Extracting content from archived Geocities webpages using DownThemAll

https://chrome.google.com/webstore/detail/downthemall/nljkibfhlpcnanjgbnlnbjecgicbjkge?hl=en

In this latest OSINT Tool Review, we are going to take a nostalgic trip down memory lane and explain how Digital Investigators can extract content – including media – from archived Geocities pages – yes, that’s right, Geocities! For those of us who are old enough to remember, Geocities was a web hosting service that allowed users to create and publish websites for free and to browse user-created websites by their theme or interest. In its original form, site users selected a “city” in which to list the hyperlinks to their web pages. The “cities” were named after real cities or regions according to their content – for example, computer-related sites were placed in “SiliconValley” and those dealing with entertainment were assigned to “Hollywood”; hence the name of the site. Soon after its acquisition by Yahoo!, this practice was abandoned in favour of using the Yahoo! member names in the URLs. In April 2009, the company announced that it would end the United States GeoCities service on October 26, 2009. There were at least 38 million pages displayed by GeoCities before it was terminated, most user-written. The GeoCities Japan version of the service endured until March 31, 2019.

Although Geocities was abandoned, the good folks at Archive.org still maintain a vast collection of archived pages which can be extracted and downloaded in their own entirety through the use of various tools – we will touch on those tools in an upcoming article. For our readers wondering why on earth would a Digital Investigator or OSINT Analyst be interested in collecting media content from a defunct series of web pages from the early 2000’s – we very much said the same until a recent investigation we supported led us to the Geocities archive. Indeed, after several weeks of OSINT research against several Geocities pages of interest, we found instances where links to nefarious sites (including those involved in the distribution of CSAM material) was being shared. So, the bottom line, conducting investigations against archived Geocities pages remains very relevant even though it is incredibly time-consuming!

In order to extract media content directly from the archived Geocities webpages, we recommend using DownThemAll; a simple, but very powerful extension-based media extraction tool that can be used within Google Chrome or Mozilla Firefox. We should also point out that DownThemAll is not entirely exclusive for archived webpages, it can be used against most other webpages provided that the HTML markup can be easily read by the tool itself.

DownThemAll comes equipped with the capability to extract embedded media from web pages in addition to links to image files – the latter of which is the most useful when considering that most web pages insert images into pages as a link. In all, the tool has the capability to download and extract the following:

  • Software files including exe and msi
  • Image files including jpg, jpeg, png, gif and svg
  • Archive files including zip, rar and 7z
  • Document files including all Microsoft-supported files
  • Video files including mp4, webm and mkv
  • Audio files including mp3, flac and wav

The tool can also allow Digital Investigators to allow filter any additional file types through the use of its ‘Fast Filtering’ and / or ‘Mask’ filtering capability. In situations where investigators have access to the public root of a webpage, they can use the ‘Fast Filtering’ to isolate and extract entire web pages as a single HTML file. Lastly, in cases where webpages – including Archive.org – apply rate-limiting restrictions, investigators can configure DownThemAll and control the amount and speed of downloads undertaken over a given period of time.

To conclude, we have explained why OSINT research against very old Geocities webpages remains relevant to some investigations and how media content from such pages can be effectively achieved by using DownThemAll. However, as many of our readers will understand, the quality of media hosted on Geocities was incredibly low – this was mainly because the majority of Geocities users were using dial-up connectivity. Nevertheless, such media content could prove to be useful in any investigation.

