OSINT Toolbox Talk: Creating local copies of web pages, extracting 4Chan content, and scraping Instagram location data

OSINT Toolbox Talk

Creating local copies of web pages, extracting 4Chan content, and scraping Instagram location data

This OSINT Toolbox Talk wraps up another fortnight of extensive OSINT tool testing by the OS2INT team. Indeed, this latest article is most definitely our most extensive in terms of the overall capabilities of the tools that we will showcase. We will start by introducing ‘SingleFile’, a very neat tool that can be used either within a command-line environment or deployed as a Google Chrome and Mozilla Firefox extension. ‘SingleFile’ is most definitely recommended as a tool that can be used to save copies of single web pages – most definitely an important capability for every Digital Investigator. Next, we will introduce ‘4Chan-Scraper’, a neat Python script that enables investigators to scrape threads and media from 4Chan threads. In our article, we also discuss why investigations on 4Chan remain relevant with regards to disinformation, political extremism, and CSAM. Lastly, we show you the awesome power of ‘Instagram Location Search’, a Python script developed by the fantastic Bellingcat team. This tool allows investigators to identify and scrape location IDs from a given set of coordinates. Thereafter, the output of this tool can be integrated with other Instagram scrapers to extract content based on the generated list of location IDs.

Stay tuned for our next OSINT Toolbox Talk as we introduce more tools that can add value for digital investigations!

Create and save local copies of web pages with 'SingleFile' Create and save local copies of web pages with 'SingleFile' https://github.com/gildas-lormeau/SingleFile

Undoubtedly, the capability to create a copy of a web page and save it locally is especially important for Digital Investigators. For example, there may be situations where a social media page or a site suspected to be involved in the sale of illicit material needs to be locally saved for reference. There is an abundance of tools that can achieve this task, but for this OSINT Tool Review article, we will present ‘SingleFile’, a very popular web extension and command-line interface tool that is compatible with Chrome, Firefox (Desktop and Mobile), Microsoft Edge, Vivaldi, Brave, Waterfox, Yandex Browser, and Opera.

For the benefit of our readers, the links to the tools can be accessed below:

So, how does it work? During our tests, we found that installing and deploying the extension is incredibly straightforward. Once the extension is installed onto your browser of choice, you can open a target web page and then click on the ‘SingleFile’ extension icon located at the top-right hand side of your browser. At this point, the extension does all of the hard work and creates a copy of the target page and saves it locally on your system as an HTML file. The extension itself is highly flexible and users can configure the tool depending on their overall needs by right-clicking the extension’s icon and selecting the ‘options’. For example, users can configure the user interface, specify file names, configure what HTML content should be extracted and remove scripts (to name but a few!).

In addition to the capabilities indicated above, ‘SingleFile’ allows Digital Investigators to batch save all tabs within their browser window and annotate saved web pages with the use of a neat editor. But, what we consider to be the most valuable capability, is that it allows Digital Investigators to turn on the auto-save function which will save each web page that is opened within the browser. Lastly, the tool is very flexible with regards to integrations with user scripts – this enables Digital Investigators to implement user scripts such as auto-click and auto-scroll.

It goes without saying, this tool is impressive. However, the developers have recognised some teething problems with regards to using the extension when incognito mode. Also, while the command-line interface version of the tool can be used to save Onion web pages; the extension, on the other hand, is not compatible with Tor.

One very important point we wish to make at this stage is that tool only allows Digital Investigators to save web pages as HTML, there remains no capability to save as PDF. Also, we must also point out that the tool does not forensically save web pages – this is important for investigators that require a forensically secure copy that can be presented as evidence before a court. For example, HTML copies of web pages can easily be edited locally through the use of a text editor, whereas editing a PDF is less easy. In the same regard, digital evidence should also be hashed at source, thus certifying that it is a forensically secure copy. Therefore, while ‘SingleFile’ is a useful tool for creating a copy of a web page – perhaps for research purposes – we strongly advise that Digital Investigators look at specialised forensic browsers that have the capability to create secure copies of web pages and hash them immediately upon extraction. The titan of forensic browsers is undoubtedly Paliscope Discovry as it has the capability to create forensic copies of web pages in HTML and PDF. Additionally, Paliscope Discovry can be used to create forensic copies of Onion web pages.

Extracting media and user posts from 4Chan with '4Chan-Scraper' Extracting media and user posts from 4Chan with '4Chan-Scraper' https://github.com/malavmodi/4Chan-Scraper

4Chan, what many people refer to as the ‘cesspool of the internet’ remains a valuable source of data with regards to digital investigations. 4Chan, alongside other imageboards including 8Chan (now referred to as 8kun) and 2Chan has a very serious problem with regards to the distribution of sexually explicit material. In some cases, Category C Child Sexual Abuse Material (CSAM) has been posted onto these imageboards. Also, 4Chan is undoubtedly a platform used by the so-called ‘Alt Right’ in addition to foreign actors intent on spreading disinformation of a political nature.

Tools used to scrape media from imageboards are quite significant in number. That said, scrapers that can be used to pull entire imageboards and extract user posts are limited in quantity. However, in this OSINT Tool Review, we will present ‘4Chan-Scraper’, a lightweight Python script that does exactly what it says – and does it very well indeed. The tool can scrape the catalog of a given 4Chan board for all comments, files and associated media through the command-line interface. When running the script, it will create a folder with all associated data in the current working directory in a hierarchical structure as such:

  • Thread ID with Subject (Folder)
    • Thread ID files (Folder)
      • File Data
  • CSV with comments/replies from posts
  • JSON formatted output of thread
  • File Metadata
  • Thread Metadata

Installing the script requires some tweaking in the form of slight modifications to the HTML parser script located inside the py4chan utility file. However, the tool’s developer – via the Github repository – has provided detailed instructions on how to implement the required modifications. Thereafter, deploying the tool by invoking the desired command is incredibly straightforward. The scraper’s output is incredibly useful, and we certainly found no issues or flaws during our tests.

However, considering that the tool can scrape an enormous amount of content, we need to take this further and analyse the data – to do this we turned to Paliscope YOSE to index and apply AI algorithms to analyse the data. Using 4Chan-Scraper, we collected a vast amount of political-related chats and media, our intent was to identify potential actors involved in spreading disinformation. This data was then processed through YOSE; and within minutes, we had identified one Lithuania-based actor involved in spreading ‘Alt Right’ disinformation. Using YOSE, we also identified cryptocurrency wallets associated with the same individual in addition to other Lithuania-based disinformation sources.

But, it doesn’t stop there! Within a matter of minutes, YOSE was able to classify all of the media content scraped from 4Chan, enabling us to quickly identify sexual, financial or violent content – then draw links between the users who posted the media and those who have posted comments in response to them. Lastly, we made good use of YOSE’s sentiment analysis capability to single out discussions of a violent or sexual nature, then analyse such discussions event further.

So, in conclusion, 4Chan-Scraper is a great tool, but Digital Investigators should also look at other tools that will enable them to quickly and efficiently analyse scraped data. What YOSE is able to do is enable Digital Investigators and Intelligence Analysts to find the crucial needle in the haystack – in this case, the individuals of interest involved in the spreading of disinformation of a violent-political nature and analyse them even further.

Identifying and scraping from Instagram locations with 'Instagram Location Search' Identifying and scraping from Instagram locations with 'Instagram Location Search' https://github.com/bellingcat/instagram-location-search

We have been meaning to review this awesome tool by Bellingcat for quite some time. With the start of the military coup in Sudan, an opportunity was presented to us to try out Instagram Location Search in order to identify media that could be of intelligence value with regards to real-time security incident reporting. So, we will cut to the chase and say to our readers that this tool is outstanding and it did not disappoint!

Instagram Location Search is a lightweight Python script that packs an awesome punch. It allows Digital Investigators and Intelligence Analysts to search for Instagram location IDs based on given coordinates. Installing and deploying the tool does not require users to specify their Instagram credentials, it relies on the session-id-token to be specified instead. For those who are unfamiliar with how to obtain this token, the fantastic team at Bellingcat have provided an easy set of instructions in this regard. Deploying the tool itself is very straightforward as it can be achieved by invoking the command python3 instagram-locations.py --session "<session-id-token>", within the same command, users can specify their desired location by inputting their respective --lat and --lng arguments. Users can also use the --date argument in order to filter the Instagram location pages to show posts created on this date or earlier. Lastly, users can what output they require; the four options being:

  • JSON, which can be achieved by invoking the --json command-line argument. This will save the list as a JSON file, almost identical to the raw API response.
  • GeoJOSN, this can be executed through the --geojson argument. This will save the list of locations as a GeoJSON file for other geospatial applications.
  • Location IDs are scraped and saved within a .txt file that can be passed through another Instagram-based media scraper such as ‘Instagram Scraper‘. This can be achieved by invoking the command --ids.
  • Most definitely an impressive feature – users can request an interactive map (based on Leaflet JS) which will show all of the locations of the returned points. This can be achieved by invoking the command --map.

Impressive as the tool is – it is even more effective when it is combined with another tool such as ‘Instagram Scraper‘ to pull all of the media associated with the scraped Location IDs. Additionally, Instagram Scraper also be used to save the media and location metadata associated with the scraped media.

As we earlier discussed, we tested out Instagram Location Tool by pinpointing it towards Khartoum, Sudan, in order to identify locations that could be used to pull media of intelligence value in relation to the ongoing military coup. Most certainly, the tool did not let us down – we identified several hundred locations and visualised with them the map feature. Using Instagram Scraper, however, was tedious at best. We found that the tool itself required some additional configuration tweaks in order to overcome some minor errors caused by recent updates to Instagram’s search functionality. Once we had implemented our tweaks, we found that both Instagram Location Search and Instagram Scraper are a perfect combination, resulting in us scraping a substantial amount of media associated with the ongoing military coup in Sudan and using the extracted location metadata to map our results.

There is undoubtedly a significant number of reasons as to why every Digital Investigator and Intelligence Analyst should use Instagram Location Search, but we have settled for just one test case. It is easy to install and deploy, delivers effective results, and can be integrated with other scrapers. Therefore, Instagram Location Search is most certainly recommended and we take our hats off to the Bellingcat team for once again contributing to the OSINT community!

Let's talk today Are you ready to begin discussing our range of training and capability development solutions?