4Chan-Scraper: Extracting media and user posts from 4Chan
Link to tool: https://github.com/malavmodi/4Chan-Scraper
4Chan, what many people refer to as the ‘cesspool of the internet’ remains a valuable source of data with regards to digital investigations. 4Chan, alongside other imageboards including 8Chan (now referred to as 8kun) and 2Chan has a very serious problem with regards to the distribution of sexually explicit material. In some cases, Category C Child Sexual Abuse Material (CSAM) has been posted onto these imageboards. Also, 4Chan is undoubtedly a platform used by the so-called ‘Alt Right’ in addition to foreign actors intent on spreading disinformation of a political nature.
Tools used to scrape media from imageboards are quite significant in number. That said, scrapers that can be used to pull entire imageboards and extract user posts are limited in quantity. However, in this OSINT Tool Review, we will present ‘4Chan-Scraper’, a lightweight Python script that does exactly what it says – and does it very well indeed. The tool can scrape the catalog of a given 4Chan board for all comments, files and associated media through the command-line interface. When running the script, it will create a folder with all associated data in the current working directory in a hierarchical structure as such:
- Thread ID with Subject (Folder)
- Thread ID files (Folder)
- File Data
- Thread ID files (Folder)
- CSV with comments/replies from posts
- JSON formatted output of thread
- File Metadata
- Thread Metadata
Installing the script requires some tweaking in the form of slight modifications to the HTML parser script located inside the
py4chan utility file. However, the tool’s developer – via the Github repository – has provided detailed instructions on how to implement the required modifications. Thereafter, deploying the tool by invoking the desired command is incredibly straightforward. The scraper’s output is incredibly useful, and we certainly found no issues or flaws during our tests.
However, considering that the tool can scrape an enormous amount of content, we need to take this further and analyse the data – to do this we turned to Paliscope YOSE to index and apply AI algorithms to analyse the data. Using 4Chan-Scraper, we collected a vast amount of political-related chats and media, our intent was to identify potential actors involved in spreading disinformation. This data was then processed through YOSE; and within minutes, we had identified one Lithuania-based actor involved in spreading ‘Alt Right’ disinformation. Using YOSE, we also identified cryptocurrency wallets associated with the same individual in addition to other Lithuania-based disinformation sources.
But, it doesn’t stop there! Within a matter of minutes, YOSE was able to classify all of the media content scraped from 4Chan, enabling us to quickly identify sexual, financial or violent content – then draw links between the users who posted the media and those who have posted comments in response to them. Lastly, we made good use of YOSE’s sentiment analysis capability to single out discussions of a violent or sexual nature, then analyse such discussions event further.
So, in conclusion, 4Chan-Scraper is a great tool, but Digital Investigators should also look at other tools that will enable them to quickly and efficiently analyse scraped data. What YOSE is able to do is enable Digital Investigators and Intelligence Analysts to find the crucial needle in the haystack – in this case, the individuals of interest involved in the spreading of disinformation of a violent-political nature and analyse them even further.