Chat-Downloader: Scraping chats from YouTube, Reddit and Twitch livestreams
Link to tool: https://github.com/xenova/chat-downloader
Chat logs from livestreams are undoubtedly a valuable source of information for Digital Investigators and Intelligence Analysts. This is especially the case with regards to cases involving online child grooming – occurrences of which remain acutely regular across several video-sharing platforms including YouTube. Additionally, CSAM criminals are also known to exploit video-sharing and streaming sites associated with gaming, such sites also include Twitch. On the other end of the spectrum, livestreams and their associated chat logs may also be of intelligence value for Country Risk Analysts responsible for monitoring high-profile protest activity such as the Capitol Hill Riots, the 2019/2020 Iraq Protests, and the recent protests in response to the military coup in Sudan.
Without further ado, we introduce ‘Chat-Downloader’, a very simple Python script that can be used to retrieve chat messages from livestreams, videos, clips and past broadcasts from YouTube, Reddit and Twitch. In addition to scraping chats within the command-line interface, ‘Chat-Downloader’ can also parse chats into a JSON file which includes a wide range of useful information including:
- Message ID
- Message Content
- Message Type
- Timestamp
- Time (in seconds, relative to video length)
- Time text (local time in which the message was posted)
- Author ID
- Author Name
- Author Image/Avatar
Installation of the tool to your local system is very straightforward, this can be achieved by pulling the ‘Chat-Downloader’ directly via Git, or manually cloning the tool and installing it by invoking python setup.py install
. Once installed, deploying the tool is very straightforward, the basic argument to invoke being chat_downloader <your target livestream> <your options>
. The tool itself is highly flexible with regards to scraping options, the user has the capability to specify various parameters such as the scrape start and end time relative to the video’s length, maximum number of messages to scrape, and proxy configuration. Additionally, the user can configure various troubleshooting settings including retry attempts, retry timeouts and inactivity timeouts.
During our test, we focused the tool on several livestreams covering the ongoing instability in Sudan via several news outlets. Our tests resulted in a significant number of JSON files generated and the maximum number of chats we scraped in one single session was just over 1500. However, we do believe that the tool could have scraped much, much more than this! One issue we encountered during the scraping process was that the command-line would not display Arabic characters – to overcome this, we had to view our JSON file with UTF-8 encoding enabled. Additionally, the raw output of ‘Chat-Downloader’ does not take into account ’emojis’ – this is understandably quite a hindrance, especially for Digital Investigators focused on cases involving online child grooming. But all that said, the output generated by this tool can most certainly be ingested by a third-party application in order to visualise the entire content of messages.
So, our overall verdict, ‘Chat-Downloader’ is a very effective tool, and we were certainly able to scrape a vast amount of content by using it. Easy to install, quick to deploy, and it addresses at least two sources of livestreams (YouTube and Reddit) quite well. The one issue we encountered while using this tool is that it does not have the capability to scrape from Facebook livestreams. Whilst the developers have addressed this shortfall within the tool’s Github repository; but unfortunately, no updates to address Facebook livestreams have yet to be implemented. If they manage to implement a capability to scrape chats from Facebook livestreams, then we certainly believe that this would be a significant milestone and a game-changer for Digital Investigators. That aside, we really like this tool based on its easy-to-use and flexible functions; in short, it does exactly what it says.