OSINT Toolbox Talk: Scraping livestream chats, extracting 2Chan and 4Chan media, and verifying email usernames

OSINT Toolbox Talk

Scraping livestream chats, extracting 2Chan and 4Chan media, and verifying email usernames


This OSINT Toolbox Talk article from the OS2INT team rounds-up another fortnight of OSINT tool testing undertaken by our team. This latest article begins by introducing Chat-Downloader, a very effective Python-based script that allows Digital Investigators to scrape chat logs from YouTube, Reddit and Twitch livestreams. Next up, we follow up on our previous OSINT Toolbox Talk article which discussed an effective 4Chan media and thread scraper by presenting Chan-Scraper – yet another Python-based tool that scrapes media from 2Chan and 4Chan imageboards. Lastly, we show the powerful capabilities of MailCat, a lightweight and effective tool that can allow Digital Investigators to verify usernames across a wide range of email service providers.

Stay tuned for our next edition of OSINT Toolbox Talk, we will introduce one very neat interface that will enable Digital Investigators to geolocate Telegram users in real-time in addition to a highly effective and lightweight Twitter scraping utility.


Scraping chats from YouTube, Reddit and Twitch livestreams with 'Chat-Downloader' Scraping chats from YouTube, Reddit and Twitch livestreams with 'Chat-Downloader' https://github.com/xenova/chat-downloader

Chat logs from livestreams are undoubtedly a valuable source of information for Digital Investigators and Intelligence Analysts. This is especially the case with regards to cases involving online child grooming – occurrences of which remain acutely regular across several video-sharing platforms including YouTube. Additionally, CSAM criminals are also known to exploit video-sharing and streaming sites associated with gaming, such sites also include Twitch. On the other end of the spectrum, livestreams and their associated chat logs may also be of intelligence value for Country Risk Analysts responsible for monitoring high-profile protest activity such as the Capitol Hill Riots, the 2019/2020 Iraq Protests, and the recent protests in response to the military coup in Sudan.

Without further ado, we introduce ‘Chat-Downloader’, a very simple Python script that can be used to retrieve chat messages from livestreams, videos, clips and past broadcasts from YouTube, Reddit and Twitch. In addition to scraping chats within the command-line interface, ‘Chat-Downloader’ can also parse chats into a JSON file which includes a wide range of useful information including:

  • Message ID
  • Message Content
  • Message Type
  • Timestamp
  • Time (in seconds, relative to video length)
  • Time text (local time in which the message was posted)
  • Author ID
  • Author Name
  • Author Image/Avatar

Installation of the tool to your local system is very straightforward, this can be achieved by pulling the ‘Chat-Downloader’ directly via Git, or manually cloning the tool and installing it by invoking python setup.py install. Once installed, deploying the tool is very straightforward, the basic argument to invoke being chat_downloader <your target livestream> <your options>. The tool itself is highly flexible with regards to scraping options, the user has the capability to specify various parameters such as the scrape start and end time relative to the video’s length, maximum number of messages to scrape, and proxy configuration. Additionally, the user can configure various troubleshooting settings including retry attempts, retry timeouts and inactivity timeouts.

During our test, we focused the tool on several livestreams covering the ongoing instability in Sudan via several news outlets. Our tests resulted in a significant number of JSON files generated and the maximum number of chats we scraped in one single session was just over 1500. However, we do believe that the tool could have scraped much, much more than this! One issue we encountered during the scraping process was that the command-line would not display Arabic characters – to overcome this, we had to view our JSON file with UTF-8 encoding enabled. Additionally, the raw output of ‘Chat-Downloader’ does not take into account ’emojis’ – this is understandably quite a hindrance, especially for Digital Investigators focused on cases involving online child grooming. But all that said, the output generated by this tool can most certainly be ingested by a third-party application in order to visualise the entire content of messages.

So, our overall verdict, ‘Chat-Downloader’ is a very effective tool, and we were certainly able to scrape a vast amount of content by using it. Easy to install, quick to deploy, and it addresses at least two sources of livestreams (YouTube and Reddit) quite well. The one issue we encountered while using this tool is that it does not have the capability to scrape from Facebook livestreams. Whilst the developers have addressed this shortfall within the tool’s Github repository; but unfortunately, no updates to address Facebook livestreams have yet to be implemented. If they manage to implement a capability to scrape chats from Facebook livestreams, then we certainly believe that this would be a significant milestone and a game-changer for Digital Investigators. That aside, we really like this tool based on its easy-to-use and flexible functions; in short, it does exactly what it says.


Extracting media from 2ch.hk and 4Chan imageboards with 'Chan Scraper' Extracting media from 2ch.hk and 4Chan imageboards with 'Chan Scraper' https://github.com/m3tro1d/chan-scraper

In one of our previous OSINT Tool Review articles, we introduced a scraper which has the capability to extract user posts and media from the 4Chan imageboard. In the same article, we discussed several reasons as to why 4Chan remains a hub for political extremism, disinformation, and CSAM criminality. However, 4Chan is certainly not unique in this regard; and as stated by one of our esteemed readers – a well-renowned independent CSAM researcher – “CSAM criminals count on the ‘hard-to-surf-Chans’ structure to protect what is often meaningful intelligence“. Indeed, her comment rightfully points out that investigations on Chan imageboards should not be under-looked.

Our previous article also pointed out that OSINT tools for Chan imageboards are predominantly limited in quantity – especially with regards to the lesser-known Chans which are often found to host a high volume of criminal users. The 2ch.hk imageboard – also referred to as ‘Dvach’ – is Russia’s largest anonymous imageboard site that is well known for cyberbullying, misogyny, and toxic trolling. However, during our research, we also discovered CSAM being distributed across several threads.

So, if we (as Digital Investigators) want to scrape media content from 2ch.hk and 4Chan and use such content for the purpose of building an intelligence picture, the ideal tool that we should use is ‘Chan Scraper’, a lightweight Python script that is capable of downloading attachments (images, videos, or both) from individual threads. Downloading the tool is incredibly easy – as is deploying the tool by invoking the command python chan-scraper.py within the same argument, the user can specify the output directory, type of content to download, and the target imageboard thread. Additionally, the tool has the capability to scrape from more than one thread at the same time and can also be adapted for other imageboards by implementing some tweaks to the script itself.

The overall performance and output of this tool is impressive. During our tests, we sought to extract a range of media content associated with Russian right-wing extremists associated with the now-disbanded National Socialist Society (NSO). The content extracted included several high-quality images of armed individuals in addition to other graphical content depicting extreme politically-motivated violence. However, without implementing some corrections to the script, it is unable to scrape from archived threads. Additionally, the script is not able to scrape entire imageboard catalogs.

Like our previous article, we sought to see what intelligence we could extract from the images that we scraped. Again, we turned to Paliscope YOSE and its awesome AI capabilities to index and analyse all of the images – the results were once again fantastic. Within a few mouse clicks, we had categorised our images according to their content; it should be pointed out that the main categories of images discovered included weapons, violence, pornography and CSAM. We went even further and used YOSE’s imagery analysis capabilities to extract several weapon serial numbers and Russian ID cards. And lastly, we mapped the links between many Russian and European right-wing extremists.

To bring this article to a conclusion, we will end by saying that Chan Scraper is a very useful tool for Digital Investigators to scrape media content from 2ch.hk and 4Chan. However, as great as this tool is for collecting data from 2ch.hk and 4Chan – Digital Investigators should combine this script with an analysis tool that enables them to categorise collected media and connect the dots between individuals of interest. All in all, Chan Scraper and Paliscope YOSE makes an excellent combination!

 


Verifying email usernames using 'MailCat' Verifying email usernames using 'MailCat' https://github.com/sharsil/mailcat

The capability to verify email addresses / accounts by username is undoubtedly a key must-have for Digital Investigators. However, this process is often time-consuming when attempting to verify usernames and accounts against individual email service providers. One very effective tool that Digital Investigators can use to accomplish this task and save lots of time is ‘MailCat’. This tool is an ultra-lightweight Python script that checks usernames across a substantially large range of email providers including:

  • GMail
  • Yandex
  • ProtonMail – Including protonmail.com, protonmail.ch and pm.me
  • iCloud – Including icloud.com, me.com, and mac.com
  • tut.by – A Belarus-based independent news, media and service internet portal, one of the five most popular websites in Belarus, and the most popular news web portal and email service providers in the country
  • mail.ru – A popular Russian email service provider that also owns the VKontakte social media network
  • Rambler – A Russian search engine and email service provider owned by the Rambler Media Group
  • Tutanota – A German end-to-end encrypted email software and hosted secure email service
  • Yahoo
  • Outlook
  • Zoho – An Indian cloud-based and service-as-a-standard (SaaS) provider for businesses
  • Lycos
  • Eclipso – A German cloud and email service provider
  • Posteo – A German email service provider for individuals and businesses
  • mailbox.org – A German email service provider
  • FireMail – A German email service provider
  • FastMail – An Australian email hosting company
  • StartMail – An encrypted email service provider based in the Netherlands
  • KolabNow – A Swiss web-based email and groupware service
  • bigmir)net – A Russian email service provider
  • XMail – A British Virgin Islands-incorporated secure email service provider
  • Ukr.net – A Ukrainian search portal and email service provider
  • Runbox – A Norwegian e-mail and web hosting provider
  • DuckDuckGo
  • HushMail – A Canadian web-based email service offering PGP-encrypted e-mail and vanity domain service
  • CTemplar – An Icelandic anonymous encrypted email service provider

The verification method used by the tool understandably varies depending on which email service provider it is querying. For GMail and Yandex, the tool uses SMTP verification. Verification of ProtonMail usernames / accounts is achieved through the use of the open API whilst iCloud emails are verified through the access recovery method. The remaining email service providers are verified via the registration method.

Overall, we had a very successful test of ‘MailCat’. Installation and deployment of the script were seamless with no issues detected, running the search is achieved by invoking the argument python mailcat.py <TARGET USERNAME> within the command-line interface. The overall search process can take between one to five minutes depending on the number of email accounts identified on each of the email service providers. All-in-all tool is incredibly easy-to-use and produces effective results – as such, it most certainly comes with our recommendation, though it should be noted that this tool has received glowing recommendations in addition to a notable mention by team Bellingcat.


Let's talk today Are you ready to begin discussing our range of training and capability development solutions?