Scraping social media data, analysing disinformation, and batch scraping from Telegram
Once again, after a very busy period for OS2INT team, we bring our readers another OSINT Toolbox Talk article focusing on three of the most effective OSINT tools tried and tested by us over that past few weeks. Starting with Instant Data Scraper, we show our users how this must-have and simple Google Chrome extension packs a powerful punch with regards to scraping data from social media sites such as Facebook, Instagram and Twitter. Next, we focus on the WeVerify and InVid Google Chrome Extension that brings a very powerful range of capabilities for OSINT analysts involved in analysing disinformation from multiple sources – undoubtedly, this is a must-have tool in light of the current military situation in eastern Europe and the consequent increase in Russian disinformation on social media. Finally, we show the awesome capabilities of TG-API, an effective Python utility that provides users with the capability batch-scrape from multiple Telegram channels and groups.
Stay tuned for our upcoming OSINT tool reviews, where will look at additional tools that can be used to investigate and analyse disinformation actors and scrape media from a combination of social media sites!
In this latest OSINT Tool Review, we will begin by pointing out that Twitter has implemented limits to the amount of data that can be scraped through the use of the Twitter API. Consequently, these limits present significant issues for Digital Investigators and Intelligence Analysts involved in the identification and analysis of suspected Russian disinformation actors. This is especially the case when considering the ongoing military situation in Ukraine and heightened geopolitical tensions. But, we at OS2INT will demonstrate how we overcame these limitations by using Instant Data Scraper to collect data from Twitter profiles suspected to be involved in the dissemination of disinformation.
What is Instant Data Scraper?
This is not the first time we have introduced Instant Data Scraper to our readers. In fact, we demonstrated in a previous OSINT Workflow article how this tool can be used to scrape friends lists from Facebook. So, to reacquaint our readers with this tool, Instant Data Scraper is a Google Chrome Extension developed by Web Robots. The tool is an automated data extraction tool for any website. It uses AI to predict which data is most relevant on an HTML page and allows saving it to Excel or CSV files (XLS, XLSX, CSV). This tool does not require website specific scripts, instead, it uses applies AI to analyse the HTML structure of websites to detect data for extraction. If the prediction is not satisfactory, the user can customise the selections for greater accuracy. Additionally, the tool comes pre-compiled with an ‘infinite scroll’ capability that allows users to acquire HTML data that is auto-loaded when the user reaches the bottom of a web page screen.
Installation and deployment
As Instant Data Scraper is a Google Chrome Extension, it only needs to be downloaded and installed via the Google Chrome Web Store. Though we must stress that the tool is 100% free to use. Deploying the tool is just as easy – the user must simply navigate to the page they wish to scrape from and initiate the tool by pressing on the Instant Data Scraper icon from the Chrome browser toolbar.
What social media platforms can it scrape from?
The list of websites and social media platforms that Instant Data Scraper is compatible with is extremely long. However, we must say that we have spent a considerable amount of time using the tool to scrape from:
Telegram (Web App)
However, we have found that the tool is not well-suited to be used on LinkedIn, most likely due to the platform’s HTML structure and markup. However, given that the tool has been developed to provide the user with as much control over the type of data to be scraped, it is possible to configure the tool to detect specific HTML elements on any target page.
Identifying suspected disinformation actors
Most certainly, this was a very easy accomplishment when taking into account the current military situation in Ukraine. Additionally, the use of Twitter by pro-Russian disinformation actors is very well-established. To identify suspected disinformation actors, we understood that the notorious letter ‘Z’ has become synonymous as a sign of support for the invasion – this symbolic alphanumeric character was the basis for our basic search for disinformation actors. Immediately, we found a substantial number of pro-Russian and pro-invasion Twitter accounts. On closer inspection, the majority of these accounts were publishing and circulating disinformation.
Scraping Twitter follower lists
As we earlier pointed out, Twitter has implemented a range of measures to limit the amount of data that can be scraped via its API. These limits affect the amount of media, tweets and follower data that can be extracted. However, taking into account that Instant Data Scraper does not rely on the Twitter API to scrape data, we can use this tool to extract follower lists from the Twitter profiles associated with a range of suspected disinformation actors. To do this, we simply navigated to the follower list of each target profile and initiated the scraper by clicking on the Instant Data Scraper icon located in the Chrome extensions toolbar (top right-hand-side).
When initiated, the Instant Data Scraper window will open. In this window, we must now select the ‘Infinite Scroll’ checkbox. We use Infinite Scroll because every follower list on Twitter is built with a feature called ‘lazy load’, what this means is that Twitter will defer the initialisation of an object (such as followers) until the point at which it is needed. With the Infinite Scroll checkbox selected, we can now set our minimum and maximum delay values – these values will not only help us to bypass rate-limiting mechanisms, but they will also ensure a more accurate crawl and scrape of the target follower lists. In our case, we set the minimum delay value as ‘3’ and the maximum delay value as ’20’. At this point, we selected the blue box labelled ‘Start Crawling’. Once the crawl and scrape process has finished, we opted to download the follower list as a CSV. We then proceeded to repeat the same process across several additional Twitter profiles we suspected to be involved in the spreading of disinformation.
Analysing the data
With our scraped lists of Twitter profiles suspected to be involved in spreading pro-Russian disinformation, we chose to process that same data within a link chart. This would allow us to visualise how each of the Twitter profiles is connected and where those connections intersect. To achieve this, we can combine each of our scraped follower lists into a ‘Node’ and ‘Edge’ list, then visualise the scraped data using Gephi. Instructions on how to process scraped data and visualise it using Gephi can be found in our previous OSINT Workflow Article.
However, in our case, we chose to upload our data into Paliscope YOSE by simply dragging and dropping our CSV files containing the scraped data into the YOSE database. The result – as you can see from the image above – shows that we produced an extensive link graph showing how each of the suspected disinformation actors was connected and where those connections intersect. The visual intelligence produced by YOSE now enables us to identify additional disinformation actors who may be of intelligence interest.
To bring this article to a conclusion, we will conclude by saying that Instant Data Scraper is a highly effective tool that delivers outstanding results. Whilst it does not have the automation capabilities that are normally found in Python-based scripts, it does have the ability to provide any user with a simple and effective way to extract data from a variety of web pages and social media platforms. In our case, we used Instant Data Scraper to bypass limitations associated with the Twitter API to scrape data from suspected pro-Russian and pro-invasion disinformation actors. From this article, we have also found the basis to produce a wider workflow focused on disinformation actors operating on Twitter…watch this space!
The outbreak of war in Ukraine following the Russian invasion has undoubtedly resulted in an upsurge in the level of disinformation – mostly being disseminated by pro-invasion and pro-Russian actors online. What this means is that the narrative of the war in Ukraine is becoming greatly distorted by the increased disinformation campaign that has been deployed by pro-invasion and pro-Russian actors. The real danger in this regard stems from disinformation campaigns masking the realities of real atrocities and war crimes currently taking place in Ukraine – this presents a scenario where disinformation can affect the situation on the ground. Adding to this issue, there is a clear requirement for all-source intelligence analysts and journalists to apply greater scrutiny to media reports from un-trusted and un-corroborated sources. At this point, we will now introduce the highly effective InVID and WeVerify toolbox that can enable digital investigators, all-source intelligence analysts, and journalists to identify and analyse disinformation.
What is the InVID and WeVerify toolbox?
To outline very clearly, the InVID and WeVerify toolbox is undoubtedly the ‘Swiss Army Knife’ of tools designed to detect and analyse disinformation. The toolbox is intended to help journalists, fact-checkers, and human rights defenders to save time and be more efficient in their fact-checking and debunking tasks on social networks, especially when verifying videos and images. It is a Google Chrome-based extension that was initially launched in July 2017 during the InVID European project, a Horizon 2020 innovation action funded by the European Union. The toolbox is currently maintained by AFP Medialab R&D and has been enhanced by the WeVerify project, also funded by the European Union.
Installation and deployment
Being a Google Chrome extension, the InVID and WeVerify toolbox can be installed onto your Chrome browser directly from the Chrome Web Store. No configuration is needed to run the tool, though we should rightly point out that advanced features are understandably restricted to fact-checkers, journalists and researchers due to the computing power needed to run such features and to avoid misuse. On a highly important point of privacy, the team who maintains the toolbox maintain that no personal data is being recorded. However, they do use Google Analytics to better understand usage, though users can opt-out from this by un-checking the Google Analytics checkbox located on the ‘About’ page.
So what can it do?
A word of warning to our readers, this is going to be a lengthy section – but be prepared to be very pleased!
The toolbox consists of the following core modules, each with its own unique set of capabilities:
Video module containing the following functions:
Video Analysis: Provides you with contextual information on a YouTube, Facebook or Twitter video
Keyframes: Fragments a YouTube, Facebook or Twitter video or an MP4 file into keyframes for reverse image search on Google, Yandex, Bing, Tineye, Baidu or Karma Decay (for Reddit) search engines
Thumbnails: Extracts and performs a reverse image search of the thumbnails of a YouTube video
Video Rights: Provides information about the legal rights of a YouTube or Twitter video
Metadata: Extract metadata from videos in MP4 or M4V format
Image module that has the following capabilities:
Image Analysis: Provides contextual information on an image posted on Facebook or Twitter
Magnifier: Provides a magnifying lens and a photo editor to help you examine an image thoroughly
Metadata: Extracts metadata for JPEG images
Forensic: Provides an enhanced toolkit to detect image forgeries and alterations in manipulated images
Optical Character Recognition: Enables you to read text from images
Check GIF: An advanced feature restricted to registered fact-checkers, journalists and researchers that allows you to create a GIF between a manipulated image and an original one to better reveal the manipulation
Search module that allows users to perform the following:
Twitter Search: Enables users to perform advanced search queries on Twitter
Factchecks: Provides a customised search of fact-checks. Unfortunately, this feature has been deprecated since the latest Google update, but a solution is currently being explored.
XNetwork: Provides a customised search of cross-network queries. Unfortunately, this feature has been deprecated since the latest Google update, but a solution is currently being explored.
Data Analytics module that offers the following capabilities:
Twitter SNA: An advanced feature restricted to registered fact-checkers, journalists and researchers that enables you to perform social network analysis on Twitter
CSV Analysis: Can perform social network analysis from a CrowdTangle CSV export. The CrowdTangle Chrome Extension can be installed from: https://chrome.google.com/webstore/detail/crowdtangle-link-checker/klakndphagmmfkpelfkgjbkimjihpmkh/related?authuser=1
Detecting and analysing disinformation
It can be reasonably agreed that Twitter accounts for the vast amount of disinformation being circulated across mainstream social media platforms. Therefore, we tested out the InVID and WeVerify toolbox across a number of pro-invasion and pro-Russian disinformation actors using Twitter to disseminate and circulate fake news. Using the Image Analysis and Forensic features, we were able to easily and quickly detect modified images being disseminated by several disinformation actors.
We applied the same for several videos being circulated by the same disinformation actors by using the Keyframes module to break down videos into frames and conducting reverse image searches. Lastly, we used the Twitter SNA feature to conduct a comprehensive analysis of disinformation on Twitter. This module not only provides a range of visual intelligence in relation to URLs, hashtags, and URLs associated with disinformation, but it also allows you to reveal users who have shared and / or liked tweets containing disinformation. As a final treat, the outputs from this module include the capability to produce GEXF files that can be opened and visualised on Gephi. The result being very much the same as the outputs discussed in our previous OS2INT Tool Review article.
Wrapping it all up!
All we can simply say is that the InVID and WeVerify toolbox is quite simply outstanding both in terms of the range of tools that it provides in addition to their overall effectiveness. This toolbox has been developed for a very important purpose; with 68k active users per week from 223 countries (and growing!!), this toolbox is clearly well-renowned and well-trusted by journalists, fact-checkers, and researchers worldwide. So, to bring this OSINT Tool Review article to a natural and fitting conclusion, we at OS2INT see a very real need for this tool to be used to separate facts from fake news concerning the ongoing war in Ukraine. As such, this toolbox comes with our highest recommendation!
TG-API: Batch scrape from Telegram channels and groups
Telegram is undoubtedly a vital source of data and information concerning the ongoing war in Ukraine. On the one hand, Telegram channels and groups created by local civilians are being used to report on Russian troop movements; on the other hand, pro-Russian and pro-invasion disinformation actors have created a significant number of channels to broadcast their false narratives. Whilst Telegram does offer users the capability to export chat histories through the native export feature that can be found on the Telegram desktop application, tools that can batch extract chats are considerably few and far between. However, one such tool that offers OSINT analysts the capability to batch scrape from Telegram channels and groups is a Python-based utility named TG-API [Telegram API].
Why would you need to batch scrape?
Using the ongoing military situation in Ukraine as an example, there is a real risk of ‘information – or data – overload’. This is caused by the huge amount of Telegram channels and groups that exist within this space and the vast amount of information being posted by users on a daily basis. Using the native export chat feature on the Telegram desktop application is quite simply not a feasible option as it would take days – or perhaps weeks – to archive each group individually. Batch scraping would at least enable OSINT analysts to continuously scrape from Telegram channels and groups, then use extracted data through an effective third-party analysis tool.
What can TG-API do?
TG-API provides several very useful functions. Its core capability is that it individually or batch scrapes from Telegram channels and groups, then generates JSON files containing the scraped data. Such data includes information regarding the target channel / group in addition to scraped user posts. Additionally, the utility provides users with the capability to generate a CSV file based on the aforementioned JSON files – which is especially useful when using a third-party platform to analyse the results.
Installation and deployment
Cloning the tool from its Github repository is very straightforward, and installation of the tool using Python is done by invoking the standard command pip install -r requirements.txt. However, depending on your operating system of choice, some of the required Python libraries such as Louvain, Matplotlib, and Pandas will need to be manually installed by invoking pip install [INSERT TARGET LIBRARY HERE]. After all of the required Python libraries have been installed, your Telegram API credentials need to be inserted into the config.ini file located in the utility’s root folder.
Once all of the configurations are complete, the tool can now scrape from your target Telegram channels / groups by invoking python main.py --telegram-channel [INSERT CHANNEL NAME]. However, if you need to batch scrape from multiple sources, this can be achieved by creating a .txt file with a list of target Telegram channels / groups (one per line) and saving it in the utility’s root folder. Then, you can run the tool to scrape from multiple sources by invoking the command python main.py --batch-file [PATH TO TXT FILE].
The utility also provides users with the capability to scrape new messages from target Telegram channels / groups by invoking the command python main.py --telegram-channel channelname --min-id [INSERT LAST ID NUMBER SCRAPED].
As we earlier pointed out, TG-API works by scraping Telegram channel / group data including metadata and posts, then saving them in JSON format. But, if you require the scraped data to be in CSV format, this can be easily achieved by invoking the command python build-datasets.py.
Analysing the output
TG-API is also meant to provide users with the capability to produce a Gephi file based on its output – ultimately enabling users to visualise collected data. Unfortunately, we found that this feature has a bug which prevents it from working (hopefully the utility’s developer can resolve this issue). That said, and going back to what we indicated earlier, collecting vast amounts of data from Telegram could be a useless task if you have no way to effectively analyse it.
To analyse the scraped data, we turned to YOSE by Paliscope – specifically its Chat Analytics module. In YOSE, we established a comprehensive keyword list containing a whole range of Russian military equipment so that we can later identify and analyse interactions where there were keyword matches. To process our Telegram data into YOSE, we simply used its drag-and-drop feature and then identified the relevant columns containing the Telegram chat data, the results (as shown below) are very good!
Taking our analysis even further, we used YOSE to analyse the dataset and visualise the flow of chats between various users and instances where messages have been forwarded from one channel to another. As you can see from the image below, we were able to create an effective intelligence picture concerning our scraped chats and visualise how chats and messages are being shared between various Telegram channels / groups.
Our final thoughts
TG-API is quite a good tool for OSINT analysts that need the capability to batch scrape from multiple Telegram channels and groups. Whilst the utility itself is relatively new, some features do contain bugs or deprecation warnings. This means that unless these issues are addressed soon, the tool may not effectively function in the short term. Issues aside, the tool is very capable of extracting vast amounts of Telegram data from multiple sources and generating data-sets that can be effectively analysed using third-party applications. As we already pointed out, there is a genuine need for OSINT analysts monitoring the situation in Ukraine to have the capability to batch scrape from multiple Telegram channels / groups. But, this data is useless unless you have the capability to process it and analyse it effectively.
Joseph Jones | Founder of OS2INT and Director of Capability Development at Paliscope
Joseph Jones is a former British military intelligence operator and former National Crime Agency intelligence officer with more than 16 years of intelligence-gathering and investigative experience. He holds a BSc (Hons) Intelligence and Cyber Security from Staffordshire University and is also an external expert for the European Union Agency for Law Enforcement Training (CEPOL), the European Border and Coast Guard Agency (FRONTEX), the European Union Agency for Cybersecurity (ENISA) and Expertise France.
Let's talk today Are you ready to begin discussing our range of training and capability development solutions?