Batch scraping and archiving YouTube channels with YARK

YouTube is often a neglected source of OSINT-rich data, but media content can be leveraged and gathered for OSINT through a clean interface such as YARK.

October 30, 2022

YouTube-based OSINT, what is possible?

Despite the prominence of content-sharing social media platforms including TikTok, YouTube very much remains an important source of data that can be used for OSINT purposes. Use cases where YouTube data can be leveraged are considerably broad. For example, YouTube remains a space where criminals target underage children for grooming and sexual exploitation.

Youtube video content showing events taking place in Ukraine

At the same time, video content relating to awful events taking place in Ukraine is posted daily, such content is valuable for military intelligence operators looking to maintain their situational awareness of events taking place on the ground – especially in illegally occupied areas of the country.

Scratching the surface of YouTube

On the face of it, exploitable data on YouTube appears somewhat limited. For example, when we study the YouTube OSINT Attack Surface Map by OSINT Dojo (a great resource we should add!), we can identify the categories and classifications of data that is present on the surface of YouTube. Visible content includes the media content itself, users / channels associated, video captions and comments (to name but a few).

Youtube attack surface map by OSINT Dojo - https://www.osintdojo.com/

But, when we scratch the surface, we often find a substantial amount of highly valuable metadata which can include a broad range of video / channel metrics, geolocation data, and time / date stamps. All of this data can be effectively leveraged and used for OSINT purposes, but the question is how can such data be effectively collected.

YARK – YouTube archiving made simple

We have been closely following YARK on Github for some time and decided that now was a great moment to test out this tool. YARK has been created and developed by Liverpool-based developer Owen Griffiths; and from the very start, we are very impressed with what he has created and how it can be used.

YARK, scraping video content from YouTube channels

This utility has been developed in Python and will enable users to continuously archive all videos and associated metadata from a YouTube channel; then view their archives in a local web-app interface. To batch download video content from YouTube channels, YARK uses the YoutubeDL library – which is great considering that YoutubeDL is a great downloading utility in its own right!

Installation and deployment

The installation process of YARK is surprisingly very straightforward thanks to the instructions provided by Owen on the Github repo page. Quite simply, it was a case of creating a folder for YARK within our toolbox and installing the tool in our virtual environment by invoking pip3 install yark. To deploy the utility, you should create an archive for the target YouTube channel by invoking yark new [archive name] [YouTube Channel ID].

YARK web application showing scraped video content from YouTube

For example, if you invoke yark new vice VICENews, you will create an archive named ‘vice’ and associate that archive with the Vice News YouTube channel. To begin the download process, users can then invoke yark refresh [archive name]. This process will store downloaded video content within the YARK root folder.

Visualising your YouTube archives

Unlike the standard YoutubeDL library, YARK enables users to visualise archives of YouTube channels through a web-app interface that is run through localhost port 7667. To run and view the web app, yark view should be invoked in the command-line interface and localhost:766 opened in your browser of choice. Once opened, users can search open and view archived YouTube channels.

YARK web application showing scraped video from YouTube

For each video collected by YARK, users can use the web app to visualise a range of information including:

  • History: Showing whether any changes have been applied to the video title when the video description was created, whether the video description has been modified and when, and whether any additional changes have been applied.
  • Views over time: YARK provides a graphical view to show the number of video views from the point that the archive containing the video was created.

As an added feature, YARK also has the capability to allow users to attach notes to collected videos.

Refreshing YouTube archives

One reason why we appreciate the work that Owen has put into developing YARK is the fact that he has rightly indicated that automatic archive refreshing is not yet a feature in YARK. That said, he has rightfully pointed out that users can implement a cron job to automate the process of updating their YouTube archives. Nevertheless, for now, at least, users can manually refresh their YouTube archives by invoking yark refresh [archive name]. By doing so, this process will update video metrics and the graph chart showing video views over time.

On a final note!

Owen has done a great job with YARK. We spent a considerable amount of time testing out YARK on several channels associated with video content concerning events taking place in occupied eastern Ukraine. This enabled us to batch-scrape a considerably large volume of video content showing military activity taking place on the ground. Additionally, we could see a broad range of useful information concerning video views and changes to video titles. This information was also quite useful when analysing videos containing pro-invasion and pro-Russian disinformation.

YARK graphical interface showing OSINT outputs from YouTube

So, for OSINT’ers looking for a very elegant YouTube channel scraping and visualisation utility, YARK is most certainly a tool that everyone should consider. But, what we like is Owen’s approach to the development of this tool and that he welcomes any feedback and feature suggestions. Naturally, many OSINT’ers will have their own specific features that they would like to see implemented in YARK. Certainly, we believe that YARK can also be combined with other YouTube-based OSINT tools to provide users with the capability to scrape and export user comments in CSV in addition to geolocation data. However, such features would very likely require YouTube Data API access. Nevertheless, even in its current form, YARK is a really elegant and effective tool for OSINT’ers looking to maintain an archive of YouTube content relating to their subject matter areas. So, what more can we say other than “great work Owen!”.

Start learning today

Select on the relevent tags to filter our posts

Let's talk today Are you ready to begin discussing our range of Intelligence Analysis training and capability development solutions?