Extracting Youtube Data with Prefect
SHARE
May 5, 2022

In part two of James Sopkin's social media series, we explore extracting and loading data from Youtube with Prefect.

James Sopkin
James SopkinSolutions Engineer

In part two of this social media series, this guide will demonstrate how to build out a data flow that extracts and loads YouTube channel data. In the previous segment here, Twitter data was extracted with Tweepy. For the purpose of this guide, only the Youtube Data API V3 will be covered, but there are two additional APIs: YouTube Analytics and YouTube Reports.

Prefect is a workflow orchestrator that affords convenience, simplicity, and functionality. Functions are packaged as tasks, secrets can be passed in through Prefect Secrets, and the workflow can be containerized and stored on the cloud and run according to a schedule.

  • prefect

  • pandas

  • google-api-python-client

Here are the import statements used in this flow:

Before the YouTube Data API can be accessed, a project must be created in Google Developers Console and the API must be enabled. More info on how to set up a project and obtain credentials.

YouTube has an Authorization process that involves retrieving an authorization key and then generating an access token and a refresh token. The refresh token can be used to generate a new access token.

If however, only publicly accessible data is needed, the API can be accessed with an API key. The API reference details what can be pulled with an API key and what needs requires an OAuth authorization. It also includes a feature that allows users to plug in parameters and test endpoints in the UI.

The above code returns a youtube connection object that will be used to hit the API endpoints. (Note: this flow pulls data from a public YouTube account. If requesting data from a private Youtube account, OAuth authorization will be necessary.)

This task returns three lists of channel metrics - view count, video count, and subscriber count.

www.youtube.com/watch?....&list=YourPlaylistID

The playlist ID of a channel’s uploads can be found in its URL. Use the value right after “list=” as the playlist ID for the below code.

This paginates through all of the video uploads to grab video ids and returns a list of JSONs with video data. The next few tasks will build a DataFrame, convert the video duration into a time object, and filter the DataFrame for specific column names.

Helper function for video times:

Video durations are returned as a string and formatted as such - PT1H1M1 - a video duration of 1 hour, 1 minute, and 1 second. The regex functions parse the video duration strings and returns those values as a datetime.time object.

Table task to build channel table and modify video DataFrame:

The above task returns two tables: the channel and video table. It also filters and renames certain columns from the video table. Feel free to pick other columns or use all of the columns in the json_normalized DataFrame.

The below snippet is a task that inserts a DataFrame into Snowflake. pks refers to primary keys and the delete command deletes duplicates in the temp table before it is inserted.

When making parameterized SQL queries, like the ones shown above, it is often best practice bind parameters to avoid the risk of SQL injection attacks. However, in this case, the parameters(primary keys and table names) are explicitly defined in the flow and passed in directly to the load task, so no user input is taken. If the queries need to be modified to take user input, feel free to utilize the binding capabilities built into Snowflake’s Python client.

Our Final step: run the flow!

The above flow first authenticates to the YouTube data API, then it grabs data about the channel and its videos. That data is formatted into dataframes, de-deduplicated, then loaded into snowflake.

Now that leaves plenty of options for how to use the data. One use case would be for gathering Youtube data would be to track performance. This may involve building out Tableau dashboards to visualize what videos perform better than others. There also be some worthwhile machine learning applications. A simple ML model that can predict successful videos(more views, subscribers, likes) based on video length, tags, subscribers. Better yet, a nlp model that determines which titles are the most effective, or, generates a title for a video based on tags. Whichever way you choose to use this data, what better data to use but your own?

Posted on May 5, 2022
Blog Post
Youtube
SQL
Integrations
Scheduling

Love your workflows again

Orchestrate your stack to gain confidence in your data