Saturday, July 2, 2022
HomeData ScienceConstruct a Sturdy Workflow to Visualize Trending GitHub Repositories in Python |...

Construct a Sturdy Workflow to Visualize Trending GitHub Repositories in Python | by Khuyen Tran | Jul, 2022


Get Up to date with the Newest Tendencies on GitHub in Your Native Machine

GitHub feed is an effective way so that you can comply with together with what’s trending in the neighborhood. You may uncover some helpful repositories by taking a look at what your connections have starred.

Picture by Creator

Nevertheless, there is likely to be some repositories you don’t care about. For instance, you is likely to be solely concerned about Python repositories, however there are repositories written in different languages in your GitHub feed. Thus, it could take you some time to seek out helpful libraries.

Wouldn’t it’s good in the event you can create a private dashboard exhibiting the repositories that your connections adopted, filtered by your favourite language like beneath?

Picture by Creator

On this article, we are going to discover ways to do precisely that, with GitHub API + Streamlit + Prefect.

At a excessive stage, we are going to:

  • Use GitHub API to jot down scripts to tug the information from GitHub
  • Use Streamlit to create a dashboard displaying the statistics of the processed information
  • Use Prefect to schedule to run the scripts to get and course of information every day
Picture by Creator

If you wish to skip by means of the reasons and go straight to creating your individual dashboard, view this GitHub repository:

To drag information from GitHub, you want your username and an entry token. Subsequent, create the file known as .env to save lots of your GitHub authentication.

Add .env to your .gitignore file to make sure that .env just isn’t tracked by Git.

Within the file process_data.py below the improvement listing, we are going to write code to entry the knowledge within the file .env utilizing python-dotenv:

Subsequent, we are going to get the final data of public repositories we acquired in our GitHub feed utilizing GitHub API. The GitHub API permits us to simply get information out of your GitHub account together with occasions, repositories, and so forth.

Picture by Creator

We are going to use pydash to get the URLs of all starred repositories from a listing of repositories:

Get the precise data of every starred repository reminiscent of language, stars, house owners, pull requests, and so forth:

The data extracted from every repo ought to look just like this.

Save the information to the native listing:

Put all the things collectively:

The present script works, however to guarantee that it’s resilient towards failures, we are going to use Prefect so as to add observability, caching, and retries to our present workflow.

Prefect is an open-source library that allows you to orchestrate your information workflow in Python.

Add Observability

As a result of it takes some time to run the file get_data.py, we would need to know which code is being executed and roughly how for much longer we have to wait. We are able to add Prefect’s decorators to our capabilities to get extra perception into the state of every operate.

Particularly, we add the decorator @activity to the capabilities that do one factor and add the decorator @circulation to the operate that comprises a number of duties.

Within the code beneath, we add the decorator @circulation to the operate get_data since get_data comprises all duties.

You need to see the beneath output when operating this Python script:

From the output, we all know which duties are accomplished and which of them are in progress.

Caching

In our present code, the operate get_geneneral_info_of_repos takes some time to run. If the operate get_specific_info_of_repos failed, we have to rerun the whole pipeline once more and look ahead to the operate get_geneneral_info_of_repos to run.

Picture by Creator

To scale back the execution time, we will use Prefect’s caching to save lots of the outcomes of get_general_info_of_repos within the first run after which reuse the leads to the second run.

Picture by Creator

Within the file get_data.py, we add caching to the duties get_general_info_of_repos , get_starred_repo_urls , and get_specific_info_of_repos as a result of they take fairly a little bit of time to run.

So as to add caching to a activity, specify the values for the arguments cache_key_fn and cach_expiration.

Within the code above,

  • cache_key_fn=task_input_hash tells Prefect to make use of the cached outcomes except the inputs change or the cache expired.
  • cached_expiration=timedelta(days=1) tells Prefect to refresh the cache after at some point.

We are able to see that the run that doesn’t use caching (lush-ostrich) completed in 27s whereas the run that makes use of caching (provocative-wildebeest) completed in 1s!

Picture by Creator

Notice: To view the dashboard with all runs like above, run prefect orion begin.

Retries

There’s a probability that we fail to tug the information from GitHub by means of the API, and we have to run the whole pipeline once more.

Picture by Creator

As a substitute of rerunning the whole pipeline, it’s extra environment friendly to rerun the duty that failed a particular variety of instances after a particular time frame.

Prefect permits you to routinely retry on failure. To allow retries, add retries and retry_delay_seconds parameters to your activity.

Within the code above:

  • retries=3 tells Prefect to rerun the duty as much as 3 instances
  • retry_delay_seconds=60 tells Prefect to retry after 60 seconds. This performance is useful since we would hit the speed restrict if we name the GitHub API repeatedly in a brief period of time.

Within the file process_data.py below the listing improvement, we are going to clear up the information in order that we get a desk that solely exhibits what we’re concerned about.

Begin with loading the information and solely saving the repositories which might be written in a particular language:

Picture by Creator

Subsequent, we are going to maintain solely the knowledge of a repository that’s fascinating to us, which incorporates:

  • Full title
  • HTML URL
  • Description
  • Stargazers rely

Create a DataFrame from the dictionary then take away the duplicated entries:

Put all the things collectively:

Now come to a enjoyable half. Create a dashboard to view the repositories and their statistics.

The construction of the listing for our app code will look just like this:

Visualize The Statistics of Repositories

The file [Visualize.py](<http://Visualize.py>) will create the house web page of the app whereas the information below the listing pages create the kids pages.

We are going to use Streamlit to create a easy app in Python. Let’s begin with writing the code to point out the information and its statistics. Particularly, we need to see the next on the primary web page:

  • A desk of repositories filtered by language
  • A chart of the highest 10 hottest repositories
  • A chart of the highest 10 hottest subjects
  • A phrase cloud chart of subjects

Code of Visualize.py :

To view the dashboard, sort:

cd app
streamlit run Visualize.py

Go to http://localhost:8501/ and you must see the next dashboard!

Picture by Creator

Filter Repositories Primarily based on Their Subjects

We get repositories with totally different subjects, however we regularly solely care about particular subjects reminiscent of machine studying and deep studying. Let’s create a web page that helps customers filter repositories based mostly on their subjects.

And you must see a second web page that appears like this. Within the GIF beneath, I solely see the repositories with the tags deep-learning, spark, and mysql after making use of to the filter.

Picture by Creator

If you wish to have a every day replace of repositories on the GitHub feed, you may really feel lazy to run the script to get and course of information on daily basis. Wouldn’t it’s good in the event you can schedule to routinely run your script on daily basis?

Picture by Creator

Let’s schedule our Python scripts by making a deployment with Prefect.

Use Subflows

Since we need to run the circulation get_data earlier than operating the circulation process_data, we will put them below one other circulation known as get_and_process_data contained in the file improvement/get_and_process_data.py.

Subsequent, we are going to write a script to deploy our circulation. We use IntervalSchedule to run the deployment on daily basis.

To run the deployment, we are going to:

  • Begin a Prefect Orion server
  • Configure a storage
  • Create a piece queue
  • Run an agent
  • Create the deployment

Begin a Prefect Orion Server

To begin a Prefect Orion server, run:

prefect orion begin

Configure Storage

Storage saves your activity outcomes and deployments. Later while you run a deployment, Prefect will retrieve your circulation from the storage.

To create storage, sort:

prefect storage create

And you will note the next choices in your terminal.

On this mission, we are going to use non permanent native storage.

Create a Work Queue

Every work queue additionally organizes deployments into work queues for execution.

To create a piece queue, sort:

prefect work-queue create --tag dev dev-queue
Picture by Creator

Output:

UUID('e0e4ee25-bcff-4abb-9697-b8c7534355b2')

The --tag dev tells the dev-queue work queue to solely serve deployments that embody a dev tag.

Run an Agent

Every agent makes positive deployments in a particular work queue are executed

Picture by Creator

To run an agent, sort prefect agent begin <ID of dev-queue> . Because the ID of the dev-queue is e0e4ee25-bcff-4abb-9697-b8c7534355b2 , we sort:

prefect agent begin 'e0e4ee25-bcff-4abb-9697-b8c7534355b2'

Create a Deployment

To create a deployment from the file improvement.py, sort:

prefect deployment create improvement.py

You need to see the brand new deployment below the Deployments tab.

Picture by Creator

Then click on Run within the high proper nook:

Picture by Creator

Then click on Stream Runs on the left menu:

Picture by Creator

And you will note that your circulation is scheduled!

Picture by Creator

Now the script to tug and course of information will run on daily basis. Your dashboard additionally exhibits the newest repositories in your native machine. How cool is that?

Within the present model, the app and Prefect agent runs on the native machine and can cease working in the event you flip off your machine. If we flip off the machine, the app and the agent will cease operating.

To stop this from occurring, we will use a cloud service reminiscent of AWS or GCP to run the agent, retailer the database, and serve the dashboard.

Picture by Creator

Within the subsequent article, we are going to discover ways to do precisely that.

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

- Advertisment -
Google search engine

Most Popular

Recent Comments