Blog About Me Posts By Tags Subscribe Elastic Search My Sessions Terraform

Pavan Kumar Aryasomayajulu


Using BigQuery Stackoverflow and Github Data To Find Interesting Stats

Hello Everyone, As a developer, I use Stackoverflow and Github everyday. So I know these sites play a vital role in almost every software developers day to day life.

So In this post, I would like to find some interesting stats like how many .cs(C# files) have stackoverflow question as a comment and then which answer or question they are actually refering to. I’ll be making use of Google’s BigQuery and open datasets provided by them to find out these relationships.

BigQuery

“BigQuery is an enterprise data warehouse that solves this problem by enabling super-fast SQL queries using the processing power of Google’s infrastructure. … You can control access to both the project and your data based on your business needs, such as giving others the ability to view or query your data” – Google’s definition

So Google provides us ability to query in sql format by uploading large datasets or even we can public datasets

Google has almost 124 different public datasets available in the market place from various domains.

Getting Started With BigQuery

Navigate to [https://console.cloud.google.com/bigquery] (https://console.cloud.google.com/bigquery)

Login with your gmail account

Click on select project and add a new project if you dont have one or the quota got expired for existing one. Once we are done with the project creation, following screen appears.

At the bottom we can see bigquery-public-data and there we can see different public datasets. I am more interested in github_repos and stackoverflow datasets.
When we expand github_repos we can see different tables available and when we click on a table name we can see schema. If we click on preview, we can see sample data.

Finding Languages Used In Repos


SELECT distinct(lang.name) as lang FROM `bigquery-public-data.github_repos.languages`, UNNEST(language) as lang

How many times a language is used in different repos

Select lang,count(lang) as count FROM (
SELECT lang.name as lang FROM `bigquery-public-data.github_repos.languages`, UNNEST(language) as lang) a group by lang order by count(lang) desc


Explore with DataStudio
Google also provides us with the ability to plot various graphs and visualizations based on the data which we got. Once we get the query result, there is an option called “Explore with Datastudio”. We can assiociate our account to make use of that feature and there we can plot different graphs.




Note: In my case, I need to set Metric as “count” instead of the default value “Record Count” uder data or else we will not get appropriate results.



Thanks,
Pavan Kumar Aryasomayajulu