Surprise Topic 1 Google BigQuery

Rodrigo Ledesma
Jun 19, 2022
4 min read

Hello everybody! Welcome back to my blog. Today we will be giving the post a complete makeover. In our last 10 posts, we have been journeying through the process of data cleansing, feature selection, model tuning, and analysis of model performance. But now, we will give a little turn to our objectives, in this case, we will not focus on machine learning or in data science but in the true future (in my opinion) of computing. Cloud Storage and Cloud Processing.

Google’s BigQuery ML is an online tool designed to allow data scientists and machine learning engineers to create, train and store models and data in a serverless environment. Also the perks of Google’s infrastructure for escalating the resources based on demand. Of course, this service is not free, if you sign in for the first time you will have a $300 USD credit to use for free. Please before trying to recreate the steps in this tutorial make sure you are either willing to pay for the service or you have this starter's credit.

Now the objective of this post will be to use the platform to store the Harry Potter and the Forbidden Journey’s waiting time. The dataset we have been using in our last posts to create the models. This will be the first of hopefully many posts using cloud services, my purpose will be to have every part of the pipeline in a cloud-based/serverless environment. So one step at a time, let’s start at the very begging as my aunt Julie Andrews used to say (if you don’t understand go ask your parents, for sure they know what I am talking about)

Configuring your Google Cloud Platform profile

GCP is based on projects, which you can think of as folders to contain different services, so for starters let’s create a new project, but first make sure you are logged with your google credentials in the GCP platform. Then click the following link to check your projects:

https://console.cloud.google.com/projectselector2/projectselector

If you already have projects, this is how it will look, but in this case let’s just go step by step and click the “create project” button at the top right corner.

Give your project the name you wish, but make sure it is not generic as in some time you will have lots and you will have to navigate through them. Now that we have created the project we need to link it into the BigQuery space, for this click on the following link:

BigQuery

Add project, dataset and table in BigQuery

Make sure to select the project you have created and then click “add data”, pin a project and look for a project.

Now the new project will appear and we will be ready to create a dataset:

Click the create dataset button and fill the information with the names you like:

Now you have created the equivalent of a database, so next step will be to create the tables

Using python and Pandas to insert information to BigQuery’s dataset

For creating a table, BigQuery gives you the necessary tools to do so with python and skipping the tedious process of defining the structure of the table feature per feature. Instead you can just upload a complete Pandas dataset with few lines of code. First please install the pandas-gbq library.

pip install pandas-gbq

Now we are ready, lets import the csv into a pandas dataframe and also import the recently installed library “gbq”. Next we will upload it using three simple lines, the destination table is composed of two parts, the dataset’s name (database) and the table’s name, it does not matter if you have not created the table yet, BigQuery’s API will handle the instructions automatically for that. The project_id is nothing but the name of your project. And last parameter is “if_exists” that as you might imagine is going to handle the problem of already created tables, you have 3 options here, you can replace the old table with the new one, you can append the new info into the old table, or you can create an error.

from pandas.io import gbq
hp.to_gbq(destination_table='Dataset_for_testing.waiting_times',
         project_id='proyecto-ocr-293616',
         if_exists='replace')

This only command will create the table’s schema and upload the dataset into MYSQL.

Now let’s validate that the information was correctly created, for this, return to your BigQuery window and refresh the page.

As you can see inside your dataframe, now you have the table, so now click the preview button and let’s validate it has the correct information:

Easy as winning a match against the Ottawa’s Senators! We have automatically created a table and it’s schema.

BigQuery allows you to perform querys (obviously) within its platform and it is extremely easy, just type the command query you wish and click the Run button.

Personally, it is good to see this information in BigQuery, but it would be more useful to actually have it available in Pandas, so let’s see how we can query this information using python and jupyter notebooks.

Extract information from BigQuery using Python and Pandas

We have almost ready to perform queries, run this line to enable the bigQuery extension:

%load_ext google.cloud.bigquery

Then let’s create the query:

import pandas as pd
from pandas.io import gbqquery="""SELECT * FROM `proyecto-ocr-293616.Dataset_for_testing.waiting_times` LIMIT 1000"""
hp_sql = gbq.read_gbq(query,project_id="proyecto-ocr-293616")

Just by running these lines you will do the query and save the result in a pandas df:

And perfect! This was all for today’s post, I hope you enjoyed it. Thank you for reading and stay tuned for new posts.

RODRIGO LEDESMA

Surprise Topic 1 Google BigQuery

Add project, dataset and table in BigQuery

Using python and Pandas to insert information to BigQuery’s dataset

Extract information from BigQuery using Python and Pandas

Recent Posts

Comments