Project Description:
In this blog post, we will explore the process of developing a modern data engineering project on the Uber dataset using Google Cloud Platform (GCP) and a stack of modern tools. We will cover the steps involved in building a data model, transforming the data using Python, deploying the code on a compute instance, loading the data onto Google BigQuery, and creating a final dashboard for data analysis and visualization. Let's dive in!
Tech Stack:
Our project utilizes a powerful tech stack to handle the complexities of data analysis and visualization. We leverage Python, SQL, and Google Cloud Platform (GCP) as the foundation of our solution. Python provides us with a versatile and efficient programming language, while SQL allows us to query and manipulate data effectively. GCP offers a range of services that enable us to store, process, and analyze large datasets.
To handle our data pipeline, we utilize Ma, an open-source data pipeline tool. Ma helps us streamline the process of extracting, transforming, and loading data from various sources into our target destination. With its intuitive interface and robust features, Ma simplifies the complexities of data integration.
For our data storage and analysis needs, we rely on Google BigQuery, a fully managed, serverless data warehouse. BigQuery allows us to store and query massive datasets with high performance and scalability. Its integration with GCP services and SQL-like querying capabilities make it an ideal choice for our project.
To visualize and explore our data, we utilize Looker, a powerful data visualization and exploration tool. Looker enables us to create interactive dashboards and reports, providing valuable insights into our data. Additionally, we leverage Data Studio, another tool from Google, to create visually appealing and interactive reports that can be easily shared with stakeholders.
You can find the code and commands used in our project on our GitHub repository:
With this robust tech stack, we are equipped to handle the complexities of data analysis and visualization, enabling us to derive meaningful insights and make data-driven decisions.
Github-
Flow-
Step 1: Creating a Bucket-
To begin our project, we create a bucket in GCP's Cloud Storage service. This bucket will serve as a storage location for our data.
Step 2: Creating an Instance-
Next, we create a compute instance with the desired specifications, such as an e2 standard 16 CPU with 8 cores and 64 GB of memory. This instance will be used to run our data engineering code.
Step 3: SSH into the Instance-
Once the instance is created, we SSH into it to access the command-line interface.
Step 4: Installing Python 3 and Required Libraries-
nside the instance, we install Python 3 and the necessary libraries, such as pandas and maze.ai, using the pip package manager.
Step 5: Setting up Google Cloud-
We set up Google Cloud services, including Google Cloud BigQuery, to store and analyze our data. This involves creating a project in the GCP Console and enabling the required APIs.
Step 6: Starting the Maze.ai Project-
We start the maze.ai project on port 6789. By accessing the external IP provided by the instance, we can view the maze.ai dashboard.
Step 7: Changing Firewall Rules-
To allow access to the maze.ai dashboard through the external IP and port, we modify the firewall rules accordingly, allowing the port and external ip of the instance .
Step 8: Creating a Data Loader Pipeline-
In maze.ai, we create a pipeline for loading the Uber dataset. We specify the source, format, and destination of the data.
Step 9: Creating a Data Transformer Pipeline-
We create a pipeline in maze.ai for transforming the data using a generic template. However, we encounter an error related to kernel overload, which needs to be resolved.
Step 10: Connecting the Pipeline to DataExporter-
Once the data is transformed, we connect the pipeline to a DataExporter component to export the data from Python to Google BigQuery.
Step 11: Configuring io_config.yaml-
We configure the io_config.yaml file in maze.ai to specify the input and output configurations for our data pipeline.
Step 12: Creating a New Service Account-
In the GCP Console, we open the API services and create a new service account. This account will have the necessary permissions to access GCP services.
Step 13: Downloading the Service Account Key-
We download the service account key in JSON format from the GCP Console. This key will be used for authentication when accessing GCP services.
Step 14: Copying the JSON Data into io_config.yaml-
We copy the JSON data from the service account key and paste it into the appropriate section of the io_config.yaml file. This enables authentication and access to GCP services.
Step 15: Refreshing BigQuery and Previewing Data-
After completing the third pipeline, we navigate to Google BigQuery and refresh the dataset or table where the transformed data is stored. This ensures that the latest data is available for analysis. We can also preview the data to verify its correctness.
Step 16: Setting up Looker Studio-
To visualize and explore our data, we set up Looker Studio. This involves creating a Looker account and connecting it to our data sources.
Step 17: Connecting Data to Looker Studio Dashboard-
In Looker Studio, we connect to our BigQuery dataset and create a dashboard. We configure the necessary connections and queries to fetch the data from BigQuery and visualize it in Looker Studio.
Conclusion:
In this blog post, we have walked through the process of building an Uber data engineering project using GCP and modern tools. We covered the steps involved in creating a data model, transforming the data, deploying the code, loading the data onto BigQuery, and creating a final dashboard for data analysis and visualization. By following these steps, you can leverage the power of GCP and modern tools to build robust data engineering projects. Happy engineering!