menu
arrow_back

Dataprep: Qwik Start

search favorite_border
Add to favorites
help
Help
—/100

Checkpoints

arrow_forward

Create a Cloud Storage bucket

Initialize Cloud Dataprep

Dataprep: Qwik Start

1 hour Free

GSP105

Google Cloud Self-Paced Labs

Overview

Cloud Dataprep by Trifacta is an intelligent data service for visually exploring, cleaning, and preparing data for analysis. Cloud Dataprep is serverless and works at any scale. There is no infrastructure to deploy or manage. Easy data preparation with clicks and no code!

In this lab you use Dataprep to manipulate a dataset. You import datasets, correct mismatched data, transform data, and join data. If this is new to you, you'll know what it all is by the end of this lab.

Setup and Requirements

Before you click the Start Lab button

Read these instructions. Labs are timed and you cannot pause them. The timer, which starts when you click Start Lab, shows how long Google Cloud resources will be made available to you.

This Qwiklabs hands-on lab lets you do the lab activities yourself in a real cloud environment, not in a simulation or demo environment. It does so by giving you new, temporary credentials that you use to sign in and access Google Cloud for the duration of the lab.

What you need

To complete this lab, you need:

  • Access to a standard internet browser (Chrome browser recommended).
  • Time to complete the lab.

Note: If you already have your own personal Google Cloud account or project, do not use it for this lab.

Note: If you are using a Pixelbook, open an Incognito window to run this lab.

How to start your lab and sign in to the Google Cloud Console

  1. Click the Start Lab button. If you need to pay for the lab, a pop-up opens for you to select your payment method. On the left is a panel populated with the temporary credentials that you must use for this lab.

    Open Google Console

  2. Copy the username, and then click Open Google Console. The lab spins up resources, and then opens another tab that shows the Sign in page.

    Sign in

    Tip: Open the tabs in separate windows, side-by-side.

  3. In the Sign in page, paste the username that you copied from the Connection Details panel. Then copy and paste the password.

    Important: You must use the credentials from the Connection Details panel. Do not use your Qwiklabs credentials. If you have your own Google Cloud account, do not use it for this lab (avoids incurring charges).

  4. Click through the subsequent pages:

    • Accept the terms and conditions.
    • Do not add recovery options or two-factor authentication (because this is a temporary account).
    • Do not sign up for free trials.

After a few moments, the Cloud Console opens in this tab.

Check project permissions

Before you begin your work on Google Cloud, you need to ensure that your project has the correct permissions within Identity and Access Management (IAM).

  1. In the Google Cloud console, on the Navigation menu (nav-menu.png), click IAM & Admin > IAM.

  2. Confirm that the default compute Service Account {project-number}-compute@developer.gserviceaccount.com is present and has the editor role assigned. The account prefix is the project number, which you can find on Navigation menu > Home.

check-sa.png

If the account is not present in IAM or does not have the editor role, follow the steps below to assign the required role.

  • In the Google Cloud console, on the Navigation menu, click Home.

  • Copy the project number (e.g. 729328892908).

  • On the Navigation menu, click IAM & Admin > IAM.

  • At the top of the IAM page, click Add.

  • For New members, type:

{project-number}-compute@developer.gserviceaccount.com

Replace {project-number} with your project number.

  • For Role, select Project (or Basic) > Editor. Click Save.

add-sa.png

Create a Cloud Storage bucket in your project

  1. In the Cloud Platform Console, select Navigation menu > Storage > Browser. nav_storage.png

  2. Click Create bucket.

  3. In the Create a bucket dialog, Name the bucket a unique name. Skip the other steps to leave those settings at their default value.

my-bucket.png

  1. Click Create.

You created your bucket. Remember the bucket name for later steps.

Test Completed Task

Click Check my progress to verify your performed task. If you have successfully created Cloud Storage bucket, you see an assessment score.

Create a Cloud Storage bucket

Initialize Cloud Dataprep

  1. Select Navigation menu > Dataprep.
  2. Check to accept the Google Dataprep Terms of Service, then click ACCEPT.
  3. Check to authorize sharing your account information with Trifacta, then click Agree and Continue.
  4. Click Allow to allow Trifacta to access project data.
  5. Click your Google Cloud username to sign in to Cloud Dataprep by Trifacta. Your Google Cloud username is Username in the left panel in your lab.
  6. Click Allow to grant Cloud Dataprep access to your Google Cloud lab account.
  7. Check to agree to Trifacta Terms of Service, and then click Accept.
  8. Click Continue on the "First time set up" screen to create the default storage location.

937d6677b5e75d9d.png

Dataprep opens in a new browser tab. In the Welcome page, click Hide tour in the top right.

Test Completed Task

Click Check my progress to verify your performed task. If you have successfully initialized Cloud Dataprep with default storage location, you see an assessment score.

Initialize Cloud Dataprep

Create a flow

Cloud Dataprep uses a flow workspace to access and manipulate datasets.

  1. Click Create Flow in the top right:

dataprep_create_flow.png

  1. Name and describe the flow. Since this lab uses 2016 data from the United States Federal Elections Commission 2016, name the flow "FEC-2016", and the describe the flow as "United States Federal Elections Commission 2016".

656369ed7d46b2dc.png

  1. Click Create.

The FEC-2016 flow page opens. You can scroll through the "What's a flow?" slides to get an overview of what you'll be doing next, or click Don't show me any helpers to skip it.

Import datasets

In this section you import and add data to the FEC-2016 flow.

  1. Click the Add Icon in the Dataset box.

dataset.png

  1. In the Add Datasets to Flow dialog box, select Import Datasets.

import-dataset.png

  1. In the left menu pane, select GCS to import datasets from Cloud Storage, then click on the pencil to edit the file path.

dataprep_choose_file1.png

  1. Type gs://spls/gsp105 in the Choose a file or folder text box, then click Go.

You may have to widen the browser window to see the Go and Cancel buttons.

  1. Click us-fec/.

  2. Click the + icon next to cn-2016.txt to create a dataset shown in the right pane. Click on the title in the dataset and rename it "Candidate Master 2016".

  3. In the same way add the itcont-2016.txt dataset, and rename it "Campaign Contributions 2016".

  4. After both datasets are listed in the right pane, click Import & Add to Flow.

4e126e9b671a7722.png

You see both datasets listed as a flow.

FEC-2016.png

Prep the candidate file

  1. By default, the Candidate Master 2016 dataset is selected. In the right pane, click Edit Recipe.

dataprep-recipe.png

The Candidate Master 2016 Transformer page opens in the grid view.

transformer-page.png

The Transformer page is where you build your transformation recipe and see the results applied to the sample. When you are satisfied with what you see, execute the job against your dataset.

Each of the column heads have a Name and value that specify the data type. Data types are shown when you click the flag icon:

datatypes.png

Also, when you click the flag option, a Details panel opens on the right:

details_panel.png

Click X in the top right of the Details panel to close the Details panel.

In the following steps, you explore data in the grid view and apply transformation steps to your recipe.

  1. Column5 provides data from 1990-2064. Widen column5 (like you would on a spreadsheet) to separate each year. Click to select the tallest bin, which represents year 2015.

dataprep-column5.png

This creates a step where these values are selected.

  1. In the Suggestions panel on the right, in the Keep rows section, click Add to add this step your recipe.

recipe-step.png

The Recipe panel on the right now has the following step:

Keep rows where(DATE(2015, 1, 1) <= column5) && (column5 < DATE(2020, 1, 1))

  1. In Column6 (State), hover over and click on the mismatched (red) portion of the header to select the mismatched rows.

3cdb3803ef49636b.png

Scroll down to find the mismatched values and notice how most of these records have the value "P" in column7, and "US" in column6. The mismatch occurs because column6 is marked as a "State" column (indicated by the flag icon), but there are non-state (such as "US") values.

  1. To correct the mismatch, click X in the top of the Suggestions panel to cancel the transformation, then click on the flag icon in Column6 and change it to a "String" column.

84cfd42fcab33662.png

There is no longer a mismatch and the column marker is now green.

  1. Filter on just the presidential candidates, which are those records that have the value "P" in column7. In the histogram for column7, hover over the two bins to see which is "H" and which is "P". Click the "P" bin.

328626b128b93f1.png

  1. In the right Suggestions panel, click Add to accept the step to the recipe.

Dataprep_row_7

Join the Contributions file

On the Join page, you can add your current dataset to another dataset or recipe based on information that is common to both datasets.

Before you join the Contributions file to the Candidates file, clean up the Contributions file.

  1. Click on FEC-2016 (the dataset selector) at the top of the grid view page.

dataprep_fec2016.png

  1. Click to select the grayed out Campaign Contributions.

  2. In the right pane, click Add > Recipe, then click Edit Recipe.

  3. Click the recipe icon at the top right of the page, then click Add New Step.

dataprep_2nd_recipe.png

Remove extra delimiters in the dataset.

  1. Insert the following Wrangle language command in the Search box:

replacepatterns col: * with: '' on: `{start}"|"{end}` global: true

The Transformation Builder parses the Wrangle command and populates the Find and Replace transformation fields.

recipe.png

  1. Click Add to add the transform to the recipe.

  2. Add another new step to the recipe. Click New Step, then type "Join" in the Search box.

Dataprep_join

  1. Click Join datasets to open the Joins page.

  2. Click on "Candidate Master 2016" to join with Campaign Contributions 2016-2, then Accept in the bottom right.

dataprep-candidate-master.png

  1. Hover in the Join keys section, then click on the pencil (Edit icon).

edit_join.png

Dataprep infers common keys. There are many common values that Dataprep suggests as Join Keys.

  1. In the Add Key panel, in the Suggested join keys section, click column2 = column11.

join_conditions.png

  1. Click Save and Continue.

Columns 2 and 11 open for your review.

  1. Click Next, then check the checkbox to the left of the "Columns" label to add all columns of both datasets to the joined dataset.

type_checkbox.png

  1. Click Review, and then Add to Recipe to return to the grid view.

Summary of data

Generate a useful summary by aggregating, averaging, and counting the contributions in Column 16 and grouping the candidates by IDs, names, and party affiliation in Columns 2, 24, 8 respectively.

  1. At the top of the Recipe panel on the left, click on New Step and enter the following formula in the Transformation search box to preview the aggregated data.

pivot value:sum(column16),average(column16),countif(column16 > 0) group: column2,column24,column8

An initial sample of the joined and aggregated data is displayed, representing a summary table of US presidential candidates and their 2016 campaign contribution metrics.

6f4fba772aa0a141.png

  1. Click Add to open a summary table of major US presidential candidates and their 2016 campaign contribution metrics.

summary_table.png

Rename columns

You can make the data easier to interpret by renaming the columns. Add each of the renaming and rounding steps individually to the recipe by clicking New Step, then enter:

rename type: manual mapping: [column24,'Candidate_Name'], [column2,'Candidate_ID'],[column8,'Party_Affiliation'], [sum_column16,'Total_Contribution_Sum'], [average_column16,'Average_Contribution_Sum'], [countif,'Number_of_Contributions']

Then click Add.

Add in this last New Step to round the Average Contribution amount:

set col: Average_Contribution_Sum value: round(Average_Contribution_Sum)

Then click Add.

Your results look something like this:

2b3dc976f95952a5.png

Congratulations!

You used Dataprep to add a dataset and created recipes to wrangle the data into meaningful results.

Next Steps / Learn More

This lab is part of a series of labs called Qwik Starts. These labs are designed to give you a little taste of the many features available with Google Cloud. Search for "Qwik Starts" in the lab catalog to find the next lab you'd like to take!

Google Cloud Training & Certification

...helps you make the most of Google Cloud technologies. Our classes include technical skills and best practices to help you get up to speed quickly and continue your learning journey. We offer fundamental to advanced level training, with on-demand, live, and virtual options to suit your busy schedule. Certifications help you validate and prove your skill and expertise in Google Cloud technologies.

Manual Last Updated December 2, 2020
Lab Last Tested November 25, 2020

Copyright 2021 Google LLC All rights reserved. Google and the Google logo are trademarks of Google LLC. All other company and product names may be trademarks of the respective companies with which they are associated.