Study Notes: Google Machine Learning Engineer Certification

Section 3: Data Preparation and Processing

Yingying Hu
12 min readJan 17, 2021

3.1 Data ingestion

Ingestion of various file types (e.g. csv, json, img, parquet or databases, Hadoop/Spark)

Loading Data into BigQuery (click here to see the image for better resolution)
Loading Data into BigQuery (click here to see the image for better resolution)

You also have following data pipeline options to load data into BigQuery:

  • Dataflow: a fully managed service on GCP built using the open source Apache Beam API with support for various data sources — files, databases, message based and more. With Dataflow you can transform and enrich data in both batch and streaming modes with the same code. Google provides prebuilt Dataflow templates for batch jobs.
  • Dataproc: a fully managed service on GCP for Apache Spark and Apache Hadoop services. Dataproc provides BigQuery connector enabling Spark and Hadoop applications to process data from BigQuery and write data to BigQuery using its native terminology.

Database migration

Database migration: Concepts and principles (Part 1, Part 2)

Database migration is a migration of data from source databases to target databases with the goal of turning down the source database systems after the migration completes. The entire dataset, or a subset, is migrated

homogeneous migration: the source and target databases are of the same database management system from the same provider.

heterogeneous migration: the source and target databases are of different database management systems from different providers.

Streaming data (e.g. from IoT devices)

  • BigQuery streaming ingestion allows you to stream your data into BigQuery one record at a time by using the tabledata.insertAll method. The API allows uncoordinated inserts from multiple producers.
  • One of the common patterns to ingest real-time data on Google Cloud Platform is to read messages from Cloud Pub/Sub topic using Cloud Dataflow pipeline that runs in streaming mode and writes to BigQuery tables after the required processing is done. Or, you can go serverless with Cloud Functions for low volume events.
  • Write streaming pipelines in Apache Spark and run on a Hadoop cluster such as Cloud Dataproc using Apache Spark BigQuery Connector.
  • Call the Streaming API in any client library to stream data to BigQuery.

3.2 Data exploration (EDA)

Visualization

Exploratory Data Analysis

Visualization methods for univariate analysis where your data only has one variable:

Visualization methods for Bivariate analysis to find if two variables are related:

In addition to the above cases, if you have “continuous to continuous”, you can use the sns.regplot() which plots data and a navie linear regression model fit.

Some common visualization types and python libraries:

  • Histogram: Display the shape and spread of continuous sample data. pandas.DataFrame.hist()
  • Scatter Plot: Reveal relationship between two variables. matplotlib.pyplot.scatter()
  • HeatMap: Use system of color coding to show correlations between multivariate. sns.heatmap()

Best Practice for doing Visualization on GCP:

  • Use BigQuery to explore and preprocess large amounts of data: During EDA, data is usually retrieved from BigQuery and sent to an AI Platform Notebooks instance. However, if you have a large dataset, this might not be possible. Therefore, it’s better to execute the analytics and data processing in BigQuery and use an AI Platform Notebooks instance to retrieve and visualize the results. Similarly, we recommend that you preprocess data in BigQuery before you retrieve it for training your model. Similarly, you can write code in Dataflow or Tensorflow to transform data before plotting them in the AI Notebooks.
  • Visualizing BigQuery data using Data Studio: The Google Data Studio BigQuery connector allows you to access data from your BigQuery tables within Google Data Studio.
  • Dataprep: Explore and transform your data interactively using web browser with minimum amounts of code.

Statistical fundamentals at scale

  • distribution
  • minimum, maximum
  • average
  • standard deviation

Scaling feature values

Scaling means converting floating-point feature values from their natural range (for example, 100 to 900) into a standard range (for example, 0 to 1 or -1 to +1). If a feature set consists of only a single feature, then scaling provides little to no practical benefit. If, however, a feature set consists of multiple features, then feature scaling provides the following benefits:

  • Helps gradient descent converge more quickly.
  • Helps avoid the “NaN trap,” in which one number in the model becomes a NaN (e.g., when a value exceeds the floating-point precision limit during training), and — due to math operations — every other number in the model also eventually becomes a NaN.
  • Helps the model learn appropriate weights for each feature. Without feature scaling, the model will pay too much attention to the features having a wider range.

Evaluation of data quality and feasibility

Attributes related to the Data Quality:

  • Accuracy: The information your data contains corresponds to reality
  • Consistency: No matter where you look in the database, you won’t find any contradictions in your data
  • Timeliness: Data represents reality within a reasonable period of time or in accordance with corporate standards
  • Completeness: All available elements of the data have found their way to the database
  • Uniqueness: A data record with specific details appears only once in the database
  • Orderliness: The data entered has the required format and structure
  • Auditability: Data is accessible and it’s possible to trace introduced changes

Evaluation Metric:

  • Accuracy: the ratio of data to errors
  • Consistency: the number of inconsistencies
  • Timeliness: number of records with delayed changes
  • Completeness: The number of missing values
  • Uniqueness: The number of duplicates revealed
  • Orderliness:The ratio of data of inappropriate format
  • Auditability: % of cells where the metadata about introduced changes is not accessible

Ways to improve data quality:

  • Resolve Missing Values
  • Convert the Date feature column to Datetime Format
  • Parse date/time features
  • Remove unwanted values
  • Convert categorical columns to “one-hot encodings”

3.3 Design data pipelines

Data preprocessing for machine learning: options and recommendations

Batching and streaming data pipelines at scale

Data privacy and compliance

Filtering for personal identifiable information

Monitoring/changing deployed pipelines

Monitor Dataprep jobs and output results to BigQuery or GCS

3.4 Build data pipelines

Data validation

Data validation is required before model training to decide whether you should retrain the model or stop the execution of the pipeline. This decision is automatically made if the following was identified by the pipeline.

  • Data schema skews: These skews are considered anomalies in the input data, which means that the downstream pipeline steps, including data processing and model training, receives data that doesn’t comply with the expected schema. In this case, you should stop the pipeline so the data science team can investigate. The team might release a fix or an update to the pipeline to handle these changes in the schema. Schema skews include receiving unexpected features, not receiving all the expected features, or receiving features with unexpected values.
  • Data values skews: These skews are significant changes in the statistical properties of data, which means that data patterns are changing, and you need to trigger a retraining of the model to capture these changes.

Handling missing data

How to Handle Missing Data in Machine Learning: 5 Techniques

Deductive Imputation: This is an imputation rule defined by logical reasoning, as opposed to a statistical rule. It requires no inference, and the true value can be assessed. But it can be time-consuming or might require specific coding.

Mean/Median/Mode Imputation: Any missing values in a given column are replaced with the mean (or median, or mode) of that column. This is the easiest to implement and comprehend.

Regression Imputation: This approach replaces missing values with a predicted value based on a regression line.

Stochastic Regression Imputation: This aims to preserve the variability of data. To achieve this, we add an error (or residual term) to each predicted score. This residual term is normally distributed with a mean of zero and a variance equal to the variance of the predictor used for imputing.

Multiply-Stochastic Regression Imputation: This is similar to the stochastic regression imputation, but it is done for a few iterations and the final value is just aggregated by the mean. It is better than singly-stochastic regression imputation as it allows for much better estimation of true variance. But it takes a bit more effort to implement.

Handling outliers

How to Make Your Machine Learning Models Robust to Outliers

Common methods for detecting outliers

  • Box-plot & scatter plot (this method is not recommended for high dimensional data where the power of visualization fails.)
  • Cook’s Distance (this method is used only for linear regression and therefore has a limited application)
  • Z-Score (This method assumes that the variable has a Gaussian distribution)

Common data-based methods for handling outliers

  • Dropping the outlier when you’re sure that it is a measurement error, or the number of outliers are very few compared with your data size.
  • Clipping/Winsorizing: setting the extreme values of an attribute to some specified value.
  • Log-Scale Transformation: It’s often preferred when the response variable follows exponential distribution or is right-skewed.
  • Binning: dividing a list of continuous variables into groups

Common data-based methods for handling outliers

  • tree-based methods like Random Forest and Gradient Boosting are less impacted by outliers
  • Do not use RMSE as a loss function

Managing large samples (TFRecords)

If you plan to train a TensorFlow model, create a TFRecord file. A TFRecord file contains a sequence of records, where each record is encoded as a byte string. TFRecord files are optimized for training TensorFlow models. You can use TensorFlow Transform (TFT) to prepare the data as TFRecords for training TensorFlow models. TFT is implemented using Apache Beam and runs at scale on Dataflow.

tf.data.experimental.TFRecordWriter function writes a dataset to a TFRecord file. Read the TFRecord file using the tf.data.TFRecordDataset class

Transformations (TensorFlow Transform)

Data preprocessing for machine learning using TensorFlow Transform

TF transform is a hybrid of Beam and TensorFlow. It uses Dataflow during training, but only tensorflow during prediction since TensorFlow is good for on-demand, on-the-fly processing.

Data preprocessing occurs in two phases: analyze and transform. Beam is good at the analyze phase, and tensorflow is better suited to doing on the fly transformation of the input data.

tf.transform provides two PTransforms:

AnalyzeAndTransformDataset(Analysis phase)

Executed in Beam to create the pre-processing training dataset

transform_fn. A function that contains the computed stats from the analyze phase and the transformation logic (which uses the stats) as instance-level operations.The transform_fn is saved to be attached to the model serving_input_fn. This makes it possible to apply the same transformation to the online prediction data points.

transform_metadata. An object that describes the expected schema of the data after transformation.

TransformDataset (Transform phase)

Executed in Beam to create the training/evaluation dataset. Executed in TensorFlow during prediction

3.5 Feature engineering

Data leakage and augmentation

Data leakage is when you use input features during training that “leak” information about the target that you are trying to predict which is unavailable when the model is actually served. This can be detected when a feature that is highly correlated with the target column is included as one of the input features. To prevent data leakage, make sure you know what the data means and whether or not you should use it as a feature before using any data. Also, Check the correlation in the Train tab. High correlations should be flagged for review.

Data augmentation is a technique used to increase the amount of data by adding slightly modified copies of already existing data or newly created synthetic data from existing data. It acts as a regularizer and helps reduce overfitting when training a machine learning model.

Encoding structured data types

Raw data must be mapped into numerical feature vectors.

Demonstrate several types of feature columns

Numeric columns

feature_column.numeric_column(source_column)represents real valued features. When using this column, your model will receive the column value from the dataframe unchanged.

Bucketized columns

Instead of feeding a number directly into the model, you can use feature_column.bucketized_column(source_column, boundaries) to split its value into different categories (buckets) based on numerical ranges

Categorical columns

We cannot feed strings directly to a model. Instead, we must first map them to numeric values. The categorical vocabulary columns provide a way to represent strings as a one-hot vector (much like you have seen above with age buckets). The vocabulary can be passed as a list using feature_column.categorical_column_with_vocabulary_list(key, vocabulary_list) , or loaded from a file using feature_column.categorical_column_with_vocabulary_file(key, vocabulary_file)

categorical_column_with_identity(key, num_buckets, default_value=None) Use this when your inputs are integers in the range [0, num_buckets), and you want to use the input value itself as the categorical ID. Values outside this range will result in default_value if specified, otherwise it will fail.

Embedding columns

Suppose instead of having just a few possible strings, we have thousands (or more) values per category. For a number of reasons, as the number of categories grow large, it becomes infeasible to train a neural network using one-hot encodings. We can use an embedding column to overcome this limitation. Instead of representing the data as a one-hot vector of many dimensions, an feature_column.embedding_column represents that data as a lower-dimensional, dense vector in which each cell can contain any number, not just 0 or 1.

Hashed feature columns

Another way to represent a categorical column with a large number of values is to use a feature_column.categorical_column_with_hash_bucket. This feature column calculates a hash value of the input, then selects one of the hash_bucket_size buckets to encode a string. When using this column, you do not need to provide the vocabulary, and you can choose to make the number of hash_buckets significantly smaller than the number of actual categories to save space. An important downside of this technique is that there may be collisions in which different strings are mapped to the same bucket. In practice, this can work well for some datasets regardless.

Feature selection

Good feature should:

  1. be related to the objective
  2. be known at prediction-time
  3. be numeric with meaningful magnitude
  4. have enough examples

A good rule of thumb: You should have at least five examples of any value before use it in your model

5. bring human insight to problem

Class imbalance

A classification data set with skewed class proportions is called imbalanced. Classes that make up a large proportion of the data set are called majority classes. Those that make up a smaller proportion are minority classes.

Handle the imbalanced data is via downsample and upweight the majority class.

  • Downsampling (in this context) means training on a disproportionately low subset of the majority class examples.
  • Upweighting means adding an example weight to the downsampled class equal to the factor by which you downsampled.

Feature crosses

Combining features into a single feature enables a model to learn separate weights for each combination of features.Feature crosses brings non-linear inputs in a linear learner. A feature cross memorizes the input space. However, the goal of ML is generalization. Memorization works when you have lots of data, which essentially just learning the mean. Feature cross is only possible with categorical or discrete (bucketized) variables. Some benefits of feature crosses are:

  • Feature crosses + massive data is an efficient way for learning highly complex spaces
  • Feature crosses allow a linear model to memorize large datasets
  • Optimizing linear models is a convex problem
  • Feature crosses, as a preprocessor, make neural networks converge a lot quicker

Tensorflow has the feature_column.crossed_column(keys, has_bucket_size) method for implementing the feature crosses.

Overcrossing: overuse of feature crosses

Regularization for Sparsity: L1 Regularization

Sparse vectors often contain many dimensions. Creating a feature cross results in even more dimensions. Given such high-dimensional feature vectors, model size may become huge and require huge amounts of RAM.

In a high-dimensional sparse vector, it would be nice to encourage weights to drop to exactly 0 where possible. A weight of exactly 0 essentially removes the corresponding feature from the model. Zeroing out features will save RAM and may reduce noise in the model.

L1 Regularization: Encourage sparsity and drive many coefficients to zero unlike L2 Regularization which tries to make the weight small but won’t actually drive them to zero for you.

--

--