Study Notes: Google Machine Learning Engineer Certification

Section 2: ML Solution Architecture

Yingying Hu

4 min readJan 18, 2021

2.1 Design reliable, scalable, highly available ML solutions

Optimizing data use and storage

Cloud Storage services:

Standard Cloud Storage provides maximum availability.
Cloud Storage Nearline provides low-cost archival storage ideal for data accessed less than once a month.
Cloud Storage Coldline provides even lower-cost archival storage ideal for data accessed less than once a quarter.
Cloud Storage Archive provides the lowest-cost archival storage for backup and disaster recovery ideal for data you intend to access less than once a year.

Data connections

On prem sources
Events streaming from IoT
GCS
Cloud SQL: a fully-managed database service that helps you set up, maintain, manage, and administer your relational databases on Google Cloud Platform. You can use Cloud SQL with MySQL, PostgreSQL, or SQL Server
BigQuery

Automation of data preparation and model training/deployment

KubeFlow, TFX, Dataflow, PubSub, BigQuery and GCS are likely to be core components

SDLC best practices

Software Development Life Cycle:

Source control changes
Reproducible builds by automation
Reproducible deployments by automation
Version models

2.2 Choose appropriate Google Cloud software components

A variety of component types — data collection; data management

Exploration/analysis

AI Notebooks
Dataprep: Dataprep’s main purpose is to let data analysts explore, clean, and prepare data for analysis. It provides tools to format, filter, and run macros against data. It uses a visual interface to cleanse and enrich multiple data sources before loading them to a Google Cloud Storage data lake or BigQuery data warehouse.

Feature engineering

BigQuery
Dataflow
TensorFlow

Logging/management

Cloud Logging: a fully managed service that performs at scale and that can ingest application and system log data. This lets you analyze and export selected logs to long-term storage in real time.

Automation

Cloud Build
TensorFlow Extended (TFX)
KubeFlow
AI Platform

Monitoring

Cloud Monitoring
Tensorboard
AI Platform (Continuous Evaluation)

Serving

AI Platform Prediction
Auto ML
Run model on a device via exportable TensorFlow models

2.3 Choose appropriate Google Cloud hardware components

Selection of quotas and compute/accelerators with components

TPUs (Tensor Processing Units) are specialized ML hardware. All Google Cloud projects are allocated a default AI Platform Training quota for at least one Cloud TPU. Quota is allocated in units of 8 TPU cores per Cloud TPU.

2.4 Design architecture that complies with regulatory and security concerns

Building secure ML systems

Privacy implications of data usage

Identifying sensitive data

Sensitive data in columns: Sensitive data can be restricted to specific columns in structured datasets.In this case, you identify which columns have sensitive data, decide how to secure them, and document these decisions.

Sensitive data in unstructured text-based datasets: Sensitive data can be part of an unstructured text-based dataset, and it can often be detected using known patterns. (Tool: Regular expression → Cloud Data Loss Prevention API)

Sensitive data in free-form unstructured data: Sensitive data can exist in free-form unstructured data such as text reports, audio recordings, photographs, or scanned receipts. There are many tools available to identify it:

free-text documents: use Cloud Natural Language API to identify entities, email addresses, and other sensitive data.
audio recordings: Cloud Speech API + Cloud Natural Language API
images: use Cloud Vision API to yield raw text from the image and isolate the location of that text within the image
videos: parse each video into individual picture frames and treat them as image files, or use Cloud Video Intelligence API + Cloud Speech API

Sensitive data in a combination of fields: Sensitive data can exist as a combination of fields, or manifest from a trend in a protected field over time.

Sensitive data in unstructured content: Sensitive data sometimes exists in unstructured content because of embedded contextual information.

For model development it is often effective to take a subsample of this data that has been scrubbed and reviewed by a trusted person and make it available for model development. You would then be able to use security restrictions and software automation to process the full dataset through the production model training process.

Protecting sensitive data

Removing sensitive data

For structured datasets: create a view that doesn’t provide access to the columns in questions
For unstructured content which is identifiable using known patterns: use Cloud DLP to automatically remove and replace it by a generic string
For unstructured free-form data: extend the GCP API tools to identify the sensitive data to mask or remove it
For combination of fields: incorporated automated tools or manual data analysis steps to quantify the risk posed by each column, then your data engineers can make informed decisions about retaining or removing any relevant column.

Masking sensitive data

use a substitution cipher to replace all occurrences of a plain-text identifier by its hashed and /or encrypted value
Tokenization: pushing the encrypted values into a separated database.
PCA/dimension-reducing techniques. all PCA processing reduces the data distribution, and trades accuracy for security.

Coarsening sensitive data: decrease the precision or granularity of data. This method is suited for:

Locations
Zip Codes: use just the first three digits (“zip3”)
Numeric quantities: Numbers can be binned to make them less likely to identify an individual.
IP addresses: zero out the last octet of IPv4 addresses (the last 80 bits if using IPv6).

Establishing a governance policy

If your datasets have any amount of sensitive data, it is recommended that you consult legal counsel to establish some sort of governance policy and best practices documentation.