Free Access Google.Professional-Data-Engineer.v2022-07-07.q155 Practice Test (Page 17)

Question 76

You are building a model to make clothing recommendations. You know a user's fashion pis likely to change over time, so you build a data pipeline to stream new data back to the model as it becomes available. How should you use this data to train the model?

A.Train on the existing data while using the new data as your test set.

B.Train on the new data while using the existing data as your test set.

C.Continuously retrain the model on a combination of existing data and the new data.

D.Continuously retrain the model on just the new data.

Question 77

You designed a database for patient records as a pilot project to cover a few hundred patients in three clinics. Your design used a single database table to represent all patients and their visits, and you used self-joins to generate reports. The server resource utilization was at 50%. Since then, the scope of the project has expanded. The database must now store 100 times more patient records. You can no longer run the reports, because they either take too long or they encounter errors with insufficient compute resources. How should you adjust the database design?

A.Shard the tables into smaller ones based on date ranges, and only generate reports with prespecified date ranges.

B.Normalize the master patient-record table into the patient table and the visits table, and create other necessary tables to avoid self-join.

C.Add capacity (memory and disk space) to the database server by the order of 200.

D.Partition the table into smaller tables, with one for each clinic. Run queries against the smaller table pairs, and use unions for consolidated reports.

Question 78

You are building a new application that you need to collect data from in a scalable way. Data arrives continuously from the application throughout the day, and you expect to generate approximately 150 GB of JSON data per day by the end of the year. Your requirements are:
* Decoupling producer from consumer
* Space and cost-efficient storage of the raw ingested data, which is to be stored indefinitely
* Near real-time SQL query
* Maintain at least 2 years of historical data, which will be queried with SQ Which pipeline should you use to meet these requirements?

A.Create an application that provides an API. Write a tool to poll the API and write data to Cloud Storage as gzipped JSON files.

B.Create an application that publishes events to Cloud Pub/Sub, and create a Cloud Dataflow pipeline that transforms the JSON event payloads to Avro, writing the data to Cloud Storage and BigQuery.

C.Create an application that writes to a Cloud SQL database to store the data. Set up periodic exports of the database to write to Cloud Storage and load into BigQuery.

D.Create an application that publishes events to Cloud Pub/Sub, and create Spark jobs on Cloud Dataproc to convert the JSON data to Avro format, stored on HDFS on Persistent Disk.

Question 79

You need to create a data pipeline that copies time-series transaction data so that it can be queried from within BigQuery by your data science team for analysis. Every hour, thousands of transactions are updated with a new status. The size of the intitial dataset is 1.5 PB, and it will grow by 3 TB per day. The data is heavily structured, and your data science team will build machine learning models based on this data. You want to maximize performance and usability for your data science team. Which two strategies should you adopt?
(Choose two.)

A.Develop a data pipeline where status updates are appended to BigQuery instead of updated.

B.Copy a daily snapshot of transaction data to Cloud Storage and store it as an Avro file. Use BigQuery's support for external data sources to query.

C.Preserve the structure of the data as much as possible.

D.Use BigQuery UPDATE to further reduce the size of the dataset.

E.Denormalize the data as must as possible.

Question 80

Your company is performing data preprocessing for a learning algorithm in Google Cloud Dataflow.
Numerous data logs are being are being generated during this step, and the team wants to analyze them.
Due to the dynamic nature of the campaign, the data is growing exponentially every hour. The data scientists have written the following code to read the data for a new key features in the logs.
BigQueryIO.Read
.named("ReadLogData")
.from("clouddataflow-readonly:samples.log_data")
You want to improve the performance of this data read. What should you do?

A.Call a transform that returns TableRow objects, where each element in the PCollexction represents a single row in the table.

B.Use .fromQuery operation to read specific fields from the table.

C.Specify the TableReference object in the code.

D.Use of both the Google BigQuery TableSchema and TableFieldSchema classes.

Other Version: 154Google.Professional-Data-Engineer.v2025-09-04.q126; 813Google.Professional-Data-Engineer.v2025-03-18.q108; 948Google.Professional-Data-Engineer.v2024-12-09.q327; 1476Google.Professional-Data-Engineer.v2024-02-15.q202; 2261Google.Professional-Data-Engineer.v2023-09-14.q233; 5271Google.Professional-Data-Engineer.v2022-02-11.q268; 112Google.Examcollectionpass.Professional-Data-Engineer.v2021-12-20.by.jonathan.161q.pdf

Latest Upload: 102OCEG.GRCP.v2025-09-11.q211; 102HP.HPE0-V27.v2025-09-11.q78; 117Oracle.1Z0-1057-23.v2025-09-10.q47; 150Google.Professional-Cloud-Network-Engineer.v2025-09-09.q179; 131SAP.C-S4EWM-2023.v2025-09-08.q83; 164TheSecOpsGroup.CNSP.v2025-09-08.q20; 222CFAInstitute.ESG-Investing.v2025-09-08.q173; 157PECB.ISO-IEC-27001-Lead-Implementer.v2025-09-06.q132; 147Salesforce.Data-Architect.v2025-09-05.q216; 141Adobe.AD0-E605.v2025-09-05.q50

Question 76

Question 77

Question 78

Question 79

Question 80

Download PDF File