Free Access Databricks.Databricks-Certified-Professional-Data-Engineer.v2025-11-20.q139 Practice Test (Page 16)

Question 71

The data science team has requested assistance in accelerating queries on free form text from user reviews. The data is currently stored in Parquet with the below schema:
item_id INT, user_id INT, review_id INT, rating FLOAT, review STRING
The review column contains the full text of the review left by the user. Specifically, the data science team is looking to identify if any of 30 key words exist in this field.
A junior data engineer suggests converting this data to Delta Lake will improve query performance.
Which response to the junior data engineer s suggestion is correct?

A.Delta Lake statistics are not optimized for free text fields with high cardinality.

B.Text data cannot be stored with Delta Lake.

C.ZORDER ON review will need to be run to see performance gains.

D.The Delta log creates a term matrix for free text fields to support selective filtering.

E.Delta Lake statistics are only collected on the first 4 columns in a table.

Question 72

The data engineering team maintains the following code:

Assuming that this code produces logically correct results and the data in the source table has been de- duplicated and validated, which statement describes what will occur when this code is executed?

A.The silver_customer_sales table will be overwritten by aggregated values calculated from all records in the gold_customer_lifetime_sales_summary table as a batch job.

B.A batch job will update the gold_customer_lifetime_sales_summary table, replacing only those rows that have different values than the current version of the table, using customer_id as the primary key.

C.The gold_customer_lifetime_sales_summary table will be overwritten by aggregated values calculated from all records in the silver_customer_sales table as a batch job.

D.An incremental job will leverage running information in the state store to update aggregate values in the gold_customer_lifetime_sales_summary table.

E.An incremental job will detect if new rows have been written to the silver_customer_sales table; if new rows are detected, all aggregates will be recalculated and used to overwrite the gold_customer_lifetime_sales_summary table.

Question 73

A data engineer is attempting to execute the following PySpark code:
df = spark.read.table("sales")
result = df.groupBy("region").agg(sum("revenue"))
However, upon inspecting the execution plan and profiling the Spark job, they observe excessive data shuffling during the aggregation phase.
Which technique should be applied to reduce shuffling during the groupBy aggregation operation?

A.Caching the DataFrame df.

B.Repartition by region before aggregation.

C.Use coalesce() after the aggregation.

D.Use broadcast join.

Question 74

An external object storage container has been mounted to the location /mnt/finance_eda_bucket.
The following logic was executed to create a database for the finance team:

After the database was successfully created and permissions configured, a member of the finance team runs the following code:

If all users on the finance team are members of the finance group, which statement describes how the tx_sales table will be created?

A.A logical table will persist the query plan to the Hive Metastore in the Databricks control plane.

B.An external table will be created in the storage container mounted to /mnt/finance eda bucket.

C.A logical table will persist the physical plan to the Hive Metastore in the Databricks control plane.

D.An managed table will be created in the storage container mounted to /mnt/finance eda bucket.

E.A managed table will be created in the DBFS root storage container.

Question 75

The data science team has created and logged a production model using MLflow. The following code correctly imports and applies the production model to output the predictions as a new DataFrame namedpredswith the schema "customer_id LONG, predictions DOUBLE, date DATE".

The data science team would like predictions saved to a Delta Lake table with the ability to compare all predictions across time. Churn predictions will be made at most once per day.
Which code block accomplishes this task while minimizing potential compute costs?

A.preds.write.mode("append").saveAsTable("churn_preds")

B.preds.write.format("delta").save("/preds/churn_preds")