Free Access Databricks.Databricks-Certified-Professional-Data-Engineer.v2025-11-20.q139 Practice Test (Page 22)

Question 101

The business reporting tem requires that data for their dashboards be updated every hour. The total processing time for the pipeline that extracts transforms and load the data for their pipeline runs in 10 minutes.
Assuming normal operating conditions, which configuration will meet their service-level agreement requirements with the lowest cost?

A.Schedule a jo to execute the pipeline once and hour on a dedicated interactive cluster.

B.Schedule a Structured Streaming job with a trigger interval of 60 minutes.

C.Schedule a job to execute the pipeline once hour on a new job cluster.

D.Configure a job that executes every time new data lands in a given directory.

Question 102

A Delta Lake table in the Lakehouse named customer_parsams is used in churn prediction by the machine learning team. The table contains information about customers derived from a number of upstream sources.
Currently, the data engineering team populates this table nightly by overwriting the table with the current valid values derived from upstream data sources.
Immediately after each update succeeds, the data engineer team would like to determine the difference between the new version and the previous of the table.
Given the current implementation, which method can be used?

A.Parse the Delta Lake transaction log to identify all newly written data files.

B.Execute DESCRIBE HISTORY customer_churn_params to obtain the full operation metrics for the update, including a log of all records that have been added or modified.

C.Execute a query to calculate the difference between the new version and the previous version using Delta Lake's built-in versioning and time travel functionality.

D.Parse the Spark event logs to identify those rows that were updated, inserted, or deleted.

Question 103

A junior data engineer has been asked to develop a streaming data pipeline with a grouped aggregation using DataFrame df. The pipeline needs to calculate the average humidity and average temperature for each non-overlapping five-minute interval. Events are recorded once per minute per device.
df has the following schema: device_id INT, event_time TIMESTAMP, temp FLOAT, humidity FLOAT Code block:
df.withWatermark("event_time", "10 minutes")
.groupBy(
________,
"device_id"
)
.agg(
avg("temp").alias("avg_temp"),
avg("humidity").alias("avg_humidity")
)
.writeStream
.format("delta")
.saveAsTable("sensor_avg")
Which line of code correctly fills in the blank within the code block to complete this task?

A.window("event_time", "5 minutes").alias("time")

B.to_interval("event_time", "5 minutes").alias("time")

C."event_time"

D.lag("event_time", "5 minutes").alias("time")

Question 104

The business intelligence team has a dashboard configured to track various summary metrics for retail stories. This includes total sales for the previous day alongside totals and averages for a variety of time periods. The fields required to populate this dashboard have the following schema:

For Demand forecasting, the Lakehouse contains a validated table of all itemized sales updated incrementally in near real-time. This table named products_per_order, includes the following fields:

Because reporting on long-term sales trends is less volatile, analysts using the new dashboard only require data to be refreshed once daily. Because the dashboard will be queried interactively by many users throughout a normal business day, it should return results quickly and reduce total compute associated with each materialization.
Which solution meets the expectations of the end users while controlling and limiting possible costs?

A.Populate the dashboard by configuring a nightly batch job to save the required to quickly update the dashboard with each query.

B.Use the Delta Cache to persists the products_per_order table in memory to quickly the dashboard with each query.

C.Use Structure Streaming to configure a live dashboard against the products_per_order table within a Databricks notebook.

D.Define a view against the products_per_order table and define the dashboard against this view.

Question 105

When scheduling Structured Streaming jobs for production, which configuration automatically recovers from query failures and keeps costs low?

A.Cluster: New Job Cluster;
Retries: Unlimited;
Maximum Concurrent Runs: Unlimited

B.Cluster: New Job Cluster;
Retries: None;
Maximum Concurrent Runs: 1

C.Cluster: Existing All-Purpose Cluster;
Retries: Unlimited;
Maximum Concurrent Runs: 1

D.Cluster: New Job Cluster;
Retries: Unlimited;
Maximum Concurrent Runs: 1

E.Cluster: Existing All-Purpose Cluster;
Retries: None;
Maximum Concurrent Runs: 1

Other Version: 543Databricks.Databricks-Certified-Professional-Data-Engineer.v2025-10-30.q132; 1571Databricks.Databricks-Certified-Professional-Data-Engineer.v2024-03-27.q76; 1296Databricks.Databricks-Certified-Professional-Data-Engineer.v2023-12-11.q54

Latest Upload: 104SAP.C_BCBAI_2502.v2026-01-08.q38; 104Oracle.1Z0-1056-24.v2026-01-08.q53; 138Huawei.H13-831_V2.0.v2026-01-07.q101; 145Salesforce.Salesforce-Slack-Administrator.v2026-01-06.q103; 122CIPS.L5M15.v2026-01-06.q31; 113Oracle.1Z0-1072-25.v2026-01-06.q18; 123Oracle.1Z0-1042-25.v2026-01-05.q55; 131EMC.D-PCR-DY-01.v2026-01-05.q77; 125DSCI.DCPLA.v2026-01-05.q64; 160TheOpenGroup.OGA-031.v2026-01-05.q42

Question 101

Question 102

Question 103

Question 104

Question 105

Download PDF File