Free Access Databricks.Databricks-Certified-Professional-Data-Engineer.v2025-10-30.q132 Practice Test (Page 27)

Question 126

A Delta Lake table was created with the below query:

Consider the following query:
DROP TABLE prod.sales_by_store -
If this statement is executed by a workspace admin, which result will occur?

A.Nothing will occur until a COMMIT command is executed.

B.The table will be removed from the catalog but the data will remain in storage.

C.The table will be removed from the catalog and the data will be deleted.

D.An error will occur because Delta Lake prevents the deletion of production data.

E.Data will be marked as deleted but still recoverable with Time Travel.

Question 127

An upstream source writes Parquet data as hourly batches to directories named with the current date. A nightly batch job runs the following code to ingest all data from the previous day as indicated by thedatevariable:

Assume that the fieldscustomer_idandorder_idserve as a composite key to uniquely identify each order.
If the upstream system is known to occasionally produce duplicate entries for a single order hours apart, which statement is correct?

A.Each write to the orders table will only contain unique records, and only those records without duplicates in the target table will be written.

B.Each write to the orders table will only contain unique records, but newly written records may have duplicates already present in the target table.

C.Each write to the orders table will only contain unique records; if existing records with the same key are present in the target table, these records will be overwritten.

D.Each write to the orders table will only contain unique records; if existing records with the same key are present in the target table, the operation will tail.

E.Each write to the orders table will run deduplication over the union of new and existing records, ensuring no duplicate records are present.

Question 128

The data engineering team maintains a table of aggregate statistics through batch nightly updates. This includes total sales for the previous day alongside totals and averages for a variety of time periods including the 7 previous days, year-to-date, and quarter-to-date. This table is named store_saies_summary and the schema is as follows:
The table daily_store_sales contains all the information needed to update store_sales_summary. The schema for this table is:
store_id INT, sales_date DATE, total_sales FLOAT
If daily_store_sales is implemented as a Type 1 table and the total_sales column might be adjusted after manual data auditing, which approach is the safest to generate accurate reports in the store_sales_summary table?

A.Implement the appropriate aggregate logic as a batch read against the daily_store_sales table and overwrite the store_sales_summary table with each Update.

B.Implement the appropriate aggregate logic as a batch read against the daily_store_sales table and append new rows nightly to the store_sales_summary table.

C.Implement the appropriate aggregate logic as a batch read against the daily_store_sales table and use upsert logic to update results in the store_sales_summary table.

D.Implement the appropriate aggregate logic as a Structured Streaming read against the daily_store_sales table and use upsert logic to update results in the store_sales_summary table.

E.Use Structured Streaming to subscribe to the change data feed for daily_store_sales and apply changes to the aggregates in the store_sales_summary table with each update.

Correct Answer: E

The daily_store_sales table contains all the information needed to update store_sales_summary. The schema of the table is:
store_id INT, sales_date DATE, total_sales FLOAT
The daily_store_sales table is implemented as a Type 1 table, which means that old values are overwritten by new values and no history is maintained. The total_sales column might be adjusted after manual data auditing, which means that the data in the table may change over time.
The safest approach to generate accurate reports in the store_sales_summary table is to use Structured Streaming to subscribe to the change data feed for daily_store_sales and apply changes to the aggregates in the store_sales_summary table with each update. Structured Streaming is a scalable and fault-tolerant stream processing engine built on Spark SQL. Structured Streaming allows processing data streams as if they were tables or DataFrames, using familiar operations such as select, filter, groupBy, or join. Structured Streaming also supports output modes that specify how to write the results of a streaming query to a sink, such as append, update, or complete. Structured Streaming can handle both streaming and batch data sources in a unified manner.
The change data feed is a feature of Delta Lake that provides structured streaming sources that can subscribe to changes made to a Delta Lake table. The change data feed captures both data changes and schema changes as ordered events that can be processed by downstream applications or services. The change data feed can be configured with different options, such as starting from a specific version or timestamp, filtering by operation type or partition values, or excluding no-op changes.
By using Structured Streaming to subscribe to the change data feed for daily_store_sales, one can capture and process any changes made to the total_sales column due to manual data auditing. By applying these changes to the aggregates in the store_sales_summary table with each update, one can ensure that the reports are always consistent and accurate with the latest data. Verified References: [Databricks Certified Data Engineer Professional], under "Spark Core" section; Databricks Documentation, under "Structured Streaming" section; Databricks Documentation, under "Delta Change Data Feed" section.

Question 129

A junior data engineer seeks to leverage Delta Lake's Change Data Feed functionality to create a Type 1 table representing all of the values that have ever been valid for all rows in abronzetable created with the propertydelta.enableChangeDataFeed = true. They plan to execute the following code as a daily job:

Which statement describes the execution and results of running the above query multiple times?

A.Each time the job is executed, newly updated records will be merged into the target table, overwriting previous values with the same primary keys.

B.Each time the job is executed, the entire available history of inserted or updated records will be appended to the target table, resulting in many duplicate entries.

C.Each time the job is executed, the target table will be overwritten using the entire history of inserted or updated records, giving the desired result.

D.Each time the job is executed, the differences between the original and current versions are calculated; this may result in duplicate entries for some records.

E.Each time the job is executed, only those records that have been inserted or updated since the last execution will be appended to the target table giving the desired result.

Question 130

A Delta Lake table representing metadata about content posts from users has the following schema:
* user_id LONG
* post_text STRING
* post_id STRING
* longitude FLOAT
* latitude FLOAT
* post_time TIMESTAMP
* date DATE
Based on the above schema, which column is a good candidate for partitioning the Delta Table?

A.date

B.user_id

C.post_id

D.post_time

Correct Answer: A

Partitioning a Delta Lake table is a strategy used to improve query performance by dividing the table into distinct segments based on the values of a specific column. This approach allows queries to scan only the relevant partitions, thereby reducing the amount of data read and enhancing performance.
Considerations for Choosing a Partition Column:
* Cardinality:Columns with high cardinality (i.e., a large number of unique values) are generally poor choices for partitioning. High cardinality can lead to a large number of small partitions, which can degrade performance.
* Query Patterns:The partition column should align with common query filters. If queries frequently filter data based on a particular column, partitioning by that column can be beneficial.
* Partition Size:Each partition should ideally contain at least 1 GB of data. This ensures that partitions are neither too small (leading to too many partitions) nor too large (negating the benefits of partitioning).
Evaluation of Columns:
* date:
* Cardinality:Typically low, especially if data spans over days, months, or years.
* Query Patterns:Many analytical queries filter data based on date ranges.
* Partition Size:Likely to meet the 1 GB threshold per partition, depending on data volume.
* user_id:
* Cardinality:High, as each user has a unique ID.
* Query Patterns:While some queries might filter by user_id, the high cardinality makes it unsuitable for partitioning.
* Partition Size:Partitions could be too small, leading to inefficiencies.
* post_id:
* Cardinality:Extremely high, with each post having a unique ID.
* Query Patterns:Unlikely to be used for filtering large datasets.
* Partition Size:Each partition would be very small, resulting in a large number of partitions.
* post_time:
* Cardinality:High, especially if it includes exact timestamps.
* Query Patterns:Queries might filter by time, but the high cardinality poses challenges.
* Partition Size:Similar to user_id, partitions could be too small.
Conclusion:
Given the considerations, the date column is the most suitable candidate for partitioning. It has low cardinality, aligns with common query patterns, and is likely to result in appropriately sized partitions.
References:
* Delta Lake Best Practices
* Partitioning in Delta Lake

Other Version: 751Databricks.Databricks-Certified-Professional-Data-Engineer.v2025-11-20.q139; 1389Databricks.Databricks-Certified-Professional-Data-Engineer.v2024-03-27.q76; 1167Databricks.Databricks-Certified-Professional-Data-Engineer.v2023-12-11.q54

Latest Upload: 102GAQM.CBCP-002.v2025-12-17.q28; 102ServiceNow.CIS-Discovery.v2025-12-17.q125; 113PECB.NIS-2-Directive-Lead-Implementer.v2025-12-16.q29; 122SAP.C_C4H32_2411.v2025-12-16.q51; 125Splunk.SPLK-3001.v2025-12-15.q104; 112SAP.C-S4PM2-2507.v2025-12-15.q29; 179APA.FPC-Remote.v2025-12-13.q147; 123CIPS.L6M2.v2025-12-13.q13; 131WGU.Cybersecurity-Architecture-and-Engineering.v2025-12-13.q81; 124Oracle.1Z0-1163-1.v2025-12-13.q17

Question 126

Question 127

Question 128

Question 129

Question 130

Download PDF File