Free Access Databricks.Databricks-Certified-Professional-Data-Engineer.v2025-10-30.q132 Practice Test (Page 25)

Question 116

Create a schema called bronze using location '/mnt/delta/bronze', and check if the schema exists before creating.

A.CREATE SCHEMA IF NOT EXISTS bronze LOCATION '/mnt/delta/bronze'

B.CREATE SCHEMA bronze IF NOT EXISTS LOCATION '/mnt/delta/bronze'

C.if IS_SCHEMA('bronze'): CREATE SCHEMA bronze LOCATION '/mnt/delta/bronze'

D.Schema creation is not available in metastore, it can only be done in Unity catalog UI

E.Cannot create schema without a database

Question 117

What type of table is created when you create delta table with below command?
CREATE TABLE transactions USING DELTA LOCATION "DBFS:/mnt/bronze/transactions"

A.Managed delta table

B.External table

C.Managed table

D.Temp table

E.Delta Lake table

Question 118

A table in the Lakehouse namedcustomer_churn_paramsis used in churn prediction by the machine learning team. The table contains information about customers derived from a number of upstream sources. Currently, the data engineering team populates this table nightly by overwriting the table with the current valid values derived from upstream data sources.
The churn prediction model used by the ML team is fairly stable in production. The team is only interested in making predictions on records that have changed in the past 24 hours.
Which approach would simplify the identification of these changed records?

A.Apply the churn model to all rows in the customer_churn_params table, but implement logic to perform an upsert into the predictions table that ignores rows where predictions have not changed.

B.Convert the batch job to a Structured Streaming job using the complete output mode; configure a Structured Streaming job to read from the customer_churn_params table and incrementally predict against the churn model.

C.Calculate the difference between the previous model predictions and the current customer_churn_params on a key identifying unique customers before making new predictions; only make predictions on those customers not in the previous predictions.

D.Modify the overwrite logic to include a field populated by calling spark.sql.functions.
current_timestamp() as data are being written; use this field to identify records written on a particular date.

E.Replace the current overwrite logic with a merge statement to modify only those records that have changed; write logic to make predictions on the changed records identified by the change data feed.

Correct Answer: E

The approach that would simplify the identification of the changed records is to replace the current overwrite logic with a merge statement to modify only those records that have changed, and write logic to make predictions on the changed records identified by the change data feed. This approach leverages the Delta Lake features of merge and change data feed, which are designed to handle upserts and track row-level changes in a Delta table12. By using merge, the data engineering team can avoid overwriting the entire table every night, and only update or insert the records that have changed in the source data. By using change data feed, the ML team can easily access the change events that have occurred in the customer_churn_params table, and filter them by operation type (update or insert) and timestamp. This way, they can only make predictions on the records that have changed in the past 24 hours, and avoid re-processing the unchanged records.
The other options are not as simple or efficient as the proposed approach, because:
* Option A would require applying the churn model to all rows in the customer_churn_params table, which would be wasteful and redundant. It would also require implementing logic to perform an upsert into the predictions table, which would be more complex than using the merge statement.
* Option B would require converting the batch job to a Structured Streaming job, which would involve changing the data ingestion and processing logic. It would also require using the complete output mode, which would output the entire result table every time there is a change in the source data, which would be inefficient and costly.
* Option C would require calculating the difference between the previous model predictions and the current customer_churn_params on a key identifying unique customers, which would be computationally expensive and prone to errors. It would also require storing and accessing the previous predictions, which would add extra storage and I/O costs.
* Option D would require modifying the overwrite logic to include a field populated by calling spark.sql.
functions.current_timestamp() as data are being written, which would add extra complexity and overhead to the data engineering job. It would also require using this field to identify records written on a particular date, which would be less accurate and reliable than using the change data feed.
References: Merge, Change data feed

Question 119

A data ingestion task requires a one-TB JSON dataset to be written out to Parquet with a target part-file size of 512 MB. Because Parquet is being used instead of Delta Lake, built-in file-sizing features such as Auto- Optimize & Auto-Compaction cannot be used.
Which strategy will yield the best performance without shuffling data?

A.Set spark.sql.files.maxPartitionBytes to 512 MB, ingest the data, execute the narrow transformations, and then write to parquet.

B.Set spark.sql.shuffle.partitions to 2,048 partitions (1TB*1024*1024/512), ingest the data, execute the narrow transformations, optimize the data by sorting it (which automatically repartitions the data), and then write to parquet.

C.Set spark.sql.adaptive.advisoryPartitionSizeInBytes to 512 MB bytes, ingest the data, execute the narrow transformations, coalesce to 2,048 partitions (1TB*1024*1024/512), and then write to parquet.

D.Ingest the data, execute the narrow transformations, repartition to 2,048 partitions (1TB* 1024*1024
/512), and then write to parquet.

E.Set spark.sql.shuffle.partitions to 512, ingest the data, execute the narrow transformations, and then write to parquet.

Correct Answer: A

For this scenario where a one-TB JSON dataset needs to be converted into Parquet format without employing Delta Lake's auto-sizing features, the goal is to avoid unnecessary data shuffles and yet ensure optimal file sizes for the output Parquet files. Here's a breakdown of why option A is most suitable:
* Setting maxPartitionBytes:The spark.sql.files.maxPartitionBytes configuration controls the size of blocks that Spark reads from the data source (in this case, the JSON files) but also influences the output size of files when data is written without repartition or coalesce operations. Setting this parameter to
512 MB directly addresses the requirement to manage the output file size effectively.
* Data Ingestion and Processing:
* Ingesting Data:Load the JSON dataset into a DataFrame.
* Applying Transformations:Perform any required narrow transformations that do not involve shuffling data (like filtering or adding new columns).
* Writing to Parquet:Directly write the transformed DataFrame to Parquet files. The setting for maxPartitionBytes ensures that each part-file is approximately 512 MB, meeting the requirement for part-file size without additional steps to repartition or coalesce the data.
* Performance Consideration:This approach is optimal because:
* It avoids the overhead of shuffling data, which can be significant, especially with large datasets.
* It directly ties the read/write operations to a configuration that matches the target output size, making it efficient in terms of both computation and I/O operations.
* Alternative Options Analysis:
* Option B and D:Involves repartitioning, which would trigger a shuffle of the data, contradicting the requirement to avoid shuffling for performance reasons.
* Option C:Uses coalesce, which is less intensive than repartition but can still lead to uneven partition sizes and does not directly control the output file size as effectively as setting maxPartitionBytes.
* Option E:Setting shuffle partitions to 512 doesn't directly control the output file size for writing to Parquet and could lead to smaller files depending on the dataset's partitioning post- transformations.
References
* Apache Spark Configuration
* Writing to Parquet Files in Spark

Question 120

One of the queries in the Databricks SQL Dashboard takes a long time to refresh, which of the be-low steps can be taken to identify the root cause of this issue?

A.Restart the SQL endpoint

B.Select the SQL endpoint cluster, spark UI, SQL tab to see the execution plan and time spent in each step

C.Run optimize and Z ordering

D.Change the Spot Instance Policy from "Cost optimized" to "Reliability Optimized."

E.Use Query History, to view queries and select query, and check query profile to time spent in each step

Other Version: 751Databricks.Databricks-Certified-Professional-Data-Engineer.v2025-11-20.q139; 1389Databricks.Databricks-Certified-Professional-Data-Engineer.v2024-03-27.q76; 1167Databricks.Databricks-Certified-Professional-Data-Engineer.v2023-12-11.q54

Latest Upload: 102GAQM.CBCP-002.v2025-12-17.q28; 102ServiceNow.CIS-Discovery.v2025-12-17.q125; 113PECB.NIS-2-Directive-Lead-Implementer.v2025-12-16.q29; 122SAP.C_C4H32_2411.v2025-12-16.q51; 125Splunk.SPLK-3001.v2025-12-15.q104; 112SAP.C-S4PM2-2507.v2025-12-15.q29; 179APA.FPC-Remote.v2025-12-13.q147; 123CIPS.L6M2.v2025-12-13.q13; 131WGU.Cybersecurity-Architecture-and-Engineering.v2025-12-13.q81; 124Oracle.1Z0-1163-1.v2025-12-13.q17

Question 116

Question 117

Question 118

Question 119

Question 120

Download PDF File