Free Access Databricks.Databricks-Certified-Professional-Data-Engineer.v2025-11-20.q139 Practice Test (Page 19)

Question 86

A Databricks job has been configured with 3 tasks, each of which is a Databricks notebook. Task A does not depend on other tasks. Tasks B and C run in parallel, with each having a serial dependency on task A.
If tasks A and B complete successfully but task C fails during a scheduled run, which statement describes the resulting state?

A.All logic expressed in the notebook associated with tasks A and B will have been successfully completed; some operations in task C may have completed successfully.

B.All logic expressed in the notebook associated with tasks A and B will have been successfully completed; any changes made in task C will be rolled back due to task failure.

C.All logic expressed in the notebook associated with task A will have been successfully completed; tasks B and C will not commit any changes because of stage failure.

D.Because all tasks are managed as a dependency graph, no changes will be committed to the Lakehouse until ail tasks have successfully been completed.

E.Unless all tasks complete successfully, no changes will be committed to the Lakehouse; because task C failed, all commits will be rolled back automatically.

Question 87

A Spark job is taking longer than expected. Using the Spark UI, a data engineer notes that the Min, Median, and Max Durations for tasks in a particular stage show the minimum and median time to complete a task as roughly the same, but the max duration for a task to be roughly 100 times as long as the minimum.
Which situation is causing increased duration of the overall job?

A.Task queueing resulting from improper thread pool assignment.

B.Spill resulting from attached volume storage being too small.

C.Network latency due to some cluster nodes being in different regions from the source data

D.Skew caused by more data being assigned to a subset of spark-partitions.

E.Credential validation errors while pulling data from an external system.

Question 88

Although the Databricks Utilities Secrets module provides tools to store sensitive credentials and avoid accidentally displaying them in plain text users should still be careful with which credentials are stored here and which users have access to using these secrets.
Which statement describes a limitation of Databricks Secrets?

A.Because the SHA256 hash is used to obfuscate stored secrets, reversing this hash will display the value in plain text.

B.Account administrators can see all secrets in plain text by logging on to the Databricks Accounts console.

C.Secrets are stored in an administrators-only table within the Hive Metastore; database administrators have permission to query this table by default.

D.Iterating through a stored secret and printing each character will display secret contents in plain text.

E.The Databricks REST API can be used to list secrets in plain text if the personal access token has proper credentials.

Question 89

A Spark job is taking longer than expected. Using the Spark UI, a data engineer notes that the Min, Median, and Max Durations for tasks in a particular stage show the minimum and median time to complete a task as roughly the same, but the max duration for a task to be roughly 100 times as long as the minimum.
Which situation is causing increased duration of the overall job?

A.Task queueing resulting from improper thread pool assignment.

B.Spill resulting from attached volume storage being too small.

C.Network latency due to some cluster nodes being in different regions from the source data

D.Skew caused by more data being assigned to a subset of spark-partitions.

E.Credential validation errors while pulling data from an external system.

Question 90

A Delta Lake table representing metadata about content posts from users has the following schema:
* user_id LONG
* post_text STRING
* post_id STRING
* longitude FLOAT
* latitude FLOAT
* post_time TIMESTAMP
* date DATE
Based on the above schema, which column is a good candidate for partitioning the Delta Table?

A.date

B.user_id

C.post_id

D.post_time

Correct Answer: A

Partitioning a Delta Lake table is a strategy used to improve query performance by dividing the table into distinct segments based on the values of a specific column. This approach allows queries to scan only the relevant partitions, thereby reducing the amount of data read and enhancing performance.
Considerations for Choosing a Partition Column:
* Cardinality:Columns with high cardinality (i.e., a large number of unique values) are generally poor choices for partitioning. High cardinality can lead to a large number of small partitions, which can degrade performance.
* Query Patterns:The partition column should align with common query filters. If queries frequently filter data based on a particular column, partitioning by that column can be beneficial.
* Partition Size:Each partition should ideally contain at least 1 GB of data. This ensures that partitions are neither too small (leading to too many partitions) nor too large (negating the benefits of partitioning).
Evaluation of Columns:
* date:
* Cardinality:Typically low, especially if data spans over days, months, or years.
* Query Patterns:Many analytical queries filter data based on date ranges.
* Partition Size:Likely to meet the 1 GB threshold per partition, depending on data volume.
* user_id:
* Cardinality:High, as each user has a unique ID.
* Query Patterns:While some queries might filter by user_id, the high cardinality makes it unsuitable for partitioning.
* Partition Size:Partitions could be too small, leading to inefficiencies.
* post_id:
* Cardinality:Extremely high, with each post having a unique ID.
* Query Patterns:Unlikely to be used for filtering large datasets.
* Partition Size:Each partition would be very small, resulting in a large number of partitions.
* post_time:
* Cardinality:High, especially if it includes exact timestamps.
* Query Patterns:Queries might filter by time, but the high cardinality poses challenges.
* Partition Size:Similar to user_id, partitions could be too small.
Conclusion:
Given the considerations, the date column is the most suitable candidate for partitioning. It has low cardinality, aligns with common query patterns, and is likely to result in appropriately sized partitions.
References:
* Delta Lake Best Practices
* Partitioning in Delta Lake

Question 86

Question 87

Question 88

Question 89

Question 90

Download PDF File