Question 26

What could be the expected output of query SELECT COUNT (DISTINCT *) FROM user on this table
  • Question 27

    A data engineering team has a time-consuming data ingestion job with three data sources. Each notebook takes about one hour to load new data. One day, the job fails because a notebook update introduced a new required configuration parameter. The team must quickly fix the issue and load the latest data from the failing source.
    Which action should the team take?
  • Question 28

    An upstream source writes Parquet data as hourly batches to directories named with the current date. A nightly batch job runs the following code to ingest all data from the previous day as indicated by thedatevariable:

    Assume that the fieldscustomer_idandorder_idserve as a composite key to uniquely identify each order.
    If the upstream system is known to occasionally produce duplicate entries for a single order hours apart, which statement is correct?
  • Question 29

    An upstream system has been configured to pass the date for a given batch of data to the Databricks Jobs API as a parameter. The notebook to be scheduled will use this parameter to load data with the following code:
    df = spark.read.format("parquet").load(f"/mnt/source/(date)")
    Which code block should be used to create the date Python variable used in the above code block?
  • Question 30

    Two of the most common data locations on Databricks are the DBFS root storage and external object storage mounted with dbutils.fs.mount().
    Which of the following statements is correct?