Question 71

The data science team has requested assistance in accelerating queries on free form text from user reviews. The data is currently stored in Parquet with the below schema:
item_id INT, user_id INT, review_id INT, rating FLOAT, review STRING
The review column contains the full text of the review left by the user. Specifically, the data science team is looking to identify if any of 30 key words exist in this field.
A junior data engineer suggests converting this data to Delta Lake will improve query performance.
Which response to the junior data engineer s suggestion is correct?
  • Question 72

    The data engineering team maintains the following code:

    Assuming that this code produces logically correct results and the data in the source table has been de- duplicated and validated, which statement describes what will occur when this code is executed?
  • Question 73

    A data engineer is attempting to execute the following PySpark code:
    df = spark.read.table("sales")
    result = df.groupBy("region").agg(sum("revenue"))
    However, upon inspecting the execution plan and profiling the Spark job, they observe excessive data shuffling during the aggregation phase.
    Which technique should be applied to reduce shuffling during the groupBy aggregation operation?
  • Question 74

    An external object storage container has been mounted to the location /mnt/finance_eda_bucket.
    The following logic was executed to create a database for the finance team:

    After the database was successfully created and permissions configured, a member of the finance team runs the following code:

    If all users on the finance team are members of the finance group, which statement describes how the tx_sales table will be created?
  • Question 75

    The data science team has created and logged a production model using MLflow. The following code correctly imports and applies the production model to output the predictions as a new DataFrame namedpredswith the schema "customer_id LONG, predictions DOUBLE, date DATE".

    The data science team would like predictions saved to a Delta Lake table with the ability to compare all predictions across time. Churn predictions will be made at most once per day.
    Which code block accomplishes this task while minimizing potential compute costs?