Which of the following code blocks sorts DataFrame transactionsDf both by column storeId in ascending and by column productId in descending order, in this priority?
Correct Answer: D
Explanation In this question it is important to realize that you are asked to sort transactionDf by two columns. This means that the sorting of the second column depends on the sorting of the first column. So, any option that sorts the entire DataFrame (through chaining sort statements) will not work. The two columns need to be channeled through the same call to sort(). Also, order_by is not a valid DataFrame API method. More info: pyspark.sql.DataFrame.sort - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2
Question 22
Which of the following code blocks shows the structure of a DataFrame in a tree-like way, containing both column names and types?
Correct Answer: B
Explanation itemsDf.printSchema() Correct! Here is an example of what itemsDf.printSchema() shows, you can see the tree-like structure containing both column names and types: root |-- itemId: integer (nullable = true) |-- attributes: array (nullable = true) | |-- element: string (containsNull = true) |-- supplier: string (nullable = true) itemsDf.rdd.printSchema() No, the DataFrame's underlying RDD does not have a printSchema() method. spark.schema(itemsDf) Incorrect, there is no spark.schema command. print(itemsDf.columns) print(itemsDf.dtypes) Wrong. While the output of this code blocks contains both column names and column types, the information is not arranges in a tree-like way. itemsDf.print.schema() No, DataFrame does not have a print method. Static notebook | Dynamic notebook: See test 3
Question 23
Which of the following describes a difference between Spark's cluster and client execution modes?
Correct Answer: C
Explanation In cluster mode, the driver resides on a worker node, while it resides on an edge node in client mode. Correct. The idea of Spark's client mode is that workloads can be executed from an edge node, also known as gateway machine, from outside the cluster. The most common way to execute Spark however is in cluster mode, where the driver resides on a worker node. In practice, in client mode, there are tight constraints about the data transfer speed relative to the data transfer speed between worker nodes in the cluster. Also, any job in that is executed in client mode will fail if the edge node fails. For these reasons, client mode is usually not used in a production environment. In cluster mode, the cluster manager resides on a worker node, while it resides on an edge node in client execution mode. No. In both execution modes, the cluster manager may reside on a worker node, but it does not reside on an edge node in client mode. In cluster mode, executor processes run on worker nodes, while they run on gateway nodes in client mode. This is incorrect. Only the driver runs on gateway nodes (also known as "edge nodes") in client mode, but not the executor processes. In cluster mode, the Spark driver is not co-located with the cluster manager, while it is co-located in client mode. No, in client mode, the Spark driver is not co-located with the driver. The whole point of client mode is that the driver is outside the cluster and not associated with the resource that manages the cluster (the machine that runs the cluster manager). In cluster mode, a gateway machine hosts the driver, while it is co-located with the executor in client mode. No, it is exactly the opposite: There are no gateway machines in cluster mode, but in client mode, they host the driver.
Question 24
Which of the following describes Spark's Adaptive Query Execution?
Correct Answer: D
Explanation Adaptive Query Execution features include dynamically coalescing shuffle partitions, dynamically injecting scan filters, and dynamically optimizing skew joins. This is almost correct. All of these features, except for dynamically injecting scan filters, are part of Adaptive Query Execution. Dynamically injecting scan filters for join operations to limit the amount of data to be considered in a query is part of Dynamic Partition Pruning and not of Adaptive Query Execution. Adaptive Query Execution reoptimizes queries at execution points. No, Adaptive Query Execution reoptimizes queries at materialization points. Adaptive Query Execution is enabled in Spark by default. No, Adaptive Query Execution is disabled in Spark needs to be enabled through the spark.sql.adaptive.enabled property. Adaptive Query Execution applies to all kinds of queries. No, Adaptive Query Execution applies only to queries that are not streaming queries and that contain at least one exchange (typically expressed through a join, aggregate, or window operator) or one subquery. More info: How to Speed up SQL Queries with Adaptive Query Execution, Learning Spark, 2nd Edition, Chapter 12 (https://bit.ly/3tOh8M1)
Question 25
The code block displayed below contains an error. The code block is intended to perform an outer join of DataFrames transactionsDf and itemsDf on columns productId and itemId, respectively. Find the error. Code block: transactionsDf.join(itemsDf, [itemsDf.itemId, transactionsDf.productId], "outer")
Correct Answer: C
Explanation Correct code block: transactionsDf.join(itemsDf, itemsDf.itemId == transactionsDf.productId, "outer") Static notebook | Dynamic notebook: See test 1 (https://flrs.github.io/spark_practice_tests_code/#1/33.html , https://bit.ly/sparkpracticeexams_import_instructions)