Explanation This is a tricky question to get right, since it is easy to confuse execution modes and deployment modes. Even in literature, both terms are sometimes used interchangeably. There are only 3 valid execution modes in Spark: Client, cluster, and local execution modes. Execution modes do not refer to specific frameworks, but to where infrastructure is located with respect to each other. In client mode, the driver sits on a machine outside the cluster. In cluster mode, the driver sits on a machine inside the cluster. Finally, in local mode, all Spark infrastructure is started in a single JVM (Java Virtual Machine) in a single computer which then also includes the driver. Deployment modes often refer to ways that Spark can be deployed in cluster mode and how it uses specific frameworks outside Spark. Valid deployment modes are standalone, Apache YARN, Apache Mesos and Kubernetes. Client, Cluster, Local Correct, all of these are the valid execution modes in Spark. Standalone, Client, Cluster No, standalone is not a valid execution mode. It is a valid deployment mode, though. Kubernetes, Local, Client No, Kubernetes is a deployment mode, but not an execution mode. Cluster, Server, Local No, Server is not an execution mode. Server, Standalone, Client No, standalone and server are not execution modes. More info: Apache Spark Internals - Learning Journal
Question 37
Which of the following code blocks creates a new DataFrame with 3 columns, productId, highest, and lowest, that shows the biggest and smallest values of column value per value in column productId from DataFrame transactionsDf? Sample of DataFrame transactionsDf: 1.+-------------+---------+-----+-------+---------+----+ 2.|transactionId|predError|value|storeId|productId| f| 3.+-------------+---------+-----+-------+---------+----+ 4.| 1| 3| 4| 25| 1|null| 5.| 2| 6| 7| 2| 2|null| 6.| 3| 3| null| 25| 3|null| 7.| 4| null| null| 3| 2|null| 8.| 5| null| null| null| 2|null| 9.| 6| 3| 2| 25| 2|null| 10.+-------------+---------+-----+-------+---------+----+
Correct Answer: D
Explanation transactionsDf.groupby('productId').agg(max('value').alias('highest'), min('value').alias('lowest')) Correct. groupby and aggregate is a common pattern to investigate aggregated values of groups. transactionsDf.groupby("productId").agg({"highest": max("value"), "lowest": min("value")}) Wrong. While DataFrame.agg() accepts dictionaries, the syntax of the dictionary in this code block is wrong. If you use a dictionary, the syntax should be like {"value": "max"}, so using the column name as the key and the aggregating function as value. transactionsDf.agg(max('value').alias('highest'), min('value').alias('lowest')) Incorrect. While this is valid Spark syntax, it does not achieve what the question asks for. The question specifically asks for values to be aggregated per value in column productId - this column is not considered here. Instead, the max() and min() values are calculated as if the entire DataFrame was a group. transactionsDf.max('value').min('value') Wrong. There is no DataFrame.max() method in Spark, so this command will fail. transactionsDf.groupby(col(productId)).agg(max(col(value)).alias("highest"), min(col(value)).alias("lowest")) No. While this may work if the column names are expressed as strings, this will not work as is. Python will interpret the column names as variables and, as a result, pySpark will not understand which columns you want to aggregate. More info: pyspark.sql.DataFrame.agg - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 3
Question 38
Which of the following statements about Spark's execution hierarchy is correct?
Correct Answer: A
Explanation In Spark's execution hierarchy, a job may reach over multiple stage boundaries. Correct. A job is a sequence of stages, and thus may reach over multiple stage boundaries. In Spark's execution hierarchy, tasks are one layer above slots. Incorrect. Slots are not a part of the execution hierarchy. Tasks are the lowest layer. In Spark's execution hierarchy, a stage comprises multiple jobs. No. It is the other way around - a job consists of one or multiple stages. In Spark's execution hierarchy, executors are the smallest unit. False. Executors are not a part of the execution hierarchy. Tasks are the smallest unit! In Spark's execution hierarchy, manifests are one layer above jobs. Wrong. Manifests are not a part of the Spark ecosystem.
Question 39
Which of the following describes Spark's Adaptive Query Execution?
Correct Answer: D
Explanation Adaptive Query Execution features include dynamically coalescing shuffle partitions, dynamically injecting scan filters, and dynamically optimizing skew joins. This is almost correct. All of these features, except for dynamically injecting scan filters, are part of Adaptive Query Execution. Dynamically injecting scan filters for join operations to limit the amount of data to be considered in a query is part of Dynamic Partition Pruning and not of Adaptive Query Execution. Adaptive Query Execution reoptimizes queries at execution points. No, Adaptive Query Execution reoptimizes queries at materialization points. Adaptive Query Execution is enabled in Spark by default. No, Adaptive Query Execution is disabled in Spark needs to be enabled through the spark.sql.adaptive.enabled property. Adaptive Query Execution applies to all kinds of queries. No, Adaptive Query Execution applies only to queries that are not streaming queries and that contain at least one exchange (typically expressed through a join, aggregate, or window operator) or one subquery. More info: How to Speed up SQL Queries with Adaptive Query Execution, Learning Spark, 2nd Edition, Chapter 12 (https://bit.ly/3tOh8M1)
Question 40
The code block shown below should add a column itemNameBetweenSeparators to DataFrame itemsDf. The column should contain arrays of maximum 4 strings. The arrays should be composed of the values in column itemsDf which are separated at - or whitespace characters. Choose the answer that correctly fills the blanks in the code block to accomplish this. Sample of DataFrame itemsDf: 1.+------+----------------------------------+-------------------+ 2.|itemId|itemName |supplier | 3.+------+----------------------------------+-------------------+ 4.|1 |Thick Coat for Walking in the Snow|Sports Company Inc.| 5.|2 |Elegant Outdoors Summer Dress |YetiX | 6.|3 |Outdoors Backpack |Sports Company Inc.| 7.+------+----------------------------------+-------------------+ Code block: itemsDf.__1__(__2__, __3__(__4__, "[\s\-]", __5__))
Correct Answer: A
Explanation This question deals with the parameters of Spark's split operator for strings. To solve this question, you first need to understand the difference between DataFrame.withColumn() and DataFrame.withColumnRenamed(). The correct option here is DataFrame.withColumn() since, according to the question, we want to add a column and not rename an existing column. This leaves you with only 3 answers to consider. The second gap should be filled with the name of the new column to be added to the DataFrame. One of the remaining answers states the column name as itemNameBetweenSeparators, while the other two state it as "itemNameBetweenSeparators". The correct option here is "itemNameBetweenSeparators", since the other option would let Python try to interpret itemNameBetweenSeparators as the name of a variable, which we have not defined. This leaves you with 2 answers to consider. The decision boils down to how to fill gap 5. Either with 4 or with 5. The question asks for arrays of maximum four strings. The code in gap 5 relates to the limit parameter of Spark's split operator (see documentation linked below). The documentation states that "the resulting array's length will not be more than limit", meaning that we should pick the answer option with 4 as the code in the fifth gap here. On a side note: One answer option includes a function str_split. This function does not exist in pySpark. More info: pyspark.sql.functions.split - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 3