Which of the following describes the role of the cluster manager?
Correct Answer: C
Explanation The cluster manager allocates resources to Spark applications and maintains the executor processes in client mode. Correct. In cluster mode, the cluster manager is located on a node other than the client machine. From there it starts and ends executor processes on the cluster nodes as required by the Spark application running on the Spark driver. The cluster manager allocates resources to Spark applications and maintains the executor processes in remote mode. Wrong, there is no "remote" execution mode in Spark. Available execution modes are local, client, and cluster. The cluster manager allocates resources to the DataFrame manager Wrong, there is no "DataFrame manager" in Spark. The cluster manager schedules tasks on the cluster in client mode. No, in client mode, the Spark driver schedules tasks on the cluster - not the cluster manager. The cluster manager schedules tasks on the cluster in local mode. Wrong: In local mode, there is no "cluster". The Spark application is running on a single machine, not on a cluster of machines.
Question 52
Which of the following code blocks creates a new DataFrame with two columns season and wind_speed_ms where column season is of data type string and column wind_speed_ms is of data type double?
Correct Answer: B
Explanation spark.createDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"]) Correct. This command uses the Spark Session's createDataFrame method to create a new DataFrame. Notice how rows, columns, and column names are passed in here: The rows are specified as a Python list. Every entry in the list is a new row. Columns are specified as Python tuples (for example ("summer", 4.5)). Every column is one entry in the tuple. The column names are specified as the second argument to createDataFrame(). The documentation (link below) shows that "when schema is a list of column names, the type of each column will be inferred from data" (the first argument). Since values 4.5 and 7.5 are both float variables, Spark will correctly infer the double type for column wind_speed_ms. Given that all values in column "season" contain only strings, Spark will cast the column appropriately as string. Find out more about SparkSession.createDataFrame() via the link below. spark.newDataFrame([("summer", 4.5), ("winter", 7.5)], ["season", "wind_speed_ms"]) No, the SparkSession does not have a newDataFrame method. from pyspark.sql import types as T spark.createDataFrame((("summer", 4.5), ("winter", 7.5)), T.StructType([T.StructField("season", T.CharType()), T.StructField("season", T.DoubleType())])) No. pyspark.sql.types does not have a CharType type. See link below for available data types in Spark. spark.createDataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]}) No, this is not correct Spark syntax. If you have considered this option to be correct, you may have some experience with Python's pandas package, in which this would be correct syntax. To create a Spark DataFrame from a Pandas DataFrame, you can simply use spark.createDataFrame(pandasDf) where pandasDf is the Pandas DataFrame. Find out more about Spark syntax options using the examples in the documentation for SparkSession.createDataFrame linked below. spark.DataFrame({"season": ["winter","summer"], "wind_speed_ms": [4.5, 7.5]}) No, the Spark Session (indicated by spark in the code above) does not have a DataFrame method. More info: pyspark.sql.SparkSession.createDataFrame - PySpark 3.1.1 documentation and Data Types - Spark 3.1.2 Documentation Static notebook | Dynamic notebook: See test 1
Question 53
Which of the following code blocks returns about 150 randomly selected rows from the 1000-row DataFrame transactionsDf, assuming that any row can appear more than once in the returned DataFrame?
Correct Answer: E
Explanation Answering this question correctly depends on whether you understand the arguments to the DataFrame.sample() method (link to the documentation below). The arguments are as follows: DataFrame.sample(withReplacement=None, fraction=None, seed=None). The first argument withReplacement specified whether a row can be drawn from the DataFrame multiple times. By default, this option is disabled in Spark. But we have to enable it here, since the question asks for a row being able to appear more than once. So, we need to pass True for this argument. About replacement: "Replacement" is easiest explained with the example of removing random items from a box. When you remove those "with replacement" it means that after you have taken an item out of the box, you put it back inside. So, essentially, if you would randomly take 10 items out of a box with 100 items, there is a chance you take the same item twice or more times. "Without replacement" means that you would not put the item back into the box after removing it. So, every time you remove an item from the box, there is one less item in the box and you can never take the same item twice. The second argument to the withReplacement method is fraction. This referes to the fraction of items that should be returned. In the question we are asked for 150 out of 1000 items - a fraction of 0.15. The last argument is a random seed. A random seed makes a randomized processed repeatable. This means that if you would re-run the same sample() operation with the same random seed, you would get the same rows returned from the sample() command. There is no behavior around the random seed specified in the question. The varying random seeds are only there to confuse you! More info: pyspark.sql.DataFrame.sample - PySpark 3.1.1 documentation Static notebook | Dynamic notebook: See test 1
Question 54
Which of the following statements about broadcast variables is correct?
Correct Answer: C
Explanation Broadcast variables are local to the worker node and not shared across the cluster. This is wrong because broadcast variables are meant to be shared across the cluster. As such, they are never just local to the worker node, but available to all worker nodes. Broadcast variables are commonly used for tables that do not fit into memory. This is wrong because broadcast variables can only be broadcast because they are small and do fit into memory. Broadcast variables are serialized with every single task. This is wrong because they are cached on every machine in the cluster, precisely avoiding to have to be serialized with every single task. Broadcast variables are occasionally dynamically updated on a per-task basis. This is wrong because broadcast variables are immutable - they are never updated. More info: Spark - The Definitive Guide, Chapter 14
Question 55
The code block shown below should return a new 2-column DataFrame that shows one attribute from column attributes per row next to the associated itemName, for all suppliers in column supplier whose name includes Sports. Choose the answer that correctly fills the blanks in the code block to accomplish this. Sample of DataFrame itemsDf: 1.+------+----------------------------------+-----------------------------+-------------------+ 2.|itemId|itemName |attributes |supplier | 3.+------+----------------------------------+-----------------------------+-------------------+ 4.|1 |Thick Coat for Walking in the Snow|[blue, winter, cozy] |Sports Company Inc.| 5.|2 |Elegant Outdoors Summer Dress |[red, summer, fresh, cooling]|YetiX | 6.|3 |Outdoors Backpack |[green, summer, travel] |Sports Company Inc.| 7.+------+----------------------------------+-----------------------------+-------------------+ Code block: itemsDf.__1__(__2__).select(__3__, __4__)
Correct Answer: E
Explanation Output of correct code block: +----------------------------------+------+ |itemName |col | +----------------------------------+------+ |Thick Coat for Walking in the Snow|blue | |Thick Coat for Walking in the Snow|winter| |Thick Coat for Walking in the Snow|cozy | |Outdoors Backpack |green | |Outdoors Backpack |summer| |Outdoors Backpack |travel| +----------------------------------+------+ The key to solving this question is knowing about Spark's explode operator. Using this operator, you can extract values from arrays into single rows. The following guidance steps through the answers systematically from the first to the last gap. Note that there are many ways to solving the gap questions and filtering out wrong answers, you do not always have to start filtering out from the first gap, but can also exclude some answers based on obvious problems you see with them. The answers to the first gap present you with two options: filter and where. These two are actually synonyms in PySpark, so using either of those is fine. The answer options to this gap therefore do not help us in selecting the right answer. The second gap is more interesting. One answer option includes "Sports".isin(col("Supplier")). This construct does not work, since Python's string does not have an isin method. Another option contains col(supplier). Here, Python will try to interpret supplier as a variable. We have not set this variable, so this is not a viable answer. Then, you are left with answers options that include col ("supplier").contains("Sports") and col("supplier").isin("Sports"). The question states that we are looking for suppliers whose name includes Sports, so we have to go for the contains operator here. We would use the isin operator if we wanted to filter out for supplier names that match any entries in a list of supplier names. Finally, we are left with two answers that fill the third gap both with "itemName" and the fourth gap either with explode("attributes") or "attributes". While both are correct Spark syntax, only explode ("attributes") will help us achieve our goal. Specifically, the question asks for one attribute from column attributes per row - this is what the explode() operator does. One answer option also includes array_explode() which is not a valid operator in PySpark. More info: pyspark.sql.functions.explode - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 3