Which of the following describes the characteristics of accumulators?
Correct Answer: E
Explanation If an action including an accumulator fails during execution and Spark manages to restart the action and complete it successfully, only the successful attempt will be counted in the accumulator. Correct, when Spark tries to rerun a failed action that includes an accumulator, it will only update the accumulator if the action succeeded. Accumulators are immutable. No. Although accumulators behave like write-only variables towards the executors and can only be read by the driver, they are not immutable. All accumulators used in a Spark application are listed in the Spark UI. Incorrect. For scala, only named, but not unnamed, accumulators are listed in the Spark UI. For pySpark, no accumulators are listed in the Spark UI - this feature is not yet implemented. Accumulators are used to pass around lookup tables across the cluster. Wrong - this is what broadcast variables do. Accumulators can be instantiated directly via the accumulator(n) method of the pyspark.RDD module. Wrong, accumulators are instantiated via the accumulator(n) method of the sparkContext, for example: counter = spark.sparkContext.accumulator(0). More info: python - In Spark, RDDs are immutable, then how Accumulators are implemented? - Stack Overflow, apache spark - When are accumulators truly reliable? - Stack Overflow, Spark - The Definitive Guide, Chapter 14
Question 12
Which of the following code blocks returns a new DataFrame in which column attributes of DataFrame itemsDf is renamed to feature0 and column supplier to feature1?
Correct Answer: D
Explanation itemsDf.withColumnRenamed("attributes", "feature0").withColumnRenamed("supplier", "feature1") Correct! Spark's DataFrame.withColumnRenamed syntax makes it relatively easy to change the name of a column. itemsDf.withColumnRenamed(attributes, feature0).withColumnRenamed(supplier, feature1) Incorrect. In this code block, the Python interpreter will try to use attributes and the other column names as variables. Needless to say, they are undefined, and as a result the block will not run. itemsDf.withColumnRenamed(col("attributes"), col("feature0"), col("supplier"), col("feature1")) Wrong. The DataFrame.withColumnRenamed() operator takes exactly two string arguments. So, in this answer both using col() and using four arguments is wrong. itemsDf.withColumnRenamed("attributes", "feature0") itemsDf.withColumnRenamed("supplier", "feature1") No. In this answer, the returned DataFrame will only have column supplier be renamed, since the result of the first line is not written back to itemsDf. itemsDf.withColumn("attributes", "feature0").withColumn("supplier", "feature1") Incorrect. While withColumn works for adding and naming new columns, you cannot use it to rename existing columns. More info: pyspark.sql.DataFrame.withColumnRenamed - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 3
Question 13
The code block displayed below contains an error. The code block should write DataFrame transactionsDf as a parquet file to location filePath after partitioning it on column storeId. Find the error. Code block: transactionsDf.write.partitionOn("storeId").parquet(filePath)
Correct Answer: E
Explanation No method partitionOn() exists for the DataFrame class, partitionBy() should be used instead. Correct! Find out more about partitionBy() in the documentation (linked below). The operator should use the mode() option to configure the DataFrameWriter so that it replaces any existing files at location filePath. No. There is no information about whether files should be overwritten in the question. The partitioning column as well as the file path should be passed to the write() method of DataFrame transactionsDf directly and not as appended commands as in the code block. Incorrect. To write a DataFrame to disk, you need to work with a DataFrameWriter object which you get access to through the DataFrame.writer property - no parentheses involved. Column storeId should be wrapped in a col() operator. No, this is not necessary - the problem is in the partitionOn command (see above). The partitionOn method should be called before the write method. Wrong. First of all partitionOn is not a valid method of DataFrame. However, even assuming partitionOn would be replaced by partitionBy (which is a valid method), this method is a method of DataFrameWriter and not of DataFrame. So, you would always have to first call DataFrame.write to get access to the DataFrameWriter object and afterwards call partitionBy. More info: pyspark.sql.DataFrameWriter.partitionBy - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 3
Question 14
Which of the following code blocks returns a 2-column DataFrame that shows the distinct values in column productId and the number of rows with that productId in DataFrame transactionsDf?
Correct Answer: D
Explanation transactionsDf.groupBy("productId").count() Correct. This code block first groups DataFrame transactionsDf by column productId and then counts the rows in each group. transactionsDf.groupBy("productId").select(count("value")) Incorrect. You cannot call select on a GroupedData object (the output of a groupBy) statement. transactionsDf.count("productId") No. DataFrame.count() does not take any arguments. transactionsDf.count("productId").distinct() Wrong. Since DataFrame.count() does not take any arguments, this option cannot be right. transactionsDf.groupBy("productId").agg(col("value").count()) False. A Column object, as returned by col("value"), does not have a count() method. You can see all available methods for Column object linked in the Spark documentation below. More info: pyspark.sql.DataFrame.count - PySpark 3.1.2 documentation, pyspark.sql.Column - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 3
Question 15
Which of the elements in the labeled panels represent the operation performed for broadcast variables? Larger image
Correct Answer: C
Explanation 2,3 Correct! Both panels 2 and 3 represent the operation performed for broadcast variables. While a broadcast operation may look like panel 3, with the driver being the bottleneck, it most probably looks like panel 2. This is because the torrent protocol sits behind Spark's broadcast implementation. In the torrent protocol, each executor will try to fetch missing broadcast variables from the driver or other nodes, preventing the driver from being the bottleneck. 1,2 Wrong. While panel 2 may represent broadcasting, panel 1 shows bi-directional communication which does not occur in broadcast operations. 3 No. While broadcasting may materialize like shown in panel 3, its use of the torrent protocol also enables communciation as shown in panel 2 (see first explanation). 1,3,4 No. While panel 2 shows broadcasting, panel 1 shows bi-directional communication - not a characteristic of broadcasting. Panel 4 shows uni-directional communication, but in the wrong direction. Panel 4 resembles more an accumulator variable than a broadcast variable. 2,5 Incorrect. While panel 2 shows broadcasting, panel 5 includes bi-directional communication - not a characteristic of broadcasting. More info: Broadcast Join with Spark - henning.kropponline.de