Which of the elements that are labeled with a circle and a number contain an error or are misrepresented?
Correct Answer: B
Explanation 1: Correct - This should just read "API" or "DataFrame API". The DataFrame is not part of the SQL API. To make a DataFrame accessible via SQL, you first need to create a DataFrame view. That view can then be accessed via SQL. 4: Although "K_38_INU" looks odd, it is a completely valid name for a DataFrame column. 6: No, StringType is a correct type. 7: Although a StringType may not be the most efficient way to store a phone number, there is nothing fundamentally wrong with using this type here. 8: Correct - TreeType is not a type that Spark supports. 9: No, Spark DataFrames support ArrayType variables. In this case, the variable would represent a sequence of elements with type LongType, which is also a valid type for Spark DataFrames. 10: There is nothing wrong with this row. More info: Data Types - Spark 3.1.1 Documentation (https://bit.ly/3aAPKJT)
Question 7
Which of the following code blocks reads in the JSON file stored at filePath as a DataFrame?
Correct Answer: A
Explanation spark.read.json(filePath) Correct. spark.read accesses Spark's DataFrameReader. Then, Spark identifies the file type to be read as JSON type by passing filePath into the DataFrameReader.json() method. spark.read.path(filePath) Incorrect. Spark's DataFrameReader does not have a path method. A universal way to read in files is provided by the DataFrameReader.load() method (link below). spark.read.path(filePath, source="json") Wrong. A DataFrameReader.path() method does not exist (see above). spark.read().json(filePath) Incorrect. spark.read is a way to access Spark's DataFrameReader. However, the DataFrameReader is not callable, so calling it via spark.read() will fail. spark.read().path(filePath) No, Spark's DataFrameReader is not callable (see above). More info: pyspark.sql.DataFrameReader.json - PySpark 3.1.2 documentation, pyspark.sql.DataFrameReader.load - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 3
Question 8
Which of the following code blocks returns a new DataFrame in which column attributes of DataFrame itemsDf is renamed to feature0 and column supplier to feature1?
Correct Answer: D
Explanation itemsDf.withColumnRenamed("attributes", "feature0").withColumnRenamed("supplier", "feature1") Correct! Spark's DataFrame.withColumnRenamed syntax makes it relatively easy to change the name of a column. itemsDf.withColumnRenamed(attributes, feature0).withColumnRenamed(supplier, feature1) Incorrect. In this code block, the Python interpreter will try to use attributes and the other column names as variables. Needless to say, they are undefined, and as a result the block will not run. itemsDf.withColumnRenamed(col("attributes"), col("feature0"), col("supplier"), col("feature1")) Wrong. The DataFrame.withColumnRenamed() operator takes exactly two string arguments. So, in this answer both using col() and using four arguments is wrong. itemsDf.withColumnRenamed("attributes", "feature0") itemsDf.withColumnRenamed("supplier", "feature1") No. In this answer, the returned DataFrame will only have column supplier be renamed, since the result of the first line is not written back to itemsDf. itemsDf.withColumn("attributes", "feature0").withColumn("supplier", "feature1") Incorrect. While withColumn works for adding and naming new columns, you cannot use it to rename existing columns. More info: pyspark.sql.DataFrame.withColumnRenamed - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 3
Question 9
The code block shown below should return only the average prediction error (column predError) of a random subset, without replacement, of approximately 15% of rows in DataFrame transactionsDf. Choose the answer that correctly fills the blanks in the code block to accomplish this. transactionsDf.__1__(__2__, __3__).__4__(avg('predError'))
Correct Answer: B
Explanation Correct code block: transactionsDf.sample(withReplacement=False, fraction=0.15).select(avg('predError')) You should remember that getting a random subset of rows means sampling. This, in turn should point you to the DataFrame.sample() method. Once you know this, you can look up the correct order of arguments in the documentation (link below). Lastly, you have to decide whether to use filter, where or select. where is just an alias for filter(). filter() is not the correct method to use here, since it would only allow you to filter rows based on some condition. However, the question asks to return only the average prediction error. You can control the columns that a query returns with the select() method - so this is the correct method to use here. More info: pyspark.sql.DataFrame.sample - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2
Question 10
Which of the following is the idea behind dynamic partition pruning in Spark?
Correct Answer: A
Explanation Dynamic partition pruning reoptimizes query plans based on runtime statistics collected during query execution. No - this is what adaptive query execution does, but not dynamic partition pruning. Dynamic partition pruning concatenates columns of similar data types to optimize join performance. Wrong, this answer does not make sense, especially related to dynamic partition pruning. Dynamic partition pruning reoptimizes physical plans based on data types and broadcast variables. It is true that dynamic partition pruning works in joins using broadcast variables. This actually happens in both the logical optimization and the physical planning stage. However, data types do not play a role for the reoptimization. Dynamic partition pruning performs wide transformations on disk instead of in memory. This answer does not make sense. Dynamic partition pruning is meant to accelerate Spark - performing any transformation involving disk instead of memory resources would decelerate Spark and certainly achieve the opposite effect of what dynamic partition pruning is intended for.