Which of the following code blocks adds a column predErrorSqrt to DataFrame transactionsDf that is the square root of column predError?
Correct Answer: D
Explanation transactionsDf.withColumn("predErrorSqrt", sqrt(col("predError"))) Correct. The DataFrame.withColumn() operator is used to add a new column to a DataFrame. It takes two arguments: The name of the new column (here: predErrorSqrt) and a Column expression as the new column. In PySpark, a Column expression means referring to a column using the col("predError") command or by other means, for example by transactionsDf.predError, or even just using the column name as a string, "predError". The question asks for the square root. sqrt() is a function in pyspark.sql.functions and calculates the square root. It takes a value or a Column as an input. Here it is the predError column of DataFrame transactionsDf expressed through col("predError"). transactionsDf.withColumn("predErrorSqrt", sqrt(predError)) Incorrect. In this expression, sqrt(predError) is incorrect syntax. You cannot refer to predError in this way - to Spark it looks as if you are trying to refer to the non-existent Python variable predError. You could pass transactionsDf.predError, col("predError") (as in the correct solution), or even just "predError" instead. transactionsDf.select(sqrt(predError)) Wrong. Here, the explanation just above this one about how to refer to predError applies. transactionsDf.select(sqrt("predError")) No. While this is correct syntax, it will return a single-column DataFrame only containing a column showing the square root of column predError. However, the question asks for a column to be added to the original DataFrame transactionsDf. transactionsDf.withColumn("predErrorSqrt", col("predError").sqrt()) No. The issue with this statement is that column col("predError") has no sqrt() method. sqrt() is a member of pyspark.sql.functions, but not of pyspark.sql.Column. More info: pyspark.sql.DataFrame.withColumn - PySpark 3.1.2 documentation and pyspark.sql.functions.sqrt - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2
Question 2
Which of the following code blocks concatenates rows of DataFrames transactionsDf and transactionsNewDf, omitting any duplicates?
Correct Answer: B
Explanation DataFrame.unique() and DataFrame.concat() do not exist and union() is not a method of the SparkSession. In addition, there is no union option for the join method in the DataFrame.join() statement. More info: pyspark.sql.DataFrame.union - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2
Question 3
Which of the following code blocks reads in the JSON file stored at filePath as a DataFrame?
Correct Answer: A
Explanation spark.read.json(filePath) Correct. spark.read accesses Spark's DataFrameReader. Then, Spark identifies the file type to be read as JSON type by passing filePath into the DataFrameReader.json() method. spark.read.path(filePath) Incorrect. Spark's DataFrameReader does not have a path method. A universal way to read in files is provided by the DataFrameReader.load() method (link below). spark.read.path(filePath, source="json") Wrong. A DataFrameReader.path() method does not exist (see above). spark.read().json(filePath) Incorrect. spark.read is a way to access Spark's DataFrameReader. However, the DataFrameReader is not callable, so calling it via spark.read() will fail. spark.read().path(filePath) No, Spark's DataFrameReader is not callable (see above). More info: pyspark.sql.DataFrameReader.json - PySpark 3.1.2 documentation, pyspark.sql.DataFrameReader.load - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 3
Question 4
The code block shown below should return a two-column DataFrame with columns transactionId and supplier, with combined information from DataFrames itemsDf and transactionsDf. The code block should merge rows in which column productId of DataFrame transactionsDf matches the value of column itemId in DataFrame itemsDf, but only where column storeId of DataFrame transactionsDf does not match column itemId of DataFrame itemsDf. Choose the answer that correctly fills the blanks in the code block to accomplish this. Code block: transactionsDf.__1__(itemsDf, __2__).__3__(__4__)
Correct Answer: C
Explanation This question is pretty complex and, in its complexity, is probably above what you would encounter in the exam. However, reading the question carefully, you can use your logic skills to weed out the wrong answers here. First, you should examine the join statement which is common to all answers. The first argument of the join() operator (documentation linked below) is the DataFrame to be joined with. Where join is in gap 3, the first argument of gap 4 should therefore be another DataFrame. For none of the questions where join is in the third gap, this is the case. So you can immediately discard two answers. For all other answers, join is in gap 1, followed by .(itemsDf, according to the code block. Given how the join() operator is called, there are now three remaining candidates. Looking further at the join() statement, the second argument (on=) expects "a string for the join column name, a list of column names, a join expression (Column), or a list of Columns", according to the documentation. As one answer option includes a list of join expressions (transactionsDf.productId==itemsDf.itemId, transactionsDf.storeId!=itemsDf.itemId) which is unsupported according to the documentation, we can discard that answer, leaving us with two remaining candidates. Both candidates have valid syntax, but only one of them fulfills the condition in the question "only where column storeId of DataFrame transactionsDf does not match column itemId of DataFrame itemsDf". So, this one remaining answer option has to be the correct one! As you can see, although sometimes overwhelming at first, even more complex questions can be figured out by rigorously applying the knowledge you can gain from the documentation during the exam. More info: pyspark.sql.DataFrame.join - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 3
Question 5
The code block shown below should return a column that indicates through boolean variables whether rows in DataFrame transactionsDf have values greater or equal to 20 and smaller or equal to 30 in column storeId and have the value 2 in column productId. Choose the answer that correctly fills the blanks in the code block to accomplish this. transactionsDf.__1__((__2__.__3__) __4__ (__5__))
Correct Answer: D
Explanation Correct code block: transactionsDf.select((col("storeId").between(20, 30)) & (col("productId")==2)) Although this question may make you think that it asks for a filter or where statement, it does not. It asks explicity to return a column with booleans - this should point you to the select statement. Another trick here is the rarely used between() method. It exists and resolves to ((storeId >= 20) AND (storeId <= 30)) in SQL. geq() and leq() do not exist. Another riddle here is how to chain the two conditions. The only valid answer here is &. Operators like && or and are not valid. Other boolean operators that would be valid in Spark are | and. Static notebook | Dynamic notebook: See test 1