The code block displayed below contains an error. The code block should save DataFrame transactionsDf at path path as a parquet file, appending to any existing parquet file. Find the error. Code block:
Which of the following code blocks concatenates rows of DataFrames transactionsDf and transactionsNewDf, omitting any duplicates?
Correct Answer: B
Explanation DataFrame.unique() and DataFrame.concat() do not exist and union() is not a method of the SparkSession. In addition, there is no union option for the join method in the DataFrame.join() statement. More info: pyspark.sql.DataFrame.union - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2
Question 28
Which of the following describes a shuffle?
Correct Answer: C
Explanation A shuffle is a Spark operation that results from DataFrame.coalesce(). No. DataFrame.coalesce() does not result in a shuffle. A shuffle is a process that allocates partitions to executors. This is incorrect. A shuffle is a process that is executed during a broadcast hash join. No, broadcast hash joins avoid shuffles and yield performance benefits if at least one of the two tables is small in size (<= 10 MB by default). Broadcast hash joins can avoid shuffles because instead of exchanging partitions between executors, they broadcast a small table to all executors that then perform the rest of the join operation locally. A shuffle is a process that compares data across executors. No, in a shuffle, data is compared across partitions, and not executors. More info: Spark Repartition & Coalesce - Explained (https://bit.ly/32KF7zS)
Question 29
The code block shown below should show information about the data type that column storeId of DataFrame transactionsDf contains. Choose the answer that correctly fills the blanks in the code block to accomplish this. Code block: transactionsDf.__1__(__2__).__3__
Correct Answer: B
Explanation Correct code block: transactionsDf.select("storeId").printSchema() The difficulty of this question is that it is hard to solve with the stepwise first-to-last-gap approach that has worked well for similar questions, since the answer options are so different from one another. Instead, you might want to eliminate answers by looking for patterns of frequently wrong answers. A first pattern that you may recognize by now is that column names are not expressed in quotes. For this reason, the answer that includes storeId should be eliminated. By now, you may have understood that the DataFrame.limit() is useful for returning a specified amount of rows. It has nothing to do with specific columns. For this reason, the answer that resolves to limit("storeId") can be eliminated. Given that we are interested in information about the data type, you should question whether the answer that resolves to limit(1).columns provides you with this information. While DataFrame.columns is a valid call, it will only report back column names, but not column types. So, you can eliminate this option. The two remaining options either use the printSchema() or print_schema() command. You may remember that DataFrame.printSchema() is the only valid command of the two. The select("storeId") part just returns the storeId column of transactionsDf - this works here, since we are only interested in that column's type anyways. More info: pyspark.sql.DataFrame.printSchema - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 3
Question 30
The code block shown below should return a DataFrame with only columns from DataFrame transactionsDf for which there is a corresponding transactionId in DataFrame itemsDf. DataFrame itemsDf is very small and much smaller than DataFrame transactionsDf. The query should be executed in an optimized way. Choose the answer that correctly fills the blanks in the code block to accomplish this. __1__.__2__(__3__, __4__, __5__)
Correct Answer: C
Explanation Correct code block: transactionsDf.join(broadcast(itemsDf), "transactionId", "left_semi") This question is extremely difficult and exceeds the difficulty of questions in the exam by far. A first indication of what is asked from you here is the remark that "the query should be executed in an optimized way". You also have qualitative information about the size of itemsDf and transactionsDf. Given that itemsDf is "very small" and that the execution should be optimized, you should consider instructing Spark to perform a broadcast join, broadcasting the "very small" DataFrame itemsDf to all executors. You can explicitly suggest this to Spark via wrapping itemsDf into a broadcast() operator. One answer option does not include this operator, so you can disregard it. Another answer option wraps the broadcast() operator around transactionsDf - the bigger of the two DataFrames. This answer option does not make sense in the optimization context and can likewise be disregarded. When thinking about the broadcast() operator, you may also remember that it is a method of pyspark.sql.functions. One answer option, however, resolves to itemsDf.broadcast([...]). The DataFrame class has no broadcast() method, so this answer option can be eliminated as well. All two remaining answer options resolve to transactionsDf.join([...]) in the first 2 gaps, so you will have to figure out the details of the join now. You can pick between an outer and a left semi join. An outer join would include columns from both DataFrames, where a left semi join only includes columns from the "left" table, here transactionsDf, just as asked for by the question. So, the correct answer is the one that uses the left_semi join.