Which of the following statements about garbage collection in Spark is incorrect?
Correct Answer: C
Explanation Manually persisting RDDs in Spark prevents them from being garbage collected. This statement is incorrect, and thus the correct answer to the question. Spark's garbage collector will remove even persisted objects, albeit in an "LRU" fashion. LRU stands for least recently used. So, during a garbage collection run, the objects that were used the longest time ago will be garbage collected first. See the linked StackOverflow post below for more information. Serialized caching is a strategy to increase the performance of garbage collection. This statement is correct. The more Java objects Spark needs to collect during garbage collection, the longer it takes. Storing a collection of many Java objects, such as a DataFrame with a complex schema, through serialization as a single byte array thus increases performance. This means that garbage collection takes less time on a serialized DataFrame than an unserialized DataFrame. Optimizing garbage collection performance in Spark may limit caching ability. This statement is correct. A full garbage collection run slows down a Spark application. When taking about "tuning" garbage collection, we mean reducing the amount or duration of these slowdowns. A full garbage collection run is triggered when the Old generation of the Java heap space is almost full. (If you are unfamiliar with this concept, check out the link to the Garbage Collection Tuning docs below.) Thus, one measure to avoid triggering a garbage collection run is to prevent the Old generation share of the heap space to be almost full. To achieve this, one may decrease its size. Objects with sizes greater than the Old generation space will then be discarded instead of cached (stored) in the space and helping it to be "almost full". This will decrease the number of full garbage collection runs, increasing overall performance. Inevitably, however, objects will need to be recomputed when they are needed. So, this mechanism only works when a Spark application needs to reuse cached data as little as possible. Garbage collection information can be accessed in the Spark UI's stage detail view. This statement is correct. The task table in the Spark UI's stage detail view has a "GC Time" column, indicating the garbage collection time needed per task. In Spark, using the G1 garbage collector is an alternative to using the default Parallel garbage collector. This statement is correct. The G1 garbage collector, also known as garbage first garbage collector, is an alternative to the default Parallel garbage collector. While the default Parallel garbage collector divides the heap into a few static regions, the G1 garbage collector divides the heap into many small regions that are created dynamically. The G1 garbage collector has certain advantages over the Parallel garbage collector which improve performance particularly for Spark workloads that require high throughput and low latency. The G1 garbage collector is not enabled by default, and you need to explicitly pass an argument to Spark to enable it. For more information about the two garbage collectors, check out the Databricks article linked below.
Question 17
Which of the following code blocks returns a new DataFrame with the same columns as DataFrame transactionsDf, except for columns predError and value which should be removed?
Correct Answer: B
Explanation More info: pyspark.sql.DataFrame.drop - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2
Question 18
The code block displayed below contains an error. The code block should display the schema of DataFrame transactionsDf. Find the error. Code block: transactionsDf.rdd.printSchema
Correct Answer: D
Explanation Correct code block: transactionsDf.printSchema() This is more of a knowledge question that you should just memorize or look up in the provided documentation during the exam. You can get more info about DataFrame.printSchema() in the documentation (link below). However - it is a plain simple method without any arguments. One answer points to an alternative of printing the schema: You could also use print(transactionsDf.schema). This will give you readable, but not nicely formatted, description of the schema. More info: pyspark.sql.DataFrame.printSchema - PySpark 3.1.1 documentation Static notebook | Dynamic notebook: See test 1
Question 19
The code block shown below should return a DataFrame with all columns of DataFrame transactionsDf, but only maximum 2 rows in which column productId has at least the value 2. Choose the answer that correctly fills the blanks in the code block to accomplish this. transactionsDf.__1__(__2__).__3__
Correct Answer: D
Explanation Correct code block: transactionsDf.filter(col("productId") >= 2).limit(2) The filter and where operators in gap 1 are just aliases of one another, so you cannot use them to pick the right answer. The column definition in gap 2 is more helpful. The DataFrame.filter() method takes an argument of type Column or str. From all possible answers, only the one including col("productId") >= 2 fits this profile, since it returns a Column type. The answer option using "productId" > 2 is invalid, since Spark does not understand that "productId" refers to column productId. The answer option using transactionsDf[productId] >= 2 is wrong because you cannot refer to a column using square bracket notation in Spark (if you are coming from Python using Pandas, this is something to watch out for). In all other options, productId is being referred to as a Python variable, so they are relatively easy to eliminate. Also note that the question asks for the value in column productId being at least 2. This translates to a "greater or equal" sign (>= 2), but not a "greater" sign (> 2). Another thing worth noting is that there is no DataFrame.max() method. If you picked any option including this, you may be confusing it with the pyspark.sql.functions.max method. The correct method to limit the amount of rows is the DataFrame.limit() method. More info: - pyspark.sql.DataFrame.filter - PySpark 3.1.2 documentation - pyspark.sql.DataFrame.limit - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 3
Question 20
The code block displayed below contains multiple errors. The code block should remove column transactionDate from DataFrame transactionsDf and add a column transactionTimestamp in which dates that are expressed as strings in column transactionDate of DataFrame transactionsDf are converted into unix timestamps. Find the errors. Sample of DataFrame transactionsDf: 1.+-------------+---------+-----+-------+---------+----+----------------+ 2.|transactionId|predError|value|storeId|productId| f| transactionDate| 3.+-------------+---------+-----+-------+---------+----+----------------+ 4.| 1| 3| 4| 25| 1|null|2020-04-26 15:35| 5.| 2| 6| 7| 2| 2|null|2020-04-13 22:01| 6.| 3| 3| null| 25| 3|null|2020-04-02 10:53| 7.+-------------+---------+-----+-------+---------+----+----------------+ Code block: 1.transactionsDf = transactionsDf.drop("transactionDate") 2.transactionsDf["transactionTimestamp"] = unix_timestamp("transactionDate", "yyyy-MM-dd")
Correct Answer: E
Explanation This question requires a lot of thinking to get right. For solving it, you may take advantage of the digital notepad that is provided to you during the test. You have probably seen that the code block includes multiple errors. In the test, you are usually confronted with a code block that only contains a single error. However, since you are practicing here, this challenging multi-error question will make it easier for you to deal with single-error questions in the real exam. You can clearly see that column transactionDate should be dropped only after transactionTimestamp has been written. This is because to generate column transactionTimestamp, Spark needs to read the values from column transactionDate. Values in column transactionDate in the original transactionsDf DataFrame look like 2020-04-26 15:35. So, to convert those correctly, you would have to pass yyyy-MM-dd HH:mm. In other words: The string indicating the date format should be adjusted. While you might be tempted to change unix_timestamp() to to_unixtime() (in line with the from_unixtime() operator), this function does not exist in Spark. unix_timestamp() is the correct operator to use here. Also, there is no DataFrame.withColumnReplaced() operator. A similar operator that exists is DataFrame.withColumnRenamed(). Whether you use col() or not is irrelevant with unix_timestamp() - the command is fine with both. Finally, you cannot assign a column like transactionsDf["columnName"] = ... in Spark. This is Pandas syntax (Pandas is a popular Python package for data analysis), but it is not supported in Spark. So, you need to use Spark's DataFrame.withColumn() syntax instead. More info: pyspark.sql.functions.unix_timestamp - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 3