Which of the following describes a narrow transformation?
Correct Answer: E
Explanation A narrow transformation is an operation in which no data is exchanged across the cluster. Correct! In narrow transformations, no data is exchanged across the cluster, since these transformations do not require any data from outside of the partition they are applied on. Typical narrow transformations include filter, drop, and coalesce. A narrow transformation is an operation in which data is exchanged across partitions. No, that would be one definition of a wide transformation, but not of a narrow transformation. Wide transformations typically cause a shuffle, in which data is exchanged across partitions, executors, and the cluster. A narrow transformation is an operation in which data is exchanged across the cluster. No, see explanation just above this one. A narrow transformation is a process in which 32-bit float variables are cast to smaller float variables, like 16-bit or 8-bit float variables. No, type conversion has nothing to do with narrow transformations in Spark. A narrow transformation is a process in which data from multiple RDDs is used. No. A resilient distributed dataset (RDD) can be described as a collection of partitions. In a narrow transformation, no data is exchanged between partitions. Thus, no data is exchanged between RDDs. One could say though that a narrow transformation and, in fact, any transformation results in a new RDD being created. This is because a transformation results in a change to an existing RDD (RDDs are the foundation of other Spark data structures, like DataFrames). But, since RDDs are immutable, a new RDD needs to be created to reflect the change caused by the transformation. More info: Spark Transformation and Action: A Deep Dive | by Misbah Uddin | CodeX | Medium
Question 57
Which of the following code blocks returns all unique values of column storeId in DataFrame transactionsDf?
Correct Answer: B
Explanation distinct() is a method of a DataFrame. Knowing this, or recognizing this from the documentation, is the key to solving this question. More info: pyspark.sql.DataFrame.distinct - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2
Question 58
Which of the following code blocks returns approximately 1000 rows, some of them potentially being duplicates, from the 2000-row DataFrame transactionsDf that only has unique rows?
Correct Answer: A
Explanation To solve this question, you need to know that DataFrame.sample() is not guaranteed to return the exact fraction of the number of rows specified as an argument. Furthermore, since duplicates may be returned, you should understand that the operator's withReplacement argument should be set to True. A force= argument for the operator does not exist. While the take argument returns an exact number of rows, it will just take the first specified number of rows (1000 in this question) from the DataFrame. Since the DataFrame does not include duplicate rows, there is no potential of any of those returned rows being duplicates when using take(), so the correct answer cannot involve take(). More info: pyspark.sql.DataFrame.sample - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2
Question 59
The code block shown below should return a column that indicates through boolean variables whether rows in DataFrame transactionsDf have values greater or equal to 20 and smaller or equal to 30 in column storeId and have the value 2 in column productId. Choose the answer that correctly fills the blanks in the code block to accomplish this. transactionsDf.__1__((__2__.__3__) __4__ (__5__))
Correct Answer: D
Explanation Correct code block: transactionsDf.select((col("storeId").between(20, 30)) & (col("productId")==2)) Although this question may make you think that it asks for a filter or where statement, it does not. It asks explicity to return a column with booleans - this should point you to the select statement. Another trick here is the rarely used between() method. It exists and resolves to ((storeId >= 20) AND (storeId <= 30)) in SQL. geq() and leq() do not exist. Another riddle here is how to chain the two conditions. The only valid answer here is &. Operators like && or and are not valid. Other boolean operators that would be valid in Spark are | and. Static notebook | Dynamic notebook: See test 1
Question 60
Which of the following statements about reducing out-of-memory errors is incorrect?
Correct Answer: A
Explanation Concatenating multiple string columns into a single column may guard against out-of-memory errors. Exactly, this is an incorrect answer! Concatenating any string columns does not reduce the size of the data, it just structures it a different way. This does little to how Spark processes the data and definitely does not reduce out-of-memory errors. Reducing partition size can help against out-of-memory errors. No, this is not incorrect. Reducing partition size is a viable way to aid against out-of-memory errors, since executors need to load partitions into memory before processing them. If the executor does not have enough memory available to do that, it will throw an out-of-memory error. Decreasing partition size can therefore be very helpful for preventing that. Decreasing the number of cores available to each executor can help against out-of-memory errors. No, this is not incorrect. To process a partition, this partition needs to be loaded into the memory of an executor. If you imagine that every core in every executor processes a partition, potentially in parallel with other executors, you can imagine that memory on the machine hosting the executors fills up quite quickly. So, memory usage of executors is a concern, especially when multiple partitions are processed at the same time. To strike a balance between performance and memory usage, decreasing the number of cores may help against out-of-memory errors. Setting a limit on the maximum size of serialized data returned to the driver may help prevent out-of-memory errors. No, this is not incorrect. When using commands like collect() that trigger the transmission of potentially large amounts of data from the cluster to the driver, the driver may experience out-of-memory errors. One strategy to avoid this is to be careful about using commands like collect() that send back large amounts of data to the driver. Another strategy is setting the parameter spark.driver.maxResultSize. If data to be transmitted to the driver exceeds the threshold specified by the parameter, Spark will abort the job and therefore prevent an out-of-memory error. Limiting the amount of data being automatically broadcast in joins can help against out-of-memory errors. Wrong, this is not incorrect. As part of Spark's internal optimization, Spark may choose to speed up operations by broadcasting (usually relatively small) tables to executors. This broadcast is happening from the driver, so all the broadcast tables are loaded into the driver first. If these tables are relatively big, or multiple mid-size tables are being broadcast, this may lead to an out-of- memory error. The maximum table size for which Spark will consider broadcasting is set by the spark.sql.autoBroadcastJoinThreshold parameter. More info: Configuration - Spark 3.1.2 Documentation and Spark OOM Error - Closeup. Does the following look familiar when... | by Amit Singh Rathore | The Startup | Medium