Which of the following code blocks reduces a DataFrame from 12 to 6 partitions and performs a full shuffle?
Correct Answer: E
Explanation DataFrame.repartition(6) Correct. repartition() always triggers a full shuffle (different from coalesce()). DataFrame.repartition(12) No, this would just leave the DataFrame with 12 partitions and not 6. DataFrame.coalesce(6) coalesce does not perform a full shuffle of the data. Whenever you see "full shuffle", you know that you are not dealing with coalesce(). While coalesce() can perform a partial shuffle when required, it will try to minimize shuffle operations, so the amount of data that is sent between executors. Here, 12 partitions can easily be repartitioned to be 6 partitions simply by stitching every two partitions into one. DataFrame.coalesce(6, shuffle=True) and DataFrame.coalesce(6).shuffle() These statements are not valid Spark API syntax. More info: Spark Repartition & Coalesce - Explained and Repartition vs Coalesce in Apache Spark - Rock the JVM Blog
Question 52
Which of the following statements about storage levels is incorrect?
Correct Answer: D
Explanation MEMORY_AND_DISK replicates cached DataFrames both on memory and disk. Correct, this statement is wrong. Spark prioritizes storage in memory, and will only store data on disk that does not fit into memory. DISK_ONLY will not use the worker node's memory. Wrong, this statement is correct. DISK_ONLY keeps data only on the worker node's disk, but not in memory. In client mode, DataFrames cached with the MEMORY_ONLY_2 level will not be stored in the edge node's memory. Wrong, this statement is correct. In fact, Spark does not have a provision to cache DataFrames in the driver (which sits on the edge node in client mode). Spark caches DataFrames in the executors' memory. Caching can be undone using the DataFrame.unpersist() operator. Wrong, this statement is correct. Caching, as achieved via the DataFrame.cache() or DataFrame.persist() operators can be undone using the DataFrame.unpersist() operator. This operator will remove all of its parts from the executors' memory and disk. The cache operator on DataFrames is evaluated like a transformation. Wrong, this statement is correct. DataFrame.cache() is evaluated like a transformation: Through lazy evaluation. This means that after calling DataFrame.cache() the command will not have any effect until you call a subsequent action, like DataFrame.cache().count(). More info: pyspark.sql.DataFrame.unpersist - PySpark 3.1.2 documentation
Question 53
Which of the following code blocks returns a single-column DataFrame of all entries in Python list throughputRates which contains only float-type values ?
Correct Answer: E
Explanation spark.createDataFrame(throughputRates, FloatType()) Correct! spark.createDataFrame is the correct operator to use here and the type FloatType() which is passed in for the command's schema argument is correctly instantiated using the parentheses. Remember that it is essential in PySpark to instantiate types when passing them to SparkSession.createDataFrame. And, in Databricks, spark returns a SparkSession object. spark.createDataFrame((throughputRates), FloatType) No. While packing throughputRates in parentheses does not do anything to the execution of this command, not instantiating the FloatType with parentheses as in the previous answer will make this command fail. spark.createDataFrame(throughputRates, FloatType) Incorrect. Given that it does not matter whether you pass throughputRates in parentheses or not, see the explanation of the previous answer for further insights. spark.DataFrame(throughputRates, FloatType) Wrong. There is no SparkSession.DataFrame() method in Spark. spark.createDataFrame(throughputRates) False. Avoiding the schema argument will have PySpark try to infer the schema. However, as you can see in the documentation (linked below), the inference will only work if you pass in an "RDD of either Row, namedtuple, or dict" for data (the first argument to createDataFrame). But since you are passing a Python list, Spark's schema inference will fail. More info: pyspark.sql.SparkSession.createDataFrame - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 3
Question 54
The code block shown below should return a one-column DataFrame where the column storeId is converted to string type. Choose the answer that correctly fills the blanks in the code block to accomplish this. transactionsDf.__1__(__2__.__3__(__4__))
Correct Answer: D
Explanation Correct code block: transactionsDf.select(col("storeId").cast(StringType())) Solving this question involves understanding that, when using types from the pyspark.sql.types such as StringType, these types need to be instantiated when using them in Spark, or, in simple words, they need to be followed by parentheses like so: StringType(). You could also use .cast("string") instead, but that option is not given here. More info: pyspark.sql.Column.cast - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2
Question 55
Which of the following code blocks returns a DataFrame with approximately 1,000 rows from the 10,000-row DataFrame itemsDf, without any duplicates, returning the same rows even if the code block is run twice?
Correct Answer: B
Explanation itemsDf.sample(fraction=0.1, seed=87238) Correct. If itemsDf has 10,000 rows, this code block returns about 1,000, since DataFrame.sample() is never guaranteed to return an exact amount of rows. To ensure you are not returning duplicates, you should leave the withReplacement parameter at False, which is the default. Since the question specifies that the same rows should be returned even if the code block is run twice, you need to specify a seed. The number passed in the seed does not matter as long as it is an integer. itemsDf.sample(withReplacement=True, fraction=0.1, seed=23536) Incorrect. While this code block fulfills almost all requirements, it may return duplicates. This is because withReplacement is set to True. Here is how to understand what replacement means: Imagine you have a bucket of 10,000 numbered balls and you need to take 1,000 balls at random from the bucket (similar to the problem in the question). Now, if you would take those balls with replacement, you would take a ball, note its number, and put it back into the bucket, meaning the next time you take a ball from the bucket there would be a chance you could take the exact same ball again. If you took the balls without replacement, you would leave the ball outside the bucket and not put it back in as you take the next 999 balls. itemsDf.sample(fraction=1000, seed=98263) Wrong. The fraction parameter needs to have a value between 0 and 1. In this case, it should be 0.1, since 1,000/10,000 = 0.1. itemsDf.sampleBy("row", fractions={0: 0.1}, seed=82371) No, DataFrame.sampleBy() is meant for stratified sampling. This means that based on the values in a column in a DataFrame, you can draw a certain fraction of rows containing those values from the DataFrame (more details linked below). In the scenario at hand, sampleBy is not the right operator to use because you do not have any information about any column that the sampling should depend on. itemsDf.sample(fraction=0.1) Incorrect. This code block checks all the boxes except that it does not ensure that when you run it a second time, the exact same rows will be returned. In order to achieve this, you would have to specify a seed. More info: - pyspark.sql.DataFrame.sample - PySpark 3.1.2 documentation - pyspark.sql.DataFrame.sampleBy - PySpark 3.1.2 documentation - Types of Samplings in PySpark 3. The explanations of the sampling... | by Pinar Ersoy | Towards Data Science