Which of the following code blocks prints out in how many rows the expression Inc. appears in the string-type column supplier of DataFrame itemsDf?
Correct Answer: E
Explanation Correct code block: accum=sc.accumulator(0) def check_if_inc_in_supplier(row): if 'Inc.' in row['supplier']: accum.add(1) itemsDf.foreach(check_if_inc_in_supplier) print(accum.value) To answer this question correctly, you need to know both about the DataFrame.foreach() method and accumulators. When Spark runs the code, it executes it on the executors. The executors do not have any information about variables outside of their scope. This is whhy simply using a Python variable counter, like in the two examples that start with counter = 0, will not work. You need to tell the executors explicitly that counter is a special shared variable, an Accumulator, which is managed by the driver and can be accessed by all executors for the purpose of adding to it. If you have used Pandas in the past, you might be familiar with the iterrows() command. Notice that there is no such command in PySpark. The two examples that start with print do not work, since DataFrame.foreach() does not have a return value. More info: pyspark.sql.DataFrame.foreach - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 3
Question 27
The code block shown below should return only the average prediction error (column predError) of a random subset, without replacement, of approximately 15% of rows in DataFrame transactionsDf. Choose the answer that correctly fills the blanks in the code block to accomplish this. transactionsDf.__1__(__2__, __3__).__4__(avg('predError'))
Correct Answer: B
Explanation Correct code block: transactionsDf.sample(withReplacement=False, fraction=0.15).select(avg('predError')) You should remember that getting a random subset of rows means sampling. This, in turn should point you to the DataFrame.sample() method. Once you know this, you can look up the correct order of arguments in the documentation (link below). Lastly, you have to decide whether to use filter, where or select. where is just an alias for filter(). filter() is not the correct method to use here, since it would only allow you to filter rows based on some condition. However, the question asks to return only the average prediction error. You can control the columns that a query returns with the select() method - so this is the correct method to use here. More info: pyspark.sql.DataFrame.sample - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2
Question 28
Which of the following describes a difference between Spark's cluster and client execution modes?
Correct Answer: C
Explanation In cluster mode, the driver resides on a worker node, while it resides on an edge node in client mode. Correct. The idea of Spark's client mode is that workloads can be executed from an edge node, also known as gateway machine, from outside the cluster. The most common way to execute Spark however is in cluster mode, where the driver resides on a worker node. In practice, in client mode, there are tight constraints about the data transfer speed relative to the data transfer speed between worker nodes in the cluster. Also, any job in that is executed in client mode will fail if the edge node fails. For these reasons, client mode is usually not used in a production environment. In cluster mode, the cluster manager resides on a worker node, while it resides on an edge node in client execution mode. No. In both execution modes, the cluster manager may reside on a worker node, but it does not reside on an edge node in client mode. In cluster mode, executor processes run on worker nodes, while they run on gateway nodes in client mode. This is incorrect. Only the driver runs on gateway nodes (also known as "edge nodes") in client mode, but not the executor processes. In cluster mode, the Spark driver is not co-located with the cluster manager, while it is co-located in client mode. No, in client mode, the Spark driver is not co-located with the driver. The whole point of client mode is that the driver is outside the cluster and not associated with the resource that manages the cluster (the machine that runs the cluster manager). In cluster mode, a gateway machine hosts the driver, while it is co-located with the executor in client mode. No, it is exactly the opposite: There are no gateway machines in cluster mode, but in client mode, they host the driver.
Question 29
The code block shown below should convert up to 5 rows in DataFrame transactionsDf that have the value 25 in column storeId into a Python list. Choose the answer that correctly fills the blanks in the code block to accomplish this. Code block: transactionsDf.__1__(__2__).__3__(__4__)
Correct Answer: D
Explanation The correct code block is: transactionsDf.filter(col("storeId")==25).take(5) Any of the options with collect will not work because collect does not take any arguments, and in both cases the argument 5 is given. The option with toLocalIterator will not work because the only argument to toLocalIterator is prefetchPartitions which is a boolean, so passing 5 here does not make sense. The option using head will not work because the expression passed to select is not proper syntax. It would work if the expression would be col("storeId")==25. Static notebook | Dynamic notebook: See test 1 (https://flrs.github.io/spark_practice_tests_code/#1/24.html , https://bit.ly/sparkpracticeexams_import_instructions)
Question 30
The code block displayed below contains an error. The code block should return a new DataFrame that only contains rows from DataFrame transactionsDf in which the value in column predError is at least 5. Find the error. Code block: transactionsDf.where("col(predError) >= 5")
Correct Answer: A
Explanation The argument to the where method cannot be a string. It can be a string, no problem here. Instead of where(), filter() should be used. No, that does not matter. In PySpark, where() and filter() are equivalent. Instead of >=, the SQL operator GEQ should be used. Incorrect. The expression returns the original DataFrame transactionsDf and not a new DataFrame. To avoid this, the code block should be transactionsDf.toNewDataFrame().where("col(predError) >= 5"). No, Spark returns a new DataFrame. Static notebook | Dynamic notebook: See test 1 (https://flrs.github.io/spark_practice_tests_code/#1/27.html , https://bit.ly/sparkpracticeexams_import_instructions)