Explanation The driver receives data upon request by actions. Correct! Actions trigger the distributed execution of tasks on executors which, upon task completion, transfer result data back to the driver. Actions are Spark's way of exchanging data between executors. No. In Spark, data is exchanged between executors via shuffles. Writing data to disk is the primary purpose of actions. No. The primary purpose of actions is to access data that is stored in Spark's RDDs and return the data, often in aggregated form, back to the driver. Actions are Spark's way of modifying RDDs. Incorrect. Firstly, RDDs are immutable - they cannot be modified. Secondly, Spark generates new RDDs via transformations and not actions. Stage boundaries are commonly established by actions. Wrong. A stage boundary is commonly established by a shuffle, for example caused by a wide transformation.
Question 22
Which of the following code blocks reads in the parquet file stored at location filePath, given that all columns in the parquet file contain only whole numbers and are stored in the most appropriate format for this kind of data?
Correct Answer: D
Explanation The schema passed into schema should be of type StructType or a string, so all entries in which a list is passed are incorrect. In addition, since all numbers are whole numbers, the IntegerType() data type is the correct option here. NumberType() is not a valid data type and StringType() would fail, since the parquet file is stored in the "most appropriate format for this kind of data", meaning that it is most likely an IntegerType, and Spark does not convert data types if a schema is provided. Also note that StructType accepts only a single argument (a list of StructFields). So, passing multiple arguments is invalid. Finally, Spark needs to know which format the file is in. However, all of the options listed are valid here, since Spark assumes parquet as a default when no file format is specifically passed. More info: pyspark.sql.DataFrameReader.schema - PySpark 3.1.2 documentation and StructType - PySpark 3.1.2 documentation
Question 23
The code block displayed below contains an error. The code block should return the average of rows in column value grouped by unique storeId. Find the error. Code block: transactionsDf.agg("storeId").avg("value")
Correct Answer: D
Explanation Static notebook | Dynamic notebook: See test 1 (https://flrs.github.io/spark_practice_tests_code/#1/30.html , https://bit.ly/sparkpracticeexams_import_instructions)
Question 24
Which of the following statements about executors is correct, assuming that one can consider each of the JVMs working as executors as a pool of task execution slots?
Correct Answer: E
Explanation Tasks run in parallel via slots. Correct. Given the assumption, an executor then has one or more "slots", defined by the equation spark.executor.cores / spark.task.cpus. With the executor's resources divided into slots, each task takes up a slot and multiple tasks can be executed in parallel. Slot is another name for executor. No, a slot is part of an executor. An executor runs on a single core. No, an executor can occupy multiple cores. This is set by the spark.executor.cores option. There must be more slots than tasks. No. Slots just process tasks. One could imagine a scenario where there was just a single slot for multiple tasks, processing one task at a time. Granted - this is the opposite of what Spark should be used for, which is distributed data processing over multiple cores and machines, performing many tasks in parallel. There must be less executors than tasks. No, there is no such requirement. More info: Spark Architecture | Distributed Systems Architecture (https://bit.ly/3x4MZZt)
Question 25
Which of the following statements about stages is correct?
Correct Answer: D
Explanation Tasks in a stage may be executed by multiple machines at the same time. This is correct. Within a single stage, tasks do not depend on each other. Executors on multiple machines may execute tasks belonging to the same stage on the respective partitions they are holding at the same time. Different stages in a job may be executed in parallel. No. Different stages in a job depend on each other and cannot be executed in parallel. The nuance is that every task in a stage may be executed in parallel by multiple machines. For example, if a job consists of Stage A and Stage B, tasks belonging to those stages may not be executed in parallel. However, tasks from Stage A may be executed on multiple machines at the same time, with each machine running it on a different partition of the same dataset. Then, afterwards, tasks from Stage B may be executed on multiple machines at the same time. Stages may contain multiple actions, narrow, and wide transformations. No, stages may not contain multiple wide transformations. Wide transformations mean that shuffling is required. Shuffling typically terminates a stage though, because data needs to be exchanged across the cluster. This data exchange often causes partitions to change and rearrange, making it impossible to perform tasks in parallel on the same dataset. Stages ephemerally store transactions, before they are committed through actions. No, this does not make sense. Stages do not "store" any data. Transactions are not "committed" in Spark. Stages consist of one or more jobs. No, it is the other way around: Jobs consist of one more stages. More info: Spark: The Definitive Guide, Chapter 15.