Which of the following describes a narrow transformation?
Correct Answer: E
Explanation A narrow transformation is an operation in which no data is exchanged across the cluster. Correct! In narrow transformations, no data is exchanged across the cluster, since these transformations do not require any data from outside of the partition they are applied on. Typical narrow transformations include filter, drop, and coalesce. A narrow transformation is an operation in which data is exchanged across partitions. No, that would be one definition of a wide transformation, but not of a narrow transformation. Wide transformations typically cause a shuffle, in which data is exchanged across partitions, executors, and the cluster. A narrow transformation is an operation in which data is exchanged across the cluster. No, see explanation just above this one. A narrow transformation is a process in which 32-bit float variables are cast to smaller float variables, like 16-bit or 8-bit float variables. No, type conversion has nothing to do with narrow transformations in Spark. A narrow transformation is a process in which data from multiple RDDs is used. No. A resilient distributed dataset (RDD) can be described as a collection of partitions. In a narrow transformation, no data is exchanged between partitions. Thus, no data is exchanged between RDDs. One could say though that a narrow transformation and, in fact, any transformation results in a new RDD being created. This is because a transformation results in a change to an existing RDD (RDDs are the foundation of other Spark data structures, like DataFrames). But, since RDDs are immutable, a new RDD needs to be created to reflect the change caused by the transformation. More info: Spark Transformation and Action: A Deep Dive | by Misbah Uddin | CodeX | Medium
Question 32
The code block displayed below contains an error. The code block is intended to join DataFrame itemsDf with the larger DataFrame transactionsDf on column itemId. Find the error. Code block: transactionsDf.join(itemsDf, "itemId", how="broadcast")
Correct Answer: E
Explanation broadcast is not a valid join type. Correct! The code block should read transactionsDf.join(broadcast(itemsDf), "itemId"). This would imply an inner join (this is the default in DataFrame.join()), but since the join type is not given in the question, this would be a valid choice. The larger DataFrame transactionsDf is being broadcasted, rather than the smaller DataFrame itemsDf. This option does not apply here, since the syntax around broadcasting is incorrect. Spark will only perform the broadcast operation if this behavior has been enabled on the Spark cluster. No, it is enabled by default, since the spark.sql.autoBroadcastJoinThreshold property is set to 10 MB by default. If that property would be set to -1, then broadcast joining would be disabled. More info: Performance Tuning - Spark 3.1.1 Documentation (https://bit.ly/3gCz34r) The join method should be replaced by the broadcast method. No, DataFrame has no broadcast() method. The syntax is wrong, how= should be removed from the code block. No, having the keyword argument how= is totally acceptable.
Question 33
The code block displayed below contains an error. The code block should read the csv file located at path data/transactions.csv into DataFrame transactionsDf, using the first row as column header and casting the columns in the most appropriate type. Find the error. First 3 rows of transactions.csv: 1.transactionId;storeId;productId;name 2.1;23;12;green grass 3.2;35;31;yellow sun 4.3;23;12;green grass Code block: transactionsDf = spark.read.load("data/transactions.csv", sep=";", format="csv", header=True)
Correct Answer: E
Explanation Correct code block: transactionsDf = spark.read.load("data/transactions.csv", sep=";", format="csv", header=True, inferSchema=True) By default, Spark does not infer the schema of the CSV (since this usually takes some time). So, you need to add the inferSchema=True option to the code block. More info: pyspark.sql.DataFrameReader.csv - PySpark 3.1.2 documentation
Question 34
Which of the following code blocks returns all unique values across all values in columns value and productId in DataFrame transactionsDf in a one-column DataFrame?
Correct Answer: D
Explanation transactionsDf.select('value').union(transactionsDf.select('productId')).distinct() Correct. This code block uses a common pattern for finding the unique values across multiple columns: union and distinct. In fact, it is so common that it is even mentioned in the Spark documentation for the union command (link below). transactionsDf.select('value', 'productId').distinct() Wrong. This code block returns unique rows, but not unique values. transactionsDf.agg({'value': 'collect_set', 'productId': 'collect_set'}) Incorrect. This code block will output a one-row, two-column DataFrame where each cell has an array of unique values in the respective column (even omitting any nulls). transactionsDf.select(col('value'), col('productId')).agg({'*': 'count'}) No. This command will count the number of rows, but will not return unique values. transactionsDf.select('value').join(transactionsDf.select('productId'), col('value')==col('productId'), 'outer') Wrong. This command will perform an outer join of the value and productId columns. As such, it will return a two-column DataFrame. If you picked this answer, it might be a good idea for you to read up on the difference between union and join, a link is posted below. More info: pyspark.sql.DataFrame.union - PySpark 3.1.2 documentation, sql - What is the difference between JOIN and UNION? - Stack Overflow Static notebook | Dynamic notebook: See test 3
Question 35
Which of the following code blocks returns a DataFrame with an added column to DataFrame transactionsDf that shows the unix epoch timestamps in column transactionDate as strings in the format month/day/year in column transactionDateFormatted? Excerpt of DataFrame transactionsDf:
Correct Answer: D
Explanation transactionsDf.withColumn("transactionDateFormatted", from_unixtime("transactionDate", format="MM/dd/yyyy")) Correct. This code block adds a new column with the name transactionDateFormatted to DataFrame transactionsDf, using Spark's from_unixtime method to transform values in column transactionDate into strings, following the format requested in the question. transactionsDf.withColumn("transactionDateFormatted", from_unixtime("transactionDate", format="dd/MM/yyyy")) No. Although almost correct, this uses the wrong format for the timestamp to date conversion: day/month/year instead of month/day/year. transactionsDf.withColumnRenamed("transactionDate", "transactionDateFormatted", from_unixtime("transactionDateFormatted", format="MM/dd/yyyy")) Incorrect. This answer uses wrong syntax. The command DataFrame.withColumnRenamed() is for renaming an existing column only has two string parameters, specifying the old and the new name of the column. transactionsDf.apply(from_unixtime(format="MM/dd/yyyy")).asColumn("transactionDateFormatted") Wrong. Although this answer looks very tempting, it is actually incorrect Spark syntax. In Spark, there is no method DataFrame.apply(). Spark has an apply() method that can be used on grouped data - but this is irrelevant for this question, since we do not deal with grouped data here. transactionsDf.withColumn("transactionDateFormatted", from_unixtime("transactionDate")) No. Although this is valid Spark syntax, the strings in column transactionDateFormatted would look like this: 2020-04-26 15:35:32, the default format specified in Spark for from_unixtime and not what is asked for in the question. More info: pyspark.sql.functions.from_unixtime - PySpark 3.1.1 documentation and pyspark.sql.DataFrame.withColumnRenamed - PySpark 3.1.1 documentation Static notebook | Dynamic notebook: See test 1