Which of the following code blocks returns a DataFrame with a single column in which all items in column attributes of DataFrame itemsDf are listed that contain the letter i? Sample of DataFrame itemsDf: 1.+------+----------------------------------+-----------------------------+-------------------+ 2.|itemId|itemName |attributes |supplier | 3.+------+----------------------------------+-----------------------------+-------------------+ 4.|1 |Thick Coat for Walking in the Snow|[blue, winter, cozy] |Sports Company Inc.| 5.|2 |Elegant Outdoors Summer Dress |[red, summer, fresh, cooling]|YetiX | 6.|3 |Outdoors Backpack |[green, summer, travel] |Sports Company Inc.| 7.+------+----------------------------------+-----------------------------+-------------------+
Correct Answer: D
Explanation Result of correct code block: +-------------------+ |attributes_exploded| +-------------------+ | winter| | cooling| +-------------------+ To solve this question, you need to know about explode(). This operation helps you to split up arrays into single rows. If you did not have a chance to familiarize yourself with this method yet, find more examples in the documentation (link below). Note that explode() is a method made available through pyspark.sql.functions - it is not available as a method of a DataFrame or a Column, as written in some of the answer options. More info: pyspark.sql.functions.explode - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2
Question 12
The code block shown below should write DataFrame transactionsDf to disk at path csvPath as a single CSV file, using tabs (\t characters) as separators between columns, expressing missing values as string n/a, and omitting a header row with column names. Choose the answer that correctly fills the blanks in the code block to accomplish this. transactionsDf.__1__.write.__2__(__3__, " ").__4__.__5__(csvPath)
Correct Answer: C
Explanation Correct code block: transactionsDf.repartition(1).write.option("sep", "\t").option("nullValue", "n/a").csv(csvPath) It is important here to understand that the question specifically asks for writing the DataFrame as a single CSV file. This should trigger you to think about partitions. By default, every partition is written as a separate file, so you need to include repatition(1) into your call. coalesce(1) works here, too! Secondly, the question is very much an invitation to search through the parameters in the Spark documentation that work with DataFrameWriter.csv (link below). You will also need to know that you need an option() statement to apply these parameters. The final concern is about the general call structure. Once you have called accessed write of your DataFrame, options follow and then you write the DataFrame with csv. Instead of csv(csvPath), you could also use save(csvPath, format='csv') here. More info: pyspark.sql.DataFrameWriter.csv - PySpark 3.1.1 documentation Static notebook | Dynamic notebook: See test 1
Question 13
Which of the following code blocks returns a single-row DataFrame that only has a column corr which shows the Pearson correlation coefficient between columns predError and value in DataFrame transactionsDf?
Correct Answer: D
Explanation In difficulty, this question is above what you can expect from the exam. What this question NO: wants to teach you, however, is to pay attention to the useful details included in the documentation. pyspark.sql.corr is not a very common method, but it deals with Spark's data structure in an interesting way. The command takes two columns over multiple rows and returns a single row - similar to an aggregation function. When examining the documentation (linked below), you will find this code example: a = range(20) b = [2 * x for x in range(20)] df = spark.createDataFrame(zip(a, b), ["a", "b"]) df.agg(corr("a", "b").alias('c')).collect() [Row(c=1.0)] See how corr just returns a single row? Once you understand this, you should be suspicious about answers that include first(), since there is no need to just select a single row. A reason to eliminate those answers is that DataFrame.first() returns an object of type Row, but not DataFrame, as requested in the question. transactionsDf.select(corr(col("predError"), col("value")).alias("corr")) Correct! After calculating the Pearson correlation coefficient, the resulting column is correctly renamed to corr. transactionsDf.select(corr(predError, value).alias("corr")) No. In this answer, Python will interpret column names predError and value as variable names. transactionsDf.select(corr(col("predError"), col("value")).alias("corr")).first() Incorrect. first() returns a row, not a DataFrame (see above and linked documentation below). transactionsDf.select(corr("predError", "value")) Wrong. Whie this statement returns a DataFrame in the desired shape, the column will have the name corr(predError, value) and not corr. transactionsDf.select(corr(["predError", "value"]).alias("corr")).first() False. In addition to first() returning a row, this code block also uses the wrong call structure for command corr which takes two arguments (the two columns to correlate). More info: - pyspark.sql.functions.corr - PySpark 3.1.2 documentation - pyspark.sql.DataFrame.first - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 3
Question 14
Which of the following code blocks reorders the values inside the arrays in column attributes of DataFrame itemsDf from last to first one in the alphabet? 1.+------+-----------------------------+-------------------+ 2.|itemId|attributes |supplier | 3.+------+-----------------------------+-------------------+ 4.|1 |[blue, winter, cozy] |Sports Company Inc.| 5.|2 |[red, summer, fresh, cooling]|YetiX | 6.|3 |[green, summer, travel] |Sports Company Inc.| 7.+------+-----------------------------+-------------------+
Correct Answer: D
Explanation Output of correct code block: +------+-----------------------------+-------------------+ |itemId|attributes |supplier | +------+-----------------------------+-------------------+ |1 |[winter, cozy, blue] |Sports Company Inc.| |2 |[summer, red, fresh, cooling]|YetiX | |3 |[travel, summer, green] |Sports Company Inc.| +------+-----------------------------+-------------------+ It can be confusing to differentiate between the different sorting functions in PySpark. In this case, a particularity about sort_array has to be considered: The sort direction is given by the second argument, not by the desc method. Luckily, this is documented in the documentation (link below). Also, for solving this question you need to understand the difference between sort and sort_array. With sort, you cannot sort values in arrays. Also, sort is a method of DataFrame, while sort_array is a method of pyspark.sql.functions. More info: pyspark.sql.functions.sort_array - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2
Question 15
Which of the following code blocks applies the boolean-returning Python function evaluateTestSuccess to column storeId of DataFrame transactionsDf as a user-defined function?
Correct Answer: A
Explanation Recognizing that the UDF specification requires a return type (unless it is a string, which is the default) is important for solving this question. In addition, you should make sure that the generated UDF (evaluateTestSuccessUDF) and not the Python function (evaluateTestSuccess) is applied to column storeId. More info: pyspark.sql.functions.udf - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 2