Which of the elements that are labeled with a circle and a number contain an error or are misrepresented?
Correct Answer: B
Explanation 1: Correct - This should just read "API" or "DataFrame API". The DataFrame is not part of the SQL API. To make a DataFrame accessible via SQL, you first need to create a DataFrame view. That view can then be accessed via SQL. 4: Although "K_38_INU" looks odd, it is a completely valid name for a DataFrame column. 6: No, StringType is a correct type. 7: Although a StringType may not be the most efficient way to store a phone number, there is nothing fundamentally wrong with using this type here. 8: Correct - TreeType is not a type that Spark supports. 9: No, Spark DataFrames support ArrayType variables. In this case, the variable would represent a sequence of elements with type LongType, which is also a valid type for Spark DataFrames. 10: There is nothing wrong with this row. More info: Data Types - Spark 3.1.1 Documentation (https://bit.ly/3aAPKJT)
Question 17
Which of the following code blocks returns the number of unique values in column storeId of DataFrame transactionsDf?
Correct Answer: A
Explanation transactionsDf.select("storeId").dropDuplicates().count() Correct! After dropping all duplicates from column storeId, the remaining rows get counted, representing the number of unique values in the column. transactionsDf.select(count("storeId")).dropDuplicates() No. transactionsDf.select(count("storeId")) just returns a single-row DataFrame showing the number of non-null rows. dropDuplicates() does not have any effect in this context. transactionsDf.dropDuplicates().agg(count("storeId")) Incorrect. While transactionsDf.dropDuplicates() removes duplicate rows from transactionsDf, it does not do so taking only column storeId into consideration, but eliminates full row duplicates instead. transactionsDf.distinct().select("storeId").count() Wrong. transactionsDf.distinct() identifies unique rows across all columns, but not only unique rows with respect to column storeId. This may leave duplicate values in the column, making the count not represent the number of unique values in that column. transactionsDf.select(distinct("storeId")).count() False. There is no distinct method in pyspark.sql.functions.
Question 18
Which of the following code blocks performs an inner join of DataFrames transactionsDf and itemsDf on columns productId and itemId, respectively, excluding columns value and storeId from DataFrame transactionsDf and column attributes from DataFrame itemsDf?
Correct Answer: E
Explanation This question offers you a wide variety of answers for a seemingly simple question. However, this variety reflects the variety of ways that one can express a join in PySpark. You need to understand some SQL syntax to get to the correct answer here. transactionsDf.createOrReplaceTempView('transactionsDf') itemsDf.createOrReplaceTempView('itemsDf') statement = """ SELECT * FROM transactionsDf INNER JOIN itemsDf ON transactionsDf.productId==itemsDf.itemId """ spark.sql(statement).drop("value", "storeId", "attributes") Correct - this answer uses SQL correctly to perform the inner join and afterwards drops the unwanted columns. This is totally fine. If you are unfamiliar with the triple-quote """ in Python: This allows you to express strings as multiple lines. transactionsDf \ drop(col('value'), col('storeId')) \ join(itemsDf.drop(col('attributes')), col('productId')==col('itemId')) No, this answer option is a trap, since DataFrame.drop() does not accept a list of Column objects. You could use transactionsDf.drop('value', 'storeId') instead. transactionsDf.drop("value", "storeId").join(itemsDf.drop("attributes"), "transactionsDf.productId==itemsDf.itemId") Incorrect - Spark does not evaluate "transactionsDf.productId==itemsDf.itemId" as a valid join expression. This would work if it would not be a string. transactionsDf.drop('value', 'storeId').join(itemsDf.select('attributes'), transactionsDf.productId==itemsDf.itemId) Wrong, this statement incorrectly uses itemsDf.select instead of itemsDf.drop. transactionsDf.createOrReplaceTempView('transactionsDf') itemsDf.createOrReplaceTempView('itemsDf') spark.sql("SELECT -value, -storeId FROM transactionsDf INNER JOIN itemsDf ON productId==itemId").drop("attributes") No, here the SQL expression syntax is incorrect. Simply specifying -columnName does not drop a column. More info: pyspark.sql.DataFrame.join - PySpark 3.1.2 documentation Static notebook | Dynamic notebook: See test 3
Question 19
Which of the following code blocks reads in parquet file /FileStore/imports.parquet as a DataFrame?
Correct Answer: D
Explanation Static notebook | Dynamic notebook: See test 1 (https://flrs.github.io/spark_practice_tests_code/#1/23.html , https://bit.ly/sparkpracticeexams_import_instructions)
Question 20
The code block shown below should return a DataFrame with columns transactionsId, predError, value, and f from DataFrame transactionsDf. Choose the answer that correctly fills the blanks in the code block to accomplish this. transactionsDf.__1__(__2__)
Correct Answer: C
Explanation Correct code block: transactionsDf.select(["transactionId", "predError", "value", "f"]) The DataFrame.select returns specific columns from the DataFrame and accepts a list as its only argument. Thus, this is the correct choice here. The option using col(["transactionId", "predError", "value", "f"]) is invalid, since inside col(), one can only pass a single column name, not a list. Likewise, all columns being specified in a single string like "transactionId, predError, value, f" is not valid syntax. filter and where filter rows based on conditions, they do not control which columns to return. Static notebook | Dynamic notebook: See test 2