The code block displayed below contains an error. The code block should configure Spark so that DataFrames up to a size of 20 MB will be broadcast to all worker nodes when performing a join. Find the error. Code block:
Correct Answer: B
Explanation This is question is hard. Let's assess the different answers one-by-one. Spark will only broadcast DataFrames that are much smaller than the default value. This is correct. The default value is 10 MB (10485760 bytes). Since the configuration for spark.sql.autoBroadcastJoinThreshold expects a number in bytes (and not megabytes), the code block sets the limits to merely 20 bytes, instead of the requested 20 * 1024 * 1024 (= 20971520) bytes. The command is evaluated lazily and needs to be followed by an action. No, this command is evaluated right away! Spark will only apply the limit to threshold joins and not to other joins. There are no "threshold joins", so this option does not make any sense. The correct option to write configurations is through spark.config and not spark.conf. No, it is indeed spark.conf! The passed limit has the wrong variable type. The configuration expects the number of bytes, a number, as an input. So, the 20 provided in the code block is fine.
Question 32
In which order should the code blocks shown below be run in order to assign articlesDf a DataFrame that lists all items in column attributes ordered by the number of times these items occur, from most to least often? Sample of DataFrame articlesDf: 1.+------+-----------------------------+-------------------+ 2.|itemId|attributes |supplier | 3.+------+-----------------------------+-------------------+ 4.|1 |[blue, winter, cozy] |Sports Company Inc.| 5.|2 |[red, summer, fresh, cooling]|YetiX | 6.|3 |[green, summer, travel] |Sports Company Inc.| 7.+------+-----------------------------+-------------------+
The code block shown below should return a DataFrame with columns transactionsId, predError, value, and f from DataFrame transactionsDf. Choose the answer that correctly fills the blanks in the code block to accomplish this. transactionsDf.__1__(__2__)
Correct Answer: C
Explanation Correct code block: transactionsDf.select(["transactionId", "predError", "value", "f"]) The DataFrame.select returns specific columns from the DataFrame and accepts a list as its only argument. Thus, this is the correct choice here. The option using col(["transactionId", "predError", "value", "f"]) is invalid, since inside col(), one can only pass a single column name, not a list. Likewise, all columns being specified in a single string like "transactionId, predError, value, f" is not valid syntax. filter and where filter rows based on conditions, they do not control which columns to return. Static notebook | Dynamic notebook: See test 2
Question 34
Which of the following describes characteristics of the Dataset API?
Correct Answer: D
Explanation The Dataset API is available in Scala, but it is not available in Python. Correct. The Dataset API uses fixed typing and is typically used for object-oriented programming. It is available when Spark is used with the Scala programming language, but not for Python. In Python, you use the DataFrame API, which is based on the Dataset API. The Dataset API does not provide compile-time type safety. No - in fact, depending on the use case, the type safety that the Dataset API provides is an advantage. The Dataset API does not support unstructured data. Wrong, the Dataset API supports structured and unstructured data. In Python, the Dataset API's schema is constructed via type hints. No, this is not applicable since the Dataset API is not available in Python. In Python, the Dataset API mainly resembles Pandas' DataFrame API. The Dataset API does not exist in Python, only in Scala and Java.
Question 35
The code block displayed below contains at least one error. The code block should return a DataFrame with only one column, result. That column should include all values in column value from DataFrame transactionsDf raised to the power of 5, and a null value for rows in which there is no value in column value. Find the error(s). Code block: 1.from pyspark.sql.functions import udf 2.from pyspark.sql import types as T 3. 4.transactionsDf.createOrReplaceTempView('transactions') 5. 6.def pow_5(x): 7. return x**5 8. 9.spark.udf.register(pow_5, 'power_5_udf', T.LongType()) 10.spark.sql('SELECT power_5_udf(value) FROM transactions')
Correct Answer: D
Explanation Correct code block: from pyspark.sql.functions import udf from pyspark.sql import types as T transactionsDf.createOrReplaceTempView('transactions') def pow_5(x): if x: return x**5 return x spark.udf.register('power_5_udf', pow_5, T.LongType()) spark.sql('SELECT power_5_udf(value) AS result FROM transactions') Here it is important to understand how the pow_5 method handles empty values. In the wrong code block above, the pow_5 method is unable to handle empty values and will throw an error, since Python's ** operator cannot deal with any null value Spark passes into method pow_5. The order of arguments for registering the UDF function with Spark via spark.udf.register matters. In the code snippet in the question, the arguments for the SQL method name and the actual Python function are switched. You can read more about the arguments of spark.udf.register and see some examples of its usage in the documentation (link below). Finally, you should recognize that in the original code block, an expression to rename column created through the UDF function is missing. The renaming is done by SQL's AS result argument. Omitting that argument, you end up with the column name power_5_udf(value) and not result. More info: pyspark.sql.functions.udf - PySpark 3.1.1 documentation