Free Access Databricks.Associate-Developer-Apache-Spark.v2022-05-05.q60 Practice Test (Page 12)

Question 51

Which of the following code blocks reduces a DataFrame from 12 to 6 partitions and performs a full shuffle?

A.DataFrame.repartition(12)

B.DataFrame.coalesce(6).shuffle()

C.DataFrame.coalesce(6)

D.DataFrame.coalesce(6, shuffle=True)

E.DataFrame.repartition(6)

Question 52

Which of the following statements about storage levels is incorrect?

A.The cache operator on DataFrames is evaluated like a transformation.

B.In client mode, DataFrames cached with the MEMORY_ONLY_2 level will not be stored in the edge node's memory.

C.Caching can be undone using the DataFrame.unpersist() operator.

D.MEMORY_AND_DISK replicates cached DataFrames both on memory and disk.

E.DISK_ONLY will not use the worker node's memory.

Question 53

Which of the following code blocks returns a single-column DataFrame of all entries in Python list throughputRates which contains only float-type values ?

A.spark.createDataFrame((throughputRates), FloatType)

B.spark.createDataFrame(throughputRates, FloatType)

C.spark.DataFrame(throughputRates, FloatType)

D.spark.createDataFrame(throughputRates)

E.spark.createDataFrame(throughputRates, FloatType())

Question 54

The code block shown below should return a one-column DataFrame where the column storeId is converted to string type. Choose the answer that correctly fills the blanks in the code block to accomplish this.
transactionsDf.__1__(__2__.__3__(__4__))

A.1. select
2. col("storeId")
3. cast
4. StringType

B.1. select
2. col("storeId")
3. as
4. StringType

C.1. cast
2. "storeId"
3. as
4. StringType()

D.1. select
2. col("storeId")
3. cast
4. StringType()

E.1. select
2. storeId
3. cast
4. StringType()

Question 55

Which of the following code blocks returns a DataFrame with approximately 1,000 rows from the 10,000-row DataFrame itemsDf, without any duplicates, returning the same rows even if the code block is run twice?

A.itemsDf.sampleBy("row", fractions={0: 0.1}, seed=82371)

B.itemsDf.sample(fraction=0.1, seed=87238)

C.itemsDf.sample(fraction=1000, seed=98263)

D.itemsDf.sample(withReplacement=True, fraction=0.1, seed=23536)

E.itemsDf.sample(fraction=0.1)

Correct Answer: B

Explanation
itemsDf.sample(fraction=0.1, seed=87238)
Correct. If itemsDf has 10,000 rows, this code block returns about 1,000, since DataFrame.sample() is never guaranteed to return an exact amount of rows. To ensure you are not returning duplicates, you should leave the withReplacement parameter at False, which is the default. Since the question specifies that the same rows should be returned even if the code block is run twice, you need to specify a seed. The number passed in the seed does not matter as long as it is an integer.
itemsDf.sample(withReplacement=True, fraction=0.1, seed=23536)
Incorrect. While this code block fulfills almost all requirements, it may return duplicates. This is because withReplacement is set to True.
Here is how to understand what replacement means: Imagine you have a bucket of 10,000 numbered balls and you need to take 1,000 balls at random from the bucket (similar to the problem in the question). Now, if you would take those balls with replacement, you would take a ball, note its number, and put it back into the bucket, meaning the next time you take a ball from the bucket there would be a chance you could take the exact same ball again. If you took the balls without replacement, you would leave the ball outside the bucket and not put it back in as you take the next 999 balls.
itemsDf.sample(fraction=1000, seed=98263)
Wrong. The fraction parameter needs to have a value between 0 and 1. In this case, it should be 0.1, since
1,000/10,000 = 0.1.
itemsDf.sampleBy("row", fractions={0: 0.1}, seed=82371)
No, DataFrame.sampleBy() is meant for stratified sampling. This means that based on the values in a column in a DataFrame, you can draw a certain fraction of rows containing those values from the DataFrame (more details linked below). In the scenario at hand, sampleBy is not the right operator to use because you do not have any information about any column that the sampling should depend on.
itemsDf.sample(fraction=0.1)
Incorrect. This code block checks all the boxes except that it does not ensure that when you run it a second time, the exact same rows will be returned. In order to achieve this, you would have to specify a seed.
More info:
- pyspark.sql.DataFrame.sample - PySpark 3.1.2 documentation
- pyspark.sql.DataFrame.sampleBy - PySpark 3.1.2 documentation
- Types of Samplings in PySpark 3. The explanations of the sampling... | by Pinar Ersoy | Towards Data Science

Other Version: 2020Databricks.Associate-Developer-Apache-Spark.v2022-07-05.q65

Latest Upload: 103Microsoft.GH-100.v2026-01-12.q22; 102GInI.CInP.v2026-01-12.q84; 103Microsoft.MS-900.v2026-01-12.q286; 103UiPath.UiPath-TAEPv1.v2026-01-12.q39; 134SAP.C-LCNC-2406.v2026-01-09.q21; 163Salesforce.CRT-550.v2026-01-09.q122; 130Salesforce.Marketing-Cloud-Intelligence.v2026-01-09.q41; 125CIPS.L4M1.v2026-01-09.q27; 120ISTQB.ATM.v2026-01-09.q49; 119SAP.C_BCBAI_2502.v2026-01-08.q38

Question 51

Question 52

Question 53

Question 54

Question 55

Download PDF File