Online Access Free DSA-C03 Practice Test

Exam Code:	DSA-C03
Exam Name:	SnowPro Advanced: Data Scientist Certification Exam
Certification Provider:	Snowflake
Free Question Number:	289
Posted:	Sep 08, 2025

Rating

100%

Page: 1 / 58
Total 289 questions

Question 1

A telecom company, 'ConnectPlus', observes that the individual call durations of its customers are heavily skewed towards shorter calls, following an exponential distribution. A data science team aims to analyze call patterns and requires to perform hypothesis testing on the average call duration. Which of the following statements regarding the applicability of the Central Limit Theorem (CLT) in this scenario are correct if the sample size is sufficiently large?

A.The CLT is applicable only if the sample size is extremely large (e.g., greater than 10,000), due to the exponential distribution's heavy tail.
B.The CLT is not applicable because the population distribution (call durations) is heavily skewed.
C.The CLT is applicable, and the distribution of sample means of call durations will approximate a normal distribution, regardless of the skewness of the individual call durations.
D.The CLT is applicable as long as the sample size is reasonably large (typically n > 30), and the distribution of sample means will be approximately normal. The specific minimum sample size depends on the severity of the skewness.
E.The CLT is applicable, and the sample mean will converge to the population median.

Question 2

You have trained a complex Random Forest model in Snowflake to predict loan default risk. You wish to understand the individual and combined effects of 'credit_score' and 'debt_to_income_ratio' on the predicted probability of default. Which approach is MOST suitable for visualizing and interpreting these relationships?

A.Create a two-way Partial Dependence Plot (PDP) showing the interaction between 'credit_score' and 'debt_to_income_ratio'.
B.Calculate feature importance using SNOWFLAKE.ML.FEATURE IMPORTANCE and focus on the features with the highest scores.
C.Fit a simpler linear model (e.g., Logistic Regression) to the data and interpret its coefficients.
D.Generate individual Partial Dependence Plots (PDPs) for 'credit_score' and 'debt_to_income_ratio'.
E.Examine the model's overall accuracy (e.g., AUC) and assume the relationships are well-represented.

Question 3

You are tasked with training a complex machine learning model using scikit-learn and need to leverage Snowflake's data for training outside of Snowflake using an external function. The training data resides in a Snowflake table named 'CUSTOMER DATA'. Due to data governance policies, you must ensure minimal data movement and secure communication. You choose to implement the external function using AWS Lambda'. Which of the following steps are crucial to achieve secure and efficient model training outside of Snowflake?

A.Grant usage privilege on the API integration object to the role that will be calling the external function, ensuring only authorized users can trigger the model training.
B.Utilize Snowflake's data masking policies on the table to anonymize sensitive information before sending it to the external function for training. This ensures data privacy and compliance with regulations.
C.In the Lambda function, establish a direct connection to the Snowflake database using the Snowflake JDBC driver and Snowflake user credentials stored in the Lambda environment variables. This allows the Lambda function to directly query the 'CUSTOMER DATA' table.
D.Create an API integration object in Snowflake that points to your AWS API Gateway endpoint, configured to invoke the Lambda function. This API integration must use a service principal and access roles for secure authentication.
E.Create an external function in Snowflake that accepts a JSON payload containing the necessary parameters for model training, such as features to use and model hyperparameters. This function will call the API integration to invoke the Lambda function.

Question 4

You are working with a dataset in Snowflake containing customer reviews stored in a 'REVIEWS' table. The 'SENTIMENT SCORE column contains continuous values ranging from -1 (negative) to 1 (positive). You need to create a new column, 'SENTIMENT CATEGORY, based on the following rules: 'Negative': 'SENTIMENT SCORE < -0.5 'Neutral': -0.5 'SENTIMENT SCORE 0.5 'Positive': 'SENTIMENT SCORE > 0.5 You also want to binarize this 'SENTIMENT CATEGORY column into three separate columns: 'IS NEGATIVE, 'IS NEUTRAL', and 'IS POSITIVE. Which of the following SQL statements correctly implements both the categorization and subsequent binarization?

A.Option A
B.Option D
C.Option E
D.Option C
E.Option B

Question 5

You are training a regression model to predict house prices using a Snowflake dataset. The dataset contains various features, including 'number of_bedrooms', , and You want to use time-based partitioning for your training, validation, and holdout sets. However, you also need to ensure that the dataset is properly shuffled within each time partition to mitigate potential bias introduced by the order of data entry. Which of the following strategies is MOST EFFECTIVE and EFFICIENT for partitioning your data into train, validation, and holdout sets in Snowflake, while also ensuring random shuffling within each partition, and addressing potential data leakage issues?

A.Use Snowflake's SAMPLE clause with a 'REPEATABLE seed for each split (train, validation, holdout), filtering by 'sale_date'. Add an 'ORDER BY RANDOM()' clause within each 'SAMPLE query to shuffle the data within each split. This approach does not guarantee non-overlapping sets and can introduce sampling bias.
B.Create a user-defined function (UDF) in Python that takes a 'sale_date' as input and returns either 'train', 'validation', or 'holdout' based on pre-defined date ranges. Apply this UDF to each row, creating a 'split_group' column. Then, create temporary tables for each split using 'CREATE TABLE AS SELECT ... FROM . WHERE split_group = ... ORDER BY RANDOM()'. UDF overhead and global RANDOM sort make it very slow.
C.Create a new column 'split_group' using a CASE statement based on 'sale_date' to assign each row to 'train', 'validation', or 'holdout'. Calculate a random number within each 'split_group' by using OVER (PARTITION BY split_group ORDER BY RANDOM())'. Then create temporary tables for each split using 'CREATE TABLE AS SELECT FROM WHERE split_group = QUALIFY ROW NUMBER() OVER (ORDER BY RANDOM()) (SELECT COUNT( ) FROM transactions WHERE split_group -- ...) (respective split percentage);'
D.Create separate views for train, validation, and holdout sets, filtering by 'sale_date' . Shuffle the entire dataset using 'ORDER BY RANDOM()' before creating the views to ensure randomness across all sets. This does not address shuffling within parition.
E.Create a new column 'split_group' using a CASE statement based on 'sale_date' to assign each row to 'train', 'validation', or 'holdout'. Then, create temporary tables for each split using 'CREATE TABLE AS SELECT FROM WHERE split_group = ORDER BY RANDOM()'. This can be very slow because of global RANDOM sort and leakage issues with using full dataset for randomness.

Latest Upload: 105Oracle.1Z0-1057-23.v2025-09-10.q47; 145Google.Professional-Cloud-Network-Engineer.v2025-09-09.q179; 130SAP.C-S4EWM-2023.v2025-09-08.q83; 148TheSecOpsGroup.CNSP.v2025-09-08.q20; 190CFAInstitute.ESG-Investing.v2025-09-08.q173; 145PECB.ISO-IEC-27001-Lead-Implementer.v2025-09-06.q132; 138Salesforce.Data-Architect.v2025-09-05.q216; 133Adobe.AD0-E605.v2025-09-05.q50; 172Nutanix.NCP-MCI-6.10.v2025-09-05.q55; 113Oracle.1z0-591.v2025-09-05.q104