Free Access Snowflake.DSA-C03.v2025-10-01.q105 Practice Test (Page 19)

Question 86

You are tasked with presenting a business case to stakeholders demonstrating the value of a new machine learning model that predicts customer churn. The model has been trained on data within Snowflake, and you have various metrics such as accuracy, precision, recall, and F I-score. You also have feature importance scores generated using a SHAP (SHapley Additive exPlanations) explainer. Which of the following visualization strategies, when combined, would MOST effectively communicate the model's performance and impact to a non-technical audience, while also providing sufficient detail for technical stakeholders?

A.A simple bar chart showing the overall accuracy score of the model alongside a table detailing the precision, recall, and F I-score. Include a word cloud of the most important features from the SHAP values.

B.A confusion matrix visualizing the true positives, true negatives, false positives, and false negatives, along with a summary plot of the SHAP values showing the impact of each feature on the model's prediction for a representative sample of customers. A line chart showing cumulative churn rate across different customer segments.

C.A ROC curve (Receiver Operating Characteristic) showing the trade-off between true positive rate and false positive rate, paired with a detailed table of all feature importance scores generated by the SHAP explainer. Present statistical summaries, such as mean and standard deviation, of the top 5 feature values, grouped by predicted churn probability.

D.A scatter plot showing the relationship between two key features identified by SHAP, colored by the model's churn prediction, and a table summarizing the model's performance metrics (accuracy, precision, recall, F I-score). Additionally, include a waterfall plot for a specific customer, illustrating how each feature contributes to the final prediction.

E.A distribution plot (e.g., histogram or KDE) of the predicted churn probabilities, segmented by actual churn status (churned vs. not churned), combined with a SHAP force plot visualizing the feature contributions for a single, randomly selected customer who churned. Add a section on potential cost savings from churn reduction.

Question 87

You are developing a model to predict equipment failure in a factory using sensor data stored in Snowflake. The data is partitioned by 'EQUIPMENT ID' and 'TIMESTAMP. After initial model training and cross-validation using the following code snippet:

You observe significant performance variations across different equipment groups when evaluating on out-of-sample data'. Which of the following strategies could you employ to address this issue within the Snowflake environment to improve the model's generalization ability across all equipment?

A.Increase the overall size of the "TRAINING_DATR to include more historical data for all equipment, assuming this will balance the representation of each EQUIPMENT ID'

B.Implement a hyperparameter search using 'SYSTEM$OPTIMIZE_MODEL' with a wider range of parameters for each 'EQUIPMENT_ID individually, creating a separate model for each 'EQUIPMENT ID.

C.Retrain the model with additional feature engineering to create interaction terms between 'EQUIPMENT_ID' and other relevant sensor features to capture equipment-specific patterns. For instance, you can one hot encode and add to model and include in 'INPUT DATA'.

D.Implement cross-validation at the partition level by splitting 'TRAINING_DATX into train and test sets before creating the model, and then using the 'FIT' command to train on the train set and 'PREDICT to evaluate on the test set, repeating for each partition.

E.Create seperate models per equipment ID. For each equipment ID, split data into training and testing data. For each equipment ID, use 'SYSTEM$OPTIMIZE MODEL' to perform hyper parameter search individually. Train and Deploy the model at equipement ID Level.

Question 88

A marketing analyst at 'NovaRetail' suspects that a new advertising campaign has increased the average purchase amount. They have historical purchase data in a Snowflake table called 'purchase_historf. To validate their hypothesis using the Central Limit Theorem (CLT), they perform the following steps: 1. Calculate the population mean (?) of purchase amounts from the historical data'. 2. Draw 500 random samples of size 50 from the table. 3. Calculate the sample mean (x?) for each sample. Which of the following steps are essential for correctly applying the Central Limit Theorem to perform a z-test to determine whether the new advertising campaign has significantly increased the average purchase amount?

A.Calculate the standard deviation of the population (?) from the historical data and estimate the standard error of the mean as ? / sqrt(50).

B.Check if the original population distribution (purchase amounts) is approximately normally distributed.

C.Ensure that the samples are drawn independently and randomly.

D.Calculate the standard deviation of the sample means and use it as an estimate for the standard error of the mean.

E.Verify that the sample size (n=50) is sufficiently large to approximate normality of the sample mean distribution based on the CLT. This implicitly assumes population size is significantly larger than the sample size.

Question 89

You are developing a churn prediction model using Snowpark Python and Scikit-learn. After initial model training, you observe significant overfitting. Which of the following hyperparameter tuning strategies and code snippets, when implemented within a Snowflake Python UDF, would be MOST effective to address overfitting in a Ridge Regression model and how can you implement a reproducible model with minimal code?

A.Option A

B.Option B

C.Option C

D.Option D

E.Option E

Question 90

You are training a regression model to predict house prices using a Snowflake dataset. The dataset contains various features, including 'number of_bedrooms', , and You want to use time-based partitioning for your training, validation, and holdout sets. However, you also need to ensure that the dataset is properly shuffled within each time partition to mitigate potential bias introduced by the order of data entry. Which of the following strategies is MOST EFFECTIVE and EFFICIENT for partitioning your data into train, validation, and holdout sets in Snowflake, while also ensuring random shuffling within each partition, and addressing potential data leakage issues?

A.Create separate views for train, validation, and holdout sets, filtering by 'sale_date' . Shuffle the entire dataset using 'ORDER BY RANDOM()' before creating the views to ensure randomness across all sets. This does not address shuffling within parition.

B.Create a new column 'split_group' using a CASE statement based on 'sale_date' to assign each row to 'train', 'validation', or 'holdout'. Then, create temporary tables for each split using 'CREATE TABLE AS SELECT FROM WHERE split_group = ORDER BY RANDOM()'. This can be very slow because of global RANDOM sort and leakage issues with using full dataset for randomness.

C.Use Snowflake's SAMPLE clause with a 'REPEATABLE seed for each split (train, validation, holdout), filtering by 'sale_date'. Add an 'ORDER BY RANDOM()' clause within each 'SAMPLE query to shuffle the data within each split. This approach does not guarantee non-overlapping sets and can introduce sampling bias.

D.Create a user-defined function (UDF) in Python that takes a 'sale_date' as input and returns either 'train', 'validation', or 'holdout' based on pre-defined date ranges. Apply this UDF to each row, creating a 'split_group' column. Then, create temporary tables for each split using 'CREATE TABLE AS SELECT ... FROM . WHERE split_group = ... ORDER BY RANDOM()'. UDF overhead and global RANDOM sort make it very slow.

E.Create a new column 'split_group' using a CASE statement based on 'sale_date' to assign each row to 'train', 'validation', or 'holdout'. Calculate a random number within each 'split_group' by using OVER (PARTITION BY split_group ORDER BY RANDOM())'. Then create temporary tables for each split using 'CREATE TABLE AS SELECT FROM WHERE split_group = QUALIFY ROW NUMBER() OVER (ORDER BY RANDOM()) (SELECT COUNT( ) FROM transactions WHERE split_group -- ...) (respective split percentage);'

Question 86

Question 87

Question 88

Question 89

Question 90

Download PDF File