Free Access Snowflake.DSA-C03.v2025-10-01.q105 Practice Test (Page 2)

Question 1

You are exploring a large dataset of website user behavior in Snowflake to identify patterns and potential features for a machine learning model predicting user engagement. You want to create a visualization showing the distribution of 'session_duration' for different 'user_segments'. The 'user_segmentS column contains categorical values like 'New', 'Returning', and 'Power User'. Which Snowflake SQL query and subsequent data visualization technique would be most effective for this task?

A.Query: 'SELECT user_segments, AVG(session_duration) FROM user_behavior GROUP BY Visualization: Bar chart showing average session duration for each user segment.

B.Query: 'SELECT user_segments, MEDIAN(session_duration) FROM user_behavior GROUP BY user_segments;' Visualization: Box plot showing the distribution (quartiles, median, outliers) of session duration for each user segment.

C.Query: 'SELECT session_duration FROM user_behavior WHERE user_segments = 'New';- (repeated for each user segment). Visualization: Overlayed histograms showing the distribution of session duration for each user segment on the same axes.

D.Query: 'SELECT user_segments, APPROX 0.25), APPROX 0.5), APPROX_PERCENTlLE(session_duration, 0.75) FROM user_behavior GROUP BY user_segments;' Visualization: Scatter plot where each point represents a user segment and the x,y coordinates represent session duration at 25th and 75th percentiles respectively.

E.Query: ' SELECT COUNT( ) ,user_segments FROM user_behavior GROUP BY user_segments;' Visualization: Pie chart showing proportion of each segment.

Question 2

You have deployed a fraud detection model in Snowflake using Snowpark and are monitoring its performance. You observe a significant drift in the transaction data distribution compared to the data used during training. To address this, you want to implement a retraining strategy. Which of the following steps are MOST critical to automate the retraining process using Snowflake's features?

A.Create a Snowflake Stream on the transaction data table to capture changes since the last training run.

B.Develop a Python UDF that periodically calculates drift metrics (e.g., Population Stability Index) and triggers retraining when a threshold is exceeded. Use Snowflake's Task feature to schedule the UDF execution.

C.Replace the existing model artifact in Snowflake's stage with the newly trained model using Snowpark's model registry functionality.

D.Configure Snowflake's data lineage features to automatically track the input data and model lineage for reproducibility.

E.Build and deploy a new docker image for each retraining, containing the new model, and update the external function definition to point to the new image.

Question 3

You're developing a model to predict customer churn using Snowflake. Your dataset is large and continuously growing. You need to implement partitioning strategies to optimize model training and inference performance. You consider the following partitioning strategies: 1. Partitioning by 'customer segment (e.g., 'High-Value', 'Medium-Value', 'Low-Value'). 2. Partitioning by 'signup_date' (e.g., monthly partitions). 3. Partitioning by 'region' (e.g., 'North America', 'Europe', 'Asia'). Which of the following statements accurately describe the potential benefits and drawbacks of these partitioning strategies within a Snowflake environment, specifically in the context of model training and inference?

A.Partitioning by 'customer_segment' is beneficial if churn patterns are significantly different across segments, allowing for training separate models for each segment. However, if any segment has very few churned customers, it may lead to overfitting or unreliable models for that segment.

B.Partitioning by 'signup_date' is ideal for capturing temporal dependencies in churn behavior and allows for easy retraining of models with the latest data. It also naturally aligns with a walk-forward validation approach. However, it might not be effective if churn drivers are independent of signup date.

C.Partitioning by 'region' is useful if churn is heavily influenced by geographic factors (e.g., local market conditions). It can improve query performance during both training and inference when filtering by region. However, it can create data silos, making it difficult to build a global churn model that considers interactions across regions. Furthermore, the 'region' column must have low cardinality.

D.Using clustering in Snowflake on top of partitioning will always improve query performance significantly and reduce compute costs irrespective of query patterns.

E.Implementing partitioning requires modifying existing data loading pipelines and may introduce additional overhead in data management. If the cost of partitioning outweighs the performance gains, it's better to rely on Snowflake's built-in micro-partitioning alone. Also, data skew in partition keys is a major concern.

Question 4

You are responsible for deploying a fraud detection model in Snowflake. The model needs to be validated rigorously before being put into production. Which of the following actions represent the MOST comprehensive approach to model validation within the Snowflake environment, focusing on both statistical performance and operational readiness, and using Snowflake features for validation?

A.Performing a single train/test split of the historical data and evaluating model performance metrics (e.g., accuracy, precision, recall) on the test set using standard Python libraries within a Snowflake Snowpark environment. Deploying the model directly if the metrics exceed a predefined threshold.

B.Implementing K-fold cross-validation using Snowflake stored procedures and temporary tables to store and aggregate the results from each fold. Evaluating the model's performance across different data segments and time periods to assess its robustness. Using Snowflake streams and tasks to automate the validation process on new incoming data.

C.Conducting a comprehensive backtesting analysis using historical data, simulating real-world scenarios, and evaluating the model's performance under different conditions. Using Snowflake's time travel feature to access historical data snapshots for accurate backtesting. Monitoring model performance using Snowflake alerts triggered by custom SQL queries against model prediction logs.

D.Relying on a simple visual inspection of model outputs and comparing them to a small sample of known fraud cases. Skipping formal validation to accelerate the deployment process.

E.Calculating only the AUC (Area Under the Curve) metric on the entire dataset without performing any data splitting or cross-validation. Deploying the model if the AUC is above 0.7.

Question 5

You are building a fraud detection model for an e-commerce platform. One of the features is 'purchase_amount', which ranges from $1 to $10,000. The data has a skewed distribution with many small purchases and a few very large ones. You need to normalize this feature for your model, which uses gradient descent. Which normalization technique(s) would be most suitable in Snowflake, considering the data characteristics and the need to handle potential future outliers?

A.Min-Max scaling using the following SQL: