Free Access Snowflake.DSA-C03.v2025-10-01.q105 Practice Test (Page 8)

Question 31

You have a structured dataset in Snowflake containing customer information and purchase history. You aim to build a multi-class classification model to predict customer churn, categorizing customers into 'Low Risk', 'Medium Risk', and 'High Risk' of churning. After training the model, you want to evaluate its performance. Which of the following metrics and evaluation techniques, when used together, provide the MOST comprehensive understanding of the model's performance across all churn risk categories, especially when dealing with potential class imbalance?

A.Overall Accuracy, Precision, Recall, F I-Score for each class, and Confusion Matrix.

B.Area Under the ROC Curve (AUC-ROC) for each class (one-vs-rest approach), Precision-Recall Curve for each class, and Cumulative Accuracy Profile (CAP) curve.

C.Log Loss (Cross-Entropy Loss), Gini Coefficient, and Kolmogorov-Smirnov (KS) statistic.

D.Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and R-squared (Coefficient of Determination).

E.Only Overall Accuracy and a confusion Matrix.

Question 32

You are a data scientist working with a Snowflake table named 'CUSTOMER TRANSACTIONS' that contains sensitive PII data, including customer names and email addresses. You need to create a representative sample of 1% of the data for model development, ensuring that the sample is anonymized and protects customer privacy. The sample must be reproducible for future model iterations.
Which of the following steps are most appropriate using Snowpark for Python and SQL?

A.Use the 'SAMPLE clause in a SQL query to extract 1% of the rows, then apply SHA256 hashing to the 'customer_name' and 'email_addresS columns within Snowpark using a UDF. Seed the sampling for reproducibility.

B.Use Snowpark DataFrame's 'sample' function with a fraction of 0.01 and a fixed random seed. Before sampling, create a view that masks 'customer_name' and 'email_address' columns, and then sample from the view.

C.Create a new table using 'CREATE TABLE AS SELECT statement combined with 'SAMPLE clause and SHA256 hashing functions in SQL to create the sample and anonymize data. Manually seed the random number generator in Python before executing the SQL statement via Snowpark.

D.Employ stratified sampling based on a customer segment column, then anonymize data. Use the TABLESAMPLE BERNOULLI function in SQL with a 1 percent sample rate. Apply SHA256 hashing to the 'customer_name' and 'email_addresS columns using SQL functions.

E.Use the 'QUALIFY OVER (ORDER BY RANDOM()) (SELECT COUNT( ) 0.01 FROM CUSTOMER_TRANSACTIONS)' clause with SHA256 on sensitive columns directly within a CREATE TABLE AS statement to generate an anonymized sample. The function should return only 1 percentage of row.

Question 33

You are tasked with feature engineering a dataset containing customer transaction data stored in a Snowflake table named 'CUSTOMER TRANSACTIONS'. This table includes columns like 'CUSTOMER ID', 'TRANSACTION DATE, and 'TRANSACTION AMOUNT. You need to create a new feature representing the 'Recency' of the customer, which is the number of days since their last transaction. Using Snowpark Pandas, which of the following code snippets will correctly calculate the Recency feature as a new column in a Snowpark DataFrame?

A.Option A

B.Option B

C.Option C

D.Option D

E.Option E

Question 34

You are using Snowflake Cortex to build a customer support chatbot that leverages LLMs to answer customer questions. You have a knowledge base stored in a Snowflake table. The following options describe different methods for using this knowledge base in conjunction with the LLM to generate responses. Which of the following approaches will likely result in the MOST accurate, relevant, and cost-effective responses from the LLM?

A.Directly prompt the LLM with the entire knowledge base content for each customer question. Concatenate all knowledge base entries into a single string and include it in the prompt.

B.Use Snowflake Cortex's 'COMPLETE function without any external knowledge base. Rely solely on the LLM's pre-trained knowledge.

C.Use Retrieval-Augmented Generation (RAG). Generate vector embeddings for the knowledge base entries, perform a similarity search to find the most relevant entries for each customer question, and include those entries in the prompt.

D.Fine-tune the LLM on the entire knowledge base. Train a custom LLM model specifically on the knowledge base data.

E.Partition your database by different subject matter and then query the specific partitions for your information.

Question 35

You're building a regression model using Snowpark Python to predict house prices. After initial training, you observe that the model consistently overestimates the prices of high-value houses and underestimates the prices of low-value houses. Given the options below, which optimization metric, along with code snippet to calculate it using Snowpark, would be most effective in addressing this specific issue?

A.Mean Absolute Error MAE - as it is sensitive to outliers and will penalize large errors more heavily.

B.Root Mean Squared Error (RMSE) - as it gives more weight to larger errors, making it suitable for addressing the underestimation/overestimation problem.