Question 11

A Snowflake table named 'SALES DATA contains a 'TRANSACTION DATE column stored as VARCHAR. The data in this column is inconsistent; some rows have dates in 'YYYY-MM-DD' format, others in 'MM/DD/YYYY' format, and some contain invalid date strings like 'N/A'. You need to standardize all dates to 'YYYY-MM-DD' format and store them in a new column called FORMATTED DATE in a new table 'STANDARDIZED_SALES DATA. Which of the following approaches, using Snowpark Python and SQL, most effectively handles these inconsistencies and minimizes errors during data transformation? Select all that apply:
  • Question 12

    You are analyzing a dataset of website traffic and conversions in Snowflake, aiming to understand the relationship between the number of pages visited CPAGES VISITED) and the conversion rate (CONVERSION_RATE). You perform a simple linear regression using the 'REGR SLOPE and 'REGR INTERCEPT functions. However, after plotting the data and the regression line, you observe significant heteroscedasticity (non-constant variance of errors). Which of the following actions, performed within Snowflake during the data preparation and feature engineering phase, are MOST appropriate to address this heteroscedasticity and improve the validity of your linear regression model? (Select all that apply)
  • Question 13

    You have trained a logistic regression model in Python using scikit-learn and plan to deploy it as a Python stored procedure in Snowflake. You need to serialize the model for deployment. Consider the following code snippet:
  • Question 14

    You've built a customer churn prediction model in Snowflake, and are using the AUC as your primary performance metric. You notice that your model consistently performs well (AUC > 0.85) on your validation set but significantly worse (AUC < 0.7) in production. What are the possible reasons for this discrepancy? (Select all that apply)
  • Question 15

    You are building a model to predict loan defaults using data stored in Snowflake. As part of your feature engineering process within a Snowflake Notebook, you need to handle missing values in several columns: 'annual _ income', and You want to use a combination of imputation strategies: replace missing values with the median, 'annual_income' with the mean, and with a constant value of 0.5. You are leveraging the Snowpark DataFrame API. Which of the following code snippets correctly implements this imputation strategy?