You are tasked with training a machine learning model within Snowflake using a Python UDTF. The UDTF is intended to process incoming sales data, calculate features, and update the model incrementally. The model is a simple linear regression using scikit-learn. Your initial attempt fails with a 'ModuleNotFoundError: No module named 'sklearn" error within the UDTF. You have already confirmed that scikit-learn is available in your Anaconda channel and specified it during session creation. Which of the following actions would MOST directly address this issue and allow the UDTF to successfully import and use scikit-learn?
Correct Answer: D
The 'PACKAGES parameter within the 'CREATE FUNCTION' statement is the MOST direct and reliable way to ensure that specific Python packages are available to your UDTF. Options A, B, and C might address related issues, but directly specifying the package in the function definition is the recommended approach. Option E, although technically feasible, is not a best practice and can lead to dependency management issues. The Snowpark session is automatically created and is not the source of sklearn not being available. The Anaconda environment is a construct that provides the channel information, but the function needs an explict reference to the packages to include within the function body.
Question 92
You are performing exploratory data analysis on a large sales dataset in Snowflake using Snowpark. The dataset contains columns such as 'order_id', , and 'profit'. You want to identify the top 5 most profitable products for each month. You have already created a Snowpark DataFrame named 'sales_df. Which of the following Snowpark operations, when combined correctly, will efficiently achieve this?
Correct Answer: A
Option A correctly describes the process. First group by month and product to calculate total profit, then use with correct partitioning and ordering to assign a rank within each month based on profit. Options B and C use less efficient ranking functions. Option D groups by product globally, missing the monthly granularity. Option E 'ntile' divides products into 5 buckets which is not what we are looking for.
Question 93
Consider the following Python UDF intended to train a simple linear regression model using scikit-learn within Snowflake. The UDF takes feature columns and a target column as input and returns the model's coefficients and intercept as a JSON string. You are encountering an error during the CREATE OR REPLACE FUNCTION statement because of the incorrect deployment of the package during runtime. What would be the right way to fix this deployment and execute your model?
Correct Answer: E
Option E is the correct option and provides explanation for deploying the packages and ensuring that model executes successfully.
Question 94
You have successfully trained a binary classification model using Snowpark ML and deployed it as a UDF in Snowflake. The UDF takes several input features and returns the predicted probability of the positive class. You need to continuously monitor the model's performance in production to detect potential data drift or concept drift. Which of the following methods and metrics, when used together, would provide the MOST comprehensive and reliable assessment of model performance and drift in a production environment? (Select TWO)
Correct Answer: B,D
Options B and D provide the most comprehensive assessment of model performance and drift. Option D, by continuously calculating key performance metrics (AUC, precision, recall, F1 -score) on labeled production data, directly assesses how well the model is performing on real- world data. Comparing these metrics to the holdout set provides insights into potential overfitting or degradation over time (concept drift). Option B, calculating the KS statistic between the predicted probability distributions of training and production data, helps to identify data drift, indicating that the input data distribution has changed. Option A can be an indicator but is less reliable than the KS statistic. Option C monitors data pipeline health, not model performance. Option E focuses on data quality, which is important but doesn't directly assess model performance drift.
Question 95
You are developing a machine learning model within a Snowflake UDF (User-Defined Function) written in Python. This UDF needs to access external Python libraries not included in the default Snowflake Anaconda channel. You've created a stage and uploaded the necessary file. You've successfully used 'conda create' and 'conda install --file requirements.txt' to create your environment locally, and subsequently zipped the environment. Now, what steps are essential to configure the Snowflake UDF to correctly use these external libraries from the stage? Select all that apply.
Correct Answer: B,C,D
Options B, C, and D are crucial. Snowflake UDFs can use custom environments created and uploaded as ZIP files to a stage. The 'imports' clause in the function definition must point to the ZIP file on the stage (Option C). The 'PYTHON_VERSION' must match the environment's Python version (Option D). Option B describes the process of creating a deployment-ready ZIP file. Option A's approach of manually setting 'sys._xoptions' is incorrect and not a recommended or supported method. Option E is not the standard way to manage external libraries; uploading a pre-built environment is more reliable and avoids dependency conflicts during UDF execution.