You are tasked with fine-tuning a Snowflake Cortex LLM model using your own labeled dataset to improve its performance on a specific sentiment analysis task related to customer reviews. You have already created a Snowflake stage 'my_stage' and uploaded your labeled data in CSV format to this stage. The labeled data contains two columns: 'review_text' and 'sentiment' (values: 'positive', 'negative', 'neutral'). Which of the following SQL commands, or sequences of commands, is MOST appropriate to initiate the fine-tuning process using the 'SNOWFLAKE.ML.FINETUNE LLM' function? Assume you have already set the necessary permissions for your role to access the model and stage.
Correct Answer: E
The correct answer is E. 'SNOWFLAKE.ML.FINETUNE LLM' function requires 'INPUT which specifies the location of the training data, 'MODEL' which is the base LLM model from Snowflake Cortex to fine-tune and 'TASK' to specify intent of fine tuning. Option D is incorrect, it adds 'parameter' which is not required. Option B is incorrect, it is adding 'target_accuracy' which is not part of the parameters. Option A and C has custom function definitions which is incorrect.
Question 62
A data scientist is building a linear regression model in Snowflake to predict customer churn based on structured data stored in a table named 'CUSTOMER DATA'. The table includes features like 'CUSTOMER D', 'AGE, 'TENURE MONTHS', 'NUM PRODUCTS', and 'AVG MONTHLY SPEND'. The target variable is 'CHURNED' (1 for churned, 0 for active). After building the model, the data scientist wants to evaluate its performance using Mean Squared Error (MSE) on a held-out test set. Which of the following SQL queries, executed within Snowflake's stored procedure framework, is the MOST efficient and accurate way to calculate the MSE for the linear regression model predictions against the actual 'CHURNED values in the 'CUSTOMER DATA TEST table, assuming the linear regression model is named 'churn _ model' and the predicted values are generated by the MODEL APPLY() function?
Correct Answer: D
Option D is the most efficient and accurate because it uses a single SQL query to calculate the MSE directly. It avoids using cursors or procedural logic, which are less performant in Snowflake. It uses SUM to calculate the sum of squared errors and COUNT( ) to get the total number of records, then divides to obtain the average (MSE). Option B calculates the average of power, that is wrong mathematical operation, Option A is correct from mathematical point but slow because of cursor and not following Snowflake best practices, option C is using JavaScript which is also valid, but Snowflake recommends to use SQL when possible for performance, and option E is using external python for model calculation, that not best for this scenarios.
Question 63
A data scientist is tasked with predicting house prices using Snowflake. They have a dataset stored in a Snowflake table called 'HOUSE PRICES' with columns such as 'SQUARE FOOTAGE, 'NUM BEDROOMS, 'LOCATION_ID, and 'PRICE. They choose a Random Forest Regressor model. Which of the following steps is MOST important to prevent overfitting and ensure good generalization performance on unseen data, and how can this be effectively implemented within a Snowflake-centric workflow?
Correct Answer: B
Hyperparameter tuning with cross-validation is crucial to prevent overfitting. By splitting the data into training and validation sets, we can evaluate the model's performance on unseen data and adjust the hyperparameters accordingly. Snowflake's 'QUALIFY' clause and temporary tables can be used to efficiently manage these splits. Using a maximum number of estimators without validation is prone to overfitting. Training on the entire dataset without validation provides no indication of generalization performance. Randomly selecting a subset of features may remove important predictors and eliminating outliers without proper investigation can skew your data and reduce the efficacy of the model.
Question 64
You've deployed a fraud detection model in Snowflake using Snowpark. You are monitoring its performance and notice a significant decrease in recall, while precision remains high. This means the model is missing many fraudulent transactions. The training data was initially balanced, but you suspect that recent changes in user behavior have skewed the distribution of fraudulent vs. non-fraudulent transactions in production. Which of the following actions are MOST appropriate to address this issue and improve the model's performance, considering best practices for model retraining within the Snowflake ecosystem?
Correct Answer: B,C,D
Options B, C, and D are the most appropriate. B addresses the data drift by incorporating recent production data with re-balancing to mitigate the skewed distribution. C directly improves recall by adjusting the classification threshold. D establishes a proactive drift detection and retraining system which is a best practice for long-term model maintenance. A is incorrect because the original data doesn't reflect current trends. E is too drastic initially; adjusting the threshold and retraining are preferred first. Retraining with balanced, recent data is critical, especially if the class distribution has shifted. Monitoring for drift provides an automated approach to maintaining model accuracy in a changing environment. Also a low code retraining pipeline is appropriate considering current model performance with SQL udf transformations.
Question 65
You have deployed a regression model in Snowflake as an external function using AWS Lambda'. The external function takes several numerical features as input and returns a predicted value. You want to continuously monitor the model's performance in production and automatically retrain it when the performance degrades below a predefined threshold. Which of the following methods represent VALID approaches for calculating and monitoring model performance within the Snowflake environment and triggering the retraining process?
Correct Answer: A,B,C
Options A, B, and C all represent valid approaches. A uses Snowflake Tasks, SQL queries for metrics, and UDFs/stored procedures for retraining. B uses AWS Lambda logging, CloudWatch, and Step Functions to orchestrate retraining. C leverages Snowflake's Alerting feature and webhooks. D, while technically possible, is not scalable as polling an external function from Snowpark introduces unnecessary latency and overhead. E is partially correct; however Sagemaker can't directly validate data with the actual result in Snowflake. Therefore, we must use alerting or tasks within snowflake.