You are tasked with identifying fraudulent transactions from unstructured log data stored in Snowflake. The logs contain various fields, including timestamps, user IDs, and transaction details embedded within free-text descriptions. You plan to use a supervised learning approach, having labeled a subset of transactions as 'fraudulent' or 'not fraudulent.' Which of the following methods best describes the extraction and processing of this data for training a machine learning model within Snowflake?
Correct Answer: C
Option C provides the most comprehensive and effective approach. It combines the strengths of both regular expressions (for structured data extraction) and NLP techniques (for understanding the semantic content of the log descriptions). Using Snowflake UDFs keeps the data processing within Snowflake, minimizing data movement. Combining extracted features with other structured data enhances the model's performance.
Question 67
You are building a fraud detection model using transaction data stored in Snowflake. The dataset includes features like transaction amount, merchant category, location, and time. Due to regulatory requirements, you need to ensure personally identifiable information (PII) is handled securely and compliantly during the data collection and preprocessing phases. Which of the following combinations of Snowflake features and techniques would be MOST suitable for achieving this goal?
Correct Answer: A,E
Options A and E are the MOST suitable. Option A directly addresses PII protection by leveraging Snowflake's masking policies to redact sensitive data before it is used for model training. Role-based access control provides an additional layer of security by limiting access to the unmasked data. Option E applies differential privacy to protect individual transaction data while still enabling useful model training and combines it with Row Access policies to restrict access to sensitive transaction records. Option B is partially correct but insufficient, as it only addresses which columns are seen, not protection within those columns. Option C protects the entire database but doesn't address PII handling during model training. Option D is highly risky and non-compliant, as it exposes PII to a third party without adequate protection.
Question 68
You are working on a customer churn prediction model and are using Snowpark Feature Store. One of your features, is updated daily. You notice that your model's performance degrades over time, likely due to stale feature values being used during inference. You want to ensure that the model always uses the most up-to-date feature values. Which of the following strategies would be the MOST effective way to address this issue using Snowpark Feature Store and avoid model staleness during online inference?
Correct Answer: E
Option E is the most effective. Configuring the feature group with is important to reduce model staleness during online inference. Setting the in the configuration will serve as an indicator for staleness and use the method to retrieve the latest feature value available.
Question 69
A data engineer is tasked with removing duplicates from a table named 'USER ACTIVITY' in Snowflake, which contains user activity logs. The table has columns: 'ACTIVITY TIMESTAMP', 'ACTIVITY TYPE', and 'DEVICE_ID. The data engineer wants to remove duplicate rows, considering only 'USER ID', 'ACTIVITY TYPE, and 'DEVICE_ID' columns. What is the most efficient and correct SQL query to achieve this while retaining only the earliest 'ACTIVITY TIMESTAMP' for each unique combination of the specified columns?
Correct Answer: B
Option B provides the most efficient and correct solution. - It uses the 'QUALIFY' clause along with the window function to partition the data by 'USER ID, 'ACTIVITY TYPE, and 'DEVICE ICY. Within each partition, it orders the rows by 'ACTIVITY _ TIMESTAMP' in ascending order. The function assigns a unique rank to each row within the partition. The 'QUALIFY clause filters the result set, keeping only the rows where the 'ROW NUMBER()' is equal to 1, which effectively selects the earliest activity timestamp for each unique combination of 'ACTIVITY _ TYPE , and 'DEVICE_ID'. Option A is incorrect because it aggregates and only retains the minimum 'ACTIVITY TIMESTAMP' , discarding other potentially relevant columns. Option C is incorrect because it only returns rows where a combination of 'USER_ID, ACTIVITY_TYPE, DEVICE_ID, and ACTIVITY_TIMESTAMP" appears only once, not removing duplicates based on the desired columns. Option D is incorrect because it only selects distinct combinations of USER ID, ACTIVITY_TYPE, and DEVICE_ID, thus losing the ACTIVITY_TIMESTAMP. option E is incorrect. While it keeps the ACTIVITY_TIMESTAMP as the earliest, FIRST VALUE generates all other columns based on the input data which will generate duplicates.
Question 70
You are working with a dataset containing customer reviews for various products. The dataset includes a 'REVIEW TEXT column with the raw review text and a 'PRODUCT ID' column. You want to perform sentiment analysis on the reviews and create a new feature called 'SENTIMENT SCORE for each product. You plan to use a UDF to perform the sentiment analysis. Which of the following steps and SQL code snippets are essential for implementing this feature engineering task in Snowflake, ensuring optimal performance and scalability? Select all that apply:
Correct Answer: A,C,E
Options A, C and E are correct. Option A is essential for performing sentiment analysis. Option C correctly integrates the UDF into a SQL query to generate the 'SENTIMENT SCORE'. Option E is crucial for performance since vectorized UDFs are much faster and more efficient for large datasets. Option B is not a correct usage pattern for sentiment analysis as Snowflake ML is in early stages to cater this. Option D, while seeming logical is not ideal for the task because this review data changes continuously and the model would be outdated, also temporary table is for the scope of session it is created.