Question 256

You are designing a data processing pipeline. The pipeline must be able to scale automatically as load increases. Messages must be processed at least once, and must be ordered within windows of 1 hour. How should you design the solution?
  • Question 257

    Your company is selecting a system to centralize data ingestion and delivery. You are considering messaging and data integration systems to address the requirements. The key requirements are:
    * The ability to seek to a particular offset in a topic, possibly back to the start of all data ever captured
    * Support for publish/subscribe semantics on hundreds of topics
    * Retain per-key ordering
    Which system should you choose?
  • Question 258

    You have uploaded 5 years of log data to Cloud Storage A user reported that some data points in the log data are outside of their expected ranges, which indicates errors You need to address this issue and be able to run the process again in the future while keeping the original data for compliance reasons What should you do?
  • Question 259

    Your company built a TensorFlow neutral-network model with a large number of neurons and layers. The
    model fits well for the training data. However, when tested against new data, it performs poorly. What
    method can you employ to address this?
  • Question 260

    You want to build a managed Hadoop system as your data lake. The data transformation process is composed of a series of Hadoop jobs executed in sequence. To accomplish the design of separating storage from compute, you decided to use the Cloud Storage connector to store all input data, output data, and intermediary data. However, you noticed that one Hadoop job runs very slowly with Cloud Dataproc, when compared with the on-premises bare-metal Hadoop environment (8-core nodes with 100-GB RAM).
    Analysis shows that this particular Hadoop job is disk I/O intensive. You want to resolve the issue. What should you do?