[Jul-2024] Pass Google Professional-Data-Engineer Exam in First Attempt Guaranteed! [Q36-Q59]

Share

[Jul-2024] Pass Google Professional-Data-Engineer Exam in First Attempt Guaranteed!

Full Professional-Data-Engineer Practice Test and 333 unique questions with explanations waiting just for you, get it now!

NEW QUESTION # 36
Your company is in a highly regulated industry. One of your requirements is to ensure individual users
have access only to the minimum amount of information required to do their jobs. You want to enforce this
requirement with Google BigQuery. Which three approaches can you take? (Choose three.)

  • A. Segregate data across multiple tables or databases.
  • B. Restrict BigQuery API access to approved users.
  • C. Restrict access to tables by role.
  • D. Ensure that the data is encrypted at all times.
  • E. Disable writes to certain tables.
  • F. Use Google Stackdriver Audit Logging to determine policy violations.

Answer: B,C,F


NEW QUESTION # 37
You designed a database for patient records as a pilot project to cover a few hundred patients in three clinics. Your design used a single database table to represent all patients and their visits, and you used self-joins to generate reports. The server resource utilization was at 50%. Since then, the scope of the project has expanded. The database must now store 100 times more patient records. You can no longer run the reports, because they either take too long or they encounter errors with insufficient compute resources. How should you adjust the database design?

  • A. Normalize the master patient-record table into the patient table and the visits table, and create other necessary tables to avoid self-join.
  • B. Partition the table into smaller tables, with one for each clinic. Run queries against the smaller table pairs, and use unions for consolidated reports.
  • C. Add capacity (memory and disk space) to the database server by the order of 200.
  • D. Shard the tables into smaller ones based on date ranges, and only generate reports with prespecified date ranges.

Answer: A


NEW QUESTION # 38
You are designing the database schema for a machine learning-based food ordering service that will predict what users want to eat. Here is some of the information you need to store:
* The user profile: What the user likes and doesn't like to eat
* The user account information: Name, address, preferred meal times
* The order information: When orders are made, from where, to whom
The database will be used to store all the transactional data of the product. You want to optimize the data schema. Which Google Cloud Platform product should you use?

  • A. Cloud Bigtable
  • B. Cloud SQL
  • C. Cloud Datastore
  • D. BigQuery

Answer: D


NEW QUESTION # 39
You work for a large real estate firm and are preparing 6 TB of home sales data to be used for machine learning. You will use SQL to transform the data and use BigQuery ML to create a machine learning model. You plan to use the model for predictions against a raw dataset that has not been transformed. How should you set up your workflow in order to prevent skew at prediction time?

  • A. Use a BigQuery view to define your preprocessing logic. When creating your model, use the view as your model training data. At prediction time, use BigQuery's ML.EVALUATE clause without specifying any transformations on the raw input data.
  • B. Preprocess all data using Dataflow. At prediction time, use BigQuery's ML.EVALUATE clause without specifying any further transformations on the input data.
  • C. When creating your model, use BigQuery's TRANSFORM clause to define preprocessing steps. Before requesting predictions, use a saved query to transform your raw input data, and then use ML.EVALUATE.
  • D. When creating your model, use BigQuery's TRANSFORM clause to define preprocessing steps. At prediction time, use BigQuery's ML.EVALUATE clause without specifying any transformations on the raw input data.

Answer: C


NEW QUESTION # 40
You work for a mid-sized enterprise that needs to move its operational system transaction data from an on-premises database to GCP. The database is about 20 TB in size. Which database should you choose?

  • A. Cloud SQL
  • B. Cloud Bigtable
  • C. Cloud Datastore
  • D. Cloud Spanner

Answer: A


NEW QUESTION # 41
Which of the following is NOT one of the three main types of triggers that Dataflow supports?

  • A. Trigger that is a combination of other triggers
  • B. Trigger based on time
  • C. Trigger based on element size in bytes
  • D. Trigger based on element count

Answer: C

Explanation:
There are three major kinds of triggers that Dataflow supports: 1. Time-based triggers 2.
Data-driven triggers. You can set a trigger to emit results from a window when that window has received a certain number of data elements. 3. Composite triggers. These triggers combine multiple time-based or data-driven triggers in some logical way Reference: https://cloud.google.com/dataflow/model/triggers


NEW QUESTION # 42
You are designing storage for 20 TB of text files as part of deploying a data pipeline on Google Cloud.
Your input data is in CSV format. You want to minimize the cost of querying aggregate values for multiple users who will query the data in Cloud Storage with multiple engines. Which storage service and schema design should you use?

  • A. Use Cloud Storage for storage. Link as temporary tables in BigQuery for query.
  • B. Use Cloud Bigtable for storage. Link as permanent tables in BigQuery for query.
  • C. Use Cloud Bigtable for storage. Install the HBase shell on a Compute Engine instance to query the Cloud Bigtable data.
  • D. Use Cloud Storage for storage. Link as permanent tables in BigQuery for query.

Answer: C


NEW QUESTION # 43
You are designing the database schema for a machine learning-based food ordering service that will predict what users want to eat. Here is some of the information you need to store:
The user profile: What the user likes and doesn't like to eat

The user account information: Name, address, preferred meal times

The order information: When orders are made, from where, to whom

The database will be used to store all the transactional data of the product. You want to optimize the data schema. Which Google Cloud Platform product should you use?

  • A. Cloud Bigtable
  • B. Cloud SQL
  • C. Cloud Datastore
  • D. BigQuery

Answer: D


NEW QUESTION # 44
You have an Oracle database deployed in a VM as part of a Virtual Private Cloud (VPC) network. You want to replicate and continuously synchronize 50 tables to BigQuery. You want to minimize the need to manage infrastructure. What should you do?

  • A. Create a Pub/Sub subscription to write to BigQuery directly Deploy the Debezium Oracle connector to capture changes in the Oracle database, and sink to the Pub/Sub topic.
  • B. Create a Datastream service from Oracle to BigQuery, use a private connectivity configuration to the same VPC network, and a connection profile to BigQuery.
  • C. Deploy Apache Kafka in the same VPC network, use Kafka Connect Oracle Change Data Capture (CDC), and Dataflow to stream the Kafka topic to BigQuery.
    D O Deploy Apache Kafka in the same VPC network, use Kafka Connect Oracle change data capture (CDC), and the Kafka Connect Google BigQuery Sink Connector.

Answer: B

Explanation:
Datastream is a serverless, scalable, and reliable service that enables you to stream data changes from Oracle and MySQL databases to Google Cloud services such as BigQuery, Cloud SQL, Google Cloud Storage, and Cloud Pub/Sub. Datastream captures and streams database changes using change data capture (CDC) technology. Datastream supports private connectivity to the source and destination systems using VPC networks. Datastream also provides a connection profile to BigQuery, which simplifies the configuration and management of the data replication. References:
* Datastream overview
* Creating a Datastream stream
* Using Datastream with BigQuery


NEW QUESTION # 45
You have developed three data processing jobs. One executes a Cloud Dataflow pipeline that transforms data uploaded to Cloud Storage and writes results to BigQuery. The second ingests data from on- premises servers and uploads it to Cloud Storage. The third is a Cloud Dataflow pipeline that gets information from third-party data providers and uploads the information to Cloud Storage. You need to be able to schedule and monitor the execution of these three workflows and manually execute them when needed. What should you do?

  • A. Develop an App Engine application to schedule and request the status of the jobs using GCP API calls.
  • B. Set up cron jobs in a Compute Engine instance to schedule and monitor the pipelines using GCP API calls.
  • C. Create a Direct Acyclic Graph in Cloud Composer to schedule and monitor the jobs.
  • D. Use Stackdriver Monitoring and set up an alert with a Webhook notification to trigger the jobs.

Answer: C

Explanation:
Cloud composer is used to schedule the interdependent jobs.


NEW QUESTION # 46
You've migrated a Hadoop job from an on-prem cluster to dataproc and GCS. Your Spark job is a complicated analytical workload that consists of many shuffing operations and initial data are parquet files (on average 200-400 MB size each). You see some degradation in performance after the migration to Dataproc, so you'd like to optimize for it. You need to keep in mind that your organization is very cost-sensitive, so you'd like to continue using Dataproc on preemptibles (with 2 non-preemptible workers only) for this workload.
What should you do?

  • A. Switch from HDDs to SSDs, override the preemptible VMs configuration to increase the boot disk size.
  • B. Switch from HDDs to SSDs, copy initial data from GCS to HDFS, run the Spark job and copy results back to GCS.
  • C. Increase the size of your parquet files to ensure them to be 1 GB minimum.
  • D. Switch to TFRecords formats (appr. 200MB per file) instead of parquet files.

Answer: C


NEW QUESTION # 47
You are deploying a new storage system for your mobile application, which is a media streaming service. You decide the best fit is Google Cloud Datastore. You have entities with multiple properties, some of which can take on multiple values. For example, in the entity 'Movie' the property 'actors' and the property 'tags' have multiple values but the property 'date released' does not. A typical query would ask for all movies with actor=<actorname> ordered by date_released or all movies with tag=Comedy ordered by date_released. How should you avoid a combinatorial explosion in the number of indexes?

  • A. Option C
  • B. Option B.
  • C. Option A
  • D. Option D

Answer: C


NEW QUESTION # 48
You've migrated a Hadoop job from an on-prem cluster to dataproc and GCS. Your Spark job is a complicated analytical workload that consists of many shuffing operations and initial data are parquet files (on average
200-400 MB size each). You see some degradation in performance after the migration to Dataproc, so you'd like to optimize for it. You need to keep in mind that your organization is very cost-sensitive, so you'd like to continue using Dataproc on preemptibles (with 2 non-preemptible workers only) for this workload.
What should you do?

  • A. Switch from HDDs to SSDs, override the preemptible VMs configuration to increase the boot disk size.
  • B. Increase the size of your parquet files to ensure them to be 1 GB minimum.
  • C. Switch to TFRecords formats (appr. 200MB per file) instead of parquet files.
  • D. Switch from HDDs to SSDs, copy initial data from GCS to HDFS, run the Spark job and copy results back to GCS.

Answer: D


NEW QUESTION # 49
You are designing storage for 20 TB of text files as part of deploying a data pipeline on Google Cloud. Your input data is in CSV format. You want to minimize the cost of querying aggregate values for multiple users who will query the data in Cloud Storage with multiple engines. Which storage service and schema design should you use?

  • A. Use Cloud Storage for storage. Link as temporary tables in BigQuery for query.
  • B. Use Cloud Bigtable for storage. Link as permanent tables in BigQuery for query.
  • C. Use Cloud Bigtable for storage. Install the HBase shell on a Compute Engine instance to query the Cloud Bigtable data.
  • D. Use Cloud Storage for storage. Link as permanent tables in BigQuery for query.

Answer: C


NEW QUESTION # 50
Which of these is NOT a way to customize the software on Dataproc cluster instances?

  • A. Configure the cluster using Cloud Deployment Manager
  • B. Set initialization actions
  • C. Log into the master node and make changes from there
  • D. Modify configuration files using cluster properties

Answer: A

Explanation:
You can access the master node of the cluster by clicking the SSH button next to it in the Cloud Console.
You can easily use the --properties option of the dataproc command in the Google Cloud SDK to modify many common configuration files when creating a cluster. When creating a Cloud Dataproc cluster, you can specify initialization actions in executables and/or scripts that Cloud Dataproc will run on all nodes in your Cloud Dataproc cluster immediately after the cluster is set up. [https://cloud.google.com/dataproc/ docs/concepts/configuring-clusters/init-actions] Reference: https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/cluster-properties


NEW QUESTION # 51
You have Google Cloud Dataflow streaming pipeline running with a Google Cloud Pub/Sub subscription as the source. You need to make an update to the code that will make the new Cloud Dataflow pipeline incompatible with the current version. You do not want to lose any data when making this update. What should you do?

  • A. Update the current pipeline and use the drain flag.
  • B. Create a new pipeline that has a new Cloud Pub/Sub subscription and cancel the old pipeline.
  • C. Create a new pipeline that has the same Cloud Pub/Sub subscription and cancel the old pipeline.
  • D. Update the current pipeline and provide the transform mapping JSON object.

Answer: B


NEW QUESTION # 52
Your organization has been collecting and analyzing data in Google BigQuery for 6 months. The majority of the data analyzed is placed in a time-partitioned table named events_partitioned. To reduce the cost of queries, your organization created a view called events, which queries only the last 14 days of dat
a. The view is described in legacy SQL. Next month, existing applications will be connecting to BigQuery to read the events data via an ODBC connection. You need to ensure the applications can connect. Which two actions should you take? (Choose two.)

  • A. Create a new partitioned table using a standard SQL query
  • B. Create a new view over events using standard SQL
  • C. Create a Google Cloud Identity and Access Management (Cloud IAM) role for the ODBC connection and shared "events"
  • D. Create a new view over events_partitioned using standard SQL
  • E. Create a service account for the ODBC connection to use for authentication

Answer: B,C


NEW QUESTION # 53
You are deploying a new storage system for your mobile application, which is a media streaming service. You decide the best fit is Google Cloud Datastore. You have entities with multiple properties, some of which can take on multiple values. For example, in the entity 'Movie' the property 'actors' and the property 'tags' have multiple values but the property 'date released' does not. A typical query would ask for all movies with actor=<actorname> ordered by date_released or all movies with tag=Comedy ordered by date_released. How should you avoid a combinatorial explosion in the number of indexes?

  • A. Option C
  • B. Option B.
  • C. Option A
  • D. Option D

Answer: C


NEW QUESTION # 54
Your financial services company is moving to cloud technology and wants to store 50 TB of financial time-series data in the cloud. This data is updated frequently and new data will be streaming in all the time.
Your company also wants to move their existing Apache Hadoop jobs to the cloud to get insights into this data. Which product should they use to store the data?

  • A. Google Cloud Datastore
  • B. Cloud Bigtable
  • C. Google BigQuery
  • D. Google Cloud Storage

Answer: B


NEW QUESTION # 55
Your organization has been collecting and analyzing data in Google BigQuery for 6 months. The majority
of the data analyzed is placed in a time-partitioned table named events_partitioned. To reduce the
cost of queries, your organization created a view called events, which queries only the last 14 days of
data. The view is described in legacy SQL. Next month, existing applications will be connecting to
BigQuery to read the eventsdata via an ODBC connection. You need to ensure the applications can
connect. Which two actions should you take? (Choose two.)

  • A. Create a new partitioned table using a standard SQL query
  • B. Create a Google Cloud Identity and Access Management (Cloud IAM) role for the ODBC connection
    and shared "events"
  • C. Create a new view over events using standard SQL
  • D. Create a new view over events_partitioned using standard SQL
  • E. Create a service account for the ODBC connection to use for authentication

Answer: B,C


NEW QUESTION # 56
MJTelco Case Study
Company Overview
MJTelco is a startup that plans to build networks in rapidly growing, underserved markets around the world.
The company has patents for innovative optical communications hardware. Based on these patents, they can create many reliable, high-speed backbone links with inexpensive hardware.
Company Background
Founded by experienced telecom executives, MJTelco uses technologies originally developed to overcome communications challenges in space. Fundamental to their operation, they need to create a distributed data infrastructure that drives real-time analysis and incorporates machine learning to continuously optimize their topologies. Because their hardware is inexpensive, they plan to overdeploy the network allowing them to account for the impact of dynamic regional politics on location availability and cost.
Their management and operations teams are situated all around the globe creating many-to-many relationship between data consumers and provides in their system. After careful consideration, they decided public cloud is the perfect environment to support their needs.
Solution Concept
MJTelco is running a successful proof-of-concept (PoC) project in its labs. They have two primary needs:
* Scale and harden their PoC to support significantly more data flows generated when they ramp to more than 50,000 installations.
* Refine their machine-learning cycles to verify and improve the dynamic models they use to control topology definition.
MJTelco will also use three separate operating environments - development/test, staging, and production - to meet the needs of running experiments, deploying new features, and serving production customers.
Business Requirements
* Scale up their production environment with minimal cost, instantiating resources when and where needed in an unpredictable, distributed telecom user community.
* Ensure security of their proprietary data to protect their leading-edge machine learning and analysis.
* Provide reliable and timely access to data for analysis from distributed research workers
* Maintain isolated environments that support rapid iteration of their machine-learning models without affecting their customers.
Technical Requirements
Ensure secure and efficient transport and storage of telemetry data
Rapidly scale instances to support between 10,000 and 100,000 data providers with multiple flows each.
Allow analysis and presentation against data tables tracking up to 2 years of data storing approximately 100m records/day Support rapid iteration of monitoring infrastructure focused on awareness of data pipeline problems both in telemetry flows and in production learning cycles.
CEO Statement
Our business model relies on our patents, analytics and dynamic machine learning. Our inexpensive hardware is organized to be highly reliable, which gives us cost advantages. We need to quickly stabilize our large distributed data pipelines to meet our reliability and capacity commitments.
CTO Statement
Our public cloud services must operate as advertised. We need resources that scale and keep our data secure.
We also need environments in which our data scientists can carefully study and quickly adapt our models.
Because we rely on automation to process our data, we also need our development and test environments to work as we iterate.
CFO Statement
The project is too large for us to maintain the hardware and software required for the data and analysis. Also, we cannot afford to staff an operations team to monitor so many data feeds, so we will rely on automation and infrastructure. Google Cloud's machine learning will allow our quantitative researchers to work on our high-value problems instead of problems with our data pipelines.
Given the record streams MJTelco is interested in ingesting per day, they are concerned about the cost of Google BigQuery increasing. MJTelco asks you to provide a design solution. They require a single large data table called tracking_table. Additionally, they want to minimize the cost of daily queries while performing fine-grained analysis of each day's events. They also want to use streaming ingestion. What should you do?

  • A. Create a table called tracking_table with a TIMESTAMP column to represent the day.
  • B. Create sharded tables for each day following the pattern tracking_table_YYYYMMDD.
  • C. Create a table called tracking_table and include a DATE column.
  • D. Create a partitioned table called tracking_table and include a TIMESTAMP column.

Answer: D


NEW QUESTION # 57
Cloud Bigtable is Google's ______ Big Data database service.

  • A. SQL Server
  • B. NoSQL
  • C. Relational
  • D. mySQL

Answer: B

Explanation:
Explanation
Cloud Bigtable is Google's NoSQL Big Data database service. It is the same database that Google uses for services, such as Search, Analytics, Maps, and Gmail.
It is used for requirements that are low latency and high throughput including Internet of Things (IoT), user analytics, and financial data analysis.
Reference: https://cloud.google.com/bigtable/


NEW QUESTION # 58
You are deploying 10,000 new Internet of Things devices to collect temperature data in your warehouses globally. You need to process, store and analyze these very large datasets in real time.
What should you do?

  • A. Send the data to Google Cloud Pub/Sub, stream Cloud Pub/Sub to Google Cloud Dataflow, and store the data in Google BigQuery.
  • B. Send the data to Google Cloud Datastore and then export to BigQuery.
  • C. Export logs in batch to Google Cloud Storage and then spin up a Google Cloud SQL instance, import the data from Cloud Storage, and run an analysis as needed.
  • D. Send the data to Cloud Storage and then spin up an Apache Hadoop cluster as needed in Google Cloud Dataproc whenever analysis is required.

Answer: A


NEW QUESTION # 59
......

Get Latest Professional-Data-Engineer Dumps Exam Questions in here: https://pass4sure.actualtorrent.com/Professional-Data-Engineer-exam-guide-torrent.html