Check Real Google Professional-Data-Engineer Exam Question for Free (2023)
Get Ready to Boost your Prepare for your Professional-Data-Engineer Exam with 270 Questions
To become a Google Certified Professional Data Engineer, candidates must pass a rigorous certification exam that consists of multiple-choice and scenario-based questions. Professional-Data-Engineer exam is designed to test the candidate's knowledge and skills in working with big data and cloud technologies, as well as their ability to design and implement scalable and efficient data processing systems. Google Certified Professional Data Engineer Exam certification is ideal for professionals who work with data pipelines, data warehousing, and data analytics, and who have a deep understanding of cloud computing and distributed systems. By earning this certification, data engineers can demonstrate their expertise in the field and increase their career opportunities and earning potential.
NEW QUESTION # 110
After migrating ETL jobs to run on BigQuery, you need to verify that the output of the migrated jobs is the same as the output of the original. You've loaded a table containing the output of the original job and want to compare the contents with output from the migrated job to show that they are identical. The tables do not contain a primary key column that would enable you to join them together for comparison.
What should you do?
- A. Select random samples from the tables using the HASH() function and compare the samples.
- B. Use a Dataproc cluster and the BigQuery Hadoop connector to read the data from each table and calculate a hash from non-timestamp columns of the table after sorting. Compare the hashes of each table.
- C. Select random samples from the tables using the RAND() function and compare the samples.
- D. Create stratified random samples using the OVER() function and compare equivalent samples from each table.
Answer: B
Explanation:
Full comparison with this option, rest are comparison on sample which doesn't ensure all the data will be ok.
NEW QUESTION # 111
Business owners at your company have given you a database of bank transactions. Each row contains the user ID, transaction type, transaction location, and transaction amount. They ask you to investigate what type of machine learning can be applied to the data. Which three machine learning applications can you use? (Choose three.)
- A. Supervised learning to determine which transactions are most likely to be fraudulent.
- B. Clustering to divide the transactions into N categories based on feature similarity.
- C. Reinforcement learning to predict the location of a transaction.
- D. Supervised learning to predict the location of a transaction.
- E. Unsupervised learning to predict the location of a transaction.
- F. Unsupervised learning to determine which transactions are most likely to be fraudulent.
Answer: B,C,F
NEW QUESTION # 112
Your neural network model is taking days to train. You want to increase the training speed. What can you do?
- A. Subsample your training dataset.
- B. Increase the number of layers in your neural network.
- C. Increase the number of input features to your model.
- D. Subsample your test dataset.
Answer: B
Explanation:
Explanation/Reference: https://towardsdatascience.com/how-to-increase-the-accuracy-of-a-neural-network-9f5d1c6f407d
NEW QUESTION # 113
You are designing storage for 20 TB of text files as part of deploying a data pipeline on Google Cloud.
Your input data is in CSV format. You want to minimize the cost of querying aggregate values for multiple users who will query the data in Cloud Storage with multiple engines. Which storage service and schema design should you use?
- A. Use Cloud Storage for storage. Link as permanent tables in BigQuery for query.
- B. Use Cloud Bigtable for storage. Link as permanent tables in BigQuery for query.
- C. Use Cloud Storage for storage. Link as temporary tables in BigQuery for query.
- D. Use Cloud Bigtable for storage. Install the HBase shell on a Compute Engine instance to query the Cloud Bigtable data.
Answer: D
NEW QUESTION # 114
You work on a regression problem in a natural language processing domain, and you have 100M labeled exmaples in your dataset. You have randomly shuffled your data and split your dataset into train and test samples (in a 90/10 ratio). After you trained the neural network and evaluated your model on a test set, you discover that the root-mean-squared error (RMSE) of your model is twice as high on the train set as on the test set. How should you improve the performance of your model?
- A. Increase the complexity of your model by, e.g., introducing an additional layer or increase sizing the size of vocabularies or n-grams used.
- B. Increase the share of the test sample in the train-test split.
- C. Try to collect more data and increase the size of your dataset.
- D. Try out regularization techniques (e.g., dropout of batch normalization) to avoid overfitting.
Answer: D
NEW QUESTION # 115
You are designing a basket abandonment system for an ecommerce company. The system will send a message to a user based on these rules:
No interaction by the user on the site for 1 hour
Has added more than $30 worth of products to the basket Has not completed a
transaction
You use Google Cloud Dataflow to process the data and decide if a message should be sent. How should you design the pipeline?
- A. Use a sliding time window with a duration of 60 minutes.
- B. Use a global window with a time based trigger with a delay of 60 minutes.
- C. Use a session window with a gap time duration of 60 minutes.
- D. Use a fixed-time window with a duration of 60 minutes.
Answer: B
NEW QUESTION # 116
You are planning to use Google's Dataflow SDK to analyze customer data such as displayed below. Your project requirement is to extract only the customer name from the data source and then write to an output PCollection.
Tom,555 X street
Tim,553 Y street
Sam, 111 Z street
Which operation is best suited for the above data processing requirement?
- A. Source API
- B. Data extraction
- C. ParDo
- D. Sink API
Answer: C
Explanation:
In Google Cloud dataflow SDK, you can use the ParDo to extract only a customer name of each
element in your PCollection.
NEW QUESTION # 117
You are a head of BI at a large enterprise company with multiple business units that each have different priorities and budgets. You use on-demand pricing for BigQuery with a quota of 2K concurrent on-demand slots per project. Users at your organization sometimes don't get slots to execute their query and you need to correct this. You'd like to avoid introducing new projects to your account.
What should you do?
- A. Create an additional project to overcome the 2K on-demand per-project quota.
- B. Convert your batch BQ queries into interactive BQ queries.
- C. Switch to flat-rate pricing and establish a hierarchical priority model for your projects.
- D. Increase the amount of concurrent slots per project at the Quotas page at the Cloud Console.
Answer: C
Explanation:
Explanation
Reference https://cloud.google.com/blog/products/gcp/busting-12-myths-about-bigquery
NEW QUESTION # 118
You have an Apache Kafka cluster on-prem with topics containing web application logs. You need to replicate the data to Google Cloud for analysis in BigQuery and Cloud Storage. The preferred replication method is mirroring to avoid deployment of Kafka Connect plugins.
What should you do?
- A. Deploy a Kafka cluster on GCE VM Instances with the PubSub Kafka connector configured as a Sink connector. Use a Dataproc cluster or Dataflow job to read from Kafka and write to GCS.
- B. Deploy the PubSub Kafka connector to your on-prem Kafka cluster and configure PubSub as a Sink connector. Use a Dataflow job to read from PubSub and write to GCS.
- C. Deploy the PubSub Kafka connector to your on-prem Kafka cluster and configure PubSub as a Source connector. Use a Dataflow job to read from PubSub and write to GCS.
- D. Deploy a Kafka cluster on GCE VM Instances. Configure your on-prem cluster to mirror your topics to the cluster running in GCE. Use a Dataproc cluster or Dataflow job to read from Kafka and write to GCS.
Answer: D
NEW QUESTION # 119
You have Google Cloud Dataflow streaming pipeline running with a Google Cloud Pub/Sub subscription as the source. You need to make an update to the code that will make the new Cloud Dataflow pipeline incompatible with the current version. You do not want to lose any data when making this update. What should you do?
- A. Create a new pipeline that has a new Cloud Pub/Sub subscription and cancel the old pipeline.
- B. Update the current pipeline and use the drain flag.
- C. Update the current pipeline and provide the transform mapping JSON object.
- D. Create a new pipeline that has the same Cloud Pub/Sub subscription and cancel the old pipeline.
Answer: C
Explanation:
If any transform names in your pipeline have changed, you must supply a transform mapping and pass it using the --transformNameMapping option.
https://cloud.google.com/dataflow/docs/guides/updating-a-pipeline#preventing_compatibility_breaks
NEW QUESTION # 120
You are deploying MariaDB SQL databases on GCE VM Instances and need to configure monitoring and alerting. You want to collect metrics including network connections, disk IO and replication status from MariaDB with minimal development effort and use StackDriver for dashboards and alerts.
What should you do?
- A. Place the MariaDB instances in an Instance Group with a Health Check.
- B. Install the OpenCensus Agent and create a custom metric collection application with a StackDriver exporter.
- C. Install the StackDriver Agent and configure the MySQL plugin.
- D. Install the StackDriver Logging Agent and configure fluentd in_tail plugin to read MariaDB logs.
Answer: D
NEW QUESTION # 121
What are two of the benefits of using denormalized data structures in BigQuery?
- A. Reduces the amount of storage required, increases query speed
- B. Increases query speed, makes queries simpler
- C. Reduces the amount of data processed, increases query speed
- D. Reduces the amount of data processed, reduces the amount of storage required
Answer: B
Explanation:
Denormalization increases query speed for tables with billions of rows because BigQuery's performance degrades when doing JOINs on large tables, but with a denormalized data structure, you don't have to use JOINs, since all of the data has been combined into one table. Denormalization also makes queries simpler because you do not have to use JOIN clauses. Denormalization increases the amount of data processed and the amount of storage required because it creates redundant data.
https://cloud.google.com/solutions/bigquery-data-warehouse#denormalizing_data
NEW QUESTION # 122
You work for a shipping company that uses handheld scanners to read shipping labels. Your company has strict data privacy standards that require scanners to only transmit recipients' personally identifiable information (PII) to analytics systems, which violates user privacy rules. You want to quickly build a scalable solution using cloud-native managed services to prevent exposure of PII to the analytics systems. What should you do?
- A. Build a Cloud Function that reads the topics and makes a call to the Cloud Data Loss Prevention API. Use the tagging and confidence levels to either pass or quarantine the data in a bucket for review.
- B. Create an authorized view in BigQuery to restrict access to tables with sensitive data.
- C. Install a third-party data validation tool on Compute Engine virtual machines to check the incoming data for sensitive information.
- D. Use Stackdriver logging to analyze the data passed through the total pipeline to identify transactions that may contain sensitive information.
Answer: B
NEW QUESTION # 123
Your globally distributed auction application allows users to bid on items. Occasionally, users place identical bids at nearly identical times, and different application servers process those bids. Each bid event contains the item, amount, user, and timestamp. You want to collate those bid events into a single location in real time to determine which user bid first. What should you do?
- A. Have each application server write the bid events to Cloud Pub/Sub as they occur. Push the events from Cloud Pub/Sub to a custom endpoint that writes the bid event information into Cloud SQL.
- B. Set up a MySQL database for each application server to write bid events into. Periodically query each of those distributed MySQL databases and update a master MySQL database with bid event information.
- C. Create a file on a shared file and have the application servers write all bid events to that file. Process the file with Apache Hadoop to identify which user bid first.
- D. Have each application server write the bid events to Google Cloud Pub/Sub as they occur. Use a pull subscription to pull the bid events using Google Cloud Dataflow. Give the bid for each item to the user in the bid event that is processed first.
Answer: A
Explanation:
From Cloud SQL we can fetch the record on timestamp basis using where clause and it satisfies near real time.
NEW QUESTION # 124
You need to compose visualization for operations teams with the following requirements:
Telemetry must include data from all 50,000 installations for the most recent 6 weeks (sampling once every minute)
The report must not be more than 3 hours delayed from live data.
The actionable report should only show suboptimal links.
Most suboptimal links should be sorted to the top.
Suboptimal links can be grouped and filtered by regional geography.
User response time to load the report must be <5 seconds.
You create a data source to store the last 6 weeks of data, and create visualizations that allow viewers to see multiple date ranges, distinct geographic regions, and unique installation types. You always show the latest data without any changes to your visualizations. You want to avoid creating and updating new visualizations each month. What should you do?
- A. Look through the current data and compose a small set of generalized charts and tables bound to criteria filters that allow value selection.
- B. Load the data into relational database tables, write a Google App Engine application that queries all rows, summarizes the data across each criteria, and then renders results using the Google Charts and visualization API.
- C. Look through the current data and compose a series of charts and tables, one for each possible
combination of criteria. - D. Export the data to a spreadsheet, compose a series of charts and tables, one for each possible
combination of criteria, and spread them across multiple tabs.
Answer: A
NEW QUESTION # 125
You have some data, which is shown in the graphic below. The two dimensions are X and Y, and the shade of each dot represents what class it is. You want to classify this data accurately using a linear algorithm. To do this you need to add a synthetic feature. What should the value of that feature be?
- A. cos(X)
- B. Y^2
- C. X^2+Y^2
- D. X^2
Answer: A
NEW QUESTION # 126
You work for a large fast food restaurant chain with over 400,000 employees. You store employee information in Google BigQuery in a Userstable consisting of a FirstNamefield and a LastNamefield. A member of IT is building an application and asks you to modify the schema and data in BigQuery so the application can query a FullNamefield consisting of the value of the FirstNamefield concatenated with a space, followed by the value of the LastNamefield for each employee. How can you make that data available while minimizing cost?
- A. Use BigQuery to export the data for the table to a CSV file. Create a Google Cloud Dataproc job to process the CSV file and output a new CSV file containing the proper values for FirstName, LastNameand FullName. Run a BigQuery load job to load the new CSV file into BigQuery.
- B. Create a Google Cloud Dataflow job that queries BigQuery for the entire Userstable, concatenates the FirstNamevalue and LastNamevalue for each user, and loads the proper values for FirstName, LastName, and FullNameinto a new table in BigQuery.
- C. Create a view in BigQuery that concatenates the FirstNameand LastNamefield values to produce the FullName.
- D. Add a new column called FullNameto the Users table. Run an UPDATEstatement that updates the FullNamecolumn for each user with the concatenation of the FirstNameand LastNamevalues.
Answer: B
Explanation:
Explanation/Reference:
NEW QUESTION # 127
Which is not a valid reason for poor Cloud Bigtable performance?
- A. The Cloud Bigtable cluster has too many nodes.
- B. The table's schema is not designed correctly.
- C. The workload isn't appropriate for Cloud Bigtable.
- D. There are issues with the network connection.
Answer: A
Explanation:
Explanation
The Cloud Bigtable cluster doesn't have enough nodes. If your Cloud Bigtable cluster is overloaded, adding more nodes can improve performance. Use the monitoring tools to check whether the cluster is overloaded.
Reference: https://cloud.google.com/bigtable/docs/performance
NEW QUESTION # 128
When a Cloud Bigtable node fails, ____ is lost.
- A. the time dimension
- B. the last transaction
- C. no data
- D. all data
Answer: C
Explanation:
Explanation
A Cloud Bigtable table is sharded into blocks of contiguous rows, called tablets, to help balance the workload of queries. Tablets are stored on Colossus, Google's file system, in SSTable format. Each tablet is associated with a specific Cloud Bigtable node.
Data is never stored in Cloud Bigtable nodes themselves; each node has pointers to a set of tablets that are stored on Colossus. As a result:
Rebalancing tablets from one node to another is very fast, because the actual data is not copied. Cloud Bigtable simply updates the pointers for each node.
Recovery from the failure of a Cloud Bigtable node is very fast, because only metadata needs to be migrated to the replacement node.
When a Cloud Bigtable node fails, no data is lost
Reference: https://cloud.google.com/bigtable/docs/overview
NEW QUESTION # 129
Suppose you have a table that includes a nested column called "city" inside a column called "person", but when you try to submit the following query in BigQuery, it gives you an error. SELECT person FROM
`project1.example.table1` WHERE city = "London" How would you correct the error?
- A. Add ", UNNEST(person)" before the WHERE clause.
- B. Add ", UNNEST(city)" before the WHERE clause.
- C. Change "person" to "person.city".
- D. Change "person" to "city.person".
Answer: A
Explanation:
To access the person.city column, you need to "UNNEST(person)" and JOIN it to table1 using a comma.
Reference:
https://cloud.google.com/bigquery/docs/reference/standard-sql/migrating-from-legacy- sql#nested_repeated_results
NEW QUESTION # 130
You are training a spam classifier. You notice that you are overfitting the training data. Which three actions can you take to resolve this problem? (Choose three.)
- A. Increase the regularization parameters
- B. Use a smaller set of features
- C. Decrease the regularization parameters
- D. Use a larger set of features
- E. Get more training examples
- F. Reduce the number of training examples
Answer: A,B,E
NEW QUESTION # 131
You are building a model to make clothing recommendations. You know a user's fashion preference is likely to change over time, so you build a data pipeline to stream new data back to the model as it becomes available.
How should you use this data to train the model?
- A. Continuously retrain the model on just the new data.
- B. Continuously retrain the model on a combination of existing data and the new data.
- C. Train on the new data while using the existing data as your test set.
- D. Train on the existing data while using the new data as your test set.
Answer: C
NEW QUESTION # 132
......
Google Professional-Data-Engineer Exam Syllabus Topics:
| Topic | Details |
|---|---|
| Topic 1 |
|
| Topic 2 |
|
| Topic 3 |
|
| Topic 4 |
|
In order to be eligible for the exam, candidates must have at least three years of industry experience, with at least one year of experience working with Google Cloud Platform. Professional-Data-Engineer exam consists of multiple-choice and multiple-select questions and is designed to be completed within two hours. The passing score for the exam is 70%.
Use Free Professional-Data-Engineer Exam Questions that Stimulates Actual EXAM : https://quizguide.actualcollection.com/Professional-Data-Engineer-exam-questions.html