TotalCloud Insights: Uncovering the Hidden Dangers in Google Cloud Dataproc

Rahul Pareek


  • The Apache Hadoop Distributed File System (HDFS) can be vulnerable to data compromise when a Compute Engine cluster is in a public-facing virtual private cloud (VPC) or shares the VPC with other Compute Engine instances.
  • Google Cloud Platform (GCP) provides a default VPC called ‘default.’ This VPC allows inbound connections only on ports 22 and 3389 while permitting all inbound connections within the internal subnet. This configuration can pose a significant security risk when both Dataproc clusters and Compute Engine instances share the default subnet VPC. It can lead to potential data corruption or theft, both serious concerns.
  • The Google Security Team labeled the attack flow as an ‘Abuse Risk.’
  • Qualys TotalCloud now notifies customers of misconfigured Dataproc clusters that are vulnerable to exploitation, offering remediation steps and code for prompt resolution.

Why This Vulnerability is Important to Understand

Security vulnerabilities pose significant challenges in the rapidly evolving landscape of cloud computing. The recent discovery of an unauthenticated access vulnerability in Google Cloud Dataproc underscores the need for robust cloud security measures.

This risk to Google Cloud Dataproc clusters can lead to data theft, manipulation, or loss. The underlying Open-Source Software (OSS) managed solution lacks adequate security controls, enabling unauthorized access by attackers with knowledge of the Dataproc IP address.

Google’s Dataproc documentation highlights the security risk associated with open firewall rules on public networks and recommends caution in setting them up. However, it also emphasizes the need for vigilance against potential attackers who might gain initial access to a Compute Engine instance, as this could allow them unauthenticated access to GCP Dataproc. This scenario underscores the importance of robust security measures at all access points to safeguard against unauthorized access.

To shed light on this issue and help organizations enhance their security posture, in this post the Qualys TotalCloud team aims to analyze the attack flow comprehensively and offer recommendations for minimizing the associated risks.

What is the Google Cloud Dataproc Service?

Google Cloud Dataproc is a managed cloud service specially designed for seamless deployment and efficient management of Apache Spark and Apache Hadoop clusters. This service caters to large-scale data processing and analytics workloads, utilizing Hadoop for distributed storage and batch processing and Spark for in-memory data processing and analytics.

Understanding the Vulnerability

The GCP Dataproc threat exploits two critical weaknesses:

  1. The absence of security controls in Apache Hadoop’s web interfaces and
  2. The tendency to rely on default settings when creating resources.

These vulnerabilities combine to allow an attacker unimpeded access to the Apache Hadoop Distributed File System (HDFS) and the ability to then compromise sensitive data.

Two key web interfaces on the master node facilitate this exploitation: YARN ResourceManager on port 8088 and HDFS NameNode on port 9870. Alarmingly, neither interface requires authentication despite their crucial role in the cluster’s operation. The HDFS endpoint is particularly risky, being the primary storage system for the entire cluster. Both ports mentioned above serve all IP addresses, effectively granting internet access as the sole prerequisite to full access. Consequently, an insufficiently segmented cluster can have catastrophic consequences.

HDFS NameNode on port 9870
YARN ResourceManager on port 8088

The GCP Dataproc Attack Flow 

While it may be challenging to prevent the occurrence of an internet-facing Compute Engine instance vulnerable to Remote Code Execution (RCE), organizations should prioritize risk mitigation.

This blog outlines a potential attack path, assuming an external attacker has fully compromised such an instance, allowing them to scan for open ports. If a Dataproc cluster shares the same VPC, the consequences can be dire, granting the attacker unrestricted access to unauthorized services.

The attacker gains access to both web interfaces by exploiting the compromised machine as a tunnel. They can utilize the YARN endpoint to create applications, submit jobs, and perform Cloud Storage operations. Alternatively, using the HDFS endpoint, the attacker can navigate the storage file system freely, potentially obtaining sensitive data.

It’s important to note that this attack flow is not limited to vulnerable Compute Engine instances. Any workload, such as Cloud Run or AppEngine, deployed on the same VPC network as the Hadoop master node can expose the HDFS data if compromised, facilitating the same level of exploitation.

Evaluating HDFS NameNode Accessibility: GCP’s Vulnerability vs. AWS and Azure Security

In the context of HDFS NameNode accessibility, a significant vulnerability emerges in GCP Dataproc, notably absent in AWS EMR and Azure HDInsight. This variance underscores differing security paradigms in these cloud platforms.

GCP Dataproc Vulnerability

On the one hand, the Google Cloud Platform offers a secure method to access the HDFS NameNode through the Dataproc Web Interfaces within the Google Cloud Console. This method is more secure because it likely involves internal mechanisms for redirecting your request to the cluster. Additionally, this process typically includes authentication checks against your Google Cloud credentials. These checks ensure that you are authorized to access the cluster, thus providing a layer of security and control.

On the other hand, if you access the NameNode user interface directly by typing http://[MasterNodeIP]:9870 into a browser (where MasterNodeIP is nothing, but the external IP of the master node VM instance is attached to the datatproc cluster), you are attempting to reach the service without the security context of the GCP Console. This direct approach is different because it bypasses the Google Cloud Console’s authentication and redirection mechanisms. If you can access the HDFS NameNode UI from any location using this method, it suggests that the service is configured to be publicly accessible. This accessibility could be due to network configurations that expose the service externally, which might be a security concern if unintended.

AWS EMR’s Secure Approach

In contrast, AWS EMR employs a more secure default setup. Direct public access to the HDFS NameNode is not readily available. EMR clusters operate within the confines of a VPC, and security groups default to restrict unauthorized external access. To access the NameNode, users typically utilize SSH tunneling, a secure method that ensures encryption and limits exposure to potential external threats. This design choice in AWS EMR inherently protects against the kind of vulnerability seen in GCP Dataproc.

Azure HDInsight’s Restrictive Configuration

Azure HDInsight also follows a stringent security model. Like AWS, Azure HDInsight doesn’t readily expose the HDFS NameNode to the public internet. Deployed within Azure Virtual Networks and governed by Network Security Groups, HDInsight clusters are configured to significantly reduce the risk of unauthorized access. Azure’s approach, which emphasizes controlled and secure access mechanisms such as VPNs, aims to mitigate vulnerabilities akin to those seen in GCP Dataproc, though the effectiveness of these measures depends largely on the specific configurations applied by users.

Best Practices for Securing GCP Dataproc Clusters

The Qualys TotalCloud team advocates for regular vulnerability assessments, network segmentation, and the implementation of comprehensive security policies. These practices are crucial in safeguarding cloud environments against emerging threats.

Vulnerability Management:

Implementing consistent vulnerability and patch management practices is crucial for minimizing the possibility of unauthorized access to cloud environments. Tools like Qualys TotalCoud with Flex Scan provide comprehensive visibility and facilitate the detection and remediation of vulnerabilities, ensuring that unpatched servers and applications are promptly addressed.

Network Segmentation:

Organizations must diligently review default settings and configure a dedicated VPC network before deploying new clusters. Firewall rules should be tailored to specific requirements, and independent clusters should be deployed in different subnets within the same VPC to limit lateral movement and minimize the impact of a security breach. Avoid deploying additional services on these dedicated networks, focusing solely on the necessary cluster components.

Avoiding Misconfigurations:

In the forthcoming release, following the recent identification of the vulnerability presented by misconfigured Dataproc clusters on the default VPC, Qualys TotalCloud will introduce a control to address this issue (CID 52139 – Ensure Dataproc Clusters are not using Default VPC). This enhancement will assist security teams by identifying vulnerable Dataproc Clusters that use the default VPC; customers will then need to mitigate the risk by migrating the clusters to an alternative VPC.

Encryption is essential for protecting your organization’s data, but your data will still be unprotected if it is not configured correctly. Most CSPs offer encryption at no additional cost; enabling it is usually as simple as selecting a checkbox in the configuration settings. Despite the simplicity, this essential control is not universally implemented. To ensure that Dataproc clusters are encrypted at rest using CMKs, refer to CID 52161.


In summary, the vulnerabilities in Google Cloud Dataproc clusters highlight the urgent need for enhanced cloud security measures. Qualys TotalCloud’s upcoming release addresses these concerns, offering critical solutions for risk mitigation and data protection. Organizations must prioritize robust security practices, including network segmentation, regular vulnerability assessments, and the use of Customer-Managed Keys for encryption. This proactive approach is essential for safeguarding cloud environments against evolving threats and maintaining data integrity.

To learn more about how Qualys TotalCloud can enhance your cloud security, visit our website, or contact support for a detailed demonstration.

Share your Comments


Your email address will not be published. Required fields are marked *