Discussion and Comparison of Several Hadoop Security Tools

Yinyi Soo
16 min readNov 15, 2020

Abstract — Apache Hadoop is a big data management system. However, during the development of Hadoop Apache, security is not the main concern for the development, and this have led to security vulnerable for this system. To enhance the security for Hadoop Apache, several security tools such as Apache Ranger, Apache Knox and others have been developed to enhance its security. This report will be analyzing the security vulnerabilities for Hadoop Apache and focus in discussing several security tools and compare the features, functionality, and advantages between them. Finally, solutions and conclusion will be drawn for these security tools.

Keywords — Apache Hadoop, Cybersecurity, Big Data Security

I. Introduction

In a simple word, Big data is the tremendous amount of data that consists of both structure and unstructured. To be more specifically, there are four Vs in the big data structure which are Velocity, Volume, Velocity and Veracity [1]. However, as the growth of Big Data volume and usage, security issue have becoming a big concern and led to a new V in the big data sector which is Vulnerability [2].

Hadoop Apache is an open source software framework that is used to manage the big data by implementing map reduce method. Map reduce can helps to simplified the Data Processing on a large cluster by generating a set of intermediate key/value and merge all value with the same intermediate value [3]. In the beginning, Apache Hadoop is only concern on its core capabilities and lack of security concern. This have made Hadoop extremely vulnerable to security threats. According to a journal in 2019, the vulnerabilities of Hadoop can be divided to three type of the following [4]:

· Software Vulnerabilities: Hadoop framework is written entirely in Java which the Java language have already been exploited heavily in cybercriminals.
· Web Interface Vulnerabilities: Hadoop usually have a weak web configuration such as default ports and IP addresses and vulnerable to Cross Site Scripting attack.
· Network Vulnerabilities: Since Hadoop is deal with complex type of data and databases, different type of users may have different user policies level and led to vulnerabilities.

The above vulnerabilities of Hadoop have provided chance and vulnerabilities for Black hat hackers to launch attack to the Hadoop big data system. As an example, a grey hack hacker can easily enumerate the details such as port and IP address for the Hadoop system since weak web configuration such as default ports and IP addresses have been used. Besides, Denial of Services (DoS) attack can also be launch to Hadoop system by using the vulnerable ports and web server details. According to a report in 2017, 11246 attack incident have been reported and out f those 5 breaches have been successful [5].

Besides, a hacker can combine the Web Interface and Database vulnerabilities to perform a SQL Injection attack. Instead of entering a value, the hacker might enter a Hadoop SQL query such as Hive to perform SQL query and retrieve valuable information and even perform several Data Manipulation Language to modify the data inside the Hadoop database. In short, the vulnerability of Hadoop may cause to leak of valuable data and even integrity of the data since the data may be modified by the hackers.

Among all the attack the most critical attack for Hadoop Apache is Data leakage. The reason is Data Leakage can be anonymous and does not been realize by the user or programmer. According to the research report in 2017, the data leakage is most probably occurred in application and operating system layer [6].

To enhance the security of Apache Hadoop system and avoid data leakage or any security threat, several tools have been used. Among those tools, several tools and their characteristic will be introduced and discussed in the next section. After discussion, comparison between these tools will also be done. The tools that will be discuss are Apache Ranger, Apache Knox, Apache Atlas, Project Rhino, Kerberos, and Apache Sentry.

The reason that these tools have been selected is different, Apache Ranger and Apache Knox are two of the most popular security tools that consists of various security function where Kerberos is a very classic authentication technologies used in not only big data but our daily login activity. Apache Atlas is a tool that is focus on data governance where Project Rhino is the latest release tools by Intel which is not from Apache.

II. Available Security Tools

In this part of the report, six tools will be discussed which are the Apache Ranger, Apache Knox, Apache Atlas, Project Rhino, Kerberos, and Apache Sentry. Each of their functionality and features will be discussed. Besides, their advantages and effectiveness will also be examined, first let looks at the Apache Ranger.

A. Apache Ranger

Apache Ranger is a framework which providing comprehensive security across Apace Hadoop ecosystem that is introduced by Hortonworks. Throughout its functionality, Apache Ranger can integrate with other Hadoop functionality such as HDFS, Hive, Yarn and Kafka. The features and characteristic for Apache Ranger have been stated below [7].

· Centralize Security Administration: Enable to manage all security related tasks in a central UI or by using APIs.
· Standardize Authorization Method
· Centralize Auditing: Centralize control for user access and administrative action

The main characteristic of Apache Ranger is centralized, which provide the user a centralize UI control to manage file service policies and metadata policies. Besides controlling file data, the Apache Ranger also centralize control the user access and administrative action in a plain and simple UI.

In terms of authentication and authorization, Apache Ranger used Kerberos for user authentication and cooperate with Apache Knox for authorization which is a role -based access control (RBAC) [4]. For the data encryption, Apache Ranger is using wire encryption for the data which protects the latter as data moves through Hadoop over RPC, HTTP, Data Transfer Protocol (DTP), and JDBC [8].

In terms of communication, Apache Ranger supply user with REST API to provide a communication platform and manage the comprehensive data security across the Hadoop Platform [9]. This API allow user to centralize the auditing of data by using a centralized security administration.

B. Apache Knox

Apache Knox act as a single access point to the Hadoop Clusters based on the concept of stateless reverse proxy framework [10]. The Apache Knox can be view as a firewall for Hadoop system that is responsible in providing authentication for the users. Apache Knox support both HTTP and REST API for communication. In the latest version of Apache Knox 1.4.0, Knox deliver three groups of user facing services which shows in the below diagram [11].

Fig. 1. Services support by Apache Knox

· Proxying Services
The proxy services are focusing in providing service to HTTP proxy
· Authentication Services
Authentication process for user when communicating by REST API.
· Client Services
Provide platform for client development by scripting through Domain Specific Language (DSL) or using the Knox Shell classes directly as SDK.

The Apache Knox can provide solution that integrates well with other Apache Hadoop services such as Ambari, Yarn RM, Hive etc. It can also simplify the services number for the client when the client is interacting with the gateway by providing the layout of the cluster for routing and translation between user facing URLs and cluster internals.

C. Apache Atlas

The Apache Atlas is a scalable and extensible set of core data governances service use in the Apache Hadoop that have just been release in 2019. Apache Atlas use open metadata management and governance capabilities for organization to build their data assets efficiently and effectively [12]. There are several features for Apache Atlas which are:

· Metadata Types & Instances: Ability to pre-defined new and various types for Hadoop and non-Hadoop data.
· Classification: Ability to dynamically create classification with including its attribute
· Lineage: Intuitive UI to view lineage of data and the UI can dynamically cooperate with API update.
· Search/Discovery: Providing SQL like query language for searching which is named as Domain Specific Language (DSL) and enable REST API’s searching.
· Security & Data Masking: Finely grained security for data access by controlling metadata access.

The highlight for Apache Atlas is data access control by using tag. User can integrate Apache Atlas with Apache ranger to manage the Tag Based Policies. Only user with corresponding access can view certain tags. The tags can also be used to other Hadoop Services such as HDFS and HIVE by giving these services a tag and control the user assessment. In terms of data auditing, since Atlas 0.6 Apache Atlas have added the ability to track data coming through several Hadoop components such as Hive, Falcon and Storm. Apache Atlas also provide a REST API gateway for lineage registration and improve the auditing process.

D. Project Rhino

Project Rhino is an open-source Hadoop security tools developed by Intel to provide data protection for Hadoop stack with the single-sign-on (SSO) concept. Project Rhino authentication basically is built on top of the current technology used by Hadoop Apache such as Kerberos and enhance the mechanism. There are several features supply by Project Rhino to enhance the security of Hadoop Apache [13].

· Enhance Data Protection: Adding encryption support into Hadoop Core System such as HDFS, MapReduce, Hive etc.
· Token Based Authentication: Implementing a common token-based authentication framework to decouple internal user and service authentication from external mechanisms.
· Standardize Audit Logging: Build and create a standard and unified log format for various type of compliance and activity in Hadoop Apache.

In terms of Hadoop Apache authorization mechanism, HBase which is a Column-Oriented database on HDFS supports setting access control at table or column on family level. However, this function is limited, and Project Rhino aims to extend the mechanism by authenticating the process on per call basis. Besides, Project Rhino also provide data encryption in both data-at-rest and data-at-transit when most of the Hadoop encryption only provide encryption for data-in-transit only.

E. Kerberos

Kerberos is a very popular security tools use for authentication not only in Big data sector but throughout the networking sector. Kerberos is developed by Massachusetts Institute of Technology (MIT) in 90s and normally it consists of three components which are an authentication server, a ticket -granting server (TGS) and a database [14].

The Kerberos protocol uses secret-key cryptography to provide and encrypted data communication over a non-secure network. There are a lots of web server using Kerberos protocol nowadays for authentication and it is one of the earliest authentication tools used by Apache Hadoop. The below diagram shows the basic flows of Kerberos authentication in Hadoop Apache.

Fig. 2. Kerberos Authentication Process

During the authentication, the user will first communicate with the Authentication Server(AS) to retrieve token from the authentication server, after that the user will use that particular token returned by the authentication server to communicate and request token from the Ticket-granting Server (TGS). Once the user retrieve ticket granted by the TGS, the user can use the ticket to access with the Hadoop Service. Each of the request process is limited to once per user login session to avoid DDoS attack or Man-In-The-Middle attack.

F. Apache Sentry

Apache Sentry is a tool developed by Cloudera. Like Apache Ranger, Apache Sentry can provide user role-based authentication and administrative module. Apache Sentry support various level of role-based authentication for HDFS, Hive and Impala. The main method of Apache Sentry is to install several Sentry Plugin in HDFS component and centralize control by the Policy Metadata. The structure of Apache Sentry has been showed in the below image [15].

Fig. 3. Apache Sentry Plugin Structure

Apache Sentry can manage access to data and metadata by enforcing an accurate level of privileges to authenticated users and applications in a Hadoop cluster [4]. However, Apache Sentry only support for certain Hadoop Plugins such as Hive, HDFS and Impala, it does not support other Hadoops Plugin such as HBase, Yarn and Kafka.

III. Comparisons

After discussing about the feature and implementation for each of the security tools, their features ad characteristic will be further discussed in this section. The comparison for their features, generally each of the features and characteristic have been divided into four section which are Administration/Authentication, Authorization, Audit and Data Protection, their characteristic have been shown in the below tables [4]. Firstly, let looks at Administration/Authentication and Authorization.

TABLE I. COMPARISON FOR AUTHENTICATION AND AUTHORIZATION

In terms of administration or authentication, Apache Atlas which focus on Data governance does not support for administration process where other tools do support authentication. Apache Ranger using a centralized security control which will manage the security and administration, Apache Ranger also provided REST API gateway for security control. Where Apache Knox provide authentication by using HTTP protocol to provide a Single Access point authentication for the Hadoop system, like Apace Ranger, Apache Knox also provided a REST API gateway for authentication control. Apache Sentry basically is integrated with Apache Knox and using Knox’s method for authentication purpose. Kerberos is using an encrypted token-based Authentication that can be integrated to all Hadoop resources where Project Rhino is using a single sign-on and token-based authentication.

In terms of authorization, Kerberos only support authentication process and does not support for the authorization process. Apache Knox provides an Access Control List (ACL) based authorization and support service level-audit, Apache Knox will constantly evaluate the identity of context’s user and determine the access for the user. Apache Ranger also provided a centralized UI control for role -based access control (RBAC), however this function is mainly integrating with the Apache Knox authorization function. Apache Atlas, using a tag-based policies control by setting up the access policies for each of the tags and assigned to the user. Project Rhino provide a Cell-Level encryption and fine-grained access control to HBase in Hadoop for authorization where Apache Sentry also provide a fine-grained access control but not support for columns granularity.

TABLE II. COMPARISON FOR AUDIT AND DATA PROTECTION

In terms of data logging or auditing, Kerberos does not support for data auditing. Apache Ranger provides a centralize auditing throughout all the Hadoop components based on the user access. Apache Knox support service level auditing by the user level or Access-control list. Apache Atlas provide not only logging for data auditing but also a REST API gateway to track data and register data lineage. Project Rhino provide a standardize data logging and auditing standard to record and track the data. Where the latest version of Apache Sentry provides a basic data logging which state only the authorization process.

In terms of data encryption only Apache Ranger and Project Rhino provides some data encryption or protection for the data, the other tools however provide a metadata access control and hides the original data by using authorization process such as Apache Knox, Atlas and Sentry. The Apache Ranger use a wire encryption for the data to encrypt and protect the original data. Project Rhino moreover provides additional data encryption to the Hadoop core system process such as HDFS, MapReduce etc. After discussing about the features for each of the tools, their advantages and disadvantages have been organized and shown in the below table.

TABLE III. COMPARISON FOR ADVANTAGES AND DISADVANTAGES

Among the authentication process, most of the Apache tools such as Apache Ranger and Knox provide not only centralize control security but also REST API gateway for user to better communicate with the tools. Where other non-Hadoop authentication tools such as Project Rhino and Kerberos focus in token-based authentication. Where for the authorization process, all the tools above except Kerberos provides a certain type of role base and access control for authorization process. In terms of Auditing, all the tools provide a basic or advance auditing except for Kerberos. Among the tools, Project Rhino and Apache Ranger provides not only its own auditing service but aim to provide a centralize or standardize auditing service to the whole Hadoop system. Where for data encryption, only Apache Ranger and Project Rhino provides certain among of data encryption to protect the data where other tools only hide the original data by controlling metadata using authorization process.

IV. Discussion

In the previous section, some of the tools security features such as their Authentication, Authorization, Audit and Data Protection process have been discussed, however not all of the tools support all those feature, as an example the Kerberos only support Authentication process but no other process. Moreover, some of the tools might be able to integrate to provide a more comprehensive support on the security issue, for example Apache Sentry is integrating with Apache Knox to support for the Administration process. However, the integration between tools only occurs to certain products such as Apache product but not Project Rhino. Based on the previous reality, the best tools that be selected may not be the tools that is able to support the whole comprehensive security support but the tools that is able to integrate with other tools to provide a comprehensive security solution for Hadoop system. In these tools, only Apache Ranger and Project Rhino able to cooperate or integrate with other tools to provide a more comprehensive security solution that cover Authentication, Authorization, Auditing and Data Encryption.

Among all the security tools, Apache Ranger is considered as the best security tools in securing the Hadoop System. Apache Ranger have the highest compatibility to integrate with other tools to provide a comprehensive security solution. Besides, Apache Ranger support various type of Hadoop components unlike Apache Sentry that is only supporting certain Hadoop Plugin but not HBase and YARN. Apache Ranger also provides a centralize UI support and control for the user so that the user can easily manage and control the authorization and auditing process when other users or client communicate with the Hadoop System. To makes things easier, Apache Ranger also provided a REST API gateway for the user to communicate with the Hadoop System in an easier and secure communication.

Even though the Project Rhino also provide comprehensive security solution for the Hadoop system, it is not one of the Apache products and may have difficulties in integrating with the new Security Tools supply by Hadoop in the future. Besides, Project Rhino is an open source project which lead to difficulties in technical support since its code and structure may be contribute by another user in GitHub. However, since Project Rhino is an open source product, it is considered as free and as an outsider we can study the structure of the security tools and the data information is transparent. In another side, the transparency of tool structure may also lead to vulnerability and some hackers may attack this tool after discovered some vulnerabilities in this tool.

Generally, among all the tools Apache Ranger will be recommended for the user because of its capabilities to integrated with many different Hadoop components and its centralized control for the Hadoop service. Of course, only using Apache Ranger and Apache security components tools are not enough. To prevent and avoid hacking process done in the modern world, the security tool itself must be constantly upgrade and beware of the evolution of several malwares. Apache Ranger in this case have its own advantage compare to Project Rhino, it is an Apache’s product and can be integrate with many other security tools provided by Apache now and in the future. That is the main reason Apache Ranger is recommended, it can be constantly upgrade and integrate with other new Apache security product such as Apache Atlas.

V. Conclusion

In this technical review paper, the vulnerabilities of Hadoop system have been divided to three vulnerabilities which are network, web-interface, and software vulnerabilities, each of the vulnerabilities have been discussed. Several possible cyber-attack methods such as SQL or Hive Injection and Cross-site Scripting attack have also been discussing by combining with the vulnerabilities of Hadoop Security system. After that, six security tools from the web which are Apace Ranger, Apache Knox, Apache Atlas, Project Rhino, Kerberos, and Sentry have been discussed by stating their features, implementation and comparison between their advantages and disadvantages. Finally, Apache Ranger will be recommended to the users as the security tools in Hadoop system, not only because of its ability to centralize control the other security components in Hadoop but its ability to integrate with many other Hadoop security tools no matter is in the future or now. Cybersecurity is an endless journey which required the developer to constantly upgrade their tools and equip with modern knowledge to secure the data, the ability of Apace Ranger to integrate with the future security tools not only earn itself a stand in the user heart, but also in the future of cybersecurity world.

VI. References

[1] Ibrahim Abaker Targio Hashem, Ibrar Yaqoob, Nor Badrul Anuar, Salimah Mokhtar, Abdullah Gani, Samee Ullah Khan, “The rise of “big data” on cloud computing: Review and open research issues,” Elsevier, vol. 47, pp. 98–115, 2015.

[2] S. Sharma, “Vulnerability — Introducing 10th V of Big Data,” Data Science Central, 20 July 2017. [Online]. Available: https://www.datasciencecentral.com/profiles/blogs/vulnerability-introducing-10th-v-of-big-data. [Accessed 24 June 2020].

[3] Jeffrey Dean, Sanjay Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,” Operating Systems Design and Implementation(OSDI), 2004.

[4] Gurjit Singh Bhathal, Amardeep Singh, “Big Data: Hadoop framework vulnerabilities, security issues and attacks,” Elsevier, Vols. 1–2, 2019.

[5] D. b. d. p. i. reality, “Verizon,” Verizon Enterprise, 2017. [Online]. Available: https://enterprise.verizon.com/resources/reports/data-breach-digest/. [Accessed 24 June 2020].

[6] Bin Luo, Xiaojiang Du, “Security Threats to Hadoop: Data Leakage Attacks and Investigation,” IEEE Network, pp. 12–16, 2017.

[7] “Apache Ranger,” Apache, [Online]. Available: https://ranger.apache.org/. [Accessed 24 June 2020].

[8] “Apache ranger,” A. S. Foundation, [Online]. Available: http://ranger.apache.org/index.html. [Accessed 24 June 2020].

[9] “Ranger REST API,” [Online]. Available: https://ranger.apache.org/apidocs/index.html. [Accessed 24 June 2020].

[10] “Apache Knox gateway 0.14.x user’s guide,” Apache, 2019. [Online]. Available: https://knox.apache.org/books/knox-0-14-0/user-guide.html. [Accessed 24 June 2020].

[11] “Apache Knox,” Apache, [Online]. Available: https://knox.apache.org/. [Accessed 24 June 2020].

[12] “Apache Atlas,” Apache, [Online]. Available: https://atlas.apache.org/#/. [Accessed 24 June 2020].

[13] “intel-Hadoop/project-rhino,” Intel, 1 July 2015. [Online]. Available: https://github.com/intel-hadoop/project-rhino/. [Accessed 24 June 2020].

[14] Alessandro Basso , Carlo Baliello, Hassan Khalil, Cinzia Di Giusto & Daniel Machancoses, “Kerberos protocol: an overview,” Distributed Systems Fall, 2002.

[15] Regha S. & Manimekalai M, “Approval of Data in Hadoop Using Apache Sentry,” International Journal of Computer Sciences and Engineering, vol. 7, no. 1, pp. 583–586, 2019.

--

--

Yinyi Soo

Hi, I am an Oracle Database Developer who is pursuing Master Degree of Data Science in USM. I will share some of my Works and Assignment during my studies here.