Hadoop Distributed File System: Balancing Portability and Performance

Posted: July 2, 2015 in Uncategorized

Apache Hadoop is a pen-source implementation of distributed storage and distributed processing for the analysis of big datasets. To manage storage resources across the distributed cluster, Hadoop uses a distributed user-level file system. This file system — HDFS — is written in Java and designed for portability across heterogeneous hardware and software platforms. In this article we discuss the performance of HDFS and uncover several performance issues.

First, hadoop’s architectural bottlenecks exist in the Hadoop implementation that results in inefficient HDFS usage due to delays in scheduling new MapReduce tasks. Second, portability limitations prevent the Java implementation from exploiting features of the native platform. Third, HDFS implicitly makes portability assumptions about how the native platform manages storage resources, even though native filesystems and I/O schedulers vary widely in design and behavior.

The bottlenecks in HDFS are divided into three parts:

Software Architectural Bottlenecks — HDFS is not utilized to its full potential due to scheduling delays in the Hadoop architecture that result in cluster nodes waiting for new tasks. Instead of using the disk in a streaming manner, the access pattern is periodic. Further, even when tasks are available for computation, the HDFS client code, particularly for file reads, serializes computation and I/O instead of de-coupling and pipelining those operations. Data prefetching is not employed to improve performance, even though the typical MapReduce streaming access pattern is highly predictable.

Portability Limitations — Some performance-enhancing features in the native filesystem are not available in Java in a platform-independent manner. This includes options such as bypassing the filesystem page cache and transferring data directly from disk into user buffers. As such, the HDFS implementation runs less efficiently and has higher processor usage than would otherwise be necessary.

Portability Assumptions — The classic notion of software portability is simple: does the application run on multiple platforms? But, a broader notion of portability is: does the application perform well on multiple platforms? While HDFS is strictly portable, its performance is highly dependent on the behavior of underlying software layers, specifically the OS I/O scheduler and native filesystem allocation algorithm.

MapReduce systems such as Hadoop are used in large-scale deployments. Eliminating HDFS bottlenecks will not only boost application performance, but also improve overall cluster efficiency, thereby reducing power and cooling costs and allowing more computation to be accomplished with the same number of cluster nodes.

Hadoop application performance suffers due to architectural bottlenecks in the way that applications use the Hadoop filesystem. Ideally, MapReduce applications should manipulate the disk using streaming access patterns. The application framework should allow for data to be read or written to the disk continuously, and overlap computation with I/O. Many simple applications with low computation requirements do not achieve this ideal operating mode. Instead, they utilize the disk in a periodic fashion, decreasing performance.

The behavior of the disk and processor utilization over time for the simple search benchmark is “% of Time Disk Had 1 or More Outstanding Requests”. Disk utilization was measured as the percentage of time that the disk had at least one I/O request outstanding. This profiling did not measure the relative efficiency of disk accesses (which is influenced by excessive seeks and request size), but simply examined whether or not the disk was kept sufficiently busy with outstanding service requests. Here, the system is not ac-cessing the disk in a continuous streaming fashion as desired, even though there are ample processor resources still available. Rather, the system is reading data in bursts, processing it (by searching for a short text string in each input line), and then fetching more data in a periodic manner. This behavior is also evident in other applications such as the sort benchmark, not shown here.

The overall system impact of this periodic behavior is “Average Processor and HDFS Disk Utilization (% of Time Disk Had 1 or More Outstanding Requests) ”, which presents the average HDFS disk and processor utilization for each application in the test suite. The AIO test programs (running as native applications, not in Hadoop) kept the disk saturated with I/O requests nearly all the time (97.5%) with very low processor utilization (under 3.5%). Some Hadoop programs (such as S-Wr and Rnd-Bin) also kept the disk equivalently busy, albeit at much higher processor usage due to Hadoop and Java virtual machine overheads. In contrast, the remaining programs have poor resource utilization. For instance, the search program accesses the disk less than 40% of the time, and uses the processors less than 60% of the time.

This poor efficiency is a result of the way applications are scheduled in Hadoop, and is not a bottleneck caused by HDFS. By default, the test applications like search and sort were divided into hundreds of map tasks that each process only a single HDFS block or less before exiting. This can speed recovery from node failure (by reducing the amount of work lost) and simplify cluster scheduling. It is easy to take a map task that accesses a single HDFS block and assign it to the node that contains the data. Scheduling becomes more difficult, however, when map tasks access a region of multiple HDFS blocks, each of which could reside on different nodes. Unfortunately, the benefits of using a large number of small tasks come with a performance price that is particularly high for applications like the search test that complete tasks quickly. When a map task completes, the node can be idle for several seconds until the TaskTracker polls the JobTracker for more tasks. By default, the minimum polling interval is 3 seconds for a small cluster, and increases with cluster size. Then, the JobTracker runs a scheduling algorithm and returns the next task to the TaskTracker. Finally, a new Java virtual machine (JVM) is started, after which the node can resume application processing.

The Hadoop framework and filesystem impose a significant processor overhead on the cluster. While some of this over-head is inherent in providing necessary functionality, other overhead is incurred due to the design goal of creating a portable MapReduce implementation. These are referred to as Portability Limitations.

A final class of performance bottlenecks exists in the Hadoop filesystem that we refer to as Portability Assumptions. Specifically, these bottlenecks exist because the HDFS imple-mentation makes implicit assumptions that the underlying OS and filesystem will behave in an optimal manner for Hadoop. Unfortunately, I/O schedulers can cause excessive seeks under concurrent workloads, and disk allocation algorithms can cause excessive fragmentation, both of which degrade HDFS performance significantly. These agents are outside the direct control of HDFS, which runs inside a Java virtual machine and manages storage as a user-level application.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s