Nowadays, dealing with datasets in the order of terabytes or even petabytes is a reality . Therefore, processing such big datasets in an efficient way is a clear need for many users.

1. INTRODUCTION

Nowadays, dealing with datasets in the order of terabytes or even petabytes is a reality . Therefore, processing such big datasets in an efficient way is a clear need for many users. In this context, Hadoop MapReduce is a big data processing framework that has rapidly become the de facto standard in both industry and academia . The main reasons of such popularity are the ease-of-use, scalability, and failover properties of Hadoop MapReduce. However, these features come at a price: the performance of Hadoop MapReduce is usually far from the performance of a well-tuned parallel database. Therefore, many research works (from industry and academia) have focused on improving the performance of Hadoop MapReduce jobs in many aspects. For example, researchers have proposed different data layouts , join algorithms , high-level query languages , failover algorithms , query optimization techniques , and indexing techniques. The latter includes HAIL : an indexing technique presented at this VLDB 2012. It improves the performance of Hadoop MapReduce jobs by up to a factor of 70 — without requiring expensive index creation phases. Over the past years researchers have actively studied the different performance problems of Hadoop MapReduce. Unfortunately, users do not always have a deep knowledge on how to efficiently exploit the different techniques. In this tutorial, we discuss how to reduce the performance gap to well-tuned database systems. We will point out the similarities and differences between the techniques used in Hadoop with those used in parallel databases. In particular, we will highlight research areas that have not yet been exploited. In the following, we present the three parts in which this tutorial will be structured.

2. HADOOP MAPREDUCE

We will focus on Hadoop MapReduce, which is the most popular open source implementation of the MapReduce framework proposed by Google . Generally speaking, a Hadoop MapReduce job mainly consists of two user-defined functions: map and reduce. The input of a Hadoop MapReduce job is a set of key-value pairs (k, v) and the map function is called for each of these pairs. The map function produces zero or more intermediate key-value pairs (k ′ , v′ ). Then, the Hadoop MapReduce framework groups these intermediate key-value pairs by intermediate key k ′ and calls the reduce function for each group. Finally, the reduce function produces zero or more aggregated results. The beauty of Hadoop MapReduce is that users usually only have to define the map and reduce functions. The framework takes care of everything else such as parallelisation and failover. The Hadoop MapReduce framework utilises a distributed file system to read and write its data. Typically, Hadoop MapReduce uses the Hadoop Distributed File System (HDFS), which is the open source counterpart of the Google File System [11]. Therefore, the I/O performance of a Hadoop MapReduce job strongly depends on HDFS. In the first part of this tutorial, we will introduce Hadoop MapReduce and HDFS in detail. We will contrast both with parallel databases. In particular, we will show and explain the static physical execution plan of Hadoop MapReduce and how it affects job performance. In this part, we will also survey high level languages that allow users to run jobs even more easily.

3. JOB OPTIMIZATION

One of the major advantages of Hadoop MapReduce is that it allows non-expert users to easily run analytical tasks over big data. Hadoop MapReduce gives users full control on how input datasets are processed. Users code their queries using Java rather than SQL. This makes Hadoop MapReduce easy to use for a larger number of developers: no background in databases is required; only a basic knowledge in Java is required. However, Hadoop MapReduce jobs are far behind parallel databases in their query processing efficiency. Hadoop MapReduce jobs achieve decent performance through scaling out to very large computing clusters. However, this results in high costs in terms of hardware and power consumption. Therefore, researchers have carried out many research works to effectively adapt the query processing techniques found in parallel databases to the context of Hadoop MapReduce. In the second part of this tutorial, we will provide an overview of state-of-the-art techniques for optimizing Hadoop MapReduce jobs. We will handle two topics. First, we will survey research works that focus on tuning the configuration parameters of Hadoop MapReduce jobs . Second, we will survey different query optimization techniques for Hadoop MapReduce jobs .

 

the source : http://vldb.org/pvldb/vol5/p2014_jensdittrich_vldb2012.pdf