BIG DATA BEST PRACTICES
Big Data tools like Hadoop, MapReduce, Hive and Pig etc. can do wonders if used correctly and wisely. We all know the usage of these tools. But there are some points, if followed can take the core efficiency out of these tools. So, let’s have a look at those points that are termed as the Big Data best practices :
1. USE THE NUMBER OF MAP AND REDUCE TASKS APPROPRIATLEY Choosing the number of map and reduce tasks for a job is important. Here are some of the factors to be kept in mind : a. If each tasks takes less than 30-40 seconds, reduce the number of tasks. The task setup and scheduling overhead is a few seconds, so if tasks finish very quickly, you are wasting time while not doing work. In simple words, your task is under loaded. Better increase the task load and utilize it to the fullest. Another option can be the reuse of JVM. The JVM J VM spawned by one mapper can be reused by the other one, so that there is no overhead of spawning of an extra JVM. b. If you are dealing with a huge input data size, for example, suppose 1TB, then consider increasing the block size of the input dataset to 256M or 512M, so that less number of mappers will be spawned. Increasing the number of mappers by decreasing the block size is not a good pr actice. Hadoop is designed to work on larger amount of data to reduce the disk seek time and increase the computation speed. So always define the HDFS block size larger enough to allow Hadoop to compute effectively. c. If you have 50 map slots in your cluster, avoid jobs using 51 or 52 mappers, st
because the first 50 mappers finish at the same time and then the 51 and nd
the 52 will run before the reducer task can be started. So just increasing the number of mappers to 500, or 1000 or even to 2000 does not speed your job. The mappers will run in parallel according to the map slots available in your cluster. If map slot available is 50 only 50 will run in parallel, others will be in queue, waiting for the map slots to be available. Bijoy Kumar Khandelwal
d. The number of reduce tasks should always be equal less than the reduce slot available in your cluster. e. Sometime we don’t really use reducers. For example f iltering and reduce noise in data. In these cases make sure you set the number of reducers to zero since the sorting and shuffling is an expensive operation.
2. EXECUTE JOBS ON A SMALL DATASET FOR TESTING (SAMPLING) Whenever a complex Hive query or Pig Script or a raw MapReduce job is written, it’s
a fair technique to first run it on a small dataset rather that testing it on the real dataset which will be huge. It’s better to check all the bugs and bottlenecks in the job by running it on the test dataset rather than wasting the cluster resources by running it on a huge dataset.
3. PARTITIONING HIVE TABLES Hive Partitioning is an effective method to improve the query performance on larger tables. Partitioning allows us to store data in separate sub-directories under table location. It greatly helps the queries which are queried upon the partition keys. Although the selection of partition key is always a sensitive, it should always be a low cardinal attribute, e.g. if your data is associated with time dimension, then date could be a good partition key.
4. COMPRESS MAP/REDUCE OUTPUT Compression techniques significantly reduce the intermediate data volume, which internally reduces the amount of data transfers between mappers and reducers. All this generally occur over the network. Compression can be applied on the mapper and reducer output individually. Keep in mind that gzip compressed files are not splittable. That means this should be applied with caution. A compressed file should not be larger than a few hundred megabytes. Otherwise it can potentially lead to an imbalanced job. Other options of compression codec should be snappy, lzo, bzip etc.
For map output compression set mapred.compression.map.output to true.
For reduce output compression set mapred.compression.reduce.output to true.
Bijoy Kumar Khandelwal
5. MAP JOIN Map Joins are really efficient if a table on the other side of a join is small enough to fit in the memory. Hive supports a parameter, hive.auto.convert.join , which when set to “true” suggests that Hive try to map join automatically. When using this
parameter, make sure that the auto convert is enabled in the hive environment.
6. INPUT FORMAT SELECTION Input format plays a vital role in Hive performance. For example, JSON, the text type of input formats, is not a good choice for a large production system where data volume is really high. These type of readable formats actually take a lot of space and have overhead of parsing. To address these problems, Hive comes with columnar input formats like RCFile, ORC etc. Columnar Formats allow you to reduce the read operations in analytics queries by allowing each column to be accessed individually. There are some binary formats like Avro, sequence files, Thrift and ProtoBuf which can be useful.
7. PARALLEL EXECUTION Hadoop can execute MapReduce jobs in parallel, and several queries executed on hive use this parallelism. However, single, complex Hive queries commonly are translated into a number of MapReduce jobs that are executed by default sequencing. Often though, some of the query’s MapReduce stages are not
interdependent and could be executed in parallel. Then they can take advantage of the spare capacity on a cluster and improve cluster utilization while at the same time reducing the overall query execution time. Set hive.exec.parallel=true to use this behavior.
8. VECTORIZATION Vectorization allows Hive to process a batch of rows together instead of one row at a time. Each batch consists of a columnar vector which is usually an array of primitive types. Operations are performed on the entire columnar vector, which improves the instruction pipelines and the cache usage. To enable vectorization, set this configuration parameter SET hive.vectorized.execution.enabled=true .
Bijoy Kumar Khandelwal
KEY EXPLANATION
The general rule to choose number of mappers and reducers is Total number of mapper or reducer = number of nodes * maximum number of tasks per node Maximum number of task per node = number of processor per node – 1(Since data node and task tracker will take one processor)
Advantages of having less number of maps :
It reduces the scheduling overhead; having fewer maps means task scheduling is easier and availability of free-slots in the cluster is higher.
It reduces the number of seeks required shuffle the map-outputs from the maps to the reducers – because each map produces output for each reduce, thus the number of seeks is m * r where m is the number of maps and r is the number of reduces.
Each shuffled segment is larger; resulting in reducing the overhead of connection establishment when compared to the ‘real’ work done that is, moving bytes
across the network.
The reduce-side merge of the sorted map-outputs is more efficient, since the branch-factor for the merge is lesser, that is, fewer merges are needed since there are fewer sorted segments of map-outputs to merge.
Bijoy Kumar Khandelwal