Why , When and How Hadoop and Spark...: Sorting in Mapreduce

We know that the output of map produces key's in sorted format(not values).
The simplest way to sort he files is to run then through a map only job and output is stored in sequence file having key as Int,Long Writable.(Preparation)
Then passing the same output through default mapreduce job having number of reducers as n.This will sort the output in given sort order.(Partial Sort)
But the above will produce n files which are in sorted order.one can concat them to produce a global sorted file.

Total Sort:

This approach take input sequence files from preparation phase.and the pass it through default mapreduce but with the help of input sampler..which is nothing but divide our partitions in even range.. functionally it is still the same.produce 2 sorted reduce files and combine then to produce a singe sorted file.
Input sampler(returns a sample of keys given an InputFormat and Job) to see if an even partitioning of files have occurred .Client calls writePartitionFile()method of input sampler which produce a sequence file.The same sequence file is used by TotalOrderPartitioner to create partitions for the sort job.
below is the code:

job.setPartitionerClass(TotalOrderPartitioner.class);

InputSampler.Sampler<IntWritable, LongWritable> sampler =
new InputSampler.RandomSampler<IntWritable, LongWritable>(0.1, 10000, 10);

InputSampler.writePartitionFile(job, sampler);

// Add to DistributedCache
// Configuration conf = job.getConfiguration();
String partitionFile = TotalOrderPartitioner.getPartitionFile(conf);
URI partitionUri = new URI(partitionFile);
job.addCacheFile(partitionUri);

Secondary Sort:

One can easily understand by below example.

Make the key a composite of the natural key and the natural value.
The sort comparator should order by the composite key (i.e., the natural key and natural value).
The partitioner and grouping comparator for the composite key should consider only the natural key for partitioning and grouping.

SortComparator:Used to define how map output keys are sorted

Excerpts from the book Hadoop - Definitive Guide:

Sort order for keys is found as follows: 1.If the property mapred.output.key.comparator.class is set, either explicitly or by calling setSortComparatorClass() on Job, then an instance of that class is used. (In the old API the equivalent method is setOutputKeyComparatorClass() on JobConf.)

2.Otherwise, keys must be a subclass of WritableComparable, and the registered comparator for the key class is used.

3.If there is no registered comparator, then a RawComparator is used that deserializes the byte streams being compared into objects and delegates to the WritableCompar able’s compareTo() method.

SortComparator Vs GroupComparator in a one liner: SortComparator decides how map output keys are sorted while GroupComparator decides which map output keys within the Reducer go to the same reduce method call.

Why , When and How Hadoop and Spark...

Thursday, 28 May 2015

Sorting in Mapreduce

Total Sort:

Secondary Sort:

No comments:

Post a Comment

Blog Archive