Sunday, 26 January 2014

Collector Methods


(Auto)
– Eagerly read any row from any input partition
– Output row order is undefined (non-deterministic)
– This is the default collector method
Generally, Auto is the fastest and most efficient method of collection

Round Robin
– Pick row from input partitions in round robin order
– Slower than auto, rarely used
Round robin collector can be used to reconstruct original (sequential)
row order for round-robin partitioned inputs
– As long as intermediate processing (e.g. sort, aggregator) has not altered
row order or reduced number of rows
– Rarely used


Ordered
– Read all rows from first partition, then second,…
– Preserves order that exists within partitions
Ordered is only appropriate when sorted input has been range partitioned
– No sort required to produce sorted output, when partitions have been sorted
– Rarely used as range partition is rarely used also

Sort Merge
– Produces a single (sequential) stream of rows sorted on specified key
columns from input sorted on those keys
– Row order is not preserved for non-key columns (non-stable sort)
 To generate a single stream of sorted data, use the Sort Merge
collector
– Input data must be sorted on these keys
– Sort Merge does not perform a sort

 

No comments:

Post a Comment