Data Parallelism
Data Parallelism
在给定数据集的情况下,数据并行性是指跨数据集的元素组在同一功能的多台机器或线程上同时执行。数据并行性也可以视为
MapReduce(Dean&Ghemawat,2008)是
在
MapReduce
在此模型中,可并行化的计算被抽象为
-
Map:用户编写的
Map 接受一组键/ 值对(“记录”)作为输入,对每个记录应用map 操作,并计算一组中间键/ 值对作为输出。 -
Reduce:还由用户编写的
Reduce 接受中间键和与该键关联的一组值,对其进行操作,并产生零个或一个输出值。
$$ \begin{array}{c}\operatorname{map}(k 1, v 1) \rightarrow \operatorname{list}(k 2, v 2) \ \text {reduce}(k 2, \text {list}(v 2)) \rightarrow \operatorname{list}(v 2)\end{array} $$
输入键和值是从与输出键和值不同的域中提取的。中间键和值与输出键和值来自同一域。例如,我们可以考虑对大量文档中每个单词出现的次数进行计数的问题。可以将其建模为一个
map(String key, String value):
// key: document name
// value: document contents
for each word in value:
EmitIntermediate(word, "1");
reduce(String key, Iterator values):
// key: a word
// values: a list of counts
int result = 0;
for each value in values:
result += ParseInt(value);
Emit(AsString(result));
在执行期间,
Fault Tolerance
Limitations
许多分析工作负载(例如
FlumeJava
引入
PCollection
, a immutable bag of elements of typeT
, which can be created from an in-memory JavaCollection
or from reading a file with encoding specified byrecordOf
.recordOf(...)
, specifies the encoding of the instancePTable
, a subclass ofPCollection>
, an immutable multi-map with keys of typeK
and values of typeV
parallelDo()
, can express both the map and reduce parts of MapReducegroupByKey()
, same as shuffle step of MapReducecombineValues()
, semantically a special case ofparallelDo()
, a combination of a MapReduce combiner and a MapReduce reducer, which is more efficient than doing all the combining in the reducer.flatten
, takes a list ofPCollection
s and returns a singlePCollection
.
PTable<String, Integer> wordsWithOnes =
words.parallelDo(
new DoFn<String, Pair<String, Integer>> () {
void process(String word,
EmitFn<Pair<String, Integer>> emitFn) {
emitFn.emit(Pair.of(word, 1));
}
}, tableOf(strings(), ints()));
PTable<String, Collection<Integer>>
groupedWordsWithOnes = wordsWithOnes.groupByKey();
PTable<String, Integer> wordCounts =
groupedWordsWithOnes.combineValues(SUM_INTS);
使用
-
Fusion: $f(g(x))=>g \circ f(x)$,本质上是功能组成。通过将多个可组合步骤组合为一个步骤,通常可以帮助减少给定工作所需的步骤数量。
-
MapShuffleCombineReduce (MSCR) Operation: 将ParallelDo ,GroupByKey,CombineValues 和Flatten 组合到一个MapReduce 作业中。这将MapReduce 扩展为接受多个输入和多个输出。下图说明了具有3 个输入通道,2 个分组(“ GroupByKey”)输出通道和1 个直通输出通道的MSCR 操作的情况。

总体优化器策略涉及一系列优化操作,其最终目标是产生最少,最有效的
- Sink Flatten: $h(f(a)+g(b)) \rightarrow h(f(a))+h(g(b))$
- Lift combineValues operations: If a
CombineValues
operation immediately follows aGroupByKey
operation, theGroupByKey
records the fact and originalCombineValues
is left in place, which can be treated as normalParallelDo
operation and subject to ParallelDo fusions. - Insert fusion blocks:
- Fuse
ParallelDo
s - Fuse MSCRs: create MSCR operations, and convert any remaining unfused ParallelDo operations into trivial MSCRs.