Reduction trees
Introducing coNCePTuaL communication

In this exercise we construct a communication benchmark that measures the communication throughput across a reduction tree.

Problem description

A common operation in parallel programs is the reduction. In a reduction operation, all tasks collectively apply a commutative/associative operator to a list of values, with one value per task. For example, a set of eight tasks might use the + operator to sum the values {955, 657, 326, 864, 431, 542, 156, 360}—each value provided by a different task. The result—in this case, 4291—is typically either received by a single task or distributed to all tasks. (The latter is often termed an all-reduce.)

Reductions are often implemented using a communication tree. In a communication tree, pairs of tasks reduce their values, then pairs of pairs reduce their reduced values, then pairs of pairs of pairs, and so forth until a single task (the root of the reduction tree) receives the completely reduced value. Again using {955, 657, 326, 864, 431, 542, 156, 360} as sample input, task 1 sends its value, 657, to task 0, who adds 955+657=1612. At the same time, task 3 sends its value, 864, to task 2, who adds 326+864=1190. Meanwhile, task 4 is adding 431+542=973 and task 6 is adding 156+360=516. We're now left with tasks {0, 2, 4, 6} holding the values {1612, 1190, 973, 516} respectively. In the second round, task 2 sends 1190 to task 0 who adds 1612+1190=2802 and task 6 sends 516 to task 4 who adds 973+516=1489. Finally, task 4 sends 1489 to task 0 who adds 2802+1489 for a grand total of 4291. The advantage of a reduction tree is that it takes time proportional to the base-2 logarithm of the number of tasks; for example, it takes only 16 communication steps to reduce the values from 65,536 tasks. (Yes, coNCePTuaL has been run on a computer system with that many processors.)

Our goal for this exercise is to implement a coNCePTuaL program that implements the communication pattern represented by a reduction tree and measures the rate at which data flows through that tree.

Scott Pakin,