Parallel folding is a bit of an oxymoron. The fold operation sequentially builds up an accumulator value from the elements of a data structure. So if it's inherently sequential, how can it be parallelized?

If the reduce function given to fold is associative, the fold can be parallelized. If it's not associative, the fold is sequential.

Associativity means that you can turn

`(a + b) + c`

into

`a + (b + c)`

, or, in function call terms,

`(f (f a b) c) == (f a (f b c))`

. When you have a long call chain like

`a + b + c + d + e + f + g + h`

, it means that you can split it into

`(a + b) + (c + d) + (e + f) + (g + h)`

, evaluate the parenthesized bits in parallel, and then reduce the results to the final value. The reduction can also be parallelized, as it is the fold

`(a_b + c_d) + (e_f + g_h)`

. This sort of parallel tree reduction is what map-reduce is all about btw: compute a bunch of values in parallel and reduce them to the result value.

### Where it all goes by the wayside

Let's do the math to figure out the theoretical runtime for a reduction network. One parallel reduction step takes

`computeTime`

time. Moving the result to the next reduce node takes

`totalResultSize / bandwidth`

time. The variable

`totalResultSize`

is defined as

`resultSize * branchingFactor`

, as the branching factor is the amount of nodes sending their reduce results to the reduce node. We can also split it into a more useful form

`branchingFactor * (resultSize / bandwidth)`

and write

`transferTime = resultSize / bandwidth`

Finally, the amount of parallel reduction steps required to reduce a tree of size

`N`

is

`log_branchingFactor N`

(and

`1 < branchingFactor <= N`

).

Putting it all together, we get something like

treeReduceTime n branchingFactor computeTime transferTime =

stepCount * stepTime

where stepTime = computeTime + (branchingFactor * transferTime)

stepCount = if branchingFactor == 1.0

then n

else log n / log branchingFactor

From the above we can come up with a couple of generalizations:

If computeTime is O(1) with regard to branchingFactor and you have infinite bandwidth or zero result size (both mean that transferTime is zero), you want a branching factor of

`N`

. That's easy enough to see:

With infinite bandwidth, the treeReduceTime is

(computeTime + (branchingFactor * 0)) + stepCount =

computeTime * stepCount

To minimize the above, we need to minimize stepCount.

The smallest stepCount we can have is 1.

The stepCount is 1 when branchingFactor is N.

If you have infinitely fast compute nodes (computeTime = 0), you want a branching factor of

`e`

. That's a bit tougher to prove

[1] (or maybe I'm just lacking some essential bit of knowledge that would make the proof easier.)

Note that it's highly likely that

`computeTime`

depends on

`branchingFactor`

. If the reduce algorithm is O(n),

`computeTime = branchingFactor * elementComputeTime`

, and in that case the optimum branching factor is

`e`

(as elementComputeTime and transferTime have the same factor, they can be rolled together and the infinitely fast compute nodes proof applies). For O(n^2) algorithms, the optimal branching factor seems to be

`sqrt(e)`

.

For O(1) computeTime, finite computational power and non-zero transfer time, the optimal branching factor is somewhere between

`e`

and

`N`

. I used a simple hill-climb optimizer to find optimal branching factors for different computation/transfer ratios, and they do seem to be independent of N (well, apart from the maximum branching factor). For 1:1 compute time : transfer time, the branching factor is around 3.6. For 1:10 (low bandwidth), the branching factor is a bit over 2.8. For 10:1 (slow compute), the branching factor is about 8.6. Here's some approximate ratios for integer branching factors: 3 -> 1:3.5, 4 -> 1.5:1, 5 -> 3:1, 6 -> 5:1, and for 1000:1 the factor is about 226.

The above approximations can be used for branching factor -dependent computation times too. You do that by replacing the computation time with per-reduce-step overhead and combining computeTime in transferTime. The new ratio is

`overhead : (computeTime+transferTime)`

. For intuition on this: If the overhead dominates, you want more branching as that will decrease the contribution of the overhead. If you have very little overhead, you want a branching factor close to

`e`

.

For an example, consider an email tournament that aims to pick the best email out of ten thousand. You receive some fixed amount of emails and your task is to pick the best one, then forward it to your assigned receiver (who does the same, etc., and the last reducer sends the mail to the award ceremony). Let's estimate that the transfer time is 5 seconds, the computing time is a minute per email, and the fixed overhead is 8 hours. This gives us an overhead:processing-ratio of 443. If each receiver processes 10 emails per session, running the tournament would take 33 hours. But if each receiver does 118 emails a go, the tournament would be done in 19.5 hours. And if they do 200 emails, it'd take ... 20.2 hours. (Yeah, 118 is the best branching factor here.)

Suppose you do the same tournament but with people glued to their mail apps. Now the overhead might be just one minute. This time, doing the tournament with 118 mails per session would take a snappy 4.1 hours. But as the role of overhead has fallen, we should reduce the branching factor as well. With 4 mails per session, the tournament would take only 35 minutes. (With 2 mails per session, 42 minutes. 4 is optimal.)

With infinitely fast compute nodes, the treeReduceTime is

(branchingFactor * transferTime) * stepCount

For a branching factor of 1, that equals

transferTime * n

For other branching factors, the equation is

b * t * log_b(n), where b = branchingFactor and t = transferTime

From which we see that t is a fixed factor,

so the minimum of the above equation is the minimum of the equation

b * log_b(n)

Which can be written as

log_b n = log n / log b, b > 1

b * log_b(n) =

b * (log n / log b) =

(b / log b) * log n

From which we see that log n is a fixed factor,

so the minimum of b * t * log_b(n) is the minimum of

b / log b

To find the minimum, find the derivative

D(b/log b) = (1*log b - b*1/b) / (log b)^2

= (log b - 1) / (log b)^2

And its inflection point at b = e,

(log e - 1) / (log e)^2 = (1 - 1) / 1^2 = 0 / 1 = 0,

we check the values of b / log b on both sides of the

inflection point to be larger than the value at the inflection point,

e / log e = e =~ 2.718

2 / log 2 =~ 2.885

3 / log 3 =~ 2.731,

thus confirming it as a local minimum.

Checking the discontinuity of b / log b at b = 1,

b / 0 = infinity,

and discovering it to be larger than e / log e,

we find the global minimum of b / log b to lie at b = e.

Plugging that back into the original equation:

minimum of branchingFactor * transferTime * stepCount is

transferTime * min(n, e*ln(n))

and because n >= e*ln(n), the minimum is

transferTime * e * ln(n).