art with code

2016-12-07

Thought experiment on autoparallelizing loops

One thought I've been having these days is if you could make JavaScript faster by executing loops in parallel. JavaScript feels like a pretty good language for autoparallelization as it has no pointers and no shared state concurrency: you know if two objects share memory and the variables in your thread of execution can't suddenly change (i.e. no external threads mucking with your data). So proving that a loop can be parallelized should be easier than in C, and if you can prove that a loop can be parallelized, you can go ahead and do it with reasonable confidence that the results will be the same as for serial execution. And JS has another dubious advantage as well: it's slow. A slow language is less likely to run into a memory bandwidth bottlenecks (assuming that it's slow because it's doing more boilerplate computation), so it should benefit from parallelization more easily than a language that's already memory-bound.

On a high level, you'd keep track of the input variables and output variables of a block of code, then figure out if any of the inputs are also outputs. If not, you've got a pure block and can execute it in parallel. If the outputs overlap (each iteration is writing to a variable outside the block) but that output is not used as an input, use the last iterations value for it.

If the inputs and outputs overlap, you've got a reduce tree. The shape of the tree depends on the associativity of the input->output -function. If it's associative, e.g. step = step + myValue, you can make any shape of tree that you like. If it's not associative, e.g. step = step / myValue, the tree is sequential. You can still execute the computation of myValue in parallel, but need to reduce the result of step for each block instance before it can proceed. This is a bit complex though. A simple heuristic would be to execute fully associative blocks in parallel and everything else sequentially.

How to implement... first, detect loops. Second, try to prove that output values are independent of each other (for starters, deal with output[i] = myValue where i is incremented by one on each iteration and terminates at a known value). Third, estimate serial loop cost vs parallel loop cost + thread creation overhead. If serial cost > parallel cost + overhead, turn the loop into a parallel one.

If output values depend on each other but you can prove that the reduce operation is associative (start by handling the case of one number output written to by all threads, with + as reduce op), estimate the reduce tree cost vs serial reduce cost. Pick minimum cost from parallel map + serial reduce, parallel map + parallel reduce and serial map + serial reduce.

Durr... maybe some maps could be turned into a sequence of loop-wide vector ops (e.g. a[i] = b[i] + c[i] * d[i] => a = b + c * d) which could then be split down into parallel SIMD chunks + loop fusion it back into idx = thread_id*thread_block_size+i; a_simd[idx] = b_simd[idx] + c_simd[idx] * d_simd[idx].

No comments:

Blog Archive