Have you tried doing that with TMVars, for locklessness?

Julian: Just spent smoe time trying it (newEmptyTMVarIO, putTMVar, takeTMVar), but it's 5% slower...

There are no "locks" per say here, so no reason to use TMVars. The only thing the MVars are used for here is to cause the consumer to block until the producer is done with all of the elements (which could be achieved by TMVars, but more expensively, at no improvement).

I also suggested (it seems I forgot to do "reply all" so it was off list) wrapping the takeMVar in unsafeInterleaveIO. This would allow you to, say, retrieve the result of element x while element y is still being computed (assuming x/=y and x is finished) since the blocking is deferred until you actually try to use the elements in the list. This may speed up some things since you can start consuming the list before all the computatinos are done.

