After some discussion on haskell@, Sebastian Sylvan suggested a much more straightforward way of hyperizing computations. By request of Nicholas, here is the hyperization code before parallelization:
mapM evaluate xs
and this is the code after parallelization:
mvs <- forM xs $ \x -> do
mv <- newEmptyMVar
forkIO (evaluate x >>= putMVar mv)
return mv
mapM takeMVar mvs
Or, in Perl 6 (without using hyper-operators themselves):
# Before
@xs.map(&evaluate);# After
my @mvs = @xs.map: -> $x {
my $mv = MVar.new;
async { $mv.put: evaluate($x) };
$mv;
};
.take for @mvs;
The main point here is that forkIO/async does not actually create an OS thread; instead, it creates a new task for the preemptive GHC runtime kernel, which then assign it to one of the CPUs currently available, via pre-spawned OS threads.
With this strategy, numbers on feather now looks better, and so does my dual-core Macbook:
$ time env GHCRTS=-N1 ./pugs -e 'my @x = 1..50000; @x.>>sqrt'
real 0m5.204s
user 0m5.093s
sys 0m0.088s$ time env GHCRTS=-N2 ./pugs -e 'my @x = 1..50000; @x.>>sqrt'
real 0m3.404s
user 0m3.937s
sys 0m0.107s
Note that it's now taking more user-time than real-time, which means SMP is doing its job correctly.
The profiler seems to point to GC performance as the major factor preventing true linear scalability, which GHC 6.8 will address by having a true multithreaded GC implementation. It'll be fun to try this little experiment again once it arrives. :-)

Have you tried doing that with TMVars, for locklessness?
Posted by: Julian Morrison | 2006.10.22 at 08:21 PM
Julian: Just spent smoe time trying it (newEmptyTMVarIO, putTMVar, takeTMVar), but it's 5% slower...
Posted by: Audrey Tang | 2006.10.22 at 11:27 PM
There are no "locks" per say here, so no reason to use TMVars. The only thing the MVars are used for here is to cause the consumer to block until the producer is done with all of the elements (which could be achieved by TMVars, but more expensively, at no improvement).
I also suggested (it seems I forgot to do "reply all" so it was off list) wrapping the takeMVar in unsafeInterleaveIO. This would allow you to, say, retrieve the result of element x while element y is still being computed (assuming x/=y and x is finished) since the blocking is deferred until you actually try to use the elements in the list. This may speed up some things since you can start consuming the list before all the computatinos are done.
Posted by: Sebastian Sylvan | 2006.10.22 at 11:28 PM