August 2008

Sun Mon Tue Wed Thu Fri Sat
          1 2
3 4 5 6 7 8 9
10 11 12 13 14 15 16
17 18 19 20 21 22 23
24 25 26 27 28 29 30
31            
Recently on this blog
Recently on other blogs

Map

Audrey

My Photo

License

« SMP parallelization comes to Pugs! | Main | Weekly Perl 6 mailing list summary for 15-21 October, 2006 »

2006.10.22

More SMP parallelism.

After some discussion on haskell@, Sebastian Sylvan suggested a much more straightforward way of hyperizing computations.  By request of Nicholas, here is the hyperization code before parallelization:

mapM evaluate xs

and this is the code after parallelization:

mvs <- forM xs $ \x -> do
    mv <- newEmptyMVar
    forkIO (evaluate x >>= putMVar mv)
    return mv
mapM takeMVar mvs

Or, in Perl 6 (without using hyper-operators themselves):

# Before
@xs.map(&evaluate);

# After
my @mvs = @xs.map: -> $x {
    my $mv = MVar.new;
    async { $mv.put: evaluate($x) };
    $mv;
};
.take for @mvs;

The main point here is that forkIO/async does not actually create an OS thread; instead, it creates a new task for the preemptive GHC runtime kernel, which then assign it to one of the CPUs currently available, via pre-spawned OS threads.

With this strategy, numbers on feather now looks better, and so does my dual-core Macbook:

$ time env GHCRTS=-N1 ./pugs -e 'my @x = 1..50000; @x.>>sqrt'
real    0m5.204s
user    0m5.093s
sys     0m0.088s

$ time env GHCRTS=-N2 ./pugs -e 'my @x = 1..50000; @x.>>sqrt'
real    0m3.404s
user    0m3.937s
sys     0m0.107s

Note that it's now taking more user-time than real-time, which means SMP is doing its job correctly.

The profiler seems to point to GC performance as the major factor preventing true linear scalability, which GHC 6.8 will address by having a true multithreaded GC implementation.  It'll be fun to try this little experiment again once it arrives. :-)

Comments

Have you tried doing that with TMVars, for locklessness?

Julian: Just spent smoe time trying it (newEmptyTMVarIO, putTMVar, takeTMVar), but it's 5% slower...

There are no "locks" per say here, so no reason to use TMVars. The only thing the MVars are used for here is to cause the consumer to block until the producer is done with all of the elements (which could be achieved by TMVars, but more expensively, at no improvement).

I also suggested (it seems I forgot to do "reply all" so it was off list) wrapping the takeMVar in unsafeInterleaveIO. This would allow you to, say, retrieve the result of element x while element y is still being computed (assuming x/=y and x is finished) since the blocking is deferred until you actually try to use the elements in the list. This may speed up some things since you can start consuming the list before all the computatinos are done.

Verify your Comment

Previewing your Comment

This is only a preview. Your comment has not yet been posted.

Working...
Your comment could not be posted. Error type:
Your comment has been posted. Post another comment

The letters and numbers you entered did not match the image. Please try again.

As a final step before posting your comment, enter the letters and numbers you see in the image below. This prevents automated programs from posting comments.

Having trouble reading this image? View an alternate.

Working...

Post a comment