Sunday, May 29, 2011

Web Workers faster than (naive) C++ [edited]


[EDIT] The main problem with the C++ version was probably the platform (Cygwin). So I re-ran the same program on Linux inside VirtualBox on the same machine and the results are not so catastrophic. I also ran the Python program with PyPy and it's twice as fast as CPython on that problem (Thanks to Alex Gaynor for trying it first).

I don't mean that to stir the controversy. If someone with great C++ skills could explain what I did wrong, I'd be glad to learn. So here is the story.

Multi-language parallel processing comparisons

It all began last week with Google Code Jam problem B in round 1A. The problem is the following. Two guys are playing hangman. They have a common dictionary of allowed words. The guy who tries to guess has a very simple strategy. He has a fixed list of letters that he will propose in order. However, he will skip letters that cannot appear in the word in the current state of the game. Each time he proposes a letter that does not appear in the word, the other guy makes one point. Your goal is to find the word that will make the most points, knowing the dictionary and the ordered list of letters of the guy who guesses.
I have written a solution with the same algorithm in four languages: Python, Java, C++ and JavaScript. I also wanted to get some knowledge of the parallel processing capabilities of those languages. In Python, I used the multiprocessing module, in Java, package java.util.concurrent with ExecutorService, in C++, OpenMP and in JavaScript HTML5 web workers. Here are the running times.

Single
Multi
1 worker 4 workers 6 workers
Java 26" 27" 18.5" 17.5"
C++ Linux 33" 33" 20" 21"
C++ Cygwin 58" 59" 30'00" 28'00"
PyPy 1'04" 1'08" 27" 26.5"
Python 2'30" 2'35" 50" 44"
JS Chrome 44" 45" 15.3" 14"
JS Opera 1'17" 1'17" 1'21" 1'35"
JS IE 1'30" No web workers
JS Firefox 1'34" crash
1'14" for 9
crash crash

Best times are highlighted in green in each column, worst times in red, crashing or unavailable solutions in black.

The code is available in github, with extensive explanations on how to build and run it. The HTML5 web workers solution can be tested right in your browser. To run the program in non-web worker mode, just click the "Run without workers" button. Be aware that the browser will probably tell you every few seconds that a script is making the page unresponsive. You need to allow it to continue if you want it to finish.
To run the program with web workers, you need to create a pool first. When you do that, a row of green squares will appear, to track the activity of allocated workers. Then, launch the computation with "Run". The "Queue size" will move briefly to 10, workers will turn red when busy. The result will appear once the queue is empty and all workers have finished.

Observations

About the C++ program

My first surprise is that I could not make my C++ program run as fast as the Java solution. My C++ skills are rusty, for sure, but when I wrote some numerical things not too long ago, I had no problem making them go two or three times faster than Java. This time, there is a lot of string and map handling and I guess my performance problems might come from too much memory allocation. I used gprof to locate potential hotspots, but nothing meaningful appears. I would genuinely be interested if some C++ wizard could explain how to fix that. [At least on Linux it's better than with Cygwin]
C++ 4-workers CPU profile

The second suprise is that using parallel processing actually made things worse, much much worse! Read the numbers: 30 minutes for the 4-workers C++ when Chrome takes 15 seconds. Actually it reinforces the hypothesis that my performance problems come from memory allocation. The multiple threads of execution are fighting each other for memory and provoke contention. That would explain the fact that the program only consumes 25% CPU, a big part of which is system time. [Yet again, on Linux the multi-threaded version sees a good increase in performance. Cygwin probably has to do much more work.]

About Firefox 4

The sequential solution, using only the main JavaScript thread, has a very curious memory profile.
Firefox non-worker solution

It looks like memory is never reclaimed during the execution, only after the function has finished running.

When running the 4-worker solution, Firefox CPU profile is similar to the C++ one: only 40% used and at least three quarters of that is system time. Using Process Explorer, I would conclude that Firefox uses threads for workers and probably meets memory contention too.

What is more troublesome is that Firefox cannot reliably compute the solution using web workers. Firebug catches some exceptions during execution. But those exceptions never occur in other browsers, or even in Firefox when running without workers. Among the ten tasks that workers have to process, there is generally one or two that produce an exception. It is sometimes possible to solve nine of the ten cases with one worker, but not always. And with four workers, it cannot even finish four cases. I have really no idea what is wrong.

About Opera 11

Opera performs quite well without workers. With workers, it performs the same. In fact, the more workers you add, the worse it performs. I guess that it uses "green threads", not native ones. No new threads appear in Process Explorer when launching workers and CPU use is always around 25%. With more workers, there is more work for the internal scheduler and some more seralization of data to do. That's what explains the slight degradation of running times.

About Internet Explorer 9

IE9 does not support web workers, so I only tested the standard solution. The performance is reasonable. It is worse than Chrome and Opera, but a bit better than Firefox.

About Chrome 13

Chrome is impressive, no doubt about that. The standard JavaScript solution is already the fastest of the browser pack (twice faster than Firefox and IE), and it benefits greatly from added workers. Using separate OS processes (like Python multiprocessing module) makes it immune to memory contention.

About Java

Java is a still a bit faster than Chrome V8 JavaScript in single-threaded mode. However, the gain when using multiple workers is not as big as expected: only 1.5 times faster with four workers when Python or Chrome go three times faster with four processes. I don't know why this is so and I don't have a precise idea how to diagnose. It might be a kind of memory contention, as with other multi-threaded implementations, except it does not appear as system time, because the JVM manages memory by itself.

About Python

I initially wrote my solution in Python, because the language feels so good. When ran with PyPy (1.5.0-alpha0) it can perform twice as fast as with CPython. That's great news!

Conclusions and open questions

I welcome all comments, on why my C++ solution is so slow, why Firefox fails with workers, and anything else. In the meantime, I will finish and release my micro-library to provide a Python-like Pool API above web workers, knowing that today, only Chrome can use it for heavy lifting.

3 comments:

  1. Your C++ is slow because you're copying strings all over the place. If you profile, I bet you'd find a lot of your app time spent on the heap.

    Pass by reference more.

    Replace substr() with a pair of const_iterator when possible -- basically only use substr() if you're going to modify the resulting string or if you need its lifetime to be longer than the source. In this case keeping the source around for longer instead is probably acceptable and will certainly be more performant than copying.

    consider using std::move() instead of copying in some places.

    ReplyDelete
  2. Thanks Cory, you taught me something today! I didn't know about std::move(). I have now used it to make my function more explicit.
    I have removed all the string copying that I could, I think. String shortening has been replaced by passing an index around. 1-character strings have been replaced by char.
    The only remaining strings are the one that are really dynamically built when playing letters. I don't think I can get rid of them. At least I tried to std::move() them instead of copying.

    The result: -20% execution time. Nice!

    ReplyDelete
  3. use java amd aparapi api rocks and ownz and wipes all benchmark you did.
    It utilizes GPU on the java very easy.
    Check : http://developer.amd.com/tools-and-sdks/heterogeneous-computing/aparapi/

    ReplyDelete