![]() |
Home | Libraries | People | FAQ | More |
Performance measurements were taken using std::chrono::highresolution_clock
,
with overhead corrections. The code was compiled using the build options: variant
= release, optimization = speed. Tests are executed on Intel Core i7-4770S
3.10GHz (4C/8T), Linux (4.7.0/x86_64) and gcc-6.2.0.
The microbenchmark syknet from Alexander Temerev was ported and used for performance measurements. At the root the test spawns 10 threads-of-execution (ToE), e.g. actor/goroutine/fiber etc.. Each spawned ToE spawns additional 10 ToEs ... until 100000 ToEs are created. ToEs return back ther ordinal numbers (0 ... 99999), which are summed on the previous level and sent back upstream, until reaching the root.
Table 1.1. performance of N=100000 actors/goroutines (other languages)
Haskell | stack-1.0.4 |
Erlang | erts-7.0 |
Go | go1.6.1 (GOMAXPROCS == default) |
Go | go1.6.1 (GOMAXPROCS == 8) |
---|---|---|---|
32 ms |
64 ms - 121 ms |
152 ms - 164 ms |
70 ms - 98 ms |
The test utilizes 4 cores with enabled SMT (8 hardware threads). The fiber
stacks are allocated by fixedsize_stack
.
As the benchmark shows, the memory allocation algorithm is significant for performance in a multithreaded environment. The tests use glibc's memory allocation algorithm (based on ptmalloc2) as well as Google's TCmalloc (via linkflags="-ltcmalloc"). [9].
The shared_work
scheduling algorithm uses one global queue,
containing fibers ready to run, shared between all threads. The work is distributed
equaly over all threads. In the work_stealing
scheduling
algorithm each thread has its own local queue. Fibers that are ready to run
are pushed to and poped from the local queues. If the queue runs out of ready
fibers, fibers are stolen from local queues of other threads.
Table 1.2. performance of N=100000 fibers
fiber (1C/1T, round robin) |
fiber (1C/1T, round robin, tcmalloc) |
fiber (4C/8T, work sharing) |
fiber (4C/8T, work sharing, tcmalloc) |
fiber (4C/8T, work stealing) |
fiber (4C/8T, work stealing, tcmalloc) |
---|---|---|---|---|---|
91 ms - 128 ms |
90 ms - 103 ms |
90 ms - 111 ms |
62 ms - 80 ms |
35 ms - 66 ms |
13 ms - 26 ms |
[9] Tais B. Ferreira, Rivalino Matias, Autran Macedo, Lucio B. Araujo “An Experimental Study on Memory Allocators in Multicore and Multithreaded Applications"”, PDCAT '11 Proceedings of the 2011 12th International Conference on Parallel and Distributed Computing, Applications and Technologies, pages 92-98