• Emilio G. Cota's avatar
    tb hash: track translated blocks with qht · 909eaac9
    Emilio G. Cota authored
    Having a fixed-size hash table for keeping track of all translation blocks
    is suboptimal: some workloads are just too big or too small to get maximum
    performance from the hash table. The MRU promotion policy helps improve
    performance when the hash table is a little undersized, but it cannot
    make up for severely undersized hash tables.
    
    Furthermore, frequent MRU promotions result in writes that are a scalability
    bottleneck. For scalability, lookups should only perform reads, not writes.
    This is not a big deal for now, but it will become one once MTTCG matures.
    
    The appended fixes these issues by using qht as the implementation of
    the TB hash table. This solution is superior to other alternatives considered,
    namely:
    
    - master: implementation in QEMU before this patchset
    - xxhash: before this patch, i.e. fixed buckets + xxhash hashing + MRU.
    - xxhash-rcu: fixed buckets + xxhash + RCU list + MRU.
                  MRU is implemented here by adding an intermediate struct
                  that contains the u32 hash and a pointer to the TB; this
                  allows us, on an MRU promotion, to copy said struct (that is not
                  at the head), and put this new copy at the head. After a grace
                  period, the original non-head struct can be eliminated, and
                  after another grace period, freed.
    - qht-fixed-nomru: fixed buckets + xxhash + qht without auto-resize +
                       no MRU for lookups; MRU for inserts.
    The appended solution is the following:
    - qht-dyn-nomru: dynamic number of buckets + xxhash + qht w/ auto-resize +
                     no MRU for lookups; MRU for inserts.
    
    The plots below compare the considered solutions. The Y axis shows the
    boot time (in seconds) of a debian jessie image with arm-softmmu; the X axis
    sweeps the number of buckets (or initial number of buckets for qht-autoresize).
    The plots in PNG format (and with errorbars) can be seen here:
      http://imgur.com/a/Awgnq
    
    Each test runs 5 times, and the entire QEMU process is pinned to a
    single core for repeatability of results.
    
                                Host: Intel Xeon E5-2690
    
      28 ++------------+-------------+-------------+-------------+------------++
         A*****        +             +             +             master **A*** +
      27 ++    *                                                 xxhash ##B###++
         |      A******A******                               xxhash-rcu $$C$$$ |
      26 C$$                  A******A******            qht-fixed-nomru*%%D%%%++
         D%%$$                              A******A******A*qht-dyn-mru A*E****A
      25 ++ %%$$                                          qht-dyn-nomru &&F&&&++
         B#####%                                                               |
      24 ++    #C$$$$$                                                        ++
         |      B###  $                                                        |
         |          ## C$$$$$$                                                 |
      23 ++           #       C$$$$$$                                         ++
         |             B######       C$$$$$$                                %%%D
      22 ++                  %B######       C$$$$$$C$$$$$$C$$$$$$C$$$$$$C$$$$$$C
         |                    D%%%%%%B######      @E@@@@@@    %%%D%%%@@@E@@@@@@E
      21 E@@@@@@E@@@@@@F&&&@@@E@@@&&&D%%%%%%B######B######B######B######B######B
         +             E@@@   F&&&   +      E@     +      F&&&   +             +
      20 ++------------+-------------+-------------+-------------+------------++
         14            16            18            20            22            24
                                 log2 number of buckets
    
                                     Host: Intel i7-4790K
    
      14.5 ++------------+------------+-------------+------------+------------++
           A**           +            +             +            master **A*** +
        14 ++ **                                                 xxhash ##B###++
      13.5 ++   **                                           xxhash-rcu $$C$$$++
           |                                            qht-fixed-nomru %%D%%% |
        13 ++     A******                                   qht-dyn-mru @@E@@@++
           |             A*****A******A******             qht-dyn-nomru &&F&&& |
      12.5 C$$                               A******A******A*****A******    ***A
        12 ++ $$                                                        A***  ++
           D%%% $$                                                             |
      11.5 ++  %%                                                             ++
           B###  %C$$$$$$                                                      |
        11 ++  ## D%%%%% C$$$$$                                               ++
           |     #      %      C$$$$$$                                         |
      10.5 F&&&&&&B######D%%%%%       C$$$$$$C$$$$$$C$$$$$$C$$$$$C$$$$$$    $$$C
        10 E@@@@@@E@@@@@@B#####B######B######E@@@@@@E@@@%%%D%%%%%D%%%###B######B
           +             F&&          D%%%%%%B######B######B#####B###@@@D%%%   +
       9.5 ++------------+------------+-------------+------------+------------++
           14            16           18            20           22            24
                                  log2 number of buckets
    
    Note that the original point before this patch series is X=15 for "master";
    the little sensitivity to the increased number of buckets is due to the
    poor hashing function in master.
    
    xxhash-rcu has significant overhead due to the constant churn of allocating
    and deallocating intermediate structs for implementing MRU. An alternative
    would be do consider failed lookups as "maybe not there", and then
    acquire the external lock (tb_lock in this case) to really confirm that
    there was indeed a failed lookup. This, however, would not be enough
    to implement dynamic resizing--this is more complex: see
    "Resizable, Scalable, Concurrent Hash Tables via Relativistic
    Programming" by Triplett, McKenney and Walpole. This solution was
    discarded due to the very coarse RCU read critical sections that we have
    in MTTCG; resizing requires waiting for readers after every pointer update,
    and resizes require many pointer updates, so this would quickly become
    prohibitive.
    
    qht-fixed-nomru shows that MRU promotion is advisable for undersized
    hash tables.
    
    However, qht-dyn-mru shows that MRU promotion is not important if the
    hash table is properly sized: there is virtually no difference in
    performance between qht-dyn-nomru and qht-dyn-mru.
    
    Before this patch, we're at X=15 on "xxhash"; after this patch, we're at
    X=15 @ qht-dyn-nomru. This patch thus matches the best performance that we
    can achieve with optimum sizing of the hash table, while keeping the hash
    table scalable for readers.
    
    The improvement we get before and after this patch for booting debian jessie
    with arm-softmmu is:
    
    - Intel Xeon E5-2690: 10.5% less time
    - Intel i7-4790K: 5.2% less time
    
    We could get this same improvement _for this particular workload_ by
    statically increasing the size of the hash table. But this would hurt
    workloads that do not need a large hash table. The dynamic (upward)
    resizing allows us to start small and enlarge the hash table as needed.
    
    A quick note on downsizing: the table is resized back to 2**15 buckets
    on every tb_flush; this makes sense because it is not guaranteed that the
    table will reach the same number of TBs later on (e.g. most bootup code is
    thrown away after boot); it makes sense to grow the hash table as
    more code blocks are translated. This also avoids the complication of
    having to build downsizing hysteresis logic into qht.
    Reviewed-by: 's avatarSergey Fedorov <serge.fedorov@linaro.org>
    Reviewed-by: 's avatarAlex Bennée <alex.bennee@linaro.org>
    Reviewed-by: 's avatarRichard Henderson <rth@twiddle.net>
    Signed-off-by: 's avatarEmilio G. Cota <cota@braap.org>
    Message-Id: <1465412133-3029-15-git-send-email-cota@braap.org>
    Signed-off-by: 's avatarRichard Henderson <rth@twiddle.net>
    909eaac9
cpu-exec.c 20.5 KB