CS533 Concepts of Operating Systems - Computer Action Team
Performance of memory reclamation for lockless synchronization By Thomas E. Hart, Paul E. McKenney, Angela Demke Brown, Jonathan Walpole Handwaved about by Jim Cotillier CS510 Concurrent Systems 1 The Problem
Why not just stick with classical locks? o o o Performance issues (blocking) CAS-class instruction overhead Susceptible to:
Deadlock Priority Inversion Convoying Lockless synchronization addresses this, but is exposed to Read/Reclaim Races o Reclamation of shared data elements without coordination with all contenders leads to an inconsistent global state
Such ex post facto references to deleted data yield unpredictable results Uncoordinated reclamation Some Approaches to Solutions
QSBR - Quiescent State Based Reclamation EBR/NEBR - Epoch Based Reclamation HPBR - Hazard Pointers Based Reclamation LFRC - Lock-free Reference Counting Functionality provided by a client/library interface o But no single, invariant set of interface semantics exists across all schemes QSBR
Permits the reclamation of data only after a time interval elapses, called a Grace Period QSBR defines a Grace Period to be the temporal interval (a,b), such that any data element deleted before a can be reclaimed after b A Quiescent State is a state of a thread, T, in
which T holds no references to shared elements, active or deleted (zombie) Any interval in which each thread passes through a Quiescent State is a QSBR Grace Period Three-thread QSBR example QSBR Fuzzy Barriers
Protect access to protected code which no thread should execute before all other threads reach a specified point Do not absolutely block, a la hard barriers, only prevent execution of protected code until barrier opens Thus, can be used to synchronize reclamation Using QSBR
Client explicitly declares Quiescent State: and thereby enters a fuzzy barrier Problem: thread failure A dead thread cannot call quiescent_state() and thus can force QSBR to block EBR (Fraser)
Uses Grace Periods, like QSBR o Encapsulates lockless operations within Critical Sections o But does not rely upon explicit client Quiescent State declarations, as QSBR does
which the client explicitly declares, via the functions critical_enter() and critical_exit() Counts the number of Critical Region invocations, and then attempts to enter a fuzzy barrier to reclaim memory Linked list search using EBR EBR Epochs
Epochs are modeled after , the group of equivalence classes modulo 3 Epochs are hierarchical: Global and Local Each epoch has an associated zombie element list Fuzzy barrier for reclamation is entered upon
entry to each new epoch A thread entering a Critical Region updates its Local Epoch to match the global epoch After M (magic number) LE updates, a thread will attempt to increment the GE EBR Epochs Contd.
A GE update attempt only succeeds if the LE of each thread in a CR matches the GE Since threads update their LE only at the start of a CR, whenever, for a thread T, its LE = GE, then all lockless operations of other threads in progress the last time T was in epoch GE have completed Thankfully, a grace period has expired! EBR Epoch Cycle NEBR a Modest EBR Improvement
EBR must pay for the expensive fences at the beginning and end of a CS Modeled a little after QSBR: have the application set/reset a critical section(s) may be in here flag o o NEBR then does not automatically do this in each CS
Application independence dies in favor of performance Reduces EBRs overhead modestly--closer to QSBR o NEBR is attractive as the programmers responsibilities are limited to marking sections that might contain lockless operations HPBR/SMR (Michael) Each thread T has (magic) K Hazard Pointers
used to protect elements from reclamation by other threads o o Thus, for N threads, H = NK HPs exist in toto K is small, often 2 (queues and lists); 1 (stacks)
T caches removed elements privately in a list P of size (magic) R After R removals, T reclaims each element in P that does not have a corresponding HP If T fails, a maximum of K+R removed elements can be leaked HPBR Paradigm HPBR Paradigm Contd.
Hazardous Referencesreferences to shared elements that may now be zombies or ABA situations o o Algorithms using HPBR must identify a Hazardous Reference, set a Hazard Pointer, then check for element removal If an element has not been removed, it continues to be referentially safe
LFRC (Valois, Detlefs, et al.) Threads track the instantaneous count of references to elements o Many variations on this scheme may or may not allow element types to change upon reclamation o
o When count = 0, element can be reclaimed May require type invariance (Valios); type independence requires DCAS (Detlefs, et al.) Zombies may consume unbounded memory Performance may be worse than lock-based o
CAS, FAA (Intel: LOCK XADD) very expensive Summary of Schemes QSBR - Detects grace periods using applicationspecified quiescent states EBR -- Detects grace periods using applicationindependent epochs
HPBR - Uses per-thread Hazard Pointers to synchronize reclamation LFRC Uses per-element reference counts to synchronize reclamation Performance Factors Depends on a lot of stuff
o o Memory consistency and constraints Workload, contention and thread scheduling Sequentially consistent memory model is still generally assumed by the lock-free literature o But the hardware trends are toward weaker models
Coder needs to rely on fences (MBarriers), which artificially add overhead HPBR, EBR and LFRC require per-operation fences, but not QSBRthis is shown to be a distinct advantage Performance Factors Contd. Thread preemption
o Descheduled threads are blocked threads, as far as reclamation schemes are concerned o Can start when number threads > number CPUs Anything that prevents a Grace Period from closing is
bad Threads may sometimes need to borrow memory from a locked, global pool o o A thread may be preempted whilst holding such a lock; setting up a thread convoy on memory HPBR bounds memory stress and has an advantage here The BenchmarkBenchmark
The BenchmarkBenchmark Contd. Master thread flow logic o o o
Average execution time/measured operation = test duration/number of operations Net CPU time = execution time * number of threads o Create N children Start a timer When timer expires, stop children If thread count > CPU count, report execution time; otherwise report CPU time.
Driver parameters were selected not to be biased toward any particular reclamation scheme The BenchmarkBenchmark Contd. CS implemented on POWER via larx/stcx (LL/SC) Fences implemented via eieio (Enforce In-order Execution of I/O)
Spin locks implemented via cas and fences Statically allocated HPBR Hazard Pointers o Some algorithms may require unbounded HP counts Choice of placement of QSBR QS declarations may not be obvious in some algorithms
Performance Measurement Guidelines Measure the base costs first o Single-threaded execution, small data structures No contention, preemption, traversing long lists
Non-blocking queues, single-element linked lists Then move toward complexity o o Pedagogical approach--try to change only one factor at a time Consider the R/O, the W/O and the R/W cases in each of the examined reclamation schemes Base Performance Costs
Scalability with Fractional Workload Scalability with Traversal Length Scalability of LFRC No Preemption; R/O Workload No Preemption; W/O Workload Preemption; R/O Workload
Preemption; W/O Workload Memory Stress Busy Wait Hash Tables; Update Fraction Workload No Preemption; R/O Workload with NEBR Case StudyRCU API in Linux
RCU conceptsRead/Copy/Update o o o Lockless concurrent reads with deferred destruction of zombie elements Writers may not prevent readers from accessing shared data Writers must coordinate with each other in some way
o o RCU does not specify what way RCU neither blocks nor fails for readers Preemptable kernels necessitate the use of rcu_read_lock() and rcu_read_unlock()to toggle kernel preemption so that context switches do not occur at intolerable times
Case StudyRCU Contd. QSBR is a natural choice for memory reclamation o EBR could be used as well, but would not offer any advantages over QSBR
RCU is best targeted to read-mostly data structures o Rare updates imply rare reclamation Case StudyRCU Contd. SysV IPC subsystem implemented in Linux via CR-QSBR o
o o Implements semaphores, message queues and shared memory Apps use an integer Accessor ID to access in-kernel data structures (essentially a resource handle) The dynamic, mostly-read (AID/resource) array, formerly spinlocked in stock Linux, was protected here through CR-QSBR instead, and benchmarked Case StudyRCU Contd.
Semopbench, 8-CPU, 700 MHz Intel P-III Case StudyRCU Contd. DBT1 Database Benchmark Raw Results Case StudyRCU Contd. DBT1 database benchmark results (TPS) Conclusions
Reclamation has a huge effect on lockless algorithm performance o So one must tune to the design of the application
Both QSBR and EBR can suffer in the face of memory exhaustion HPBR and EBR have higher base costs than QSBR due to fences The NEBR enhancements modestly improve EBR LFRC has the highest overhead due to the perelement atomic instruction requirement Conclusons Contd.
HPBR scales poorly as the traversal length increases QSBR is, overall, the best performing reclamation scheme o o and best suited to an OS kernel environment Lockless approaches using QSBR can widely outperform locking approaches by a large margin Rantings -- STAE
STAE Specified Thread Abnormal Exit o o o o User provides Exit code to be run on condition of thread error trap Exit is driven by the etrap interrupt logic; Exit is called
immediately after etrap is detected, e.g., SEGV Exit has full access to environment of failing thread; may modify any data, etc. Exit may: Allow failing thread to die (the status quo) Resuscitate failing thread by telling the dispatcher to restart the thread at an Exit-specified point in its code Call a completely new program to run in place of failing thread
(with all of the failing threads credentials and context) Rantings -- PLO PLO Perform Locked Operation (IBM z Platform) o o o o o
o Meta instruction that atomically encapsulates all of: CAL, CAS, DCAS, CASAS, CASADS, CASATS into single-instruction global atomicity 32, 64, or 128-bit operands are supported Acquires a global hardware interlock unique to PLO Is very powerful and flexible, but is so complex that it may require a pre-built parameter list just to program it! Usually needs to be coded with a zillion operands Its proprietary Benchmarkalgorithm has to be huge, but whether its utility outstrips its cost enough to yield a net gain in performance, has not yet been answered (afaik)
Questions/Musings Suppose DCAS was improved so that it uses an order of magnitude fewer clocks than today. o To what extent could macroscopically faster hardware atomicity affect the utility of these lockless schemes?
Could the STAE formalism provably solve the failed thread blocking problem in QSBR? o If you believe the answer is yes, based on the empirical data in this paper, would the paradigm (QSBR+STAE) satisfy Ockhams Razor and thus become the overall best solution to the lockless reclamation problem?
Lord Diplock in Dupont Steels Ltd v Sirs (1980) Whitley v Chappell (1864) - 'illegal to impersonate any person entitled to vote' - dead person not entitled to vote - defendant acquitted LNER v Berriman (1946) - Fatal Accidents Act...
NCAR's Four Focus Areas: Technology Transfer for Societal Benefit. One of Unidata's largest data beneficiaries, NCAR's Research Applications Laboratory feeds Unidata information into many of the tools and programs that serve society - from its roadway decision-support systems, to ...
John gillespie PhD. Health economics, health policy & reimbursement. ... healthcare providers and clinicians the actual results of care provided and how care can be improved," said Dr Lyons. ... ICHOM was founded in 2012 by Professor Michael Porter of...
In general, Greek gods were divided into three categories: Heaven Earth Sea The Titans ruled before the Gods of Olympus. The Titans were the children of Uranus (Heaven) and Gaea (Earth) and the parents of the Gods of Olympus. The...
Clicker Question. Consider a negatively charged rod that touches a long conductor, transferring its charge to the conductor. What is the difference between how the electrons become arranged on the conductor, versus how they would be arranged if it had...
Nonholonomic systems, in which boundary conditions are time dependent, by definition: Energy is not conserved, e.g., dissipative systems, so stepwise changes of the system dynamics must take finite time. Boundary conditions and system laws cannot be separated in principle, except...
Ready to download the document? Go ahead and hit continue!