Multi-processor synchronization notes

These are brief refresher on code synchronization concepts, which apply to Unix/Linux/Windows/OSX operating systems.

  • Single processor (uni-processor or UP) concurrency comes from interrupts, or cooperative multi-tasking (the code yields the CPU). Variables shared with interrupts must be preempt-safe.
  • Multi-processor can do cooperative multi-tasking, where a thread yields the CPU when it decides to, and domain locks (only one CPU can control a feature).
  • Symmetric Multi-Processing (SMP), same code can run on multiple CPUs, so shared data access must be protected/synchronized to keep coherency. Most PCs and servers are now SMP (multiple sockets, multiple cores, multiple hyper-threads).
  • All general purpose OSes for SMP systems use Preemptive multi-tasking schedulers can start/stop a thread any time on any CPU causing all possible coherency scenarios to eventually happen. Shared data must be SMP-safe.

A critical region is a code block accessing shared data. Data access for a region must be atomic. If 2 threads run the same critical region, it is a race condition and likely a bug. It is avoided with synchronization primitives (i.e. code). A goo practice is to write the smallest critical section possible, aka just access the data needed, and do all the processing outside of the critical section.

Atomic instructions are in the instruction-set for all SMP CPUs, and run a complex operation as one indivisible action as seen by any other CPU. They apply to native integer types and bit manipulations. Typical ones are fetch-and-add, compare-and-swap, test-and-set, load-link/store-conditional. Load and store instructions for basic types like a word are usually atomic by nature. Note that a 32-bit CPU will likely not have an atomic 64-bit load or store!

Memory barriers: Atomicity only ensures that the memory access is consistent and was not corrupted by another CPU. It does not enforce ordering, aka which one comes first. Barriers enforce an action is done before, or after a certain point.

Linux kernel: uses the atomic_t integer type, see <asm/atomic.h> for ints or <asm/bitops.h> for bits, and defines inline ASM macros for common atomic instructions.

One can use atomic instructions to create lock-free code (pay attention to cache line thrashing). Usually programmers will declare shared variables in lock-free code as volatile, which tells the compiler that if the source code does a load or a store on that variable, the compiled code must perform that same exact action, without trying to optimize the code away. Otherwise, if the compiler can keep the variable in a register to make the code go fast, it will. For lock-free code, the “critical region” is always accessing a native data type with an atomic instruction.

Locks are higher constructs implemented using atomic instructions (their use has higher overhead).

Locks are advisory and voluntary. The programmer must know a variable is shared and use the corresponding lock to protect the critical section. Note: A lock protects shared data, and not a critical region of code. If 2 critical regions access the same data, they must grab the corresponding lock.

  • is it a global variable
  • is it shared with an interrupt handler
  • is it shared with another process (like a child process)
  • can process sleep (block on a syscall), if so what state is the data?
  • can it be freed by another thread
  • can the same code be called by other threads?

A deadlock is when the code cannot make progress because it waits for  resource that is not released. For example a thread tries to grab a lock it already holds. Or two threads own one of two resources and need to grab the other resource before releasing both (deadly embrace or ABBA). These are hard to debug.

Lock order: To avoid ABBA deadlocks, always acquire the locks in the same order. HP-UX actually enforces ordering by giving each lock a priority, and flagging at compile time an order reversal. Lock release order is not important though common practice to do in reverse order.

Scalability: Lock contention when 2 threads try to acquire at the same time will cause poor performance because work of multiple threads is serialized. Contention comes from frequency of access AND size of the critical section. Scalability implies increasing number of threads, size of memory and/or number of CPUs.

Lock granularity: Finer grain means that a lock controls access to a smaller region of memory. For example a field in a structure, instead of an instance of the structure, or even a whole list. Finer grain locking is only needed when performance is impacted otherwise. We call this breaking a lock, and may incur a small penalty for systems with few CPUs.

Recursive locks: allow a thread to grab the lock multiple times. (avoid)

Spinlocks: If a thread tries to grab a lock already held/contended, it will spin in a busy loop trying to acquire it. In UP, spinlocks are compiled out of the code completely.

Locks can be used in interrupt handlers (no sleep) but in that case require disabling interrupts prior to locking on that CPU or an interrupt could try to acquire the lock too while it is held by interrupted code (see spin_lock_irqsave()). In UP, the en/dis of interrupts would be kept, even with the lock compiled out. Note that locking disables kernel preemption. A critical section is not stopped. Interrupt handlers with bottom halves will have similar requirements (see spin_lock_bh()), and behaviors and solutions will vary with tasklets and softIRQs.

<linux/spinlock.h> // see CONFIG_DEBUG_SPINLOCK macro for debugging
spinlock_t l = SPIN_LOCK_UNLOCKED;
// unsigned long flags;
// spin_trylock(&l); spin_is_locked(&l);
spin_lock(&l);  // or spin_lock_irqsave(&l, flags);
// critical section
spin_unlock(&l); // or spin_unlock_irqrestore(&l, flags);

read-write locks: It’s a spin lock that allows multiple readers to access the data at the same time but exclusive access for writers. They have the same flavors are spinlocks. You cannot upgrade a read to a write lock. However you can recursively obtain a read lock (for interrupts it means you can use read_lock() instead of read_lock_irqsave(), however it is still needed for a write to avoid a deadlock).

<linux/spinlock.h>
rwlock_t l = RW_LOCK_UNLOCKED;
read_lock(&l); 
// critical section
read_unlock(&l);
write_lock(&l); 
// critical section
write_unlock(&l);

The thread waiting can instead of spinning, go to sleep until the lock becomes available.

Semaphores: Waiting threads are put to sleep, with the overhead of 2 context switches. A  process holding a semaphore can sleep and can be preempted by the scheduler. Other waiters will sleep too so there is no deadlock. Well suited for locks held a long time. They can’t be used for interrupts or code that is not subjected to the OS scheduler. They are often used for locking held while sync’ing with user space.

Semaphores can’t be acquired while holding a spinlock, as the process may sleep.

Counting semaphores can have a use count greater than one to allow multiple processes to hold the lock at the same time, and are used to enforce limits.
A binary semaphore or mutex has a count of one, like spinlocks.

Semaphore operators are often called: P() to probe, and V(), to to increment, or down() to decrement and acquire it, and up() to increment and release it. If down() retruns a negative value, the lock is not acquired, and the process goes into a wait queue and sleeps.

#include <asm/semaphore.h>
static DECLARE_SEMAPHORE_GENERIC(s, count);
// static DECLARE_MUTEX(s); // binary semaphore shortcut
// semaphore s;
// sema_init(&s, count);
// init_MUTEX(&s);
down_interruptible(&s); // acquire sema or sleep in TASK_INTERRUPTIBLE state
// down_trylock(&s);
// critical section
up_interruptible(&s); // release sema

Read-write semaphores: Same as semaphore, but with multiple readers and single writer. Note these come with a downgrade_writer() method to convert holding a write lock to a read lock.

#include <linux/rwsem.h>
static DECLARE_RWSEM(s);
down_read(&s);
// critical section
up_read(&s);

Completion variables: Somewhat similar to a semaphore, threads waiting for another to finish can sleep on one. Used by vfork().

#include <linux/completion.h>
DECLARE_COMPLETION(v);
// init_completion(&v);
wait_for_completion(&v);

// other thread runs:
complete(&v);

Big Kernel Lock (BKL): In Linux it is a global spinlock, used to transition from simple SMP to fine grained locking. Process can sleep holding BKL, it is dropped and reacquired automatically for the process when it is scheduled. It is recursive. It disables scheduler process preemption. It works only in process context. It is controlled with <linux/smo_lock.h> lock_kernel()/unlock_kernel().

Seq locks: simple shared data access mechanism that favors writers which never stall. On write, the data is locked and a sequence number is incremented. On a read, the sequence number is checked before and after the read, if the value is unchanged the data was not modified during the read.If the value is even, no write is pending. Taking a write lock makes the sequence odd.

#include <>
seqlock_t l = SEQLOCK_UNLOCKED;
write_seqlock(&l);
// critical section
write_sequnlock(&l);

do {
  n = read_seqbegin(&l);
  // critical section
} while (read_seqretry(&l, n);

Disabling preemption: spinlocks disable preemption. However per-CPU data don’t need locking, but may need to disable preemption for a critical region to prevent another thread do be scheduled on the CPU while data should be updated atomically.

preempt_disable();
// CPU specific critical section
preempt_enable();

int cpu;
cpu = get_cpu(); // disables preemtion
// CPU specific critical section
put_cpu(); // reenables preemption

Barriers: Both compiler optimizations and out-of-order CPU execution can reorder memory accesses without dependencies. Barriers instructions will enforce ordering.

read barrier (rmb()): No load can be reordered across that point. Previous loads will all complete before, and those after will not start before it. Write barrier (wmb()) and read/write barriers (mb()) perform similarly.

read_barrier_depends() is a read barrier that applies only to loads for which subsequent loads have a dependency. On non out-of-order CPUs this is a noop.

Other topics:

Alpha and Beta semaphores (HP-UX)
Lock queue (order the threads spinning to avoid thundering herd effect).
Posix threads and locking primitives
Conditional Variables

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s