Next Previous Contents

5. Locking

The kernel is a highly multithreaded environment. Unless some data structure is private to a single thread of context you have to do appropiate locking for it.

No kernel code should be written on the assumption that you will never be SMP, because kernel hackers who got free, big, 64-way SMP machines will hate you forever. And some locking is required on uniprocessor machines as well.

5.1 Being naive

On SMP, many system calls grab a magic recursive spinlock called the ``kernel lock''. This lock is automatically dropped when a process sleeps. It makes sure that user context code always runs single threaded. It used to be that all of the kernel was protected by this kernel lock in 2.0, and it has been shrinking over time; relying on it to protect your code may not be a long-term solution.

On a single CPU, multitasking is cooperative. If you're using per-CPU resources (e.g. arrays indexed with cpu_number_map[smp_process_id()]) you thus need no locking.

5.2 Stupid locking with save_flags()/cli()/sti()/restore_flags()

The simplest way to protect shared regions against interrupts is to turn off all interrupts. The standard idiom for that is:

#include <linux/interrupt.h>

        unsigned long flags; 
        save_flags(flags);
        cli();  
        ... critical section...
        restore_flags(flags); 

`cli()' disables all interrupts across all processors inside the critical section, and waits for any running bottom halves to finish. On SMP machines this halts all CPUs, which is really expensive. It is therefore strongly deprecated: use spinlocks to protect against interrupts.

It is very important that interrupts are only turned off for a really short time. Otherwise you'll kill interrupt latency for the whole kernel and make the user level soft real time scheduling behave badly.

5.3 More intelligent locking

There are two main types of locks in the kernel: semaphores and spinlocks.

Semaphores <include/asm/semaphore.h> <SLEEPS>

A semaphore is basically a counter indicating the number of resources available; under Linux they can only be used in user context. If this number is 1 (the most common case), the semaphore is often called a `mutex'. You can grab the mutex and release it.

Declaring

You can either declare a `struct semaphore', and call `sema_init()' to initialize it in your initialization code, or if you only want a mutex, simply declare it with `static DECLARE_MUTEX' (sometimes a `static DECLARE_MUTEX_LOCKED()' if you want to initialize it locked to begin).

Locking

Call `down_interruptible()': this returns 0 if you got the lock, or -EINTR if the process was interrupted with a signal.

If you really can't allow the semaphore down to fail (usually this is a bad thing), you can simply use `down()' which won't return until it has subtracted one from the semaphore.

Either of these calls may sleep if the semaphore is not currently available, making them unusable in interrupt context.

Unlocking

Call `up()', which never fails; this can be called in interrupt context [FIXME: I assume?]

Wait Queues <include/linux/wait.h> <SLEEPS>

A wait queue is not really a form of locking, but it's used to guarantee that if you are checking for a condition, and going to sleep if it's not true, that there is no gap between the check and the sleeping. You declared one wait_queue_head_t, and then processes which want to wait for that condition declare a wait_queue_t referring to themselves, and place that in the queue.

Declaring

You declare a `wait_queue_head_t' using the DECLARE_WAIT_QUEUE_HEAD() macro, or using the `init_waitqueue_head()' routine in your initialization code.

Queuing

Placing yourself in the waitqueue is fairly complex, because you must put yourself in the queue before checking the condition. There is a macro to do this: `wait_event_interruptible()' <include/linux/sched.h>. The first argument is the wait queue head, and the second is an expression which is evaluated; the macro returns 0 when this expression is true, or -ERESTARTSYS if a signal is received. The `wait_event()' version ignores signals.

Waking Up Queued Tasks

Call `wake_up()' <include/linux/sched.h>, which will wake up every process in the queue. The exception is if one has TASK_EXCLUSIVE set, in which case the remainder of the queue will not be woken.

Spinlocks <include/asm/spinlock.h>

A spinlock is a mutex which at first glance is incredibly naive: if it can't get the lock the first time around, it keeps trying in a busy loop (this is the `spin' part). However, this means that it can be used in interrupt context (where sleeping is not an option), and if the lock contention is low, they are the cheapest locks in the Linux kernel.

Warning: speed has its price. They are not fair at all and can really break the machine when misused.

Declaring

You can either declare a `spinlock_t', and assign `SPIN_LOCK_UNLOCKED' to it, or use `spin_lock_init()' in your initialization code.

Locking

The normal method is to call `spin_lock_irqsave()'. This grabs the spinlock, and blocks all interrupts on the local CPU (saving the previous state in the flags argument: see `save_flags()' above). If you ever grab the spinlock both in and outside interrupt context, this is vital; if you have the spinlock and interrupts are still enabled, you could be interrupted by a bottom half, which then spins forever on the spinlock...

A better method is available if you are only using the spinlock between a bottom half and user context (this is the usual case); `spin_lock_bh()'. This stops bottom halves from being run on the current CPU (if you're already in a bottom half, this does nothing), and the grabs the spinlock. This means that interrupts can still be processed, and on most [FIXME: is this true?] architectures is far cheaper than blocking interrupts.

Unlocking

The `spin_unlock_irqrestore()' and `spin_unlock_bh()' functions are used to unlock spinlocks locked with `spin_lock_irqsave()' and `spin_lock_bh()' respectively.

On a kernel not compiled for SMP, spinlocks turn into nothing, except that *_irqsave/irqrestore are equivalent to save_flags/cli/restore_flags. Remember again that turning off interrupts for a long time is very bad for the system; make critical sections as short as possible.

Read-Write Locks <include/asm/spinlock.h>

A readwrite lock is a spinlock, except you lock it in one of two modes: a `read lock' or a `write lock'. More than one people can share a `read lock', but the write lock is exclusive: if someone has a write lock, noone can have any lock. If your code is clearly divided into readers (who don't need exclusive access to the data) and writers (who do), you can use read-write locks. Be aware that they are slightly more expensive than normal spinlocks on some [FIXME: most?] architectures.

Declaring

You declare a `rwlock_t', and assign `RW_LOCK_UNLOCKED' to it.

Locking

`read_lock_irqsave()'/`read_lock_bh()', and `write_lock_irqsave()'/`write_lock_bh()' are used similar to the spinlock routines described above. Note that if you have no writers in interrupt context (only readers), you can use the non-interrupt-blocking `read_lock()' calls. You'll still need to use `write_lock_irqsave()' though. [FIXME: Same logic applies to mixing read_lock() with write_lock_bh()?]

Unlocking

The `read_unlock_irqrestore()'/`read_unlock_bh()' and `write_unlock_irqrestore()'/bh()' functions are used to unlock spinlocks locked with `spin_lock_irqsave()' and `spin_lock_bh()' respectively.

5.4 Atomic Operations

The fastest lock is no lock at all; certain operations are guaranteed atomic. The first class of operations work on `atomic_t' <include/asm/atomic.h>; this contains a signed integer (at least 32 bits long), and you must use these functions to manipulate or read atomic_t variables. atomic_read() and atomic_set() get and set the counter, atomic_add(), atomic_sub(), atomic_inc(), atomic_dec(), and atomic_dec_and_test() (returns true if it was decremented to zero).

Yes. It returns true (i.e. != 0) if the atomic variable is zero.

Note that these functions are slower than normal arithmatic, and so should not be used unneccessarily.

The second class of atomic operations is atomic bit operations, defined in <include/asm/bitops.h>. These operations generally take a pointer to the bit pattern, and a bit number: 0 is the least significant bit. set_bit(), clear_bit() and change_bit() set, clear, and flip the given bit. test_and_set_bit(), test_and_clear_bit() and test_and_change_bit() do the same thing, except return true if the bit was previously set; these are particularly useful for simple locking.

It is possible to call these operations with bit indices greater than 31. The resulting behaviour is strange on big-endian platforms though so it is a good idea not to do this.

[FIXME: Relative speeds of atomic_dec_and_test vs. test_and_set_bit?]

5.5 Locking Strategies

It's usually best to begin with a single lock, and refine later if that proves too slow. It's considered established wisdom in the Linux community that too many locks lead to unmaintainability and complicated deadlock problems. [FIXME: Get LM to elucidate this with horror stories maybe?].

If you need multiple spinlocks to do a single operation make sure they are always grabbed in the same order.

A technique widely used in the networking code is atomic reference counts, and a `dead' marker, for example `struct sock' <include/net/sock.h>) and `struct in_device' <include linux/inetdevice.h>. In this scheme, the structure is initialized with a reference count of 1, and dead set to zero.

This technique works with objects which have a well-defined event which deletes the object. When this event occurs, it simply marks the object dead, and decrements the reference count.

The objects are kept in a hash table or linked list: this is locked by a spinlock or a rwlock as normal. When searching for an object, those which are marked dead are ignored. When an object is found, the reference is incremented, and the lock released. When the object is finished with, the reference count is decremented.

Whoever decrements the reference count to zero is responsible for the actual freeing of the object (the object must be marked dead unless there is a bug).

5.6 Further reading

Documentation/spinlocks.txt

Linus Torvalds' spinlocking tutorial in the kernel sources.

Curt Schimmel's "Unix Systems for Modern Architectures: Symmetric Multiprocesssing and Caching for Kernel Programmers" [ISBN: 0201633388]

A very good introduction to kernel level locking (not written for Linux, but nearly everything applies). The book is expensive, but really worth every penny to understand SMP locking.


Next Previous Contents