542 lines
15 KiB
Plaintext
542 lines
15 KiB
Plaintext
|
.HTML "Process Sleep and Wakeup on a Shared-memory Multiprocessor
|
||
|
.TL
|
||
|
Process Sleep and Wakeup on a Shared-memory Multiprocessor
|
||
|
.AU
|
||
|
Rob Pike
|
||
|
Dave Presotto
|
||
|
Ken Thompson
|
||
|
Gerard Holzmann
|
||
|
.sp
|
||
|
rob,presotto,ken,gerard@plan9.bell-labs.com
|
||
|
.AB
|
||
|
.FS
|
||
|
Appeared in a slightly different form in
|
||
|
.I
|
||
|
Proceedings of the Spring 1991 EurOpen Conference,
|
||
|
.R
|
||
|
Tromsø, Norway, 1991, pp. 161-166.
|
||
|
.FE
|
||
|
The problem of enabling a `sleeping' process on a shared-memory multiprocessor
|
||
|
is a difficult one, especially if the process is to be awakened by an interrupt-time
|
||
|
event. We present here the code
|
||
|
for sleep and wakeup primitives that we use in our multiprocessor system.
|
||
|
The code has been exercised by years of active use and by a verification
|
||
|
system.
|
||
|
.AE
|
||
|
.LP
|
||
|
Our problem is to synchronise processes on a symmetric shared-memory multiprocessor.
|
||
|
Processes suspend execution, or
|
||
|
.I sleep,
|
||
|
while awaiting an enabling event such as an I/O interrupt.
|
||
|
When the event occurs, the process is issued a
|
||
|
.I wakeup
|
||
|
to resume its execution.
|
||
|
During these events, other processes may be running and other interrupts
|
||
|
occurring on other processors.
|
||
|
.LP
|
||
|
More specifically, we wish to implement subroutines called
|
||
|
.CW sleep ,
|
||
|
callable by a process to relinquish control of its current processor,
|
||
|
and
|
||
|
.CW wakeup ,
|
||
|
callable by another process or an interrupt to resume the execution
|
||
|
of a suspended process.
|
||
|
The calling conventions of these subroutines will remain unspecified
|
||
|
for the moment.
|
||
|
.LP
|
||
|
We assume the processors have an atomic test-and-set or equivalent
|
||
|
operation but no other synchronisation method. Also, we assume interrupts
|
||
|
can occur on any processor at any time, except on a processor that has
|
||
|
locally inhibited them.
|
||
|
.LP
|
||
|
The problem is the generalisation to a multiprocessor of a familiar
|
||
|
and well-understood uniprocessor problem. It may be reduced to a
|
||
|
uniprocessor problem by using a global test-and-set to serialise the
|
||
|
sleeps and wakeups,
|
||
|
which is equivalent to synchronising through a monitor.
|
||
|
For performance and cleanliness, however,
|
||
|
we prefer to allow the interrupt handling and process control to be multiprocessed.
|
||
|
.LP
|
||
|
Our attempts to solve the sleep/wakeup problem in Plan 9
|
||
|
[Pik90]
|
||
|
prompted this paper.
|
||
|
We implemented solutions several times over several months and each
|
||
|
time convinced ourselves \(em wrongly \(em they were correct.
|
||
|
Multiprocessor algorithms can be
|
||
|
difficult to prove correct by inspection and formal reasoning about them
|
||
|
is impractical. We finally developed an algorithm we trust by
|
||
|
verifying our code using an
|
||
|
empirical testing tool.
|
||
|
We present that code here, along with some comments about the process by
|
||
|
which it was designed.
|
||
|
.SH
|
||
|
History
|
||
|
.LP
|
||
|
Since processes in Plan 9 and the UNIX
|
||
|
system have similar structure and properties, one might ask if
|
||
|
UNIX
|
||
|
.CW sleep
|
||
|
and
|
||
|
.CW wakeup
|
||
|
[Bac86]
|
||
|
could not easily be adapted from their standard uniprocessor implementation
|
||
|
to our multiprocessor needs.
|
||
|
The short answer is, no.
|
||
|
.LP
|
||
|
The
|
||
|
UNIX
|
||
|
routines
|
||
|
take as argument a single global address
|
||
|
that serves as a unique
|
||
|
identifier to connect the wakeup with the appropriate process or processes.
|
||
|
This has several inherent disadvantages.
|
||
|
From the point of view of
|
||
|
.CW sleep
|
||
|
and
|
||
|
.CW wakeup ,
|
||
|
it is difficult to associate a data structure with an arbitrary address;
|
||
|
the routines are unable to maintain a state variable recording the
|
||
|
status of the event and processes.
|
||
|
(The reverse is of course easy \(em we could
|
||
|
require the address to point to a special data structure \(em
|
||
|
but we are investigating
|
||
|
UNIX
|
||
|
.CW sleep
|
||
|
and
|
||
|
.CW wakeup ,
|
||
|
not the code that calls them.)
|
||
|
Also, multiple processes sleep `on' a given address, so
|
||
|
.CW wakeup
|
||
|
must enable them all, and let process scheduling determine which process
|
||
|
actually benefits from the event.
|
||
|
This is inefficient;
|
||
|
a queueing mechanism would be preferable
|
||
|
but, again, it is difficult to associate a queue with a general address.
|
||
|
Moreover, the lack of state means that
|
||
|
.CW sleep
|
||
|
and
|
||
|
.CW wakeup
|
||
|
cannot know what the corresponding process (or interrupt) is doing;
|
||
|
.CW sleep
|
||
|
and
|
||
|
.CW wakeup
|
||
|
must be executed atomically.
|
||
|
On a uniprocessor it suffices to disable interrupts during their
|
||
|
execution.
|
||
|
On a multiprocessor, however,
|
||
|
most processors
|
||
|
can inhibit interrupts only on the current processor,
|
||
|
so while a process is executing
|
||
|
.CW sleep
|
||
|
the desired interrupt can come and go on another processor.
|
||
|
If the wakeup is to be issued by another process, the problem is even harder.
|
||
|
Some inter-process mutual exclusion mechanism must be used,
|
||
|
which, yet again, is difficult to do without a way to communicate state.
|
||
|
.LP
|
||
|
In summary, to be useful on a multiprocessor,
|
||
|
UNIX
|
||
|
.CW sleep
|
||
|
and
|
||
|
.CW wakeup
|
||
|
must either be made to run atomically on a single
|
||
|
processor (such as by using a monitor)
|
||
|
or they need a richer model for their communication.
|
||
|
.SH
|
||
|
The design
|
||
|
.LP
|
||
|
Consider the case of an interrupt waking up a sleeping process.
|
||
|
(The other case, a process awakening a second process, is easier because
|
||
|
atomicity can be achieved using an interlock.)
|
||
|
The sleeping process is waiting for some event to occur, which may be
|
||
|
modeled by a condition coming true.
|
||
|
The condition could be just that the event has happened, or something
|
||
|
more subtle such as a queue draining below some low-water mark.
|
||
|
We represent the condition by a function of one
|
||
|
argument of type
|
||
|
.CW void* ;
|
||
|
the code supporting the device generating the interrupts
|
||
|
provides such a function to be used by
|
||
|
.CW sleep
|
||
|
and
|
||
|
.CW wakeup
|
||
|
to synchronise. The function returns
|
||
|
.CW false
|
||
|
if the event has not occurred, and
|
||
|
.CW true
|
||
|
some time after the event has occurred.
|
||
|
The
|
||
|
.CW sleep
|
||
|
and
|
||
|
.CW wakeup
|
||
|
routines must, of course, work correctly if the
|
||
|
event occurs while the process is executing
|
||
|
.CW sleep .
|
||
|
.LP
|
||
|
We assume that a particular call to
|
||
|
.CW sleep
|
||
|
corresponds to a particular call to
|
||
|
.CW wakeup ,
|
||
|
that is,
|
||
|
at most one process is asleep waiting for a particular event.
|
||
|
This can be guaranteed in the code that calls
|
||
|
.CW sleep
|
||
|
and
|
||
|
.CW wakeup
|
||
|
by appropriate interlocks.
|
||
|
We also assume for the moment that there will be only one interrupt
|
||
|
and that it may occur at any time, even before
|
||
|
.CW sleep
|
||
|
has been called.
|
||
|
.LP
|
||
|
For performance,
|
||
|
we desire that multiple instances of
|
||
|
.CW sleep
|
||
|
and
|
||
|
.CW wakeup
|
||
|
may be running simultaneously on our multiprocessor.
|
||
|
For example, a process calling
|
||
|
.CW sleep
|
||
|
to await a character from an input channel need not
|
||
|
wait for another process to finish executing
|
||
|
.CW sleep
|
||
|
to await a disk block.
|
||
|
At a finer level, we would like a process reading from one input channel
|
||
|
to be able to execute
|
||
|
.CW sleep
|
||
|
in parallel with a process reading from another input channel.
|
||
|
A standard approach to synchronisation is to interlock the channel `driver'
|
||
|
so that only one process may be executing in the channel code at once.
|
||
|
This method is clearly inadequate for our purposes; we need
|
||
|
fine-grained synchronisation, and in particular to apply
|
||
|
interlocks at the level of individual channels rather than at the level
|
||
|
of the channel driver.
|
||
|
.LP
|
||
|
Our approach is to use an object called a
|
||
|
.I rendezvous ,
|
||
|
which is a data structure through which
|
||
|
.CW sleep
|
||
|
and
|
||
|
.CW wakeup
|
||
|
synchronise.
|
||
|
(The similarly named construct in Ada is a control structure;
|
||
|
ours is an unrelated data structure.)
|
||
|
A rendezvous
|
||
|
is allocated for each active source of events:
|
||
|
one for each I/O channel,
|
||
|
one for each end of a pipe, and so on.
|
||
|
The rendezvous serves as an interlockable structure in which to record
|
||
|
the state of the sleeping process, so that
|
||
|
.CW sleep
|
||
|
and
|
||
|
.CW wakeup
|
||
|
can communicate if the event happens before or while
|
||
|
.CW sleep
|
||
|
is executing.
|
||
|
.LP
|
||
|
Our design for
|
||
|
.CW sleep
|
||
|
is therefore a function
|
||
|
.P1
|
||
|
void sleep(Rendezvous *r, int (*condition)(void*), void *arg)
|
||
|
.P2
|
||
|
called by the sleeping process.
|
||
|
The argument
|
||
|
.CW r
|
||
|
connects the call to
|
||
|
.CW sleep
|
||
|
with the call to
|
||
|
.CW wakeup ,
|
||
|
and is part of the data structure for the (say) device.
|
||
|
The function
|
||
|
.CW condition
|
||
|
is described above;
|
||
|
called with argument
|
||
|
.CW arg ,
|
||
|
it is used by
|
||
|
.CW sleep
|
||
|
to decide whether the event has occurred.
|
||
|
.CW Wakeup
|
||
|
has a simpler specification:
|
||
|
.P1
|
||
|
void wakeup(Rendezvous *r).
|
||
|
.P2
|
||
|
.CW Wakeup
|
||
|
must be called after the condition has become true.
|
||
|
.SH
|
||
|
An implementation
|
||
|
.LP
|
||
|
The
|
||
|
.CW Rendezvous
|
||
|
data type is defined as
|
||
|
.P1
|
||
|
typedef struct{
|
||
|
Lock l;
|
||
|
Proc *p;
|
||
|
}Rendezvous;
|
||
|
.P2
|
||
|
Our
|
||
|
.CW Locks
|
||
|
are test-and-set spin locks.
|
||
|
The routine
|
||
|
.CW lock(Lock\ *l)
|
||
|
returns when the current process holds that lock;
|
||
|
.CW unlock(Lock\ *l)
|
||
|
releases the lock.
|
||
|
.LP
|
||
|
Here is our implementation of
|
||
|
.CW sleep .
|
||
|
Its details are discussed below.
|
||
|
.CW Thisp
|
||
|
is a pointer to the current process on the current processor.
|
||
|
(Its value differs on each processor.)
|
||
|
.P1
|
||
|
void
|
||
|
sleep(Rendezvous *r, int (*condition)(void*), void *arg)
|
||
|
{
|
||
|
int s;
|
||
|
|
||
|
s = inhibit(); /* interrupts */
|
||
|
lock(&r->l);
|
||
|
|
||
|
/*
|
||
|
* if condition happened, never mind
|
||
|
*/
|
||
|
if((*condition)(arg)){
|
||
|
unlock(&r->l);
|
||
|
allow(); /* interrupts */
|
||
|
return;
|
||
|
}
|
||
|
|
||
|
/*
|
||
|
* now we are committed to
|
||
|
* change state and call scheduler
|
||
|
*/
|
||
|
if(r->p)
|
||
|
error("double sleep %d %d", r->p->pid, thisp->pid);
|
||
|
thisp->state = Wakeme;
|
||
|
r->p = thisp;
|
||
|
unlock(&r->l);
|
||
|
allow(s); /* interrupts */
|
||
|
sched(); /* relinquish CPU */
|
||
|
}
|
||
|
.P2
|
||
|
.ne 3i
|
||
|
Here is
|
||
|
.CW wakeup.
|
||
|
.P1
|
||
|
void
|
||
|
wakeup(Rendezvous *r)
|
||
|
{
|
||
|
Proc *p;
|
||
|
int s;
|
||
|
|
||
|
s = inhibit(); /* interrupts; return old state */
|
||
|
lock(&r->l);
|
||
|
p = r->p;
|
||
|
if(p){
|
||
|
r->p = 0;
|
||
|
if(p->state != Wakeme)
|
||
|
panic("wakeup: not Wakeme");
|
||
|
ready(p);
|
||
|
}
|
||
|
unlock(&r->l);
|
||
|
if(s)
|
||
|
allow();
|
||
|
}
|
||
|
.P2
|
||
|
.CW Sleep
|
||
|
and
|
||
|
.CW wakeup
|
||
|
both begin by disabling interrupts
|
||
|
and then locking the rendezvous structure.
|
||
|
Because
|
||
|
.CW wakeup
|
||
|
may be called in an interrupt routine, the lock must be set only
|
||
|
with interrupts disabled on the current processor,
|
||
|
so that if the interrupt comes during
|
||
|
.CW sleep
|
||
|
it will occur only on a different processor;
|
||
|
if it occurred on the processor executing
|
||
|
.CW sleep ,
|
||
|
the spin lock in
|
||
|
.CW wakeup
|
||
|
would hang forever.
|
||
|
At the end of each routine, the lock is released and processor priority
|
||
|
returned to its previous value.
|
||
|
.CW Wakeup "" (
|
||
|
needs to inhibit interrupts in case
|
||
|
it is being called by a process;
|
||
|
this is a no-op if called by an interrupt.)
|
||
|
.LP
|
||
|
.CW Sleep
|
||
|
checks to see if the condition has become true, and returns if so.
|
||
|
Otherwise the process posts its name in the rendezvous structure where
|
||
|
.CW wakeup
|
||
|
may find it, marks its state as waiting to be awakened
|
||
|
(this is for error checking only) and goes to sleep by calling
|
||
|
.CW sched() .
|
||
|
The manipulation of the rendezvous structure is all done under the lock,
|
||
|
and
|
||
|
.CW wakeup
|
||
|
only examines it under lock, so atomicity and mutual exclusion
|
||
|
are guaranteed.
|
||
|
.LP
|
||
|
.CW Wakeup
|
||
|
has a simpler job. When it is called, the condition has implicitly become true,
|
||
|
so it locks the rendezvous, sees if a process is waiting, and readies it to run.
|
||
|
.SH
|
||
|
Discussion
|
||
|
.LP
|
||
|
The synchronisation technique used here
|
||
|
is similar to known methods, even as far back as Saltzer's thesis
|
||
|
[Sal66].
|
||
|
The code looks trivially correct in retrospect: all access to data structures is done
|
||
|
under lock, and there is no place that things may get out of order.
|
||
|
Nonetheless, it took us several iterations to arrive at the above
|
||
|
implementation, because the things that
|
||
|
.I can
|
||
|
go wrong are often hard to see. We had four earlier implementations
|
||
|
that were examined at great length and only found faulty when a new,
|
||
|
different style of device or activity was added to the system.
|
||
|
.LP
|
||
|
.ne 3i
|
||
|
Here, for example, is an incorrect implementation of wakeup,
|
||
|
closely related to one of our versions.
|
||
|
.P1
|
||
|
void
|
||
|
wakeup(Rendezvous *r)
|
||
|
{
|
||
|
Proc *p;
|
||
|
int s;
|
||
|
|
||
|
p = r->p;
|
||
|
if(p){
|
||
|
s = inhibit();
|
||
|
lock(&r->l);
|
||
|
r->p = 0;
|
||
|
if(p->state != Wakeme)
|
||
|
panic("wakeup: not Wakeme");
|
||
|
ready(p);
|
||
|
unlock(&r->l);
|
||
|
if(s)
|
||
|
allow();
|
||
|
}
|
||
|
}
|
||
|
.P2
|
||
|
The mistake is that the reading of
|
||
|
.CW r->p
|
||
|
may occur just as the other process calls
|
||
|
.CW sleep ,
|
||
|
so when the interrupt examines the structure it sees no one to wake up,
|
||
|
and the sleeping process misses its wakeup.
|
||
|
We wrote the code this way because we reasoned that the fetch
|
||
|
.CW p
|
||
|
.CW =
|
||
|
.CW r->p
|
||
|
was inherently atomic and need not be interlocked.
|
||
|
The bug was found by examination when a new, very fast device
|
||
|
was added to the system and sleeps and interrupts were closely overlapped.
|
||
|
However, it was in the system for a couple of months without causing an error.
|
||
|
.LP
|
||
|
How many errors lurk in our supposedly correct implementation above?
|
||
|
We would like a way to guarantee correctness; formal proofs are beyond
|
||
|
our abilities when the subtleties of interrupts and multiprocessors are
|
||
|
involved.
|
||
|
With that in mind, the first three authors approached the last to see
|
||
|
if his automated tool for checking protocols
|
||
|
[Hol91]
|
||
|
could be
|
||
|
used to verify our new
|
||
|
.CW sleep
|
||
|
and
|
||
|
.CW wakeup
|
||
|
for correctness.
|
||
|
The code was translated into the language for that system
|
||
|
(with, unfortunately, no way of proving that the translation is itself correct)
|
||
|
and validated by exhaustive simulation.
|
||
|
.LP
|
||
|
The validator found a bug.
|
||
|
Under our assumption that there is only one interrupt, the bug cannot
|
||
|
occur, but in the more general case of multiple interrupts synchronising
|
||
|
through the same condition function and rendezvous,
|
||
|
the process and interrupt can enter a peculiar state.
|
||
|
A process may return from
|
||
|
.CW sleep
|
||
|
with the condition function false
|
||
|
if there is a delay between
|
||
|
the condition coming true and
|
||
|
.CW wakeup
|
||
|
being called,
|
||
|
with the delay occurring
|
||
|
just as the receiving process calls
|
||
|
.CW sleep .
|
||
|
The condition is now true, so that process returns immediately,
|
||
|
does whatever is appropriate, and then (say) decides to call
|
||
|
.CW sleep
|
||
|
again. This time the condition is false, so it goes to sleep.
|
||
|
The wakeup process then finds a sleeping process,
|
||
|
and wakes it up, but the condition is now false.
|
||
|
.LP
|
||
|
There is an easy (and verified) solution: at the end of
|
||
|
.CW sleep
|
||
|
or after
|
||
|
.CW sleep
|
||
|
returns,
|
||
|
if the condition is false, execute
|
||
|
.CW sleep
|
||
|
again. This re-execution cannot repeat; the second synchronisation is guaranteed
|
||
|
to function under the external conditions we are supposing.
|
||
|
.LP
|
||
|
Even though the original code is completely
|
||
|
protected by interlocks and had been examined carefully by all of us
|
||
|
and believed correct, it still had problems.
|
||
|
It seems to us that some exhaustive automated analysis is
|
||
|
required of multiprocessor algorithms to guarantee their safety.
|
||
|
Our experience has confirmed that it is almost impossible to
|
||
|
guarantee by inspection or simple testing the correctness
|
||
|
of a multiprocessor algorithm. Testing can demonstrate the presence
|
||
|
of bugs but not their absence
|
||
|
[Dij72].
|
||
|
.LP
|
||
|
We close by claiming that the code above with
|
||
|
the suggested modification passes all tests we have for correctness
|
||
|
under the assumptions used in the validation.
|
||
|
We would not, however, go so far as to claim that it is universally correct.
|
||
|
.SH
|
||
|
References
|
||
|
.LP
|
||
|
[Bac86] Maurice J. Bach,
|
||
|
.I "The Design of the UNIX Operating System,
|
||
|
Prentice-Hall,
|
||
|
Englewood Cliffs,
|
||
|
1986.
|
||
|
.LP
|
||
|
[Dij72] Edsger W. Dijkstra,
|
||
|
``The Humble Programmer \- 1972 Turing Award Lecture'',
|
||
|
.I "Comm. ACM,
|
||
|
15(10), pp. 859-866,
|
||
|
October 1972.
|
||
|
.LP
|
||
|
[Hol91] Gerard J. Holzmann,
|
||
|
.I "Design and Validation of Computer Protocols,
|
||
|
Prentice-Hall,
|
||
|
Englewood Cliffs,
|
||
|
1991.
|
||
|
.LP
|
||
|
[Pik90]
|
||
|
Rob Pike,
|
||
|
Dave Presotto,
|
||
|
Ken Thompson,
|
||
|
Howard Trickey,
|
||
|
``Plan 9 from Bell Labs'',
|
||
|
.I "Proceedings of the Summer 1990 UKUUG Conference,
|
||
|
pp. 1-9,
|
||
|
London,
|
||
|
July, 1990.
|
||
|
.LP
|
||
|
[Sal66] Jerome H. Saltzer,
|
||
|
.I "Traffic Control in a Multiplexed Computer System
|
||
|
MIT,
|
||
|
Cambridge, Mass.,
|
||
|
1966.
|