Lock free DS : Work Stealing Queues

This is an attempt to explain work stealing queue lock free data-structure (DS) in ghc haskell. It is called WSDeque for work stealing double ended queue.

Imagine you have to design a job scheduler for running multiple jobs in parallel.You create multiple worker threads (each on one core) and assign jobs to all of them.
As it happens, some worker threads will finish soon and will be waiting idle. What a waste !!!

The way to solve this is to let idle-workers steal some jobs from heavy-loaded workers. The core DS is the queue with each worker and it is aptly called work stealing queue (WSQueue).

WSQueue is a circular concurrent queue, implemented as lock free data structure to get maximum performance. A lock free datastructure uses volatile, memory barriers and CAS (compare and swap). So, no locks here when working in multi threaded environment. Phew !!!

1) WSQueue has two ends : left most is called top and right most is called bottom. Valid jobs are between top and bottom including them.
2) Each worker thread has its own WSQueue.
3) Only worker thread pushes to the bottom end of WSQueue. So, only one thread push at a time.
4) Multiple worker-threads can steal from the top end of WSQueue of some worker-thread.
5) top always move ahead. bottom can move ahead and backward (as an when you push and pop).

So, we have three operations on Queue : pushWSQ, popWSQ and stealWSQ.

1) pushWSQ and popWSQ works on bottom end and are called only by the owner-worker of the queue.
2) stealWSQ works on top end and is called by multiple non-owner-workers of the queue.

/* -----------------------------------------------------------------------------
 * Operations
 *
 * A WSDeque has an *owner* thread.  The owner can perform any operation;
 * other threads are only allowed to call stealWSDeque_(),
 * stealWSDeque(), looksEmptyWSDeque(), and dequeElements().
 *
 * -------------------------------------------------------------------------- */

// Allocation, deallocation
WSDeque * newWSDeque  (uint32_t size);
void      freeWSDeque (WSDeque *q);

// Take an element from the "write" end of the pool.  Can be called
// by the pool owner only.
void* popWSDeque (WSDeque *q);

// Push onto the "write" end of the pool.  Return true if the push
// succeeded, or false if the deque is full.
bool pushWSDeque (WSDeque *q, void *elem);
So, the question is how does it work without locks in the case of multiple threads ?
I think to design a lock free DS, we need to first minimize and think carefully about shared mutation :
1) top is changed only by stealWSQ. (multiple threads)
2) bottom is changed only by pushWSQ and popWSQ. (single owner thread).
3) Both top and bottom is read by all operations to decide their actions.

In stealWSQ, anyone who wins the CAS race on top by incrementing it, steals the job at top position.
In pushWSQ, you increment the bottom. So, no CAS here.
In popWSQ, you decrement bottom. Only this thread can change that, so no CAS.. (read more).

One edge case is when one job is present in Queue. Then technically, owner-worker thread's popWSQ and non-owner-worker threads' stealWSQ are equivalent. So, in popWSQ, if you find the no more job is present in queue after popping, then do a CAS on top and compete with stealWSQs.

Since, this is multi thraded environment, you want all threads to see latest value. You don't want compiler to optimize writes to the variable in registers. So, you use volatile for top and bottom.

Another thing that you don't want to happen is instruction reodering optimization by compiler. In general, popWSQ thread first decrements bottom and then reads top. Reverse can happen because of compiler optimization. Then lets say after popWSQ has old top, its thread is halted by cpu. Multiple stealWSQs run and empty the queue. When popWSQ runs, it will have old top and it will think that i still have job at old bottom.

One more problem is memory reordering that you see because of cpu optimizations on architectures that have weak memory model. In this case, cpu pipeline will execute instructions out of order then you expect to make full use of its pipeline. This is very efficient in single thread, but can cause difficult to detect bugs
in multi-threded environment. So, same problem like above can come. x86/x64 have stong memory model. On many other platforms, you will need to use memory barriers (instruction like mfence) to stop reordering happening at certain code points like above.
store_load_barrier(); adds memory barrier in the following code.
void *popWSDeque (WSDeque *q)
{
    /* also a bit tricky, has to avoid concurrent steal() calls by
       accessing top with cas, when there is only one element left */
    StgWord t, b;
    long  currSize;
    void * removed;

    ASSERT_WSDEQUE_INVARIANTS(q);
    b = q->bottom;
    // "decrement b as a test, see what happens"
    b--;
    q->bottom = b;

    // very important that the following read of q->top does not occur
    // before the earlier write to q->bottom.
    store_load_barrier();

    t = q->top; /* using topBound would give an *upper* bound, we
                   need a lower bound. We use the real top here, but
                   can update the topBound value */
and here is
EXTERN_INLINE void
store_load_barrier(void) {
#if defined(NOSMP)
    return;
#elif i386_HOST_ARCH
    __asm__ __volatile__ ("lock; addl $0,0(%%esp)" : : : "memory");
#elif x86_64_HOST_ARCH
    __asm__ __volatile__ ("lock; addq $0,0(%%rsp)" : : : "memory");
#elif powerpc_HOST_ARCH || powerpc64_HOST_ARCH || powerpc64le_HOST_ARCH
    __asm__ __volatile__ ("sync" : : : "memory");
#elif sparc_HOST_ARCH
    __asm__ __volatile__ ("membar #StoreLoad" : : : "memory");
#elif arm_HOST_ARCH
    __asm__ __volatile__ ("dmb" : : : "memory");
#elif aarch64_HOST_ARCH
    __asm__ __volatile__ ("dmb sy" : : : "memory");
#else
#error memory barriers unimplemented on this architecture
#endif
}

EXTERN_INLINE void
load_load_barrier(void) {
#if defined(NOSMP)
    return;
#elif i386_HOST_ARCH
    __asm__ __volatile__ ("" : : : "memory");
#elif x86_64_HOST_ARCH
    __asm__ __volatile__ ("" : : : "memory");
#elif powerpc_HOST_ARCH || powerpc64_HOST_ARCH || powerpc64le_HOST_ARCH
    __asm__ __volatile__ ("lwsync" : : : "memory");
#elif sparc_HOST_ARCH
    /* Sparc in TSO mode does not require load/load barriers. */
    __asm__ __volatile__ ("" : : : "memory");
#elif arm_HOST_ARCH
    __asm__ __volatile__ ("dmb" : : : "memory");
#elif aarch64_HOST_ARCH
    __asm__ __volatile__ ("dmb sy" : : : "memory");
#else
#error memory barriers unimplemented on this architecture
#endif
}
For x86/64 there is no mfence, as it has strict memory ordering.
and in powerpc you can find sync instructions, which gives you memory barriers.

memory at the end of each instruction like __asm__ __volatile__ ("" : : : "memory");
is for stopping compiler ordering optimizations.

So, we have to stop both compilers and cpus to change certain parts of our code to work it effectively in lockfree datastructures.

For completeness, Note: We could have let idle-workers ask some single master-thread for jobs. That master thread then can steal from the queue of heavy-loaded workers. This approach may require more context switches and master-thread can become a bottle neck as well.

Comments