xemu/util/qemu-coroutine.c

205 lines
5.6 KiB
C
Raw Normal View History

coroutine: introduce coroutines Asynchronous code is becoming very complex. At the same time synchronous code is growing because it is convenient to write. Sometimes duplicate code paths are even added, one synchronous and the other asynchronous. This patch introduces coroutines which allow code that looks synchronous but is asynchronous under the covers. A coroutine has its own stack and is therefore able to preserve state across blocking operations, which traditionally require callback functions and manual marshalling of parameters. Creating and starting a coroutine is easy: coroutine = qemu_coroutine_create(my_coroutine); qemu_coroutine_enter(coroutine, my_data); The coroutine then executes until it returns or yields: void coroutine_fn my_coroutine(void *opaque) { MyData *my_data = opaque; /* do some work */ qemu_coroutine_yield(); /* do some more work */ } Yielding switches control back to the caller of qemu_coroutine_enter(). This is typically used to switch back to the main thread's event loop after issuing an asynchronous I/O request. The request callback will then invoke qemu_coroutine_enter() once more to switch back to the coroutine. Note that if coroutines are used only from threads which hold the global mutex they will never execute concurrently. This makes programming with coroutines easier than with threads. Race conditions cannot occur since only one coroutine may be active at any time. Other coroutines can only run across yield. This coroutines implementation is based on the gtk-vnc implementation written by Anthony Liguori <anthony@codemonkey.ws> but it has been significantly rewritten by Kevin Wolf <kwolf@redhat.com> to use setjmp()/longjmp() instead of the more expensive swapcontext() and by Paolo Bonzini <pbonzini@redhat.com> for Windows Fibers support. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
2011-01-17 16:08:14 +00:00
/*
* QEMU coroutines
*
* Copyright IBM, Corp. 2011
*
* Authors:
* Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
* Kevin Wolf <kwolf@redhat.com>
*
* This work is licensed under the terms of the GNU LGPL, version 2 or later.
* See the COPYING.LIB file in the top-level directory.
*
*/
#include "qemu/osdep.h"
coroutine: introduce coroutines Asynchronous code is becoming very complex. At the same time synchronous code is growing because it is convenient to write. Sometimes duplicate code paths are even added, one synchronous and the other asynchronous. This patch introduces coroutines which allow code that looks synchronous but is asynchronous under the covers. A coroutine has its own stack and is therefore able to preserve state across blocking operations, which traditionally require callback functions and manual marshalling of parameters. Creating and starting a coroutine is easy: coroutine = qemu_coroutine_create(my_coroutine); qemu_coroutine_enter(coroutine, my_data); The coroutine then executes until it returns or yields: void coroutine_fn my_coroutine(void *opaque) { MyData *my_data = opaque; /* do some work */ qemu_coroutine_yield(); /* do some more work */ } Yielding switches control back to the caller of qemu_coroutine_enter(). This is typically used to switch back to the main thread's event loop after issuing an asynchronous I/O request. The request callback will then invoke qemu_coroutine_enter() once more to switch back to the coroutine. Note that if coroutines are used only from threads which hold the global mutex they will never execute concurrently. This makes programming with coroutines easier than with threads. Race conditions cannot occur since only one coroutine may be active at any time. Other coroutines can only run across yield. This coroutines implementation is based on the gtk-vnc implementation written by Anthony Liguori <anthony@codemonkey.ws> but it has been significantly rewritten by Kevin Wolf <kwolf@redhat.com> to use setjmp()/longjmp() instead of the more expensive swapcontext() and by Paolo Bonzini <pbonzini@redhat.com> for Windows Fibers support. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
2011-01-17 16:08:14 +00:00
#include "trace.h"
#include "qemu/thread.h"
coroutine: rewrite pool to avoid mutex This patch removes the mutex by using fancy lock-free manipulation of the pool. Lock-free stacks and queues are not hard, but they can suffer from the ABA problem so they are better avoided unless you have some deferred reclamation scheme like RCU. Otherwise you have to stick with adding to a list, and emptying it completely. This is what this patch does, by coupling a lock-free global list of available coroutines with per-CPU lists that are actually used on coroutine creation. Whenever the destruction pool is big enough, the next thread that runs out of coroutines will steal the whole destruction pool. This is positive in two ways: 1) the allocation does not have to do any atomic operation in the fast path, it's entirely using thread-local storage. Once every POOL_BATCH_SIZE allocations it will do a single atomic_xchg. Release does an atomic_cmpxchg loop, that hopefully doesn't cause any starvation, and an atomic_inc. A later patch will also remove atomic operations from the release path, and try to avoid the atomic_xchg altogether---succeeding in doing so if all devices either use ioeventfd or are not submitting requests actively. 2) in theory this should be completely adaptive. The number of coroutines around should be a little more than POOL_BATCH_SIZE * number of allocating threads; so this also empties qemu_coroutine_adjust_pool_size. (The previous pool size was POOL_BATCH_SIZE * number of block backends, so it was a bit more generous. But if you actually have many high-iodepth disks, it's better to put them in different iothreads, which will also use separate thread pools and aio=native file descriptors). This speeds up perf/cost (in tests/test-coroutine) by a factor of ~1.33. No matter if we end with some kind of coroutine bypass scheme or not, it cannot hurt to optimize hot code. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Fam Zheng <famz@redhat.com> Message-id: 1417518350-6167-6-git-send-email-pbonzini@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2014-12-02 11:05:48 +00:00
#include "qemu/atomic.h"
#include "qemu/coroutine.h"
#include "qemu/coroutine_int.h"
#include "block/aio.h"
coroutine: introduce coroutines Asynchronous code is becoming very complex. At the same time synchronous code is growing because it is convenient to write. Sometimes duplicate code paths are even added, one synchronous and the other asynchronous. This patch introduces coroutines which allow code that looks synchronous but is asynchronous under the covers. A coroutine has its own stack and is therefore able to preserve state across blocking operations, which traditionally require callback functions and manual marshalling of parameters. Creating and starting a coroutine is easy: coroutine = qemu_coroutine_create(my_coroutine); qemu_coroutine_enter(coroutine, my_data); The coroutine then executes until it returns or yields: void coroutine_fn my_coroutine(void *opaque) { MyData *my_data = opaque; /* do some work */ qemu_coroutine_yield(); /* do some more work */ } Yielding switches control back to the caller of qemu_coroutine_enter(). This is typically used to switch back to the main thread's event loop after issuing an asynchronous I/O request. The request callback will then invoke qemu_coroutine_enter() once more to switch back to the coroutine. Note that if coroutines are used only from threads which hold the global mutex they will never execute concurrently. This makes programming with coroutines easier than with threads. Race conditions cannot occur since only one coroutine may be active at any time. Other coroutines can only run across yield. This coroutines implementation is based on the gtk-vnc implementation written by Anthony Liguori <anthony@codemonkey.ws> but it has been significantly rewritten by Kevin Wolf <kwolf@redhat.com> to use setjmp()/longjmp() instead of the more expensive swapcontext() and by Paolo Bonzini <pbonzini@redhat.com> for Windows Fibers support. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
2011-01-17 16:08:14 +00:00
enum {
coroutine: rewrite pool to avoid mutex This patch removes the mutex by using fancy lock-free manipulation of the pool. Lock-free stacks and queues are not hard, but they can suffer from the ABA problem so they are better avoided unless you have some deferred reclamation scheme like RCU. Otherwise you have to stick with adding to a list, and emptying it completely. This is what this patch does, by coupling a lock-free global list of available coroutines with per-CPU lists that are actually used on coroutine creation. Whenever the destruction pool is big enough, the next thread that runs out of coroutines will steal the whole destruction pool. This is positive in two ways: 1) the allocation does not have to do any atomic operation in the fast path, it's entirely using thread-local storage. Once every POOL_BATCH_SIZE allocations it will do a single atomic_xchg. Release does an atomic_cmpxchg loop, that hopefully doesn't cause any starvation, and an atomic_inc. A later patch will also remove atomic operations from the release path, and try to avoid the atomic_xchg altogether---succeeding in doing so if all devices either use ioeventfd or are not submitting requests actively. 2) in theory this should be completely adaptive. The number of coroutines around should be a little more than POOL_BATCH_SIZE * number of allocating threads; so this also empties qemu_coroutine_adjust_pool_size. (The previous pool size was POOL_BATCH_SIZE * number of block backends, so it was a bit more generous. But if you actually have many high-iodepth disks, it's better to put them in different iothreads, which will also use separate thread pools and aio=native file descriptors). This speeds up perf/cost (in tests/test-coroutine) by a factor of ~1.33. No matter if we end with some kind of coroutine bypass scheme or not, it cannot hurt to optimize hot code. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Fam Zheng <famz@redhat.com> Message-id: 1417518350-6167-6-git-send-email-pbonzini@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2014-12-02 11:05:48 +00:00
POOL_BATCH_SIZE = 64,
};
/** Free list to speed up creation */
coroutine: rewrite pool to avoid mutex This patch removes the mutex by using fancy lock-free manipulation of the pool. Lock-free stacks and queues are not hard, but they can suffer from the ABA problem so they are better avoided unless you have some deferred reclamation scheme like RCU. Otherwise you have to stick with adding to a list, and emptying it completely. This is what this patch does, by coupling a lock-free global list of available coroutines with per-CPU lists that are actually used on coroutine creation. Whenever the destruction pool is big enough, the next thread that runs out of coroutines will steal the whole destruction pool. This is positive in two ways: 1) the allocation does not have to do any atomic operation in the fast path, it's entirely using thread-local storage. Once every POOL_BATCH_SIZE allocations it will do a single atomic_xchg. Release does an atomic_cmpxchg loop, that hopefully doesn't cause any starvation, and an atomic_inc. A later patch will also remove atomic operations from the release path, and try to avoid the atomic_xchg altogether---succeeding in doing so if all devices either use ioeventfd or are not submitting requests actively. 2) in theory this should be completely adaptive. The number of coroutines around should be a little more than POOL_BATCH_SIZE * number of allocating threads; so this also empties qemu_coroutine_adjust_pool_size. (The previous pool size was POOL_BATCH_SIZE * number of block backends, so it was a bit more generous. But if you actually have many high-iodepth disks, it's better to put them in different iothreads, which will also use separate thread pools and aio=native file descriptors). This speeds up perf/cost (in tests/test-coroutine) by a factor of ~1.33. No matter if we end with some kind of coroutine bypass scheme or not, it cannot hurt to optimize hot code. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Fam Zheng <famz@redhat.com> Message-id: 1417518350-6167-6-git-send-email-pbonzini@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2014-12-02 11:05:48 +00:00
static QSLIST_HEAD(, Coroutine) release_pool = QSLIST_HEAD_INITIALIZER(pool);
static unsigned int release_pool_size;
static __thread QSLIST_HEAD(, Coroutine) alloc_pool = QSLIST_HEAD_INITIALIZER(pool);
static __thread unsigned int alloc_pool_size;
coroutine: rewrite pool to avoid mutex This patch removes the mutex by using fancy lock-free manipulation of the pool. Lock-free stacks and queues are not hard, but they can suffer from the ABA problem so they are better avoided unless you have some deferred reclamation scheme like RCU. Otherwise you have to stick with adding to a list, and emptying it completely. This is what this patch does, by coupling a lock-free global list of available coroutines with per-CPU lists that are actually used on coroutine creation. Whenever the destruction pool is big enough, the next thread that runs out of coroutines will steal the whole destruction pool. This is positive in two ways: 1) the allocation does not have to do any atomic operation in the fast path, it's entirely using thread-local storage. Once every POOL_BATCH_SIZE allocations it will do a single atomic_xchg. Release does an atomic_cmpxchg loop, that hopefully doesn't cause any starvation, and an atomic_inc. A later patch will also remove atomic operations from the release path, and try to avoid the atomic_xchg altogether---succeeding in doing so if all devices either use ioeventfd or are not submitting requests actively. 2) in theory this should be completely adaptive. The number of coroutines around should be a little more than POOL_BATCH_SIZE * number of allocating threads; so this also empties qemu_coroutine_adjust_pool_size. (The previous pool size was POOL_BATCH_SIZE * number of block backends, so it was a bit more generous. But if you actually have many high-iodepth disks, it's better to put them in different iothreads, which will also use separate thread pools and aio=native file descriptors). This speeds up perf/cost (in tests/test-coroutine) by a factor of ~1.33. No matter if we end with some kind of coroutine bypass scheme or not, it cannot hurt to optimize hot code. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Fam Zheng <famz@redhat.com> Message-id: 1417518350-6167-6-git-send-email-pbonzini@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2014-12-02 11:05:48 +00:00
static __thread Notifier coroutine_pool_cleanup_notifier;
static void coroutine_pool_cleanup(Notifier *n, void *value)
{
Coroutine *co;
Coroutine *tmp;
QSLIST_FOREACH_SAFE(co, &alloc_pool, pool_next, tmp) {
QSLIST_REMOVE_HEAD(&alloc_pool, pool_next);
qemu_coroutine_delete(co);
}
}
Coroutine *qemu_coroutine_create(CoroutineEntry *entry, void *opaque)
coroutine: introduce coroutines Asynchronous code is becoming very complex. At the same time synchronous code is growing because it is convenient to write. Sometimes duplicate code paths are even added, one synchronous and the other asynchronous. This patch introduces coroutines which allow code that looks synchronous but is asynchronous under the covers. A coroutine has its own stack and is therefore able to preserve state across blocking operations, which traditionally require callback functions and manual marshalling of parameters. Creating and starting a coroutine is easy: coroutine = qemu_coroutine_create(my_coroutine); qemu_coroutine_enter(coroutine, my_data); The coroutine then executes until it returns or yields: void coroutine_fn my_coroutine(void *opaque) { MyData *my_data = opaque; /* do some work */ qemu_coroutine_yield(); /* do some more work */ } Yielding switches control back to the caller of qemu_coroutine_enter(). This is typically used to switch back to the main thread's event loop after issuing an asynchronous I/O request. The request callback will then invoke qemu_coroutine_enter() once more to switch back to the coroutine. Note that if coroutines are used only from threads which hold the global mutex they will never execute concurrently. This makes programming with coroutines easier than with threads. Race conditions cannot occur since only one coroutine may be active at any time. Other coroutines can only run across yield. This coroutines implementation is based on the gtk-vnc implementation written by Anthony Liguori <anthony@codemonkey.ws> but it has been significantly rewritten by Kevin Wolf <kwolf@redhat.com> to use setjmp()/longjmp() instead of the more expensive swapcontext() and by Paolo Bonzini <pbonzini@redhat.com> for Windows Fibers support. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
2011-01-17 16:08:14 +00:00
{
Coroutine *co = NULL;
if (CONFIG_COROUTINE_POOL) {
coroutine: rewrite pool to avoid mutex This patch removes the mutex by using fancy lock-free manipulation of the pool. Lock-free stacks and queues are not hard, but they can suffer from the ABA problem so they are better avoided unless you have some deferred reclamation scheme like RCU. Otherwise you have to stick with adding to a list, and emptying it completely. This is what this patch does, by coupling a lock-free global list of available coroutines with per-CPU lists that are actually used on coroutine creation. Whenever the destruction pool is big enough, the next thread that runs out of coroutines will steal the whole destruction pool. This is positive in two ways: 1) the allocation does not have to do any atomic operation in the fast path, it's entirely using thread-local storage. Once every POOL_BATCH_SIZE allocations it will do a single atomic_xchg. Release does an atomic_cmpxchg loop, that hopefully doesn't cause any starvation, and an atomic_inc. A later patch will also remove atomic operations from the release path, and try to avoid the atomic_xchg altogether---succeeding in doing so if all devices either use ioeventfd or are not submitting requests actively. 2) in theory this should be completely adaptive. The number of coroutines around should be a little more than POOL_BATCH_SIZE * number of allocating threads; so this also empties qemu_coroutine_adjust_pool_size. (The previous pool size was POOL_BATCH_SIZE * number of block backends, so it was a bit more generous. But if you actually have many high-iodepth disks, it's better to put them in different iothreads, which will also use separate thread pools and aio=native file descriptors). This speeds up perf/cost (in tests/test-coroutine) by a factor of ~1.33. No matter if we end with some kind of coroutine bypass scheme or not, it cannot hurt to optimize hot code. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Fam Zheng <famz@redhat.com> Message-id: 1417518350-6167-6-git-send-email-pbonzini@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2014-12-02 11:05:48 +00:00
co = QSLIST_FIRST(&alloc_pool);
if (!co) {
if (release_pool_size > POOL_BATCH_SIZE) {
/* Slow path; a good place to register the destructor, too. */
if (!coroutine_pool_cleanup_notifier.notify) {
coroutine_pool_cleanup_notifier.notify = coroutine_pool_cleanup;
qemu_thread_atexit_add(&coroutine_pool_cleanup_notifier);
}
/* This is not exact; there could be a little skew between
* release_pool_size and the actual size of release_pool. But
* it is just a heuristic, it does not need to be perfect.
*/
alloc_pool_size = qatomic_xchg(&release_pool_size, 0);
coroutine: rewrite pool to avoid mutex This patch removes the mutex by using fancy lock-free manipulation of the pool. Lock-free stacks and queues are not hard, but they can suffer from the ABA problem so they are better avoided unless you have some deferred reclamation scheme like RCU. Otherwise you have to stick with adding to a list, and emptying it completely. This is what this patch does, by coupling a lock-free global list of available coroutines with per-CPU lists that are actually used on coroutine creation. Whenever the destruction pool is big enough, the next thread that runs out of coroutines will steal the whole destruction pool. This is positive in two ways: 1) the allocation does not have to do any atomic operation in the fast path, it's entirely using thread-local storage. Once every POOL_BATCH_SIZE allocations it will do a single atomic_xchg. Release does an atomic_cmpxchg loop, that hopefully doesn't cause any starvation, and an atomic_inc. A later patch will also remove atomic operations from the release path, and try to avoid the atomic_xchg altogether---succeeding in doing so if all devices either use ioeventfd or are not submitting requests actively. 2) in theory this should be completely adaptive. The number of coroutines around should be a little more than POOL_BATCH_SIZE * number of allocating threads; so this also empties qemu_coroutine_adjust_pool_size. (The previous pool size was POOL_BATCH_SIZE * number of block backends, so it was a bit more generous. But if you actually have many high-iodepth disks, it's better to put them in different iothreads, which will also use separate thread pools and aio=native file descriptors). This speeds up perf/cost (in tests/test-coroutine) by a factor of ~1.33. No matter if we end with some kind of coroutine bypass scheme or not, it cannot hurt to optimize hot code. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Fam Zheng <famz@redhat.com> Message-id: 1417518350-6167-6-git-send-email-pbonzini@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2014-12-02 11:05:48 +00:00
QSLIST_MOVE_ATOMIC(&alloc_pool, &release_pool);
co = QSLIST_FIRST(&alloc_pool);
}
}
if (co) {
coroutine: rewrite pool to avoid mutex This patch removes the mutex by using fancy lock-free manipulation of the pool. Lock-free stacks and queues are not hard, but they can suffer from the ABA problem so they are better avoided unless you have some deferred reclamation scheme like RCU. Otherwise you have to stick with adding to a list, and emptying it completely. This is what this patch does, by coupling a lock-free global list of available coroutines with per-CPU lists that are actually used on coroutine creation. Whenever the destruction pool is big enough, the next thread that runs out of coroutines will steal the whole destruction pool. This is positive in two ways: 1) the allocation does not have to do any atomic operation in the fast path, it's entirely using thread-local storage. Once every POOL_BATCH_SIZE allocations it will do a single atomic_xchg. Release does an atomic_cmpxchg loop, that hopefully doesn't cause any starvation, and an atomic_inc. A later patch will also remove atomic operations from the release path, and try to avoid the atomic_xchg altogether---succeeding in doing so if all devices either use ioeventfd or are not submitting requests actively. 2) in theory this should be completely adaptive. The number of coroutines around should be a little more than POOL_BATCH_SIZE * number of allocating threads; so this also empties qemu_coroutine_adjust_pool_size. (The previous pool size was POOL_BATCH_SIZE * number of block backends, so it was a bit more generous. But if you actually have many high-iodepth disks, it's better to put them in different iothreads, which will also use separate thread pools and aio=native file descriptors). This speeds up perf/cost (in tests/test-coroutine) by a factor of ~1.33. No matter if we end with some kind of coroutine bypass scheme or not, it cannot hurt to optimize hot code. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Fam Zheng <famz@redhat.com> Message-id: 1417518350-6167-6-git-send-email-pbonzini@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2014-12-02 11:05:48 +00:00
QSLIST_REMOVE_HEAD(&alloc_pool, pool_next);
alloc_pool_size--;
}
}
if (!co) {
co = qemu_coroutine_new();
}
coroutine: introduce coroutines Asynchronous code is becoming very complex. At the same time synchronous code is growing because it is convenient to write. Sometimes duplicate code paths are even added, one synchronous and the other asynchronous. This patch introduces coroutines which allow code that looks synchronous but is asynchronous under the covers. A coroutine has its own stack and is therefore able to preserve state across blocking operations, which traditionally require callback functions and manual marshalling of parameters. Creating and starting a coroutine is easy: coroutine = qemu_coroutine_create(my_coroutine); qemu_coroutine_enter(coroutine, my_data); The coroutine then executes until it returns or yields: void coroutine_fn my_coroutine(void *opaque) { MyData *my_data = opaque; /* do some work */ qemu_coroutine_yield(); /* do some more work */ } Yielding switches control back to the caller of qemu_coroutine_enter(). This is typically used to switch back to the main thread's event loop after issuing an asynchronous I/O request. The request callback will then invoke qemu_coroutine_enter() once more to switch back to the coroutine. Note that if coroutines are used only from threads which hold the global mutex they will never execute concurrently. This makes programming with coroutines easier than with threads. Race conditions cannot occur since only one coroutine may be active at any time. Other coroutines can only run across yield. This coroutines implementation is based on the gtk-vnc implementation written by Anthony Liguori <anthony@codemonkey.ws> but it has been significantly rewritten by Kevin Wolf <kwolf@redhat.com> to use setjmp()/longjmp() instead of the more expensive swapcontext() and by Paolo Bonzini <pbonzini@redhat.com> for Windows Fibers support. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
2011-01-17 16:08:14 +00:00
co->entry = entry;
co->entry_arg = opaque;
QSIMPLEQ_INIT(&co->co_queue_wakeup);
coroutine: introduce coroutines Asynchronous code is becoming very complex. At the same time synchronous code is growing because it is convenient to write. Sometimes duplicate code paths are even added, one synchronous and the other asynchronous. This patch introduces coroutines which allow code that looks synchronous but is asynchronous under the covers. A coroutine has its own stack and is therefore able to preserve state across blocking operations, which traditionally require callback functions and manual marshalling of parameters. Creating and starting a coroutine is easy: coroutine = qemu_coroutine_create(my_coroutine); qemu_coroutine_enter(coroutine, my_data); The coroutine then executes until it returns or yields: void coroutine_fn my_coroutine(void *opaque) { MyData *my_data = opaque; /* do some work */ qemu_coroutine_yield(); /* do some more work */ } Yielding switches control back to the caller of qemu_coroutine_enter(). This is typically used to switch back to the main thread's event loop after issuing an asynchronous I/O request. The request callback will then invoke qemu_coroutine_enter() once more to switch back to the coroutine. Note that if coroutines are used only from threads which hold the global mutex they will never execute concurrently. This makes programming with coroutines easier than with threads. Race conditions cannot occur since only one coroutine may be active at any time. Other coroutines can only run across yield. This coroutines implementation is based on the gtk-vnc implementation written by Anthony Liguori <anthony@codemonkey.ws> but it has been significantly rewritten by Kevin Wolf <kwolf@redhat.com> to use setjmp()/longjmp() instead of the more expensive swapcontext() and by Paolo Bonzini <pbonzini@redhat.com> for Windows Fibers support. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
2011-01-17 16:08:14 +00:00
return co;
}
static void coroutine_delete(Coroutine *co)
{
coroutine: rewrite pool to avoid mutex This patch removes the mutex by using fancy lock-free manipulation of the pool. Lock-free stacks and queues are not hard, but they can suffer from the ABA problem so they are better avoided unless you have some deferred reclamation scheme like RCU. Otherwise you have to stick with adding to a list, and emptying it completely. This is what this patch does, by coupling a lock-free global list of available coroutines with per-CPU lists that are actually used on coroutine creation. Whenever the destruction pool is big enough, the next thread that runs out of coroutines will steal the whole destruction pool. This is positive in two ways: 1) the allocation does not have to do any atomic operation in the fast path, it's entirely using thread-local storage. Once every POOL_BATCH_SIZE allocations it will do a single atomic_xchg. Release does an atomic_cmpxchg loop, that hopefully doesn't cause any starvation, and an atomic_inc. A later patch will also remove atomic operations from the release path, and try to avoid the atomic_xchg altogether---succeeding in doing so if all devices either use ioeventfd or are not submitting requests actively. 2) in theory this should be completely adaptive. The number of coroutines around should be a little more than POOL_BATCH_SIZE * number of allocating threads; so this also empties qemu_coroutine_adjust_pool_size. (The previous pool size was POOL_BATCH_SIZE * number of block backends, so it was a bit more generous. But if you actually have many high-iodepth disks, it's better to put them in different iothreads, which will also use separate thread pools and aio=native file descriptors). This speeds up perf/cost (in tests/test-coroutine) by a factor of ~1.33. No matter if we end with some kind of coroutine bypass scheme or not, it cannot hurt to optimize hot code. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Fam Zheng <famz@redhat.com> Message-id: 1417518350-6167-6-git-send-email-pbonzini@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2014-12-02 11:05:48 +00:00
co->caller = NULL;
if (CONFIG_COROUTINE_POOL) {
coroutine: rewrite pool to avoid mutex This patch removes the mutex by using fancy lock-free manipulation of the pool. Lock-free stacks and queues are not hard, but they can suffer from the ABA problem so they are better avoided unless you have some deferred reclamation scheme like RCU. Otherwise you have to stick with adding to a list, and emptying it completely. This is what this patch does, by coupling a lock-free global list of available coroutines with per-CPU lists that are actually used on coroutine creation. Whenever the destruction pool is big enough, the next thread that runs out of coroutines will steal the whole destruction pool. This is positive in two ways: 1) the allocation does not have to do any atomic operation in the fast path, it's entirely using thread-local storage. Once every POOL_BATCH_SIZE allocations it will do a single atomic_xchg. Release does an atomic_cmpxchg loop, that hopefully doesn't cause any starvation, and an atomic_inc. A later patch will also remove atomic operations from the release path, and try to avoid the atomic_xchg altogether---succeeding in doing so if all devices either use ioeventfd or are not submitting requests actively. 2) in theory this should be completely adaptive. The number of coroutines around should be a little more than POOL_BATCH_SIZE * number of allocating threads; so this also empties qemu_coroutine_adjust_pool_size. (The previous pool size was POOL_BATCH_SIZE * number of block backends, so it was a bit more generous. But if you actually have many high-iodepth disks, it's better to put them in different iothreads, which will also use separate thread pools and aio=native file descriptors). This speeds up perf/cost (in tests/test-coroutine) by a factor of ~1.33. No matter if we end with some kind of coroutine bypass scheme or not, it cannot hurt to optimize hot code. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Fam Zheng <famz@redhat.com> Message-id: 1417518350-6167-6-git-send-email-pbonzini@redhat.com Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2014-12-02 11:05:48 +00:00
if (release_pool_size < POOL_BATCH_SIZE * 2) {
QSLIST_INSERT_HEAD_ATOMIC(&release_pool, co, pool_next);
qatomic_inc(&release_pool_size);
return;
}
if (alloc_pool_size < POOL_BATCH_SIZE) {
QSLIST_INSERT_HEAD(&alloc_pool, co, pool_next);
alloc_pool_size++;
return;
}
}
qemu_coroutine_delete(co);
}
void qemu_aio_coroutine_enter(AioContext *ctx, Coroutine *co)
coroutine: introduce coroutines Asynchronous code is becoming very complex. At the same time synchronous code is growing because it is convenient to write. Sometimes duplicate code paths are even added, one synchronous and the other asynchronous. This patch introduces coroutines which allow code that looks synchronous but is asynchronous under the covers. A coroutine has its own stack and is therefore able to preserve state across blocking operations, which traditionally require callback functions and manual marshalling of parameters. Creating and starting a coroutine is easy: coroutine = qemu_coroutine_create(my_coroutine); qemu_coroutine_enter(coroutine, my_data); The coroutine then executes until it returns or yields: void coroutine_fn my_coroutine(void *opaque) { MyData *my_data = opaque; /* do some work */ qemu_coroutine_yield(); /* do some more work */ } Yielding switches control back to the caller of qemu_coroutine_enter(). This is typically used to switch back to the main thread's event loop after issuing an asynchronous I/O request. The request callback will then invoke qemu_coroutine_enter() once more to switch back to the coroutine. Note that if coroutines are used only from threads which hold the global mutex they will never execute concurrently. This makes programming with coroutines easier than with threads. Race conditions cannot occur since only one coroutine may be active at any time. Other coroutines can only run across yield. This coroutines implementation is based on the gtk-vnc implementation written by Anthony Liguori <anthony@codemonkey.ws> but it has been significantly rewritten by Kevin Wolf <kwolf@redhat.com> to use setjmp()/longjmp() instead of the more expensive swapcontext() and by Paolo Bonzini <pbonzini@redhat.com> for Windows Fibers support. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
2011-01-17 16:08:14 +00:00
{
QSIMPLEQ_HEAD(, Coroutine) pending = QSIMPLEQ_HEAD_INITIALIZER(pending);
Coroutine *from = qemu_coroutine_self();
QSIMPLEQ_INSERT_TAIL(&pending, co, co_queue_next);
coroutine: introduce coroutines Asynchronous code is becoming very complex. At the same time synchronous code is growing because it is convenient to write. Sometimes duplicate code paths are even added, one synchronous and the other asynchronous. This patch introduces coroutines which allow code that looks synchronous but is asynchronous under the covers. A coroutine has its own stack and is therefore able to preserve state across blocking operations, which traditionally require callback functions and manual marshalling of parameters. Creating and starting a coroutine is easy: coroutine = qemu_coroutine_create(my_coroutine); qemu_coroutine_enter(coroutine, my_data); The coroutine then executes until it returns or yields: void coroutine_fn my_coroutine(void *opaque) { MyData *my_data = opaque; /* do some work */ qemu_coroutine_yield(); /* do some more work */ } Yielding switches control back to the caller of qemu_coroutine_enter(). This is typically used to switch back to the main thread's event loop after issuing an asynchronous I/O request. The request callback will then invoke qemu_coroutine_enter() once more to switch back to the coroutine. Note that if coroutines are used only from threads which hold the global mutex they will never execute concurrently. This makes programming with coroutines easier than with threads. Race conditions cannot occur since only one coroutine may be active at any time. Other coroutines can only run across yield. This coroutines implementation is based on the gtk-vnc implementation written by Anthony Liguori <anthony@codemonkey.ws> but it has been significantly rewritten by Kevin Wolf <kwolf@redhat.com> to use setjmp()/longjmp() instead of the more expensive swapcontext() and by Paolo Bonzini <pbonzini@redhat.com> for Windows Fibers support. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
2011-01-17 16:08:14 +00:00
/* Run co and any queued coroutines */
while (!QSIMPLEQ_EMPTY(&pending)) {
Coroutine *to = QSIMPLEQ_FIRST(&pending);
CoroutineAction ret;
/* Cannot rely on the read barrier for to in aio_co_wake(), as there are
* callers outside of aio_co_wake() */
const char *scheduled = qatomic_mb_read(&to->scheduled);
coroutine: introduce coroutines Asynchronous code is becoming very complex. At the same time synchronous code is growing because it is convenient to write. Sometimes duplicate code paths are even added, one synchronous and the other asynchronous. This patch introduces coroutines which allow code that looks synchronous but is asynchronous under the covers. A coroutine has its own stack and is therefore able to preserve state across blocking operations, which traditionally require callback functions and manual marshalling of parameters. Creating and starting a coroutine is easy: coroutine = qemu_coroutine_create(my_coroutine); qemu_coroutine_enter(coroutine, my_data); The coroutine then executes until it returns or yields: void coroutine_fn my_coroutine(void *opaque) { MyData *my_data = opaque; /* do some work */ qemu_coroutine_yield(); /* do some more work */ } Yielding switches control back to the caller of qemu_coroutine_enter(). This is typically used to switch back to the main thread's event loop after issuing an asynchronous I/O request. The request callback will then invoke qemu_coroutine_enter() once more to switch back to the coroutine. Note that if coroutines are used only from threads which hold the global mutex they will never execute concurrently. This makes programming with coroutines easier than with threads. Race conditions cannot occur since only one coroutine may be active at any time. Other coroutines can only run across yield. This coroutines implementation is based on the gtk-vnc implementation written by Anthony Liguori <anthony@codemonkey.ws> but it has been significantly rewritten by Kevin Wolf <kwolf@redhat.com> to use setjmp()/longjmp() instead of the more expensive swapcontext() and by Paolo Bonzini <pbonzini@redhat.com> for Windows Fibers support. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
2011-01-17 16:08:14 +00:00
QSIMPLEQ_REMOVE_HEAD(&pending, co_queue_next);
trace_qemu_aio_coroutine_enter(ctx, from, to, to->entry_arg);
/* if the Coroutine has already been scheduled, entering it again will
* cause us to enter it twice, potentially even after the coroutine has
* been deleted */
if (scheduled) {
fprintf(stderr,
"%s: Co-routine was already scheduled in '%s'\n",
__func__, scheduled);
abort();
}
if (to->caller) {
fprintf(stderr, "Co-routine re-entered recursively\n");
abort();
}
coroutine-lock: do not touch coroutine after another one has been entered Submission of requests on linux aio is a bit tricky and can lead to requests completions on submission path: 44713c9e8547 ("linux-aio: Handle io_submit() failure gracefully") 0ed93d84edab ("linux-aio: process completions from ioq_submit()") That means that any coroutine which has been yielded in order to wait for completion can be resumed from submission path and be eventually terminated (freed). The following use-after-free crash was observed when IO throttling was enabled: Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7f5813dff700 (LWP 56417)] virtqueue_unmap_sg (elem=0x7f5804009a30, len=1, vq=<optimized out>) at virtio.c:252 (gdb) bt #0 virtqueue_unmap_sg (elem=0x7f5804009a30, len=1, vq=<optimized out>) at virtio.c:252 ^^^^^^^^^^^^^^ remember the address #1 virtqueue_fill (vq=0x5598b20d21b0, elem=0x7f5804009a30, len=1, idx=0) at virtio.c:282 #2 virtqueue_push (vq=0x5598b20d21b0, elem=elem@entry=0x7f5804009a30, len=<optimized out>) at virtio.c:308 #3 virtio_blk_req_complete (req=req@entry=0x7f5804009a30, status=status@entry=0 '\000') at virtio-blk.c:61 #4 virtio_blk_rw_complete (opaque=<optimized out>, ret=0) at virtio-blk.c:126 #5 blk_aio_complete (acb=0x7f58040068d0) at block-backend.c:923 #6 coroutine_trampoline (i0=<optimized out>, i1=<optimized out>) at coroutine-ucontext.c:78 (gdb) p * elem $8 = {index = 77, out_num = 2, in_num = 1, in_addr = 0x7f5804009ad8, out_addr = 0x7f5804009ae0, in_sg = 0x0, out_sg = 0x7f5804009a50} ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 'in_sg' and 'out_sg' are invalid. e.g. it is impossible that 'in_sg' is zero, instead its value must be equal to: (gdb) p/x 0x7f5804009ad8 + sizeof(elem->in_addr[0]) + 2 * sizeof(elem->out_addr[0]) $26 = 0x7f5804009af0 Seems 'elem' was corrupted. Meanwhile another thread raised an abort: Thread 12 (Thread 0x7f57f2ffd700 (LWP 56426)): #0 raise () from /lib/x86_64-linux-gnu/libc.so.6 #1 abort () from /lib/x86_64-linux-gnu/libc.so.6 #2 qemu_coroutine_enter (co=0x7f5804009af0) at qemu-coroutine.c:113 #3 qemu_co_queue_run_restart (co=0x7f5804009a30) at qemu-coroutine-lock.c:60 #4 qemu_coroutine_enter (co=0x7f5804009a30) at qemu-coroutine.c:119 ^^^^^^^^^^^^^^^^^^ WTF?? this is equal to elem from crashed thread #5 qemu_co_queue_run_restart (co=0x7f57e7f16ae0) at qemu-coroutine-lock.c:60 #6 qemu_coroutine_enter (co=0x7f57e7f16ae0) at qemu-coroutine.c:119 #7 qemu_co_queue_run_restart (co=0x7f5807e112a0) at qemu-coroutine-lock.c:60 #8 qemu_coroutine_enter (co=0x7f5807e112a0) at qemu-coroutine.c:119 #9 qemu_co_queue_run_restart (co=0x7f5807f17820) at qemu-coroutine-lock.c:60 #10 qemu_coroutine_enter (co=0x7f5807f17820) at qemu-coroutine.c:119 #11 qemu_co_queue_run_restart (co=0x7f57e7f18e10) at qemu-coroutine-lock.c:60 #12 qemu_coroutine_enter (co=0x7f57e7f18e10) at qemu-coroutine.c:119 #13 qemu_co_enter_next (queue=queue@entry=0x5598b1e742d0) at qemu-coroutine-lock.c:106 #14 timer_cb (blk=0x5598b1e74280, is_write=<optimized out>) at throttle-groups.c:419 Crash can be explained by access of 'co' object from the loop inside qemu_co_queue_run_restart(): while ((next = QSIMPLEQ_FIRST(&co->co_queue_wakeup))) { QSIMPLEQ_REMOVE_HEAD(&co->co_queue_wakeup, co_queue_next); ^^^^^^^^^^^^^^^^^^^^ on each iteration 'co' is accessed, but 'co' can be already freed qemu_coroutine_enter(next); } When 'next' coroutine is resumed (entered) it can in its turn resume 'co', and eventually free it. That's why we see 'co' (which was freed) has the same address as 'elem' from the first backtrace. The fix is obvious: use temporary queue and do not touch coroutine after first qemu_coroutine_enter() is invoked. The issue is quite rare and happens every ~12 hours on very high IO and CPU load (building linux kernel with -j512 inside guest) when IO throttling is enabled. With the fix applied guest is running ~35 hours and is still alive so far. Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Message-id: 20170601160847.23720-1-roman.penyaev@profitbricks.com Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Fam Zheng <famz@redhat.com> Cc: Stefan Hajnoczi <stefanha@redhat.com> Cc: Kevin Wolf <kwolf@redhat.com> Cc: qemu-devel@nongnu.org Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
2017-06-01 16:08:47 +00:00
to->caller = from;
to->ctx = ctx;
/* Store to->ctx before anything that stores to. Matches
* barrier in aio_co_wake and qemu_co_mutex_wake.
*/
smp_wmb();
ret = qemu_coroutine_switch(from, to, COROUTINE_ENTER);
/* Queued coroutines are run depth-first; previously pending coroutines
* run after those queued more recently.
*/
QSIMPLEQ_PREPEND(&pending, &to->co_queue_wakeup);
switch (ret) {
case COROUTINE_YIELD:
break;
case COROUTINE_TERMINATE:
assert(!to->locks_held);
trace_qemu_coroutine_terminate(to);
coroutine_delete(to);
break;
default:
abort();
}
}
coroutine: introduce coroutines Asynchronous code is becoming very complex. At the same time synchronous code is growing because it is convenient to write. Sometimes duplicate code paths are even added, one synchronous and the other asynchronous. This patch introduces coroutines which allow code that looks synchronous but is asynchronous under the covers. A coroutine has its own stack and is therefore able to preserve state across blocking operations, which traditionally require callback functions and manual marshalling of parameters. Creating and starting a coroutine is easy: coroutine = qemu_coroutine_create(my_coroutine); qemu_coroutine_enter(coroutine, my_data); The coroutine then executes until it returns or yields: void coroutine_fn my_coroutine(void *opaque) { MyData *my_data = opaque; /* do some work */ qemu_coroutine_yield(); /* do some more work */ } Yielding switches control back to the caller of qemu_coroutine_enter(). This is typically used to switch back to the main thread's event loop after issuing an asynchronous I/O request. The request callback will then invoke qemu_coroutine_enter() once more to switch back to the coroutine. Note that if coroutines are used only from threads which hold the global mutex they will never execute concurrently. This makes programming with coroutines easier than with threads. Race conditions cannot occur since only one coroutine may be active at any time. Other coroutines can only run across yield. This coroutines implementation is based on the gtk-vnc implementation written by Anthony Liguori <anthony@codemonkey.ws> but it has been significantly rewritten by Kevin Wolf <kwolf@redhat.com> to use setjmp()/longjmp() instead of the more expensive swapcontext() and by Paolo Bonzini <pbonzini@redhat.com> for Windows Fibers support. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
2011-01-17 16:08:14 +00:00
}
void qemu_coroutine_enter(Coroutine *co)
{
qemu_aio_coroutine_enter(qemu_get_current_aio_context(), co);
}
void qemu_coroutine_enter_if_inactive(Coroutine *co)
{
if (!qemu_coroutine_entered(co)) {
qemu_coroutine_enter(co);
}
}
coroutine: introduce coroutines Asynchronous code is becoming very complex. At the same time synchronous code is growing because it is convenient to write. Sometimes duplicate code paths are even added, one synchronous and the other asynchronous. This patch introduces coroutines which allow code that looks synchronous but is asynchronous under the covers. A coroutine has its own stack and is therefore able to preserve state across blocking operations, which traditionally require callback functions and manual marshalling of parameters. Creating and starting a coroutine is easy: coroutine = qemu_coroutine_create(my_coroutine); qemu_coroutine_enter(coroutine, my_data); The coroutine then executes until it returns or yields: void coroutine_fn my_coroutine(void *opaque) { MyData *my_data = opaque; /* do some work */ qemu_coroutine_yield(); /* do some more work */ } Yielding switches control back to the caller of qemu_coroutine_enter(). This is typically used to switch back to the main thread's event loop after issuing an asynchronous I/O request. The request callback will then invoke qemu_coroutine_enter() once more to switch back to the coroutine. Note that if coroutines are used only from threads which hold the global mutex they will never execute concurrently. This makes programming with coroutines easier than with threads. Race conditions cannot occur since only one coroutine may be active at any time. Other coroutines can only run across yield. This coroutines implementation is based on the gtk-vnc implementation written by Anthony Liguori <anthony@codemonkey.ws> but it has been significantly rewritten by Kevin Wolf <kwolf@redhat.com> to use setjmp()/longjmp() instead of the more expensive swapcontext() and by Paolo Bonzini <pbonzini@redhat.com> for Windows Fibers support. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
2011-01-17 16:08:14 +00:00
void coroutine_fn qemu_coroutine_yield(void)
{
Coroutine *self = qemu_coroutine_self();
Coroutine *to = self->caller;
trace_qemu_coroutine_yield(self, to);
if (!to) {
fprintf(stderr, "Co-routine is yielding to no one\n");
abort();
}
self->caller = NULL;
qemu_coroutine_switch(self, to, COROUTINE_YIELD);
coroutine: introduce coroutines Asynchronous code is becoming very complex. At the same time synchronous code is growing because it is convenient to write. Sometimes duplicate code paths are even added, one synchronous and the other asynchronous. This patch introduces coroutines which allow code that looks synchronous but is asynchronous under the covers. A coroutine has its own stack and is therefore able to preserve state across blocking operations, which traditionally require callback functions and manual marshalling of parameters. Creating and starting a coroutine is easy: coroutine = qemu_coroutine_create(my_coroutine); qemu_coroutine_enter(coroutine, my_data); The coroutine then executes until it returns or yields: void coroutine_fn my_coroutine(void *opaque) { MyData *my_data = opaque; /* do some work */ qemu_coroutine_yield(); /* do some more work */ } Yielding switches control back to the caller of qemu_coroutine_enter(). This is typically used to switch back to the main thread's event loop after issuing an asynchronous I/O request. The request callback will then invoke qemu_coroutine_enter() once more to switch back to the coroutine. Note that if coroutines are used only from threads which hold the global mutex they will never execute concurrently. This makes programming with coroutines easier than with threads. Race conditions cannot occur since only one coroutine may be active at any time. Other coroutines can only run across yield. This coroutines implementation is based on the gtk-vnc implementation written by Anthony Liguori <anthony@codemonkey.ws> but it has been significantly rewritten by Kevin Wolf <kwolf@redhat.com> to use setjmp()/longjmp() instead of the more expensive swapcontext() and by Paolo Bonzini <pbonzini@redhat.com> for Windows Fibers support. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
2011-01-17 16:08:14 +00:00
}
bool qemu_coroutine_entered(Coroutine *co)
{
return co->caller;
}
AioContext *coroutine_fn qemu_coroutine_get_aio_context(Coroutine *co)
{
return co->ctx;
}