A BlockDriverState should not issue requests on itself through the
public block layer interface. Nested, or reentrant, requests are
problematic because they do I/O throttling and request tracking twice.
Features like block layer copy-on-read use request tracking to avoid
race conditions between concurrent requests. The reentrant request will
have to "wait" for its parent request to complete. But the parent is
waiting for the reentrant request to make progress so we have reached
deadlock.
The solution is for block drivers to avoid the public block layer
interfaces for reentrant requests. Instead they should call their own
internal functions if they wish to perform reentrant requests.
This is also a good opportunity to make copy_sectors() a true
coroutine_fn. That means calling bdrv_co_writev() instead of
bdrv_write(). Behavior is unchanged but we're being explicit that this
executes in coroutine context.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Unlocking during COW allows for more parallelism. One change it requires is
that buffers are dynamically allocated instead of just using a per-image
buffer.
While touching the code, drop the synchronous qcow2_read() function and replace
it by a bdrv_read() call.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
If during allocation of compressed clusters the cluster was already allocated
uncompressed, fail and properly release the l2_table (the latter avoids a
failed assertion).
While at it, make it return some real error numbers instead of -1.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Dong Xu Wang <wdongxu@linux.vnet.ibm.com>
QCowL2Meta::offset is not cluster aligned but only sector aligned
however nb_clusters count cluster from cluster start.
This fix range check. Note that old code have no corruption issues
related to this check cause it only cause intersection to occur
when shouldn't.
Signed-off-by: Frediano Ziglio <freddy77@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
QCow2Meta structure was inserted into list before many fields are
initialized. Currently is not a problem cause all occur in a lock
but if qcow2_alloc_clusters would in a future unlock this lock
some issues could arise.
Initializing fields before inserting fix the problem.
Signed-off-by: Frediano Ziglio <freddy77@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Documentation states the num is measured in clusters, but its
actually measured in sectors
Signed-off-by: Devin Nakamura <devin122@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
If qcow2_cache_put returns an error during cluster allocation and the
allocation fails, it must be removed from the list of in-flight allocations.
Otherwise we'd get a loop in the list when the ACB is used for the next
allocation.
Luckily, this qcow2_cache_put shouldn't fail anyway because the L2 table is
only read, so that qcow2_cache_put doesn't even involve I/O.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
This fixes memory leaks that may be caused by I/O errors during L1 table growth
(can happen during save_vm) and in qemu-img check.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
When copying L2 tables (this happens only with internal snapshots), the order
wasn't completely safe, so that after a crash you could end up with a L2 table
that has too low refcount, possibly leading to corruption in the long run.
This patch puts the operations in the right order: First allocate the new
L2 table and replace the reference, and only then decrease the refcount of the
old table.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
When reading a compressed cluster failed, qcow2 falsely returned success.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
This adds a bdrv_discard function to qcow2 that frees the discarded clusters.
It does not yet pass the discard on to the underlying file system driver, but
the space can be reused by future writes to the image.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
qcow2 calls bdrv_flush() after performing COW in order to ensure that the
L2 table change is never written before the copy is safe on disk. Now that the
L2 table is cached, we can wait with flushing until we write out the next L2
table.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
cpu_to_be64w() is called with an obviously non-aligned pointer. Use
cpu_to_be64wu() instead. It fixes unaligned accesses errors on IA64
hosts.
Cc: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
It doesn't really make sense for functions in qcow2.c to be named
qcow_ so convert the names to match correctly.
Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The cache content may be destroyed after a failed read, better not use it any
more.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
The L1 table grow operation includes a size calculation that bumps up
the new L1 table size in order to anticipate the size needs of vmstate
data. This helps reduce the number of times that the L1 table has to be
grown when vmstate data is appended.
This size overhead is not necessary during image creation,
bdrv_truncate(), or snapshot goto operations. In fact, existing
qemu-iotests that exercise table growth are no longer able to trigger it
because image creation preallocates an L1 table that is too large after
changes to qcow_create2().
This patch keeps the size calculation but also adds exact growth for
callers that do not want to inflate the L1 table size unnecessarily.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
qcow2 used to use bounce buffers for any AIO requests. This does not only imply
unnecessary copying, but also unbounded allocations which should be avoided.
This patch removes bounce buffers from the normal AIO read path, and constrains
them to a constant size for encrypted images.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We always have a sync for the refcount update when a new cluster is
allocated. If we move this past the COW, we can save an additional sync.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
When a new cluster was allocated, we only need a flush after the write to the
L2 table if it was a COW and we need to decrease the refcounts of the old
clusters.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
If writing the L1 table to disk failed, we need to restore its old content in
memory to avoid inconsistencies.
Reported-by: Juan Quintela <quintela@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
qcow2_get_cluster_offset() looks up a given virtual disk offset and returns the
offset of the corresponding cluster in the image file. Errors (e.g. L2 table
can't be read) are currenctly indicated by a return value of 0, which is
unfortuately the same as for any unallocated cluster. So in effect we can't
check for errors.
This makes the old return value a by-reference parameter and returns the usual
0/-errno error code.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
l2_allocate has some intermediate states in which the image is inconsistent.
Change the order to write to the L1 table only after the new L2 table has
successfully been initialized.
Also reset the L2 cache in failure case, it's very likely wrong.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
If the L2 table was already updated in cache, but writing it to disk has
failed, we must not continue using the changed version in the cache to stay
consistent with what's on the disk.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Format drivers shouldn't need to bother with things like file names, but rather
just get an open BlockDriverState for the underlying protocol. This patch
introduces this behaviour for bdrv_open implementation. For protocols which
need to access the filename to open their file/device/connection/... a new
callback bdrv_file_open is introduced which doesn't get an underlying file
opened.
For now, also some of the more obscure formats use bdrv_file_open because they
open() the file themselves instead of using the block.c functions. They need to
be fixed in later patches.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Returning NULL on error doesn't allow distinguishing between different errors.
Change the interface to return an integer for -errno.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
If we complete a request with a failure we need to remove it from the list of
requests that are in flight. If we don't do it, the next time the same AIOCB is
used for a cluster allocation it will create a loop in the list and qemu will
hang in an endless loop.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
If a write requests crosses a L2 table boundary and all clusters until the
end of the L2 table are usable for the request, we must not look at the next
L2 entry because we already have arrived at the end of the array.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Checking for return codes < 0 isn't really going to work with unsigned
types. Use signed types instead.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Now that qcow2_alloc_clusters can return error codes, we must handle them in
the callers of qcow2_alloc_clusters.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Returning 0/-errno allows it to distingush different errors classes. The
cluster offset of newly allocated clusters is now returned in the QCowL2Meta
struct.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Switching to 0/-errno allows it to distinguish different error cases.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Return the appropriate error value instead of always using EIO. Don't free the
L1 table on errors, we still need it.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Images with disk size 0 may be used for
VM snapshots, but not to save normal block data.
It is possible to create such images using
qemu-img, but opening them later fails.
So even "qemu-img info image.qcow2" is not
possible for an image created with
"qemu-img create -f qcow2 image.qcow2 0".
This is fixed here.
Signed-off-by: Stefan Weil <weil@mail.berlios.de>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
It was merely a workaround and the real fix is done now.
This reverts commit ef845c3bf4.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
When the synchronous read and write functions were dropped, they were replaced
by generic emulation functions. Unfortunately, these emulation functions don't
provide the same semantics as the original functions did.
The original bdrv_read would mean that we read some data synchronously and that
we won't be interrupted during this read. The latter assumption is no longer
true with the emulation function which needs to use qemu_aio_poll and therefore
allows the callback of any other concurrent AIO request to be run during the
read. Which in turn means that (meta)data read earlier could have changed and
be invalid now. qcow2 is not prepared to work in this way and it's just scary
how many places there are where other requests could run.
I'm not sure yet where exactly it breaks, but you'll see breakage with virtio
on qcow2 with a backing file. Providing synchronous functions again fixes the
problem for me.
Patchworks-ID: 35437
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
This patch increases the maximum qcow2 cluster size to 2 MB. Starting with 128k
clusters, L2 tables span 2 GB or more of virtual disk space, causing 32 bit
truncation and wraparound of signed integers. Therefore some variables need to
use a larger data type.
While being at reviewing data types, change some integers that are used for
array indices to unsigned. In some places they were checked against some upper
limit but not for negative values. This could avoid potential segfaults with
corrupted qcow2 images.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>