We are generally moving to int64_t for both offset and bytes parameters
on all io paths.
Main motivation is realization of 64-bit write_zeroes operation for
fast zeroing large disk chunks, up to the whole disk.
We chose signed type, to be consistent with off_t (which is signed) and
with possibility for signed return type (where negative value means
error).
So, convert driver read handlers parameters which are already 64bit to
signed type.
While being here, convert also flags parameter to be BdrvRequestFlags.
Now let's consider all callers. Simple
git grep '\->bdrv_\(aio\|co\)_preadv\(_part\)\?'
shows that's there three callers of driver function:
bdrv_driver_preadv() in block/io.c, passes int64_t, checked by
bdrv_check_qiov_request() to be non-negative.
qcow2_load_vmstate() does bdrv_check_qiov_request().
do_perform_cow_read() has uint64_t argument. And a lot of things in
qcow2 driver are uint64_t, so converting it is big job. But we must
not work with requests that don't satisfy bdrv_check_qiov_request(),
so let's just assert it here.
Still, the functions may be called directly, not only by drv->...
Let's check:
git grep '\.bdrv_\(aio\|co\)_preadv\(_part\)\?\s*=' | \
awk '{print $4}' | sed 's/,//' | sed 's/&//' | sort | uniq | \
while read func; do git grep "$func(" | \
grep -v "$func(BlockDriverState"; done
The only one such caller:
QEMUIOVector qiov = QEMU_IOVEC_INIT_BUF(qiov, &data, 1);
...
ret = bdrv_replace_test_co_preadv(bs, 0, 1, &qiov, 0);
in tests/unit/test-bdrv-drain.c, and it's OK obviously.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210903102807.27127-4-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
[eblake: fix typos]
Signed-off-by: Eric Blake <eblake@redhat.com>
We modify the request by adding an offset to vmstate. Let's check the
modified request. It will help us to safely move .bdrv_co_preadv_part
and .bdrv_co_pwritev_part to int64_t type of offset and bytes.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20210903102807.27127-3-vsementsov@virtuozzo.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
Only qcow2 driver supports vmstate.
In qcow2 these requests go through .bdrv_co_p{read,write}v_part
handlers.
So, let's do our basic check for the request on vmstate generic
handlers.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20210903102807.27127-2-vsementsov@virtuozzo.com>
Signed-off-by: Eric Blake <eblake@redhat.com>
Split checking for reserved bits out of aligned offset check.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Tested-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Reviewed-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20210914122454.141075-11-vsementsov@virtuozzo.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
- use g_autofree for l1_table
- better name for size in bytes variable
- reduce code blocks nesting
- whitespaces, braces, newlines
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20210914122454.141075-9-vsementsov@virtuozzo.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Check subcluster bitmap of the l2 entry for different types of
clusters:
- for compressed it must be zero
- for allocated check consistency of two parts of the bitmap
- for unallocated all subclusters should be unallocated
(or zero-plain)
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Tested-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Message-Id: <20210914122454.141075-7-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Hanna Reitz <hreitz@redhat.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
We'll reuse the function to fix wrong L2 entry bitmap. Support it now.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20210914122454.141075-6-vsementsov@virtuozzo.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Split fix_l2_entry_by_zero() out of check_refcounts_l2() to be
reused in further patch.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20210914122454.141075-5-vsementsov@virtuozzo.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Add helper to parse compressed l2_entry and use it everywhere instead
of open-coding.
Note, that in most places we move to precise coffset/csize instead of
sector-aligned. Still it should work good enough for updating
refcounts.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20210914122454.141075-4-vsementsov@virtuozzo.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Let's pass the whole L2 entry and not bother with
L2E_COMPRESSED_OFFSET_SIZE_MASK.
It also helps further refactoring that adds generic
qcow2_parse_compressed_l2_entry() helper.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Alberto Garcia <berto@igalia.com>
Reviewed-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20210914122454.141075-3-vsementsov@virtuozzo.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
- don't use same name for size in bytes and in entries
- use g_autofree for l2_table
- add whitespace
- fix block comment style
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20210914122454.141075-2-vsementsov@virtuozzo.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
There is no conflict and no dependency if we have parallel writes to
different subclusters of one cluster when the cluster itself is already
allocated. So, relax extra dependency.
Measure performance:
First, prepare build/qemu-img-old and build/qemu-img-new images.
cd scripts/simplebench
./img_bench_templater.py
Paste the following to stdin of running script:
qemu_img=../../build/qemu-img-{old|new}
$qemu_img create -f qcow2 -o extended_l2=on /ssd/x.qcow2 1G
$qemu_img bench -c 100000 -d 8 [-s 2K|-s 2K -o 512|-s $((1024*2+512))] \
-w -t none -n /ssd/x.qcow2
The result:
All results are in seconds
------------------ --------- ---------
old new
-s 2K 6.7 ± 15% 6.2 ± 12%
-7%
-s 2K -o 512 13 ± 3% 11 ± 5%
-16%
-s $((1024*2+512)) 9.5 ± 4% 8.4
-12%
------------------ --------- ---------
So small writes are more independent now and that helps to keep deeper
io queue which improves performance.
271 iotest output becomes racy for three allocation in one cluster.
Second and third writes may finish in different order. Second and
third requests don't depend on each other any more. Still they both
depend on first request anyway. Filter out second and third write
offsets to cover both possible outputs.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210824101517.59802-4-vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Hanna Reitz <hreitz@redhat.com>
[hreitz: s/ an / and /]
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
No logic change, just prepare for the following commit. While being
here do also small grammar fix in a comment.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20210824101517.59802-3-vsementsov@virtuozzo.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
In mirror_iteration() we call mirror_wait_on_conflicts() with
`self` parameter set to NULL.
Starting from commit d44dae1a7c we dereference `self` pointer in
mirror_wait_on_conflicts() without checks if it is not NULL.
Backtrace:
Program terminated with signal SIGSEGV, Segmentation fault.
#0 mirror_wait_on_conflicts (self=0x0, s=<optimized out>, offset=<optimized out>, bytes=<optimized out>)
at ../block/mirror.c:172
172 self->waiting_for_op = op;
[Current thread is 1 (Thread 0x7f0908931ec0 (LWP 380249))]
(gdb) bt
#0 mirror_wait_on_conflicts (self=0x0, s=<optimized out>, offset=<optimized out>, bytes=<optimized out>)
at ../block/mirror.c:172
#1 0x00005610c5d9d631 in mirror_run (job=0x5610c76a2c00, errp=<optimized out>) at ../block/mirror.c:491
#2 0x00005610c5d58726 in job_co_entry (opaque=0x5610c76a2c00) at ../job.c:917
#3 0x00005610c5f046c6 in coroutine_trampoline (i0=<optimized out>, i1=<optimized out>)
at ../util/coroutine-ucontext.c:173
#4 0x00007f0909975820 in ?? () at ../sysdeps/unix/sysv/linux/x86_64/__start_context.S:91
from /usr/lib64/libc.so.6
Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=2001404
Fixes: d44dae1a7c ("block/mirror: fix active mirror dead-lock in mirror_wait_on_conflicts")
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Message-Id: <20210910124533.288318-1-sgarzare@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
bdrv_co_block_status() does it for us, we do not need to do it here.
The advantage of not capping *pnum is that bdrv_co_block_status() can
cache larger data regions than requested by its caller.
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20210812084148.14458-7-hreitz@redhat.com>
bdrv_co_block_status() does it for us, we do not need to do it here.
The advantage of not capping *pnum is that bdrv_co_block_status() can
cache larger data regions than requested by its caller.
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20210812084148.14458-6-hreitz@redhat.com>
bdrv_co_block_status() does it for us, we do not need to do it here.
The advantage of not capping *pnum is that bdrv_co_block_status() can
cache larger data regions than requested by its caller.
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-Id: <20210812084148.14458-5-hreitz@redhat.com>
As we have attempted before
(https://lists.gnu.org/archive/html/qemu-devel/2019-01/msg06451.html,
"file-posix: Cache lseek result for data regions";
https://lists.nongnu.org/archive/html/qemu-block/2021-02/msg00934.html,
"file-posix: Cache next hole"), this patch seeks to reduce the number of
SEEK_DATA/HOLE operations the file-posix driver has to perform. The
main difference is that this time it is implemented as part of the
general block layer code.
The problem we face is that on some filesystems or in some
circumstances, SEEK_DATA/HOLE is unreasonably slow. Given the
implementation is outside of qemu, there is little we can do about its
performance.
We have already introduced the want_zero parameter to
bdrv_co_block_status() to reduce the number of SEEK_DATA/HOLE calls
unless we really want zero information; but sometimes we do want that
information, because for files that consist largely of zero areas,
special-casing those areas can give large performance boosts. So the
real problem is with files that consist largely of data, so that
inquiring the block status does not gain us much performance, but where
such an inquiry itself takes a lot of time.
To address this, we want to cache data regions. Most of the time, when
bad performance is reported, it is in places where the image is iterated
over from start to end (qemu-img convert or the mirror job), so a simple
yet effective solution is to cache only the current data region.
(Note that only caching data regions but not zero regions means that
returning false information from the cache is not catastrophic: Treating
zeroes as data is fine. While we try to invalidate the cache on zero
writes and discards, such incongruences may still occur when there are
other processes writing to the image.)
We only use the cache for nodes without children (i.e. protocol nodes),
because that is where the problem is: Drivers that rely on block-status
implementations outside of qemu (e.g. SEEK_DATA/HOLE).
Resolves: https://gitlab.com/qemu-project/qemu/-/issues/307
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20210812084148.14458-3-hreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
[hreitz: Added `local_file == bs` assertion, as suggested by Vladimir]
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
gluster's block-status implementation is basically a copy of that in
block/file-posix.c, there is only one thing missing, and that is
aligning trailing data extents to the request alignment (as added by
commit 9c3db310ff).
Note that 9c3db310ff mentions that "there seems to be no other block
driver that sets request_alignment and [...]", but while block/gluster.c
does indeed not set request_alignment, block/io.c's
bdrv_refresh_limits() will still default to an alignment of 512 because
block/gluster.c does not provide a byte-aligned read function.
Therefore, unaligned tails can conceivably occur, and so we should apply
the change from 9c3db310ff to gluster's block-status implementation.
Reported-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20210805143603.59503-1-mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
We expect the first qemu_vfio_dma_map() to fail (indicating
DMA mappings exhaustion, see commit 15a730e7a3). Do not
report the first failure as error, since we are going to
flush the mappings and retry.
This removes spurious error message displayed on the monitor:
(qemu) c
(qemu) qemu-kvm: VFIO_MAP_DMA failed: No space left on device
(qemu) info status
VM status: running
Reported-by: Tingting Mao <timao@redhat.com>
Reviewed-by: Klaus Jensen <k.jensen@samsung.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-id: 20210902070025.197072-12-philmd@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Currently qemu_vfio_dma_map() displays errors on stderr.
When using management interface, this information is simply
lost. Pass qemu_vfio_dma_map() an Error** handle so it can
propagate the error to callers.
Reviewed-by: Fam Zheng <fam@euphon.net>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Klaus Jensen <k.jensen@samsung.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-id: 20210902070025.197072-7-philmd@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
nvme_create_queue_pair() does not return a boolean value (indicating
eventual error) but a pointer, and is inconsistent in how it fills the
error handler. To fulfill callers expectations, always set an error
message on failure.
Reported-by: Auger Eric <eric.auger@redhat.com>
Reviewed-by: Klaus Jensen <k.jensen@samsung.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-id: 20210902070025.197072-6-philmd@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Fix when building with -Wshorten-64-to-32:
warning: implicit conversion loses integer precision: 'unsigned long' to 'int' [-Wshorten-64-to-32]
Reviewed-by: Klaus Jensen <k.jensen@samsung.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-id: 20210902070025.197072-2-philmd@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Make 'qemu-img commit' work on Windows.
Command 'commit' requires reopening backing file in RW mode. So,
add reopen prepare/commit/abort handlers and change dwShareMode
for CreateFile call in order to allow further read/write reopening.
Resolves: https://gitlab.com/qemu-project/qemu/-/issues/418
Suggested-by: Hanna Reitz <hreitz@redhat.com>
Signed-off-by: Viktor Prutyanov <viktor.prutyanov@phystech.edu>
Tested-by: Helge Konetzka <hk@zapateado.de>
Message-Id: <20210825173625.19415-1-viktor.prutyanov@phystech.edu>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Include linux/fs.h to avoid the following build failure on uclibc or
musl raised since version 6.0.0:
../block/export/fuse.c: In function 'fuse_lseek':
../block/export/fuse.c:641:19: error: 'SEEK_HOLE' undeclared (first use in this function)
641 | if (whence != SEEK_HOLE && whence != SEEK_DATA) {
| ^~~~~~~~~
../block/export/fuse.c:641:19: note: each undeclared identifier is reported only once for each function it appears in
../block/export/fuse.c:641:42: error: 'SEEK_DATA' undeclared (first use in this function); did you mean 'SEEK_SET'?
641 | if (whence != SEEK_HOLE && whence != SEEK_DATA) {
| ^~~~~~~~~
| SEEK_SET
Fixes:
- http://autobuild.buildroot.org/results/33c90ebf04997f4d3557cfa66abc9cf9a3076137
Signed-off-by: Fabrice Fontaine <fontaine.fabrice@gmail.com>
Message-Id: <20210827220301.272887-1-fontaine.fabrice@gmail.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
The only caller pass copy_range and compress both false. Let's just
drop these arguments.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210824083856.17408-35-vsementsov@virtuozzo.com>
Reviewed-by: Hanna Reitz <hreitz@redhat.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Finally, copy-before-write gets own .bdrv_open and .bdrv_close
handlers, block_init() call and becomes available through bdrv_open().
To achieve this:
- cbw_init gets unused flags argument and becomes cbw_open
- block_copy_state_free() call moved to new cbw_close()
- in bdrv_cbw_append:
- options are completed with driver and node-name, and we can simply
use bdrv_insert_node() to do both open and drained replacing
- in bdrv_cbw_drop:
- cbw_close() is now responsible for freeing s->bcs, so don't do it
here
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20210824083856.17408-22-vsementsov@virtuozzo.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Now block-copy will crash if user don't set progress meter by
block_copy_set_progress_meter(). copy-before-write filter will be used
in separate of backup job, and it doesn't want any progress meter (for
now). So, allow not setting it.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20210824083856.17408-21-vsementsov@virtuozzo.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
We are going to publish copy-before-write filter to be used in separate
of backup. Future step would support bitmap for the filter. But let's
start from full set bitmap.
We have to modify backup, as bitmap is first initialized by
copy-before-write filter, and then backup modifies it.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20210824083856.17408-20-vsementsov@virtuozzo.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
One more step closer to .bdrv_open(): use options instead of plain
arguments. Move to bdrv_open_child() calls, native for drive open
handlers.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20210824083856.17408-19-vsementsov@virtuozzo.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20210824083856.17408-18-vsementsov@virtuozzo.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
In the next commit we'll get rid of source argument of cbw_init().
Prepare to it now, to make next commit simpler: move the code block
that uses source below attaching the child and use bs->file->bs instead
of source variable.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20210824083856.17408-17-vsementsov@virtuozzo.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
One more step closer to real .bdrv_open() handler: use more usual names
for bs being initialized and its state.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20210824083856.17408-16-vsementsov@virtuozzo.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Move part of bdrv_cbw_append() to new function cbw_open(). It's an
intermediate step for adding normal .bdrv_open() handler to the
filter. With this commit no logic is changed, but we have a function
which will be turned into .bdrv_open() handler in future commit.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20210824083856.17408-15-vsementsov@virtuozzo.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Refactor the function to replace child at last. Thus we don't need to
revert it and code is simplified.
block-copy state initialization being done before replacing the child
doesn't need any drained section.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20210824083856.17408-14-vsementsov@virtuozzo.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
We are going to publish copy-before-write filter, and there no public
backing-child-based filter in Qemu. No reason to create a precedent, so
let's refactor copy-before-write filter instead.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20210824083856.17408-13-vsementsov@virtuozzo.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
bdrv_attach_child() do bdrv_unref() on failure, so we shouldn't do it
by hand here.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20210824083856.17408-12-vsementsov@virtuozzo.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
We are going to publish copy-before-write filter. So, user should be
able to create it with blockdev-add first, specifying both filtered and
target children. And then do blockdev-reopen, to actually insert the
filter where needed.
Currently, filter unshares write permission unconditionally on source
node. It's good, but it will not allow to do blockdev-add. So, let's
relax restrictions when filter doesn't have any parent.
Test output is modified, as now permission conflict happens only when
job creates a blk parent for filter node.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20210824083856.17408-11-vsementsov@virtuozzo.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
The main consumer of cluster-size is block-copy. Let's calculate it
here instead of passing through backup-top.
We are going to publish copy-before-write filter soon, so it will be
created through options. But we don't want for now to make explicit
option for cluster-size, let's continue to calculate it automatically.
So, now is the time to get rid of cluster_size argument for
bdrv_cbw_append().
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20210824083856.17408-10-vsementsov@virtuozzo.com>
[hreitz: Add qemu/error-report.h include to block/block-copy.c]
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
We are going to publish copy-before-write filter, so it would be
initialized through options. Still we don't want to publish compress
and copy-range options, as
1. Modern way to enable compression is to use compress filter.
2. For copy-range it's unclean how to make proper interface:
- it's has experimental prefix for backup job anyway
- the whole BackupPerf structure doesn't make sense for the filter
So, let's just add copy-range possibility to the filter later if
needed.
Still, we are going to continue support for compression and
experimental copy-range in backup job. So, set these options after
filter creation.
Note, that we can drop "compress" argument of bdrv_cbw_append() now, as
well as "perf". The only reason not doing so is that now, when I
prepare this patch the big series around it is already reviewed and I
want to avoid extra rebase conflicts to simplify review of the
following version.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20210824083856.17408-9-vsementsov@virtuozzo.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
We'll need a possibility to set compress and use_copy_range options
after initialization of the state. So make corresponding part of
block_copy_state_new() separate and public.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20210824083856.17408-8-vsementsov@virtuozzo.com>
Reviewed-by: Hanna Reitz <hreitz@redhat.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
We want to simplify initialization interface of copy-before-write
filter as we are going to make it public. So, let's detect fleecing
scheme exactly in block-copy code, to not pass this information through
extra levels.
Why not just set BDRV_REQ_SERIALISING unconditionally: because we are
going to implement new more efficient fleecing scheme which will not
rely on backing feature.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Hanna Reitz <hreitz@redhat.com>
Message-Id: <20210824083856.17408-7-vsementsov@virtuozzo.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
We are going to convert backup_top to full featured public filter,
which can be used in separate of backup job. Start from renaming from
"how it used" to "what it does".
While updating comments in 283 iotest, drop and rephrase also things
about ".active", as this field is now dropped, and filter doesn't have
"inactive" mode.
Note that this change may be considered as incompatible interface
change, as backup-top filter format name was visible through
query-block and query-named-block-nodes.
Still, consider the following reasoning:
1. backup-top was never documented, so if someone depends on format
name (for driver that can't be used other than it is automatically
inserted on backup job start), it's a kind of "undocumented feature
use". So I think we are free to change it.
2. There is a hope, that there is no such users: it's a lot more native
to give a good node-name to backup-top filter if need to operate
with it somehow, and don't touch format name.
3. Another "incompatible" change in further commit would be moving
copy-before-write filter from using backing child to file child. And
this is even more reasonable than renaming: for now all public
filters are file-child based.
So, it's a risky change, but risk seems small and good interface worth
it.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20210824083856.17408-6-vsementsov@virtuozzo.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Add function to change bs inside blk.
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-Id: <20210824083856.17408-3-vsementsov@virtuozzo.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
The following command-line fails due to a permissions conflict:
$ qemu-storage-daemon \
--blockdev driver=nvme,node-name=nvme0,device=0000:08:00.0,namespace=1 \
--blockdev driver=raw,node-name=l1-1,file=nvme0,offset=0,size=1073741824 \
--blockdev driver=raw,node-name=l1-2,file=nvme0,offset=1073741824,size=1073741824 \
--nbd-server addr.type=unix,addr.path=/tmp/nbd.sock,max-connections=2 \
--export type=nbd,id=nbd-l1-1,node-name=l1-1,name=l1-1,writable=on \
--export type=nbd,id=nbd-l1-2,node-name=l1-2,name=l1-2,writable=on
qemu-storage-daemon: --export type=nbd,id=nbd-l1-1,node-name=l1-1,name=l1-1,writable=on: Permission conflict on node 'nvme0': permissions 'resize' are both required by node 'l1-1' (uses node 'nvme0' as 'file' child) and unshared by node 'l1-2' (uses node 'nvme0' as 'file' child).
The problem is that block/raw-format.c relies on bdrv_default_perms() to
set permissions on the nvme node. The default permissions add RESIZE in
anticipation of a format driver like qcow2 that needs to grow the image
file. This fails because RESIZE is unshared, so we cannot get the RESIZE
permission.
Max Reitz pointed out that block/crypto.c already handles this case by
implementing a custom ->bdrv_child_perm() function that adjusts the
result of bdrv_default_perms().
This patch takes the same approach in block/raw-format.c so that RESIZE
is only required if it's actually necessary (e.g. the parent is qcow2).
Cc: Max Reitz <mreitz@redhat.com>
Cc: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-Id: <20210726122839.822900-1-stefanha@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>
Signed-off-by: Mao Zhongyi <maozhongyi@cmss.chinamobile.com>
Message-Id: <20210802062507.347555-1-maozhongyi@cmss.chinamobile.com>
Signed-off-by: Hanna Reitz <hreitz@redhat.com>