Commit 5500316 (May 2012) implemented raw_co_is_allocated() as
follows:
1. If defined(CONFIG_FIEMAP), use the FS_IOC_FIEMAP ioctl
2. Else if defined(SEEK_HOLE) && defined(SEEK_DATA), use lseek()
3. Else pretend there are no holes
Later on, raw_co_is_allocated() was generalized to
raw_co_get_block_status().
Commit 4f11aa8 (May 2014) changed it to try the three methods in order
until success, because "there may be implementations which support
[SEEK_HOLE/SEEK_DATA] but not [FIEMAP] (e.g., NFSv4.2) as well as vice
versa."
Unfortunately, we used FIEMAP incorrectly: we lacked FIEMAP_FLAG_SYNC.
Commit 38c4d0a (Sep 2014) added it. Because that's a significant
speed hit, the next commit 7c159037 put SEEK_HOLE/SEEK_DATA first.
As you see, the obvious use of FIEMAP is wrong, and the correct use is
slow. I guess this puts it somewhere between -7 "The obvious use is
wrong" and -10 "It's impossible to get right" on Rusty Russel's Hard
to Misuse scale[*].
"Fortunately", the FIEMAP code is used only when
* SEEK_HOLE/SEEK_DATA aren't defined, but CONFIG_FIEMAP is
Uncommon. SEEK_HOLE had no XFS implementation between 2011 (when it
was introduced for ext4 and btrfs) and 2012.
* SEEK_HOLE/SEEK_DATA and CONFIG_FIEMAP are defined, but lseek() fails
Unlikely.
Thus, the FIEMAP code executes rarely. Makes it a nice hidey-hole for
bugs. Worse, bugs hiding there can theoretically bite even on a host
that has SEEK_HOLE/SEEK_DATA.
I don't want to worry about this crap, not even theoretically. Get
rid of it.
[*] http://ozlabs.org/~rusty/index.cgi/tech/2008-04-01.html
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Missed in commit 705be72.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
When extent types don't match, we return -ENOTSUP. In this case, be
polite to the caller and don't modify bdi.
Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-id: 1415938161-16217-1-git-send-email-famz@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
The block layer read and write functions do not like requests which are
bigger than INT_MAX bytes. Since the VDI bmap is read and written in a
single operation, its size is therefore limited accordingly. This
reduces the maximum VDI image size supported by QEMU to half of what it
currently is (down to approximately 512 TB).
The VDI test 084 has to be adapted accordingly. Actually, one could
clearly see that it was broken from the "Could not open
'TEST_DIR/t.IMGFMT': Invalid argument" line for an image which was
supposed to work just fine.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Peter Lieven <pl@kamp.de>
Version: GnuPG v1
iQEcBAABAgAGBQJUV2wdAAoJEJykq7OBq3PIjcAH/29rl938ETw1wjXxYe3uH+R6
K2yFEiPh9/cOJSH0mJ+gD8DZIN+iyR4eoQGP2s5ALFPcX3bkYxRLlUeYK0BCp883
esc7gO6XPhLvTVqP0xgACRCdUwH2I0VTToDlHjXXZogyI/DuDX3gzWJufE3x1DGs
WNTMOp5n/uYkWH3rI3DkInmbSddEz3pgX65a8BuYtw0V/RSeSRnHKDYHMygvJBRL
EVfWRNeOIrZ730CyJry0t8ITjsZxiBDKXR5glNSwaIfQUfGkTSWi9YNSurNYkUDr
aMS2rgvOVlrOUDKTHUj9oS3jgoGWcDtlk9E1MeSoyIptbRoMhdFVl1AUJZsrMJU=
=Mfbu
-----END PGP SIGNATURE-----
Merge remote-tracking branch 'remotes/stefanha/tags/block-pull-request' into staging
# gpg: Signature made Mon 03 Nov 2014 11:50:53 GMT using RSA key ID 81AB73C8
# gpg: Good signature from "Stefan Hajnoczi <stefanha@redhat.com>"
# gpg: aka "Stefan Hajnoczi <stefanha@gmail.com>"
* remotes/stefanha/tags/block-pull-request: (53 commits)
block: declare blockjobs and dataplane friends!
block: let commit blockjob run in BDS AioContext
block: let mirror blockjob run in BDS AioContext
block: let stream blockjob run in BDS AioContext
block: let backup blockjob run in BDS AioContext
block: add bdrv_drain()
blockjob: add block_job_defer_to_main_loop()
blockdev: add note that block_job_cb() must be thread-safe
blockdev: acquire AioContext in blockdev_mark_auto_del()
blockdev: acquire AioContext in do_qmp_query_block_jobs_one()
block: acquire AioContext in generic blockjob QMP commands
iotests: Expand test 061
block/qcow2: Simplify shared L2 handling in amend
block/qcow2: Make get_refcount() global
block/qcow2: Implement status CB for amend
qemu-img: Fix insignificant memleak
qemu-img: Add progress output for amend
block: Add status callback to bdrv_amend_options()
block: qemu-iotest 107 supports NFS
iotests: Add test for qcow2's bdrv_make_empty
...
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQEcBAABAgAGBQJUVhuDAAoJEL7lnXSkw9fbOKMIAIE3XZMhar4Vmokb/K0DFbnh
gy2z7iCe7vumLKiRSJX1LGmkFO3dwykw82JZQ1SVo0RdgguJ5dx1Abx1qDM1rojL
jJT0pJ9zWPl4fTv38wCEfaysQHPdgwoH4826ga+MXnVS9XHRHHxuQ4vI01AK3oyQ
4t6/wto9H8kF3n6ny7tz5WNZClsq7qbiIqw5nNCILQfSh/VBPwxQNBiWf/nYVMuY
Ubk5noztZwH+hbiAQL5lAPz/HolcRwg1tzbR0dfmt8/aqO28rJhasG58JgtziI2y
JSg4BwldqUQEgiHonArLfQDixjLtEEyL+fQSzZm02ixwcBpc/ADSyGDy2R1zpH8=
=j1ga
-----END PGP SIGNATURE-----
Merge remote-tracking branch 'remotes/mjt/tags/pull-trivial-patches-2014-11-02' into staging
trivial patches for 2014-11-02
# gpg: Signature made Sun 02 Nov 2014 11:54:43 GMT using RSA key ID A4C3D7DB
# gpg: Good signature from "Michael Tokarev <mjt@tls.msk.ru>"
# gpg: aka "Michael Tokarev <mjt@corpit.ru>"
# gpg: aka "Michael Tokarev <mjt@debian.org>"
* remotes/mjt/tags/pull-trivial-patches-2014-11-02: (23 commits)
vdi: wrapped uuid_unparse() in #ifdef
tap: fix possible fd leak in net_init_tap
tap: do not close(fd) in net_init_tap_one
target-i386: Remove unused model_features_t struct
tap_int.h: remove repeating NETWORK_SCRIPT defines
os-posix: reorder parent notification for -daemonize
pidfile: stop making pidfile error a special case
os-posix: replace goto again with a proper loop
os-posix: use global daemon_pipe instead of cryptic fds[1]
dump: Fix dump-guest-memory termination and use-after-close
virtio-9p-proxy: improve error messages in connect_namedsocket()
virtio-9p-proxy: fix error return in proxy_init()
virtio-9p-proxy: Fix sockfd leak
target-tricore: check return value before using it
net/slirp: specify logbase for smbd
Revert "os-posix: report error message when lock file failed"
util: Improve os_mem_prealloc error message
sparse: fix build
target-arm: A64: remove redundant store
target-xtensa: mark XtensaConfig structs as unused
...
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
The commit block job must run in the BlockDriverState AioContext so that
it works with dataplane.
Acquire the AioContext in blockdev.c so starting the block job is safe.
One detail here is that the bdrv_drain_all() must be moved inside the
aio_context_acquire() region so requests cannot sneak in between the
drain and acquire.
The completion code in block/commit.c must perform backing chain
manipulation and bdrv_reopen() from the main loop. Use
block_job_defer_to_main_loop() to achieve that.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-id: 1413889440-32577-11-git-send-email-stefanha@redhat.com
The mirror block job must run in the BlockDriverState AioContext so that
it works with dataplane.
Acquire the AioContext in blockdev.c so starting the block job is safe.
Note that to_replace is treated separately from other BlockDriverStates
in that it does not need to be in the same AioContext. Explicitly
acquire/release to_replace's AioContext when accessing it.
The completion code in block/mirror.c must perform BDS graph
manipulation and bdrv_reopen() from the main loop. Use
block_job_defer_to_main_loop() to achieve that.
The bdrv_drain_all() call is not allowed outside the main loop since it
could lead to lock ordering problems. Use bdrv_drain(bs) instead
because we have acquired the AioContext so nothing else can sneak in
I/O.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-id: 1413889440-32577-10-git-send-email-stefanha@redhat.com
The stream block job must run in the BlockDriverState AioContext so that
it works with dataplane.
The basics of acquiring the AioContext are easy in blockdev.c.
The tricky part is the completion code which drops part of the backing
file chain. This must be done in the main loop where bdrv_unref() and
bdrv_close() are safe to call. Use block_job_defer_to_main_loop() to
achieve that.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-id: 1413889440-32577-9-git-send-email-stefanha@redhat.com
The backup block job must run in the BlockDriverState AioContext so that
it works with dataplane.
The basics of acquiring the AioContext are easy in blockdev.c.
The completion code in block/backup.c must call bdrv_unref() from the
main loop. Use block_job_defer_to_main_loop() to achieve that.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Message-id: 1413889440-32577-8-git-send-email-stefanha@redhat.com
Currently, we have a bitmap for keeping track of which clusters have
been created during the zero cluster expansion process. This was
necessary because we need to properly increase the refcount for shared
L2 tables.
However, now we can simply take the L2 refcount and use it for the
cluster allocated for expansion. This will be the correct refcount and
therefore we don't have to remember that cluster having been allocated
any more.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Benoît Canet <benoit.canet@nodalink.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Benoit Canet <benoit@irqsave.net>
Message-id: 1414404776-4919-7-git-send-email-mreitz@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reading the refcount of a cluster is an operation which can be useful in
all of the qcow2 code, so make that function globally available.
While touching this function, amend the comment describing the "addend"
parameter: It is (no longer, if it ever was) necessary to have it set to
-1 or 1; any value is fine.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Benoît Canet <benoit.canet@nodalink.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Benoit Canet <benoit@irqsave.net>
Message-id: 1414404776-4919-6-git-send-email-mreitz@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
The only really time-consuming operation potentially performed by
qcow2_amend_options() is zero cluster expansion when downgrading qcow2
images from compat=1.1 to compat=0.10, so report status of that
operation and that operation only through the status CB.
For this, approximate the progress as the number of L1 entries visited
during the operation.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Benoît Canet <benoit.canet@nodalink.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Benoit Canet <benoit@irqsave.net>
Message-id: 1414404776-4919-5-git-send-email-mreitz@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Depending on the changed options and the image format,
bdrv_amend_options() may take a significant amount of time. In these
cases, a way to be informed about the operation's status is desirable.
Since the operation is rather complex and may fundamentally change the
image, implementing it as AIO or a coroutine does not seem feasible. On
the other hand, implementing it as a block job would be significantly
more difficult than a simple callback and would not add benefits other
than progress report to the amending operation, because it should not
actually be run as a block job at all.
A callback may not be very pretty, but it's very easy to implement and
perfectly fits its purpose here.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Benoît Canet <benoit.canet@nodalink.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-id: 1414404776-4919-2-git-send-email-mreitz@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
qemu-img should use QMP commands whenever possible in order to ensure
feature completeness of both online and offline image operations. As
qemu-img itself has no access to QMP (since this would basically require
just everything being linked into qemu-img), imitate QMP's
implementation of block-commit by using commit_active_start() and then
waiting for the block job to finish.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-id: 1414159063-25977-9-git-send-email-mreitz@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Instead of taking the total length of the block device as the block
job's length, use the number of dirty sectors. The progress is now the
number of sectors mirrored to the target block device. Note that this
may result in the job's length increasing during operation, which is
however in fact desirable.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-id: 1414159063-25977-8-git-send-email-mreitz@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
bdrv_make_empty() is currently only called if the current image
represents an external snapshot that has been committed to its base
image; it is therefore unlikely to have internal snapshots. In this
case, bdrv_make_empty() can be greatly sped up by emptying the L1 and
refcount table (while having the dirty flag set, which only works for
compat=1.1) and creating a trivial refcount structure.
If there are snapshots or for compat=0.10, fall back to the simple
implementation (discard all clusters).
[Applied s/clusters/cluster/ typo fix suggested by Eric Blake
--Stefan]
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-id: 1414159063-25977-4-git-send-email-mreitz@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Implement this function by making all clusters in the image file fall
through to the backing file (by using the recently extended discard).
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-id: 1414159063-25977-3-git-send-email-mreitz@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Normally, discarded sectors should read back as zero. However, there are
cases in which a sector (or rather cluster) should be discarded as if
they were never written in the first place, that is, reading them should
fall through to the backing file again.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-id: 1414159063-25977-2-git-send-email-mreitz@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Instead of generating the full return value thrice in try_fiemap(),
try_seek_hole() and as a fall-back in raw_co_get_block_status() itself,
generate the value only in raw_co_get_block_status().
While at it, also remove the pnum parameter from try_fiemap() and
try_seek_hole().
Suggested-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-id: 1414148280-17949-3-git-send-email-mreitz@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
As its comment states, raw_co_get_block_status() should unconditionally
return 0 and set *pnum to 0 for after EOF.
An assertion after lseek(..., SEEK_HOLE) tried to catch this case by
asserting that errno != -ENXIO (which would indicate a position after
the EOF); but it should be errno != ENXIO instead. Regardless of that,
there should be no such assertion at all. If bdrv_getlength() returned
an outdated value and the image has been resized outside of qemu,
lseek() will return with errno == ENXIO. Just return that value as an
error then.
Setting *pnum to 0 and returning 0 should not be done here, as in that
case we should update the device length as well. So, from qemu's
perspective, the file has not been resized; it's just that there was an
error querying sectors beyond a certain point (the actual file size).
Additionally, nb_sectors should be clamped against the image end. This
was probably not an issue if FIEMAP or SEEK_HOLE/SEEK_DATA worked, but
the fallback did not take this case into account.
Reported-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-id: 1414148280-17949-2-git-send-email-mreitz@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
qemu_opt_get_number returns a uint64_t, and curl_easy_setopt expects a
long (not an int). There is no warning about the latter type error
because curl_easy_setopt uses a varargs argument.
Store the timeout (which is a positive number of seconds) as a
uint64_t. Check that the number given by the user is reasonable.
Zero is permissible (meaning no timeout is enforced by cURL).
Cast it to long before calling curl_easy_setopt to fix the type error.
Example error message after this change has been applied:
$ ./qemu-img create -f qcow2 /tmp/test.qcow2 \
-b 'json: { "file.driver":"https",
"file.url":"https://foo/bar",
"file.timeout":-1 }'
qemu-img: /tmp/test.qcow2: Could not open 'json: { "file.driver":"https", "file.url":"https://foo/bar", "file.timeout":-1 }': timeout parameter is too large or negative: Invalid argument
Signed-off-by: Richard W.M. Jones <rjones@redhat.com>
Reviewed-by: Laszlo Ersek <lersek@redhat.com>
Reviewed-by: Gonglei <arei.gonglei@huawei.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
If there are still pending i/o while deleting snapshot,
because deleting snapshot is done in non-coroutine context, and
the pending i/o read/write (bdrv_co_do_rw) is done in coroutine context,
so it's possible to cause concurrency problem between above two operations.
Add bdrv_drain_all() to bdrv_snapshot_delete() to avoid this problem.
Signed-off-by: Zhang Haoyu <zhanghy@sangfor.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-id: 201410211637596311287@sangfor.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This fixes Ceph issue 2467: ttp://tracker.ceph.com/issues/2467
[Dropped return r in void function as suggested by Josh Durgin
<josh.durgin@inktank.com>.
--Stefan]
Signed-off-by: Adam Crume <adamcrume@gmail.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-id: 1412880272-3154-1-git-send-email-adamcrume@gmail.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
found by valgrind.
Command: ./qemu-img convert -f parallels -O qcow2 1.hds 1.img
Invalid read of size 4
at 0x17D0EF: parallels_co_read (parallels.c:357)
by 0x11FEE4: bdrv_aio_rw_vector (block.c:4640)
by 0x11FFBF: bdrv_aio_readv_em (block.c:4652)
by 0x11F55F: bdrv_co_readv_em (block.c:4862)
by 0x123428: bdrv_aligned_preadv (block.c:3056)
by 0x1239FA: bdrv_co_do_preadv (block.c:3162)
by 0x125424: bdrv_rw_co_entry (block.c:2706)
by 0x155DD9: coroutine_trampoline (coroutine-ucontext.c:118)
by 0x6975B6F: ??? (in /lib/x86_64-linux-gnu/libc-2.19.so)
The problem is that s->catalog_bitmap is allocated/filled as
gmalloc(s->catalog_size) thus index validity check must be
inclusive, i.e. index >= s->catalog_size is invalid.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Message-id: 1412759610-2257-4-git-send-email-den@openvz.org
CC: Jeff Cody <jcody@redhat.com>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Cancel oversized requests early. They would generate
an iSCSI protocol error anyway; after having transferred
possibly a lot of data over the wire.
Suggested-By: Max Reitz <mreitz@redhat.com>
Signed-off-by: Peter Lieven <pl@kamp.de>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
As Max pointed out there is a hidden cast from int64_t to int for all
limits. So use the newly introduced sector_limits_lun2qemu for all
limits received from the target.
Signed-off-by: Peter Lieven <pl@kamp.de>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Copy the max_xfer_len from the BlockLimits VPD or use the
maximum value fitting in the CDB.
The helper function sector_limits_lun2qemu is introduced to convert
and cap the limits from the VPD to the maximum power of two fitting
in an integer; integer is the range for nb_sectors throughout
the block layer.
Signed-off-by: Peter Lieven <pl@kamp.de>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Wrapped uuid_unparse() in #ifdef to avoid "-Wunused-function"
on clang 3.4 or later.
Signed-off-by: SeokYeon Hwang <syeon.hwang@samsung.com>
Reviewed-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
Before, when a write protected iSCSI target is attached as scsi-disk
with BDRV_O_RDWR, we report it as writable, while in fact all writes
will fail.
One way to improve this is to report write protect flag as true to
guest, but a even better way is to refuse using a write protected LUN to
guest.
Target write protect flag is checked with a mode sense query.
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Introduce a new flag to mark devices that require requests to be aligned and
replace the usage of BDRV_O_NOCACHE and O_DIRECT with this flag when
appropriate.
If a character device is used as a backend on a FreeBSD host set this flag
unconditionally.
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Cc: Kevin Wolf <kwolf@redhat.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
While writing an L1 table sector, qcow2_write_l1_entry() copies the
respective range from s->l1_table to the local "buf" array. The size of
s->l1_table does not have to be a multiple of L1_ENTRIES_PER_SECTOR;
thus, limit the index which is used for copying all entries to the L1
size.
Cc: qemu-stable@nongnu.org
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Peter Lieven <pl@kamp.de>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
With BDRVQcowState.refcount_block_bits, we don't need REFCOUNT_SHIFT
anymore.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Because the old refcount structure will be leaked after having rebuilt
it, we need to recalculate the refcounts and run a leak-fixing operation
afterwards (if leaks should be fixed at all).
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The previous commit introduced the "rebuild" variable to qcow2's
implementation of the image consistency check. Now make use of this by
adding a function which creates a completely new refcount structure
based solely on the in-memory information gathered before.
The old refcount structure will be leaked, however. This leak will be
dealt with in a follow-up commit.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
If a referenced cluster has a refcount of 0, increasing its refcount may
result in clusters being allocated for the refcount structures. This may
overwrite the referenced cluster, therefore we cannot simply increase
the refcount then.
In such cases, we can either try to replicate all the refcount
operations solely for the check operation, basing the allocations on the
in-memory refcount table; or we can simply rebuild the whole refcount
structure based on the in-memory refcount table. Since the latter will
be much easier, do that.
To prepare for this, introduce a "rebuild" boolean which should be set
to true whenever a fix is rather dangerous or too complicated using the
current refcount structures. Another example for this is refcount blocks
being referenced more than once.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
If the qcow2 check function detects a refcount block located beyond the
image end, grow the image appropriately. This cannot break anything and
is the logical fix for such a case.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We will later call calculate_refcounts multiple times, so reuse the
refcount table if possible.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Benoît Canet <benoit.canet@nodalink.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Now that the refcount table can be passed around by reference, do that
for inc_refcounts() (and subsequently check_refcounts_l1() and
check_refcounts_l2()) and use it for resizing it when a cluster after
the image end is encountered.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
As of a future patch, inc_refcounts() will have to throw errors which
are generally signaled by returning -errno. Therefore, let it return an
integer which is either 0 for success or -errno and handle the -errno
case in all callers.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Instead of printing out an error message, incrementing check_errors and
returning a fixed -errno, just do cleanups and return -ret, with ret set
by the code which threw the exception (jumped to the fail label).
Also, increment check_errors on error in check_refcounts_l2().
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Use int64_t for the entry count of the in-memory refcount table
throughout the check functions.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Pull check_refblocks() before calculate_refcounts() so we can drop its
static declaration.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Benoît Canet <benoit.canet@nodalink.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
When implementing variable refcounts, we want to be able to easily find
all the places in qemu which are tied to a certain refcount order.
Replace sizeof(uint16_t) in the check code by sizeof(**refcount_table)
so we can later find it more easily.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Put the code for calculating the reference counts and comparing them
during qemu-img check into own functions.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Benoît Canet <benoit.canet@nodalink.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
When opening dirty images, qcow2's repair function should not only
repair errors but leaks as well.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Benoît Canet <benoit.canet@nodalink.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The size of a refblock entry is (in theory) variable; calculate
therefore the number of entries per refblock and the according bit shift
(1 << x == entry count) when opening an image.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
There are macros for these operations, so make use of them.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Doesn't make a difference just yet, but it's the right thing to do.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Benoît Canet <benoit.canet@nodalink.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Move device model attachment / detachment and the BlockDevOps device
model callbacks and their wrappers from BlockDriverState to
BlockBackend.
Wrapper calls in block.c change from
bdrv_dev_FOO_cb(bs, ...)
to
if (bs->blk) {
bdrv_dev_FOO_cb(bs->blk, ...);
}
No change, because both bdrv_dev_change_media_cb() and
bdrv_dev_resize_cb() do nothing when no device model is attached, and
a device model can be attached only when bs->blk.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Much more command code needs conversion. I start with this one
because it's using bdrv_dev_* functions, which I'm about to lift into
BlockBackend.
While there, give bdrv_query_info() internal linkage.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Benoît Canet <benoit.canet@nodalink.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
blockdev_init() always creates a DriveInfo, but only drive_new() fills
it in. qmp_blockdev_add() leaves it blank. This results in a drive
with type = IF_IDE, bus = 0, unit = 0. Screwed up in commit ee13ed1c.
Board initialization code looking for IDE drive (0,0) can pick up one
of these bogus drives. The QMP command has to execute really early to
be visible. Not sure how likely that is in practice.
Fix by creating DriveInfo in drive_new(). Block backends created by
blockdev-add don't get one.
Breaks the test for "has been created by qmp_blockdev_add()" in
blockdev_mark_auto_del() and do_drive_del(), because it changes the
value of dinfo && !dinfo->enable_auto_del from true to false. Simply
test !dinfo instead.
Leaves DriveInfo member enable_auto_del unused. Drop it.
A few places assume a block backend always has a DriveInfo. Fix them
up.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Benoît Canet <benoit.canet@nodalink.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Device models should access their block backends only through the
block-backend.h API. Convert them, and drop direct includes of
inappropriate headers.
Just four uses of BlockDriverState are left:
* The Xen paravirtual block device backend (xen_disk.c) opens images
itself when set up via xenbus, bypassing blockdev.c. I figure it
should go through qmp_blockdev_add() instead.
* Device model "usb-storage" prompts for keys. No other device model
does, and this one probably shouldn't do it, either.
* ide_issue_trim_cb() uses bdrv_aio_discard() instead of
blk_aio_discard() because it fishes its backend out of a BlockAIOCB,
which has only the BlockDriverState.
* PC87312State has an unused BlockDriverState[] member.
The next two commits take care of the latter two.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
I'll use it with block backends shortly, and the name is going to fit
badly there. It's a block layer thing anyway, not just a block driver
thing.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
I'll use BlockDriverAIOCB with block backends shortly, and the name is
going to fit badly there. It's a block layer thing anyway, not just a
block driver thing.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
BlockBackend's name space is separate only to keep the initial patches
simple. Time to merge the two.
Retain bdrv_find() and bdrv_get_device_name() for now, to keep this
series manageable.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
device_name[] can become non-empty only in bdrv_new_root() and
bdrv_move_feature_fields(). The latter is used only to undo damage
done by bdrv_swap(). The former is called only by blk_new_with_bs().
Therefore, when a BlockDriverState's device_name[] is non-empty, then
it's been created with a BlockBackend, and vice versa. Furthermore,
blk_new_with_bs() keeps the two names equal.
Therefore, device_name[] is redundant. Eliminate it.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
On BlockBackend destruction, unref its BlockDriverState. Replaces the
callers' unrefs.
This turns the pointer from BlockBackend to BlockDriverState into a
strong reference, managed with bdrv_ref() / bdrv_unref(). The
back-pointer remains weak.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Benoît Canet <benoit.canet@nodalink.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Make the BlockBackend own the DriveInfo. Change blockdev_init() to
return the BlockBackend instead of the DriveInfo.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Convenience function blk_new_with_bs() creates a BlockBackend with its
BlockDriverState. Callers have to unref both. The commit after next
will relieve them of the need to unref the BlockDriverState.
Complication: due to the silly way drive_del works, we need a way to
hide a BlockBackend, just like bdrv_make_anon(). To emphasize its
"special" status, give the function a suitably off-putting name:
blk_hide_on_behalf_of_do_drive_del(). Unfortunately, hiding turns the
BlockBackend's name into the empty string. Can't avoid that without
breaking the blk->bs->device_name equals blk->name invariant.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
A block device consists of a frontend device model and a backend.
A block backend has a tree of block drivers doing the actual work.
The tree is managed by the block layer.
We currently use a single abstraction BlockDriverState both for tree
nodes and the backend as a whole. Drawbacks:
* Its API includes both stuff that makes sense only at the block
backend level (root of the tree) and stuff that's only for use
within the block layer. This makes the API bigger and more complex
than necessary. Moreover, it's not obvious which interfaces are
meant for device models, and which really aren't.
* Since device models keep a reference to their backend, the backend
object can't just be destroyed. But for media change, we need to
replace the tree. Our solution is to make the BlockDriverState
generic, with actual driver state in a separate object, pointed to
by member opaque. That lets us replace the tree by deinitializing
and reinitializing its root. This special need of the root makes
the data structure awkward everywhere in the tree.
The general plan is to separate the APIs into "block backend", for use
by device models, monitor and whatever other code dealing with block
backends, and "block driver", for use by the block layer and whatever
other code (if any) dealing with trees and tree nodes.
Code dealing with block backends, device models in particular, should
become completely oblivious of BlockDriverState. This should let us
clean up both APIs, and the tree data structures.
This commit is a first step. It creates a minimal "block backend"
API: type BlockBackend and functions to create, destroy and find them.
BlockBackend objects are created and destroyed exactly when root
BlockDriverState objects are created and destroyed. "Root" in the
sense of "in bdrv_states". They're not yet used for anything; that'll
come shortly.
A root BlockDriverState is created with bdrv_new_root(), so where to
create a BlockBackend is obvious. Where these roots get destroyed
isn't always as obvious.
It is obvious in qemu-img.c, qemu-io.c and qemu-nbd.c, and in error
paths of blockdev_init(), blk_connect(). That leaves destruction of
objects successfully created by blockdev_init() and blk_connect().
blockdev_init() is used only by drive_new() and qmp_blockdev_add().
Objects created by the latter are currently indestructible (see commit
48f364d "blockdev: Refuse to drive_del something added with
blockdev-add" and commit 2d246f0 "blockdev: Introduce
DriveInfo.enable_auto_del"). Objects created by the former get
destroyed by drive_del().
Objects created by blk_connect() get destroyed by blk_disconnect().
BlockBackend is reference-counted. Its reference count never exceeds
one so far, but that's going to change.
In drive_del(), the BB's reference count is surely one now. The BDS's
reference count is greater than one when something else is holding a
reference, such as a block job. In this case, the BB is destroyed
right away, but the BDS lives on until all extra references get
dropped.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Creating an anonymous BDS can't fail. Make that obvious.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Benoît Canet <benoit.canet@nodalink.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Export names may be used with nbd+unix, too, fix nbd_refresh_filename()
accordingly. Also, for nbd+tcp, the documented path schema is
"nbd://host[:port]/export", so use it. Furthermore, as can be seen from
that schema, the port is optional.
That makes six single cases for how the filename can be formatted; it is
not easy to generalize these cases without the resulting statement being
completely unreadable, thus there is simply one snprintf() per case.
Finally, taking the options from BDRVNBDState::socket_opts is wrong,
because those will not contain the export name. Just use
BlockDriverState::options instead.
Reported-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
try_fiemap() uses FIEMAP_FLAG_SYNC which has a significant performance
impact.
Prefer seek_hole() over fiemap() to avoid this impact where possible.
seek_hole is more widely used and, arguably, has potential to be
optimised in the kernel.
Reported-By: Michael Steffens <michael_steffens@posteo.de>
Signed-off-by: Tony Breeds <tony@bakeyournoodle.com>
Cc: Kevin Wolf <kwolf@redhat.com>
Cc: Markus Armbruster <armbru@redhat.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Max Reitz <mreitz@redhat.com>
Cc: Pádraig Brady <pbrady@redhat.com>
Cc: Eric Blake <eblake@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Using fiemap without FIEMAP_FLAG_SYNC is a known corrupter.
Add the FIEMAP_FLAG_SYNC flag to the FS_IOC_FIEMAP ioctl. This has
the downside of significantly reducing performance.
Reported-By: Michael Steffens <michael_steffens@posteo.de>
Signed-off-by: Tony Breeds <tony@bakeyournoodle.com>
Cc: Kevin Wolf <kwolf@redhat.com>
Cc: Markus Armbruster <armbru@redhat.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Max Reitz <mreitz@redhat.com>
Cc: Pádraig Brady <pbrady@redhat.com>
Cc: Eric Blake <eblake@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
When the Qcow2DiscardRegion is adjacent to another one referenced by "d",
free this Qcow2DiscardRegion metadata referenced by "p" after
it was removed from s->discards queue.
Signed-off-by: Zhang Haoyu <zhanghy@sangfor.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Just like lazy-refcounts, this field will be present iff the qcow2
compat level is 1.1 (or probably any future revision).
As expected, this breaks some tests due to the new field present in
qemu-img info output; so fix their output accordingly.
Suggested-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-id: 1412105489-7681-3-git-send-email-mreitz@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
$ ./qemu-img create -f qcow2 overlay \
-b 'json: { "file.driver":"ssh",
"file.host":"localhost",
"file.host_key_check":"no" }'
qemu-img: qobject/qdict.c:193: qdict_get_obj: Assertion `obj != ((void *)0)' failed.
Aborted
A similar crash also happens if the file.host field is omitted.
https://bugzilla.redhat.com/show_bug.cgi?id=1147343
Bug found and reported by Jun Li.
Signed-off-by: Richard W.M. Jones <rjones@redhat.com>
Reviewed-by: Gonglei <arei.gonglei@huawei.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
yet 100% thread-safe, though, which makes it really, really
experimental. It also brings asynchronous cancellation to
the SCSI subsystem and implements it in virtio-scsi. This
is a pretty important feature. Almost all the work here
was done by Fam Zheng.
I also included the virtio refcount fixes from Gonglei,
because they had a small conflict with virtio-scsi dataplane.
This pull request is using the new subkey 4E6B09D7.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQEcBAABAgAGBQJUKpR2AAoJEBRUblpOawnXNLAH/RBeF66ZqWc29dl78JKEbv0+
C5pL61GhlI5vFIjSbPU3/iaZQifw3E4NLvX3SCN5ImsLzBw4r3qerapP2Ut96K/j
5CYdWTF1oqE32oCefvlWhJulHmE1vxGN53BvOz3HHxoehdF1/tJ0wUoZyfztGTOF
tiW85VMewi6CKm47/ns5tSNfGMVzWHqnUg67z/mwN6ZmPFU1dXBlgmiIv8Znahrn
B1AOAeMjWaKvOS+tiYNVG6k0GENWGoiypxiTR3ZXLQKxOYdkh/X0ARULqLMonASX
YsT772nzO9KZDIsdLj9QZZmM7vxs7UhW0MgQlvcSWP9vfZa5SeuRSgoXorPDj3Q=
=54T3
-----END PGP SIGNATURE-----
Merge remote-tracking branch 'remotes/bonzini/tags/for-upstream' into staging
This update brings dataplane to virtio-scsi (NOT
yet 100% thread-safe, though, which makes it really, really
experimental. It also brings asynchronous cancellation to
the SCSI subsystem and implements it in virtio-scsi. This
is a pretty important feature. Almost all the work here
was done by Fam Zheng.
I also included the virtio refcount fixes from Gonglei,
because they had a small conflict with virtio-scsi dataplane.
This pull request is using the new subkey 4E6B09D7.
# gpg: Signature made Tue 30 Sep 2014 12:31:02 BST using RSA key ID 4E6B09D7
# gpg: Good signature from "Paolo Bonzini <pbonzini@redhat.com>"
# gpg: aka "Paolo Bonzini <bonzini@gnu.org>"
* remotes/bonzini/tags/for-upstream: (39 commits)
block/iscsi: handle failure on malloc of the allocationmap
util: introduce bitmap_try_new
virtio-scsi: Handle TMF request cancellation asynchronously
scsi: Introduce scsi_req_cancel_async
scsi: Introduce scsi_req_cancel_complete
scsi: Drop SCSIReqOps.cancel_io
scsi: Unify request unref in scsi_req_cancel
scsi-generic: Handle canceled request in scsi_command_complete
scsi: Drop scsi_req_abort
virtio-scsi: Process ".iothread" property
virtio-scsi: Call bdrv_io_plug/bdrv_io_unplug in cmd request handling
virtio-scsi: Batched prepare for cmd reqs
virtio-scsi: Two stages processing of cmd request
virtio-scsi: Add migration state notifier for dataplane code
virtio-scsi: Hook up with dataplane
virtio-scsi-dataplane: Code to run virtio-scsi on iothread
virtio-scsi: Add VirtIOSCSIVring in VirtIOSCSIReq
virtio-scsi: Add 'iothread' property to virtio-scsi
virtio: add a wrapper for virtio-backend initialization
virtio-9p: fix virtio-9p child refcount in transports
...
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Check for the presence of posix_fallocate() in configure and only
compile in support for PREALLOC_MODE_FALLOC when it's there.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABAgAGBQJUJbcwAAoJEH8JsnLIjy/WmckP/RBMK3gkr2aArp0k/sbdFcnW
D7GusGkoj1jXRVFKuG3d+T/nozJTX0mkfnaRo8QCg/5VVJtqeHfsfY6JHgppQEai
C5Z8gtzIFxsQNE0/8d2D6zJMnqsIeXAVtDONWgWcDOZstVXn+KEKOw5E3KCZeGxF
yLA+h4h8sncT3ysPE0/jnQqeIwGJpLusFXmICiRY2lXRcCgY/qniymUq25FZO69m
Isw8iVHrUr05u1kfthdQeSAmdPQqI5tK5siR4CERtO2XdlyiODmdo5DRZmATfGiL
CGUIJd30mvrKpqRpIk8gbuGEtz92RuTnyoi9xkLcKi52eWyflGH4gjN3ojYe5Pro
hRY06AcwI9lAVjCy7lKPO6dJn8RWOi/wkgfffK0CFujlo8r1BtwfCW0Y9koYuhdq
mM/K0IEbjNgOvacHLIMOTQuKz3DLv7QB8wZQ+cTkdmvC6kNGnqD9yrA4cAQP4kjR
wjm/WfDIRKPif6cMwgaOiQeNaL0VF2LIn4GRWgaiIqEYfcKzy65uBFoPqOlAoXr8
NtTzCIXDIBbTSzz3VLBKD22FOsP0pfPrMeXkf+GwJwStKMcp6PGY4iUd3+M3bfOP
iUsxTJ4Uw5zMg0MgOaaWnUaDdbRtKGtH1y2UA+4X3gO+3W8Vz0e8B5cPMBqN/mpy
lUS4mBA08mTvaLE520rp
=ManP
-----END PGP SIGNATURE-----
Merge remote-tracking branch 'remotes/kevin/tags/for-upstream' into staging
Block patches
# gpg: Signature made Fri 26 Sep 2014 19:57:52 BST using RSA key ID C88F2FD6
# gpg: Good signature from "Kevin Wolf <kwolf@redhat.com>"
* remotes/kevin/tags/for-upstream:
qemu-iotests: Fail test if explicit test case number is unknown
block: Validate node-name
vpc: fix beX_to_cpu() and cpu_to_beX() confusion
docs: add blkdebug block driver documentation
block: Catch simultaneous usage of options and their aliases
block: Specify -drive legacy option aliases in array
block: Improve message for device name clashing with node name
qemu-nbd: Destroy the BlockDriverState properly
block: Keep DriveInfo alive until BlockDriverState dies
blockdev: Disentangle BlockDriverState and DriveInfo creation
blkdebug: show an error for invalid event names
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
iscsi_aio_write16_cb, iscsi_aio_writev, iscsi_aio_read16_cb, and
iscsi_aio_readv have not not been in use since commit
063c3378a9 ("block/iscsi: introduce
bdrv_co_{readv, writev, flush_to_disk}").
These were the only trace events in block/iscsi.c so drop the the
trace.h include.
Cc: Peter Lieven <pl@kamp.de>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Message-id: 1411394595-15300-4-git-send-email-stefanha@redhat.com
The beX_to_cpu() and cpu_to_beX() functions perform the same operation -
they do a byteswap if the host CPU endianness is little-endian or a
nothing otherwise.
The point of two names for the same operation is that it documents which
direction the data is being converted. This makes it clear whether the
data is suitable for CPU processing or in its external representation.
This patch fixes incorrect beX_to_cpu()/cpu_to_beX() usage.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
It is easy to typo a blkdebug configuration and waste a lot of time
figuring out why no rules are matching.
Push the Error** down into add_rule() so we can report an error when the
event name is invalid.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Gonglei <arei.gonglei@huawei.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
In vhdx_create_metadata(), we allocate 40 bytes to entry_buffer for
the various metadata table entries. However, we write out 64kB from
that buffer into the new file. Only write out the correct 40 bytes.
Signed-off-by: Jeff Cody <jcody@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This patch removes support for the cow file format.
Normally we do not break backwards compatibility but in this case there
is no impact and it is the most logical option. Extraordinary claims
require extraordinary evidence so I will show why removing the cow block
driver is the right thing to do.
The cow file format is the disk image format for Usermode Linux, a way
of running a Linux system in userspace. The performance of UML was
never great and it was hacky, but it enjoyed some popularity before
hardware virtualization support became mainstream.
QEMU's block/cow.c is supposed to read this image file format.
Unfortunately the file format was underspecified:
1. Earlier Linux versions used the MAXPATHLEN constant for the backing
filename field. The value of MAXPATHLEN can change, so Linux
switched to a 4096 literal but QEMU has a 1024 literal.
2. Padding was not used on the header struct (both in the Linux kernel
and in QEMU) so the struct layout varied across architectures. In
particular, i386 and x86_64 were different due to int64_t alignment
differences. Linux now uses __attribute__((packed)), QEMU does not.
Therefore:
1. QEMU cow images do not conform to the Linux cow image file format.
2. cow images cannot be shared between different host architectures.
This means QEMU cow images are useless and QEMU has not had bug reports
from users actually hitting these issues.
Let's get rid of this thing, it serves no purpose and no one will be
affected.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Message-id: 1410877464-20481-1-git-send-email-stefanha@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Being able to set the overlap-check option to a string and then refine
it via the overlap-check.* options is a nice idea for the command line
but does not work so well for non-flattened dicts. In that case, one can
only specify either but not both, so add a field to overlap-check.*
which does the same as directly specifying overlap-check but can be used
in conjunction with the other fields in non-flattened dicts.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-id: 1408557576-14574-4-git-send-email-mreitz@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Currently, the QemuOpts object opts is leaked if anything fails from its
creation up to and including the image repair block. Fix this by freeing
that object in the fail path.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Benoît Canet <benoit.canet@nodalink.com>
Message-id: 1408557576-14574-2-git-send-email-mreitz@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Offsets taken from the L1, L2 and refcount tables are generally assumed
to be correctly aligned. However, this cannot be guaranteed if the image
has been written to by something different than qemu, thus check all
offsets taken from these tables for correct cluster alignment.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-id: 1409926039-29044-5-git-send-email-mreitz@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Use the new function in case of a failed overlap check.
This changes output in case of corruption, so adapt iotest 060's
reference output accordingly.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Benoît Canet <benoit.canet@nodalink.com>
Message-id: 1409926039-29044-4-git-send-email-mreitz@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Add a helper function for easily marking an image corrupt (on fatal
corruptions) while outputting an informative message to stderr and via
QAPI.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Benoît Canet <benoit.canet@nodalink.com>
Message-id: 1409926039-29044-3-git-send-email-mreitz@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Not every BLOCK_IMAGE_CORRUPTED event must be fatal; for example, when
reading from an image, they should generally not be. Nonetheless, even
an image only read from may of course be corrupted and this can be
detected during normal operation. In this case, a non-fatal event should
be emitted, but the image should not be marked corrupt (in accordance to
"fatal" set to false).
Signed-off-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-id: 1409926039-29044-2-git-send-email-mreitz@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This is an analogue to Linux null_blk. It can be used for testing or
benchmarking block device emulation and general block layer
functionalities such as coroutines and throttling, where disk IO is not
necessary or wanted.
Use null-aio:// for AIO version, and null-co:// for coroutine version.
[Resolved conflict with Fam's async bdrv_aio_cancel() series:
1. Drop .bdrv_aio_cancel() since it is now done by block.c
2. Rename qemu_aio_release() to qemu_aio_unref()
--Stefan]
Signed-off-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Benoît Canet <benoit.canet@nodalink.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-id: 1410415798-20673-2-git-send-email-famz@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Also drop the now unused SheepdogAIOCB.finished field. Note that this
aio is internal to sheepdog driver and has NULL cb and opaque, and
should be unused at all.
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Before, we cancel all the child requests with bdrv_aio_cancel, then free
the acb..
Now we just kick off asynchronous cancellation of child requests and
return, we know quorum_aio_cb will be called later, so in the end
quorum_aio_finalize will take care of calling the caller's cb.
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
For a fifo read pattern, we only have one running aio (possible other cases that
has less number than num_children in the future), so we need to check if
.acb is NULL against bdrv_aio_cancel() to avoid segfault.
Cc: Eric Blake <eblake@redhat.com>
Cc: Benoit Canet <benoit@irqsave.net>
Cc: Kevin Wolf <kwolf@redhat.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Liu Yuan <namei.unix@gmail.com>
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
The cancelled flag is no longer useful. Later the request will complete
as before, and cb will be called.
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Just call io_cancel (2), if it fails, it means the request is not
canceled, so the event loop will eventually call
qemu_laio_process_completion.
In qemu_laio_process_completion, change to call the cb unconditionally.
It is required by bdrv_aio_cancel_async.
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
The parent_vhdx_guid variable is defined but never used, which provokes
complaints from newer versions of clang. Since the variable definition
is here acting as documentation of the image format, mark it with the
'unused' attribute to keep the compiler happy rather than simply
deleting it.
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
When trying to create a fixed vhd image qemu-img will return the
following error:
qemu-img: test.vhdx: Could not create image: Cannot allocate memory
This happens because of a incorrect check in vhdx.c. Specifficaly,
in vhdx_create_bat(), after allocating memory for the BAT entry,
there is a check to determine if the allocation was unsuccsessful.
The error comes from the fact that it checks if s->bat isn't NULL,
which is true in case of succsessful allocation, and exits with
error ENOMEM.
Signed-off-by: Adelina Tuvenie <atuvenie@cloudbasesolutions.com>
Acked-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
preallocation=falloc allocates disk space by posix_fallocate(),
preallocation=full allocates disk space by writing zeros to disk.
Both modes imply preallocation=metadata.
Signed-off-by: Hu Tao <hutao@cn.fujitsu.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This patch adds a new option preallocation for raw format, and implements
falloc and full preallocation.
Signed-off-by: Hu Tao <hutao@cn.fujitsu.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This patch prepares for the subsequent patches.
Signed-off-by: Hu Tao <hutao@cn.fujitsu.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
and avoid converting it back later.
Signed-off-by: Hu Tao <hutao@cn.fujitsu.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Benoît Canet <benoit.canet@nodalink.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Currently the file size requested by user is rounded down to nearest
sector, causing the actual file size could be a bit less than the size
user requested. Since some formats (like qcow2) record virtual disk
size in bytes, this can make the last few bytes cannot be accessed.
This patch fixes it by rounding up file size to nearest sector so that
the actual file size is no less than the requested file size.
Signed-off-by: Hu Tao <hutao@cn.fujitsu.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This is the next step for decoupling block accounting functions from
BlockDriverState.
In a future commit the BlockAcctStats structure will be moved from
BlockDriverState to the device models structures.
Note that bdrv_get_stats was introduced so device models can retrieve the
BlockAcctStats structure of a BlockDriverState without being aware of it's
layout.
This function should go away when BlockAcctStats will be embedded in the device
models structures.
CC: Kevin Wolf <kwolf@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Keith Busch <keith.busch@intel.com>
CC: Anthony Liguori <aliguori@amazon.com>
CC: "Michael S. Tsirkin" <mst@redhat.com>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Eric Blake <eblake@redhat.com>
CC: Peter Maydell <peter.maydell@linaro.org>
CC: Michael Tokarev <mjt@tls.msk.ru>
CC: John Snow <jsnow@redhat.com>
CC: Markus Armbruster <armbru@redhat.com>
CC: Alexander Graf <agraf@suse.de>
CC: Max Reitz <mreitz@redhat.com>
Signed-off-by: Benoît Canet <benoit.canet@nodalink.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The middle term goal is to move the BlockAcctStats structure in the device models.
(Capturing I/O accounting statistics in the device models is good for billing)
This patch make a small step in this direction by removing a reference to BDRV.
CC: Kevin Wolf <kwolf@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Keith Busch <keith.busch@intel.com>
CC: Anthony Liguori <aliguori@amazon.com>
CC: "Michael S. Tsirkin" <mst@redhat.com>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: John Snow <jsnow@redhat.com>
CC: Richard Henderson <rth@twiddle.net>
CC: Markus Armbruster <armbru@redhat.com>
CC: Alexander Graf <agraf@suse.de>i
Signed-off-by: Benoît Canet <benoit.canet@nodalink.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The plan is to add new accounting metrics (latency, invalid requests, failed
requests, queue depth) and block.c is overpopulated so it will be better to work
in a separate module.
Moreover the long term plan is to have statistics in each of the BDS of the graph
for metrology purpose; this means that the device model statistics must move from
the topmost BDS to the device model.
So we need to decouple the statistic code from BlockDriverState.
This is another argument for the extraction of the code in a separate module.
CC: Kevin Wolf <kwolf@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Max Reitz <mreitz@redhat.com>
CC: Eric Blake <eblake@redhat.com>
CC: Benoit Canet <benoit@irqsave.net>
CC: Fam Zheng <famz@redhat.com>
CC: Peter Crosthwaite <peter.crosthwaite@xilinx.com>
CC: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Benoît Canet <benoit.canet@nodalink.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Extract the block accounting statistics into a structure so the block device
models can hold them in the future.
CC: Kevin Wolf <kwolf@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Max Reitz <mreitz@redhat.com>
CC: Eric Blake <eblake@redhat.com>
Signed-off-by: Benoît Canet <benoit.canet@nodalink.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
cpu_to_be32() is wrong since vhd_type is an enum constant
(just a regular CPU-endian integer).
Signed-off-by: Xiaodong Gong <gordongong0350@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
vmdk_open_sparse() does not take ownership of buf so the caller always
needs to free it.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Replace __sync builtins with ones provided by QEMU
for atomic operations.
Special thanks goes to Paolo Bonzini for his refactoring
suggestion in order to use the already existing atomic builtins
interface.
Signed-off-by: Chrysostomos Nanakos <cnanakos@grnet.gr>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
In commit 63f0f45f2e the following
mechanical change was made:
if (!state) {
- qemu_aio_wait();
+ aio_poll(state->s->aio_context, true);
}
The new code now checks if state is NULL and then dereferences it
('state->s') which is obviously incorrect.
This commit replaces state->s->aio_context with
bdrv_get_aio_context(bs), fixing this problem. The two other hunks
are concerned with getting the BlockDriverState pointer bs to where it
is needed.
The original bug causes a segfault when using libguestfs to access a
VMware vCenter Server and doing any kind of complex read-heavy
operations. With this commit the segfault goes away.
Signed-off-by: Richard W.M. Jones <rjones@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Benoît Canet <benoit.canet@nodalink.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
In order to access VMware ESX efficiently, we need to send a session
cookie. This patch is very simple and just allows you to send that
session cookie. It punts on the question of how you get the session
cookie in the first place, but in practice you can just run a `curl'
command against the server and extract the cookie that way.
To use it, add file.cookie to the curl URL. For example:
$ qemu-img info 'json: {
"file.driver":"https",
"file.url":"https://vcenter/folder/Windows%202003/Windows%202003-flat.vmdk?dcPath=Datacenter&dsName=datastore1",
"file.sslverify":"off",
"file.cookie":"vmware_soap_session=\"52a01262-bf93-ccce-d379-8dabb3e55560\""}'
image: [...]
file format: raw
virtual size: 8.0G (8589934592 bytes)
disk size: unavailable
Signed-off-by: Richard W.M. Jones <rjones@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
If two Linux AIO request completions are fetched in the same
io_getevents() call, QEMU will deadlock if request A's callback waits
for request B to complete using an aio_poll() loop. This was reported
to happen with the mirror blockjob.
This patch moves completion processing into a BH and makes it resumable.
Nested event loops can resume completion processing so that request B
will complete and the deadlock will not occur.
Cc: Kevin Wolf <kwolf@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Ming Lei <ming.lei@canonical.com>
Cc: Marcin Gibuła <m.gibula@beyond.pl>
Reported-by: Marcin Gibuła <m.gibula@beyond.pl>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Tested-by: Marcin Gibuła <m.gibula@beyond.pl>
We should reinit local_err as NULL inside the while loop or g_free() will report
corrupption and abort the QEMU when sheepdog driver tries reconnecting.
This was broken in commit 356b4ca.
qemu-system-x86_64: failed to get the header, Resource temporarily unavailable
qemu-system-x86_64: Failed to connect to socket: Connection refused
qemu-system-x86_64: (null)
[xcb] Unknown sequence number while awaiting reply
[xcb] Most likely this is a multi-threaded client and XInitThreads has not been called
[xcb] Aborting, sorry about that.
qemu-system-x86_64: ../../src/xcb_io.c:298: poll_for_response: Assertion `!xcb_xlib_threads_sequence_lost' failed.
Aborted (core dumped)
Cc: qemu-devel@nongnu.org
Cc: Markus Armbruster <armbru@redhat.com>
Cc: Kevin Wolf <kwolf@redhat.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Liu Yuan <namei.unix@gmail.com>
Reviewed-by: Benoît Canet <benoit.canet@nodalink.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Uses the same select/WSAEventSelect scheme as main-loop.c.
WSAEventSelect() is edge-triggered, so it cannot be used
directly, but it is still used as a way to exit from a
blocking g_poll().
Before g_poll() is called, we poll sockets with a non-blocking
select() to achieve the level-triggered semantics we require:
if a socket is ready, the g_poll() is made non-blocking too.
Based on a patch from Or Goshen.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This patch adds single read pattern to quorum driver and quorum vote is default
pattern.
For now we do a quorum vote on all the reads, it is designed for unreliable
underlying storage such as non-redundant NFS to make sure data integrity at the
cost of the read performance.
For some use cases as following:
VM
--------------
| |
v v
A B
Both A and B has hardware raid storage to justify the data integrity on its own.
So it would help performance if we do a single read instead of on all the nodes.
Further, if we run VM on either of the storage node, we can make a local read
request for better performance.
This patch generalize the above 2 nodes case in the N nodes. That is,
vm -> write to all the N nodes, read just one of them. If single read fails, we
try to read next node in FIFO order specified by the startup command.
The 2 nodes case is very similar to DRBD[1] though lack of auto-sync
functionality in the single device/node failure for now. But compared with DRBD
we still have some advantages over it:
- Suppose we have 20 VMs running on one(assume A) of 2 nodes' DRBD backed
storage. And if A crashes, we need to restart all the VMs on node B. But for
practice case, we can't because B might not have enough resources to setup 20 VMs
at once. So if we run our 20 VMs with quorum driver, and scatter the replicated
images over the data center, we can very likely restart 20 VMs without any
resource problem.
After all, I think we can build a more powerful replicated image functionality
on quorum and block jobs(block mirror) to meet various High Availibility needs.
E.g, Enable single read pattern on 2 children,
-drive driver=quorum,children.0.file.filename=0.qcow2,\
children.1.file.filename=1.qcow2,read-pattern=fifo,vote-threshold=1
[1] http://en.wikipedia.org/wiki/Distributed_Replicated_Block_Device
[Dropped \n from an error_setg() error message
--Stefan]
Cc: Benoit Canet <benoit@irqsave.net>
Cc: Eric Blake <eblake@redhat.com>
Cc: Kevin Wolf <kwolf@redhat.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Liu Yuan <namei.unix@gmail.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Recently, sheepdog revived its VDI locking functionality. This patch
updates sheepdog driver of QEMU for this feature. It changes an error
code for a case of failed locking. -EBUSY is a suitable one.
Reported-by: Valerio Pachera <sirio81@gmail.com>
Cc: Kevin Wolf <kwolf@redhat.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Liu Yuan <namei.unix@gmail.com>
Cc: MORITA Kazutaka <morita.kazutaka@gmail.com>
Signed-off-by: Hitoshi Mitake <mitake.hitoshi@lab.ntt.co.jp>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
The update is required for supporting iSCSI multipath. It doesn't
affect behavior of QEMU driver but adding a new field to vdi request
struct is required.
Cc: Kevin Wolf <kwolf@redhat.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Liu Yuan <namei.unix@gmail.com>
Cc: MORITA Kazutaka <morita.kazutaka@gmail.com>
Signed-off-by: Hitoshi Mitake <mitake.hitoshi@lab.ntt.co.jp>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
The curl hardcoded timeout (5 seconds) sometimes is not long
enough depending on the remote server configuration and network
traffic. The user should be able to set how much long he is
willing to wait for the connection.
Adding a new option to set this timeout gives the user this
flexibility. The previous default timeout of 5 seconds will be
used if this option is not present.
Reviewed-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Daniel Henrique Barboza <danielhb@linux.vnet.ibm.com>
Reviewed-by: Benoit Canet <benoit.canet@nodalink.com>
Tested-by: Richard W.M. Jones <rjones@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
The gcc 4.1.2 compiler warns that delay_ns may be uninitialized in
mirror_iteration().
There are two break statements in the do ... while loop that skip over
the delay_ns assignment. These are probably the cause of the warning.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Benoît Canet <benoit.canet@nodalink.com>
Instead of bdrv_getlength().
Commit 57322b7 did this all over block, but one more bdrv_getlength()
has crept in since.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Fam Zheng <famz@redhat.com>
Reviewed-by: Benoît Canet <benoit.canet@nodalink.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Otherwise error_callback_bh will access the already released acb.
Cc: qemu-stable@nongnu.org
Signed-off-by: Fam Zheng <famz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The following O_DIRECT read from a <512 byte file fails:
$ truncate -s 320 test.img
$ qemu-io -n -c 'read -P 0 0 512' test.img
qemu-io: can't open device test.img: Could not read image for determining its format: Invalid argument
Note that qemu-io completes successfully without the -n (O_DIRECT)
option.
This patch fixes qemu-iotests ./check -nocache -vmdk 059.
Cc: qemu-stable@nongnu.org
Suggested-by: Kevin Wolf <kwolf@redhat.com>
Reported-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
bs->total_sectors is not yet updated at this point. resulting
in memory corruption if the volume has grown and data is written
to the newly availble areas.
CC: qemu-stable@nongnu.org
Signed-off-by: Peter Lieven <pl@kamp.de>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Just log to stderr unconditionally, like other similar code does.
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Because blkdebug cannot simply create a configuration file, simply
refuse to reconstruct a plain filename and only generate an options
QDict from the rules instead.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Add options for specifying the size of the metadata caches. This can
either be done directly for each cache (if only one is given, the other
will be derived according to a default ratio) or combined for both.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
With a variable cache size, the number given to qcow2_cache_create() may
be huge. Therefore, use g_try_new0().
While at it, use g_new0() instead of g_malloc0() for allocating the
Qcow2Cache object.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Specifying the metadata cache sizes in clusters results in less clusters
(and much less bytes) covered for small cluster sizes and vice versa.
Using a constant byte size reduces this difference, and makes it
possible to manually specify the cache size in an easily comprehensible
unit.
Signed-off-by: Max Reitz <mreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
They clutter the code. Unfortunately, I can't figure out how to make
Coccinelle drop all of them, so I have to settle for common special
cases:
@@
type T;
T *pt;
void *pv;
@@
- pt = (T *)pv;
+ pt = pv;
@@
type T;
@@
- (T *)
(\(g_malloc\|g_malloc0\|g_realloc\|g_new\|g_new0\|g_renew\|
g_try_malloc\|g_try_malloc0\|g_try_realloc\|
g_try_new\|g_try_new0\|g_try_renew\)(...))
Topped off with minor manual style cleanups.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
g_new(T, n) is safer than g_malloc(sizeof(*v) * n) for two reasons.
One, it catches multiplication overflowing size_t. Two, it returns
T * rather than void *, which lets the compiler catch more type
errors.
Perhaps a conversion to g_malloc_n() would be neater in places, but
that's merely four years old, and we can't use such newfangled stuff.
This commit only touches allocations with size arguments of the form
sizeof(T), plus two that use 4 instead of sizeof(uint32_t). We can
make the others safe by converting to g_malloc_n() when it becomes
available to us in a couple of years.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
g_new(T, n) is neater than g_malloc(sizeof(T) * n). It's also safer,
for two reasons. One, it catches multiplication overflowing size_t.
Two, it returns T * rather than void *, which lets the compiler catch
more type errors.
Patch created with Coccinelle, with two manual changes on top:
* Add const to bdrv_iterate_format() to keep the types straight
* Convert the allocation in bdrv_drop_intermediate(), which Coccinelle
inexplicably misses
Coccinelle semantic patch:
@@
type T;
@@
-g_malloc(sizeof(T))
+g_new(T, 1)
@@
type T;
@@
-g_try_malloc(sizeof(T))
+g_try_new(T, 1)
@@
type T;
@@
-g_malloc0(sizeof(T))
+g_new0(T, 1)
@@
type T;
@@
-g_try_malloc0(sizeof(T))
+g_try_new0(T, 1)
@@
type T;
expression n;
@@
-g_malloc(sizeof(T) * (n))
+g_new(T, n)
@@
type T;
expression n;
@@
-g_try_malloc(sizeof(T) * (n))
+g_try_new(T, n)
@@
type T;
expression n;
@@
-g_malloc0(sizeof(T) * (n))
+g_new0(T, n)
@@
type T;
expression n;
@@
-g_try_malloc0(sizeof(T) * (n))
+g_try_new0(T, n)
@@
type T;
expression p, n;
@@
-g_realloc(p, sizeof(T) * (n))
+g_renew(T, p, n)
@@
type T;
expression p, n;
@@
-g_try_realloc(p, sizeof(T) * (n))
+g_try_renew(T, p, n)
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Max Reitz <mreitz@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Commit de82815db1 ("qcow2: Handle failure
for potentially large allocations") introduced a double-free of
new_blocks in the alloc_refcount_block() error path.
The qemu-iotests qcow2 026 test case was failing because qemu-io
segfaulted.
Make sure new_blocks is NULL after we free it the first time.
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Parallels has released in the recent updates of Parallels Server 5/6
new addition to his image format. Images with signature WithouFreSpacExt
have offsets in the catalog coded not as offsets in sectors (multiple
of 512 bytes) but offsets coded in blocks (i.e. header->tracks * 512)
In this case all 64 bits of header->nb_sectors are used for image size.
This patch implements support of this for qemu-img and also adds specific
check for an incorrect image. Images with block size greater than
INT_MAX/513 are not supported. The biggest available Parallels image
cluster size in the field is 1 Mb. Thus this limit will not hurt
anyone.
Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Jeff Cody <jcody@redhat.com>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
and rework error path a bit. There is no difference at the moment, but
the code will be definitely shorter when additional processing will
be required for WithouFreSpacExt
Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Jeff Cody <jcody@redhat.com>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>