brdv_truncate() is also called from readv/writev commands on self-
growing file based storage. this will result in requests waiting
for theirselves to complete.
This reverts commit 9a665b2b86.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
realpath(3) is used to get an absolute path to the image file when
creating a -drive snapshot=on temporary qcow2. This does not work for
protocols since their filenames ("proto:foo:...") do not correspond to
file system paths.
Commit 7c96d46ec2 ("Let snapshot work with
protocols") skipped realpath(3) for protocols. Later on the "raw"
format was introduced and broke the check.
Use path_has_protocol(filename) to decide if this image uses a protocol
or a filename.
Reported-by: Richard Jones <rjones@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
For now bdrv_get_aio_context() is just a stub that calls
qemu_aio_get_context() since the block layer is currently tied to the
main loop AioContext.
Add the stub now so that the block layer can begin accessing its
AioContext.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
The options are passed down to the block drivers, which are supposed to
remove all options they have processed. Anything that is left over in
the end is an unknown option and results in an error.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
It doesn't do anything yet except storing the options QDict in the
BlockDriverState.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
During a commit of 'all' using the HMP non-live commit, the operation
is aborted and returns error on the first error enountered. When
non-COW drives are in use (e.g. ejected floppy, cdrom, or drives without
a backing parent), that means a commit all will return an error of either
-ENOMEDIUM or -ENOTSUP. This is not desirable, so for the 'all' commit
case, only attempt the commit if both bs->drv and bs->backing_hd are
present.
More succinctly: 'commit all' now means a commit on all COW drives.
This means an individual commit to a specific non-COW drive will still
return the appropriate error (-ENOMEDIUM if eject / not present, -ENOTSUP
if no backing file).
Reported-by: Jan Kiszka <jan.kiszka@web.de>
Signed-off-by: Jeff Cody <jcody@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
It is better to present homogeneous hardware independent of the storage
technology that is chosen on the host, hence we make discard a host
parameter; the user can choose whether to pass it down to the image
format and protocol, or to ignore it.
Using DISCARD with filesystems can cause very severe fragmentation, so it
is left default-off for now. This can change later when we implement the
"anchor" operation for efficient management of preallocated files.
There is still one choice to make: whether DISCARD has an effect on the
dirty bitmap or not. I chose yes, though there is a disadvantage: if
the guest is buggy and issues discards for data that is in use, there
will be no way to migrate storage for that guest without downgrading
the machine type to an older one.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
bdrv_truncate() invalidates the bdrv_check_request() result for
in-flight requests, so there should better be none.
Cc: qemu-stable@nongnu.org
Signed-off-by: Peter Lieven <pl@kamp.de>
Reported-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
There can be a need to turn output to stdout off. This patch adds a -q option
that enable "Quiet mode". In Quiet mode, only errors are printed out.
Signed-off-by: Miroslav Rezanina <mrezanin@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
There's no synchronous wrapper for bdrv_co_is_allocated_above function
so it's not possible to check for sector allocation in an image with
a backing file.
Signed-off-by: Miroslav Rezanina <mrezanin@redhat.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
In an image chain, if the base image is smaller than the current
image, we need to make sure to use the current images count of
unallocated blocks once we get to the end of the base image. Without
this change the code will return 0 blocks when it gets to the end
of the base image and mirror_run will fail its assertion.
Signed-off-by: Vishvananda Ishaya <vishvananda@gmail.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Laszlo Ersek <lersek@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This is needed in the following patch.
Reviewed-by: Laszlo Ersek <lersek@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This actually uses the dirty bitmap in the block layer, and converts
mirroring to use an HBitmapIter.
Reviewed-by: Laszlo Ersek <lersek@redhat.com> (except block/mirror.c parts)
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Note that resetting bits in the dirty bitmap is done _before_ actually
processing the request. Writes, instead, set bits after the request
is completed.
This way, when there are concurrent write and discard requests, the
outcome will always be that the blocks are marked dirty. This scenario
should never happen, but it is safer to do it this way.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
bdrv_io_limits_enable() starts a new slice, but does not set io_base
correctly for that slice.
Here is how io_base is used:
bytes_base = bs->nr_bytes[is_write] - bs->io_base.bytes[is_write];
bytes_res = (unsigned) nb_sectors * BDRV_SECTOR_SIZE;
if (bytes_base + bytes_res <= bytes_limit) {
/* no wait */
} else {
/* operation needs to be throttled */
}
As a result, any I/O operations that are triggered between now and
bs->slice_end are incorrectly limited. If 10 MB of data has been
written since the VM was started, QEMU thinks that 10 MB of data has
been written in this slice. This leads to a I/O lockup in the guest.
We fix this by delaying the start of a new slice to the next
call of bdrv_exceed_io_limits().
Signed-off-by: Peter Lieven <pl@kamp.de>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
The qiov_is_aligned() function checks whether a QEMUIOVector meets a
BlockDriverState's alignment requirements. This is needed by
virtio-blk-data-plane so:
1. Move the function from block/raw-posix.c to block/block.c.
2. Make it public in block/block.h.
3. Rename to bdrv_qiov_is_aligned().
4. Change return type from int to bool.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
A blank CD or DVD is visible as a zero-sized disks. Probing such
disks will lead to an EIO and a failure to start the VM. Treating
them as raw is a better solution.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This allows removing of MinGW specific code and improves
reentrancy for POSIX hosts.
[Removed unused ret variable in qemu_get_timedate() to fix warning:
vl.c: In function ‘qemu_get_timedate’:
vl.c:451:16: error: variable ‘ret’ set but not used [-Werror=unused-but-set-variable]
-- Stefan Hajnoczi]
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This makes the blkdebug suspend/resume functionality available in
qemu-io. Use it like this:
$ ./qemu-io blkdebug::/tmp/test.qcow2
qemu-io> break write_aio req_a
qemu-io> aio_write 0 4k
qemu-io> blkdebug: Suspended request 'req_a'
qemu-io> resume req_a
blkdebug: Resuming request 'req_a'
qemu-io> wrote 4096/4096 bytes at offset 0
4 KiB, 1 ops; 0:00:30.71 (133.359788 bytes/sec and 0.0326 ops/sec)
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This commit adds an Error ** argument to bdrv_img_create() and set it
appropriately on error.
Callers of bdrv_img_create() pass NULL for the new argument and still
rely on bdrv_img_create()'s return value. Next commits will change
callers to use the Error object instead.
Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This fixes problems that are caused by the additional open/close cycle
of the existing format probing, for example related to qemu-nbd without
-t option or file descriptor passing.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Instead of waiting for all requests to complete, wait just for the
specific request that should be cancelled.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The local string tmp_filename is passed to function get_tmp_filename
which expects a string with minimum size MAX_PATH for w32 hosts.
MAX_PATH is 260 and PATH_MAX is 259, so tmp_filename was too short.
Commit eba25057b9 introduced this
regression.
Signed-off-by: Stefan Weil <sw@weilnetz.de>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Blue Swirl <blauwirbel@gmail.com>
Now that AIOPool no longer keeps a freelist, it isn't really a "pool"
anymore. Rename it to AIOCBInfo and make it const since it no longer
needs to be modified.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
AIO control blocks are frequently acquired and released because each aio
request involves at least one AIOCB. Therefore, we pool them to avoid
heap allocation overhead.
The problem with the freelist approach in AIOPool is thread-safety. If
we want BlockDriverStates to associate with AioContexts that execute in
multiple threads, then a global freelist becomes a problem.
This patch drops the freelist and instead uses g_slice_alloc() which is
tuned for per-thread fixed-size object pools. qemu_aio_get() and
qemu_aio_release() are now thread-safe.
Note that the change from g_malloc0() to g_slice_alloc() should be safe
since the freelist reuse case doesn't zero the AIOCB either.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
* kwolf/for-anthony: (32 commits)
osdep: Less restrictive F_SEFL in qemu_dup_flags()
qemu-iotests: add testcases for mirroring on-source-error/on-target-error
qmp: add pull_event function
mirror: add support for on-source-error/on-target-error
iostatus: forward block_job_iostatus_reset to block job
qemu-iotests: add mirroring test case
mirror: implement completion
qmp: add drive-mirror command
mirror: introduce mirror job
block: introduce BLOCK_JOB_READY event
block: add block-job-complete
block: rename block_job_complete to block_job_completed
block: export dirty bitmap information in query-block
block: introduce new dirty bitmap functionality
block: add bdrv_open_backing_file
block: add bdrv_query_stats
block: add bdrv_query_info
qemu-config: Add new -add-fd command line option
monitor: Prevent removing fd from set during init
monitor: Enable adding an inherited fd to an fd set
...
Conflicts:
vl.c
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Assert that write_compressed is never used with the dirty bitmap.
Setting the bits early is wrong, because a coroutine might concurrently
examine them and copy incomplete data from the source.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Mirroring runs without the backing file so that it can be copied outside
QEMU. However, we need to add it at the time the job is completed and
QEMU switches to the target. Factor out the common bits of opening an
image and completing a mirroring operation.
The new function does not assume that the file is closed immediately after
it returns failure, so it keeps the BDRV_O_NO_BACKING flag up-to-date.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
qmp_query_blockstat cannot have errors, remove the Error argument and
create a new public function bdrv_query_stats out of it.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Currently, bdrv_find_backing_image compares bs->backing_file with
what is passed in as a backing_file name. Mismatches may occur,
however, when bs->backing_file and backing_file are not both
absolute or relative.
Use path_combine() to make sure any relative backing filenames are
relative to the current image filename being searched, and then use
realpath() to make all comparisons based on absolute filenames.
If either backing_file or bs->backing_file is determine to be a
protocol, then no filename normalization is performed.
This also changes bdrv_find_backing_image to no longer be recursive,
but iterative.
Signed-off-by: Jeff Cody <jcody@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The first user of close notifiers will be the embedded NBD server.
It would be possible to use them to do some of the ad hoc processing
(e.g. for block jobs and I/O limits) that is currently done by
bdrv_close.
Acked-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
There is no reason in principle to skip job cancellation and draining
of pending I/O when there is no medium in the disk. Do these unconditionally,
which also prepares the code for the next patch.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Also, use PATH_MAX, rather than the arbitrary 1024.
Using PATH_MAX is more consistent with other filename-related
variables in this file, like backing_filename and tmp_filename.
Acked-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Jim Meyering <meyering@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
The following behaviors are possible:
'report': The behavior is the same as in 1.1. An I/O error,
respectively during a read or a write, will complete the job immediately
with an error code.
'ignore': An I/O error, respectively during a read or a write, will be
ignored. For streaming, the job will complete with an error and the
backing file will be left in place. For mirroring, the sector will be
marked again as dirty and re-examined later.
'stop': The job will be paused and the job iostatus will be set to
failed or nospace, while the VM will keep running. This can only be
specified if the block device has rerror=stop and werror=stop or enospc.
'enospc': Behaves as 'stop' for ENOSPC errors, 'report' for others.
In all cases, even for 'report', the I/O error is reported as a QMP
event BLOCK_JOB_ERROR, with the same arguments as BLOCK_IO_ERROR.
It is possible that while stopping the VM a BLOCK_IO_ERROR event will be
reported and will clobber the event from BLOCK_JOB_ERROR, or vice versa.
This is not really avoidable since stopping the VM completes all pending
I/O requests. In fact, it is already possible now that a series of
BLOCK_IO_ERROR events are reported with rerror=stop, because vm_stop
calls bdrv_drain_all and this can generate further errors.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Move the common part of IDE/SCSI/virtio error handling to the block
layer. The new function bdrv_error_action subsumes all three of
bdrv_emit_qmp_error_event, vm_stop, bdrv_iostatus_set_err.
The same scheme will be used for errors in block jobs.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Do this while we are touching this part of the code, before introducing
more uses of "int is_read".
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This will let block-stream reuse the enum. Places that used the enums
are renamed accordingly.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We want to remove knowledge of BLOCK_ERR_STOP_ENOSPC from drivers;
drivers should only be told whether to stop/report/ignore the error.
On the other hand, we want to keep using the nicer BlockErrorAction
name in the drivers. So rename the enums, while leaving aside the
names of the enum values for now.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This is a simple helper function, that will return the base image
of a given image chain.
Signed-off-by: Jeff Cody <jcody@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Add bdrv_find_overlay(), and bdrv_drop_intermediate().
bdrv_find_overlay(): given 'bs' and the active (topmost) BDS of an image chain,
find the image that is the immediate top of 'bs'
bdrv_drop_intermediate():
Given 3 BDS (active, top, base), drop images above
base up to and including top, and set base to be the
backing file of top's overlay node.
E.g., this converts:
bottom <- base <- intermediate <- top <- active
to
bottom <- base <- active
Signed-off-by: Jeff Cody <jcody@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The keep_read_only flag is no longer used, in favor of the bdrv
flag BDRV_O_ALLOW_RDWR.
Signed-off-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Currently, bdrv_commit() reopens images r/w itself, via risky
_delete() and _open() calls. Use the new safe method for drive reopen.
Signed-off-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This is based on Supriya Kannery's bdrv_reopen() patch series.
This provides a transactional method to reopen multiple
images files safely.
Image files are queue for reopen via bdrv_reopen_queue(), and the
reopen occurs when bdrv_reopen_multiple() is called. Changes are
staged in bdrv_reopen_prepare() and in the equivalent driver level
functions. If any of the staged images fails a prepare, then all
of the images left untouched, and the staged changes for each image
abandoned.
Block drivers are passed a reopen state structure, that contains:
* BDS to reopen
* flags for the reopen
* opaque pointer for any driver-specific data that needs to be
persistent from _prepare to _commit/_abort
* reopen queue pointer, if the driver needs to queue additional
BDS for a reopen
Signed-off-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
bdrv_set_enable_write_cache() sets the bs->enable_write_cache flag,
but without the flag recorded in bs->open_flags, then next time
a reopen() is performed the enable_write_cache setting may be
inadvertently lost.
This will set the flag in open_flags, so it is preserved across
reopens.
Signed-off-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
I believe the bs->keep_read_only flag is supposed to reflect
the initial open state of the device. If the device is initially
opened R/O, then commit operations, or reopen operations changing
to R/W, are prohibited.
Currently, the keep_read_only flag is only accurate for the active
layer, and its backing file. Subsequent images end up always having
the keep_read_only flag set.
For instance, what happens now:
[ base ] kro = 1, ro = 1
|
v
[ snap-1 ] kro = 1, ro = 1
|
v
[ snap-2 ] kro = 0, ro = 1
|
v
[ active ] kro = 0, ro = 0
What we want:
[ base ] kro = 0, ro = 1
|
v
[ snap-1 ] kro = 0, ro = 1
|
v
[ snap-2 ] kro = 0, ro = 1
|
v
[ active ] kro = 0, ro = 0
Signed-off-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The caller would not delete temporary file after failed get_tmp_filename().
Signed-off-by: Dunrong Huang <riegamaths@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The tray status should change also if you eject empty block device.
Signed-off-by: Pavel Hrdina <phrdina@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Commit 29cdb251 already added a comment that no unnecessary flushes to
disk will occur, this patch makes the code even get to the point of the
comment. This is mostly theoretical because in practice we only stack
one format on top of one protocol, the former implementing flush_to_os
and the latter only flush_to_disk. It starts to matter when drivers that
are not on top implement flush_to_os.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Use the dedicated counting function in qmp_query_block in order to
propagate the backing file depth to HMP and add backing_file_depth
to qmp-commands.hx
Signed-off-by: Benoit Canet <benoit@irqsave.net>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
Create bdrv_get_backing_file_depth() in order to be able to show
in QMP and HMP how many ancestors backing an image a block device
have.
Signed-off-by: Benoit Canet <benoit@irqsave.net>
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
There are two producers of these hints: drive_init() on behalf of
-drive, and hd_geometry_guess().
The only consumer of the hint is hd_geometry_guess().
The callers of hd_geometry_guess() call it only when drive_init()
didn't set the hints. Therefore, drive_init()'s hints are never used.
Thus, hd_geometry_guess() only ever sees hints it produced itself in a
prior call. Only the first call computes something, subsequent calls
just repeat the first call's results. However, hd_geometry_guess() is
never called more than once: the device models don't, and the block
device is destroyed on unplug. Thus, dropping the repeat feature
doesn't break anything now.
If a block device wasn't destroyed on unplug and could be reused with
a new device, then repeating old results would be wrong. Thus,
dropping the repeat feature prevents future breakage.
This renders the hints unused. Purge them from the block layer.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Commit f3d54fc4 factored it out of hw/ide.c for reuse. Sensible,
except it was put into block.c. Device-specific functionality should
be kept in device code, not the block layer. Move it to
hw/hd-geometry.c, and make stylistic changes required to keep
checkpatch.pl happy.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Commit 5bbdbb46 moved it to block.c because "other geometry guessing
functions already reside in block.c". Device-specific functionality
should be kept in device code, not the block layer. Move it back.
Disk geometry guessing is still in block.c. To be moved out in a
later patch series.
Bonus: the floppy type used in pc_cmos_init() now obviously matches
the one in the FDrive. Before, we relied on
bdrv_get_floppy_geometry_hint() picking the same type both in
fd_revalidate() and in pc_cmos_init().
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
* mjt/mjt-iov2:
rewrite iov_send_recv() and move it to iov.c
cleanup qemu_co_sendv(), qemu_co_recvv() and friends
export iov_send_recv() and use it in iov_send() and iov_recv()
rename qemu_sendv to iov_send, change proto and move declarations to iov.h
change qemu_iovec_to_buf() to match other to,from_buf functions
consolidate qemu_iovec_copy() and qemu_iovec_concat() and make them consistent
allow qemu_iovec_from_buffer() to specify offset from which to start copying
consolidate qemu_iovec_memset{,_skip}() into single function and use existing iov_memset()
rewrite iov_* functions
change iov_* function prototypes to be more appropriate
virtio-serial-bus: use correct lengths in control_out() message
Conflicts:
tests/Makefile
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
To prepare move of guess_disk_lchs() into hw/, where it poking
BlockDriverState member io_limits_enabled directly would be unclean.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
bdrv_get_floppy_geometry_hint() fails to store through its parameter
drive when bs has a geometry hint. Makes fd_revalidate() assign
random crap to drv->drive.
Has been broken that way for ages. Harmless, because:
* The only way to set a geometry hint is -drive if=none,cyls=...
Since commit c219331e, probably unintentional.
* The only use of drv->drive is as argument to another
bdrv_get_floppy_geometry_hint(). Which doesn't use it, since the
geometry hint is still there.
Drop the broken code, ignore -drive parameter cyls, heads and secs for
floppies even with if=none, just like before commit c219331e. Matches
-help, which explains cyls, heads, secs as "hard disk physical
geometry".
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The new function can be made a bit nicer than bdrv_append. It swaps the
whole contents, and then swaps back (using the usual t=a;a=b;b=t idiom)
the fields that need to stay on top. Thus, it does not need explicit
bdrv_detach_dev, bdrv_iostatus_disable, etc.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
While these should not be in use at the time a transaction is started,
a command in the prepare phase of a transaction might have added them,
so they need to be brought over.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
So callers don't need to know anything about maximum name length.
Returning a pointer is safe, because the name string lives as long as
the block driver it names, and block drivers don't die.
Requested by Peter Maydell.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Formats are entirely in charge of flushes for metadata writes. For
guest-initiated writes, a writethrough cache is faked in the block layer.
So we can always open in writeback mode.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Because the guest will be able to flip enable_write_cache, the actual
state may not match what is used to open the new snapshot.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We want to make the formats handle their own flushes
autonomously, while keeping for guests the ability to use a writethrough
cache. Since formats will write metadata via bs->file, bdrv_co_do_writev
is the only place where we need to add a flush.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The QED block driver already provides the functionality to not only
detect inconsistencies in images, but also fix them. However, this
functionality cannot be manually invoked with qemu-img, but the
check happens only automatically during bdrv_open().
This adds a -r switch to qemu-img check that allows manual invocation
of an image repair.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
It now allows specifying offset within qiov to start from and
amount of bytes to copy. Actual implementation is just a call
to iov_to_buf().
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
qemu_iovec_concat() is currently a wrapper for
qemu_iovec_copy(), use the former (with extra
"0" arg) in a few places where it is used.
Change skip argument of qemu_iovec_copy() from
uint64_t to size_t, since size of qiov itself
is size_t, so there's no way to skip larger
sizes. Rename it to soffset, to make it clear
that the offset is applied to src.
Also change the only usage of uint64_t in
hw/9pfs/virtio-9p.c, in v9fs_init_qiov_from_pdu() -
all callers of it actually uses size_t too,
not uint64_t.
One added restriction: as for all other iovec-related
functions, soffset must point inside src.
Order of argumens is already good:
qemu_iovec_memset(QEMUIOVector *qiov, size_t offset,
int c, size_t bytes)
vs:
qemu_iovec_concat(QEMUIOVector *dst,
QEMUIOVector *src,
size_t soffset, size_t sbytes)
(note soffset is after _src_ not dst, since it applies to src;
for memset it applies to qiov).
Note that in many places where this function is used,
the previous call is qemu_iovec_reset(), which means
many callers actually want copy (replacing dst content),
not concat. So we may want to add a wrapper like
qemu_iovec_copy() with the same arguments but which
calls qemu_iovec_reset() before _concat().
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
Similar to
qemu_iovec_memset(QEMUIOVector *qiov, size_t offset,
int c, size_t bytes);
the new prototype is:
qemu_iovec_from_buf(QEMUIOVector *qiov, size_t offset,
const void *buf, size_t bytes);
The processing starts at offset bytes within qiov.
This way, we may copy a bounce buffer directly to
a middle of qiov.
This is exactly the same function as iov_from_buf() from
iov.c, so use the existing implementation and rename it
to qemu_iovec_from_buf() to be shorter and to match the
utility function.
As with utility implementation, we now assert that the
offset is inside actual iovec. Nothing changed for
current callers, because `offset' parameter is new.
While at it, stop using "bounce-qiov" in block/qcow2.c
and copy decrypted data directly from cluster_data
instead of recreating a temp qiov for doing that.
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
In snapshot mode, bdrv_open creates an empty temporary file without
checking for mkstemp or close failure, and ignoring the possibility
of a buffer overrun given a surprisingly long $TMPDIR.
Change the get_tmp_filename function to return int (not void),
so that it can inform its two callers of those failures.
Also avoid the risk of buffer overrun and do not ignore mkstemp
or close failure.
Update both callers (in block.c and vvfat.c) to propagate
temp-file-creation failure to their callers.
get_tmp_filename creates and closes an empty file, while its
callers later open that presumed-existing file with O_CREAT.
The problem was that a malicious user could provoke mkstemp failure
and race to create a symlink with the selected temporary file name,
thus causing the qemu process (usually root owned) to open through
the symlink, overwriting an attacker-chosen file.
This addresses CVE-2012-2652.
http://bugzilla.redhat.com/CVE-2012-2652
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Jim Meyering <meyering@redhat.com>
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
qemu-img info should use the same logic as qemu when printing the
backing file path, or debugging becomes quite tricky. We can also
simplify the output in case the backing file has an absolute path
or a protocol.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
bdrv_close should leave fields in the same state as bdrv_new. It is
not up to bdrv_open_common to fix the mess.
Also, backing_format was not being re-initialized.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
path_has_protocol will erroneously return "true" if the colon is part
of a filename. These names are common with stable device names produced
by udev. We cannot fully protect against this in case the filename
does not have a path component (e.g. if the current directory is
/dev/disk/by-path), but in the common case there will be a slash before
and path_has_protocol can easily detect that and return false.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
On Windows, all the logic is already in is_windows_drive and
is_windows_drive_prefix. On POSIX, there is no need to look
out for colons.
The win32 code changes the behaviour in some cases, we could have
something like "d:foo.img". The old code would treat it as relative
path, the new one as absolute. Now the path is absolute, because to
go from c:/program files/blah to d:foo.img you cannot say c:/program
files/blah/d:foo.img. You have to say d:foo.img. But you could also
say it's relative because (I think, at least it was like that in DOS
15 years ago) d:foo.img is relative to the current path of drive D.
Considering how path_is_absolute is used by path_combine, I think it's
better to treat it as absolute.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The limitation on not having I/O after cancellation cannot really be
kept. Even streaming has a very small race window where you could
cancel a job and have it report completion. If this window is hit,
bdrv_change_backing_file() will yield and possibly cause accesses to
dangling pointers etc.
So, let's just assume that we cannot know exactly what will happen
after the coroutine has set busy to false. We can set a very lax
condition:
- if we cancel the job, the coroutine won't set it to false again
(and hence will not call co_sleep_ns again).
- block_job_cancel_sync will wait for the coroutine to exit, which
pretty much ensures no race.
Instead, we track the coroutine that executes the job and put very
strict conditions on what to do while it is quiescent (busy = false).
First of all, the coroutine must never set busy = false while the job
has been cancelled. Second, the coroutine can be reentered arbitrarily
while it is quiescent, so you cannot really do anything but co_sleep_ns at
that time. This condition is obeyed by the block_job_sleep_ns function.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This function abstracts the pretty complex semantics of the "busy"
member of BlockJob.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
We are reusing bs->file across close/open, which may not cause any
known bugs but is a recipe for trouble. Prefer bdrv_delete, and
enjoy the new invariant in the implementation of bdrv_delete.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This is another bug caused by not doing a full cleanup of the BDS
across close/open. This was found with mirroring by Shaolong Hu,
but it can probably be reproduced also with eject or change.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
bdrv_append must also copy open_flags to the top, because the snapshot
has BDRV_O_NO_BACKING set. This causes interesting results if you
later use drive-reopen (not upstream) to reopen the image, and lose
the backing file in the process.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
QED's opaque data includes a pointer back to the BlockDriverState.
This breaks when bdrv_append shuffles data between bs_new and bs_top.
To avoid this, add a "rebind" function that tells the driver about
the new relationship between the BlockDriverState and its opaque.
The patch also adds rebind to VVFAT for completeness, even though
it is not used with live snapshots.
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Also reuse elsewhere the new constant for sizeof(unsigned long) * 8.
The dirty bitmap is allocated in bits but declared as unsigned long.
Thus, its memory block is accessed beyond its end unless the image
is a multiple of 64 chunks (i.e. a multiple of 64 MB).
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
bdrv_img_create will temporarily open the backing file to probe its size.
However, this could be done with a read-write open if the wrong flags are
passed to bdrv_img_create. Since there is really no documentation on
what flags can be passed, assume that bdrv_img_create receives the flags
with which the new image will be opened; sanitize them when opening
the backing file.
Reported-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
These are needed to print "info block" output correctly. QCOW2 does this
because it needs it to write the header, but QED does not, and common code
is the right place to do it.
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This check applies to all drivers, but QED lacks it.
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
[ Iterate until all block devices have processed all requests,
add comments. - Paolo ]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The current qemu.git introduces failure with preallocation and some
sizes:
qemu-img create -f qcow2 new.img 976563K -o preallocation=metadata
qemu-img: qemu-coroutine-lock.c:111: qemu_co_mutex_unlock: Assertion
`mutex->locked == 1' failed.
And lock needs to work in coroutine context. So to fix this issue, we
need to make bdrv_create adopt coroutine at first.
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Allow streaming operations to be started with an initial speed limit.
This eliminates the window of time between starting streaming and
issuing block-job-set-speed. Users should use the new optional 'speed'
parameter instead so that speed limits are in effect immediately when
the job starts.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Acked-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Acked-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
There are at least two different errors that can occur in
block_job_set_speed(): the job might not support setting speeds or the
value might be invalid.
Use the Error mechanism to report the error where it occurs.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Acked-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
The block job API uses -errno return values internally and we convert
these to Error in the QMP functions. This is ugly because the Error
should be created at the point where we still have all the relevant
information. More importantly, it is hard to add new error cases to
this case since we quickly run out of -errno values without losing
information.
Go ahead and use Error directly and don't convert later.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Acked-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
The 'qemu-img convert -h' advertise that the default cache mode is
'writeback', while in fact it is 'unsafe'.
This patch 1) fix the help manual and 2) let bdrv_close() call bdrv_flush()
2) is needed because some backend storage doesn't have a self-flush
mechanism(for e.g., sheepdog), so we need to call bdrv_flush() to make
sure the image is really writen to the storage instead of hanging around
writeback cache forever.
Signed-off-by: Liu Yuan <tailai.ly@taobao.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
If an AIO request is in flight that refers to a BlockDriverState that
has been closed and possibly even freed, more or less anything could
happen. I have seen segfaults, -EBADF return values and qcow2 sometimes
actually catches the situation in bdrv_close() and abort()s.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
This function will clear all BDRV_O_INCOMING flags.
Signed-off-by: Benoit Canet <benoit.canet@gmail.com>
Reviewed-by: Stefan Hajnoczi <stefanha@gmail.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
A few fixups for bdrv_append():
The new bs (bs_new) passed into bdrv_append() should be anonymous. Rather
than call bdrv_make_anon() to enforce this, use an assert to catch when a caller
is passing in a bs_new that is not anonymous.
Also, the new top layer should have its backing_format reflect the original
top's format.
And last, after the swap of bs contents, the device_name will have been copied
down. This needs to be cleared to reflect the anonymity of the bs that was
pushed down.
Signed-off-by: Jeff Cody <jcody@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
There is no need to do this in every implementation of set_speed
(even though there is only one right now).
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Streaming can issue I/O while qcow2_close is running. This causes the
L2 caches to become very confused or, alternatively, could cause a
segfault when the streaming coroutine is reentered after closing its
block device. The fix is to cancel streaming jobs when closing their
underlying device.
The cancellation must be synchronous, on the other hand qemu_aio_wait
will not restart a coroutine that is sleeping in co_sleep. So add
a flag saying whether streaming has in-flight I/O. If the busy flag
is false, the coroutine is quiescent and, when cancelled, will not
issue any new I/O.
This protects streaming against closing, but not against deleting.
We have a reference count protecting us against concurrent deletion,
but I still added an assertion to ensure nothing bad happens.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Reviewed-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Monitor operations that manipulate image files must not execute while a
background job (like image streaming) is in progress. This prevents
corruptions from happening when two pieces of code are manipulating the
image file without knowledge of each other.
The monitor "commit" command raises QERR_DEVICE_IN_USE when
bdrv_commit() returns -EBUSY but "commit all" has no error handling.
This is easy to fix, although note that we do not deliver a detailed
error about which device was busy in the "commit all" case.
Suggested-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This is a QAPI/QMP only command to take a snapshot of a group of
devices. This is similar to the blockdev-snapshot-sync command, except
blockdev-group-snapshot-sync accepts a list devices, filenames, and
formats.
It is attempted to keep the snapshot of the group atomic; if the
creation or open of any of the new snapshots fails, then all of
the new snapshots are abandoned, and the name of the snapshot image
that failed is returned. The failure case should not interrupt
any operations.
Rather than use bdrv_close() along with a subsequent bdrv_open() to
perform the pivot, the original image is never closed and the new
image is placed 'in front' of the original image via manipulation
of the BlockDriverState fields. Thus, once the new snapshot image
has been successfully created, there are no more failure points
before pivoting to the new snapshot.
This allows the group of disks to remain consistent with each other,
even across snapshot failures.
Signed-off-by: Jeff Cody <jcody@redhat.com>
Acked-by: Luiz Capitulino <lcapitulino@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Floppies must be read at a specific transfer rate, depending of its own format.
Update floppy description table to include required transfer rate.
Signed-off-by: Hervé Poussineau <hpoussin@reactos.org>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
It's emitted whenever the tray is moved by the guest or by HMP/QMP
commands.
Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Acked-by: Kevin Wolf <kwolf@redhat.com>
They are QMP events, not monitor events. Rename them accordingly.
Also, move bdrv_emit_qmp_error_event() up in the file. A new event will
be added soon and it's good to have them next each other.
Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Acked-by: Kevin Wolf <kwolf@redhat.com>
Copy-on-Read populates the image file with data read from a backing
image. In order to avoid bloating the image file when all zeroes are
read we should scan the buffer and perform an optimized zero write
operation.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The ability to zero regions of an image file is a useful primitive for
higher-level features such as image streaming or zero write detection.
Image formats may support an optimized metadata representation instead
of writing zeroes into the image file. This allows zero writes to be
potentially faster than regular write operations and also preserve
sparseness of the image file.
The .bdrv_co_write_zeroes() interface should be implemented by block
drivers that wish to provide efficient zeroing.
Note that this operation is different from the discard operation, which
may leave the contents of the region indeterminate. That means
discarded blocks are not guaranteed to contain zeroes and may contain
junk data instead.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Add bdrv_find_backing_image: given a BlockDriverState pointer, and an id,
traverse the backing image chain to locate the id.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Previously copy-on-read could only be enabled for all requests to a
block device. This means requests coming from the guest as well as
QEMU's internal requests would perform copy-on-read when enabled.
For image streaming we want to support finer-grained behavior than just
populating the image file from its backing image. Image streaming
supports partial streaming where a common backing image is preserved.
In this case guest requests should not perform copy-on-read because they
would indiscriminately copy data which should be left in a backing image
from the backing chain.
Introduce a per-request flag for copy-on-read so that a block device can
process both regular and copy-on-read requests. Overlapping reads and
writes still need to be serialized for correctness when copy-on-read is
happening, so add an in-flight reference count to track this.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Long-running block operations like block migration and image streaming
must have continual access to their block device. It is not safe to
perform operations like hotplug, eject, change, resize, commit, or
external snapshot while a long-running operation is in progress.
This patch adds the missing bdrv_in_use() checks so that block migration
and image streaming never have the rug pulled out from underneath them.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Coverity is confused by this "if" and reports leaks on acb->bh.
The bottom half is always deleted before releasing the AIOCB,
in either bdrv_aio_cancel_em or bdrv_aio_bh_cb.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Now that early failure of bdrv_aio_writev is not possible anymore,
mcb->num_requests can be set before the loop starts.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Initially done with the following semantic patch:
@ rule1 @
expression E;
statement S;
@@
E =
(
bdrv_aio_readv
| bdrv_aio_writev
| bdrv_aio_flush
| bdrv_aio_discard
| bdrv_aio_ioctl
)
(...);
(
- if (E == NULL) { ... }
|
- if (E)
{ <... S ...> }
)
which however missed the occurrence in block/blkverify.c
(as it should have done), and left behind some unused
variables.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Many places in QEMU call qemu_aio_flush() to complete all pending
asynchronous I/O. Most of these places actually want to drain all block
requests but there is no block layer API to do so.
This patch introduces the bdrv_drain_all() API to wait for requests
across all BlockDriverStates to complete. As a bonus we perform checks
after qemu_aio_wait() to ensure that requests really have finished.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Debugging a reentrant request deadlock was fun but in the future we need
a quick and obvious way of detecting such bugs. Add an assert that
checks we are not about to deadlock when waiting for another request.
Suggested-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Cases beyond the end of the disk image are only implemented for block
drivers that do not provide .bdrv_co_is_allocated(). It's worth making
these cases generic so that block drivers that do implement
.bdrv_co_is_allocated() also get them for free.
Suggested-by: Mark Wu <wudxw@linux.vnet.ibm.com>
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Detect overlapping requests and remember to align to cluster boundaries
if the image format uses them. This assumes that allocating I/O is
performed in cluster granularity - which is true for qcow2, qed, etc.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
When copy-on-read is enabled it is necessary to wait for overlapping
requests before issuing new requests. This prevents races between the
copy-on-read and a write request.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The bdrv_enable_copy_on_read()/bdrv_disable_copy_on_read() functions can
be used to programmatically enable or disable copy-on-read for a block
device. Later patches add the actual copy-on-read logic.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
The block layer does not know about pending requests. This information
is necessary for copy-on-read since overlapping requests must be
serialized to prevent races that corrupt the image.
The BlockDriverState gets a new tracked_request list field which
contains all pending requests. Each request is a BdrvTrackedRequest
record with sector_num, nb_sectors, and is_write fields.
Note that request tracking is always enabled but hopefully this extra
work is so small that it doesn't justify adding an enable/disable flag.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This patch introduces the public bdrv_co_is_allocated() interface which
can be used to query image allocation status while the VM is running.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Now that all block drivers have been converted to
.bdrv_co_is_allocated() we can drop .bdrv_is_allocated().
Note that the public bdrv_is_allocated() interface is still available
but is in fact a synchronous wrapper around .bdrv_co_is_allocated().
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
This patch adds the .bdrv_co_is_allocated() interface which is identical
to .bdrv_is_allocated() but runs in coroutine context. Running in
coroutine context implies that other coroutines might be performing I/O
at the same time. Therefore it must be safe to run while the following
BlockDriver functions are in-flight:
.bdrv_co_readv()
.bdrv_co_writev()
.bdrv_co_flush()
.bdrv_co_is_allocated()
The new .bdrv_co_is_allocated() interface is useful because it can be
used when a VM is running, whereas .bdrv_is_allocated() is a synchronous
interface that does not cope with parallel requests.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
There is no need for bdrv_commit() to use the BlockDriver
.bdrv_is_allocated() interface directly. Converting to the public
interface gives us the freedom to drop .bdrv_is_allocated() entirely in
favor of a new .bdrv_co_is_allocated() in the future.
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com>
Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Image files have two types of data: immutable data that describes things like
image size, backing files, etc. and mutable data that includes offset and
reference count tables.
Today, image formats aggressively cache mutable data to improve performance. In
some cases, this happens before a guest even starts. When dealing with live
migration, since a file is open on two machines, the caching of meta data can
lead to data corruption.
This patch addresses this by introducing a mechanism to invalidate any cached
mutable data a block driver may have which is then used by the live migration
code.
NB, this still requires coherent shared storage. Addressing migration without
coherent shared storage (i.e. NFS) requires additional work.
Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
cache=unsafe completely ignored bdrv_flush, because flushing the host disk
costs a lot of performance. However, this means that qcow2 images (and
potentially any other format) can lose data even after the guest has issued a
flush if the qemu process crashes/is killed. In case of a host crash, data loss
is certainly expected with cache=unsafe, but if just the qemu process dies this
is a bit too unsafe.
Now that we have two separate flush functions, we can choose to flush
everythign to the OS, but don't enforce that it's physically written to the
disk.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>