mirror of
https://github.com/xemu-project/xemu.git
synced 2024-11-25 20:49:49 +00:00
bbacffc5f7
This patch introduces the icount field for saving within the snapshot. It is required for navigation between the snapshots in record/replay mode. Signed-off-by: Pavel Dovgalyuk <Pavel.Dovgalyuk@ispras.ru> Acked-by: Kevin Wolf <kwolf@redhat.com> -- v7 changes: - also fix the test which checks qcow2 snapshot extra data Message-Id: <160174518284.12451.2301137308458777398.stgit@pasha-ThinkPad-X280> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
902 lines
38 KiB
Plaintext
902 lines
38 KiB
Plaintext
== General ==
|
|
|
|
A qcow2 image file is organized in units of constant size, which are called
|
|
(host) clusters. A cluster is the unit in which all allocations are done,
|
|
both for actual guest data and for image metadata.
|
|
|
|
Likewise, the virtual disk as seen by the guest is divided into (guest)
|
|
clusters of the same size.
|
|
|
|
All numbers in qcow2 are stored in Big Endian byte order.
|
|
|
|
|
|
== Header ==
|
|
|
|
The first cluster of a qcow2 image contains the file header:
|
|
|
|
Byte 0 - 3: magic
|
|
QCOW magic string ("QFI\xfb")
|
|
|
|
4 - 7: version
|
|
Version number (valid values are 2 and 3)
|
|
|
|
8 - 15: backing_file_offset
|
|
Offset into the image file at which the backing file name
|
|
is stored (NB: The string is not null terminated). 0 if the
|
|
image doesn't have a backing file.
|
|
|
|
Note: backing files are incompatible with raw external data
|
|
files (auto-clear feature bit 1).
|
|
|
|
16 - 19: backing_file_size
|
|
Length of the backing file name in bytes. Must not be
|
|
longer than 1023 bytes. Undefined if the image doesn't have
|
|
a backing file.
|
|
|
|
20 - 23: cluster_bits
|
|
Number of bits that are used for addressing an offset
|
|
within a cluster (1 << cluster_bits is the cluster size).
|
|
Must not be less than 9 (i.e. 512 byte clusters).
|
|
|
|
Note: qemu as of today has an implementation limit of 2 MB
|
|
as the maximum cluster size and won't be able to open images
|
|
with larger cluster sizes.
|
|
|
|
Note: if the image has Extended L2 Entries then cluster_bits
|
|
must be at least 14 (i.e. 16384 byte clusters).
|
|
|
|
24 - 31: size
|
|
Virtual disk size in bytes.
|
|
|
|
Note: qemu has an implementation limit of 32 MB as
|
|
the maximum L1 table size. With a 2 MB cluster
|
|
size, it is unable to populate a virtual cluster
|
|
beyond 2 EB (61 bits); with a 512 byte cluster
|
|
size, it is unable to populate a virtual size
|
|
larger than 128 GB (37 bits). Meanwhile, L1/L2
|
|
table layouts limit an image to no more than 64 PB
|
|
(56 bits) of populated clusters, and an image may
|
|
hit other limits first (such as a file system's
|
|
maximum size).
|
|
|
|
32 - 35: crypt_method
|
|
0 for no encryption
|
|
1 for AES encryption
|
|
2 for LUKS encryption
|
|
|
|
36 - 39: l1_size
|
|
Number of entries in the active L1 table
|
|
|
|
40 - 47: l1_table_offset
|
|
Offset into the image file at which the active L1 table
|
|
starts. Must be aligned to a cluster boundary.
|
|
|
|
48 - 55: refcount_table_offset
|
|
Offset into the image file at which the refcount table
|
|
starts. Must be aligned to a cluster boundary.
|
|
|
|
56 - 59: refcount_table_clusters
|
|
Number of clusters that the refcount table occupies
|
|
|
|
60 - 63: nb_snapshots
|
|
Number of snapshots contained in the image
|
|
|
|
64 - 71: snapshots_offset
|
|
Offset into the image file at which the snapshot table
|
|
starts. Must be aligned to a cluster boundary.
|
|
|
|
For version 2, the header is exactly 72 bytes in length, and finishes here.
|
|
For version 3 or higher, the header length is at least 104 bytes, including
|
|
the next fields through header_length.
|
|
|
|
72 - 79: incompatible_features
|
|
Bitmask of incompatible features. An implementation must
|
|
fail to open an image if an unknown bit is set.
|
|
|
|
Bit 0: Dirty bit. If this bit is set then refcounts
|
|
may be inconsistent, make sure to scan L1/L2
|
|
tables to repair refcounts before accessing the
|
|
image.
|
|
|
|
Bit 1: Corrupt bit. If this bit is set then any data
|
|
structure may be corrupt and the image must not
|
|
be written to (unless for regaining
|
|
consistency).
|
|
|
|
Bit 2: External data file bit. If this bit is set, an
|
|
external data file is used. Guest clusters are
|
|
then stored in the external data file. For such
|
|
images, clusters in the external data file are
|
|
not refcounted. The offset field in the
|
|
Standard Cluster Descriptor must match the
|
|
guest offset and neither compressed clusters
|
|
nor internal snapshots are supported.
|
|
|
|
An External Data File Name header extension may
|
|
be present if this bit is set.
|
|
|
|
Bit 3: Compression type bit. If this bit is set,
|
|
a non-default compression is used for compressed
|
|
clusters. The compression_type field must be
|
|
present and not zero.
|
|
|
|
Bit 4: Extended L2 Entries. If this bit is set then
|
|
L2 table entries use an extended format that
|
|
allows subcluster-based allocation. See the
|
|
Extended L2 Entries section for more details.
|
|
|
|
Bits 5-63: Reserved (set to 0)
|
|
|
|
80 - 87: compatible_features
|
|
Bitmask of compatible features. An implementation can
|
|
safely ignore any unknown bits that are set.
|
|
|
|
Bit 0: Lazy refcounts bit. If this bit is set then
|
|
lazy refcount updates can be used. This means
|
|
marking the image file dirty and postponing
|
|
refcount metadata updates.
|
|
|
|
Bits 1-63: Reserved (set to 0)
|
|
|
|
88 - 95: autoclear_features
|
|
Bitmask of auto-clear features. An implementation may only
|
|
write to an image with unknown auto-clear features if it
|
|
clears the respective bits from this field first.
|
|
|
|
Bit 0: Bitmaps extension bit
|
|
This bit indicates consistency for the bitmaps
|
|
extension data.
|
|
|
|
It is an error if this bit is set without the
|
|
bitmaps extension present.
|
|
|
|
If the bitmaps extension is present but this
|
|
bit is unset, the bitmaps extension data must be
|
|
considered inconsistent.
|
|
|
|
Bit 1: Raw external data bit
|
|
If this bit is set, the external data file can
|
|
be read as a consistent standalone raw image
|
|
without looking at the qcow2 metadata.
|
|
|
|
Setting this bit has a performance impact for
|
|
some operations on the image (e.g. writing
|
|
zeros requires writing to the data file instead
|
|
of only setting the zero flag in the L2 table
|
|
entry) and conflicts with backing files.
|
|
|
|
This bit may only be set if the External Data
|
|
File bit (incompatible feature bit 1) is also
|
|
set.
|
|
|
|
Bits 2-63: Reserved (set to 0)
|
|
|
|
96 - 99: refcount_order
|
|
Describes the width of a reference count block entry (width
|
|
in bits: refcount_bits = 1 << refcount_order). For version 2
|
|
images, the order is always assumed to be 4
|
|
(i.e. refcount_bits = 16).
|
|
This value may not exceed 6 (i.e. refcount_bits = 64).
|
|
|
|
100 - 103: header_length
|
|
Length of the header structure in bytes. For version 2
|
|
images, the length is always assumed to be 72 bytes.
|
|
For version 3 it's at least 104 bytes and must be a multiple
|
|
of 8.
|
|
|
|
|
|
=== Additional fields (version 3 and higher) ===
|
|
|
|
In general, these fields are optional and may be safely ignored by the software,
|
|
as well as filled by zeros (which is equal to field absence), if software needs
|
|
to set field B, but does not care about field A which precedes B. More
|
|
formally, additional fields have the following compatibility rules:
|
|
|
|
1. If the value of the additional field must not be ignored for correct
|
|
handling of the file, it will be accompanied by a corresponding incompatible
|
|
feature bit.
|
|
|
|
2. If there are no unrecognized incompatible feature bits set, an unknown
|
|
additional field may be safely ignored other than preserving its value when
|
|
rewriting the image header.
|
|
|
|
3. An explicit value of 0 will have the same behavior as when the field is not
|
|
present*, if not altered by a specific incompatible bit.
|
|
|
|
*. A field is considered not present when header_length is less than or equal
|
|
to the field's offset. Also, all additional fields are not present for
|
|
version 2.
|
|
|
|
104: compression_type
|
|
|
|
Defines the compression method used for compressed clusters.
|
|
All compressed clusters in an image use the same compression
|
|
type.
|
|
|
|
If the incompatible bit "Compression type" is set: the field
|
|
must be present and non-zero (which means non-zlib
|
|
compression type). Otherwise, this field must not be present
|
|
or must be zero (which means zlib).
|
|
|
|
Available compression type values:
|
|
0: zlib <https://www.zlib.net/>
|
|
1: zstd <http://github.com/facebook/zstd>
|
|
|
|
|
|
=== Header padding ===
|
|
|
|
@header_length must be a multiple of 8, which means that if the end of the last
|
|
additional field is not aligned, some padding is needed. This padding must be
|
|
zeroed, so that if some existing (or future) additional field will fall into
|
|
the padding, it will be interpreted accordingly to point [3.] of the previous
|
|
paragraph, i.e. in the same manner as when this field is not present.
|
|
|
|
|
|
=== Header extensions ===
|
|
|
|
Directly after the image header, optional sections called header extensions can
|
|
be stored. Each extension has a structure like the following:
|
|
|
|
Byte 0 - 3: Header extension type:
|
|
0x00000000 - End of the header extension area
|
|
0xe2792aca - Backing file format name string
|
|
0x6803f857 - Feature name table
|
|
0x23852875 - Bitmaps extension
|
|
0x0537be77 - Full disk encryption header pointer
|
|
0x44415441 - External data file name string
|
|
other - Unknown header extension, can be safely
|
|
ignored
|
|
|
|
4 - 7: Length of the header extension data
|
|
|
|
8 - n: Header extension data
|
|
|
|
n - m: Padding to round up the header extension size to the next
|
|
multiple of 8.
|
|
|
|
Unless stated otherwise, each header extension type shall appear at most once
|
|
in the same image.
|
|
|
|
If the image has a backing file then the backing file name should be stored in
|
|
the remaining space between the end of the header extension area and the end of
|
|
the first cluster. It is not allowed to store other data here, so that an
|
|
implementation can safely modify the header and add extensions without harming
|
|
data of compatible features that it doesn't support. Compatible features that
|
|
need space for additional data can use a header extension.
|
|
|
|
|
|
== String header extensions ==
|
|
|
|
Some header extensions (such as the backing file format name and the external
|
|
data file name) are just a single string. In this case, the header extension
|
|
length is the string length and the string is not '\0' terminated. (The header
|
|
extension padding can make it look like a string is '\0' terminated, but
|
|
neither is padding always necessary nor is there a guarantee that zero bytes
|
|
are used for padding.)
|
|
|
|
|
|
== Feature name table ==
|
|
|
|
The feature name table is an optional header extension that contains the name
|
|
for features used by the image. It can be used by applications that don't know
|
|
the respective feature (e.g. because the feature was introduced only later) to
|
|
display a useful error message.
|
|
|
|
The number of entries in the feature name table is determined by the length of
|
|
the header extension data. Each entry look like this:
|
|
|
|
Byte 0: Type of feature (select feature bitmap)
|
|
0: Incompatible feature
|
|
1: Compatible feature
|
|
2: Autoclear feature
|
|
|
|
1: Bit number within the selected feature bitmap (valid
|
|
values: 0-63)
|
|
|
|
2 - 47: Feature name (padded with zeros, but not necessarily null
|
|
terminated if it has full length)
|
|
|
|
|
|
== Bitmaps extension ==
|
|
|
|
The bitmaps extension is an optional header extension. It provides the ability
|
|
to store bitmaps related to a virtual disk. For now, there is only one bitmap
|
|
type: the dirty tracking bitmap, which tracks virtual disk changes from some
|
|
point in time.
|
|
|
|
The data of the extension should be considered consistent only if the
|
|
corresponding auto-clear feature bit is set, see autoclear_features above.
|
|
|
|
The fields of the bitmaps extension are:
|
|
|
|
Byte 0 - 3: nb_bitmaps
|
|
The number of bitmaps contained in the image. Must be
|
|
greater than or equal to 1.
|
|
|
|
Note: Qemu currently only supports up to 65535 bitmaps per
|
|
image.
|
|
|
|
4 - 7: Reserved, must be zero.
|
|
|
|
8 - 15: bitmap_directory_size
|
|
Size of the bitmap directory in bytes. It is the cumulative
|
|
size of all (nb_bitmaps) bitmap directory entries.
|
|
|
|
16 - 23: bitmap_directory_offset
|
|
Offset into the image file at which the bitmap directory
|
|
starts. Must be aligned to a cluster boundary.
|
|
|
|
== Full disk encryption header pointer ==
|
|
|
|
The full disk encryption header must be present if, and only if, the
|
|
'crypt_method' header requires metadata. Currently this is only true
|
|
of the 'LUKS' crypt method. The header extension must be absent for
|
|
other methods.
|
|
|
|
This header provides the offset at which the crypt method can store
|
|
its additional data, as well as the length of such data.
|
|
|
|
Byte 0 - 7: Offset into the image file at which the encryption
|
|
header starts in bytes. Must be aligned to a cluster
|
|
boundary.
|
|
Byte 8 - 15: Length of the written encryption header in bytes.
|
|
Note actual space allocated in the qcow2 file may
|
|
be larger than this value, since it will be rounded
|
|
to the nearest multiple of the cluster size. Any
|
|
unused bytes in the allocated space will be initialized
|
|
to 0.
|
|
|
|
For the LUKS crypt method, the encryption header works as follows.
|
|
|
|
The first 592 bytes of the header clusters will contain the LUKS
|
|
partition header. This is then followed by the key material data areas.
|
|
The size of the key material data areas is determined by the number of
|
|
stripes in the key slot and key size. Refer to the LUKS format
|
|
specification ('docs/on-disk-format.pdf' in the cryptsetup source
|
|
package) for details of the LUKS partition header format.
|
|
|
|
In the LUKS partition header, the "payload-offset" field will be
|
|
calculated as normal for the LUKS spec. ie the size of the LUKS
|
|
header, plus key material regions, plus padding, relative to the
|
|
start of the LUKS header. This offset value is not required to be
|
|
qcow2 cluster aligned. Its value is currently never used in the
|
|
context of qcow2, since the qcow2 file format itself defines where
|
|
the real payload offset is, but none the less a valid payload offset
|
|
should always be present.
|
|
|
|
In the LUKS key slots header, the "key-material-offset" is relative
|
|
to the start of the LUKS header clusters in the qcow2 container,
|
|
not the start of the qcow2 file.
|
|
|
|
Logically the layout looks like
|
|
|
|
+-----------------------------+
|
|
| QCow2 header |
|
|
| QCow2 header extension X |
|
|
| QCow2 header extension FDE |
|
|
| QCow2 header extension ... |
|
|
| QCow2 header extension Z |
|
|
+-----------------------------+
|
|
| ....other QCow2 tables.... |
|
|
. .
|
|
. .
|
|
+-----------------------------+
|
|
| +-------------------------+ |
|
|
| | LUKS partition header | |
|
|
| +-------------------------+ |
|
|
| | LUKS key material 1 | |
|
|
| +-------------------------+ |
|
|
| | LUKS key material 2 | |
|
|
| +-------------------------+ |
|
|
| | LUKS key material ... | |
|
|
| +-------------------------+ |
|
|
| | LUKS key material 8 | |
|
|
| +-------------------------+ |
|
|
+-----------------------------+
|
|
| QCow2 cluster payload |
|
|
. .
|
|
. .
|
|
. .
|
|
| |
|
|
+-----------------------------+
|
|
|
|
== Data encryption ==
|
|
|
|
When an encryption method is requested in the header, the image payload
|
|
data must be encrypted/decrypted on every write/read. The image headers
|
|
and metadata are never encrypted.
|
|
|
|
The algorithms used for encryption vary depending on the method
|
|
|
|
- AES:
|
|
|
|
The AES cipher, in CBC mode, with 256 bit keys.
|
|
|
|
Initialization vectors generated using plain64 method, with
|
|
the virtual disk sector as the input tweak.
|
|
|
|
This format is no longer supported in QEMU system emulators, due
|
|
to a number of design flaws affecting its security. It is only
|
|
supported in the command line tools for the sake of back compatibility
|
|
and data liberation.
|
|
|
|
- LUKS:
|
|
|
|
The algorithms are specified in the LUKS header.
|
|
|
|
Initialization vectors generated using the method specified
|
|
in the LUKS header, with the physical disk sector as the
|
|
input tweak.
|
|
|
|
== Host cluster management ==
|
|
|
|
qcow2 manages the allocation of host clusters by maintaining a reference count
|
|
for each host cluster. A refcount of 0 means that the cluster is free, 1 means
|
|
that it is used, and >= 2 means that it is used and any write access must
|
|
perform a COW (copy on write) operation.
|
|
|
|
The refcounts are managed in a two-level table. The first level is called
|
|
refcount table and has a variable size (which is stored in the header). The
|
|
refcount table can cover multiple clusters, however it needs to be contiguous
|
|
in the image file.
|
|
|
|
It contains pointers to the second level structures which are called refcount
|
|
blocks and are exactly one cluster in size.
|
|
|
|
Although a large enough refcount table can reserve clusters past 64 PB
|
|
(56 bits) (assuming the underlying protocol can even be sized that
|
|
large), note that some qcow2 metadata such as L1/L2 tables must point
|
|
to clusters prior to that point.
|
|
|
|
Note: qemu has an implementation limit of 8 MB as the maximum refcount
|
|
table size. With a 2 MB cluster size and a default refcount_order of
|
|
4, it is unable to reference host resources beyond 2 EB (61 bits); in
|
|
the worst case, with a 512 cluster size and refcount_order of 6, it is
|
|
unable to access beyond 32 GB (35 bits).
|
|
|
|
Given an offset into the image file, the refcount of its cluster can be
|
|
obtained as follows:
|
|
|
|
refcount_block_entries = (cluster_size * 8 / refcount_bits)
|
|
|
|
refcount_block_index = (offset / cluster_size) % refcount_block_entries
|
|
refcount_table_index = (offset / cluster_size) / refcount_block_entries
|
|
|
|
refcount_block = load_cluster(refcount_table[refcount_table_index]);
|
|
return refcount_block[refcount_block_index];
|
|
|
|
Refcount table entry:
|
|
|
|
Bit 0 - 8: Reserved (set to 0)
|
|
|
|
9 - 63: Bits 9-63 of the offset into the image file at which the
|
|
refcount block starts. Must be aligned to a cluster
|
|
boundary.
|
|
|
|
If this is 0, the corresponding refcount block has not yet
|
|
been allocated. All refcounts managed by this refcount block
|
|
are 0.
|
|
|
|
Refcount block entry (x = refcount_bits - 1):
|
|
|
|
Bit 0 - x: Reference count of the cluster. If refcount_bits implies a
|
|
sub-byte width, note that bit 0 means the least significant
|
|
bit in this context.
|
|
|
|
|
|
== Cluster mapping ==
|
|
|
|
Just as for refcounts, qcow2 uses a two-level structure for the mapping of
|
|
guest clusters to host clusters. They are called L1 and L2 table.
|
|
|
|
The L1 table has a variable size (stored in the header) and may use multiple
|
|
clusters, however it must be contiguous in the image file. L2 tables are
|
|
exactly one cluster in size.
|
|
|
|
The L1 and L2 tables have implications on the maximum virtual file
|
|
size; for a given L1 table size, a larger cluster size is required for
|
|
the guest to have access to more space. Furthermore, a virtual
|
|
cluster must currently map to a host offset below 64 PB (56 bits)
|
|
(although this limit could be relaxed by putting reserved bits into
|
|
use). Additionally, as cluster size increases, the maximum host
|
|
offset for a compressed cluster is reduced (a 2M cluster size requires
|
|
compressed clusters to reside below 512 TB (49 bits), and this limit
|
|
cannot be relaxed without an incompatible layout change).
|
|
|
|
Given an offset into the virtual disk, the offset into the image file can be
|
|
obtained as follows:
|
|
|
|
l2_entries = (cluster_size / sizeof(uint64_t)) [*]
|
|
|
|
l2_index = (offset / cluster_size) % l2_entries
|
|
l1_index = (offset / cluster_size) / l2_entries
|
|
|
|
l2_table = load_cluster(l1_table[l1_index]);
|
|
cluster_offset = l2_table[l2_index];
|
|
|
|
return cluster_offset + (offset % cluster_size)
|
|
|
|
[*] this changes if Extended L2 Entries are enabled, see next section
|
|
|
|
L1 table entry:
|
|
|
|
Bit 0 - 8: Reserved (set to 0)
|
|
|
|
9 - 55: Bits 9-55 of the offset into the image file at which the L2
|
|
table starts. Must be aligned to a cluster boundary. If the
|
|
offset is 0, the L2 table and all clusters described by this
|
|
L2 table are unallocated.
|
|
|
|
56 - 62: Reserved (set to 0)
|
|
|
|
63: 0 for an L2 table that is unused or requires COW, 1 if its
|
|
refcount is exactly one. This information is only accurate
|
|
in the active L1 table.
|
|
|
|
L2 table entry:
|
|
|
|
Bit 0 - 61: Cluster descriptor
|
|
|
|
62: 0 for standard clusters
|
|
1 for compressed clusters
|
|
|
|
63: 0 for clusters that are unused, compressed or require COW.
|
|
1 for standard clusters whose refcount is exactly one.
|
|
This information is only accurate in L2 tables
|
|
that are reachable from the active L1 table.
|
|
|
|
With external data files, all guest clusters have an
|
|
implicit refcount of 1 (because of the fixed host = guest
|
|
mapping for guest cluster offsets), so this bit should be 1
|
|
for all allocated clusters.
|
|
|
|
Standard Cluster Descriptor:
|
|
|
|
Bit 0: If set to 1, the cluster reads as all zeros. The host
|
|
cluster offset can be used to describe a preallocation,
|
|
but it won't be used for reading data from this cluster,
|
|
nor is data read from the backing file if the cluster is
|
|
unallocated.
|
|
|
|
With version 2 or with extended L2 entries (see the next
|
|
section), this is always 0.
|
|
|
|
1 - 8: Reserved (set to 0)
|
|
|
|
9 - 55: Bits 9-55 of host cluster offset. Must be aligned to a
|
|
cluster boundary. If the offset is 0 and bit 63 is clear,
|
|
the cluster is unallocated. The offset may only be 0 with
|
|
bit 63 set (indicating a host cluster offset of 0) when an
|
|
external data file is used.
|
|
|
|
56 - 61: Reserved (set to 0)
|
|
|
|
|
|
Compressed Clusters Descriptor (x = 62 - (cluster_bits - 8)):
|
|
|
|
Bit 0 - x-1: Host cluster offset. This is usually _not_ aligned to a
|
|
cluster or sector boundary! If cluster_bits is
|
|
small enough that this field includes bits beyond
|
|
55, those upper bits must be set to 0.
|
|
|
|
x - 61: Number of additional 512-byte sectors used for the
|
|
compressed data, beyond the sector containing the offset
|
|
in the previous field. Some of these sectors may reside
|
|
in the next contiguous host cluster.
|
|
|
|
Note that the compressed data does not necessarily occupy
|
|
all of the bytes in the final sector; rather, decompression
|
|
stops when it has produced a cluster of data.
|
|
|
|
Another compressed cluster may map to the tail of the final
|
|
sector used by this compressed cluster.
|
|
|
|
If a cluster is unallocated, read requests shall read the data from the backing
|
|
file (except if bit 0 in the Standard Cluster Descriptor is set). If there is
|
|
no backing file or the backing file is smaller than the image, they shall read
|
|
zeros for all parts that are not covered by the backing file.
|
|
|
|
== Extended L2 Entries ==
|
|
|
|
An image uses Extended L2 Entries if bit 4 is set on the incompatible_features
|
|
field of the header.
|
|
|
|
In these images standard data clusters are divided into 32 subclusters of the
|
|
same size. They are contiguous and start from the beginning of the cluster.
|
|
Subclusters can be allocated independently and the L2 entry contains information
|
|
indicating the status of each one of them. Compressed data clusters don't have
|
|
subclusters so they are treated the same as in images without this feature.
|
|
|
|
The size of an extended L2 entry is 128 bits so the number of entries per table
|
|
is calculated using this formula:
|
|
|
|
l2_entries = (cluster_size / (2 * sizeof(uint64_t)))
|
|
|
|
The first 64 bits have the same format as the standard L2 table entry described
|
|
in the previous section, with the exception of bit 0 of the standard cluster
|
|
descriptor.
|
|
|
|
The last 64 bits contain a subcluster allocation bitmap with this format:
|
|
|
|
Subcluster Allocation Bitmap (for standard clusters):
|
|
|
|
Bit 0 - 31: Allocation status (one bit per subcluster)
|
|
|
|
1: the subcluster is allocated. In this case the
|
|
host cluster offset field must contain a valid
|
|
offset.
|
|
0: the subcluster is not allocated. In this case
|
|
read requests shall go to the backing file or
|
|
return zeros if there is no backing file data.
|
|
|
|
Bits are assigned starting from the least significant
|
|
one (i.e. bit x is used for subcluster x).
|
|
|
|
32 - 63 Subcluster reads as zeros (one bit per subcluster)
|
|
|
|
1: the subcluster reads as zeros. In this case the
|
|
allocation status bit must be unset. The host
|
|
cluster offset field may or may not be set.
|
|
0: no effect.
|
|
|
|
Bits are assigned starting from the least significant
|
|
one (i.e. bit x is used for subcluster x - 32).
|
|
|
|
Subcluster Allocation Bitmap (for compressed clusters):
|
|
|
|
Bit 0 - 63: Reserved (set to 0)
|
|
Compressed clusters don't have subclusters,
|
|
so this field is not used.
|
|
|
|
== Snapshots ==
|
|
|
|
qcow2 supports internal snapshots. Their basic principle of operation is to
|
|
switch the active L1 table, so that a different set of host clusters are
|
|
exposed to the guest.
|
|
|
|
When creating a snapshot, the L1 table should be copied and the refcount of all
|
|
L2 tables and clusters reachable from this L1 table must be increased, so that
|
|
a write causes a COW and isn't visible in other snapshots.
|
|
|
|
When loading a snapshot, bit 63 of all entries in the new active L1 table and
|
|
all L2 tables referenced by it must be reconstructed from the refcount table
|
|
as it doesn't need to be accurate in inactive L1 tables.
|
|
|
|
A directory of all snapshots is stored in the snapshot table, a contiguous area
|
|
in the image file, whose starting offset and length are given by the header
|
|
fields snapshots_offset and nb_snapshots. The entries of the snapshot table
|
|
have variable length, depending on the length of ID, name and extra data.
|
|
|
|
Snapshot table entry:
|
|
|
|
Byte 0 - 7: Offset into the image file at which the L1 table for the
|
|
snapshot starts. Must be aligned to a cluster boundary.
|
|
|
|
8 - 11: Number of entries in the L1 table of the snapshots
|
|
|
|
12 - 13: Length of the unique ID string describing the snapshot
|
|
|
|
14 - 15: Length of the name of the snapshot
|
|
|
|
16 - 19: Time at which the snapshot was taken in seconds since the
|
|
Epoch
|
|
|
|
20 - 23: Subsecond part of the time at which the snapshot was taken
|
|
in nanoseconds
|
|
|
|
24 - 31: Time that the guest was running until the snapshot was
|
|
taken in nanoseconds
|
|
|
|
32 - 35: Size of the VM state in bytes. 0 if no VM state is saved.
|
|
If there is VM state, it starts at the first cluster
|
|
described by first L1 table entry that doesn't describe a
|
|
regular guest cluster (i.e. VM state is stored like guest
|
|
disk content, except that it is stored at offsets that are
|
|
larger than the virtual disk presented to the guest)
|
|
|
|
36 - 39: Size of extra data in the table entry (used for future
|
|
extensions of the format)
|
|
|
|
variable: Extra data for future extensions. Unknown fields must be
|
|
ignored. Currently defined are (offset relative to snapshot
|
|
table entry):
|
|
|
|
Byte 40 - 47: Size of the VM state in bytes. 0 if no VM
|
|
state is saved. If this field is present,
|
|
the 32-bit value in bytes 32-35 is ignored.
|
|
|
|
Byte 48 - 55: Virtual disk size of the snapshot in bytes
|
|
|
|
Byte 56 - 63: icount value which corresponds to
|
|
the record/replay instruction count
|
|
when the snapshot was taken. Set to -1
|
|
if icount was disabled
|
|
|
|
Version 3 images must include extra data at least up to
|
|
byte 55.
|
|
|
|
variable: Unique ID string for the snapshot (not null terminated)
|
|
|
|
variable: Name of the snapshot (not null terminated)
|
|
|
|
variable: Padding to round up the snapshot table entry size to the
|
|
next multiple of 8.
|
|
|
|
|
|
== Bitmaps ==
|
|
|
|
As mentioned above, the bitmaps extension provides the ability to store bitmaps
|
|
related to a virtual disk. This section describes how these bitmaps are stored.
|
|
|
|
All stored bitmaps are related to the virtual disk stored in the same image, so
|
|
each bitmap size is equal to the virtual disk size.
|
|
|
|
Each bit of the bitmap is responsible for strictly defined range of the virtual
|
|
disk. For bit number bit_nr the corresponding range (in bytes) will be:
|
|
|
|
[bit_nr * bitmap_granularity .. (bit_nr + 1) * bitmap_granularity - 1]
|
|
|
|
Granularity is a property of the concrete bitmap, see below.
|
|
|
|
|
|
=== Bitmap directory ===
|
|
|
|
Each bitmap saved in the image is described in a bitmap directory entry. The
|
|
bitmap directory is a contiguous area in the image file, whose starting offset
|
|
and length are given by the header extension fields bitmap_directory_offset and
|
|
bitmap_directory_size. The entries of the bitmap directory have variable
|
|
length, depending on the lengths of the bitmap name and extra data.
|
|
|
|
Structure of a bitmap directory entry:
|
|
|
|
Byte 0 - 7: bitmap_table_offset
|
|
Offset into the image file at which the bitmap table
|
|
(described below) for the bitmap starts. Must be aligned to
|
|
a cluster boundary.
|
|
|
|
8 - 11: bitmap_table_size
|
|
Number of entries in the bitmap table of the bitmap.
|
|
|
|
12 - 15: flags
|
|
Bit
|
|
0: in_use
|
|
The bitmap was not saved correctly and may be
|
|
inconsistent. Although the bitmap metadata is still
|
|
well-formed from a qcow2 perspective, the metadata
|
|
(such as the auto flag or bitmap size) or data
|
|
contents may be outdated.
|
|
|
|
1: auto
|
|
The bitmap must reflect all changes of the virtual
|
|
disk by any application that would write to this qcow2
|
|
file (including writes, snapshot switching, etc.). The
|
|
type of this bitmap must be 'dirty tracking bitmap'.
|
|
|
|
2: extra_data_compatible
|
|
This flags is meaningful when the extra data is
|
|
unknown to the software (currently any extra data is
|
|
unknown to Qemu).
|
|
If it is set, the bitmap may be used as expected, extra
|
|
data must be left as is.
|
|
If it is not set, the bitmap must not be used, but
|
|
both it and its extra data be left as is.
|
|
|
|
Bits 3 - 31 are reserved and must be 0.
|
|
|
|
16: type
|
|
This field describes the sort of the bitmap.
|
|
Values:
|
|
1: Dirty tracking bitmap
|
|
|
|
Values 0, 2 - 255 are reserved.
|
|
|
|
17: granularity_bits
|
|
Granularity bits. Valid values: 0 - 63.
|
|
|
|
Note: Qemu currently supports only values 9 - 31.
|
|
|
|
Granularity is calculated as
|
|
granularity = 1 << granularity_bits
|
|
|
|
A bitmap's granularity is how many bytes of the image
|
|
accounts for one bit of the bitmap.
|
|
|
|
18 - 19: name_size
|
|
Size of the bitmap name. Must be non-zero.
|
|
|
|
Note: Qemu currently doesn't support values greater than
|
|
1023.
|
|
|
|
20 - 23: extra_data_size
|
|
Size of type-specific extra data.
|
|
|
|
For now, as no extra data is defined, extra_data_size is
|
|
reserved and should be zero. If it is non-zero the
|
|
behavior is defined by extra_data_compatible flag.
|
|
|
|
variable: extra_data
|
|
Extra data for the bitmap, occupying extra_data_size bytes.
|
|
Extra data must never contain references to clusters or in
|
|
some other way allocate additional clusters.
|
|
|
|
variable: name
|
|
The name of the bitmap (not null terminated), occupying
|
|
name_size bytes. Must be unique among all bitmap names
|
|
within the bitmaps extension.
|
|
|
|
variable: Padding to round up the bitmap directory entry size to the
|
|
next multiple of 8. All bytes of the padding must be zero.
|
|
|
|
|
|
=== Bitmap table ===
|
|
|
|
Each bitmap is stored using a one-level structure (as opposed to two-level
|
|
structures like for refcounts and guest clusters mapping) for the mapping of
|
|
bitmap data to host clusters. This structure is called the bitmap table.
|
|
|
|
Each bitmap table has a variable size (stored in the bitmap directory entry)
|
|
and may use multiple clusters, however, it must be contiguous in the image
|
|
file.
|
|
|
|
Structure of a bitmap table entry:
|
|
|
|
Bit 0: Reserved and must be zero if bits 9 - 55 are non-zero.
|
|
If bits 9 - 55 are zero:
|
|
0: Cluster should be read as all zeros.
|
|
1: Cluster should be read as all ones.
|
|
|
|
1 - 8: Reserved and must be zero.
|
|
|
|
9 - 55: Bits 9 - 55 of the host cluster offset. Must be aligned to
|
|
a cluster boundary. If the offset is 0, the cluster is
|
|
unallocated; in that case, bit 0 determines how this
|
|
cluster should be treated during reads.
|
|
|
|
56 - 63: Reserved and must be zero.
|
|
|
|
|
|
=== Bitmap data ===
|
|
|
|
As noted above, bitmap data is stored in separate clusters, described by the
|
|
bitmap table. Given an offset (in bytes) into the bitmap data, the offset into
|
|
the image file can be obtained as follows:
|
|
|
|
image_offset(bitmap_data_offset) =
|
|
bitmap_table[bitmap_data_offset / cluster_size] +
|
|
(bitmap_data_offset % cluster_size)
|
|
|
|
This offset is not defined if bits 9 - 55 of bitmap table entry are zero (see
|
|
above).
|
|
|
|
Given an offset byte_nr into the virtual disk and the bitmap's granularity, the
|
|
bit offset into the image file to the corresponding bit of the bitmap can be
|
|
calculated like this:
|
|
|
|
bit_offset(byte_nr) =
|
|
image_offset(byte_nr / granularity / 8) * 8 +
|
|
(byte_nr / granularity) % 8
|
|
|
|
If the size of the bitmap data is not a multiple of the cluster size then the
|
|
last cluster of the bitmap data contains some unused tail bits. These bits must
|
|
be zero.
|
|
|
|
|
|
=== Dirty tracking bitmaps ===
|
|
|
|
Bitmaps with 'type' field equal to one are dirty tracking bitmaps.
|
|
|
|
When the virtual disk is in use dirty tracking bitmap may be 'enabled' or
|
|
'disabled'. While the bitmap is 'enabled', all writes to the virtual disk
|
|
should be reflected in the bitmap. A set bit in the bitmap means that the
|
|
corresponding range of the virtual disk (see above) was written to while the
|
|
bitmap was 'enabled'. An unset bit means that this range was not written to.
|
|
|
|
The software doesn't have to sync the bitmap in the image file with its
|
|
representation in RAM after each write or metadata change. Flag 'in_use'
|
|
should be set while the bitmap is not synced.
|
|
|
|
In the image file the 'enabled' state is reflected by the 'auto' flag. If this
|
|
flag is set, the software must consider the bitmap as 'enabled' and start
|
|
tracking virtual disk changes to this bitmap from the first write to the
|
|
virtual disk. If this flag is not set then the bitmap is disabled.
|