For SSD, if request size exceeds specific value (optimal io size), request size
isn't important for bandwidth. In such condition, if making request size bigger
will cause some disks idle, the total throughput will actually drop. A good
example is doing a readahead in a two-disk raid1 setup.
So when should we split big requests? We absolutly don't want to split big
request to very small requests. Even in SSD, big request transfer is more
efficient. This patch only considers request with size above optimal io size.
If all disks are busy, is it worth doing a split? Say optimal io size is 16k,
two requests 32k and two disks. We can let each disk run one 32k request, or
split the requests to 4 16k requests and each disk runs two. It's hard to say
which case is better, depending on hardware.
So only consider case where there are idle disks. For readahead, split is
always better in this case. And in my test, below patch can improve > 30%
thoughput. Hmm, not 100%, because disk isn't 100% busy.
Such case can happen not just in readahead, for example, in directio. But I
suppose directio usually will have bigger IO depth and make all disks busy, so
I ignored it.
Note: if the raid uses any hard disk, we don't prevent merging. That will make
performace worse.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
SSD hasn't spindle, distance between requests means nothing. And the original
distance based algorithm sometimes can cause severe performance issue for SSD
raid.
Considering two thread groups, one accesses file A, the other access file B.
The first group will access one disk and the second will access the other disk,
because requests are near from one group and far between groups. In this case,
read balance might keep one disk very busy but the other relative idle. For
SSD, we should try best to distribute requests to as many disks as possible.
There isn't spindle move penality anyway.
With below patch, I can see more than 50% throughput improvement sometimes
depending on workloads.
The only exception is small requests can be merged to a big request which
typically can drive higher throughput for SSD too. Such small requests are
sequential reads. Unlike hard disk, sequential read which can't be merged (for
example direct IO, or read without readahead) can be ignored for SSD. Again
there is no spindle move penality. readahead dispatches small requests and such
requests can be merged.
Last patch can help detect sequential read well, at least if concurrent read
number isn't greater than raid disk number. In that case, distance based
algorithm doesn't work well too.
V2: For hard disk and SSD mixed raid, doesn't use distance based algorithm for
random IO too. This makes the algorithm generic for raid with SSD.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Currently the sequential read detection is global wide. It's natural to make it
per disk based, which can improve the detection for concurrent multiple
sequential reads. And next patch will make SSD read balance not use distance
based algorithm, where this change help detect truly sequential read for SSD.
Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
md/raid10: Export is_congested test.
In similar fashion to commits
11d8a6e3719519fbc0e2c9d61b6fa931b84bf813
1ed7242e591af7e233234d483f12d33818b189d9
we export the RAID10 congestion checking function so that dm-raid.c can
make use of it and make use of the personality. The 'queue' and 'gendisk'
structures will not be available to the MD code when device-mapper sets
up the device, so we conditionalize access to these fields also.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
MD RAID1/RAID10: Move some macros from .h file to .c file
There are three macros (IO_BLOCKED,IO_MADE_GOOD,BIO_SPECIAL) which are defined
in both raid1.h and raid10.h. They are only used in there respective .c files.
However, if we wish to make RAID10 accessible to the device-mapper RAID
target (dm-raid.c), then we need to move these macros into the .c files where
they are used so that they do not conflict with each other.
The macros from the two files are identical and could be moved into md.h, but
I chose to leave the duplication and have them remain in the personality
files.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
MD RAID1: Rename the structure 'mirror_info' to 'raid1_info'
The same structure name ('mirror_info') is used by raid10. Each of these
structures are defined in there respective header files. If dm-raid is
to support both RAID1 and RAID10, the header files will be included and
the structure names must not collide. While only one of these structure
names needs to change, this patch adds consistency to the naming of the
structure.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
MD RAID10: Rename the structure 'mirror_info' to 'raid10_info'
The same structure name ('mirror_info') is used by raid1. Each of these
structures are defined in there respective header files. If dm-raid is
to support both RAID1 and RAID10, the header files will be included and
the structure names must not collide.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Commit outstanding metadata before returning the status for a dm thin
pool so that the numbers reported are as up-to-date as possible.
The commit is not performed if the device is suspended or if
the DM_NOFLUSH_FLAG is supplied by userspace and passed to the target
through a new 'status_flags' parameter in the target's dm_status_fn.
The userspace dmsetup tool will support the --noflush flag with the
'dmsetup status' and 'dmsetup wait' commands from version 1.02.76
onwards.
Tested-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Add read-only and fail-io modes to thin provisioning.
If a transaction commit fails the pool's metadata device will transition
to "read-only" mode. If a commit fails once already in read-only mode
the transition to "fail-io" mode occurs.
Once in fail-io mode the pool and all associated thin devices will
report a status of "Fail".
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Introduce dm_pool_abort_metadata to abort the current metadata
transaction. Generally this will only be called when bad things are
happening and dm-thin is trying to roll back to a good state for
read-only mode.
It's complicated by the fact that the metadata device may have failed
completely causing the abort to be unable to read the old transaction.
In this case the metadata object is placed in a 'fail' mode and
everything fails apart from destroying it.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Introduce dm_pool_metadata_set_read_only to put the underlying block
manager into read-only mode.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Introduce dm_bm_set_read_only to switch the block manager into a
read-only mode. To be used when dm-thin degrades due to io errors on
the metadata device.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Reduce the number of metadata commits by using
dm_thin_changed_this_transaction to check if metadata was changed on a
per thin device granularity.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Introduce dm_thin_changed_this_transaction to dm-thin-metadata to publish a
useful bit of information we're already tracking. This will help dm thin
decide when to commit.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Add a parameter to dm_pool_metadata_open to indicate whether or not an
unformatted metadata area should be formatted.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Tidy up error path in __open_metadata and __format_metadata in dm-thin-metadata.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Factor out __check_incompat_features and only call it once when we open
the metadata device rather than at the beginning of every transaction.
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Remove some duplicate initialisation of struct dm_pool_metadata.
These pmd fields are initialised by both:
__format_metadata's calls to dm_btree_empty
__write_initial_superblock + __begin_transaction
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Remove 'create' parameter from __create_persistent_data_objects() in dm-thin-metadata.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Move the check for __superblock_all_zeroes from
__create_persistent_data_objects() down to __open_or_format_metadata in
dm-thin-metadata.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Remove nr_blocks arg from __create_persistent_data_objects in dm-thin-metadata.
It was always passed as zero.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Split __open_or_format_metadata into __format_metadata and __open_metadata in
dm-thin-metadata.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Clean up __open_or_format_metadata in dm-thin-metadata by using struct
dm_pool_metadata members to replace local variables.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Zero the unused uuid when initialising the metadata superblock.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Lift the call to __begin_transaction out of __write_initial_superblock in
dm-thin-metadata. Called higher up the call chain now.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Move dm_commit_pool_metadata inline into __write_initial_superblock in dm-thin-metadata.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Factor out __write_initial_superblock and also pull some other initial
creation code out of dm_pool_metadata_open.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Lift some initialisation out of __open_or_format_metadata in dm-thin-metadata.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Factor __destroy_persistent_data_objects out of dm_pool_metadata_close.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Move block manager creation and the check for unformatted metadata into
__create_persistent_data_objects().
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Rename init_pmd to __create_persistent_data_objects in dm-thin-metadata.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Introduce wrappers to handle write locking the superblock
appropriately.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Stop using dm_bm_unlock_move when shadowing blocks in the transaction
manager as an optimisation and remove the function as it is then no
longer used.
Some code, such as the space maps, keeps using on-disk data structures
from the previous transaction. It can do this because blocks won't
be reallocated until the subsequent transaction. Using
dm_bm_unlock_move to copy blocks sounds like a win, but it forces a
synchronous read should the old block be accessed.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Tidy the transaction manager creation functions.
They no longer lock the superblock. Superblock locking is pulled out to
the caller.
Also export dm_bm_write_lock_zero.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Remove an optimisation that tracks whether or not a thin metadata commit
is needed.
If dm_pool_commit_metadata() is called and no changes have been made
to the metadata then this optimisation avoided writing to disk.
Removing because we're going to do something better later.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
This patch introduces a separate struct for the block_manager.
It also uses IS_ERR to check the return value of dm_bufio_client_create
instead of testing incorrectly for NULL.
Prior to this patch a struct dm_block_manager was really an alias for
a struct dm_bufio_client. We want to add some functionality to the
block manager that will require extra fields, so this one to one
mapping is no longer valid.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Factor __setup_btree_details out of init_pmd in dm-thin-metadata.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
The thin provisioning target commits internal metadata on flush. So it
should receive flushes regardless of whether the underlying devices
support them.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Allow targets to override the 'supports flush' calculation.
Set 'flush_supported' if a target needs to receive flushes regardless of
whether or not its underlying devices have support.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Introduce bitmap_index_changed to track whether or not the index changed
then only commit a space map if it did.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Unlock the superblock even if initial dm_bufio_write_dirty_buffers fails.
Also, remove redundant flush calls. dm_bm_flush_and_unlock's calls to
dm_bufio_write_dirty_buffers already result in dm_bufio_issue_flush
being called.
This avoids warnings about unflushed dirty buffers from bufio.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
There's no need to break sharing, triggering a copy, for a write that has no
data (i.e. a flush).
Signed-off-by: Joe Thornber <ejt@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Fix memory leak in process_prepared_mapping by always freeing
the dm_thin_new_mapping structs from the mapping_pool mempool on
the error paths.
Signed-off-by: Joe Thornber <ejt@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Rename sector to cc_sector in dm-crypt's convert_context struct.
This is preparation for a future patch that merges dm_io and
convert_context which both have a "sector" field.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Store the crypt_config struct pointer directly in struct dm_crypt_io
instead of the dm_target struct pointer.
Target information is never used - only target->private is referenced,
thus we can change it to point directly to struct crypt_config.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
Move static dm-crypt cipher data out of per-cpu structure.
Cipher information is static, so it does not have to be in a per-cpu
structure.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
In preparation for RAID10 inclusion in dm-raid, we move the sectors_per_dev
calculation later in the device creation process. This is because we won't
know up-front how many stripes vs how many mirrors there are which will
change the calculation.
Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>
There are two dm crypt structures that have a field called "pending".
This patch renames them to "cc_pending" and "io_pending" to reduce confusion
and ease searching the code.
Also remove unnecessary initialisation of r in crypt_convert_block().
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Alasdair G Kergon <agk@redhat.com>