linux

mirror of https://github.com/FEX-Emu/linux.git synced 2025-01-29 23:12:23 +00:00

Author	SHA1	Message	Date
Milan Broz	8f0009a225	dm crypt: optionally support larger encryption sector size Add optional "sector_size" parameter that specifies encryption sector size (atomic unit of block device encryption). Parameter can be in range 512 - 4096 bytes and must be power of two. For compatibility reasons, the maximal IO must fit into the page limit, so the limit is set to the minimal page size possible (4096 bytes). NOTE: this device cannot yet be handled by cryptsetup if this parameter is set. IV for the sector is calculated from the 512 bytes sector offset unless the iv_large_sectors option is used. Test script using dmsetup: DEV="/dev/sdb" DEV_SIZE=$(blockdev --getsz $DEV) KEY="9c1185a5c5e9fc54612808977ee8f548b2258d31ddadef707ba62c166051b9e3cd0294c27515f2bccee924e8823ca6e124b8fc3167ed478bca702babe4e130ac" BLOCK_SIZE=4096 # dmsetup create test_crypt --table "0 $DEV_SIZE crypt aes-xts-plain64 $KEY 0 $DEV 0 1 sector_size:$BLOCK_SIZE" # dmsetup table --showkeys test_crypt Signed-off-by: Milan Broz <gmazyland@gmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2017-03-24 15:54:21 -04:00
Milan Broz	33d2f09fcb	dm crypt: introduce new format of cipher with "capi:" prefix For the new authenticated encryption we have to support generic composed modes (combination of encryption algorithm and authenticator) because this is how the kernel crypto API accesses such algorithms. To simplify the interface, we accept an algorithm directly in crypto API format. The new format is recognised by the "capi:" prefix. The dmcrypt internal IV specification is the same as for the old format. The crypto API cipher specifications format is: capi:cipher_api_spec-ivmode[:ivopts] Examples: capi:cbc(aes)-essiv:sha256 (equivalent to old aes-cbc-essiv:sha256) capi:xts(aes)-plain64 (equivalent to old aes-xts-plain64) Examples of authenticated modes: capi:gcm(aes)-random capi:authenc(hmac(sha256),xts(aes))-random capi:rfc7539(chacha20,poly1305)-random Authenticated modes can only be configured using the new cipher format. Note that this format allows user to specify arbitrary combinations that can be insecure. (Policy decision is done in cryptsetup userspace.) Authenticated encryption algorithms can be of two types, either native modes (like GCM) that performs both encryption and authentication internally, or composed modes where user can compose AEAD with separate specification of encryption algorithm and authenticator. For composed mode with HMAC (length-preserving encryption mode like an XTS and HMAC as an authenticator) we have to calculate HMAC digest size (the separate authentication key is the same size as the HMAC digest). Introduce crypt_ctr_auth_cipher() to parse the crypto API string to get HMAC algorithm and retrieve digest size from it. Also, for HMAC composed mode we need to parse the crypto API string to get the cipher mode nested in the specification. For native AEAD mode (like GCM), we can use crypto_tfm_alg_name() API to get the cipher specification. Because the HMAC composed mode is not processed the same as the native AEAD mode, the CRYPT_MODE_INTEGRITY_HMAC flag is no longer needed and "hmac" specification for the table integrity argument is removed. Signed-off-by: Milan Broz <gmazyland@gmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2017-03-24 15:54:20 -04:00
Milan Broz	e889f97a3e	dm crypt: factor IV constructor out to separate function No functional change. Signed-off-by: Milan Broz <gmazyland@gmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2017-03-24 15:54:19 -04:00
Milan Broz	ef43aa3806	dm crypt: add cryptographic data integrity protection (authenticated encryption) Allow the use of per-sector metadata, provided by the dm-integrity module, for integrity protection and persistently stored per-sector Initialization Vector (IV). The underlying device must support the "DM-DIF-EXT-TAG" dm-integrity profile. The per-bio integrity metadata is allocated by dm-crypt for every bio. Example of low-level mapping table for various types of use: DEV=/dev/sdb SIZE=417792 # Additional HMAC with CBC-ESSIV, key is concatenated encryption key + HMAC key SIZE_INT=389952 dmsetup create x --table "0 $SIZE_INT integrity $DEV 0 32 J 0" dmsetup create y --table "0 $SIZE_INT crypt aes-cbc-essiv:sha256 \ 11ff33c6fb942655efb3e30cf4c0fd95f5ef483afca72166c530ae26151dd83b \ 00112233445566778899aabbccddeeff00112233445566778899aabbccddeeff \ 0 /dev/mapper/x 0 1 integrity:32:hmac(sha256)" # AEAD (Authenticated Encryption with Additional Data) - GCM with random IVs # GCM in kernel uses 96bits IV and we store 128bits auth tag (so 28 bytes metadata space) SIZE_INT=393024 dmsetup create x --table "0 $SIZE_INT integrity $DEV 0 28 J 0" dmsetup create y --table "0 $SIZE_INT crypt aes-gcm-random \ 11ff33c6fb942655efb3e30cf4c0fd95f5ef483afca72166c530ae26151dd83b \ 0 /dev/mapper/x 0 1 integrity:28:aead" # Random IV only for XTS mode (no integrity protection but provides atomic random sector change) SIZE_INT=401272 dmsetup create x --table "0 $SIZE_INT integrity $DEV 0 16 J 0" dmsetup create y --table "0 $SIZE_INT crypt aes-xts-random \ 11ff33c6fb942655efb3e30cf4c0fd95f5ef483afca72166c530ae26151dd83b \ 0 /dev/mapper/x 0 1 integrity:16:none" # Random IV with XTS + HMAC integrity protection SIZE_INT=377656 dmsetup create x --table "0 $SIZE_INT integrity $DEV 0 48 J 0" dmsetup create y --table "0 $SIZE_INT crypt aes-xts-random \ 11ff33c6fb942655efb3e30cf4c0fd95f5ef483afca72166c530ae26151dd83b \ 00112233445566778899aabbccddeeff00112233445566778899aabbccddeeff \ 0 /dev/mapper/x 0 1 integrity:48:hmac(sha256)" Both AEAD and HMAC protection authenticates not only data but also sector metadata. HMAC protection is implemented through autenc wrapper (so it is processed the same way as an authenticated mode). In HMAC mode there are two keys (concatenated in dm-crypt mapping table). First is the encryption key and the second is the key for authentication (HMAC). (It is userspace decision if these keys are independent or somehow derived.) The sector request for AEAD/HMAC authenticated encryption looks like this: \|----- AAD -------\|------ DATA -------\|-- AUTH TAG --\| \| (authenticated) \| (auth+encryption) \| \| \| sector_LE \| IV \| sector in/out \| tag in/out \| For writes, the integrity fields are calculated during AEAD encryption of every sector and stored in bio integrity fields and sent to underlying dm-integrity target for storage. For reads, the integrity metadata is verified during AEAD decryption of every sector (they are filled in by dm-integrity, but the integrity fields are pre-allocated in dm-crypt). There is also an experimental support in cryptsetup utility for more friendly configuration (part of LUKS2 format). Because the integrity fields are not valid on initial creation, the device must be "formatted". This can be done by direct-io writes to the device (e.g. dd in direct-io mode). For now, there is available trivial tool to do this, see: https://github.com/mbroz/dm_int_tools Signed-off-by: Milan Broz <gmazyland@gmail.com> Signed-off-by: Ondrej Mosnacek <omosnacek@gmail.com> Signed-off-by: Vashek Matyas <matyas@fi.muni.cz> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2017-03-24 15:49:41 -04:00
Mikulas Patocka	7eada909bf	dm: add integrity target The dm-integrity target emulates a block device that has additional per-sector tags that can be used for storing integrity information. A general problem with storing integrity tags with every sector is that writing the sector and the integrity tag must be atomic - i.e. in case of crash, either both sector and integrity tag or none of them is written. To guarantee write atomicity the dm-integrity target uses a journal. It writes sector data and integrity tags into a journal, commits the journal and then copies the data and integrity tags to their respective location. The dm-integrity target can be used with the dm-crypt target - in this situation the dm-crypt target creates the integrity data and passes them to the dm-integrity target via bio_integrity_payload attached to the bio. In this mode, the dm-crypt and dm-integrity targets provide authenticated disk encryption - if the attacker modifies the encrypted device, an I/O error is returned instead of random data. The dm-integrity target can also be used as a standalone target, in this mode it calculates and verifies the integrity tag internally. In this mode, the dm-integrity target can be used to detect silent data corruption on the disk or in the I/O path. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Milan Broz <gmazyland@gmail.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2017-03-24 15:49:07 -04:00
Ming Lei	2d06e3b714	md: raid10: avoid direct access to bvec table in handle_reshape_read_error All reshape I/O share pages from 1st copy device, so just use that pages for avoiding direct access to bvec table in handle_reshape_read_error. Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-24 10:41:37 -07:00
Ming Lei	cdb76be315	md: raid10: retrieve page from preallocated resync page array Now one page array is allocated for each resync bio, and we can retrieve page from this table directly. Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-24 10:41:37 -07:00
Ming Lei	f025061836	md: raid10: don't use bio's vec table to manage resync pages Now we allocate one page array for managing resync pages, instead of using bio's vec table to do that, and the old way is very hacky and won't work any more if multipage bvec is enabled. The introduced cost is that we need to allocate (128 + 16) * copies bytes per r10_bio, and it is fine because the inflight r10_bio for resync shouldn't be much, as pointed by Shaohua. Also bio_reset() in raid10_sync_request() and reshape_request() are removed because all bios are freshly new now in these functions and not necessary to reset any more. This patch can be thought as cleanup too. Suggested-by: Shaohua Li <shli@kernel.org> Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-24 10:41:37 -07:00
Ming Lei	81fa152008	md: raid10: refactor code of read reshape's .bi_end_io reshape read request is a bit special and requires one extra bio which isn't allocated from r10buf_pool. Refactor the .bi_end_io for read reshape, so that we can use raid10's resync page mangement approach easily in the following patches. Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-24 10:41:37 -07:00
Ming Lei	841c1316c7	md: raid1: improve write behind This patch improve handling of write behind in the following ways: - introduce behind master bio to hold all write behind pages - fast clone bios from behind master bio - avoid to change bvec table directly - use bio_copy_data() and make code more clean Suggested-by: Shaohua Li <shli@fb.com> Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-24 10:41:37 -07:00
Ming Lei	d8c84c4f8b	md: raid1: move 'offset' out of loop The 'offset' local variable can't be changed inside the loop, so move it out. Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-24 10:41:37 -07:00
Ming Lei	60928a91b0	md: raid1: use bio helper in process_checks() Avoid to direct access to bvec table. Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-24 10:41:36 -07:00
Ming Lei	44cf0f4dc7	md: raid1: retrieve page from pre-allocated resync page array Now one page array is allocated for each resync bio, and we can retrieve page from this table directly. Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-24 10:41:36 -07:00
Ming Lei	98d30c5812	md: raid1: don't use bio's vec table to manage resync pages Now we allocate one page array for managing resync pages, instead of using bio's vec table to do that, and the old way is very hacky and won't work any more if multipage bvec is enabled. The introduced cost is that we need to allocate (128 + 16) * raid_disks bytes per r1_bio, and it is fine because the inflight r1_bio for resync shouldn't be much, as pointed by Shaohua. Also the bio_reset() in raid1_sync_request() is removed because all bios are freshly new now and not necessary to reset any more. This patch can be thought as a cleanup too Suggested-by: Shaohua Li <shli@kernel.org> Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-24 10:41:36 -07:00
Ming Lei	a7234234d0	md: raid1: simplify r1buf_pool_free() This patch gets each page's reference of each bio for resync, then r1buf_pool_free() gets simplified a lot. The same policy has been taken in raid10's buf pool allocation/free too. Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-24 10:41:36 -07:00
Ming Lei	513e2faa01	md: prepare for managing resync I/O pages in clean way Now resync I/O use bio's bec table to manage pages, this way is very hacky, and may not work any more once multipage bvec is introduced. So introduce helpers and new data structure for managing resync I/O pages more cleanly. Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-24 10:41:36 -07:00
Ming Lei	d8e29fbc3b	md: move two macros into md.h Both raid1 and raid10 share common resync block size and page count, so move them into md.h. Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-24 10:41:36 -07:00
Ming Lei	c85ba149de	md: raid1/raid10: don't handle failure of bio_add_page() All bio_add_page() is for adding one page into resync bio, which is big enough to hold RESYNC_PAGES pages, and the current bio_add_page() doesn't check queue limit any more, so it won't fail at all. remove unused label (shaohua) Signed-off-by: Ming Lei <tom.leiming@gmail.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-24 10:41:36 -07:00
Zhilong Liu	3560741e31	md: fix several trivial typos in comments Signed-off-by: Zhilong Liu <zlliu@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-23 22:54:57 -07:00
Guoqing Jiang	27f26a0f37	md/raid10: refactor some codes from raid10_write_request Previously, we clone both bio and repl_bio in raid10_write_request, then add the cloned bio to plug->pending or conf->pending_bio_list based on plug or not, and most of the logics are same for the two conditions. So introduce raid10_write_one_disk for it, and use replacement parameter to distinguish the difference. No functional changes in the patch. Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-23 22:42:14 -07:00
Dan Carpenter	0b408baf7f	raid5-ppl: silence a misleading warning message The "need_cache_flush" variable is never set to false. When the variable is true that means we print a warning message at the end of the function. Fixes: 3418d036c81d ("raid5-ppl: Partial Parity Log write logging implementation") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-23 22:38:46 -07:00
NeilBrown	4ad23a9764	MD: use per-cpu counter for writes_pending The 'writes_pending' counter is used to determine when the array is stable so that it can be marked in the superblock as "Clean". Consequently it needs to be updated frequently but only checked for zero occasionally. Recent changes to raid5 cause the count to be updated even more often - once per 4K rather than once per bio. This provided justification for making the updates more efficient. So we replace the atomic counter a percpu-refcount. This can be incremented and decremented cheaply most of the time, and can be switched to "atomic" mode when more precise counting is needed. As it is possible for multiple threads to want a precise count, we introduce a "sync_checker" counter to count the number of threads in "set_in_sync()", and only switch the refcount back to percpu mode when that is zero. We need to be careful about races between set_in_sync() setting ->in_sync to 1, and md_write_start() setting it to zero. md_write_start() holds the rcu_read_lock() while checking if the refcount is in percpu mode. If it is, then we know a switch to 'atomic' will not happen until after we call rcu_read_unlock(), in which case set_in_sync() will see the elevated count, and not set in_sync to 1. If it is not in percpu mode, we take the mddev->lock to ensure proper synchronization. It is no longer possible to quickly check if the count is zero, which we previously did to update a timer or to schedule the md_thread. So now we do these every time we decrement that counter, but make sure they are fast. mod_timer() already optimizes the case where the timeout value doesn't actually change. We leverage that further by always rounding off the jiffies to the timeout value. This may delay the marking of 'clean' slightly, but ensure we only perform atomic operation here when absolutely needed. md_wakeup_thread() current always calls wake_up(), even if THREAD_WAKEUP is already set. That too can be optimised to avoid calls to wake_up(). Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-22 19:18:56 -07:00
NeilBrown	55cc39f345	md: close a race with setting mddev->in_sync If ->in_sync is being set just as md_write_start() is being called, it is possible that set_in_sync() won't see the elevated ->writes_pending, and md_write_start() won't see the set ->in_sync. To close this race, re-test ->writes_pending after setting ->in_sync, and add memory barriers to ensure the increment of ->writes_pending will be seen by the time of this second test, or the new ->in_sync will be seen by md_write_start(). Add a spinlock to array_state_show() to ensure this temporary instability is never visible from userspace. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-22 19:18:30 -07:00
NeilBrown	6497709b5d	md: factor out set_in_sync() Three separate places in md.c check if the number of active writes is zero and, if so, sets mddev->in_sync. There are a few differences, but there shouldn't be: - it is always appropriate to notify the change in sysfs_state, and there is no need to do this outside a spin-locked region. - we never need to check ->recovery_cp. The state of resync is not relevant for whether there are any pending writes or not (which is what ->in_sync reports). So create set_in_sync() which does the correct tests and makes the correct changes, and call this in all three places. Any behaviour changes here a minor and cosmetic. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-22 19:18:18 -07:00
NeilBrown	84dd97a690	md/raid5: don't test ->writes_pending in raid5_remove_disk This test on ->writes_pending cannot be safe as the counter can be incremented at any moment and cannot be locked against. Change it to test conf->active_stripes, which at least can be locked against. More changes are still needed. A future patch will change ->writes_pending, and testing it here will be very inconvenient. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-22 19:18:05 -07:00
NeilBrown	37011e3afb	md/raid1: stop using bi_phys_segment Change to use bio->__bi_remaining to count number of r1bio attached to a bio. See precious raid10 patch for more details. Like the raid10.c patch, this fixes a bug as nr_queued and nr_pending used to measure different things, but were being compared. This patch fixes another bug in that nr_pending previously did not could write-behind requests, so behind writes could continue while resync was happening. How that nr_pending counts all r1_bio, the resync cannot commence until the behind writes have completed. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-22 19:17:53 -07:00
NeilBrown	fd16f2e848	md/raid10: stop using bi_phys_segments raid10 currently repurposes bi_phys_segments on each incoming bio to count how many r10bio was used to encode the request. We need to know when the number of attached r10bio reaches zero to: 1/ call bio_endio() when all IO on the bio is finished 2/ decrement ->nr_pending so that resync IO can proceed. Now that the bio has its own __bi_remaining counter, that can be used instead. We can call bio_inc_remaining to increment the counter and call bio_endio() every time an r10bio completes, rather than only when bi_phys_segments reaches zero. This addresses point 1, but not point 2. bio_endio() doesn't (and cannot) report when the last r10bio has finished, so a different approach is needed. So: instead of counting bios in ->nr_pending, count r10bios. i.e. every time we attach a bio, increment nr_pending. Every time an r10bio completes, decrement nr_pending. Normally we only increment nr_pending after first checking that ->barrier is zero, or some other non-trivial tests and possible waiting. When attaching multiple r10bios to a bio, we only need the tests and the waiting once. After the first increment, subsequent increments can happen unconditionally as they are really all part of the one request. So introduce inc_pending() which can be used when we know that nr_pending is already elevated. Note that this fixes a bug. freeze_array() contains the line atomic_read(&conf->nr_pending) == conf->nr_queued+extra, which implies that the units for ->nr_pending, ->nr_queued and extra are the same. ->nr_queue and extra count r10_bios, but prior to this patch, ->nr_pending counted bios. If a bio ever resulted in multiple r10_bios (due to bad blocks), freeze_array() would not work correctly. Now it does. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-22 19:17:41 -07:00
NeilBrown	6b6c8110e1	md/raid1, raid10: move rXbio accounting closer to allocation. When raid1 or raid10 find they will need to allocate a new r1bio/r10bio, in order to work around a known bad block, they account for the allocation well before the allocation is made. This separation makes the correctness less obvious and requires comments. The accounting needs to be a little before: before the first rXbio is submitted, but that is all. So move the accounting down to where it makes more sense. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-22 19:17:24 -07:00
NeilBrown	97d5343808	Revert "md/raid5: limit request size according to implementation limits" This reverts commit e8d7c33232e5fdfa761c3416539bc5b4acd12db5. Now that raid5 doesn't abuse bi_phys_segments any more, we no longer need to impose these limits. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-22 19:17:12 -07:00
NeilBrown	0472a42ba1	md/raid5: remove over-loading of ->bi_phys_segments. When a read request, which bypassed the cache, fails, we need to retry it through the cache. This involves attaching it to a sequence of stripe_heads, and it may not be possible to get all the stripe_heads we need at once. We do what we can, and record how far we got in ->bi_phys_segments so we can pick up again later. There is only ever one bio which may have a non-zero offset stored in ->bi_phys_segments, the one that is either active in the single thread which calls retry_aligned_read(), or is in conf->retry_read_aligned waiting for retry_aligned_read() to be called again. So we only need to store one offset value. This can be in a local variable passed between remove_bio_from_retry() and retry_aligned_read(), or in the r5conf structure next to the ->retry_read_aligned pointer. Storing it there allows the last usage of ->bi_phys_segments to be removed from md/raid5.c. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-22 19:16:56 -07:00
NeilBrown	016c76ac76	md/raid5: use bio_inc_remaining() instead of repurposing bi_phys_segments as a counter md/raid5 needs to keep track of how many stripe_heads are processing a bio so that it can delay calling bio_endio() until all stripe_heads have completed. It currently uses 16 bits of ->bi_phys_segments for this purpose. 16 bits is only enough for 256M requests, and it is possible for a single bio to be larger than this, which causes problems. Also, the bio struct contains a larger counter, __bi_remaining, which has a purpose very similar to the purpose of our counter. So stop using ->bi_phys_segments, and instead use __bi_remaining. This means we don't need to initialize the counter, as our caller initializes it to '1'. It also means we can call bio_endio() directly as it tests this counter internally. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-22 19:16:30 -07:00
NeilBrown	bd83d0a28c	md/raid5: call bio_endio() directly rather than queueing for later. We currently gather bios that need to be returned into a bio_list and call bio_endio() on them all together. The original reason for this was to avoid making the calls while holding a spinlock. Locking has changed a lot since then, and that reason is no longer valid. So discard return_io() and various return_bi lists, and just call bio_endio() directly as needed. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-22 19:16:12 -07:00
NeilBrown	16d997b78b	md/raid5: simplfy delaying of writes while metadata is updated. If a device fails during a write, we must ensure the failure is recorded in the metadata before the completion of the write is acknowleged. Commit c3cce6cda162 ("md/raid5: ensure device failure recorded before write request returns.") added code for this, but it was unnecessarily complicated. We already had similar functionality for handling updates to the bad-block-list, thanks to Commit de393cdea66c ("md: make it easier to wait for bad blocks to be acknowledged.") So revert most of the former commit, and instead avoid collecting completed writes if MD_CHANGE_PENDING is set. raid5d() will then flush the metadata and retry the stripe_head. As this change can leave a stripe_head ready for handling immediately after handle_active_stripes() returns, we change raid5_do_work() to pause when MD_CHANGE_PENDING is set, so that it doesn't spin. We check MD_CHANGE_PENDING after analyse_stripe() as it could be set asynchronously. After analyse_stripe(), we have collected stable data about the state of devices, which will be used to make decisions. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-22 19:15:57 -07:00
NeilBrown	497280509f	md/raid5: use md_write_start to count stripes, not bios We use md_write_start() to increase the count of pending writes, and md_write_end() to decrement the count. We currently count bios submitted to md/raid5. Change it count stripe_heads that a WRITE bio has been attached to. So now, raid5_make_request() calls md_write_start() and then md_write_end() to keep the count elevated during the setup of the request. add_stripe_bio() calls md_write_start() for each stripe_head, and the completion routines always call md_write_end(), instead of only calling it when raid5_dec_bi_active_stripes() returns 0. make_discard_request also calls md_write_start/end(). The parallel between md_write_{start,end} and use of bi_phys_segments can be seen in that: Whenever we set bi_phys_segments to 1, we now call md_write_start. Whenever we increment it on non-read requests with raid5_inc_bi_active_stripes(), we now call md_write_start(). Whenever we decrement bi_phys_segments on non-read requsts with raid5_dec_bi_active_stripes(), we now call md_write_end(). This reduces our dependence on keeping a per-bio count of active stripes in bi_phys_segments. md_write_inc() is added which parallels md_write_start(), but requires that a write has already been started, and is certain never to sleep. This can be used inside a spinlocked region when adding to a write request. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-22 19:15:42 -07:00
Joe Thornber	0d963b6e65	dm cache metadata: fix metadata2 format's blocks_are_clean_separate_dirty The dm_bitset_cursor_begin() call was using the incorrect nr_entries. Also, the last dm_bitset_cursor_next() must be avoided if we're at the end of the cursor. Fixes: 7f1b21591a6 ("dm cache metadata: use cursor api in blocks_are_clean_separate_dirty()") Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>	2017-03-20 16:00:49 -04:00
Guoqing Jiang	48df498daf	md: move bitmap_destroy to the beginning of __md_stop Since we have switched to sync way to handle METADATA_UPDATED msg for md-cluster, then process_metadata_update is depended on mddev->thread->wqueue. With the new change, clustered raid could possible hang if array received a METADATA_UPDATED msg after array unregistered mddev->thread, so we need to stop clustered raid (bitmap_destroy -> bitmap_free -> md_cluster_stop) earlier than unregister thread (mddev_detach -> md_unregister_thread). And this change should be safe for non-clustered raid since all writes are stopped before the destroy. Also in md_run, we activate the personality (pers->run()) before activating the bitmap (bitmap_create()). So it is pleasingly symmetric to stop the bitmap (bitmap_destroy()) before stopping the personality (__md_stop() calls pers->free()), we achieve this by move bitmap_destroy to the beginning of __md_stop. But we don't want to break the codes for waiting behind IO as Shaohua mentioned, so introduce bitmap_wait_behind_writes to call the codes, and call the new fun in both mddev_detach and bitmap_destroy, then we will not break original behind IO code and also fit the new condition well. Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-16 16:55:58 -07:00
Song Liu	ea17481fb4	md/r5cache: generate R5LOG_PAYLOAD_FLUSH In r5c_finish_stripe_write_out(), R5LOG_PAYLOAD_FLUSH is append to log->current_io. Appending R5LOG_PAYLOAD_FLUSH in quiesce needs extra writes to journal. To simplify the logic, we just skip R5LOG_PAYLOAD_FLUSH in quiesce. Even R5LOG_PAYLOAD_FLUSH supports multiple stripes per payload. However, current implementation is one stripe per R5LOG_PAYLOAD_FLUSH, which is simpler. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-16 16:55:57 -07:00
Song Liu	2d4f468753	md/r5cache: handle R5LOG_PAYLOAD_FLUSH in recovery This patch adds handling of R5LOG_PAYLOAD_FLUSH in journal recovery. Next patch will add logic that generate R5LOG_PAYLOAD_FLUSH on flush finish. When R5LOG_PAYLOAD_FLUSH is seen in recovery, pending data and parity will be dropped from recovery. This will reduce the number of stripes to replay, and thus accelerate the recovery process. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-16 16:55:57 -07:00
Artur Paszkiewicz	ba903a3ea4	raid5-ppl: runtime PPL enabling or disabling Allow writing to 'consistency_policy' attribute when the array is active. Add a new function 'change_consistency_policy' to the md_personality operations structure to handle the change in the personality code. Values "ppl" and "resync" are accepted and turn PPL on and off respectively. When enabling PPL its location and size should first be set using 'ppl_sector' and 'ppl_size' attributes and a valid PPL header should be written at this location on each member device. Enabling or disabling PPL is performed under a suspended array. The raid5_reset_stripe_cache function frees the stripe cache and allocates it again in order to allocate or free the ppl_pages for the stripes in the stripe cache. Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-16 16:55:56 -07:00
Artur Paszkiewicz	6358c239d8	raid5-ppl: support disk hot add/remove with PPL Add a function to modify the log by removing an rdev when a drive fails or adding when a spare/replacement is activated as a raid member. Removing a disk just clears the child log rdev pointer. No new stripes will be accepted for this child log in ppl_write_stripe() and running io units will be processed without writing PPL to the device. Adding a disk sets the child log rdev pointer and writes an empty PPL header. Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-16 16:55:56 -07:00
Artur Paszkiewicz	4536bf9ba2	raid5-ppl: load and recover the log Load the log from each disk when starting the array and recover if the array is dirty. The initial empty PPL is written by mdadm. When loading the log we verify the header checksum and signature. For external metadata arrays the signature is verified in userspace, so here we read it from the header, verifying only if it matches on all disks, and use it later when writing PPL. In addition to the header checksum, each header entry also contains a checksum of its partial parity data. If the header is valid, recovery is performed for each entry until an invalid entry is found. If the array is not degraded and recovery using PPL fully succeeds, there is no need to resync the array because data and parity will be consistent, so in this case resync will be disabled. Due to compatibility with IMSM implementations on other systems, we can't assume that the recovery data block size is always 4K. Writes generated by MD raid5 don't have this issue, but when recovering PPL written in other environments it is possible to have entries with 512-byte sector granularity. The recovery code takes this into account and also the logical sector size of the underlying drives. Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-16 16:55:55 -07:00
Artur Paszkiewicz	664aed0444	md: add sysfs entries for PPL Add 'consistency_policy' attribute for array. It indicates how the array maintains consistency in case of unexpected shutdown. Add 'ppl_sector' and 'ppl_size' for rdev, which describe the location and size of the PPL space on the device. They can't be changed for active members if the array is started and PPL is enabled, so in the setter functions only basic checks are performed. More checks are done in ppl_validate_rdev() when starting the log. These attributes are writable to allow enabling PPL for external metadata arrays and (later) to enable/disable PPL for a running array. Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-16 16:55:55 -07:00
Artur Paszkiewicz	3418d036c8	raid5-ppl: Partial Parity Log write logging implementation Implement the calculation of partial parity for a stripe and PPL write logging functionality. The description of PPL is added to the documentation. More details can be found in the comments in raid5-ppl.c. Attach a page for holding the partial parity data to stripe_head. Allocate it only if mddev has the MD_HAS_PPL flag set. Partial parity is the xor of not modified data chunks of a stripe and is calculated as follows: - reconstruct-write case: xor data from all not updated disks in a stripe - read-modify-write case: xor old data and parity from all updated disks in a stripe Implement it using the async_tx API and integrate into raid_run_ops(). It must be called when we still have access to old data, so do it when STRIPE_OP_BIODRAIN is set, but before ops_run_prexor5(). The result is stored into sh->ppl_page. Partial parity is not meaningful for full stripe write and is not stored in the log or used for recovery, so don't attempt to calculate it when stripe has STRIPE_FULL_WRITE. Put the PPL metadata structures to md_p.h because userspace tools (mdadm) will also need to read/write PPL. Warn about using PPL with enabled disk volatile write-back cache for now. It can be removed once disk cache flushing before writing PPL is implemented. Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-16 16:55:54 -07:00
Artur Paszkiewicz	ff875738ed	raid5: separate header for log functions Move raid5-cache declarations from raid5.h to raid5-log.h, add inline wrappers for functions which will be shared with ppl and use them in raid5 core instead of direct calls to raid5-cache. Remove unused parameter from r5c_cache_data(), move two duplicated pr_debug() calls to r5l_init_log(). Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-16 16:55:54 -07:00
Artur Paszkiewicz	ea0213e0c7	md: superblock changes for PPL Include information about PPL location and size into mdp_superblock_1 and copy it to/from rdev. Because PPL is mutually exclusive with bitmap, put it in place of 'bitmap_offset'. Add a new flag MD_FEATURE_PPL for 'feature_map', analogically to MD_FEATURE_BITMAP_OFFSET. Add MD_HAS_PPL to mddev->flags to indicate that PPL is enabled on an array. Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-16 16:55:53 -07:00
Song Liu	effe6ee752	md/r5cache: improve recovery with read ahead page pool In r5cache recovery, the journal device is scanned page by page. Currently, we use sync_page_io() to read journal device. This is not efficient when we have to recovery many stripes from the journal. To improve the speed of recovery, this patch introduces a read ahead page pool (ra_pool) to recovery_ctx. With ra_pool, multiple consecutive pages are read in one IO. Then the recovery code read the journal from ra_pool. With ra_pool, r5l_recovery_ctx has become much bigger. Therefore, r5l_recovery_log() is refactored so r5l_recovery_ctx is not using stack space. Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-16 16:55:53 -07:00
Shaohua Li	aaf9f12ebf	md/raid5: sort bios Previous patch (raid5: only dispatch IO from raid5d for harddisk raid) defers IO dispatching. The goal is to create better IO pattern. At that time, we don't sort the deffered IO and hope the block layer can do IO merge and sort. Now the raid5-cache writeback could create large amount of bios. And if we enable muti-thread for stripe handling, we can't control when to dispatch IO to raid disks. In a lot of time, we are dispatching IO which block layer can't do merge effectively. This patch moves further for the IO dispatching defer. We accumulate bios, but we don't dispatch all the bios after a threshold is met. This 'dispatch partial portion of bios' stragety allows bios coming in a large time window are sent to disks together. At the dispatching time, there is large chance the block layer can merge the bios. To make this more effective, we dispatch IO in ascending order. This increases request merge chance and reduces disk seek. Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-16 16:55:52 -07:00
Shaohua Li	84890c03b6	md/raid5-cache: bump flush stripe batch size Bump the flush stripe batch size to 2048. For my 12 disks raid array, the stripes takes: 12 * 4k * 2048 = 96MB This is still quite small. A hardware raid card generally has 1GB size, which we suggest the raid5-cache has similar cache size. The advantage of a big batch size is we can dispatch a lot of IO in the same time, then we can do some scheduling to make better IO pattern. Last patch prioritizes stripes, so we don't worry about a big flush stripe batch will starve normal stripes. Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-16 16:55:51 -07:00
Shaohua Li	535ae4eb12	md/raid5: prioritize stripes for writeback In raid5-cache writeback mode, we have two types of stripes to handle. - stripes which aren't cached yet - stripes which are cached and flushing out to raid disks Upperlayer is more sensistive to latency of the first type of stripes generally. But we only one handle list for all these stripes, where the two types of stripes are mixed together. When reclaim flushes a lot of stripes, the first type of stripes could be noticeably delayed. On the other hand, if the log space is tight, we'd like to handle the second type of stripes faster and free log space. This patch destinguishes the two types stripes. They are added into different handle list. When we try to get a stripe to handl, we prefer the first type of stripes unless log space is tight. This should have no impact for !writeback case. Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-16 16:55:51 -07:00
Guoqing Jiang	818da59f97	md-cluster: add the support for resize To update size for cluster raid, we need to make sure all nodes can perform the change successfully. However, it is possible that some of them can't do it due to failure (bitmap_resize could fail). So we need to consider the issue before we set the capacity unconditionally, and we use below steps to perform sanity check. 1. A change the size, then broadcast METADATA_UPDATED msg. 2. B and C receive METADATA_UPDATED change the size excepts call set_capacity, sync_size is not update if the change failed. Also call bitmap_update_sb to sync sb to disk. 3. A checks other node's sync_size, if sync_size has been updated in all nodes, then send CHANGE_CAPACITY msg otherwise send msg to revert previous change. 4. B and C call set_capacity if receive CHANGE_CAPACITY msg, otherwise pers->resize will be called to restore the old value. Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Shaohua Li <shli@fb.com>	2017-03-16 16:55:50 -07:00

... 2 3 4 5 6 ...

4850 Commits