linux

mirror of https://github.com/FEX-Emu/linux.git synced 2024-12-23 01:40:30 +00:00

Author	SHA1	Message	Date
Mike Snitzer	b372d360df	dm: convey that all flushes are processed as empty Rename __clone_and_map_flush to __clone_and_map_empty_flush for added clarity. Simplify logic associated with REQ_FLUSH conditionals. Introduce a BUG_ON() and add a few more helpful comments to the code so that it is clear that all flushes are empty. Cleanup __split_and_process_bio() so that an empty flush isn't processed by a 'sector_count' focused while loop. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-09-10 12:35:38 +02:00
Kiyoshi Ueda	05447420f9	dm: fix locking context in queue_io() Now queue_io() is called from dec_pending(), which may be called with interrupts disabled, so queue_io() must not enable interrupts unconditionally and must save/restore the current interrupts status. Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-09-10 12:35:38 +02:00
Tejun Heo	6a8736d10c	dm: relax ordering of bio-based flush implementation Unlike REQ_HARDBARRIER, REQ_FLUSH/FUA doesn't mandate any ordering against other bio's. This patch relaxes ordering around flushes. * A flush bio is no longer deferred to workqueue directly. It's processed like other bio's but __split_and_process_bio() uses md->flush_bio as the clone source. md->flush_bio is initialized to empty flush during md initialization and shared for all flushes. * As a flush bio now travels through the same execution path as other bio's, there's no need for dedicated error handling path either. It can use the same error handling path in dec_pending(). Dedicated error handling removed along with md->flush_error. * When dec_pending() detects that a flush has completed, it checks whether the original bio has data. If so, the bio is queued to the deferred list w/ REQ_FLUSH cleared; otherwise, it's completed. * As flush sequencing is handled in the usual issue/completion path, dm_wq_work() no longer needs to handle flushes differently. Now its only responsibility is re-issuing deferred bio's the same way as _dm_request() would. REQ_FLUSH handling logic including process_flush() is dropped. * There's no reason for queue_io() and dm_wq_work() write lock dm->io_lock. queue_io() now only uses md->deferred_lock and dm_wq_work() read locks dm->io_lock. * bio's no longer need to be queued on the deferred list while a flush is in progress making DMF_QUEUE_IO_TO_THREAD unncessary. Drop it. This avoids stalling the device during flushes and simplifies the implementation. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-09-10 12:35:38 +02:00
Tejun Heo	29e4013de7	dm: implement REQ_FLUSH/FUA support for request-based dm This patch converts request-based dm to support the new REQ_FLUSH/FUA. The original request-based flush implementation depended on request_queue blocking other requests while a barrier sequence is in progress, which is no longer true for the new REQ_FLUSH/FUA. In general, request-based dm doesn't have infrastructure for cloning one source request to multiple targets, but the original flush implementation had a special mostly independent path which can issue flushes to multiple targets and sequence them. However, the capability isn't currently in use and adds a lot of complexity. Moreoever, it's unlikely to be useful in its current form as it doesn't make sense to be able to send out flushes to multiple targets when write requests can't be. This patch rips out special flush code path and deals handles REQ_FLUSH/FUA requests the same way as other requests. The only special treatment is that REQ_FLUSH requests use the block address 0 when finding target, which is enough for now. * added BUG_ON(!dm_target_is_valid(ti)) in dm_request_fn() as suggested by Mike Snitzer Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Mike Snitzer <snitzer@redhat.com> Tested-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-09-10 12:35:38 +02:00
Tejun Heo	d87f4c14f2	dm: implement REQ_FLUSH/FUA support for bio-based dm This patch converts bio-based dm to support REQ_FLUSH/FUA instead of now deprecated REQ_HARDBARRIER. * -EOPNOTSUPP handling logic dropped. * Preflush is handled as before but postflush is dropped and replaced with passing down REQ_FUA to member request_queues. This replaces one array wide cache flush w/ member specific FUA writes. * __split_and_process_bio() now calls __clone_and_map_flush() directly for flushes and guarantees all FLUSH bio's going to targets are zero ` length. * It's now guaranteed that all FLUSH bio's which are passed onto dm targets are zero length. bio_empty_barrier() tests are replaced with REQ_FLUSH tests. * Empty WRITE_BARRIERs are replaced with WRITE_FLUSHes. * Dropped unlikely() around REQ_FLUSH tests. Flushes are not unlikely enough to be marked with unlikely(). * Block layer now filters out REQ_FLUSH/FUA bio's if the request_queue doesn't support cache flushing. Advertise REQ_FLUSH \| REQ_FUA capability. * Request based dm isn't converted yet. dm_init_request_based_queue() resets flush support to 0 for now. To avoid disturbing request based dm code, dm->flush_error is added for bio based dm while requested based dm continues to use dm->barrier_error. Lightly tested linear, stripe, raid1, snap and crypt targets. Please proceed with caution as I'm not familiar with the code base. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: dm-devel@redhat.com Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-09-10 12:35:38 +02:00
Tejun Heo	e9c7469bb4	md: implment REQ_FLUSH/FUA support This patch converts md to support REQ_FLUSH/FUA instead of now deprecated REQ_HARDBARRIER. In the core part (md.c), the following changes are notable. * Unlike REQ_HARDBARRIER, REQ_FLUSH/FUA don't interfere with processing of other requests and thus there is no reason to mark the queue congested while FLUSH/FUA is in progress. * REQ_FLUSH/FUA failures are final and its users don't need retry logic. Retry logic is removed. * Preflush needs to be issued to all member devices but FUA writes can be handled the same way as other writes - their processing can be deferred to request_queue of member devices. md_barrier_request() is renamed to md_flush_request() and simplified accordingly. For linear, raid0 and multipath, the core changes are enough. raid1, 5 and 10 need the following conversions. * raid1: Handling of FLUSH/FUA bio's can simply be deferred to request_queues of member devices. Barrier related logic removed. * raid5: Queue draining logic dropped. FUA bit is propagated through biodrain and stripe resconstruction such that all the updated parts of the stripe are written out with FUA writes if any of the dirtying writes was FUA. preread_active_stripes handling in make_request() is updated as suggested by Neil Brown. * raid10: FUA bit needs to be propagated to write clones. linear, raid0, 1, 5 and 10 tested. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Neil Brown <neilb@suse.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-09-10 12:35:38 +02:00
Tejun Heo	4913efe456	block: deprecate barrier and replace blk_queue_ordered() with blk_queue_flush() Barrier is deemed too heavy and will soon be replaced by FLUSH/FUA requests. Deprecate barrier. All REQ_HARDBARRIERs are failed with -EOPNOTSUPP and blk_queue_ordered() is replaced with simpler blk_queue_flush(). blk_queue_flush() takes combinations of REQ_FLUSH and FUA. If a device has write cache and can flush it, it should set REQ_FLUSH. If the device can handle FUA writes, it should also set REQ_FUA. All blk_queue_ordered() users are converted. * ORDERED_DRAIN is mapped to 0 which is the default value. * ORDERED_DRAIN_FLUSH is mapped to REQ_FLUSH. * ORDERED_DRAIN_FLUSH_FUA is mapped to REQ_FLUSH \| REQ_FUA. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Boaz Harrosh <bharrosh@panasas.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Jeremy Fitzhardinge <jeremy@xensource.com> Cc: Chris Wright <chrisw@sous-sol.org> Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Cc: Geert Uytterhoeven <Geert.Uytterhoeven@sonycom.com> Cc: David S. Miller <davem@davemloft.net> Cc: Alasdair G Kergon <agk@redhat.com> Cc: Pierre Ossman <drzeus@drzeus.cx> Cc: Stefan Weinhuber <wein@de.ibm.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-09-10 12:35:36 +02:00
NeilBrown	2c7d46ec19	md raid-1/10 Fix bio_rw bit manipulations again commit `7b6d91daee` changed the behaviour of a few variables in raid1 and raid10 from flags to bit-sets, but left them as type 'bool' so they did not work. Change them (back) to unsigned long. (historical note: see `1ef04fefe2`) Signed-off-by: NeilBrown <neilb@suse.de> Reported-by: Jiri Slaby <jslaby@suse.cz> and many others	2010-08-18 16:16:05 +10:00
NeilBrown	6b96562054	md: provide appropriate return value for spare_active functions. md_check_recovery expects ->spare_active to return 'true' if any spares were activated, but none of them do, so the consequent change in 'degraded' is not notified through sysfs. So count the number of spares activated, subtract it from 'degraded' just once, and return it. Reported-by: Adrian Drzewiecki <adriand@vmware.com> Signed-off-by: NeilBrown <neilb@suse.de>	2010-08-18 12:04:32 +10:00
Adrian Drzewiecki	e6ffbcb6cd	md: Notify sysfs when RAID1/5/10 disk is In_sync. When RAID1 is done syncing disks, it'll update the state of synced rdevs to In_sync. But it neglected to notify sysfs that the attribute changed. So any programs that are waiting for an rdev's state to change will not be woken. (raid5/raid10 added by neilb) Signed-off-by: Adrian Drzewiecki <adriand@vmware.com> Signed-off-by: NeilBrown <neilb@suse.de>	2010-08-18 11:49:02 +10:00
NeilBrown	3a3a5ddb7a	Update recovery_offset even when external metadata is used. The update of ->recovery_offset in sync_sbs is appropriate even then external metadata is in use. However sync_sbs is only called when native metadata is used. So move that update in to the top of md_update_sb (which is the only caller of sync_sbs) before the test on ->external. This moves the update out of ->write_lock protection, but those fields only need ->reconfig_mutex protection which they still have. Also move the test on ->persistent up to where ->external is set as for metadata update purposes they are the same. Clear MD_CHANGE_DEVS and MD_CHANGE_CLEAN as they can only be confusing if ->external is set or ->persistent isn't. Finally move the update of ->utime down as it is only relevent (like the ->events update) for native metadata. Signed-off-by: NeilBrown <neilb@suse.de> Reported-by: "Kwolek, Adam" <adam.kwolek@intel.com>	2010-08-18 11:39:38 +10:00
Mike Snitzer	959eb4e559	dm mpath: support discard Enable discard support in the DM multipath target. This discard support depends on a few discard-specific fixes to the block layer's request stacking driver methods. Discard requests are optional so don't allow a failed discard to trigger path failures. If there is a real problem with a given path the barriers associated with the discard (either before or after the discard) will cause path failure. That said, unconditionally passing discard failures up the stack is not ideal. This must be fixed once DM has more information about the nature of the underlying storage failure. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Cc: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>	2010-08-12 04:14:32 +01:00
Mikulas Patocka	7b76ec11fe	dm stripe: support discards The DM core will submit a discard bio to the stripe target for each stripe in a striped DM device. The stripe target will determine stripe-specific portions of the supplied bio to be remapped into individual (at most 'num_discard_requests' extents). If a given stripe-specific discard bio doesn't touch a particular stripe the bio will be dropped. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2010-08-12 04:14:26 +01:00
Mike Snitzer	a79245b3e5	dm: split discard requests on target boundaries Update __clone_and_map_discard to loop across all targets in a DM device's table when it processes a discard bio. If a discard crosses a target boundary it must be split accordingly. Update __issue_target_requests and __issue_target_request to allow a cloned discard bio to have a custom start sector and size. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2010-08-12 04:14:24 +01:00
Mikulas Patocka	c96053b767	dm stripe: optimize sector division Optimize sector division: If the number of stripes is a power of two, we can do shift and mask instead of division. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2010-08-12 04:14:21 +01:00
Mikulas Patocka	65988525ab	dm stripe: move sector translation to a function Move sector to stripe translation into a function. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2010-08-12 04:14:14 +01:00
Mike Snitzer	38e1b257fd	dm: error return error for discards Have the error target respond to a discard request with a hard -EIO rather than fail the request with -EOPNOTSUPP. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2010-08-12 04:14:14 +01:00
Mike Snitzer	3fd5d48027	dm delay: support discard Enable discard support for the delay target. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2010-08-12 04:14:13 +01:00
Mike Snitzer	f8facb61b5	dm: zero silently drop discards Have the zero target silently drop a discard rather than fail the request with -EOPNOTSUPP. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2010-08-12 04:14:12 +01:00
Alasdair G Kergon	b441a262e7	dm: use dm_target_offset macro Use new dm_target_offset() macro to avoid most references to ti->begin in dm targets. Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2010-08-12 04:14:11 +01:00
Mike Snitzer	56a67df766	dm: factor out max_io_len_target_boundary Split max_io_len_target_boundary out of max_io_len so that the discard support can make use of it without duplicating max_io_len code. Avoiding max_io_len's split_io logic enables DM's discard support to submit the entire discard request to a target. But discards must still be split on target boundaries. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2010-08-12 04:14:10 +01:00
Mike Snitzer	06a426cee9	dm: use common __issue_target_request for flush and discard support Rename __flush_target to __issue_target_request now that it is used to issue both flush and discard requests. Introduce __issue_target_requests as a convenient wrapper to __issue_target_request 'num_flush_requests' or 'num_discard_requests' times per target. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2010-08-12 04:14:09 +01:00
Mike Snitzer	5ae89a8720	dm: linear support discard Allow discards to be passed through to linear mappings if at least one underlying device supports it. Discards will be forwarded only to devices that support them. A target that supports discards should set num_discard_requests to indicate how many times each discard request must be submitted to it. Verify table's underlying devices support discards prior to setting the associated DM device as capable of discards (via QUEUE_FLAG_DISCARD). Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Joe Thornber <thornber@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2010-08-12 04:14:08 +01:00
Milan Broz	5ebaee6d29	dm crypt: simplify crypt_ctr Allocate cipher strings indpendently of struct crypt_config and move cipher parsing and allocation into a separate function to prepare for supporting the cryptoapi format e.g. "xts(aes)". No functional change in this patch. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2010-08-12 04:14:07 +01:00
Milan Broz	28513fccf0	dm crypt: simplify crypt_config destruction logic Use just one label and reuse common destructor for crypt target. Parse remaining argv arguments in logic order. Also do not ignore error values from IV init and set key functions. No functional change in this patch except changed return codes based on above. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2010-08-12 04:14:06 +01:00
Peter Rajnoha	7e507eb643	dm: allow autoloading of dm mod Add devname:mapper/control and MAPPER_CTRL_MINOR module alias to support dm-mod module autoloading. Signed-off-by: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Peter Rajnoha <prajnoha@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2010-08-12 04:14:05 +01:00
Mike Snitzer	57cba5d365	dm: rename map_info flush_request to target_request_nr 'target_request_nr' is a more generic name that reflects the fact that it will be used for both flush and discard support. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2010-08-12 04:14:04 +01:00
Will Drewry	26803b9f06	dm ioctl: refactor dm_table_complete This change unifies the various checks and finalization that occurs on a table prior to use. By doing so, it allows table construction without traversing the dm-ioctl interface. Signed-off-by: Will Drewry <wad@chromium.org> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2010-08-12 04:14:03 +01:00
Mikulas Patocka	b1d5552838	dm snapshot: implement merge Implement merge method for the snapshot origin to improve read performance. Without merge method, dm asks the upper layers to submit smallest possible bios --- one page. Submitting such small bios impacts performance negatively when reading or writing the origin device. Without this patch, CPU consumption when reading the origin on lvm on md-raid0 was 6 to 12%, with this patch, it drops to 1 to 4%. Note: in my testing, it actually degraded performance in some settings, I traced it to Maxtor disks having problems with > 512-sector requests. Reducing the number of sectors to /sys/block/sd*/queue/max_sectors_kb to 256 fixed the read performance. I think we don't have to care about weird disks that actually degrade performance because of large requests being sent to them. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2010-08-12 04:14:02 +01:00
Mike Snitzer	4a0b4ddf26	dm: do not initialise full request queue when bio based Change bio-based mapped devices no longer to have a fully initialized request_queue (request_fn, elevator, etc). This means bio-based DM devices no longer register elevator sysfs attributes ('iosched/' tree or 'scheduler' other than "none"). In contrast, a request-based DM device will continue to have a full request_queue and will register elevator sysfs attributes. Therefore a user can determine a DM device's type by checking if elevator sysfs attributes exist. First allocate a minimalist request_queue structure for a DM device (needed for both bio and request-based DM). Initialization of a full request_queue is deferred until it is known that the DM device is request-based, at the end of the table load sequence. Factor DM device's request_queue initialization: - common to both request-based and bio-based into dm_init_md_queue(). - specific to request-based into dm_init_request_based_queue(). The md->type_lock mutex is used to protect md->queue, in addition to md->type, during table_load(). A DM device's first table_load will establish the immutable md->type. But md->queue initialization, based on md->type, may fail at that time (because blk_init_allocated_queue cannot allocate memory). Therefore any subsequent table_load must (re)try dm_setup_md_queue independently of establishing md->type. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2010-08-12 04:14:02 +01:00
Mike Snitzer	a5664dad7e	dm ioctl: make bio or request based device type immutable Determine whether a mapped device is bio-based or request-based when loading its first (inactive) table and don't allow that to be changed later. This patch performs different device initialisation in each of the two cases. (We don't think it's necessary to add code to support changing between the two types.) Allowed md->type transitions: DM_TYPE_NONE to DM_TYPE_BIO_BASED DM_TYPE_NONE to DM_TYPE_REQUEST_BASED We now prevent table_load from replacing the inactive table with a conflicting type of table even after an explicit table_clear. Introduce 'type_lock' into the struct mapped_device to protect md->type and to prepare for the next patch that will change the queue initialization and allocate memory while md->type_lock is held. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> drivers/md/dm-ioctl.c \| 15 +++++++++++++++ drivers/md/dm.c \| 37 ++++++++++++++++++++++++++++++------- drivers/md/dm.h \| 5 +++++ include/linux/dm-ioctl.h \| 4 ++-- 4 files changed, 52 insertions(+), 9 deletions(-)	2010-08-12 04:14:01 +01:00
Mikulas Patocka	708e929513	dm: skip second flush on bio unsupported error When processing barriers, skip the second flush if processing the bio failed with -EOPNOTSUPP. This can happen with discard+barrier requests. If the device doesn't support discard, there would be two useless SYNCHRONIZE CACHE commands. The first dm_flush cannot be so easily optimized out, so we leave it there. Previously, -EOPNOTSUPP could be received in dec_pending only with empty barriers and we ignored that error, assuming the device not supporting cache flushes has cache always consistent. With the addition of discard barriers, this -EOPNOTSUPP can also be generated by discards and we must record it in md->barrier_error for process_barrier. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2010-08-12 04:14:00 +01:00
Tomohiro Kusumi	87c961cb74	dm snapshot: persistent use define for disk header chunk size This patch fixes hard-coded value for the size of a chunk that includes disk header for persistent snapshot. It should be changed to existing macro NUM_SNAPSHOT_HDR_CHUNKS instead of using hard-coded value 1. Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@jp.fujitsu.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2010-08-12 04:13:59 +01:00
Julia Lawall	a9c88f2ebc	dm crypt: use kstrdup Use kstrdup when the goal of an allocation is copy a string into the allocated region. The semantic patch that makes this change is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ expression from,to; expression flag,E1,E2; statement S; @@ - to = kmalloc(strlen(from) + 1,flag); + to = kstrdup(from, flag); ... when != \(from = E1 \\| to = E1 \) if (to==NULL \|\| ...) S ... when != \(from = E2 \\| to = E2 \) - strcpy(to, from); // </smpl> Signed-off-by: Julia Lawall <julia@diku.dk> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2010-08-12 04:13:58 +01:00
Arnd Bergmann	402ab352c2	dm ioctl: use nonseekable_open The dm control device does not implement read/write, so it has no use for seeking. Using no_llseek prevents falling back to default_llseek, which requires the BKL. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2010-08-12 04:13:57 +01:00
Kiyoshi Ueda	3f77316de0	dm: separate device deletion from dm_put This patch separates the device deletion code from dm_put() to make sure the deletion happens in the process context. By this patch, device deletion always occurs in an ioctl (process) context and dm_put() can be called in interrupt context. As a result, the request-based dm's bad dm_put() usage pointed out by Mikulas below disappears. http://marc.info/?l=dm-devel&m=126699981019735&w=2 Without this patch, I confirmed there is a case to crash the system: dm_put() => dm_table_destroy() => vfree() => BUG_ON(in_interrupt()) Some more backgrounds and details: In request-based dm, a device opener can remove a mapped_device while the last request is still completing, because bios in the last request complete first and then the device opener can close and remove the mapped_device before the last request completes: CPU0 CPU1 ================================================================= <<INTERRUPT>> blk_end_request_all(clone_rq) blk_update_request(clone_rq) bio_endio(clone_bio) == end_clone_bio blk_update_request(orig_rq) bio_endio(orig_bio) <<I/O completed>> dm_blk_close() dev_remove() dm_put(md) <<Free md>> blk_finish_request(clone_rq) .... dm_end_request(clone_rq) free_rq_clone(clone_rq) blk_end_request_all(orig_rq) rq_completed(md) So request-based dm used dm_get()/dm_put() to hold md for each I/O until its request completion handling is fully done. However, the final dm_put() can call the device deletion code which must not be run in interrupt context and may cause kernel panic. To solve the problem, this patch moves the device deletion code, dm_destroy(), to predetermined places that is actually deleting the mapped_device in ioctl (process) context, and changes dm_put() just to decrement the reference count of the mapped_device. By this change, dm_put() can be used in any context and the symmetric model below is introduced: dm_create(): create a mapped_device dm_destroy(): destroy a mapped_device dm_get(): increment the reference count of a mapped_device dm_put(): decrement the reference count of a mapped_device dm_destroy() waits for all references of the mapped_device to disappear, then deletes the mapped_device. dm_destroy() uses active waiting with msleep(1), since deleting the mapped_device isn't performance-critical task. And since at this point, nobody opens the mapped_device and no new reference will be taken, the pending counts are just for racing completing activity and will eventually decrease to zero. For the unlikely case of the forced module unload, dm_destroy_immediate(), which doesn't wait and forcibly deletes the mapped_device, is also introduced and used in dm_hash_remove_all(). Otherwise, "rmmod -f" may be stuck and never return. And now, because the mapped_device is deleted at this point, subsequent accesses to the mapped_device may cause NULL pointer references. Cc: stable@kernel.org Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2010-08-12 04:13:56 +01:00
Kiyoshi Ueda	98f332855e	dm ioctl: release _hash_lock between devices in remove_all This patch changes dm_hash_remove_all() to release _hash_lock when removing a device. After removing the device, dm_hash_remove_all() takes _hash_lock and searches the hash from scratch again. This patch is a preparation for the next patch, which changes device deletion code to wait for md reference to be 0. Without this patch, the wait in the next patch may cause AB-BA deadlock: CPU0 CPU1 ----------------------------------------------------------------------- dm_hash_remove_all() down_write(_hash_lock) table_status() md = find_device() dm_get(md) <increment md->holders> dm_get_live_or_inactive_table() dm_get_inactive_table() down_write(_hash_lock) <in the md deletion code> <wait for md->holders to be 0> Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Cc: stable@kernel.org Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2010-08-12 04:13:55 +01:00
Kiyoshi Ueda	abdc568b05	dm: prevent access to md being deleted This patch prevents access to mapped_device which is being deleted. Currently, even after a mapped_device has been removed from the hash, it could be accessed through idr_find() using minor number. That could cause a race and NULL pointer reference below: CPU0 CPU1 ------------------------------------------------------------------ dev_remove(param) down_write(_hash_lock) dm_lock_for_deletion(md) spin_lock(_minor_lock) set_bit(DMF_DELETING) spin_unlock(_minor_lock) __hash_remove(hc) up_write(_hash_lock) dev_status(param) md = find_device(param) down_read(_hash_lock) __find_device_hash_cell(param) dm_get_md(param->dev) md = dm_find_md(dev) spin_lock(_minor_lock) md = idr_find(MINOR(dev)) spin_unlock(_minor_lock) dm_put(md) free_dev(md) dm_get(md) up_read(_hash_lock) __dev_status(md, param) dm_put(md) This patch fixes such problems. Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Cc: stable@kernel.org Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2010-08-12 04:13:54 +01:00
Peter Rajnoha	856a6f1dbd	dm ioctl: return uevent flag after rename All the dm ioctls that generate uevents set the DM_UEVENT_GENERATED flag so that userspace knows whether or not to wait for a uevent to be processed before continuing, The dm rename ioctl sets this flag but was not structured to return it to userspace. This patch restructures the rename ioctl processing to behave like the other ioctls that return data and so fix this. Signed-off-by: Peter Rajnoha <prajnoha@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2010-08-12 04:13:53 +01:00
Alasdair G Kergon	094ea9a071	dm ioctl: make __dev_status void __dev_status() cannot fail so make it void and simplify callers. Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2010-08-12 04:13:52 +01:00
Peter Rajnoha	6be5449401	dm ioctl: remove __dev_status from geometry and target message Remove useless __dev_status call while processing an ioctl that sets up device geometry and target message. The data is not returned to userspace so there is no point collecting it and in the case of target_message it is collected before processing the message so if it did return it might be stale. Signed-off-by: Peter Rajnoha <prajnoha@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2010-08-12 04:13:52 +01:00
Mikulas Patocka	c241104506	dm snapshot: test chunk size against both origin and snapshot Validate chunk size against both origin and snapshot sector size Don't allow chunk size smaller than either origin or snapshot logical sector size. Reading or writing data not aligned to sector size is not allowed and causes immediate errors. This requires us to open the origin before initialising the exception store and to export dm_snap_origin. Cc: stable@kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2010-08-12 04:13:51 +01:00
Mikulas Patocka	1e5554c842	dm snapshot: iterate origin and cow devices Iterate both origin and snapshot devices iterate_devices method should call the callback for all the devices where the bio may be remapped. Thus, snapshot_iterate_devices should call the callback for both snapshot and origin underlying devices because it remaps some bios to the snapshot and some to the origin. snapshot_iterate_devices called the callback only for the origin device. This led to badly calculated device limits if snapshot and origin were placed on different types of disks. Cc: stable@kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2010-08-12 04:13:50 +01:00
Alasdair G Kergon	6bbf79a140	dm mpath: fix NULL pointer dereference when path parameters missing multipath_ctr() forgets to return an error after detecting missing path parameters. Fix this. Signed-off-by: Patrick LoPresti <lopresti@gmail.com> Cc: stable@kernel.org Signed-off-by: Alasdair G Kergon <agk@redhat.com>	2010-08-12 04:13:49 +01:00
Linus Torvalds	3d30701b58	Merge branch 'for-linus' of git://neil.brown.name/md * 'for-linus' of git://neil.brown.name/md: (24 commits) md: clean up do_md_stop md: fix another deadlock with removing sysfs attributes. md: move revalidate_disk() back outside open_mutex md/raid10: fix deadlock with unaligned read during resync md/bitmap: separate out loading a bitmap from initialising the structures. md/bitmap: prepare for storing write-intent-bitmap via dm-dirty-log. md/bitmap: optimise scanning of empty bitmaps. md/bitmap: clean up plugging calls. md/bitmap: reduce dependence on sysfs. md/bitmap: white space clean up and similar. md/raid5: export raid5 unplugging interface. md/plug: optionally use plugger to unplug an array during resync/recovery. md/raid5: add simple plugging infrastructure. md/raid5: export is_congested test raid5: Don't set read-ahead when there is no queue md: add support for raising dm events. md: export various start/stop interfaces md: split out md_rdev_init md: be more careful setting MD_CHANGE_CLEAN md/raid5: ensure we create a unique name for kmem_cache when mddev has no gendisk ...	2010-08-10 15:38:19 -07:00
NeilBrown	fd8aa2c181	Merge git://git.infradead.org/users/dwmw2/libraid-2.6 into for-linus	2010-08-10 10:02:33 +10:00
David Woodhouse	2144381da4	Merge branch 'async' of macbook:git/btrfs-unstable Conflicts: drivers/md/Makefile lib/raid6/unroll.pl	2010-08-09 10:36:44 +01:00
NeilBrown	6e17b02764	md: clean up do_md_stop There is only one error exit from do_md_stop, so make that more explicit and discard the 'err' variable. Also drop the 'revalidate' variable by moving the unlock calls around. Signed-off-by: NeilBrown <neilb@suse.de>	2010-08-08 21:22:45 +10:00
NeilBrown	bb4f1e9d0e	md: fix another deadlock with removing sysfs attributes. Move the deletion of sysfs attributes from reconfig_mutex to open_mutex didn't really help as a process can try to take open_mutex while holding reconfig_mutex, so the same deadlock can happen, just requiring one more process to be involved in the chain. I looks like I cannot easily use locking to wait for the sysfs deletion to complete, so don't. The only things that we cannot do while the deletions are still pending is other things which can change the sysfs namespace: run, takeover, stop. Each of these can fail with -EBUSY. So set a flag while doing a sysfs deletion, and fail run, takeover, stop if that flag is set. This is suitable for 2.6.35.x Cc: stable@kernel.org Signed-off-by: NeilBrown <neilb@suse.de>	2010-08-08 21:21:27 +10:00
Dan Williams	147e0b6a63	md: move revalidate_disk() back outside open_mutex Commit `b821eaa5` "md: remove ->changed and related code" moved revalidate_disk() under open_mutex, and lockdep noticed. [ INFO: possible circular locking dependency detected ] 2.6.32-mdadm-locking #1 ------------------------------------------------------- mdadm/3640 is trying to acquire lock: (&bdev->bd_mutex){+.+.+.}, at: [<ffffffff811acecb>] revalidate_disk+0x5b/0x90 but task is already holding lock: (&mddev->open_mutex){+.+...}, at: [<ffffffffa055e07a>] do_md_stop+0x4a/0x4d0 [md_mod] which lock already depends on the new lock. It is suitable for 2.6.35.x Cc: <stable@kernel.org> Reported-by: Przemyslaw Czarnowski <przemyslaw.hawrylewicz.czarnowski@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>	2010-08-08 21:20:17 +10:00

1 2 3 4 5 ...

1752 Commits