linux/drivers/md
Guoqing Jiang 0ba959774e md-cluster: use sync way to handle METADATA_UPDATED msg
Previously, when node received METADATA_UPDATED msg, it just
need to wakeup mddev->thread, then md_reload_sb will be called
eventually.

We taken the asynchronous way to avoid a deadlock issue, the
deadlock issue could happen when one node is receiving the
METADATA_UPDATED msg (wants reconfig_mutex) and trying to run
the path:

md_check_recovery -> mddev_trylock(hold reconfig_mutex)
                  -> md_update_sb-metadata_update_start
		     (want EX on token however token is
		      got by the sending node)

Since we will support resizing for clustered raid, and we
need the metadata update handling to be synchronous so that
the initiating node can detect failure, so we need to change
the way for handling METADATA_UPDATED msg.

But, we obviously need to avoid above deadlock with the
sync way. To make this happen, we considered to not hold
reconfig_mutex to call md_reload_sb, if some other thread
has already taken reconfig_mutex and waiting for the 'token',
then process_recvd_msg() can safely call md_reload_sb()
without taking the mutex. This is because we can be certain
that no other thread will take the mutex, and we also certain
that the actions performed by md_reload_sb() won't interfere
with anything that the other thread is in the middle of.

To make this more concrete, we added a new cinfo->state bit
        MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD

Which is set in lock_token() just before dlm_lock_sync() is
called, and cleared just after. As lock_token() is always
called with reconfig_mutex() held (the specific case is the
resync_info_update which is distinguished well in previous
patch), if process_recvd_msg() finds that the new bit is set,
then the mutex must be held by some other thread, and it will
keep waiting.

So process_metadata_update() can call md_reload_sb() if either
mddev_trylock() succeeds, or if MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD
is set. The tricky bit is what to do if neither of these apply.
We need to wait. Fortunately mddev_unlock() always calls wake_up()
on mddev->thread->wqueue. So we can get lock_token() to call
wake_up() on that when it sets the bit.

There are also some related changes inside this commit:
1. remove RELOAD_SB related codes since there are not valid anymore.
2. mddev is added into md_cluster_info then we can get mddev inside
   lock_token.
3. add new parameter for lock_token to distinguish reconfig_mutex
   is held or not.

And, we need to set MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD in below:
1. set it before unregister thread, otherwise a deadlock could
   appear if stop a resyncing array.
   This is because md_unregister_thread(&cinfo->recv_thread) is
   blocked by recv_daemon -> process_recvd_msg
			  -> process_metadata_update.
   To resolve the issue, MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD is
   also need to be set before unregister thread.
2. set it in metadata_update_start to fix another deadlock.
	a. Node A sends METADATA_UPDATED msg (held Token lock).
	b. Node B wants to do resync, and is blocked since it can't
	   get Token lock, but MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD is
	   not set since the callchain
	   (md_do_sync -> sync_request
        	       -> resync_info_update
		       -> sendmsg
		       -> lock_comm -> lock_token)
	   doesn't hold reconfig_mutex.
	c. Node B trys to update sb (held reconfig_mutex), but stopped
	   at wait_event() in metadata_update_start since we have set
	   MD_CLUSTER_SEND_LOCK flag in lock_comm (step 2).
	d. Then Node B receives METADATA_UPDATED msg from A, of course
	   recv_daemon is blocked forever.
   Since metadata_update_start always calls lock_token with reconfig_mutex,
   we need to set MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD here as well, and
   lock_token don't need to set it twice unless lock_token is invoked from
   lock_comm.

Finally, thanks to Neil for his great idea and help!

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-03-16 16:55:49 -07:00
..
bcache drivers/md/bcache/util.h: remove duplicate inclusion of blkdev.h 2017-03-09 17:01:10 -08:00
persistent-data sched/headers: Prepare to move the get_task_struct()/put_task_struct() and related APIs from <linux/sched.h> to <linux/sched/task.h> 2017-03-02 08:42:40 +01:00
bitmap.c md: separate flags for superblock changes 2016-12-08 22:01:47 -08:00
bitmap.h
dm-bio-prison.c
dm-bio-prison.h
dm-bio-record.h
dm-bufio.c sched/headers: Prepare to move the memalloc_noio_*() APIs to <linux/sched/mm.h> 2017-03-02 08:42:33 +01:00
dm-bufio.h
dm-builtin.c
dm-cache-block-types.h linux: drop __bitwise__ everywhere 2016-12-16 00:13:41 +02:00
dm-cache-metadata.c dm cache metadata: use cursor api in blocks_are_clean_separate_dirty() 2017-02-16 13:12:51 -05:00
dm-cache-metadata.h dm cache metadata: add "metadata2" feature 2017-02-16 13:12:47 -05:00
dm-cache-policy-cleaner.c
dm-cache-policy-internal.h
dm-cache-policy-smq.c dm cache policy smq: use hash_32() instead of hash_32_generic() 2016-12-08 19:42:37 -05:00
dm-cache-policy.c
dm-cache-policy.h
dm-cache-target.c - Fix dm-raid transient device failure processing and other smaller 2017-02-21 12:11:41 -08:00
dm-core.h dm: always defer request allocation to the owner of the request_queue 2017-01-27 15:08:35 -07:00
dm-crypt.c KEYS: Differentiate uses of rcu_dereference_key() and user_key_payload() 2017-03-02 10:09:00 +11:00
dm-delay.c
dm-era-target.c block: Use pointer to backing_dev_info from request_queue 2017-02-02 08:20:48 -07:00
dm-exception-store.c
dm-exception-store.h
dm-flakey.c dm flakey: introduce "error_writes" feature 2016-12-13 15:01:31 -05:00
dm-io.c
dm-ioctl.c sched/headers: Prepare to move the memalloc_noio_*() APIs to <linux/sched/mm.h> 2017-03-02 08:42:33 +01:00
dm-kcopyd.c
dm-linear.c
dm-log-userspace-base.c
dm-log-userspace-transfer.c
dm-log-userspace-transfer.h
dm-log-writes.c
dm-log.c
dm-mpath.c Merge branch 'for-4.11/next' into for-4.11/linus-merge 2017-02-17 14:08:19 -07:00
dm-mpath.h
dm-path-selector.c
dm-path-selector.h
dm-queue-length.c
dm-raid1.c Merge branch 'for-4.10/block' of git://git.kernel.dk/linux-block 2016-12-13 10:19:16 -08:00
dm-raid.c dm raid: bump the target version 2017-02-28 16:47:52 -05:00
dm-region-hash.c
dm-round-robin.c dm round robin: revert "use percpu 'repeat_count' and 'current_path'" 2017-02-17 00:54:09 -05:00
dm-rq.c dm-rq: don't dereference request payload after ending request 2017-02-24 13:19:32 -07:00
dm-rq.h dm: always defer request allocation to the owner of the request_queue 2017-01-27 15:08:35 -07:00
dm-service-time.c
dm-snap-persistent.c
dm-snap-transient.c
dm-snap.c
dm-stats.c dm stats: fix a leaked s->histogram_boundaries array 2017-02-16 14:17:07 -05:00
dm-stats.h
dm-stripe.c
dm-switch.c
dm-sysfs.c
dm-table.c block: Use pointer to backing_dev_info from request_queue 2017-02-02 08:20:48 -07:00
dm-target.c dm: always defer request allocation to the owner of the request_queue 2017-01-27 15:08:35 -07:00
dm-thin-metadata.c
dm-thin-metadata.h
dm-thin.c block: Use pointer to backing_dev_info from request_queue 2017-02-02 08:20:48 -07:00
dm-uevent.c
dm-uevent.h
dm-verity-fec.c
dm-verity-fec.h
dm-verity-target.c
dm-verity.h
dm-zero.c
dm.c blk: Ensure users for current->bio_list can see the full list. 2017-03-11 15:31:37 -07:00
dm.h dm: always defer request allocation to the owner of the request_queue 2017-01-27 15:08:35 -07:00
faulty.c md: fast clone bio in bio_clone_mddev() 2017-02-15 11:24:54 -08:00
Kconfig
linear.c Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md 2017-02-24 14:42:19 -08:00
linear.h md linear: fix a race between linear_add() and linear_congested() 2017-02-13 09:17:50 -08:00
Makefile
md-cluster.c md-cluster: use sync way to handle METADATA_UPDATED msg 2017-03-16 16:55:49 -07:00
md-cluster.h
md.c md-cluster: use sync way to handle METADATA_UPDATED msg 2017-03-16 16:55:49 -07:00
md.h md-cluster: use sync way to handle METADATA_UPDATED msg 2017-03-16 16:55:49 -07:00
multipath.c Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md 2017-02-24 14:42:19 -08:00
multipath.h
raid0.c Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md 2017-02-24 14:42:19 -08:00
raid0.h
raid1.c md/raid1: fix a trivial typo in comments 2017-03-14 11:10:44 -07:00
raid1.h RAID1: avoid unnecessary spin locks in I/O barrier code 2017-02-19 22:04:25 -08:00
raid5-cache.c md/raid5-cache: exclude reclaiming stripes in reclaim check 2017-02-13 09:20:05 -08:00
raid5.c md/r5cache: fix set_syndrome_sources() for data in cache 2017-03-14 09:57:10 -07:00
raid5.h md/raid5-cache: exclude reclaiming stripes in reclaim check 2017-02-13 09:20:05 -08:00
raid10.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md 2017-03-16 11:43:48 -07:00
raid10.h