linux/drivers
Guoqing Jiang 0ba959774e md-cluster: use sync way to handle METADATA_UPDATED msg
Previously, when node received METADATA_UPDATED msg, it just
need to wakeup mddev->thread, then md_reload_sb will be called
eventually.

We taken the asynchronous way to avoid a deadlock issue, the
deadlock issue could happen when one node is receiving the
METADATA_UPDATED msg (wants reconfig_mutex) and trying to run
the path:

md_check_recovery -> mddev_trylock(hold reconfig_mutex)
                  -> md_update_sb-metadata_update_start
		     (want EX on token however token is
		      got by the sending node)

Since we will support resizing for clustered raid, and we
need the metadata update handling to be synchronous so that
the initiating node can detect failure, so we need to change
the way for handling METADATA_UPDATED msg.

But, we obviously need to avoid above deadlock with the
sync way. To make this happen, we considered to not hold
reconfig_mutex to call md_reload_sb, if some other thread
has already taken reconfig_mutex and waiting for the 'token',
then process_recvd_msg() can safely call md_reload_sb()
without taking the mutex. This is because we can be certain
that no other thread will take the mutex, and we also certain
that the actions performed by md_reload_sb() won't interfere
with anything that the other thread is in the middle of.

To make this more concrete, we added a new cinfo->state bit
        MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD

Which is set in lock_token() just before dlm_lock_sync() is
called, and cleared just after. As lock_token() is always
called with reconfig_mutex() held (the specific case is the
resync_info_update which is distinguished well in previous
patch), if process_recvd_msg() finds that the new bit is set,
then the mutex must be held by some other thread, and it will
keep waiting.

So process_metadata_update() can call md_reload_sb() if either
mddev_trylock() succeeds, or if MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD
is set. The tricky bit is what to do if neither of these apply.
We need to wait. Fortunately mddev_unlock() always calls wake_up()
on mddev->thread->wqueue. So we can get lock_token() to call
wake_up() on that when it sets the bit.

There are also some related changes inside this commit:
1. remove RELOAD_SB related codes since there are not valid anymore.
2. mddev is added into md_cluster_info then we can get mddev inside
   lock_token.
3. add new parameter for lock_token to distinguish reconfig_mutex
   is held or not.

And, we need to set MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD in below:
1. set it before unregister thread, otherwise a deadlock could
   appear if stop a resyncing array.
   This is because md_unregister_thread(&cinfo->recv_thread) is
   blocked by recv_daemon -> process_recvd_msg
			  -> process_metadata_update.
   To resolve the issue, MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD is
   also need to be set before unregister thread.
2. set it in metadata_update_start to fix another deadlock.
	a. Node A sends METADATA_UPDATED msg (held Token lock).
	b. Node B wants to do resync, and is blocked since it can't
	   get Token lock, but MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD is
	   not set since the callchain
	   (md_do_sync -> sync_request
        	       -> resync_info_update
		       -> sendmsg
		       -> lock_comm -> lock_token)
	   doesn't hold reconfig_mutex.
	c. Node B trys to update sb (held reconfig_mutex), but stopped
	   at wait_event() in metadata_update_start since we have set
	   MD_CLUSTER_SEND_LOCK flag in lock_comm (step 2).
	d. Then Node B receives METADATA_UPDATED msg from A, of course
	   recv_daemon is blocked forever.
   Since metadata_update_start always calls lock_token with reconfig_mutex,
   we need to set MD_CLUSTER_HOLDING_MUTEX_FOR_RECVD here as well, and
   lock_token don't need to set it twice unless lock_token is invoked from
   lock_comm.

Finally, thanks to Neil for his great idea and help!

Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
2017-03-16 16:55:49 -07:00
..
accessibility
acpi Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2017-03-07 14:47:24 -08:00
amba
android sched/headers: Prepare for new header dependencies before moving code to <linux/sched/signal.h> 2017-03-02 08:42:29 +01:00
ata ahci: qoriq: correct the sata ecc setting error 2017-03-09 11:55:23 -05:00
atm sched/headers: Prepare to move signal wakeup & sigpending methods from <linux/sched.h> into <linux/sched/signal.h> 2017-03-02 08:42:32 +01:00
auxdisplay
base Merge branch 'rebased-statx' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2017-03-03 11:38:56 -08:00
bcma
block A fix for the recently discovered misdirected requests bug present in 2017-03-10 11:05:47 -08:00
bluetooth btmrvl: fix spelling mistake: "actived" -> "activated" 2017-02-19 00:26:37 +01:00
bus ARM: SoC driver updates 2017-02-23 15:57:04 -08:00
cdrom Merge branch 'for-4.11/next' into for-4.11/linus-merge 2017-02-17 14:08:19 -07:00
char Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 2017-03-15 09:26:04 -07:00
clk ARM: SoC: late DT updates for v4.11 2017-03-03 16:15:48 -08:00
clocksource sched/headers: Prepare for new header dependencies before moving code to <linux/sched/clock.h> 2017-03-02 08:42:27 +01:00
connector
cpufreq Merge branch 'pm-cpufreq' 2017-03-09 15:12:27 +01:00
cpuidle Merge branch 'WIP.sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2017-03-03 10:16:38 -08:00
crypto Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 2017-03-15 09:26:04 -07:00
dax sched/headers: Prepare to remove the <linux/magic.h> include from <linux/sched/task_stack.h> 2017-03-02 08:42:40 +01:00
dca
devfreq scripts/spelling.txt: add "followings" pattern and fix typo instances 2017-02-27 18:43:47 -08:00
dio
dma sched/headers: Prepare to move the get_task_struct()/put_task_struct() and related APIs from <linux/sched.h> to <linux/sched/task.h> 2017-03-02 08:42:40 +01:00
dma-buf sched/headers: Prepare to move signal wakeup & sigpending methods from <linux/sched.h> into <linux/sched/signal.h> 2017-03-02 08:42:32 +01:00
edac Merge branch 'ras-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2017-02-20 12:47:44 -08:00
eisa
extcon scripts/spelling.txt: add "swithc" pattern and fix typo instances 2017-02-27 18:43:46 -08:00
firewire Merge branch 'idr-4.11' of git://git.infradead.org/users/willy/linux-dax 2017-02-28 20:29:41 -08:00
firmware Merge branch 'efi-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2017-03-07 14:25:48 -08:00
fmc
fpga
fsi
gpio This is the bulk of GPIO changes for the v4.11 cycle 2017-02-23 08:46:04 -08:00
gpu intel, amd and mxsfb fixes. 2017-03-10 09:53:00 -08:00
hid sched/headers: Prepare to move signal wakeup & sigpending methods from <linux/sched.h> into <linux/sched/signal.h> 2017-03-02 08:42:32 +01:00
hsi sched/headers: Prepare to move signal wakeup & sigpending methods from <linux/sched.h> into <linux/sched/signal.h> 2017-03-02 08:42:32 +01:00
hv scripts/spelling.txt: add "disble(d)" pattern and fix typo instances 2017-03-09 17:01:09 -08:00
hwmon scripts/spelling.txt: add "followings" pattern and fix typo instances 2017-02-27 18:43:47 -08:00
hwspinlock
hwtracing mm, fs: reduce fault, page_mkwrite, and pfn_mkwrite to take only vmf 2017-02-24 17:46:54 -08:00
i2c Revert "i2c: copy device properties when using i2c_register_board_info()" 2017-03-09 16:41:48 +01:00
ide sched/headers: Prepare for new header dependencies before moving code to <linux/sched/task_stack.h> 2017-03-02 08:42:36 +01:00
idle Power management turbostat utility updates for v4.11-rc1 2017-03-02 17:41:27 -08:00
iio Staging/IIO driver fixes for 4.11-rc1 2017-03-04 11:26:18 -08:00
infiniband sched/headers: Prepare to move the get_task_struct()/put_task_struct() and related APIs from <linux/sched.h> to <linux/sched/task.h> 2017-03-02 08:42:40 +01:00
input Input: rmi4 - f30: detect INPUT_PROP_BUTTONPAD from the button count 2017-03-01 10:01:56 -08:00
iommu sched/headers: Prepare for new header dependencies before moving code to <linux/sched/mm.h> 2017-03-02 08:42:28 +01:00
ipack
irqchip irqchip/irqdomain updates for 4.11-rc2 2017-03-09 12:06:41 +01:00
isdn Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2017-03-14 21:31:23 -07:00
leds sched/headers: Prepare for new header dependencies before moving code to <linux/sched/loadavg.h> 2017-03-02 08:42:27 +01:00
lguest sched/headers: Prepare to move signal wakeup & sigpending methods from <linux/sched.h> into <linux/sched/signal.h> 2017-03-02 08:42:32 +01:00
lightnvm
macintosh powerpc/pmac: Fix crash in dma-mapping.h with NULL dma_ops 2017-03-10 14:17:23 +11:00
mailbox sched/headers: Prepare to move signal wakeup & sigpending methods from <linux/sched.h> into <linux/sched/signal.h> 2017-03-02 08:42:32 +01:00
mcb
md md-cluster: use sync way to handle METADATA_UPDATED msg 2017-03-16 16:55:49 -07:00
media Merge branch 'akpm' (patches from Andrew) 2017-03-10 08:34:42 -08:00
memory ARM: SoC driver updates 2017-02-23 15:57:04 -08:00
memstick Merge branch 'for-4.11/next' into for-4.11/linus-merge 2017-02-17 14:08:19 -07:00
message SCSI misc on 20170220 2017-02-21 11:51:42 -08:00
mfd staging/iio driver patches for 4.11-rc1 2017-02-22 12:14:01 -08:00
misc mm: convert generic code to 5-level paging 2017-03-09 11:48:47 -08:00
mmc sched/headers: Prepare for new header dependencies before moving code to <uapi/linux/sched/types.h> 2017-03-02 08:42:27 +01:00
mtd scripts/spelling.txt: add "disble(d)" pattern and fix typo instances 2017-03-09 17:01:09 -08:00
net Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2017-03-14 21:31:23 -07:00
nfc scripts/spelling.txt: add "omited" pattern and fix typo instances 2017-02-27 18:43:47 -08:00
ntb ntb: ntb_hw_intel: link_poll isn't clearing the pending status properly 2017-02-16 23:11:26 -05:00
nubus
nvdimm nfit, libnvdimm: fix interleave set cookie calculation 2017-03-01 00:49:42 -08:00
nvme Merge branch 'for-linus' of git://git.kernel.dk/linux-block 2017-03-03 10:53:35 -08:00
nvmem
of DeviceTree updates for 4.11: 2017-02-22 19:23:14 -08:00
oprofile sched/headers: Prepare to move the get_task_struct()/put_task_struct() and related APIs from <linux/sched.h> to <linux/sched/task.h> 2017-03-02 08:42:40 +01:00
parisc Merge branch 'parisc-4.11-1' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux 2017-03-03 16:20:06 -08:00
parport sched/headers: Prepare to move signal wakeup & sigpending methods from <linux/sched.h> into <linux/sched/signal.h> 2017-03-02 08:42:32 +01:00
pci PCI/ASPM: Always set link->downstream to avoid NULL dereference on remove 2017-03-07 14:23:30 -06:00
pcmcia
perf sched/headers: Prepare for new header dependencies before moving code to <linux/sched/clock.h> 2017-03-02 08:42:27 +01:00
phy pci-v4.11-changes 2017-02-23 11:53:22 -08:00
pinctrl pinctrl: uniphier: change pin names of aio/xirq for LD11 2017-03-06 14:38:05 +01:00
platform platform-drivers-x86 for v4.11-2 2017-03-13 13:23:43 -07:00
pnp
power scripts/spelling.txt: add "intialization" pattern and fix typo instances 2017-02-27 18:43:47 -08:00
powercap
pps
ps3 sched/headers: Prepare for new header dependencies before moving code to <linux/sched/signal.h> 2017-03-02 08:42:29 +01:00
ptp 4.11 is going to be a relatively large release for KVM, with a little over 2017-02-22 18:22:53 -08:00
pwm pwm: Changes for v4.11-rc1 2017-03-01 09:46:02 -08:00
rapidio rapidio: use get_user_pages_unlocked() 2017-02-27 18:43:45 -08:00
ras
regulator regulator: Updates for v4.11 2017-02-20 17:23:57 -08:00
remoteproc virtio, vhost: optimizations, fixes 2017-03-02 13:53:13 -08:00
reset ARM: SoC driver updates 2017-02-23 15:57:04 -08:00
rpmsg virtio, vhost: optimizations, fixes 2017-03-02 13:53:13 -08:00
rtc sched/headers: Prepare to move signal wakeup & sigpending methods from <linux/sched.h> into <linux/sched/signal.h> 2017-03-02 08:42:32 +01:00
s390 Merge branch 'WIP.sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2017-03-03 10:16:38 -08:00
sbus
scsi SCSI fixes on 20170315 2017-03-15 10:44:19 -07:00
sfi
sh
sn
soc sched/headers: Prepare to move signal wakeup & sigpending methods from <linux/sched.h> into <linux/sched/signal.h> 2017-03-02 08:42:32 +01:00
spi sched/headers: Prepare for new header dependencies before moving code to <uapi/linux/sched/types.h> 2017-03-02 08:42:27 +01:00
spmi
ssb
staging Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2017-03-14 21:31:23 -07:00
target Merge branch 'WIP.sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2017-03-03 10:16:38 -08:00
tc
thermal sched/headers: Prepare for new header dependencies before moving code to <uapi/linux/sched/types.h> 2017-03-02 08:42:27 +01:00
thunderbolt
tty serial: samsung: Continue to work if DMA request fails 2017-03-07 19:58:37 +01:00
uio sched/headers: Prepare to move signal wakeup & sigpending methods from <linux/sched.h> into <linux/sched/signal.h> 2017-03-02 08:42:32 +01:00
usb USB fixes for 4.11-rc2 2017-03-11 00:08:39 -08:00
uwb
vfio sched/headers: Prepare for new header dependencies before moving code to <linux/sched/signal.h> 2017-03-02 08:42:29 +01:00
vhost Merge branch 'WIP.sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2017-03-03 10:16:38 -08:00
video sched/headers: Remove the <linux/mm_types.h> dependency from <linux/sched.h> 2017-03-03 01:45:16 +01:00
virt
virtio Merge branch 'WIP.sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2017-03-03 10:16:38 -08:00
vlynq
vme
w1 sched/headers: Prepare for new header dependencies before moving code to <linux/sched/signal.h> 2017-03-02 08:42:29 +01:00
watchdog watchdog: retu: restore MFD dependency 2017-03-01 06:15:10 -08:00
xen features and fixes for 4.11 rc1 2017-03-09 12:23:30 -08:00
zorro
Kconfig
Makefile pci-v4.11-changes 2017-02-23 11:53:22 -08:00