linux/drivers
Shaohua Li 851c30c9ba raid5: offload stripe handle to workqueue
This is another attempt to create multiple threads to handle raid5 stripes.
This time I use workqueue.

raid5 handles request (especially write) in stripe unit. A stripe is page size
aligned/long and acrosses all disks. Writing to any disk sector, raid5 runs a
state machine for the corresponding stripe, which includes reading some disks
of the stripe, calculating parity, and writing some disks of the stripe. The
state machine is running in raid5d thread currently. Since there is only one
thread, it doesn't scale well for high speed storage. An obvious solution is
multi-threading.

To get better performance, we have some requirements:
a. locality. stripe corresponding to request submitted from one cpu is better
handled in thread in local cpu or local node. local cpu is preferred but some
times could be a bottleneck, for example, parity calculation is too heavy.
local node running has wide adaptability.
b. configurablity. Different setup of raid5 array might need diffent
configuration. Especially the thread number. More threads don't always mean
better performance because of lock contentions.

My original implementation is creating some kernel threads. There are
interfaces to control which cpu's stripe each thread should handle. And
userspace can set affinity of the threads. This provides biggest flexibility
and configurability. But it's hard to use and apparently a new thread pool
implementation is disfavor.

Recent workqueue improvement is quite promising. unbound workqueue will be
bound to numa node. If WQ_SYSFS is set in workqueue, there are sysfs option to
do affinity setting. For example, we can only include one HT sibling in
affinity. Since work is non-reentrant by default, and we can control running
thread number by limiting dispatched work_struct number.

In this patch, I created several stripe worker group. A group is a numa node.
stripes from cpus of one node will be added to a group list. Workqueue thread
of one node will only handle stripes of worker group of the node. In this way,
stripe handling has numa node locality. And as I said, we can control thread
number by limiting dispatched work_struct number.

The work_struct callback function handles several stripes in one run. A typical
work queue usage is to run one unit in each work_struct. In raid5 case, the
unit is a stripe. But we can't do that:
a. Though handling a stripe doesn't need lock because of reference accounting
and stripe isn't in any list, queuing a work_struct for each stripe will make
workqueue lock contended very heavily.
b. blk_start_plug()/blk_finish_plug() should surround stripe handle, as we
might dispatch request. If each work_struct only handles one stripe, such block
plug is meaningless.

This implementation can't do very fine grained configuration. But the numa
binding is most popular usage model, should be enough for most workloads.

Note: since we have only one stripe queue, switching to multi-thread might
decrease request size dispatching down to low level layer. The impact depends
on thread number, raid configuration and workload. So multi-thread raid5 might
not be proper for all setups.

Changes V1 -> V2:
1. remove WQ_NON_REENTRANT
2. disabling multi-threading by default
3. Add more descriptions in changelog

Signed-off-by: Shaohua Li <shli@fusionio.com>
Signed-off-by: NeilBrown <neilb@suse.de>
2013-08-28 16:46:38 +10:00
..
accessibility
acpi Revert "ACPI / video: Always call acpi_video_init_brightness() on init" 2013-08-22 23:39:02 +02:00
amba
ata sata_fsl: save irqs while coalescing 2013-08-20 08:38:23 -04:00
atm
auxdisplay
base
bcma
block aoe: adjust ref of head for compound page tails 2013-08-13 17:57:48 -07:00
bluetooth
bus
cdrom
char More virtio console fixes than I'm happy with, but all real issues, 2013-08-08 09:32:20 -07:00
clk clk: exynos4: Add CLK_GET_RATE_NOCACHE flag for the Exynos4x12 ISP clocks 2013-08-13 10:01:56 -07:00
clocksource
connector
cpufreq cpufreq: rename ignore_nice as ignore_nice_load 2013-08-07 22:25:06 +02:00
cpuidle
crypto
dca
devfreq
dio
dma ARM: SoC fixes for v3.11-rc 2013-08-08 09:28:08 -07:00
edac
eisa
extcon
firewire
firmware
fmc
gpio
gpu Merge tag 'drm-intel-fixes-2013-08-23' of git://people.freedesktop.org/~danvet/drm-intel into drm-fixes 2013-08-23 18:52:37 +10:00
hid Revert "HID: hid-logitech-dj: querying_devices was never set" 2013-08-09 11:34:19 +02:00
hsi
hv
hwmon hwmon: (adt7470) Fix incorrect return code check 2013-08-08 12:43:07 -07:00
hwspinlock
i2c
ide
idle
iio iio: adjd_s311: Fix non-scan mode data read 2013-08-19 19:30:21 +01:00
infiniband
input
iommu
ipack
irqchip
isdn
leds
lguest
macintosh
mailbox
md raid5: offload stripe handle to workqueue 2013-08-28 16:46:38 +10:00
media Merge branch 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media 2013-08-09 15:04:09 -07:00
memory
memstick
message
mfd
misc
mmc
mtd
net be2net: fix disabling TX in be_close() 2013-08-22 19:58:23 -07:00
nfc
ntb
nubus
of of: fdt: fix memory initialization for expanded DT 2013-08-21 20:05:49 -05:00
oprofile
parisc
parport
pci ACPI: Try harder to resolve _ADR collisions for bridges 2013-08-07 22:55:00 +02:00
pcmcia
pinctrl pinctrl: sunxi: Add spinlocks 2013-08-07 21:57:17 +02:00
platform Merge branch 'akpm' (patches from Andrew Morton) 2013-08-23 09:52:32 -07:00
pnp
power
pps
ps3
ptp
pwm
rapidio
regulator
remoteproc
reset
rpmsg
rtc drivers/rtc/rtc-stmp3xxx.c: provide timeout for potentially endless loop polling a HW bit 2013-08-13 17:57:48 -07:00
s390 [SCSI] zfcp: remove access control tables interface (keep sysfs files) 2013-08-22 09:26:51 -07:00
sbus
scsi [SCSI] lpfc: Don't force CONFIG_GENERIC_CSUM on 2013-08-21 10:54:20 -07:00
sfi
sh
sn
spi
ssb
staging staging: comedi: bug-fix NULL pointer dereference on failed attach 2013-08-23 10:31:47 -07:00
target
tc
thermal
tty
uio
usb usb: phy: fix build breakage 2013-08-23 10:41:46 -07:00
uwb
vfio
vhost
video fbdev fixes: 2013-08-09 11:52:34 -07:00
virt
virtio
vlynq
vme
w1
watchdog
xen Bug-fixes: 2013-08-21 16:38:33 -07:00
zorro
Kconfig
Makefile