linux/fs
Dave Chinner 80168676eb xfs: force background CIL push under sustained load
I have been seeing occasional pauses in transaction throughput up to
30s long under heavy parallel workloads. The only notable thing was
that the xfsaild was trying to be active during the pauses, but
making no progress. It was running exactly 20 times a second (on the
50ms no-progress backoff), and the number of pushbuf events was
constant across this time as well.  IOWs, the xfsaild appeared to be
stuck on buffers that it could not push out.

Further investigation indicated that it was trying to push out inode
buffers that were pinned and/or locked. The xfsbufd was also getting
woken at the same frequency (by the xfsaild, no doubt) to push out
delayed write buffers. The xfsbufd was not making any progress
because all the buffers in the delwri queue were pinned. This scan-
and-make-no-progress dance went one in the trace for some seconds,
before the xfssyncd came along an issued a log force, and then
things started going again.

However, I noticed something strange about the log force - there
were way too many IO's issued. 516 log buffers were written, to be
exact. That added up to 129MB of log IO, which got me very
interested because it's almost exactly 25% of the size of the log.
He delayed logging code is suppose to aggregate the minimum of 25%
of the log or 8MB worth of changes before flushing. That's what
really puzzled me - why did a log force write 129MB instead of only
8MB?

Essentially what has happened is that no CIL pushes had occurred
since the previous tail push which cleared out 25% of the log space.
That caused all the new transactions to block because there wasn't
log space for them, but they kick the xfsaild to push the tail.
However, the xfsaild was not making progress because there were
buffers it could not lock and flush, and the xfsbufd could not flush
them because they were pinned. As a result, both the xfsaild and the
xfsbufd could not move the tail of the log forward without the CIL
first committing.

The cause of the problem was that the background CIL push, which
should happen when 8MB of aggregated changes have been committed, is
being held off by the concurrent transaction commit load. The
background push does a down_write_trylock() which will fail if there
is a concurrent transaction commit holding the push lock in read
mode. With 8 CPUs all doing transactions as fast as they can, there
was enough concurrent transaction commits to hold off the background
push until tail-pushing could no longer free log space, and the halt
would occur.

It should be noted that there is no reason why it would halt at 25%
of log space used by a single CIL checkpoint. This bug could
definitely violate the "no transaction should be larger than half
the log" requirement and hence result in corruption if the system
crashed under heavy load. This sort of bug is exactly the reason why
delayed logging was tagged as experimental....

The fix is to start blocking background pushes once the threshold
has been exceeded. Rework the threshold calculations to keep the
amount of log space a CIL checkpoint can use to below that of the
AIL push threshold to avoid the problem completely.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2010-09-29 07:51:03 -05:00
..
9p fs/9p: Don't use dotl version of mknod for dotu inode operations 2010-09-13 08:13:03 -05:00
adfs check ATTR_SIZE contraints in inode_change_ok 2010-08-09 16:47:39 -04:00
affs AFFS: wait for sb synchronization when needed 2010-08-09 16:48:51 -04:00
afs Merge git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6 2010-08-13 10:37:30 -07:00
autofs autofs/autofs4: Move compat_ioctl handling into fs 2010-08-09 00:13:34 +02:00
autofs4 autofs4: remove unneeded null check in try_to_fill_dentry() 2010-08-11 08:59:06 -07:00
befs fix typos concerning "initiali[zs]e" 2010-06-16 18:05:05 +02:00
bfs BFS: clean up the superblock usage 2010-08-09 16:48:53 -04:00
btrfs Merge branch 'for-2.6.36' of git://git.kernel.dk/linux-2.6-block 2010-08-10 15:22:42 -07:00
cachefiles Add a dummy printk function for the maintenance of unused printks 2010-08-12 09:51:35 -07:00
ceph ceph: select CRYPTO 2010-09-17 12:30:31 -07:00
cifs cifs: fix potential double put of TCP session reference 2010-09-14 23:21:03 +00:00
coda Coda: mount hangs because of missed REQ_WRITE rename 2010-09-19 11:03:09 -07:00
configfs
cramfs cramfs: only unlock new inodes 2010-08-18 01:01:33 -04:00
debugfs
devpts
dlm fs/dlm: Drop unnecessary null test 2010-08-05 14:23:45 -05:00
ecryptfs eCryptfs: Fix encrypted file name lookup regression 2010-08-27 10:50:53 -05:00
efs
exofs Merge branch 'for-linus' of git://git.open-osd.org/linux-open-osd 2010-08-11 09:19:43 -07:00
exportfs
ext2 mbcache: Remove unused features 2010-08-09 16:48:45 -04:00
ext3 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 2010-08-10 11:26:52 -07:00
ext4 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 2010-08-10 11:26:52 -07:00
fat remove SWRITE* I/O types 2010-08-18 01:09:01 -04:00
freevxfs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 2010-08-10 11:26:52 -07:00
fscache Add a dummy printk function for the maintenance of unused printks 2010-08-12 09:51:35 -07:00
fuse fuse: fix lock annotations 2010-09-07 13:42:41 +02:00
gfs2 GFS2: gfs2_logd should be using interruptible waits 2010-09-17 14:00:10 +01:00
hfs convert remaining ->clear_inode() to ->evict_inode() 2010-08-09 16:48:37 -04:00
hfsplus convert remaining ->clear_inode() to ->evict_inode() 2010-08-09 16:48:37 -04:00
hostfs hostfs ->follow_link() braino 2010-08-18 06:21:10 -04:00
hpfs switch hpfs to ->evict_inode() 2010-08-09 16:48:17 -04:00
hppfs switch hppfs to ->evict_inode() 2010-08-09 16:48:16 -04:00
hugetlbfs new helper: end_writeback() 2010-08-09 16:47:49 -04:00
isofs isofs: Fix lseek() to position beyond 4 GB 2010-08-11 00:29:47 -04:00
jbd remove SWRITE* I/O types 2010-08-18 01:09:01 -04:00
jbd2 remove SWRITE* I/O types 2010-08-18 01:09:01 -04:00
jffs2 Merge git://git.infradead.org/mtd-2.6 2010-08-10 11:49:21 -07:00
jfs jfs: don't allow os2 xattr namespace overlap with others 2010-08-10 15:33:09 -07:00
lockd
logfs logfs: kill BKL 2010-08-14 00:24:24 +02:00
minix minix: fix regression in minix_mkdir() 2010-09-09 18:57:25 -07:00
ncpfs Merge branch 'bkl/ioctl' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing 2010-08-10 13:58:28 -07:00
nfs SUNRPC: Fix the NFSv4 and RPCSEC_GSS Kconfig dependencies 2010-09-12 19:57:50 -04:00
nfs_common
nfsd SUNRPC: Fix the NFSv4 and RPCSEC_GSS Kconfig dependencies 2010-09-12 19:57:50 -04:00
nilfs2 nilfs2: fix leak of shadow dat inode in error path of load_nilfs 2010-08-30 10:18:03 +09:00
nls
notify fsnotify: drop two useless bools in the fnsotify main loop 2010-08-27 21:42:11 -04:00
ntfs convert remaining ->clear_inode() to ->evict_inode() 2010-08-09 16:48:37 -04:00
ocfs2 o2dlm: force free mles during dlm exit 2010-09-23 14:16:53 -07:00
omfs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/bcopeland/omfs 2010-08-10 11:47:36 -07:00
openpromfs
partitions [S390] partitions: fix build error in ibm partition detection code 2010-08-13 10:06:55 +02:00
proc /proc/pid/smaps: fix dirty pages accounting 2010-09-22 17:22:39 -07:00
qnx4 get rid of cont_write_begin_newtrunc 2010-08-09 16:47:31 -04:00
quota Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 2010-08-10 11:26:52 -07:00
ramfs check ATTR_SIZE contraints in inode_change_ok 2010-08-09 16:47:39 -04:00
reiserfs remove SWRITE* I/O types 2010-08-18 01:09:01 -04:00
romfs
smbfs switch smbfs to evict_inode() 2010-08-09 16:48:00 -04:00
squashfs Squashfs: fix checkpatch.pl warnings 2010-08-08 22:29:33 +00:00
sysfs sysfs: checking for NULL instead of ERR_PTR 2010-09-03 17:26:28 -07:00
sysv fs/sysv/super.c: add support for non-PDP11 v7 filesystems 2010-08-11 08:59:23 -07:00
ubifs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 2010-08-10 11:26:52 -07:00
udf Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 2010-08-10 11:26:52 -07:00
ufs remove SWRITE* I/O types 2010-08-18 01:09:01 -04:00
xfs xfs: force background CIL push under sustained load 2010-09-29 07:51:03 -05:00
aio.c aio: do not return ERESTARTSYS as a result of AIO 2010-09-22 17:22:39 -07:00
anon_inodes.c
attr.c check ATTR_SIZE contraints in inode_change_ok 2010-08-09 16:47:39 -04:00
bad_inode.c bkl: Remove locked .ioctl file operation 2010-08-14 00:24:24 +02:00
binfmt_aout.c
binfmt_elf_fdpic.c
binfmt_elf.c
binfmt_em86.c
binfmt_flat.c flat: tweak default stack alignment 2010-06-29 15:29:31 -07:00
binfmt_misc.c binfmt_misc: fix binfmt_misc priority 2010-09-09 18:57:24 -07:00
binfmt_script.c Make do_execve() take a const filename pointer 2010-08-17 18:07:43 -07:00
binfmt_som.c
bio-integrity.c fs/bio-integrity.c: return -ENOMEM on kmalloc failure 2010-08-23 13:36:59 +02:00
bio.c block: unify flags for struct bio and struct request 2010-08-07 18:20:39 +02:00
block_dev.c blkdev: cgroup whitelist permission fix 2010-08-11 08:59:18 -07:00
buffer.c remove SWRITE* I/O types 2010-08-18 01:09:01 -04:00
char_dev.c char: Mark /dev/zero and /dev/kmem as not capable of writeback 2010-09-22 09:48:47 +02:00
compat_binfmt_elf.c
compat_ioctl.c bkl: Remove locked .ioctl file operation 2010-08-14 00:24:24 +02:00
compat.c Prevent freeing uninitialized pointer in compat_do_readv_writev 2010-09-22 17:22:38 -07:00
dcache.c fs: brlock vfsmount_lock 2010-08-18 08:35:48 -04:00
dcookies.c
direct-io.c O_DIRECT: fix the splitting up of contiguous I/O 2010-09-09 18:57:22 -07:00
drop_caches.c simplify checks for I_CLEAR/I_FREEING 2010-08-09 16:47:44 -04:00
eventfd.c
eventpoll.c
exec.c execve: make responsive to SIGKILL with large arguments 2010-09-10 08:10:26 -07:00
fcntl.c vfs: take O_NONBLOCK out of the O_* uniqueness test 2010-09-09 18:57:25 -07:00
fifo.c
file_table.c fs: scale files_lock 2010-08-18 08:35:48 -04:00
file.c vfs: use kmalloc() to allocate fdmem if possible 2010-08-11 08:59:02 -07:00
filesystems.c
fs_struct.c fs: fs_struct rwlock to spinlock 2010-08-18 08:35:46 -04:00
fs-writeback.c bdi: Fix warnings in __mark_inode_dirty for /dev/zero and friends 2010-09-22 09:48:47 +02:00
generic_acl.c vfs: update ctime when changing the file's permission by setfacl 2010-08-18 01:04:22 -04:00
inode.c Merge branch 'for-linus' of git://git.infradead.org/users/eparis/notify 2010-08-10 11:39:13 -07:00
internal.h fs: brlock vfsmount_lock 2010-08-18 08:35:48 -04:00
ioctl.c bkl: Remove locked .ioctl file operation 2010-08-14 00:24:24 +02:00
ioprio.c
Kconfig fs/Kconfig: Fix typo Userpace -> Userspace 2010-07-20 17:30:22 +02:00
Kconfig.binfmt
libfs.c check ATTR_SIZE contraints in inode_change_ok 2010-08-09 16:47:39 -04:00
locks.c
Makefile
mbcache.c mbcache: Limit the maximum number of cache entries 2010-08-18 06:24:41 -04:00
mpage.c
namei.c fs: brlock vfsmount_lock 2010-08-18 08:35:48 -04:00
namespace.c VFS: Sanity check mount flags passed to change_mnt_propagation() 2010-09-07 13:46:20 -07:00
nfsctl.c
no-block.c
open.c fs: cleanup files_lock locking 2010-08-18 08:35:47 -04:00
pipe.c
pnode.c fs: brlock vfsmount_lock 2010-08-18 08:35:48 -04:00
pnode.h
posix_acl.c
read_write.c fsnotify: pass a file instead of an inode to open, read, and write 2010-07-28 09:58:32 -04:00
read_write.h
readdir.c vfs: fix warning: 'dirent' is used uninitialized in this function 2010-08-09 20:45:05 -07:00
select.c
seq_file.c
signalfd.c signalfd: fill in ssi_int for posix timers and message queues 2010-08-11 08:59:20 -07:00
splice.c splice: fix misuse of SPLICE_F_NONBLOCK 2010-08-07 18:52:56 +02:00
stack.c
stat.c Mark arguments to certain syscalls as being const 2010-08-13 16:53:13 -07:00
statfs.c add f_flags to struct statfs(64) 2010-08-09 16:48:44 -04:00
super.c fs: scale files_lock 2010-08-18 08:35:48 -04:00
sync.c get rid of file_fsync() 2010-08-09 16:47:43 -04:00
timerfd.c
utimes.c Mark arguments to certain syscalls as being const 2010-08-13 16:53:13 -07:00
xattr_acl.c
xattr.c