linux/fs
Andrea Arcangeli 7766755a2f Fix /proc dcache deadlock in do_exit
This patch fixes a sles9 system hang in start_this_handle from a customer
with some heavy workload where all tasks are waiting on kjournald to commit
the transaction, but kjournald waits on t_updates to go down to zero (it
never does).

This was reported as a lowmem shortage deadlock but when checking the debug
data I noticed the VM wasn't under pressure at all (well it was really
under vm pressure, because lots of tasks hanged in the VM prune_dcache
methods trying to flush dirty inodes, but no task was hanging in GFP_NOFS
mode, the holder of the journal handle should have if this was a vm issue
in the first place).

No task was apparently holding the leftover handle in the committing
transaction, so I deduced t_updates was stuck to 1 because a journal_stop
was never run by some path (this turned out to be correct).  With a debug
patch adding proper reverse links and stack trace logging in ext3 deployed
in production, I found journal_stop is never run because
mark_inode_dirty_sync is called inside release_task called by do_exit.
(that was quite fun because I would have never thought about this
subtleness, I thought a regular path in ext3 had a bug and it forgot to
call journal_stop)

do_exit->release_task->mark_inode_dirty_sync->schedule() (will never
come back to run journal_stop)

The reason is that shrink_dcache_parent is racy by design (feature not
a bug) and it can do blocking I/O in some case, but the point is that
calling shrink_dcache_parent at the last stage of do_exit isn't safe
for self-reaping tasks.

I guess the memory pressure of the unbalanced highmem system allowed
to trigger this more easily.

Now mainline doesn't have this line in iput (like sles9 has):

    	     if (inode->i_state & I_DIRTY_DELAYED)
	     			mark_inode_dirty_sync(inode);

so it will probably not crash with ext3, but for example ext2 implements an
I/O-blocking ext2_put_inode that will lead to similar screwups with
ext2_free_blocks never coming back and it's definitely wrong to call
blocking-IO paths inside do_exit.  So this should fix a subtle bug in
mainline too (not verified in practice though).  The equivalent fix for
ext3 is also not verified yet to fix the problem in sles9 but I don't have
doubt it will (it usually takes days to crash, so it'll take weeks to be
sure).

An alternate fix would be to offload that work to a kernel thread, but I
don't think a reschedule for this is worth it, the vm should be able to
collect those entries for the synchronous release_task.

Signed-off-by: Andrea Arcangeli <andrea@suse.de>
Cc: Jan Kara <jack@ucw.cz>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05 09:44:18 -08:00
..
9p
adfs
affs
afs vfs: Add 64 bit i_version support 2008-01-28 23:58:27 -05:00
autofs
autofs4
befs fs/: Spelling fixes 2008-02-03 17:33:42 +02:00
bfs regression: bfs endianness bug 2007-12-05 09:25:20 -08:00
cifs Pagecache zeroing: zero_user_segment, zero_user_segments and zero_user 2008-02-05 09:44:13 -08:00
coda coda: convert struct class_device to struct device 2008-01-24 20:40:05 -08:00
configfs configfs: file.c fix possible recursive locking 2008-01-25 15:05:47 -08:00
cramfs
debugfs Kobject: convert fs/* from kobject_unregister() to kobject_put() 2008-01-24 20:40:40 -08:00
devpts
dlm dlm: static initialization improvements 2008-01-30 11:04:43 -06:00
ecryptfs Pagecache zeroing: zero_user_segment, zero_user_segments and zero_user 2008-02-05 09:44:13 -08:00
efs
exportfs
ext2 ext2: Fix the max file size for ext2 file system. 2008-01-28 23:58:26 -05:00
ext3 Pagecache zeroing: zero_user_segment, zero_user_segments and zero_user 2008-02-05 09:44:13 -08:00
ext4 Pagecache zeroing: zero_user_segment, zero_user_segments and zero_user 2008-02-05 09:44:13 -08:00
fat fat: optimize fat_count_free_clusters() 2008-01-08 16:10:35 -08:00
freevxfs fs/: Spelling fixes 2008-02-03 17:33:42 +02:00
fuse Kobject: convert fs/* from kobject_unregister() to kobject_put() 2008-01-24 20:40:40 -08:00
gfs2 Pagecache zeroing: zero_user_segment, zero_user_segments and zero_user 2008-02-05 09:44:13 -08:00
hfs hfs: fix coverity-found null deref 2008-01-17 15:38:58 -08:00
hfsplus
hostfs
hpfs
hppfs
hugetlbfs hugetlb: allow sticky directory mount option 2008-02-05 09:44:14 -08:00
isofs
jbd spinlock: lockbreak cleanup 2008-01-30 13:31:20 +01:00
jbd2 spinlock: lockbreak cleanup 2008-01-30 13:31:20 +01:00
jffs2 Typoes: "whith" -> "with" 2008-02-03 15:14:02 +02:00
jfs Spelling fixes: lenght->length 2008-02-03 15:42:53 +02:00
lockd NLM: tear down RPC clients in nlm_shutdown_hosts 2008-02-01 16:42:15 -05:00
minix
msdos
ncpfs vm audit: add VM_DONTEXPAND to mmap for drivers that need it 2008-02-04 07:55:38 -08:00
nfs Pagecache zeroing: zero_user_segment, zero_user_segments and zero_user 2008-02-05 09:44:13 -08:00
nfs_common
nfsd nfsd: more careful input validation in nfsctl write methods 2008-02-01 16:42:15 -05:00
nls
ntfs is_vmalloc_addr(): Check if an address is within the vmalloc boundaries 2008-02-05 09:44:14 -08:00
ocfs2 Pagecache zeroing: zero_user_segment, zero_user_segments and zero_user 2008-02-05 09:44:13 -08:00
openpromfs [SPARC]: Constify function pointer tables. 2008-01-22 18:29:20 -08:00
partitions Kobject: convert fs/* from kobject_unregister() to kobject_put() 2008-01-24 20:40:40 -08:00
proc Fix /proc dcache deadlock in do_exit 2008-02-05 09:44:18 -08:00
qnx4
ramfs
reiserfs Pagecache zeroing: zero_user_segment, zero_user_segments and zero_user 2008-02-05 09:44:13 -08:00
romfs
smbfs Merge branch 'task_killable' of git://git.kernel.org/pub/scm/linux/kernel/git/willy/misc 2008-02-01 11:45:47 +11:00
sysfs Merge git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6 2008-01-25 17:19:08 -08:00
sysv
udf
ufs ufs: fix nexstep dir block size 2007-12-05 09:21:18 -08:00
vfat
xfs is_vmalloc_addr(): Check if an address is within the vmalloc boundaries 2008-02-05 09:44:14 -08:00
aio.c core: remove last users of empty FASTCALL macro 2008-01-30 13:31:17 +01:00
anon_inodes.c
attr.c
bad_inode.c
binfmt_aout.c mm: fix exit_mmap BUG() on a.out binary exit 2007-12-20 07:49:53 -08:00
binfmt_elf_fdpic.c
binfmt_elf.c fs/binfmt_elf.c: spello fix 2008-02-03 18:05:15 +02:00
binfmt_em86.c
binfmt_flat.c
binfmt_misc.c
binfmt_script.c
binfmt_som.c
bio.c __bio_clone: don't calculate hw/phys segment counts 2008-01-28 10:04:46 +01:00
block_dev.c Driver core: convert block from raw kobjects to core devices 2008-01-24 20:40:36 -08:00
buffer.c bufferhead: revert constructor removal 2008-02-05 09:44:14 -08:00
char_dev.c Kobject: rename kobject_init_ng() to kobject_init() 2008-01-24 20:40:38 -08:00
compat_binfmt_elf.c x86: compat_binfmt_elf 2008-01-30 13:31:46 +01:00
compat_ioctl.c remove __attribute_used__ 2008-01-28 23:21:18 +01:00
compat.c timerfd: new timerfd API 2008-02-05 09:44:07 -08:00
dcache.c
dcookies.c
direct-io.c Pagecache zeroing: zero_user_segment, zero_user_segments and zero_user 2008-02-05 09:44:13 -08:00
dnotify.c
dquot.c Don't send quota messages repeatedly when hardlimit reached 2007-12-23 12:54:36 -08:00
drop_caches.c
eventfd.c
eventpoll.c lockdep: annotate epoll 2008-02-05 09:44:07 -08:00
exec.c exec: rework the group exit and fix the race with kill 2008-02-05 09:44:07 -08:00
fcntl.c
fifo.c
file_table.c
file.c
filesystems.c
fs-writeback.c Revert "writeback: introduce writeback_control.more_io to indicate more io" 2008-01-14 21:21:29 -08:00
generic_acl.c
inode.c ext4: Add inode version support in ext4 2008-01-28 23:58:27 -05:00
inotify_user.c
inotify.c
internal.h
ioctl.c
ioprio.c cfq-iosched: relax IOPRIO_CLASS_IDLE restrictions 2008-01-28 11:38:15 +01:00
Kconfig nfsd: select CONFIG_PROC_FS in nfsv4 and gss server cases 2008-02-01 16:42:04 -05:00
Kconfig.binfmt x86: compat_binfmt_elf Kconfig 2008-01-30 13:31:46 +01:00
libfs.c Pagecache zeroing: zero_user_segment, zero_user_segments and zero_user 2008-02-05 09:44:13 -08:00
locks.c pid-namespaces-vs-locks-interaction 2008-02-03 17:51:36 -05:00
Makefile x86: compat_binfmt_elf Kconfig 2008-01-30 13:31:46 +01:00
mbcache.c
mpage.c Pagecache zeroing: zero_user_segment, zero_user_segments and zero_user 2008-02-05 09:44:13 -08:00
namei.c Use access mode instead of open flags to determine needed permissions 2008-01-12 14:47:58 -08:00
namespace.c kobject: convert main fs kobject to use kobject_create 2008-01-24 20:40:13 -08:00
nfsctl.c
no-block.c
open.c mark sys_open/sys_read exports unused 2007-11-14 18:45:42 -08:00
pipe.c
pnode.c
pnode.h
posix_acl.c
quota_v1.c
quota_v2.c
quota.c
read_write.c ext4: export iov_shorten from kernel for ext4's use 2008-01-28 23:58:27 -05:00
read_write.h
readdir.c Use mutex_lock_killable in vfs_readdir 2007-12-06 17:39:54 -05:00
select.c
seq_file.c
signalfd.c Fix a small number of "memeber" typoes. 2008-02-03 15:12:15 +02:00
splice.c splice: always updated atime in direct splice 2008-02-01 09:26:32 +01:00
stack.c
stat.c
super.c
sync.c
timerfd.c timerfd: new timerfd API 2008-02-05 09:44:07 -08:00
utimes.c
xattr_acl.c
xattr.c