linux/fs
Anton Altaparmakov 88bd5121d6 [PATCH] Fix soft lockup due to NTFS: VFS part and explanation
Something has changed in the core kernel such that we now get concurrent
inode write outs, one e.g via pdflush and one via sys_sync or whatever.
This causes a nasty deadlock in ntfs.  The only clean solution
unfortunately requires a minor vfs api extension.

First the deadlock analysis:

Prerequisive knowledge: NTFS has a file $MFT (inode 0) loaded at mount
time.  The NTFS driver uses the page cache for storing the file contents as
usual.  More interestingly this file contains the table of on-disk inodes
as a sequence of MFT_RECORDs.  Thus NTFS driver accesses the on-disk inodes
by accessing the MFT_RECORDs in the page cache pages of the loaded inode
$MFT.

The situation: VFS inode X on a mounted ntfs volume is dirty.  For same
inode X, the ntfs_inode is dirty and thus corresponding on-disk inode,
which is as explained above in a dirty PAGE_CACHE_PAGE belonging to the
table of inodes ($MFT, inode 0).

What happens:

Process 1: sys_sync()/umount()/whatever...  calls __sync_single_inode() for
$MFT -> do_writepages() -> write_page for the dirty page containing the
on-disk inode X, the page is now locked -> ntfs_write_mst_block() which
clears PageUptodate() on the page to prevent anyone else getting hold of it
whilst it does the write out (this is necessary as the on-disk inode needs
"fixups" applied before the write to disk which are removed again after the
write and PageUptodate is then set again).  It then analyses the page
looking for dirty on-disk inodes and when it finds one it calls
ntfs_may_write_mft_record() to see if it is safe to write this on-disk
inode.  This then calls ilookup5() to check if the corresponding VFS inode
is in icache().  This in turn calls ifind() which waits on the inode lock
via wait_on_inode whilst holding the global inode_lock.

Process 2: pdflush results in a call to __sync_single_inode for the same
VFS inode X on the ntfs volume.  This locks the inode (I_LOCK) then calls
write-inode -> ntfs_write_inode -> map_mft_record() -> read_cache_page() of
the page (in page cache of table of inodes $MFT, inode 0) containing the
on-disk inode.  This page has PageUptodate() clear because of Process 1
(see above) so read_cache_page() blocks when tries to take the page lock
for the page so it can call ntfs_read_page().

Thus Process 1 is holding the page lock on the page containing the on-disk
inode X and it is waiting on the inode X to be unlocked in ifind() so it
can write the page out and then unlock the page.

And Process 2 is holding the inode lock on inode X and is waiting for the
page to be unlocked so it can call ntfs_readpage() or discover that
Process 1 set PageUptodate() again and use the page.

Thus we have a deadlock due to ifind() waiting on the inode lock.

The only sensible solution: NTFS does not care whether the VFS inode is
locked or not when it calls ilookup5() (it doesn't use the VFS inode at
all, it just uses it to find the corresponding ntfs_inode which is of
course attached to the VFS inode (both are one single struct); and it uses
the ntfs_inode which is subject to its own locking so I_LOCK is irrelevant)
hence we want a modified ilookup5_nowait() which is the same as ilookup5()
but it does not wait on the inode lock.

Without such functionality I would have to keep my own ntfs_inode cache in
the NTFS driver just so I can find ntfs_inodes independent of their VFS
inodes which would be slow, memory and cpu cycle wasting, and incredibly
stupid given the icache already exists in the VFS.

Below is a patch that does the ilookup5_nowait() implementation in
fs/inode.c and exports it.

ilookup5_nowait.diff:

Introduce ilookup5_nowait() which is basically the same as ilookup5() but
it does not wait on the inode's lock (i.e. it omits the wait_on_inode()
done in ifind()).

This is needed to avoid a nasty deadlock in NTFS.

Signed-off-by: Anton Altaparmakov <aia21@cantab.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-13 11:25:24 -07:00
..
adfs Linux-2.6.12-rc2 2005-04-16 15:20:36 -07:00
affs Linux-2.6.12-rc2 2005-04-16 15:20:36 -07:00
afs [PATCH] Cleanup patch for process freezing 2005-06-25 17:10:13 -07:00
autofs Linux-2.6.12-rc2 2005-04-16 15:20:36 -07:00
autofs4 [PATCH] autofs4: mistake in debug print 2005-07-07 18:23:46 -07:00
befs Linux-2.6.12-rc2 2005-04-16 15:20:36 -07:00
bfs Linux-2.6.12-rc2 2005-04-16 15:20:36 -07:00
cifs [CIFS] Fix cifs update of page cache. Write at correct offset when out of memory 2005-06-09 14:44:07 -07:00
coda [PATCH] class: convert the remaining class_simple users in the kernel to usee the new class api 2005-06-20 15:15:11 -07:00
cramfs Linux-2.6.12-rc2 2005-04-16 15:20:36 -07:00
debugfs [PATCH] remove duplicate get_dentry functions in various places 2005-06-23 09:45:20 -07:00
devfs Linux-2.6.12-rc2 2005-04-16 15:20:36 -07:00
devpts Linux-2.6.12-rc2 2005-04-16 15:20:36 -07:00
efs Linux-2.6.12-rc2 2005-04-16 15:20:36 -07:00
exportfs Linux-2.6.12-rc2 2005-04-16 15:20:36 -07:00
ext2 [PATCH] ext2: fix mount options parting 2005-07-12 16:01:01 -07:00
ext3 [PATCH] ext3: fix options parsing 2005-07-12 16:01:01 -07:00
fat [PATCH] fatfs sectioning fix 2005-06-30 22:29:48 -07:00
freevxfs [PATCH] freevxfs: minor cleanups 2005-06-30 08:45:12 -07:00
hfs [PATCH] hfs, hfsplus: don't leak s_fs_info and fix an oops 2005-05-01 08:59:16 -07:00
hfsplus [PATCH] hfs, hfsplus: don't leak s_fs_info and fix an oops 2005-05-01 08:59:16 -07:00
hostfs [PATCH] uml: remove 2_5compat.h 2005-05-28 16:46:11 -07:00
hpfs Linux-2.6.12-rc2 2005-04-16 15:20:36 -07:00
hppfs [PATCH] uml: restore hppfs support 2005-07-07 18:23:44 -07:00
hugetlbfs [PATCH] Avoiding mmap fragmentation 2005-06-21 18:46:16 -07:00
isofs [PATCH] isofs: show hidden files, add granularity for assoc/hidden files flags 2005-06-21 19:07:38 -07:00
jbd [PATCH] Cleanup patch for process freezing 2005-06-25 17:10:13 -07:00
jffs [PATCH] fs/jffs/: cleanups 2005-06-25 16:25:04 -07:00
jffs2 [JFFS2] Simplify the tree insert code. 2005-07-06 18:30:00 +02:00
jfs JFS: Need to be root to create files with security context 2005-07-13 09:15:18 -05:00
lockd [PATCH] Cleanup patch for process freezing 2005-06-25 17:10:13 -07:00
minix Linux-2.6.12-rc2 2005-04-16 15:20:36 -07:00
msdos Linux-2.6.12-rc2 2005-04-16 15:20:36 -07:00
ncpfs [PATCH] fs/ncpfs/: remove unused #ifdef USE_OLD_SLOW_DIRECTORY_LISTING code 2005-06-25 16:25:04 -07:00
nfs [PATCH] really remove xattr_acl.h 2005-06-28 21:20:31 -07:00
nfs_common [PATCH] NFSD: Add server support for NFSv3 ACLs. 2005-06-22 16:07:23 -04:00
nfsd [PATCH] inotify 2005-07-12 20:38:38 -07:00
nls [PATCH] make some things static 2005-05-05 16:36:47 -07:00
ntfs Linux-2.6.12-rc2 2005-04-16 15:20:36 -07:00
openpromfs Linux-2.6.12-rc2 2005-04-16 15:20:36 -07:00
partitions [PATCH] small partitions/msdos cleanups 2005-06-25 16:24:59 -07:00
proc [PATCH] kdump: Parse elf32 headers and export through /proc/vmcore 2005-06-25 16:24:53 -07:00
qnx4 [PATCH] fs/qnx4/*: fix sparse warnings 2005-06-24 14:14:24 -07:00
ramfs Linux-2.6.12-rc2 2005-04-16 15:20:36 -07:00
reiserfs reiserfs: run scripts/Lindent on reiserfs code 2005-07-12 20:21:28 -07:00
romfs Linux-2.6.12-rc2 2005-04-16 15:20:36 -07:00
smbfs Linux-2.6.12-rc2 2005-04-16 15:20:36 -07:00
sysfs [PATCH] inotify 2005-07-12 20:38:38 -07:00
sysv Linux-2.6.12-rc2 2005-04-16 15:20:36 -07:00
udf [PATCH] udf_find_entry() cleanup 2005-06-30 08:45:11 -07:00
ufs Linux-2.6.12-rc2 2005-04-16 15:20:36 -07:00
umsdos Linux-2.6.12-rc2 2005-04-16 15:20:36 -07:00
vfat Linux-2.6.12-rc2 2005-04-16 15:20:36 -07:00
xfs [PATCH] Cleanup patch for process freezing 2005-06-25 17:10:13 -07:00
aio.c [PATCH] aio-retry-fix: fix aio retry work queueing 2005-06-28 21:20:32 -07:00
attr.c [PATCH] inotify 2005-07-12 20:38:38 -07:00
bad_inode.c [PATCH] make some things static 2005-05-05 16:36:47 -07:00
binfmt_aout.c [PATCH] Avoiding mmap fragmentation 2005-06-21 18:46:16 -07:00
binfmt_elf_fdpic.c Linux-2.6.12-rc2 2005-04-16 15:20:36 -07:00
binfmt_elf.c [PATCH] Avoiding mmap fragmentation 2005-06-21 18:46:16 -07:00
binfmt_em86.c Linux-2.6.12-rc2 2005-04-16 15:20:36 -07:00
binfmt_flat.c [PATCH] binfmt_flat mmap flag fix 2005-06-06 14:57:51 -07:00
binfmt_misc.c Linux-2.6.12-rc2 2005-04-16 15:20:36 -07:00
binfmt_script.c Linux-2.6.12-rc2 2005-04-16 15:20:36 -07:00
binfmt_som.c Linux-2.6.12-rc2 2005-04-16 15:20:36 -07:00
bio.c [PATCH] mostly_read data section 2005-07-07 18:23:46 -07:00
block_dev.c [PATCH] block: add unlocked_ioctl support for block devices 2005-06-23 09:45:32 -07:00
buffer.c [PATCH] page_uptodate locking scalability 2005-07-07 18:23:45 -07:00
char_dev.c [PATCH] cdev: cdev_put oops 2005-07-12 16:01:02 -07:00
compat_ioctl.c Linux-2.6.12-rc2 2005-04-16 15:20:36 -07:00
compat.c [PATCH] inotify 2005-07-12 20:38:38 -07:00
dcache.c [PATCH] make some things static 2005-05-05 16:36:47 -07:00
dcookies.c [PATCH] dcookies.c: use proper refcounting functions 2005-07-07 18:23:52 -07:00
direct-io.c [PATCH] pass iocb to dio_iodone_t 2005-06-24 00:05:19 -07:00
dnotify.c Linux-2.6.12-rc2 2005-04-16 15:20:36 -07:00
dquot.c [PATCH] list_for_each_entry: fs-dquot.c 2005-06-25 16:25:11 -07:00
eventpoll.c [PATCH] Remove eventpoll macro obfuscation 2005-06-23 09:45:30 -07:00
exec.c [PATCH] reset real_timer target on exec leader change 2005-07-12 16:01:01 -07:00
fcntl.c [PATCH] convert that currently tests _NSIG directly to use valid_signal() 2005-05-01 08:59:14 -07:00
fifo.c Linux-2.6.12-rc2 2005-04-16 15:20:36 -07:00
file_table.c [PATCH] inotify 2005-07-12 20:38:38 -07:00
file.c Linux-2.6.12-rc2 2005-04-16 15:20:36 -07:00
filesystems.c Linux-2.6.12-rc2 2005-04-16 15:20:36 -07:00
fs-writeback.c [PATCH] O(1) sb list traversing on syncs 2005-06-23 09:45:27 -07:00
inode.c [PATCH] Fix soft lockup due to NTFS: VFS part and explanation 2005-07-13 11:25:24 -07:00
inotify.c [PATCH] inotify: misc cleanup 2005-07-13 11:09:31 -07:00
ioctl.c Linux-2.6.12-rc2 2005-04-16 15:20:36 -07:00
ioprio.c [PATCH] move ioprio syscalls into syscalls.h 2005-07-07 18:23:37 -07:00
Kconfig [PATCH] inotify 2005-07-12 20:38:38 -07:00
Kconfig.binfmt Linux-2.6.12-rc2 2005-04-16 15:20:36 -07:00
libfs.c [PATCH] fix fsync(dir) return value for ram-based filesystems 2005-06-25 16:24:38 -07:00
locks.c [PATCH] coverity: fs/locks.c flp null check 2005-07-07 18:23:47 -07:00
Makefile [PATCH] inotify 2005-07-12 20:38:38 -07:00
mbcache.c [PATCH] make some things static 2005-05-05 16:36:47 -07:00
mpage.c [PATCH] mpage_end_io_write() I/O error handling fix 2005-06-04 17:12:59 -07:00
namei.c [PATCH] inotify 2005-07-12 20:38:38 -07:00
namespace.c [PATCH] namespace: rename mnt_fslink to mnt_expire 2005-07-07 18:23:52 -07:00
nfsctl.c Linux-2.6.12-rc2 2005-04-16 15:20:36 -07:00
open.c [PATCH] inotify 2005-07-12 20:38:38 -07:00
pipe.c Linux-2.6.12-rc2 2005-04-16 15:20:36 -07:00
posix_acl.c Linux-2.6.12-rc2 2005-04-16 15:20:36 -07:00
quota_v1.c Linux-2.6.12-rc2 2005-04-16 15:20:36 -07:00
quota_v2.c [PATCH] quota: possible bug in quota format v2 support 2005-04-16 15:25:47 -07:00
quota.c [PATCH] O(1) sb list traversing on syncs 2005-06-23 09:45:27 -07:00
read_write.c [PATCH] inotify 2005-07-12 20:38:38 -07:00
readdir.c Linux-2.6.12-rc2 2005-04-16 15:20:36 -07:00
select.c [PATCH] make some things static 2005-05-05 16:36:47 -07:00
seq_file.c [PATCH] DocBook: fix some descriptions 2005-05-01 08:59:26 -07:00
stat.c Linux-2.6.12-rc2 2005-04-16 15:20:36 -07:00
super.c [PATCH] set mnt_namespace in the correct place 2005-07-07 18:23:52 -07:00
xattr_acl.c Linux-2.6.12-rc2 2005-04-16 15:20:36 -07:00
xattr.c [PATCH] inotify 2005-07-12 20:38:38 -07:00