linux/fs
Filipe Manana 6ca0709756 Btrfs: fix hang during inode eviction due to concurrent readahead
Zygo Blaxell and other users have reported occasional hangs while an
inode is being evicted, leading to traces like the following:

[ 5281.972322] INFO: task rm:20488 blocked for more than 120 seconds.
[ 5281.973836]       Not tainted 4.0.0-rc5-btrfs-next-9+ #2
[ 5281.974818] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 5281.976364] rm              D ffff8800724cfc38     0 20488   7747 0x00000000
[ 5281.977506]  ffff8800724cfc38 ffff8800724cfc38 ffff880065da5c50 0000000000000001
[ 5281.978461]  ffff8800724cffd8 ffff8801540a5f50 0000000000000008 ffff8801540a5f78
[ 5281.979541]  ffff8801540a5f50 ffff8800724cfc58 ffffffff8143107e 0000000000000123
[ 5281.981396] Call Trace:
[ 5281.982066]  [<ffffffff8143107e>] schedule+0x74/0x83
[ 5281.983341]  [<ffffffffa03b33cf>] wait_on_state+0xac/0xcd [btrfs]
[ 5281.985127]  [<ffffffff81075cd6>] ? signal_pending_state+0x31/0x31
[ 5281.986715]  [<ffffffffa03b4b71>] wait_extent_bit.constprop.32+0x7c/0xde [btrfs]
[ 5281.988680]  [<ffffffffa03b540b>] lock_extent_bits+0x5d/0x88 [btrfs]
[ 5281.990200]  [<ffffffffa03a621d>] btrfs_evict_inode+0x24e/0x5be [btrfs]
[ 5281.991781]  [<ffffffff8116964d>] evict+0xa0/0x148
[ 5281.992735]  [<ffffffff8116a43d>] iput+0x18f/0x1e5
[ 5281.993796]  [<ffffffff81160d4a>] do_unlinkat+0x15b/0x1fa
[ 5281.994806]  [<ffffffff81435b54>] ? ret_from_sys_call+0x1d/0x58
[ 5281.996120]  [<ffffffff8107d314>] ? trace_hardirqs_on_caller+0x18f/0x1ab
[ 5281.997562]  [<ffffffff8123960b>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[ 5281.998815]  [<ffffffff81161a16>] SyS_unlinkat+0x29/0x2b
[ 5281.999920]  [<ffffffff81435b32>] system_call_fastpath+0x12/0x17
[ 5282.001299] 1 lock held by rm/20488:
[ 5282.002066]  #0:  (sb_writers#12){.+.+.+}, at: [<ffffffff8116dd81>] mnt_want_write+0x24/0x4b

This happens when we have readahead, which calls readpages(), happening
right before the inode eviction handler is invoked. So the reason is
essentially:

1) readpages() is called while a reference on the inode is held, so
   eviction can not be triggered before readpages() returns. It also
   locks one or more ranges in the inode's io_tree (which is done at
   extent_io.c:__do_contiguous_readpages());

2) readpages() submits several read bios, all with an end io callback
   that runs extent_io.c:end_bio_extent_readpage() and that is executed
   by other task when a bio finishes, corresponding to a work queue
   (fs_info->end_io_workers) worker kthread. This callback unlocks
   the ranges in the inode's io_tree that were previously locked in
   step 1;

3) readpages() returns, the reference on the inode is dropped;

4) One or more of the read bios previously submitted are still not
   complete (their end io callback was not yet invoked or has not
   yet finished execution);

5) Inode eviction is triggered (through an unlink call for example).
   The inode reference count was not incremented before submitting
   the read bios, therefore this is possible;

6) The eviction handler starts executing and enters the loop that
   iterates over all extent states in the inode's io_tree;

7) The loop picks one extent state record and uses its ->start and
   ->end fields, after releasing the inode's io_tree spinlock, to
   call lock_extent_bits() and clear_extent_bit(). The call to lock
   the range [state->start, state->end] blocks because the whole
   range or a part of it was locked by the previous call to
   readpages() and the corresponding end io callback, which unlocks
   the range was not yet executed;

8) The end io callback for the read bio is executed and unlocks the
   range [state->start, state->end] (or a superset of that range).
   And at clear_extent_bit() the extent_state record state is used
   as a second argument to split_state(), which sets state->start to
   a larger value;

9) The task executing the eviction handler is woken up by the task
   executing the bio's end io callback (through clear_state_bit) and
   the eviction handler locks the range
   [old value for state->start, state->end]. Shortly after, when
   calling clear_extent_bit(), it unlocks the range
   [new value for state->start, state->end], so it ends up unlocking
   only part of the range that it locked, leaving an extent state
   record in the io_tree that represents the unlocked subrange;

10) The eviction handler loop, in its next iteration, gets the
    extent_state record for the subrange that it did not unlock in the
    previous step and then tries to lock it, resulting in an hang.

So fix this by not using the ->start and ->end fields of an existing
extent_state record. This is a simple solution, and an alternative
could be to bump the inode's reference count before submitting each
read bio and having it dropped in the bio's end io callback. But that
would be a more invasive/complex change and would not protect against
other possible places that are not holding a reference on the inode
as well. Something to consider in the future.

Many thanks to Zygo Blaxell for reporting, in the mailing list, the
issue, a set of scripts to trigger it and testing this fix.

Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Tested-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-06-03 04:03:09 -07:00
..
9p Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2015-04-26 17:22:07 -07:00
adfs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2015-04-26 17:22:07 -07:00
affs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2015-04-26 17:22:07 -07:00
afs VFS: normal filesystems (and lustre): d_inode() annotations 2015-04-15 15:06:57 -04:00
autofs4 VFS: normal filesystems (and lustre): d_inode() annotations 2015-04-15 15:06:57 -04:00
befs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2015-04-26 17:22:07 -07:00
bfs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2015-04-26 17:22:07 -07:00
btrfs Btrfs: fix hang during inode eviction due to concurrent readahead 2015-06-03 04:03:09 -07:00
cachefiles VFS: fs/cachefiles: d_backing_inode() annotations 2015-04-15 15:06:59 -04:00
ceph Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2015-04-26 17:22:07 -07:00
cifs CIFS: Fix race condition on RFC1002_NEGATIVE_SESSION_RESPONSE 2015-05-20 13:25:55 -05:00
coda VFS: normal filesystems (and lustre): d_inode() annotations 2015-04-15 15:06:57 -04:00
configfs configfs: init configfs module earlier at boot time 2015-05-05 17:10:11 -07:00
cramfs
debugfs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2015-04-26 17:22:07 -07:00
devpts VFS: normal filesystems (and lustre): d_inode() annotations 2015-04-15 15:06:57 -04:00
dlm
ecryptfs VFS: normal filesystems (and lustre): d_inode() annotations 2015-04-15 15:06:57 -04:00
efivarfs Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2015-05-06 10:57:37 -07:00
efs VFS: normal filesystems (and lustre): d_inode() annotations 2015-04-15 15:06:57 -04:00
exofs VFS: normal filesystems (and lustre): d_inode() annotations 2015-04-15 15:06:57 -04:00
exportfs VFS: (Scripted) Convert S_ISLNK/DIR/REG(dentry->d_inode) to d_is_*(dentry) 2015-02-22 11:38:41 -05:00
ext2 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2015-04-26 17:22:07 -07:00
ext3 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2015-04-26 17:22:07 -07:00
ext4 ext4: fix an ext3 collapse range regression in xfstests 2015-05-15 00:24:10 -04:00
f2fs f2fs: fix wrong error hanlder in f2fs_follow_link 2015-05-04 14:15:16 -07:00
fat Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2015-04-26 17:22:07 -07:00
freevxfs VFS: normal filesystems (and lustre): d_inode() annotations 2015-04-15 15:06:57 -04:00
fscache
fuse VFS: normal filesystems (and lustre): d_inode() annotations 2015-04-15 15:06:57 -04:00
gfs2 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2015-04-26 17:22:07 -07:00
hfs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2015-04-26 17:22:07 -07:00
hfsplus Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2015-04-26 17:22:07 -07:00
hostfs hostfs: Use correct mask for file mode 2015-05-04 14:50:29 +02:00
hpfs VFS: normal filesystems (and lustre): d_inode() annotations 2015-04-15 15:06:57 -04:00
hppfs VFS: normal filesystems (and lustre): d_inode() annotations 2015-04-15 15:06:57 -04:00
hugetlbfs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2015-04-26 17:22:07 -07:00
isofs VFS: normal filesystems (and lustre): d_inode() annotations 2015-04-15 15:06:57 -04:00
jbd
jbd2 jbd2: fix r_count overflows leading to buffer overflow in journal recovery 2015-05-14 19:11:50 -04:00
jffs2 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2015-04-26 17:22:07 -07:00
jfs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2015-04-26 17:22:07 -07:00
kernfs kernfs: do not account ino_ida allocations to memcg 2015-05-14 17:55:51 -07:00
lockd nfsd: eliminate NFSD_DEBUG 2015-04-21 16:16:02 -04:00
logfs VFS: normal filesystems (and lustre): d_inode() annotations 2015-04-15 15:06:57 -04:00
minix VFS: normal filesystems (and lustre): d_inode() annotations 2015-04-15 15:06:57 -04:00
ncpfs VFS: normal filesystems (and lustre): d_inode() annotations 2015-04-15 15:06:57 -04:00
nfs nfs: take extra reference to fl->fl_file when running a setlk 2015-05-13 14:56:06 -04:00
nfs_common
nfsd nfsd: skip CB_NULL probes for 4.1 or later 2015-05-04 12:02:42 -04:00
nilfs2 nilfs2: fix sanity check of btree level in nilfs_btree_root_broken() 2015-05-05 17:10:11 -07:00
nls
notify fanotify: fix event filtering with FAN_ONDIR set 2015-03-12 18:46:08 -07:00
ntfs VFS: normal filesystems (and lustre): d_inode() annotations 2015-04-15 15:06:57 -04:00
ocfs2 ocfs2: dlm: fix race between purge and get lock resource 2015-05-05 17:10:11 -07:00
omfs omfs: fix potential integer overflow in allocator 2015-05-28 18:25:19 -07:00
openpromfs
overlayfs ovl: mount read-only if workdir can't be created 2015-05-19 14:30:12 +02:00
proc Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2015-04-26 17:22:07 -07:00
pstore Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2015-04-26 17:22:07 -07:00
qnx4
qnx6 VFS: normal filesystems (and lustre): d_inode() annotations 2015-04-15 15:06:57 -04:00
quota Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2015-04-26 17:22:07 -07:00
ramfs VFS: normal filesystems (and lustre): d_inode() annotations 2015-04-15 15:06:57 -04:00
reiserfs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2015-04-26 17:22:07 -07:00
romfs make new_sync_{read,write}() static 2015-04-11 22:29:40 -04:00
squashfs VFS: normal filesystems (and lustre): d_inode() annotations 2015-04-15 15:06:57 -04:00
sysfs sysfs: Only accept read/write permissions for file attributes 2015-03-25 13:27:57 +01:00
sysv VFS: normal filesystems (and lustre): d_inode() annotations 2015-04-15 15:06:57 -04:00
tracefs
ubifs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2015-04-26 17:22:07 -07:00
udf Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2015-04-26 17:22:07 -07:00
ufs VFS: normal filesystems (and lustre): d_inode() annotations 2015-04-15 15:06:57 -04:00
xfs xfs: fix broken i_nlink accounting for whiteout tmpfile inode 2015-05-29 08:14:55 +10:00
aio.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2015-04-16 23:27:56 -04:00
anon_inodes.c
attr.c
bad_inode.c don't bother with most of the bad_file_ops methods 2015-02-20 04:03:58 -05:00
binfmt_aout.c
binfmt_elf_fdpic.c
binfmt_elf.c fs/binfmt_elf.c:load_elf_binary(): return -EINVAL on zero-length mappings 2015-05-28 18:25:18 -07:00
binfmt_em86.c
binfmt_flat.c
binfmt_misc.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2015-04-26 17:22:07 -07:00
binfmt_script.c
block_dev.c direct-io: only inc/dec inode->i_dio_count for file systems 2015-04-24 15:45:28 -04:00
buffer.c page_writeback: clean up mess around cancel_dirty_page() 2015-04-14 16:49:01 -07:00
char_dev.c
compat_binfmt_elf.c
compat_ioctl.c Bluetooth: bnep: Add support for get bnep features via ioctl 2015-04-03 23:21:34 +02:00
compat.c
coredump.c coredump: accept any write method 2015-04-11 22:29:39 -04:00
dax.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2015-04-26 17:22:07 -07:00
dcache.c d_walk() might skip too much 2015-05-28 23:45:30 -04:00
dcookies.c
direct-io.c direct-io: only inc/dec inode->i_dio_count for file systems 2015-04-24 15:45:28 -04:00
drop_caches.c
eventfd.c eventfd: don't take the spinlock in eventfd_poll 2015-02-17 14:34:52 -08:00
eventpoll.c
exec.c parisc,metag: Fix crashes due to stack randomization on stack-grows-upwards architectures 2015-05-12 22:03:44 +02:00
fcntl.c
fhandle.c
file_table.c ->aio_read and ->aio_write removed 2015-04-11 22:29:43 -04:00
file.c mm: rcu-protected get_mm_exe_file() 2015-04-17 09:04:07 -04:00
filesystems.c
fs_pin.c fs_pin: Allow for the possibility that m_list or s_list go unused. 2015-04-09 11:39:55 -05:00
fs_struct.c
fs-writeback.c fs: add dirtytime_expire_seconds sysctl 2015-03-17 12:23:32 -04:00
inode.c direct-io: only inc/dec inode->i_dio_count for file systems 2015-04-24 15:45:28 -04:00
internal.h trylock_super(): replacement for grab_super_passive() 2015-02-22 11:38:42 -05:00
ioctl.c
Kconfig f2fs: relocate Kconfig from misc filesystems 2015-04-10 15:08:35 -07:00
Kconfig.binfmt mm: split ET_DYN ASLR from mmap ASLR 2015-04-14 16:49:05 -07:00
libfs.c VFS: fs library helpers: d_inode() annotations 2015-04-15 15:06:58 -04:00
locks.c proc: show locks in /proc/pid/fdinfo/X 2015-04-17 09:04:12 -04:00
Makefile This adds the new tracefs file system. This has been in linux-next for 2015-04-14 10:22:29 -07:00
mbcache.c
mount.h
mpage.c
namei.c path_openat(): fix double fput() 2015-05-09 00:12:48 -04:00
namespace.c mnt: Fix fs_fully_visible to verify the root directory is visible 2015-05-09 11:55:50 -05:00
no-block.c
nsfs.c VFS: assorted weird filesystems: d_inode() annotations 2015-04-15 15:06:58 -04:00
open.c xfs: update for 4.1-rc1 2015-04-24 07:08:41 -07:00
pipe.c VFS: assorted weird filesystems: d_inode() annotations 2015-04-15 15:06:58 -04:00
pnode.c mnt: Don't propagate unmounts to locked mounts 2015-04-02 20:34:20 -05:00
pnode.h mnt: Honor MNT_LOCKED when detaching mounts 2015-04-09 11:39:55 -05:00
posix_acl.c VFS: assorted d_backing_inode() annotations 2015-04-15 15:06:59 -04:00
proc_namespace.c
read_write.c new_sync_write(): discard ->ki_pos unless the return value is positive 2015-04-11 22:29:46 -04:00
readdir.c
select.c
seq_file.c Btrfs: show subvol= and subvolid= in /proc/mounts 2015-06-03 04:03:02 -07:00
signalfd.c
splice.c splice: sendfile() at once fails for big files 2015-05-06 09:27:41 -06:00
stack.c
stat.c VFS: assorted d_backing_inode() annotations 2015-04-15 15:06:59 -04:00
statfs.c
super.c cleancache: remove limit on the number of cleancache enabled filesystems 2015-04-14 16:49:03 -07:00
sync.c
timerfd.c
utimes.c
xattr.c