linux/fs
Maxim Patlasov 5a53748568 mm/page-writeback.c: add strictlimit feature
The feature prevents mistrusted filesystems (ie: FUSE mounts created by
unprivileged users) to grow a large number of dirty pages before
throttling.  For such filesystems balance_dirty_pages always check bdi
counters against bdi limits.  I.e.  even if global "nr_dirty" is under
"freerun", it's not allowed to skip bdi checks.  The only use case for now
is fuse: it sets bdi max_ratio to 1% by default and system administrators
are supposed to expect that this limit won't be exceeded.

The feature is on if a BDI is marked by BDI_CAP_STRICTLIMIT flag.  A
filesystem may set the flag when it initializes its BDI.

The problematic scenario comes from the fact that nobody pays attention to
the NR_WRITEBACK_TEMP counter (i.e.  number of pages under fuse
writeback).  The implementation of fuse writeback releases original page
(by calling end_page_writeback) almost immediately.  A fuse request queued
for real processing bears a copy of original page.  Hence, if userspace
fuse daemon doesn't finalize write requests in timely manner, an
aggressive mmap writer can pollute virtually all memory by those temporary
fuse page copies.  They are carefully accounted in NR_WRITEBACK_TEMP, but
nobody cares.

To make further explanations shorter, let me use "NR_WRITEBACK_TEMP
problem" as a shortcut for "a possibility of uncontrolled grow of amount
of RAM consumed by temporary pages allocated by kernel fuse to process
writeback".

The problem was very easy to reproduce.  There is a trivial example
filesystem implementation in fuse userspace distribution: fusexmp_fh.c.  I
added "sleep(1);" to the write methods, then recompiled and mounted it.
Then created a huge file on the mount point and run a simple program which
mmap-ed the file to a memory region, then wrote a data to the region.  An
hour later I observed almost all RAM consumed by fuse writeback.  Since
then some unrelated changes in kernel fuse made it more difficult to
reproduce, but it is still possible now.

Putting this theoretical happens-in-the-lab thing aside, there is another
thing that really hurts real world (FUSE) users.  This is write-through
page cache policy FUSE currently uses.  I.e.  handling write(2), kernel
fuse populates page cache and flushes user data to the server
synchronously.  This is excessively suboptimal.  Pavel Emelyanov's patches
("writeback cache policy") solve the problem, but they also make resolving
NR_WRITEBACK_TEMP problem absolutely necessary.  Otherwise, simply copying
a huge file to a fuse mount would result in memory starvation.  Miklos,
the maintainer of FUSE, believes strictlimit feature the way to go.

And eventually putting FUSE topics aside, there is one more use-case for
strictlimit feature.  Using a slow USB stick (mass storage) in a machine
with huge amount of RAM installed is a well-known pain.  Let's make simple
computations.  Assuming 64GB of RAM installed, existing implementation of
balance_dirty_pages will start throttling only after 9.6GB of RAM becomes
dirty (freerun == 15% of total RAM).  So, the command "cp 9GB_file
/media/my-usb-storage/" may return in a few seconds, but subsequent
"umount /media/my-usb-storage/" will take more than two hours if effective
throughput of the storage is, to say, 1MB/sec.

After inclusion of strictlimit feature, it will be trivial to add a knob
(e.g.  /sys/devices/virtual/bdi/x:y/strictlimit) to enable it on demand.
Manually or via udev rule.  May be I'm wrong, but it seems to be quite a
natural desire to limit the amount of dirty memory for some devices we are
not fully trust (in the sense of sustainable throughput).

[akpm@linux-foundation.org: fix warning in page-writeback.c]
Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11 15:58:04 -07:00
..
9p
adfs
affs
afs afs: get rid of redundant ->d_name.len checks 2013-09-07 19:54:55 -04:00
autofs4 autofs4 - fix device ioctl mount lookup 2013-09-08 22:07:47 -04:00
befs
bfs bfs: iget_locked() doesn't return an ERR_PTR 2013-08-24 12:10:22 -04:00
btrfs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial 2013-09-06 09:36:28 -07:00
cachefiles CacheFiles: Implement interface to check cache consistency 2013-09-06 09:17:30 +01:00
ceph ceph: use d_invalidate() to invalidate aliases 2013-09-06 12:55:29 -07:00
cifs direct-io: Handle O_(D)SYNC AIO 2013-09-04 09:23:46 -04:00
coda
configfs
cramfs
debugfs debugfs: debugfs_remove_recursive() must not rely on list_empty(d_subdirs) 2013-07-31 12:16:31 -04:00
devpts
dlm dlm: remove signal blocking 2013-08-12 15:22:43 -05:00
ecryptfs
efivarfs
efs efs: iget_locked() doesn't return an ERR_PTR() 2013-08-24 12:10:22 -04:00
exofs
exportfs exportfs: don't assume that ->iterate() won't feed us too long entries 2013-09-07 19:54:55 -04:00
ext2
ext3 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs 2013-09-06 09:06:02 -07:00
ext4 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial 2013-09-06 09:36:28 -07:00
f2fs f2fs: optimize gc for better performance 2013-09-05 13:50:32 +09:00
fat
freevxfs
fscache fscache: Netfs function for cleanup post readpages 2013-09-06 09:17:30 +01:00
fuse mm/page-writeback.c: add strictlimit feature 2013-09-11 15:58:04 -07:00
gfs2 This is possibly the smallest ever set of GFS2 patches for a merge 2013-09-09 09:16:51 -07:00
hfs
hfsplus
hostfs um: hostfs: Fix writeback 2013-09-07 10:38:29 +02:00
hpfs
hppfs
hugetlbfs cope with potentially long ->d_dname() output for shmem/hugetlb 2013-08-24 12:10:17 -04:00
isofs isofs: Refuse RW mount of the filesystem instead of making it RO 2013-07-31 22:14:50 +02:00
jbd jbd: use a single printk for jbd_debug() 2013-08-09 10:49:00 +02:00
jbd2 jbd2: Fix endian mixing problems in the checksumming code 2013-08-28 14:59:58 -04:00
jffs2
jfs jfs: fix readdir cookie incompatibility with NFSv4 2013-08-15 17:22:29 -05:00
lockd LOCKD: Don't call utsname()->nodename from nlmclnt_setlockargs 2013-08-05 15:03:46 -04:00
logfs
minix
ncpfs
nfs NFS client updates for Linux 3.12 2013-09-09 09:19:15 -07:00
nfs_common
nfsd Merge branch 'nfsd-next' of git://linux-nfs.org/~bfields/linux 2013-09-10 20:04:59 -07:00
nilfs2 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2013-09-05 08:50:26 -07:00
nls
notify
ntfs
ocfs2 ocfs2: fix the end cluster offset of FIEMAP 2013-09-11 15:56:53 -07:00
omfs
openpromfs
proc mm: track vma changes with VM_SOFTDIRTY bit 2013-09-11 15:57:56 -07:00
pstore pstore/ram: (really) fix undefined usage of rounddown_pow_of_two 2013-08-30 15:57:01 -07:00
qnx4
qnx6
quota xfs: update for v3.12-rc1 2013-09-09 11:19:09 -07:00
ramfs
reiserfs Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs 2013-09-06 09:06:02 -07:00
romfs
squashfs
sysfs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2013-09-07 14:36:57 -07:00
sysv
ubifs
udf udf: Refuse RW mount of the filesystem instead of making it RO 2013-07-31 22:14:51 +02:00
ufs
xfs xfs: update for v3.12-rc1 2013-09-09 11:19:09 -07:00
aio.c
anon_inodes.c
attr.c
bad_inode.c
binfmt_aout.c
binfmt_elf_fdpic.c
binfmt_elf.c
binfmt_em86.c
binfmt_flat.c
binfmt_misc.c
binfmt_script.c
binfmt_som.c
bio-integrity.c
bio.c Merge branch 'for-3.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup 2013-09-03 18:25:03 -07:00
block_dev.c direct-io: Handle O_(D)SYNC AIO 2013-09-04 09:23:46 -04:00
buffer.c
char_dev.c
compat_binfmt_elf.c
compat_ioctl.c
compat.c
coredump.c
coredump.h
dcache.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2013-09-10 12:44:24 -07:00
dcookies.c
direct-io.c direct-io: Use return from cmpxchg to decide of assignment happened 2013-09-09 10:47:42 -07:00
drop_caches.c
eventfd.c
eventpoll.c switch epoll_ctl() to fdget 2013-09-03 23:04:44 -04:00
exec.c mm: track vma changes with VM_SOFTDIRTY bit 2013-09-11 15:57:56 -07:00
fcntl.c vfs: add missing check for __O_TMPFILE in fcntl_init() 2013-08-05 18:25:32 +04:00
fhandle.c
file_table.c only regular files with FMODE_WRITE need to be on s_files 2013-09-03 22:50:28 -04:00
file.c
filesystems.c
fs_struct.c
fs-writeback.c mm/writeback: make writeback_inodes_wb static 2013-09-11 15:58:02 -07:00
generic_acl.c
inode.c constify touch_atime() 2013-09-03 22:52:45 -04:00
internal.h rename user_path_umountat() to user_path_mountpoint_at() 2013-09-08 20:20:21 -04:00
ioctl.c
ioprio.c
Kconfig
Kconfig.binfmt
libfs.c
locks.c
Makefile
mbcache.c
mount.h
mpage.c
namei.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2013-09-10 12:44:24 -07:00
namespace.c rename user_path_umountat() to user_path_mountpoint_at() 2013-09-08 20:20:21 -04:00
no-block.c
open.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace 2013-09-07 14:35:32 -07:00
pipe.c
pnode.c
pnode.h vfs: Don't copy mount bind mounts of /proc/<pid>/ns/mnt between namespaces 2013-08-26 18:42:15 -07:00
posix_acl.c
proc_namespace.c
read_write.c
readdir.c
select.c
seq_file.c
signalfd.c
splice.c
stack.c
stat.c quota: provide interface for readding allocated space into reserved space 2013-08-17 09:32:32 -04:00
statfs.c
super.c prune_super(): sb->s_op is never NULL 2013-09-07 19:54:56 -04:00
sync.c
timerfd.c
utimes.c
xattr_acl.c
xattr.c