linux/fs
Miklos Szeredi 3be5a52b30 fuse: support writable mmap
Quoting Linus (3 years ago, FUSE inclusion discussions):

  "User-space filesystems are hard to get right. I'd claim that they
   are almost impossible, unless you limit them somehow (shared
   writable mappings are the nastiest part - if you don't have those,
   you can reasonably limit your problems by limiting the number of
   dirty pages you accept through normal "write()" calls)."

Instead of attempting the impossible, I've just waited for the dirty page
accounting infrastructure to materialize (thanks to Peter Zijlstra and
others).  This nicely solved the biggest problem: limiting the number of pages
used for write caching.

Some small details remained, however, which this largish patch attempts to
address.  It provides a page writeback implementation for fuse, which is
completely safe against VM related deadlocks.  Performance may not be very
good for certain usage patterns, but generally it should be acceptable.

It has been tested extensively with fsx-linux and bash-shared-mapping.

Fuse page writeback design
--------------------------

fuse_writepage() allocates a new temporary page with GFP_NOFS|__GFP_HIGHMEM.
It copies the contents of the original page, and queues a WRITE request to the
userspace filesystem using this temp page.

The writeback is finished instantly from the MM's point of view: the page is
removed from the radix trees, and the PageDirty and PageWriteback flags are
cleared.

For the duration of the actual write, the NR_WRITEBACK_TEMP counter is
incremented.  The per-bdi writeback count is not decremented until the actual
write completes.

On dirtying the page, fuse waits for a previous write to finish before
proceeding.  This makes sure, there can only be one temporary page used at a
time for one cached page.

This approach is wasteful in both memory and CPU bandwidth, so why is this
complication needed?

The basic problem is that there can be no guarantee about the time in which
the userspace filesystem will complete a write.  It may be buggy or even
malicious, and fail to complete WRITE requests.  We don't want unrelated parts
of the system to grind to a halt in such cases.

Also a filesystem may need additional resources (particularly memory) to
complete a WRITE request.  There's a great danger of a deadlock if that
allocation may wait for the writepage to finish.

Currently there are several cases where the kernel can block on page
writeback:

  - allocation order is larger than PAGE_ALLOC_COSTLY_ORDER
  - page migration
  - throttle_vm_writeout (through NR_WRITEBACK)
  - sync(2)

Of course in some cases (fsync, msync) we explicitly want to allow blocking.
So for these cases new code has to be added to fuse, since the VM is not
tracking writeback pages for us any more.

As an extra safetly measure, the maximum dirty ratio allocated to a single
fuse filesystem is set to 1% by default.  This way one (or several) buggy or
malicious fuse filesystems cannot slow down the rest of the system by hogging
dirty memory.

With appropriate privileges, this limit can be raised through
'/sys/class/bdi/<bdi>/max_ratio'.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-30 08:29:50 -07:00
..
9p [PATCH] restore sane ->umount_begin() API 2008-04-25 09:23:25 -04:00
adfs adfs: work around bogus sparse warning 2008-04-29 08:05:59 -07:00
affs fs/affs/file.c: use BUG_ON 2008-04-29 08:06:02 -07:00
afs afs: support the CB.ProbeUuid RPC op 2008-04-29 08:06:26 -07:00
autofs mount options: fix autofs 2008-02-08 09:22:40 -08:00
autofs4 autofs4: fix sparse warning in root.c 2008-04-29 08:06:01 -07:00
befs befs: fix sparse warning in linuxvfs.c 2008-04-29 08:05:59 -07:00
bfs iget: stop BFS from using iget() and read_inode() 2008-02-07 08:42:27 -08:00
cifs proc: remove proc_root_fs 2008-04-29 08:06:18 -07:00
coda codafs: fix build warning 2008-04-29 08:06:04 -07:00
configfs mm: bdi: add separate writeback accounting capability 2008-04-30 08:29:50 -07:00
cramfs fs: Remove unnecessary inclusions of asm/semaphore.h 2008-04-18 22:16:44 -04:00
debugfs debugfs: fix sparse warnings 2008-03-04 14:47:06 -08:00
devpts devpts: factor out PTY index allocation 2008-04-30 08:29:48 -07:00
dlm Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm 2008-04-22 13:44:23 -07:00
ecryptfs Remove duplicated unlikely() in IS_ERR() 2008-04-29 08:06:25 -07:00
efs efs: update error msg to not refer to deleted read_inode() 2008-04-02 15:28:19 -07:00
exportfs
ext2 ext2: retry block allocation if new blocks are allocated from system zone 2008-04-28 08:58:43 -07:00
ext3 ext3: fix test ext_generic_write_end() copied return value 2008-04-29 22:01:27 -04:00
ext4 ext4: fix test ext_generic_write_end() copied return value 2008-04-29 22:01:18 -04:00
fat fat: use get/put_unaligned_* helpers 2008-04-29 08:06:28 -07:00
freevxfs fs/freevxfs/: proper externs 2008-04-29 08:06:00 -07:00
fuse fuse: support writable mmap 2008-04-30 08:29:50 -07:00
gfs2 mm: remove nopage 2008-04-28 08:58:18 -07:00
hfs hfs: handle match_strdup failure 2008-04-29 08:06:01 -07:00
hfsplus trivial: fix user-visible typo in hfsplus 2008-04-29 13:23:21 -07:00
hostfs uml: fix hostfs tv_usec calculations 2008-02-05 09:44:30 -08:00
hpfs mount options: fix hpfs 2008-02-08 09:22:40 -08:00
hppfs [PATCH] sanitize hppfs 2008-03-19 06:42:18 -04:00
hugetlbfs mm: bdi: add separate writeback accounting capability 2008-04-30 08:29:50 -07:00
isofs isofs: fix access to unallocated memory when reading corrupted filesystem 2008-04-30 08:29:33 -07:00
jbd jbd: replace remaining __FUNCTION__ occurrences 2008-04-28 08:58:45 -07:00
jbd2 jbd2: use non-racy method for proc entries creation 2008-04-29 08:06:20 -07:00
jffs2 [JFFS2] Introduce dbg_readinode2 log level, use it to shut read_dnode() up 2008-04-23 16:43:15 +01:00
jfs proc: remove proc_root_fs 2008-04-29 08:06:18 -07:00
lockd locks: don't call ->copy_lock methods on return of conflicting locks 2008-04-25 13:00:11 -04:00
minix iget: stop the MINIX filesystem from using iget() and read_inode() 2008-02-07 08:42:28 -08:00
msdos fat: fat_notify_change() and check_mode() cleanup 2008-04-28 08:58:47 -07:00
ncpfs ncpfs: use get/put_unaligned_* helpers 2008-04-29 08:06:28 -07:00
nfs mm: bdi: expose the BDI object in sysfs for NFS 2008-04-30 08:29:49 -07:00
nfs_common
nfsd nfsd: use proc_create to setup de->proc_fops 2008-04-29 08:06:20 -07:00
nls
ntfs Remove duplicated unlikely() in IS_ERR() 2008-04-29 08:06:25 -07:00
ocfs2 mm: bdi: add separate writeback accounting capability 2008-04-30 08:29:50 -07:00
openpromfs iget: stop OPENPROMFS from using iget() and read_inode() 2008-02-07 08:42:29 -08:00
partitions fat: detect media without partition table correctly 2008-04-28 08:58:47 -07:00
proc mm: Add NR_WRITEBACK_TEMP counter 2008-04-30 08:29:50 -07:00
qnx4 iget: stop QNX4 from using iget() and read_inode() 2008-02-07 08:42:28 -08:00
ramfs mm: bdi: add separate writeback accounting capability 2008-04-30 08:29:50 -07:00
reiserfs reiserfs: use non-racy method for proc entries creation 2008-04-29 08:06:20 -07:00
romfs ROMFS: Fix up an error in iget removal 2008-03-19 18:53:36 -07:00
smbfs NULL noise: fs/*, mm/*, kernel/* 2008-03-30 14:18:41 -07:00
sysfs mm: bdi: add separate writeback accounting capability 2008-04-30 08:29:50 -07:00
sysv iget: stop the SYSV filesystem from using iget() and read_inode() 2008-02-07 08:42:29 -08:00
udf udf: fix sparse warning in namei.c 2008-04-28 08:58:46 -07:00
ufs ufs: replace __inline with inline 2008-04-28 08:58:45 -07:00
vfat fat: use __getname() 2008-04-28 08:58:47 -07:00
xfs [XFS] Include linux/random.h in all builds, not just debug. 2008-04-30 07:53:50 -07:00
aio.c aio: fix misleading comments 2008-04-29 08:06:29 -07:00
anon_inodes.c [PATCH] fix up new filp allocators 2008-03-19 06:54:05 -04:00
attr.c
bad_inode.c iget: introduce a function to register iget failure 2008-02-07 08:42:26 -08:00
binfmt_aout.c fs/binfmt_aout.c: use printk_ratelimit() 2008-04-29 08:06:04 -07:00
binfmt_elf_fdpic.c fdpic: check that the size returned by kernel_read() is what we asked for 2008-04-29 08:06:05 -07:00
binfmt_elf.c elf: fix shadowed variables in fs/binfmt_elf.c 2008-04-29 08:06:16 -07:00
binfmt_em86.c binfmt_misc.c: avoid potential kernel stack overflow 2008-04-29 08:06:04 -07:00
binfmt_flat.c procfs task exe symlink 2008-04-29 08:06:17 -07:00
binfmt_misc.c binfmt_misc.c: avoid potential kernel stack overflow 2008-04-29 08:06:04 -07:00
binfmt_script.c binfmt_misc.c: avoid potential kernel stack overflow 2008-04-29 08:06:04 -07:00
binfmt_som.c [PATCH] sanitize handling of shared descriptor tables in failing execve() 2008-04-25 09:23:53 -04:00
bio.c block: add dma alignment and padding support to blk_rq_map_kern 2008-04-29 09:50:34 +02:00
block_dev.c fs/block_dev.c: remove #if 0'ed code 2008-02-19 10:04:00 +01:00
buffer.c make fs/buffer.c:cont_expand_zero() static 2008-04-29 08:06:01 -07:00
char_dev.c fs: remove unused fops from struct char_device_struct 2008-04-29 08:06:01 -07:00
compat_binfmt_elf.c x86: compat_binfmt_elf 2008-01-30 13:31:46 +01:00
compat_ioctl.c tty: The big operations rework 2008-04-30 08:29:47 -07:00
compat.c signals: use HAVE_SET_RESTORE_SIGMASK 2008-04-30 08:29:37 -07:00
dcache.c [patch 2/7] vfs: mountinfo: add seq_file_root() 2008-04-23 00:04:38 -04:00
dcookies.c d_path: Make d_path() use a struct path 2008-02-14 21:17:09 -08:00
direct-io.c Pagecache zeroing: zero_user_segment, zero_user_segments and zero_user 2008-02-05 09:44:13 -08:00
dnotify.c
dquot.c quota: quota core changes for quotaon on remount 2008-04-28 08:58:33 -07:00
drop_caches.c vfs: skip inodes without pages to free in drop_pagecache_sb() 2008-04-29 08:06:05 -07:00
eventfd.c fs/eventfd.c should #include <linux/syscalls.h> 2008-02-06 10:41:03 -08:00
eventpoll.c signals: use HAVE_SET_RESTORE_SIGMASK 2008-04-30 08:29:37 -07:00
exec.c document de_thread() with exit_notify() connection 2008-04-30 08:29:38 -07:00
fcntl.c [PATCH] sanitize locate_fd() 2008-04-25 09:24:05 -04:00
fifo.c
file_table.c [PATCH] r/o bind mounts: debugging for missed calls 2008-04-19 00:29:28 -04:00
file.c get rid of NR_OPEN and introduce a sysctl_nr_open 2008-02-06 10:41:06 -08:00
filesystems.c
fs-writeback.c fs/fs-writeback.c: make 2 functions static 2008-04-29 08:06:00 -07:00
generic_acl.c
inode.c fs/inode.c: use hlist_for_each_entry() 2008-04-29 08:06:06 -07:00
inotify_user.c Remove duplicated unlikely() in IS_ERR() 2008-04-29 08:06:25 -07:00
inotify.c inotify: remove debug code 2008-02-06 10:41:07 -08:00
internal.h [PATCH] move a bunch of declarations to fs/internal.h 2008-04-21 23:11:01 -04:00
ioctl.c make vfs_ioctl() static 2008-04-29 08:06:00 -07:00
ioprio.c cfq-iosched: relax IOPRIO_CLASS_IDLE restrictions 2008-01-28 11:38:15 +01:00
Kconfig Merge git://git.linux-nfs.org/projects/trondmy/nfs-2.6 2008-04-24 11:46:16 -07:00
Kconfig.binfmt make BINFMT_FLAT a bool 2008-04-29 08:06:01 -07:00
libfs.c Pagecache zeroing: zero_user_segment, zero_user_segments and zero_user 2008-02-05 09:44:13 -08:00
locks.c Export __locks_copy_lock() so modular lockd builds 2008-04-25 15:49:46 -07:00
Makefile x86: compat_binfmt_elf Kconfig 2008-01-30 13:31:46 +01:00
mbcache.c vfs: fix possible deadlock in ext2, ext3, ext4 when using xattrs 2008-04-15 19:35:41 -07:00
mpage.c docbook: fix filesystems.tmpl source files 2008-03-03 10:47:13 -08:00
namei.c cgroups: implement device whitelist 2008-04-29 08:06:09 -07:00
namespace.c vfs: remove lives_below_in_same_fs() 2008-04-29 08:06:06 -07:00
nfsctl.c Introduce path_put() 2008-02-14 21:13:33 -08:00
no-block.c
open.c xip: support non-struct page backed memory 2008-04-28 08:58:23 -07:00
pipe.c [PATCH] double-free of inode on alloc_file() failure exit in create_write_pipe() 2008-04-22 19:54:57 -04:00
pnode.c [patch 7/7] vfs: mountinfo: show dominating group id 2008-04-23 00:05:09 -04:00
pnode.h [patch 7/7] vfs: mountinfo: show dominating group id 2008-04-23 00:05:09 -04:00
posix_acl.c
quota_v1.c quota: do not allow setting of quota limits to too high values 2008-04-28 08:58:32 -07:00
quota_v2.c quota: do not allow setting of quota limits to too high values 2008-04-28 08:58:32 -07:00
quota.c quota: quota core changes for quotaon on remount 2008-04-28 08:58:33 -07:00
read_write.c fs: use loff_t type instead of long long 2008-04-22 15:17:11 -07:00
read_write.h
readdir.c Use mutex_lock_killable in vfs_readdir 2007-12-06 17:39:54 -05:00
select.c signals: use HAVE_SET_RESTORE_SIGMASK 2008-04-30 08:29:37 -07:00
seq_file.c [patch 2/7] vfs: mountinfo: add seq_file_root() 2008-04-23 00:04:38 -04:00
signalfd.c signalfd: fix for incorrect SI_QUEUE user data reporting 2008-04-11 08:06:44 -07:00
splice.c relay: fix splice problem 2008-04-29 09:48:15 +02:00
stack.c
stat.c Introduce path_put() 2008-02-14 21:13:33 -08:00
super.c make __put_super() static 2008-04-29 08:06:00 -07:00
sync.c vfs: fix unconditional write_super() call in file_fsync() 2008-04-29 08:06:06 -07:00
timerfd.c fs/timerfd.c should #include <linux/syscalls.h> 2008-04-29 08:06:01 -07:00
utimes.c [PATCH] r/o bind mounts: elevate write count for do_utimes() 2008-04-19 00:29:24 -04:00
xattr_acl.c
xattr.c xattr: add missing consts to function arguments 2008-04-29 08:06:06 -07:00