linux/fs
Peter Staubach 38c73044f5 NFS: read-modify-write page updating
Hi.

I have a proposal for possibly resolving this issue.

I believe that this situation occurs due to the way that the
Linux NFS client handles writes which modify partial pages.

The Linux NFS client handles partial page modifications by
allocating a page from the page cache, copying the data from
the user level into the page, and then keeping track of the
offset and length of the modified portions of the page.  The
page is not marked as up to date because there are portions
of the page which do not contain valid file contents.

When a read call comes in for a portion of the page, the
contents of the page must be read in the from the server.
However, since the page may already contain some modified
data, that modified data must be written to the server
before the file contents can be read back in the from server.
And, since the writing and reading can not be done atomically,
the data must be written and committed to stable storage on
the server for safety purposes.  This means either a
FILE_SYNC WRITE or a UNSTABLE WRITE followed by a COMMIT.
This has been discussed at length previously.

This algorithm could be described as modify-write-read.  It
is most efficient when the application only updates pages
and does not read them.

My proposed solution is to add a heuristic to decide whether
to do this modify-write-read algorithm or switch to a read-
modify-write algorithm when initially allocating the page
in the write system call path.  The heuristic uses the modes
that the file was opened with, the offset in the page to
read from, and the size of the region to read.

If the file was opened for reading in addition to writing
and the page would not be filled completely with data from
the user level, then read in the old contents of the page
and mark it as Uptodate before copying in the new data.  If
the page would be completely filled with data from the user
level, then there would be no reason to read in the old
contents because they would just be copied over.

This would optimize for applications which randomly access
and update portions of files.  The linkage editor for the
C compiler is an example of such a thing.

I tested the attached patch by using rpmbuild to build the
current Fedora rawhide kernel.  The kernel without the
patch generated about 269,500 WRITE requests.  The modified
kernel containing the patch generated about 261,000 WRITE
requests.  Thus, about 8,500 fewer WRITE requests were
generated.  I suspect that many of these additional
WRITE requests were probably FILE_SYNC requests to WRITE
a single page, but I didn't test this theory.

The difference between this patch and the previous one was
to remove the unneeded PageDirty() test.  I then retested to
ensure that the resulting system continued to behave as
desired.

	Thanx...

		ps

Signed-off-by: Peter Staubach <staubach@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2009-08-10 08:54:16 -04:00
..
9p 9p: Fix incorrect parameters to v9fs_file_readn. 2009-07-14 15:54:42 -05:00
adfs headers: smp_lock.h redux 2009-07-12 12:22:34 -07:00
affs affs: add ->sync_fs 2009-06-11 21:36:14 -04:00
afs AFS: Fix compilation warning 2009-07-12 12:24:07 -07:00
autofs switch follow_down() 2009-06-11 21:36:01 -04:00
autofs4 headers: smp_lock.h redux 2009-07-12 12:22:34 -07:00
befs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 2009-06-17 08:46:57 -07:00
bfs headers: smp_lock.h redux 2009-07-12 12:22:34 -07:00
btrfs Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable 2009-08-07 19:03:09 -07:00
cachefiles enforce ->sync_fs is only called for rw superblock 2009-06-11 21:36:06 -04:00
cifs [CIFS] Update readme to reflect forceuid mount parms 2009-08-04 03:53:28 +00:00
coda splice: implement default splice_read method 2009-05-11 14:13:10 +02:00
configfs configfs: Rework configfs_depend_item() locking and make lockdep happy 2009-04-30 10:48:26 -07:00
cramfs fs/cramfs: return f_fsid for statfs(2) 2009-04-02 19:05:08 -07:00
debugfs debugfs: use specified mode to possibly mark files read/write only 2009-06-15 21:30:28 -07:00
devpts devpts: remove module-related code 2009-06-24 08:15:24 -04:00
dlm dlm: free socket in error exit path 2009-07-14 12:28:43 -05:00
ecryptfs eCryptfs: parse_tag_3_packet check tag 3 packet encrypted key size 2009-07-28 14:26:06 -07:00
efs get rid of BKL in fs/efs 2009-06-17 00:36:36 -04:00
exofs headers: smp_lock.h redux 2009-07-12 12:22:34 -07:00
exportfs
ext2 headers: smp_lock.h redux 2009-07-12 12:22:34 -07:00
ext3 ext3: Get rid of extenddisksize parameter of ext3_get_blocks_handle() 2009-07-15 21:30:46 +02:00
ext4 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 2009-07-13 16:39:25 -07:00
fat headers: smp_lock.h redux 2009-07-12 12:22:34 -07:00
freevxfs headers: smp_lock.h redux 2009-07-12 12:22:34 -07:00
fscache FS-Cache: Fixup renamed filenames in comments in internal.h 2009-05-27 10:20:13 -07:00
fuse Revert "fuse: Fix build error" as unnecessary 2009-07-11 11:22:34 -07:00
gfs2 GFS2: remove dcache entries for remote deleted inodes 2009-07-30 11:01:03 +01:00
hfs headers: smp_lock.h redux 2009-07-12 12:22:34 -07:00
hfsplus headers: smp_lock.h redux 2009-07-12 12:22:34 -07:00
hostfs hostfs: set maximum filesize in superblock for proper LFS support 2009-06-30 18:56:03 -07:00
hpfs headers: smp_lock.h redux 2009-07-12 12:22:34 -07:00
hppfs hppfs: hppfs_read_file() may return -ERROR 2009-04-02 19:04:53 -07:00
hugetlbfs Merge branch 'master' into next 2009-05-22 18:40:59 +10:00
isofs isofs: fix Joliet regression 2009-07-10 19:18:59 -07:00
jbd jbd: fix race between write_metadata_buffer and get_write_access 2009-07-21 11:54:42 +02:00
jbd2 jbd2: fix race between write_metadata_buffer and get_write_access 2009-07-13 17:55:35 -04:00
jffs2 jffs2: Fix return value from jffs2_do_readpage_nolock() 2009-08-04 12:13:06 +01:00
jfs jfs: Fix early release of acl in jfs_get_acl 2009-07-23 11:08:36 -05:00
lockd headers: smp_lock.h redux 2009-07-12 12:22:34 -07:00
minix Making fs/minix/minix.h double including safe 2009-06-22 11:34:42 -07:00
ncpfs NLS: update handling of Unicode 2009-06-15 21:44:43 -07:00
nfs NFS: read-modify-write page updating 2009-08-10 08:54:16 -04:00
nfs_common
nfsd headers: smp_lock.h redux 2009-07-12 12:22:34 -07:00
nilfs2 nilfs2: fix missing unlock in error path of nilfs_mdt_write_page 2009-08-02 22:24:15 +09:00
nls NLS: update handling of Unicode 2009-06-15 21:44:43 -07:00
notify inotify: use GFP_NOFS under potential memory pressure 2009-07-21 15:26:27 -04:00
ntfs ntfs: use is_power_of_2() function for clarity. 2009-06-16 19:47:48 -07:00
ocfs2 headers: smp_lock.h redux 2009-07-12 12:22:34 -07:00
omfs switch omfs to simple_fsync() 2009-06-11 21:36:13 -04:00
openpromfs zero i_uid/i_gid on inode allocation 2009-01-05 11:54:28 -05:00
partitions partitions: fix broken uevent_suppress conversion 2009-07-12 13:02:09 -07:00
proc proc: vmcore - use kzalloc in get_new_element() 2009-06-18 13:03:41 -07:00
qnx4 fs/qnx4: sanitize includes 2009-06-11 21:36:12 -04:00
quota quota: Silence lockdep on quota_on 2009-07-30 17:31:23 +02:00
ramfs fs/ramfs/file-nommu.c needs include/linux/sched.h 2009-07-29 19:10:36 -07:00
reiserfs headers: smp_lock.h redux 2009-07-12 12:22:34 -07:00
romfs ROMFS: romfs_dev_read() error ignored 2009-05-09 10:49:41 -04:00
smbfs push BKL down into ->put_super 2009-06-11 21:36:07 -04:00
squashfs headers: smp_lock.h redux 2009-07-12 12:22:34 -07:00
sysfs sysfs: fix hardlink count on device_move 2009-07-28 13:45:21 -07:00
sysv get rid of BKL in fs/sysv 2009-06-17 00:36:37 -04:00
ubifs headers: smp_lock.h redux 2009-07-12 12:22:34 -07:00
udf udf: Fix loading of VAT inode when drive wrongly reports number of recorded blocks 2009-07-30 17:28:26 +02:00
ufs ufs: sector_t cannot be negative 2009-06-18 13:03:46 -07:00
xfs xfs: fix freeing of inodes not yet added to the inode cache 2009-08-07 14:38:34 -03:00
aio.c eventfd: revised interface and cleanups 2009-06-30 18:55:58 -07:00
anon_inodes.c fs: Provide empty .set_page_dirty() aop for anon inodes 2009-06-18 14:46:10 +02:00
attr.c vfs: Use lowercase names of quota functions 2009-03-26 02:18:35 +01:00
bad_inode.c
binfmt_aout.c
binfmt_elf_fdpic.c elf_core_dump: use rcu_read_lock() to access ->real_parent 2009-06-18 13:03:52 -07:00
binfmt_elf.c elf: fix one check-after-use 2009-07-01 11:14:28 -07:00
binfmt_em86.c
binfmt_flat.c flat: fix uninitialized ptr with shared libs 2009-08-07 10:39:57 -07:00
binfmt_misc.c fs/binfmt_misc.c: add terminating newline to /proc/sys/fs/binfmt_misc/status 2009-01-06 15:59:19 -08:00
binfmt_script.c
binfmt_som.c Don't crap into descriptor table in binfmt_som 2009-03-31 23:00:28 -04:00
bio-integrity.c block: Create bip slabs with embedded integrity vectors 2009-07-01 10:56:25 +02:00
bio.c block: fix sg SG_DXFER_TO_FROM_DEV regression 2009-07-10 20:31:53 +02:00
block_dev.c PM / Hibernate: Replace bdget call with simple atomic_inc of i_count 2009-07-29 21:07:55 +02:00
buffer.c Merge branch 'for-2.6.31' of git://git.kernel.dk/linux-2.6-block 2009-06-11 11:10:35 -07:00
char_dev.c headers: smp_lock.h redux 2009-07-12 12:22:34 -07:00
compat_binfmt_elf.c
compat_ioctl.c compat_ioctl: hook up compat handler for FIEMAP ioctl 2009-08-07 10:39:56 -07:00
compat.c headers: smp_lock.h redux 2009-07-12 12:22:34 -07:00
dcache.c dcache: extrace and use d_unlinked() 2009-06-11 21:36:06 -04:00
dcookies.c [CVE-2009-0029] System call wrapper special cases 2009-01-14 14:15:18 +01:00
direct-io.c block: Do away with the notion of hardsect_size 2009-05-22 23:22:54 +02:00
drop_caches.c mm: remove __invalidate_mapping_pages variant 2009-06-16 19:47:43 -07:00
eventfd.c eventfd: revised interface and cleanups 2009-06-30 18:55:58 -07:00
eventpoll.c epoll: fix nested calls support 2009-06-18 13:03:41 -07:00
exec.c cred_guard_mutex: do not return -EINTR to user-space 2009-07-06 13:57:04 -07:00
fcntl.c headers: smp_lock.h redux 2009-07-12 12:22:34 -07:00
fifo.c
file_table.c fs: move mark_files_ro into file_table.c 2009-06-11 21:36:02 -04:00
file.c
filesystems.c fs: Mark get_filesystem_list() as __init function. 2009-04-20 23:02:52 -04:00
fs_struct.c Get rid of indirect include of fs_struct.h 2009-03-31 23:00:27 -04:00
fs-writeback.c cleanup __writeback_single_inode 2009-06-24 08:15:26 -04:00
generic_acl.c New helper - current_umask() 2009-03-31 23:00:26 -04:00
inode.c vfs: add __destroy_inode 2009-08-07 14:38:29 -03:00
internal.h Trim a bit of crap from fs.h 2009-06-11 21:36:07 -04:00
ioctl.c fs: Add new pre-allocation ioctls to vfs for compatibility with legacy xfs ioctls 2009-06-24 08:15:27 -04:00
ioprio.c [CVE-2009-0029] System call wrappers part 28 2009-01-14 14:15:30 +01:00
Kconfig fs/Kconfig: move nilfs2 out 2009-07-14 12:34:17 +09:00
Kconfig.binfmt CORE_DUMP_DEFAULT_ELF_HEADERS depends on ELF_CORE 2009-01-09 16:54:41 -08:00
libfs.c New helper - simple_fsync() 2009-06-11 21:36:11 -04:00
locks.c lockd: call locks_release_private to cleanup per-filesystem state 2009-04-24 16:36:03 -04:00
Makefile nilfs2: update makefile and Kconfig 2009-04-07 08:31:16 -07:00
mbcache.c
mpage.c ext4: Properly initialize the buffer_head state 2009-05-13 15:13:42 -04:00
namei.c integrity: add ima_counts_put (updated) 2009-06-29 08:59:10 +10:00
namespace.c vfs: mnt_want_write_file(): fix special file handling 2009-08-07 10:39:56 -07:00
nfsctl.c [CVE-2009-0029] System call wrappers part 27 2009-01-14 14:15:29 +01:00
no-block.c
open.c fs: Add new pre-allocation ioctls to vfs for compatibility with legacy xfs ioctls 2009-06-24 08:15:27 -04:00
pipe.c lockdep: Fix lockdep annotation for pipe_double_lock() 2009-07-22 21:14:14 +02:00
pnode.c
pnode.h
posix_acl.c
read_write.c splice: implement default splice_read method 2009-05-11 14:13:10 +02:00
read_write.h
readdir.c [CVE-2009-0029] System call wrappers part 32 2009-01-14 14:15:31 +01:00
select.c poll: avoid extra wakeups in select/poll 2009-06-16 19:47:48 -07:00
seq_file.c seq_file: add function to write binary data 2009-06-18 13:03:57 -07:00
signalfd.c [CVE-2009-0029] System call wrappers part 31 2009-01-14 14:15:31 +01:00
splice.c splice: fix kmaps in default_file_splice_write() 2009-05-19 11:37:46 +02:00
stack.c
stat.c kill vfs_stat_fd / vfs_lstat_fd 2009-04-20 23:02:52 -04:00
super.c ... and the same for vfsmount id/mount group id 2009-06-24 08:15:26 -04:00
sync.c sys_sync(): fix 16% performance regression in ffsb create_4k test 2009-07-06 13:57:03 -07:00
timerfd.c timerfd: add flags check 2009-02-18 15:37:53 -08:00
utimes.c [CVE-2009-0029] System call wrappers part 30 2009-01-14 14:15:30 +01:00
xattr_acl.c
xattr.c fs: introduce mnt_clone_write 2009-06-11 21:36:02 -04:00