linux

mirror of https://github.com/FEX-Emu/linux.git synced 2024-12-16 05:50:19 +00:00

Author	SHA1	Message	Date
Christoph Hellwig	070ecdca54	xfs: skip writeback from reclaim context Allowing writeback from reclaim context causes massive problems with stack overflows as we can call into the writeback code which tends to be a heavy stack user both in the generic code and XFS from random contexts that perform memory allocations. Follow the example of btrfs (and in slightly different form ext4) and refuse to write out data from reclaim context. This issue should really be handled by the VM so that we can tune better for this case, but until we get it sorted out there we have to hack around this in each filesystem with a complex writeback path. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com>	2010-06-03 16:22:29 +10:00
Dave Chinner	5b257b4a1f	xfs: fix race in inode cluster freeing failing to stale inodes When an inode cluster is freed, it needs to mark all inodes in memory as XFS_ISTALE before marking the buffer as stale. This is eeded because the inodes have a different life cycle to the buffer, and once the buffer is torn down during transaction completion, we must ensure none of the inodes get written back (which is what XFS_ISTALE does). Unfortunately, xfs_ifree_cluster() has some bugs that lead to inodes not being marked with XFS_ISTALE. This shows up when xfs_iflush() is called on these inodes either during inode reclaim or tail pushing on the AIL. The buffer is read back, but no longer contains inodes and so triggers assert failures and shutdowns. This was reproducable with at run.dbench10 invocation from xfstests. There are two main causes of xfs_ifree_cluster() failing. The first is simple - it checks in-memory inodes it finds in the per-ag icache to see if they are clean without holding the flush lock. if they are clean it skips them completely. However, If an inode is flushed delwri, it will appear clean, but is not guaranteed to be written back until the flush lock has been dropped. Hence we may have raced on the clean check and the inode may actually be dirty. Hence always mark inodes found in memory stale before we check properly if they are clean. The second is more complex, and makes the first problem easier to hit. Basically the in-memory inode scan is done with full knowledge it can be racing with inode flushing and AIl tail pushing, which means that inodes that it can't get the flush lock on might not be attached to the buffer after then in-memory inode scan due to IO completion occurring. This is actually documented in the code as "needs better interlocking". i.e. this is a zero-day bug. Effectively, the in-memory scan must be done while the inode buffer is locked and Io cannot be issued on it while we do the in-memory inode scan. This ensures that inodes we couldn't get the flush lock on are guaranteed to be attached to the cluster buffer, so we can then catch all in-memory inodes and mark them stale. Now that the inode cluster buffer is locked before the in-memory scan is done, there is no need for the two-phase update of the in-memory inodes, so simplify the code into two loops and remove the allocation of the temporary buffer used to hold locked inodes across the phases. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2010-06-03 16:22:29 +10:00
Theodore Ts'o	1f5a81e41f	ext4: Make sure the MOVE_EXT ioctl can't overwrite append-only files Dan Roseberg has reported a problem with the MOVE_EXT ioctl. If the donor file is an append-only file, we should not allow the operation to proceed, lest we end up overwriting the contents of an append-only file. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Dan Rosenberg <dan.j.rosenberg@gmail.com>	2010-06-02 22:04:39 -04:00
Sage Weil	558d3499bd	ceph: fix f_namelen reported by statfs We were setting f_namelen in kstatfs to PATH_MAX instead of NAME_MAX. That disagrees with ceph_lookup behavior (which checks against NAME_MAX), and also makes the pjd posix test suite spit out ugly errors because with can't clean up its temporary files. Signed-off-by: Sage Weil <sage@newdream.net>	2010-06-01 16:56:03 -07:00
Yehuda Sadeh	205475679a	ceph: fix memory leak in statfs Freeing the statfs request structure when required. Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>	2010-06-01 16:56:02 -07:00
Henry C Chang	13a4214cd9	ceph: fix d_subdirs ordering problem We misused list_move_tail() to order the dentry in d_subdirs. This will screw up the d_subdirs order. This bug can be reliably reproduced by: 1. mount ceph fs. 2. on ceph fs, git clone git://ceph.newdream.net/git/ceph.git 3. Run autogen.sh in ceph directory. (Note: Errors only occur at the first time you run autogen.sh.) Signed-off-by: Henry C Chang <henry_c_chang@tcloudcomputing.com> Signed-off-by: Sage Weil <sage@newdream.net>	2010-06-01 16:55:55 -07:00
Christoph Hellwig	b160fdabe9	nfsd: nfsd_setattr needs to call commit_metadata The conversion of write_inode_now calls to commit_metadata in commit `f501912a35` missed out the call in nfsd_setattr. But without this conversion we can't guarantee that a SETATTR request has actually been commited to disk with XFS, which causes a regression from 2.6.32 (only for NFSv2, but anyway). Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: stable@kernel.org Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>	2010-06-01 19:17:50 -04:00
Dan Carpenter	08a66859e6	FS-Cache: Remove unneeded null checks fscache_write_op() makes unnecessary checks of the page variable to see if it is NULL. It can't be NULL at those points as the kernel would already have crashed a little higher up where we examined page->index. Furthermore, unless radix_tree_gang_lookup_tag() can return 1 but no page, a NULL pointer crash should not be encountered there as we can only get there if r_t_g_l_t() returned 1. Signed-off-by: Dan Carpenter <error27@gmail.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-06-01 13:32:11 -07:00
Jeff Layton	06b43672a9	cifs: fix page refcount leak Commit `315e995c63` is causing OOM kills when stress-testing a CIFS filesystem. The VFS readpages operation takes a page reference. The older code just handed this reference off to the page cache, but the new code takes an extra one. The simplest fix is to put the new reference after add_to_page_cache_lru. Signed-off-by: Jeff Layton <jlayton@redhat.com> Acked-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Steve French <sfrench@us.ibm.com>	2010-06-01 17:15:52 +00:00
Denis Kirjanov	037776fcbe	AFS: Fix possible null pointer dereference in afs_alloc_server() Fix a possible null pointer dereference in afs_alloc_server(): the server pointer is NULL if there was an allocation failure, and under such a condition, we can't dereference it in the _leave() statement. Signed-off-by: Denis Kirjanov <dkirjanov@kernel.org> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-06-01 09:26:36 -07:00
Takuya Yoshikawa	e30c7c3b30	binfmt_elf_fdpic: Fix clear_user() error handling clear_user() returns the number of bytes that could not be copied rather than an error code. So we should return -EFAULT rather than directly returning the results. Without this patch, positive values may be returned to elf_fdpic_map_file() and the following error handlings do not function as expected. 1. ret = elf_fdpic_map_file_constdisp_on_uclinux(params, file, mm); if (ret < 0) return ret; 2. ret = elf_fdpic_map_file_by_direct_mmap(params, file, mm); if (ret < 0) return ret; Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Mike Frysinger <vapier@gentoo.org> CC: Alexander Viro <viro@zeniv.linux.org.uk> CC: Andrew Morton <akpm@linux-foundation.org> CC: Daisuke HATAYAMA <d.hatayama@jp.fujitsu.com> CC: Paul Mundt <lethal@linux-sh.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-06-01 08:11:06 -07:00
Jens Axboe	b4ca761577	Merge branch 'master' into for-linus Conflicts: fs/pipe.c Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-06-01 12:42:12 +02:00
Jens Axboe	0e3c9a2284	Revert "writeback: fix WB_SYNC_NONE writeback from umount" This reverts commit `e913fc825d`. We are investigating a hang associated with the WB_SYNC_NONE changes, so revert them for now. Conflicts: fs/fs-writeback.c mm/page-writeback.c Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-06-01 11:08:43 +02:00
Jens Axboe	f17625b318	Revert "writeback: ensure that WB_SYNC_NONE writeback with sb pinned is sync" This reverts commit `7c8a3554c6`. We are investigating a hang associated with the WB_SYNC_NONE changes, so revert them for now. Signed-off-by: Jens Axboe <jaxboe@fusionio.com>	2010-06-01 11:05:22 +02:00
Ryusuke Konishi	c29684d683	nilfs2: remove obsolete declarations of cache constructor and destructor The commit `41c88bd7` ("nilfs2: cleanup multi kmem_cache_{create,destroy} code") consolidated slab constructors and destructors used in nilfs, but it left some declarations in header files. This gets rid of the obsolete declarations. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>	2010-05-31 20:50:29 +09:00
Ryusuke Konishi	84cb099985	nilfs2: fix style issue in nilfs_destroy_cachep This gets rid of unwanted space chars in front of conditional sentences of nilfs_destroy_cachep(). Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>	2010-05-31 20:50:29 +09:00
Linus Torvalds	003386fff3	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: mm: export generic_pipe_buf_() to modules fuse: support splice() reading from fuse device fuse: allow splice to move pages mm: export remove_from_page_cache() to modules mm: export lru_cache_add_() to modules fuse: support splice() writing to fuse device fuse: get page reference for readpages fuse: use get_user_pages_fast() fuse: remove unneeded variable	2010-05-30 09:16:14 -07:00
Linus Torvalds	d28619f156	Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6: quota: Convert quota statistics to generic percpu_counter ext3 uses rb_node = NULL; to zero rb_root. quota: Fixup dquot_transfer reiserfs: Fix resuming of quotas on remount read-write pohmelfs: Remove dead quota code ufs: Remove dead quota code udf: Remove dead quota code quota: rename default quotactl methods to dquot_ quota: explicitly set ->dq_op and ->s_qcop quota: drop remount argument to ->quota_on and ->quota_off quota: move unmount handling into the filesystem quota: kill the vfs_dq_off and vfs_dq_quota_on_remount wrappers quota: move remount handling into the filesystem ocfs2: Fix use after free on remount read-only Fix up conflicts in fs/ext4/super.c and fs/ufs/file.c	2010-05-30 09:11:11 -07:00
Linus Torvalds	b612a05537	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: ceph: clean up on forwarded aborted mds request ceph: fix leak of osd authorizer ceph: close out mds, osd connections before stopping auth ceph: make lease code DN specific fs/ceph: Use ERR_CAST ceph: renew auth tickets before they expire ceph: do not resend mon requests on auth ticket renewal ceph: removed duplicated #includes ceph: avoid possible null dereference ceph: make mds requests killable, not interruptible sched: add wait_for_completion_killable_timeout	2010-05-30 08:56:39 -07:00
Sage Weil	2a8e5e3637	ceph: clean up on forwarded aborted mds request If an mds request is aborted (timeout, SIGKILL), it is left registered to keep our state in sync with the mds. If we get a forward notification, though, we know the request didn't succeed and we can unregister it safely. We were trying to resend it, but then bailing out (and not unregistering) in __do_request. Signed-off-by: Sage Weil <sage@newdream.net>	2010-05-29 09:42:05 -07:00
Sage Weil	79494d1b9b	ceph: fix leak of osd authorizer Release the ceph_authorizer when releasing osd state. Signed-off-by: Sage Weil <sage@newdream.net>	2010-05-29 09:42:04 -07:00
Sage Weil	a922d38fd1	ceph: close out mds, osd connections before stopping auth The auth module (part of the mon_client) is needed to free any ceph_authorizer(s) used by the mds and osd connections. Flush the msgr workqueue before stopping monc to ensure that the destroy_authorizer auth op is available when those connections are closed out. Signed-off-by: Sage Weil <sage@newdream.net>	2010-05-29 09:42:03 -07:00
Sage Weil	dd1c905736	ceph: make lease code DN specific The lease code includes a mask in the CEPH_LOCK_* namespace, but that namespace is changing, and only one mask (formerly _DN == 1) is used, so hard code for that value for now. If we ever extend this code to handle leases over different data types we can extend it accordingly. Signed-off-by: Sage Weil <sage@newdream.net>	2010-05-29 09:12:42 -07:00
Julia Lawall	7e34bc524e	fs/ceph: Use ERR_CAST Use ERR_CAST(x) rather than ERR_PTR(PTR_ERR(x)). The former makes more clear what is the purpose of the operation, which otherwise looks like a no-op. In the case of fs/ceph/inode.c, ERR_CAST is not needed, because the type of the returned value is the same as the type of the enclosing function. The semantic patch that makes this change is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ type T; T x; identifier f; @@ T f (...) { <+... - ERR_PTR(PTR_ERR(x)) + x ...+> } @@ expression x; @@ - ERR_PTR(PTR_ERR(x)) + ERR_CAST(x) // </smpl> Signed-off-by: Julia Lawall <julia@diku.dk> Signed-off-by: Sage Weil <sage@newdream.net>	2010-05-29 09:12:41 -07:00
Sage Weil	a41359fa35	ceph: renew auth tickets before they expire We were only requesting renewal after our tickets expire; do so before that. Most of the low-level logic for this was already there; just use it. Signed-off-by: Sage Weil <sage@newdream.net>	2010-05-29 09:12:39 -07:00
Sage Weil	09c4d6a7d4	ceph: do not resend mon requests on auth ticket renewal We only want to send pending mon requests when we successfully authenticate. If we are already authenticated, like when we renew our ticket, there is no need to resend pending requests. Signed-off-by: Sage Weil <sage@newdream.net>	2010-05-29 09:12:38 -07:00
Andrea Gelmini	984c76908e	ceph: removed duplicated #includes fs/ceph/auth.c: linux/slab.h is included more than once. fs/ceph/super.h: linux/slab.h is included more than once. Acked-by: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net> Signed-off-by: Sage Weil <sage@newdream.net>	2010-05-29 09:12:37 -07:00
Sage Weil	e95e9a7ae4	ceph: avoid possible null dereference ac->ops may be null; use protocol id in error message instead. Reported-by: Dan Carpenter <error27@gmail.com> Signed-off-by: Sage Weil <sage@newdream.net>	2010-05-29 09:12:36 -07:00
Sage Weil	aa91647c89	ceph: make mds requests killable, not interruptible The underlying problem is that many mds requests can't be restarted. For example, a restarted create() would return -EEXIST if the original request succeeds. However, we do not want a hung MDS to hang the client too. So, use the _killable wait_for_completion variants to abort on SIGKILL but nothing else. Signed-off-by: Sage Weil <sage@newdream.net>	2010-05-29 09:12:35 -07:00
Linus Torvalds	9a90e09854	Merge branch 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6 * 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-acpi-2.6: (27 commits) ACPI: Don't let acpi_pad needlessly mark TSC unstable drivers/acpi/sleep.h: Checkpatch cleanup ACPI: Minor cleanup eliminating redundant PMTIMER_TICKS to NS conversion ACPI: delete unused c-state promotion/demotion data strucutures ACPI: video: fix acpi_backlight=video ACPI: EC: Use kmemdup drivers/acpi: use kasprintf ACPI, APEI, EINJ injection parameters support Add x64 support to debugfs ACPI, APEI, Use ERST for persistent storage of MCE ACPI, APEI, Error Record Serialization Table (ERST) support ACPI, APEI, Generic Hardware Error Source memory error support ACPI, APEI, UEFI Common Platform Error Record (CPER) header Unified UUID/GUID definition ACPI Hardware Error Device (PNP0C33) support ACPI, APEI, PCIE AER, use general HEST table parsing in AER firmware_first setup ACPI, APEI, Document for APEI ACPI, APEI, EINJ support ACPI, APEI, HEST table parsing ACPI, APEI, APEI supporting infrastructure ...	2010-05-28 14:42:18 -07:00
Christoph Hellwig	fb3b504ade	xfs: fix access to upper inodes without inode64 If a filesystem is mounted without the inode64 mount option we should still be able to access inodes not fitting into 32 bits, just not created new ones. For this to work we need to make sure the inode cache radix tree is initialized for all allocation groups, not just those we plan to allocate inodes from. This patch makes sure we initialize the inode cache radix tree for all allocation groups, and also cleans xfs_initialize_perag up a bit to separate the inode32 logical from the general perag structure setup. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-28 15:19:56 -05:00
Dave Chinner	9b98b6f3e1	xfs: fix might_sleep() warning when initialising per-ag tree The use of radix_tree_preload() only works if the radix tree was initialised without the __GFP_WAIT flag. The per-ag tree uses GFP_NOFS, so does not trigger allocation of new tree nodes from the preloaded array. Hence it enters the allocator with a spinlock held and triggers the might_sleep() warnings. Reported-by; Chris Mason <chris.mason@oracle.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-28 15:19:50 -05:00
Julia Lawall	38e712ab3d	fs/xfs/quota: Add missing mutex_unlock Add a mutex_unlock missing on the error path. The use of this lock is balanced elsewhere in the file. The semantic match that finds this problem is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ expression E1; @@ * mutex_lock(E1,...); <+... when != E1 if (...) { ... when != E1 * return ...; } ...+> * mutex_unlock(E1,...); // </smpl> Signed-off-by: Julia Lawall <julia@diku.dk> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-28 15:19:41 -05:00
Huang Weiyi	3bd0946eb1	xfs: remove duplicated #include Remove duplicated #include('s) in fs/xfs/linux-2.6/xfs_quotaops.c Signed-off-by: Huang Weiyi <weiyi.huang@gmail.com> Reviewed-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-28 15:19:36 -05:00
Li Zefan	f8adb4d574	xfs: convert more trace events to DEFINE_EVENT Use DECLARE_EVENT_CLASS, and save ~15K: text data bss dec hex filename 171949 43028 48 215025 347f1 fs/xfs/linux-2.6/xfs_trace.o.orig 156521 43028 36 199585 30ba1 fs/xfs/linux-2.6/xfs_trace.o No change in functionality. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-28 15:19:31 -05:00
Huang Weiyi	292ec4cf35	xfs: xfs_trace.c: remove duplicated #include Remove duplicated #include('s) in fs/xfs/linux-2.6/xfs_trace.c Signed-off-by: Huang Weiyi <weiyi.huang@gmail.com> Reviewed-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-28 15:19:24 -05:00
Dave Chinner	07f1a4f5e8	xfs: Check new inode size is OK before preallocating The new xfsqa test 228 tries to preallocate more space than the filesystem contains. it should fail, but instead triggers an assert about lock flags. The failure is due to the size extension failing in vmtruncate() due to rlimit being set. Check this before we start the preallocation to avoid allocating space that will never be used. Also the path through xfs_vn_allocate already holds the IO lock, so it should not be present in the lock flags when the setattr fails. Hence the assert needs to take this into account. This will prevent other such callers from hitting this incorrect ASSERT. (Fixed a reference to "newsize" to read "new_size". -Alex) Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-28 15:19:12 -05:00
Christoph Hellwig	fdc07f44c8	xfs: clean up xlog_align Add suggested cleanups to commit 29db3370a1369541d58d692fbfb168b8a0bd7f41 from review that didn't end up being commited. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-28 14:58:36 -05:00
Christoph Hellwig	025101dca4	xfs: cleanup log reservation calculactions Instead of having small helper functions calling big macros do the calculations for the log reservations directly in the functions. These are mostly 1:1 from the macros execept that the macros kept the quota calculations in their callers. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-28 14:58:30 -05:00
Eric Sandeen	32891b292d	xfs: be more explicit if RT mount fails due to config Recent testers were slightly confused that a realtime mount failed due to missing CONFIG_XFS_RT; we can make that a little more obvious. V2: drop the else as suggested by Christoph Signed-off-by: Eric Sandeen <sandeen@sandeen.net> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-28 14:58:24 -05:00
Eric Sandeen	657a4cffde	xfs: replace E2BIG with EFBIG where appropriate Many places in the xfs code return E2BIG when they really mean EFBIG; trying to grow past 16T on a 32 bit machine, for example, says "Argument list too long" rather than "File too large" which is not particularly helpful. Some of these don't make perfect sense as EFBIG either, but still better than E2BIG IMHO. Signed-off-by: Eric Sandeen <sandeen@sandeen.net> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-28 14:58:16 -05:00
Al Viro	49837a80b3	remove detritus left by "mm: make read_cache_page synchronous" gets minix get_dir_page() in sync with its analogs; back in 2007 Nick has switched read_cache_page() and friends to sync behaviour (i.e. they wait for the page to get unlocked, check if it's uptodate and if it isn't return ERR_PTR(-EIO) instead) and removed the duplicate logics from the callers. In case of fs/minix/dir.c he'd removed only half of that... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-28 11:37:41 -04:00
Al Viro	4c9002de32	fix fs/sysv s_dirt handling got broken on ->sync_fs() conversion a year ago, nobody noticed... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-27 22:16:05 -04:00
npiggin@suse.de	459f6ed3b8	fat: convert to use the new truncate convention. Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-27 22:16:02 -04:00
npiggin@suse.de	737f2e93b9	ext2: convert to use the new truncate convention. I also have commented a possible bug in existing ext2 code, marked with XXX. Cc: linux-ext4@vger.kernel.org Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-27 22:15:57 -04:00
Nick Piggin	3322e79a38	fs: convert simple fs to new truncate Convert simple filesystems: ramfs, configfs, sysfs, block_dev to new truncate sequence. Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-27 22:15:47 -04:00
npiggin@suse.de	15c6fd9786	kill spurious reference to vmtruncate Lots of filesystems calls vmtruncate despite not implementing the old ->truncate method. Switch them to use simple_setsize and add some comments about the truncate code where it seems fitting. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-27 22:15:42 -04:00
npiggin@suse.de	7bb46a6734	fs: introduce new truncate sequence Introduce a new truncate calling sequence into fs/mm subsystems. Rather than setattr > vmtruncate > truncate, have filesystems call their truncate sequence from ->setattr if filesystem specific operations are required. vmtruncate is deprecated, and truncate_pagecache and inode_newsize_ok helpers introduced previously should be used. simple_setattr is introduced for simple in-ram filesystems to implement the new truncate sequence. Eventually all filesystems should be converted to implement a setattr, and the default code in notify_change should go away. simple_setsize is also introduced to perform just the ATTR_SIZE portion of simple_setattr (ie. changing i_size and trimming pagecache). To implement the new truncate sequence: - filesystem specific manipulations (eg freeing blocks) must be done in the setattr method rather than ->truncate. - vmtruncate can not be used by core code to trim blocks past i_size in the event of write failure after allocation, so this must be performed in the fs code. - convert usage of helpers block_write_begin, nobh_write_begin, cont_write_begin, and blockdev_direct_IO to use _newtrunc postfixed variants. These avoid calling vmtruncate to trim blocks (see previous). - inode_setattr should not be used. generic_setattr is a new function to be used to copy simple attributes into the generic inode. - make use of the better opportunity to handle errors with the new sequence. Big problem with the previous calling sequence: the filesystem is not called until i_size has already changed. This means it is not allowed to fail the call, and also it does not know what the previous i_size was. Also, generic code calling vmtruncate to truncate allocated blocks in case of error had no good way to return a meaningful error (or, for example, atomically handle block deallocation). Cc: Christoph Hellwig <hch@lst.de> Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-27 22:15:33 -04:00
Randy Dunlap	7000d3c424	fs/super: fix kernel-doc warning Fix fs/super.c kernel-doc warning and function notation: Warning(fs/super.c:957): No description found for parameter 'sb' Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-27 22:06:23 -04:00
Erik van der Kouwe	0ab7620a0c	fs/minix: bugfix, number of indirect block ptrs per block depends on block size The MINIX filesystem driver used a constant number of indirect block pointers in an indirect block. This worked only for filesystems with 1kb block, while the MINIX default block size is now 4kb. As a consequence, large files were read incorrectly on such filesystems and writing a large file would cause the filesystem to become corrupted. This patch computes the number of indirect block pointers based on the block size, making the driver work for each block size. I would like to thank Feiran Zheng ('Fam') for pointing out the cause of the corruption. Signed-off-by: Erik van der Kouwe <vdkouwe@cs.vu.nl> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-27 22:06:22 -04:00
Christoph Hellwig	1b061d9247	rename the generic fsync implementations We don't name our generic fsync implementations very well currently. The no-op implementation for in-memory filesystems currently is called simple_sync_file which doesn't make too much sense to start with, the the generic one for simple filesystems is called simple_fsync which can lead to some confusion. This patch renames the generic file fsync method to generic_file_fsync to match the other generic_file_* routines it is supposed to be used with, and the no-op implementation to noop_fsync to make it obvious what to expect. In addition add some documentation for both methods. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-27 22:06:06 -04:00
Christoph Hellwig	7ea8085910	drop unused dentry argument to ->fsync Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-27 22:05:02 -04:00
Julia Lawall	cc967be547	fs: Add missing mutex_unlock Add a mutex_unlock missing on the error path. At other exists from the function that return an error flag, the mutex is unlocked, so do the same here. The semantic match that finds this problem is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ expression E1; @@ * mutex_lock(E1,...); <+... when != E1 if (...) { ... when != E1 * return ...; } ...+> * mutex_unlock(E1,...); // </smpl> Signed-off-by: Julia Lawall <julia@diku.dk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-27 22:03:09 -04:00
Al Viro	d7065da038	get rid of the magic around f_count in aio __aio_put_req() plays sick games with file refcount. What it wants is fput() from atomic context; it's almost always done with f_count > 1, so they only have to deal with delayed work in rare cases when their reference happens to be the last one. Current code decrements f_count and if it hasn't hit 0, everything is fine. Otherwise it keeps a pointer to struct file (with zero f_count!) around and has delayed work do __fput() on it. Better way to do it: use atomic_long_add_unless( , -1, 1) instead of !atomic_long_dec_and_test(). IOW, decrement it only if it's not the last reference, leave refcount alone if it was. And use normal fput() in delayed work. I've made that atomic_long_add_unless call a new helper - fput_atomic(). Drops a reference to file if it's safe to do in atomic (i.e. if that's not the last one), tells if it had been able to do that. aio.c converted to it, __fput() use is gone. req->ki_file always contributes to refcount now. And __fput() became static. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-27 22:03:07 -04:00
Neil Brown	176306f59a	VFS: fix recent breakage of FS_REVAL_DOT Commit `1f36f774b2` broke FS_REVAL_DOT semantics. In particular, before this patch, the command ls -l in an NFS mounted directory would always check if the directory on the server had changed and if so would flush and refill the pagecache for the dir. After this patch, the same "ls -l" will repeatedly return stale date until the cached attributes for the directory time out. The following patch fixes this by ensuring the d_revalidate is called by do_last when "." is being looked-up. link_path_walk has already called d_revalidate, but in that case LOOKUP_OPEN is not set so nfs_lookup_verify_inode chooses not to do any validation. The following patch restores the original behaviour. Cc: stable@kernel.org Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2010-05-27 22:03:06 -04:00
Al Viro	1eb2cbb6d5	Revert "anon_inode: set S_IFREG on the anon_inode" This reverts commit `a7cf4145bb`.	2010-05-27 22:03:05 -04:00
Linus Torvalds	105a048a4f	Merge git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable * git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (27 commits) Btrfs: add more error checking to btrfs_dirty_inode Btrfs: allow unaligned DIO Btrfs: drop verbose enospc printk Btrfs: Fix block generation verification race Btrfs: fix preallocation and nodatacow checks in O_DIRECT Btrfs: avoid ENOSPC errors in btrfs_dirty_inode Btrfs: move O_DIRECT space reservation to btrfs_direct_IO Btrfs: rework O_DIRECT enospc handling Btrfs: use async helpers for DIO write checksumming Btrfs: don't walk around with task->state != TASK_RUNNING Btrfs: do aio_write instead of write Btrfs: add basic DIO read/write support direct-io: do not merge logically non-contiguous requests direct-io: add a hook for the fs to provide its own submit_bio function fs: allow short direct-io reads to be completed via buffered IO Btrfs: Metadata ENOSPC handling for balance Btrfs: Pre-allocate space for data relocation Btrfs: Metadata ENOSPC handling for tree log Btrfs: Metadata reservation for orphan inodes Btrfs: Introduce global metadata reservation ...	2010-05-27 10:43:44 -07:00
Linus Torvalds	e4ce30f377	Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (40 commits) ext4: Make fsync sync new parent directories in no-journal mode ext4: Drop whitespace at end of lines ext4: Fix compat EXT4_IOC_ADD_GROUP ext4: Conditionally define compat ioctl numbers tracing: Convert more ext4 events to DEFINE_EVENT ext4: Add new tracepoints to track mballoc's buddy bitmap loads ext4: Add a missing trace hook ext4: restart ext4_ext_remove_space() after transaction restart ext4: Clear the EXT4_EOFBLOCKS_FL flag only when warranted ext4: Avoid crashing on NULL ptr dereference on a filesystem error ext4: Use bitops to read/modify i_flags in struct ext4_inode_info ext4: Convert calls of ext4_error() to EXT4_ERROR_INODE() ext4: Convert callers of ext4_get_blocks() to use ext4_map_blocks() ext4: Add new abstraction ext4_map_blocks() underneath ext4_get_blocks() ext4: Use our own write_cache_pages() ext4: Show journal_checksum option ext4: Fix for ext4_mb_collect_stats() ext4: check for a good block group before loading buddy pages ext4: Prevent creation of files larger than RLIMIT_FSIZE using fallocate ext4: Remove extraneous newlines in ext4_msg() calls ... Fixed up trivial conflict in fs/ext4/fsync.c	2010-05-27 10:26:37 -07:00
Linus Torvalds	ade61088bc	Merge branch 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6 * 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6: NFS: Fix another nfs_wb_page() deadlock NFS: Ensure that we mark the inode as dirty if we exit early from commit NFS: Fix a lock imbalance typo in nfs_access_cache_shrinker sunrpc: fix leak on error on socket xprt setup	2010-05-27 10:18:44 -07:00
Dmitry Monakhov	f32764bd2b	quota: Convert quota statistics to generic percpu_counter Generic per-cpu counter has some memory overhead but it is negligible for modern systems and embedded systems compile without quota support. And code reuse is a good thing. This patch should fix complain from preemptive kernels which was introduced by `dde9588853`. [Jan Kara: Fixed patch to work on 32-bit archs as well] Reported-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Jan Kara <jack@suse.cz>	2010-05-27 18:56:27 +02:00
jan Blunck	ca572727db	fs/: do not fallback to default_llseek() when readdir() uses BKL Do not use the fallback default_llseek() if the readdir operation of the filesystem still uses the big kernel lock. Since llseek() modifies file->f_pos of the directory directly it may need locking to not confuse readdir which usually uses file->f_pos directly as well Since the special characteristics of the BKL (unlocked on schedule) are not necessary in this case, the inode mutex can be used for locking as provided by generic_file_llseek(). This is only possible since all filesystems, except reiserfs, either use a directory as a flat file or with disk address offsets. Reiserfs on the other hand uses a 32bit hash off the filename as the offset so generic_file_llseek() can get used as well since the hash is always smaller than sb->s_maxbytes (= (512 << 32) - blocksize). Signed-off-by: Jan Blunck <jblunck@suse.de> Acked-by: Jan Kara <jack@suse.cz> Acked-by: Anders Larsen <al@alarsen.net> Cc: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:56 -07:00
jan Blunck	ae6afc3f5c	vfs: introduce noop_llseek() This is an implementation of ->llseek useable for the rare special case when userspace expects the seek to succeed but the (device) file is actually not able to perform the seek. In this case you use noop_llseek() instead of falling back to the default implementation of ->llseek. Signed-off-by: Jan Blunck <jblunck@suse.de> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:56 -07:00
Jeff Moyer	9d85cba718	aio: fix the compat vectored operations The aio compat code was not converting the struct iovecs from 32bit to 64bit pointers, causing either EINVAL to be returned from io_getevents, or EFAULT as the result of the I/O. This patch passes a compat flag to io_submit to signal that pointer conversion is necessary for a given iocb array. A variant of this was tested by Michael Tokarev. I have also updated the libaio test harness to exercise this code path with good success. Further, I grabbed a copy of ltp and ran the testcases/kernel/syscall/readv and writev tests there (compiled with -m32 on my 64bit system). All seems happy, but extra eyes on this would be welcome. [akpm@linux-foundation.org: coding-style fixes] [akpm@linux-foundation.org: fix CONFIG_COMPAT=n build] Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Reported-by: Michael Tokarev <mjt@tls.msk.ru> Cc: Zach Brown <zach.brown@oracle.com> Cc: <stable@kernel.org> [2.6.35.1] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:53 -07:00
Jeff Moyer	b83733639a	compat: factor out compat_rw_copy_check_uvector from compat_do_readv_writev It was reported in http://lkml.org/lkml/2010/3/8/309 that 32 bit readv and writev AIO operations were not functioning properly. It turns out that the code to convert the 32bit io vectors to 64 bits was never written. The results of that can be pretty bad, but in my testing, it mostly ended up in generating EFAULT as we walked off the list of I/O vectors provided. This patch set fixes the problem in my environment. are greatly appreciated. This patch: Factor out code that will be used by both compat_do_readv_writev and the compat aio submission code paths. Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Reported-by: Michael Tokarev <mjt@tls.msk.ru> Cc: Zach Brown <zach.brown@oracle.com> Cc: <stable@kernel.org> [2.6.35.1] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:53 -07:00
Julia Lawall	cccad8f9f0	fs/affs: use ERR_CAST Use ERR_CAST(x) rather than ERR_PTR(PTR_ERR(x)). The former makes more clear what is the purpose of the operation, which otherwise looks like a no-op. The semantic patch that makes this change is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ type T; T x; identifier f; @@ T f (...) { <+... - ERR_PTR(PTR_ERR(x)) + x ...+> } @@ expression x; @@ - ERR_PTR(PTR_ERR(x)) + ERR_CAST(x) // </smpl> Signed-off-by: Julia Lawall <julia@diku.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:53 -07:00
Wu Fengguang	36e15263aa	kcore: add _text to KCORE_TEXT Extend KCORE_TEXT to cover the pages between _text and _stext, to allow examining some important page table pages. `readelf -a` output on x86_64 before and after patch: Type Offset VirtAddr PhysAddr before LOAD 0x00007fff8100c000 0xffffffff81009000 0x0000000000000000 after LOAD 0x00007fff81003000 0xffffffff81000000 0x0000000000000000 The newly covered pages are: 0xffffffff81000000 <startup_64> etc. 0xffffffff81001000 <init_level4_pgt> 0xffffffff81002000 <level3_ident_pgt> 0xffffffff81003000 <level3_kernel_pgt> 0xffffffff81004000 <level2_fixmap_pgt> 0xffffffff81005000 <level1_fixmap_pgt> 0xffffffff81006000 <level2_ident_pgt> 0xffffffff81007000 <level2_kernel_pgt> 0xffffffff81008000 <level2_spare_pgt> Before patch, /proc/kcore shows outdated contents for the above page table pages, for example: (gdb) p level3_ident_pgt $1 = {<text variable, no debug info>} 0xffffffff81002000 <level3_ident_pgt> (gdb) p/x ((pud_t )&level3_ident_pgt)@512 $2 = {{pud = 0x1006063}, {pud = 0x0} <repeats 511 times>} while the real content is: root@hp /home/wfg# hexdump -s 0x1002000 -n 4096 /dev/mem 1002000 6063 0100 0000 0000 8067 0000 0000 0000 1002010 0000 0000 0000 0000 0000 0000 0000 0000 * 1003000 That is, on a x86_64 box with 2GB memory, we can see first-1GB / full-2GB identity mapping before/after patch: (gdb) p/x ((pud_t )&level3_ident_pgt)@512 before $1 = {{pud = 0x1006063}, {pud = 0x0} <repeats 511 times>} after $1 = {{pud = 0x1006063}, {pud = 0x8067}, {pud = 0x0} <repeats 510 times>} Obviously the content before patch is wrong. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:47 -07:00
Amerigo Wang	57f87869f0	proc: remove obsolete comments A quick test shows these comments are obsolete, so just remove them. Signed-off-by: WANG Cong <amwang@redhat.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:47 -07:00
Dan Carpenter	73d3646029	proc: cleanup: remove unused assignments I removed 3 unused assignments. The first two get reset on the first statement of their functions. For "err" in root.c we don't return an error and we don't use the variable again. Signed-off-by: Dan Carpenter <error27@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:47 -07:00
Oleg Nesterov	7e49827cc9	proc: get_nr_threads() doesn't need ->siglock any longer Now that task->signal can't go away get_nr_threads() doesn't need ->siglock to read signal->count. Also, make it inline, move into sched.h, and convert 2 other proc users of signal->count to use this (now trivial) helper. Henceforth get_nr_threads() is the only valid user of signal->count, we are ready to turn it into "int nr_threads" or, perhaps, kill it. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: David Howells <dhowells@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:47 -07:00
Oleg Nesterov	d344193a05	exit: avoid sig->count in de_thread/__exit_signal synchronization de_thread() and __exit_signal() use signal_struct->count/notify_count for synchronization. We can simplify the code and use ->notify_count only. Instead of comparing these two counters, we can change de_thread() to set ->notify_count = nr_of_sub_threads, then change __exit_signal() to dec-and-test this counter and notify group_exit_task. Note that __exit_signal() checks "notify_count > 0" just for symmetry with exit_notify(), we could just check it is != 0. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Roland McGrath <roland@redhat.com> Cc: Veaceslav Falico <vfalico@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:46 -07:00
Oleg Nesterov	269b005a28	coredump: shift down_write(mmap_sem) into coredump_wait() - move the cprm.mm_flags checks up, before we take mmap_sem - move down_write(mmap_sem) and ->core_state check from do_coredump() to coredump_wait() This simplifies the code and makes the locking symmetrical. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Neil Horman <nhorman@tuxdriver.com> Cc: Roland McGrath <roland@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:45 -07:00
Oleg Nesterov	5e43aef530	coredump: factor out put_cred() calls Given that do_coredump() calls put_cred() on exit path, it is a bit ugly to do put_cred() + "goto fail" twice, just add the new "fail_creds" label. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Neil Horman <nhorman@tuxdriver.com> Cc: Roland McGrath <roland@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:45 -07:00
Oleg Nesterov	d5bf4c4f5f	coredump: cleanup "ispipe" code - kill "int dump_count", argv_split(argcp) accepts argcp == NULL. - move "int dump_count" under " if (ispipe)" branch, fail_dropcount can check ispipe. - move "char **helper_argv" as well, change the code to do argv_free() right after call_usermodehelper_fns(). - If call_usermodehelper_fns() fails goto close_fail label instead of closing the file by hand. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Neil Horman <nhorman@tuxdriver.com> Cc: Roland McGrath <roland@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:45 -07:00
Oleg Nesterov	c713541125	coredump: factor out the not-ispipe file checks do_coredump() does a lot of file checks after it opens the file or calls usermode helper. But all of these checks are only needed in !ispipe case. Move this code into the "else" branch and kill the ugly repetitive ispipe checks. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Neil Horman <nhorman@tuxdriver.com> Cc: Roland McGrath <roland@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:45 -07:00
Neil Horman	898b374af6	exec: replace call_usermodehelper_pipe with use of umh init function and resolve limit The first patch in this series introduced an init function to the call_usermodehelper api so that processes could be customized by caller. This patch takes advantage of that fact, by customizing the helper in do_coredump to create the pipe and set its core limit to one (for our recusrsion check). This lets us clean up the previous uglyness in the usermodehelper internals and factor call_usermodehelper out entirely. While I'm at it, we can also modify the helper setup to look for a core limit value of 1 rather than zero for our recursion check Signed-off-by: Neil Horman <nhorman@tuxdriver.com> Reviewed-by: Oleg Nesterov <oleg@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:44 -07:00
Thomas Stewart	d27d7a9a78	ufs: permit mounting of BorderWare filesystems I recently had to recover some files from an old broken machine that was running BorderWare Document Gateway. It's basically a drop in web server for sharing files. From the look of the init process and using strings on of a few files it seems to be based on FreeBSD 3.3. The process turned out to be more difficult than I imagined, but to cut a long story short BorderWare in their wisdom use a nonstandard magic number in their UFS (ufstype=44bsd) file systems. Thus Linux refuses to mount the file systems in order to recover the data. After a bit of hunting I was able to make a quick fix to fs/ufs/super.c in order to detect the new magic number. I assume that this number is the same for all installations. It's quite easy to find out from ufs_fs.h. The superblock sits 8k into the block device and the magic number its 1372 bytes into the superblock struct. # dd if=/dev/sda5 skip=$(( 8192 + 1372 )) bs=1 count=4 2> /dev/null \| hd 00000000 97 26 24 0f \|.&$.\| # Signed-off-by: Thomas Stewart <thomas@stewarts.org.uk> Cc: Evgeniy Dushistov <dushistov@mail.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:43 -07:00
Julia Lawall	7ca5ca60cb	fs/autofs4: use memdup_user Use memdup_user when user data is immediately copied into the allocated region. Elimination of the variable ads, which is no longer useful. The semantic patch that makes this change is as follows: (http://coccinelle.lip6.fr/) // <smpl> @@ expression from,to,size,flag; position p; identifier l1,l2; @@ - to = $kmalloc@p\\|kzalloc@p$(size,flag); + to = memdup_user(from,size); if ( - to==NULL + IS_ERR(to) \|\| ...) { <+... when != goto l1; - -ENOMEM + PTR_ERR(to) ...+> } - if (copy_from_user(to, from, size) != 0) { - <+... when != goto l2; - -EFAULT - ...+> - } // </smpl> Signed-off-by: Julia Lawall <julia@diku.dk> Cc: Ian Kent <raven@themaw.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-27 09:12:41 -07:00
Venkatesh Pallipadi	1513b02c8b	ext3 uses rb_node = NULL; to zero rb_root. The problem with this is that `17d9ddc72f` ("rbtree: Add support for augmented rbtrees") in the linux-next tree adds a new field to that struct which needs to be NULLas well. This patch uses RB_ROOT as the intializer so all of the relevant fields will be NULL'd. Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Cc: Eric Paris <eparis@redhat.com> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jan Kara <jack@suse.cz>	2010-05-27 17:39:36 +02:00
Jan Kara	4dea496974	quota: Fixup dquot_transfer Commit `bc8e5f0739` had a typo which caused quota miscomputation when changing owner group of a file. Linus will hate me. Signed-off-by: Jan Kara <jack@suse.cz>	2010-05-27 17:39:36 +02:00
Jan Kara	f4b113ae6f	reiserfs: Fix resuming of quotas on remount read-write When quota was suspended on remount-ro, finish_unfinished() will try to turn it on again (which fails) and also turns the quotas off on exit. Fix the function to check whether quotas are already on at function entry and do not turn them off in that case. CC: reiserfs-devel@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz>	2010-05-27 17:39:36 +02:00
Chris Mason	9aeead7378	Btrfs: add more error checking to btrfs_dirty_inode The ENOSPC code will now return ENOSPC to btrfs_start_transaction. btrfs_dirty_inode needs to check for this and error out appropriately. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-05-27 10:23:00 -04:00
Chris Mason	5a5f79b570	Btrfs: allow unaligned DIO In order to support DIO that isn't aligned to the filesystem blocksize, we fall back to buffered for any unaligned DIOs. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-05-26 21:35:35 -04:00
Chris Mason	933b585f70	Btrfs: drop verbose enospc printk Less printk is good printk. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-05-26 21:35:34 -04:00
Yan, Zheng	5bdd3536cb	Btrfs: Fix block generation verification race After the path is released, the generation number got from block pointer is no long valid. The race may cause disk corruption, because verify_parent_transid() calls clear_extent_buffer_uptodate() when generation numbers mismatch. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-05-26 21:35:33 -04:00
Chris Mason	46bfbb5c07	Btrfs: fix preallocation and nodatacow checks in O_DIRECT The O_DIRECT code wasn't checking for multiple references on preallocated or nodatacow extents. This means it wasn't honoring snapshots properly. The fix here is to add an explicit check for multiple references This also fixes the math for selecting the correct disk block, making sure not to go past the end of the extent. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-05-26 21:34:45 -04:00
Linus Torvalds	63a6440326	Merge git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-linus * git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-linus: squashfs: update documentation to include description of xattr layout squashfs: fix name reading in squashfs_xattr_get squashfs: constify xattr handlers squashfs: xattr fix sparse warnings squashfs: xattr_lookup sparse fix squashfs: add xattr support configure option squashfs: add new extended inode types squashfs: add support for xattr reading squashfs: add xattr id support	2010-05-26 08:57:20 -07:00
Andrew Morton	cc68e3be74	fs/fscache/object-list.c: fix warning on 32-bit fs/fscache/object-list.c: In function 'fscache_objlist_lookup': fs/fscache/object-list.c:105: warning: cast to pointer from integer of different size Acked-by: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-26 08:19:23 -07:00
Chris Mason	94b604429a	Btrfs: avoid ENOSPC errors in btrfs_dirty_inode btrfs_dirty_inode tries to sneak in without much waiting or space reservation, mostly for performance reasons. This usually works well but can cause problems when there are many many writers. When btrfs_update_inode fails with ENOSPC, we fallback to a slower btrfs_start_transaction call that will reserve some space. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-05-26 11:02:00 -04:00
Chris Mason	3f7c579c41	Btrfs: move O_DIRECT space reservation to btrfs_direct_IO This moves the delalloc space reservation done for O_DIRECT into btrfs_direct_IO. This way we don't leak reserved space if the generic O_DIRECT write code errors out before it calls into btrfs_direct_IO. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-05-26 10:59:53 -04:00
Trond Myklebust	0522f6aded	NFS: Fix another nfs_wb_page() deadlock J.R. Okajima reports that the call to sync_inode() in nfs_wb_page() can deadlock with other writeback flush calls. It boils down to the fact that we cannot ever call writeback_single_inode() while holding a page lock (even if we do set nr_to_write to zero) since another process may already be waiting in the call to do_writepages(), and so will deny us the I_SYNC lock. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2010-05-26 08:43:53 -04:00
Trond Myklebust	c5efa5fc91	NFS: Ensure that we mark the inode as dirty if we exit early from commit If we exit from nfs_commit_inode() without ensuring that the COMMIT rpc call has been completed, we must re-mark the inode as dirty. Otherwise, future calls to sync_inode() with the WB_SYNC_ALL flag set will fail to ensure that the data is on the disk. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2010-05-26 08:43:52 -04:00
Trond Myklebust	59844a9bd7	NFS: Fix a lock imbalance typo in nfs_access_cache_shrinker Commit `9c7e7e2337` (NFS: Don't call iput() in nfs_access_cache_shrinker) unintentionally removed the spin unlock for the inode->i_lock. Reported-by: David Howells <dhowells@redhat.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>	2010-05-26 08:43:51 -04:00
Miklos Szeredi	51921cb746	mm: export generic_pipe_buf_*() to modules This is needed by fuse device code which wants to create pipe buffers. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2010-05-26 08:44:22 +02:00
Chris Mason	4845e44ffd	Btrfs: rework O_DIRECT enospc handling This changes O_DIRECT write code to mark extents as delalloc while it is processing them. Yan Zheng has reworked the enospc accounting based on tracking delalloc extents and this makes it much easier to track enospc in the O_DIRECT code. There are a few space cases with the O_DIRECT code though, it only sets the EXTENT_DELALLOC bits, instead of doing EXTENT_DELALLOC \| EXTENT_DIRTY \| EXTENT_UPTODATE, because we don't want to mess with clearing the dirty and uptodate bits when things go wrong. This is important because there are no pages in the page cache, so any extent state structs that we put in the tree won't get freed by releasepage. We have to clear them ourselves as the DIO ends. With this commit, we reserve space at in btrfs_file_aio_write, and then as each btrfs_direct_IO call progresses it sets EXTENT_DELALLOC on the range. btrfs_get_blocks_direct is responsible for clearing the delalloc at the same time it drops the extent lock. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-05-25 21:52:08 -04:00
Kay Sievers	578454ff7e	driver core: add devname module aliases to allow module on-demand auto-loading This adds: alias: devname:<name> to some common kernel modules, which will allow the on-demand loading of the kernel module when the device node is accessed. Ideally all these modules would be compiled-in, but distros seems too much in love with their modularization that we need to cover the common cases with this new facility. It will allow us to remove a bunch of pretty useless init scripts and modprobes from init scripts. The static device node aliases will be carried in the module itself. The program depmod will extract this information to a file in the module directory: $ cat /lib/modules/2.6.34-00650-g537b60d-dirty/modules.devname # Device nodes to trigger on-demand module loading. microcode cpu/microcode c10:184 fuse fuse c10:229 ppp_generic ppp c108:0 tun net/tun c10:200 dm_mod mapper/control c10:235 Udev will pick up the depmod created file on startup and create all the static device nodes which the kernel modules specify, so that these modules get automatically loaded when the device node is accessed: $ /sbin/udevd --debug ... static_dev_create_from_modules: mknod '/dev/cpu/microcode' c10:184 static_dev_create_from_modules: mknod '/dev/fuse' c10:229 static_dev_create_from_modules: mknod '/dev/ppp' c108:0 static_dev_create_from_modules: mknod '/dev/net/tun' c10:200 static_dev_create_from_modules: mknod '/dev/mapper/control' c10:235 udev_rules_apply_static_dev_perms: chmod '/dev/net/tun' 0666 udev_rules_apply_static_dev_perms: chmod '/dev/fuse' 0666 A few device nodes are switched to statically allocated numbers, to allow the static nodes to work. This might also useful for systems which still run a plain static /dev, which is completely unsafe to use with any dynamic minor numbers. Note: The devname aliases must be limited to the common and singleinstance* device nodes, like the misc devices, and never be used for conceptually limited systems like the loop devices, which should rather get fixed properly and get a control node for losetup to talk to, instead of creating a random number of device nodes in advance, regardless if they are ever used. This facility is to hide the mess distros are creating with too modualized kernels, and just to hide that these modules are not compiled-in, and not to paper-over broken concepts. Thanks! :) Cc: Greg Kroah-Hartman <gregkh@suse.de> Cc: David S. Miller <davem@davemloft.net> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Chris Mason <chris.mason@oracle.com> Cc: Alasdair G Kergon <agk@redhat.com> Cc: Tigran Aivazian <tigran@aivazian.fsnet.co.uk> Cc: Ian Kent <raven@themaw.net> Signed-Off-By: Kay Sievers <kay.sievers@vrfy.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>	2010-05-25 15:08:26 -07:00
Linus Torvalds	f16a5e3478	Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes * git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes: GFS2: Fix permissions checking for setflags ioctl() GFS2: Don't "get" xattrs for ACLs when ACLs are turned off GFS2: Rework reclaiming unlinked dinodes	2010-05-25 08:17:51 -07:00
Linus Torvalds	110b93842e	Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfs * 'for-linus' of git://oss.sgi.com/xfs/xfs: xfs: Ensure inode allocation buffers are fully replayed xfs: enable background pushing of the CIL xfs: forced unmounts need to push the CIL xfs: Introduce delayed logging core code xfs: Delayed logging design documentation xfs: Improve scalability of busy extent tracking xfs: make the log ticket ID available outside the log infrastructure xfs: clean up log ticket overrun debug output xfs: Clean up XFS_BLI_* flag namespace xfs: modify buffer item reference counting xfs: allow log ticket allocation to take allocation flags xfs: Don't reuse the same transaction ID for duplicated transactions.	2010-05-25 08:17:01 -07:00
Huang Weiyi	337bbfdbff	smbfs: remove duplicated #include Remove duplicated #include('s) in fs/smbfs/symlink.c Signed-off-by: Huang Weiyi <weiyi.huang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-25 08:07:07 -07:00
Andy Shevchenko	91f06e6680	fs: ldm: don't use own implementation of hex_to_bin() Remove own implementation of hex_to_bin(). Signed-off-by: Andy Shevchenko <ext-andriy.shevchenko@nokia.com> Cc: "Richard Russon (FlatCap)" <ldm@flatcap.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-25 08:07:06 -07:00
OGAWA Hirofumi	aaa04b4875	fatfs: ratelimit corruption report Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-25 08:07:04 -07:00
Minchan Kim	4c99000ac4	ntfs: use add_to_page_cache_lru() Quote from Nick piggin's about btrfs patch - http://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg04472.html. "add_to_page_cache_lru is exported, so it should be used. Benefits over using a private pagevec: neater code, 128 bytes fewer stack used, percpu lru ordering is preserved, and finally don't need to flush pagevec before returning so batching may be shared with other LRU insertions." Let's use it instead of private pagevec in ntfs, too. Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: Anton Altaparmakov <aia21@cantab.net> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-25 08:07:03 -07:00
Minchan Kim	2ec93b0bf3	ntfs: clean up ntfs_attr_extend_initialized cached_page and lru_pvec have not been used. Let's remove the arguments. Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: Anton Altaparmakov <aia21@cantab.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-25 08:07:03 -07:00
Alexey Dobriyan	4be929be34	kernel-wide: replace USHORT_MAX, SHORT_MAX and SHORT_MIN with USHRT_MAX, SHRT_MAX and SHRT_MIN - C99 knows about USHRT_MAX/SHRT_MAX/SHRT_MIN, not USHORT_MAX/SHORT_MAX/SHORT_MIN. - Make SHRT_MIN of type s16, not int, for consistency. [akpm@linux-foundation.org: fix drivers/dma/timb_dma.c] [akpm@linux-foundation.org: fix security/keys/keyring.c] Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: WANG Cong <xiyou.wangcong@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-25 08:07:02 -07:00
Richard Kennedy	58a9d3d8db	fs-writeback: check sync bit earlier in inode_wait_for_writeback When wb_writeback() hasn't written anything it will re-acquire the inode lock before calling inode_wait_for_writeback. This change tests the sync bit first so that is doesn't need to drop & re-acquire the lock if the inode became available while wb_writeback() was waiting to get the lock. Signed-off-by: Richard Kennedy <richard@rsk.demon.co.uk> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-25 08:07:00 -07:00
Mel Gorman	a8bef8ff6e	mm: migration: avoid race between shift_arg_pages() and rmap_walk() during migration by not migrating temporary stacks Page migration requires rmap to be able to find all ptes mapping a page at all times, otherwise the migration entry can be instantiated, but it is possible to leave one behind if the second rmap_walk fails to find the page. If this page is later faulted, migration_entry_to_page() will call BUG because the page is locked indicating the page was migrated by the migration PTE not cleaned up. For example kernel BUG at include/linux/swapops.h:105! invalid opcode: 0000 [#1] PREEMPT SMP ... Call Trace: [<ffffffff810e951a>] handle_mm_fault+0x3f8/0x76a [<ffffffff8130c7a2>] do_page_fault+0x44a/0x46e [<ffffffff813099b5>] page_fault+0x25/0x30 [<ffffffff8114de33>] load_elf_binary+0x152a/0x192b [<ffffffff8111329b>] search_binary_handler+0x173/0x313 [<ffffffff81114896>] do_execve+0x219/0x30a [<ffffffff8100a5c6>] sys_execve+0x43/0x5e [<ffffffff8100320a>] stub_execve+0x6a/0xc0 RIP [<ffffffff811094ff>] migration_entry_wait+0xc1/0x129 There is a race between shift_arg_pages and migration that triggers this bug. A temporary stack is setup during exec and later moved. If migration moves a page in the temporary stack and the VMA is then removed before migration completes, the migration PTE may not be found leading to a BUG when the stack is faulted. This patch causes pages within the temporary stack during exec to be skipped by migration. It does this by marking the VMA covering the temporary stack with an otherwise impossible combination of VMA flags. These flags are cleared when the temporary stack is moved to its final location. [kamezawa.hiroyu@jp.fujitsu.com: idea for having migration skip temporary stacks] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-25 08:06:59 -07:00
Naoya Horiguchi	1a5cb81465	pagemap: add #ifdefs CONFIG_HUGETLB_PAGE on code walking hugetlb vma If !CONFIG_HUGETLB_PAGE, pagemap_hugetlb_range() is never called. So put it (and its calling function) into #ifdef block. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Matt Mackall <mpm@selenic.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2010-05-25 08:06:58 -07:00
Chris Mason	eaf25d933e	Btrfs: use async helpers for DIO write checksumming The async helper threads offload crc work onto all the CPUs, and make streaming writes much faster. This changes the O_DIRECT write code to use them. The only small complication was that we need to pass in the logical offset in the file for each bio, because we can't find it in the bio's pages. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-05-25 10:34:58 -04:00
Chris Mason	ed3b3d314c	Btrfs: don't walk around with task->state != TASK_RUNNING Yan Zheng noticed two places we were doing a lot of work without task->state set to TASK_RUNNING. This sets the state properly after we get ready to sleep but decide not to. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-05-25 10:34:58 -04:00
Josef Bacik	11c65dccf7	Btrfs: do aio_write instead of write In order for AIO to work, we need to implement aio_write. This patch converts our btrfs_file_write to btrfs_aio_write. I've tested this with xfstests and nothing broke, and the AIO stuff magically started working. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-05-25 10:34:57 -04:00
Josef Bacik	4b46fce233	Btrfs: add basic DIO read/write support This provides basic DIO support for reading and writing. It does not do the work to recover from mismatching checksums, that will come later. A few design changes have been made from Jim's code (sorry Jim!) 1) Use the generic direct-io code. Jim originally re-wrote all the generic DIO code in order to account for all of BTRFS's oddities, but thanks to that work it seems like the best bet is to just ignore compression and such and just opt to fallback on buffered IO. 2) Fallback on buffered IO for compressed or inline extents. Jim's code did it's own buffering to make dio with compressed extents work. Now we just fallback onto normal buffered IO. 3) Use ordered extents for the writes so that all of the lock_extent() lookup_ordered() type checks continue to work. 4) Do the lock_extent() lookup_ordered() loop in readpage so we don't race with DIO writes. I've tested this with fsx and everything works great. This patch depends on my dio and filemap.c patches to work. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-05-25 10:34:57 -04:00
Josef Bacik	c2c6ca417e	direct-io: do not merge logically non-contiguous requests Btrfs cannot handle having logically non-contiguous requests submitted. For example if you have Logical: [0-4095][HOLE][8192-12287] Physical: [0-4095] [4096-8191] Normally the DIO code would put these into the same BIO's. The problem is we need to know exactly what offset is associated with what BIO so we can do our checksumming and unlocking properly, so putting them in the same BIO doesn't work. So add another check where we submit the current BIO if the physical blocks are not contigous OR the logical blocks are not contiguous. Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-05-25 10:34:56 -04:00
Josef Bacik	facd07b07d	direct-io: add a hook for the fs to provide its own submit_bio function Because BTRFS can do RAID and such, we need our own submit hook so we can setup the bio's in the correct fashion, and handle checksum errors properly. So there are a few changes here 1) The submit_io hook. This is straightforward, just call this instead of submit_bio. 2) Allow the fs to return -ENOTBLK for reads. Usually this has only worked for writes, since writes can fallback onto buffered IO. But BTRFS needs the option of falling back on buffered IO if it encounters a compressed extent, since we need to read the entire extent in and decompress it. So if we get -ENOTBLK back from get_block we'll return back and fallback on buffered just like the write case. I've tested these changes with fsx and everything seems to work. Thanks, Signed-off-by: Josef Bacik <josef@redhat.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-05-25 10:34:55 -04:00
Yan, Zheng	3fd0a5585e	Btrfs: Metadata ENOSPC handling for balance This patch adds metadata ENOSPC handling for the balance code. It is consisted by following major changes: 1. Avoid COW tree leave in the phrase of merging tree. 2. Handle interaction with snapshot creation. 3. make the backref cache can live across transactions. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-05-25 10:34:54 -04:00
Yan, Zheng	efa5646456	Btrfs: Pre-allocate space for data relocation Pre-allocate space for data relocation. This can detect ENOPSC condition caused by fragmentation of free space. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-05-25 10:34:53 -04:00
Yan, Zheng	4a500fd178	Btrfs: Metadata ENOSPC handling for tree log Previous patches make the allocater return -ENOSPC if there is no unreserved free metadata space. This patch updates tree log code and various other places to propagate/handle the ENOSPC error. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-05-25 10:34:53 -04:00
Yan, Zheng	d68fc57b7e	Btrfs: Metadata reservation for orphan inodes reserve metadata space for handling orphan inodes Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-05-25 10:34:52 -04:00
Yan, Zheng	8929ecfa50	Btrfs: Introduce global metadata reservation Reserve metadata space for extent tree, checksum tree and root tree Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-05-25 10:34:52 -04:00
Yan, Zheng	0ca1f7ceb1	Btrfs: Update metadata reservation for delayed allocation Introduce metadata reservation context for delayed allocation and update various related functions. This patch also introduces EXTENT_FIRST_DELALLOC control bit for set/clear_extent_bit. It tells set/clear_bit_hook whether they are processing the first extent_state with EXTENT_DELALLOC bit set. This change is important if set/clear_extent_bit involves multiple extent_state. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-05-25 10:34:51 -04:00
Yan, Zheng	a22285a6a3	Btrfs: Integrate metadata reservation with start_transaction Besides simplify the code, this change makes sure all metadata reservation for normal metadata operations are released after committing transaction. Changes since V1: Add code that check if unlink and rmdir will free space. Add ENOSPC handling for clone ioctl. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-05-25 10:34:50 -04:00
Yan, Zheng	f0486c68e4	Btrfs: Introduce contexts for metadata reservation Introducing metadata reseravtion contexts has two major advantages. First, it makes metadata reseravtion more traceable. Second, it can reclaim freed space and re-add them to the itself after transaction committed. Besides add btrfs_block_rsv structure and related helper functions, This patch contains following changes: Move code that decides if freed tree block should be pinned into btrfs_free_tree_block(). Make space accounting more accurate, mainly for handling read only block groups. Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-05-25 10:34:50 -04:00
Yan, Zheng	2ead6ae770	Btrfs: Kill init_btrfs_i() All code in init_btrfs_i can be moved into btrfs_alloc_inode() Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-05-25 10:34:49 -04:00
Yan, Zheng	5da9d01b66	Btrfs: Shrink delay allocated space in a synchronized Shrink delayed allocation space in a synchronized manner is more controllable than flushing all delay allocated space in an async thread. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-05-25 10:34:48 -04:00
Yan, Zheng	424499dbd0	Btrfs: Kill allocate_wait in space_info We already have fs_info->chunk_mutex to avoid concurrent chunk creation. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-05-25 10:34:48 -04:00
Yan, Zheng	b742bb82f1	Btrfs: Link block groups of different raid types The size of reserved space is stored in space_info. If block groups of different raid types are linked to separate space_info, changing allocation profile will corrupt reserved space accounting. Signed-off-by: Yan Zheng <zheng.yan@oracle.com> Signed-off-by: Chris Mason <chris.mason@oracle.com>	2010-05-25 10:34:47 -04:00
Miklos Szeredi	c3021629a0	fuse: support splice() reading from fuse device Allow userspace filesystem implementation to use splice() to read from the fuse device. The userspace filesystem can now transfer data coming from a WRITE request to an arbitrary file descriptor (regular file, block device or socket) without having to go through a userspace buffer. The semantics of using splice() to read messages are: 1) with a single splice() call move the whole message from the fuse device to a temporary pipe 2) read the header from the pipe and determine the message type 3a) if message is a WRITE then splice data from pipe to destination 3b) else read rest of message to userspace buffer Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2010-05-25 15:06:07 +02:00
Miklos Szeredi	ce534fb052	fuse: allow splice to move pages When splicing buffers to the fuse device with SPLICE_F_MOVE, try to move pages from the pipe buffer into the page cache. This allows populating the fuse filesystem's cache without ever touching the page contents, i.e. zero copy read capability. The following steps are performed when trying to move a page into the page cache: - buf->ops->confirm() to make sure the new page is uptodate - buf->ops->steal() to try to remove the new page from it's previous place - remove_from_page_cache() on the old page - add_to_page_cache_locked() on the new page If any of the above steps fail (non fatally) then the code falls back to copying the page. In particular ->steal() will fail if there are external references (other than the page cache and the pipe buffer) to the page. Also since the remove_from_page_cache() + add_to_page_cache_locked() are non-atomic it is possible that the page cache is repopulated in between the two and add_to_page_cache_locked() will fail. This could be fixed by creating a new atomic replace_page_cache_page() function. fuse_readpages_end() needed to be reworked so it works even if page->mapping is NULL for some or all pages which can happen if the add_to_page_cache_locked() failed. A number of sanity checks were added to make sure the stolen pages don't have weird flags set, etc... These could be moved into generic splice/steal code. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2010-05-25 15:06:07 +02:00
Miklos Szeredi	dd3bb14f44	fuse: support splice() writing to fuse device Allow userspace filesystem implementation to use splice() to write to the fuse device. The semantics of using splice() are: 1) buffer the message header and data in a temporary pipe 2) with a single splice() call move the message from the temporary pipe to the fuse device The READ reply message has the most interesting use for this, since now the data from an arbitrary file descriptor (which could be a regular file, a block device or a socket) can be tranferred into the fuse device without having to go through a userspace buffer. It will also allow zero copy moving of pages. One caveat is that the protocol on the fuse device requires the length of the whole message to be written into the header. But the length of the data transferred into the temporary pipe may not be known in advance. The current library implementation works around this by using vmplice to write the header and modifying the header after splicing the data into the pipe (error handling omitted): struct fuse_out_header out; iov.iov_base = &out; iov.iov_len = sizeof(struct fuse_out_header); vmsplice(pip[1], &iov, 1, 0); len = splice(input_fd, input_offset, pip[1], NULL, len, 0); /* retrospectively modify the header: */ out.len = len + sizeof(struct fuse_out_header); splice(pip[0], NULL, fuse_chan_fd(req->ch), NULL, out.len, flags); This works since vmsplice only saves a pointer to the data, it does not copy the data itself. Since pipes are currently limited to 16 pages and messages need to be spliced atomically, the length of the data is limited to 15 pages (or 60kB for 4k pages). Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2010-05-25 15:06:06 +02:00
Miklos Szeredi	b5dd328537	fuse: get page reference for readpages Acquire a page ref on pages in ->readpages() and release them when the read has finished. Not acquiring a reference didn't seem to cause any trouble since the page is locked and will not be kicked out of the page cache during the read. However the following patches will want to remove the page from the cache so a separate ref is needed. Making the reference in req->pages explicit also makes the code easier to understand. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2010-05-25 15:06:06 +02:00
Miklos Szeredi	1bf94ca73e	fuse: use get_user_pages_fast() Replace uses of get_user_pages() with get_user_pages_fast(). It looks nicer and should be faster in most cases. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2010-05-25 15:06:06 +02:00
Dan Carpenter	4aa0edd294	fuse: remove unneeded variable "map" isn't needed any more after: `0bd87182d3` "fuse: fix kunmap in fuse_ioctl_copy_user" Signed-off-by: Dan Carpenter <error27@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>	2010-05-25 15:06:05 +02:00
Nick Piggin	0ae0b5d055	fs/splice.c: fix mapping_gfp_mask usage mapping_gfp_mask() is not supposed to store allocation contex details, only page location details. So mapping_gfp_mask should be applied to the pagecache page allocation, wheras normal (kernel mapped) memory should be used for surrounding allocations such as radix-tree nodes allocated by add_to_page_cache. Context modifiers should be applied on a per-callsite basis. So change splice to follow this convention (which is followed in similar code patterns in core code). Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2010-05-25 10:25:26 +02:00
Jens Axboe	b9598db340	pipe: make F_{GET,SET}PIPE_SZ deal with byte sizes Instead of requiring an exact number of pages as the argument and return value, change the API to deal with number of bytes instead. This also relaxes the requirement that the passed in size must result in a power-of-2 page array size. Round up to the nearest power-of-2 automatically and return the resulting size of the pipe on success. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2010-05-24 19:34:43 +02:00
Jens Axboe	0191f8697b	pipe: F_SETPIPE_SZ should return -EPERM for non-root If the passed in size is larger than what has been set as the system wide limit and the user is not root, we want to return permission denied (not invalid value). Signed-off-by: Jens Axboe <jens.axboe@oracle.com>	2010-05-24 19:15:57 +02:00
Alex Elder	88e88374ee	Merge branch 'delayed-logging-for-2.6.35' into for-linus	2010-05-24 11:57:36 -05:00
Dave Chinner	ccf7c23fc1	xfs: Ensure inode allocation buffers are fully replayed With delayed logging, we can get inode allocation buffers in the same transaction inode unlink buffers. We don't currently mark inode allocation buffers in the log, so inode unlink buffers take precedence over allocation buffers. The result is that when they are combined into the same checkpoint, only the unlinked inode chain fields are replayed, resulting in uninitialised inode buffers being detected when the next inode modification is replayed. To fix this, we need to ensure that we do not set the inode buffer flag in the buffer log item format flags if the inode allocation has not already hit the log. To avoid requiring a change to log recovery, we really need to make this a modification that relies only on in-memory sate. We can do this by checking during buffer log formatting (while the CIL cannot be flushed) if we are still in the same sequence when we commit the unlink transaction as the inode allocation transaction. If we are, then we do not add the inode buffer flag to the buffer log format item flags. This means the entire buffer will be replayed, not just the unlinked fields. We do this while CIL flusheѕ are locked out to ensure that we don't race with the sequence numbers changing and hence fail to put the inode buffer flag in the buffer format flags when we really need to. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-24 10:41:22 -05:00
Dave Chinner	df806158b0	xfs: enable background pushing of the CIL If we let the CIL grow without bound, it will grow large enough to violate recovery constraints (must be at least one complete transaction in the log at all times) or take forever to write out through the log buffers. Hence we need a check during asynchronous transactions as to whether the CIL needs to be pushed. We track the amount of log space the CIL consumes, so it is relatively simple to limit it on a pure size basis. Make the limit the minimum of just under half the log size (recovery constraint) or 8MB of log space (which is an awful lot of metadata). Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-24 10:38:20 -05:00
Dave Chinner	9da1ab181a	xfs: forced unmounts need to push the CIL If the filesystem is being shut down and the there is no log error, the current code forces out the current log buffers. This code now needs to push the CIL before it forces out the log buffers to acheive the same result. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-24 10:38:14 -05:00
Dave Chinner	71e330b593	xfs: Introduce delayed logging core code The delayed logging code only changes in-memory structures and as such can be enabled and disabled with a mount option. Add the mount option and emit a warning that this is an experimental feature that should not be used in production yet. We also need infrastructure to track committed items that have not yet been written to the log. This is what the Committed Item List (CIL) is for. The log item also needs to be extended to track the current log vector, the associated memory buffer and it's location in the Commit Item List. Extend the log item and log vector structures to enable this tracking. To maintain the current log format for transactions with delayed logging, we need to introduce a checkpoint transaction and a context for tracking each checkpoint from initiation to transaction completion. This includes adding a log ticket for tracking space log required/used by the context checkpoint. To track all the changes we need an io vector array per log item, rather than a single array for the entire transaction. Using the new log vector structure for this requires two passes - the first to allocate the log vector structures and chain them together, and the second to fill them out. This log vector chain can then be passed to the CIL for formatting, pinning and insertion into the CIL. Formatting of the log vector chain is relatively simple - it's just a loop over the iovecs on each log vector, but it is made slightly more complex because we re-write the iovec after the copy to point back at the memory buffer we just copied into. This code also needs to pin log items. If the log item is not already tracked in this checkpoint context, then it needs to be pinned. Otherwise it is already pinned and we don't need to pin it again. The only other complexity is calculating the amount of new log space the formatting has consumed. This needs to be accounted to the transaction in progress, and the accounting is made more complex becase we need also to steal space from it for log metadata in the checkpoint transaction. Calculate all this at insert time and update all the tickets, counters, etc correctly. Once we've formatted all the log items in the transaction, attach the busy extents to the checkpoint context so the busy extents live until checkpoint completion and can be processed at that point in time. Transactions can then be freed at this point in time. Now we need to issue checkpoints - we are tracking the amount of log space used by the items in the CIL, so we can trigger background checkpoints when the space usage gets to a certain threshold. Otherwise, checkpoints need ot be triggered when a log synchronisation point is reached - a log force event. Because the log write code already handles chained log vectors, writing the transaction is trivial, too. Construct a transaction header, add it to the head of the chain and write it into the log, then issue a commit record write. Then we can release the checkpoint log ticket and attach the context to the log buffer so it can be called during Io completion to complete the checkpoint. We also need to allow for synchronising multiple in-flight checkpoints. This is needed for two things - the first is to ensure that checkpoint commit records appear in the log in the correct sequence order (so they are replayed in the correct order). The second is so that xfs_log_force_lsn() operates correctly and only flushes and/or waits for the specific sequence it was provided with. To do this we need a wait variable and a list tracking the checkpoint commits in progress. We can walk this list and wait for the checkpoints to change state or complete easily, an this provides the necessary synchronisation for correct operation in both cases. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-24 10:38:03 -05:00
Dave Chinner	ed3b4d6cdc	xfs: Improve scalability of busy extent tracking When we free a metadata extent, we record it in the per-AG busy extent array so that it is not re-used before the freeing transaction hits the disk. This array is fixed size, so when it overflows we make further allocation transactions synchronous because we cannot track more freed extents until those transactions hit the disk and are completed. Under heavy mixed allocation and freeing workloads with large log buffers, we can overflow this array quite easily. Further, the array is sparsely populated, which means that inserts need to search for a free slot, and array searches often have to search many more slots that are actually used to check all the busy extents. Quite inefficient, really. To enable this aspect of extent freeing to scale better, we need a structure that can grow dynamically. While in other areas of XFS we have used radix trees, the extents being freed are at random locations on disk so are better suited to being indexed by an rbtree. So, use a per-AG rbtree indexed by block number to track busy extents. This incures a memory allocation when marking an extent busy, but should not occur too often in low memory situations. This should scale to an arbitrary number of extents so should not be a limitation for features such as in-memory aggregation of transactions. However, there are still situations where we can't avoid allocating busy extents (such as allocation from the AGFL). To minimise the overhead of such occurences, we need to avoid doing a synchronous log force while holding the AGF locked to ensure that the previous transactions are safely on disk before we use the extent. We can do this by marking the transaction doing the allocation as synchronous rather issuing a log force. Because of the locking involved and the ordering of transactions, the synchronous transaction provides the same guarantees as a synchronous log force because it ensures that all the prior transactions are already on disk when the synchronous transaction hits the disk. i.e. it preserves the free->allocate order of the extent correctly in recovery. By doing this, we avoid holding the AGF locked while log writes are in progress, hence reducing the length of time the lock is held and therefore we increase the rate at which we can allocate and free from the allocation group, thereby increasing overall throughput. The only problem with this approach is that when a metadata buffer is marked stale (e.g. a directory block is removed), then buffer remains pinned and locked until the log goes to disk. The issue here is that if that stale buffer is reallocated in a subsequent transaction, the attempt to lock that buffer in the transaction will hang waiting the log to go to disk to unlock and unpin the buffer. Hence if someone tries to lock a pinned, stale, locked buffer we need to push on the log to get it unlocked ASAP. Effectively we are trading off a guaranteed log force for a much less common trigger for log force to occur. Ideally we should not reallocate busy extents. That is a much more complex fix to the problem as it involves direct intervention in the allocation btree searches in many places. This is left to a future set of modifications. Finally, now that we track busy extents in allocated memory, we don't need the descriptors in the transaction structure to point to them. We can replace the complex busy chunk infrastructure with a simple linked list of busy extents. This allows us to remove a large chunk of code, making the overall change a net reduction in code size. Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-24 10:34:00 -05:00
Dave Chinner	955833cf2a	xfs: make the log ticket ID available outside the log infrastructure The ticket ID is needed to uniquely identify transactions when doing busy extent matching. Delayed logging changes the lifecycle of busy extents with respect to the transaction structure lifecycle. Hence we can no longer use the transaction structure as a means of determining the owner of the busy extent as it may be freed and reused while the busy extent is still active. This commit provides the infrastructure to access the xlog_tid_t held in the ticket from a transaction handle. This avoids the need for callers to peek into the transaction and log structures to find this out. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-24 10:33:52 -05:00
Dave Chinner	169a7b078e	xfs: clean up log ticket overrun debug output Push the error message output when a ticket overrun is detected into the ticket printing functions. Also remove the debug version of the code as the production version will still panic just as effectively on a debug kernel via the panic mask being set. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-24 10:33:46 -05:00
Dave Chinner	c11554104f	xfs: Clean up XFS_BLI_* flag namespace Clean up the buffer log format (XFS_BLI_) flags because they have a polluted namespace. They XFS_BLI_ prefix is used for both in-memory and on-disk flag feilds, but have overlapping values for different flags. Rename the buffer log format flags to use the XFS_BLF_ prefix to avoid confusing them with the in-memory XFS_BLI_* prefixed flags. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-24 10:33:39 -05:00
Dave Chinner	64fc35de60	xfs: modify buffer item reference counting The buffer log item reference counts used to take referenceѕ for every transaction, similar to the pin counting. This is symmetric (like the pin/unpin) with respect to transaction completion, but with dleayed logging becomes assymetric as the pinning becomes assymetric w.r.t. transaction completion. To make both cases the same, allow the buffer pinning to take a reference to the buffer log item and always drop the reference the transaction has on it when being unlocked. This is balanced correctly because the unpin operation always drops a reference to the log item. Hence reference counting becomes symmetric w.r.t. item pinning as well as w.r.t active transactions and as a result the reference counting model remain consistent between normal and delayed logging. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-24 10:33:31 -05:00
Dave Chinner	3383ca5780	xfs: allow log ticket allocation to take allocation flags Delayed logging currently requires ticket allocation to succeed, so we need to be able to sleep on allocation. It also should not allow memory allocation to recurse into the filesystem. hence we need to pass allocation flags directing the type of allocation the caller requires. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-24 10:33:17 -05:00
Dave Chinner	524ee36fa4	xfs: Don't reuse the same transaction ID for duplicated transactions. The transaction ID is written into the log as the unique identifier for transactions during recover. When duplicating a transaction, we reuse the log ticket, which means it has the same transaction ID as the previous transaction. Rather than regenerating a random transaction ID for the duplicated transaction, just add one to the current ID so that duplicated transaction can be easily spotted in the log and during recovery during problem diagnosis. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>	2010-05-24 10:33:10 -05:00
Linus Torvalds	f13771187b	Merge branch 'bkl/ioctl' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing * 'bkl/ioctl' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing: uml: Pushdown the bkl from harddog_kern ioctl sunrpc: Pushdown the bkl from sunrpc cache ioctl sunrpc: Pushdown the bkl from ioctl autofs4: Pushdown the bkl from ioctl uml: Convert to unlocked_ioctls to remove implicit BKL ncpfs: BKL ioctl pushdown coda: Clean-up whitespace problems in pioctl.c coda: BKL ioctl pushdown drivers: Push down BKL into various drivers isdn: Push down BKL into ioctl functions scsi: Push down BKL into ioctl functions dvb: Push down BKL into ioctl functions smbfs: Push down BKL into ioctl function coda/psdev: Remove BKL from ioctl function um/mmapper: Remove BKL usage sn_hwperf: Kill BKL usage hfsplus: Push down BKL into ioctl function	2010-05-24 08:01:10 -07:00
Linus Torvalds	0163916f1d	Merge branch 'for-linus' of git://git.open-osd.org/linux-open-osd * 'for-linus' of git://git.open-osd.org/linux-open-osd: exofs: confusion between kmap() and kmap_atomic() api exofs: Add default address_space_operations	2010-05-24 07:57:41 -07:00
Linus Torvalds	3e766fd41d	Merge git://git.kernel.org/pub/scm/linux/kernel/git/hirofumi/fatfs-2.6 * git://git.kernel.org/pub/scm/linux/kernel/git/hirofumi/fatfs-2.6: fat: convert to unlocked_ioctl fat: Cleanup nls_unload() usage fat: use pack_hex_byte() instead of custom one	2010-05-24 07:41:47 -07:00
Linus Torvalds	4fd5ec509b	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs: 9p: Optimize TCREATE by eliminating a redundant fid clone. 9p: cleanup: remove unneeded assignment 9p: Add mksock support fs/9p: Make sure we properly instantiate dentry. 9p: add 9P2000.L rename operation 9p: add 9P2000.L statfs operation 9p: VFS switches for 9p2000.L: VFS switches 9p: VFS switches for 9p2000.L: protocol and client changes	2010-05-24 07:41:13 -07:00
Linus Torvalds	6e188240eb	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (59 commits) ceph: reuse mon subscribe message instead of allocated anew ceph: avoid resending queued message to monitor ceph: Storage class should be before const qualifier ceph: all allocation functions should get gfp_mask ceph: specify max_bytes on readdir replies ceph: cleanup pool op strings ceph: Use kzalloc ceph: use common helper for aborted dir request invalidation ceph: cope with out of order (unsafe after safe) mds reply ceph: save peer feature bits in connection structure ceph: resync headers with userland ceph: use ceph. prefix for virtual xattrs ceph: throw out dirty caps metadata, data on session teardown ceph: attempt mds reconnect if mds closes our session ceph: clean up send_mds_reconnect interface ceph: wait for mds OPEN reply to indicate reconnect success ceph: only send cap releases when mds is OPEN\|HUNG ceph: dicard cap releases on mds restart ceph: make mon client statfs handling more generic ceph: drop src address(es) from message header [new protocol feature] ...	2010-05-24 07:37:52 -07:00

1 2 3 4 5 ...

18457 Commits