Once the state_owner and lock_owner semaphores get removed, it will be
possible for other OPEN requests to reopen the same file if they have
lower sequence ids than our CLOSE call.
This patch ensures that we recheck the file state once
nfs_wait_on_sequence() has completed waiting.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
NFSv4 file state-changing functions such as OPEN, CLOSE, LOCK,... are all
labelled with "sequence identifiers" in order to prevent the server from
reordering RPC requests, as this could cause its file state to
become out of sync with the client.
Currently the NFS client code enforces this ordering locally using
semaphores to restrict access to structures until the RPC call is done.
This, of course, only works with synchronous RPC calls, since the
user process must first grab the semaphore.
By dropping semaphores, and instead teaching the RPC engine to hold
the RPC calls until they are ready to be sent, we can extend this
process to work nicely with asynchronous RPC calls too.
This patch adds a new list called "rpc_sequence" that defines the order
of the RPC calls to be sent. We add one such list for each state_owner.
When an RPC call is ready to be sent, it checks if it is top of the
rpc_sequence list. If so, it proceeds. If not, it goes back to sleep,
and loops until it hits top of the list.
Once the RPC call has completed, it can then bump the sequence id counter,
and remove itself from the rpc_sequence list, and then wake up the next
sleeper.
Note that the state_owner sequence ids and lock_owner sequence ids are
all indexed to the same rpc_sequence list, so OPEN, LOCK,... requests
are all ordered w.r.t. each other.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
lock_kiocb() was introduced to serialize retrying and cancellation. In the
process of doing so it tried to sleep waiting for KIF_LOCKED while holding
the ctx_lock spinlock. Recent fixes have ensured that multiple concurrent
retries won't be attempted for a given iocb. Cancel has other problems and
has no significant in-tree users that have been complaining about it. So
for the immediate future we'll revert sleeping with the lock held and will
address proper cancellation and retry serialization in the future.
Signed-off-by: Zach Brown <zach.brown@oracle.com>
Acked-by: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Currently you do not get all the map entries on nommu systems because the
start function doesn't index into the list using the value of "pos".
Signed-off-by: David McCullough <davidm@snapgear.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Oopsable since nfs_wait_on_inode() can get called as part of iput_final().
Unnecessary since the caller had better be damned sure that the inode won't
disappear from underneath it anyway.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
If the data cache has been marked as potentially invalid by nfs_refresh_inode,
we should invalidate it rather than assume that changes are due to our own
activity.
Also ensure that we always start with a valid cache before declaring it
to be protected by a delegation.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
"proc_smaps_operations" is not defined in case of "CONFIG_MMU=n".
Signed-off-by: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Nir Tzachar <tzachar@cs.bgu.ac.il> points out that if an ELF file specifies a
zero-length bss at a whacky address, we cannot load that binary because
padzero() tries to zero out the end of the page at the whacky address, and
that may not be writeable.
See also http://bugzilla.kernel.org/show_bug.cgi?id=5411
So teach load_elf_binary() to skip the bss settng altogether if the elf file
has a zero-length bss segment.
Cc: Roland McGrath <roland@redhat.com>
Cc: Daniel Jacobowitz <dan@debian.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Here is a compatibility fix between Linux and Solaris when used with VxFS
filesystems: Solaris usually accepts acl entries in any order, but with
VxFS it replies with NFSERR_INVAL when it sees a four-entry acl that is not
in canonical form. It may also fail with other non-canonical acls -- I
can't tell, because that case never triggers: We only send non-canonical
acls when we fake up an ACL_MASK entry.
Instead of adding fake ACL_MASK entries at the end, inserting them in the
correct position makes Solaris+VxFS happy. The Linux client and server
sides don't care about entry order. The three-entry-acl special case in
which we need a fake ACL_MASK entry was handled in xdr_nfsace_encode. The
patch moves this into nfsacl_encode.
Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Acked-by: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
v9fs_file_read and v9fs_file_write use kmalloc to allocate buffers as big
as the data buffer received as parameter. kmalloc cannot be used to
allocate buffers bigger than 128K, so reading/writing data in chunks bigger
than 128k fails.
This patch reorganizes v9fs_file_read and v9fs_file_write to allocate only
buffers as big as the maximum data that can be sent in one 9P message.
Signed-off-by: Latchesar Ionkov <lucho@ionkov.net>
Cc: Eric Van Hensbergen <ericvh@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The third param in this call to vmap shouldn't be GFP_KERNEL, which
makes no sense, but rather VM_MAP. Thanks to Al Viro for spotting
this.
Signed-off-by: Tom Zanussi <zanussi@us.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
- added typedef unsigned int __nocast gfp_t;
- replaced __nocast uses for gfp flags with gfp_t - it gives exactly
the same warnings as far as sparse is concerned, doesn't change
generated code (from gcc point of view we replaced unsigned int with
typedef) and documents what's going on far better.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The nameidata "last.name" is always allocated with "__getname()", and
should always be free'd with "__putname()".
Using "putname()" without the underscores will leak memory, because the
allocation will have been hidden from the AUDITSYSCALL code.
Arguably the real bug is that the AUDITSYSCALL code is really broken,
but in the meantime this fixes the problem people see.
Reported by Robert Derr, patch by Rick Lindsley.
Acked-by: Al Viro <viro@ftp.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
bfs_fill_super() walks the inode table to get the bitmap of free inodes
and collect stats. It has no business using iget() for that - it's a
lot of extra work, extra icache pollution and more complex code.
Switched to walking the damn thing directly.
Note: that also allows to kill ->i_dsk_ino in there - separate patch if
Tigran can confirm that this field can be zero only for deleted inodes
(i.e. something that could only be found during that scan and not by
normal lookups).
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Check O_DIRECT and return -EINVAL error in open. dentry_open() also checks
this but only after the open method is called. This patch optimizes away
the unnecessary upcalls in this case.
It could be a correctness issue too: if filesystem has open() with side
effect, then it should fail before doing the open, not after.
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Calling truncate() on hostfs spits a kernel warning "Something isn't
implemented here", but it still works fine.
Indeed, hostfs i_op->truncate doesn't do anything. But hostfs_setattr() ->
set_attr() correctly detects ATTR_SIZE and calls truncate() on the host. So
we should be safe (using ftruncate() may be better, in case the file is
unlinked on the host, but we aren't sure to have the file open for writing,
and reopening it would cause the same races; plus nobody should expect UML to
be so careful).
So, the warning is wrong, because the current implementation is working. Al,
am I correct, and can the warning be therefore dropped?
CC: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Recently aio_p{read,write} changed to perform retries internally rather
than returning -EIOCBRETRY. This inadvertantly resulted in always calling
aio_{read,write} with ki_left at 0 which would in turn immediately return
0. Harmless, but we can avoid this call by checking in the caller.
Signed-off-by: Zach Brown <zach.brown@oracle.com>
Signed-off-by: Benjamin LaHaise <bcrl@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Only one of the run or kick path is supposed to put an iocb on the run
list. If both of them do it than one of them can end up referencing a
freed iocb. The kick path could delete the task_list item from the wait
queue before getting the ctx_lock and putting the iocb on the run list.
The run path was testing the task_list item outside the lock so that it
could catch ki_retry methods that return -EIOCBRETRY *without* putting the
iocb on a wait queue and promising to call kick_iocb. This unlocked check
could then race with the kick path to cause both to try and put the iocb on
the run list.
The patch stops the run path from testing task_list by requring that any
ki_retry that returns -EIOCBRETRY *must* guarantee that kick_iocb() will be
called in the future. aio_p{read,write}, the only in-tree -EIOCBRETRY
users, are updated.
Signed-off-by: Zach Brown <zach.brown@oracle.com>
Signed-off-by: Benjamin LaHaise <bcrl@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Only one of the run or kick path is supposed to put an iocb on the run
list. If both of them do it than one of them can end up referencing a
freed iocb. The kick patch could set the Kicked bit before acquiring the
ctx_lock and putting the iocb on the run list. The run path, while holding
the ctx_lock, could see this partial kick and mistake it for a kick that
was deferred while it was doing work with the run_list NULLed out. It
would then race with the kick thread to add the iocb to the run list.
This patch moves the kick setting under the ctx_lock so that only one of
the kick or run path queues the iocb on the run list, as intended.
Signed-off-by: Zach Brown <zach.brown@oracle.com>
Signed-off-by: Benjamin LaHaise <bcrl@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
it seems that readv(2)/writev(2) syscalls do not call
file_permission callback. Looks like this is overlook.
I have filled the issue into redhat bugzilla as
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=169433
and got the recommendation to post this on lsm mailing list.
The following trivial patch solves the problem.
Signed-off-by: Kostik Belousov <kostikbel@gmail.com>
Signed-off-by: Chris Wright <chrisw@osdl.org>
Fid management cleanup. The patch attempts to fix the races in dentry's
fid management.
Dentries don't keep the opened fids anymore, they are moved to the file
structs. Ideally there should be no more than one fid with fidcreate equal
to zero in the dentry's list of fids.
v9fs_fid_create initializes the important fields (fid, fidcreated) before
v9fs_fid is added to the list. v9fs_fid_lookup returns only fids that are
not created by v9fs_create. v9fs_fid_get_created returns the fid created
by the same process by v9fs_create (if any) and removes it from dentry's
list
Signed-off-by: Latchesar Ionkov <lucho@ionkov.net>
Cc: Eric Van Hensbergen <ericvh@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Fix failure paths in ext3_new_inode() and clean up duplicated code: -
DQUOT_DROP() was not being called if ext3_init_security() failed.
Signed-off-by: Chris Sykes <chris@sigsegv.plus.com>
Cc: Stephen Smalley <sds@epoch.ncsc.mil>
Cc: Jan Kara <jack@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Fix failure paths in ext2_new_inode() and clean up duplicated code: -
DQUOT_DROP() was not being called if ext2_init_security() failed.
Signed-off-by: Chris Sykes <chris@sigsegv.plus.com>
Cc: Stephen Smalley <sds@epoch.ncsc.mil>
Cc: Jan Kara <jack@ucw.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch checks reserved node ID values returned by lookup and creation
operations. In case one of the reserved values is sent, return -EIO.
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add information about required version of the userspace library/utilities
to Documentation/Changes. Also add pointer to this and to FUSE
documentation from Kconfig.
Thanks to Anton Altaparmakov for the reminder.
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
restart pages in the journal without multi sector transfer protection
fixups (i.e. the update sequence array is empty and in fact does not
exist).
Signed-off-by: Anton Altaparmakov <aia21@cantab.net>
cifsd had been preventing software suspend from completing.
Signed-off-by: pavel@suse.de
Signed-off-by: Steve French <sfrench@us.ibm.com> lightly modified
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Currently rpc_mkdir/rpc_rmdir and rpc_mkpipe/mk_unlink have an API that's
a little unfortunate. They take a path relative to the rpc_pipefs root and
thus need to perform a full lookup. If you look at debugfs or usbfs they
always store the dentry for directories they created and thus can pass in
a dentry + single pathname component pair into their equivalents of the
above functions.
And in fact rpc_pipefs actually stores a dentry for all but one component so
this change not only simplifies the core rpc_pipe code but also the callers.
Unfortuntately this code path is only used by the NFS4 idmapper and
AUTH_GSSAPI for which I don't have a test enviroment. Could someone give
it a spin? It's the last bit needed before we can rework the
lookup_hash API
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Each transport implementation can now set unique bind, connect,
reestablishment, and idle timeout values. These are variables,
allowing the values to be modified dynamically. This permits
exponential backoff of any of these values, for instance.
As an example, we implement exponential backoff for the connection
reestablishment timeout.
Test-plan:
Destructive testing (unplugging the network temporarily). Connectathon
with UDP and TCP.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Get rid of the "xprt->nocong" variable.
Test-plan:
Use WAN simulation to cause sporadic bursty packet loss with UDP mounts.
Look for significant regression in performance or client stability.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Now we can fix up the last few places that use the "xprt->stream"
variable, and get rid of it from the rpc_xprt structure.
Test-plan:
Destructive testing (unplugging the network temporarily). Connectathon
with UDP and TCP.
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Implement a best practice: don't use exponential backoff when computing
retransmit timeout values on TCP connections, but simply retransmit
at regular intervals.
This also fixes a bug introduced when xprt_reset_majortimeo() was added.
Test-plan:
Enable RPC debugging and watch timeout behavior on a NFS/TCP mount.
Version: Thu, 11 Aug 2005 16:02:19 -0400
Signed-off-by: Chuck Lever <cel@netapp.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Fixes a condition whereby the kernel is returning the non-POSIX error
EBADCOOKIE to userspace.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
[PATCH] Fix miscompare in __posix_lock_file
If an application requests the same lock twice, the
kernel should just leave the existing lock in place.
Currently, it will install a second lock of the same type.
Signed-off-by: Olaf Kirch <okir@suse.de>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
When doing a rename on top of an existing file that is not in use,
the inode of the overwritten file will remain in the icache.
The fix is to decrement i_nlink of the overwritten inode, like we
do for unlink, rmdir etc already.
Problem diagnosed by Olaf Kirch. This patch is a slight variation
on his fix.
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
since we otherwise get into a lock reversal deadlock if a read locked
runlist is passed in. In the process also change it to take an ntfs
inode instead of a vfs inode as parameter.
Signed-off-by: Anton Altaparmakov <aia21@cantab.net>
nfs_readpage_release() causes an oops while accessing a file with NFS
debugging turned on (echo 32767 > /proc/sys/sunrpc/nfs_debug) and a kernel
built with CONFIG_DEBUG_SLAB.
This patch moves the debugging statement above nfs_release_request() to
avoid accessing freed memory.
Signed-off-by: Nick Wilson <njw@osdl.org>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>