Make /proc/ page monitoring configurable
This puts the following files under an embedded config option:
/proc/pid/clear_refs
/proc/pid/smaps
/proc/pid/pagemap
/proc/kpagecount
/proc/kpageflags
[akpm@linux-foundation.org: Kconfig fix]
Signed-off-by: Matt Mackall <mpm@selenic.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This makes a subset of physical page flags available to userspace. Together
with /proc/pid/kpagemap, this allows tracking of a wide variety of VM behaviors.
Exported flags are decoupled from the kernel's internal flags. This
allows us to reorder flag bits, and synthesize any bits that get
redefined in terms of other bits.
[akpm@linux-foundation.org: remove unneeded access_ok()]
[akpm@linux-foundation.org: s/0/NULL/]
Signed-off-by: Matt Mackall <mpm@selenic.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This makes physical page map counts available to userspace. Together
with /proc/pid/pagemap and /proc/pid/clear_refs, this can be used to
monitor memory usage on a per-page basis.
[akpm@linux-foundation.org: remove unneeded access_ok()]
[bunk@stusta.de: make struct proc_kpagemap static]
Signed-off-by: Matt Mackall <mpm@selenic.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Adrian Bunk <bunk@stusta.de>
Cc: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This interface provides a mapping for each page in an address space to its
physical page frame number, allowing precise determination of what pages are
mapped and what pages are shared between processes.
New in this version:
- headers gone again (as recommended by Dave Hansen and Alan Cox)
- 64-bit entries (as per discussion with Andi Kleen)
- swap pte information exported (from Dave Hansen)
- page walker callback for holes (from Dave Hansen)
- direct put_user I/O (as suggested by Rusty Russell)
This patch folds in cleanups and swap PTE support from Dave Hansen
<haveblue@us.ibm.com>.
Signed-off-by: Matt Mackall <mpm@selenic.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Reorder source so that all the code and data for each interface is together.
Signed-off-by: Matt Mackall <mpm@selenic.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This puts all the clear_refs code where it belongs and probably lets things
compile on MMU-less systems as well.
Signed-off-by: Matt Mackall <mpm@selenic.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This pulls the shared map display code out of show_map and puts it in
show_smap where it belongs.
Signed-off-by: Matt Mackall <mpm@selenic.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use the generic pagewalker for smaps and clear_refs
Signed-off-by: Matt Mackall <mpm@selenic.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The "proportional set size" (PSS) of a process is the count of pages it has
in memory, where each page is divided by the number of processes sharing
it. So if a process has 1000 pages all to itself, and 1000 shared with one
other process, its PSS will be 1500.
- lwn.net: "ELC: How much memory are applications really using?"
The PSS proposed by Matt Mackall is a very nice metic for measuring an
process's memory footprint. So collect and export it via
/proc/<pid>/smaps.
Matt Mackall's pagemap/kpagemap and John Berthels's exmap can also do the
job. They are comprehensive tools. But for PSS, let's do it in the simple
way.
Cc: John Berthels <jjberthels@gmail.com>
Cc: Bernardo Innocenti <bernie@codewiz.org>
Cc: Padraig Brady <P@draigBrady.com>
Cc: Denys Vlasenko <vda.linux@googlemail.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Matt Mackall <mpm@selenic.com>
Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Allow sticky directory mount option for hugetlbfs. This allows admin
to create a shared hugetlbfs mount point for multiple users, while
prevent accidental file deletion that users may step on each other.
It is similiar to default tmpfs mount option, or typical option used
on /tmp.
Signed-off-by: Ken Chen <kenchen@google.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Adam Litke <agl@us.ibm.com>
Cc: David Gibson <hermes@gibson.dropbear.id.au>
Cc: William Lee Irwin III <wli@holomorphy.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The constructor for buffer_head slabs was removed recently. We need the
constructor back in slab defrag in order to insure that slab objects always
have a definite state even before we allocated them.
I think we mistakenly merged the removal of the constuctor into a cleanup
patch. You (ie: akpm) had a test that showed that the removal of the
constructor led to a small regression. The prior state makes things easier
for slab defrag.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Checking if an address is a vmalloc address is done in a couple of places.
Define a common version in mm.h and replace the other checks.
Again the include structures suck. The definition of VMALLOC_START and
VMALLOC_END is not available in vmalloc.h since highmem.c cannot be included
there.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Simplify page cache zeroing of segments of pages through 3 functions
zero_user_segments(page, start1, end1, start2, end2)
Zeros two segments of the page. It takes the position where to
start and end the zeroing which avoids length calculations and
makes code clearer.
zero_user_segment(page, start, end)
Same for a single segment.
zero_user(page, start, length)
Length variant for the case where we know the length.
We remove the zero_user_page macro. Issues:
1. Its a macro. Inline functions are preferable.
2. The KM_USER0 macro is only defined for HIGHMEM.
Having to treat this special case everywhere makes the
code needlessly complex. The parameter for zeroing is always
KM_USER0 except in one single case that we open code.
Avoiding KM_USER0 makes a lot of code not having to be dealing
with the special casing for HIGHMEM anymore. Dealing with
kmap is only necessary for HIGHMEM configurations. In those
configurations we use KM_USER0 like we do for a series of other
functions defined in highmem.h.
Since KM_USER0 is depends on HIGHMEM the existing zero_user_page
function could not be a macro. zero_user_* functions introduced
here can be be inline because that constant is not used when these
functions are called.
Also extract the flushing of the caches to be outside of the kmap.
[akpm@linux-foundation.org: fix nfs and ntfs build]
[akpm@linux-foundation.org: fix ntfs build some more]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Steven French <sfrench@us.ibm.com>
Cc: Michael Halcrow <mhalcrow@us.ibm.com>
Cc: <linux-ext4@vger.kernel.org>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Anton Altaparmakov <aia21@cantab.net>
Cc: Mark Fasheh <mark.fasheh@oracle.com>
Cc: David Chinner <dgc@sgi.com>
Cc: Michael Halcrow <mhalcrow@us.ibm.com>
Cc: Steven French <sfrench@us.ibm.com>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is the new timerfd API as it is implemented by the following patch:
int timerfd_create(int clockid, int flags);
int timerfd_settime(int ufd, int flags,
const struct itimerspec *utmr,
struct itimerspec *otmr);
int timerfd_gettime(int ufd, struct itimerspec *otmr);
The timerfd_create() API creates an un-programmed timerfd fd. The "clockid"
parameter can be either CLOCK_MONOTONIC or CLOCK_REALTIME.
The timerfd_settime() API give new settings by the timerfd fd, by optionally
retrieving the previous expiration time (in case the "otmr" parameter is not
NULL).
The time value specified in "utmr" is absolute, if the TFD_TIMER_ABSTIME bit
is set in the "flags" parameter. Otherwise it's a relative time.
The timerfd_gettime() API returns the next expiration time of the timer, or
{0, 0} if the timerfd has not been set yet.
Like the previous timerfd API implementation, read(2) and poll(2) are
supported (with the same interface). Here's a simple test program I used to
exercise the new timerfd APIs:
http://www.xmailserver.org/timerfd-test2.c
[akpm@linux-foundation.org: coding-style cleanups]
[akpm@linux-foundation.org: fix ia64 build]
[akpm@linux-foundation.org: fix m68k build]
[akpm@linux-foundation.org: fix mips build]
[akpm@linux-foundation.org: fix alpha, arm, blackfin, cris, m68k, s390, sparc and sparc64 builds]
[heiko.carstens@de.ibm.com: fix s390]
[akpm@linux-foundation.org: fix powerpc build]
[akpm@linux-foundation.org: fix sparc64 more]
Signed-off-by: Davide Libenzi <davidel@xmailserver.org>
Cc: Michael Kerrisk <mtk-manpages@gmx.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Davide Libenzi <davidel@xmailserver.org>
Cc: Michael Kerrisk <mtk-manpages@gmx.net>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Davide Libenzi <davidel@xmailserver.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As Roland pointed out, we have the very old problem with exec. de_thread()
sets SIGNAL_GROUP_EXIT, kills other threads, changes ->group_leader and then
clears signal->flags. All signals (even fatal ones) sent in this window
(which is not too small) will be lost.
With this patch exec doesn't abuse SIGNAL_GROUP_EXIT. signal_group_exit(),
the new helper, should be used to detect exit_group() or exec() in progress.
It can have more users, but this patch does only strictly necessary changes.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Davide Libenzi <davidel@xmailserver.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Robin Holt <holt@sgi.com>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It was dumb to make get_task_comm() return void. Change it to return a
pointer to the resulting output for caller convenience.
Cc: Ulrich Drepper <drepper@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
On Sat, 2008-01-05 at 13:35 -0800, Davide Libenzi wrote:
> I remember I talked with Arjan about this time ago. Basically, since 1)
> you can drop an epoll fd inside another epoll fd 2) callback-based wakeups
> are used, you can see a wake_up() from inside another wake_up(), but they
> will never refer to the same lock instance.
> Think about:
>
> dfd = socket(...);
> efd1 = epoll_create();
> efd2 = epoll_create();
> epoll_ctl(efd1, EPOLL_CTL_ADD, dfd, ...);
> epoll_ctl(efd2, EPOLL_CTL_ADD, efd1, ...);
>
> When a packet arrives to the device underneath "dfd", the net code will
> issue a wake_up() on its poll wake list. Epoll (efd1) has installed a
> callback wakeup entry on that queue, and the wake_up() performed by the
> "dfd" net code will end up in ep_poll_callback(). At this point epoll
> (efd1) notices that it may have some event ready, so it needs to wake up
> the waiters on its poll wait list (efd2). So it calls ep_poll_safewake()
> that ends up in another wake_up(), after having checked about the
> recursion constraints. That are, no more than EP_MAX_POLLWAKE_NESTS, to
> avoid stack blasting. Never hit the same queue, to avoid loops like:
>
> epoll_ctl(efd2, EPOLL_CTL_ADD, efd1, ...);
> epoll_ctl(efd3, EPOLL_CTL_ADD, efd2, ...);
> epoll_ctl(efd4, EPOLL_CTL_ADD, efd3, ...);
> epoll_ctl(efd1, EPOLL_CTL_ADD, efd4, ...);
>
> The code "if (tncur->wq == wq || ..." prevents re-entering the same
> queue/lock.
Since the epoll code is very careful to not nest same instance locks
allow the recursion.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Tested-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
Acked-by: Davide Libenzi <davidel@xmailserver.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Drivers that register a ->fault handler, but do not range-check the
offset argument, must set VM_DONTEXPAND in the vm_flags in order to
prevent an expanding mremap from overflowing the resource.
I've audited the tree and attempted to fix these problems (usually by
adding VM_DONTEXPAND where it is not obvious).
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
fcntl(F_GETLK,..) can return pid of process for not current pid namespace
(if process is belonged to the several namespaces). It is true also for
pids in /proc/locks. So correct behavior is saving pointer to the struct
pid of the process lock owner.
Signed-off-by: Vitaliy Gusev <vgusev@openvz.org>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
interruptible_sleep_on_locked() is just an open-coded
wait_event_interruptible_timeout(), with the one difference that
interruptible_sleep_on_locked() doesn't bother to check the condition on
which it is waiting, depending instead on the BKL to avoid the case
where it blocks after the wakeup has already been called.
locks_block_on_timeout() is only used in one place, so it's actually
simpler to inline it into its caller.
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
For such a short function (with such a long comment),
posix_locks_deadlock() seems to cause a lot of confusion. Attempt to
make it a bit clearer:
- Remove the initial posix_same_owner() check, which can never
pass (since this is only called in the case that block_fl and
caller_fl conflict)
- Use an explicit loop (and a helper function) instead of a goto.
- Rewrite the comment, attempting a clearer explanation, and
removing some uninteresting historical detail.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
* 'for-linus' of git://linux-nfs.org/~bfields/linux: (100 commits)
SUNRPC: RPC program information is stored in unsigned integers
SUNRPC: Move exported symbol definitions after function declaration part 2
NLM: tear down RPC clients in nlm_shutdown_hosts
SUNRPC: spin svc_rqst initialization to its own function
nfsd: more careful input validation in nfsctl write methods
lockd: minor log message fix
knfsd: don't bother mapping putrootfh enoent to eperm
rdma: makefile
rdma: ONCRPC RDMA protocol marshalling
rdma: SVCRDMA sendto
rdma: SVCRDMA recvfrom
rdma: SVCRDMA Core Transport Services
rdma: SVCRDMA Transport Module
rdma: SVCRMDA Header File
svc: Add svc_xprt_names service to replace svc_sock_names
knfsd: Support adding transports by writing portlist file
svc: Add svc API that queries for a transport instance
svc: Add /proc/sys/sunrpc/transport files
svc: Add transport hdr size for defer/revisit
svc: Move the xprt independent code to the svc_xprt.c file
...
It's possible for a RPC to outlive the lockd daemon that created it, so
we need to make sure that all RPC's are killed when lockd is coming
down. When nlm_shutdown_hosts is called, kill off all RPC tasks
associated with the host. Since we need to wait until they have all gone
away, we might as well just shut down the RPC client altogether.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Neil Brown points out that we're checking buf[size-1] in a couple places
without first checking whether size is zero.
Actually, given the implementation of simple_transaction_get(), buf[-1]
is zero, so in both of these cases the subsequent check of the value of
buf[size-1] will catch this case.
But it seems fragile to depend on that, so add explicit checks for this
case.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Acked-by: NeilBrown <neilb@suse.de>
Neither EPERM and ENOENT map to valid errors for PUTROOTFH according to
rfc 3530, and, if anything, ENOENT is likely to be slightly more
informative; so don't bother mapping ENOENT to EPERM. (Probably this
was originally done because one likely cause was that there is an fsid=0
export but that it isn't permitted to this particular client. Now that
we allow WRONGSEC returns, this is somewhat less likely.)
In the long term we should work to make this situation less likely,
perhaps by turning off nfsv4 service entirely in the absence of the
pseudofs root, or constructing a pseudofilesystem root ourselves in the
kernel as necessary.
Thanks to Benny Halevy <bhalevy@panasas.com> for pointing out this
problem.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Cc: Benny Halevy <bhalevy@panasas.com>
Create a transport independent version of the svc_sock_names function.
The toclose capability of the svc_sock_names service can be implemented
using the svc_xprt_find and svc_xprt_close services.
Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
Acked-by: Neil Brown <neilb@suse.de>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Greg Banks <gnb@sgi.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Update the write handler for the portlist file to allow creating new
listening endpoints on a transport. The general form of the string is:
<transport_name><space><port number>
For example:
echo "tcp 2049" > /proc/fs/nfsd/portlist
This is intended to support the creation of a listening endpoint for
RDMA transports without adding #ifdef code to the nfssvc.c file.
Transports can also be removed as follows:
'-'<transport_name><space><port number>
For example:
echo "-tcp 2049" > /proc/fs/nfsd/portlist
Attempting to add a listener with an invalid transport string results
in EPROTONOSUPPORT and a perror string of "Protocol not supported".
Attempting to remove an non-existent listener (.e.g. bad proto or port)
results in ENOTCONN and a perror string of
"Transport endpoint is not connected"
Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
Acked-by: Neil Brown <neilb@suse.de>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Greg Banks <gnb@sgi.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Add a new svc function that allows a service to query whether a
transport instance has already been created. This is used in lockd
to determine whether or not a transport needs to be created when
a lockd instance is brought up.
Specifying 0 for the address family or port is effectively a wild-card,
and will result in matching the first transport in the service's list
that has a matching class name.
Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
Acked-by: Neil Brown <neilb@suse.de>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Greg Banks <gnb@sgi.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Move sk_list and sk_ready to svc_xprt. This involves close because these
lists are walked by svcs when closing all their transports. So I combined
the moving of these lists to svc_xprt with making close transport independent.
The svc_force_sock_close has been changed to svc_close_all and takes a list
as an argument. This removes some svc internals knowledge from the svcs.
This code races with module removal and transport addition.
Thanks to Simon Holm Thøgersen for a compile fix.
Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
Acked-by: Neil Brown <neilb@suse.de>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Greg Banks <gnb@sgi.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Cc: Simon Holm Thøgersen <odie@cs.aau.dk>
Modify the various kernel RPC svcs to use the svc_create_xprt service.
Signed-off-by: Tom Tucker <tom@opengridcomputing.com>
Acked-by: Neil Brown <neilb@suse.de>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Greg Banks <gnb@sgi.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Fix nlm_block leak for the case of supplied blocking lock info.
Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Document these checks a little better and inline, as suggested by Neil
Brown (note both functions have two callers). Remove an obviously bogus
check while we're there (checking whether unsigned value is negative).
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Cc: Neil Brown <neilb@suse.de>
The server silently ignores attempts to set the uid and gid on create.
Based on the comment, this appears to have been done to prevent some
overly-clever IRIX client from causing itself problems.
Perhaps we should remove that hack completely. For now, at least, it
makes sense to allow root (when no_root_squash is set) to set uid and
gid.
While we're there, since nfsd_create and nfsd_create_v3 share the same
logic, pull that out into a separate function. And spell out the
individual modifications of ia_valid instead of doing them both at once
inside a conditional.
Thanks to Roger Willcocks <roger@filmlight.ltd.uk> for the bug report
and original patch on which this is based.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Without the patch, there is a leakage of nlmblock structure refcount
that holds a reference nlmfile structure, that holds a reference to
struct file, when async GETFL is used (-EINPROGRESS return from
file_ops->lock()), and also in some error cases.
Fix up a style nit while we're here.
Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
This patch addresses a compatibility issue with a Linux NFS server and
AIX NFS client.
I have exported /export as fsid=0 with sec=krb5:krb5i
I have mount --bind /home onto /export/home
I have exported /export/home with sec=krb5i
The AIX client mounts / -o sec=krb5:krb5i onto /mnt
If I do an ls /mnt, the AIX client gets a permission error. Looking at
the network traceIwe see a READDIR looking for attributes
FATTR4_RDATTR_ERROR and FATTR4_MOUNTED_ON_FILEID. The response gives a
NFS4ERR_WRONGSEC which the AIX client is not expecting.
Since the AIX client is only asking for an attribute that is an
attribute of the parent file system (pseudo root in my example), it
seems reasonable that there should not be an error.
In discussing this issue with Bruce Fields, I initially proposed
ignoring the error in nfsd4_encode_dirent_fattr() if all that was being
asked for was FATTR4_RDATTR_ERROR and FATTR4_MOUNTED_ON_FILEID, however,
Bruce suggested that we avoid calling cross_mnt() if only these
attributes are requested.
The following patch implements bypassing cross_mnt() if only
FATTR4_RDATTR_ERROR and FATTR4_MOUNTED_ON_FILEID are called. Since there
is some complexity in the code in nfsd4_encode_fattr(), I didn't want to
duplicate code (and introduce a maintenance nightmare), so I added a
parameter to nfsd4_encode_fattr() that indicates whether it should
ignore cross mounts and simply fill in the attribute using the passed in
dentry as opposed to it's parent.
Signed-off-by: Frank Filz <ffilzlnx@us.ibm.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
The failure to return a stateowner from nfs4_preprocess_seqid_op() means
in the case where a lock request is of a type incompatible with an open
(due to, e.g., an application attempting a write lock on a file open for
read), means that fs/nfsd/nfs4xdr.c:ENCODE_SEQID_OP_TAIL() never bumps
the seqid as it should. The client, attempting to close the file
afterwards, then gets an (incorrect) bad sequence id error. Worse, this
prevents the open file from ever being closed, so we leak state.
Thanks to Benny Halevy and Trond Myklebust for analysis, and to Steven
Wilton for the report and extensive data-gathering.
Cc: Benny Halevy <bhalevy@panasas.com>
Cc: Steven Wilton <steven.wilton@team.eftel.com.au>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
In a number of places where we wish only to translate nlm_drop_reply to
rpc_drop_reply errors we instead return early with rpc_drop_reply,
skipping some important end-of-function cleanup.
This results in reference count leaks when lockd is doing posix locking
on GFS2.
Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
When the callback channel fails, we inform the client of that by
returning a cb_path_down error the next time it tries to renew its
lease.
If we wait most of a lease period before deciding that a callback has
failed and that the callback channel is down, then we decrease the
chances that the client will find out in time to do anything about it.
So, mark the channel down as soon as we recognize that an rpc has
failed. However, continue trying to recall delegations anyway, in hopes
it will come back up. This will prevent more delegations from being
given out, and ensure cb_path_down is returned to renew calls earlier,
while still making the best effort to deliver recalls of existing
delegations.
Also fix a couple comments and remove a dprink that doesn't seem likely
to be useful.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Declare this variable in the one function where it's used, and clean up
some minor style problems.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
We generate a unique cl_confirm for every new client; so if we've
already checked that this cl_confirm agrees with the cl_confirm of
unconf, then we already know that it does not agree with the cl_confirm
of conf.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Again, the only way conf and unconf can have the same clientid is if
they were created in the "probable callback update" case of setclientid,
in which case we already know that the cl_verifier fields must agree.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
If conf and unconf are both found in the lookup by cl_clientid, then
they share the same cl_clientid. We always create a unique new
cl_clientid field when creating a new client--the only exception is the
"probable callback update" case in setclientid, where we copy the old
cl_clientid from another clientid with the same name.
Therefore two clients with the same cl_client field also always share
the same cl_name field, and a couple of the checks here are redundant.
Thanks to Simon Holm Thøgersen for a compile fix.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Cc: Simon Holm Thøgersen <odie@cs.aau.dk>
Using a counter instead of the nanoseconds value seems more likely to
produce a unique cl_confirm.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
We're supposed to generate a different cl_confirm verifier for each new
client, so these to cl_confirm values should never be the same.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Most of these comments just summarize the code.
The matching of code to the cases described in the RFC may still be
useful, though; add specific section references to make that easier to
follow. Also update references to the outdated RFC 3010.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
While we're here, let's remove the redundant (and now wrong) pathname in
the comment, and the #ifdef __KERNEL__'s.
Acked-by: NeilBrown <neilb@suse.de>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
This header is used only in a few places in fs/nfsd, so there seems to
be little point to having it in include/. (Thanks to Robert Day for
pointing this out.)
Cc: Robert P. J. Day <rpjday@crashcourse.ca>
Acked-by: NeilBrown <neilb@suse.de>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Newer server features such as nfsv4 and gss depend on proc to work, so a
failure to initialize the proc files they need should be treated as
fatal.
Thanks to Andrew Morton for style fix and compile fix in case where
CONFIG_NFSD_V4 is undefined.
Cc: Andrew Morton <akpm@linux-foundation.org>
Acked-by: NeilBrown <neilb@suse.de>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
I assume the reason failure of creation was ignored here was just to
continue support embedded systems that want nfsd but not proc.
However, in cases where proc is supported it would be clearer to fail
entirely than to come up with some features disabled.
Acked-by: NeilBrown <neilb@suse.de>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
The server depends on upcalls under /proc to support nfsv4 and gss.
Acked-by: NeilBrown <neilb@suse.de>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
There's really nothing much the caller can do if cache unregistration
fails. And indeed, all any caller does in this case is print an error
and continue. So just return void and move the printk's inside
cache_unregister.
Acked-by: NeilBrown <neilb@suse.de>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
If the reply cache initialization fails due to a kmalloc failure,
currently we try to soldier on with a reduced (or nonexistant) reply
cache.
Better to just fail immediately: the failure is then much easier to
understand and debug, and it could save us complexity in some later
code. (But actually, it doesn't help currently because the cache is
also turned off in some odd failure cases; we should probably find a
better way to handle those failure cases some day.)
Fix some minor style problems while we're at it, and rename
nfsd_cache_init() to remove the need for a comment describing it.
Acked-by: NeilBrown <neilb@suse.de>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Handle the failure case here with something closer to the standard
kernel style.
Doesn't really matter for now, but I'd like to add a few more failure
cases, and then this'll help.
Acked-by: NeilBrown <neilb@suse.de>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
We forgot to shut down the nfs4 state and idmapping code in this case.
Acked-by: NeilBrown <neilb@suse.de>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
The length "nbytes" passed into read_buf should never be negative, but
we check only for too-large values of "nbytes", not for too-small
values. Make nbytes unsigned, so it's clear that the former tests are
sufficient. (Despite this read_buf() currently correctly returns an xdr
error in the case of a negative length, thanks to an unsigned
comparison with size_of() and bounds-checking in kmalloc(). This seems
very fragile, though.)
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Clean up: path name lengths are unsigned on the wire, negative lengths
are not meaningful natively either.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Acked-By: NeilBrown <neilb@suse.de>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Clean up: adjust the sign of the length argument of nfsd_lookup and
nfsd_lookup_dentry, for consistency with recent changes. NFSD version
4 callers already pass an unsigned file name length.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Acked-By: NeilBrown <neilb@suse.de>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Clean up: file name lengths are unsigned on the wire, negative lengths
are not meaningful natively either.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Acked-By: NeilBrown <neilb@suse.de>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
According to The Open Group's NLM specification, NLM callers are variable
length strings. XDR variable length strings use an unsigned 32 bit length.
And internally, negative string lengths are not meaningful for the Linux
NLM implementation.
Clean up: Make nlm_lock.len and nlm_reboot.len unsigned integers. This
makes the sign of NLM string lengths consistent with the sign of xdr_netobj
lengths.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Acked-By: NeilBrown <neilb@suse.de>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Obviously at some point we thought "error" represented the length when
positive. This appears to be a long-standing typo.
Thanks to Prasad Potluri <pvp@us.ibm.com> for finding the problem and
proposing an earlier version of this patch.
Cc: Steve French <smfltc@us.ibm.com>
Cc: Prasad V Potluri <pvp@us.ibm.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Dereferenced pointer "dentry" without checking and assigned to inode
in the declaration.
(We could just delete the NULL checks that follow instead, as we never
get to the encode function in this particular case. But it takes a
little detective work to verify that fact, so it's probably safer to
leave the checks in place.)
Cc: Steve French <smfltc@us.ibm.com>
Signed-off-by: Prasad V Potluri <pvp@us.ibm.com>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
The whole reason to move this callback-channel probe into a separate
thread was because (for now) we don't have an easy way to create the
rpc_client asynchronously. But I forgot to move the rpc_create() to the
spawned thread. Doh! Fix that.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Our callback code doesn't actually handle concurrent attempts to probe
the callback channel. Some rethinking of the locking may be required.
However, we can also just move the callback probing to this case. Since
this is the only time a client is "confirmed" (and since that can only
happen once in the lifetime of a client), this ensures we only probe
once.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Andre Majorel <aym-xunil@teaser.fr> points out that if we only updated
the atime when we transfer some data, we deviate from the standard
of always updating the atime. So change splice to always call
file_accessed() even if splice_direct_to_actor() didn't transfer
any data.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* 'task_killable' of git://git.kernel.org/pub/scm/linux/kernel/git/willy/misc: (22 commits)
Remove commented-out code copied from NFS
NFS: Switch from intr mount option to TASK_KILLABLE
Add wait_for_completion_killable
Add wait_event_killable
Add schedule_timeout_killable
Use mutex_lock_killable in vfs_readdir
Add mutex_lock_killable
Use lock_page_killable
Add lock_page_killable
Add fatal_signal_pending
Add TASK_WAKEKILL
exit: Use task_is_*
signal: Use task_is_*
sched: Use task_contributes_to_load, TASK_ALL and TASK_NORMAL
ptrace: Use task_is_*
power: Use task_is_*
wait: Use TASK_NORMAL
proc/base.c: Use task_is_*
proc/array.c: Use TASK_REPORT
perfmon: Use task_is_*
...
Fixed up conflicts in NFS/sunrpc manually..
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm: (21 commits)
dlm: static initialization improvements
dlm: clean ups
dlm: Sanity check namelen before copying it
dlm: keep cached master rsbs during recovery
dlm: change error message to debug
dlm: fix possible use-after-free
dlm: limit dir lookup loop
dlm: reject normal unlock when lock is waiting for lookup
dlm: validate messages before processing
dlm: reject messages from non-members
dlm: another call to confirm_master in receive_request_reply
dlm: recover locks waiting for overlap replies
dlm: clear ast_type when removing from astqueue
dlm: use fixed errno values in messages
dlm: swap bytes for rcom lock reply
dlm: align midcomms message buffer
dlm: close othercons
dlm: use dlm prefix on alloc and free functions
dlm: don't print common non-errors
dlm: proper prototypes
...
also change name_prefix from char pointer to char array.
Signed-off-by: Denis Cheng <crquan@gmail.com>
Signed-off-by: David Teigland <teigland@redhat.com>
A couple small clean-ups. Remove unnecessary wrapper-functions in
rcom.c, and remove unnecessary casting and an unnecessary ASSERT in
util.c.
Signed-off-by: David Teigland <teigland@redhat.com>
The 32/64 compatibility code in the DLM does not check the validity of
the lock name length passed into it, so it can easily overwrite memory
if the value is rubbish (as early versions of libdlm can cause with
unlock calls, it doesn't zero the field).
This patch restricts the length of the name to the amount of data
actually passed into the call.
Signed-off-by: Patrick Caulfield <pcaulfie@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
To prevent the master of an rsb from changing rapidly, an unused rsb is kept
on the "toss list" for a period of time to be reused. The toss list was
being cleared completely for each recovery, which is unnecessary. Much of
the benefit of the toss list can be maintained if nodes keep rsb's in their
toss list that they are the master of. These rsb's need to be included
when the resource directory is rebuilt during recovery.
Signed-off-by: David Teigland <teigland@redhat.com>
The invalid lockspace messages are normal and can appear relatively
often. They should be suppressed without debugging enabled.
Signed-off-by: David Teigland <teigland@redhat.com>
The dlm_put_lkb() can free the lkb and its associated ua structure,
so we can't depend on using the ua struct after the put.
Signed-off-by: David Teigland <teigland@redhat.com>
In a rare case we may need to repeat a local resource directory lookup
due to a race with removing the rsb and removing the resdir record.
We'll never need to do more than a single additional lookup, though,
so the infinite loop around the lookup can be removed. In addition
to being unnecessary, the infinite loop is dangerous since some other
unknown condition may appear causing the loop to never break.
Signed-off-by: David Teigland <teigland@redhat.com>
Non-forced unlocks should be rejected if the lock is waiting on the
rsb_lookup list for another lock to establish the master node.
Signed-off-by: David Teigland <teigland@redhat.com>
There was some hit and miss validation of messages that has now been
cleaned up and unified. Before processing a message, the new
validate_message() function checks that the lkb is the appropriate type,
process-copy or master-copy, and that the message is from the correct
nodeid for the the given lkb. Other checks and assertions on the
lkb type and nodeid have been removed. The assertions were particularly
bad since they would panic the machine instead of just ignoring the bad
message.
Although other recent patches have made processing old message unlikely,
it still may be possible for an old message to be processed and caught
by these checks.
Signed-off-by: David Teigland <teigland@redhat.com>
Messages from nodes that are no longer members of the lockspace should be
ignored. When nodes are removed from the lockspace, recovery can
sometimes complete quickly enough that messages arrive from a removed node
after recovery has completed. When processed, these messages would often
cause an error message, and could in some cases change some state, causing
problems.
Signed-off-by: David Teigland <teigland@redhat.com>
When a failed request (EBADR or ENOTBLK) is unlocked/canceled instead of
retried, there may be other lkb's waiting on the rsb_lookup list for it
to complete. A call to confirm_master() is needed to move on to the next
waiting lkb since the current one won't be retried.
Signed-off-by: David Teigland <teigland@redhat.com>
When recovery looks at locks waiting for replies, it fails to consider
locks that have already received a reply for their first remote operation,
but not received a reply for secondary, overlapping unlock/cancel. The
appropriate stub reply needs to be called for these waiters.
Appears when we start doing recovery in the presence of a many overlapping
unlock/cancel ops.
Signed-off-by: David Teigland <teigland@redhat.com>
The lkb_ast_type field indicates whether the lkb is on the astqueue list.
When clearing locks for a process, lkb's were being removed from the astqueue
list without clearing the field. If release_lockspace then happened
immediately afterward, it could try to remove the lkb from the list a second
time.
Appears when process calls libdlm dlm_release_lockspace() which first
closes the ls dev triggering clear_proc_locks, and then removes the ls
(a write to control dev) causing release_lockspace().
Signed-off-by: David Teigland <teigland@redhat.com>
Some errno values differ across platforms. So if we return things like
-EINPROGRESS from one node it can get misinterpreted or rejected on
another one.
This patch fixes up the errno values passed on the wire so that they
match the x86 ones (so as not to break the protocol), and re-instates
the platform-specific ones at the other end.
Many thanks to Fabio for testing this patch.
Initial patch from Patrick.
Signed-off-by: Patrick Caulfield <pcaulfie@redhat.com>
Signed-off-by: Fabio M. Di Nitto <fabbione@ubuntu.com>
Signed-off-by: David Teigland <teigland@redhat.com>
DLM_RCOM_LOCK_REPLY messages need byte swapping.
Signed-off-by: Fabio M. Di Nitto <fabbione@ubuntu.com>
Signed-off-by: David Teigland <teigland@redhat.com>
gcc does not guarantee that an auto buffer is 64bit aligned.
This change allows sparc64 to work.
Signed-off-by: Fabio M. Di Nitto <fabbione@ubuntu.com>
Signed-off-by: David Teigland <teigland@redhat.com>
ibcs2 support has never been supported on 2.6 kernels as far as I know,
and if it has it must have been an external patch. Anyways, if anybody
applies an external patch they could as well readd the ibcs checking
code to the ELF loader in the same patch. But there is no reason to
keep this code running in all Linux kernels. This will save at least
two strcmps each ELF execution.
No deprecation period because it could not have been used anyway.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This adds Kconfig and Makefile bits to build fs/compat_binfmt_elf.c,
just added. Each arch that wants to use this file needs to add a
"select COMPAT_BINFMT_ELF" line in its Kconfig bits that enable COMPAT.
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This adds fs/compat_binfmt_elf.c, a wrapper around fs/binfmt_elf.c for
32-bit ELF support on 64-bit kernels. It can replace all the hand-rolled
versions of this that each 32/64 arch has, which are all about the same.
To use this, an arch's asm/elf.h has to define at least a few compat_*
macros that parallel the various macros that fs/binfmt_elf.c uses for
native support.
There is no attempt to deal with compat macros for the core dump format
support. To use this file, the arch has to define compat_gregset_t for
linux/elfcore-compat.h and #define CORE_DUMP_USE_REGSET. The 32-bit
compatible formats should come automatically from task_user_regset_view
called on a 32-bit task.
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This modifies the ELF core dump code under #ifdef CORE_DUMP_USE_REGSET.
It changes nothing when this macro is not defined. When it's #define'd
by some arch header (e.g. asm/elf.h), the arch must support the
user_regset (linux/regset.h) interface for reading thread state.
This provides an alternate version of note segment writing that is based
purely on the user_regset interfaces. When CORE_DUMP_USE_REGSET is set,
the arch need not define macros such as ELF_CORE_COPY_REGS and ELF_ARCH.
All that information is taken from the user_regset data structures.
The core dumps come out exactly the same if arch's definitions for its
user_regset details are correct.
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This pulls out the code for writing the notes segment of an ELF core dump
into separate functions. This cleanly isolates into one cluster of
functions everything that deals with the note formats and the hooks into
arch code to fill them. The top-level elf_core_dump function itself now
deals purely with the generic ELF format and the memory segments.
This only moves code around into functions that can be inlined away.
It should not change any behavior at all.
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The break_lock data structure and code for spinlocks is quite nasty.
Not only does it double the size of a spinlock but it changes locking to
a potentially less optimal trylock.
Put all of that under CONFIG_GENERIC_LOCKBREAK, and introduce a
__raw_spin_is_contended that uses the lock data itself to determine whether
there are waiters on the lock, to be used if CONFIG_GENERIC_LOCKBREAK is
not set.
Rename need_lockbreak to spin_needbreak, make it use spin_is_contended to
decouple it from the spinlock implementation, and make it typesafe (rwlocks
do not have any need_lockbreak sites -- why do they even get bloated up
with that break_lock then?).
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>