Normally, activate_mm() is called from exec(), and thus it used to be a
no-op because we use a completely new "MM context" on the host (for
instance, a new process), and so we didn't need to flush any "TLB entries"
(which for us are the set of memory mappings for the host process from the
virtual "RAM" file).
Kernel threads, instead, are usually handled in a different way. So, when
for AIO we call use_mm(), things used to break and so Benjamin implemented
activate_mm(). However, that is only needed for AIO, and could slow down
exec() inside UML, so be smart: detect being called for AIO (via
PF_BORROWED_MM) and do the full flush only in that situation.
Comment also the caller so that people won't go breaking UML without
noticing. I also rely on the caller's locks for testing current->flags.
Signed-off-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
CC: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch modifies the VFS setxattr, getxattr, and listxattr code to fall
back to the security module for security xattrs if the filesystem does not
support xattrs natively. This allows security modules to export the incore
inode security label information to userspace even if the filesystem does
not provide xattr storage, and eliminates the need to individually patch
various pseudo filesystem types to provide such access. The patch removes
the existing xattr code from devpts and tmpfs as it is then no longer
needed.
The patch restructures the code flow slightly to reduce duplication between
the normal path and the fallback path, but this should only have one
user-visible side effect - a program may get -EACCES rather than
-EOPNOTSUPP if policy denied access but the filesystem didn't support the
operation anyway. Note that the post_setxattr hook call is not needed in
the fallback case, as the inode_setsecurity hook call handles the incore
inode security state update directly. In contrast, we do call fsnotify in
both cases.
Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov>
Acked-by: James Morris <jmorris@namei.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add a "smaps" entry to /proc/pid: show howmuch memory is resident in each
mapping.
People that want to perform a memory consumption analysing can use it
mainly if someone needs to figure out which libraries can be reduced for
embedded systems. So the new features are the physical size of shared and
clean [or dirty]; private and clean [or dirty].
Take a look the example below:
# cat /proc/4576/smaps
08048000-080dc000 r-xp /bin/bash
Size: 592 KB
Rss: 500 KB
Shared_Clean: 500 KB
Shared_Dirty: 0 KB
Private_Clean: 0 KB
Private_Dirty: 0 KB
080dc000-080e2000 rw-p /bin/bash
Size: 24 KB
Rss: 24 KB
Shared_Clean: 0 KB
Shared_Dirty: 0 KB
Private_Clean: 0 KB
Private_Dirty: 24 KB
080e2000-08116000 rw-p
Size: 208 KB
Rss: 208 KB
Shared_Clean: 0 KB
Shared_Dirty: 0 KB
Private_Clean: 0 KB
Private_Dirty: 208 KB
b7e2b000-b7e34000 r-xp /lib/tls/libnss_files-2.3.2.so
Size: 36 KB
Rss: 12 KB
Shared_Clean: 12 KB
Shared_Dirty: 0 KB
Private_Clean: 0 KB
Private_Dirty: 0 KB
...
(Includes a cleanup from "Richard Purdie" <rpurdie@rpsys.net>)
From: Torsten Foertsch <torsten.foertsch@gmx.net>
show_smap calls first show_map and then prints its additional information to
the seq_file. show_map checks if all it has to print fits into the buffer and
if yes marks the current vma as written. While that is correct for show_map
it is not for show_smap. Here the vma should be marked as written only after
the additional information is also written.
The attached patch cures the problem. It moves the functionality of the
show_map function to a new function show_map_internal that is called with an
additional struct mem_size_stats* argument. Then show_map calls
show_map_internal with NULL as struct mem_size_stats* whereas show_smap calls
it with a real pointer. Now the final
if (m->count < m->size) /* vma is copied successfully */
m->version = (vma != get_gate_vma(task))? vma->vm_start: 0;
is done only if the whole entry fits into the buffer.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch was recently discussed on linux-mm:
http://marc.theaimsgroup.com/?t=112085728500002&r=1&w=2
I inherited a large code base from Ray for page migration. There was a
small patch in there that I find to be very useful since it allows the
display of the locality of the pages in use by a process. I reworked that
patch and came up with a /proc/<pid>/numa_maps that gives more information
about the vma's of a process. numa_maps is indexes by the start address
found in /proc/<pid>/maps. F.e. with this patch you can see the page use
of the "getty" process:
margin:/proc/12008 # cat maps
00000000-00004000 r--p 00000000 00:00 0
2000000000000000-200000000002c000 r-xp 00000000 08:04 516 /lib/ld-2.3.3.so
2000000000038000-2000000000040000 rw-p 00028000 08:04 516 /lib/ld-2.3.3.so
2000000000040000-2000000000044000 rw-p 2000000000040000 00:00 0
2000000000058000-2000000000260000 r-xp 00000000 08:04 54707842 /lib/tls/libc.so.6.1
2000000000260000-2000000000268000 ---p 00208000 08:04 54707842 /lib/tls/libc.so.6.1
2000000000268000-2000000000274000 rw-p 00200000 08:04 54707842 /lib/tls/libc.so.6.1
2000000000274000-2000000000280000 rw-p 2000000000274000 00:00 0
2000000000280000-20000000002b4000 r--p 00000000 08:04 9126923 /usr/lib/locale/en_US.utf8/LC_CTYPE
2000000000300000-2000000000308000 r--s 00000000 08:04 60071467 /usr/lib/gconv/gconv-modules.cache
2000000000318000-2000000000328000 rw-p 2000000000318000 00:00 0
4000000000000000-4000000000008000 r-xp 00000000 08:04 29576399 /sbin/mingetty
6000000000004000-6000000000008000 rw-p 00004000 08:04 29576399 /sbin/mingetty
6000000000008000-600000000002c000 rw-p 6000000000008000 00:00 0 [heap]
60000fff7fffc000-60000fff80000000 rw-p 60000fff7fffc000 00:00 0
60000ffffff44000-60000ffffff98000 rw-p 60000ffffff44000 00:00 0 [stack]
a000000000000000-a000000000020000 ---p 00000000 00:00 0 [vdso]
cat numa_maps
2000000000000000 default MaxRef=43 Pages=11 Mapped=11 N0=4 N1=3 N2=2 N3=2
2000000000038000 default MaxRef=1 Pages=2 Mapped=2 Anon=2 N0=2
2000000000040000 default MaxRef=1 Pages=1 Mapped=1 Anon=1 N0=1
2000000000058000 default MaxRef=43 Pages=61 Mapped=61 N0=14 N1=15 N2=16 N3=16
2000000000268000 default MaxRef=1 Pages=2 Mapped=2 Anon=2 N0=2
2000000000274000 default MaxRef=1 Pages=3 Mapped=3 Anon=3 N0=3
2000000000280000 default MaxRef=8 Pages=3 Mapped=3 N0=3
2000000000300000 default MaxRef=8 Pages=2 Mapped=2 N0=2
2000000000318000 default MaxRef=1 Pages=1 Mapped=1 Anon=1 N2=1
4000000000000000 default MaxRef=6 Pages=2 Mapped=2 N1=2
6000000000004000 default MaxRef=1 Pages=1 Mapped=1 Anon=1 N0=1
6000000000008000 default MaxRef=1 Pages=1 Mapped=1 Anon=1 N0=1
60000fff7fffc000 default MaxRef=1 Pages=1 Mapped=1 Anon=1 N0=1
60000ffffff44000 default MaxRef=1 Pages=1 Mapped=1 Anon=1 N0=1
getty uses ld.so. The first vma is the code segment which is used by 43
other processes and the pages are evenly distributed over the 4 nodes.
The second vma is the process specific data portion for ld.so. This is
only one page.
The display format is:
<startaddress> Links to information in /proc/<pid>/map
<memory policy> This can be "default" "interleave={}", "prefer=<node>" or "bind={<zones>}"
MaxRef= <maximum reference to a page in this vma>
Pages= <Nr of pages in use>
Mapped= <Nr of pages with mapcount >
Anon= <nr of anonymous pages>
Nx= <Nr of pages on Node x>
The content of the proc-file is self-evident. If this would be tied into
the sparsemem system then the contents of this file would not be too
useful.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
the pagebuf had been unlocked if the buffer was delwri. At high load, this
could result in a race when the superblock was being synced that would
result the flags being incorrect and the iodone functions being executed
incorrectly. This then leads to iclog callback failures or AIL list
corruptions resulting in filesystem shutdowns.
SGI-PV: 923981
SGI-Modid: xfs-linux:xfs-kern:23616a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
fixes crashes under high nfs load
SGI-PV: 941429
SGI-Modid: xfs-linux:xfs-kern:197929a
Signed-off-by: Christoph Hellwig <hch@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
which can cause an extent hole to be filled and a free extent to be
processed. In this case, we make a few mistakes: forget to pass back the
transaction, forget to put a hold on the buffer and forget to add the buf
to the new transaction.
SGI-PV: 940366
SGI-Modid: xfs-linux:xfs-kern:23594a
Signed-off-by: Tim Shimmin <tes@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
linvfs_clear_inode(). The behavior may go away in VOP_INACTIVE.
SGI-PV: 941000
SGI-Modid: xfs-linux:xfs-kern:197355a
Signed-off-by: Felix Blyakher <felixb@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
having previously mounted with quotas.
SGI-PV: 940491
SGI-Modid: xfs-linux:xfs-kern:23388a
Signed-off-by: Tim Shimmin <tes@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
because aio+dio completions may happen from irq context but we need
process context for converting unwritten extents. We also queue regular
direct I/O completions to workqueue for regularity, there's only one
queue_work call per syscall.
SGI-PV: 934766
SGI-Modid: xfs-linux:xfs-kern:196857a
Signed-off-by: Christoph Hellwig <hch@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
Use MAP_PRIVATE when calling mmap to get memory for the code region.
The flat loader was using MAP_SHARED, but underlying changes to the
MMUless mmap means this is now wrong.
Signed-off-by: Greg Ungerer <gerg@uclinux.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
when it goes to force out the log, and get the tail lsn, it will want to
get the AIL lock.
SGI-PV: 940076
SGI-Modid: xfs-linux:xfs-kern:23260a
Signed-off-by: Tim Shimmin <tes@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
include all the components which make up the transaction in the ondisk
log. Having this incomplete has shown up as problems on IRIX when some v2
log changes went in. The symptom was the msg of "xfs_log_write:
reservation ran out. Need to up reservation" and was seen on synchronous
writes on files with lots of holes (and therefore lots of extents).
SGI-PV: 931457
SGI-Modid: xfs-linux:xfs-kern:23095a
Signed-off-by: Tim Shimmin <tes@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
are getting ENOSPC errors on writes. When we fail to allocate space for
indirect blocks in xfs_bmapi() make sure we release the direct block
allocation before returning.
SGI-PV: 938502
SGI-Modid: xfs-linux:xfs-kern:22986a
Signed-off-by: David Chinner <dgc@sgi.com>
Signed-off-by: Nathan Scott <nathans@sgi.com>
Since the patch to add a NULL short-circuit to crypto_free_tfm() went in,
there's no longer any need for callers of that function to check for NULL.
This patch removes the redundant NULL checks and also a few similar checks
for NULL before calls to kfree() that I ran into while doing the
crypto_free_tfm bits.
I've succesfuly compile tested this patch, and a kernel with the patch
applied boots and runs just fine.
When I posted the patch to LKML (and other lists/people on Cc) it drew the
following comments :
J. Bruce Fields commented
"I've no problem with the auth_gss or nfsv4 bits.--b."
Sridhar Samudrala said
"sctp change looks fine."
Herbert Xu signed off on the patch.
So, I guess this is ready to be dropped into -mm and eventually mainline.
Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch goes through the current users of the crypto layer and sets
CRYPTO_TFM_REQ_MAY_SLEEP at crypto_alloc_tfm() where all crypto operations
are performed in process context.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Lots of places just needs the states, not even linux/tcp.h, where this
enum was, needs it.
This speeds up development of the refactorings as less sources are
rebuilt when things get moved from net/tcp.h.
Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The problem arises if an entity in sysfs is created and removed without
ever having been made completely visible. In SCSI this is triggered by
removing a device while it's initialising.
The problem appears to be that because it was never made visible in sysfs,
the sysfs dentry has a null d_inode which oopses when a reference is made
to it. The solution is simply to check d_inode and assume the object was
never made visible (and thus doesn't need deleting) if it's NULL.
(akpm: possibly a stopgap for 2.6.13 scsi problems. May not be the
long-term fix)
Signed-off-by: James Bottomley <James.Bottomley@SteelEye.com>
Cc: Greg KH <greg@kroah.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The recent change to locks_remove_flock code in fs/locks.c changes how
byte range locks are removed from closing files, which shows up a bug in
cifs.
The assumption in the cifs code was that the close call sent to the
server would remove any pending locks on the server on this file, but
that is no longer safe as the fs/locks.c code on the client wants unlock
of 0 to PATH_MAX to remove all locks (at least from this client, it is
not possible AFAIK to remove all locks from other clients made to the
server copy of the file).
Note that cifs locks are different from posix locks - and it is not
possible to map posix locks perfectly on the wire yet, due to
restrictions of the cifs network protocol, even to Samba without adding
a new request type to the network protocol (which we plan to do for
Samba 3.0.21 within a few months), but the local client will have the
correct, posix view, of the lock in most cases.
The correct fix for cifs for this would involve a bigger change than I
would like to do this late in the 2.6.13-rc cycle - and would involve
cifs keeping track of all unmerged (uncoalesced) byte range locks for
each remote inode and scanning that list to remove locks that intersect
or fall wholly within the range - locks that intersect may have to be
reaquired with the smaller, remaining range.
Signed-off-by: Steve French <sfrench@us.ibm.com>
Signed-off-by: Dave Kleikamp <shaggy@austin.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
While touching this code I noticed the error handling is bogus, so I
fixed it up.
I've removed the IS_ERR(proc_dentry) check, which will never trigger and
is clearly a typo: we must check proc_file instead.
Signed-off-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Update hppfs for the symlink functions prototype change.
Yes, I know the code I leave there is still _bogus_, see next patch for
this.
Signed-off-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
There is an off by one problem with idr_get_new_above.
The comment and function name suggest that it will return an id >
starting_id, but it actually returned an id >= starting_id, and kernel
callers other than inotify treated it as such.
The patch below fixes the comment, and fixes inotifys usage. The
function name still doesn't match the behaviour, but it never did.
Signed-off-by: John McCutchan <ttb@tentacle.dhs.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This bug could cause oopses and page state corruption, because ncpfs
used the generic page-cache symlink handlign functions. But those
functions only work if the page cache is guaranteed to be "stable", ie a
page that was installed when the symlink walk was started has to still
be installed in the page cache at the end of the walk.
We could have fixed ncpfs to not use the generic helper routines, but it
is in many ways much cleaner to instead improve on the symlink walking
helper routines so that they don't require that absolute stability.
We do this by allowing "follow_link()" to return a error-pointer as a
cookie, which is fed back to the cleanup "put_link()" routine. This
also simplifies NFS symlink handling.
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The current calling conventions for ->follow_link() are already fairly
complex.
What we have is
1) you can return -error; then you must release nameidata yourself
and ->put_link() will _not_ be called.
2) you can do nd_set_link(nd, ERR_PTR(-error)) and return 0
3) you can do nd_set_link(nd, path) and return 0
4) you can return 0 (after having moved nameidata yourself)
jffs2 follow_link() is broken - it has an exit where it returns
-EIO and leaks nameidata.
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
When i_acl_default is set to some error we do not hold the lock (hence we
are not allowed to drop it and reacquire later).
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Jeff Mahoney <jeffm@suse.com>
Cc: Chris Mason <mason@suse.com>
Cc: <reiserfs-dev@namesys.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Down the road we want to eliminate the use of the global kernel lock entirely
from the NFS client. To do this, we need to protect the fields in the
nfs_inode structure adequately. Start by serializing updates to the
"cache_validity" field.
Note this change addresses an SMP hang found by njw@osdl.org, where processes
deadlock because nfs_end_data_update and nfs_revalidate_mapping update the
"cache_validity" field without proper serialization.
Test plan:
Millions of fsx ops on SMP clients. Run Nick Wilson's breaknfs program on
large SMP clients.
Signed-off-by: Chuck Lever <cel@netapp.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Introduce atomic bitops to manipulate the bits in the nfs_inode structure's
"flags" field.
Using bitops means we can use a generic wait_on_bit call instead of an ad hoc
locking scheme in fs/nfs/inode.c, so we can remove the "nfs_i_wait" field from
nfs_inode at the same time.
The other new flags field will continue to use bitmask and logic AND and OR.
This permits several flags to be set at the same time efficiently. The
following patch adds a spin lock to protect these flags, and this spin lock
will later cover other fields in the nfs_inode structure, amortizing the cost
of using this type of serialization.
Test plan:
Millions of fsx ops on SMP clients.
Signed-off-by: Chuck Lever <cel@netapp.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Certain bits in nfsi->flags can be manipulated with atomic bitops, and some
are better manipulated via logical bitmask operations.
This patch splits the flags field into two. The next patch introduces atomic
bitops for one of the fields.
Test plan:
Millions of fsx ops on SMP clients.
Signed-off-by: Chuck Lever <cel@netapp.com>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The nfsd holds the big kernel lock upon exit, when it really shouldn't.
Not to mention that this breaks Ingo's RT patch. This is a trivial fix
to release the lock.
Ingo, this patch also works with your kernel, and stops the problem with
nfsd.
Note, there's a "goto out;" where "out:" is right above svc_exit_thread.
The point of the goto also holds the kernel_lock, so I don't see any
problem here in releasing it.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
When the client performs an exclusive create and opens the file for writing,
a Netapp filer will first create the file using the mode 01777. It does this
since an NFSv3/v4 exclusive create cannot immediately set the mode bits.
The 01777 mode then gets put into the inode->i_mode. After the file creation
is successful, we then do a setattr to change the mode to the correct value
(as per the NFS spec).
The problem is that nfs_refresh_inode() no longer updates inode->i_mode, so
the latter retains the 01777 mode. A bit later, the VFS notices this, and calls
remove_suid(). This of course now resets the file mode to inode->i_mode & 0777.
Hey presto, the file mode on the server is now magically changed to 0777. Duh...
Fixes http://bugzilla.linux-nfs.org/show_bug.cgi?id=32
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
the buffers when mapping them after the VM had discarded them.
Thanks to Martin MOKREJŠ for the bug report.
Signed-off-by: Anton Altaparmakov <aia21@cantab.net>
This adds a MOVE_SELF event to inotify. It is sent whenever the inode
you are watching is moved. We need this event so that we can catch
something like this:
- app1:
watch /etc/mtab
- app2:
cp /etc/mtab /tmp/mtab-work
mv /etc/mtab /etc/mtab~
mv /tmp/mtab-work /etc/mtab
app1 still thinks it's watching /etc/mtab but it's actually watching
/etc/mtab~.
Signed-off-by: John McCutchan <ttb@tentacle.dhs.org>
Signed-off-by: Robert Love <rml@novell.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
We are saving the wrong thing in ->last_wd. We want the wd, not the
return value.
Signed-off-by: Robert Love <rml@novell.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Fix path name conversion for long filenames when mapchars mount option
was specified at mount time.
Signed-off-by: Steve French (sfrench@us.ibm.com)
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Fix missing entries in search results when very long file names and more
than 50 (or so) of such long search entries in the directory.
FindNext could send corrupt last byte of resume name when resume key was
a few hundred bytes long file name or longer.
Fixes Samba Bug # 2932
Signed-off-by: Steve French (sfrench@us.ibm.com)
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Initialize key object ID in inode so that we don't try to remove the inode
when we fail on some checks even before we manage to allocate something.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
TxAnchor.anon_list is protected by jfsTxnLock (TXN_LOCK), but there was
a place in txLock() that was removing an entry from the list without holding
the spinlock.
Signed-off-by: Dave Kleikamp <shaggy@austin.ibm.com>
The patch below unhooks fsnotify from vfs_unlink & vfs_rmdir. It
introduces two new fsnotify calls, that are hooked in at the dcache
level. This not only more closely matches how the VFS layer works, it
also avoids the problem with locking and inode lifetimes.
The two functions are
- fsnotify_nameremove -- called when a directory entry is going away.
It notifies the PARENT of the deletion. This is called from
d_delete().
- inoderemove -- called when the files inode itself is going away. It
notifies the inode that is being deleted. This is called from
dentry_iput().
Signed-off-by: John McCutchan <ttb@tentacle.dhs.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
I'm resending this patch, because I still believe it's the correct fix.
Tested before/after applying the patch with a test application
available from:
http://www.inf.bme.hu/~mszeredi/nstest.c
Bind mount from a foreign namespace results in an un-removable mount.
The reason is that mnt->mnt_namespace is copied from the old mount in
clone_mnt(). Because of this check_mnt() in sys_umount() will fail.
The solution is to set mnt->mnt_namespace to current->namespace in
clone_mnt(). clone_mnt() is either called from do_loopback() or
copy_tree(). copy_tree() is called from do_loopback() or
copy_namespace().
When called (directly or indirectly) from do_loopback(), always
current->namspace is being modified: check_mnt(nd->mnt). So setting
mnt->mnt_namespace to current->namspace is the right thing to do.
When called from copy_namespace(), the setting of mnt_namespace is
irrelevant, since mnt_namespace is reset later in that function for
all copied mounts.
Jamie said:
This patch is correct. The old code was buggy for more fundamental and
serious reason: it broke the invariant that a tree of vfsmnts all have the
same value of mnt_namespace (and the same for the mnt_list list).
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Acked-by: Jamie Lokier <jamie@shareable.org>
Cc: <viro@parcelfarce.linux.theplanet.co.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This uses the new deflateBound() thing to sanity-check the input to the
zlib decompressor before we even bother to start reading in the blocks.
Problem noted by Tim Yamin <plasmaroo@gentoo.org>
This avoids the whole #ifdef mess by just getting a copy of
dentry->d_inode before d_delete is called - that makes the codepaths the
same for the INOTIFY/DNOTIFY cases as for the regular no-notify case.
I've been running this under a Gnome session for the last 10 minutes.
Inotify is being used extensively.
Signed-off-by: John McCutchan <ttb@tentacle.dhs.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The included patch fixes a problem where a inotify client would receive a
delete event before the file was actually deleted. The bug affects both
dnotify & inotify.
Signed-off-by: John McCutchan <ttb@tentacle.dhs.org>
Signed-off-by: Robert Love <rml@novell.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The inotify help text still refers to the character device. Update it.
Fixes kernel bug #4993.
Signed-off-by: Robert Love <rml@novell.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
If there was a read error, the bnode might miss some pages, so skip them.
Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
If inode size hasn't changed, don't do anything further in truncate, which
also prevents a dirty inode, what might upset some readonly devices quite
badly.
Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Some error paths may iput an invalid inode with i_nlink=0. jfs should
not try to actually delete such an inode.
Signed-off-by: Dave Kleikamp <shaggy@austin.ibm.com>
When you rm a watch, an IN_IGNORED event is sent down the event queue
with the watch descriptor that you just rm'd.
If you then add a watch you could get the ignored watch's wd and if you
haven't read the entire event queue, user space will think that it's
newly created watch was just ignored.
To avoid this problem we just use idr_get_new_above instead of
idr_get_new.
Signed-off-by: John McCutchan <ttb@tentacle.dhs.org>
Signed-off-by: Robert Love <rml@novell.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
When a file is moved over an existing file that you are watching,
inotify won't send you a DELETE_SELF event and it won't unref the inode
until the inotify instance is closed by the application.
Signed-off-by: John McCutchan <ttb@tentacle.dhs.org>
Signed-off-by: Robert Love <rml@novell.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
o sysfs_dirent's s_mode field should also be updated in sysfs_setattr(), else
there could be inconsistency in the two fields. s_mode is used while
->readdir so as not to bring in the inode to cache.
Signed-off-by: Maneesh Soni <maneesh@in.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
o sysfs_chmod_file() must update the new iattr field in sysfs_dirent else
the mode change will not be persistent in case of inode evacuation from
cache.
Signed-off-by: Maneesh Soni <maneesh@in.ibm.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Actually implement the hostfs "sync" method.
Signed-off-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Cc: Jeff Dike <jdike@addtoit.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Fix bug introduced in 2.6.11-rc2: when we clone a BIO we need to copy over the
current index into it as well.
It corrupts data with some MD setups.
See http://bugzilla.kernel.org/show_bug.cgi?id=4946
Huuuuuuuuge thanks to Matthew Stapleton <matthew4196@gmail.com> for doggedly
chasing this one down.
Acked-by: Jens Axboe <axboe@suse.de>
Cc: <linux-raid@vger.kernel.org>
Cc: <dm-devel@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
`gcc -W' likes to complain if the static keyword is not at the beginning of
the declaration. This patch fixes all remaining occurrences of "inline
static" up with "static inline" in the entire kernel tree (140 occurrences in
47 files).
While making this change I came across a few lines with trailing whitespace
that I also fixed up, I have also added or removed a blank line or two here
and there, but there are no functional changes in the patch.
Signed-off-by: Jesper Juhl <juhl-lkml@dif.dk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
turn many #if $undefined_string into #ifdef $undefined_string to fix some
warnings after -Wno-def was added to global CFLAGS
Signed-off-by: Olaf Hering <olh@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The cache parameter to mb_cache_shrink isn't used. We may as well remove
it.
Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
I believe that there is a problem with the handling of POSIX locks, which
the attached patch should address.
The problem appears to be a race between fcntl(2) and close(2). A
multithreaded application could close a file descriptor at the same time as
it is trying to acquire a lock using the same file descriptor. I would
suggest that that multithreaded application is not providing the proper
synchronization for itself, but the OS should still behave correctly.
SUS3 (Single UNIX Specification Version 3, read: POSIX) indicates that when
a file descriptor is closed, that all POSIX locks on the file, owned by the
process which closed the file descriptor, should be released.
The trick here is when those locks are released. The current code releases
all locks which exist when close is processing, but any locks in progress
are handled when the last reference to the open file is released.
There are three cases to consider.
One is the simple case, a multithreaded (mt) process has a file open and
races to close it and acquire a lock on it. In this case, the close will
release one reference to the open file and when the fcntl is done, it will
release the other reference. For this situation, no locks should exist on
the file when both the close and fcntl operations are done. The current
system will handle this case because the last reference to the open file is
being released.
The second case is when the mt process has dup(2)'d the file descriptor.
The close will release one reference to the file and the fcntl, when done,
will release another, but there will still be at least one more reference
to the open file. One could argue that the existence of a lock on the file
after the close has completed is okay, because it was acquired after the
close operation and there is still a way for the application to release the
lock on the file, using an existing file descriptor.
The third case is when the mt process has forked, after opening the file
and either before or after becoming an mt process. In this case, each
process would hold a reference to the open file. For each process, this
degenerates to first case above. However, the lock continues to exist
until both processes have released their references to the open file. This
lock could block other lock requests.
The changes to release the lock when the last reference to the open file
aren't quite right because they would allow the lock to exist as long as
there was a reference to the open file. This is too long.
The new proposed solution is to add support in the fcntl code path to
detect a race with close and then to release the lock which was just
acquired when such as race is detected. This causes locks to be released
in a timely fashion and for the system to conform to the POSIX semantic
specification.
This was tested by instrumenting a kernel to detect the handling locks and
then running a program which generates case #3 above. A dangling lock
could be reliably generated. When the changes to detect the close/fcntl
race were added, a dangling lock could no longer be generated.
Cc: Matthew Wilcox <willy@debian.org>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Oliver Paukstadt from our test department is testing the xip patches in
Linus' git-tree. He found a problem that shows when reading a file that
contains sparse blocks (holes) on a -o xip mounted ext2 filesystem: the
BUG_ON() in fs/ext2/xip.c:40 triggers where it should not. The problem was
introduced by a cleanup in my previous patch, this patch fixes it.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
If the automount daemon receives a signal which causes it to sumarily
terminate the autofs4 module leaks dentries. The same problem exists with
detached mount requests without the warning.
This patch cleans these dentries at umount.
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
We must drop references to quota structures before releasing the inode.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
We must drop references to quota structures before releasing the inode.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
reiserfs_new_inode() can call iput() with the xattr lock held. This will
cause a deadlock to occur when reiserfs_delete_xattrs() is called to clean
up.
The following patch releases the lock and reacquires it after the iput.
This is safe because interaction with xattrs is complete, and the relock is
just to balance out the release in the caller.
The locking needs some reworking to be more sane, but that's more intrusive
and I was just looking to fix this bug.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Here's a patch to fix a missing refrigerator call in jffs2.
Signed-off-by: Nigel Cunningham <nigel@suspend2.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Under heavy load, hot metadata pages are often locked by non-committed
transactions, making them difficult to flush to disk. This prevents
the sync point from advancing past a transaction that had modified the
page.
There is a point during the sync barrier processing where all
outstanding transactions have been committed to disk, but no new
transaction have been allowed to proceed. This is the best time
to write the metadata.
Signed-off-by: Dave Kleikamp <shaggy@austin.ibm.com>
Cc: Robert Love <rml@novell.com>
Cc: John McCutchan <ttb@tentacle.dhs.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Check for (unlikely) errors in the filesystem initialization stuff in
our module_init() function.
Signed-off-by: Robert Love <rml@novell.com>
Signed-off-by: John McCutchan <ttb@tentacle.dhs.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Change default inotify limits: Maximum instances per user to 128 and
maximum events per queue to 16k. The max instances used to be 128; the
change to 8 was a mistake. Memory consumption is fine.
Signed-off-by: Robert Love <rml@novell.com>
Signed-off-by: John McCutchan <ttb@tentacle.dhs.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Handle error out paths better.
Signed-off-by: Robert Love <rml@novell.com>
Signed-off-by: John McCutchan <ttb@tentacle.dhs.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Bug fix: Ensure that the fd passed to inotify_add_watch() and
inotify_rm_watch() belongs to inotify.
Signed-off-by: Robert Love <rml@novell.com>
Signed-off-by: John McCutchan <ttb@tentacle.dhs.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
As an optimization, use fget_light() and fput_light() where possible.
Signed-off-by: Robert Love <rml@novell.com>
Signed-off-by: John McCutchan <ttb@tentacle.dhs.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Miscellaneous invariant clean up, comment fixes, and so on. Trivial
stuff.
Signed-off-by: Robert Love <rml@novell.com>
Signed-off-by: John McCutchan <ttb@tentacle.dhs.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
A failure in dbAlloc caused a directory's i_blocks to be incorrectly
incremented, causing jfs_fsck to find the inode to be corrupt.
Signed-off-by: Dave Kleikamp <shaggy@austin.ibm.com>
If a metadata page is kept active, it is possible that the sync barrier logic
continues to trigger, even if all active transactions have been phyically
written to the journal. This can cause a hang, since the completion of the
journal I/O is what unsets the sync barrier flag to allow new transactions
to be created.
Signed-off-by: Dave Kleikamp <shaggy@austin.ibm.com>
This patch includes feedback from Andrew and Christoph. Thanks for
taking time to review.
Use of empty_zero_page was eliminated to fix compilation for architectures
that don't have it.
This patch removes setting pages up-to-date in ext2_get_xip_page and all
bug checks to verify that the page is indeed up to date. Setting the page
state on mapping to userland is bogus. None of the code patchs involved
with these pages in mm cares about the page state.
still on my ToDo list: identify a place outside second extended where
__inode_direct_access should reside
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This is half of a patch that Qu Fuping submitted in April. The first part
was applied to fs/mpage.c in 2.6.12-rc4.
jfs_fsync should return error, but it doesn't wait for the metadata page to
be uptodate, e.g.:
jfs_fsync->jfs_commit_inode->txCommit->diWrite->read_metapage->
__get_metapage->read_cache_page reads a page from disk. Because read is
async, when read_cache_page: err = filler(data, page), filler will not
return error, it just submits I/O request and returns. So, page is not
uptodate. Checking only if(IS_ERROR(mp->page)) is not enough, we should
add "|| !PageUptodate(mp->page)"
Signed-off-by: Dave Kleikamp <shaggy@austin.ibm.com>
In the rare case of failing to write the cleanmarker
the allocated node was not freed.
Pointed out by Forrest Zhao
Initial cleanup by Joern Engel
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Minimal patch removing uses of ROOT_DEV; next patch unexports it. I've
opposed this, but I've planned to reintroduce the functionality without using
ROOT_DEV.
Signed-off-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Jeff Dike <jdike@addtoit.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Fix the error message to refer to the error code, i.e. err, not count, plus
add some cosmetical fixes.
Signed-off-by: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
Cc: Jeff Dike <jdike@addtoit.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Allow the setting of NLM timeouts and grace periods through the proc and
sysclt interfaces on x86_64 architectures
Signed-off-by: Steve Dickson <steved@redhat.com>
Acked-by: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Something has changed in the core kernel such that we now get concurrent
inode write outs, one e.g via pdflush and one via sys_sync or whatever.
This causes a nasty deadlock in ntfs. The only clean solution
unfortunately requires a minor vfs api extension.
First the deadlock analysis:
Prerequisive knowledge: NTFS has a file $MFT (inode 0) loaded at mount
time. The NTFS driver uses the page cache for storing the file contents as
usual. More interestingly this file contains the table of on-disk inodes
as a sequence of MFT_RECORDs. Thus NTFS driver accesses the on-disk inodes
by accessing the MFT_RECORDs in the page cache pages of the loaded inode
$MFT.
The situation: VFS inode X on a mounted ntfs volume is dirty. For same
inode X, the ntfs_inode is dirty and thus corresponding on-disk inode,
which is as explained above in a dirty PAGE_CACHE_PAGE belonging to the
table of inodes ($MFT, inode 0).
What happens:
Process 1: sys_sync()/umount()/whatever... calls __sync_single_inode() for
$MFT -> do_writepages() -> write_page for the dirty page containing the
on-disk inode X, the page is now locked -> ntfs_write_mst_block() which
clears PageUptodate() on the page to prevent anyone else getting hold of it
whilst it does the write out (this is necessary as the on-disk inode needs
"fixups" applied before the write to disk which are removed again after the
write and PageUptodate is then set again). It then analyses the page
looking for dirty on-disk inodes and when it finds one it calls
ntfs_may_write_mft_record() to see if it is safe to write this on-disk
inode. This then calls ilookup5() to check if the corresponding VFS inode
is in icache(). This in turn calls ifind() which waits on the inode lock
via wait_on_inode whilst holding the global inode_lock.
Process 2: pdflush results in a call to __sync_single_inode for the same
VFS inode X on the ntfs volume. This locks the inode (I_LOCK) then calls
write-inode -> ntfs_write_inode -> map_mft_record() -> read_cache_page() of
the page (in page cache of table of inodes $MFT, inode 0) containing the
on-disk inode. This page has PageUptodate() clear because of Process 1
(see above) so read_cache_page() blocks when tries to take the page lock
for the page so it can call ntfs_read_page().
Thus Process 1 is holding the page lock on the page containing the on-disk
inode X and it is waiting on the inode X to be unlocked in ifind() so it
can write the page out and then unlock the page.
And Process 2 is holding the inode lock on inode X and is waiting for the
page to be unlocked so it can call ntfs_readpage() or discover that
Process 1 set PageUptodate() again and use the page.
Thus we have a deadlock due to ifind() waiting on the inode lock.
The only sensible solution: NTFS does not care whether the VFS inode is
locked or not when it calls ilookup5() (it doesn't use the VFS inode at
all, it just uses it to find the corresponding ntfs_inode which is of
course attached to the VFS inode (both are one single struct); and it uses
the ntfs_inode which is subject to its own locking so I_LOCK is irrelevant)
hence we want a modified ilookup5_nowait() which is the same as ilookup5()
but it does not wait on the inode lock.
Without such functionality I would have to keep my own ntfs_inode cache in
the NTFS driver just so I can find ntfs_inodes independent of their VFS
inodes which would be slow, memory and cpu cycle wasting, and incredibly
stupid given the icache already exists in the VFS.
Below is a patch that does the ilookup5_nowait() implementation in
fs/inode.c and exports it.
ilookup5_nowait.diff:
Introduce ilookup5_nowait() which is basically the same as ilookup5() but
it does not wait on the inode's lock (i.e. it omits the wait_on_inode()
done in ifind()).
This is needed to avoid a nasty deadlock in NTFS.
Signed-off-by: Anton Altaparmakov <aia21@cantab.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This moves the inotify sysctl knobs to "/proc/sys/fs/inotify" from
"/proc/sys/fs". Also some related cleanup.
Signed-off-by: Robert Love <rml@novell.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
All of the different xattr namespaces have different rules.
user.* and ACL's are not allowed on symlinks, and since these were the
first xattrs implemented, I assumed there was no need to support xattrs
on symlinks. This one-line patch should fix it.
Signed-off-by: Dave Kleikamp <shaggy@austin.ibm.com>
inotify is intended to correct the deficiencies of dnotify, particularly
its inability to scale and its terrible user interface:
* dnotify requires the opening of one fd per each directory
that you intend to watch. This quickly results in too many
open files and pins removable media, preventing unmount.
* dnotify is directory-based. You only learn about changes to
directories. Sure, a change to a file in a directory affects
the directory, but you are then forced to keep a cache of
stat structures.
* dnotify's interface to user-space is awful. Signals?
inotify provides a more usable, simple, powerful solution to file change
notification:
* inotify's interface is a system call that returns a fd, not SIGIO.
You get a single fd, which is select()-able.
* inotify has an event that says "the filesystem that the item
you were watching is on was unmounted."
* inotify can watch directories or files.
Inotify is currently used by Beagle (a desktop search infrastructure),
Gamin (a FAM replacement), and other projects.
See Documentation/filesystems/inotify.txt.
Signed-off-by: Robert Love <rml@novell.com>
Cc: John McCutchan <ttb@tentacle.dhs.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This was a pure indentation change, using:
scripts/Lindent fs/reiserfs/*.c include/linux/reiserfs_*.h
to make reiserfs match the regular Linux indentation style. As Jeff
Mahoney <jeffm@suse.com> writes:
The ReiserFS code is a mix of a number of different coding styles, sometimes
different even from line-to-line. Since the code has been relatively stable
for quite some time and there are few outstanding patches to be applied, it
is time to reformat the code to conform to the Linux style standard outlined
in Documentation/CodingStyle.
This patch contains the result of running scripts/Lindent against
fs/reiserfs/*.c and include/linux/reiserfs_*.h. There are places where the
code can be made to look better, but I'd rather keep those patches separate
so that there isn't a subtle by-hand hand accident in the middle of a huge
patch. To be clear: This patch is reformatting *only*.
A number of patches may follow that continue to make the code more consistent
with the Linux coding style.
Hans wasn't particularly enthusiastic about these patches, but said he
wouldn't really oppose them either.
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
indent(1) doesn't know how to handle the "do not compile" error. It results
in the item_ops array declaration being indented a tab stop in when it should
not be. This patch replaces it with a #error that describes why it's failing.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
While fixing an oops in the st driver in a dirty release path, I
encountered an oops in cdev_put for cdevs allocated using cdev_alloc. If
cdev_del is called when the cdev kobject still has an open user, when the
last cdev_put is called, the cdev_put will call kobject_put, which will end
up ultimately releasing the cdev in cdev_dynamic_release. Patch fixes the
oops by preventing cdev_put from accessing freed memory.
Signed-off-by: Brian King <brking@us.ibm.com>
Cc: <viro@parcelfarce.linux.theplanet.co.uk>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Restore old set of ext2 mount options when remounting of a filesystem
fails.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Fix a problem with ext3 mount option parsing. When remount of a filesystem
fails, old options are now restored.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
When a noninitial thread does exec, it becomes the new group leader. If
there is a ITIMER_REAL timer running, it points at the old group leader and
when it fires it can follow a stale pointer. The timer data needs to be
reset to point at the exec'ing thread that is becoming the group leader.
This has to synchronize with any concurrent firing of the timer to make
sure that it_real_fn can never run when the data points to a thread that
might have been reaped already.
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Bug symptoms
~~~~~~~~~~~~
For the same inode VFS calls read_inode() twice and doesn't call
clear_inode() between the two read_inode() invocations.
Bug description
~~~~~~~~~~~~~~~
Suppose we have an inode which has zero reference count but is still in
the inode cache. Suppose kswapd invokes shrink_icache_memory() to free
some RAM. In prune_icache() inodes are removed from i_hash. prune_icache
() is then going to call clear_inode(), but drops the inode_lock
spinlock before this. If in this moment another task calls iget() for an
inode which was just removed from i_hash by prune_icache(), then iget()
invokes read_inode() for this inode, because it is *already removed*
from i_hash.
The end result is: we call iget(#N) then iput(#N); inode #N has zero
i_count now and is in the inode cache; kswapd starts. kswapd removes the
inode #N from i_hash ans is preempted; we call iget(#N) again;
read_inode() is invoked as the result; but we expect clear_inode()
before.
Fix
~~~~~~~
To fix the bug I remove inodes from i_hash later, when clear_inode() is
actually called. I remove them from i_hash under spinlock protection.
Since the i_state is set to I_FREEING, it is safe to do this. The others
will sleep waiting for the inode state change.
I also postpone removing inodes from i_sb_list. It is not compulsory to
do so but I do it for readability reasons. Inodes are added/removed to
the lists together everywhere in the code and there is no point to
change this rule. This is harmless because the only user of i_sb_list
which somehow may interfere with me (invalidate_list()) is excluded by
the iprune_sem mutex.
The same race is possible in invalidate_list() so I do the same for it.
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch fixes queer behavior in __wait_on_freeing_inode().
If I_LOCK was not set it called yield(), effectively busy waiting for the
removal of the inode from the hash. This change was introduced within
"[PATCH] eliminate inode waitqueue hashtable" Changeset 1.1938.166.16 last
october by wli.
The solution is to restore the old behavior, of unconditionally waiting on
the waitqueue. It doesn't matter if I_LOCK is not set initally, the task
will go to sleep, and wake up when wake_up_inode() is called from
generic_delete_inode() after removing the inode from the hash chain.
Comment is also updated to better reflect current behavior.
This condition is very hard to trigger normally (simultaneous clear_inode()
with iget()) so probably only heavy stress testing can reveal any change of
behavior.
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
In case of a mount error locks might be uninitialized but
accessed by the resulting call to jffs2_kill_sb().
Signed-off-by: Artem B. Bityuckiy <dedekind@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
We recently changed the method of collecting and sorting of
tmp_dnode objects to use a temporary RB-tree instead of a
temporary list. Rename function and update comments.
Signed-off-by: Artem B. Bityuckiy <dedekind@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
After discussion at the recent NFSv4 bake-a-thon, I realized that my
assumption that NFS4_FH_PERSISTENT required filehandles to persist was a
misreading of the spec. This also fixes an interoperability problem with the
Solaris client.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
We shouldn't be allowing, e.g., write locks on files not open for read. To
enforce this, we add a pointer from the lock stateid back to the open stateid
it came from, so that the check will continue to be correct even after the
open is upgraded or downgraded.
Signed-off-by: Andy Adamson <andros@citi.umich.edu>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
As long as we're here, do some miscellaneous cleanup.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The handling of close_lru in preprocess_stateid_op was a source of some
confusion here recently. Try to make the logic a little clearer, by renaming
find_openstateowner_id to make its purpose clearer and untangling some
unnecessarily complicated goto's.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
nfs4_preprocess_seqid_op is called by NFSv4 operations that imply an implicit
renewal of the client lease.
Signed-off-by: Andy Adamson <andros@citi.umich.edu>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
from RFC 3530:
"Share reservations are established by OPEN operations and by their
nature are mandatory in that when the OPEN denies READ or WRITE
operations, that denial results in such operations being rejected
with error NFS4ERR_LOCKED."
(Note that share_denied is really only a legal error for OPEN.)
Signed-off-by: Andy Adamson <andros@citi.umich.edu>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
An OPEN from the same client/open stateowner requires a stateid update because
of the share/deny access update.
Signed-off-by: Andy Adamson <andros@citi.umich.edu>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
We're insisting that the lock sequence id field passed in the
open_to_lockowner struct always be zero. This is probably thanks to the
sentence in rfc3530: "The first request issued for any given lock_owner is
issued with a sequence number of zero."
But there doesn't seem to be any problem with allowing initial sequence
numbers other than zero. And currently this is causing lock reclaims from the
Linux client to fail.
In the spirit of "be liberal in what you accept, conservative in what you
send", we'll relax the check (and patch the Linux client as well).
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Add some comments on the use of so_seqid, in an attempt to avoid some of the
confusion outlined in the previous patch....
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The sequence number we store in the sequence id is the last one we received
from the client. So on the next operation we'll check that the client gives
us the next higher number.
We increment sequence id's at the last moment, in encode, so that we're sure
of knowing the right error return. (The decision to increment the sequence id
depends on the exact error returned.)
However on the *first* use of a sequence number, if we set the sequence number
to the one received from the client and then let the increment happen on
encode, we'll be left with a sequence number one to high.
For that reason, ENCODE_SEQID_OP_TAIL only increments the sequence id on
*confirmed* stateowners.
This creates a problem for open reclaims, which are confirmed on first use.
Therefore the open reclaim code, as a special exception, *decrements* the
sequence id, cancelling out the undesired increment on encode. But this
prevents the sequence id from ever being incremented in the case where
multiple reclaims are sent with the same openowner. Yuch!
We could add another exception to the open reclaim code, decrementing the
sequence id only if this is the first use of the open owner.
But it's simpler by far to modify the meaning of the op_seqid field: instead
of representing the previous value sent by the client, we take op_seqid, after
encoding, to represent the *next* sequence id that we expect from the client.
This eliminates the need for special-case handling of the first use of a
stateowner.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Yeah, it's trivial, but this drives me up the wall....
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
A misreading of the spec lead us to convert all errors on open and lock
reclaims to RECLAIM_BAD. This causes problems--for example, a reboot within
the grace period could lead to reclaims with stale stateid's, and we'd like to
return STALE errors in those cases.
What rfc3530 actually says about RECLAIM_BAD: "The reclaim provided by the
client does not match any of the server's state consistency checks and is
bad." I'm assuming that "state consistency checks" refers to checks for
consistency with the state recorded to stable storage, and that the error
should be reserved for that case.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
A GRACE or NOGRACE response to a lock request should also bump the sequence
id. So we delay the handling of grace period errors till after we've found
the relevant owner.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The GRACE and NOGRACE errors should bump the sequence id on open. So we delay
the handling of these errors until nfsd4_process_open2, at which point we've
set the open owner, so the encode routine will be able to bump the sequence
id.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
We oops in list_for_each_entry(), because release_stateowner frees something
on the list we're traversing.
Signed-off-by: Andy Adamson <andros@citi.umich.edu>
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Make sure we don't try to delete client recovery directories multiple times;
fixes some spurious error messages.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Oops, this lookup_one_len needs the i_sem.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
We need to fsync the recovery directory after writing to it, but we weren't
doing this correctly. (For example, we weren't taking the i_sem when calling
->fsync().)
Just reuse the existing nfsd fsync code instead.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
We need to remove the recovery directory here too. (This chunk just got lost
somehow in the process of commuting the reboot recovery patches past the other
patches.)
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Signed-off-by: Neil Brown <neilb@cse.unsw.edu.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch renames _mntput() to something a little more descriptive:
mntput_no_expire().
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch renames vfsmount->mnt_fslink to something a little more
descriptive: vfsmount->mnt_expire.
Signed-off-by: Mike Waychison <michael.waychison@sun.com>
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Dcookies shouldn't play with the internals of dentry and vfsmnt
refcounting. It defeats grepping, and is prone to break if implementation
details change.
In addition the function doesn't even seem to be performance critical: it
calls kmem_cache_alloc().
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch sets ->mnt_namespace where it's actually added to the
namespace.
Previously mnt_namespace was set in do_kern_mount() even if the filesystem
was never added to any process's namespace (most kernel-internal
filesystems).
This discrepancy doesn't actually cause any problems, but it's cleaner if
mnt_namespace is NULL for these non exported filesystems.
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch clears mnt_namespace in an expired mount.
If mnt_namespace is not cleared, it's possible to attach a new mount to the
already detached mount, because check_mnt() can return true.
The effect is a resource leak, since the resulting tree will never be
freed.
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch fixes a bug noticed by Al Viro:
However, we still have a problem here - just what would
happen if vfsmount is detached while we were grabbing namespace
semaphore? Refcount alone is not useful here - we might be held by
whoever had detached the vfsmount. IOW, we should check that it's
still attached (i.e. that mnt->mnt_parent != mnt). If it's not -
just leave it alone, do mntput() and let whoever holds it deal with
the sucker. No need to put it back on lists.
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Cc: <viro@parcelfarce.linux.theplanet.co.uk>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch splits the mark_mounts_for_expiry() function. It's too complex and
too deeply nested, even without the bugfix in the following patch.
Otherwise code is completely the same.
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Cc: <viro@parcelfarce.linux.theplanet.co.uk>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch simplifies mark_mounts_for_expiry() by using detach_mnt() instead
of duplicating everything it does.
It should be an equivalent transformation except for righting the dput/mntput
order.
Al Viro said: "Looks sane".
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Cc: <viro@parcelfarce.linux.theplanet.co.uk>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch fixes a race found by Ram in mark_mounts_for_expiry() in
fs/namespace.c.
The bug can only be triggered with simultaneous exiting of a process having
a private namespace, and expiry of a mount from within that namespace.
It's practically impossible to trigger, and I haven't even tried. But
still, a bug is a bug.
The race happens when put_namespace() is called by another task, while
mark_mounts_for_expiry() is between atomic_read() and get_namespace(). In
that case get_namespace() will be called on an already dead namespace with
unforeseeable results.
The solution was suggested by Al Viro, with his own words:
Instead of screwing with atomic_read() in there, why don't we
simply do the following:
a) atomic_dec_and_lock() in put_namespace()
b) __put_namespace() called without dropping lock
c) the first thing done by __put_namespace would be
struct vfsmount *root = namespace->root;
namespace->root = NULL;
spin_unlock(...);
....
umount_tree(root);
...
d) check in mark_... would be simply namespace && namespace->root.
And we are all set; no screwing around with atomic_read(), no magic
at all. Dying namespace gets NULL ->root.
All changes of ->root happen under spinlock.
If under a spinlock we see non-NULL ->mnt_namespace, it won't be
freed until we drop the lock (we will set ->mnt_namespace to NULL
under that lock before we get to freeing namespace).
If under a spinlock we see non-NULL ->mnt_namespace and
->mnt_namespace->root, we can grab a reference to namespace and be
sure that it won't go away.
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Acked-by: Al Viro <viro@parcelfarce.linux.theplanet.co.uk>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch clears mnt_namespace on unmount.
Not clearing mnt_namespace has two effects:
1) It is possible to attach a new mount to a detached mount,
because check_mnt() returns true.
This means, that when no other references to the detached mount
remain, it still can't be freed. This causes a resource leak,
and possibly un-removable modules.
2) If mnt_namespace is dereferenced (only in mark_mounts_for_expiry())
after the namspace has been freed, it can cause an Oops, memory
corruption, etc.
1) has been tested before and after the patch, 2) is only speculation.
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
We're dereferencing `flp' and then we're testing it for NULLness.
Either the compiler accidentally saved us or the existing null-pointer checdk
is redundant.
This defect was found automatically by Coverity Prevent, a static analysis tool.
Signed-off-by: Zaur Kambarov <zkambarov@coverity.com>
Cc: Matthew Wilcox <willy@debian.org>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Fix debugging printk.
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
We are not using the in-inode space for xattrs in reserved inodes because
mkfs.ext3 doesn't initialize it properly. For those inodes, we set
i_extra_isize to 0. Make sure that we also don't overwrite the
i_extra_isize field when writing out the inode in that case. This is for
cleanliness only, and doesn't fix an actual bug.
Signed-off-by: Andreas Gruenbacher <agruen@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>