nilfs_wait_on_logs has a potential to slip out before completion of
all bio requests when it met an error. This synchronization fault may
cause unexpected results, for instance, violative access to freed
segment buffers from an end-bio callback routine.
This fixes the issue by ensuring that nilfs_wait_on_logs waits all
given logs.
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
According to the report from Andreas Beckmann (Message-ID:
<4BA54677.3090902@abeckmann.de>), nilfs in 2.6.33 kernel got stuck
after a disk full error.
This turned out to be a regression by log writer updates merged at
kernel 2.6.33. nilfs_segctor_abort_construction, which is a cleanup
function for erroneous cases, was skipping writeback completion for
some logs.
This fixes the bug and would resolve the hang issue.
Reported-by: Andreas Beckmann <debian@abeckmann.de>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Tested-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: stable <stable@kernel.org> [2.6.33.x]
Clear pointer to mds request after dropping the reference to
ensure we don't drop it again, as there is at least one error
path through this function that does not reset fi->last_readdir
to a new value.
Signed-off-by: Sage Weil <sage@newdream.net>
Fix a broken check that a reply came back from the same MDS we sent the
request to. I don't think a case that actually triggers this would ever
come up in practice, but it's clearly wrong and easy to fix.
Reported-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
Return ERR_PTR(-ENOMEM) if kmalloc() fails. We handle allocation
failures the same way later in the function.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Sage Weil <sage@newdream.net>
Currently, if the wait_event_interruptible is interrupted, we
return EAGAIN unconditionally and loop, such that we aren't, in
fact, interruptible. So, propagate ERESTARTSYS if we get it.
Signed-off-by: Sage Weil <sage@newdream.net>
We were rebuilding the snap context when it was not necessary
(i.e. when the realm seq hadn't changed _and_ the parent seq
was still older), which caused page snapc pointers to not match
the realm's snapc pointer (even though the snap context itself
was identical). This confused begin_write and put it into an
endless loop.
The correct logic is: rebuild snapc if _my_ realm seq changed, or
if my parent realm's seq is newer than mine (and thus mine needs
to be rebuilt too).
Signed-off-by: Sage Weil <sage@newdream.net>
We get a fault callback on _every_ tcp connection fault. Normally, we
want to reopen the connection when that happens. If the address we have
is bad, however, and connection attempts always result in a connection
refused or similar error, explicitly closing and reopening the msgr
connection just prevents the messenger's backoff logic from kicking in.
The result can be a console full of
[ 3974.417106] ceph: osd11 10.3.14.138:6800 connection failed
[ 3974.423295] ceph: osd11 10.3.14.138:6800 connection failed
[ 3974.429709] ceph: osd11 10.3.14.138:6800 connection failed
Instead, if we get a fault, and have outstanding requests, but the osd
address hasn't changed and the connection never successfully connected in
the first place, do nothing to the osd connection. The messenger layer
will back off and retry periodically, because we never connected and thus
the lossy bit is not set.
Instead, touch each request's r_stamp so that handle_timeout can tell the
request is still alive and kicking.
Signed-off-by: Sage Weil <sage@newdream.net>
Make variable name slightly more generic, since it will (soon)
reflect either the time the request was sent OR the time it was
last determined to be still retrying.
Signed-off-by: Sage Weil <sage@newdream.net>
The messenger fault was clearing the BUSY bit, for reasons unclear. This
made it possible for the con->ops->fault function to reopen the connection,
and requeue work in the workqueue--even though the current thread was
already in con_work.
This avoids a problem where the client busy loops with connection failures
on an unreachable OSD, but doesn't address the root cause of that problem.
Signed-off-by: Sage Weil <sage@newdream.net>
Prevent duplicate 'mds0 caps stale' message from spamming the console every
few seconds while the MDS restarts. Set s_renew_requested earlier, so that
we only print the message once, even if we don't send an actual request.
Signed-off-by: Sage Weil <sage@newdream.net>
The incremental map decoding of pg pool updates wasn't skipping
the snaps and removed_snaps vectors. This caused osd requests
to stall when pool snapshots were created or fs snapshots were
deleted. Use a common helper for full and incremental map
decoders that decodes pools properly.
Signed-off-by: Sage Weil <sage@newdream.net>
The wait_unsafe_requests() helper dropped the mdsc mutex to wait
for each request to complete, and then examined r_node to get the
next request after retaking the lock. But the request completion
removes the request from the tree, so r_node was always undefined
at this point. Since it's a small race, it usually led to a
valid request, but not always. The result was an occasional
crash in rb_next() while dereferencing node->rb_left.
Fix this by clearing the rb_node when removing the request from
the request tree, and not walking off into the weeds when we
are done waiting for a request. Since the request we waited on
will _always_ be out of the request tree, take a ref on the next
request, in the hopes that it won't be. But if it is, it's ok:
we can start over from the beginning (and traverse over older read
requests again).
Signed-off-by: Sage Weil <sage@newdream.net>
We were releasing used caps (e.g. FILE_CACHE) from encode_inode_release
with MDS requests (e.g. setattr). We don't carry refs on most caps, so
this code worked most of the time, but for setattr (utimes) we try to
drop Fscr.
This causes cap state to get slightly out of sync with reality, and may
result in subsequent mds revoke messages getting ignored.
Fix by only releasing unused caps.
Signed-off-by: Sage Weil <sage@newdream.net>
Drop session mutex unconditionally in handle_cap_grant, and do the
check_caps from the handle_cap_grant helper. This avoids using a magic
return value.
Also avoid using a flag variable in the IMPORT case and call
check_caps at the appropriate point.
Signed-off-by: Sage Weil <sage@newdream.net>
Passing a session pointer to ceph_check_caps() used to mean it would leave
the session mutex locked. That wasn't always possible if it wasn't passed
CHECK_CAPS_AUTHONLY. If could unlock the passed session and lock a
differet session mutex, which was clearly wrong, and also emitted a
warning when it a racing CPU retook it and we did an unlock from the wrong
context.
This was only a problem when there was more than one MDS.
First, make ceph_check_caps unconditionally drop the session mutex, so that
it is free to lock other sessions as needed. Then adjust the one caller
that passes in a session (handle_cap_grant) accordingly.
Signed-off-by: Sage Weil <sage@newdream.net>
This causes an oops when debug output is enabled and we kick
an osd request with no current r_osd (sometime after an osd
failure). Check the pointer before dereferencing.
Signed-off-by: Sage Weil <sage@newdream.net>
Previously we would decode state directly into our current ticket_handler.
This is problematic if for some reason we fail to decode, because we end
up with half new state and half old state.
We are probably already in bad shape if we get an update we can't decode,
but we may as well be tidy anyway. Decode into new_* temporaries and
update the ticket_handler only on success.
Signed-off-by: Sage Weil <sage@newdream.net>
It seems clear from the surrounding code that xpermits is allowed to be
NULL here.
Signed-off-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The reply parsing code attempts to decode the GETATTR response even if
the DELEGRETURN portion of the compound returned an error. The GETATTR
response won't actually exist if that's the case and we're asking the
parser to read past the end of the response.
This bug is fairly benign. The parser catches this without reading past
the end of the response and decode_getfattr returns -EIO. Earlier
kernels however had decode_op_hdr using the READ_BUF macro, and this
bug would make this printk pop any time the client got an error from
a delegreturn:
kernel: decode_op_hdr: reply buffer overflowed in line XXXX
More recent kernels seem to have replaced this printk with a dprintk.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Andreas Beckmann gave me a report that nilfs logged the following
warnings when it got a disk full:
nilfs_sufile_do_cancel_free: segment 0 must be clean
nilfs_sufile_do_cancel_free: segment 1 must be clean
These arise from a duplicate call to nilfs_segctor_cancel_freev in an
error path of log writer. This will fix the issue.
Reported-by: Andreas Beckmann <debian@abeckmann.de>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Release the old ticket_blob buffer when we get an updated service ticket
from the monitor. Previously these were getting leaked.
Signed-off-by: Sage Weil <sage@newdream.net>
The buffer size was incorrectly calculated for the ceph_x_encrypt()
encapsulated ticket blob. Use a helper (with correct arithmetic) and
BUG out if we were wrong.
Signed-off-by: Sage Weil <sage@newdream.net>
We were failing to reconnect to services due to an old authenticator, even
though we had the new ticket, because we weren't properly retrying the
connect handshake, because we were calling an old/incorrect helper that
left in_base_pos incorrect. The result was a failure to reconnect to the
OSD or MDS (with an authentication error) if the MDS restarted after the
service had been up a few hours (long enough for the original authenticator
to be invalid). This was only a problem if the AUTH_X authentication was
enabled.
Now that the 'negotiate' and 'connect' stages are fully separated, we
should use the prepare_read_connect() helper instead, and remove the
obsolete one.
Signed-off-by: Sage Weil <sage@newdream.net>
When an inode was dropped while being migrated between two MDSs,
i_cap_exporting_issued was non-zero such that issue caps were non-zero and
__ceph_is_any_caps(ci) was true. This prevented the inode from being
removed from the snap realm, even as it was dropped from the cache.
Fix this by dropping any residual i_snap_realm ref in destroy_inode.
Signed-off-by: Sage Weil <sage@newdream.net>
All ci->i_snap_realm_item/realm->inodes_with_caps manipulation should be
protected by realm->inodes_with_caps_lock. This bug would have only bit
us in a rare race with a realm split (during some snap creations).
Signed-off-by: Sage Weil <sage@newdream.net>
Added assertion, and cleared one case where the implemented caps were
not following the issued caps.
Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
You can't store a pointer that you haven't filled in yet and expect it
to work.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
When replacing a xattr's value, in some case we wipe its name/value
first and then re-add it. The wipe is done by
ocfs2_xa_block_wipe_namevalue() when the xattr is in the inode or
block. We currently adjust name_offset for all the entries which have
(offset < name_offset). This does not adjust the entrie we're replacing.
Since we are replacing the entry, we don't adjust the total entry count.
When we calculate a new namevalue location, we trust the entries
now-wrong offset in ocfs2_xa_get_free_start(). The solution is to
also adjust the name_offset for the replaced entry, allowing
ocfs2_xa_get_free_start() to calculate the new namevalue location
correctly.
The following script can trigger a kernel panic easily.
echo 'y'|mkfs.ocfs2 --fs-features=local,xattr -b 4K $DEVICE
mount -t ocfs2 $DEVICE $MNT_DIR
FILE=$MNT_DIR/$RANDOM
for((i=0;i<76;i++))
do
string_76="a$string_76"
done
string_78="aa$string_76"
string_82="aaaa$string_78"
touch $FILE
setfattr -n 'user.test1234567890' -v $string_76 $FILE
setfattr -n 'user.test1234567890' -v $string_78 $FILE
setfattr -n 'user.test1234567890' -v $string_82 $FILE
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (205 commits)
ceph: update for write_inode API change
ceph: reset osd after relevant messages timed out
ceph: fix flush_dirty_caps race with caps migration
ceph: include migrating caps in issued set
ceph: fix osdmap decoding when pools include (removed) snaps
ceph: return EBADF if waiting for caps on closed file
ceph: set osd request message front length correctly
ceph: reset front len on return to msgpool; BUG on mismatched front iov
ceph: fix snaptrace decoding on cap migration between mds
ceph: use single osd op reply msg
ceph: reset bits on connection close
ceph: remove bogus mds forward warning
ceph: remove fragile __map_osds optimization
ceph: fix connection fault STANDBY check
ceph: invalidate_authorizer without con->mutex held
ceph: don't clobber write return value when using O_SYNC
ceph: fix client_request_forward decoding
ceph: drop messages on unregistered mds sessions; cleanup
ceph: fix comments, locking in destroy_inode
ceph: move dereference after NULL test
...
Fix trivial conflicts in Documentation/ioctl/ioctl-number.txt
* git://git.kernel.org/pub/scm/linux/kernel/git/sfrench/cifs-2.6:
cifs: trivial white space
[CIFS] checkpatch cleanup
cifs: add cifs_revalidate_file
cifs: add a CIFSSMBUnixQFileInfo function
cifs: add a CIFSSMBQFileInfo function
cifs: overhaul cifs_revalidate and rename to cifs_revalidate_dentry
* git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (30 commits)
Btrfs: fix the inode ref searches done by btrfs_search_path_in_tree
Btrfs: allow treeid==0 in the inode lookup ioctl
Btrfs: return keys for large items to the search ioctl
Btrfs: fix key checks and advance in the search ioctl
Btrfs: buffer results in the space_info ioctl
Btrfs: use __u64 types in ioctl.h
Btrfs: fix search_ioctl key advance
Btrfs: fix gfp flags masking in the compression code
Btrfs: don't look at bio flags after submit_bio
btrfs: using btrfs_stack_device_id() get devid
btrfs: use memparse
Btrfs: add a "df" ioctl for btrfs
Btrfs: cache the extent state everywhere we possibly can V2
Btrfs: cache ordered extent when completing io
Btrfs: cache extent state in find_delalloc_range
Btrfs: change the ordered tree to use a spinlock instead of a mutex
Btrfs: finish read pages in the order they are submitted
btrfs: fix btrfs_mkdir goto for no free objectids
Btrfs: flush data on snapshot creation
Btrfs: make df be a little bit more understandable
...
* 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
NFS: ensure bdi_unregister is called on mount failure.
NFS: Avoid a deadlock in nfs_release_page
NFSv4: Don't ignore the NFS_INO_REVAL_FORCED flag in nfs_revalidate_inode()
nfs4: Make the v4 callback service hidden
nfs: fix unlikely memory leak
rpc client can not deal with ENOSOCK, so translate it into ENOCONN
What we were doing before was to ask for the current window size as the
maximum allocation. This had the effect of limiting the amount of allocation
we could get for the local alloc during times when the window size was
shrunk due to fragmentation. In some cases, that could actually *increase*
fragmentation by artificially limiting the number of bits we can accept. So
while we still want to ask for a minimum number of bits equal to window
size, there is no reason why we should limit the number of bits the local
alloc should accept. Hence always allow the maximum number of local alloc
bits.
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
This is used by the inode lookup ioctl to follow all the backrefs up
to the subvol root. But the search being done would sometimes land one
past the last item in the leaf instead of finding the backref.
This changes the search to look for the highest possible backref and hop
back one item. It also fixes a leaked path on failure to find the root.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
When a root id of 0 is sent to the inode lookup ioctl, it will
use the root of the file we're ioctling and pass the root id
back to userland along with the results.
This allows userland to do searches based on that root later on.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The search ioctl was skipping large items entirely (ones that are too
big for the results buffer). This changes things to at least copy
the item header so that we can send information about the item back to
userland.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
The search ioctl was working well for finding tree roots, but using it for
generic searches requires a few changes to how the keys are advanced.
This treats the search control min fields for objectid, type and offset
more like a key, where we drop the offset to zero once we bump the type,
etc.
The downside of this is that we are changing the min_type and min_offset
fields during the search, and so the ioctl caller needs extra checks to make sure
the keys in the result are the ones it wanted.
This also changes key_in_sk to use btrfs_comp_cpu_keys, just to make
things more readable.
Signed-off-by: Chris Mason <chris.mason@oracle.com>
Use bitmap_weight() instead of doing hweight32() for each u32 element in
the page.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Anton Altaparmakov <aia21@cantab.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
jffs2 uses rb_node = NULL; to zero rb_root.
The problem with this is that 17d9ddc72fb8bba0d4f678 ("rbtree: Add
support for augmented rbtrees") in the linux-next tree adds a new field
to that struct which needs to be NULL as well. This patch uses RB_ROOT
as the intializer so all of the relevant fields will be NULL'd.
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Cc: Eric Paris <eparis@redhat.com>
Acked-by: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ocfs2_set_acl() and ocfs2_init_acl() were setting i_mode on the in-memory
inode, but never setting it on the disk copy. Thus, acls were some times not
getting propagated between nodes. This patch fixes the issue by adding a
helper function ocfs2_acl_set_mode() which does this the right way.
ocfs2_set_acl() and ocfs2_init_acl() are then updated to call
ocfs2_acl_set_mode().
Signed-off-by: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
In reflink, we need to upate i_blocks for the target inode.
Reported-by: Jie Liu <jeff.liu@oracle.com>
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
In ocfs2_validate_gd_parent, we check bg_chain against the
cl_next_free_rec of the dinode. Actually in resize, we have
the chance of bg_chain == cl_next_free_rec. So add some
additional condition check for it.
I also rename paramter "clean_error" to "resize", since the
old one is not clearly enough to indicate that we should only
meet with this case in resize.
btw, the correpsonding bug is
http://oss.oracle.com/bugzilla/show_bug.cgi?id=1230.
Signed-off-by: Tao Ma <tao.ma@oracle.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
ocfs2_lock() will skip locks on file which has mode set to 02666. This
is a problem in cases where the mode of the file is changed after a
process has obtained a lock on the file.
ocfs2_lock() should skip the check for mandatory locks when unlocking a
file.
Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Signed-off-by: Joel Becker <joel.becker@oracle.com>
If we are doing a forced shutdown, we can get lots of noise about
delalloc pages being discarded. This is happens by design during a
forced shutdown, so don't spam the logs with these messages.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>