ceph_calc_file_object_mapping() takes (among other things) a "file"
offset and length, and based on the layout, determines the object
number ("bno") backing the affected portion of the file's data and
the offset into that object where the desired range begins. It also
computes the size that should be used for the request--either the
amount requested or something less if that would exceed the end of
the object.
This patch changes the input length parameter in this function so it
is used only for input. That is, the argument will be passed by
value rather than by address, so the value provided won't get
updated by the function.
The value would only get updated if the length would surpass the
current object, and in that case the value it got updated to would
be exactly that returned in *oxlen.
Only one of the two callers is affected by this change. Update
ceph_calc_raw_layout() so it records any updated value.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
The MDS may have incorrect wanted caps after importing caps. So the
client should check the value mds has and send cap update if necessary.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
When client wants to release an imported cap, it's possible there
is no reserved cap_release message in corresponding mds session.
so __queue_cap_release causes kernel panic.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Allow revoking duplicated caps issued by non-auth MDS if these caps
are also issued by auth MDS.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Sage Weil <sage@inktank.com>
The mds now sends back a created inode if the create request
performed the create. If the file already existed, no inode is
returned in the reply. This allows ceph to set the created flag
in atomic_open so that permissions are properly checked in the case
that the file wasn't created by the create call to the mds.
To ensure compability with previous kernels, a feature for sending
back the inode in the create reply was added, so that the mds will
only send back the inode if the client indicates it supports the
feature.
Signed-off-by: Sam Lang <sam.lang@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
The error returned by ceph_mdsc_do_request includes errors sending the
request, errors on timeout, or any errors coming from the mds. If
ceph_mdsc_do_request returns an error, the reply struct will most likely
be bogus. We need to bail out and propogate the error instead of
overwriting it.
Signed-off-by: Sam Lang <sam.lang@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Function start_read() can get an error before processing all pages.
It must not only release the remaining pages, but unlock them too.
This fixes http://tracker.newdream.net/issues/3370
Signed-off-by: David Zafman <david.zafman@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
If client sends cap message that requests new max size during
exporting caps, the exporting MDS will drop the message quietly.
So the client may wait for the reply that updates the max size
forever. call handle_cap_grant() for cap import message can
avoid this issue.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Sage Weil <sage@inktank.com>
we should set i_truncate_pending to 0 after page cache is truncated
to i_truncate_size
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Sage Weil <sage@inktank.com>
Add dirty inode to cap_dirty_migrating list instead, this can avoid
ceph_flush_dirty_caps() entering infinite loop.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Sage Weil <sage@inktank.com>
__wake_requests() will enter infinite loop if we use it to wake
requests in the session->s_waiting list. __wake_requests() deletes
requests from the list and __do_request() adds requests back to
the list.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Sage Weil <sage@inktank.com>
The cap from non-auth mds doesn't have a meaningful max_size value.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Sage Weil <sage@inktank.com>
This would reset a connection with any OSD that had an outstanding
request that was taking more than N seconds. The idea was that if the
OSD was buggy, the client could compensate by resending the request.
In reality, this only served to hide server bugs, and we haven't
actually seen such a bug in quite a while. Moreover, the userspace
client code never did this.
More importantly, often the request is taking a long time because the
OSD is trying to recover, or overloaded, and killing the connection
and retrying would only make the situation worse by giving the OSD
more work to do.
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
ceph_aio_write() has an optimization that marks cap EPH_CAP_FILE_WR
dirty before data is copied to page cache and inode size is updated.
If ceph_check_caps() flushes the dirty cap before the inode size is
updated, MDS can miss the new inode size. The fix is move
ceph_{get,put}_cap_refs() into ceph_write_{begin,end}() and call
__ceph_mark_dirty_caps() after inode size is updated.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Sage Weil <sage@inktank.com>
set_request_path_attr() checks for NULL ptr before calling strlen()
This fixes http://tracker.newdream.net/issues/3404
Signed-off-by: David Zafman <david.zafman@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
Call to d_find_alias() needs a corresponding dput()
This fixes http://tracker.newdream.net/issues/3271
Signed-off-by: David Zafman <david.zafman@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
A pgoff_t is defined (by default) to have type (unsigned long). On
architectures such as i686 that's a 32-bit type. The ceph address
space code was attempting to produce 64 bit offsets by shifting a
page's index by PAGE_CACHE_SHIFT, but the result was not what was
desired because the shift occurred before the result got promoted
to 64 bits.
Fix this by converting all uses of page->index used in this way to
use the page_offset() macro, which ensures the 64-bit result has the
intended value.
This fixes http://tracker.newdream.net/issues/3112
Reported-by: Mohamed Pakkeer <pakkeer.mohideen@realimage.com>
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
If the user calls GET_DATALOC on a file with an invalid (e.g.,
zeroed) layout, return EIO to userland.
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
If we are creating an osd request and get an invalid layout, return
an EINVAL to the caller. We switch up the return to have an error
code instead of NULL implying -ENOMEM.
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Convert cpu_to_le32(le32_to_cpu(E1) + E2) to use le32_add_cpu().
dpatch engine is used to auto generate this patch.
(https://github.com/weiyj/dpatch)
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Sage Weil <sage@inktank.com>
A recent change to /sbin/mountall causes any trailing '/' character
in the "device" (or fs_spec) field in /etc/fstab to be stripped. As
a result, an entry for a ceph mount that intends to mount the root
of the name space ends up with now path portion, and the ceph mount
option processing code rejects this.
That is, an entry in /etc/fstab like:
cephserver:port:/ /mnt ceph defaults 0 0
provides to the ceph code just "cephserver:port:" as the "device,"
and that gets rejected.
Although this is a bug in /sbin/mountall, we can have the ceph mount
code support an empty/nonexistent path, interpreting it to mean the
root of the name space.
RFC 5952 offers recommendations for how to express IPv6 addresses,
and recommends the usage found in RFC 3986 (which specifies the
format for URI's) for representing both IPv4 and IPv6 addresses that
include port numbers. (See in particular the definition of
"authority" found in the Appendix of RFC 3986.)
According to those standards, no host specification will ever
contain a '/' character. As a result, it is sufficient to scan a
provided "device" from an /etc/fstab entry for the first '/'
character, and if it's found, treat that as the beginning of the
path. If no '/' character is present, we can treat the entire
string as the monitor host specification(s), and assume the path
to be the root of the name space. We'll still require a ':' to
separate the host portion from the (possibly empty) path portion.
This means that we can more formally define how ceph will interpret
the "device" it's provided when processing a mount request:
"device" will look like:
<server_spec>[,<server_spec>...]:[<path>]
where
<server_spec> is <ip>[:<port>]
<path> is optional, but if present must begin with '/'
This addresses http://tracker.newdream.net/issues/2919
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
If "l->stripe_unit" is zero the the mod on the next line will cause a
divide by zero bug. This comes from the copy_from_user() in
ceph_ioctl_set_layout_policy(). Passing 0 is valid, though (it means
"do not change") so avoid the % check in that case.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
If the MDS gives us a dentry and we weren't prepared to handle it,
WARN_ON_ONCE instead of crashing.
Reported-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
The debugfs directory includes the cluster fsid and our unique global_id.
We need to delay the initialization of the debug entry until we have
learned both the fsid and our global_id from the monitor or else the
second client can't create its debugfs entry and will fail (and multiple
client instances aren't properly reflected in debugfs).
Reported by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
The initial ->atomic_open op was carried over from the old intent code,
which was incomplete and didn't really work. Replace it with a fresh
method. In particular:
* always attempt to do an atomic open+lookup, both for the create case
and for lookups of existing files.
* fix symlink handling by returning 1 to the VFS so that we can follow
the link to its destination. This fixes a longstanding ceph bug (#2392).
Signed-off-by: Sage Weil <sage@inktank.com>
Pull second vfs pile from Al Viro:
"The stuff in there: fsfreeze deadlock fixes by Jan (essentially, the
deadlock reproduced by xfstests 068), symlink and hardlink restriction
patches, plus assorted cleanups and fixes.
Note that another fsfreeze deadlock (emergency thaw one) is *not*
dealt with - the series by Fernando conflicts a lot with Jan's, breaks
userland ABI (FIFREEZE semantics gets changed) and trades the deadlock
for massive vfsmount leak; this is going to be handled next cycle.
There probably will be another pull request, but that stuff won't be
in it."
Fix up trivial conflicts due to unrelated changes next to each other in
drivers/{staging/gdm72xx/usb_boot.c, usb/gadget/storage_common.c}
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (54 commits)
delousing target_core_file a bit
Documentation: Correct s_umount state for freeze_fs/unfreeze_fs
fs: Remove old freezing mechanism
ext2: Implement freezing
btrfs: Convert to new freezing mechanism
nilfs2: Convert to new freezing mechanism
ntfs: Convert to new freezing mechanism
fuse: Convert to new freezing mechanism
gfs2: Convert to new freezing mechanism
ocfs2: Convert to new freezing mechanism
xfs: Convert to new freezing code
ext4: Convert to new freezing mechanism
fs: Protect write paths by sb_start_write - sb_end_write
fs: Skip atime update on frozen filesystem
fs: Add freezing handling to mnt_want_write() / mnt_drop_write()
fs: Improve filesystem freezing handling
switch the protection of percpu_counter list to spinlock
nfsd: Push mnt_want_write() outside of i_mutex
btrfs: Push mnt_want_write() outside of i_mutex
fat: Push mnt_want_write() outside of i_mutex
...
Pull Ceph changes from Sage Weil:
"Lots of stuff this time around:
- lots of cleanup and refactoring in the libceph messenger code, and
many hard to hit races and bugs closed as a result.
- lots of cleanup and refactoring in the rbd code from Alex Elder,
mostly in preparation for the layering functionality that will be
coming in 3.7.
- some misc rbd cleanups from Josh Durgin that are finally going
upstream
- support for CRUSH tunables (used by newer clusters to improve the
data placement)
- some cleanup in our use of d_parent that Al brought up a while back
- a random collection of fixes across the tree
There is another patch coming that fixes up our ->atomic_open()
behavior, but I'm going to hammer on it a bit more before sending it."
Fix up conflicts due to commits that were already committed earlier in
drivers/block/rbd.c, net/ceph/{messenger.c, osd_client.c}
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (132 commits)
rbd: create rbd_refresh_helper()
rbd: return obj version in __rbd_refresh_header()
rbd: fixes in rbd_header_from_disk()
rbd: always pass ops array to rbd_req_sync_op()
rbd: pass null version pointer in add_snap()
rbd: make rbd_create_rw_ops() return a pointer
rbd: have __rbd_add_snap_dev() return a pointer
libceph: recheck con state after allocating incoming message
libceph: change ceph_con_in_msg_alloc convention to be less weird
libceph: avoid dropping con mutex before fault
libceph: verify state after retaking con lock after dispatch
libceph: revoke mon_client messages on session restart
libceph: fix handling of immediate socket connect failure
ceph: update MAINTAINERS file
libceph: be less chatty about stray replies
libceph: clear all flags on con_close
libceph: clean up con flags
libceph: replace connection state bits with states
libceph: drop unnecessary CLOSED check in socket state change callback
libceph: close socket directly from ceph_con_close()
...
There are two structures in which a count of snapshots are
maintained:
struct ceph_snap_context {
...
u32 num_snaps;
...
}
and
struct ceph_snap_realm {
...
u32 num_prior_parent_snaps; /* had prior to parent_since */
...
u32 num_snaps;
...
}
These fields never take on negative values (e.g., to hold special
meaning), and so are really inherently unsigned. Furthermore they
take their value from over-the-wire or on-disk formatted 32-bit
values.
So change their definition to have type u32, and change some spots
elsewhere in the code to account for this change.
Signed-off-by: Alex Elder <elder@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
We re-run the loop but we don't re-set the attrs pointer back to NULL.
Signed-off-by: Alan Cox <alan@linux.intel.com>
Reviewed-by: Alex Elder <elder@inktank.com>
When we detect a mds session reset, close the old ceph_connection before
reopening it. This ensures we clean up the old socket properly and keep
the ceph_connection state correct.
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
This is simply cleanup that will keep things more closely synced with the
userland code.
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Alex Elder <elder@inktank.com>
Reviewed-by: Yehuda Sadeh <yehuda@inktank.com>
CC: Sage Weil <sage@newdream.net>
CC: ceph-devel@vger.kernel.org
Acked-by: Sage Weil <sage@newdream.net>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
d_parent is never NULL, and IS_ROOT() is the proper way to check for a
(non-self-referential) parent.
Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Sage Weil <sage@inktank.com>
Pass mount flags to sget() so that it can use them in initialising a new
superblock before the set function is called. They could also be passed to the
compare function.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
boolean "does it have to be exclusive?" flag is passed instead;
Local filesystem should just ignore it - the object is guaranteed
not to be there yet.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Just the flags; only NFS cares even about that, but there are
legitimate uses for such argument. And getting rid of that
completely would require splitting ->lookup() into a couple
of methods (at least), so let's leave that alone for now...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Just pass struct file *. Methods are happier that way...
There's no need to return struct file * from finish_open() now,
so let it return int. Next: saner prototypes for parts in
namei.c
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Change of calling conventions:
old new
NULL 1
file 0
ERR_PTR(-ve) -ve
Caller *knows* that struct file *; no need to return it.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
... and let finish_open() report having opened the file via that sucker.
Next step: don't modify od->filp at all.
[AV: FILE_CREATE was already used by cifs; Miklos' fix folded]
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Add an ->atomic_open implementation which replaces the atomic lookup+open+create
operation implemented via ->lookup and ->create operations.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: Sage Weil <sage@newdream.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
What was the purpose of this?
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: Sage Weil <sage@newdream.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
I got lots of NULL pointer dereference Oops when compiling kernel on ceph.
The bug is because the kernel page migration routine replaces some pages
in the page cache with new pages, these new pages' private can be non-zero.
Signed-off-by: Zheng Yan <zheng.z.yan@intel.com>
Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 28c0254ede)