Peng Tao points out that the call to pnfs_mark_matching_lsegs_return()
could race with pnfs_put_lseg(), in which case the layout segment is
cleared, but no layoutreturn will be sent.
Fix is to replace the call to pnfs_mark_matching_lsegs_invalid().
Reported-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Fix a bug whereby if all the layout segments could be immediately freed,
the call to pnfs_error_mark_layout_for_return() would never result in
a layoutreturn.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
If pnfs_mark_matching_lsegs_return() needs to mark a layout segment for
return, then it must also set the return iomode.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
If two processes share the same credentials and NFSv4 open stateid, then
allow them both to dirty the same page, even if their nfs_open_context
differs.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
If the layout segment is invalid, then we should not be adding more
write requests to the commit list. Instead, those writes should be
replayed after requesting a new layout.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Allow synchronous RPC calls to wait for pending RPC calls to finish,
but also allow asynchronous ones to just fire off another commit.
With this patch, the xfstests generic/074 test completes in 226s
instead of 242s
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
The flexfiles layout in particular, seems to want to poke around in the
O_DIRECT flags when retransmitting.
This patch sets up an interface to allow it to call back into O_DIRECT
to handle retransmission correctly. It also fixes a potential bug whereby
we could change the behaviour of O_DIRECT if an error is already pending.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Jeff reports seeing an Oops in ff_layout_alloc_lseg. Turns out
copy+paste has played cruel tricks on a nested loop.
Reported-by: Jeff Layton <jeff.layton@primarydata.com>
Cc: stable@vger.kernel.org # 4.3+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
If a NFSv4 client uses the cache_consistency_bitmask in order to
request only information about the change attribute, timestamps and
size, then it has not revalidated all attributes, and hence the
attribute timeout timestamp should not be updated.
Reported-by: Donald Buczek <buczek@molgen.mpg.de>
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Donald Buczek reports that NFS clients can also report incorrect
results for access() due to lack of revalidation of attributes
before calling execute_ok().
Looking closely, it seems chdir() is afflicted with the same problem.
Fix is to ensure we call nfs_revalidate_inode_rcu() or
nfs_revalidate_inode() as appropriate before deciding to trust
execute_ok().
Reported-by: Donald Buczek <buczek@molgen.mpg.de>
Link: http://lkml.kernel.org/r/1451331530-3748-1-git-send-email-buczek@molgen.mpg.de
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
* flexfiles:
pNFS/flexfiles: Ensure we record layoutstats even if RPC is terminated early
pNFS: Add flag to track if we've called nfs4_ff_layout_stat_io_start_read/write
pNFS/flexfiles: Fix a statistics gathering imbalance
pNFS/flexfiles: Don't mark the entire layout as failed, when returning it
pNFS/flexfiles: Don't prevent flexfiles client from retrying LAYOUTGET
pnfs/flexfiles: count io stat in rpc_count_stats callback
pnfs/flexfiles: do not mark delay-like status as DS failure
NFS41: map NFS4ERR_LAYOUTUNAVAILABLE to ENODATA
nfs: only remove page from mapping if launder_page fails
nfs: handle request add failure properly
nfs: centralize pgio error cleanup
nfs: clean up rest of reqs when failing to add one
NFS41: pop some layoutget errors to application
pNFS/flexfiles: Support server-supplied layoutstats sampling period
If the client is promising to return the layout ASAP, then there is no
need to return DELAY and have the server retry. Instead default to the
normal procedure described in RFC5661.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
The RFC requires us to check if the server is recalling a stateid that we
haven't yet received. If so, tell it to wait.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
If the client needs to delay the layout callback, then speed up the recall
process by marking the remaining layout segments to be actively returned
by the client.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
This ensures that we don't reuse the stateid if a layout return or
implied layout return means that we've returned all layout segments
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
If we're unable to perform the layoutget due to an invalid open stateid
or a bulk recall, ensure that we return the error so that the caller
can decide on an appropriate action.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Currently, we will only record the layoutstats correctly if the
RPC call successfully obtains a slot. If we exit before that
happens, then we may find ourselves starting the busy timer through
the call in ff_layout_(read|write)_prepare_layoutstats, but never stopping it.
The same thing happens if we're doing DA-DS.
The fix is to ensure that we catch these cases in the rpc_release()
callback.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
When we replay a failed read, write or commit to the dataserver, we
need to ensure that we call ff_layout_read_prepare_v3(),
ff_layout_write_prepare_v3 or ff_layout_commit_prepare_v3() so that we
reset the statistics.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
In pNFS/flexfiles, we want to return the layout without necessarily marking
it as having completely failed. We therefore move the call to
pnfs_layout_io_set_failed() out of pnfs_error_mark_layout_for_return(),
and then ensura that pNFS/files layout calls it separately.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Fix a bug in which flexfiles clients are falling back to I/O through the
MDS even when the FF_FLAGS_NO_IO_THRU_MDS flag is set.
The flexfiles client will always report errors through the LAYOUTRETURN
and/or LAYOUTERROR mechanisms, so it should normally be safe for it
to retry the LAYOUTGET until it fails or succeeds.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
If client ever restarts IO due to some errors, we'll endup
mis-counting IO stats if we do the counting in .rpc_done
callback. Move it to .rpc_count_stats callback that is only
called when releasing RPC.
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
We just need to delay and retry in these cases.
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Instead of mapping it to EIO that is a fatal error and
fails application. We'll go inband after getting
NFS4ERR_LAYOUTUNAVAILABLE.
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Instead of dropping pages when write fails, only do it when
we get fatal failure in launder_page write back.
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
When we fail to queue a read page to IO descriptor,
we need to clean it up otherwise it is hanging around
preventing nfs module from being removed.
When we fail to queue a write page to IO descriptor,
we need to clean it up and also save the failure status
to open context. Then at file close, we can try to write
pages back again and drop the page if it fails to writeback
in .launder_page, which will be done in the next patch.
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
In case we fail during setting things up for read/write IO, set
pg_error in IO descriptor and do the cleanup in nfs_pageio_add_request,
where we clean up all pages that are still hanging around on the IO
descriptor.
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
If we fail to set up things before sending anything over wire,
we need to clean up the reqs that are still attached to the
IO descriptor.
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
For ERESTARTSYS/EIO/EROFS/ENOSPC/E2BIG in layoutget, we
should just bail out instead of hiding the error and
retrying inband IO.
Change all the call sites to pop the error all the way up.
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Some servers want to be able to control the frequency with which clients
report layoutstats, for instance, in order to monitor QoS for a particular
file or set of file. In order to support this, the flexfiles layout allows
the server to pass this info as a hint in the layout payload.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
If there are already writes queued up for commit, then don't flush
just this page even if it is a reclaim issue.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Background flush is needed in order to satisfy the global page limits.
Don't subvert by reducing the priority.
This should also address a write starvation issue that was reported by
Neil Brown.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Since commit 2d8ae84fbc32, nothing is bumping lo->plh_block_lgets in the
layoutreturn path, so it should not be touched in nfs4_layoutreturn_release
either.
Fixes: 2d8ae84fbc32 ("NFSv4.1/pnfs: Remove redundant lo->plh_block_lgets...")
Cc: stable@vger.kernel.org # 4.3+
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Donald Buczek reports that a nfs4 client incorrectly denies
execute access based on outdated file mode (missing 'x' bit).
After the mode on the server is 'fixed' (chmod +x) further execution
attempts continue to fail, because the nfs ACCESS call updates
the access parameter but not the mode parameter or the mode in
the inode.
The root cause is ultimately that the VFS is calling may_open()
before the NFS client has a chance to OPEN the file and hence revalidate
the access and attribute caches.
Al Viro suggests:
>>> Make nfs_permission() relax the checks when it sees MAY_OPEN, if you know
>>> that things will be caught by server anyway?
>>
>> That can work as long as we're guaranteed that everything that calls
>> inode_permission() with MAY_OPEN on a regular file will also follow up
>> with a vfs_open() or dentry_open() on success. Is this always the
>> case?
>
> 1) in do_tmpfile(), followed by do_dentry_open() (not reachable by NFS since
> it doesn't have ->tmpfile() instance anyway)
>
> 2) in atomic_open(), after the call of ->atomic_open() has succeeded.
>
> 3) in do_last(), followed on success by vfs_open()
>
> That's all. All calls of inode_permission() that get MAY_OPEN come from
> may_open(), and there's no other callers of that puppy.
Reported-by: Donald Buczek <buczek@molgen.mpg.de>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=109771
Link: http://lkml.kernel.org/r/1451046656-26319-1-git-send-email-buczek@molgen.mpg.de
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Allow LAYOUTRETURN and DELEGRETURN to use machine credentials if the
server supports it. Add request for OPEN_DOWNGRADE as the close path
also uses that.
Signed-off-by: Andrew Elble <aweits@rit.edu>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
This patch fixes the checkpatch.pl error to nfs4sysctl.c:
ERROR: do not initialise statics to 0
Signed-off-by: Wei Tang <tangwei@cmss.chinamobile.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Instead of displaying a layout segment pointer in these tracepoints,
let's use the layout stateid, now that Olga gave us a set of tools for
displaying them.
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
pnfs_update_layout is really the "nexus" of layout handling. If it
returns NULL then we end up going through the MDS. This patch adds
some tracepoints to that function that allow us to determine the
cause when we end up going through the MDS unexpectedly.
Signed-off-by: Jeff Layton <jeff.layton@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Operations to which stateid information is added:
close, delegreturn, open, read, setattr, layoutget, layoutcommit, test_stateid,
write, lock, locku, lockt
Format is "stateid=<seqid>:<crc32 hash stateid.other>", also "openstateid=",
"layoutstateid=", and "lockstateid=" for open_file, layoutget, set_lock
tracepoints.
New function is added to internal.h, nfs_stateid_hash(), to compute the hash
trace_nfs4_setattr() is moved from nfs4_do_setattr() to _nfs4_do_setattr()
to get access to stateid.
trace_nfs4_setattr and trace_nfs4_delegreturn are changed from INODE_EVENT
to new event type, INODE_STATEID_EVENT which is same as INODE_EVENT but adds
stateid information
for locking tracepoints, moved trace_nfs4_set_lock() into _nfs4_do_setlk()
to get access to stateid information, and removed trace_nfs4_lock_reclaim(),
trace_nfs4_lock_expired() as they call into _nfs4_do_setlk() and both were
previously same LOCK_EVENT type.
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
When server returns layoutstats stateid error, we should
invalidate client's layout so that next IO can trigger new
layoutget.
Signed-off-by: Peng Tao <tao.peng@primarydata.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
We've seen this in a packet capture - I've intermixed what I
think was going on. The fix here is to grab the so_lock sooner.
1964379 -> #1 open (for write) reply seqid=1
1964393 -> #2 open (for read) reply seqid=2
__nfs4_close(), state->n_wronly--
nfs4_state_set_mode_locked(), changes state->state = [R]
state->flags is [RW]
state->state is [R], state->n_wronly == 0, state->n_rdonly == 1
1964398 -> #3 open (for write) call -> because close is already running
1964399 -> downgrade (to read) call seqid=2 (close of #1)
1964402 -> #3 open (for write) reply seqid=3
__update_open_stateid()
nfs_set_open_stateid_locked(), changes state->flags
state->flags is [RW]
state->state is [R], state->n_wronly == 0, state->n_rdonly == 1
new sequence number is exposed now via nfs4_stateid_copy()
next step would be update_open_stateflags(), pending so_lock
1964403 -> downgrade reply seqid=2, fails with OLD_STATEID (close of #1)
nfs4_close_prepare() gets so_lock and recalcs flags -> send close
1964405 -> downgrade (to read) call seqid=3 (close of #1 retry)
__update_open_stateid() gets so_lock
* update_open_stateflags() updates state->n_wronly.
nfs4_state_set_mode_locked() updates state->state
state->flags is [RW]
state->state is [RW], state->n_wronly == 1, state->n_rdonly == 1
* should have suppressed the preceding nfs4_close_prepare() from
sending open_downgrade
1964406 -> write call
1964408 -> downgrade (to read) reply seqid=4 (close of #1 retry)
nfs_clear_open_stateid_locked()
state->flags is [R]
state->state is [RW], state->n_wronly == 1, state->n_rdonly == 1
1964409 -> write reply (fails, openmode)
Signed-off-by: Andrew Elble <aweits@rit.edu>
Cc: stable@vger,kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>