The AFS3 FID is three 32-bit unsigned numbers and is represented as three
up-to-8-hex-digit numbers separated by colons to the afs.fid xattr.
However, with the advent of support for YFS, the FID is now a 64-bit volume
number, a 96-bit vnode/inode number and a 32-bit uniquifier (as before).
Whilst the sprintf in afs_xattr_get_fid() has been partially updated (it
currently ignores the upper 32 bits of the 96-bit vnode number), the size
of the stack-based buffer has not been increased to match, thereby allowing
stack corruption to occur.
Fix this by increasing the buffer size appropriately and conditionally
including the upper part of the vnode number if it is non-zero. The latter
requires the lower part to be zero-padded if the upper part is non-zero.
Fixes: 3b6492df4153 ("afs: Increase to 64-bit volume ID and 96-bit vnode ID for YFS")
Signed-off-by: David Howells <dhowells@redhat.com>
Fix the ->get handlers for the afs.cell and afs.volume xattrs to pass the
source data size to memcpy() rather than target buffer size.
Overcopying the source data occasionally causes the kernel to oops.
Fixes: d3e3b7eac886 ("afs: Add metadata xattrs")
Signed-off-by: David Howells <dhowells@redhat.com>
While it's not possible to give an accurate number for the blocks
used on the server, populate i_blocks based on the file size so
that 'du' can give a reasonable estimate.
The value is rounded up to 1K granularity, for consistency with
what other AFS clients report, and the servers' 1K usage quota
unit. Note that the value calculated by 'du' at the root of a
volume can still be slightly lower than the quota usage on the
server, as 0-length files are charged 1 quota block, but are
reported as occupying 0 blocks. Again, this is consistent with
other AFS clients.
Signed-off-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Log more information when "kAFS: AFS vnode with undefined type\n" is
displayed due to a vnode record being retrieved from the server that
appears to have a duff file type (usually 0). This prints more information
to try and help pin down the problem.
Signed-off-by: David Howells <dhowells@redhat.com>
This issue is found by running liburing/test/io_uring_setup test.
When test run, the testcase "attempt to bind to invalid cpu" would not
pass with messages like:
io_uring_setup(1, 0xbfc2f7c8), \
flags: IORING_SETUP_SQPOLL|IORING_SETUP_SQ_AFF, \
resv: 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000, \
sq_thread_cpu: 2
expected -1, got 3
FAIL
On my system, there is:
CPU(s) possible : 0-3
CPU(s) online : 0-1
CPU(s) offline : 2-3
CPU(s) present : 0-1
The sq_thread_cpu 2 is offline on my system, so the bind should fail.
But cpu_possible() will pass the check. We shouldn't be able to bind
to an offline cpu. Use cpu_online() to do the check.
After the change, the testcase run as expected: EINVAL will be returned
for cpu offlined.
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Shenghui Wang <shhuiw@foxmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Actually we don't do anything with return value from
nfs_wait_client_init_complete in nfs_match_client, as a
consequence if we get a fatal signal and client is not
fully initialised, we'll loop to "again" label
This has been proven to cause soft lockups on some scenarios
(no-carrier but configured network interfaces)
Signed-off-by: Roberto Bergantinos Corpas <rbergant@redhat.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Pull crypto update from Herbert Xu:
"API:
- Add support for AEAD in simd
- Add fuzz testing to testmgr
- Add panic_on_fail module parameter to testmgr
- Use per-CPU struct instead multiple variables in scompress
- Change verify API for akcipher
Algorithms:
- Convert x86 AEAD algorithms over to simd
- Forbid 2-key 3DES in FIPS mode
- Add EC-RDSA (GOST 34.10) algorithm
Drivers:
- Set output IV with ctr-aes in crypto4xx
- Set output IV in rockchip
- Fix potential length overflow with hashing in sun4i-ss
- Fix computation error with ctr in vmx
- Add SM4 protected keys support in ccree
- Remove long-broken mxc-scc driver
- Add rfc4106(gcm(aes)) cipher support in cavium/nitrox"
* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (179 commits)
crypto: ccree - use a proper le32 type for le32 val
crypto: ccree - remove set but not used variable 'du_size'
crypto: ccree - Make cc_sec_disable static
crypto: ccree - fix spelling mistake "protedcted" -> "protected"
crypto: caam/qi2 - generate hash keys in-place
crypto: caam/qi2 - fix DMA mapping of stack memory
crypto: caam/qi2 - fix zero-length buffer DMA mapping
crypto: stm32/cryp - update to return iv_out
crypto: stm32/cryp - remove request mutex protection
crypto: stm32/cryp - add weak key check for DES
crypto: atmel - remove set but not used variable 'alg_name'
crypto: picoxcell - Use dev_get_drvdata()
crypto: crypto4xx - get rid of redundant using_sd variable
crypto: crypto4xx - use sync skcipher for fallback
crypto: crypto4xx - fix cfb and ofb "overran dst buffer" issues
crypto: crypto4xx - fix ctr-aes missing output IV
crypto: ecrdsa - select ASN1 and OID_REGISTRY for EC-RDSA
crypto: ux500 - use ccflags-y instead of CFLAGS_<basename>.o
crypto: ccree - handle tee fips error during power management resume
crypto: ccree - add function to handle cryptocell tee fips error
...
Pull stack trace updates from Ingo Molnar:
"So Thomas looked at the stacktrace code recently and noticed a few
weirdnesses, and we all know how such stories of crummy kernel code
meeting German engineering perfection end: a 45-patch series to clean
it all up! :-)
Here's the changes in Thomas's words:
'Struct stack_trace is a sinkhole for input and output parameters
which is largely pointless for most usage sites. In fact if embedded
into other data structures it creates indirections and extra storage
overhead for no benefit.
Looking at all usage sites makes it clear that they just require an
interface which is based on a storage array. That array is either on
stack, global or embedded into some other data structure.
Some of the stack depot usage sites are outright wrong, but
fortunately the wrongness just causes more stack being used for
nothing and does not have functional impact.
Another oddity is the inconsistent termination of the stack trace
with ULONG_MAX. It's pointless as the number of entries is what
determines the length of the stored trace. In fact quite some call
sites remove the ULONG_MAX marker afterwards with or without nasty
comments about it. Not all architectures do that and those which do,
do it inconsistenly either conditional on nr_entries == 0 or
unconditionally.
The following series cleans that up by:
1) Removing the ULONG_MAX termination in the architecture code
2) Removing the ULONG_MAX fixups at the call sites
3) Providing plain storage array based interfaces for stacktrace
and stackdepot.
4) Cleaning up the mess at the callsites including some related
cleanups.
5) Removing the struct stack_trace based interfaces
This is not changing the struct stack_trace interfaces at the
architecture level, but it removes the exposure to the generic
code'"
* 'core-stacktrace-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (45 commits)
x86/stacktrace: Use common infrastructure
stacktrace: Provide common infrastructure
lib/stackdepot: Remove obsolete functions
stacktrace: Remove obsolete functions
livepatch: Simplify stack trace retrieval
tracing: Remove the last struct stack_trace usage
tracing: Simplify stack trace retrieval
tracing: Make ftrace_trace_userstack() static and conditional
tracing: Use percpu stack trace buffer more intelligently
tracing: Simplify stacktrace retrieval in histograms
lockdep: Simplify stack trace handling
lockdep: Remove save argument from check_prev_add()
lockdep: Remove unused trace argument from print_circular_bug()
drm: Simplify stacktrace handling
dm persistent data: Simplify stack trace handling
dm bufio: Simplify stack trace retrieval
btrfs: ref-verify: Simplify stack trace retrieval
dma/debug: Simplify stracktrace retrieval
fault-inject: Simplify stacktrace retrieval
mm/page_owner: Simplify stack trace handling
...
Currently variable ret is declared in a while-loop code block that
shadows another variable ret. When an error occurs in the while-loop
the error return in ret is not being set in the outer code block and
so the error check on ret is always going to be checking on the wrong
ret variable resulting in check that is always going to be true and
a premature return occurs.
Fix this by removing the declaration of the inner while-loop variable
ret so that shadowing does not occur.
Addresses-Coverity: ("'Constant' variable guards dead code")
Fixes: 6b06314c47e1 ("io_uring: add file set registration")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This amends commit 10dce8af3422 ("fs: stream_open - opener for
stream-like files so that read and write can run simultaneously without
deadlock") in how position is passed into .read()/.write() handler for
stream-like files:
Rasmus noticed that we currently pass 0 as position and ignore any position
change if that is done by a file implementation. This papers over bugs if ppos
is used in files that declare themselves as being stream-like as such bugs will
go unnoticed. Even if a file implementation is correctly converted into using
stream_open, its read/write later could be changed to use ppos and even though
that won't be working correctly, that bug might go unnoticed without someone
doing wrong behaviour analysis. It is thus better to pass ppos=NULL into
read/write for stream-like files as that don't give any chance for ppos usage
bugs because it will oops if ppos is ever used inside .read() or .write().
Note 1: rw_verify_area, new_sync_{read,write} needs to be updated
because they are called by vfs_read/vfs_write & friends before
file_operations .read/.write .
Note 2: if file backend uses new-style .read_iter/.write_iter, position
is still passed into there as non-pointer kiocb.ki_pos . Currently
stream_open.cocci (semantic patch added by 10dce8af3422) ignores files
whose file_operations has *_iter methods.
Suggested-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Kirill Smelkov <kirr@nexedi.com>
We found that it return success when we set IMMUTABLE_FL flag to a file in
docker even though the docker didn't have the capability
CAP_LINUX_IMMUTABLE.
The commit d1d04ef8572b ("ovl: stack file ops") and dab5ca8fd9dd ("ovl: add
lsattr/chattr support") implemented chattr operations on a regular overlay
file. ovl_real_ioctl() overridden the current process's subjective
credentials with ofs->creator_cred which have the capability
CAP_LINUX_IMMUTABLE so that it will return success in
vfs_ioctl()->cap_capable().
Fix this by checking the capability before cred overridden. And here we
only care about APPEND_FL and IMMUTABLE_FL, so get these information from
inode.
[SzM: move check and call to underlying fs inside inode locked region to
prevent two such calls from racing with each other]
Signed-off-by: Jiufei Xue <jiufei.xue@linux.alibaba.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Overlayfs "fake" path is used for stacked file operations on underlying
files. Operations on files with "fake" path must not generate fsnotify
events with path data, because those events have already been generated at
overlayfs layer and because the reported event->fd for fanotify marks on
underlying inode/filesystem will have the wrong path (the overlayfs path).
Link: https://lore.kernel.org/linux-fsdevel/20190423065024.12695-1-jencce.kernel@gmail.com/
Reported-by: Murphy Zhou <jencce.kernel@gmail.com>
Fixes: d1d04ef8572b ("ovl: stack file ops")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Overlay file f_pos is the master copy that is preserved
through copy up and modified on read/write, but only real
fs knows how to SEEK_HOLE/SEEK_DATA and real fs may impose
limitations that are more strict than ->s_maxbytes for specific
files, so we use the real file to perform seeks.
We do not call real fs for SEEK_CUR:0 query and for SEEK_SET:0
requests.
Fixes: d1d04ef8572b ("ovl: stack file ops")
Reported-by: Eddie Horng <eddiehorng.tw@gmail.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Generalize the helper ovl_open_maybe_copy_up() and use it to copy up file
with data before FS_IOC_SETFLAGS ioctl.
The FS_IOC_SETFLAGS ioctl is a bit of an odd ball in vfs, which probably
caused the confusion. File may be open O_RDONLY, but ioctl modifies the
file. VFS does not call mnt_want_write_file() nor lock inode mutex, but
fs-specific code for FS_IOC_SETFLAGS does. So ovl_ioctl() calls
mnt_want_write_file() for the overlay file, and fs-specific code calls
mnt_want_write_file() for upper fs file, but there was no call for
ovl_want_write() for copy up duration which prevents overlayfs from copying
up on a frozen upper fs.
Fixes: dab5ca8fd9dd ("ovl: add lsattr/chattr support")
Cc: <stable@vger.kernel.org> # v4.19
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Pull vfs fixes from Al Viro:
- a couple of ->i_link use-after-free fixes
- regression fix for wrong errno on absent device name in mount(2)
(this cycle stuff)
- ancient UFS braino in large GID handling on Solaris UFS images (bogus
cut'n'paste from large UID handling; wrong field checked to decide
whether we should look at old (16bit) or new (32bit) field)
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
ufs: fix braino in ufs_get_inode_gid() for solaris UFS flavour
Abort file_remove_privs() for non-reg. files
[fix] get rid of checking for absent device name in vfs_get_tree()
apparmorfs: fix use-after-free on symlink traversal
securityfs: fix use-after-free on symlink traversal
Otherwise we race with orangefs_writepage/orangefs_writepages
which and does not expect i_size < page_offset.
Fixes xfstests generic/129.
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
->readpage looks in file->private_data to try and find out how the
userspace program set "count" in read(2) or with "dd bs=" or whatever.
->readpage uses "count" and inode->i_size to calculate how much
data Orangefs should deposit in the Orangefs shared buffer, and
remembers which slot the data is in.
After copying data from the Orangefs shared buffer slot into
"the page", readpage tries to increment through the pagecache index
and fill as many pages as it can from the extra data in the shared
buffer. Hopefully these extra pages will soon be needed by the vfs,
and they'll be in the pagecache already.
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
When userspace deposits more than a page of data into the shared buffer,
we'll need to know which slot it is in when we get back to readpage
so that we can try to use the extra data to fill some extra pages.
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Orangefs wins when it can do IO on large (up to four meg) blocks at a time,
and looses when it has to do tiny "small io" reads and writes. Accessing
Orangefs through the pagecache with the kernel module helps with small io,
both reading and writing, a great deal. Readpage generally tries to fetch a
page (four k) at a time. We'll let users use "count" (as in read(2) or
pread(2) for example) as a knob to control how much data they get from
Orangefs at a time and we'll try to use the data to fill extra
pagecache pages when we get to ->readpage, hopefully resulting in
fewer calls to readpage and Orangefs userspace.
We need a way to remember how they set count so that we can still have
it available when we get to ->readpage.
- We'll use file->private_data to keep track of "count".
We'll wrap generic_file_open with orangefs_file_open and
initialize private_data to NULL there.
- In ->read_iter we have access to both "count" and file, so
we'll kmalloc some space onto file->private_data and store
"count" there.
- We'll kfree file->private_data each time we visit ->flush and
reinitialize it to NULL.
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
This is modeled after NFS, except our method is different. We use a
simple timer to determine whether to invalidate the page cache. This
is bound to perform.
This addes a sysfs parameter cache_timeout_msecs which controls the time
between page cache invalidations.
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
Go through pages and look for a consecutive writable region. After
finding a number of consecutive writable pages or when finding that
the next page's dirty range is not contiguous and cannot be written
as one request, send the write to the server.
The number of pages is determined by the client-core's buffer size.
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
Attach the actual range of bytes written to plus the responsible uid/gid
to each dirty page. This information must be sent to the server when
the page is written out.
Now write_begin, page_mkwrite, and invalidatepage keep up with this
information. There are several conditions where they must write out the
page immediately to store the new range. Two non-contiguous ranges
cannot be stored on a single page.
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
Without this, an fsync call is sent to the server even if no data
changed. This resulted in a rather severe (50%) performance regression
under certain metadata-heavy workloads.
In the past, everything was direct IO. Nothing happend on a close call.
An explicit fsync call would send an fsync request to the server which
in turn fsynced the underlying file.
Now there are cached writes. Then fsync began writing out dirty pages
in addition to making an fsync request to the server, and close began
calling fsync.
With this commit, close only writes out dirty pages, and does not make
the fsync request.
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
Would happen if an inode is dirty but whatever happened is not something
that can be written out to OrangeFS.
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
direct_IO was the only caller and all direct_IO did was call it,
so there's no use in having the code spread out into so many functions.
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
Now orangefs_inode_getattr fills from cache if an inode has dirty pages.
also if attr_valid and dirty pages and !flags, we spin on inode writeback
before returning if pages still dirty after: should it be other way
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
Remove orangefs_inode_read. It was used by readpage. Calling
wait_for_direct_io directly serves the purpose just as well. There is
now no check of the bufmap size in the readpage path. There are already
other places the bufmap size is assumed to be greater than PAGE_SIZE.
Important to call truncate_inode_pages now in the write path so a
subsequent read sees the new data.
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
It's a copy of the loop which would run in read_pages from
mm/readahead.c.
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
OrangeFS accepts a mask indicating which attributes were changed. The
kernel must not set any bits except those that were actually changed.
The kernel must set the uid/gid of the request to the actual uid/gid
responsible for the change.
Code path for notify_change initiated setattrs is
orangefs_setattr(dentry, iattr)
-> __orangefs_setattr(inode, iattr)
In kernel changes are initiated by calling __orangefs_setattr.
Code path for writeback is
orangefs_write_inode
-> orangefs_inode_setattr
attr_valid and attr_uid and attr_gid change together under i_lock.
I_DIRTY changes separately.
__orangefs_setattr
lock
if needs to be cleaned first, unlock and retry
set attr_valid
copy data in
unlock
mark_inode_dirty
orangefs_inode_setattr
lock
copy attributes out
unlock
clear getattr_time
# __writeback_single_inode clears dirty
orangefs_inode_getattr
# possible to get here with attr_valid set and not dirty
lock
if getattr_time ok or attr_valid set, unlock and return
unlock
do server operation
# another thread may getattr or setattr, so check for that
lock
if getattr_time ok or attr_valid, unlock and return
else, copy in
update getattr_time
unlock
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
This is a fairly big change, but ultimately it's not a lot of code.
Implement write_inode and then avoid the call to orangefs_inode_setattr
within orangefs_setattr.
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
This should be a no-op now. When inode writeback works, this will
prevent a getattr from overwriting inode data while an inode is
transitioning to dirty.
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
This should be a no-op now, but once inode writeback works, it'll be
necessary to have the correct attribute in the dirty inode.
Previously the attribute fetch timeout was marked invalid and the server
provided the updated attribute. When the inode is dirty, the server
cannot be consulted since it does not yet know the pending setattr.
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
No need to store the received mask. It is either STATX_BASIC_STATS or
STATX_BASIC_STATS & ~STATX_SIZE. If STATX_SIZE is requested, the cache
is bypassed anyway, so the cached mask is unnecessary to decide whether
to do a real getattr.
This is a change. Previously a getattr would want size and use the
cached size. All of the in-kernel callers that wanted size did not want
a cached size. Now a getattr cannot use the cached size if it wants
size at all.
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
When an inode is created, we fetch attributes from the server. There is
no need to turn around and invalidate them.
No need to initialize attributes after the getattr either. Either it'll
be exactly the same, or it'll be something else and wrong.
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
This uses the same timeout as the getattr cache. This substantially
increases performance when writing files with smaller buffer sizes.
When writing, the size is (often) changed, which causes a call to
notify_change which calls security_inode_need_killpriv which needs a
getxattr. Caching it reduces traffic to the server.
Signed-off-by: Martin Brandenburg <martin@omnibond.com>
Signed-off-by: Mike Marshall <hubcap@omnibond.com>
If we have an error writing out a delalloc range in
btrfs_punch_hole_lock_range we'll unlock the inode and then goto
out_only_mutex, where we will again unlock the inode. This is bad,
don't do this.
Fixes: f27451f22996 ("Btrfs: add support for fallocate's zero range operation")
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When a file's compression property is set as zlib or zstd but leave
the compression mount option not be set, that means btrfs will try
to compress the file with default compression level. But in
btrfs_compress_pages(), it calls get_workspace() with level = 0.
This will return a workspace with a wrong compression level.
For zlib, the compression level in the workspace will be 0
(that means "store only"). And for zstd, the compression in the
workspace will be 1, not the default level 3.
How to reproduce:
mkfs -t btrfs /dev/sdb
mount /dev/sdb /mnt/
mkdir /mnt/zlib
btrfs property set /mnt/zlib/ compression zlib
dd if=/dev/zero of=/mnt/zlib/compression-friendly-file-10M bs=1M count=10
sync
btrfs-debugfs -f /mnt/zlib/compression-friendly-file-10M
btrfs-debugfs output:
* before:
...
(258 9961472): ram 524288 disk 1106247680 disk_size 524288
file: ... extents 20 disk size 10485760 logical size 10485760 ratio 1.00
* after:
...
(258 10354688): ram 131072 disk 14217216 disk_size 4096
file: ... extents 80 disk size 327680 logical size 10485760 ratio 32.00
The steps for zstd are similar, but need to put a debugging message to
show the level of the return workspace in zstd_get_workspace().
This commit adds a check of the compression level before getting a
workspace by set_level().
CC: stable@vger.kernel.org # 5.1+
Signed-off-by: Johnny Chang <johnnyc@synology.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of having the convention where individual nfsd4_callback_ops->done
operations return -1 to indicate the callback path is down, move the check
to nfsd4_cb_done. Only mark the callback path down on transport-level
errors, not NFS-level errors.
The existing logic causes the server to set SEQ4_STATUS_CB_PATH_DOWN
just because the client returned an error to a CB_RECALL for a
delegation that the client had already done a FREE_STATEID for. But
clearly that error doesn't mean that there's anything wrong with the
backchannel.
Additionally, handle NFS4ERR_DELAY in nfsd4_cb_recall_done. The client
returns NFS4ERR_DELAY if it is already in the process of returning the
delegation.
Signed-off-by: Scott Mayhew <smayhew@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
No need to set it in io_poll_add; io_poll_complete doesn't use it to set
the result in the CQE.
Signed-off-by: Stefan Bühler <source@stbuehler.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Allow registration of an eventfd, which will trigger an event every
time a completion event happens for this io_uring instance.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This just pulls out the ksys_sync_file_range() code to work on a struct
file instead of an fd, so we can use it elsewhere.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There are no ordering constraints between the submission and completion
side of io_uring. But sometimes that would be useful to have. One common
example is doing an fsync, for instance, and have it ordered with
previous writes. Without support for that, the application must do this
tracking itself.
This adds a general SQE flag, IOSQE_IO_DRAIN. If a command is marked
with this flag, then it will not be issued before previous commands have
completed, and subsequent commands submitted after the drain will not be
issued before the drain is started.. If there are no pending commands,
setting this flag will not change the behavior of the issue of the
command.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlzLBg4QHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpsG8EACHxeAgDY3j+ZSLqn9sCFzvDo+hc9q2x64o
ydscoxcdRSZYAU1dP6bnjQBFZbwVHaDzYG11L9aPwlvbsxis9/XWASc/gCo8sumN
itmk3jKW5zYXlLw57ojh2+/UbPKnO5fKJcA66AakDGTopp3nYZQvov1kUJKOwTiz
Ir48vEwST4DU9szkfGGeKOGG735veU/yarUNyADwc+CIigBcIa/A/bZtY8P4xV5d
6bSLnMayehntZaHpVUPwDqOgP7+nSqIzQ88BPAKQnSzDat/SK4C0pgIX9L2nS1/4
7fBpg3/HN0sFgUuhAG/3ILD82yD3CWDGbt2xif2CP0ylvVOMB6d/ObDAIxwTBKVd
MwJ3U0RK+Ph2rvaglp6ifk3YgS0Nzt4N5F9AVmp32RR3QpqzJ2pLN4PUjbdWaEbr
lugu6uengn0flu90mesBazrVNrZkrZ2w859tq9TNWPYpdIthKrvNn44J1bXx+yUj
F1d+Bf0Sxxnc1b8hEKn9KOXPRoIY07hd/wUi2a/F4RVGSFa2b9CnaengqXIDzQNC
duEErt+Pd2xlBA4WhPlKSI2PmrWJvpDr7iuAiqdC8cC69hpF4F2/yWCiL+JYI4pF
v0zPn214KZ2rg9XZfb0Me8ff1BNvmXJMKzvecX2mOcclcO97PFVkqJhcx/75JE/6
ooQiTofl+w==
=KjZ4
-----END PGP SIGNATURE-----
Merge tag 'for-linus-20190502' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
"This is mostly io_uring fixes/tweaks. Most of these were actually done
in time for the last -rc, but I wanted to ensure that everything
tested out great before including them. The code delta looks larger
than it really is, as it's mostly just comment additions/changes.
Outside of the comment additions/changes, this is mostly removal of
unnecessary barriers. In all, this pull request contains:
- Tweak to how we handle errors at submission time. We now post a
completion event if the error occurs on behalf of an sqe, instead
of returning it through the system call. If the error happens
outside of a specific sqe, we return the error through the system
call. This makes it nicer to use and makes the "normal" use case
behave the same as the offload cases. (me)
- Fix for a missing req reference drop from async context (me)
- If an sqe is submitted with RWF_NOWAIT, don't punt it to async
context. Return -EAGAIN directly, instead of using it as a hint to
do async punt. (Stefan)
- Fix notes on barriers (Stefan)
- Remove unnecessary barriers (Stefan)
- Fix potential double free of memory in setup error (Mark)
- Further improve sq poll CPU validation (Mark)
- Fix page allocation warning and leak on buffer registration error
(Mark)
- Fix iov_iter_type() for new no-ref flag (Ming)
- Fix a case where dio doesn't honor bio no-page-ref (Ming)"
* tag 'for-linus-20190502' of git://git.kernel.dk/linux-block:
io_uring: avoid page allocation warnings
iov_iter: fix iov_iter_type
block: fix handling for BIO_NO_PAGE_REF
io_uring: drop req submit reference always in async punt
io_uring: free allocated io_memory once
io_uring: fix SQPOLL cpu validation
io_uring: have submission side sqe errors post a cqe
io_uring: remove unnecessary barrier after unsetting IORING_SQ_NEED_WAKEUP
io_uring: remove unnecessary barrier after incrementing dropped counter
io_uring: remove unnecessary barrier before reading SQ tail
io_uring: remove unnecessary barrier after updating SQ head
io_uring: remove unnecessary barrier before reading cq head
io_uring: remove unnecessary barrier before wq_has_sleeper
io_uring: fix notes on barriers
io_uring: fix handling SQEs requesting NOWAIT
Recent refactoring of cow_file_range_async means it's now possible to
request a rather large physically contiguous memory via kmalloc. The
size is dependent on the number of 512k chunks that the compressed range
consists of. David reported multiple OOM messages on such large
allocations. Fix it by switching to using kvmalloc.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>