49165 Commits

Author SHA1 Message Date
Chao Yu
a72d4b97bb f2fs: relocate inode_{,un}lock in F2FS_IOC_SETFLAGS
This patch expands cover region of inode->i_rwsem to keep setting flag
atomically.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-03 14:30:19 -07:00
Jan Kara
3adc5fcb7e f2fs: Make flush bios explicitely sync
Commit b685d3d65ac7 "block: treat REQ_FUA and REQ_PREFLUSH as
synchronous" removed REQ_SYNC flag from WRITE_{FUA|PREFLUSH|...}
definitions.  generic_make_request_checks() however strips REQ_FUA and
REQ_PREFLUSH flags from a bio when the storage doesn't report volatile
write cache and thus write effectively becomes asynchronous which can
lead to performance regressions.

Fix the problem by making sure all bios which are synchronous are
properly marked with REQ_SYNC.

Fixes: b685d3d65ac791406e0dfd8779cc9b3707fea5a3
Cc: stable@vger.kernel.org # 4.9+
CC: Jaegeuk Kim <jaegeuk@kernel.org>
CC: linux-f2fs-devel@lists.sourceforge.net
Signed-off-by: Jan Kara <jack@suse.cz>
Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-03 14:30:18 -07:00
Darrick J. Wong
fe0be23e68 xfs: reserve enough blocks to handle btree splits when remapping
In xfs_reflink_end_cow, we erroneously reserve only enough blocks to
handle adding 1 extent.  This is problematic if we fragment free space,
have to do CoW, and then have to perform multiple bmap btree expansions.
Furthermore, the BUI recovery routine doesn't reserve /any/ blocks to
handle btree splits, so log recovery fails after our first error causes
the filesystem to go down.

Therefore, refactor the transaction block reservation macros until we
have a macro that works for our deferred (re)mapping activities, and fix
both problems by using that macro.

With 1k blocks we can hit this fairly often in g/187 if the scratch fs
is big enough.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2017-05-03 13:21:40 -07:00
Linus Torvalds
a3719f34fd Merge branch 'generic' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull quota, reiserfs, udf and ext2 updates from Jan Kara:
 "The branch contains changes to quota code so that it does not modify
  persistent flags in inode->i_flags (it was the only place in kernel
  doing that) and handle it inside filesystem's quotaon/off handlers
  instead.

  The branch also contains two UDF cleanups, a couple of reiserfs fixes
  and one fix for ext2 quota locking"

* 'generic' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
  ext4: Improve comments in ext4_quota_{on|off}()
  udf: use kmap_atomic for memcpy copying
  udf: use octal for permissions
  quota: Remove dquot_quotactl_ops
  reiserfs: Remove i_attrs_to_sd_attrs()
  reiserfs: Remove useless setting of i_flags
  jfs: Remove jfs_get_inode_flags()
  ext2: Remove ext2_get_inode_flags()
  ext4: Remove ext4_get_inode_flags()
  quota: Stop setting IMMUTABLE and NOATIME flags on quota files
  jfs: Set flags on quota files directly
  ext2: Set flags on quota files directly
  reiserfs: Set flags on quota files directly
  ext4: Set flags on quota files directly
  reiserfs: Protect dquot_writeback_dquots() by s_umount semaphore
  reiserfs: Make cancel_old_flush() reliable
  ext2: Call dquot_writeback_dquots() with s_umount held
  reiserfs: avoid a -Wmaybe-uninitialized warning
2017-05-03 11:35:47 -07:00
Linus Torvalds
5133cd7518 Merge branch 'fsnotify' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull fsnotify updates from Jan Kara:
 "The branch contains mainly a rework of fsnotify infrastructure fixing
  a shortcoming that we have waited for response to fanotify permission
  events with SRCU read lock held and when the process consuming events
  was slow to respond the kernel has stalled.

  It also contains several cleanups of unnecessary indirections in
  fsnotify framework and a bugfix from Amir fixing leakage of kernel
  internal errno to userspace"

* 'fsnotify' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: (37 commits)
  fanotify: don't expose EOPENSTALE to userspace
  fsnotify: remove a stray unlock
  fsnotify: Move ->free_mark callback to fsnotify_ops
  fsnotify: Add group pointer in fsnotify_init_mark()
  fsnotify: Drop inode_mark.c
  fsnotify: Remove fsnotify_find_{inode|vfsmount}_mark()
  fsnotify: Remove fsnotify_detach_group_marks()
  fsnotify: Rename fsnotify_clear_marks_by_group_flags()
  fsnotify: Inline fsnotify_clear_{inode|vfsmount}_mark_group()
  fsnotify: Remove fsnotify_recalc_{inode|vfsmount}_mask()
  fsnotify: Remove fsnotify_set_mark_{,ignored_}mask_locked()
  fanotify: Release SRCU lock when waiting for userspace response
  fsnotify: Pass fsnotify_iter_info into handle_event handler
  fsnotify: Provide framework for dropping SRCU lock in ->handle_event
  fsnotify: Remove special handling of mark destruction on group shutdown
  fsnotify: Detach mark from object list when last reference is dropped
  fsnotify: Move queueing of mark for destruction into fsnotify_put_mark()
  inotify: Do not drop mark reference under idr_lock
  fsnotify: Free fsnotify_mark_connector when there is no mark attached
  fsnotify: Lock object list with connector lock
  ...
2017-05-03 11:05:15 -07:00
Jaegeuk Kim
5b0ef73c9d f2fs: show available_nids in f2fs/status
This patch adds an entry in f2fs/status to show # of available nids.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-03 10:04:57 -07:00
Jaegeuk Kim
1c0f4bf5c3 f2fs: flush dirty nats periodically
This patch flushes dirty nats in order to acquire available nids by writing
checkpoint. Otherwise, we can have no chance to get freed nids.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-03 10:04:56 -07:00
Chao Yu
1f43e2ad7b f2fs: introduce CP_TRIMMED_FLAG to avoid unneeded discard
Introduce CP_TRIMMED_FLAG to indicate all invalid block were trimmed
before umount, so once we do mount with image which contain the flag,
we don't record invalid blocks as undiscard one, when fstrim is being
triggered, we can avoid issuing redundant discard commands.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-03 10:04:56 -07:00
Chao Yu
c473f1a965 f2fs: allow cpc->reason to indicate more than one reason
Change to use different bits of cpc->reason to indicate different status,
so cpc->reason can indicate more than one reason.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-03 10:04:55 -07:00
Hou Pengyang
279d6df20c f2fs: release cp and dnode lock before IPU
We don't need to rewrite the page under cp_rwsem and dnode locks.

Signed-off-by: Hou Pengyang <houpengyang@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-03 10:04:54 -07:00
Fred Isaman
c296cfe26b pNFS: Fix NULL dereference in pnfs_generic_alloc_ds_commits
Signed-off-by: Fred Isaman <fred.isaman@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-05-03 12:29:41 -04:00
Linus Torvalds
0302e28dee Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
Pull security subsystem updates from James Morris:
 "Highlights:

  IMA:
   - provide ">" and "<" operators for fowner/uid/euid rules

  KEYS:
   - add a system blacklist keyring

   - add KEYCTL_RESTRICT_KEYRING, exposes keyring link restriction
     functionality to userland via keyctl()

  LSM:
   - harden LSM API with __ro_after_init

   - add prlmit security hook, implement for SELinux

   - revive security_task_alloc hook

  TPM:
   - implement contextual TPM command 'spaces'"

* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (98 commits)
  tpm: Fix reference count to main device
  tpm_tis: convert to using locality callbacks
  tpm: fix handling of the TPM 2.0 event logs
  tpm_crb: remove a cruft constant
  keys: select CONFIG_CRYPTO when selecting DH / KDF
  apparmor: Make path_max parameter readonly
  apparmor: fix parameters so that the permission test is bypassed at boot
  apparmor: fix invalid reference to index variable of iterator line 836
  apparmor: use SHASH_DESC_ON_STACK
  security/apparmor/lsm.c: set debug messages
  apparmor: fix boolreturn.cocci warnings
  Smack: Use GFP_KERNEL for smk_netlbl_mls().
  smack: fix double free in smack_parse_opts_str()
  KEYS: add SP800-56A KDF support for DH
  KEYS: Keyring asymmetric key restrict method with chaining
  KEYS: Restrict asymmetric key linkage using a specific keychain
  KEYS: Add a lookup_restriction function for the asymmetric key type
  KEYS: Add KEYCTL_RESTRICT_KEYRING
  KEYS: Consistent ordering for __key_link_begin and restrict check
  KEYS: Add an optional lookup_restriction hook to key_type
  ...
2017-05-03 08:50:52 -07:00
Josef Bacik
563f40019d fs: don't set *REFERENCED on single use objects
By default we set DCACHE_REFERENCED and I_REFERENCED on any dentry or
inode we create.  This is problematic as this means that it takes two
trips through the LRU for any of these objects to be reclaimed,
regardless of their actual lifetime.  With enough pressure from these
caches we can easily evict our working set from page cache with single
use objects.  So instead only set *REFERENCED if we've already been
added to the LRU list.  This means that we've been touched since the
first time we were accessed, and so more likely to need to hang out in
cache.

To illustrate this issue I wrote the following scripts

https://github.com/josefbacik/debug-scripts/tree/master/cache-pressure

on my test box.  It is a single socket 4 core CPU with 16gib of RAM and
I tested on an Intel 2tib NVME drive.  The cache-pressure.sh script
creates a new file system and creates 2 6.5gib files in order to take up
13gib of the 16gib of ram with pagecache.  Then it runs a test program
that reads these 2 files in a loop, and keeps track of how often it has
to read bytes for each loop.  On an ideal system with no pressure we
should have to read 0 bytes indefinitely.  The second thing this script
does is start a fs_mark job that creates a ton of 0 length files,
putting pressure on the system with slab only allocations.  On exit the
script prints out how many bytes were read by the read-file program.
The results are as follows

Without patch:
/mnt/btrfs-test/reads/file1: total read during loops 27262988288
/mnt/btrfs-test/reads/file2: total read during loops 27262976000

With patch:
/mnt/btrfs-test/reads/file2: total read during loops 18640457728
/mnt/btrfs-test/reads/file1: total read during loops 9565376512

This patch results in a 50% reduction of the amount of pages evicted
from our working set.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-05-03 11:47:05 -04:00
Rabin Vincent
3998e6b87d CIFS: fix oplock break deadlocks
When the final cifsFileInfo_put() is called from cifsiod and an oplock
break work is queued, lockdep complains loudly:

 =============================================
 [ INFO: possible recursive locking detected ]
 4.11.0+ #21 Not tainted
 ---------------------------------------------
 kworker/0:2/78 is trying to acquire lock:
  ("cifsiod"){++++.+}, at: flush_work+0x215/0x350

 but task is already holding lock:
  ("cifsiod"){++++.+}, at: process_one_work+0x255/0x8e0

 other info that might help us debug this:
  Possible unsafe locking scenario:

        CPU0
        ----
   lock("cifsiod");
   lock("cifsiod");

  *** DEADLOCK ***

  May be due to missing lock nesting notation

 2 locks held by kworker/0:2/78:
  #0:  ("cifsiod"){++++.+}, at: process_one_work+0x255/0x8e0
  #1:  ((&wdata->work)){+.+...}, at: process_one_work+0x255/0x8e0

 stack backtrace:
 CPU: 0 PID: 78 Comm: kworker/0:2 Not tainted 4.11.0+ #21
 Workqueue: cifsiod cifs_writev_complete
 Call Trace:
  dump_stack+0x85/0xc2
  __lock_acquire+0x17dd/0x2260
  ? match_held_lock+0x20/0x2b0
  ? trace_hardirqs_off_caller+0x86/0x130
  ? mark_lock+0xa6/0x920
  lock_acquire+0xcc/0x260
  ? lock_acquire+0xcc/0x260
  ? flush_work+0x215/0x350
  flush_work+0x236/0x350
  ? flush_work+0x215/0x350
  ? destroy_worker+0x170/0x170
  __cancel_work_timer+0x17d/0x210
  ? ___preempt_schedule+0x16/0x18
  cancel_work_sync+0x10/0x20
  cifsFileInfo_put+0x338/0x7f0
  cifs_writedata_release+0x2a/0x40
  ? cifs_writedata_release+0x2a/0x40
  cifs_writev_complete+0x29d/0x850
  ? preempt_count_sub+0x18/0xd0
  process_one_work+0x304/0x8e0
  worker_thread+0x9b/0x6a0
  kthread+0x1b2/0x200
  ? process_one_work+0x8e0/0x8e0
  ? kthread_create_on_node+0x40/0x40
  ret_from_fork+0x31/0x40

This is a real warning.  Since the oplock is queued on the same
workqueue this can deadlock if there is only one worker thread active
for the workqueue (which will be the case during memory pressure when
the rescuer thread is handling it).

Furthermore, there is at least one other kind of hang possible due to
the oplock break handling if there is only worker.  (This can be
reproduced without introducing memory pressure by having passing 1 for
the max_active parameter of cifsiod.) cifs_oplock_break() can wait
indefintely in the filemap_fdatawait() while the cifs_writev_complete()
work is blocked:

 sysrq: SysRq : Show Blocked State
   task                        PC stack   pid father
 kworker/0:1     D    0    16      2 0x00000000
 Workqueue: cifsiod cifs_oplock_break
 Call Trace:
  __schedule+0x562/0xf40
  ? mark_held_locks+0x4a/0xb0
  schedule+0x57/0xe0
  io_schedule+0x21/0x50
  wait_on_page_bit+0x143/0x190
  ? add_to_page_cache_lru+0x150/0x150
  __filemap_fdatawait_range+0x134/0x190
  ? do_writepages+0x51/0x70
  filemap_fdatawait_range+0x14/0x30
  filemap_fdatawait+0x3b/0x40
  cifs_oplock_break+0x651/0x710
  ? preempt_count_sub+0x18/0xd0
  process_one_work+0x304/0x8e0
  worker_thread+0x9b/0x6a0
  kthread+0x1b2/0x200
  ? process_one_work+0x8e0/0x8e0
  ? kthread_create_on_node+0x40/0x40
  ret_from_fork+0x31/0x40
 dd              D    0   683    171 0x00000000
 Call Trace:
  __schedule+0x562/0xf40
  ? mark_held_locks+0x29/0xb0
  schedule+0x57/0xe0
  io_schedule+0x21/0x50
  wait_on_page_bit+0x143/0x190
  ? add_to_page_cache_lru+0x150/0x150
  __filemap_fdatawait_range+0x134/0x190
  ? do_writepages+0x51/0x70
  filemap_fdatawait_range+0x14/0x30
  filemap_fdatawait+0x3b/0x40
  filemap_write_and_wait+0x4e/0x70
  cifs_flush+0x6a/0xb0
  filp_close+0x52/0xa0
  __close_fd+0xdc/0x150
  SyS_close+0x33/0x60
  entry_SYSCALL_64_fastpath+0x1f/0xbe

 Showing all locks held in the system:
 2 locks held by kworker/0:1/16:
  #0:  ("cifsiod"){.+.+.+}, at: process_one_work+0x255/0x8e0
  #1:  ((&cfile->oplock_break)){+.+.+.}, at: process_one_work+0x255/0x8e0

 Showing busy workqueues and worker pools:
 workqueue cifsiod: flags=0xc
   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1
     in-flight: 16:cifs_oplock_break
     delayed: cifs_writev_complete, cifs_echo_request
 pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 750 3

Fix these problems by creating a a new workqueue (with a rescuer) for
the oplock break work.

Signed-off-by: Rabin Vincent <rabinv@axis.com>
Signed-off-by: Steve French <smfrench@gmail.com>
CC: Stable <stable@vger.kernel.org>
2017-05-03 10:10:10 -05:00
David Disseldorp
6026685de3 cifs: fix CIFS_ENUMERATE_SNAPSHOTS oops
As with 618763958b22, an open directory may have a NULL private_data
pointer prior to readdir. CIFS_ENUMERATE_SNAPSHOTS must check for this
before dereference.

Fixes: 834170c85978 ("Enable previous version support")
Signed-off-by: David Disseldorp <ddiss@suse.de>
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <smfrench@gmail.com>
2017-05-03 09:59:20 -05:00
David Disseldorp
0e5c795592 cifs: fix leak in FSCTL_ENUM_SNAPS response handling
The server may respond with success, and an output buffer less than
sizeof(struct smb_snapshot_array) in length. Do not leak the output
buffer in this case.

Fixes: 834170c85978 ("Enable previous version support")
Signed-off-by: David Disseldorp <ddiss@suse.de>
CC: Stable <stable@vger.kernel.org>
Signed-off-by: Steve French <smfrench@gmail.com>
2017-05-03 09:54:12 -05:00
Chao Yu
9a744b92da f2fs: shrink size of struct discard_cmd
In order to shrink size of struct discard_cmd, change variable type of
@state in struct discard_cmd from int to unsigned char.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-02 21:19:51 -07:00
Chao Yu
ec9895add2 f2fs: don't hold cmd_lock during waiting discard command
Previously, with protection of cmd_lock, we will wait for end io of
discard command which potentially may lead long latency, making worse
concurrency.

So, in this patch, we try to add reference into discard entry to prevent
the entry being released by other thread, then we can avoid holding
global cmd_lock during waiting discard to finish.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-02 21:19:50 -07:00
Jaegeuk Kim
4d97807813 f2fs: nullify fio->encrypted_page for each writes
This makes sure each write request has nullified encrypted_page pointer.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-02 21:19:49 -07:00
Jin Qian
b9dd46188e f2fs: sanity check segment count
F2FS uses 4 bytes to represent block address. As a result, supported
size of disk is 16 TB and it equals to 16 * 1024 * 1024 / 2 segments.

Signed-off-by: Jin Qian <jinqian@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-02 21:19:48 -07:00
Jaegeuk Kim
a817737e87 f2fs: introduce valid_ipu_blkaddr to clean up
This patch introduces valid_ipu_blkaddr to clean up checking block address for
inplace-update.

Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-02 21:19:48 -07:00
Hou Pengyang
e959c8f543 f2fs: lookup extent cache first under IPU scenario
If a page is cold, NOT atomit written and need_ipu now, there is
a high probability that IPU should be adapted. For IPU, we try to
check extent tree to get the block index first, instead of reading
the dnode page, where may lead to an useless dnode IO, since no need to
update the dnode index for IPU.

Signed-off-by: Hou Pengyang <houpengyang@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-02 21:19:47 -07:00
Hou Pengyang
7eab0c0df8 f2fs: reconstruct code to write a data page
This patch introduces encrypt_one_page which encrypts one data page before
submit_bio, and change the use of need_inplace_update.

Signed-off-by: Hou Pengyang <houpengyang@huawei.com>
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-02 21:19:46 -07:00
Chao Yu
63a94fa1d7 f2fs: introduce __wait_discard_cmd
Just cleanup, no logic change.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-02 21:19:45 -07:00
Chao Yu
bd5b07383a f2fs: introduce __issue_discard_cmd
Just cleanup, no logic change.

Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2017-05-02 21:19:44 -07:00
Linus Torvalds
89c9fea3c8 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial tree updates from Jiri Kosina.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial:
  tty: fix comment for __tty_alloc_driver()
  init/main: properly align the multi-line comment
  init/main: Fix double "the" in comment
  Fix dead URLs to ftp.kernel.org
  drivers: Clean up duplicated email address
  treewide: Fix typo in xml/driver-api/basics.xml
  tools/testing/selftests/powerpc: remove redundant CFLAGS in Makefile: "-Wall -O2 -Wall" -> "-O2 -Wall"
  selftests/timers: Spelling s/privledges/privileges/
  HID: picoLCD: Spelling s/REPORT_WRTIE_MEMORY/REPORT_WRITE_MEMORY/
  net: phy: dp83848: Fix Typo
  UBI: Fix typos
  Documentation: ftrace.txt: Correct nice value of 120 priority
  net: fec: Fix typo in error msg and comment
  treewide: Fix typos in printk
2017-05-02 19:09:35 -07:00
Linus Torvalds
76f1948a79 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching
Pull livepatch updates from Jiri Kosina:

 - a per-task consistency model is being added for architectures that
   support reliable stack dumping (extending this, currently rather
   trivial set, is currently in the works).

   This extends the nature of the types of patches that can be applied
   by live patching infrastructure. The code stems from the design
   proposal made [1] back in November 2014. It's a hybrid of SUSE's
   kGraft and RH's kpatch, combining advantages of both: it uses
   kGraft's per-task consistency and syscall barrier switching combined
   with kpatch's stack trace switching. There are also a number of
   fallback options which make it quite flexible.

   Most of the heavy lifting done by Josh Poimboeuf with help from
   Miroslav Benes and Petr Mladek

   [1] https://lkml.kernel.org/r/20141107140458.GA21774@suse.cz

 - module load time patch optimization from Zhou Chengming

 - a few assorted small fixes

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching:
  livepatch: add missing printk newlines
  livepatch: Cancel transition a safe way for immediate patches
  livepatch: Reduce the time of finding module symbols
  livepatch: make klp_mutex proper part of API
  livepatch: allow removal of a disabled patch
  livepatch: add /proc/<pid>/patch_state
  livepatch: change to a per-task consistency model
  livepatch: store function sizes
  livepatch: use kstrtobool() in enabled_store()
  livepatch: move patching functions into patch.c
  livepatch: remove unnecessary object loaded check
  livepatch: separate enabled and patched states
  livepatch/s390: add TIF_PATCH_PENDING thread flag
  livepatch/s390: reorganize TIF thread flag bits
  livepatch/powerpc: add TIF_PATCH_PENDING thread flag
  livepatch/x86: add TIF_PATCH_PENDING thread flag
  livepatch: create temporary klp_update_patch_state() stub
  x86/entry: define _TIF_ALLWORK_MASK flags explicitly
  stacktrace/x86: add function for detecting reliable stack traces
2017-05-02 18:24:16 -07:00
Linus Torvalds
8d65b08deb Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Millar:
 "Here are some highlights from the 2065 networking commits that
  happened this development cycle:

   1) XDP support for IXGBE (John Fastabend) and thunderx (Sunil Kowuri)

   2) Add a generic XDP driver, so that anyone can test XDP even if they
      lack a networking device whose driver has explicit XDP support
      (me).

   3) Sparc64 now has an eBPF JIT too (me)

   4) Add a BPF program testing framework via BPF_PROG_TEST_RUN (Alexei
      Starovoitov)

   5) Make netfitler network namespace teardown less expensive (Florian
      Westphal)

   6) Add symmetric hashing support to nft_hash (Laura Garcia Liebana)

   7) Implement NAPI and GRO in netvsc driver (Stephen Hemminger)

   8) Support TC flower offload statistics in mlxsw (Arkadi Sharshevsky)

   9) Multiqueue support in stmmac driver (Joao Pinto)

  10) Remove TCP timewait recycling, it never really could possibly work
      well in the real world and timestamp randomization really zaps any
      hint of usability this feature had (Soheil Hassas Yeganeh)

  11) Support level3 vs level4 ECMP route hashing in ipv4 (Nikolay
      Aleksandrov)

  12) Add socket busy poll support to epoll (Sridhar Samudrala)

  13) Netlink extended ACK support (Johannes Berg, Pablo Neira Ayuso,
      and several others)

  14) IPSEC hw offload infrastructure (Steffen Klassert)"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (2065 commits)
  tipc: refactor function tipc_sk_recv_stream()
  tipc: refactor function tipc_sk_recvmsg()
  net: thunderx: Optimize page recycling for XDP
  net: thunderx: Support for XDP header adjustment
  net: thunderx: Add support for XDP_TX
  net: thunderx: Add support for XDP_DROP
  net: thunderx: Add basic XDP support
  net: thunderx: Cleanup receive buffer allocation
  net: thunderx: Optimize CQE_TX handling
  net: thunderx: Optimize RBDR descriptor handling
  net: thunderx: Support for page recycling
  ipx: call ipxitf_put() in ioctl error path
  net: sched: add helpers to handle extended actions
  qed*: Fix issues in the ptp filter config implementation.
  qede: Fix concurrency issue in PTP Tx path processing.
  stmmac: Add support for SIMATIC IOT2000 platform
  net: hns: fix ethtool_get_strings overflow in hns driver
  tcp: fix wraparound issue in tcp_lp
  bpf, arm64: fix jit branch offset related to ldimm64
  bpf, arm64: implement jiting of BPF_XADD
  ...
2017-05-02 16:40:27 -07:00
Steve French
26c9cb668c Set unicode flag on cifs echo request to avoid Mac error
Mac requires the unicode flag to be set for cifs, even for the smb
echo request (which doesn't have strings).

Without this Mac rejects the periodic echo requests (when mounting
with cifs) that we use to check if server is down

Signed-off-by: Steve French <smfrench@gmail.com>
CC: Stable <stable@vger.kernel.org>
2017-05-02 14:57:34 -05:00
Pavel Shilovsky
c610c4b619 CIFS: Add asynchronous write support through kernel AIO
This patch adds support to process write calls passed by io_submit()
asynchronously. It based on the previously introduced async context
that allows to process i/o responses in a separate thread and
return the caller immediately for asynchronous calls.

This improves writing performance of single threaded applications
with increasing of i/o queue depth size.

Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2017-05-02 14:57:34 -05:00
Pavel Shilovsky
6685c5e2d1 CIFS: Add asynchronous read support through kernel AIO
This patch adds support to process read calls passed by io_submit()
asynchronously. It based on the previously introduced async context
that allows to process i/o responses in a separate thread and
return the caller immediately for asynchronous calls.

This improves reading performance of single threaded applications
with increasing of i/o queue depth size.

Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2017-05-02 14:57:34 -05:00
Pavel Shilovsky
ccf7f4088a CIFS: Add asynchronous context to support kernel AIO
Currently the code doesn't recognize asynchronous calls passed
by io_submit() and processes all calls synchronously. This is not
what kernel AIO expects. This patch introduces a new async context
that keeps track of all issued i/o requests and moves a response
collecting procedure to a separate thread. This allows to return
to a caller immediately for async calls and call iocb->ki_complete()
once all requests are completed. For sync calls the current thread
simply waits until all requests are completed.

Signed-off-by: Pavel Shilovsky <pshilov@microsoft.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2017-05-02 14:57:34 -05:00
Daniel N Pettersson
29bb3158cf cifs: fix IPv6 link local, with scope id, address parsing
When the IP address is gotten from the UNC, use only the address part
of the UNC. Else all after the percent sign in an IPv6 link local
address is interpreted as a scope id. This includes the slash and
share name. A scope id is expected to be an integer and any trailing
characters makes the conversion to integer fail.
Example of mount command that fails:
mount -i -t cifs //fe80::6a05:caff:fe3e:8ffc%2/test /mnt/t -o sec=none

Signed-off-by: Daniel N Pettersson <danielnp@axis.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2017-05-02 14:57:34 -05:00
Dan Carpenter
564277ecee cifs: small underflow in cnvrtDosUnixTm()
January is month 1.  There is no zero-th month.  If someone passes a
zero month then it means we read from one space before the start of the
total_days_of_prev_months[] array.

We may as well also be strict about days as well.

Fixes: 1bd5bbcb6531 ("[CIFS] Legacy time handling for Win9x and OS/2 part 1")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steve French <smfrench@gmail.com>
2017-05-02 14:57:34 -05:00
Linus Torvalds
204f144c9f Merge branch 'work.compat' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull fs/compat.c cleanups from Al Viro:
 "More moving of compat syscalls from fs/compat.c to fs/*.c where the
  native counterparts live.

  And death to compat_sys_getdents64() - the only architecture that used
  to need it was ia64, and _that_ has lost biarch support quite a few
  years ago"

* 'work.compat' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fs/compat.c: trim unused includes
  move compat_rw_copy_check_uvector() over to fs/read_write.c
  fhandle: move compat syscalls from compat.c
  open: move compat syscalls from compat.c
  stat: move compat syscalls from compat.c
  fcntl: move compat syscalls from compat.c
  readdir: move compat syscalls from compat.c
  statfs: move compat syscalls from compat.c
  utimes: move compat syscalls from compat.c
  move compat select-related syscalls to fs/select.c
  Remove compat_sys_getdents64()
2017-05-02 11:54:26 -07:00
Linus Torvalds
da7b66ffb2 Merge branch 'work.splice' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull splice updates from Al Viro:
 "These actually missed the last cycle; the branch itself is from last
  December"

* 'work.splice' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  make nr_pages calculation in default_file_splice_read() a bit less ugly
  splice/tee/vmsplice: validate flags
  splice_pipe_desc: kill ->flags
  remove spd_release_page()
2017-05-02 11:38:06 -07:00
Linus Torvalds
5b13475a5e Merge branch 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull iov_iter updates from Al Viro:
 "Cleanups that sat in -next + -stable fodder that has just missed 4.11.

  There's more iov_iter work in my local tree, but I'd prefer to push
  the stuff that had been in -next first"

* 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  iov_iter: don't revert iov buffer if csum error
  generic_file_read_iter(): make use of iov_iter_revert()
  generic_file_direct_write(): make use of iov_iter_revert()
  orangefs: use iov_iter_revert()
  sctp: switch to copy_from_iter_full()
  net/9p: switch to copy_from_iter_full()
  switch memcpy_from_msg() to copy_from_iter_full()
  rds: make use of iov_iter_revert()
2017-05-02 11:18:50 -07:00
Linus Torvalds
6fd4e7f774 Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6
Pull CIFS fixes from Steve French:
 "Three cifs/smb3 fixes - including two for stable"

* 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: don't check for failure from mempool_alloc()
  Do not return number of bytes written for ioctl CIFS_IOC_COPYCHUNK_FILE
  Fix match_prepath()
2017-05-02 11:16:29 -07:00
Linus Torvalds
2575be8ad3 - constify compression structures; Bhumika Goyal
- restore powerpc dumping; Ankit Kumar
 - fix more bugs in the rarely exercises module unloading logic
 - reorganize filesystem locking to fix problems noticed by lockdep
 - refactor internal pstore APIs to make development and review easier:
   - improve error reporting
   - add kernel-doc structure and function comments
   - avoid insane argument passing by using a common record structure
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 Comment: Kees Cook <kees@outflux.net>
 
 iQIcBAABCgAGBQJZB3wzAAoJEIly9N/cbcAmVQAP/3+GzoxUcL43ypsDa1CmCsFN
 l2roQjzWLNGfHgq5qkS/mNtrUdEvBMUBd2oyhHcaiqM0DsuuO3rKTp6dZ8oczYjN
 6GoTmU8ZwIPze3VEadNPjCIdpsfTMNKtZvVJCTrWnsgXxTawDS89qqr7SCs3qhBS
 Dm1E8oX77YyhOKoGA6O3CJxpdm/Ge4+KpPR6Uwj90eVro04vYoiwnBjyLUzE7w1l
 JcXEGEh1t5NjUxHeMwW7HZwJYZfA3DQ6I3MOOzGhf9tsKp6J0LQTTV8PMSEo1mif
 mLZhDBy8BBlunL+b2Tp3+c4QItGSHkBCWASI2RLa2TM7xvL67oC+qm/WaUyoRovy
 hllEG96rsCs3Zx7fFFsfQCwURcTWfJQMrD+0d/fM+P2ylWvgp+KU6PeLTS9IHu6M
 3n6i5i6A6OY/QvmZr1tN/06kUBjtQmo8EgQ0jxoxAlWyNcJqi93hmJyaRW28KxjS
 tjFTNLZMrslj0UDmjiD6fIuaT6gsGDB+3wAMPVAf+iV/k/2GUlj3ZILe4RaABAe9
 8xaUu11tZ5sTniayZ+10bA+6+K5n7uTlgU8RfFgaUZoRAzHgtyijOmdo6N+HILfK
 klv59B1Fmf6JpDlq7L9vurOqE82FAWFn4DruFM2bAaky2meFUNbYFiNfwK4l6lPI
 pmAgpdgRRvNMBCEmbVfv
 =S14G
 -----END PGP SIGNATURE-----

Merge tag 'pstore-v4.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull pstore updates from Kees Cook:
 "This has a large internal refactoring along with several smaller
  fixes.

   - constify compression structures; Bhumika Goyal

   - restore powerpc dumping; Ankit Kumar

   - fix more bugs in the rarely exercises module unloading logic

   - reorganize filesystem locking to fix problems noticed by lockdep

   - refactor internal pstore APIs to make development and review
     easier:
      - improve error reporting
      - add kernel-doc structure and function comments
      - avoid insane argument passing by using a common record
        structure"

* tag 'pstore-v4.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (23 commits)
  pstore: Solve lockdep warning by moving inode locks
  pstore: Fix flags to enable dumps on powerpc
  pstore: Remove unused vmalloc.h in pmsg
  pstore: simplify write_user_compat()
  pstore: Remove write_buf() callback
  pstore: Replace arguments for write_buf_user() API
  pstore: Replace arguments for write_buf() API
  pstore: Replace arguments for erase() API
  pstore: Do not duplicate record metadata
  pstore: Allocate records on heap instead of stack
  pstore: Pass record contents instead of copying
  pstore: Always allocate buffer for decompression
  pstore: Replace arguments for write() API
  pstore: Replace arguments for read() API
  pstore: Switch pstore_mkfile to pass record
  pstore: Move record decompression to function
  pstore: Extract common arguments into structure
  pstore: Add kernel-doc for struct pstore_info
  pstore: Improve register_pstore() error reporting
  pstore: Avoid race in module unloading
  ...
2017-05-02 10:35:45 -07:00
Trond Myklebust
5f0114832a pNFS: Fix a typo in pnfs_generic_alloc_ds_commits
If the layout segment is invalid, we want to just resend the remaining
writes.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-05-02 12:35:34 -04:00
Trond Myklebust
61f454e30c pNFS: Fix a deadlock when coalescing writes and returning the layout
Consider the following deadlock:

Process P1	Process P2		Process P3
==========	==========		==========
					lock_page(page)

		lseg = pnfs_update_layout(inode)

lo = NFS_I(inode)->layout
pnfs_error_mark_layout_for_return(lo)

		lock_page(page)

					lseg = pnfs_update_layout(inode)

In this scenario,
- P1 has declared the layout to be in error, but P2 holds a reference to
  a layout segment on that inode, so the layoutreturn is deferred.
- P2 is waiting for a page lock held by P3.
- P3 is asking for a new layout segment, but is blocked waiting
  for the layoutreturn.

The fix is to ensure that pnfs_error_mark_layout_for_return() does
not set the NFS_LAYOUT_RETURN flag, which blocks P3. Instead, we allow
the latter to call LAYOUTGET so that it can make progress and unblock
P2.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-05-02 12:35:33 -04:00
Trond Myklebust
5466d21411 pNFS: Don't clear the layout return info if there are segments to return
In pnfs_clear_layoutreturn_info, ensure that we don't clear the layout
return info if there are new segments queued for return due to, for
instance, a race between a LAYOUTRETURN and a failed I/O attempt.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
2017-05-02 12:35:33 -04:00
Eric Biggers
aa1dca3bd9 ext4: inherit encryption xattr before other xattrs
When using both encryption and SELinux (or another feature that requires
an xattr per file) on a filesystem with 256-byte inodes, each file's
xattrs usually spill into an external xattr block.  Currently, the
xattrs are inherited in the order ACL, security, then encryption.
Therefore, if spillage occurs, the encryption xattr will always end up
in the external block.  This is not ideal because the encryption xattrs
contain a nonce, so they will always be unique and will prevent the
external xattr blocks from being deduplicated.

To improve the situation, change the inheritance order to encryption,
ACL, then security.  This gives the encryption xattr a better chance to
be stored in-inode, allowing the other xattr(s) to be deduplicated.

Note that it may be better for userspace to format the filesystem with
512-byte inodes in this case.  However, it's not the default.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-05-02 00:49:54 -04:00
Linus Torvalds
6dc2cce932 Merge branch 'x86-process-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pul x86/process updates from Ingo Molnar:
 "The main change in this cycle was to add the ARCH_[GET|SET]_CPUID
  prctl() ABI extension to control the availability of the CPUID
  instruction, analogously to the existing PR_GET|SET_TSC ABI that
  controls RDTSC.

  Motivation: the 'rr' user-space record-and-replay execution debugger
  would like to trap and emulate the CPUID instruction - which
  instruction is normally unprivileged.

  Trapping CPUID is possible on IvyBridge and later Intel CPUs - expose
  this hardware capability"

* 'x86-process-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/syscalls/32: Ignore arch_prctl for other architectures
  um/arch_prctl: Fix fallout from x86 arch_prctl() rework
  x86/arch_prctl: Add ARCH_[GET|SET]_CPUID
  x86/cpufeature: Detect CPUID faulting support
  x86/syscalls/32: Wire up arch_prctl on x86-32
  x86/arch_prctl: Add do_arch_prctl_common()
  x86/arch_prctl/64: Rename do_arch_prctl() to do_arch_prctl_64()
  x86/arch_prctl/64: Use SYSCALL_DEFINE2 to define sys_arch_prctl()
  x86/arch_prctl: Rename 'code' argument to 'option'
  x86/msr: Rename MISC_FEATURE_ENABLES to MISC_FEATURES_ENABLES
  x86/process: Optimize TIF_NOTSC switch
  x86/process: Correct and optimize TIF_BLOCKSTEP switch
  x86/process: Optimize TIF checks in __switch_to_xtra()
2017-05-01 19:57:58 -07:00
Linus Torvalds
3527d3e951 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
 "The main changes in this cycle were:

   - another round of rq-clock handling debugging, robustization and
     fixes

   - PELT accounting improvements

   - CPU hotplug related ->cpus_allowed affinity handling fixes all
     around the tree

   - ... plus misc fixes, cleanups and updates"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (35 commits)
  sched/x86: Update reschedule warning text
  crypto: N2 - Replace racy task affinity logic
  cpufreq/sparc-us2e: Replace racy task affinity logic
  cpufreq/sparc-us3: Replace racy task affinity logic
  cpufreq/sh: Replace racy task affinity logic
  cpufreq/ia64: Replace racy task affinity logic
  ACPI/processor: Replace racy task affinity logic
  ACPI/processor: Fix error handling in __acpi_processor_start()
  sparc/sysfs: Replace racy task affinity logic
  powerpc/smp: Replace open coded task affinity logic
  ia64/sn/hwperf: Replace racy task affinity logic
  ia64/salinfo: Replace racy task affinity logic
  workqueue: Provide work_on_cpu_safe()
  ia64/topology: Remove cpus_allowed manipulation
  sched/fair: Move the PELT constants into a generated header
  sched/fair: Increase PELT accuracy for small tasks
  sched/fair: Fix comments
  sched/Documentation: Add 'sched-pelt' tool
  sched/fair: Fix corner case in __accumulate_sum()
  sched/core: Remove 'task' parameter and rename tsk_restore_flags() to current_restore_flags()
  ...
2017-05-01 19:12:53 -07:00
Linus Torvalds
5db6db0d40 Merge branch 'work.uaccess' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull uaccess unification updates from Al Viro:
 "This is the uaccess unification pile. It's _not_ the end of uaccess
  work, but the next batch of that will go into the next cycle. This one
  mostly takes copy_from_user() and friends out of arch/* and gets the
  zero-padding behaviour in sync for all architectures.

  Dealing with the nocache/writethrough mess is for the next cycle;
  fortunately, that's x86-only. Same for cleanups in iov_iter.c (I am
  sold on access_ok() in there, BTW; just not in this pile), same for
  reducing __copy_... callsites, strn*... stuff, etc. - there will be a
  pile about as large as this one in the next merge window.

  This one sat in -next for weeks. -3KLoC"

* 'work.uaccess' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (96 commits)
  HAVE_ARCH_HARDENED_USERCOPY is unconditional now
  CONFIG_ARCH_HAS_RAW_COPY_USER is unconditional now
  m32r: switch to RAW_COPY_USER
  hexagon: switch to RAW_COPY_USER
  microblaze: switch to RAW_COPY_USER
  get rid of padding, switch to RAW_COPY_USER
  ia64: get rid of copy_in_user()
  ia64: sanitize __access_ok()
  ia64: get rid of 'segment' argument of __do_{get,put}_user()
  ia64: get rid of 'segment' argument of __{get,put}_user_check()
  ia64: add extable.h
  powerpc: get rid of zeroing, switch to RAW_COPY_USER
  esas2r: don't open-code memdup_user()
  alpha: fix stack smashing in old_adjtimex(2)
  don't open-code kernel_setsockopt()
  mips: switch to RAW_COPY_USER
  mips: get rid of tail-zeroing in primitives
  mips: make copy_from_user() zero tail explicitly
  mips: clean and reorder the forest of macros...
  mips: consolidate __invoke_... wrappers
  ...
2017-05-01 14:41:04 -07:00
Arnd Bergmann
67fd389735 block, dax: use correct format string in bdev_dax_supported
The new message has an incorrect format string, causing a warning in some
configurations:

fs/block_dev.c: In function 'bdev_dax_supported':
fs/block_dev.c:779:5: error: format '%d' expects argument of type 'int', but argument 2 has type 'long int' [-Werror=format=]
     "error: dax access failed (%d)", len);

This changes it to use the correct %ld instead of %d.

Fixes: 2093f2e9dfec ("block, dax: convert bdev_dax_supported() to dax_direct_access()")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-05-01 13:16:29 -07:00
Linus Torvalds
694752922b Merge branch 'for-4.12/block' of git://git.kernel.dk/linux-block
Pull block layer updates from Jens Axboe:

 - Add BFQ IO scheduler under the new blk-mq scheduling framework. BFQ
   was initially a fork of CFQ, but subsequently changed to implement
   fairness based on B-WF2Q+, a modified variant of WF2Q. BFQ is meant
   to be used on desktop type single drives, providing good fairness.
   From Paolo.

 - Add Kyber IO scheduler. This is a full multiqueue aware scheduler,
   using a scalable token based algorithm that throttles IO based on
   live completion IO stats, similary to blk-wbt. From Omar.

 - A series from Jan, moving users to separately allocated backing
   devices. This continues the work of separating backing device life
   times, solving various problems with hot removal.

 - A series of updates for lightnvm, mostly from Javier. Includes a
   'pblk' target that exposes an open channel SSD as a physical block
   device.

 - A series of fixes and improvements for nbd from Josef.

 - A series from Omar, removing queue sharing between devices on mostly
   legacy drivers. This helps us clean up other bits, if we know that a
   queue only has a single device backing. This has been overdue for
   more than a decade.

 - Fixes for the blk-stats, and improvements to unify the stats and user
   windows. This both improves blk-wbt, and enables other users to
   register a need to receive IO stats for a device. From Omar.

 - blk-throttle improvements from Shaohua. This provides a scalable
   framework for implementing scalable priotization - particularly for
   blk-mq, but applicable to any type of block device. The interface is
   marked experimental for now.

 - Bucketized IO stats for IO polling from Stephen Bates. This improves
   efficiency of polled workloads in the presence of mixed block size
   IO.

 - A few fixes for opal, from Scott.

 - A few pulls for NVMe, including a lot of fixes for NVMe-over-fabrics.
   From a variety of folks, mostly Sagi and James Smart.

 - A series from Bart, improving our exposed info and capabilities from
   the blk-mq debugfs support.

 - A series from Christoph, cleaning up how handle WRITE_ZEROES.

 - A series from Christoph, cleaning up the block layer handling of how
   we track errors in a request. On top of being a nice cleanup, it also
   shrinks the size of struct request a bit.

 - Removal of mg_disk and hd (sorry Linus) by Christoph. The former was
   never used by platforms, and the latter has outlived it's usefulness.

 - Various little bug fixes and cleanups from a wide variety of folks.

* 'for-4.12/block' of git://git.kernel.dk/linux-block: (329 commits)
  block: hide badblocks attribute by default
  blk-mq: unify hctx delay_work and run_work
  block: add kblock_mod_delayed_work_on()
  blk-mq: unify hctx delayed_run_work and run_work
  nbd: fix use after free on module unload
  MAINTAINERS: bfq: Add Paolo as maintainer for the BFQ I/O scheduler
  blk-mq-sched: alloate reserved tags out of normal pool
  mtip32xx: use runtime tag to initialize command header
  scsi: Implement blk_mq_ops.show_rq()
  blk-mq: Add blk_mq_ops.show_rq()
  blk-mq: Show operation, cmd_flags and rq_flags names
  blk-mq: Make blk_flags_show() callers append a newline character
  blk-mq: Move the "state" debugfs attribute one level down
  blk-mq: Unregister debugfs attributes earlier
  blk-mq: Only unregister hctxs for which registration succeeded
  blk-mq-debugfs: Rename functions for registering and unregistering the mq directory
  blk-mq: Let blk_mq_debugfs_register() look up the queue name
  blk-mq: Register <dev>/queue/mq after having registered <dev>/queue
  ide-pm: always pass 0 error to ide_complete_rq in ide_do_devset
  ide-pm: always pass 0 error to __blk_end_request_all
  ..
2017-05-01 10:39:57 -07:00
Theodore Ts'o
72d622b422 ext4: replace BUG_ON with WARN_ONCE in ext4_end_bio()
Add fallback code and a WARN_ONCE() call instead of a BUG_ON() in
the ext4_end_bio() function.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-04-30 20:08:05 -04:00
Jan Kara
dddbd6ac8f ext4: avoid unnecessary transaction stalls during writeback
Currently ext4_writepages() submits all pages with transaction started.
When no page needs block allocation or extent conversion we can submit
all dirty pages in the inode while holding a single transaction handle
and when device is congested this can take significant amount of time.
Thus ext4_writepages() can block transaction commits for extended
periods of time.

Take for example a simple benchmark simulating PostgreSQL database
(pgioperf in mmtest). The benchmark runs 16 processes doing random reads
from a huge file, one process doing random writes to the huge file, and
one process doing sequential writes to a small files and frequently
running fsync. With unpatched kernel transaction commits take on average
~18s with standard deviation of ~41s, top 5 commit times are:

274.466639s, 126.467347s, 86.992429s, 34.351563s, 31.517653s.

After this patch transaction commits take on average 0.1s with standard
deviation of 0.15s, top 5 commit times are:

0.563792s, 0.519980s, 0.509841s, 0.471700s, 0.469899s

[ Modified so we use an explicit do_map flag instead of relying on
  io_end not being allocated, the since io_end->inode is needed for I/O
  error handling. -- tytso ]

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2017-04-30 18:29:10 -04:00