49295 Commits

Author SHA1 Message Date
David Sterba
e03733da5a btrfs: fix bool type in btrfs_page_exists_in_range
We use only a simple bool indicator, int is not a problem here.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:59 +02:00
David Sterba
c9fed2bb61 btrfs: remove unused member list from btrfs_end_io_wq
The end io work queue items have been tracked by the work queues since
"Btrfs: Add async worker threads for pre and post IO checksumming"
(8b7128429235d9bd72cfd5e) (2008).

Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:59 +02:00
David Sterba
ee4ea69852 btrfs: remove unused members dir_path from recorded_ref
The two members do not seem to be used since the initial commit.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:59 +02:00
David Sterba
b297c9f68f btrfs: remove unused member list from async_submit_bio
The list used to track checksums in the early version (2.6.29), but I
was able not pinpoint the commit that stopped using it. Everything
apparently works without it for a long time.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:59 +02:00
David Sterba
106204f191 btrfs: remove unused member err from reada_extent
Seems to be unused since the initial commit, we ignore readahead errors
anyway, the full read will handle that if necessary.

Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:59 +02:00
Sahil Kang
0bef71093d btrfs: Remove unnecessary branching in free-space-tree.c
Both btrfs_create_free_space_tree and btrfs_clear_free_space_tree
contain:

  if (ret)
          return ret;

  return 0;

The if statement is only false when ret equals zero, and since we return
zero in such cases, we can safely remove the branching.

Signed-off-by: Sahil Kang <sahil.kang@asilaycomputing.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:59 +02:00
Liu Bo
e477094f0d Btrfs: hardcode GFP_NOFS for btrfs_bio_clone_partial
We only pass GFP_NOFS to btrfs_bio_clone_partial, so lets hardcode it.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:59 +02:00
Arnd Bergmann
3c91ee6964 Btrfs: work around maybe-uninitialized warning
A rewrite of btrfs_submit_direct_hook appears to have introduced a warning:

fs/btrfs/inode.c: In function 'btrfs_submit_direct_hook':
fs/btrfs/inode.c:8467:14: error: 'bio' may be used uninitialized in this function [-Werror=maybe-uninitialized]

Where the 'bio' variable was previously initialized unconditionally, it
is now set in the "while (submit_len > 0)" loop that would never execute
if submit_len is zero.

Assuming this cannot happen in practice, we can avoid the warning
by simply replacing the while{} loop with a do{}while() loop so
the compiler knows that it will always be entered at least once.

Fixes changes introduced in "Btrfs: use bio_clone_bioset_partial to
simplify DIO submit".

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:59 +02:00
Liu Bo
3892ac9086 Btrfs: unify naming of btrfs_io_bio
All dio endio functions are using io_bio for struct btrfs_io_bio, this
makes btrfs_submit_direct to follow this convention.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:59 +02:00
Liu Bo
11b5616516 Btrfs: check-integrity use bvec_iter
Some check-integrity code depends on bio->bi_vcnt, this changes it to use
bio segments because some bios passing here may not have a reliable
bi_vcnt.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:59 +02:00
Liu Bo
629ebf4fad Btrfs: record error if one block has failed to retry
In the nocsum case of dio read endio, it returns immediately if an error
gets returned when repairing, which leaves the rest blocks unrepaired.  The
behavior is different from how buffered read endio works in the same case.
This changes it to record error only and go on repairing the rest blocks.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:59 +02:00
Liu Bo
17347cec15 Btrfs: change how we iterate bios in endio
Since dio submit has used bio_clone_fast, the submitted bio may not have a
reliable bi_vcnt, for the bio vector iterations in checksum related
functions, bio->bi_iter is not modified yet and it's safe to use
bio_for_each_segment, while for those bio vector iterations in dio read's
endio, we now save a copy of bvec_iter in struct btrfs_io_bio when cloning
bios and use the helper __bio_for_each_segment with the saved bvec_iter to
access each bvec.

Also for dio reads which don't get split, we also need to save a copy of
bio iterator in btrfs_bio_clone to let __bio_for_each_segments to access
each bvec in dio read's endio.  Note that it doesn't affect other calls of
btrfs_bio_clone() because they don't need to use this iterator.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:59 +02:00
Liu Bo
725130bac5 Btrfs: use bio_clone_bioset_partial to simplify DIO submit
Currently when mapping bio to limit bio to a single stripe length, we
split bio by adding page to bio one by one, but later we don't modify
the vector of bio at all, thus we can use bio_clone_fast to use the
original bio vector directly.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:58 +02:00
Liu Bo
2f8e914042 Btrfs: new helper btrfs_bio_clone_partial
This adds a new helper btrfs_bio_clone_partial, it'll allocate a cloned
bio that only owns a part of the original bio's data.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:58 +02:00
Liu Bo
015c1bd9f1 Btrfs: use bio_clone_fast to clone our bio
For raid1 and raid10, we clone the original bio to the bios which are then
sent to different disks.

Right now we use bio_clone_bioset to create a clone bio with iterating
bi_io_vec to initialize it.  This changes it to use bio_clone_fast()
which creates a clone bio but only copies the bi_io_vec pointer
instead of iterating bi_io_vec.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:58 +02:00
Josef Bacik
7870d0822b Btrfs: don't pass the inode through clean_io_failure
Instead pass around the failure tree and the io tree.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:58 +02:00
Josef Bacik
6ec656bc0f btrfs: remove inode argument from repair_io_failure
Once we remove the btree_inode we won't have an inode to pass anymore,
just pass the fs_info directly and the inum since we use that to print
out the repair message.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:58 +02:00
Josef Bacik
c6100a4b4e Btrfs: replace tree->mapping with tree->private_data
For extent_io tree's we have carried the address_mapping of the inode
around in the io tree in order to pull the inode back out for calling
into various tree ops hooks.  This works fine when everything that has
an extent_io_tree has an inode.  But we are going to remove the
btree_inode, so we need to change this.  Instead just have a generic
void * for private data that we can initialize with, and have all the
tree ops use that instead.  This had a lot of cascading changes but
should be relatively straightforward.

Signed-off-by: Josef Bacik <jbacik@fb.com>
Reviewed-by: Chandan Rajendra <chandan@linux.vnet.ibm.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor reordering of the callback prototypes ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:58 +02:00
Sargun Dhillon
2723480a0f btrfs: Add quota_override knob into sysfs
This patch adds the read-write attribute quota_override into sysfs.
Any process which has CAP_SYS_RESOURCE can set this flag to on, and
once it is set to true, processes with CAP_SYS_RESOURCE can exceed
the quota.

Signed-off-by: Sargun Dhillon <sargun@sargun.me>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor changelog edits ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:58 +02:00
Sargun Dhillon
f29efe2921 btrfs: add quota override flag to enable quota override for CAP_SYS_RESOURCE
This patch introduces the quota override flag to btrfs_fs_info, and a
change to quota limit checking code to temporarily allow for quota to be
overridden for processes with CAP_SYS_RESOURCE.

It's useful for administrative programs, such as log rotation, that may
need to temporarily use more disk space in order to free up a greater
amount of overall disk space without yielding more disk space to the
rest of userland.

Eventually, we may want to add the idea of an operator-specific quota,
operator reserved space, or something else to allow for administrative
override, but this is perhaps the simplest solution.

Signed-off-by: Sargun Dhillon <sargun@sargun.me>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor changelog edits ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:58 +02:00
Nikolay Borisov
a5ed45f822 btrfs: Convert fs_info->free_chunk_space to atomic64_t
The ->free_chunk_space variable is used to track the unallocated space
and access to it is protected by a spinlock, which is not used for
anything else.  Make the code a bit self-explanatory by switching the
variable to an atomic64_t type and kill the spinlock.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
[ not a performance critical code, use of atomic type is ok ]
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:58 +02:00
Anand Jain
401b41e5a8 btrfs: add framework to handle device flush error as a volume
This adds comments to the flush error handling part of the code, and
hopes to maintain the same logic with a framework which can be used to
handle the errors at the volume level.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:58 +02:00
Daichou
6b349dfe80 Btrfs: remove obsolete FIXMEs in qgroup ioctls
These FIXMEs were already addressed in 2013. All functions check for
qgroup existence:

* btrfs_add_qgroup_relation
* btrfs_ioctl_qgroup_create
* btrfs_limit_qgroup
* btrfs_del_qgroup_relation

Signed-off-by: Daichou <tommy0705c@gmail.com>
[ enhance and reformat changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:58 +02:00
Dan Carpenter
97d038562a Btrfs: remove an unused variable
"item" is never used.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:57 +02:00
Fabian Frederick
977ec79271 btrfs: kmap() can't fail
Remove NULL test on kmap() as it will always return a valid pointer.

Signed-off-by: Fabian Frederick <fabf@skynet.be>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2017-06-19 18:25:57 +02:00
Hugh Dickins
1be7107fbe mm: larger stack guard gap, between vmas
Stack guard page is a useful feature to reduce a risk of stack smashing
into a different mapping. We have been using a single page gap which
is sufficient to prevent having stack adjacent to a different mapping.
But this seems to be insufficient in the light of the stack usage in
userspace. E.g. glibc uses as large as 64kB alloca() in many commonly
used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX]
which is 256kB or stack strings with MAX_ARG_STRLEN.

This will become especially dangerous for suid binaries and the default
no limit for the stack size limit because those applications can be
tricked to consume a large portion of the stack and a single glibc call
could jump over the guard page. These attacks are not theoretical,
unfortunatelly.

Make those attacks less probable by increasing the stack guard gap
to 1MB (on systems with 4k pages; but make it depend on the page size
because systems with larger base pages might cap stack allocations in
the PAGE_SIZE units) which should cover larger alloca() and VLA stack
allocations. It is obviously not a full fix because the problem is
somehow inherent, but it should reduce attack space a lot.

One could argue that the gap size should be configurable from userspace,
but that can be done later when somebody finds that the new 1MB is wrong
for some special case applications.  For now, add a kernel command line
option (stack_guard_gap) to specify the stack gap size (in page units).

Implementation wise, first delete all the old code for stack guard page:
because although we could get away with accounting one extra page in a
stack vma, accounting a larger gap can break userspace - case in point,
a program run with "ulimit -S -v 20000" failed when the 1MB gap was
counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK
and strict non-overcommit mode.

Instead of keeping gap inside the stack vma, maintain the stack guard
gap as a gap between vmas: using vm_start_gap() in place of vm_start
(or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few
places which need to respect the gap - mainly arch_get_unmapped_area(),
and and the vma tree's subtree_gap support for that.

Original-patch-by: Oleg Nesterov <oleg@redhat.com>
Original-patch-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-06-19 21:50:20 +08:00
Linus Torvalds
6e20350659 A fix for an old ceph ->fh_to_* bug from Luis and two timestamp
fixups from Zheng, prompted by the ongoing y2038 work.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQEcBAABCAAGBQJZRTAEAAoJEEp/3jgCEfOLNwcH/jfzGqhOS262fU5FCCowfNVJ
 9ANzXiRpWykaHR7iPXOTRmRUqCJOCzhogmwzjnl7bQUX7cJPsFN+R7l9KR+dCAUX
 300dplDWZ5oQCX2c7A7vzRCIgv4wjQjtS0mo+dY/EBCNYcynoAUVmbr/87Ezrroi
 qfmA6pnI6hI527RLBkwIObljoAiy11MjQ1xFj0zS2bckWxfCSauO1v1qSpMhawkn
 v4fAWEKz3y8oUG3MtT7/Ukx4/GJAOcksxKZf93AW0sNwQozCxvB40D/Dda3NcT4s
 xYVVymUTYGTg1I/CmZHZxSqJwtKUOZJLwMFTXEFyo6bQH0Vj2pw/HaRf8Q5ksOU=
 =ClP/
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-4.12-rc6' of git://github.com/ceph/ceph-client

Pull ceph fixes from Ilya Dryomov:
 "A fix for an old ceph ->fh_to_* bug from Luis and two timestamp fixups
  from Zheng, prompted by the ongoing y2038 work"

* tag 'ceph-for-4.12-rc6' of git://github.com/ceph/ceph-client:
  ceph: unify inode i_ctime update
  ceph: use current_kernel_time() to get request time stamp
  ceph: check i_nlink while converting a file handle to dentry
2017-06-18 08:23:02 +09:00
Linus Torvalds
adc311034c Changes since last update:
- Fix some bogus ASSERT failures on CONFIG_SMP=n and CONFIG_XFS_DEBUG=y.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABCgAGBQJZQhB9AAoJEPh/dxk0SrTrMWsP/A1gSHmaS2UI3WvXdqwAxldy
 GiDZ3oSAhn4SF7BY+AhIu+Ot9kwCyOVzd/RZ9zVYmRjTTFxYX7P099uua2e/mf70
 Z1o9oSk9H76srqo7L76h7lI2ONsgd28aqh082hmDrRUhhwy7LZThVV15ZPZFfgU4
 8Q17h0uRUVgSn8M8INsuiMpw1sCLJsXw/9Rb9iFMgi3tSJaGZ1Mm//nA1aeS8HFH
 xCKHYw7YUtVKtIVyVV1NGdtXhXZbNznJXelkUZLMnMgOOmAqWiUa8FOGPNbQEezX
 1VidjzGhxRF7uoUJNnnX5mM+26c8/Ip4cLtqvnFQo30bx+HXO3OR8jywQO+6DD9Q
 NxFHY5D/Peud96xsnq5mRzNMN5fqumyVAyUCXSebT3Oa42HMdva+66s9JoFfDehi
 Z7VmRdoryxexDRbhiQVev4xe20beLtOHAP2I5JrnJddrXb0aFWowLk776wa0uxD8
 7cbKgovikqSHk4/mqpyZ5iQeaufg/kOi2cNaTcAFeCbvbXYieeRXVlisMBh1DKec
 lSX5e4kNNS20VVbCYakAttK69lpZzuraYPDDsnb4HRlmt0VX12lYyqQwklY/YZ9S
 jDagtKWKDm/L/jq2j5Nd3uSycM+lMaq97mIMjzPRrnPjOriME1ZGLwEQbKfBLXnW
 3Flzt5C2Hk1Fb/VlNx4S
 =U99d
 -----END PGP SIGNATURE-----

Merge tag 'xfs-4.12-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fix from Darrick Wong:
 "One more bugfix for you for 4.12-rc6 to fix something that came up in
  an earlier rc:

   - Fix some bogus ASSERT failures on CONFIG_SMP=n and CONFIG_XFS_DEBUG=y"

* tag 'xfs-4.12-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: fix spurious spin_is_locked() assert failures on non-smp kernels
2017-06-17 17:34:41 +09:00
Linus Torvalds
c8636b90a0 Merge branch 'ufs-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull ufs fixes from Al Viro:
 "Fix assorted ufs bugs: a couple of deadlocks, fs corruption in
  truncate(), oopsen on tail unpacking and truncate when racing with
  vmscan, mild fs corruption (free blocks stats summary buggered, *BSD
  fsck would complain and fix), several instances of broken logics
  around reserved blocks (starting with "check almost never triggers
  when it should" and then there are issues with sufficiently large
  UFS2)"

[ Note: ufs hasn't gotten any loving in a long time, because nobody
  really seems to use it. These ufs fixes are triggered by people
  actually caring now, not some sudden influx of new bugs.  - Linus ]

* 'ufs-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  ufs_truncate_blocks(): fix the case when size is in the last direct block
  ufs: more deadlock prevention on tail unpacking
  ufs: avoid grabbing ->truncate_mutex if possible
  ufs_get_locked_page(): make sure we have buffer_heads
  ufs: fix s_size/s_dsize users
  ufs: fix reserved blocks check
  ufs: make ufs_freespace() return signed
  ufs: fix logics in "ufs: make fsck -f happy"
2017-06-17 17:30:07 +09:00
Linus Torvalds
ccd3d905f7 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs fixes from Al Viro:
 "A couple of fixes; a leak in mntns_install() caught by Andrei (this
  cycle regression) + d_invalidate() softlockup fix - that had been
  reported by a bunch of people lately, but the problem is pretty old"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fs: don't forget to put old mntns in mntns_install
  Hang/soft lockup in d_invalidate with simultaneous calls
2017-06-17 17:26:53 +09:00
Andrea Arcangeli
64c2b20301 userfaultfd: shmem: handle coredumping in handle_userfault()
Anon and hugetlbfs handle FOLL_DUMP set by get_dump_page() internally to
__get_user_pages().

shmem as opposed has no special FOLL_DUMP handling there so
handle_mm_fault() is invoked without mmap_sem and ends up calling
handle_userfault() that isn't expecting to be invoked without mmap_sem
held.

This makes handle_userfault() fail immediately if invoked through
shmem_vm_ops->fault during coredumping and solves the problem.

The side effect is a BUG_ON with no lock held triggered by the
coredumping process which exits.  Only 4.11 is affected, pre-4.11 anon
memory holes are skipped in __get_user_pages by checking FOLL_DUMP
explicitly against empty pagetables (mm/gup.c:no_page_table()).

It's zero cost as we already had a check for current->flags to prevent
futex to trigger userfaults during exit (PF_EXITING).

Link: http://lkml.kernel.org/r/20170615214838.27429-1-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: <stable@vger.kernel.org>	[4.11+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-06-17 06:37:05 +09:00
Linus Torvalds
ab2789b72d A fix from Nic for a race seen in production (including a stable tag).
And while I'm sending you this I'm also sneaking in a trivial new helper
 from Bart so that we don't need inter-tree dependencies for the next merge
 window.
 -----BEGIN PGP SIGNATURE-----
 
 iQI/BAABCAApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAllDnicLHGhjaEBsc3Qu
 ZGUACgkQD55TZVIEUYOkPA//TMmDanqxLjjz12m9TiQoCjo/iFCtv9KpuJH/rdCz
 EnWK1GdGtWhR3Z1uk/Ss3zbBA/CwfUR/urVdc1P/aefLoVmsYOWQi1jsPHCHtFG6
 zkDYHr7qYqu91otaO0HgFrcOpuJe+LdbhwZndvUiTYJN8vNMRnQAnKdiEUEKmArq
 dBUj/H0JTbQwSXHZat2ZS9PwHsm7RGO+0qeixxc/HE730LF0TEwnteoy9jlu5d7U
 v1RZs9/zszmvQpWU34vPHCVH/sNfTMdVGPzc9+WNrOoxjM9vmhEOE0jTiclOcsCK
 sMAYHCG7woxkCPVZmxqgLx6P/9zZav6L2NZFPcT3z4jFq5Um+ugJ691f1oHaTq+L
 Bnn1DJdTl50wtMnb7yS1Uux+Y0OswKAXvDdC6NFPGJWwEnG41K3oL78Pq/vN7bKV
 ynKxRZciIsy/9S/Oyzp0oYV+l/cyScPVe/KfUN4zvIALi/mltMkAXYaZMEZDp7Vo
 w2TeJO7Nr3O75ghw/yCFHTWMAVbrTJg/ma1rkdUeekKYXix+4Bpr2XYqA3HHZCQY
 06pvIH+fZs1XshFlCs3RoWXvjdfjDgIO8zjrvSkTs8WUK4AxVNXIDtPDA6fpzcGz
 yZEehpdbPWPDvdd1C7TzEAi6lgOV/W5AsPUfk5KbLOaFzKWRe+FYtzDykGwamYeP
 Ov8=
 =NGL4
 -----END PGP SIGNATURE-----

Merge tag 'configfs-for-4.12' of git://git.infradead.org/users/hch/configfs

Pull configfs updates from Christoph Hellwig:
 "A fix from Nic for a race seen in production (including a stable tag).

  And while I'm sending you this I'm also sneaking in a trivial new
  helper from Bart so that we don't need inter-tree dependencies for the
  next merge window"

* tag 'configfs-for-4.12' of git://git.infradead.org/users/hch/configfs:
  configfs: Introduce config_item_get_unless_zero()
  configfs: Fix race between create_link and configfs_rmdir
2017-06-16 18:45:47 +09:00
Christoph Hellwig
20223f0f39 fs: pass on flags in compat_writev
Fixes: 793b80ef14af ("vfs: pass a flags argument to vfs_readv/vfs_writev")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-06-16 18:40:51 +09:00
Andrei Vagin
4068367c9c fs: don't forget to put old mntns in mntns_install
Fixes: 4f757f3cbf54 ("make sure that mntns_install() doesn't end up with referral for root")
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrei Vagin <avagin@openvz.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-15 06:53:05 -04:00
Al Viro
81be24d263 Hang/soft lockup in d_invalidate with simultaneous calls
It's not hard to trigger a bunch of d_invalidate() on the same
dentry in parallel.  They end up fighting each other - any
dentry picked for removal by one will be skipped by the rest
and we'll go for the next iteration through the entire
subtree, even if everything is being skipped.  Morevoer, we
immediately go back to scanning the subtree.  The only thing
we really need is to dissolve all mounts in the subtree and
as soon as we've nothing left to do, we can just unhash the
dentry and bugger off.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-15 06:52:09 -04:00
Linus Torvalds
54ed0f71f0 Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto fix from Herbert Xu:
 "This fixes a bug on sparc where we may dereference freed stack memory"

* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
  crypto: Work around deallocated stack frame reference gcc bug on sparc.
2017-06-15 17:54:51 +09:00
Al Viro
a8fad98483 ufs_truncate_blocks(): fix the case when size is in the last direct block
The logics when deciding whether we need to do anything with direct blocks
is broken when new size is within the last direct block.  It's better to
find the path to the last byte _not_ to be removed and use that instead
of the path to the beginning of the first block to be freed...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-15 03:57:46 -04:00
Al Viro
289dec5b89 ufs: more deadlock prevention on tail unpacking
->s_lock is not needed for ufs_change_blocknr()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-15 00:42:56 -04:00
Al Viro
09bf4f5b6e ufs: avoid grabbing ->truncate_mutex if possible
tail unpacking is done in a wrong place; the deadlocks galore
is best dealt with by doing that in ->write_iter() (and switching
to iomap, while we are at it), but that's rather painful to
backport.  The trouble comes from grabbing pages that cover
the beginning of tail from inside of ufs_new_fragments(); ongoing
pageout of any of those is going to deadlock on ->truncate_mutex
with process that got around to extending the tail holding that
and waiting for page to get unlocked, while ->writepage() on
that page is waiting on ->truncate_mutex.

The thing is, we don't need ->truncate_mutex when the fragment
we are trying to map is within the tail - the damn thing is
allocated (tail can't contain holes).

Let's do a plain lookup and if the fragment is present, we can
just pretend that we'd won the race in almost all cases.  The
only exception is a fragment between the end of tail and the
end of block containing tail.

Protect ->i_lastfrag with ->meta_lock - read_seqlock_excl() is
sufficient.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-15 00:41:18 -04:00
Al Viro
267309f394 ufs_get_locked_page(): make sure we have buffer_heads
callers rely upon that, but find_lock_page() racing with attempt of
page eviction by memory pressure might have left us with
	* try_to_free_buffers() successfully done
	* __remove_mapping() failed, leaving the page in our mapping
	* find_lock_page() returning an uptodate page with no
buffer_heads attached.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-14 23:32:19 -04:00
Al Viro
c596961d1b ufs: fix s_size/s_dsize users
For UFS2 we need 64bit variants; we even store them in uspi, but
use 32bit ones instead.  One wrinkle is in handling of reserved
space - recalculating it every time had been stupid all along, but
now it would become really ugly.  Just calculate it once...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-14 16:43:03 -04:00
Al Viro
b451cec4bb ufs: fix reserved blocks check
a) honour ->s_minfree; don't just go with default (5)
b) don't bother with capability checks until we know we'll need them

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-14 15:46:05 -04:00
Al Viro
fffd70f588 ufs: make ufs_freespace() return signed
as it is, checking that its return value is <= 0 is useless and
that's how it's being used.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-14 15:36:31 -04:00
Al Viro
96ecff1422 ufs: fix logics in "ufs: make fsck -f happy"
Storing stats _only_ at new locations is wrong for UFS1; old
locations should always be kept updated.  The check for "has
been converted to use of new locations" is also wrong - it
should be "->fs_maxbsize is equal to ->fs_bsize".

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-06-14 15:17:32 -04:00
Yan, Zheng
4ca2fea6f8 ceph: unify inode i_ctime update
Current __ceph_setattr() can set inode's i_ctime to current_time(),
req->r_stamp or attr->ia_ctime. These time stamps may have minor
differences. It may cause potential problem.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-06-14 19:37:23 +02:00
Yan, Zheng
56199016e8 ceph: use current_kernel_time() to get request time stamp
ceph uses ktime_get_real_ts() to get request time stamp. In most
other cases, current_kernel_time() is used to get time stamp for
filesystem operations (called by current_time()).

There is granularity difference between ktime_get_real_ts() and
current_kernel_time(). The later one can be up to one jiffy behind
the former one. This can causes inode's ctime to go back.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-06-14 19:33:23 +02:00
Luis Henriques
03f219041f ceph: check i_nlink while converting a file handle to dentry
Converting a file handle to a dentry can be done call after the inode
unlink.  This means that __fh_to_dentry() requires an extra check to
verify the number of links is not 0.

The issue can be easily reproduced using xfstest generic/426, which does
something like:

    name_to_handle_at(&fh)
    echo 3 > /proc/sys/vm/drop_caches
    unlink()
    open_by_handle_at(&fh)

The call to open_by_handle_at() should fail, as the file doesn't exist
anymore.

Link: http://tracker.ceph.com/issues/19958
Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2017-06-14 19:32:43 +02:00
Bart Van Assche
19e72d3abb configfs: Introduce config_item_get_unless_zero()
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
[hch: minor style tweak]
Signed-off-by: Christoph Hellwig <hch@lst.de>
2017-06-12 13:20:20 +02:00
Nicholas Bellinger
ba80aa909c configfs: Fix race between create_link and configfs_rmdir
This patch closes a long standing race in configfs between
the creation of a new symlink in create_link(), while the
symlink target's config_item is being concurrently removed
via configfs_rmdir().

This can happen because the symlink target's reference
is obtained by config_item_get() in create_link() before
the CONFIGFS_USET_DROPPING bit set by configfs_detach_prep()
during configfs_rmdir() shutdown is actually checked..

This originally manifested itself on ppc64 on v4.8.y under
heavy load using ibmvscsi target ports with Novalink API:

[ 7877.289863] rpadlpar_io: slot U8247.22L.212A91A-V1-C8 added
[ 7879.893760] ------------[ cut here ]------------
[ 7879.893768] WARNING: CPU: 15 PID: 17585 at ./include/linux/kref.h:46 config_item_get+0x7c/0x90 [configfs]
[ 7879.893811] CPU: 15 PID: 17585 Comm: targetcli Tainted: G           O 4.8.17-customv2.22 #12
[ 7879.893812] task: c00000018a0d3400 task.stack: c0000001f3b40000
[ 7879.893813] NIP: d000000002c664ec LR: d000000002c60980 CTR: c000000000b70870
[ 7879.893814] REGS: c0000001f3b43810 TRAP: 0700   Tainted: G O     (4.8.17-customv2.22)
[ 7879.893815] MSR: 8000000000029033 <SF,EE,ME,IR,DR,RI,LE>  CR: 28222242  XER: 00000000
[ 7879.893820] CFAR: d000000002c664bc SOFTE: 1
                GPR00: d000000002c60980 c0000001f3b43a90 d000000002c70908 c0000000fbc06820
                GPR04: c0000001ef1bd900 0000000000000004 0000000000000001 0000000000000000
                GPR08: 0000000000000000 0000000000000001 d000000002c69560 d000000002c66d80
                GPR12: c000000000b70870 c00000000e798700 c0000001f3b43ca0 c0000001d4949d40
                GPR16: c00000014637e1c0 0000000000000000 0000000000000000 c0000000f2392940
                GPR20: c0000001f3b43b98 0000000000000041 0000000000600000 0000000000000000
                GPR24: fffffffffffff000 0000000000000000 d000000002c60be0 c0000001f1dac490
                GPR28: 0000000000000004 0000000000000000 c0000001ef1bd900 c0000000f2392940
[ 7879.893839] NIP [d000000002c664ec] config_item_get+0x7c/0x90 [configfs]
[ 7879.893841] LR [d000000002c60980] check_perm+0x80/0x2e0 [configfs]
[ 7879.893842] Call Trace:
[ 7879.893844] [c0000001f3b43ac0] [d000000002c60980] check_perm+0x80/0x2e0 [configfs]
[ 7879.893847] [c0000001f3b43b10] [c000000000329770] do_dentry_open+0x2c0/0x460
[ 7879.893849] [c0000001f3b43b70] [c000000000344480] path_openat+0x210/0x1490
[ 7879.893851] [c0000001f3b43c80] [c00000000034708c] do_filp_open+0xfc/0x170
[ 7879.893853] [c0000001f3b43db0] [c00000000032b5bc] do_sys_open+0x1cc/0x390
[ 7879.893856] [c0000001f3b43e30] [c000000000009584] system_call+0x38/0xec
[ 7879.893856] Instruction dump:
[ 7879.893858] 409d0014 38210030 e8010010 7c0803a6 4e800020 3d220000 e94981e0 892a0000
[ 7879.893861] 2f890000 409effe0 39200001 992a0000 <0fe00000> 4bffffd0 60000000 60000000
[ 7879.893866] ---[ end trace 14078f0b3b5ad0aa ]---

To close this race, go ahead and obtain the symlink's target
config_item reference only after the existing CONFIGFS_USET_DROPPING
check succeeds.

This way, if configfs_rmdir() wins create_link() will return -ENONET,
and if create_link() wins configfs_rmdir() will return -EBUSY.

Reported-by: Bryant G. Ly <bryantly@linux.vnet.ibm.com>
Tested-by: Bryant G. Ly <bryantly@linux.vnet.ibm.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org
2017-06-12 13:20:10 +02:00
Linus Torvalds
5e38b72ac1 Fix various bug fixes in ext4 caused by races and memory allocation
failures.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEK2m5VNv+CHkogTfJ8vlZVpUNgaMFAlk9it4ACgkQ8vlZVpUN
 gaNE2wgAiuo9pHO5cUzRhP5HG8Jz4vE2l6mpuPq2JbhT+xGFbvjX+UhA+DN6Cw35
 EK+MnBC6h2wFoQrEOLbavNWc94nb6HZWw2riheK4sst80hBpeclwInCVCw1DLmu6
 +Nx8fzXVqjSf57Qi8CR09AGovSFBgfAJFi8aJTe8KXaSRPx48bWFxK6glgIG5gRw
 VEOyEg5kPRYoUNA8ewsunC67V32ljYF2IZSzTTlrcqaLsXi+SWeiJ2hceYfgI0qz
 fZB1EFBvmGBsdsFM2SHiJpME6fzMEQqx7oTyNC8DhCxMRVd29WjxeqCwcU4nV0Q5
 jJaCKJftJ/q/eLKn7ksezhoL0A96BA==
 =Z9eH
 -----END PGP SIGNATURE-----

Merge tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4

Pull ext4 fixes from Ted Ts'o:
 "Fix various bug fixes in ext4 caused by races and memory allocation
  failures"

* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: fix fdatasync(2) after extent manipulation operations
  ext4: fix data corruption for mmap writes
  ext4: fix data corruption with EXT4_GET_BLOCKS_ZERO
  ext4: fix quota charging for shared xattr blocks
  ext4: remove redundant check for encrypted file on dio write path
  ext4: remove unused d_name argument from ext4_search_dir() et al.
  ext4: fix off-by-one error when writing back pages before dio read
  ext4: fix off-by-one on max nr_pages in ext4_find_unwritten_pgoff()
  ext4: keep existing extra fields when inode expands
  ext4: handle the rest of ext4_mb_load_buddy() ENOMEM errors
  ext4: fix off-by-in in loop termination in ext4_find_unwritten_pgoff()
  ext4: fix SEEK_HOLE
  jbd2: preserve original nofs flag during journal restart
  ext4: clear lockdep subtype for quota files on quota off
2017-06-11 11:57:47 -07:00