linux

mirror of https://github.com/FEX-Emu/linux.git synced 2025-01-01 14:52:32 +00:00

Author	SHA1	Message	Date
Filipe Manana	a2cc11db24	Btrfs: fix directory recovery from fsync log When replaying a directory from the fsync log, if a directory entry exists both in the fs/subvol tree and in the log, the directory's inode got its i_size updated incorrectly, accounting for the dentry's name twice. Reproducer, from a test for xfstests: _scratch_mkfs >> $seqres.full 2>&1 _init_flakey _mount_flakey touch $SCRATCH_MNT/foo sync touch $SCRATCH_MNT/bar xfs_io -c "fsync" $SCRATCH_MNT xfs_io -c "fsync" $SCRATCH_MNT/bar _load_flakey_table $FLAKEY_DROP_WRITES _unmount_flakey _load_flakey_table $FLAKEY_ALLOW_WRITES _mount_flakey [ -f $SCRATCH_MNT/foo ] \|\| echo "file foo is missing" [ -f $SCRATCH_MNT/bar ] \|\| echo "file bar is missing" _unmount_flakey _check_scratch_fs $FLAKEY_DEV The filesystem check at the end failed with the message: "root 5 root dir 256 error". A test case for xfstests follows. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-09-17 13:38:27 -07:00
Liu Bo	25ce459c1a	Btrfs: fix loop writing of async reclaim One of my tests shows that when we really don't have space to reclaim via flush_space and also run out of space, this async reclaim work loops on adding itself into the workqueue and keeps writing something to disk according to iostat's results, and these writes mainly comes from commit_transaction which writes super_block. This's unacceptable as it can be bad to disks, especially memeory storages. This adds a check to avoid the above situation. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-09-17 13:38:25 -07:00
Josef Bacik	dc046b10c8	Btrfs: make fiemap not blow when you have lots of snapshots We have been iterating all references for each extent we have in a file when we do fiemap to see if it is shared. This is fine when you have a few clones or a few snapshots, but when you have 5k snapshots suddenly fiemap just sits there and stares at you. So add btrfs_check_shared which will use the backref walking code but will short circuit as soon as it finds a root or inode that doesn't match the one we currently have. This makes fiemap on my testbox go from looking at me blankly for a day to spitting out actual output in a reasonable amount of time. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-09-17 13:38:24 -07:00
Filipe Manana	78a017a2c9	Btrfs: add missing compression property remove in btrfs_ioctl_setflags The behaviour of a 'chattr -c' consists of getting the current flags, clearing the FS_COMPR_FL bit and then sending the result to the set flags ioctl - this means the bit FS_NOCOMP_FL isn't set in the flags passed to the ioctl. This results in the compression property not being cleared from the inode - it was cleared only if the bit FS_NOCOMP_FL was set in the received flags. Reproducer: $ mkfs.btrfs -f /dev/sdd $ mount /dev/sdd /mnt && cd /mnt $ mkdir a $ chattr +c a $ touch a/file $ lsattr a/file --------c------- a/file $ chattr -c a $ touch a/file2 $ lsattr a/file2 --------c------- a/file2 $ lsattr -d a ---------------- a Reported-by: Andreas Schneider <asn@cryptomilk.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-09-17 13:38:23 -07:00
Qu Wenruo	12b894cb28	btrfs: Fix a deadlock in btrfs_dev_replace_finishing() btrfs-transacion:5657 [stack snip] btrfs_bio_map() btrfs_bio_counter_inc_blocked() percpu_counter_inc(&fs_info->bio_counter) ###bio_counter > 0(A) __btrfs_bio_map() btrfs_dev_replace_lock() mutex_lock(dev_replace->lock) ###wait mutex(B) btrfs:32612 [stack snip] btrfs_dev_replace_start() btrfs_dev_replace_lock() mutex_lock(dev_replace->lock) ###hold mutex(B) btrfs_dev_replace_finishing() btrfs_rm_dev_replace_blocked() wait until percpu_counter_sum == 0 ###wait on bio_counter(A) This bug can be triggered quite easily by the following test script: http://pastebin.com/MQmb37Cy This patch will fix the ABBA problem by calling btrfs_dev_replace_unlock() before btrfs_rm_dev_replace_blocked(). The consistency of btrfs devices list and their superblocks is protected by device_list_mutex, not btrfs_dev_replace_lock/unlock(). So it is safe the move btrfs_dev_replace_unlock() before btrfs_rm_dev_replace_blocked(). Reported-by: Zhao Lei <zhaolei@cn.fujitsu.com> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Cc: Stefan Behrens <sbehrens@giantdisaster.de> Signed-off-by: Chris Mason <clm@fb.com>	2014-09-17 13:38:22 -07:00
Liu Bo	a583c02664	Btrfs: cleanup the same name in end_bio_extent_readpage We've defined a 'offset' out of bio_for_each_segment_all. This is just a clean rename, no function changes. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-09-17 13:38:20 -07:00
Mark Fasheh	0b4699dcb6	btrfs: don't go readonly on existing qgroup items btrfs_drop_snapshot() leaves subvolume qgroup items on disk after completion. This can cause problems with snapshot creation. If a new snapshot tries to claim the deleted subvolumes id, btrfs will get -EEXIST from add_qgroup_item() and go read-only. The following commands will reproduce this problem (assume btrfs is on /dev/sda and is mounted at /btrfs) mkfs.btrfs -f /dev/sda mount -t btrfs /dev/sda /btrfs/ btrfs quota enable /btrfs/ btrfs su sna /btrfs/ /btrfs/snap btrfs su de /btrfs/snap sleep 45 umount /btrfs/ mount -t btrfs /dev/sda /btrfs/ We can fix this by catching -EEXIST in add_qgroup_item() and initializing the existing items. We have the problem of orphaned relation items being on disk from an old snapshot but that is outside the scope of this patch. Signed-off-by: Mark Fasheh <mfasheh@suse.de> Signed-off-by: Chris Mason <clm@fb.com>	2014-09-17 13:38:19 -07:00
Filipe Manana	2a39e59802	Btrfs: shrink further sizeof(struct extent_buffer) The map_start and map_len fields aren't used anywhere, so just remove them. On a x86_64 system, this reduced sizeof(struct extent_buffer) from 296 bytes to 280 bytes, and therefore 14 extent_buffer structs can now fit into a page instead of 13. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>	2014-09-17 13:38:17 -07:00
Filipe Manana	4395e0c4da	Btrfs: send, lower mem requirements for processing xattrs Maximum xattr size can be up to nearly the leaf size. For an fs with a leaf size larger than the page size, using kmalloc requires allocating multiple pages that are contiguous, which might not be possible if there's heavy memory fragmentation. Therefore fallback to vmalloc if we fail to allocate with kmalloc. Also start with a smaller buffer size, since xattr values typically are smaller than a page. Reported-by: Chris Murphy <lists@colorremedies.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-09-17 13:38:16 -07:00
David Sterba	f87c4318af	btrfs: remove stale define after removing ordered operations Last user removed in commit "btrfs: disable strict file flushes for renames and truncates" (`8d875f95da`). Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>	2014-09-17 13:38:15 -07:00
Filipe Manana	2000552396	Btrfs: improve free space cache management and space allocation While under random IO, a block group's free space cache eventually reaches a state where it has a mix of extent entries and bitmap entries representing free space regions. As later free space regions are returned to the cache, some of them are merged with existing extent entries if they are contiguous with them. But others are not merged, because despite the existence of adjacent free space regions in the cache, the merging doesn't happen because the existing free space regions are represented in bitmap extents. Even when new free space regions are merged with existing extent entries (enlarging the free space range they represent), we create chances of having after an enlarged region that is contiguous with some other region represented in a bitmap entry. Both clustered and non-clustered space allocation work by iterating over our extent and bitmap entries and skipping any that represents a region smaller then the allocation request (and giving preference to extent entries before bitmap entries). By having a contiguous free space region that is represented by 2 (or more) entries (mix of extent and bitmap entries), we end up not satisfying an allocation request with a size larger than the size of any of the entries but no larger than the sum of their sizes. Making the caller assume we're under a ENOSPC condition or force it to allocate multiple smaller space regions (as we do for file data writes), which adds extra overhead and more chances of causing fragmentation due to the smaller regions being all spread apart from each other (more likely when under concurrency). For example, if we have the following in the cache: * extent entry representing free space range: [128Mb - 256Kb, 128Mb[ * bitmap entry covering the range [128Mb, 256Mb[, but only with the bits representing the range [128Mb, 128Mb + 768Kb[ set - that is, only that space in this 128Mb area is marked as free An allocation request for 1Mb, starting at offset not greater than 128Mb - 256Kb, would fail before, despite the existence of such contiguous free space area in the cache. The caller could only allocate up to 768Kb of space at once and later another 256Kb (or vice-versa). In between each smaller allocation request, another task working on a different file/inode might come in and take that space, preventing the former task of getting a contiguous 1Mb region of free space. Therefore this change implements the ability to move free space from bitmap entries into existing and new free space regions represented with extent entries. This is done when a space region is added to the cache. A test was added to the sanity tests that explains in detail the issue too. Some performance test results with compilebench on a 4 cores machine, with 32Gb of ram and using an HDD follow. Test: compilebench -D /mnt -i 30 -r 1000 --makej Before this change: intial create total runs 30 avg 69.02 MB/s (user 0.28s sys 0.57s) compile total runs 30 avg 314.96 MB/s (user 0.12s sys 0.25s) read compiled tree total runs 3 avg 27.14 MB/s (user 1.52s sys 0.90s) delete compiled tree total runs 30 avg 3.14 seconds (user 0.15s sys 0.66s) After this change: intial create total runs 30 avg 68.37 MB/s (user 0.29s sys 0.55s) compile total runs 30 avg 382.83 MB/s (user 0.12s sys 0.24s) read compiled tree total runs 3 avg 27.82 MB/s (user 1.45s sys 0.97s) delete compiled tree total runs 30 avg 3.18 seconds (user 0.17s sys 0.65s) Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-09-17 13:38:13 -07:00
Anand Jain	3c1dbdf54a	btrfs: rename total_bytes to avoid confusion we are assigning number_devices to the total_bytes, that's very confusing for a moment Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>	2014-09-17 13:38:12 -07:00
Anand Jain	de4c296f63	btrfs: fix typo in the log message there is no matching open parenthesis for the closing parenthesis Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-09-17 13:38:11 -07:00
Anand Jain	b2efedca68	btrfs: rw_devices shouldn't be incremented for seed fs in btrfs_rm_dev_replace_srcdev() seed fs devices don't participate as rw_device, so don't increment rw_devices when the device being handled belongs to a seed fs. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-09-17 13:38:10 -07:00
Anand Jain	8bef8401a0	btrfs: fix memory leak when there is no more seed device When we replace all the seed device in the system there is no point in just keeping the btrfs_fs_devices with out any device Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-09-17 13:38:09 -07:00
Anand Jain	94d5f0c2ae	btrfs: update sprout seed pointer when seed fs is relinquished We are not updating sprout fs seed pointer when all seed device is replaced. This patch will check if all seed device has been replaced and then update the sprout pointer accordingly. Same reproducer as in the previous patch would apply here. And notice that btrfs_close_device will check if seed fs is present and spits out the error with out this patch. int btrfs_close_devices(struct btrfs_fs_devices *fs_devices) { :: seed_devices = fs_devices->seed; :: while (seed_devices) { fs_devices = seed_devices; seed_devices = fs_devices->seed; __btrfs_close_devices(fs_devices); free_fs_devices(fs_devices); } Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-09-17 13:38:08 -07:00
Anand Jain	63dd86fa79	btrfs: fix rw_devices miss match after seed replace reproducer: reproducer: mount /dev/sdb /btrfs btrfs dev add /dev/sdc /btrfs btrfs rep start -B /dev/sdb /dev/sdd /btrfs umount /btrfs WARNING: CPU: 0 PID: 3882 at fs/btrfs/volumes.c:892 __btrfs_close_devices+0x1c8/0x200 [btrfs]() which is WARN_ON(fs_devices->rw_devices); The problem here is that we did not add one to the rw_devices when we replace the seed device with a writable device. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-09-17 13:38:06 -07:00
Anand Jain	25e8e9113d	btrfs: replace seed device followed by unmount causes kernel WARNING reproducer: mount /dev/sdb /btrfs btrfs dev add /dev/sdc /btrfs btrfs rep start -B /dev/sdb /dev/sdd /btrfs umount /btrfs WARNING: CPU: 0 PID: 12661 at fs/btrfs/volumes.c:891 __btrfs_close_devices+0x1b0/0x200 [btrfs]() :: __btrfs_close_devices() :: WARN_ON(fs_devices->open_devices); After the seed device has been replaced the new target device is no more a seed device. So we need to update the device numbers in the fs_devices as pointed by the fs_info. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-09-17 13:38:05 -07:00
Anand Jain	d51908ce4e	btrfs: preparatory to make btrfs_rm_dev_replace_srcdev() seed aware There is no logical change in this patch, just a preparatory patch, so that changes can be easily reasoned. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-09-17 13:38:04 -07:00
Andrey Utkin	56094eecd3	btrfs: Drop stray check of fixup_workers creation The issue was introduced in `a79b7d4b3e`, adding allocation of extent_workers, so this stray check is surely not meant to be a check of something else. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=82021 Reported-by: Maks Naumov <maksqwe1@ukr.net> Signed-off-by: Andrey Utkin <andrey.krieger.utkin@gmail.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-09-17 13:38:03 -07:00
Filipe Manana	f98de9b9c0	Btrfs: make btrfs_search_forward return with nodes unlocked None of the uses of btrfs_search_forward() need to have the path nodes (level >= 1) read locked, only the leaf needs to be locked while the caller processes it. Therefore make it return a path with all nodes unlocked, except for the leaf. This change is motivated by the observation that during a file fsync we repeatdly call btrfs_search_forward() and process the returned leaf while upper nodes of the returned path (level >= 1) are read locked, which unnecessarily blocks other tasks that want to write to the same fs/subvol btree. Therefore instead of modifying the fsync code to unlock all nodes with level >= 1 immediately after calling btrfs_search_forward(), change btrfs_search_forward() to do it, so that it benefits all callers. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-09-17 13:38:02 -07:00
Anand Jain	79aec2b80d	btrfs: sysfs label interface should check for read only FS Not sure how this escaped many eyes so far Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-09-17 13:38:01 -07:00
Anand Jain	20ee0825ec	btrfs: code optimize: BTRFS_ATTR_RW could set the mode BTRFS_ATTR_RW could set the mode and be inline with BTRFS_ATTR Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-09-17 13:37:59 -07:00
Anand Jain	98b3d389eb	btrfs: code optimize: BTRFS_ATTR could handle the mode All that uses BTRFS_ATTR want mode to be set at 0444 so just do it at the define. And few spacing alignments. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-09-17 13:37:58 -07:00
Anand Jain	3f4b57e09d	btrfs: use BTRFS_ATTR instead of btrfs_no_store() we have BTRFS_ATTR define to create sysfs RO file, use that. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-09-17 13:37:57 -07:00
Filipe Manana	160f4089c8	Btrfs: avoid unnecessary switch of path locks to blocking mode If we need to cow a node, increase the write lock level and retry the tree search, there's no point of changing the node locks in our path to blocking mode, as we only waste time and unnecessarily wake up other tasks waiting on the spinning locks (just to block them again shortly after) because we release our path before repeating the tree search. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-09-17 13:37:56 -07:00
Filipe Manana	24cdc847d9	Btrfs: unlock nodes earlier when inserting items in a btree In ctree.c:setup_items_for_insert(), we can unlock all nodes in our path before we process the leaf (shift items and data, adjust data offsets, etc). This allows for better btree concurrency, as we're often holding a write lock on at least the node at level 1. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-09-17 13:37:55 -07:00
Satoru Takeuchi	d1b00a4711	btrfs: use IS_ALIGNED() for assertion in btrfs_lookup_csums_range() for simplicity btrfs_lookup_csums_range() uses ALIGN() to check if "start" and "end + 1" are aligned to "root->sectorsize". It's better to replace these with IS_ALIGNED() for simplicity. Signed-off-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-09-17 13:37:54 -07:00
Mark Fasheh	d3982100ba	btrfs: add trace for qgroup accounting We want this to debug qgroup changes on live systems. Signed-off-by: Mark Fasheh <mfasheh@suse.de> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-09-17 13:37:50 -07:00
Miao Xie	443f24fee7	Btrfs: cleanup unused latest_devid and latest_trans in fs_devices The member variants - latest_devid and latest_trans - of fs_devices structure are set, but no one use them to do anything. so remove them. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-09-17 13:37:49 -07:00
Miao Xie	6ba40b615f	Btrfs: update the comment of total_bytes and disk_total_bytes of btrfs_devie Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-09-17 13:37:48 -07:00
Miao Xie	addc3fa74e	Btrfs: Fix the problem that the dirty flag of dev stats is cleared The io error might happen during writing out the device stats, and the device stats information and dirty flag would be update at that time, but the current code didn't consider this case, just clear the dirty flag, it would cause that we forgot to write out the new device stats information. Fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-09-17 13:37:46 -07:00
Miao Xie	d5ee37bcb1	Btrfs: make the device lock and its protected data in the same cacheline The lock in btrfs_device structure was far away from its protected data, it would make CPU load the cache line twice when we accessed them, move them together. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-09-17 13:37:45 -07:00
Miao Xie	5f546063ce	Btrfs: fix wrong generation check of super block on a seed device The super block generation of the seed devices is not the same as the filesystem which sprouted from them because we don't update the super block on the seed devices when we change that new filesystem. So we should not use the generation of that new filesystem to check the super block generation on the seed devices, Fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>	2014-09-17 13:37:44 -07:00
Miao Xie	17a9be2f28	Btrfs: fix wrong fsid check of scrub All the metadata in the seed devices has the same fsid as the fsid of the seed filesystem which is on the seed device, so we should check them by the current filesystem. Fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>	2014-09-17 13:37:43 -07:00
David Sterba	2fad4e83e1	btrfs: wake up transaction thread from SYNC_FS ioctl The transaction thread may want to do more work, namely it pokes the cleaner ktread that will start processing uncleaned subvols. This can be triggered by user via the 'btrfs fi sync' command, otherwise there was a delay up to 30 seconds before the cleaner started to clean old snapshots. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>	2014-09-17 13:37:42 -07:00
Wang Shilong	c01a5c074c	Btrfs: fix wrong max inline data size limit inline data is stored from offset of @disk_bytenr in struct btrfs_file_extent_item. So substracting total size of struct btrfs_file_extent_item is wrong, fix it. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>	2014-09-17 13:37:40 -07:00
Wang Shilong	354877befa	Btrfs: fix off-by-one in cow_file_range_inline() Btrfs could still inline file data if its size is same as page size, so don't skip max value here. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>	2014-09-17 13:37:39 -07:00
Wang Shilong	7816030eb4	Btrfs: fall into nocompression codes quickly if possible If flag NOCOMPRESS is set which means bad compression ratio, we could avoid call cow_file_range_async() for this case earlier. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>	2014-09-17 13:37:38 -07:00
Wang Shilong	f79707b092	Btrfs: fix wrong skipping compression for an inode If a file's compression ratios is bad, we will set NOCOMPRESS flag for it, and it will skip compression for that inode next time. However, if we remount fs to COMPRESS_FORCE, it still should try if we could compress pages for that inode, this patch fix wrong check for this problem. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>	2014-09-17 13:37:36 -07:00
Fabian Frederick	d447d0da44	Btrfs: fix sparse warning Fix the following sparse warning: fs/btrfs/send.c:518:51: warning: incorrect type in argument 2 (different address spaces) fs/btrfs/send.c:518:51: expected char const [noderef] <asn:1><noident> fs/btrfs/send.c:518:51: got char We can safely use (const char __user *) with set_fs(KERNEL_DS) __force added to avoid sparse-all warning: fs/btrfs/send.c:518:40: warning: cast adds address space to expression (<asn:1>) Signed-off-by: Fabian Frederick <fabf@skynet.be> Reviewed-by: Zach Brown <zab@zabbo.net> Signed-off-by: Chris Mason <clm@fb.com>	2014-09-17 13:37:35 -07:00
HIMANGI SARAOGI	14586651ed	Btrfs: use BUG_ON Use BUG_ON(x) rather than if(x) BUG(); The semantic patch that fixes this problem is as follows: // <smpl> @@ identifier x; @@ -if (x) BUG(); +BUG_ON(x); // </smpl> Signed-off-by: Himangi Saraogi <himangi774@gmail.com> Acked-by: Julia Lawall <julia.lawall@lip6.fr> Signed-off-by: Chris Mason <clm@fb.com>	2014-09-17 13:37:34 -07:00
Sergey Senozhatsky	7880991344	btrfs compression: merge inflate and deflate z_streams `struct workspace' used for zlib compression contains two zlib z_stream-s: `def_strm' used in zlib_compress_pages(), and `inf_strm' used in zlib_decompress/zlib_decompress_biovec(). None of these functions use `inf_strm' and `def_strm' simultaniously, meaning that for every compress/decompress operation we need only one z_stream (out of two available). `inf_strm' and `def_strm' are different in size of ->workspace. For inflate stream we vmalloc() zlib_inflate_workspacesize() bytes, for deflate stream - zlib_deflate_workspacesize() bytes. On my system zlib returns the following workspace sizes, correspondingly: 42312 and 268104 (+ guard pages). Keep only one `z_stream' in `struct workspace' and use it for both compression and decompression. Hence, instead of vmalloc() of two z_stream->worskpace-s, allocate only one of size: max(zlib_deflate_workspacesize(), zlib_inflate_workspacesize()) Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-09-17 13:37:33 -07:00
Filipe Manana	555e128640	Btrfs: set error return value in btrfs_get_blocks_direct We were returning with 0 (success) because we weren't extracting the error code from em (PTR_ERR(em)). Fix it. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-09-17 13:37:32 -07:00
Filipe Manana	27a3507de9	Btrfs: reduce size of struct extent_state The tree field of struct extent_state was only used to figure out if an extent state was connected to an inode's io tree or not. For this we can just use the rb_node field itself. On a x86_64 system with this change the sizeof(struct extent_state) is reduced from 96 bytes down to 88 bytes, meaning that with a page size of 4096 bytes we can now store 46 extent states per page instead of 42. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-09-17 13:37:30 -07:00
Fabian Frederick	6f84e23646	btrfs: use PTR_ERR_OR_ZERO replace IS_ERR/PTR_ERR Cc: Chris Mason <clm@fb.com> Cc: Josef Bacik <jbacik@fb.com> Cc: linux-btrfs@vger.kernel.org Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Chris Mason <clm@fb.com>	2014-09-17 13:37:29 -07:00
Wang Shilong	29549aec76	Btrfs: print btrfs specific info for some fatal error cases Marc argued that if there are several btrfs filesystems mounted, while users even don't know which filesystem hit the corrupted errors something like generation verification failure. Since @extent_buffer structure has a member @fs_info, let's output btrfs device info. Reported-by: Marc MERLIN <marc@merlins.org> Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-09-17 13:37:28 -07:00
Miao Xie	d20983b40e	Btrfs: fix writing data into the seed filesystem If we mounted a seed filesystem with degraded option, and then added a new device into the seed filesystem, then we found adding device failed because of the IO failure. Steps to reproduce: # mkfs.btrfs -d raid1 -m raid1 <dev0> <dev1> # btrfstune -S 1 <dev0> # mount <dev0> -o degraded <mnt> # btrfs device add -f <dev2> <mnt> It is because the original didn't set the chunk on the seed device to be read-only if the degraded flag was set. It was introduced by patch `f48b90756`, which fixed the problem the raid1 filesystem became read-only after one device of it was missing. But this fix method was not right, we should set the read-only flag according to the number of the missing devices, not the degraded mount option, if the number of the missing devices is less than the max error number that the profile of the chunk tolerates, we don't set it to be read-only. Cc: Josef Bacik <jbacik@fb.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-09-17 13:37:27 -07:00
Wang Shilong	47059d930f	Btrfs: make defragment work with nodatacow option Btrfs defragment will utilize COW feature, which means this did not work for nodatacow option, this problem was detected by xfstests generic/018 with nodatacow mount option. Fix this problem by forcing cow for a extent with state @EXTETN_DEFRAG setting. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-09-17 13:37:26 -07:00
Satoru Takeuchi	48fcc3ff7d	btrfs: label should not contain return char Rediffed remaining parts of original patch from Anand Jain. This makes sure to avoid trailing newlines in the btrfs label output reproducer.sh: =============================================================================== TEST_DEV=/dev/vdb TEST_DIR=/home/sat/mnt umount /home/sat/mnt mkfs.btrfs -f $TEST_DEV UUID=$(btrfs fi show $TEST_DEV \| head -1 \| sed -e 's/.uuid: $[-0-9a-z]$$/\1/') mount $TEST_DEV $TEST_DIR LABELFILE=/sys/fs/btrfs/$UUID/label echo "Test for empty label..." >&2 LINES="$(cat $LABELFILE \| wc -l \| awk '{print $1}')" RET=0 if [ $LINES -eq 0 ] ; then echo '[PASS] Trailing \n is removed correctly.' >&2 else echo '[FAIL] Trailing \n still exists.' >&2 RET=1 fi echo "Test for non-empty label..." >&2 echo testlabel >$LABELFILE LINES="$(cat $LABELFILE \| wc -l \| awk '{print $1}')" if [ $LINES -eq 1 ] ; then echo '[PASS] Trailing \n is removed correctly.' >&2 else echo '[FAIL] Trailing \n still exists.' >&2 RET=1 fi exit $RET =============================================================================== Signed-off-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-09-17 13:37:25 -07:00
Anand Jain	ec95d4917b	btrfs: device delete must be sysloged as in the disk add patch, disk detached from the volume must be recorded in the syslog as well for the same reason. Signed-off-by: Anand Jain <Anand.Jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>	2014-09-17 13:37:23 -07:00
Anand Jain	43d2076168	btrfs: device add must be sysloged when we add a new disk to the mounted btrfs we don't record it as of now, disk add is a critical change of btrfs configuration, it must be recorded in the syslog to help offline investigations of customer problems when reported. Signed-off-by: Anand Jain <Anand.Jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>	2014-09-17 13:37:20 -07:00
Wang Shilong	4027e0f4c4	Btrfs: clear compress-force when remounting with compress option Steps to reproduce: # mkfs.btrfs -f /dev/sdb # mount /dev/sdb /mnt -o compress-force=lzo # mount /dev/sdb /mnt -o remount,compress=zlib # cat /proc/mounts Remounting from compress-force to compress could not clear compress-force option. The problem is there is no way for users to clear compress-force option separately. Fix this problem by clearing @FORCE_COMPRESS flag when remounting to compress=xxx. Suggested-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com> Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.cz> Reviewed-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com> Tested-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-09-17 13:37:19 -07:00
David Sterba	ed6078f703	btrfs: use DIV_ROUND_UP instead of open-coded variants The form (value + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT is equivalent to (value + PAGE_CACHE_SIZE - 1) / PAGE_CACHE_SIZE The rest is a simple subsitution, no difference in the generated assembly code. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>	2014-09-17 13:37:17 -07:00
David Sterba	4e54b17ad6	btrfs: clean away stripe_align helper Only wraps the ALIGN macro. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>	2014-09-17 13:37:16 -07:00
David Sterba	707e8a0715	btrfs: use nodesize everywhere, kill leafsize The nodesize and leafsize were never of different values. Unify the usage and make nodesize the one. Cleanup the redundant checks and helpers. Shaves a few bytes from .text: text data bss dec hex filename 852418 24560 23112 900090 dbbfa btrfs.ko.before 851074 24584 23112 898770 db6d2 btrfs.ko.after Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>	2014-09-17 13:37:14 -07:00
David Sterba	962a298f35	btrfs: kill the key type accessor helpers btrfs_set_key_type and btrfs_key_type are used inconsistently along with open coded variants. Other members of btrfs_key are accessed directly without any helpers anyway. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>	2014-09-17 13:37:12 -07:00
David Sterba	3abdbd780e	btrfs: make close_ctree return void There's no user of the return value and we can get rid of the comment in put_super. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>	2014-09-17 13:37:11 -07:00
David Sterba	57cdc8db21	btrfs: cleanup ino cache members of btrfs_root The naming is confusing, generic yet used for a specific cache. Add a prefix 'ino_' or rename appropriately. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>	2014-09-17 13:37:09 -07:00
David Sterba	c6f83c74fd	btrfs: clenaup: don't call btrfs_release_path before free_path Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>	2014-09-17 13:37:08 -07:00
David Sterba	32471dc2ba	btrfs: remove obsolete comment in btrfs_clean_one_deleted_snapshot The comment applied when there was a BUG_ON. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>	2014-09-17 13:37:07 -07:00
Linus Torvalds	7ed641be75	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "Filipe is doing a careful pass through fsync problems, and these are the fixes so far. I'll have one more for rc6 that we're still testing. My big commit is fixing up some inode hash races that Al Viro found (thanks Al)" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: use insert_inode_locked4 for inode creation Btrfs: fix fsync data loss after a ranged fsync Btrfs: kfree()ing ERR_PTRs Btrfs: fix crash while doing a ranged fsync Btrfs: fix corruption after write/fsync failure + fsync + log recovery Btrfs: fix autodefrag with compression	2014-09-12 11:53:30 -07:00
Chris Mason	b0d5d10f41	Btrfs: use insert_inode_locked4 for inode creation Btrfs was inserting inodes into the hash table before we had fully set the inode up on disk. This leaves us open to rare races that allow two different inodes in memory for the same [root, inode] pair. This patch fixes things by using insert_inode_locked4 to insert an I_NEW inode and unlock_new_inode when we're ready for the rest of the kernel to use the inode. It also makes sure to init the operations pointers on the inode before going into the error handling paths. Signed-off-by: Chris Mason <clm@fb.com> Reported-by: Al Viro <viro@zeniv.linux.org.uk>	2014-09-08 13:56:45 -07:00
Filipe Manana	49dae1bc1c	Btrfs: fix fsync data loss after a ranged fsync While we're doing a full fsync (when the inode has the flag BTRFS_INODE_NEEDS_FULL_SYNC set) that is ranged too (covers only a portion of the file), we might have ordered operations that are started before or while we're logging the inode and that fall outside the fsync range. Therefore when a full ranged fsync finishes don't remove every extent map from the list of modified extent maps - as for some of them, that fall outside our fsync range, their respective ordered operation hasn't finished yet, meaning the corresponding file extent item wasn't inserted into the fs/subvol tree yet and therefore we didn't log it, and we must let the next fast fsync (one that checks only the modified list) see this extent map and log a matching file extent item to the log btree and wait for its ordered operation to finish (if it's still ongoing). A test case for xfstests follows. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-09-08 13:56:43 -07:00
Dan Carpenter	c47ca32d3a	Btrfs: kfree()ing ERR_PTRs The "inherit" in btrfs_ioctl_snap_create_v2() and "vol_args" in btrfs_ioctl_rm_dev() are ERR_PTRs so we can't call kfree() on them. These kind of bugs are "One Err Bugs" where there is just one error label that does everything. I could set the "inherit = NULL" and keep the single out label but it ends up being more complicated that way. It makes the code simpler to re-order the unwind so it's in the mirror order of the allocation and introduce some new error labels. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-09-08 13:56:42 -07:00
Filipe Manana	dac5705cad	Btrfs: fix crash while doing a ranged fsync While doing a ranged fsync, that is, one whose range doesn't cover the whole possible file range (0 to LLONG_MAX), we can crash under certain circumstances with a trace like the following: [41074.641913] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC (...) [41074.642692] CPU: 0 PID: 24580 Comm: fsx Not tainted 3.16.0-fdm-btrfs-next-45+ #1 (...) [41074.643886] RIP: 0010:[<ffffffffa01ecc99>] [<ffffffffa01ecc99>] btrfs_ordered_update_i_size+0x279/0x2b0 [btrfs] (...) [41074.644919] Stack: (...) [41074.644919] Call Trace: [41074.644919] [<ffffffffa01db531>] btrfs_truncate_inode_items+0x3f1/0xa10 [btrfs] [41074.644919] [<ffffffffa01eb54f>] ? btrfs_get_logged_extents+0x4f/0x80 [btrfs] [41074.644919] [<ffffffffa02137a9>] btrfs_log_inode+0x2f9/0x970 [btrfs] [41074.644919] [<ffffffff81090875>] ? sched_clock_local+0x25/0xa0 [41074.644919] [<ffffffff8164a55e>] ? mutex_unlock+0xe/0x10 [41074.644919] [<ffffffff810af51d>] ? trace_hardirqs_on+0xd/0x10 [41074.644919] [<ffffffffa0214b4f>] btrfs_log_inode_parent+0x1ef/0x560 [btrfs] [41074.644919] [<ffffffff811d0c55>] ? dget_parent+0x5/0x180 [41074.644919] [<ffffffffa0215d11>] btrfs_log_dentry_safe+0x51/0x80 [btrfs] [41074.644919] [<ffffffffa01e2d1a>] btrfs_sync_file+0x1ba/0x3e0 [btrfs] [41074.644919] [<ffffffff811eda6b>] vfs_fsync_range+0x1b/0x30 (...) The necessary conditions that lead to such crash are: * an incremental fsync (when the inode doesn't have the BTRFS_INODE_NEEDS_FULL_SYNC flag set) happened for our file and it logged a file extent item ending at offset X; * the file got the flag BTRFS_INODE_NEEDS_FULL_SYNC set in its inode, due to a file truncate operation that reduces the file to a size smaller than X; * a ranged fsync call happens (via an msync for example), with a range that doesn't cover the whole file and the end of this range, lets call it Y, is smaller than X; * btrfs_log_inode, sees the flag BTRFS_INODE_NEEDS_FULL_SYNC set and calls btrfs_truncate_inode_items() to remove all items from the log tree that are associated with our file; * btrfs_truncate_inode_items() removes all of the inode's items, and the lowest file extent item it removed is the one ending at offset X, where X > 0 and X > Y - before returning, it calls btrfs_ordered_update_i_size() with an offset parameter set to X; * btrfs_ordered_update_i_size() sees that X is greater then the current ordered size (btrfs_inode's disk_i_size) and then it assumes there can't be any ongoing ordered operation with a range covering the offset X, calling a BUG_ON() if such ordered operation exists. This assumption is made because the disk_i_size is only increased after the corresponding file extent item is added to the btree (btrfs_finish_ordered_io); * But because our fsync covers only a limited range, such an ordered extent might exist, and our fsync callback (btrfs_sync_file) doesn't wait for such ordered extent to finish when calling btrfs_wait_ordered_range(); And then by the time btrfs_ordered_update_i_size() is called, via: btrfs_sync_file() -> btrfs_log_dentry_safe() -> btrfs_log_inode_parent() -> btrfs_log_inode() -> btrfs_truncate_inode_items() -> btrfs_ordered_update_i_size() We hit the BUG_ON(), which could never happen if the fsync range covered the whole possible file range (0 to LLONG_MAX), as we would wait for all ordered extents to finish before calling btrfs_truncate_inode_items(). So just don't call btrfs_ordered_update_i_size() if we're removing the inode's items from a log tree, which isn't supposed to change the in memory inode's disk_i_size. Issue found while running xfstests/generic/127 (happens very rarely for me), more specifically via the fsx calls that use memory mapped IO (and issue msync calls). Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-09-02 16:46:05 -07:00
Filipe Manana	d9f85963e3	Btrfs: fix corruption after write/fsync failure + fsync + log recovery While writing to a file, in inode.c:cow_file_range() (and same applies to submit_compressed_extents()), after reserving an extent for the file data, we create a new extent map for the written range and insert it into the extent map cache. After that, we create an ordered operation, but if it fails (due to a transient/temporary-ENOMEM), we return without dropping that extent map, which points to a reserved extent that is freed when we return. A subsequent incremental fsync (when the btrfs inode doesn't have the flag BTRFS_INODE_NEEDS_FULL_SYNC) considers this extent map valid and logs a file extent item based on that extent map, which points to a disk extent that doesn't contain valid data - it was freed by us earlier, at this point it might contain any random/garbage data. Therefore, if we reach an error condition when cowing a file range after we added the new extent map to the cache, drop it from the cache before returning. Some sequence of steps that lead to this: $ mkfs.btrfs -f /dev/sdd $ mount -o commit=9999 /dev/sdd /mnt $ cd /mnt $ xfs_io -f -c "pwrite -S 0x01 -b 4096 0 4096" -c "fsync" foo $ xfs_io -c "pwrite -S 0x02 -b 4096 4096 4096" $ sync $ od -t x1 foo 0000000 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 * 0010000 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 * 0020000 $ xfs_io -c "pwrite -S 0xa1 -b 4096 0 4096" foo # Now this write + fsync fail with -ENOMEM, which was returned by # btrfs_add_ordered_extent() in inode.c:cow_file_range(). $ xfs_io -c "pwrite -S 0xff -b 4096 4096 4096" foo $ xfs_io -c "fsync" foo fsync: Cannot allocate memory # Now do a new write + fsync, which will succeed. Our previous # -ENOMEM was a transient/temporary error. $ xfs_io -c "pwrite -S 0xee -b 4096 16384 4096" foo $ xfs_io -c "fsync" foo # Our file content (in page cache) is now: $ od -t x1 foo 0000000 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 * 0010000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff * 0020000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 * 0040000 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee * 0050000 # Now reboot the machine, and mount the fs, so that fsync log replay # takes place. # The file content is now weird, in particular the first 8Kb, which # do not match our data before nor after the sync command above. $ od -t x1 foo 0000000 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee * 0010000 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 * 0020000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 * 0040000 ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee ee * 0050000 # In fact these first 4Kb are a duplicate of the last 4kb block. # The last write got an extent map/file extent item that points to # the same disk extent that we got in the write+fsync that failed # with the -ENOMEM error. btrfs-debug-tree and btrfsck allow us to # verify that: $ btrfs-debug-tree /dev/sdd (...) item 6 key (257 EXTENT_DATA 0) itemoff 15819 itemsize 53 extent data disk byte 12582912 nr 8192 extent data offset 0 nr 8192 ram 8192 item 7 key (257 EXTENT_DATA 8192) itemoff 15766 itemsize 53 extent data disk byte 0 nr 0 extent data offset 0 nr 8192 ram 8192 item 8 key (257 EXTENT_DATA 16384) itemoff 15713 itemsize 53 extent data disk byte 12582912 nr 4096 extent data offset 0 nr 4096 ram 4096 $ umount /dev/sdd $ btrfsck /dev/sdd Checking filesystem on /dev/sdd UUID: db5e60e1-050d-41e6-8c7f-3d742dea5d8f checking extents extent item 12582912 has multiple extent items ref mismatch on [12582912 4096] extent item 1, found 2 Backref bytes do not match extent backref, bytenr=12582912, ref bytes=4096, backref bytes=8192 backpointer mismatch on [12582912 4096] Errors found in extent allocation tree or chunk allocation checking free space cache checking fs roots root 5 inode 257 errors 1000, some csum missing found 131074 bytes used err is 1 total csum bytes: 4 total tree bytes: 131072 total fs tree bytes: 32768 total extent tree bytes: 16384 btree space waste bytes: 123404 file data blocks allocated: 274432 referenced 274432 Btrfs v3.14.1-96-gcc7fd5a-dirty Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-09-02 16:46:05 -07:00
Linus Torvalds	1fb00cbca0	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "The biggest of these comes from Liu Bo, who tracked down a hang we've been hitting since moving to kernel workqueues (it's a btrfs bug, not in the generic code). His patch needs backporting to 3.16 and 3.15 stable, which I'll send once this is in. Otherwise these are assorted fixes. Most were integrated last week during KS, but I wanted to give everyone the chance to test the result, so I waited for rc2 to come out before sending" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (24 commits) Btrfs: fix task hang under heavy compressed write Btrfs: fix filemap_flush call in btrfs_file_release Btrfs: fix crash on endio of reading corrupted block btrfs: fix leak in qgroup_subtree_accounting() error path btrfs: Use right extent length when inserting overlap extent map. Btrfs: clone, don't create invalid hole extent map Btrfs: don't monopolize a core when evicting inode Btrfs: fix hole detection during file fsync Btrfs: ensure tmpfile inode is always persisted with link count of 0 Btrfs: race free update of commit root for ro snapshots Btrfs: fix regression of btrfs device replace Btrfs: don't consider the missing device when allocating new chunks Btrfs: Fix wrong device size when we are resizing the device Btrfs: don't write any data into a readonly device when scrub Btrfs: Fix the problem that the replace destroys the seed filesystem btrfs: Return right extent when fiemap gives unaligned offset and len. Btrfs: fix wrong extent mapping for DirectIO Btrfs: fix wrong write range for filemap_fdatawrite_range() Btrfs: fix wrong missing device counter decrease Btrfs: fix unzeroed members in fs_devices when creating a fs from seed fs ...	2014-08-27 09:14:17 -07:00
Chris Mason	e9512d72e8	Btrfs: fix autodefrag with compression The autodefrag code skips defrag when two extents are adjacent. But one big advantage for autodefrag is cutting down on the number of small extents, even when they are adjacent. This commit changes it to defrag all small extents. Signed-off-by: Chris Mason <clm@fb.com>	2014-08-27 08:45:37 -07:00
Liu Bo	9e0af23764	Btrfs: fix task hang under heavy compressed write This has been reported and discussed for a long time, and this hang occurs in both 3.15 and 3.16. Btrfs now migrates to use kernel workqueue, but it introduces this hang problem. Btrfs has a kind of work queued as an ordered way, which means that its ordered_func() must be processed in the way of FIFO, so it usually looks like -- normal_work_helper(arg) work = container_of(arg, struct btrfs_work, normal_work); work->func() <---- (we name it work X) for ordered_work in wq->ordered_list ordered_work->ordered_func() ordered_work->ordered_free() The hang is a rare case, first when we find free space, we get an uncached block group, then we go to read its free space cache inode for free space information, so it will file a readahead request btrfs_readpages() for page that is not in page cache __do_readpage() submit_extent_page() btrfs_submit_bio_hook() btrfs_bio_wq_end_io() submit_bio() end_workqueue_bio() <--(ret by the 1st endio) queue a work(named work Y) for the 2nd also the real endio() So the hang occurs when work Y's work_struct and work X's work_struct happens to share the same address. A bit more explanation, A,B,C -- struct btrfs_work arg -- struct work_struct kthread: worker_thread() pick up a work_struct from @worklist process_one_work(arg) worker->current_work = arg; <-- arg is A->normal_work worker->current_func(arg) normal_work_helper(arg) A = container_of(arg, struct btrfs_work, normal_work); A->func() A->ordered_func() A->ordered_free() <-- A gets freed B->ordered_func() submit_compressed_extents() find_free_extent() load_free_space_inode() ... <-- (the above readhead stack) end_workqueue_bio() btrfs_queue_work(work C) B->ordered_free() As if work A has a high priority in wq->ordered_list and there are more ordered works queued after it, such as B->ordered_func(), its memory could have been freed before normal_work_helper() returns, which means that kernel workqueue code worker_thread() still has worker->current_work pointer to be work A->normal_work's, ie. arg's address. Meanwhile, work C is allocated after work A is freed, work C->normal_work and work A->normal_work are likely to share the same address(I confirmed this with ftrace output, so I'm not just guessing, it's rare though). When another kthread picks up work C->normal_work to process, and finds our kthread is processing it(see find_worker_executing_work()), it'll think work C as a collision and skip then, which ends up nobody processing work C. So the situation is that our kthread is waiting forever on work C. Besides, there're other cases that can lead to deadlock, but the real problem is that all btrfs workqueue shares one work->func, -- normal_work_helper, so this makes each workqueue to have its own helper function, but only a wraper pf normal_work_helper. With this patch, I no long hit the above hang. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-08-24 07:17:02 -07:00
Chris Mason	f6dc45c7a9	Btrfs: fix filemap_flush call in btrfs_file_release We should only be flushing on close if the file was flagged as needing it during truncate. I broke this with my ordered data vs transaction commit deadlock fix. Thanks to Miao Xie for catching this. Signed-off-by: Chris Mason <clm@fb.com> Reported-by: Miao Xie <miaox@cn.fujitsu.com> Reported-by: Fengguang Wu <fengguang.wu@intel.com>	2014-08-21 07:55:31 -07:00
Liu Bo	38c1c2e44b	Btrfs: fix crash on endio of reading corrupted block The crash is ------------[ cut here ]------------ kernel BUG at fs/btrfs/extent_io.c:2124! [...] Workqueue: btrfs-endio normal_work_helper [btrfs] RIP: 0010:[<ffffffffa02d6055>] [<ffffffffa02d6055>] end_bio_extent_readpage+0xb45/0xcd0 [btrfs] This is in fact a regression. It is because we forgot to increase @offset properly in reading corrupted block, so that the @offset remains, and this leads to checksum errors while reading left blocks queued up in the same bio, and then ends up with hiting the above BUG_ON. Reported-by: Chris Murphy <lists@colorremedies.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-08-21 07:55:30 -07:00
Eric Sandeen	a3c108950d	btrfs: fix leak in qgroup_subtree_accounting() error path Coverity pointed this out; in the newly added qgroup_subtree_accounting(), if btrfs_find_all_roots() returns an error, we leak at least the parents pointer, and possibly the roots pointer, depending on what failure occurs. If btrfs_find_all_roots() returns an error, we need to free up all allocations before we return. "roots" is initialized to NULL, so it should be safe to free it unconditionally (ulist_free() handles that case). Cc: Mark Fasheh <mfasheh@suse.de> Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Signed-off-by: Chris Mason <clm@fb.com>	2014-08-21 07:55:29 -07:00
Qu Wenruo	51f395ad40	btrfs: Use right extent length when inserting overlap extent map. When current btrfs finds that a new extent map is going to be insereted but failed with -EEXIST, it will try again to insert the extent map but with the length of sectorsize. This is OK if we don't enable 'no-holes' feature since all extent space is continuous, we will not go into the not found->insert routine. But if we enable 'no-holes' feature, it will make things out of control. e.g. in 4K sectorsize, we pass the following args to btrfs_get_extent(): btrfs_get_extent() args: start: 27874 len 4100 28672 27874 28672 27874+4100 32768 \|-----------------------\| \|---------hole--------------------\|---------data----------\| 1) not found and insert Since no extent map containing the range, btrfs_get_extent() will go into the not_found and insert routine, which will try to insert the extent map (27874, 27847 + 4100). 2) first overlap But it overlaps with (28672, 32768) extent, so -EEXIST will be returned by add_extent_mapping(). 3) retry but still overlap After catching the -EEXIST, then btrfs_get_extent() will try insert it again but with 4K length, which still overlaps, so -EEXIST will be returned. This makes the following patch fail to punch hole. `d77815461f` btrfs: Avoid trucating page or punching hole in a already existed hole. This patch will use the right length, which is the (exsisting->start - em->start) to insert, making the above patch works in 'no-holes' mode. Also, some small code style problems in above patch is fixed too. Reported-by: Filipe David Manana <fdmanana@gmail.com> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: Filipe David Manana <fdmanana@suse.com> Tested-by: Filipe David Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-08-21 07:55:27 -07:00
Filipe Manana	62e2390e1a	Btrfs: clone, don't create invalid hole extent map When cloning a file that consists of an inline extent, we were creating an extent map that represents a non-existing trailing hole starting at a file offset that isn't a multiple of the sector size. This happened because when processing an inline extent we weren't aligning the extent's length to the sector size, and therefore incorrectly treating the range [inline_extent_length; sector_size[ as a hole. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-08-21 07:55:26 -07:00
Filipe Manana	7064dd5c36	Btrfs: don't monopolize a core when evicting inode If an inode has a very large number of extent maps, we can spend a lot of time freeing them, which triggers a soft lockup warning. Therefore reschedule if we need to when freeing the extent maps while evicting the inode. I could trigger this all the time by running xfstests/generic/299 on a file system with the no-holes feature enabled. That test creates an inode with 11386677 extent maps. $ mkfs.btrfs -f -O no-holes $TEST_DEV $ MKFS_OPTIONS="-O no-holes" ./check generic/299 generic/299 382s ... Message from syslogd@debian-vm3 at Aug 7 10:44:29 ... kernel:[85304.208017] BUG: soft lockup - CPU#0 stuck for 22s! [umount:25330] 384s Ran: generic/299 Passed all 1 tests $ dmesg (...) [86304.300017] BUG: soft lockup - CPU#0 stuck for 23s! [umount:25330] (...) [86304.300036] Call Trace: [86304.300036] [<ffffffff81698ba9>] __slab_free+0x54/0x295 [86304.300036] [<ffffffffa02ee9cc>] ? free_extent_map+0x5c/0xb0 [btrfs] [86304.300036] [<ffffffff811a6cd2>] kmem_cache_free+0x282/0x2a0 [86304.300036] [<ffffffffa02ee9cc>] free_extent_map+0x5c/0xb0 [btrfs] [86304.300036] [<ffffffffa02e3775>] btrfs_evict_inode+0xd5/0x660 [btrfs] [86304.300036] [<ffffffff811e7c8d>] ? __inode_wait_for_writeback+0x6d/0xc0 [86304.300036] [<ffffffff816a389b>] ? _raw_spin_unlock+0x2b/0x40 [86304.300036] [<ffffffff811d8cbb>] evict+0xab/0x180 [86304.300036] [<ffffffff811d8dce>] dispose_list+0x3e/0x60 [86304.300036] [<ffffffff811d9b04>] evict_inodes+0xf4/0x110 [86304.300036] [<ffffffff811bd953>] generic_shutdown_super+0x53/0x110 [86304.300036] [<ffffffff811bdaa6>] kill_anon_super+0x16/0x30 [86304.300036] [<ffffffffa02a78ba>] btrfs_kill_super+0x1a/0xa0 [btrfs] [86304.300036] [<ffffffff811bd3a9>] deactivate_locked_super+0x59/0x80 [86304.300036] [<ffffffff811be44e>] deactivate_super+0x4e/0x70 [86304.300036] [<ffffffff811dec14>] mntput_no_expire+0x174/0x1f0 [86304.300036] [<ffffffff811deab7>] ? mntput_no_expire+0x17/0x1f0 [86304.300036] [<ffffffff811e0517>] SyS_umount+0x97/0x100 (...) Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com> Tested-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-08-21 07:55:25 -07:00
Filipe Manana	74121f7cbb	Btrfs: fix hole detection during file fsync The file hole detection logic during a file fsync wasn't correct, because it didn't look back (in a previous leaf) for the last file extent item that can be in a leaf to the left of our leaf and that has a generation lower than the current transaction id. This made it assume that a hole exists when it really doesn't exist in the file. Such false positive hole detection happens in the following scenario: * We have a file that has many file extent items, covering 3 or more btree leafs (the first leaf must contain non file extent items too). * Two ranges of the file are modified, with their extent items being located at 2 different leafs and those leafs aren't consecutive. * When processing the second modified leaf, we weren't checking if some file extent item exists that is located in some leaf that is between our 2 modified leafs, and therefore assumed the range defined between the last file extent item in the first leaf and the first file extent item in the second leaf matched a hole. Fortunately this didn't result in overriding the log with wrong data, instead it made the last loop in copy_items() attempt to insert a duplicated key (for a hole file extent item), which makes the file fsync code return with -EEXIST to file.c:btrfs_sync_file() which in turn ends up doing a full transaction commit, which is much more expensive then writing only to the log tree and wait for it to be durably persisted (as well as the file's modified extents/pages). Therefore fix the hole detection logic, so that we don't pay the cost of doing full transaction commits. I could trigger this issue with the following test for xfstests (which never fails, either without or with this patch). The last fsync call results in a full transaction commit, due to the -EEXIST error mentioned above. I could also observe this behaviour happening frequently when running xfstests/generic/075 in a loop. Test: _cleanup() { _cleanup_flakey rm -fr $tmp } # get standard environment, filters and checks . ./common/rc . ./common/filter . ./common/dmflakey # real QA test starts here _supported_fs btrfs _supported_os Linux _require_scratch _require_dm_flakey _need_to_be_root rm -f $seqres.full # Create a file with many file extent items, each representing a 4Kb extent. # These items span 3 btree leaves, of 16Kb each (default mkfs.btrfs leaf size # as of btrfs-progs 3.12). _scratch_mkfs -l 16384 >/dev/null 2>&1 _init_flakey SAVE_MOUNT_OPTIONS="$MOUNT_OPTIONS" MOUNT_OPTIONS="$MOUNT_OPTIONS -o commit=999" _mount_flakey # First fsync, inode has BTRFS_INODE_NEEDS_FULL_SYNC flag set. $XFS_IO_PROG -f -c "pwrite -S 0x01 -b 4096 0 4096" -c "fsync" \ $SCRATCH_MNT/foo \| _filter_xfs_io # For any of the following fsync calls, inode doesn't have the flag # BTRFS_INODE_NEEDS_FULL_SYNC set. for ((i = 1; i <= 500; i++)); do OFFSET=$((4096 * i)) LEN=4096 $XFS_IO_PROG -c "pwrite -S 0x01 $OFFSET $LEN" -c "fsync" \ $SCRATCH_MNT/foo \| _filter_xfs_io done # Commit transaction and bump next transaction's id (to 7). sync # Truncate will set the BTRFS_INODE_NEEDS_FULL_SYNC flag in the btrfs's # inode runtime flags. $XFS_IO_PROG -c "truncate 2048000" $SCRATCH_MNT/foo # Commit transaction and bump next transaction's id (to 8). sync # Touch 1 extent item from the first leaf and 1 from the last leaf. The leaf # in the middle, containing only file extent items, isn't touched. So the # next fsync, when calling btrfs_search_forward(), won't visit that middle # leaf. First and 3rd leaf have now a generation with value 8, while the # middle leaf remains with a generation with value 6. $XFS_IO_PROG \ -c "pwrite -S 0xee -b 4096 0 4096" \ -c "pwrite -S 0xff -b 4096 2043904 4096" \ -c "fsync" \ $SCRATCH_MNT/foo \| _filter_xfs_io _load_flakey_table $FLAKEY_DROP_WRITES md5sum $SCRATCH_MNT/foo \| _filter_scratch _unmount_flakey _load_flakey_table $FLAKEY_ALLOW_WRITES # During mount, we'll replay the log created by the fsync above, and the file's # md5 digest should be the same we got before the unmount. _mount_flakey md5sum $SCRATCH_MNT/foo \| _filter_scratch _unmount_flakey MOUNT_OPTIONS="$SAVE_MOUNT_OPTIONS" status=0 exit Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-08-21 07:55:24 -07:00
Filipe Manana	5762b5c958	Btrfs: ensure tmpfile inode is always persisted with link count of 0 If we open a file with O_TMPFILE, don't do any further operation on it (so that the inode item isn't updated) and then force a transaction commit, we get a persisted inode item with a link count of 1, and not 0 as it should be. Steps to reproduce it (requires a modern xfs_io with -T support): $ mkfs.btrfs -f /dev/sdd $ mount -o /dev/sdd /mnt $ xfs_io -T /mnt & $ sync Then btrfs-debug-tree shows the inode item with a link count of 1: $ btrfs-debug-tree /dev/sdd (...) fs tree key (FS_TREE ROOT_ITEM 0) leaf 29556736 items 4 free space 15851 generation 6 owner 5 fs uuid f164d01b-1b92-481d-a4e4-435fb0f843d0 chunk uuid 0e3d0e56-bcca-4a1c-aa5f-cec2c6f4f7a6 item 0 key (256 INODE_ITEM 0) itemoff 16123 itemsize 160 inode generation 3 transid 6 size 0 block group 0 mode 40755 links 1 item 1 key (256 INODE_REF 256) itemoff 16111 itemsize 12 inode ref index 0 namelen 2 name: .. item 2 key (257 INODE_ITEM 0) itemoff 15951 itemsize 160 inode generation 6 transid 6 size 0 block group 0 mode 100600 links 1 item 3 key (ORPHAN ORPHAN_ITEM 257) itemoff 15951 itemsize 0 orphan item checksum tree key (CSUM_TREE ROOT_ITEM 0) (...) Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-08-21 07:55:23 -07:00
Filipe Manana	9c3b306e1c	Btrfs: race free update of commit root for ro snapshots This is a better solution for the problem addressed in the following commit: Btrfs: update commit root on snapshot creation after orphan cleanup (`3821f34888`) The previous solution wasn't the best because of 2 reasons: 1) It added another full transaction commit, which is more expensive than just swapping the commit root with the root; 2) If a reboot happened after the first transaction commit (the one that creates the snapshot) and before the second transaction commit, then we would end up with the same problem if a send using that snapshot was requested before the first transaction commit after the reboot. This change addresses those 2 issues. The second issue is addressed by switching the commit root in the dentry lookup VFS callback, which is also called by the snapshot/subvol creation ioctl and performs orphan cleanup if needed. Like the vfs, the ioctl locks the parent inode too, preventing race issues between a dentry lookup and snapshot creation. Cc: Alex Lyakas <alex.btrfs@zadarastorage.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-08-21 07:55:21 -07:00
Liu Bo	87fa3bb078	Btrfs: fix regression of btrfs device replace Commit 49c6f736f34f901117c20960ebd7d5e60f12fcac( btrfs: dev replace should replace the sysfs entry) added the missing sysfs entry in the process of device replace, but didn't take missing devices into account, so now we have BUG: unable to handle kernel NULL pointer dereference at 0000000000000088 IP: [<ffffffffa0268551>] btrfs_kobj_rm_device+0x21/0x40 [btrfs] ... To reproduce it, 1. mkfs.btrfs -f disk1 disk2 2. mkfs.ext4 disk1 3. mount disk2 /mnt -odegraded 4. btrfs replace start -B 1 disk3 /mnt -------------------------- This fixes the problem. Reported-by: Chris Murphy <lists@colorremedies.com> Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com> Tested-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-08-21 07:55:20 -07:00
Miao Xie	95669976bd	Btrfs: don't consider the missing device when allocating new chunks The original code allocated new chunks by the number of the writable devices and missing devices to make sure that any RAID levels on a degraded FS continue to be honored, but it introduced a problem that it stopped us to allocating new chunks, the steps to reproduce is following: # mkfs.btrfs -m raid1 -d raid1 -f <dev0> <dev1> # mkfs.btrfs -f <dev1> //Removing <dev1> from the original fs # mount -o degraded <dev0> <mnt> # dd if=/dev/null of=<mnt>/tmpfile bs=1M It is because we allocate new chunks only on the writable devices, if we take the number of missing devices into account, and want to allocate new chunks with higher RAID level, we will fail becaue we don't have enough writable device. Fix it by ignoring the number of missing devices when allocating new chunks. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-08-19 08:52:19 -07:00
Miao Xie	7df69d3e94	Btrfs: Fix wrong device size when we are resizing the device total_bytes of device is just a in-memory variant which is used to record the size of the device, and it might be changed before we resize a device, if the resize operation fails, it will be fallbacked. But some code used it to update on-disk metadata of the device, it would cause the problem that on-disk metadata of the devices was not consistent. We should use the other variant named disk_total_bytes to update the on-disk metadata of device, because that variant is updated only when the resize operation is successful. Fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-08-19 08:52:18 -07:00
Miao Xie	5d68da3b8e	Btrfs: don't write any data into a readonly device when scrub We should not write data into a readonly device especially seed device when doing scrub, skip those devices. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>	2014-08-19 08:52:17 -07:00
Miao Xie	ff61d17c63	Btrfs: Fix the problem that the replace destroys the seed filesystem The seed filesystem was destroyed by the device replace, the reproduce method is: # mkfs.btrfs -f <dev0> # btrfstune -S 1 <dev0> # mount <dev0> <mnt> # btrfs device add <dev1> <mnt> # umount <mnt> # mount <dev1> <mnt> # btrfs replace start -f <dev0> <dev2> <mnt> # umount <mnt> # mount <dev0> <mnt> It is because we erase the super block on the seed device. It is wrong, we should not change anything on the seed device. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>	2014-08-19 08:52:16 -07:00
Qu Wenruo	2c91943b50	btrfs: Return right extent when fiemap gives unaligned offset and len. When page aligned start and len passed to extent_fiemap(), the result is good, but when start and len is not aligned, e.g. start = 1 and len = 4095 is passed to extent_fiemap(), it returns no extent. The problem is that start and len is all rounded down which causes the problem. This patch will round down start and round up (start + len) to return right extent. Reported-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>	2014-08-19 08:52:14 -07:00
Wang Shilong	e2eca69dc6	Btrfs: fix wrong extent mapping for DirectIO btrfs_next_leaf() will use current leaf's last key to search and then return a bigger one. So it may still return a file extent item that is smaller than expected value and we will get an overflow here for @em->len. This is easy to reproduce for Btrfs Direct writting, it did not cause any problem, because writting will re-insert right mapping later. However, by hacking code to make DIO support compression, wrong extent mapping is kept and it encounter merging failure(EEXIST) quickly. Fix this problem by looping to find next file extent item that is bigger than @start or we could not find anything more. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>	2014-08-19 08:52:13 -07:00
Wang Shilong	9a025a0860	Btrfs: fix wrong write range for filemap_fdatawrite_range() filemap_fdatawrite_range() expect the third arg to be @end not @len, fix it. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>	2014-08-19 08:52:12 -07:00
Miao Xie	3a7d55c84c	Btrfs: fix wrong missing device counter decrease The missing devices are accounted by its own fs device, for example the missing devices in seed filesystem will be accounted by the fs device of the seed filesystem, not by the new filesystem which is based on the seed filesystem, so when we remove the missing device in the seed filesystem, we should decrease the counter of its own fs device. Fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-08-19 08:52:10 -07:00
Miao Xie	69611ac810	Btrfs: fix unzeroed members in fs_devices when creating a fs from seed fs We forgot to zero some members in fs_devices when we create new fs_devices from the one of the seed fs. It would cause the problem that we got wrong chunk profile when allocating chunks. Fix it. Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-08-19 08:36:32 -07:00
Anand Jain	77bdae4d13	btrfs: check generation as replace duplicates devid+uuid When FS in unmounted we need to check generation number as well since devid+uuid combination could match with the missing replaced disk when it reappears, and without this patch it might pair with the replaced disk again. device_list_add() function is called in the following threads, mount device option mount argument ioctl BTRFS_IOC_SCAN_DEV (btrfs dev scan) ioctl BTRFS_IOC_DEVICES_READY (btrfs dev ready <dev>) they have been unit tested to work fine with this patch. If the user knows what he is doing and really want to pair with replaced disk (which is not a standard operation), then he should first clear the kernel btrfs device list in the memory by doing the module unload/load and followed with the mount -o device option. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-08-19 08:36:30 -07:00
Anand Jain	b96de000bc	Btrfs: device_list_add() should not update list when mounted device_list_add() is called when user runs btrfs dev scan, which would add any btrfs device into the btrfs_fs_devices list. Now think of a mounted btrfs. And a new device which contains the a SB from the mounted btrfs devices. In this situation when user runs btrfs dev scan, the current code would just replace existing device with the new device. Which is to note that old device is neither closed nor gracefully removed from the btrfs. The FS is still operational with the old bdev however the device name is the btrfs_device is new which is provided by the btrfs dev scan. reproducer: devmgt[1] detach /dev/sdc replace the missing disk /dev/sdc btrfs rep start -f 1 /dev/sde /btrfs Label: none uuid: 5dc0aaf4-4683-4050-b2d6-5ebe5f5cd120 Total devices 2 FS bytes used 32.00KiB devid 1 size 958.94MiB used 115.88MiB path /dev/sde devid 2 size 958.94MiB used 103.88MiB path /dev/sdd make /dev/sdc to reappear devmgt attach host2 btrfs dev scan btrfs fi show -m Label: none uuid: 5dc0aaf4-4683-4050-b2d6-5ebe5f5cd120^M Total devices 2 FS bytes used 32.00KiB^M devid 1 size 958.94MiB used 115.88MiB path /dev/sdc <- Wrong. devid 2 size 958.94MiB used 103.88MiB path /dev/sdd since /dev/sdc has been replaced with /dev/sde, the /dev/sdc shouldn't be part of the btrfs-fsid when it reappears. If user want it to be part of it then sys admin should be using btrfs device add instead. [1] github.com/anajain/devmgt.git Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Reviewed-by: Satoru Takeuchi <takeuchi_satoru@jp.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-08-19 08:36:28 -07:00
chandan	1707e26d6a	Btrfs: fill_holes: Fix slot number passed to hole_mergeable() call. For a non-existent key, btrfs_search_slot() sets path->slots[0] to the slot where the key could have been present, which in this case would be the slot containing the extent item which would be the next neighbor of the file range being punched. The current code passes an incremented path->slots[0] and we skip to the wrong file extent item. This would mean that we would fail to merge the "yet to be created" hole with the next neighboring hole (if one exists). Fix this. Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Reviewed-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-08-19 08:36:26 -07:00
Miao Xie	7a5c3c9be1	Btrfs: fix put dio bio twice when we submit dio bio fail The caller of btrfs_submit_direct_hook() will put the original dio bio when btrfs_submit_direct_hook() return a error number, so we needn't put the original bio in btrfs_submit_direct_hook(). Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-08-19 08:36:24 -07:00
Linus Torvalds	e64df3ebe8	Merge branch 'for-linus2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs updates from Chris Mason: "These are all fixes I'd like to get out to a broader audience. The biggest of the bunch is Mark's quota fix, which is also in the SUSE kernel, and makes our subvolume quotas dramatically more accurate. I've been running xfstests with these against your current git overnight, but I'm queueing up longer tests as well" * 'for-linus2' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: btrfs: disable strict file flushes for renames and truncates Btrfs: fix csum tree corruption, duplicate and outdated checksums Btrfs: Fix memory corruption by ulist_add_merge() on 32bit arch Btrfs: fix compressed write corruption on enospc btrfs: correctly handle return from ulist_add btrfs: qgroup: account shared subtrees during snapshot delete Btrfs: read lock extent buffer while walking backrefs Btrfs: __btrfs_mod_ref should always use no_quota btrfs: adjust statfs calculations according to raid profiles	2014-08-16 09:06:55 -06:00
Chris Mason	8d875f95da	btrfs: disable strict file flushes for renames and truncates Truncates and renames are often used to replace old versions of a file with new versions. Applications often expect this to be an atomic replacement, even if they haven't done anything to make sure the new version is fully on disk. Btrfs has strict flushing in place to make sure that renaming over an old file with a new file will fully flush out the new file before allowing the transaction commit with the rename to complete. This ordering means the commit code needs to be able to lock file pages, and there are a few paths in the filesystem where we will try to end a transaction with the page lock held. It's rare, but these things can deadlock. This patch removes the ordered flushes and switches to a best effort filemap_flush like ext4 uses. It's not perfect, but it should fix the deadlocks. Signed-off-by: Chris Mason <clm@fb.com>	2014-08-15 07:43:42 -07:00
Filipe Manana	27b9a8122f	Btrfs: fix csum tree corruption, duplicate and outdated checksums Under rare circumstances we can end up leaving 2 versions of a checksum for the same file extent range. The reason for this is that after calling btrfs_next_leaf we process slot 0 of the leaf it returns, instead of processing the slot set in path->slots[0]. Most of the time (by far) path->slots[0] is 0, but after btrfs_next_leaf() releases the path and before it searches for the next leaf, another task might cause a split of the next leaf, which migrates some of its keys to the leaf we were processing before calling btrfs_next_leaf(). In this case btrfs_next_leaf() returns again the same leaf but with path->slots[0] having a slot number corresponding to the first new key it got, that is, a slot number that didn't exist before calling btrfs_next_leaf(), as the leaf now has more keys than it had before. So we must really process the returned leaf starting at path->slots[0] always, as it isn't always 0, and the key at slot 0 can have an offset much lower than our search offset/bytenr. For example, consider the following scenario, where we have: sums->bytenr: 40157184, sums->len: 16384, sums end: 40173568 four 4kb file data blocks with offsets 40157184, 40161280, 40165376, 40169472 Leaf N: slot = 0 slot = btrfs_header_nritems() - 1 \|-------------------------------------------------------------------\| \| [(CSUM CSUM 39239680), size 8] ... [(CSUM CSUM `40116224`), size 4] \| \|-------------------------------------------------------------------\| Leaf N + 1: slot = 0 slot = btrfs_header_nritems() - 1 \|--------------------------------------------------------------------\| \| [(CSUM CSUM 40161280), size 32] ... [((CSUM CSUM 40615936), size 8 \| \|--------------------------------------------------------------------\| Because we are at the last slot of leaf N, we call btrfs_next_leaf() to find the next highest key, which releases the current path and then searches for that next key. However after releasing the path and before finding that next key, the item at slot 0 of leaf N + 1 gets moved to leaf N, due to a call to ctree.c:push_leaf_left() (via ctree.c:split_leaf()), and therefore btrfs_next_leaf() will returns us a path again with leaf N but with the slot pointing to its new last key (CSUM CSUM 40161280). This new version of leaf N is then: slot = 0 slot = btrfs_header_nritems() - 2 slot = btrfs_header_nritems() - 1 \|----------------------------------------------------------------------------------------------------\| \| [(CSUM CSUM 39239680), size 8] ... [(CSUM CSUM `40116224`), size 4] [(CSUM CSUM 40161280), size 32] \| \|----------------------------------------------------------------------------------------------------\| And incorrecly using slot 0, makes us set next_offset to 39239680 and we jump into the "insert:" label, which will set tmp to: tmp = min((sums->len - total_bytes) >> blocksize_bits, (next_offset - file_key.offset) >> blocksize_bits) = min((16384 - 0) >> 12, (39239680 - 40157184) >> 12) = min(4, (u64)-917504 = 18446744073708634112 >> 12) = 4 and ins_size = csum_size * tmp = 4 * 4 = 16 bytes. In other words, we insert a new csum item in the tree with key (CSUM_OBJECTID CSUM_KEY 40157184 = sums->bytenr) that contains the checksums for all the data (4 blocks of 4096 bytes each = sums->len). Which is wrong, because the item with key (CSUM CSUM 40161280) (the one that was moved from leaf N + 1 to the end of leaf N) contains the old checksums of the last 12288 bytes of our data and won't get those old checksums removed. So this leaves us 2 different checksums for 3 4kb blocks of data in the tree, and breaks the logical rule: Key_N+1.offset >= Key_N.offset + length_of_data_its_checksums_cover An obvious bad effect of this is that a subsequent csum tree lookup to get the checksum of any of the blocks with logical offset of 40161280, 40165376 or 40169472 (the last 3 4kb blocks of file data), will get the old checksums. Cc: stable@vger.kernel.org Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-08-15 07:43:40 -07:00
Takashi Iwai	4eb1f66dce	Btrfs: Fix memory corruption by ulist_add_merge() on 32bit arch We've got bug reports that btrfs crashes when quota is enabled on 32bit kernel, typically with the Oops like below: BUG: unable to handle kernel NULL pointer dereference at 00000004 IP: [<f9234590>] find_parent_nodes+0x360/0x1380 [btrfs] pde = 00000000 Oops: 0000 [#1] SMP CPU: 0 PID: 151 Comm: kworker/u8:2 Tainted: G S W 3.15.2-1.gd43d97e-default #1 Workqueue: btrfs-qgroup-rescan normal_work_helper [btrfs] task: f1478130 ti: f147c000 task.ti: f147c000 EIP: 0060:[<f9234590>] EFLAGS: 00010213 CPU: 0 EIP is at find_parent_nodes+0x360/0x1380 [btrfs] EAX: f147dda8 EBX: f147ddb0 ECX: 00000011 EDX: 00000000 ESI: 00000000 EDI: f147dda4 EBP: f147ddf8 ESP: f147dd38 DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 CR0: 8005003b CR2: 00000004 CR3: 00bf3000 CR4: 00000690 Stack: 00000000 00000000 f147dda4 00000050 00000001 00000000 00000001 00000050 00000001 00000000 d3059000 00000001 00000022 000000a8 00000000 00000000 00000000 000000a1 00000000 00000000 00000001 00000000 00000000 11800000 Call Trace: [<f923564d>] __btrfs_find_all_roots+0x9d/0xf0 [btrfs] [<f9237bb1>] btrfs_qgroup_rescan_worker+0x401/0x760 [btrfs] [<f9206148>] normal_work_helper+0xc8/0x270 [btrfs] [<c025e38b>] process_one_work+0x11b/0x390 [<c025eea1>] worker_thread+0x101/0x340 [<c026432b>] kthread+0x9b/0xb0 [<c0712a71>] ret_from_kernel_thread+0x21/0x30 [<c0264290>] kthread_create_on_node+0x110/0x110 This indicates a NULL corruption in prefs_delayed list. The further investigation and bisection pointed that the call of ulist_add_merge() results in the corruption. ulist_add_merge() takes u64 as aux and writes a 64bit value into old_aux. The callers of this function in backref.c, however, pass a pointer of a pointer to old_aux. That is, the function overwrites 64bit value on 32bit pointer. This caused a NULL in the adjacent variable, in this case, prefs_delayed. Here is a quick attempt to band-aid over this: a new function, ulist_add_merge_ptr() is introduced to pass/store properly a pointer value instead of u64. There are still ugly void * cast remaining in the callers because void ** cannot be taken implicitly. But, it's safer than explicit cast to u64, anyway. Bugzilla: https://bugzilla.novell.com/show_bug.cgi?id=887046 Cc: <stable@vger.kernel.org> [v3.11+] Signed-off-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Chris Mason <clm@fb.com>	2014-08-15 07:43:19 -07:00
Liu Bo	ce62003f69	Btrfs: fix compressed write corruption on enospc When failing to allocate space for the whole compressed extent, we'll fallback to uncompressed IO, but we've forgotten to redirty the pages which belong to this compressed extent, and these 'clean' pages will simply skip 'submit' part and go to endio directly, at last we got data corruption as we write nothing. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Tested-By: Martin Steigerwald <martin@lichtvoll.de> Signed-off-by: Chris Mason <clm@fb.com>	2014-08-15 07:43:18 -07:00
Mark Fasheh	f90e579c2b	btrfs: correctly handle return from ulist_add ulist_add() can return '1' on sucess, which qgroup_subtree_accounting() doesn't take into account. As a result, that value can be bubbled up to callers, causing an error to be printed. Fix this by only returning the value of ulist_add() when it indicates an error. Signed-off-by: Mark Fasheh <mfasheh@suse.de> Signed-off-by: Chris Mason <clm@fb.com>	2014-08-15 07:43:16 -07:00
Mark Fasheh	1152651a08	btrfs: qgroup: account shared subtrees during snapshot delete During its tree walk, btrfs_drop_snapshot() will skip any shared subtrees it encounters. This is incorrect when we have qgroups turned on as those subtrees need to have their contents accounted. In particular, the case we're concerned with is when removing our snapshot root leaves the subtree with only one root reference. In those cases we need to find the last remaining root and add each extent in the subtree to the corresponding qgroup exclusive counts. This patch implements the shared subtree walk and a new qgroup operation, BTRFS_QGROUP_OPER_SUB_SUBTREE. When an operation of this type is encountered during qgroup accounting, we search for any root references to that extent and in the case that we find only one reference left, we go ahead and do the math on it's exclusive counts. Signed-off-by: Mark Fasheh <mfasheh@suse.de> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>	2014-08-15 07:43:14 -07:00

1 2 3 4 5 ...

4317 Commits