2007-06-12 09:07:21 -04:00
|
|
|
/*
|
|
|
|
* Copyright (C) 2007 Oracle. All rights reserved.
|
|
|
|
*
|
|
|
|
* This program is free software; you can redistribute it and/or
|
|
|
|
* modify it under the terms of the GNU General Public
|
|
|
|
* License v2 as published by the Free Software Foundation.
|
|
|
|
*
|
|
|
|
* This program is distributed in the hope that it will be useful,
|
|
|
|
* but WITHOUT ANY WARRANTY; without even the implied warranty of
|
|
|
|
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
|
|
|
|
* General Public License for more details.
|
|
|
|
*
|
|
|
|
* You should have received a copy of the GNU General Public
|
|
|
|
* License along with this program; if not, write to the
|
|
|
|
* Free Software Foundation, Inc., 59 Temple Place - Suite 330,
|
|
|
|
* Boston, MA 021110-1307, USA.
|
|
|
|
*/
|
|
|
|
|
2008-01-08 15:46:30 -05:00
|
|
|
#ifndef __BTRFS_TRANSACTION__
|
|
|
|
#define __BTRFS_TRANSACTION__
|
2007-04-30 15:25:45 -04:00
|
|
|
#include "btrfs_inode.h"
|
2009-03-13 10:10:06 -04:00
|
|
|
#include "delayed-ref.h"
|
2012-06-28 18:03:02 +02:00
|
|
|
#include "ctree.h"
|
2007-03-16 16:20:31 -04:00
|
|
|
|
Btrfs: make the state of the transaction more readable
We used 3 variants to track the state of the transaction, it was complex
and wasted the memory space. Besides that, it was hard to understand that
which types of the transaction handles should be blocked in each transaction
state, so the developers often made mistakes.
This patch improved the above problem. In this patch, we define 6 states
for the transaction,
enum btrfs_trans_state {
TRANS_STATE_RUNNING = 0,
TRANS_STATE_BLOCKED = 1,
TRANS_STATE_COMMIT_START = 2,
TRANS_STATE_COMMIT_DOING = 3,
TRANS_STATE_UNBLOCKED = 4,
TRANS_STATE_COMPLETED = 5,
TRANS_STATE_MAX = 6,
}
and just use 1 variant to track those state.
In order to make the blocked handle types for each state more clear,
we introduce a array:
unsigned int btrfs_blocked_trans_types[TRANS_STATE_MAX] = {
[TRANS_STATE_RUNNING] = 0U,
[TRANS_STATE_BLOCKED] = (__TRANS_USERSPACE |
__TRANS_START),
[TRANS_STATE_COMMIT_START] = (__TRANS_USERSPACE |
__TRANS_START |
__TRANS_ATTACH),
[TRANS_STATE_COMMIT_DOING] = (__TRANS_USERSPACE |
__TRANS_START |
__TRANS_ATTACH |
__TRANS_JOIN),
[TRANS_STATE_UNBLOCKED] = (__TRANS_USERSPACE |
__TRANS_START |
__TRANS_ATTACH |
__TRANS_JOIN |
__TRANS_JOIN_NOLOCK),
[TRANS_STATE_COMPLETED] = (__TRANS_USERSPACE |
__TRANS_START |
__TRANS_ATTACH |
__TRANS_JOIN |
__TRANS_JOIN_NOLOCK),
}
it is very intuitionistic.
Besides that, because we remove ->in_commit in transaction structure, so
the lock ->commit_lock which was used to protect it is unnecessary, remove
->commit_lock.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-17 03:53:43 +00:00
|
|
|
enum btrfs_trans_state {
|
|
|
|
TRANS_STATE_RUNNING = 0,
|
|
|
|
TRANS_STATE_BLOCKED = 1,
|
|
|
|
TRANS_STATE_COMMIT_START = 2,
|
|
|
|
TRANS_STATE_COMMIT_DOING = 3,
|
|
|
|
TRANS_STATE_UNBLOCKED = 4,
|
|
|
|
TRANS_STATE_COMPLETED = 5,
|
|
|
|
TRANS_STATE_MAX = 6,
|
|
|
|
};
|
|
|
|
|
2015-09-24 10:46:10 -04:00
|
|
|
#define BTRFS_TRANS_HAVE_FREE_BGS 0
|
|
|
|
#define BTRFS_TRANS_DIRTY_BG_RUN 1
|
2015-10-01 12:55:18 -04:00
|
|
|
#define BTRFS_TRANS_CACHE_ENOSPC 2
|
2015-09-24 10:46:10 -04:00
|
|
|
|
2007-03-22 15:59:16 -04:00
|
|
|
struct btrfs_transaction {
|
|
|
|
u64 transid;
|
2013-05-15 07:48:27 +00:00
|
|
|
/*
|
|
|
|
* total external writers(USERSPACE/START/ATTACH) in this
|
|
|
|
* transaction, it must be zero before the transaction is
|
|
|
|
* being committed
|
|
|
|
*/
|
|
|
|
atomic_t num_extwriters;
|
2009-03-12 20:12:45 -04:00
|
|
|
/*
|
|
|
|
* total writers in this transaction, it must be zero before the
|
|
|
|
* transaction can end
|
|
|
|
*/
|
2011-04-11 15:45:29 -04:00
|
|
|
atomic_t num_writers;
|
2011-04-11 17:25:13 -04:00
|
|
|
atomic_t use_count;
|
2015-09-24 16:17:39 -04:00
|
|
|
atomic_t pending_ordered;
|
2009-03-12 20:12:45 -04:00
|
|
|
|
2015-09-24 10:46:10 -04:00
|
|
|
unsigned long flags;
|
btrfs: Fix out-of-space bug
Btrfs will report NO_SPACE when we create and remove files for several times,
and we can't write to filesystem until mount it again.
Steps to reproduce:
1: Create a single-dev btrfs fs with default option
2: Write a file into it to take up most fs space
3: Delete above file
4: Wait about 100s to let chunk removed
5: goto 2
Script is like following:
#!/bin/bash
# Recommend 1.2G space, too large disk will make test slow
DEV="/dev/sda16"
MNT="/mnt/tmp"
dev_size="$(lsblk -bn -o SIZE "$DEV")" || exit 2
file_size_m=$((dev_size * 75 / 100 / 1024 / 1024))
echo "Loop write ${file_size_m}M file on $((dev_size / 1024 / 1024))M dev"
for ((i = 0; i < 10; i++)); do umount "$MNT" 2>/dev/null; done
echo "mkfs $DEV"
mkfs.btrfs -f "$DEV" >/dev/null || exit 2
echo "mount $DEV $MNT"
mount "$DEV" "$MNT" || exit 2
for ((loop_i = 0; loop_i < 20; loop_i++)); do
echo
echo "loop $loop_i"
echo "dd file..."
cmd=(dd if=/dev/zero of="$MNT"/file0 bs=1M count="$file_size_m")
"${cmd[@]}" 2>/dev/null || {
# NO_SPACE error triggered
echo "dd failed: ${cmd[*]}"
exit 1
}
echo "rm file..."
rm -f "$MNT"/file0 || exit 2
for ((i = 0; i < 10; i++)); do
df "$MNT" | tail -1
sleep 10
done
done
Reason:
It is triggered by commit: 47ab2a6c689913db23ccae38349714edf8365e0a
which is used to remove empty block groups automatically, but the
reason is not in that patch. Code before works well because btrfs
don't need to create and delete chunks so many times with high
complexity.
Above bug is caused by many reason, any of them can trigger it.
Reason1:
When we remove some continuous chunks but leave other chunks after,
these disk space should be used by chunk-recreating, but in current
code, only first create will successed.
Fixed by Forrest Liu <forrestl@synology.com> in:
Btrfs: fix find_free_dev_extent() malfunction in case device tree has hole
Reason2:
contains_pending_extent() return wrong value in calculation.
Fixed by Forrest Liu <forrestl@synology.com> in:
Btrfs: fix find_free_dev_extent() malfunction in case device tree has hole
Reason3:
btrfs_check_data_free_space() try to commit transaction and retry
allocating chunk when the first allocating failed, but space_info->full
is set in first allocating, and prevent second allocating in retry.
Fixed in this patch by clear space_info->full in commit transaction.
Tested for severial times by above script.
Changelog v3->v4:
use light weight int instead of atomic_t to record have_remove_bgs in
transaction, suggested by:
Josef Bacik <jbacik@fb.com>
Changelog v2->v3:
v2 fixed the bug by adding more commit-transaction, but we
only need to reclaim space when we are really have no space for
new chunk, noticed by:
Filipe David Manana <fdmanana@gmail.com>
Actually, our code already have this type of commit-and-retry,
we only need to make it working with removed-bgs.
v3 fixed the bug with above way.
Changelog v1->v2:
v1 will introduce a new bug when delete and create chunk in same disk
space in same transaction, noticed by:
Filipe David Manana <fdmanana@gmail.com>
V2 fix this bug by commit transaction after remove block grops.
Reported-by: Tsutomu Itoh <t-itoh@jp.fujitsu.com>
Suggested-by: Filipe David Manana <fdmanana@gmail.com>
Suggested-by: Josef Bacik <jbacik@fb.com>
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-02-12 14:18:17 +08:00
|
|
|
|
Btrfs: make the state of the transaction more readable
We used 3 variants to track the state of the transaction, it was complex
and wasted the memory space. Besides that, it was hard to understand that
which types of the transaction handles should be blocked in each transaction
state, so the developers often made mistakes.
This patch improved the above problem. In this patch, we define 6 states
for the transaction,
enum btrfs_trans_state {
TRANS_STATE_RUNNING = 0,
TRANS_STATE_BLOCKED = 1,
TRANS_STATE_COMMIT_START = 2,
TRANS_STATE_COMMIT_DOING = 3,
TRANS_STATE_UNBLOCKED = 4,
TRANS_STATE_COMPLETED = 5,
TRANS_STATE_MAX = 6,
}
and just use 1 variant to track those state.
In order to make the blocked handle types for each state more clear,
we introduce a array:
unsigned int btrfs_blocked_trans_types[TRANS_STATE_MAX] = {
[TRANS_STATE_RUNNING] = 0U,
[TRANS_STATE_BLOCKED] = (__TRANS_USERSPACE |
__TRANS_START),
[TRANS_STATE_COMMIT_START] = (__TRANS_USERSPACE |
__TRANS_START |
__TRANS_ATTACH),
[TRANS_STATE_COMMIT_DOING] = (__TRANS_USERSPACE |
__TRANS_START |
__TRANS_ATTACH |
__TRANS_JOIN),
[TRANS_STATE_UNBLOCKED] = (__TRANS_USERSPACE |
__TRANS_START |
__TRANS_ATTACH |
__TRANS_JOIN |
__TRANS_JOIN_NOLOCK),
[TRANS_STATE_COMPLETED] = (__TRANS_USERSPACE |
__TRANS_START |
__TRANS_ATTACH |
__TRANS_JOIN |
__TRANS_JOIN_NOLOCK),
}
it is very intuitionistic.
Besides that, because we remove ->in_commit in transaction structure, so
the lock ->commit_lock which was used to protect it is unnecessary, remove
->commit_lock.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-05-17 03:53:43 +00:00
|
|
|
/* Be protected by fs_info->trans_lock when we want to change it. */
|
|
|
|
enum btrfs_trans_state state;
|
2007-04-19 21:01:03 -04:00
|
|
|
struct list_head list;
|
2008-01-24 16:13:08 -05:00
|
|
|
struct extent_io_tree dirty_pages;
|
2007-06-08 15:33:54 -04:00
|
|
|
unsigned long start_time;
|
2007-03-22 15:59:16 -04:00
|
|
|
wait_queue_head_t writer_wait;
|
|
|
|
wait_queue_head_t commit_wait;
|
2015-09-24 16:17:39 -04:00
|
|
|
wait_queue_head_t pending_wait;
|
2008-01-08 15:46:30 -05:00
|
|
|
struct list_head pending_snapshots;
|
2013-06-27 13:22:46 -04:00
|
|
|
struct list_head pending_chunks;
|
2014-03-13 15:42:13 -04:00
|
|
|
struct list_head switch_commits;
|
2014-11-17 15:45:48 -05:00
|
|
|
struct list_head dirty_bgs;
|
2015-04-06 12:46:08 -07:00
|
|
|
struct list_head io_bgs;
|
2015-09-15 10:07:04 -04:00
|
|
|
struct list_head dropped_roots;
|
2015-02-18 08:06:57 -08:00
|
|
|
u64 num_dirty_bgs;
|
2015-04-06 12:46:08 -07:00
|
|
|
|
|
|
|
/*
|
|
|
|
* we need to make sure block group deletion doesn't race with
|
|
|
|
* free space cache writeout. This mutex keeps them from stomping
|
|
|
|
* on each other
|
|
|
|
*/
|
|
|
|
struct mutex cache_write_mutex;
|
2014-11-17 15:45:48 -05:00
|
|
|
spinlock_t dirty_bgs_lock;
|
2015-11-27 12:16:16 +00:00
|
|
|
/* Protected by spin lock fs_info->unused_bgs_lock. */
|
2015-06-15 09:41:19 -04:00
|
|
|
struct list_head deleted_bgs;
|
2015-09-15 10:07:04 -04:00
|
|
|
spinlock_t dropped_roots_lock;
|
2009-03-13 10:10:06 -04:00
|
|
|
struct btrfs_delayed_ref_root delayed_refs;
|
2012-03-01 17:24:58 +01:00
|
|
|
int aborted;
|
2007-03-22 15:59:16 -04:00
|
|
|
};
|
|
|
|
|
2013-05-15 07:48:27 +00:00
|
|
|
#define __TRANS_FREEZABLE (1U << 0)
|
|
|
|
|
|
|
|
#define __TRANS_USERSPACE (1U << 8)
|
|
|
|
#define __TRANS_START (1U << 9)
|
|
|
|
#define __TRANS_ATTACH (1U << 10)
|
|
|
|
#define __TRANS_JOIN (1U << 11)
|
|
|
|
#define __TRANS_JOIN_NOLOCK (1U << 12)
|
2014-05-07 17:06:09 -04:00
|
|
|
#define __TRANS_DUMMY (1U << 13)
|
2013-05-15 07:48:27 +00:00
|
|
|
|
|
|
|
#define TRANS_USERSPACE (__TRANS_USERSPACE | __TRANS_FREEZABLE)
|
|
|
|
#define TRANS_START (__TRANS_START | __TRANS_FREEZABLE)
|
|
|
|
#define TRANS_ATTACH (__TRANS_ATTACH)
|
|
|
|
#define TRANS_JOIN (__TRANS_JOIN | __TRANS_FREEZABLE)
|
|
|
|
#define TRANS_JOIN_NOLOCK (__TRANS_JOIN_NOLOCK)
|
|
|
|
|
|
|
|
#define TRANS_EXTWRITERS (__TRANS_USERSPACE | __TRANS_START | \
|
|
|
|
__TRANS_ATTACH)
|
2012-09-20 01:51:59 -06:00
|
|
|
|
2014-07-31 00:43:18 +02:00
|
|
|
#define BTRFS_SEND_TRANS_STUB ((void *)1)
|
2014-03-28 17:07:27 -04:00
|
|
|
|
2007-03-16 16:20:31 -04:00
|
|
|
struct btrfs_trans_handle {
|
|
|
|
u64 transid;
|
2010-05-16 10:46:25 -04:00
|
|
|
u64 bytes_reserved;
|
Btrfs: fix -ENOSPC when finishing block group creation
While creating a block group, we often end up getting ENOSPC while updating
the chunk tree, which leads to a transaction abortion that produces a trace
like the following:
[30670.116368] WARNING: CPU: 4 PID: 20735 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x52/0x106 [btrfs]()
[30670.117777] BTRFS: Transaction aborted (error -28)
(...)
[30670.163567] Call Trace:
[30670.163906] [<ffffffff8142fa46>] dump_stack+0x4f/0x7b
[30670.164522] [<ffffffff8108b6a2>] ? console_unlock+0x361/0x3ad
[30670.165171] [<ffffffff81045ea5>] warn_slowpath_common+0xa1/0xbb
[30670.166323] [<ffffffffa035daa7>] ? __btrfs_abort_transaction+0x52/0x106 [btrfs]
[30670.167213] [<ffffffff81045f05>] warn_slowpath_fmt+0x46/0x48
[30670.167862] [<ffffffffa035daa7>] __btrfs_abort_transaction+0x52/0x106 [btrfs]
[30670.169116] [<ffffffffa03743d7>] btrfs_create_pending_block_groups+0x101/0x130 [btrfs]
[30670.170593] [<ffffffffa038426a>] __btrfs_end_transaction+0x84/0x366 [btrfs]
[30670.171960] [<ffffffffa038455c>] btrfs_end_transaction+0x10/0x12 [btrfs]
[30670.174649] [<ffffffffa036eb6b>] btrfs_check_data_free_space+0x11f/0x27c [btrfs]
[30670.176092] [<ffffffffa039450d>] btrfs_fallocate+0x7c8/0xb96 [btrfs]
[30670.177218] [<ffffffff812459f2>] ? __this_cpu_preempt_check+0x13/0x15
[30670.178622] [<ffffffff81152447>] vfs_fallocate+0x14c/0x1de
[30670.179642] [<ffffffff8116b915>] ? __fget_light+0x2d/0x4f
[30670.180692] [<ffffffff81152863>] SyS_fallocate+0x47/0x62
[30670.186737] [<ffffffff81435b32>] system_call_fastpath+0x12/0x17
[30670.187792] ---[ end trace 0373e6b491c4a8cc ]---
This is because we don't do proper space reservation for the chunk block
reserve when we have multiple tasks allocating chunks in parallel.
So block group creation has 2 phases, and the first phase essentially
checks if there is enough space in the system space_info, allocating a
new system chunk if there isn't, while the second phase updates the
device, extent and chunk trees. However, because the updates to the
chunk tree happen in the second phase, if we have N tasks, each with
its own transaction handle, allocating new chunks in parallel and if
there is only enough space in the system space_info to allocate M chunks,
where M < N, none of the tasks ends up allocating a new system chunk in
the first phase and N - M tasks will get -ENOSPC when attempting to
update the chunk tree in phase 2 if they need to COW any nodes/leafs
from the chunk tree.
Fix this by doing proper reservation in the chunk block reserve.
The issue could be reproduced by running fstests generic/038 in a loop,
which eventually triggered the problem.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-05-20 14:01:54 +01:00
|
|
|
u64 chunk_bytes_reserved;
|
2011-04-13 15:15:59 -04:00
|
|
|
unsigned long use_count;
|
2007-03-16 16:20:31 -04:00
|
|
|
unsigned long blocks_reserved;
|
|
|
|
unsigned long blocks_used;
|
2009-03-13 10:10:06 -04:00
|
|
|
unsigned long delayed_ref_updates;
|
2010-05-16 10:46:25 -04:00
|
|
|
struct btrfs_transaction *transaction;
|
|
|
|
struct btrfs_block_rsv *block_rsv;
|
2011-04-13 15:15:59 -04:00
|
|
|
struct btrfs_block_rsv *orig_rsv;
|
2012-09-20 01:51:59 -06:00
|
|
|
short aborted;
|
|
|
|
short adding_csums;
|
2012-12-18 09:16:16 -05:00
|
|
|
bool allocating_chunk;
|
2015-10-03 13:13:13 +01:00
|
|
|
bool can_flush_pending_bgs;
|
2013-09-25 21:47:45 +08:00
|
|
|
bool reloc_reserved;
|
2014-01-15 13:34:13 -05:00
|
|
|
bool sync;
|
2013-05-15 07:48:27 +00:00
|
|
|
unsigned int type;
|
2011-09-13 11:40:09 +02:00
|
|
|
/*
|
|
|
|
* this root is only needed to validate that the root passed to
|
|
|
|
* start_transaction is the same as the one passed to end_transaction.
|
|
|
|
* Subvolume quota depends on this
|
|
|
|
*/
|
|
|
|
struct btrfs_root *root;
|
2012-06-28 18:03:02 +02:00
|
|
|
struct seq_list delayed_ref_elem;
|
|
|
|
struct list_head qgroup_ref_list;
|
2012-09-11 16:57:25 -04:00
|
|
|
struct list_head new_bgs;
|
2007-03-16 16:20:31 -04:00
|
|
|
};
|
|
|
|
|
2008-01-08 15:46:30 -05:00
|
|
|
struct btrfs_pending_snapshot {
|
2008-11-17 21:02:50 -05:00
|
|
|
struct dentry *dentry;
|
2013-02-28 10:01:15 +00:00
|
|
|
struct inode *dir;
|
2008-01-08 15:46:30 -05:00
|
|
|
struct btrfs_root *root;
|
2015-11-10 18:54:00 +01:00
|
|
|
struct btrfs_root_item *root_item;
|
2010-05-16 10:48:46 -04:00
|
|
|
struct btrfs_root *snap;
|
2011-09-14 15:58:21 +02:00
|
|
|
struct btrfs_qgroup_inherit *inherit;
|
2015-11-10 18:54:03 +01:00
|
|
|
struct btrfs_path *path;
|
2010-05-16 10:48:46 -04:00
|
|
|
/* block reservation for the operation */
|
|
|
|
struct btrfs_block_rsv block_rsv;
|
2013-02-28 10:04:33 +00:00
|
|
|
u64 qgroup_reserved;
|
2016-05-19 21:18:45 -04:00
|
|
|
/* extra metadata reservation for relocation */
|
2010-05-16 10:48:46 -04:00
|
|
|
int error;
|
2010-12-20 16:04:08 +08:00
|
|
|
bool readonly;
|
2008-01-08 15:46:30 -05:00
|
|
|
struct list_head list;
|
|
|
|
};
|
|
|
|
|
2007-08-10 16:22:09 -04:00
|
|
|
static inline void btrfs_set_inode_last_trans(struct btrfs_trans_handle *trans,
|
|
|
|
struct inode *inode)
|
|
|
|
{
|
Btrfs: fix metadata inconsistencies after directory fsync
We can get into inconsistency between inodes and directory entries
after fsyncing a directory. The issue is that while a directory gets
the new dentries persisted in the fsync log and replayed at mount time,
the link count of the inode that directory entries point to doesn't
get updated, staying with an incorrect link count (smaller then the
correct value). This later leads to stale file handle errors when
accessing (including attempt to delete) some of the links if all the
other ones are removed, which also implies impossibility to delete the
parent directories, since the dentries can not be removed.
Another issue is that (unlike ext3/4, xfs, f2fs, reiserfs, nilfs2),
when fsyncing a directory, new files aren't logged (their metadata and
dentries) nor any child directories. So this patch fixes this issue too,
since it has the same resolution as the incorrect inode link count issue
mentioned before.
This is very easy to reproduce, and the following excerpt from my test
case for xfstests shows how:
_scratch_mkfs >> $seqres.full 2>&1
_init_flakey
_mount_flakey
# Create our main test file and directory.
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0 8K" $SCRATCH_MNT/foo | _filter_xfs_io
mkdir $SCRATCH_MNT/mydir
# Make sure all metadata and data are durably persisted.
sync
# Add a hard link to 'foo' inside our test directory and fsync only the
# directory. The btrfs fsync implementation had a bug that caused the new
# directory entry to be visible after the fsync log replay but, the inode
# of our file remained with a link count of 1.
ln $SCRATCH_MNT/foo $SCRATCH_MNT/mydir/foo_2
# Add a few more links and new files.
# This is just to verify nothing breaks or gives incorrect results after the
# fsync log is replayed.
ln $SCRATCH_MNT/foo $SCRATCH_MNT/mydir/foo_3
$XFS_IO_PROG -f -c "pwrite -S 0xff 0 64K" $SCRATCH_MNT/hello | _filter_xfs_io
ln $SCRATCH_MNT/hello $SCRATCH_MNT/mydir/hello_2
# Add some subdirectories and new files and links to them. This is to verify
# that after fsyncing our top level directory 'mydir', all the subdirectories
# and their files/links are registered in the fsync log and exist after the
# fsync log is replayed.
mkdir -p $SCRATCH_MNT/mydir/x/y/z
ln $SCRATCH_MNT/foo $SCRATCH_MNT/mydir/x/y/foo_y_link
ln $SCRATCH_MNT/foo $SCRATCH_MNT/mydir/x/y/z/foo_z_link
touch $SCRATCH_MNT/mydir/x/y/z/qwerty
# Now fsync only our top directory.
$XFS_IO_PROG -c "fsync" $SCRATCH_MNT/mydir
# And fsync now our new file named 'hello', just to verify later that it has
# the expected content and that the previous fsync on the directory 'mydir' had
# no bad influence on this fsync.
$XFS_IO_PROG -c "fsync" $SCRATCH_MNT/hello
# Simulate a crash/power loss.
_load_flakey_table $FLAKEY_DROP_WRITES
_unmount_flakey
_load_flakey_table $FLAKEY_ALLOW_WRITES
_mount_flakey
# Verify the content of our file 'foo' remains the same as before, 8192 bytes,
# all with the value 0xaa.
echo "File 'foo' content after log replay:"
od -t x1 $SCRATCH_MNT/foo
# Remove the first name of our inode. Because of the directory fsync bug, the
# inode's link count was 1 instead of 5, so removing the 'foo' name ended up
# deleting the inode and the other names became stale directory entries (still
# visible to applications). Attempting to remove or access the remaining
# dentries pointing to that inode resulted in stale file handle errors and
# made it impossible to remove the parent directories since it was impossible
# for them to become empty.
echo "file 'foo' link count after log replay: $(stat -c %h $SCRATCH_MNT/foo)"
rm -f $SCRATCH_MNT/foo
# Now verify that all files, links and directories created before fsyncing our
# directory exist after the fsync log was replayed.
[ -f $SCRATCH_MNT/mydir/foo_2 ] || echo "Link mydir/foo_2 is missing"
[ -f $SCRATCH_MNT/mydir/foo_3 ] || echo "Link mydir/foo_3 is missing"
[ -f $SCRATCH_MNT/hello ] || echo "File hello is missing"
[ -f $SCRATCH_MNT/mydir/hello_2 ] || echo "Link mydir/hello_2 is missing"
[ -f $SCRATCH_MNT/mydir/x/y/foo_y_link ] || \
echo "Link mydir/x/y/foo_y_link is missing"
[ -f $SCRATCH_MNT/mydir/x/y/z/foo_z_link ] || \
echo "Link mydir/x/y/z/foo_z_link is missing"
[ -f $SCRATCH_MNT/mydir/x/y/z/qwerty ] || \
echo "File mydir/x/y/z/qwerty is missing"
# We expect our file here to have a size of 64Kb and all the bytes having the
# value 0xff.
echo "file 'hello' content after log replay:"
od -t x1 $SCRATCH_MNT/hello
# Now remove all files/links, under our test directory 'mydir', and verify we
# can remove all the directories.
rm -f $SCRATCH_MNT/mydir/x/y/z/*
rmdir $SCRATCH_MNT/mydir/x/y/z
rm -f $SCRATCH_MNT/mydir/x/y/*
rmdir $SCRATCH_MNT/mydir/x/y
rmdir $SCRATCH_MNT/mydir/x
rm -f $SCRATCH_MNT/mydir/*
rmdir $SCRATCH_MNT/mydir
# An fsck, run by the fstests framework everytime a test finishes, also detected
# the inconsistency and printed the following error message:
#
# root 5 inode 257 errors 2001, no inode item, link count wrong
# unresolved ref dir 258 index 2 namelen 5 name foo_2 filetype 1 errors 4, no inode ref
# unresolved ref dir 258 index 3 namelen 5 name foo_3 filetype 1 errors 4, no inode ref
status=0
exit
The expected golden output for the test is:
wrote 8192/8192 bytes at offset 0
XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
wrote 65536/65536 bytes at offset 0
XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
File 'foo' content after log replay:
0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
*
0020000
file 'foo' link count after log replay: 5
file 'hello' content after log replay:
0000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
*
0200000
Which is the output after this patch and when running the test against
ext3/4, xfs, f2fs, reiserfs or nilfs2. Without this patch, the test's
output is:
wrote 8192/8192 bytes at offset 0
XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
wrote 65536/65536 bytes at offset 0
XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
File 'foo' content after log replay:
0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
*
0020000
file 'foo' link count after log replay: 1
Link mydir/foo_2 is missing
Link mydir/foo_3 is missing
Link mydir/x/y/foo_y_link is missing
Link mydir/x/y/z/foo_z_link is missing
File mydir/x/y/z/qwerty is missing
file 'hello' content after log replay:
0000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
*
0200000
rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/mydir/x/y/z': No such file or directory
rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/mydir/x/y': No such file or directory
rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/mydir/x': No such file or directory
rm: cannot remove '/home/fdmanana/btrfs-tests/scratch_1/mydir/foo_2': Stale file handle
rm: cannot remove '/home/fdmanana/btrfs-tests/scratch_1/mydir/foo_3': Stale file handle
rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/mydir': Directory not empty
Fsck, without this fix, also complains about the wrong link count:
root 5 inode 257 errors 2001, no inode item, link count wrong
unresolved ref dir 258 index 2 namelen 5 name foo_2 filetype 1 errors 4, no inode ref
unresolved ref dir 258 index 3 namelen 5 name foo_3 filetype 1 errors 4, no inode ref
So fix this by logging the inodes that the dentries point to when
fsyncing a directory.
A test case for xfstests follows.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-03-20 17:19:46 +00:00
|
|
|
spin_lock(&BTRFS_I(inode)->lock);
|
2007-08-10 16:22:09 -04:00
|
|
|
BTRFS_I(inode)->last_trans = trans->transaction->transid;
|
2009-10-13 13:21:08 -04:00
|
|
|
BTRFS_I(inode)->last_sub_trans = BTRFS_I(inode)->root->log_transid;
|
2012-08-29 01:07:55 -06:00
|
|
|
BTRFS_I(inode)->last_log_commit = BTRFS_I(inode)->root->last_log_commit;
|
Btrfs: fix metadata inconsistencies after directory fsync
We can get into inconsistency between inodes and directory entries
after fsyncing a directory. The issue is that while a directory gets
the new dentries persisted in the fsync log and replayed at mount time,
the link count of the inode that directory entries point to doesn't
get updated, staying with an incorrect link count (smaller then the
correct value). This later leads to stale file handle errors when
accessing (including attempt to delete) some of the links if all the
other ones are removed, which also implies impossibility to delete the
parent directories, since the dentries can not be removed.
Another issue is that (unlike ext3/4, xfs, f2fs, reiserfs, nilfs2),
when fsyncing a directory, new files aren't logged (their metadata and
dentries) nor any child directories. So this patch fixes this issue too,
since it has the same resolution as the incorrect inode link count issue
mentioned before.
This is very easy to reproduce, and the following excerpt from my test
case for xfstests shows how:
_scratch_mkfs >> $seqres.full 2>&1
_init_flakey
_mount_flakey
# Create our main test file and directory.
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0 8K" $SCRATCH_MNT/foo | _filter_xfs_io
mkdir $SCRATCH_MNT/mydir
# Make sure all metadata and data are durably persisted.
sync
# Add a hard link to 'foo' inside our test directory and fsync only the
# directory. The btrfs fsync implementation had a bug that caused the new
# directory entry to be visible after the fsync log replay but, the inode
# of our file remained with a link count of 1.
ln $SCRATCH_MNT/foo $SCRATCH_MNT/mydir/foo_2
# Add a few more links and new files.
# This is just to verify nothing breaks or gives incorrect results after the
# fsync log is replayed.
ln $SCRATCH_MNT/foo $SCRATCH_MNT/mydir/foo_3
$XFS_IO_PROG -f -c "pwrite -S 0xff 0 64K" $SCRATCH_MNT/hello | _filter_xfs_io
ln $SCRATCH_MNT/hello $SCRATCH_MNT/mydir/hello_2
# Add some subdirectories and new files and links to them. This is to verify
# that after fsyncing our top level directory 'mydir', all the subdirectories
# and their files/links are registered in the fsync log and exist after the
# fsync log is replayed.
mkdir -p $SCRATCH_MNT/mydir/x/y/z
ln $SCRATCH_MNT/foo $SCRATCH_MNT/mydir/x/y/foo_y_link
ln $SCRATCH_MNT/foo $SCRATCH_MNT/mydir/x/y/z/foo_z_link
touch $SCRATCH_MNT/mydir/x/y/z/qwerty
# Now fsync only our top directory.
$XFS_IO_PROG -c "fsync" $SCRATCH_MNT/mydir
# And fsync now our new file named 'hello', just to verify later that it has
# the expected content and that the previous fsync on the directory 'mydir' had
# no bad influence on this fsync.
$XFS_IO_PROG -c "fsync" $SCRATCH_MNT/hello
# Simulate a crash/power loss.
_load_flakey_table $FLAKEY_DROP_WRITES
_unmount_flakey
_load_flakey_table $FLAKEY_ALLOW_WRITES
_mount_flakey
# Verify the content of our file 'foo' remains the same as before, 8192 bytes,
# all with the value 0xaa.
echo "File 'foo' content after log replay:"
od -t x1 $SCRATCH_MNT/foo
# Remove the first name of our inode. Because of the directory fsync bug, the
# inode's link count was 1 instead of 5, so removing the 'foo' name ended up
# deleting the inode and the other names became stale directory entries (still
# visible to applications). Attempting to remove or access the remaining
# dentries pointing to that inode resulted in stale file handle errors and
# made it impossible to remove the parent directories since it was impossible
# for them to become empty.
echo "file 'foo' link count after log replay: $(stat -c %h $SCRATCH_MNT/foo)"
rm -f $SCRATCH_MNT/foo
# Now verify that all files, links and directories created before fsyncing our
# directory exist after the fsync log was replayed.
[ -f $SCRATCH_MNT/mydir/foo_2 ] || echo "Link mydir/foo_2 is missing"
[ -f $SCRATCH_MNT/mydir/foo_3 ] || echo "Link mydir/foo_3 is missing"
[ -f $SCRATCH_MNT/hello ] || echo "File hello is missing"
[ -f $SCRATCH_MNT/mydir/hello_2 ] || echo "Link mydir/hello_2 is missing"
[ -f $SCRATCH_MNT/mydir/x/y/foo_y_link ] || \
echo "Link mydir/x/y/foo_y_link is missing"
[ -f $SCRATCH_MNT/mydir/x/y/z/foo_z_link ] || \
echo "Link mydir/x/y/z/foo_z_link is missing"
[ -f $SCRATCH_MNT/mydir/x/y/z/qwerty ] || \
echo "File mydir/x/y/z/qwerty is missing"
# We expect our file here to have a size of 64Kb and all the bytes having the
# value 0xff.
echo "file 'hello' content after log replay:"
od -t x1 $SCRATCH_MNT/hello
# Now remove all files/links, under our test directory 'mydir', and verify we
# can remove all the directories.
rm -f $SCRATCH_MNT/mydir/x/y/z/*
rmdir $SCRATCH_MNT/mydir/x/y/z
rm -f $SCRATCH_MNT/mydir/x/y/*
rmdir $SCRATCH_MNT/mydir/x/y
rmdir $SCRATCH_MNT/mydir/x
rm -f $SCRATCH_MNT/mydir/*
rmdir $SCRATCH_MNT/mydir
# An fsck, run by the fstests framework everytime a test finishes, also detected
# the inconsistency and printed the following error message:
#
# root 5 inode 257 errors 2001, no inode item, link count wrong
# unresolved ref dir 258 index 2 namelen 5 name foo_2 filetype 1 errors 4, no inode ref
# unresolved ref dir 258 index 3 namelen 5 name foo_3 filetype 1 errors 4, no inode ref
status=0
exit
The expected golden output for the test is:
wrote 8192/8192 bytes at offset 0
XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
wrote 65536/65536 bytes at offset 0
XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
File 'foo' content after log replay:
0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
*
0020000
file 'foo' link count after log replay: 5
file 'hello' content after log replay:
0000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
*
0200000
Which is the output after this patch and when running the test against
ext3/4, xfs, f2fs, reiserfs or nilfs2. Without this patch, the test's
output is:
wrote 8192/8192 bytes at offset 0
XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
wrote 65536/65536 bytes at offset 0
XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
File 'foo' content after log replay:
0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
*
0020000
file 'foo' link count after log replay: 1
Link mydir/foo_2 is missing
Link mydir/foo_3 is missing
Link mydir/x/y/foo_y_link is missing
Link mydir/x/y/z/foo_z_link is missing
File mydir/x/y/z/qwerty is missing
file 'hello' content after log replay:
0000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
*
0200000
rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/mydir/x/y/z': No such file or directory
rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/mydir/x/y': No such file or directory
rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/mydir/x': No such file or directory
rm: cannot remove '/home/fdmanana/btrfs-tests/scratch_1/mydir/foo_2': Stale file handle
rm: cannot remove '/home/fdmanana/btrfs-tests/scratch_1/mydir/foo_3': Stale file handle
rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/mydir': Directory not empty
Fsck, without this fix, also complains about the wrong link count:
root 5 inode 257 errors 2001, no inode item, link count wrong
unresolved ref dir 258 index 2 namelen 5 name foo_2 filetype 1 errors 4, no inode ref
unresolved ref dir 258 index 3 namelen 5 name foo_3 filetype 1 errors 4, no inode ref
So fix this by logging the inodes that the dentries point to when
fsyncing a directory.
A test case for xfstests follows.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chris Mason <clm@fb.com>
2015-03-20 17:19:46 +00:00
|
|
|
spin_unlock(&BTRFS_I(inode)->lock);
|
2007-08-10 16:22:09 -04:00
|
|
|
}
|
|
|
|
|
2015-04-20 09:53:50 +08:00
|
|
|
/*
|
|
|
|
* Make qgroup codes to skip given qgroupid, means the old/new_roots for
|
|
|
|
* qgroup won't contain the qgroupid in it.
|
|
|
|
*/
|
|
|
|
static inline void btrfs_set_skip_qgroup(struct btrfs_trans_handle *trans,
|
|
|
|
u64 qgroupid)
|
|
|
|
{
|
|
|
|
struct btrfs_delayed_ref_root *delayed_refs;
|
|
|
|
|
|
|
|
delayed_refs = &trans->transaction->delayed_refs;
|
|
|
|
WARN_ON(delayed_refs->qgroup_to_skip);
|
|
|
|
delayed_refs->qgroup_to_skip = qgroupid;
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline void btrfs_clear_skip_qgroup(struct btrfs_trans_handle *trans)
|
|
|
|
{
|
|
|
|
struct btrfs_delayed_ref_root *delayed_refs;
|
|
|
|
|
|
|
|
delayed_refs = &trans->transaction->delayed_refs;
|
|
|
|
WARN_ON(!delayed_refs->qgroup_to_skip);
|
|
|
|
delayed_refs->qgroup_to_skip = 0;
|
|
|
|
}
|
|
|
|
|
2007-03-22 15:59:16 -04:00
|
|
|
int btrfs_end_transaction(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root);
|
|
|
|
struct btrfs_trans_handle *btrfs_start_transaction(struct btrfs_root *root,
|
2015-09-22 20:59:15 +00:00
|
|
|
unsigned int num_items);
|
2015-11-13 23:57:16 +00:00
|
|
|
struct btrfs_trans_handle *btrfs_start_transaction_fallback_global_rsv(
|
|
|
|
struct btrfs_root *root,
|
|
|
|
unsigned int num_items,
|
|
|
|
int min_factor);
|
Btrfs: improve the noflush reservation
In some places(such as: evicting inode), we just can not flush the reserved
space of delalloc, flushing the delayed directory index and delayed inode
is OK, but we don't try to flush those things and just go back when there is
no enough space to be reserved. This patch fixes this problem.
We defined 3 types of the flush operations: NO_FLUSH, FLUSH_LIMIT and FLUSH_ALL.
If we can in the transaction, we should not flush anything, or the deadlock
would happen, so use NO_FLUSH. If we flushing the reserved space of delalloc
would cause deadlock, use FLUSH_LIMIT. In the other cases, FLUSH_ALL is used,
and we will flush all things.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2012-10-16 11:33:38 +00:00
|
|
|
struct btrfs_trans_handle *btrfs_start_transaction_lflush(
|
2015-09-22 20:59:15 +00:00
|
|
|
struct btrfs_root *root,
|
|
|
|
unsigned int num_items);
|
2011-04-13 12:54:33 -04:00
|
|
|
struct btrfs_trans_handle *btrfs_join_transaction(struct btrfs_root *root);
|
|
|
|
struct btrfs_trans_handle *btrfs_join_transaction_nolock(struct btrfs_root *root);
|
Btrfs: fix orphan transaction on the freezed filesystem
With the following debug patch:
static int btrfs_freeze(struct super_block *sb)
{
+ struct btrfs_fs_info *fs_info = btrfs_sb(sb);
+ struct btrfs_transaction *trans;
+
+ spin_lock(&fs_info->trans_lock);
+ trans = fs_info->running_transaction;
+ if (trans) {
+ printk("Transid %llu, use_count %d, num_writer %d\n",
+ trans->transid, atomic_read(&trans->use_count),
+ atomic_read(&trans->num_writers));
+ }
+ spin_unlock(&fs_info->trans_lock);
return 0;
}
I found there was a orphan transaction after the freeze operation was done.
It is because the transaction may not be committed when the transaction handle
end even though it is the last handle of the current transaction. This design
avoid committing the transaction frequently, but also introduce the above
problem.
So I add btrfs_attach_transaction() which can catch the current transaction
and commit it. If there is no transaction, it will return ENOENT, and do not
anything.
This function also can be used to instead of btrfs_join_transaction_freeze()
because it don't increase the writer counter and don't start a new transaction,
so it also can fix the deadlock between sync and freeze.
Besides that, it is used to instead of btrfs_join_transaction() in
transaction_kthread(), because if there is no transaction, the transaction
kthread needn't anything.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
2012-09-20 01:54:00 -06:00
|
|
|
struct btrfs_trans_handle *btrfs_attach_transaction(struct btrfs_root *root);
|
Btrfs: fix uncompleted transaction
In some cases, we need commit the current transaction, but don't want
to start a new one if there is no running transaction, so we introduce
the function - btrfs_attach_transaction(), which can catch the current
transaction, and return -ENOENT if there is no running transaction.
But no running transaction doesn't mean the current transction completely,
because we removed the running transaction before it completes. In some
cases, it doesn't matter. But in some special cases, such as freeze fs, we
hope the transaction is fully on disk, it will introduce some bugs, for
example, we may feeze the fs and dump the data in the disk, if the transction
doesn't complete, we would dump inconsistent data. So we need fix the above
problem for those cases.
We fixes this problem by introducing a function:
btrfs_attach_transaction_barrier()
if we hope all the transaction is fully on the disk, even they are not
running, we can use this function.
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: Josef Bacik <jbacik@fusionio.com>
2013-02-20 09:17:06 +00:00
|
|
|
struct btrfs_trans_handle *btrfs_attach_transaction_barrier(
|
|
|
|
struct btrfs_root *root);
|
2011-04-13 12:54:33 -04:00
|
|
|
struct btrfs_trans_handle *btrfs_start_ioctl_transaction(struct btrfs_root *root);
|
Btrfs: add START_SYNC, WAIT_SYNC ioctls
START_SYNC will start a sync/commit, but not wait for it to
complete. Any modification started after the ioctl returns is
guaranteed not to be included in the commit. If a non-NULL
pointer is passed, the transaction id will be returned to
userspace.
WAIT_SYNC will wait for any in-progress commit to complete. If a
transaction id is specified, the ioctl will block and then
return (success) when the specified transaction has committed.
If it has already committed when we call the ioctl, it returns
immediately. If the specified transaction doesn't exist, it
returns EINVAL.
If no transaction id is specified, WAIT_SYNC will wait for the
currently committing transaction to finish it's commit to disk.
If there is no currently committing transaction, it returns
success.
These ioctls are useful for applications which want to impose an
ordering on when fs modifications reach disk, but do not want to
wait for the full (slow) commit process to do so.
Picky callers can take the transid returned by START_SYNC and
feed it to WAIT_SYNC, and be certain to wait only as long as
necessary for the transaction _they_ started to reach disk.
Sloppy callers can START_SYNC and WAIT_SYNC without a transid,
and provided they didn't wait too long between the calls, they
will get the same result. However, if a second commit starts
before they call WAIT_SYNC, they may end up waiting longer for
it to commit as well. Even so, a START_SYNC+WAIT_SYNC still
guarantees that any operation completed before the START_SYNC
reaches disk.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2010-10-29 15:41:32 -04:00
|
|
|
int btrfs_wait_for_commit(struct btrfs_root *root, u64 transid);
|
2007-06-08 15:33:54 -04:00
|
|
|
|
2013-07-25 15:11:47 -04:00
|
|
|
void btrfs_add_dead_root(struct btrfs_root *root);
|
2013-01-31 18:21:12 +00:00
|
|
|
int btrfs_defrag_root(struct btrfs_root *root);
|
2013-03-12 15:13:28 +00:00
|
|
|
int btrfs_clean_one_deleted_snapshot(struct btrfs_root *root);
|
2007-10-15 16:14:19 -04:00
|
|
|
int btrfs_commit_transaction(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root);
|
2010-10-29 15:37:34 -04:00
|
|
|
int btrfs_commit_transaction_async(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root,
|
|
|
|
int wait_for_unblock);
|
2008-06-25 16:01:31 -04:00
|
|
|
int btrfs_end_transaction_throttle(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root);
|
2010-05-16 10:49:58 -04:00
|
|
|
int btrfs_should_end_transaction(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root);
|
2008-07-29 16:15:18 -04:00
|
|
|
void btrfs_throttle(struct btrfs_root *root);
|
Btrfs: Mixed back reference (FORWARD ROLLING FORMAT CHANGE)
This commit introduces a new kind of back reference for btrfs metadata.
Once a filesystem has been mounted with this commit, IT WILL NO LONGER
BE MOUNTABLE BY OLDER KERNELS.
When a tree block in subvolume tree is cow'd, the reference counts of all
extents it points to are increased by one. At transaction commit time,
the old root of the subvolume is recorded in a "dead root" data structure,
and the btree it points to is later walked, dropping reference counts
and freeing any blocks where the reference count goes to 0.
The increments done during cow and decrements done after commit cancel out,
and the walk is a very expensive way to go about freeing the blocks that
are no longer referenced by the new btree root. This commit reduces the
transaction overhead by avoiding the need for dead root records.
When a non-shared tree block is cow'd, we free the old block at once, and the
new block inherits old block's references. When a tree block with reference
count > 1 is cow'd, we increase the reference counts of all extents
the new block points to by one, and decrease the old block's reference count by
one.
This dead tree avoidance code removes the need to modify the reference
counts of lower level extents when a non-shared tree block is cow'd.
But we still need to update back ref for all pointers in the block.
This is because the location of the block is recorded in the back ref
item.
We can solve this by introducing a new type of back ref. The new
back ref provides information about pointer's key, level and in which
tree the pointer lives. This information allow us to find the pointer
by searching the tree. The shortcoming of the new back ref is that it
only works for pointers in tree blocks referenced by their owner trees.
This is mostly a problem for snapshots, where resolving one of these
fuzzy back references would be O(number_of_snapshots) and quite slow.
The solution used here is to use the fuzzy back references in the common
case where a given tree block is only referenced by one root,
and use the full back references when multiple roots have a reference
on a given block.
This commit adds per subvolume red-black tree to keep trace of cached
inodes. The red-black tree helps the balancing code to find cached
inodes whose inode numbers within a given range.
This commit improves the balancing code by introducing several data
structures to keep the state of balancing. The most important one
is the back ref cache. It caches how the upper level tree blocks are
referenced. This greatly reduce the overhead of checking back ref.
The improved balancing code scales significantly better with a large
number of snapshots.
This is a very large commit and was written in a number of
pieces. But, they depend heavily on the disk format change and were
squashed together to make sure git bisect didn't end up in a
bad state wrt space balancing or the format change.
Signed-off-by: Yan Zheng <zheng.yan@oracle.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2009-06-10 10:45:14 -04:00
|
|
|
int btrfs_record_root_in_trans(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root);
|
2009-10-13 13:29:19 -04:00
|
|
|
int btrfs_write_marked_extents(struct btrfs_root *root,
|
2009-11-12 09:33:26 +00:00
|
|
|
struct extent_io_tree *dirty_pages, int mark);
|
2009-10-13 13:29:19 -04:00
|
|
|
int btrfs_wait_marked_extents(struct btrfs_root *root,
|
2009-11-12 09:33:26 +00:00
|
|
|
struct extent_io_tree *dirty_pages, int mark);
|
2010-05-16 10:49:58 -04:00
|
|
|
int btrfs_transaction_blocked(struct btrfs_fs_info *info);
|
2009-07-30 10:04:48 -04:00
|
|
|
int btrfs_transaction_in_commit(struct btrfs_fs_info *info);
|
2013-09-30 11:36:38 -04:00
|
|
|
void btrfs_put_transaction(struct btrfs_transaction *transaction);
|
2014-02-05 15:26:17 +01:00
|
|
|
void btrfs_apply_pending_changes(struct btrfs_fs_info *fs_info);
|
2015-09-15 10:07:04 -04:00
|
|
|
void btrfs_add_dropped_root(struct btrfs_trans_handle *trans,
|
|
|
|
struct btrfs_root *root);
|
2007-03-16 16:20:31 -04:00
|
|
|
#endif
|