If there are multi segments in one section, we will read those SSA blocks which
have contiguous address one by one in f2fs_gc. It may lost performance, let's
read ahead SSA blocks by merge multi read request.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch adds an sysfs entry to control dir_level used by the large directory.
The description of this entry is:
dir_level This parameter controls the directory level to
support large directory. If a directory has a
number of files, it can reduce the file lookup
latency by increasing this dir_level value.
Otherwise, it needs to decrease this value to
reduce the space overhead. The default value is 0.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch introduces an i_dir_level field to support large directory.
Previously, f2fs maintains multi-level hash tables to find a dentry quickly
from a bunch of chiild dentries in a directory, and the hash tables consist of
the following tree structure as below.
In Documentation/filesystems/f2fs.txt,
----------------------
A : bucket
B : block
N : MAX_DIR_HASH_DEPTH
----------------------
level #0 | A(2B)
|
level #1 | A(2B) - A(2B)
|
level #2 | A(2B) - A(2B) - A(2B) - A(2B)
. | . . . .
level #N/2 | A(2B) - A(2B) - A(2B) - A(2B) - A(2B) - ... - A(2B)
. | . . . .
level #N | A(4B) - A(4B) - A(4B) - A(4B) - A(4B) - ... - A(4B)
But, if we can guess that a directory will handle a number of child files,
we don't need to traverse the tree from level #0 to #N all the time.
Since the lower level tables contain relatively small number of dentries,
the miss ratio of the target dentry is likely to be high.
In order to avoid that, we can configure the hash tables sparsely from level #0
like this.
level #0 | A(2B) - A(2B) - A(2B) - A(2B)
level #1 | A(2B) - A(2B) - A(2B) - A(2B) - A(2B) - ... - A(2B)
. | . . . .
level #N/2 | A(2B) - A(2B) - A(2B) - A(2B) - A(2B) - ... - A(2B)
. | . . . .
level #N | A(4B) - A(4B) - A(4B) - A(4B) - A(4B) - ... - A(4B)
With this structure, we can skip the ineffective tree searches in lower level
hash tables.
This patch adds just a facility for this by introducing i_dir_level in
f2fs_inode.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
It turns out that a bit operation like find_next_bit is not always fast enough
for f2fs_find_entry.
Instead, it is pretty much simple and fast to traverse each dentries.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
The stat_show is just to show the current status of f2fs.
So, we can remove all the there-in locks.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch introduces a radix tree for the list of free_nids, which enhances
the performance on free nid management.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Introduce help macro on_build_free_nids() which just uses build_lock
to judge whether the building free nid is going, so that we can remove
the on_build_free_nids field from f2fs_sb_info.
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
[Jaegeuk Kim: remove an unnecessary white line removal]
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
The nat cache entry maintains a status whether it is checkpointed or not.
So, if a new cache entry is loaded from the last checkpoint,
nat_entry->checkpointed should be true.
If the cache entry is modified as being dirty, nat_entry->checkpoint should
be false.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
At the end of the recovery procedure, write_checkpoint is called and updates
the cp count which is managed by f2fs stat.
But, previously build_stat() is called after the recovery procedure, which
results in:
BUG: unable to handle kernel NULL pointer dereference at 000000000000012c
IP: [<ffffffffa03b1030>] write_checkpoint+0x720/0xbc0 [f2fs]
Call Trace:
[<ffffffff810a6b44>] ? mark_held_locks+0x74/0x140
[<ffffffff8109a3e0>] ? __init_waitqueue_head+0x60/0x60
[<ffffffffa03bf036>] recover_fsync_data+0x656/0xf20 [f2fs]
[<ffffffff812ee3eb>] ? security_d_instantiate+0x1b/0x30
[<ffffffffa03aeb4d>] f2fs_fill_super+0x94d/0xa00 [f2fs]
[<ffffffff811a9825>] mount_bdev+0x1a5/0x1f0
[<ffffffff8114915e>] ? __get_free_pages+0xe/0x40
[<ffffffffa03ae200>] ? f2fs_remount+0x130/0x130 [f2fs]
[<ffffffffa03aa575>] f2fs_mount+0x15/0x20 [f2fs]
[<ffffffff811aa713>] mount_fs+0x43/0x1b0
[<ffffffff811c7124>] vfs_kern_mount+0x74/0x160
[<ffffffff811c5cb1>] ? __get_fs_type+0x51/0x60
[<ffffffff811c9727>] do_mount+0x237/0xb50
[<ffffffff811c936a>] ? copy_mount_options+0x3a/0x170
So, this patche changes the order of recovery_fsync_data() and
f2fs_build_stats().
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Even if f2fs_write_data_page is called by the page reclaiming path, we should
not write the page to provide enough free segments for the worst case scenario.
Otherwise, f2fs can face with no free segment while gc is conducted, resulting
in:
------------[ cut here ]------------
kernel BUG at /home/zeus/f2fs_test/src/fs/f2fs/segment.c:565!
RIP: 0010:[<ffffffffa02c3b11>] [<ffffffffa02c3b11>] new_curseg+0x331/0x340 [f2fs]
Call Trace:
allocate_segment_by_default+0x204/0x280 [f2fs]
allocate_data_block+0x108/0x210 [f2fs]
write_data_page+0x8a/0xc0 [f2fs]
do_write_data_page+0xe1/0x2a0 [f2fs]
move_data_page+0x8a/0xf0 [f2fs]
f2fs_gc+0x446/0x970 [f2fs]
f2fs_balance_fs+0xb6/0xd0 [f2fs]
f2fs_write_begin+0x50/0x350 [f2fs]
? unlock_page+0x27/0x30
? unlock_page+0x27/0x30
generic_file_buffered_write+0x10a/0x280
? file_update_time+0xa3/0xf0
__generic_file_aio_write+0x1c8/0x3d0
? generic_file_aio_write+0x52/0xb0
? generic_file_aio_write+0x52/0xb0
generic_file_aio_write+0x65/0xb0
do_sync_write+0x5a/0x90
vfs_write+0xc5/0x1f0
SyS_write+0x55/0xa0
system_call_fastpath+0x16/0x1b
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch shows the counts of checkpoint in f2fs' status.
Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch help us to cleanup the readahead code by merging ra_{sit,nat}_pages
function into ra_meta_pages.
Additionally the new function is used to readahead cp block in
recover_orphan_inodes.
Change log from v1:
o fix a deadloop bug pointed by Jaegeuk Kim.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Previously without protection of inode mutex, f2fs_falloc and other data
correlated operations will interfere with each other.
So let's use inode mutex to keep atomicity of f2fs_falloc.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
If f2fs entered errorneous checkpoint status, it should skip writing meta
pages instead of redirtying the pages out.
Otherwise, it cannot unmount the partition even though f2fs is under read-only
status.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
When a new directory is allocated, if an error is occurred, we should truncate
preallocated dentry pages too.
This bug was reported by Andrey Tsyvarev after a while as follows.
mkdir()->
f2fs_add_link()->
init_inode_metadata()->
f2fs_init_acl()->
f2fs_get_acl()->
f2fs_getxattr()->
read_all_xattrs() fails.
Also there was a BUG_ON triggered after the fault in
mkdir()->
f2fs_add_link()->
init_inode_metadata()->
remove_inode_page() ->
f2fs_bug_on(inode->i_blocks != 0 && inode->i_blocks != 1);
But, previous patch wasn't perfect to resolve that bug, so the following bug
report was also submitted.
kernel BUG at fs/f2fs/inode.c:274!
Call Trace:
[<ffffffff811fde03>] evict+0xa3/0x1a0
[<ffffffff811fe615>] iput+0xf5/0x180
[<ffffffffa01c7f63>] f2fs_mkdir+0xf3/0x150 [f2fs]
[<ffffffff811f2a77>] vfs_mkdir+0xb7/0x160
[<ffffffff811f36bf>] SyS_mkdir+0x5f/0xc0
[<ffffffff81680769>] system_call_fastpath+0x16/0x1b
Finally, this patch resolves all the issues like below.
If an error is occurred after make_empty_dir(),
1. truncate_inode_pages()
The make_bad_inode() prior to iput() will change i_mode to S_IFREG, which
means that f2fs will not decrement fi->dirty_dents during f2fs_evict_inode.
But, by calling it here, we can do that.
2. truncate_blocks()
Preallocated dentry pages are trucated here to sync i_blocks.
3. remove_dirty_dir_inode()
Remove this directory inode from the list.
Reported-and-Tested-by: Andrey Tsyvarev <tsyvarev@ispras.ru>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch modifies flow a little bit to avoid the following build warnings.
src/fs/f2fs/recovery.c: In function ‘check_index_in_prev_nodes’:
src/fs/f2fs/recovery.c:288:51: warning: ‘sum.<U5390>.<U52f8>.ofs_in_node’ may
be used uninitialized in this function [-Wmaybe-uninitialized]
src/fs/f2fs/recovery.c:260:23: warning: ‘sum.nid’ may be used uninitialized
in this function [-Wmaybe-uninitialized]
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This is the erroneous scenario.
i_size on-disk i_size i_blocks
__f2fs_add_link() 4096 4096 2
get_new_data_page 8192 4096 3
-ENOSPC = init_inode_metadata
checkpoint - 4096 3
POR and reboot
__f2fs_add_link() 4096 4096 3
page = get_new_data_page (page->index = 1 by NEW_ADDR)
add a dentry to the page successfully
f2fs_rmdir()
f2fs_empty_dir() 4096 4096 3
f2fs_unlink() goes, since there is no valid dentry due to i_size = 4096.
But, still there is one dentry in page->index = 1.
So this patch moves the code to write dir->i_size into on-disk i_size in order
to sync dir's i_size, on-disk i_size, and its i_blocks.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch modifies the use of bi_private to remove pointer chasing for sbi.
Previously, we had a bi_private structure, but it needs memory allocation.
So this patch uses bi_private by the sbi pointer and adds a completion pointer
into the sbi.
This can achieve no memory allocation and nice use of the bi_private.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
If a new xattr node page was allocated and its inode is fsynced, we should
recover the xattr node page during the roll-forward process after power-cut.
But, previously, f2fs didn't handle that case, resulting in kernel panic as
follows reported by Tom Li.
BUG: unable to handle kernel paging request at ffffc9001c861a98
IP: [<ffffffffa0295236>] check_index_in_prev_nodes+0x86/0x2d0 [f2fs]
Call Trace:
[<ffffffff815ece9b>] ? printk+0x48/0x4a
[<ffffffffa029626a>] recover_fsync_data+0xdca/0xf50 [f2fs]
[<ffffffffa02873ae>] f2fs_fill_super+0x92e/0x970 [f2fs]
[<ffffffff8112c9f8>] mount_bdev+0x1b8/0x200
[<ffffffffa0286a80>] ? f2fs_remount+0x130/0x130 [f2fs]
[<ffffffffa0285e40>] f2fs_mount+0x10/0x20 [f2fs]
[<ffffffff8112d4de>] mount_fs+0x3e/0x1b0
[<ffffffff810ef4eb>] ? __alloc_percpu+0xb/0x10
[<ffffffff8114761f>] vfs_kern_mount+0x6f/0x120
[<ffffffff811497b9>] do_mount+0x259/0xa90
[<ffffffff810ead1d>] ? memdup_user+0x3d/0x80
[<ffffffff810eadb3>] ? strndup_user+0x53/0x70
[<ffffffff8114a2c9>] SyS_mount+0x89/0xd0
[<ffffffff815feae2>] system_call_fastpath+0x16/0x1b
This patch adds a recovery function of xattr node pages.
Reported-by: Tom Li <biergaizi@members.fsf.org>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
In order to make fs consistency, update_inode_page should not be failed all
the time. Otherwise, it is possible to lose some metadata in the inode like
a link count.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Pull core block IO changes from Jens Axboe:
"The major piece in here is the immutable bio_ve series from Kent, the
rest is fairly minor. It was supposed to go in last round, but
various issues pushed it to this release instead. The pull request
contains:
- Various smaller blk-mq fixes from different folks. Nothing major
here, just minor fixes and cleanups.
- Fix for a memory leak in the error path in the block ioctl code
from Christian Engelmayer.
- Header export fix from CaiZhiyong.
- Finally the immutable biovec changes from Kent Overstreet. This
enables some nice future work on making arbitrarily sized bios
possible, and splitting more efficient. Related fixes to immutable
bio_vecs:
- dm-cache immutable fixup from Mike Snitzer.
- btrfs immutable fixup from Muthu Kumar.
- bio-integrity fix from Nic Bellinger, which is also going to stable"
* 'for-3.14/core' of git://git.kernel.dk/linux-block: (44 commits)
xtensa: fixup simdisk driver to work with immutable bio_vecs
block/blk-mq-cpu.c: use hotcpu_notifier()
blk-mq: for_each_* macro correctness
block: Fix memory leak in rw_copy_check_uvector() handling
bio-integrity: Fix bio_integrity_verify segment start bug
block: remove unrelated header files and export symbol
blk-mq: uses page->list incorrectly
blk-mq: use __smp_call_function_single directly
btrfs: fix missing increment of bi_remaining
Revert "block: Warn and free bio if bi_end_io is not set"
block: Warn and free bio if bi_end_io is not set
blk-mq: fix initializing request's start time
block: blk-mq: don't export blk_mq_free_queue()
block: blk-mq: make blk_sync_queue support mq
block: blk-mq: support draining mq queue
dm cache: increment bi_remaining when bi_end_io is restored
block: fixup for generic bio chaining
block: Really silence spurious compiler warnings
block: Silence spurious compiler warnings
block: Kill bio_pair_split()
...
Pull vfs updates from Al Viro:
"Assorted stuff; the biggest pile here is Christoph's ACL series. Plus
assorted cleanups and fixes all over the place...
There will be another pile later this week"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (43 commits)
__dentry_path() fixes
vfs: Remove second variable named error in __dentry_path
vfs: Is mounted should be testing mnt_ns for NULL or error.
Fix race when checking i_size on direct i/o read
hfsplus: remove can_set_xattr
nfsd: use get_acl and ->set_acl
fs: remove generic_acl
nfs: use generic posix ACL infrastructure for v3 Posix ACLs
gfs2: use generic posix ACL infrastructure
jfs: use generic posix ACL infrastructure
xfs: use generic posix ACL infrastructure
reiserfs: use generic posix ACL infrastructure
ocfs2: use generic posix ACL infrastructure
jffs2: use generic posix ACL infrastructure
hfsplus: use generic posix ACL infrastructure
f2fs: use generic posix ACL infrastructure
ext2/3/4: use generic posix ACL infrastructure
btrfs: use generic posix ACL infrastructure
fs: make posix_acl_create more useful
fs: make posix_acl_chmod more useful
...
f2fs has some weird mode bit handling, so still using the old
chmod code for now.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Rename the current posix_acl_created to __posix_acl_create and add
a fully featured helper to set up the ACLs on file creation that
uses get_acl().
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Rename the current posix_acl_chmod to __posix_acl_chmod and add
a fully featured ACL chmod helper that uses the ->set_acl inode
operation.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
If a node page is trucated, we'd better drop the page in the node_inode's page
cache for better memory footprint.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch adds NODE_MAPPING which is similar as META_MAPPING introduced by
Gu Zheng.
Cc: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
As the orphan_blocks may be max to 504, so it is not security
and rigorous to store such a large array in the kernel stack
as Dan Carpenter said.
In fact, grab_meta_page has locked the page in the page cache,
and we can use find_get_page() to fetch the page safely in the
downstream, so we can remove the page array directly.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Introduce help function META_MAPPING() to get the cache meta blocks'
address space.
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
If a dentry page is updated, we should call mark_inode_dirty to add the inode
into the dirty list, so that its dentry pages are flushed to the disk.
Otherwise, the inode can be evicted without flush.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Fixed a variety of trivial checkpatch warnings. The only delta should
be some minor formatting on log strings that were split / too long.
Signed-off-by: Chris Fries <cfries@motorola.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Doing sync_meta_pages with META_FLUSH when checkpoint, we overide rw
using WRITE_FLUSH_FUA. At this time, we also should set
REQ_META|REQ_PRIO.
Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch should resolve the following bug.
=========================================================
[ INFO: possible irq lock inversion dependency detected ]
3.13.0-rc5.f2fs+ #6 Not tainted
---------------------------------------------------------
kswapd0/41 just changed the state of lock:
(&sbi->gc_mutex){+.+.-.}, at: [<ffffffffa030503e>] f2fs_balance_fs+0xae/0xd0 [f2fs]
but this lock took another, RECLAIM_FS-READ-unsafe lock in the past:
(&sbi->cp_rwsem){++++.?}
and interrupts could create inverse lock ordering between them.
other info that might help us debug this:
Chain exists of:
&sbi->gc_mutex --> &sbi->cp_mutex --> &sbi->cp_rwsem
Possible interrupt unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&sbi->cp_rwsem);
local_irq_disable();
lock(&sbi->gc_mutex);
lock(&sbi->cp_mutex);
<Interrupt>
lock(&sbi->gc_mutex);
*** DEADLOCK ***
This bug is due to the f2fs_balance_fs call in f2fs_write_data_page.
If f2fs_write_data_page is triggered by wbc->for_reclaim via kswapd, it should
not call f2fs_balance_fs which tries to get a mutex grabbed by original syscall
flow.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Support for f2fs-tools/tools/f2stat to monitor
/sys/kernel/debug/f2fs/status
Signed-off-by: Changman Lee <cm224.lee@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
With the 2 previous changes, all the long time operations are moved out
of the protection region, so here we can use spinlock rather than mutex
(orphan_inode_mutex) for lower overhead.
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Move alloc new orphan node out of lock protection region.
Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
"boo sync" parameter is never referenced in f2fs_wait_on_page_writeback.
We should remove this parameter.
Signed-off-by: Yuan Zhong <yuan.mark.zhong@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Previously during SSR and GC, the maximum number of retrials to find a victim
segment was hard-coded by MAX_VICTIM_SEARCH, 4096 by default.
This number makes an effect on IO locality, when SSR mode is activated, which
results in performance fluctuation on some low-end devices.
If max_victim_search = 4, the victim will be searched like below.
("D" represents a dirty segment, and "*" indicates a selected victim segment.)
D1 D2 D3 D4 D5 D6 D7 D8 D9
[ * ]
[ * ]
[ * ]
[ ....]
This patch adds a sysfs entry to control the number dynamically through:
/sys/fs/f2fs/$dev/max_victim_search
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
When considering a bunch of data writes with very frequent fsync calls, we
are able to think the following performance regression.
N: Node IO, D: Data IO, IO scheduler: cfq
Issue pending IOs
D1 D2 D3 D4
D1 D2 D3 D4 N1
D2 D3 D4 N1 N2
N1 D3 D4 N2 D1
--> N1 can be selected by cfq becase of the same priority of N and D.
Then D3 and D4 would be delayed, resuling in performance degradation.
So, when processing the fsync call, it'd better give higher priority to data IOs
than node IOs by assigning WRITE and WRITE_SYNC respectively.
This patch improves the random wirte performance with frequent fsync calls by up
to 10%.
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Here is a case which could read inline page data not from first page.
1. write inline data
2. lseek to offset 4096
3. read 4096 bytes from offset 4096
(read_inline_data read inline data page to non-first page,
And previously VFS has add this page to page cache)
4. ftruncate offset 8192
5. read 4096 bytes from offset 4096
(we meet this updated page with inline data in cache)
So we should leave this page with inited data and uptodate flag
for this case.
Change log from v1:
o fix a deadlock bug
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
Change log from v1:
o reduce unneeded memset in __f2fs_convert_inline_data
>From 58796be2bd2becbe8d52305210fb2a64e7dd80b6 Mon Sep 17 00:00:00 2001
From: Chao Yu <chao2.yu@samsung.com>
Date: Mon, 30 Dec 2013 09:21:33 +0800
Subject: [PATCH] f2fs: avoid to left uninitialized data in page when read
inline data
We left uninitialized data in the tail of page when we read an inline data
page. So let's initialize left part of the page excluding inline data region.
Signed-off-by: Chao Yu <chao2.yu@samsung.com>
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
The truncate_partial_nodes puts pages incorrectly in the following two cases.
Note that the value for argc 'depth' can only be 2 or 3.
Please see truncate_inode_blocks() and truncate_partial_nodes().
1) An err is occurred in the first 'for' loop
When err is occurred with depth = 2, pages[0] is invalid, so this page doesn't
need to be put. There is no problem, however, when depth is 3, it doesn't put
the pages correctly where pages[0] is valid and pages[1] is invalid.
In this case, depth is set to 2 (ref to statemnt depth = i + 1), and then
'goto fail'.
In label 'fail', for (i = depth - 3; i >= 0; i--) cannot meet the condition
because i = -1, so pages[0] cann't be put.
2) An err happened in the second 'for' loop
Now we've got pages[0] with depth = 2, or we've got pages[0] and pages[1]
with depth = 3. When an err is detected, we need 'goto fail' to put such
the pages.
When depth is 2, in label 'fail', for (i = depth - 3; i >= 0; i--) cann't
meet the condition because i = -1, so pages[0] cann't be put.
When depth is 3, in label 'fail', for (i = depth - 3; i >= 0; i--) can
only put pages[0], pages[1] also cann't be put.
Note that 'depth' has been changed before first 'goto fail' (ref to statemnt
depth = i + 1), so passing this modified 'depth' to the tracepoint,
trace_f2fs_truncate_partial_nodes, is also incorrect.
Signed-off-by: Shifei Ge <shifei10.ge@samsung.com>
[Jaegeuk Kim: modify the description and fix one bug]
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
The get_dnode_of_data nullifies inode and node page when error is occurred.
There are two cases that passes inode page into get_dnode_of_data().
1. make_empty_dir()
-> get_new_data_page()
-> f2fs_reserve_block(ipage)
-> get_dnode_of_data()
2. f2fs_convert_inline_data()
-> __f2fs_convert_inline_data()
-> f2fs_reserve_block(ipage)
-> get_dnode_of_data()
This patch adds correct error handling codes when get_dnode_of_data() returns
an error.
At first, f2fs_reserve_block() calls f2fs_put_dnode() whenever reserve_new_block
returns an error.
So, the rule of f2fs_reserve_block() is to nullify inode page when there is any
error internally.
Finally, two callers of f2fs_reserve_block() should call f2fs_put_dnode()
appropriately if they got an error since successful f2fs_reserve_block().
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>
This patch adds a inline_data recovery routine with the following policy.
[prev.] [next] of inline_data flag
o o -> recover inline_data
o x -> remove inline_data, and then recover data blocks
x o -> remove inline_data, and then recover inline_data
x x -> recover data blocks
Signed-off-by: Jaegeuk Kim <jaegeuk.kim@samsung.com>