Recognize BTRFS_BALANCE_RESUME flag passed from userspace. We use the
same heuristics used when recovering balance after a crash to try to
start where we left off last time.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Implement an ioctl for canceling restriper. Currently we wait until
relocation of the current block group is finished, in future this can be
done by triggering a commit. Balance item is deleted and no memory
about the interrupted balance is kept.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Implement an ioctl for pausing restriper. This pauses the relocation,
but balance is still considered to be "in progress": balance item is
not deleted, other volume operations cannot be started, etc. If paused
in the middle of profile changing operation we will continue making
allocations with the target profile.
Add a hook to close_ctree() to pause restriper and free its data
structures on unmount. (It's safe to unmount when restriper is in
"paused" state, we will resume with the same parameters on the next
mount)
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Since restriper kthread starts involuntarily on mount and can suck cpu
and memory bandwidth add a mount option to forcefully skip it. The
restriper in that case hangs around in paused state and can be resumed
from userspace when it's convenient.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
On mount, if balance item is found, resume balance in a separate
kernel thread.
Try to be smart to continue roughly where previous balance (or convert)
was interrupted. For chunk types that were being converted to some
profile we turn on soft convert, in case of a simple balance we turn on
usage filter and relocate only less-than-90%-full chunks of that type.
These are just heuristics but they help quite a bit, and can be improved
in future.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Introduce a new btree objectid for storing balance item. The reason is
to be able to resume restriper after a crash with the same parameters.
Balance item has a very high objectid and goes into tree of tree roots.
The key for the new item is as follows:
[ BTRFS_BALANCE_OBJECTID ; BTRFS_BALANCE_ITEM_KEY ; 0 ]
Older kernels simply ignore it so it's safe to mount with an older
kernel and then go back to the newer one.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
When doing convert from one profile to another if soft mode is on
restriper won't touch chunks that already have the profile we are
converting to. This is useful if e.g. half of the FS was converted
earlier.
The soft mode switch is (like every other filter) per-type. This means
that we can convert for example meta chunks the "hard" way while
converting data chunks selectively with soft switch.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Profile changing is done by launching a balance with
BTRFS_BALANCE_CONVERT bits set and target fields of respective
btrfs_balance_args structs initialized. Profile reducing code in this
case will pick restriper's target profile if it's available instead of
doing a blind reduce. If target profile is not yet available it goes
back to a plain reduce.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Every caller of do_chunk_alloc() feeds it the reduced allocation
profile, so stop trying to reduce it one more time. Instead check the
validity of the passed profile.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Select chunks which have at least one byte located inside a given
[vstart, vend) virtual address space range.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Select chunks which have at least one byte of at least one stripe
located on a device with devid X in a given [pstart,pend) physical
address range.
This filter only works when devid filter is turned on.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
This allows to have a separate set of filters for each chunk type
(data,meta,sys). The code however is generic and switch on chunk type
is only done once.
This commit also adds a type filter: it allows to balance for example
meta and system chunks w/o touching data ones.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Add basic restriper infrastructure: extended balancing ioctl and all
related ioctl data structures, add data structure for tracking
restriper's state to fs_info, etc. The semantics of the old balancing
ioctl are fully preserved.
Explicitly disallow any volume operations when balance is in progress.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Currently when new chunks are created respective avail_alloc_bits field
is updated to reflect profiles of all chunks present in the system.
However when chunks are removed profile bits are never cleared.
This patch clears profile bit of respective avail_alloc_bits field when
the last chunk with that profile is removed. Restriper needs this to
properly operate when "downgrading".
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Right now on-disk BTRFS_BLOCK_GROUP_* profile bits are used for
avail_{data,metadata,system}_alloc_bits fields, which gather info about
available allocation profiles in the FS. When chunk is created or read
from disk, its profile is OR'ed with the corresponding avail_alloc_bits
field. Since SINGLE is denoted by 0 in the on-disk format, currently
there is no way to tell when such chunks become avaialble. Restriper
needs that information, so add a separate bit for SINGLE profile.
This bit is going to be in-memory only, it should never be written out
to disk, so it's not a disk format change. However to avoid remappings
in future, reserve corresponding on-disk bit.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Chunk's type and profile are encoded in u64 flags field. Introduce
masks to easily access them. Also fix the type of BTRFS_BLOCK_GROUP_*
constants, it should be ULL.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Commit 3292beb340c7688 ("sched/accounting: Change cpustat fields to an array")
deleted the code which provides us with the sum of all interrupts in the
system, causing vmstat to report zero interrupts occuring in the system.
Fix this by restoring the code.
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Tested-by: Russell King <rmk+kernel@arm.linux.org.uk> # [on ARM]
Tested-by: Tony Luck <tony.luck@intel.com>
Tested-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Glauber Costa <glommer@parallels.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Paul Tuner <pjt@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (655 commits)
[media] revert patch: HDIC HD29L2 DMB-TH USB2.0 reference design driver
mb86a20s: Add a few more register settings at the init seq
mb86a20s: Group registers into the same line
[media] [PATCH] don't reset the delivery system on DTV_CLEAR
[media] [BUG] it913x-fe fix typo error making SNR levels unstable
[media] cx23885: Query the CX25840 during enum_input for status
[media] cx25840: Add support for g_input_status
[media] rc-videomate-m1f.c Rename to match remote controler name
[media] drivers: media: au0828: Fix dependency for VIDEO_AU0828
[media] convert drivers/media/* to use module_platform_driver()
[media] drivers: video: cx231xx: Fix dependency for VIDEO_CX231XX_DVB
[media] Exynos4 JPEG codec v4l2 driver
[media] doc: v4l: selection: choose pixels as units for selection rectangles
[media] v4l: s5p-tv: mixer: fix setup of VP scaling
[media] v4l: s5p-tv: mixer: add support for selection API
[media] v4l: emulate old crop API using extended crop/compose API
[media] doc: v4l: add documentation for selection API
[media] doc: v4l: add binary images for selection API
[media] v4l: add support for selection api
[media] hd29l2: fix review findings
...
* 'for-3.3/core' of git://git.kernel.dk/linux-block: (37 commits)
Revert "block: recursive merge requests"
block: Stop using macro stubs for the bio data integrity calls
blockdev: convert some macros to static inlines
fs: remove unneeded plug in mpage_readpages()
block: Add BLKROTATIONAL ioctl
block: Introduce blk_set_stacking_limits function
block: remove WARN_ON_ONCE() in exit_io_context()
block: an exiting task should be allowed to create io_context
block: ioc_cgroup_changed() needs to be exported
block: recursive merge requests
block, cfq: fix empty queue crash caused by request merge
block, cfq: move icq creation and rq->elv.icq association to block core
block, cfq: restructure io_cq creation path for io_context interface cleanup
block, cfq: move io_cq exit/release to blk-ioc.c
block, cfq: move icq cache management to block core
block, cfq: move io_cq lookup to blk-ioc.c
block, cfq: move cfqd->icq_list to request_queue and add request->elv.icq
block, cfq: reorganize cfq_io_context into generic and cfq specific parts
block: remove elevator_queue->ops
block: reorder elevator switch sequence
...
Fix up conflicts in:
- block/blk-cgroup.c
Switch from can_attach_task to can_attach
- block/cfq-iosched.c
conflict with now removed cic index changes (we now use q->id instead)
* 'linux-next' of git://git.infradead.org/ubifs-2.6:
UBI: use own macros for the layout volume
UBI: fix nameless volumes handling
UBIFS: fix non-debug configuration build
Fix a brown paperbag bug introduced by me in the previous commit. I was
in hurry and forgot about the non-debug case completely.
Artem: amend the commit message and tweak the patch to preserve alignment.
This made the patch a bit less readable, though.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@linux.intel.com>
* 'for-linus' of git://selinuxproject.org/~jmorris/linux-security:
capabilities: remove __cap_full_set definition
security: remove the security_netlink_recv hook as it is equivalent to capable()
ptrace: do not audit capability check when outputing /proc/pid/stat
capabilities: remove task_ns_* functions
capabitlies: ns_capable can use the cap helpers rather than lsm call
capabilities: style only - move capable below ns_capable
capabilites: introduce new has_ns_capabilities_noaudit
capabilities: call has_ns_capability from has_capability
capabilities: remove all _real_ interfaces
capabilities: introduce security_capable_noaudit
capabilities: reverse arguments to security_capable
capabilities: remove the task from capable LSM hook entirely
selinux: sparse fix: fix several warnings in the security server cod
selinux: sparse fix: fix warnings in netlink code
selinux: sparse fix: eliminate warnings for selinuxfs
selinux: sparse fix: declare selinux_disable() in security.h
selinux: sparse fix: move selinux_complete_init
selinux: sparse fix: make selinux_secmark_refcount static
SELinux: Fix RCU deref check warning in sel_netport_insert()
Manually fix up a semantic mis-merge wrt security_netlink_recv():
- the interface was removed in commit fd7784615248 ("security: remove
the security_netlink_recv hook as it is equivalent to capable()")
- a new user of it appeared in commit a38f7907b926 ("crypto: Add
userspace configuration API")
causing no automatic merge conflict, but Eric Paris pointed out the
issue.
Removing the parent of a watched file results in "kernel BUG at
fs/notify/mark.c:139".
To reproduce
add "-w /tmp/audit/dir/watched_file" to audit.rules
rm -rf /tmp/audit/dir
This is caused by fsnotify_destroy_mark() being called without an
extra reference taken by the caller.
Reported by Francesco Cosoleto here:
https://bugzilla.novell.com/show_bug.cgi?id=689860
Fix by removing the BUG_ON and adding a comment about not accessing mark after
the iput.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
CC: stable@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQIcBAABAgAGBQJPD2aFAAoJENkgDmzRrbjxNzsQAIeYbbrXYLjr6kQzUSngj/eC
FzjaTEfYTQIeuQCFJHcHthyc5lXV4sQbo3jOezW+Bp5yuDJL2aWIHesSfWZe7imu
zQdM4VshOYdAmUR9Q0AW5zhB8Smbs7/AyABiF2jm4p0ZPOuyMDSlei9sjvE9Vjvt
B7g5ht7L6kz0JbDnwwy0u5gs+tEitwpXYId9Y4ysZIBzIbL0qkPX8veOddGTMy0N
8xhWXaKtufpjvxFD2ORLDsw3AkoF1xXSNuFd/5nzCNpbeE7TW931jfkPoqJumuAO
7GLxcU9kKYl+IICobC6wBtsj/RrB7w+cBXMvPGwdBliam1qaRhUcJZi5FLM/Ha5d
2A9QDYNUpoXiO8JbPXrV9Z+Y0+Co8RilsQj7R/rjZh6AbbYCWt9nxzx2Svl/RfTr
xfiimHuB2P3rHjOvpCXULwOOuE5c8MzPuWncpdjiD3uGXOY/aY+X1m+if/quJw9D
grPlKL0+YiRakEYUeGG4M77KCqyKFZaF7L7UQPbqfZcj8V/9AW3/7U5I/B9RlAjs
idsr4fcf5s0N+oKUyTCW1ncpUDQNiwbU2NyJQqeu1ZxaRGj72AgyvsaNeyIPDyK+
f6x95Bi7i8KLjXc9Z1KvJwh2Nxt25gNUiTYVha/9H2NpJGd1cfI15kTOGXrgddVv
1pvuGcJDZwYiwfiXr3FL
=HHrh
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://github.com/rustyrussell/linux
Autogenerated GPG tag for Rusty D1ADB8F1: 15EE 8D6C AB0E 7F0C F999 BFCB D920 0E6C D1AD B8F1
* tag 'for-linus' of git://github.com/rustyrussell/linux:
module_param: check that bool parameters really are bool.
intelfbdrv.c: bailearly is an int module_param
paride/pcd: fix bool verbose module parameter.
module_param: make bool parameters really bool (drivers & misc)
module_param: make bool parameters really bool (arch)
module_param: make bool parameters really bool (core code)
kernel/async: remove redundant declaration.
printk: fix unnecessary module_param_name.
lirc_parallel: fix module parameter description.
module_param: avoid bool abuse, add bint for special cases.
module_param: check type correctness for module_param_array
modpost: use linker section to generate table.
modpost: use a table rather than a giant if/else statement.
modules: sysfs - export: taint, coresize, initsize
kernel/params: replace DEBUGP with pr_debug
module: replace DEBUGP with pr_debug
module: struct module_ref should contains long fields
module: Fix performance regression on modules with large symbol tables
module: Add comments describing how the "strmap" logic works
Fix up conflicts in scripts/mod/file2alias.c due to the new linker-
generated table approach to adding __mod_*_device_table entries. The
ARM sa11x0 mcp bus needed to be converted to that too.
* 'for-3.3' of git://linux-nfs.org/~bfields/linux: (31 commits)
nfsd4: nfsd4_create_clid_dir return value is unused
NFSD: Change name of extended attribute containing junction
svcrpc: don't revert to SVC_POOL_DEFAULT on nfsd shutdown
svcrpc: fix double-free on shutdown of nfsd after changing pool mode
nfsd4: be forgiving in the absence of the recovery directory
nfsd4: fix spurious 4.1 post-reboot failures
NFSD: forget_delegations should use list_for_each_entry_safe
NFSD: Only reinitilize the recall_lru list under the recall lock
nfsd4: initialize special stateid's at compile time
NFSd: use network-namespace-aware cache registering routines
SUNRPC: create svc_xprt in proper network namespace
svcrpc: update outdated BKL comment
nfsd41: allow non-reclaim open-by-fh's in 4.1
svcrpc: avoid memory-corruption on pool shutdown
svcrpc: destroy server sockets all at once
svcrpc: make svc_delete_xprt static
nfsd: Fix oops when parsing a 0 length export
nfsd4: Use kmemdup rather than duplicating its implementation
nfsd4: add a separate (lockowner, inode) lookup
nfsd4: fix CONFIG_NFSD_FAULT_INJECTION compile error
...
Since commit 080d676de095 ("aio: allocate kiocbs in batches") iocbs are
allocated in a batch during processing of first iocbs. All iocbs in a
batch are automatically added to ctx->active_reqs list and accounted in
ctx->reqs_active.
If one (not the last one) of iocbs submitted by an user fails, further
iocbs are not processed, but they are still present in ctx->active_reqs
and accounted in ctx->reqs_active. This causes process to stuck in a D
state in wait_for_all_aios() on exit since ctx->reqs_active will never
go down to zero. Furthermore since kiocb_batch_free() frees iocb
without removing it from active_reqs list the list become corrupted
which may cause oops.
Fix this by removing iocb from ctx->active_reqs and updating
ctx->reqs_active in kiocb_batch_free().
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Cc: stable@kernel.org # 3.2
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw:
GFS2: Fix nlink setting on inode creation
GFS2: fail mount if journal recovery fails
GFS2: let spectator mount do read only recovery
GFS2: Fix a use-after-free that coverity spotted
GFS2: dlm based recovery coordination
* 'linux-next' of git://git.infradead.org/ubifs-2.6:
UBIFS: fix key printing
UBIFS: use snprintf instead of sprintf when printing keys
UBIFS: fix debugging messages
UBIFS: make debugging messages light again
UBI: fix debugging messages
UBI: make vid_hdr non-static
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
ceph: ensure prealloc_blob is in place when removing xattr
rbd: initialize snap_rwsem in rbd_add()
ceph: enable/disable dentry complete flags via mount option
vfs: export symbol d_find_any_alias()
ceph: always initialize the dentry in open_root_dentry()
libceph: remove useless return value for osd_client __send_request()
ceph: avoid iput() while holding spinlock in ceph_dir_fsync
ceph: avoid useless dget/dput in encode_fh
ceph: dereference pointer after checking for NULL
crush: fix force for non-root TAKE
ceph: remove unnecessary d_fsdata conditional checks
ceph: Use kmemdup rather than duplicating its implementation
Fix up conflicts in fs/ceph/super.c (d_alloc_root() failure handling vs
always initialize the dentry in open_root_dentry)
.. and the just as dead bhv_desc forward declaration while we're at it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
Replace the nasty if, else if, elseif condition with more natural C flow
that expressed the logic we want here better.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Ben Myers <bpm@sgi.com>
This wrapper isn't overly useful, not to say rather confusing.
Around the call to xfs_itruncate_extents it does:
- add tracing
- add a few asserts in debug builds
- conditionally update the inode size in two places
- log the inode
Both the tracing and the inode logging can be moved to xfs_itruncate_extents
as they are useful for the attribute fork as well - in fact the attr code
already does an equivalent xfs_trans_log_inode call just after calling
xfs_itruncate_extents. The conditional size updates are a mess, and there
was no reason to do them in two places anyway, as the first one was
conditional on the inode having extents - but without extents we
xfs_itruncate_extents would be a no-op and the placement wouldn't matter
anyway. Instead move the size assignments and the asserts that make sense
to the callers that want it.
As a side effect of this clean up xfs_setattr_size by introducing variables
for the old and new inode size, and moving the size updates into a common
place.
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ben Myers <bpm@sgi.com>
I don't know how I missed this obvious mistake when I
reviewed Als' patches, sorry.
[ Quoting Al:
Grr... Note to self: do git status *and* git stash show -p
before git push. Nothing like "WTF? I'd fixed that braino"
feeling ;-/
Al sent the same patch - it got broken in commit d668dc56631d:
"autofs4: deal with autofs4_write/autofs4_write races". ]
Reported-and-tested-by: Dave Airlie <airlied@redhat.com>
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Before commit 56e46742e846e4de167dde0e1e1071ace1c882a5 we have had locking
around all printing macros and we could use static buffers for creating
key strings and printing them. However, now we do not have that locking and
we cannot use static buffers. This commit removes the old DBGKEY() macros
and introduces few new helper macros for printing debugging messages plus
a key at the end. Thankfully, all the messages are already structures in
a way that the key is printed in the end.
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Switch to 'snprintf()' which is more secure and reliable. This is also a
preparation to the subsequent key printing fixes.
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Andrew explains:
- various misc stuff
- Most of the rest of MM: memcg, threaded hugepages, others.
- cpumask
- kexec
- kdump
- some direct-io performance tweaking
- radix-tree optimisations
- new selftests code
A note on this: often people will develop a new userspace-visible
feature and will develop userspace code to exercise/test that
feature. Then they merge the patch and the selftest code dies.
Sometimes we paste it into the changelog. Sometimes the code gets
thrown into Documentation/(!).
This saddens me. So this patch creates a bare-bones framework which
will henceforth allow me to ask people to include their test apps in
the kernel tree so we can keep them alive. Then when people enhance
or fix the feature, I can ask them to update the test app too.
The infrastruture is terribly trivial at present - let's see how it
evolves.
- checkpoint/restart feature work.
A note on this: this is a project by various mad Russians to perform
c/r mainly from userspace, with various oddball helper code added
into the kernel where the need is demonstrated.
So rather than some large central lump of code, what we have is
little bits and pieces popping up in various places which either
expose something new or which permit something which is normally
kernel-private to be modified.
The overall project is an ongoing thing. I've judged that the size
and scope of the thing means that we're more likely to be successful
with it if we integrate the support into mainline piecemeal rather
than allowing it all to develop out-of-tree.
However I'm less confident than the developers that it will all
eventually work! So what I'm asking them to do is to wrap each piece
of new code inside CONFIG_CHECKPOINT_RESTORE. So if it all
eventually comes to tears and the project as a whole fails, it should
be a simple matter to go through and delete all trace of it.
This lot pretty much wraps up the -rc1 merge for me.
* akpm: (96 commits)
unlzo: fix input buffer free
ramoops: update parameters only after successful init
ramoops: fix use of rounddown_pow_of_two()
c/r: prctl: add PR_SET_MM codes to set up mm_struct entries
c/r: procfs: add start_data, end_data, start_brk members to /proc/$pid/stat v4
c/r: introduce CHECKPOINT_RESTORE symbol
selftests: new x86 breakpoints selftest
selftests: new very basic kernel selftests directory
radix_tree: take radix_tree_path off stack
radix_tree: remove radix_tree_indirect_to_ptr()
dio: optimize cache misses in the submission path
vfs: cache request_queue in struct block_device
fs/direct-io.c: calculate fs_count correctly in get_more_blocks()
drivers/parport/parport_pc.c: fix warnings
panic: don't print redundant backtraces on oops
sysctl: add the kernel.ns_last_pid control
kdump: add udev events for memory online/offline
include/linux/crash_dump.h needs elf.h
kdump: fix crash_kexec()/smp_send_stop() race in panic()
kdump: crashk_res init check for /sys/kernel/kexec_crash_size
...
The mm->start_code/end_code, mm->start_data/end_data, mm->start_brk are
involved into calculation of program text/data segment sizes (which might
be seen in /proc/<pid>/statm) and into brk() call final address.
For restore we need to know all these values. While
mm->start_code/end_code already present in /proc/$pid/stat, the rest
members are not, so this patch brings them in.
The restore procedure of these members is addressed in another patch using
prctl().
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Serge Hallyn <serge.hallyn@canonical.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Andrew Vagin <avagin@openvz.org>
Cc: Vasiliy Kulikov <segoon@openwall.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some investigation of a transaction processing workload showed that a
major consumer of cycles in __blockdev_direct_IO is the cache miss while
accessing the block size. This is because it has to walk the chain from
block_dev to gendisk to queue.
The block size is needed early on to check alignment and sizes. It's only
done if the check for the inode block size fails. But the costly block
device state is unconditionally fetched.
- Reorganize the code to only fetch block dev state when actually
needed.
Then do a prefetch on the block dev early on in the direct IO path. This
is worth it, because there is substantial code run before we actually
touch the block dev now.
- I also added some unlikelies to make it clear the compiler that block
device fetch code is not normally executed.
This gave a small, but measurable improvement on a large database
benchmark (about 0.3%)
[akpm@linux-foundation.org: coding-style fixes]
[sfr@canb.auug.org.au: using prefetch requires including prefetch.h]
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This makes it possible to get from the inode to the request_queue with one
less cache miss. Used in followon optimization.
The livetime of the pointer is the same as the gendisk.
This assumes that the queue will always stay the same in the gendisk while
it's visible to block_devices. I think that's safe correct?
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In get_more_blocks(), we use dio_count to calcuate fs_count and do some
tricky things to increase fs_count if dio_count isn't aligned. But
actually it still has some corner cases that can't be coverd. See the
following example:
dio_write foo -s 1024 -w 4096
(direct write 4096 bytes at offset 1024). The same goes if the offset
isn't aligned to fs_blocksize.
In this case, the old calculation counts fs_count to be 1, but actually we
will write into 2 different blocks (if fs_blocksize=4096). The old code
just works, since it will call get_block twice (and may have to allocate
and create extents twice for filesystems like ext4). So we'd better call
get_block just once with the proper fs_count.
Signed-off-by: Tao Ma <boyu.mt@taobao.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch adds a lightweight sync migrate operation MIGRATE_SYNC_LIGHT
mode that avoids writing back pages to backing storage. Async compaction
maps to MIGRATE_ASYNC while sync compaction maps to MIGRATE_SYNC_LIGHT.
For other migrate_pages users such as memory hotplug, MIGRATE_SYNC is
used.
This avoids sync compaction stalling for an excessive length of time,
particularly when copying files to a USB stick where there might be a
large number of dirty pages backed by a filesystem that does not support
->writepages.
[aarcange@redhat.com: This patch is heavily based on Andrea's work]
[akpm@linux-foundation.org: fix fs/nfs/write.c build]
[akpm@linux-foundation.org: fix fs/btrfs/disk-io.c build]
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andy Isaacson <adi@hexapodia.org>
Cc: Nai Xia <nai.xia@gmail.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Asynchronous compaction is used when allocating transparent hugepages to
avoid blocking for long periods of time. Due to reports of stalling,
there was a debate on disabling synchronous compaction but this severely
impacted allocation success rates. Part of the reason was that many dirty
pages are skipped in asynchronous compaction by the following check;
if (PageDirty(page) && !sync &&
mapping->a_ops->migratepage != migrate_page)
rc = -EBUSY;
This skips over all mapping aops using buffer_migrate_page() even though
it is possible to migrate some of these pages without blocking. This
patch updates the ->migratepage callback with a "sync" parameter. It is
the responsibility of the callback to fail gracefully if migration would
block.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Andy Isaacson <adi@hexapodia.org>
Cc: Nai Xia <nai.xia@gmail.com>
Cc: Johannes Weiner <jweiner@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>