select_bad_process() checks PF_EXITING to detect the task which is going
to release its memory, but the logic is very wrong.
- a single process P with the dead group leader disables
select_bad_process() completely, it will always return
ERR_PTR() while P can live forever
- if the PF_EXITING task has already released its ->mm
it doesn't make sense to expect it is goiing to free
more memory (except task_struct/etc)
Change the code to ignore the PF_EXITING tasks without ->mm.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
select_bad_process() thinks a kernel thread can't have ->mm != NULL, this
is not true due to use_mm().
Change the code to check PF_KTHREAD.
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
KSM reference counts can cause an anon_vma to exist after the processe it
belongs to have already exited. Because the anon_vma lock now lives in
the root anon_vma, we need to ensure that the root anon_vma stays around
until after all the "child" anon_vmas have been freed.
The obvious way to do this is to have a "child" anon_vma take a reference
to the root in anon_vma_fork. When the anon_vma is freed at munmap or
process exit, we drop the refcount in anon_vma_unlink and possibly free
the root anon_vma.
The KSM anon_vma reference count function also needs to be modified to
deal with the possibility of freeing 2 levels of anon_vma. The easiest
way to do this is to break out the KSM magic and make it generic.
When compiling without CONFIG_KSM, this code is compiled out.
Signed-off-by: Rik van Riel <riel@redhat.com>
Tested-by: Larry Woodman <lwoodman@redhat.com>
Acked-by: Larry Woodman <lwoodman@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Tested-by: Dave Young <hidave.darkstar@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Always (and only) lock the root (oldest) anon_vma whenever we do something
in an anon_vma. The recently introduced anon_vma scalability is due to
the rmap code scanning only the VMAs that need to be scanned. Many common
operations still took the anon_vma lock on the root anon_vma, so always
taking that lock is not expected to introduce any scalability issues.
However, always taking the same lock does mean we only need to take one
lock, which means rmap_walk on pages from any anon_vma in the vma is
excluded from occurring during an munmap, expand_stack or other operation
that needs to exclude rmap_walk and similar functions.
Also add the proper locking to vma_adjust.
Signed-off-by: Rik van Riel <riel@redhat.com>
Tested-by: Larry Woodman <lwoodman@redhat.com>
Acked-by: Larry Woodman <lwoodman@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Track the root (oldest) anon_vma in each anon_vma tree. Because we only
take the lock on the root anon_vma, we cannot use the lock on higher-up
anon_vmas to lock anything. This makes it impossible to do an indirect
lookup of the root anon_vma, since the data structures could go away from
under us.
However, a direct pointer is safe because the root anon_vma is always the
last one that gets freed on munmap or exit, by virtue of the same_vma list
order and unlink_anon_vmas walking the list forward.
[akpm@linux-foundation.org: fix typo]
Signed-off-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Tested-by: Larry Woodman <lwoodman@redhat.com>
Acked-by: Larry Woodman <lwoodman@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Subsitute a direct call of spin_lock(anon_vma->lock) with an inline
function doing exactly the same.
This makes it easier to do the substitution to the root anon_vma lock in a
following patch.
We will deal with the handful of special locks (nested, dec_and_lock, etc)
separately.
Signed-off-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Tested-by: Larry Woodman <lwoodman@redhat.com>
Acked-by: Larry Woodman <lwoodman@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rename anon_vma_lock to vma_lock_anon_vma. This matches the naming style
used in page_lock_anon_vma and will come in really handy further down in
this patch series.
Signed-off-by: Rik van Riel <riel@redhat.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Tested-by: Larry Woodman <lwoodman@redhat.com>
Acked-by: Larry Woodman <lwoodman@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When a copy-on-write occurs, we take one of two paths in handle_mm_fault:
through handle_pte_fault for normal pages, or through hugetlb_fault for
huge pages.
In the normal page case, we eventually get to do_wp_page and call mmu
notifiers via ptep_clear_flush_notify. There is no callout to the mmmu
notifiers in the huge page case. This patch fixes that.
Signed-off-by: Doug Doan <dougd@cray.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Provide an INIT_MM_CONTEXT intializer macro which can be used to
statically initialize mm_struct:mm_context of init_mm. This way we can
get rid of code which will do the initialization at run time (on s390).
In addition the current code can be found at a place where it is not
expected. So let's have a common initializer which architectures
can use if needed.
This is based on a patch from Suzuki Poulose.
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Suzuki Poulose <suzuki@in.ibm.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use ERR_CAST(x) rather than ERR_PTR(PTR_ERR(x)). The former makes more
clear what is the purpose of the operation, which otherwise looks like a
no-op.
The semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)
// <smpl>
@@
type T;
T x;
identifier f;
@@
T f (...) { <+...
- ERR_PTR(PTR_ERR(x))
+ x
...+> }
@@
expression x;
@@
- ERR_PTR(PTR_ERR(x))
+ ERR_CAST(x)
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use memdup_user when user data is immediately copied into the
allocated region.
The semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)
// <smpl>
@@
expression from,to,size,flag;
position p;
identifier l1,l2;
@@
- to = \(kmalloc@p\|kzalloc@p\)(size,flag);
+ to = memdup_user(from,size);
if (
- to==NULL
+ IS_ERR(to)
|| ...) {
<+... when != goto l1;
- -ENOMEM
+ PTR_ERR(to)
...+>
}
- if (copy_from_user(to, from, size) != 0) {
- <+... when != goto l2;
- -EFAULT
- ...+>
- }
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make sure we check the truncate constraints early on in ->setattr by adding
those checks to inode_change_ok. Also clean up and document inode_change_ok
to make this obvious.
As a fallout we don't have to call inode_newsize_ok from simple_setsize and
simplify it down to a truncate_setsize which doesn't return an error. This
simplifies a lot of setattr implementations and means we use truncate_setsize
almost everywhere. Get rid of fat_setsize now that it's trivial and mark
ext2_setsize static to make the calling convention obvious.
Keep the inode_newsize_ok in vmtruncate for now as all callers need an
audit for its removal anyway.
Note: setattr code in ecryptfs doesn't call inode_change_ok at all and
needs a deeper audit, but that is left for later.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Make sure we call inode_change_ok before doing any changes in ->setattr,
and make sure to call it even if our fs wants to ignore normal UNIX
permissions, but use the ATTR_FORCE to skip those.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Despite its name it's now a generic implementation of ->setattr, but
rather a helper to copy attributes from a struct iattr to the inode.
Rename it to setattr_copy to reflect this fact.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
All callers expect a boolean result which is true if the region
overlaps a reserved region. However, the implementation actually
returns -1 if there is no overlap, and a region index (0 based)
if there is.
Make it behave as callers (and common sense) expect.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Fix a bug where a lock is _bh nested within another _bh lock,
but forgets to use the _bh variant for unlock.
Further more, it's not necessary to test _bh locks, the inner lock
can just use spin_lock(). So fix up the bug by making that change.
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
This patch makes sure we first initialize everything and set the BDI_registered
flag, and only after this we add the bdi to 'bdi_list'. Current code adds the
bdi to the list too early, and as a result I the
WARN(!test_bit(BDI_registered, &bdi->state)
in bdi forker is triggered. Also, it is in general good practice to make things
visible only when they are fully initialized.
Also, this patch does few micro clean-ups:
1. Removes the 'exit' label which does not do anything, just returns. This
allows to get rid of few braces and 'ret' variable and make the code smaller.
2. If 'kthread_run()' fails, remove the error code it returns, not hard-coded
'-ENOMEM'. Theoretically, some day 'kthread_run()' can return something
else. Also, in case of failure it is not necessary to set 'bdi->wb.task' to
NULL.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Add 2 new trace points to the periodic write-back wake up case, just like we do
in the 'bdi_queue_work()' function. Namely, introduce:
1. trace_writeback_wake_thread(bdi)
2. trace_writeback_wake_forker_thread(bdi)
The first event is triggered every time we wake up a bdi thread to start
periodic background write-out. The second event is triggered only when the bdi
thread does not exist and should be created by the forker thread.
This patch was suggested by Dave Chinner and Christoph Hellwig.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
The 'setup_timer()' function also calls 'init_timer()', so the extra
'init_timer()' call is not needed. Indeed, 'setup_timer()' is basically
'init_timer()' plus callback function and data pointers initialization.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Whe the first inode for a bdi is marked dirty, we wake up the bdi thread which
should take care of the periodic background write-out. However, the write-out
will actually start only 'dirty_writeback_interval' centisecs later, so we can
delay the wake-up.
This change was requested by Nick Piggin who pointed out that if we delay the
wake-up, we weed out 2 unnecessary contex switches, which matters because
'__mark_inode_dirty()' is a hot-path function.
This patch introduces a new function - 'bdi_wakeup_thread_delayed()', which
sets up a timer to wake-up the bdi thread and returns. So the wake-up is
delayed.
We also delete the timer in bdi threads just before writing-back. And
synchronously delete it when unregistering bdi. At the unregister point the bdi
does not have any users, so no one can arm it again.
Since now we take 'bdi->wb_lock' in the timer, which can execute in softirq
context, we have to use 'spin_lock_bh()' for 'bdi->wb_lock'. This patch makes
this change as well.
This patch also moves the 'bdi_wb_init()' function down in the file to avoid
forward-declaration of 'bdi_wakeup_thread_delayed()'.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Finally, we can get rid of unnecessary wake-ups in bdi threads, which are very
bad for battery-driven devices.
There are two types of activities bdi threads do:
1. process bdi works from the 'bdi->work_list'
2. periodic write-back
So there are 2 sources of wake-up events for bdi threads:
1. 'bdi_queue_work()' - submits bdi works
2. '__mark_inode_dirty()' - adds dirty I/O to bdi's
The former already has bdi wake-up code. The latter does not, and this patch
adds it.
'__mark_inode_dirty()' is hot-path function, but this patch adds another
'spin_lock(&bdi->wb_lock)' there. However, it is taken only in rare cases when
the bdi has no dirty inodes. So adding this spinlock should be fine and should
not affect performance.
This patch makes sure bdi threads and the forker thread do not wake-up if there
is nothing to do. The forker thread will nevertheless wake up at least every
5 min. to check whether it has to kill a bdi thread. This can also be optimized,
but is not worth it.
This patch also tidies up the warning about unregistered bid, and turns it from
an ugly crocodile to a simple 'WARN()' statement.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Currently, bdi threads can decide to exit if there were no useful activities
for 5 minutes. However, this causes nasty races: we can easily oops in the
'bdi_queue_work()' if the bdi thread decides to exit while we are waking it up.
And even if we do not oops, but the bdi tread exits immediately after we wake
it up, we'd lose the wake-up event and have an unnecessary delay (up to 5 secs)
in the bdi work processing.
This patch makes the forker thread to be the central place which not only
creates bdi threads, but also kills them if they were inactive long enough.
This better design-wise.
Another reason why this change was done is to prepare for the further changes
which will prevent the bdi threads from waking up every 5 sec and wasting
power. Indeed, when the task does not wake up periodically anymore, it won't be
able to exit either.
This patch also moves the the 'wake_up_bit()' call from the bdi thread to the
forker thread as well. So now the forker thread sets the BDI_pending bit, then
forks the task or kills it, then clears the bit and wakes up the waiting
process.
The only process which may wain on the bit is 'bdi_wb_shutdown()'. This
function was changed as well - now it first removes the bdi from the
'bdi_list', then waits on the 'BDI_pending' bit. Once it wakes up, it is
guaranteed that the forker thread won't race with it, because the bdi is not
visible. Note, the forker thread sets the 'BDI_pending' bit under the
'bdi->wb_lock' which is essential for proper serialization.
And additionally, when we change 'bdi->wb.task', we now take the
'bdi->work_lock', to make sure that we do not lose wake-ups which we otherwise
would when raced with, say, 'bdi_queue_work()'.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
This patch re-structures the bdi forker a little:
1. Add 'bdi_cap_flush_forker(bdi)' condition check to the bdi loop. The reason
for this is that the forker thread can start _before_ the 'BDI_registered'
flag is set (see 'bdi_register()'), so the WARN() statement will fire for
the default bdi. I observed this warning at boot-up.
2. Introduce an enum 'action' and use "switch" statement in the outer loop.
This is a preparation to the further patch which will teach the forker
thread killing bdi threads, so we'll have another case in the "switch"
statement. This change was suggested by Christoph Hellwig.
This patch is just a small step towards the coming change where the forker
thread will kill the bdi threads. It should simplify reviewing the following
changes, which would otherwise be larger.
This patch also amends comments a little.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
The forker thread removes bdis from 'bdi_list' before forking the bdi thread.
But this is wrong for at least 2 reasons.
Reason #1: if we temporary remove a bdi from the list, we may miss works which
would otherwise be given to us.
Reason #2: this is racy; indeed, 'bdi_wb_shutdown()' expects that bdis are
always in the 'bdi_list' (see 'bdi_remove_from_list()'), and when
it races with the forker thread, it can shut down the bdi thread
at the same time as the forker creates it.
This patch makes sure the forker thread never removes bdis from 'bdi_list'
(which was suggested by Christoph Hellwig).
In order to make sure that we do not race with 'bdi_wb_shutdown()', we have to
hold the 'bdi_lock' while walking the 'bdi_list' and setting the 'BDI_pending'
flag.
NOTE! The error path is interesting. Currently, when we fail to create a bdi
thread, we move the bdi to the tail of 'bdi_list'. But if we never remove the
bdi from the list, we cannot move it to the tail either, because then we can
mess up the RCU readers which walk the list. And also, we'll have the race
described above in "Reason #2".
But I not think that adding to the tail is any important so I just do not do
that.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
This patch simplifies bdi code a little by removing the 'pending_list' which is
redundant. Indeed, currently the forker thread ('bdi_forker_thread()') is
working like this:
1. In a loop, fetch all bdi's which have works but have no writeback thread and
move them to the 'pending_list'.
2. If the list is empty, sleep for 5 sec.
3. Otherwise, take one bdi from the list, fork the writeback thread for this
bdi, and repeat the loop.
IOW, it first moves everything to the 'pending_list', then process only one
element, and so on. This patch simplifies the algorithm, which is now as
follows.
1. Find the first bdi which has a work and remove it from the global list of
bdi's (bdi_list).
2. If there was not such bdi, sleep 5 sec.
3. Fork the writeback thread for this bdi and repeat the loop.
IOW, now we find the first bdi to process, process it, and so on. This is
simpler and involves less lists.
The bonus now is that we can get rid of a couple of functions, as well as
remove complications which involve 'rcu_call()' and 'bdi->rcu_head'.
This patch also makes sure we use 'list_add_tail_rcu()', instead of plain
'list_add_tail()', but this piece of code is going to be removed in the next
patch anyway.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Currently, if someone submits jobs for the default bdi, we can lose wake-up
events. E.g., this can happen if 'bdi_queue_work()' is called when
'bdi_forker_thread()' is executing code after 'wb_do_writeback(me, 0)', but
before 'set_current_state(TASK_INTERRUPTIBLE)'.
This situation is unlikely, and the result is not very severe - we'll just
delay the execution of the work, but this is still not very nice.
This patch fixes the issue by checking whether the default bdi has works before
the forker thread goes sleep.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Currently the forker thread can lose wake-ups which may lead to unnecessary
delays in processing bdi works. E.g., consider the following scenario.
1. 'bdi_forker_thread()' walks the 'bdi_list', finds out there is nothing to
do, and is about to finish the loop.
2. A bdi thread decides to exit because it was inactive for long time.
3. 'bdi_queue_work()' adds a work to the bdi which just exited, so it wakes up
the forker thread.
4. but 'bdi_forker_thread()' executes 'set_current_state(TASK_INTERRUPTIBLE)'
and goes sleep. We lose a wake-up.
Losing the wake-up is not fatal, but this means that the bdi work processing
will be delayed by up to 5 sec. This race is theoretical, I never hit it, but
it is worth fixing.
The fix is to execute 'set_current_state(TASK_INTERRUPTIBLE)' _before_ walking
'bdi_list', not after.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
This patch fixes a very unlikely race condition on the bdi forker thread error
path: when bdi thread creation fails, 'bdi->wb.task' may contain the error code
for a short period of time. If at the same time someone submits a work to this
bdi, we can end up with an oops 'bdi_queue_work()' while executing
'wake_up_process(wb->task)'.
This patch fixes the issue by introducing a temporary variable 'task' and
storing the possible error code there, so that 'wb->task' would never take
erroneous values.
Note, this race is very unlikely and I never hit it, so it is theoretical, but
nevertheless worth fixing.
This patch also merges 2 comments which were previously separate.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
The write-back code mixes words "thread" and "task" for the same things. This
is not a big deal, but still an inconsistency.
hch: a convention I tend to use and I've seen in various places
is to always use _task for the storage of the task_struct pointer,
and thread everywhere else. This especially helps with having
foo_thread for the actual thread and foo_task for a global
variable keeping the task_struct pointer
This patch renames:
* 'bdi_add_default_flusher_task()' -> 'bdi_add_default_flusher_thread()'
* 'bdi_forker_task()' -> 'bdi_forker_thread()'
because bdi threads are 'bdi_writeback_thread()', so these names are more
consistent.
This patch also amends commentaries and makes them refer the forker and bdi
threads as "thread", not "task".
Also, while on it, make 'bdi_add_default_flusher_thread()' declaration use
'static void' instead of 'void static' and make checkpatch.pl happy.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Add a trace event to the ->writepage loop in write_cache_pages to give
visibility into how the ->writepage call is changing variables within the
writeback control structure. Of most interest is how wbc->nr_to_write changes
from call to call, especially with filesystems that write multiple pages
in ->writepage.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Tracing high level background writeback events is good, but it doesn't
give the entire picture. Add visibility into write throttling to catch IO
dispatched by foreground throttling of processing dirtying lots of pages.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Trace queue/sched/exec parts of the writeback loop. This provides
insight into when and why flusher threads are scheduled to run. e.g
a sync invocation leaves traces like:
sync-[...]: writeback_queue: bdi 8:0: sb_dev 8:1 nr_pages=7712 sync_mode=0 kupdate=0 range_cyclic=0 background=0
flush-8:0-[...]: writeback_exec: bdi 8:0: sb_dev 8:1 nr_pages=7712 sync_mode=0 kupdate=0 range_cyclic=0 background=0
This also lays the foundation for adding more writeback tracing to
provide deeper insight into the whole writeback path.
The original tracing code is from Jens Axboe, though this version is
a rewrite as a result of the code being traced changing
significantly.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Move all code for the writeback thread into fs/fs-writeback.c instead of
splitting it over two functions in two files.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
The wb_list member of struct backing_device_info always has exactly one
element. Just use the direct bdi->wb pointer instead and simplify some
code.
Also remove bdi_task_init which is now trivial to prepare for the next
patch.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
Remove the current bio flags and reuse the request flags for the bio, too.
This allows to more easily trace the type of I/O from the filesystem
down to the block driver. There were two flags in the bio that were
missing in the requests: BIO_RW_UNPLUG and BIO_RW_AHEAD. Also I've
renamed two request flags that had a superflous RW in them.
Note that the flags are in bio.h despite having the REQ_ name - as
blkdev.h includes bio.h that is the only way to go for now.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6:
slub: Allow removal of slab caches during boot
Revert "slub: Allow removal of slab caches during boot"
slub numa: Fix rare allocation from unexpected node
slab: use deferable timers for its periodic housekeeping
slub: Use kmem_cache flags to detect if slab is in debugging mode.
slub: Allow removal of slab caches during boot
slub: Check kasprintf results in kmem_cache_init()
SLUB: Constants need UL
slub: Use a constant for a unspecified node.
SLOB: Free objects to their own list
slab: fix caller tracking on !CONFIG_DEBUG_SLAB && CONFIG_TRACING
* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86: Ioremap: fix wrong physical address handling in PAT code
x86, tlb: Clean up and correct used type
x86, iomap: Fix wrong page aligned size calculation in ioremapping code
x86, mm: Create symbolic index into address_markers array
x86, ioremap: Fix normal ram range check
x86, ioremap: Fix incorrect physical address handling in PAE mode
x86-64, mm: Initialize VDSO earlier on 64 bits
x86, kmmio/mmiotrace: Fix double free of kmmio_fault_pages
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (162 commits)
tracing/kprobes: unregister_trace_probe needs to be called under mutex
perf: expose event__process function
perf events: Fix mmap offset determination
perf, powerpc: fsl_emb: Restore setting perf_sample_data.period
perf, powerpc: Convert the FSL driver to use local64_t
perf tools: Don't keep unreferenced maps when unmaps are detected
perf session: Invalidate last_match when removing threads from rb_tree
perf session: Free the ref_reloc_sym memory at the right place
x86,mmiotrace: Add support for tracing STOS instruction
perf, sched migration: Librarize task states and event headers helpers
perf, sched migration: Librarize the GUI class
perf, sched migration: Make the GUI class client agnostic
perf, sched migration: Make it vertically scrollable
perf, sched migration: Parameterize cpu height and spacing
perf, sched migration: Fix key bindings
perf, sched migration: Ignore unhandled task states
perf, sched migration: Handle ignored migrate out events
perf: New migration tool overview
tracing: Drop cpparg() macro
perf: Use tracepoint_synchronize_unregister() to flush any pending tracepoint call
...
Fix up trivial conflicts in Makefile and drivers/cpufreq/cpufreq.c
* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
Revert "net: Make accesses to ->br_port safe for sparse RCU"
mce: convert to rcu_dereference_index_check()
net: Make accesses to ->br_port safe for sparse RCU
vfs: add fs.h to define struct file
lockdep: Add an in_workqueue_context() lockdep-based test function
rcu: add __rcu API for later sparse checking
rcu: add an rcu_dereference_index_check()
tree/tiny rcu: Add debug RCU head objects
mm: remove all rcu head initializations
fs: remove all rcu head initializations, except on_stack initializations
powerpc: remove all rcu head initializations
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/linux-2.6-kgdb:
debug_core,kdb: fix crash when arch does not have single step
kgdb,x86: use macro HBP_NUM to replace magic number 4
kgdb,mips: remove unused kgdb_cpu_doing_single_step operations
mm,kdb,kgdb: Add a debug reference for the kdb kmap usage
KGDB: Remove set but unused newPC
ftrace,kdb: Allow dumping a specific cpu's buffer with ftdump
ftrace,kdb: Extend kdb to be able to dump the ftrace buffer
kgdb,powerpc: Replace hardcoded offset by BREAK_INSTR_SIZE
arm,kgdb: Add ability to trap into debugger on notify_die
gdbstub: do not directly use dbg_reg_def[] in gdb_cmd_reg_set()
gdbstub: Implement gdbserial 'p' and 'P' packets
kgdb,arm: Individual register get/set for arm
kgdb,mips: Individual register get/set for mips
kgdb,x86: Individual register get/set for x86
kgdb,kdb: individual register set and and get API
gdbstub: Optimize kgdb's "thread:" response for the gdb serial protocol
kgdb: remove custom hex_to_bin()implementation
The kdb kmap should never get used outside of the kernel debugger
exception context.
Signed-off-by: Jason Wessel<jason.wessel@windriver.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Ingo Molnar <mingo@elte.hu>
CC: linux-mm@kvack.org
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
percpu: allow limited allocation before slab is online
percpu: make @dyn_size always mean min dyn_size in first chunk init functions
Serialize kmem_cache_create and kmem_cache_destroy using the slub_lock. Only
possible after the use of the slub_lock during dynamic dma creation has been
removed.
Then make sure that the setup of the slab sysfs entries does not race
with kmem_cache_create and kmem_cache destroy.
If a slab cache is removed before we have setup sysfs then simply skip over
the sysfs handling.
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Roland Dreier <rdreier@cisco.com>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
is_hwpoison_address accesses the page table, so the caller must hold
current->mm->mmap_sem in read mode. So fix its usage in hva_to_pfn of
kvm accordingly.
Comment is_hwpoison_address to remind other users.
Reported-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>