12828 Commits

Author SHA1 Message Date
Steven Rostedt
a790087554 ftrace: Allocate the mcount record pages as groups
Allocate the mcount record pages as a group of pages as big
as can be allocated and waste no more than a single page.

Grouping the mcount pages as much as possible helps with cache
locality, as we do not need to redirect with descriptors as we
cross from page to page. It also allows us to do more with the
records later on (sort them with bigger benefits).

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-12-21 07:18:30 -05:00
Steven Rostedt
3208230983 ftrace: Remove usage of "freed" records
Records that are added to the function trace table are
permanently there, except for modules. By separating out the
modules to their own pages that can be freed in one shot
we can remove the "freed" flag and simplify some of the record
management.

Another benefit of doing this is that we can also move the
records around; sort them.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-12-21 07:17:57 -05:00
Steven Rostedt
c88fd8634e ftrace: Allow archs to modify code without stop machine
The stop machine method to modify all functions in the kernel
(some 20,000 of them) is the safest way to do so across all archs.
But some archs may not need this big hammer approach to modify code
on SMP machines, and can simply just update the code it needs.

Adding a weak function arch_ftrace_update_code() that now does the
stop machine, will also let any arch override this method.

If the arch needs to check the system and then decide if it can
avoid stop machine, it can still call ftrace_run_stop_machine() to
use the old method.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-12-21 07:16:58 -05:00
Jiri Olsa
30fb6aa740 ftrace: Fix unregister ftrace_ops accounting
Multiple users of the function tracer can register their functions
with the ftrace_ops structure. The accounting within ftrace will
update the counter on each function record that is being traced.
When the ftrace_ops filtering adds or removes functions, the
function records will be updated accordingly if the ftrace_ops is
still registered.

When a ftrace_ops is removed, the counter of the function records,
that the ftrace_ops traces, are decremented. When they reach zero
the functions that they represent are modified to stop calling the
mcount code.

When changes are made, the code is updated via stop_machine() with
a command passed to the function to tell it what to do. There is an
ENABLE and DISABLE command that tells the called function to enable
or disable the functions. But the ENABLE is really a misnomer as it
should just update the records, as records that have been enabled
and now have a count of zero should be disabled.

The DISABLE command is used to disable all functions regardless of
their counter values. This is the big off switch and is not the
complement of the ENABLE command.

To make matters worse, when a ftrace_ops is unregistered and there
is another ftrace_ops registered, neither the DISABLE nor the
ENABLE command are set when calling into the stop_machine() function
and the records will not be updated to match their counter. A command
is passed to that function that will update the mcount code to call
the registered callback directly if it is the only one left. This
means that the ftrace_ops that is still registered will have its callback
called by all functions that have been set for it as well as the ftrace_ops
that was just unregistered.

Here's a way to trigger this bug. Compile the kernel with
CONFIG_FUNCTION_PROFILER set and with CONFIG_FUNCTION_GRAPH not set:

 CONFIG_FUNCTION_PROFILER=y
 # CONFIG_FUNCTION_GRAPH is not set

This will force the function profiler to use the function tracer instead
of the function graph tracer.

  # cd /sys/kernel/debug/tracing
  # echo schedule > set_ftrace_filter
  # echo function > current_tracer
  # cat set_ftrace_filter
 schedule
  # cat trace
 # tracer: nop
 #
 # entries-in-buffer/entries-written: 692/68108025   #P:4
 #
 #                              _-----=> irqs-off
 #                             / _----=> need-resched
 #                            | / _---=> hardirq/softirq
 #                            || / _--=> preempt-depth
 #                            ||| /     delay
 #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
 #              | |       |   ||||       |         |
      kworker/0:2-909   [000] ....   531.235574: schedule <-worker_thread
           <idle>-0     [001] .N..   531.235575: schedule <-cpu_idle
      kworker/0:2-909   [000] ....   531.235597: schedule <-worker_thread
             sshd-2563  [001] ....   531.235647: schedule <-schedule_hrtimeout_range_clock

  # echo 1 > function_profile_enabled
  # echo 0 > function_porfile_enabled
  # cat set_ftrace_filter
 schedule
  # cat trace
 # tracer: function
 #
 # entries-in-buffer/entries-written: 159701/118821262   #P:4
 #
 #                              _-----=> irqs-off
 #                             / _----=> need-resched
 #                            | / _---=> hardirq/softirq
 #                            || / _--=> preempt-depth
 #                            ||| /     delay
 #           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
 #              | |       |   ||||       |         |
           <idle>-0     [002] ...1   604.870655: local_touch_nmi <-cpu_idle
           <idle>-0     [002] d..1   604.870655: enter_idle <-cpu_idle
           <idle>-0     [002] d..1   604.870656: atomic_notifier_call_chain <-enter_idle
           <idle>-0     [002] d..1   604.870656: __atomic_notifier_call_chain <-atomic_notifier_call_chain

The same problem could have happened with the trace_probe_ops,
but they are modified with the set_frace_filter file which does the
update at closure of the file.

The simple solution is to change ENABLE to UPDATE and call it every
time an ftrace_ops is unregistered.

Link: http://lkml.kernel.org/r/1323105776-26961-3-git-send-email-jolsa@redhat.com

Cc: stable@vger.kernel.org # 3.0+
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-12-21 07:09:14 -05:00
Daisuke Nishimura
62af3783e4 sched: Fix cgroup movement of waking process
There is a small race between try_to_wake_up() and sched_move_task(),
which is trying to move the process being woken up.

    try_to_wake_up() on CPU0       sched_move_task() on CPU1
--------------------------------+---------------------------------
  raw_spin_lock_irqsave(p->pi_lock)
  task_waking_fair()
    ->p.se.vruntime -= cfs_rq->min_vruntime
  ttwu_queue()
    ->send reschedule IPI to CPU1
  raw_spin_unlock_irqsave(p->pi_lock)
                                   task_rq_lock()
                                     -> tring to aquire both p->pi_lock and
                                        rq->lock with IRQ disabled
                                   task_move_group_fair()
                                     -> p.se.vruntime
                                          -= (old)cfs_rq->min_vruntime
                                          += (new)cfs_rq->min_vruntime
                                   task_rq_unlock()

                                   (via IPI)
                                   sched_ttwu_pending()
                                     raw_spin_lock(rq->lock)
                                     ttwu_do_activate()
                                       ...
                                       enqueue_entity()
                                         child.se->vruntime += cfs_rq->min_vruntime
                                     raw_spin_unlock(rq->lock)

As a result, vruntime of the process becomes far bigger than min_vruntime,
if (new)cfs_rq->min_vruntime >> (old)cfs_rq->min_vruntime.

This patch fixes this problem by just ignoring such process in
task_move_group_fair(), because the vruntime has already been normalized in
task_waking_fair().

Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/20111215143741.df82dd50.nishimura@mxp.nes.nec.co.jp
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-21 10:34:52 +01:00
Daisuke Nishimura
7ceff013c4 sched: Fix cgroup movement of newly created process
There is a small race between do_fork() and sched_move_task(), which is
trying to move the child.

            do_fork()                 sched_move_task()
--------------------------------+---------------------------------
  copy_process()
    sched_fork()
      task_fork_fair()
        -> vruntime of the child is initialized
           based on that of the parent.
  -> we can see the child in "tasks" file now.
                                    task_rq_lock()
                                    task_move_group_fair()
                                      -> child.se.vruntime
                                           -= (old)cfs_rq->min_vruntime
                                           += (new)cfs_rq->min_vruntime
                                    task_rq_unlock()
  wake_up_new_task()
    ...
    enqueue_entity()
      child.se.vruntime += cfs_rq->min_vruntime

As a result, vruntime of the child becomes far bigger than min_vruntime,
if (new)cfs_rq->min_vruntime >> (old)cfs_rq->min_vruntime.

This patch fixes this problem by just ignoring such process in
task_move_group_fair(), because the vruntime has already been normalized in
task_fork_fair().

Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/20111215143607.2ee12c5d.nishimura@mxp.nes.nec.co.jp
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-21 10:34:51 +01:00
Daisuke Nishimura
4fc420c91f sched: Fix cgroup movement of forking process
There is a small race between task_fork_fair() and sched_move_task(),
which is trying to move the parent.

        task_fork_fair()                 sched_move_task()
--------------------------------+---------------------------------
  cfs_rq = task_cfs_rq(current)
    -> cfs_rq is the "old" one.
  curr = cfs_rq->curr
    -> curr is set to the parent.
                                    task_rq_lock()
                                    dequeue_task()
                                      ->parent.se.vruntime -= (old)cfs_rq->min_vruntime
                                    enqueue_task()
                                      ->parent.se.vruntime += (new)cfs_rq->min_vruntime
                                    task_rq_unlock()
  raw_spin_lock_irqsave(rq->lock)
  se->vruntime = curr->vruntime
    -> vruntime of the child is set to that of the parent
       which has already been updated by sched_move_task().
  se->vruntime -= (old)cfs_rq->min_vruntime.
  raw_spin_unlock_irqrestore(rq->lock)

As a result, vruntime of the child becomes far bigger than expected,
if (new)cfs_rq->min_vruntime >> (old)cfs_rq->min_vruntime.

This patch fixes this problem by setting "cfs_rq" and "curr" after
holding the rq->lock.

Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Acked-by: Paul Turner <pjt@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Tejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/r/20111215143655.662676b0.nishimura@mxp.nes.nec.co.jp
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-21 10:34:49 +01:00
Kamalesh Babulal
11534ec5b6 sched: Remove cfs bandwidth period check in tg_set_cfs_period()
Remove cfs bandwidth period check from tg_set_cfs_period.
Invalid bandwidth period's lower/upper limits are denoted
by min_cfs_quota_period/max_cfs_quota_period repsectively,
and are checked against valid period in tg_set_cfs_bandwidth().

As pjt pointed out, negative input will result in very large unsigned
numbers and will be caught by the max allowed period test.

Signed-off-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Acked-by: Paul Turner <pjt@google.com>
[ammended changelog to mention negative values]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20111210135925.GA14593@linux.vnet.ibm.com
--
 kernel/sched/core.c |    3 ---
 1 file changed, 3 deletions(-)

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-21 10:34:48 +01:00
Peter Zijlstra
a195f004e9 sched: Fix load-balance lock-breaking
The current lock break relies on contention on the rq locks, something
which might never come because we've got IRQs disabled. Or will be
very likely because on anything with more than 2 cpus a synchronized
load-balance pass will very likely cause contention on the rq locks.

Also the sched_nr_migrate thing fails when it gets trapped the loops
of either the cgroup muck in load_balance_fair() or the move_tasks()
load condition.

Instead, use the new lb_flags field to propagate break/abort
conditions for all these loops and create a new loop outside the irq
disabled on the break being required.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-tsceb6w61q0gakmsccix6xxi@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-21 10:34:47 +01:00
Peter Zijlstra
5b54b56be5 sched: Replace all_pinned with a generic flags field
Replace the all_pinned argument with a flags field so that we can add
some extra controls throughout that entire call chain.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-33kevm71m924ok1gpxd720v3@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-21 10:34:45 +01:00
Peter Zijlstra
518cd62341 sched: Only queue remote wakeups when crossing cache boundaries
Mike reported a 13% drop in netperf TCP_RR performance due to the
new remote wakeup code. Suresh too noticed some performance issues
with it.

Reducing the IPIs to only cross cache domains solves the observed
performance issues.

Reported-by: Suresh Siddha <suresh.b.siddha@intel.com>
Reported-by: Mike Galbraith <efault@gmx.de>
Acked-by: Suresh Siddha <suresh.b.siddha@intel.com>
Acked-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Dave Kleikamp <dave.kleikamp@oracle.com>
Link: http://lkml.kernel.org/r/1323338531.17673.7.camel@twins
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-21 10:34:44 +01:00
Peter Zijlstra
f07fdec50a lockdep/waitqueues: Add better annotation
-> #2 (&tty->write_wait){-.-...}:

is a lot more informative than:

 -> #2 (key#19){-.....}:

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/n/tip-8zpopbny51023rdb0qq67eye@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-21 10:07:39 +01:00
Ingo Molnar
2d2b7749e8 Merge commit 'v3.2-rc6' into core/locking
Merge reason: Pick up the latest fixes.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-21 10:06:37 +01:00
Linus Torvalds
a4a4923919 Merge branch 'for-3.2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
* 'for-3.2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroups: fix a css_set not found bug in cgroup_attach_proc
2011-12-20 11:44:18 -08:00
Linus Torvalds
5fbd305dd2 Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  time/clocksource: Fix kernel-doc warnings
  rtc: m41t80: Workaround broken alarm functionality
  rtc: Expire alarms after the time is set.
2011-12-20 11:42:38 -08:00
Ingo Molnar
d87f69a16e Merge commit 'v3.2-rc6' into perf/core
Merge reason: Update with the latest fixes.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-20 20:32:11 +01:00
Michel Lespinasse
3d3c8f93a2 binary_sysctl(): fix memory leak
binary_sysctl() calls sysctl_getname() which allocates from names_cache
slab usin __getname()

The matching function to free the name is __putname(), and not putname()
which should be used only to match getname() allocations.

This is because when auditing is enabled, putname() calls audit_putname
*instead* (not in addition) to __putname().  Then, if a syscall is in
progress, audit_putname does not release the name - instead, it expects
the name to get released when the syscall completes, but that will happen
only if audit_getname() was called previously, i.e.  if the name was
allocated with getname() rather than the naked __getname().  So,
__getname() followed by putname() ends up leaking memory.

Signed-off-by: Michel Lespinasse <walken@google.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Eric Paris <eparis@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-12-20 10:25:04 -08:00
David Rientjes
b246272ecc cpusets: stall when updating mems_allowed for mempolicy or disjoint nodemask
Kernels where MAX_NUMNODES > BITS_PER_LONG may temporarily see an empty
nodemask in a tsk's mempolicy if its previous nodemask is remapped onto a
new set of allowed cpuset nodes where the two nodemasks, as a result of
the remap, are now disjoint.

c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when changing
cpuset's mems") adds get_mems_allowed() to prevent the set of allowed
nodes from changing for a thread.  This causes any update to a set of
allowed nodes to stall until put_mems_allowed() is called.

This stall is unncessary, however, if at least one node remains unchanged
in the update to the set of allowed nodes.  This was addressed by
89e8a244b97e ("cpusets: avoid looping when storing to mems_allowed if one
node remains set"), but it's still possible that an empty nodemask may be
read from a mempolicy because the old nodemask may be remapped to the new
nodemask during rebind.  To prevent this, only avoid the stall if there is
no mempolicy for the thread being changed.

This is a temporary solution until all reads from mempolicy nodemasks can
be guaranteed to not be empty without the get_mems_allowed()
synchronization.

Also moves the check for nodemask intersection inside task_lock() so that
tsk->mems_allowed cannot change.  This ensures that nothing can set this
tsk's mems_allowed out from under us and also protects tsk->mempolicy.

Reported-by: Miao Xie <miaox@cn.fujitsu.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Paul Menage <paul@paulmenage.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-12-20 10:25:04 -08:00
Ingo Molnar
45aa0663cc Merge branch 'memblock-kill-early_node_map' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc into core/memblock 2011-12-20 12:14:26 +01:00
Martin Schwidefsky
612ef28a04 Merge branch 'sched/core' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into cputime-tip
Conflicts:
	drivers/cpufreq/cpufreq_conservative.c
	drivers/cpufreq/cpufreq_ondemand.c
	drivers/macintosh/rack-meter.c
	fs/proc/stat.c
	fs/proc/uptime.c
	kernel/sched/core.c
2011-12-19 19:23:15 +01:00
Mandeep Singh Baines
29e21368b9 cgroups: remove redundant get/put of css_set from css_set_check_fetched()
We already have a reference to all elements in newcg_list.

Signed-off-by: Mandeep Singh Baines <msb@chromium.org>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: containers@lists.linux-foundation.org
Cc: cgroups@vger.kernel.org
Cc: Paul Menage <paul@paulmenage.org>
2011-12-19 09:14:30 -08:00
Mandeep Singh Baines
e0197aae59 cgroups: fix a css_set not found bug in cgroup_attach_proc
There is a BUG when migrating a PF_EXITING proc. Since css_set_prefetch()
is not called for the PF_EXITING case, find_existing_css_set() will return
NULL inside cgroup_task_migrate() causing a BUG.

This bug is easy to reproduce. Create a zombie and echo its pid to
cgroup.procs.

$ cat zombie.c
\#include <unistd.h>

int main()
{
  if (fork())
      pause();
  return 0;
}
$

We are hitting this bug pretty regularly on ChromeOS.

This bug is already fixed by Tejun Heo's cgroup patchset which is
targetted for the next merge window:

https://lkml.org/lkml/2011/11/1/356

I've create a smaller patch here which just fixes this bug so that a
fix can be merged into the current release and stable.

Signed-off-by: Mandeep Singh Baines <msb@chromium.org>
Downstream-Bug-Report: http://crosbug.com/23953
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: containers@lists.linux-foundation.org
Cc: cgroups@vger.kernel.org
Cc: stable@kernel.org
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Paul Menage <paul@paulmenage.org>
Cc: Olof Johansson <olofj@chromium.org>
2011-12-19 09:09:09 -08:00
Kusanagi Kouichi
b1b73d0950 time/clocksource: Fix kernel-doc warnings
Fix various KernelDoc build warnings.

Signed-off-by: Kusanagi Kouichi <slash@ac.auone-net.jp>
Cc: John Stultz <johnstul@us.ibm.com>
Link: http://lkml.kernel.org/r/20111219091320.0D5AF6FC03D@msa105.auone-net.jp
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-19 11:41:40 +01:00
Wu Fengguang
83712358ba writeback: dirty ratelimit - think time compensation
Compensate the task's think time when computing the final pause time,
so that ->dirty_ratelimit can be executed accurately.

        think time := time spend outside of balance_dirty_pages()

In the rare case that the task slept longer than the 200ms period time
(result in negative pause time), the sleep time will be compensated in
the following periods, too, if it's less than 1 second.

Accumulated errors are carefully avoided as long as the max pause area
is not hitted.

Pseudo code:

        period = pages_dirtied / task_ratelimit;
        think = jiffies - dirty_paused_when;
        pause = period - think;

1) normal case: period > think

        pause = period - think
        dirty_paused_when = jiffies + pause
        nr_dirtied = 0

                             period time
              |===============================>|
                  think time      pause time
              |===============>|==============>|
        ------|----------------|---------------|------------------------
        dirty_paused_when   jiffies

2) no pause case: period <= think

        don't pause; reduce future pause time by:
        dirty_paused_when += period
        nr_dirtied = 0

                           period time
              |===============================>|
                                  think time
              |===================================================>|
        ------|--------------------------------+-------------------|----
        dirty_paused_when                                       jiffies

Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-12-18 14:20:27 +08:00
Wu Fengguang
54848d73f9 writeback: charge leaked page dirties to active tasks
It's a years long problem that a large number of short-lived dirtiers
(eg. gcc instances in a fast kernel build) may starve long-run dirtiers
(eg. dd) as well as pushing the dirty pages to the global hard limit.

The solution is to charge the pages dirtied by the exited gcc to the
other random dirtying tasks. It sounds not perfect, however should
behave good enough in practice, seeing as that throttled tasks aren't
actually running so those that are running are more likely to pick it up
and get throttled, therefore promoting an equal spread.

Randy: fix compile error: 'dirty_throttle_leaks' undeclared in exit.c

Acked-by: Jan Kara <jack@suse.cz>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
2011-12-18 14:20:20 +08:00
Linus Torvalds
ab347d94d6 Merge branches 'perf-urgent-for-linus' and 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf events: Fix ring_buffer_wakeup() brown paperbag bug

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched: Fix select_idle_sibling() regression in selecting an idle SMT sibling
  MAINTAINERS: Update tip.git related git trees
2011-12-17 14:03:50 -08:00
Peter Zijlstra
ab2789213d sched: Fix select_idle_sibling() regression in selecting an idle SMT sibling
Mike Galbraith reported that this recent commit:

   commit 4dcfe1025b513c2c1da5bf5586adb0e80148f612
   Author: Peter Zijlstra <peterz@infradead.org>
   Date:   Thu Nov 10 13:01:10 2011 +0100

       sched: Avoid SMT siblings in select_idle_sibling() if possible

stopped selecting an idle SMT sibling when there are no idle
cores in a single socket system.

Intent of the select_idle_sibling() was to fallback to an idle
SMT sibling, if it fails to identify an idle core. But this
fallback was not happening on systems where all the scheduler
domains had `SD_SHARE_PKG_RESOURCES' flag set.

Fix it. Slightly bigger patch of cleaning all these goto's etc
is queued up for the next release.

Reported-by: Mike Galbraith <efault@gmx.de>
Reported-by: Alex Shi <alex.shi@intel.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Link: http://lkml.kernel.org/r/1323978421.1984.244.camel@sbsiddha-desk.sc.intel.com
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-16 09:44:58 +01:00
Kees Cook
07cde2608a sched: Add missing rcu_dereference() around ->real_parent usage
Wrap another ->real_parent dereference while under rcu_read_lock.

Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Link: http://lkml.kernel.org/r/20111215164918.GA13003@www.outflux.net
[ tidied up the changelog ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-16 09:42:09 +01:00
Martin Schwidefsky
648616343c [S390] cputime: add sparse checking and cleanup
Make cputime_t and cputime64_t nocast to enable sparse checking to
detect incorrect use of cputime. Drop the cputime macros for simple
scalar operations. The conversion macros are still needed.

Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2011-12-15 14:56:19 +01:00
Ingo Molnar
6a54aebf69 Merge commit 'v3.2-rc5' into sched/core
Merge reason: Pick up the latest fixes.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-15 08:21:30 +01:00
Kay Sievers
d369a5d8fc clocksource: convert sysdev_class to a regular subsystem
After all sysdev classes are ported to regular driver core entities, the
sysdev implementation will be entirely removed from the kernel.

Cc: John Stultz <johnstul@us.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-12-14 15:28:51 -08:00
Kay Sievers
997d3eaf02 rtmutex-tester: convert sysdev_class to a regular subsystem
After all sysdev classes are ported to regular driver core entities, the
sysdev implementation will be entirely removed from the kernel.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-12-14 14:54:22 -08:00
Will Deacon
44b7f4b98d perf events: Fix ring_buffer_wakeup() brown paperbag bug
Commit 10c6db11 ("perf: Fix loss of notification with multi-event")
seems to unconditionally dereference event->rb in the wakeup handler,
this is wrong, there might not be a buffer attached.

Signed-off-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20111213152651.GP20297@mudshark.cambridge.arm.com
[ minor edits ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-12-14 08:44:53 +01:00
Tejun Heo
b2efa05265 block, cfq: unlink cfq_io_context's immediately
cic is association between io_context and request_queue.  A cic is
linked from both ioc and q and should be destroyed when either one
goes away.  As ioc and q both have their own locks, locking becomes a
bit complex - both orders work for removal from one but not from the
other.

Currently, cfq tries to circumvent this locking order issue with RCU.
ioc->lock nests inside queue_lock but the radix tree and cic's are
also protected by RCU allowing either side to walk their lists without
grabbing lock.

This rather unconventional use of RCU quickly devolves into extremely
fragile convolution.  e.g. The following is from cfqd going away too
soon after ioc and q exits raced.

 general protection fault: 0000 [#1] PREEMPT SMP
 CPU 2
 Modules linked in:
 [   88.503444]
 Pid: 599, comm: hexdump Not tainted 3.1.0-rc10-work+ #158 Bochs Bochs
 RIP: 0010:[<ffffffff81397628>]  [<ffffffff81397628>] cfq_exit_single_io_context+0x58/0xf0
 ...
 Call Trace:
  [<ffffffff81395a4a>] call_for_each_cic+0x5a/0x90
  [<ffffffff81395ab5>] cfq_exit_io_context+0x15/0x20
  [<ffffffff81389130>] exit_io_context+0x100/0x140
  [<ffffffff81098a29>] do_exit+0x579/0x850
  [<ffffffff81098d5b>] do_group_exit+0x5b/0xd0
  [<ffffffff81098de7>] sys_exit_group+0x17/0x20
  [<ffffffff81b02f2b>] system_call_fastpath+0x16/0x1b

The only real hot path here is cic lookup during request
initialization and avoiding extra locking requires very confined use
of RCU.  This patch makes cic removal from both ioc and request_queue
perform double-locking and unlink immediately.

* From q side, the change is almost trivial as ioc->lock nests inside
  queue_lock.  It just needs to grab each ioc->lock as it walks
  cic_list and unlink it.

* From ioc side, it's a bit more difficult because of inversed lock
  order.  ioc needs its lock to walk its cic_list but can't grab the
  matching queue_lock and needs to perform unlock-relock dancing.

  Unlinking is now wholly done from put_io_context() and fast path is
  optimized by using the queue_lock the caller already holds, which is
  by far the most common case.  If the ioc accessed multiple devices,
  it tries with trylock.  In unlikely cases of fast path failure, it
  falls back to full double-locking dance from workqueue.

Double-locking isn't the prettiest thing in the world but it's *far*
simpler and more understandable than RCU trick without adding any
meaningful overhead.

This still leaves a lot of now unnecessary RCU logics.  Future patches
will trim them.

-v2: Vivek pointed out that cic->q was being dereferenced after
     cic->release() was called.  Updated to use local variable @this_q
     instead.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-12-14 00:33:39 +01:00
Tejun Heo
6e736be7f2 block: make ioc get/put interface more conventional and fix race on alloction
Ignoring copy_io() during fork, io_context can be allocated from two
places - current_io_context() and set_task_ioprio().  The former is
always called from local task while the latter can be called from
different task.  The synchornization between them are peculiar and
dubious.

* current_io_context() doesn't grab task_lock() and assumes that if it
  saw %NULL ->io_context, it would stay that way until allocation and
  assignment is complete.  It has smp_wmb() between alloc/init and
  assignment.

* set_task_ioprio() grabs task_lock() for assignment and does
  smp_read_barrier_depends() between "ioc = task->io_context" and "if
  (ioc)".  Unfortunately, this doesn't achieve anything - the latter
  is not a dependent load of the former.  ie, if ioc itself were being
  dereferenced "ioc->xxx", it would mean something (not sure what tho)
  but as the code currently stands, the dependent read barrier is
  noop.

As only one of the the two test-assignment sequences is task_lock()
protected, the task_lock() can't do much about race between the two.
Nothing prevents current_io_context() and set_task_ioprio() allocating
its own ioc for the same task and overwriting the other's.

Also, set_task_ioprio() can race with exiting task and create a new
ioc after exit_io_context() is finished.

ioc get/put doesn't have any reason to be complex.  The only hot path
is accessing the existing ioc of %current, which is simple to achieve
given that ->io_context is never destroyed as long as the task is
alive.  All other paths can happily go through task_lock() like all
other task sub structures without impacting anything.

This patch updates ioc get/put so that it becomes more conventional.

* alloc_io_context() is replaced with get_task_io_context().  This is
  the only interface which can acquire access to ioc of another task.
  On return, the caller has an explicit reference to the object which
  should be put using put_io_context() afterwards.

* The functionality of current_io_context() remains the same but when
  creating a new ioc, it shares the code path with
  get_task_io_context() and always goes through task_lock().

* get_io_context() now means incrementing ref on an ioc which the
  caller already has access to (be that an explicit refcnt or implicit
  %current one).

* PF_EXITING inhibits creation of new io_context and once
  exit_io_context() is finished, it's guaranteed that both ioc
  acquisition functions return %NULL.

* All users are updated.  Most are trivial but
  smp_read_barrier_depends() removal from cfq_get_io_context() needs a
  bit of explanation.  I suppose the original intention was to ensure
  ioc->ioprio is visible when set_task_ioprio() allocates new
  io_context and installs it; however, this wouldn't have worked
  because set_task_ioprio() doesn't have wmb between init and install.
  There are other problems with this which will be fixed in another
  patch.

* While at it, use NUMA_NO_NODE instead of -1 for wildcard node
  specification.

-v2: Vivek spotted contamination from debug patch.  Removed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-12-14 00:33:38 +01:00
Davidlohr Bueso
52dcf8a1f8 resource cgroups: remove bogus cast
The memparse() function already accepts const char * as the parsing string.

Signed-off-by: Davidlohr Bueso <dave@gnu.org>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2011-12-13 07:43:08 -08:00
Tejun Heo
494c167cf7 cgroup: kill subsys->can_attach_task(), pre_attach() and attach_task()
These three methods are no longer used.  Kill them.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Paul Menage <paul@paulmenage.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
2011-12-12 18:12:22 -08:00
Tejun Heo
94196f51c1 cgroup, cpuset: don't use ss->pre_attach()
->pre_attach() is supposed to be called before migration, which is
observed during process migration but task migration does it the other
way around.  The only ->pre_attach() user is cpuset which can do the
same operaitons in ->can_attach().  Collapse cpuset_pre_attach() into
cpuset_can_attach().

-v2: Patch contamination from later patch removed.  Spotted by Paul
     Menage.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Paul Menage <paul@paulmenage.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
2011-12-12 18:12:22 -08:00
Tejun Heo
bb9d97b6df cgroup: don't use subsys->can_attach_task() or ->attach_task()
Now that subsys->can_attach() and attach() take @tset instead of
@task, they can handle per-task operations.  Convert
->can_attach_task() and ->attach_task() users to use ->can_attach()
and attach() instead.  Most converions are straight-forward.
Noteworthy changes are,

* In cgroup_freezer, remove unnecessary NULL assignments to unused
  methods.  It's useless and very prone to get out of sync, which
  already happened.

* In cpuset, PF_THREAD_BOUND test is checked for each task.  This
  doesn't make any practical difference but is conceptually cleaner.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <paul@paulmenage.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: James Morris <jmorris@namei.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <peterz@infradead.org>
2011-12-12 18:12:21 -08:00
Tejun Heo
2f7ee5691e cgroup: introduce cgroup_taskset and use it in subsys->can_attach(), cancel_attach() and attach()
Currently, there's no way to pass multiple tasks to cgroup_subsys
methods necessitating the need for separate per-process and per-task
methods.  This patch introduces cgroup_taskset which can be used to
pass multiple tasks and their associated cgroups to cgroup_subsys
methods.

Three methods - can_attach(), cancel_attach() and attach() - are
converted to use cgroup_taskset.  This unifies passed parameters so
that all methods have access to all information.  Conversions in this
patchset are identical and don't introduce any behavior change.

-v2: documentation updated as per Paul Menage's suggestion.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Paul Menage <paul@paulmenage.org>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: James Morris <jmorris@namei.org>
2011-12-12 18:12:21 -08:00
Tejun Heo
134d33737f cgroup: improve old cgroup handling in cgroup_attach_proc()
cgroup_attach_proc() behaves differently from cgroup_attach_task() in
the following aspects.

* All hooks are invoked even if no task is actually being moved.

* ->can_attach_task() is called for all tasks in the group whether the
  new cgrp is different from the current cgrp or not; however,
  ->attach_task() is skipped if new equals new.  This makes the calls
  asymmetric.

This patch improves old cgroup handling in cgroup_attach_proc() by
looking up the current cgroup at the head, recording it in the flex
array along with the task itself, and using it to remove the above two
differences.  This will also ease further changes.

-v2: nr_todo renamed to nr_migrating_tasks as per Paul Menage's
     suggestion.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Paul Menage <paul@paulmenage.org>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
2011-12-12 18:12:21 -08:00
Tejun Heo
cd3d095275 cgroup: always lock threadgroup during migration
Update cgroup to take advantage of the fack that threadgroup_lock()
guarantees stable threadgroup.

* Lock threadgroup even if the target is a single task.  This
  guarantees that when the target tasks stay stable during migration
  regardless of the target type.

* Remove PF_EXITING early exit optimization from attach_task_by_pid()
  and check it in cgroup_task_migrate() instead.  The optimization was
  for rather cold path to begin with and PF_EXITING state can be
  trusted throughout migration by checking it after locking
  threadgroup.

* Don't add PF_EXITING tasks to target task array in
  cgroup_attach_proc().  This ensures that task migration is performed
  only for live tasks.

* Remove -ESRCH failure path from cgroup_task_migrate().  With the
  above changes, it's guaranteed to be called only for live tasks.

After the changes, only live tasks are migrated and they're guaranteed
to stay alive until migration is complete.  This removes problems
caused by exec and exit racing against cgroup migration including
symmetry among cgroup attach methods and different cgroup methods
racing each other.

v2: Oleg pointed out that one more PF_EXITING check can be removed
    from cgroup_attach_proc().  Removed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Paul Menage <paul@paulmenage.org>
2011-12-12 18:12:21 -08:00
Tejun Heo
77e4ef99d1 threadgroup: extend threadgroup_lock() to cover exit and exec
threadgroup_lock() protected only protected against new addition to
the threadgroup, which was inherently somewhat incomplete and
problematic for its only user cgroup.  On-going migration could race
against exec and exit leading to interesting problems - the symmetry
between various attach methods, task exiting during method execution,
->exit() racing against attach methods, migrating task switching basic
properties during exec and so on.

This patch extends threadgroup_lock() such that it protects against
all three threadgroup altering operations - fork, exit and exec.  For
exit, threadgroup_change_begin/end() calls are added to exit_signals
around assertion of PF_EXITING.  For exec, threadgroup_[un]lock() are
updated to also grab and release cred_guard_mutex.

With this change, threadgroup_lock() guarantees that the target
threadgroup will remain stable - no new task will be added, no new
PF_EXITING will be set and exec won't happen.

The next patch will update cgroup so that it can take full advantage
of this change.

-v2: beefed up comment as suggested by Frederic.

-v3: narrowed scope of protection in exit path as suggested by
     Frederic.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Paul Menage <paul@paulmenage.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
2011-12-12 18:12:21 -08:00
Tejun Heo
257058ae2b threadgroup: rename signal->threadgroup_fork_lock to ->group_rwsem
Make the following renames to prepare for extension of threadgroup
locking.

* s/signal->threadgroup_fork_lock/signal->group_rwsem/
* s/threadgroup_fork_read_lock()/threadgroup_change_begin()/
* s/threadgroup_fork_read_unlock()/threadgroup_change_end()/
* s/threadgroup_fork_write_lock()/threadgroup_lock()/
* s/threadgroup_fork_write_unlock()/threadgroup_unlock()/

This patch doesn't cause any behavior change.

-v2: Rename threadgroup_change_done() to threadgroup_change_end() per
     KAMEZAWA's suggestion.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Paul Menage <paul@paulmenage.org>
2011-12-12 18:12:21 -08:00
Tejun Heo
e25e2cbb4c cgroup: add cgroup_root_mutex
cgroup wants to make threadgroup stable while modifying cgroup
hierarchies which will introduce locking dependency on
cred_guard_mutex from cgroup_mutex.  This unfortunately completes
circular dependency.

 A. cgroup_mutex -> cred_guard_mutex -> s_type->i_mutex_key -> namespace_sem
 B. namespace_sem -> cgroup_mutex

B is from cgroup_show_options() and this patch breaks it by
introducing another mutex cgroup_root_mutex which nests inside
cgroup_mutex and protects cgroupfs_root.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
2011-12-12 18:12:21 -08:00
Paul E. McKenney
a513f6bab0 cpu: Export cpu_up()
Building rcutorture as a module requires cpu_up() as well as cpu_down()
exported, so apply EXPORT_SYMBOL_GPL().

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-12-12 15:55:15 -08:00
Paul E. McKenney
4f89b336fd rcu: Apply ACCESS_ONCE() to rcu_boost() return value
Both TINY_RCU's and TREE_RCU's implementations of rcu_boost() access
the ->boost_tasks and ->exp_tasks fields without preventing concurrent
changes to these fields.  This commit therefore applies ACCESS_ONCE in
order to prevent compiler mischief.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-12-11 10:33:19 -08:00
Paul E. McKenney
70321d447a Revert "rcu: Permit rt_mutex_unlock() with irqs disabled"
This reverts commit 5342e269b2b58ee0b0b4168a94087faaa60d0567.

The approach taken in this patch was deemed too abusive to mutexes,
and thus too likely to result in maintenance problems in the future.
Instead, we will disallow RCU read-side critical sections that partially
overlap with interrupt-disbled code segments.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-12-11 10:33:18 -08:00
Paul E. McKenney
4968c300e1 rcu: Augment rcu_batch_end tracing for idle and callback state
The current rcu_batch_end event trace records only the name of the RCU
flavor and the total number of callbacks that remain queued on the
current CPU.  This is insufficient for testing and tuning the new
dyntick-idle RCU_FAST_NO_HZ code, so this commit adds idle state along
with whether or not any of the callbacks that were ready to invoke
at the beginning of rcu_do_batch() are still queued.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-12-11 10:32:22 -08:00
Paul E. McKenney
101db7b41d rcu: Add rcutorture tests for srcu_read_lock_raw()
This commit adds simple rcutorture tests for srcu_read_lock_raw() and
srcu_read_unlock_raw().  It does not test doing srcu_read_lock_raw()
in an exception handler and releasing it in the corresponding process
context.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2011-12-11 10:32:21 -08:00