Commit 1b9508f, "Rate-limit newidle" has been confirmed to fix
the netperf UDP loopback regression reported by Alex Shi.
This is a cleanup and a fix:
- moved to a more out of the way spot
- fix to ensure that balancing doesn't try to balance
runqueues which haven't gone online yet, which can
mess up CPU enumeration during boot.
Reported-by: Alex Shi <alex.shi@intel.com>
Reported-by: Zhang, Yanmin <yanmin_zhang@linux.intel.com>
Signed-off-by: Mike Galbraith <efault@gmx.de>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: <stable@kernel.org> # .32.x: a1f84a3: sched: Check for an idle shared cache
Cc: <stable@kernel.org> # .32.x: 1b9508f: sched: Rate-limit newidle
Cc: <stable@kernel.org> # .32.x: fd21073: sched: Fix affinity logic
Cc: <stable@kernel.org> # .32.x
LKML-Reference: <1257821402.5648.17.camel@marge.simson.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impose a clear locking design on the note_new_gpnum()
function's use of the ->gpnum counter. This is done by updating
rdp->gpnum only from the corresponding leaf rcu_node structure's
rnp->gpnum field, and even then only under the protection of
that same rcu_node structure's ->lock field. Performance and
scalability are maintained using a form of double-checked
locking, and excessive spinning is avoided by use of the
spin_trylock() function. The use of spin_trylock() is safe due
to the fact that CPUs who fail to acquire this lock will try
again later. The hierarchical nature of the rcu_node data
structure limits contention (which could be limited further if
need be using the RCU_FANOUT kernel parameter).
Without this patch, obscure but quite possible races could
result in a quiescent state that occurred during one grace
period to be accounted to the following grace period, causing
this following grace period to end prematurely. Not good!
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
Cc: <stable@kernel.org> # .32.x
LKML-Reference: <12571987492350-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Impose a clear locking design on the rcu_process_gp_end()
function's use of the ->completed counter. This is done by
creating a ->completed field in the rcu_node structure, which
can safely be accessed under the protection of that structure's
lock. Performance and scalability are maintained by using a
form of double-checked locking, so that rcu_process_gp_end()
only acquires the leaf rcu_node structure's ->lock if a grace
period has recently ended.
This fix reduces rcutorture failure rate by at least two orders
of magnitude under heavy stress with force_quiescent_state()
being invoked artificially often. Without this fix,
unsynchronized access to the ->completed field can cause
rcu_process_gp_end() to advance callbacks whose grace period has
not yet expired. (Bad idea!)
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
Cc: <stable@kernel.org> # .32.x
LKML-Reference: <12571987494069-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
For SELinux to do better filtering in userspace we send the name of the
module along with the AVC denial when a program is denied module_request.
Example output:
type=SYSCALL msg=audit(11/03/2009 10:59:43.510:9) : arch=x86_64 syscall=write success=yes exit=2 a0=3 a1=7fc28c0d56c0 a2=2 a3=7fffca0d7440 items=0 ppid=1727 pid=1729 auid=unset uid=root gid=root euid=root suid=root fsuid=root egid=root sgid=root fsgid=root tty=(none) ses=unset comm=rpc.nfsd exe=/usr/sbin/rpc.nfsd subj=system_u:system_r:nfsd_t:s0 key=(null)
type=AVC msg=audit(11/03/2009 10:59:43.510:9) : avc: denied { module_request } for pid=1729 comm=rpc.nfsd kmod="net-pf-10" scontext=system_u:system_r:nfsd_t:s0 tcontext=system_u:system_r:kernel_t:s0 tclass=system
Signed-off-by: Eric Paris <eparis@redhat.com>
Signed-off-by: James Morris <jmorris@namei.org>
When the system has too many timers or too many aggregate
queued signals, the EAGAIN error is returned to application
from kernel, including timer_create() [POSIX.1b].
It means that the app exceeded the limit of pending signals,
but in general application writers do not expect this
outcome and the current silent failure can cause rare app
failures under very high load.
This patch adds a new message when we reach the limit
and if print_fatal_signals is enabled:
task/1234: reached RLIMIT_SIGPENDING, dropping signal
If you see this message and your system behaved unexpectedly,
you can run following command to lift the limit:
# ulimit -i unlimited
With help from Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>.
Signed-off-by: Naohiro Ooiwa <nooiwa@miraclelinux.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: oleg@redhat.com
LKML-Reference: <4AF6E7E2.9080406@miraclelinux.com>
[ Modified a few small details, gave surrounding code some love. ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The macro used to be used in both trace_selftest.c and
trace_ksym.c, but no longer, so remove it from header file.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Prasad <prasad@linux.vnet.ibm.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Allow or refuse to build a counter using the breakpoints pmu following
given constraints.
We keep track of the pmu users by using three per cpu variables:
- nr_cpu_bp_pinned stores the number of pinned cpu breakpoints counters
in the given cpu
- nr_bp_flexible stores the number of non-pinned breakpoints counters
in the given cpu.
- task_bp_pinned stores the number of pinned task breakpoints in a cpu
The latter is not a simple counter but gathers the number of tasks that
have n pinned breakpoints.
Considering HBP_NUM the number of available breakpoint address
registers:
task_bp_pinned[0] is the number of tasks having 1 breakpoint
task_bp_pinned[1] is the number of tasks having 2 breakpoints
[...]
task_bp_pinned[HBP_NUM - 1] is the number of tasks having the
maximum number of registers (HBP_NUM).
When a breakpoint counter is created and wants an access to the pmu,
we evaluate the following constraints:
== Non-pinned counter ==
- If attached to a single cpu, check:
(per_cpu(nr_bp_flexible, cpu) || (per_cpu(nr_cpu_bp_pinned, cpu)
+ max(per_cpu(task_bp_pinned, cpu)))) < HBP_NUM
-> If there are already non-pinned counters in this cpu, it
means there is already a free slot for them.
Otherwise, we check that the maximum number of per task
breakpoints (for this cpu) plus the number of per cpu
breakpoint (for this cpu) doesn't cover every registers.
- If attached to every cpus, check:
(per_cpu(nr_bp_flexible, *) || (max(per_cpu(nr_cpu_bp_pinned, *))
+ max(per_cpu(task_bp_pinned, *)))) < HBP_NUM
-> This is roughly the same, except we check the number of per
cpu bp for every cpu and we keep the max one. Same for the
per tasks breakpoints.
== Pinned counter ==
- If attached to a single cpu, check:
((per_cpu(nr_bp_flexible, cpu) > 1)
+ per_cpu(nr_cpu_bp_pinned, cpu)
+ max(per_cpu(task_bp_pinned, cpu))) < HBP_NUM
-> Same checks as before. But now the nr_bp_flexible, if any,
must keep one register at least (or flexible breakpoints will
never be be fed).
- If attached to every cpus, check:
((per_cpu(nr_bp_flexible, *) > 1)
+ max(per_cpu(nr_cpu_bp_pinned, *))
+ max(per_cpu(task_bp_pinned, *))) < HBP_NUM
Changes in v2:
- Counter -> event rename
Changes in v5:
- Fix unreleased non-pinned task-bound-only counters. We only released
it in the first cpu. (Thanks to Paul Mackerras for reporting that)
Changes in v6:
- Currently, events scheduling are done in this order: cpu context
pinned + cpu context non-pinned + task context pinned + task context
non-pinned events. Then our current constraints are right theoretically
but not in practice, because non-pinned counters may be scheduled
before we can apply every possible pinned counters. So consider
non-pinned counters as pinned for now.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Prasad <prasad@linux.vnet.ibm.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Jan Kiszka <jan.kiszka@web.de>
Cc: Jiri Slaby <jirislaby@gmail.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Paul Mundt <lethal@linux-sh.org>
This patch rebase the implementation of the breakpoints API on top of
perf events instances.
Each breakpoints are now perf events that handle the
register scheduling, thread/cpu attachment, etc..
The new layering is now made as follows:
ptrace kgdb ftrace perf syscall
\ | / /
\ | / /
/
Core breakpoint API /
/
| /
| /
Breakpoints perf events
|
|
Breakpoints PMU ---- Debug Register constraints handling
(Part of core breakpoint API)
|
|
Hardware debug registers
Reasons of this rewrite:
- Use the centralized/optimized pmu registers scheduling,
implying an easier arch integration
- More powerful register handling: perf attributes (pinned/flexible
events, exclusive/non-exclusive, tunable period, etc...)
Impact:
- New perf ABI: the hardware breakpoints counters
- Ptrace breakpoints setting remains tricky and still needs some per
thread breakpoints references.
Todo (in the order):
- Support breakpoints perf counter events for perf tools (ie: implement
perf_bpcounter_event())
- Support from perf tools
Changes in v2:
- Follow the perf "event " rename
- The ptrace regression have been fixed (ptrace breakpoint perf events
weren't released when a task ended)
- Drop the struct hw_breakpoint and store generic fields in
perf_event_attr.
- Separate core and arch specific headers, drop
asm-generic/hw_breakpoint.h and create linux/hw_breakpoint.h
- Use new generic len/type for breakpoint
- Handle off case: when breakpoints api is not supported by an arch
Changes in v3:
- Fix broken CONFIG_KVM, we need to propagate the breakpoint api
changes to kvm when we exit the guest and restore the bp registers
to the host.
Changes in v4:
- Drop the hw_breakpoint_restore() stub as it is only used by KVM
- EXPORT_SYMBOL_GPL hw_breakpoint_restore() as KVM can be built as a
module
- Restore the breakpoints unconditionally on kvm guest exit:
TIF_DEBUG_THREAD doesn't anymore cover every cases of running
breakpoints and vcpu->arch.switch_db_regs might not always be
set when the guest used debug registers.
(Waiting for a reliable optimization)
Changes in v5:
- Split-up the asm-generic/hw-breakpoint.h moving to
linux/hw_breakpoint.h into a separate patch
- Optimize the breakpoints restoring while switching from kvm guest
to host. We only want to restore the state if we have active
breakpoints to the host, otherwise we don't care about messed-up
address registers.
- Add asm/hw_breakpoint.h to Kbuild
- Fix bad breakpoint type in trace_selftest.c
Changes in v6:
- Fix wrong header inclusion in trace.h (triggered a build
error with CONFIG_FTRACE_SELFTEST
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Prasad <prasad@linux.vnet.ibm.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Jan Kiszka <jan.kiszka@web.de>
Cc: Jiri Slaby <jirislaby@gmail.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Paul Mundt <lethal@linux-sh.org>
In 15934a37324f32e0fda633dc7984a671ea81cd75,
field last_tick_seen is added to struct rq.
But it is unused now.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Guillaume Chazarain <guichaz@yahoo.fr>
LKML-Reference: <4AE6A513.6010100@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
root_task_group_empty is used only with FAIR_GROUP_SCHED
so if we use other scheduler options we get:
kernel/sched.c:314: warning: 'root_task_group_empty' defined but not used
So move CONFIG_FAIR_GROUP_SCHED up that it covers
root_task_group_empty().
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Peter Zijlstra <peterz@infradead.org>
LKML-Reference: <20091026192414.GB5321@lenovo>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
If a parent directory (ie /proc/irq/<irq>) could not be created
we should not attempt to create subdirectories. Otherwise it
would lead that "smp_affinity" and "spurious" entries are may be
registered under /proc root instead of a proper place.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <20091026202811.GD5321@lenovo>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Fix variable name in sched.c kernel-doc notation.
Fixes this DocBook warning:
Warning(kernel/sched.c:2008): No description found for parameter
'p' Warning(kernel/sched.c:2008): Excess function parameter 'k'
description in 'kthread_bind'
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
LKML-Reference: <4AF4B1BC.8020604@oracle.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
While tracing using events with perf, if one enables the
lockdep:lock_acquire event, it will infect every other perf
trace events.
Basically, you can enable whatever set of trace events through
perf but if this event is part of the set, the only result we
can get is a long list of lock_acquire events of rcu read lock,
and only that.
This is because of a recursion inside perf.
1) When a trace event is triggered, it will fill a per cpu
buffer and submit it to perf.
2) Perf will commit this event but will also protect some data
using rcu_read_lock
3) A recursion appears: rcu_read_lock triggers a lock_acquire
event that will fill the per cpu event and then submit the
buffer to perf.
4) Perf detects a recursion and ignores it
5) Perf continues its work on the previous event, but its buffer
has been overwritten by the lock_acquire event, it has then
been turned into a lock_acquire event of rcu read lock
Such scenario also happens with lock_release with
rcu_read_unlock().
We could turn the rcu_read_lock() into __rcu_read_lock() to drop
the lock debugging from perf fast path, but that would make us
lose the rcu debugging and that doesn't prevent from other
possible kind of recursion from perf in the future.
This patch adds a recursion protection based on a counter on the
perf trace per cpu buffers to solve the problem.
-v2: Fixed lost whitespace, added reviewed-by tag
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Jason Baron <jbaron@redhat.com>
LKML-Reference: <1257477185-7838-1-git-send-email-fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
sched: Fix kthread_bind() by moving the body of kthread_bind() to sched.c
sched: Disable SD_PREFER_LOCAL at node level
sched: Fix boot crash by zalloc()ing most of the cpu masks
sched: Strengthen buddies and mitigate buddy induced latencies
Rate limit newidle to migration_cost. It's a win for all
stages of sysbench oltp tests.
Signed-off-by: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When waking affine, check for an idle shared cache, and if
found, wake to that CPU/sibling instead of the waker's CPU.
This improves pgsql+oltp ramp up by roughly 8%. Possibly more
for other loads, depending on overlap. The trade-off is a
roughly 1% peak downturn if tasks are truly synchronous.
Signed-off-by: Mike Galbraith <efault@gmx.de>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: <stable@kernel.org>
LKML-Reference: <1256654138.17752.7.camel@marge.simson.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
commit 74296a8ed added this function for debug purposes, but it was
never used for anything. Remove it.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Fix docbook comments to match the actual function names
(set_irq_msi, handle_percpu_irq).
Signed-off-by: Liuwenyi <qingshenlwy@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Currently partition_sched_domains() takes a 'struct cpumask
*doms_new' which is a kmalloc'ed array of cpumask_t. You can't
have such an array if 'struct cpumask' is undefined, as we plan
for CONFIG_CPUMASK_OFFSTACK=y.
So, we make this an array of cpumask_var_t instead: this is the
same for the CONFIG_CPUMASK_OFFSTACK=n case, but requires
multiple allocations for the CONFIG_CPUMASK_OFFSTACK=y case.
Hence we add alloc_sched_domains() and free_sched_domains()
functions.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Peter Zijlstra <peterz@infradead.org>
LKML-Reference: <200911031453.40668.rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
find_lowest_rq() wants to call pick_optimal_cpu() on the
intersection of sched_domain_span(sd) and lowest_mask. Rather
than doing a cpus_and into a temporary, we can open-code it.
This actually makes the code slightly clearer, IMHO.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Gregory Haskins <ghaskins@novell.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <200911031453.15350.rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Rename Kprobes-based event tracer to kprobes-based tracing event
(kprobe-event), since it is not a tracer but an extensible
tracing event interface.
This also changes CONFIG_KPROBE_TRACER to CONFIG_KPROBE_EVENT
and sets it y by default.
Signed-off-by: Masami Hiramatsu <mhiramat@redhat.com>
Acked-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Jim Keniston <jkenisto@us.ibm.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Frank Ch. Eigler <fche@redhat.com>
Cc: Jason Baron <jbaron@redhat.com>
Cc: K.Prasad <prasad@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
LKML-Reference: <20091104001247.3454.14131.stgit@harusame>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Conflicts:
tools/perf/Makefile
Merge reason: Resolve the conflict, merge to upstream and merge in
perf fixes so we can add a dependent patch.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
cpu_nr_migrations() is not used, remove it.
Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <4AF12A66.6020609@ct.jp.nec.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When a command is passed to the set_ftrace_filter, then
the ftrace_regex_lock is still held going back to user space.
# echo 'do_open : foo' > set_ftrace_filter
(still holding ftrace_regex_lock when returning to user space!)
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
LKML-Reference: <4AEF7F8A.3080300@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
We got a sudden panic when we reduced the size of the
ringbuffer.
We can reproduce the panic by the following steps:
echo 1 > events/sched/enable
cat trace_pipe > /dev/null &
while ((1))
do
echo 12000 > buffer_size_kb
echo 512 > buffer_size_kb
done
(not more than 5 seconds, panic ...)
Reported-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
LKML-Reference: <4AF01735.9060409@cn.fujitsu.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
A simple callback in a perf event can be used for multiple purposes.
For example it is useful for triggered based events like hardware
breakpoints that need a callback to dispatch a triggered breakpoint
event.
v2: Simplify a bit the callback attribution as suggested by Paul
Mackerras
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "K.Prasad" <prasad@linux.vnet.ibm.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mundt <lethal@linux-sh.org>
There are reasons for kernel code to ask for, and use, performance
counters.
For example, in CPU freq governors this tends to be a good idea, but
there are other examples possible as well of course.
This patch adds the needed bits to do enable this functionality; they
have been tested in an experimental cpufreq driver that I'm working on,
and the changes are all that I needed to access counters properly.
[fweisbec@gmail.com: added pid to perf_event_create_kernel_counter so
that we can profile a particular task too
TODO: Have a better error reporting, don't just return NULL in fail
case.]
v2: Remove the wrong comment about the fact
perf_event_create_kernel_counter must be called from a kernel
thread.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: "K.Prasad" <prasad@linux.vnet.ibm.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Jan Kiszka <jan.kiszka@siemens.com>
Cc: Jiri Slaby <jirislaby@gmail.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Jan Kiszka <jan.kiszka@web.de>
Cc: Avi Kivity <avi@redhat.com>
LKML-Reference: <20090925122556.2f8bd939@infradead.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
* 'pm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6:
PM: Remove some debug messages producing too much noise
PM: Fix warning on suspend errors
PM / Hibernate: Add newline to load_image() fail path
PM / Hibernate: Fix error handling in save_image()
PM / Hibernate: Fix blkdev refleaks
PM / yenta: Split resume into early and late parts (rev. 4)
nr_processes() returns the sum of the per cpu counter process_counts for
all online CPUs. This counter is incremented for the current CPU on
fork() and decremented for the current CPU on exit(). Since a process
does not necessarily fork and exit on the same CPU the process_count for
an individual CPU can be either positive or negative and effectively has
no meaning in isolation.
Therefore calculating the sum of process_counts over only the online
CPUs omits the processes which were started or stopped on any CPU which
has since been unplugged. Only the sum of process_counts across all
possible CPUs has meaning.
The only caller of nr_processes() is proc_root_getattr() which
calculates the number of links to /proc as
stat->nlink = proc_root.nlink + nr_processes();
You don't have to be all that unlucky for the nr_processes() to return a
negative value leading to a negative number of links (or rather, an
apparently enormous number of links). If this happens then you can get
failures where things like "ls /proc" start to fail because they got an
-EOVERFLOW from some stat() call.
Example with some debugging inserted to show what goes on:
# ps haux|wc -l
nr_processes: CPU0: 90
nr_processes: CPU1: 1030
nr_processes: CPU2: -900
nr_processes: CPU3: -136
nr_processes: TOTAL: 84
proc_root_getattr. nlink 12 + nr_processes() 84 = 96
84
# echo 0 >/sys/devices/system/cpu/cpu1/online
# ps haux|wc -l
nr_processes: CPU0: 85
nr_processes: CPU2: -901
nr_processes: CPU3: -137
nr_processes: TOTAL: -953
proc_root_getattr. nlink 12 + nr_processes() -953 = -941
75
# stat /proc/
nr_processes: CPU0: 84
nr_processes: CPU2: -901
nr_processes: CPU3: -137
nr_processes: TOTAL: -954
proc_root_getattr. nlink 12 + nr_processes() -954 = -942
File: `/proc/'
Size: 0 Blocks: 0 IO Block: 1024 directory
Device: 3h/3d Inode: 1 Links: 4294966354
Access: (0555/dr-xr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2009-11-03 09:06:55.000000000 +0000
Modify: 2009-11-03 09:06:55.000000000 +0000
Change: 2009-11-03 09:06:55.000000000 +0000
I'm not 100% convinced that the per_cpu regions remain valid for offline
CPUs, although my testing suggests that they do. If not then I think the
correct solution would be to aggregate the process_count for a given CPU
into a global base value in cpu_down().
This bug appears to pre-date the transition to git and it looks like it
may even have been present in linux-2.6.0-test7-bk3 since it looks like
the code Rusty patched in http://lwn.net/Articles/64773/ was already
wrong.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Finish a line by \n when load_image fails in the middle of loading.
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
There are too many retval variables in save_image(). Thus error return
value from snapshot_read_next() may be ignored and only part of the
snapshot (successfully) written.
Remove 'error' variable, invert the condition in the do-while loop
and convert the loop to use only 'ret' variable.
Switch the rest of the function to consider only 'ret'.
Also make sure we end printed line by \n if an error occurs.
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
While cruising through the swsusp code I found few blkdev reference
leaks of resume_bdev.
swsusp_read: remove blkdev_put altogether. Some fail paths do
not do that.
swsusp_check: make sure we always put a reference on fail paths
software_resume: all fail paths between swsusp_check and swsusp_read
omit swsusp_close. Add it in those cases. And since
swsusp_read doesn't drop the reference anymore, do
it here unconditionally.
[rjw: Fixed a small coding style issue.]
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Eric Paris reported that commit
f685ceacab07d3f6c236f04803e2f2f0dbcc5afb causes boot time
PREEMPT_DEBUG complaints.
[ 4.590699] BUG: using smp_processor_id() in preemptible [00000000] code: rmmod/1314
[ 4.593043] caller is task_hot+0x86/0xd0
Since kthread_bind() messes with scheduler internals, move the
body to sched.c, and lock the runqueue.
Reported-by: Eric Paris <eparis@redhat.com>
Signed-off-by: Mike Galbraith <efault@gmx.de>
Tested-by: Eric Paris <eparis@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1256813310.7574.3.camel@marge.simson.net>
[ v2: fix !SMP build and clean up ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
perf tools: Remove -Wcast-align
perf tools: Fix compatibility with libelf 0.8 and autodetect
perf events: Don't generate events for the idle task when exclude_idle is set
perf events: Fix swevent hrtimer sampling by keeping track of remaining time when enabling/disabling swevent hrtimers
* 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
tracing: Remove cpu arg from the rb_time_stamp() function
tracing: Fix comment typo and documentation example
tracing: Fix trace_seq_printf() return value
tracing: Update *ppos instead of filp->f_pos
For as long as kretprobes have existed, we've allocated NR_CPUS
instances of kretprobe_instance structures. With the default
value of CONFIG_NR_CPUS increasing on certain architectures, we
are potentially wasting kernel memory.
See http://sourceware.org/bugzilla/show_bug.cgi?id=10839#c3 for
more details.
Use a saner num_possible_cpus() instead of NR_CPUS for
allocation.
Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Acked-by: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Jim Keniston <jkenisto@us.ibm.com>
Cc: fweisbec@gmail.com
LKML-Reference: <20091030135310.GA22230@in.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Currently, rcu_irq_exit() is invoked only for CONFIG_NO_HZ,
while rcu_irq_enter() is invoked unconditionally. This patch
moves rcu_irq_exit() out from under CONFIG_NO_HZ so that the
calls are balanced.
This patch has no effect on the behavior of the kernel because
both rcu_irq_enter() and rcu_irq_exit() are empty for
!CONFIG_NO_HZ, but the code is easier to understand if the calls
are obviously balanced in all cases.
Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <12567428891605-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Very long RCU read-side critical sections (50 milliseconds or
so) can cause a race between force_quiescent_state() and
rcu_start_gp() as follows on kernel builds with multi-level
rcu_node hierarchies:
1. CPU 0 calls force_quiescent_state(), sees that there is a
grace period in progress, and acquires ->fsqlock.
2. CPU 1 detects the end of the grace period, and so
cpu_quiet_msk_finish() sets rsp->completed to rsp->gpnum.
This operation is carried out under the root rnp->lock,
but CPU 0 has not yet acquired that lock. Note that
rsp->signaled is still RCU_SAVE_DYNTICK from the last
grace period.
3. CPU 1 calls rcu_start_gp(), but no one wants a new grace
period, so it drops the root rnp->lock and returns.
4. CPU 0 acquires the root rnp->lock and picks up rsp->completed
and rsp->signaled, then drops rnp->lock. It then enters the
RCU_SAVE_DYNTICK leg of the switch statement.
5. CPU 2 invokes call_rcu(), and now needs a new grace period.
It calls rcu_start_gp(), which acquires the root rnp->lock, sets
rsp->signaled to RCU_GP_INIT (too bad that CPU 0 is already in
the RCU_SAVE_DYNTICK leg of the switch statement!) and starts
initializing the rcu_node hierarchy. If there are multiple
levels to the hierarchy, it will drop the root rnp->lock and
initialize the lower levels of the hierarchy.
6. CPU 0 notes that rsp->completed has not changed, which permits
both CPU 2 and CPU 0 to try updating it concurrently. If CPU 0's
update prevails, later calls to force_quiescent_state() can
count old quiescent states against the new grace period, which
can in turn result in premature ending of grace periods.
Not good.
This patch adds an RCU_GP_IDLE state for rsp->signaled that is
set initially at boot time and any time a grace period ends.
This prevents CPU 0 from getting into the workings of
force_quiescent_state() in step 4. Additional locking and
checks prevent the concurrent update of rsp->signaled in step 6.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference: <1256742889199-git-send-email->
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Ingo triggered the following warning:
WARNING: at lib/debugobjects.c:255 debug_print_object+0x42/0x50()
Hardware name: System Product Name
ODEBUG: init active object type: timer_list
Modules linked in:
Pid: 2619, comm: dmesg Tainted: G W 2.6.32-rc5-tip+ #5298
Call Trace:
[<81035443>] warn_slowpath_common+0x6a/0x81
[<8120e483>] ? debug_print_object+0x42/0x50
[<81035498>] warn_slowpath_fmt+0x29/0x2c
[<8120e483>] debug_print_object+0x42/0x50
[<8120ec2a>] __debug_object_init+0x279/0x2d7
[<8120ecb3>] debug_object_init+0x13/0x18
[<810409d2>] init_timer_key+0x17/0x6f
[<81041526>] free_uid+0x50/0x6c
[<8104ed2d>] put_cred_rcu+0x61/0x72
[<81067fac>] rcu_do_batch+0x70/0x121
debugobjects warns about an enqueued timer being initialized. If
CONFIG_USER_SCHED=y the user management code uses delayed work to
remove the user from the hash table and tear down the sysfs objects.
free_uid is called from RCU and initializes/schedules delayed work if
the usage count of the user_struct is 0. The init/schedule happens
outside of the uidhash_lock protected region which allows a concurrent
caller of find_user() to reference the about to be destroyed
user_struct w/o preventing the work from being scheduled. If the next
free_uid call happens before the work timer expired then the active
timer is initialized and the work scheduled again.
The race was introduced in commit 5cb350ba (sched: group scheduling,
sysfs tunables) and made more prominent by commit 3959214f (sched:
delayed cleanup of user_struct)
Move the init/schedule_delayed_work inside of the uidhash_lock
protected region to prevent the race.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Cc: Paul E. McKenney <paulmck@us.ibm.com>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: stable@kernel.org
I got a boot crash when forcing cpumasks offstack on 32 bit,
because find_new_ilb() returned 3 on my UP system (nohz.cpu_mask
wasn't zeroed).
AFAICT the others need to be zeroed too: only
nohz.ilb_grp_nohz_mask is initialized before use.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Peter Zijlstra <peterz@infradead.org>
LKML-Reference: <200911022037.21282.rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
____ftrace_check_##name() is used for compile-time check on
F_printk() only, so it should be marked as __unused instead
of __used.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
LKML-Reference: <4AEE2D01.4010305@cn.fujitsu.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
sched: move rq_weight data array out of .percpu
percpu: allow pcpu_alloc() to be called with IRQs off
* git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-param-fixes:
param: fix setting arrays of bool
param: fix NULL comparison on oom
param: fix lots of bugs with writing charp params from sysfs, by leaking mem.