PREEMPT_RESTRICT was a method aimed at reducing the amount of wakeup
related preemption. It has a disadvantage though, it can prevent
legitimate wakeups if a task is 'unlucky' to be hit too early by a tick
that clears peer_preempt.
Now that the wakeup preemption has been cleaned up we dont seem to have
excessive preemptions anymore, so this feature can be turned off. (and
removed in the next patch)
Signed-off-by: Ingo Molnar <mingo@elte.hu>
1) hardcoded 1000000000 value is used five times in places where
NSEC_PER_SEC might be more readable.
2) A conversion from nsec to msec uses the hardcoded 1000000 value,
which is a candidate for NSEC_PER_MSEC.
no code changed:
text data bss dec hex filename
44359 3326 36 47721 ba69 sched.o.before
44359 3326 36 47721 ba69 sched.o.after
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Yanmin Zhang reported an aim7 regression and bisected it down to:
| commit 38ad464d41
| Author: Ingo Molnar <mingo@elte.hu>
| Date: Mon Oct 15 17:00:02 2007 +0200
|
| sched: uniform tunings
|
| use the same defaults on both UP and SMP.
fix this by reintroducing similar SMP tunings again. This resolves
the regression.
(also update the comments to match the ilog2(nr_cpus) tuning effect)
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Adds a cpu.usage file to the CFS cgroup that reports CPU usage in
milliseconds for that cgroup's tasks
[ mingo@elte.hu: style cleanups. ]
Signed-off-by: Paul Menage <menage@google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Peter Zijlstra noticed that the rcu_head object need not be present
in every cfs_rq of a group. Move it to the task_group structure
instead.
Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch:
commit 9b5b77512d
Author: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Date: Mon Oct 15 17:00:09 2007 +0200
sched: clean up code under CONFIG_FAIR_GROUP_SCHED
Introduced an assumption of the existence of CPU0 via this line
cfs_rq = tg->cfs_rq[0];
If you have no CPU0, that will be NULL. The fix seems to be just to
take whatever cfs_rq queue comes out of the for_each_possible_cpu()
loop, since they're all equally good for the destruction operation.
Signed-off-by: James Bottomley <James.Bottomley@SteelEye.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
At the moment, a lot of load balancing code that is irrelevant to non
SMP systems gets included during non SMP builds.
This patch addresses this issue and reduces the binary size on non
SMP systems:
text data bss dec hex filename
10983 28 1192 12203 2fab sched.o.before
10739 28 1192 11959 2eb7 sched.o.after
Signed-off-by: Peter Williams <pwil3058@bigpond.net.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
At the moment, balance_tasks() provides low level functionality for both
move_tasks() and move_one_task() (indirectly) via the load_balance()
function (in the sched_class interface) which also provides dual
functionality. This dual functionality complicates the interfaces and
internal mechanisms and makes the run time overhead of operations that
are called with two run queue locks held.
This patch addresses this issue and reduces the overhead of these
operations.
Signed-off-by: Peter Williams <pwil3058@bigpond.net.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
- replace "cont" with "cgrp" in a few places in the CFS cgroup code,
- use write_uint rather than write for cpu.shares write function
Signed-off-by: Paul Menage <menage@google.com>
Acked-by : Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
A full register dump along with stack backtrace would make the
"scheduling while atomic" message more helpful. Use show_regs() instead
of dump_stack() for this. We already know we're atomic in here (that is
why this function was called) so show_regs()'s atomicity expectations
are guaranteed.
Also, modify the output of the "BUG: scheduling while atomic:" header a
bit to keep task->comm and task->pid together and preempt_count() after
them.
Signed-off-by: Satyam Sharma <satyam@infradead.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
clean up sched_domain_debug().
this also shrinks the code a bit:
text data bss dec hex filename
50474 4306 480 55260 d7dc sched.o.before
50404 4306 480 55190 d796 sched.o.after
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Jeff Dike noticed that wait_for_completion_interruptible()'s prototype
had a mismatched fastcall.
Fix this by removing the fastcall attributes from all the completion APIs.
Found-by: Jeff Dike <jdike@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
commit 029190c515 (cpuset
sched_load_balance flag) was not tested SCHED_DEBUG enabled as
committed as it dereferences NULL when used and it reordered
the sysctl registration to cause it to never show any domains
or their tunables.
Fixes:
1) restore arch_init_sched_domains ordering
we can't walk the domains before we build them
presently we register cpus with empty directories (no domain
directories or files).
2) make unregister_sched_domain_sysctl do nothing when already unregistered
detach_destroy_domains is now called one set of cpus at a time
unregister_syctl dereferences NULL if called with a null.
While the the function would always dereference null if called
twice, in the previous code it was always called once and then
was followed a register. So only the hidden bug of the
sysctl_root_table not being allocated followed by an attempt to
free it would have shown the error.
3) always call unregister and register in partition_sched_domains
The code is "smart" about unregistering only needed domains.
Since we aren't guaranteed any calls to unregister, always
unregister. Without calling register on the way out we
will not have a table or any sysctl tree.
4) warn if register is called without unregistering
The previous table memory is lost, leaving pointers to the
later freed memory in sysctl and leaking the memory of the
tables.
Before this patch on a 2-core 4-thread box compiled for SMT and NUMA,
the domains appear empty (there are actually 3 levels per cpu). And as
soon as two domains a null pointer is dereferenced (unreliable in this
case is stack garbage):
bu19a:~# ls -R /proc/sys/kernel/sched_domain/
/proc/sys/kernel/sched_domain/:
cpu0 cpu1 cpu2 cpu3
/proc/sys/kernel/sched_domain/cpu0:
/proc/sys/kernel/sched_domain/cpu1:
/proc/sys/kernel/sched_domain/cpu2:
/proc/sys/kernel/sched_domain/cpu3:
bu19a:~# mkdir /dev/cpuset
bu19a:~# mount -tcpuset cpuset /dev/cpuset/
bu19a:~# cd /dev/cpuset/
bu19a:/dev/cpuset# echo 0 > sched_load_balance
bu19a:/dev/cpuset# mkdir one
bu19a:/dev/cpuset# echo 1 > one/cpus
bu19a:/dev/cpuset# echo 0 > one/sched_load_balance
Unable to handle kernel paging request for data at address 0x00000018
Faulting instruction address: 0xc00000000006b608
NIP: c00000000006b608 LR: c00000000006b604 CTR: 0000000000000000
REGS: c000000018d973f0 TRAP: 0300 Not tainted (2.6.23-bml)
MSR: 9000000000009032 <EE,ME,IR,DR> CR: 28242442 XER: 00000000
DAR: 0000000000000018, DSISR: 0000000040000000
TASK = c00000001912e340[1987] 'bash' THREAD: c000000018d94000 CPU: 2
..
NIP [c00000000006b608] .unregister_sysctl_table+0x38/0x110
LR [c00000000006b604] .unregister_sysctl_table+0x34/0x110
Call Trace:
[c000000018d97670] [c000000007017270] 0xc000000007017270 (unreliable)
[c000000018d97720] [c000000000058710] .detach_destroy_domains+0x30/0xb0
[c000000018d977b0] [c00000000005cf1c] .partition_sched_domains+0x1bc/0x230
[c000000018d97870] [c00000000009fdc4] .rebuild_sched_domains+0xb4/0x4c0
[c000000018d97970] [c0000000000a02e8] .update_flag+0x118/0x170
[c000000018d97a80] [c0000000000a1768] .cpuset_common_file_write+0x568/0x820
[c000000018d97c00] [c00000000009d95c] .cgroup_file_write+0x7c/0x180
[c000000018d97cf0] [c0000000000e76b8] .vfs_write+0xe8/0x1b0
[c000000018d97d90] [c0000000000e810c] .sys_write+0x4c/0x90
[c000000018d97e30] [c00000000000852c] syscall_exit+0x0/0x40
Signed-off-by: Milton Miller <miltonm@bga.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
KVM clears it by itself now, and for s390 this is plain wrong.
Signed-off-by: Laurent Vivier <Laurent.Vivier@bull.net>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Fix the various misspellings of "system", controller", "interrupt" and
"[un]necessary".
Signed-off-by: Robert P. J. Day <rpjday@mindspring.com>
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Enable "cgroup" (formerly containers) based fair group scheduling. This
will let administrator create arbitrary groups of tasks (using "cgroup"
pseudo filesystem) and control their cpu bandwidth usage.
[akpm@linux-foundation.org: fix cpp condition]
Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When a cpu is disabled, move_task_off_dead_cpu() is called for tasks that have
been running on that cpu.
Currently, such a task is migrated:
1) to any cpu on the same node as the disabled cpu, which is both online
and among that task's cpus_allowed
2) to any cpu which is both online and among that task's cpus_allowed
It is typical of a multithreaded application running on a large NUMA system to
have its tasks confined to a cpuset so as to cluster them near the memory that
they share. Furthermore, it is typical to explicitly place such a task on a
specific cpu in that cpuset. And in that case the task's cpus_allowed
includes only a single cpu.
This patch would insert a preference to migrate such a task to some cpu within
its cpuset (and set its cpus_allowed to its entire cpuset).
With this patch, migrate the task to:
1) to any cpu on the same node as the disabled cpu, which is both online
and among that task's cpus_allowed
2) to any online cpu within the task's cpuset
3) to any cpu which is both online and among that task's cpus_allowed
In order to do this, move_task_off_dead_cpu() must make a call to
cpuset_cpus_allowed_locked(), a new subset of cpuset_cpus_allowed(), that will
not block. (name change - per Oleg's suggestion)
Calls are made to cpuset_lock() and cpuset_unlock() in migration_call() to set
the cpuset mutex during the whole migrate_live_tasks() and
migrate_dead_tasks() procedure.
[akpm@linux-foundation.org: build fix]
[pj@sgi.com: Fix indentation and spacing]
Signed-off-by: Cliff Wickman <cpw@sgi.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Paul Jackson <pj@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The task_struct->pid member is going to be deprecated, so start
using the helpers (task_pid_nr/task_pid_vnr/task_pid_nr_ns) in
the kernel.
The first thing to start with is the pid, printed to dmesg - in
this case we may safely use task_pid_nr(). Besides, printks produce
more (much more) than a half of all the explicit pid usage.
[akpm@linux-foundation.org: git-drm went and changed lots of stuff]
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: Dave Airlie <airlied@linux.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
tsk->exit_state can only be 0, EXIT_ZOMBIE, or EXIT_DEAD. A non-zero test
is the same as tsk->exit_state & (EXIT_ZOMBIE | EXIT_DEAD), so just testing
tsk->exit_state is sufficient.
Signed-off-by: Eugene Teo <eugeneteo@kernel.sg>
Cc: Roland McGrath <roland@redhat.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cause writes to cpuset "cpus" file to update cpus_allowed for member tasks:
- collect batches of tasks under tasklist_lock and then call
set_cpus_allowed() on them outside the lock (since this can sleep).
- add a simple generic priority heap type to allow efficient collection
of batches of tasks to be processed without duplicating or missing any
tasks in subsequent batches.
- make "cpus" file update a no-op if the mask hasn't changed
- fix race between update_cpumask() and sched_setaffinity() by making
sched_setaffinity() post-check that it's not running on any cpus outside
cpuset_cpus_allowed().
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Paul Menage <menage@google.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add a new per-cpuset flag called 'sched_load_balance'.
When enabled in a cpuset (the default value) it tells the kernel scheduler
that the scheduler should provide the normal load balancing on the CPUs in
that cpuset, sometimes moving tasks from one CPU to a second CPU if the
second CPU is less loaded and if that task is allowed to run there.
When disabled (write "0" to the file) then it tells the kernel scheduler
that load balancing is not required for the CPUs in that cpuset.
Now even if this flag is disabled for some cpuset, the kernel may still
have to load balance some or all the CPUs in that cpuset, if some
overlapping cpuset has its sched_load_balance flag enabled.
If there are some CPUs that are not in any cpuset whose sched_load_balance
flag is enabled, the kernel scheduler will not load balance tasks to those
CPUs.
Moreover the kernel will partition the 'sched domains' (non-overlapping
sets of CPUs over which load balancing is attempted) into the finest
granularity partition that it can find, while still keeping any two CPUs
that are in the same shed_load_balance enabled cpuset in the same element
of the partition.
This serves two purposes:
1) It provides a mechanism for real time isolation of some CPUs, and
2) it can be used to improve performance on systems with many CPUs
by supporting configurations in which load balancing is not done
across all CPUs at once, but rather only done in several smaller
disjoint sets of CPUs.
This mechanism replaces the earlier overloading of the per-cpuset
flag 'cpu_exclusive', which overloading was removed in an earlier
patch: cpuset-remove-sched-domain-hooks-from-cpusets
See further the Documentation and comments in the code itself.
[akpm@linux-foundation.org: don't be weird]
Signed-off-by: Paul Jackson <pj@sgi.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The find_task_by_something is a set of macros are used to find task by pid
depending on what kind of pid is proposed - global or virtual one. All of
them are wrappers above the most generic one - find_task_by_pid_type_ns() -
and just substitute some args for it.
It turned out, that dereferencing the current->nsproxy->pid_ns construction
and pushing one more argument on the stack inline cause kernel text size to
grow.
This patch moves all this stuff out-of-line into kernel/pid.c. Together
with the next patch it saves a bit less than 400 bytes from the .text
section.
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Paul Menage <menage@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is the largest patch in the set. Make all (I hope) the places where
the pid is shown to or get from user operate on the virtual pids.
The idea is:
- all in-kernel data structures must store either struct pid itself
or the pid's global nr, obtained with pid_nr() call;
- when seeking the task from kernel code with the stored id one
should use find_task_by_pid() call that works with global pids;
- when showing pid's numerical value to the user the virtual one
should be used, but however when one shows task's pid outside this
task's namespace the global one is to be used;
- when getting the pid from userspace one need to consider this as
the virtual one and use appropriate task/pid-searching functions.
[akpm@linux-foundation.org: build fix]
[akpm@linux-foundation.org: nuther build fix]
[akpm@linux-foundation.org: yet nuther build fix]
[akpm@linux-foundation.org: remove unneeded casts]
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Alexey Dobriyan <adobriyan@openvz.org>
Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Paul Menage <menage@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This example demonstrates how to use the generic cgroup subsystem for a
simple resource tracker that counts, for the processes in a cgroup, the
total CPU time used and the %CPU used in the last complete 10 second interval.
Portions contributed by Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Paul Menage <menage@google.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: Kirill Korotaev <dev@openvz.org>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Cedric Le Goater <clg@fr.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This adds items to the taststats struct to account for user and system
time based on scaling the CPU frequency and instruction issue rates.
Adds account_(user|system)_time_scaled callbacks which architectures
can use to account for time using this mechanism.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Jay Lan <jlan@engr.sgi.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
schedstat is useful in investigating CPU scheduler behavior. Ideally,
I think it is beneficial to have it on all the time. However, the
cost of turning it on in production system is quite high, largely due
to number of events it collects and also due to its large memory
footprint.
Most of the fields probably don't need to be full 64-bit on 64-bit
arch. Rolling over 4 billion events will most like take a long time
and user space tool can be made to accommodate that. I'm proposing
kernel to cut back most of variable width on 64-bit system. (note,
the following patch doesn't affect 32-bit system).
Signed-off-by: Ken Chen <kenchen@google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
printk: add the KERN_CONT annotation (which is empty string but via
which checkpatch.pl can notice that the lacking KERN_ level is fine).
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The recent wait_for_completion() cleanups:
commit 8cbbe86dfc
Author: Andi Kleen <ak@suse.de>
Date: Mon Oct 15 17:00:14 2007 +0200
sched: cleanup: refactor common code of sleep_on / wait_for_completion
Refactor common code of sleep_on / wait_for_completion
broke the return value of wait_for_completion_interruptible().
Previously it returned 0 on success, now -1. Fix that.
Problem found by Geert Uytterhoeven.
[ mingo: fixed whitespace damage ]
Reported-by: Geert Uytterhoeven <Geert.Uytterhoeven@sonycom.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Change migration_call(CPU_DEAD) to use direct spin_lock_irq() instead of
task_rq_lock(rq->idle), rq->idle can't change its task_rq().
This makes the code a bit more symmetrical with migrate_dead_tasks()'s path
which uses spin_lock_irq/spin_unlock_irq.
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Cliff Wickman <cpw@sgi.com>
Cc: Gautham R Shenoy <ego@in.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently move_task_off_dead_cpu() is called under
write_lock_irq(tasklist). This means it can't use task_lock() which is
needed to improve migrating to take task's ->cpuset into account.
Change the code to call move_task_off_dead_cpu() with irqs enabled, and
change migrate_live_tasks() to use read_lock(tasklist).
This all is a preparation for the futher changes proposed by Cliff Wickman, see
http://marc.info/?t=117327786100003
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Cliff Wickman <cpw@sgi.com>
Cc: Gautham R Shenoy <ego@in.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
Cc: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Child task may be added on a different cpu that the one on which parent
is running. In which case, task_new_fair() should check whether the new
born task's parent entity should be added as well on the cfs_rq.
Patch below fixes the problem in task_new_fair.
This could fix the put_prev_task_fair() crashes reported.
Reported-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Reported-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
We recently discovered a nasty performance bug in the kernel CPU load
balancer where we were hit by 50% performance regression.
When tasks are assigned to a subset of CPUs that span across
sched_domains (either ccNUMA node or the new multi-core domain) via
cpu affinity, kernel fails to perform proper load balance at
these domains, due to several logic in find_busiest_group() miss
identified busiest sched group within a given domain. This leads to
inadequate load balance and causes 50% performance hit.
To give you a concrete example, on a dual-core, 2 socket numa system,
there are 4 logical cpu, organized as:
CPU0 attaching sched-domain:
domain 0: span 0003 groups: 0001 0002
domain 1: span 000f groups: 0003 000c
CPU1 attaching sched-domain:
domain 0: span 0003 groups: 0002 0001
domain 1: span 000f groups: 0003 000c
CPU2 attaching sched-domain:
domain 0: span 000c groups: 0004 0008
domain 1: span 000f groups: 000c 0003
CPU3 attaching sched-domain:
domain 0: span 000c groups: 0008 0004
domain 1: span 000f groups: 000c 0003
If I run 2 tasks with CPU affinity set to 0x5. There are situation
where cpu0 has run queue length of 2, and cpu2 will be idle. The
kernel load balancer is unable to balance out these two tasks over
cpu0 and cpu2 due to at least three logics in find_busiest_group()
that heavily bias load balance towards power saving mode. e.g. while
determining "busiest" variable, kernel only set it when
"sum_nr_running > group_capacity". This test is flawed that
"sum_nr_running" is not necessary same as
sum-tasks-allowed-to-run-within-the sched-group. The end result is
that kernel "think" everything is balanced, but in reality we have an
imbalance and thus causing one CPU to be over-subscribed and leaving
other idle. There are two other logic in the same function will also
causing similar effect. The nastiness of this bug is that kernel not
be able to get unstuck in this unfortunate broken state. From what
we've seen in our environment, kernel will stuck in imbalanced state
for extended period of time and it is also very easy for the kernel to
stuck into that state (it's pretty much 100% reproducible for us).
So proposing the following fix: add addition logic in
find_busiest_group to detect intrinsic imbalance within the busiest
group. When such condition is detected, load balance goes into spread
mode instead of default grouping mode.
Signed-off-by: Ken Chen <kenchen@google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
It occurred to me this morning that the procname field was dynamically
allocated and needed to be freed. I started to put in break statements
when allocation failed but it was approaching 50% error handling code.
I came up with this alternative of looping while entry->mode is set and
checking proc_handler instead of ->table. Alternatively, the string
version of the domain name and cpu number could be stored the structs.
I verified by compiling CONFIG_DEBUG_SLAB and checking the allocation
counts after taking a cpuset exclusive and back.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Remove the cpuset hooks that defined sched domains depending on the setting
of the 'cpu_exclusive' flag.
The cpu_exclusive flag can only be set on a child if it is set on the
parent.
This made that flag painfully unsuitable for use as a flag defining a
partitioning of a system.
It was entirely unobvious to a cpuset user what partitioning of sched
domains they would be causing when they set that one cpu_exclusive bit on
one cpuset, because it depended on what CPUs were in the remainder of that
cpusets siblings and child cpusets, after subtracting out other
cpu_exclusive cpusets.
Furthermore, there was no way on production systems to query the
result.
Using the cpu_exclusive flag for this was simply wrong from the get go.
Fortunately, it was sufficiently borked that so far as I know, almost no
successful use has been made of this. One real time group did use it to
affectively isolate CPUs from any load balancing efforts. They are willing
to adapt to alternative mechanisms for this, such as someway to manipulate
the list of isolated CPUs on a running system. They can do without this
present cpu_exclusive based mechanism while we develop an alternative.
There is a real risk, to the best of my understanding, of users
accidentally setting up a partitioned scheduler domains, inhibiting desired
load balancing across all their CPUs, due to the nonobvious (from the
cpuset perspective) side affects of the cpu_exclusive flag.
Furthermore, since there was no way on a running system to see what one was
doing with sched domains, this change will be invisible to any using code.
Unless they have real insight to the scheduler load balancing choices, they
will be unable to detect that this change has been made in the kernel's
behaviour.
Initial discussion on lkml of this patch has generated much comment. My
(probably controversial) take on that discussion is that it has reached a
rough concensus that the current cpuset cpu_exclusive mechanism for
defining sched domains is borked. There is no concensus on the
replacement. But since we can remove this mechanism, and since its
continued presence risks causing unwanted partitioning of the schedulers
load balancing, we should remove it while we can, as we proceed to work the
replacement scheduler domain mechanisms.
Signed-off-by: Paul Jackson <pj@sgi.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Christoph Lameter <clameter@engr.sgi.com>
Cc: Dinakar Guniguntala <dino@in.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Convert cpu_sibling_map from a static array sized by NR_CPUS to a per_cpu
variable. This saves sizeof(cpumask_t) * NR unused cpus. Access is mostly
from startup and CPU HOTPLUG functions.
Signed-off-by: Mike Travis <travis@sgi.com>
Cc: Andi Kleen <ak@suse.de>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: "Luck, Tony" <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
make sure sync wakeups preempt too - the scheduler will not
overschedule as we've got various throttles against that.
As a result, sync wakeups can be used more widely in the kernel
(to signal wakeup affinity between tasks), and no arbitrary
latencies will be introduced either.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
make sync wakeups affine for cache-cold tasks: if a cache-cold task
is woken up by a sync wakeup then use the opportunity to migrate it
straight away. (the two tasks are 'related' because they communicate)
Signed-off-by: Ingo Molnar <mingo@elte.hu>
modify account_system_time() to add cputime to cpustat->guest if we are
running a VCPU. We add this cputime to cpustat->user instead of
cpustat->system because this part of KVM code is in fact user code
although it is executed in the kernel. We duplicate VCPU time between
guest and user to allow an unmodified "top(1)" to display correct value.
A modified "top(1)" is able to display good cpu user time and cpu guest
time by subtracting cpu guest time from cpu user time. Update "gtime" in
task_struct accordingly.
Signed-off-by: Laurent Vivier <Laurent.Vivier@bull.net>
Acked-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
we had an incorrect-terminator bug in sd_alloc_ctl_domain_table()
before, so add a comment that documents it.
Signed-off-by: Milton Miller <miltonm@bga.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Now that we are calling this at runtime, a more relaxed error path is
suggested. If an allocation fails, we just register the partial table,
which will show empty directories.
Signed-off-by: Milton Miller <miltonm@bga.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Unregister and free the sysctl table before destroying domains, then
rebuild and register after creating the new domains. This prevents the
sysctl table from pointing to freed memory for root to write.
Signed-off-by: Milton Miller <miltonm@bga.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
init_sched_domain_sysctl was walking cpus 0-n and referencing per_cpu
variables. If the cpus_possible mask is not contigious this will result
in a crash referencing unallocated data. If the online mask is not
contigious then we would show offline cpus and miss online ones.
Signed-off-by: Milton Miller <miltonm@bga.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
kcalloc checks for n * sizeof(element) overflows and it zeros.
Signed-off-by: Milton Miller <miltonm@bga.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>