Child task may be added on a different cpu that the one on which parent
is running. In which case, task_new_fair() should check whether the new
born task's parent entity should be added as well on the cfs_rq.
Patch below fixes the problem in task_new_fair.
This could fix the put_prev_task_fair() crashes reported.
Reported-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Reported-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
When CONFIG_SYSFS is not set, CONFIG_FAIR_USER_SCHED fails to build
with
kernel/built-in.o: In function `uids_kobject_init':
(.init.text+0x1488): undefined reference to `kernel_subsys'
kernel/built-in.o: In function `uids_kobject_init':
(.init.text+0x1490): undefined reference to `kernel_subsys'
kernel/built-in.o: In function `uids_kobject_init':
(.init.text+0x1480): undefined reference to `kernel_subsys'
kernel/built-in.o: In function `uids_kobject_init':
(.init.text+0x1494): undefined reference to `kernel_subsys'
This patch fixes this build error.
Signed-off-by: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
We recently discovered a nasty performance bug in the kernel CPU load
balancer where we were hit by 50% performance regression.
When tasks are assigned to a subset of CPUs that span across
sched_domains (either ccNUMA node or the new multi-core domain) via
cpu affinity, kernel fails to perform proper load balance at
these domains, due to several logic in find_busiest_group() miss
identified busiest sched group within a given domain. This leads to
inadequate load balance and causes 50% performance hit.
To give you a concrete example, on a dual-core, 2 socket numa system,
there are 4 logical cpu, organized as:
CPU0 attaching sched-domain:
domain 0: span 0003 groups: 0001 0002
domain 1: span 000f groups: 0003 000c
CPU1 attaching sched-domain:
domain 0: span 0003 groups: 0002 0001
domain 1: span 000f groups: 0003 000c
CPU2 attaching sched-domain:
domain 0: span 000c groups: 0004 0008
domain 1: span 000f groups: 000c 0003
CPU3 attaching sched-domain:
domain 0: span 000c groups: 0008 0004
domain 1: span 000f groups: 000c 0003
If I run 2 tasks with CPU affinity set to 0x5. There are situation
where cpu0 has run queue length of 2, and cpu2 will be idle. The
kernel load balancer is unable to balance out these two tasks over
cpu0 and cpu2 due to at least three logics in find_busiest_group()
that heavily bias load balance towards power saving mode. e.g. while
determining "busiest" variable, kernel only set it when
"sum_nr_running > group_capacity". This test is flawed that
"sum_nr_running" is not necessary same as
sum-tasks-allowed-to-run-within-the sched-group. The end result is
that kernel "think" everything is balanced, but in reality we have an
imbalance and thus causing one CPU to be over-subscribed and leaving
other idle. There are two other logic in the same function will also
causing similar effect. The nastiness of this bug is that kernel not
be able to get unstuck in this unfortunate broken state. From what
we've seen in our environment, kernel will stuck in imbalanced state
for extended period of time and it is also very easy for the kernel to
stuck into that state (it's pretty much 100% reproducible for us).
So proposing the following fix: add addition logic in
find_busiest_group to detect intrinsic imbalance within the busiest
group. When such condition is detected, load balance goes into spread
mode instead of default grouping mode.
Signed-off-by: Ken Chen <kenchen@google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
It occurred to me this morning that the procname field was dynamically
allocated and needed to be freed. I started to put in break statements
when allocation failed but it was approaching 50% error handling code.
I came up with this alternative of looping while entry->mode is set and
checking proc_handler instead of ->table. Alternatively, the string
version of the domain name and cpu number could be stored the structs.
I verified by compiling CONFIG_DEBUG_SLAB and checking the allocation
counts after taking a cpuset exclusive and back.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'upstream-linus' of master.kernel.org:/pub/scm/linux/kernel/git/jgarzik/netdev-2.6:
WOL bugfix for 3c59x.c
skge 1.12
skge: add a debug interface
skge: eeprom support
skge: internal stats
skge: XM PHY handling fixes
skge: changing MTU while running causes problems
skge: fix ram buffer size calculation
gianfar: Fix compile regression caused by 09f75cd7
net: Fix new EMAC driver for NAPI changes
bonding: two small fixes for IPoIB support
e1000e: don't poke PHY registers to retreive link status
e1000e: fix error checks
e1000e: Fix debug printk macro
tokenring/3c359.c: fixed array index problem
[netdrvr] forcedeth: remove in-driver copy of net_device_stats
[netdrvr] forcedeth: improved probe info; dev_printk() cleanups
forcedeth: fix NAPI rx poll function
AFAICS, fallout from repacing include of blkdev.h with include of bio.h.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This fixes a lost 'key' variable declaration that went missing in a
mismerge (commit b981d8b3f5)
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some NICs (3c905B) can not generate PME in power state PCI_D0, while others
like 3c905C can. Call pci_enable_wake() with PCI_D3hot should give proper
WOL for 3c905B.
Signed-off-by: Steffen Klassert <klassert@mathematik.tu-chemnitz.de>
Tested-by: Harry Coin <hcoin@n4comm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
Add a debugfs interface to look at internal ring state.
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
Change how PHY is managed on SysKonnect fibre based boards.
Poll for PHY coming up 1 per second, but use interrupt to detect loss.
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
Rather than bring network down/up when changing MTU,
only need to impact receiver.
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
This fixes problems with transmit hangs on older fiber based SysKonnect boards.
Adjust ram buffer sizing calculation to make it correct on all boards
and make it like the code in sky2 driver.
Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
net: Fix new EMAC driver for NAPI changes
This fixes the new EMAC driver for the NAPI updates. The previous patch
by Roland Dreier (already applied) to do that doesn't actually work. This
applies on top of it makes it work on my test Ebony machine.
This patch depends on "net: Add __napi_sycnhronize() to sync with napi poll"
posted previously.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
Two small fixes to IPoIB support for bonding:
1- copy header_ops from slave to bonding for IPoIB slaves
2- move release and destroy logic to UNREGISTER from GOING_DOWN
notifier to avoid double release
Set bonding to version 3.2.1.
Signed-off-by: Moni Shoua <monis at voltaire.com>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
Apparently poking the link status registers when autonegotiation
is running on the PHY might botch the PHY link on 80003es2lan
devices. While this is a very rare condition we can completely
avoid it alltogether by just using the MAC link bits to provide
the proper information to ethtool.
Signed-off-by: Auke Kok <auke-jan.h.kok@intel.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
Spotted by the Coverity checker.
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: Auke Kok <auke-jan.h.kok@intel.com>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
The xl_laa array is just 6 bytes long, so we should substract
10 from the index, like is also done some lines above already.
Signed-Off-By: Marcus Meissner <meissner@suse.de>
Signed-off-by: Jeff Garzik <jeff@garzik.org>
A copy of struct net_device_stats now lives in struct net_device,
making in-driver copies a waste of memory.
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
main change:
* greatly improve per-NIC probe diagnostic output. Similar to other
net drivers, print out MAC address, PHY info, and various hardware and
software flags that may be relevant.
other changes:
* similar to other net drivers, only print the initial version message
when we have found at least one board.
* don't bother to print error message when pci_enable_device() fails,
it will do so for us.
* use dev_printk() rather than printk() in nv_probe(). This gives
use a standardized output similar to the rest of the kernel, and
eliminates the need to manually print out PCI bus id.
* use DRV_NAME constant where appropriate
* clean struct pci_driver indentation
Signed-off-by: Jeff Garzik <jgarzik@redhat.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394-2.6: (25 commits)
firewire: fw-cdev: reorder wakeup vs. spinlock
firewire: in-code doc updates.
firewire: a header cleanup
firewire: adopt read cycle timer ABI from raw1394
firewire: fw-ohci: check for misconfigured bus (phyID == 63)
firewire: fw-ohci: missing dma_unmap_single
firewire: fw-ohci: log posted write errors
firewire: fw-ohci: reorder includes
firewire: fw-ohci: fix includes
firewire: fw-ohci: enforce read order for selfID generation
firewire: fw-sbp2: use an own workqueue (fix system responsiveness)
firewire: fw-sbp2: expose module parameter for workarounds
firewire: fw-sbp2: add support for multiple logical units per target
firewire: fw-sbp2: always enable IRQs before calling command ORB callback
firewire: fw-core: local variable shadows a global one
firewire: optimize fw_core_add_address_handler
ieee1394: ieee1394_core.c: use DEFINE_SPINLOCK for spinlock definition
ieee1394: csr1212: proper refcounting
ieee1394: nodemgr: fix leak of struct csr1212_keyval
ieee1394: pcilynx: I2C cleanups
...
This patch kills ugly warnings when the "Improve SELinux performance
when ACV misses" patch.
Signed-off-by: KaiGai Kohei <kaigai@ak.jp.nec.com>
Signed-off-by: James Morris <jmorris@namei.org>
* We add ebitmap_for_each_positive_bit() which enables to walk on
any positive bit on the given ebitmap, to improve its performance
using common bit-operations defined in linux/bitops.h.
In the previous version, this logic was implemented using a combination
of ebitmap_for_each_bit() and ebitmap_node_get_bit(), but is was worse
in performance aspect.
This logic is most frequestly used to compute a new AVC entry,
so this patch can improve SELinux performance when AVC misses are happen.
* struct ebitmap_node is redefined as an array of "unsigned long", to get
suitable for using find_next_bit() which is fasted than iteration of
shift and logical operation, and to maximize memory usage allocated
from general purpose slab.
* Any ebitmap_for_each_bit() are repleced by the new implementation
in ss/service.c and ss/mls.c. Some of related implementation are
changed, however, there is no incompatibility with the previous
version.
* The width of any new line are less or equal than 80-chars.
The following benchmark shows the effect of this patch, when we
access many files which have different security context one after
another. The number is more than /selinux/avc/cache_threshold, so
any access always causes AVC misses.
selinux-2.6 selinux-2.6-ebitmap
AVG: 22.763 [s] 8.750 [s]
STD: 0.265 0.019
------------------------------------------
1st: 22.558 [s] 8.786 [s]
2nd: 22.458 [s] 8.750 [s]
3rd: 22.478 [s] 8.754 [s]
4th: 22.724 [s] 8.745 [s]
5th: 22.918 [s] 8.748 [s]
6th: 22.905 [s] 8.764 [s]
7th: 23.238 [s] 8.726 [s]
8th: 22.822 [s] 8.729 [s]
Signed-off-by: KaiGai Kohei <kaigai@ak.jp.nec.com>
Acked-by: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: James Morris <jmorris@namei.org>
Allow policy to select, in much the same way as it selects MLS support, how
the kernel should handle access decisions which contain either unknown
classes or unknown permissions in known classes. The three choices for the
policy flags are
0 - Deny unknown security access. (default)
2 - reject loading policy if it does not contain all definitions
4 - allow unknown security access
The policy's choice is exported through 2 booleans in
selinuxfs. /selinux/deny_unknown and /selinux/reject_unknown.
Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: James Morris <jmorris@namei.org>
It reduces the selinux overhead on read/write by only revalidating
permissions in selinux_file_permission if the task or inode labels have
changed or the policy has changed since the open-time check. A new LSM
hook, security_dentry_open, is added to capture the necessary state at open
time to allow this optimization.
(see http://marc.info/?l=selinux&m=118972995207740&w=2)
Signed-off-by: Yuichi Nakamura<ynakam@hitachisoft.jp>
Acked-by: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: James Morris <jmorris@namei.org>
This patch reduces memory usage of SELinux by tuning avtab. Number of hash
slots in avtab was 32768. Unused slots used memory when number of rules is
fewer. This patch decides number of hash slots dynamically based on number
of rules. (chain length)^2 is also printed out in avtab_hash_eval to see
standard deviation of avtab hash table.
Signed-off-by: Yuichi Nakamura<ynakam@hitachisoft.jp>
Acked-by: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: James Morris <jmorris@namei.org>
This duplicates the read cycle timer feature of raw1394 (added in Linux
2.6.21) in firewire-core's userspace ABI. The argument to the ioctl is
reordered though to ensure 32/64 bit compatibility.
Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
Signed-off-by: Kristian Høgsberg <krh@redhat.com>
Check NodeID.nodeNumber as per OHCI 1.1 clause 7.2.3.2. See also IEEE
1394a table 5B-1.
Also, demote the "node ID not valid" message from error to notification
as it is not an error condition.
Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
at_context_queue_packet() didn't clean up in an early exit path.
Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
Signed-off-by: Kristian Høgsberg <krh@redhat.com>
It seems unlikely, but access to self_id_cpu[0] could at least in theory
be deferred until after the loop over self_id_cpu[1..n] or even after
the subsequent reg_read. Enforce the desired order by a read barrier.
Also prevent the reg_read from being reordered relative to the for loop.
This isn't necessary if the loop's conditional printk counts as an
implicit barrier, but better make it explicit.
(self_id_cpu[] is a coherent DMA buffer.)
Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
Firewire-sbp2 did very uncooperative things in the kernel's shared
workqueue: Sleeping until reception of management status from the
target for up to 2 seconds, and performing SCSI inquiry and all of the
setup of SCSI command set drivers via scsi_add_device. If there were
transient or permanent error conditions, this caused long blockage of
the kernel's events process, noticeable e.g. by blocked keyboard input.
We now allocate a workqueue process exclusive to fw-sbp2. As a side
effect, this also increases parallelism of fw-sbp2's login and reconnect
work versus fw-core's device discovery and device update work which is
performed in the shared workqueue.
Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
Signed-off-by: Kristian Høgsberg <krh@redhat.com>
On rare occasions, the ability to set one of the workaround flags at
runtime may save the day.
People who experience I/O errors with firewire-sbp2 while the old sbp2
driver worked for them should try workarounds=1 and report to the devel
mailinglist whether that improves things. Firewire-sbp2 defaults to the
SCSI stack's maximum transfer size per command, while sbp2 limits them
to 128 kBytes. Flag 1 accomplishes just that.
Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
On IOMMU-less noncoherent architectures, orb->callback will memcpy the
whole SCSI command buffer for READ-like SCSI commands. It is therefore
friendlier to enable IRQs before the call, like before patch "Add
ref-counting for sbp2 orbs".
Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
Acked-by: Kristian Høgsberg <krh@redhat.com>