Two x86 patches broke lguest:
1) v2.6.35-492-g72d7c3b, which changed x86 to use the memblock allocator.
In lguest, the host places linear page tables at the top of mem, which
used to be enough to get us up to the swapper_pg_dir page tables. With
the first patch, the direct mapping tables used that memory:
Before: kernel direct mapping tables up to 4000000 @ 7000-1a000
After: kernel direct mapping tables up to 4000000 @ 3fed000-4000000
I initially fixed this by lying about the amount of memory we had, so
the kernel wouldn't blatt the lguest boot pagetables (yuk!), but then...
2) v2.6.36-rc8-54-gb40827f, which made x86 boot use initial_page_table.
This was initialized in a part of head_32.S which isn't executed by
lguest; it is then copied into swapper_pg_dir. So we have to initialize
it; and anyway we switch to it before we blatt the old tables, so that
fixes the previous damage as well.
For the moment, I cut & pasted the code into lguest's boot code, but
next merge window I will merge them.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
To: x86@kernel.org
lguest is dumb and drops *all* the pagetables for set_pte (which is
only used for kernel mapping manipulation, so it's OK without highmem).
But it's used a lot in boot, too. As a guest optimization, we
suppressed this flushing until the first page switch. Now we have
initial_page_table, that happens much earlier, so extend the heuristic
to wait until we switch to something other than the swapper_pg_dir or
initial_page_table.
As measured on my laptop under kvm, this dropped the time-to-mount-root
from 48 seconds to 4.3 seconds.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
fe25c7fc2e "x86: lguest: Convert to new irq chip functions" converted
enable_lguest_irq() to take a struct irq_data *, but didn't fix the one
internal caller.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
To: x86@kernel.org
Add missing header file:
arch/x86/crypto/ghash-clmulni-intel_glue.c:256: error: implicit declaration of function 'IS_ERR'
arch/x86/crypto/ghash-clmulni-intel_glue.c:257: error: implicit declaration of function 'PTR_ERR'
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
* master.kernel.org:/home/rmk/linux-2.6-arm:
ARM: 6535/1: V6 MPCore v6_dma_inv_range and v6_dma_flush_range RWFO fix
ARM: 6534/1: Make CONFIG_FPE_NWFPE depend on !CONFIG_THUMB2_KERNEL
ARM: 6533/1: Thumb-2: Make CONFIG_THUMB2_KERNEL depend on !CPU_V6
Change bcmring Maintainer list.
ARM: Update mach-types
ARM: 6528/1: Use CTR for the I-cache line size on ARMv7
ARM: 6527/1: Use CTR instead of CCSIDR for the D-cache line size on ARMv7
ARM: pxa/palm: fix ifdef around gen_nand driver registration
ARM: pxa: fix pxa2xx-flash section mismatch
ARM: mmp2: remove not used clk_rtc
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc-2.6:
sparc: Write to prom console using indirect buffer.
sparc: Delete prom_*getchar().
sparc: Pass buffer pointer all the way down to prom_{get,put}char().
sparc: Do not export prom_nb{get,put}char().
sparc64: Delete prom_setcallback().
sparc64: Unexport prom_service_exists().
sparc: Kill prom devops_{32,64}.c
sparc: Remove prom_pathtoinode()
sparc64: Delete prom_puts() unused.
SPARC/LEON: removed constant timer initialization as if HZ=100, now it reflects the value of HZ
Cache ownership must be acquired by reading/writing data from the
cache line to make cache operation have the desired effect on the
SMP MPCore CPU. However, the ownership is never acquired in the
v6_dma_inv_range function when cleaning the first line and
flushing the last one, in case the address is not aligned
to D_CACHE_LINE_SIZE boundary.
Fix this by reading/writing data if needed, before performing
cache operations.
While at it, fix v6_dma_flush_range to prevent RWFO outside
the buffer.
Cc: stable@kernel.org
Signed-off-by: Valentine Barshak <vbarshak@mvista.com>
Signed-off-by: George G. Davis <gdavis@mvista.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Because the nwfpe support is unlikely to be used on new platforms
and requires CONFIG_OABI_COMPAT, which is not generally used with
ARMv7+, we shouldn't expect to build nwfpe support into a Thumb-2
kernel.
At present, nwfpe contains assembly code which isn't Thumb-2
compatible, and for now it doesn't appear useful to port this
code.
All ARMv7-A/R platforms necessarily have VFPv3 hardware floating-
point natively, making emulation unnecessary.
Signed-off-by: Dave Martin <dave.martin@linaro.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Nicolas Pitre <nicolas.pitre@linaro.org>
Acked-by: Santosh Shilimkar <santosh.shilimkar@ti.com>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
This makes sense, because Thumb-2 code can't execute on plain
ARMv6 processors.
This will avoid accidentally configuring a broken kernel where the
config otherwise would allow multiple architecture versions to
coexist in the same kernel.
Not adding !CPU_V5 etc., because the chance of anyone trying to
put v5 and v7 in the same kernel is low, and I'm not aware of
any mach which can do this. These could be added later if it
matters.
Note that the rules may need to be refined if support for the
ARM1156J(F)-S processor is later added to the kernel, since this
processor supports the rare ARMv6T2 extensions, which add support
for Thumb-2 and a few other ARMv7 features.
Signed-off-by: Dave Martin <dave.martin@linaro.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Nicolas Pitre <nicolas.pitre@linaro.org>
Acked-by: Santosh Shilimkar <santosh.shilimkar@ti.com>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
There are still quite a number of MFD and GPIO expander drivers that are
using the old irq_chip APIs that haven't had a chance to update during
the .37 cycle, resulting in allyes/modconfig errors on some
configurations.
Mark Brown has done most of the legwork to get these fixed up in .38,
so this should just be a .37 stop-gap that we can drop at the end of the
.38 merge window.
Reported-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
Signed-off-by: Paul Mundt <lethal@linux-sh.org>
The current implementation of the v7_coherent_*_range function assumes
that the D and I cache lines have the same size, which is incorrect
architecturally. This patch adds the icache_line_size macro which reads
the CTR register. The main loop in v7_coherent_*_range is split in two
independent loops or the D and I caches. This also has the performance
advantage that the DSB is moved outside the main loop.
Reported-by: Kevin Sapp <ksapp@quicinc.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
The current implementation of the dcache_line_size macro reads the L1
cache size from the CCSIDR register. This, however, is not guaranteed to
be the smallest cache line in the cache hierarchy. The patch changes to
the macro to use the more architecturally correct CTR register.
Reported-by: Kevin Sapp <ksapp@quicinc.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
After Charu's GPIO hwmod patches, GPIO initialization on N800 emits
the following messages for all GPIO banks:
omap_hwmod: gpio1: cannot be enabled (3)
This is due to OMAP24XX_ST_GPIOS_SHIFT being defined as a bitmask.
Fix this and also fix two other macros that had the same problem.
Thanks to Tony Lindgren <tony@atomide.com> for originally reporting
this bug.
Signed-off-by: Paul Walmsley <paul@pwsan.com
Cc: Charulatha Varadarajan <charu@ti.com>
Signed-off-by: Tony Lindgren <tony@atomide.com>
commit 0d8e2d0dad (OMAP2+: PM/serial:
hold console semaphore while OMAP UARTs are disabled) added use of the
console semaphore to protect UARTs from being accessed after disabled
during idle, but this causes problems in suspend.
During suspend, the console semaphore is acquired by the console
suspend method (console_suspend()) so the try_acquire_console_sem()
will always fail and suspend will be aborted.
To fix, introduce a check so the console semaphore is only attempted
during idle, and not during suspend. Also use the same check so that
the console semaphore is not prematurely released during resume.
Thanks to Paul Walmsley for suggesting adding the same check during
resume.
Cc: Paul Walmsley <paul@pwsan.com>
Tested-by: Jean Pihet <j-pihet@ti.com>
Tested-by: Paul Walmsley <paul@pwsan.com>
Signed-off-by: Kevin Hilman <khilman@deeprootsystems.com>
Signed-off-by: Tony Lindgren <tony@atomide.com>
Kernel was failing to boot on omap1611 based OSK boards due to
mis-configured SRAM size. Existing code was using a hard-coded value
for 250k, which was then rounded down by PAGE_SIZE. Increasing this to
256k allows kernel to boot on omap1611 SoCs.
Problem reported by and initial fix suggested by Tim Bird.
Thanks to Tony Lindgren for helping diagnose the problem to being
specific to OMAP1611 and not affecting OMAP1610/OMAP1623.
Reported-by: Tim Bird <tim.bird@am.sony.com>
Signed-off-by: Kevin Hilman <khilman@deeprootsystems.com>
Signed-off-by: Tony Lindgren <tony@atomide.com>
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86/pvclock: Zero last_value on resume
* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
perf record: Fix eternal wait for stillborn child
perf header: Don't assume there's no attr info if no sample ids is provided
perf symbols: Figure out start address of kernel map from kallsyms
perf symbols: Fix kallsyms kernel/module map splitting
* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
nohz: Fix printk_needs_cpu() return value on offline cpus
printk: Fix wake_up_klogd() vs cpu hotplug
clk_get() return value should be checked with IS_ERR().
Signed-off-by: Aaro Koskinen <aaro.koskinen@nokia.com>
Acked-by: Kevin Hilman <khilman@deeprootsystems.com>
Signed-off-by: Tony Lindgren <tony@atomide.com>
Currently the {set,get}_pull callbacks of the s3c24xx_gpiocfg_default structure
are initalized via s3c_gpio_{get,set}pull_1up. This results in a linker
error when only CONFIG_CPU_S3C2442 is selected:
arch/arm/plat-s3c24xx/built-in.o:(.data+0x13f4): undefined reference to
`s3c_gpio_getpull_1up'
arch/arm/plat-s3c24xx/built-in.o:(.data+0x13f8): undefined reference to
`s3c_gpio_setpull_1up'
The s3c2442 has pulldowns instead of pullups compared to the s3c2440.
The method of controlling them is the same though.
So this patch modifies the existing s3c_gpio_{get,set}pull_1up helper functions
to take an additional parameter deciding whether the pin has a pullup or pulldown.
The s3c_gpio_{get,set}pull_1{down,up} functions then wrap that functions passing
either S3C_GPIO_PULL_UP or S3C_GPIO_PULL_DOWN.
Furthermore this patch sets up the s3c24xx_gpiocfg_default.{get,set}_pull fields
in the s3c244{0,2}_map_io function to the new pulldown helper functions.
Based on patch from "Lars-Peter Clausen" <lars@metafoo.de>
Signed-off-by: Vasily Khoruzhick <anarsoul@gmail.com>
Signed-off-by: Ben Dooks <ben-linux@fluff.org>
Fix the name of interrupt mask alteration function (ie the
local_change_intr_mask_level() fn) called in gdbstub to have an arch_
prefix to match the definition in asm/irqflags.h.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 0ea1293009 ("arm: return both physical and virtual addresses
from addruart") took out the test for MMU on/off but didn't switch the
ldr instructions to no longer be conditionals based on said test.
Fix that.
Signed-off-by: Olof Johansson <olof@lixom.net>
Acked-by: Colin Cross <ccross@android.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
clk_get() returns ERR_PTR() on error, not NULL.
Signed-off-by: Vasiliy Kulikov <segoon@openwall.com>
Acked-by: Kevin Hilman <khilman@deeprootsystems.com>
[tony@atomide.com: updated to include err.h to compile on omap1]
Signed-off-by: Tony Lindgren <tony@atomide.com>
This patch complements ed919b0 "mmc: sdio: fix runtime PM anomalies by
introducing MMC_CAP_POWER_OFF_CARD" by declaring MMC_CAP_POWER_OFF_CARD
on the ZOOM's wl1271 mmc slot.
This is required in order not to break runtime PM support for the wl1271
sdio driver.
Signed-off-by: Ohad Ben-Cohen <ohad@wizery.com>
Signed-off-by: Chris Ball <cjb@laptop.org>
Signed-off-by: Tony Lindgren <tony@atomide.com>
* master.kernel.org:/home/rmk/linux-2.6-arm:
ARM: 6524/1: GIC irq desciptor bug fix
ARM: 6523/1: iop: ensure sched_clock() is notrace
ARM: 6456/1: Fix for building DEBUG with sa11xx_base.c as a module.
ARM: 6519/1: kuser: Fix incorrect cmpxchg syscall in kuser helpers
ARM: 6505/1: kprobes: Don't HAVE_KPROBES when CONFIG_THUMB2_KERNEL is selected
ARM: 6508/1: vexpress: Correct data alignment in headsmp.S for CONFIG_THUMB2_KERNEL
ARM: 6507/1: RealView: Correct data alignment in headsmp.S for CONFIG_THUMB2_KERNEL
ARM: 6504/1: Thumb-2: Fix long-distance conditional branches in head.S for Thumb-2.
ARM: 6503/1: Thumb-2: Restore sensible zImage header layout for CONFIG_THUMB2_KERNEL
ARM: 6502/1: Thumb-2: Fix CONFIG_THUMB2_KERNEL breakage in compressed/head.S
ARM: 6501/1: Thumb-2: Correct data alignment for CONFIG_THUMB2_KERNEL in mm/proc-v7.S
ARM: 6500/1: Thumb-2: Correct data alignment for CONFIG_THUMB2_KERNEL in kernel/head.S
ARM: 6499/1: Thumb-2: Correct data alignment for CONFIG_THUMB2_KERNEL in bootp/init.S
ARM: 6498/1: vfp: Correct data alignment for CONFIG_THUMB2_KERNEL
ARM: 6497/1: kexec: Correct data alignment for CONFIG_THUMB2_KERNEL
ARM: 6496/1: GIC: Do not try to register more then NR_IRQS interrupts
ARM: cns3xxx: Fix build with CONFIG_PCI=y
gic_set_cpu will directly use irq_desc[]. If CONFIG_SPARSE_IRQ is
enabled, there is no irq_desc[]. So we need use irq_to_desc(irq) to
get the descriptor for irq.
Signed-off-by: Chao Xie <chao.xie@marvell.com>
Acked-by: Kyungmin Park <kyungmin.park@samsung.com>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/kyle/parisc-2.6:
parisc: Fix GSC PS/2 driver name for keyboard and mouse
parisc: KittyHawk LCD fix
parisc: convert the rest of the irq handlers to simple/percpu
parisc: fix dino/gsc interrupts
parisc: remove redundant initialization in sigsegv path of sys_rt_sigreturn
The generic conversion eliminates the spurious no_ack and no_end
routines, converts all the cascaded handlers to handle_simple_irq() and
makes iosapic use a modified handle_percpu_irq() to become the same as
the CPU irq's. This isn't an essential change, but it eliminates the
mask/unmask overhead of handle_level_irq().
Signed-off-by: James Bottomley <James.Bottomley@suse.de>
Tested-by: Helge Deller <deller@gmx.de>
Signed-off-by: Kyle McMartin <kyle@mcmartin.ca>
The essential problem we're currently having is that dino (and gsc) is a
cascaded CPU interrupt. Under the old __do_IRQ() handler, our CPU
interrupts basically did an ack followed by an end. In the new scheme,
we replaced them with level handlers which do a mask, an ack and then an
unmask (but no end). Instead, with the renaming of end to eoi, we
actually want to call the percpu flow handlers, because they actually
have all the characteristics we want.
This patch does the conversion and gets my C360 booting again.
Signed-off-by: James Bottomley <James.Bottomley@suse.de>
Signed-off-by: Kyle McMartin <kyle@mcmartin.ca>
Include sched.h to ensure sched_clock() has the notrace
annotation, and mark any functions it calls as notrace
too.
Include sched.h to ensure sched_clock() has the notrace
annotation, and mark any functions it calls as notrace
too.
Acked-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Rabin Vincent <rabin@rab.in>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
The existing code invokes the syscall with rubbish in r7,
due to what looks like an incorrect literal load idiom.
Reviewed-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Dave Martin <dave.martin@linaro.org>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
* '2.6.37-rc4-pvhvm-fixes' of git://xenbits.xen.org/people/sstabellini/linux-pvhvm:
xen: unplug the emulated devices at resume time
xen: fix save/restore for PV on HVM guests with pirq remapping
xen: resume the pv console for hvm guests too
xen: fix MSI setup and teardown for PV on HVM guests
xen: use PHYSDEVOP_get_free_pirq to implement find_unbound_pirq
The MACH_MINI2440 entry requires the backlight LED driver, but this
subsystem has not been enabled and the select of LEDS_TRIGGER_BACKLIGHT
alone is insufficient to enable the necessary bits of the LED driver.
Add NEW_LEDS, LEDS_CLASS and LEDS_TRIGGER to the select to allow the
kernel to link.
This fixes the following error:
drivers/built-in.o: In function `led_trigger_set':
/home/ben/linux.git/drivers/leds/led-triggers.c:116: undefined reference to `led_brightness_set'
Signed-off-by: Ben Dooks <ben-linux@fluff.org>
* 'upstream/core' of git://git.kernel.org/pub/scm/linux/kernel/git/jeremy/xen:
xen: allocate irq descs on any NUMA node
xen: prevent crashes with non-HIGHMEM 32-bit kernels with largeish memory
xen: use default_idle
xen: clean up "extra" memory handling some more
* 'upstream/bugfix' of git://git.kernel.org/pub/scm/linux/kernel/git/jeremy/xen:
xen: x86/32: perform initial startup on initial_page_table
xen: don't bother to stop other cpus on shutdown/reboot
* 'sh-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/lethal/sh-2.6:
sh: se/7724: Remove FSI/B of GPIO init code
sh: se/7724: Update clock framework of FSI clock to non-legacy
sh: Assume new page cache pages have dirty dcache lines.
sh: boards: mach-se: use IS_ERR() instead of NULL check
sh: Add div6_reparent_clks to clock framework for FSI
dma: shdma: add a MODULE_ALIAS() to allow module autoloading
Implement asm/syscall.h for the MN10300 arch.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Wakeup-on-timer code does not have/need debugfs dependency. Move
the function out of debugfs ifdef.
Fixes compile error when CONFIG_DEBUG_FS is disabled but PM debug is
enabled.
Reported-by: Tony Lindgren <tony@atomide.com>
Signed-off-by: Kevin Hilman <khilman@deeprootsystems.com>
Signed-off-by: Tony Lindgren <tony@atomide.com>
On stock 2.6.37-rc4, running:
# mount lilith:/export /mnt/lilith
# find /mnt/lilith/ -type f -print0 | xargs -0 file
crashes the machine fairly quickly under Xen. Often it results in oops
messages, but the couple of times I tried just now, it just hung quietly
and made Xen print some rude messages:
(XEN) mm.c:2389:d80 Bad type (saw 7400000000000001 != exp
3000000000000000) for mfn 1d7058 (pfn 18fa7)
(XEN) mm.c:964:d80 Attempt to create linear p.t. with write perms
(XEN) mm.c:2389:d80 Bad type (saw 7400000000000010 != exp
1000000000000000) for mfn 1d2e04 (pfn 1d1fb)
(XEN) mm.c:2965:d80 Error while pinning mfn 1d2e04
Which means the domain tried to map a pagetable page RW, which would
allow it to map arbitrary memory, so Xen stopped it. This is because
vm_unmap_ram() left some pages mapped in the vmalloc area after NFS had
finished with them, and those pages got recycled as pagetable pages
while still having these RW aliases.
Removing those mappings immediately removes the Xen-visible aliases, and
so it has no problem with those pages being reused as pagetable pages.
Deferring the TLB flush doesn't upset Xen because it can flush the TLB
itself as needed to maintain its invariants.
When unmapping a region in the vmalloc space, clear the ptes
immediately. There's no point in deferring this because there's no
amortization benefit.
The TLBs are left dirty, and they are flushed lazily to amortize the
cost of the IPIs.
This specific motivation for this patch is an oops-causing regression
since 2.6.36 when using NFS under Xen, triggered by the NFS client's use
of vm_map_ram() introduced in 56e4ebf877 ("NFS: readdir with vmapped
pages") . XFS also uses vm_map_ram() and could cause similar problems.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Bryan Schumaker <bjschuma@netapp.com>
Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: Alex Elder <aelder@sgi.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When remapping MSIs into pirqs for PV on HVM guests, qemu is responsible
for doing the actual mapping and unmapping.
We only give qemu the desired pirq number when we ask to do the mapping
the first time, after that we should be reading back the pirq number
from qemu every time we want to re-enable the MSI.
This fixes a bug in xen_hvm_setup_msi_irqs that manifests itself when
trying to enable the same MSI for the second time: the old MSI to pirq
mapping is still valid at this point but xen_hvm_setup_msi_irqs would
try to assign a new pirq anyway.
A simple way to reproduce this bug is to assign an MSI capable network
card to a PV on HVM guest, if the user brings down the corresponding
ethernet interface and up again, Linux would fail to enable MSIs on the
device.
Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
This fixes the same problem as described in the patch "nohz: fix
printk_needs_cpu() return value on offline cpus" for the arch_needs_cpu()
primitive:
arch_needs_cpu() may return 1 if called on offline cpus. When a cpu gets
offlined it schedules the idle process which, before killing its own cpu,
will call tick_nohz_stop_sched_tick().
That function in turn will call arch_needs_cpu() in order to check if the
local tick can be disabled. On offline cpus this function should naturally
return 0 since regardless if the tick gets disabled or not the cpu will be
dead short after. That is besides the fact that __cpu_disable() should already
have made sure that no interrupts on the offlined cpu will be delivered anyway.
In this case it prevents tick_nohz_stop_sched_tick() to call
select_nohz_load_balancer(). No idea if that really is a problem. However what
made me debug this is that on 2.6.32 the function get_nohz_load_balancer() is
used within __mod_timer() to select a cpu on which a timer gets enqueued.
If arch_needs_cpu() returns 1 then the nohz_load_balancer cpu doesn't get
updated when a cpu gets offlined. It may contain the cpu number of an offline
cpu. In turn timers get enqueued on an offline cpu and not very surprisingly
they never expire and cause system hangs.
This has been observed 2.6.32 kernels. On current kernels __mod_timer() uses
get_nohz_timer_target() which doesn't have that problem. However there might
be other problems because of the too early exit tick_nohz_stop_sched_tick()
in case a cpu goes offline.
This specific bug was indrocuded with 3c5d92a0 "nohz: Introduce
arch_needs_cpu".
In this case a cpu hotplug notifier is used to fix the issue in order to keep
the normal/fast path small. All we need to do is to clear the condition that
makes arch_needs_cpu() return 1 since it is just a performance improvement
which is supposed to keep the local tick running for a short period if a cpu
goes idle. Nothing special needs to be done except for clearing the condition.
Cc: stable@kernel.org
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>