This reverts commit 13a229abc2.
Quoth Andi:
"After some consideration and feedback from various people it turns
out this wasn't that good an idea. It has some problems and needs
more work. Since it was only an optimization anyways it's best to
just back it out again for now."
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The commit e2c0388866 added
setup_additional_cpus to setup.c but this is only defined if
CONFIG_HOTPLUG_CPU is set. This patch changes the #ifdef to reflect that.
Signed-off-by: Brian Magnuson <magnuson@rcn.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The previous experiment for using apicmaintimer on ATI systems didn't
work out very well. In particular laptops with C2/C3 support often
don't let it tick during idle, which makes it useless. There were also
some other bugs that made the apicmaintimer often not used at all.
I tried some other experiments - running timer over RTC and some other
things but they didn't really work well neither.
I rechecked the specs now and it turns out this simple change is
actually enough to avoid the double ticks on the ATI systems. We just
turn off IRQ 0 in the 8254 and only route it directly using the IO-APIC.
I tested it on a few ATI systems and it worked there. In fact it worked
on all chipsets (NVidia, Intel, AMD, ATI) I tried it on.
According to the ACPI spec routing should always work through the
IO-APIC so I think it's the correct thing to do anyways (and most of the
old gunk in check_timer should be thrown away for x86-64).
But for 2.6.16 it's best to do a fairly minimal change:
- Use the known to be working everywhere-but-ATI IRQ0 both over 8254
and IO-APIC setup everywhere
- Except on ATI disable IRQ0 in the 8254
- Remove the code to select apicmaintimer on ATI chipsets
- Add some boot options to allow to override this (just paranoia)
In 2.6.17 I hope to switch the default over to this for everybody.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
SMP time selection originally ran after all CPUs were brought up because
it needed to know the number of CPUs to decide if it needs an MP safe
timer or not.
This is not needed anymore because we know present CPUs early.
This fixes a couple of problems:
- apicmaintimer didn't always work because it relied on state that was
set up time_init_gtod too late.
- The output for the used timer in early kernel log was misleading
because time_init_gtod could actually change it later. Now always
print the final timer choice
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
It didn't set up the CPU possible map early enough, so the
option didn't actually work.
Noticed by Heiko Carstens
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
[description from AK]
Old check for the IO-APIC watchdog during the timer check was wrong -
it obviously should only drop into this if the IO-APIC watchdog is used.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Big Unisys systems have multiple clusters too, but they have an
synchronized TSC.
I'm using the SMBIOS to check for vendor == IBM.
Cc: Chris McDermott <lcm@us.ibm.com>
Cc: "Protasevich, Natalie" <Natalie.Protasevich@unisys.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
In previous versions of pci-gart.c, no_iommu was used to determine if IOMMU was
disabled in the GART DMA mapping functions. This changed in 2.6.16 and now
gart_xxx() functions are only called if gart is enabled. Therefore, uses of
no_iommu in the GART code are no longer necessary and can be removed.
Also, it removes double deceleration of no_iommu and force_iommu in pci.h and
proto.h, by removing the deceleration in pci.h.
Lastly, end_pfn off by one error.
Tested (along with patch 1/2) on dual opteron with gart enabled, iommu=soft,
and iommu=off.
Signed-off-by: Jon Mason <jdmason@us.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Commit 9c869edac5 moved the i386 topology.c
file. That change broke x86-64 compiles, as it uses the same file.
Signed-off-by: Dave Jones <davej@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Don't print KERN_INFO in the middle of a printk line.
printk(KERN_INFO "OEM ID: %s ",str);
is just above this. This is already fixed up in i386 copy.
Signed-off-by: Martin J. Bligh <mbligh@google.com>
Cc: Andi Kleen <ak@muc.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
But do it after everything else to risk less from recursive
crashes.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Many laptops have problems with ticking the local APIC timer in C2/C3.
The code added earlier to use it by default on ATI didn't really work
for them. Don't enable it when the system supports C2/C3.
This doesn't fix the problem fully, but at least it's not worse than before.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This caused a sigreturn with bad argument on a preemptible kernel
to complain with
Debug: sleeping function called from invalid context at /home/lsrc/quilt/linux/include/linux/rwsem.h:43
in_atomic():0, irqs_disabled():1
Call Trace: {__might_sleep+190} {profile_task_exit+21}
{__do_exit+34} {do_wait+0}
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Along with that, also suppress the memory touching altogether when the
watchdog is not running, to eliminate needless crosstalk. Plus ad a call
to it to make things consistent (one could also consider removing the call
in enable_timer_nmi_watchdog()).
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Don't touch the non DMA members in the sg list in dma_map_sg in the IOMMU
Some drivers (in particular ST) ran into problems because they reused the sg
lists after passing them to pci_map_sg(). The merging procedure in the K8
GART IOMMU corrupted the state. This patch changes it to only touch the dma*
entries during merging, but not the other fields. Approach suggested by Dave
Miller.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
We found a problem with x86_64 kernels with preemption enabled, where
having multiple tasks doing ptrace singlesteps around the same time will
cause the system to 'oops'. The problem seems that a task can get
preempted out of the do_debug() processing while it is running on the
DEBUG_STACK stack. If another task on that same cpu then enters do_debug()
and uses the same per-cpu DEBUG_STACK stack, the previous preempted tasks's
stack contents can be corrupted, and the system will oops when the
preempted task is context switched back in again.
The typical oops looks like the following:
Unable to handle kernel paging request at ffffffffffffffae RIP: <ffffffff805452a1>{thread_return+34}
PGD 103027 PUD 102429067 PMD 0
Oops: 0002 [1] PREEMPT SMP
CPU 0
Modules linked in:
Pid: 3786, comm: ssdd Not tainted 2.6.15.2 #1
RIP: 0010:[<ffffffff805452a1>] <ffffffff805452a1>{thread_return+34}
RSP: 0018:ffffffff80824058 EFLAGS: 000136c2
RAX: ffff81017e12cea0 RBX: 0000000000000000 RCX: 00000000c0000100
RDX: 0000000000000000 RSI: ffff8100f7856e20 RDI: ffff81017e12cea0
RBP: 0000000000000046 R08: ffff8100f68a6000 R09: 0000000000000000
R10: 0000000000000000 R11: ffff81017e12cea0 R12: ffff81000c2d53e8
R13: ffff81017f5b3be8 R14: ffff81000c0036e0 R15: 000001056cbfc899
FS: 00002aaaaaad9b00(0000) GS:ffffffff80883800(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: ffffffffffffffae CR3: 00000000f6fcf000 CR4: 00000000000006e0
Process ssdd (pid: 3786, threadinfo ffff8100f68a6000, task ffff8100f7856e20)
Stack: ffffffff808240d8 ffffffff8012a84a ffff8100055f6c00 0000000000000020
0000000000000001 ffff81000c0036e0 ffffffff808240b8 0000000000000000
0000000000000000 0000000000000000
Call Trace: <#DB>
<ffffffff8012a84a>{try_to_wake_up+985}
<ffffffff8012c0d3>{kick_process+87}
<ffffffff8013b262>{signal_wake_up+48}
<ffffffff8013b5ce>{specific_send_sig_info+179}
<ffffffff80546abc>{_spin_unlock_irqrestore+27}
<ffffffff8013b67c>{force_sig_info+159}
<ffffffff801103a0>{do_debug+289} <ffffffff80110278>{sync_regs+103}
<ffffffff8010ed9a>{paranoid_userspace+35}
Unable to handle kernel paging request at 00007fffffb7d000 RIP: <ffffffff8010f2e4>{show_trace+465}
PGD f6f25067 PUD f6fcc067 PMD f6957067 PTE 0
Oops: 0000 [2] PREEMPT SMP
This patch disables preemptions for the task upon entry to do_debug(), before
interrupts are reenabled, and then disables preemption before exiting
do_debug(), after disabling interrupts. I've noticed that the task can be
preempted either at the end of an interrupt, or on the call to
force_sig_info() on the spin_unlock_irqrestore() processing. It might be
better to attempt to code a fix in entry.S around the code that calls
do_debug().
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
[description from AK]
The IBM Summit 3 chipset doesn't implement the HPET timer replacement
option. Since the current Linux code relies on it use a mixed mode with
both PIT for the interrupt and HPET counters for the time keeping. That
was already implemented, but didn't work properly because it was still
using the last interrupt offset in HPET. This resulted in x460 not
booting. Fix this up by using the free running HPET counter.
Shouldn't affect any other machine because they either use full HPET mode
or no HPET at all.
TBD needs a similar 32bit fix.
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Pallipadi, Venkatesh" <venkatesh.pallipadi@intel.com>
Cc: Bob Picco <bob.picco@hp.com>
Cc: Bjorn Helgaas <bjorn.helgaas@hp.com>
Cc: john stultz <johnstul@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Currently, x86_64 and ia64 arches do not clear the corresponding bits in
the node's cpumask when a cpu goes down or cpu bring up is cancelled. This
is buggy since there are pieces of common code where the cpumask is checked
in the cpu down code path to decide on things (like in the slab down path).
PPC does the right thing, but x86_64 and ia64 don't (This was the reason
Sonny hit upon a slab bug during cpu offline on ppc and could not reproduce
on other arches). This patch fixes it for x86_64. I won't attempt ia64 as
I cannot test it.
Credit for spotting this should go to Alok.
(akpm: this was applied, then reverted. But it's OK now because we now use
for_each_cpu() in the right places).
Signed-off-by: Alok N Kataria <alokk@calsoftinc.com>
Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
Signed-off-by: Shai Fultheim <shai@scalex86.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This reverts commit 10f4dc8b27.
Quoth Andi Kleen:
"Kiran decided that it makes the problem worse than it was before.
Fixing it fully requires more work which is too much for 2.6.16. So
please revert that commit for now."
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This patch contains a printk reorder to remove the current problem of
displaying "PCI-DMA: Disabling IOMMU." and then "PCI-DMA: using GART
IOMMU" 20 lines later in dmesg.
It also constains a printk reorder in swiotlb to state swiotlb
enablement prior to describing the location of the bounce buffers, and a
printk reorder to state gart enablement prior to describing the
aperature.
Also constains a whitespace cleanup in arch/x86_64/kernel/setup.c
Tested (along with patch 2/2) on dual opteron with gart enabled,
iommu=soft, and iommu=off.
Signed-off-by: Jon Mason <jdmason@us.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Hack for 2.6.16. In 2.6.17 all code that uses NR_CPUs should
be audited and changed to only touch possible CPUs.
Don't mark the reference per cpu data init data (so it stays
around after boot) and point all impossible CPUs to it. This way
they reference some valid - although shared memory. Usually
this is only initialization like INIT_LIST_HEADs and there
won't be races because these CPUs never run. Still somewhat hackish.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
It's bad juju to touch the APIC when it hasn't been enabled.
I also moved ack_bad_irq for x86-64 out of line following i386.
Signed-off-by: Andi Kleen <ak@suse.de>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Conditionalize two unwind directives to match other similarly
conditional code.
Signed-Off-By: Jan Beulich <jbeulich@novell.com>
Cc: Jim Houston <jim.houston@ccur.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
On some broken motherboards (at least one NForce3 based AMD64 laptop)
the PIT timer runs at a incorrect frequency. This patch adds a new
option "apicpmtimer" that allows to use the APIC timer and calibrate it
using the PMTimer. It requires the earlier patch that allows to run the
main timer from the APIC.
Specifying apicpmtimer implies apicmaintimer.
The option defaults to off for now.
I tested it on a few systems and the resulting APIC timer frequencies
were usually a bit off, but always <1%, which should be tolerable.
TBD figure out heuristic to enable this automatically on the affected
systems TBD perhaps do it on all NForce3s or using DMI?
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
kprobes cannot deal with the funny calling conventions when it
runs on a different stack when it returns. If someone wants
to instrument context switch they can add a probe to schedule()
instead.
Cc: jkenisto@us.ibm.com, prasanna@in.ibm.com
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Align the start of the per-cpu section to the configured number of bytes in a
cache line. This stops a BUG_ON() from triggering in load_module() when
DEFINE_PER_CPU() is used in a module and the section isn't cacheline-aligned.
Rusty also found this and sent a patch in a while ago
(http://lkml.org/lkml/2004/10/19/17), I don't know what came of that.
Signed-off-by: Zach Brown <zach.brown@oracle.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
[ AK: I redid Kevin's fix to be simpler, but the idea and original
analysis of the problem is from Kevin]
This avoid allocation failures on some SATA systems like Nvidia CK8
when the IOMMU gets fragmented. Modern SATA devices have quite large queues
(128 entries) and the FS with ext2/3 is good enough now that it often
passes whole 128 page sg lists down to the driver. These require
512K of continuous free space in the IOMMU aperture to map when merged.
When the IOMMU is fragmented this could lead to spurious IO errors
due to failing mappings.
Short term fix is to just try to map the SG list again unmerged
page by page - this way fragmentation doesn't matter anymore.
The code for that was already there, but it just wasn't enabled for the
merge case.
According to Kevin at least the Nvidia device doesn't seem to benefit
from merging much anyways, so the only slowdown is from trying
to do an unnecessary merge attempt.
Kevin plans to implement better fragmentation avoidance in the future,
but that wouldn't be 2.6.16 material.
TBD: should add some statistic counters to count how often that really
happens.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
attached patch is 2 more cases i found via running the reference_init.pl
script. These were easy to spot just knowing the file names. There is
one another about init/main.c that i cant exactly zero in. (partly
because i dont know how to interpret the data thats spewed out of the tool).
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Currently, x86_64 and ia64 arches do not clear the corresponding bits
in the node's cpumask when a cpu goes down or cpu bring up is cancelled.
This is buggy since there are pieces of common code where the cpumask is
checked in the cpu down code path to decide on things (like in the slab
down path). PPC does the right thing, but x86_64 and ia64 don't (This
was the reason Sonny hit upon a slab bug during cpu offline on ppc and
could not reproduce on other arches). This patch fixes it for x86_64.
I won't attempt ia64 as I cannot test it.
Credit for spotting this should go to Alok.
Signed-off-by: Alok N Kataria <alokk@calsoftinc.com>
Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org>
Signed-off-by: Shai Fultheim <shai@scalex86.org>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
They cause quite bad performance regressions on Netburst
This is temporary until we can get new optimized functions
for these CPUs.
This undoes changes that were done in 2.6.15 and in 2.6.16-rc1,
essentially bringing the code back to 2.6.14 level. Only change
is I renamed the X86_FEATURE_K8_C flag to X86_FEATURE_REP_GOOD
and fixed the check for the flag and also fixed some comments.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This avoids BUG_ONs in the low level allocator when an illegal
GFP mask is added.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
At resume time, TSC's value or something similar might be changed a lot
against suspend time. This could make system gets a very big lost ticks.
See http://bugzilla.kernel.org/show_bug.cgi?id=5825
Signed-off-by: Shaohua Li<shaohua.li@intel.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
They all have problems with IRQ 0 routing, so just use the APIC on them.
Can be overwritten with "noapicmaintimer"
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Another piece from the no-idle-tick patch.
This can be enabled with the "apicmaintimer" option.
This is mainly useful when the PIT/HPET interrupt is unreliable.
Note there are some systems that are known to stop the APIC
timer in C3. For those it will never work, but this case
should be automatically detected.
It also only works with PM timer right now. When HPET is used
the way the main timer handler computes the delay doesn't work.
It should be a bit more efficient because there is one less
regular interrupt to process on the boot processor.
Requires earlier bugfix from Venkatesh
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
A kprobe executes IRET early and that could cause NMI recursion
and stack corruption.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Fix a typo/mis-merge in one of the previous patches.
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
... as they are no longer needed. Since there were hard-coded numbers in the
file, the patch also adds a mechanism to avoid these (otherwise potential
future changes would again and again require adjusting these numbers).
Signed-Off-By: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This unbreaks recursive kprobes which didn't work anymore
due to an earlier patch which converted the debug entry point
to use an IST.
This also allows nesting of the debug entry point too.
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
o This fix was posted for i386 long back. Posting it for x86_64.
http://marc.theaimsgroup.com/?l=linux-kernel&m=110380103229830&w=2
o This patch fixes the problem of secondary cpus boot up. This situation
is faced when kernel is built for default locations like 16MB and
onwards. In this configuration, only primary cpu (BP) comes and
secondary cpus don't boot.
o Problem occurs because in trampoline code, lgdt is not able to load the
GDT as it happens to be situated beyond 16MB. This is due to the fact
that cpu is still in real mode and default operand size is 16bit.
o This patch uses lgdtl instead of lgdt to force operand size to 32
instead of 16.
Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>