With priorities in place and no one really understanding the difference between
DIE_NMI and DIE_NMI_IPI, just remove DIE_NMI_IPI and convert everyone to DIE_NMI.
This also simplifies default_do_nmi() a little bit. Instead of calling the
die_notifier in both the if and else part, just pull it out and call it before
the if-statement. This has the side benefit of avoiding a call to the ioport
to see if there is an external NMI sitting around until after the (more frequent)
internal NMIs are dealt with.
Patch-Inspired-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Don Zickus <dzickus@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1294348732-15030-5-git-send-email-dzickus@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
In order to consolidate the NMI die_chain events, we need to setup the priorities
for the die notifiers.
I started by defining a bunch of common priorities that can be used by the
notifier blocks. Then I modified the notifier blocks to use the newly created
priorities.
Now that the priorities are straightened out, it should be easier to remove the
event DIE_NMI_IPI.
Signed-off-by: Don Zickus <dzickus@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1294348732-15030-4-git-send-email-dzickus@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
They are a handful of places in the code that register a die_notifier
as a catch all in case no claims the NMI. Unfortunately, they trigger
on events like DIE_NMI and DIE_NMI_IPI, which depending on when they
registered may collide with other handlers that have the ability to
determine if the NMI is theirs or not.
The function unknown_nmi_error() makes one last effort to walk the
die_chain when no one else has claimed the NMI before spitting out
messages that the NMI is unknown.
This is a better spot for these devices to execute any code without
colliding with the other handlers.
The two drivers modified are only compiled on x86 arches I believe, so
they shouldn't be affected by other arches that may not have
DIE_NMIUNKNOWN defined.
Signed-off-by: Don Zickus <dzickus@redhat.com>
Cc: Russ Anderson <rja@sgi.com>
Cc: Corey Minyard <minyard@acm.org>
Cc: openipmi-developer@lists.sourceforge.net
Cc: dann frazier <dannf@hp.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1294348732-15030-3-git-send-email-dzickus@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Replace the NMI related magic numbers with symbol constants.
Memory parity error is only valid for IBM PC-AT, newer machine use
bit 7 (0x80) of 0x61 port for PCI SERR. While memory error is usually
reported via MCE. So corresponding function name and kernel log string
is changed.
But on some machines, PCI SERR line is still used to report memory
errors. This is used by EDAC, so corresponding EDAC call is reserved.
Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Don Zickus <dzickus@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1294348732-15030-2-git-send-email-dzickus@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Conflicts:
arch/x86/include/asm/io_apic.h
Merge reason: Resolve the conflict, update to a more recent -rc base
Signed-off-by: Ingo Molnar <mingo@elte.hu>
"x86, numa: Fake node-to-cpumask for NUMA emulation" broke the
build when CONFIG_DEBUG_PER_CPU_MAPS is set and CONFIG_NUMA_EMU
is not. This is because it is possible to map a cpu to multiple
nodes when NUMA emulation is used; the patch required a physical
node address table to find those nodes that was only available
when CONFIG_NUMA_EMU was enabled.
This extracts the common debug functionality to its own function
for CONFIG_DEBUG_PER_CPU_MAPS and uses it regardless of whether
CONFIG_NUMA_EMU is set or not.
NUMA emulation will now iterate over the set of possible nodes
for each cpu and call the new debug function whereas only the
cpu's node will be used without NUMA emulation enabled.
Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: David Rientjes <rientjes@google.com>
Acked-by: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <alpine.DEB.2.00.1012301053590.12995@chino.kir.corp.google.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'x86-alternatives-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, suspend: Avoid unnecessary smp alternatives switch during suspend/resume
* 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86-64, asm: Use fxsaveq/fxrestorq in more places
* 'x86-hwmon-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, hwmon: Add core threshold notification to therm_throt.c
* 'x86-paravirt-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, paravirt: Use native_halt on a halt, not native_safe_halt
* 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
locking, lockdep: Convert sprintf_symbol to %pS
* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
irq: Better struct irqaction layout
* 'x86-uv-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, UV, BAU: Extend for more than 16 cpus per socket
x86, UV: Fix the effect of extra bits in the hub nodeid register
x86, UV: Add common uv_early_read_mmr() function for reading MMRs
* 'x86-tsc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86: Check tsc available/disabled in the delayed init function
x86: Improve TSC calibration using a delayed workqueue
x86: Make tsc=reliable override boot time stability checks
* 'x86-security-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
module: Move RO/NX module protection to after ftrace module update
x86: Resume trampoline must be executable
x86: Add RO/NX protection for loadable kernel modules
x86: Add NX protection for kernel data
x86: Fix improper large page preservation
* 'x86-platform-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, earlyprintk: Move mrst early console to platform/ and fix a typo
x86, apbt: Setup affinity for apb timers acting as per-cpu timer
ce4100: Add errata fixes for UART on CE4100
x86: platform: Move iris to x86/platform where it belongs
x86, mrst: Check platform_device_register() return code
x86/platform: Add Eurobraille/Iris power off support
x86, mrst: Add explanation for using 1960 as the year offset for vrtc
x86, mrst: Fix dependencies of "select INTEL_SCU_IPC"
x86, mrst: The shutdown for MRST requires the SCU IPC mechanism
x86: Ce4100: Add reboot_fixup() for CE4100
ce4100: Add PCI register emulation for CE4100
x86: Add CE4100 platform support
x86: mrst: Set vRTC's IRQ to level trigger type
x86: mrst: Add audio driver bindings
rtc: Add drivers/rtc/rtc-mrst.c
x86: mrst: Add vrtc driver which serves as a wall clock device
x86: mrst: Add Moorestown specific reboot/shutdown support
x86: mrst: Parse SFI timer table for all timer configs
x86/mrst: Add SFI platform device parsing code
* 'x86-microcode-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, microcode, AMD: Cleanup code a bit
x86, microcode, AMD: Replace vmalloc+memset with vzalloc
* 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86: Fix included-by file reference comments
x86, cpu: Only CPU features determine NX capabilities
x86, cpu: Call verify_cpu during 32bit CPU startup
x86, cpu: Clear XD_DISABLED flag on Intel to regain NX
x86, cpu: Rename verify_cpu_64.S to verify_cpu.S
* 'x86-apic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86: Fix APIC ID sizing bug on larger systems, clean up MAX_APICS confusion
x86, acpi: Parse all SRAT cpu entries even above the cpu number limitation
x86, acpi: Add MAX_LOCAL_APIC for 32bit
x86: io_apic: Split setup_ioapic_ids_from_mpc()
x86: io_apic: Fix CONFIG_X86_IO_APIC=n breakage
x86: apic: Move probe_nr_irqs_gsi() into ioapic_init_mappings()
x86: Allow platforms to force enable apic
* 'x86-amd-nb-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, cacheinfo: Cleanup L3 cache index disable support
x86, amd-nb: Cleanup AMD northbridge caching code
x86, amd-nb: Complete the rename of AMD NB and related code
Prevent the long delay in io_check_error making NMI watchdog
timeout.
Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Don Zickus <dzickus@redhat.com>
LKML-Reference: <1294198689-15447-3-git-send-email-dzickus@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The spin_lock_debug/rcu_cpu_stall detector uses
trigger_all_cpu_backtrace() to dump cpu backtrace.
Therefore it is possible that trigger_all_cpu_backtrace()
could be called at the same time on different CPUs, which
triggers and 'unknown reason NMI' warning. The following case
illustrates the problem:
CPU1 CPU2 ... CPU N
trigger_all_cpu_backtrace()
set "backtrace_mask" to cpu mask
|
generate NMI interrupts generate NMI interrupts ...
\ | /
\ | /
The "backtrace_mask" will be cleaned by the first NMI interrupt
at nmi_watchdog_tick(), then the following NMI interrupts
generated by other cpus's arch_trigger_all_cpu_backtrace() will
be taken as unknown reason NMI interrupts.
This patch uses a test_and_set to avoid the problem, and stop
the arch_trigger_all_cpu_backtrace() from calling to avoid
dumping a double cpu backtrace info when there is already a
trigger_all_cpu_backtrace() in progress.
Signed-off-by: Dongdong Deng <dongdong.deng@windriver.com>
Reviewed-by: Bruce Ashfield <bruce.ashfield@windriver.com>
Cc: fweisbec@gmail.com
LKML-Reference: <1294198689-15447-2-git-send-email-dzickus@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Don Zickus <dzickus@redhat.com>
There are some paths that walk the die_chain with preemption on.
Make sure we are in an NMI call before we start doing anything.
This was triggered by do_general_protection calling notify_die
with DIE_GPF.
Reported-by: Jan Kiszka <jan.kiszka@web.de>
Signed-off-by: Don Zickus <dzickus@redhat.com>
LKML-Reference: <1294198689-15447-1-git-send-email-dzickus@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Found one x2apic pre-enabled system, x2apic_mode suddenly get
corrupted after register some cpus, when compiled
CONFIG_NR_CPUS=255 instead of 512.
It turns out that generic_processor_info() ==> phyid_set(apicid,
phys_cpu_present_map) causes the problem.
phys_cpu_present_map is sized by MAX_APICS bits, and pre-enabled
system some cpus have an apic id > 255.
The variable after phys_cpu_present_map may get corrupted
silently:
ffffffff828e8420 B phys_cpu_present_map
ffffffff828e8440 B apic_verbosity
ffffffff828e8444 B local_apic_timer_c2_ok
ffffffff828e8448 B disable_apic
ffffffff828e844c B x2apic_mode
ffffffff828e8450 B x2apic_disabled
ffffffff828e8454 B num_processors
...
Actually phys_cpu_present_map is referenced via apic id, instead
index. We should use MAX_LOCAL_APIC instead MAX_APICS.
For 64-bit it will be 32768 in all cases. BSS will increase by 4k bytes
on 64-bit:
text data bss dec filename
21696943 4193748 12787712 38678403 vmlinux.before
21696943 4193748 12791808 38682499 vmlinux.after
No change on 32bit.
Finally we can remove MAX_APCIS that was rather confusing.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: H. Peter Anvin <hpa@linux.intel.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
LKML-Reference: <4D23BD9C.3070102@kernel.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
v2.6.36-rc8-54-gb40827f (x86-32, mm: Add an initial page table
for core bootstrapping) made x86 boot using initial_page_table
and broke lguest.
For 2.6.37 we simply cut & paste the initialization code into
lguest (da32dac10126 "lguest: populate initial_page_table"), now
we fix it properly by doing that initialization before the
paravirt jump.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: lguest <lguest@ozlabs.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <201101041720.54535.rusty@rustcorp.com.au>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Add these new power trace events:
power:cpu_idle
power:cpu_frequency
power:machine_suspend
The old C-state/idle accounting events:
power:power_start
power:power_end
Have now a replacement (but we are still keeping the old
tracepoints for compatibility):
power:cpu_idle
and
power:power_frequency
is replaced with:
power:cpu_frequency
power:machine_suspend is newly introduced.
Jean Pihet has a patch integrated into the generic layer
(kernel/power/suspend.c) which will make use of it.
the type= field got removed from both, it was never
used and the type is differed by the event type itself.
perf timechart userspace tool gets adjusted in a separate patch.
Signed-off-by: Thomas Renninger <trenn@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Arjan van de Ven <arjan@linux.intel.com>
Acked-by: Jean Pihet <jean.pihet@newoldbits.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: rjw@sisk.pl
LKML-Reference: <1294073445-14812-3-git-send-email-trenn@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
LKML-Reference: <1290072314-31155-2-git-send-email-trenn@suse.de>
The code will use a segment prefix instead of doing the lookup and
calculation.
Signed-off-by: Christoph Lameter <cl@linux.com>
Acked-by: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Fix a hard-coded limit of a maximum of 16 cpu's per socket.
The UV Broadcast Assist Unit code initializes by scanning the
cpu topology of the system and assigning a master cpu for each
socket and UV hub. That scan had an assumption of a limit of 16
cpus per socket. With Westmere we are going over that limit.
The UV hub hardware will allow up to 32.
If the scan finds the system has gone over that limit it returns
an error and we print a warning and fall back to doing TLB
shootdowns without the BAU.
Signed-off-by: Cliff Wickman <cpw@sgi.com>
Cc: <stable@kernel.org> # .37.x
LKML-Reference: <E1PZol7-0000mM-77@eag09.americas.sgi.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This patch adds code to therm_throt.c to notify core thermal threshold
events. These thresholds are supported by the IA32_THERM_INTERRUPT register.
The status/log for the same is monitored using the IA32_THERM_STATUS register.
The necessary #defines are in msr-index.h. A call back is added to mce.h, to
further notify the thermal stack, about the threshold events.
Signed-off-by: Durgadoss R <durgadoss.r@intel.com>
LKML-Reference: <D6D887BA8C9DFF48B5233887EF04654105C1251710@bgsmsx502.gar.corp.intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Disable preemption in init_ibs(). The function only checks the
ibs capabilities and sets up pci devices (if necessary). It runs
only on one cpu but operates with the local APIC and some MSRs,
thus it is better to disable preemption.
[ 7.034377] BUG: using smp_processor_id() in preemptible [00000000] code: modprobe/483
[ 7.034385] caller is setup_APIC_eilvt+0x155/0x180
[ 7.034389] Pid: 483, comm: modprobe Not tainted 2.6.37-rc1-20101110+ #1
[ 7.034392] Call Trace:
[ 7.034400] [<ffffffff812a2b72>] debug_smp_processor_id+0xd2/0xf0
[ 7.034404] [<ffffffff8101e985>] setup_APIC_eilvt+0x155/0x180
[ ... ]
Addresses https://bugzilla.kernel.org/show_bug.cgi?id=22812
Reported-by: <atswartz@gmail.com>
Signed-off-by: Robert Richter <robert.richter@amd.com>
Cc: oprofile-list@lists.sourceforge.net <oprofile-list@lists.sourceforge.net>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Dan Carpenter <error27@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: <stable@kernel.org> [2.6.37.x]
LKML-Reference: <20110103111514.GM4739@erda.amd.com>
[ small cleanups ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The only bit of EFER that affects the mmu is NX, and this is already
accounted for (LME only takes effect when changing cr0).
Based on a patch by Hillf Danton.
Signed-off-by: Avi Kivity <avi@redhat.com>
isr_ack is never initialized. So, until the first PIC reset, interrupts
may fail to be injected. This can cause Windows XP to fail to boot, as
reported in the fallout from the fix to
https://bugzilla.kernel.org/show_bug.cgi?id=21962.
Reported-and-tested-by: Nicolas Prochazka <prochazka.nicolas@gmail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Replace all uses of current_cpu_data with this_cpu operations on the
per cpu structure cpu_info. The scala accesses are replaced with the
matching this_cpu ops which results in smaller and more efficient
code.
In the long run, it might be a good idea to remove cpu_data() macro
too and use per_cpu macro directly.
tj: updated description
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Go through x86 code and replace __get_cpu_var and get_cpu_var
instances that refer to a scalar and are not used for address
determinations.
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
We use the physical address instead of the base gfn for the four
PAE page directories we use in unpaged mode. When the guest accesses
an address above 1GB that is backed by a large host page, a BUG_ON()
in kvm_mmu_set_gfn() triggers.
Resolves: https://bugzilla.kernel.org/show_bug.cgi?id=21962
Reported-and-tested-by: Nicolas Prochazka <prochazka.nicolas@gmail.com>
KVM-Stable-Tag.
Signed-off-by: Avi Kivity <avi@redhat.com>
halt() should use native_halt()
safe_halt() uses native_safe_halt()
If CONFIG_PARAVIRT=y, halt() is defined in arch/x86/include/asm/paravirt.h as
static inline void halt(void)
{
PVOP_VCALL0(pv_irq_ops.safe_halt);
}
Otherwise (no CONFIG_PARAVIRT) halt() in arch/x86/include/asm/irqflags.h is
static inline void halt(void)
{
native_halt();
}
So it looks to me like the CONFIG_PARAVIRT case of using native_safe_halt()
for a halt() is an oversight.
Am I missing something?
It probably hasn't shown up as a problem because the local apic is disabled
on a shutdown or restart. But if we disable interrupts and call halt()
we shouldn't expect that the halt() will re-enable interrupts.
Signed-off-by: Cliff Wickman <cpw@sgi.com>
LKML-Reference: <E1PSBcz-0001g1-FM@eag09.americas.sgi.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
In arch/x86/kernel/microcode_intel.c::generic_load_microcode()
we have this:
while (leftover) {
...
if (get_ucode_data(mc, ucode_ptr, mc_size) ||
microcode_sanity_check(mc) < 0) {
vfree(mc);
break;
}
...
}
if (mc)
vfree(mc);
This will cause a double free of 'mc'. This patch fixes that by
just removing the vfree() call in the loop since 'mc' will be
freed nicely just after we break out of the loop.
There's also a second change in the patch. I noticed a lot of
checks for pointers being NULL before passing them to vfree().
That's completely redundant since vfree() deals gracefully with
being passed a NULL pointer. Removing the redundant checks
yields a nice size decrease for the object file.
Size before the patch:
text data bss dec hex filename
4578 240 1032 5850 16da arch/x86/kernel/microcode_intel.o
Size after the patch:
text data bss dec hex filename
4489 240 984 5713 1651 arch/x86/kernel/microcode_intel.o
Signed-off-by: Jesper Juhl <jj@chaosbits.net>
Acked-by: Tigran Aivazian <tigran@aivazian.fsnet.co.uk>
Cc: Shaohua Li <shaohua.li@intel.com>
LKML-Reference: <alpine.LNX.2.00.1012251946100.10759@swampdragon.chaosbits.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
perf probe: Fix to support libdwfl older than 0.148
perf tools: Fix lazy wildcard matching
perf buildid-list: Fix error return for success
perf buildid-cache: Fix symbolic link handling
perf symbols: Stop using vmlinux files with no symbols
perf probe: Fix use of kernel image path given by 'k' option
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, kexec: Limit the crashkernel address appropriately
NUMA boot code assumes that physical node ids start at 0, but the DIMMs
that the apic id represents may not be reachable. If this is the case,
node 0 is never online and cpus never end up getting appropriately
assigned to a node. This causes the cpumask of all online nodes to be
empty and machines crash with kernel code assuming online nodes have
valid cpus.
The fix is to appropriately map all the address ranges for physical nodes
and ensure the cpu to node mapping function checks all possible nodes (up
to MAX_NUMNODES) instead of simply checking nodes 0-N, where N is the
number of physical nodes, for valid address ranges.
This requires no longer "compressing" the address ranges of nodes in the
physical node map from 0-N, but rather leave indices in physnodes[] to
represent the actual node id of the physical node. Accordingly, the
topology exported by both amd_get_nodes() and acpi_get_nodes() no longer
must return the number of nodes to iterate through; all such iterations
will now be to MAX_NUMNODES.
This change also passes the end address of system RAM (which may be
different from normal operation if mem= is specified on the command line)
before the physnodes[] array is populated. ACPI parsed nodes are
truncated to fit within the address range that respect the mem=
boundaries and even some physical nodes may become unreachable in such
cases.
When NUMA emulation does succeed, any apicid to node mapping that exists
for unreachable nodes are given default values so that proximity domains
can still be assigned. This is important for node_distance() to
function as desired.
Signed-off-by: David Rientjes <rientjes@google.com>
LKML-Reference: <alpine.DEB.2.00.1012221702090.3701@chino.kir.corp.google.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
It's necessary to fake the node-to-cpumask mapping so that an emulated
node ID returns a cpumask that includes all cpus that have affinity to
the memory it represents.
This is a little intrusive because it requires knowledge of the physical
topology of the system. setup_physnodes() gives us that information, but
since NUMA emulation ends up altering the physnodes array, it's necessary
to reset it before cpus are brought online.
Accordingly, the physnodes array is moved out of init.data and into
cpuinit.data since it will be needed on cpuup callbacks.
This works regardless of whether numa=fake is used on the command line,
or the setup of the fake node succeeds or fails. The physnodes array
always contains the physical topology of the machine if CONFIG_NUMA_EMU
is enabled and can be used to setup the correct node-to-cpumask mappings
in all cases since setup_physnodes() is called whenever the array needs
to be repopulated with the correct data.
To fake the actual mappings, numa_add_cpu() and numa_remove_cpu() are
rewritten for CONFIG_NUMA_EMU so that we first find the physical node to
which each cpu has local affinity, then iterate through all online nodes
to find the emulated nodes that have local affinity to that physical
node, and then finally map the cpu to each of those emulated nodes.
Signed-off-by: David Rientjes <rientjes@google.com>
LKML-Reference: <alpine.DEB.2.00.1012221701520.3701@chino.kir.corp.google.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
This patch adds the equivalent of acpi_fake_nodes() for AMD Northbridge
platforms. The goal is to fake the apicid-to-node mappings for NUMA
emulation so the physical topology of the machine is correctly maintained
within the kernel.
This change also fakes proximity domains for both ACPI and k8 code so the
physical distance between emulated nodes is maintained via
node_distance(). This exports the correct distances via
/sys/devices/system/node/.../distance based on the underlying topology.
A new helper function, fake_physnodes(), is introduced to correctly
invoke the correct NUMA code to fake these two mappings based on the
system type. If there is no underlying NUMA configuration, all cpus are
mapped to node 0 for local distance.
Since acpi_fake_nodes() is no longer called with CONFIG_ACPI_NUMA, it's
prototype can be removed from the header file for such a configuration.
Signed-off-by: David Rientjes <rientjes@google.com>
LKML-Reference: <alpine.DEB.2.00.1012221701360.3701@chino.kir.corp.google.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Both acpi_get_nodes() and amd_get_nodes() are only necessary when
CONFIG_NUMA_EMU is enabled, so avoid compiling them when the option is
disabled.
Signed-off-by: David Rientjes <rientjes@google.com>
LKML-Reference: <alpine.DEB.2.00.1012221701210.3701@chino.kir.corp.google.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
This patch changes the minimum fake node size from 64MB to 32MB so it is
possible to test NUMA code at a greater scale on smaller machines
(64 nodes on a 2G machine, 1024 nodes on 32G machine with
CONFIG_NODES_SHIFT=10).
Signed-off-by: David Rientjes <rientjes@google.com>
LKML-Reference: <alpine.DEB.2.00.1012221700590.3701@chino.kir.corp.google.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Recent Intel new system have different order in MADT, aka will list all thread0
at first, then all thread1.
But SRAT table still old order, it will list cpus in one socket all together.
If the user have compiled limited NR_CPUS or boot with nr_cpus=, could have missed
to put some cpus apic id to node mapping into apicid_to_node[].
for example for 4 sockets system with 64 cpus with nr_cpus=32 will get crash...
[ 9.106288] Total of 32 processors activated (136190.88 BogoMIPS).
[ 9.235021] divide error: 0000 [#1] SMP
[ 9.235315] last sysfs file:
[ 9.235481] CPU 1
[ 9.235592] Modules linked in:
[ 9.245398]
[ 9.245478] Pid: 2, comm: kthreadd Not tainted 2.6.37-rc1-tip-yh-01782-ge92ef79-dirty #274 /Sun Fire x4800
[ 9.265415] RIP: 0010:[<ffffffff81075a8f>] [<ffffffff81075a8f>] select_task_rq_fair+0x4f0/0x623
...
[ 9.645938] RIP [<ffffffff81075a8f>] select_task_rq_fair+0x4f0/0x623
[ 9.665356] RSP <ffff88103f8d1c40>
[ 9.665568] ---[ end trace 2296156d35fdfc87 ]---
So let just parse all cpu entries in SRAT.
Also add apicid checking with MAX_LOCAL_APIC, in case We could out of boundaries of
apicid_to_node[].
it fixes following bug too.
https://bugzilla.kernel.org/show_bug.cgi?id=22662
-v2: expand to 32bit according to hpa
need to add MAX_LOCAL_APIC for 32bit
Reported-and-Tested-by: Wu Fengguang <fengguang.wu@intel.com>
Reported-by: Bjorn Helgaas <bjorn.helgaas@hp.com>
Tested-by: Myron Stowe <myron.stowe@hp.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <4D0AD486.9020704@kernel.org>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
We should use MAX_LOCAL_APIC for max apic ids and MAX_APICS as number
of local apics.
Also apic_version[] array should use MAX_LOCAL_APICs.
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <4D0AD464.2020408@kernel.org>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>