While backporting 72dc67a696, a gfn_to_page()
call was duplicated instead of moved (due to an unrelated patch not being
present in mainline). This caused a page reference leak, resulting in a
fairly massive memory leak.
Fix by removing the extraneous gfn_to_page() call.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Do not assume that a shadow mapping will always point to the same host
frame number. Fixes crash with madvise(MADV_DONTNEED).
[avi: move after first printk(), add another printk()]
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
The vmx hardware state restore restores the tss selector and base address, but
not its length. Usually, this does not matter since most of the tss contents
is within the default length of 0x67. However, if a process is using ioperm()
to grant itself I/O port permissions, an additional bitmap within the tss,
but outside the default length is consulted. The effect is that the process
will receive a SIGSEGV instead of transparently accessing the port.
Fix by restoring the tss length. Note that i386 had this working already.
Closes bugzilla 10246.
Signed-off-by: Avi Kivity <avi@qumranet.com>
It appears that 64-bit PCI resources cannot possibly ever have worked on
x86-32 even when the RESOURCES_64BIT config option was set, because any
driver that tried to [pci_]ioremap() the resource would have been unable
to do so because the high 32 bits would have been silently dropped on
the floor by the ioremap() routines that only used "unsigned long".
Change them to use "resource_size_t" instead, which properly encodes the
whole 64-bit resource data if RESOURCES_64BIT is enabled.
Acked-by: H. Peter Anvin <hpa@kernel.org>
Acked-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Revert
commit f62f1fc9ef
Author: Yinghai Lu <yhlu.kernel@gmail.com>
Date: Fri Mar 7 15:02:50 2008 -0800
x86: reserve dma32 early for gart
The patch has a dependency on bootmem modifications which are not .25
material that late in the -rc cycle. The problem which is addressed by
the patch is limited to machines with 256G and more memory booted with
NUMA disabled. This is not a .25 regression and the audience which is
affected by this problem is very limited, so it's safer to do the
revert than pulling in intrusive bootmem changes right now.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
so use nodedata_phys directly.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
fix the bug reported here:
http://bugzilla.kernel.org/show_bug.cgi?id=10232
use update_memory_range() instead of add_memory_range() directly
to avoid closing the gap.
( the new code only affects and runs on systems where the MTRR
workaround triggers. )
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
we have seen a little problem in rebooting Dell Optiplex 745 with the
0KW626 board. Here is a small patch enabling reboot with this board,
which forces the default reboot path it into the BIOS reboot mode.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The fault_msg text is not explictly nul terminated now in startup
assembly. Do so by converting .ascii to .asciz.
Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
aperture_64.c takes a piece of memory and makes it into iommu
window... but such window may not be saved by swsusp -- that leads to
oops during hibernation.
Signed-off-by: Pavel Machek <pavel@suse.cz>
Acked-by: "Rafael J. Wysocki" <rjw@sisk.pl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
this patch allows hpet=force on nVidia nForce 430 southbridge.
This patch was tested by me on my old Asus A8N-VM CSM (where bios does not
support hpet and does not advertise it via acpi entry). My nForce430 version:
lspci -nn | grep LPC
00:0a.0 ISA bridge [0601]: nVidia Corporation MCP51 LPC Bridge [10de:0260]
(rev a2)
Kernel 2.6.24.3 after patching and using hpet=force reports this:
dmesg | grep -i hpet
Kernel command line: root=/dev/sda8 ro vga=773 video=vesafb:mtrr:4,ywrap
vt.default_utf8=0 hpet=force
Force enabled HPET at base address 0xfed00000
hpet clockevent registered
Time: hpet clocksource has been installed.
grep -i hpet /proc/timer_list
Clock Event Device: hpet
set_next_event: hpet_legacy_next_event
set_mode: hpet_legacy_set_mode
grep Clock /proc/timer_list (before patching)
Clock Event Device: pit
Clock Event Device: lapic
grep Clock /proc/timer_list (after patching)
Clock Event Device: hpet
Clock Event Device: lapic
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
a system with 256 GB of RAM, when NUMA is disabled crashes the
following way:
Your BIOS doesn't leave a aperture memory hole
Please enable the IOMMU option in the BIOS setup
This costs you 64 MB of RAM
Cannot allocate aperture memory hole (ffff8101c0000000,65536K)
Kernel panic - not syncing: Not enough memory for aperture
Pid: 0, comm: swapper Not tainted 2.6.25-rc4-x86-latest.git #33
Call Trace:
[<ffffffff84037c62>] panic+0xb2/0x190
[<ffffffff840381fc>] ? release_console_sem+0x7c/0x250
[<ffffffff847b1628>] ? __alloc_bootmem_nopanic+0x48/0x90
[<ffffffff847b0ac9>] ? free_bootmem+0x29/0x50
[<ffffffff847ac1f7>] gart_iommu_hole_init+0x5e7/0x680
[<ffffffff847b255b>] ? alloc_large_system_hash+0x16b/0x310
[<ffffffff84506a2f>] ? _etext+0x0/0x1
[<ffffffff847a2e8c>] pci_iommu_alloc+0x1c/0x40
[<ffffffff847ac795>] mem_init+0x45/0x1a0
[<ffffffff8479ff35>] start_kernel+0x295/0x380
[<ffffffff8479f1c2>] _sinittext+0x1c2/0x230
the root cause is : memmap PMD is too big,
[ffffe200e0600000-ffffe200e07fffff] PMD ->ffff81383c000000 on node 0
almost near 4G..., and vmemmap_alloc_block will use up the ram under 4G.
solution will be:
1. make memmap allocation get memory above 4G...
2. reserve some dma32 range early before we try to set up memmap for all.
and release that before pci_iommu_alloc, so gart or swiotlb could get some
range under 4g limit for sure.
the patch is using method 2.
because method1 may need more code to handle SPARSEMEM and SPASEMEM_VMEMMAP
will get
Your BIOS doesn't leave a aperture memory hole
Please enable the IOMMU option in the BIOS setup
This costs you 64 MB of RAM
Mapping aperture over 65536 KB of RAM @ 4000000
Memory: 264245736k/268959744k available (8484k kernel code, 4187464k reserved, 4004k data, 724k init)
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
We recently got some of the "Desktop Form Factor" Optiplex 745's in. I
noticed that there's an entry for the SFF one's, but the BIOS model number
of the DFF differs from that of the SFF. We have been reliably
experiencing the same (as far as I can tell) reboot bug as the SFF boxes.
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Fix visws printk format warnings:
/local/linsrc/linux-2.6.24-git15/arch/x86/mach-visws/traps.c:50: warning: format '%#lx' expects type 'long unsigned int', but argument 2 has type 'u32'
/local/linsrc/linux-2.6.24-git15/arch/x86/mach-visws/traps.c:50: warning: format '%#lx' expects type 'long unsigned int', but argument 3 has type 'u32'
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
when numa disabled I got this compile warning:
arch/x86/kernel/setup64.c: In function setup_per_cpu_areas:
arch/x86/kernel/setup64.c:147: warning: the address of
contig_page_data will always evaluate as true
it seems we missed checking if the node is online before we try to refer
NODE_DATA. Fix it.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
memory-less node support:
this patch uses updated dev_to_node, because dev_to_node already makes sure
it returns an online node.
Signed-off-by: Yinghai Lu <yinghai.lu@sun.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Move 00-INDEX entries to power/00-INDEX (and add entry for
pm_qos_interface.txt).
Update references to moved filenames.
Fix some trailing whitespace.
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Len Brown <len.brown@intel.com>
quicklists cause a serious memory leak on 32-bit x86,
as documented at:
http://bugzilla.kernel.org/show_bug.cgi?id=9991
the reason is that the quicklist pool is a special-purpose
cache that grows out of proportion. It is not accounted for
anywhere and users have no way to even realize that it's
the quicklists that are causing RAM usage spikes. It was
supposed to be a relatively small pool, but as demonstrated
by KOSAKI Motohiro, they can grow as large as:
Quicklists: 1194304 kB
given how much trouble this code has caused historically,
and given that Andrew objected to its introduction on x86
(years ago), the best option at this point is to remove them.
[ any performance benefits of caching constructed pgds should
be implemented in a more generic way (possibly within the page
allocator), while still allowing constructed pages to be
allocated by other workloads. ]
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The code to restart syscalls after signals depends on checking for a
negative orig_ax, and for particular negative -ERESTART* values in ax.
These fields are 64 bits and for a 32-bit task they get zero-extended.
The syscall restart behavior is lost, a regression from a native 32-bit
kernel and from 64-bit tasks' behavior.
This patch fixes the problem by doing sign-extension where it matters.
For orig_ax, the only time the value should be -1 but winds up as
0x0ffffffff is via a 32-bit ptrace call. So the patch changes ptrace to
sign-extend the 32-bit orig_eax value when it's stored; it doesn't
change the checks on orig_ax, though it uses the new current_syscall()
inline to better document the subtle importance of the used of
signedness there.
The ax value is stored a lot of ways and it seems hard to get them all
sign-extended at their origins. So for that, we use the
current_syscall_ret() to sign-extend it only for 32-bit tasks at the
time of the -ERESTART* comparisons.
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
I figured out another ACPI related regression today.
randconfig testing triggered an early boot-time hang on a laptop of mine
(32-bit x86, config attached) - the screen was scrolling ACPI AML
exceptions [with no serial port and no early debugging available].
v2.6.24 works fine on that laptop with the same .config, so after a few
hours of bisection (had to restart it 3 times - other regressions
interacted), it honed in on this commit:
| 10270d4838 is first bad commit
|
| Author: Linus Torvalds <torvalds@woody.linux-foundation.org>
| Date: Wed Feb 13 09:56:14 2008 -0800
|
| acpi: fix acpi_os_read_pci_configuration() misuse of raw_pci_read()
reverting this commit ontop of -rc5 gave a correctly booting kernel.
But this commit fixes a real bug so the real question is, why did it
break the bootup?
After quite some head-scratching, the following change stood out:
- pci_id->bus = tu8;
+ pci_id->bus = val;
pci_id->bus is defined as u16:
struct acpi_pci_id {
u16 segment;
u16 bus;
...
and 'tu8' changed from u8 to u32. So previously we'd unconditionally
mask the return value of acpi_os_read_pci_configuration()
(raw_pci_read()) to 8 bits, but now we just trust whatever comes back
from the PCI access routines and only crop it to 16 bits.
But if the high 8 bits of that result contains any noise then we'll
write that into ACPI's PCI ID descriptor and confuse the heck out of the
rest of ACPI.
So lets check the PCI-BIOS code on that theory. We have this codepath
for 8-bit accesses (arch/x86/pci/pcbios.c:pci_bios_read()):
switch (len) {
case 1:
__asm__("lcall *(%%esi); cld\n\t"
"jc 1f\n\t"
"xor %%ah, %%ah\n"
"1:"
: "=c" (*value),
"=a" (result)
: "1" (PCIBIOS_READ_CONFIG_BYTE),
"b" (bx),
"D" ((long)reg),
"S" (&pci_indirect));
Aha! The "=a" output constraint puts the full 32 bits of EAX into
*value. But if the BIOS's routines set any of the high bits to nonzero,
we'll return a value with more set in it than intended.
The other, more common PCI access methods (v1 and v2 PCI reads) clear
out the high bits already, for example pci_conf1_read() does:
switch (len) {
case 1:
*value = inb(0xCFC + (reg & 3));
which explicitly converts the return byte up to 32 bits and zero-extends
it.
So zero-extending the result in the PCI-BIOS read routine fixes the
regression on my laptop. ( It might fix some other long-standing issues
we had with PCI-BIOS during the past decade ... ) Both 8-bit and 16-bit
accesses were buggy.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ahmed managed to crash the Host in release_pgd(), which cannot be a Guest
bug, and indeed it wasn't.
The bug was that handing a 0 as the address of the toplevel page table
being manipulated can cause the lookup code in find_pgdir() to return
an uninitialized cache entry (we shadow up to 4 top level page tables
for each Guest).
Commit 37cc8d7f96 introduced this
behaviour in the Guest, uncovering the bug.
The patch which he submitted (which removed the /4 from the index
calculation) simply ensured that these high-indexed entries hit the
early exit path of guest_set_pmd(). But you get lots of segfaults in
guest userspace as the PMDs aren't being updated.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Now the TSC code handles a zero return from calculate_cpu_khz(),
lguest can simply pass through the value it gets from the Host: if
non-zero, all the normal TSC code applies.
Otherwise (or if the Host really doesn't support TSC), the clocksource
code will fall back to the slower but reasonable lguest clock.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This makes 64-bit ptrace calls setting the 64-bit orig_ax field for a
32-bit task sign-extend the low 32 bits up to 64. This matches what a
64-bit debugger expects when tracing a 32-bit task.
This follows on my "x86_64 ia32 syscall restart fix". This didn't
matter until that was fixed.
The debugger ignores or zeros the high half of every register slot it
sets (including the orig_rax pseudo-register) uniformly. It expects
that the setting of the low 32 bits always has the same meaning as a
32-bit debugger setting those same 32 bits with native 32-bit
facilities.
This never arose before because the syscall restart check never
matched any -ERESTART* values due to lack of sign extension. Before
that fix, even 32-bit ptrace setting orig_eax to -1 failed to trigger
the restart check anyway. So this was never noticed as a regression
of 64-bit debuggers vs 32-bit debuggers on the same 64-bit kernel.
Signed-off-by: Roland McGrath <roland@redhat.com>
[ Changed to just do the sign-extension unconditionally on x86-64,
since orig_ax is always just a small integer and doesn't need
the full 64-bit range ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The new x86 setup code (4fd06960f1) broke booting on an old P3/500MHz
with an onboard Voodoo3 of mine. After debugging it, it turned out
to be caused by the fact that the vesa probing now asks for VBE2 data.
Disassembing the video BIOS shows that it overflows the vesa_general_info
structure when VBE2 data is requested because the source addresses for the
information strings which get strcpy'ed to the buffer lie outside the 32K
BIOS code (and hence contain long sequences of 0xff's).
E.G.:
get_vbe_controller_info:
00002A9C 60 pushaw
00002A9D 1E push ds
00002A9E 0E push cs
00002A9F 1F pop ds
00002AA0 2BC9 sub cx,cx
00002AA2 6626813D56424532 cmp dword [es:di],0x32454256 ; "VBE2"
00002AAA 7501 jnz .1
00002AAC 41 inc cx
.1:
00002AAD 51 push cx
00002AAE B91400 mov cx,0x14
00002AB1 BED47F mov si, controller_header
00002AB4 57 push di
00002AB5 F3A4 rep movsb ; copy vbe1.2 header
00002AB7 B9EC00 mov cx,0xec
00002ABA 2AC0 sub al,al
00002ABC F3AA rep stosb ; zero pad remainder
00002ABE 5F pop di
00002ABF E8EB0D call word get_memory
00002AC2 C1E002 shl ax,0x2
00002AC5 26894512 mov [es:di+0x12],ax ; total memory
00002AC9 26C745040003 mov word [es:di+0x4],0x300 ; VBE version
00002ACF 268C4D08 mov [es:di+0x8],cs
00002AD3 268C4D10 mov [es:di+0x10],cs
00002AD7 59 pop cx
00002AD8 E361 jcxz .done ; VBE2 requested?
00002ADA 8D9D0001 lea bx,[di+0x100]
00002ADE 53 push bx
00002ADF 87DF xchg bx,di ; di now points to 2nd half
00002AE1 26C747140001 mov word [es:bx+0x14],0x100 ; sw rev
00002AE7 26897F06 mov [es:bx+0x6],di ; oem string
00002AEB 268C4708 mov [es:bx+0x8],es
00002AEF BE5280 mov si,0x8052 ; oem string
00002AF2 E87A1B call word strcpy
00002AF5 26897F0E mov [es:bx+0xe],di ; video mode list
00002AF9 268C4710 mov [es:bx+0x10],es
00002AFD B91E00 mov cx,0x1e
00002B00 BEE87F mov si,vidmodes
00002B03 F3A5 rep movsw
00002B05 26897F16 mov [es:bx+0x16],di ; oem vendor
00002B09 268C4718 mov [es:bx+0x18],es
00002B0D BE2480 mov si,0x8024 ; oem vendor
00002B10 E85C1B call word strcpy
00002B13 26897F1A mov [es:bx+0x1a],di ; oem product
00002B17 268C471C mov [es:bx+0x1c],es
00002B1B BE3880 mov si,0x8038 ; oem product
00002B1E E84E1B call word strcpy
00002B21 26897F1E mov [es:bx+0x1e],di ; oem product rev
00002B25 268C4720 mov [es:bx+0x20],es
00002B29 BE4580 mov si,0x8045 ; oem product rev
00002B2C E8401B call word strcpy
00002B2F 58 pop ax
00002B30 B90001 mov cx,0x100
00002B33 2BCF sub cx,di
00002B35 03C8 add cx,ax
00002B37 2AC0 sub al,al
00002B39 F3AA rep stosb ; zero pad
.done:
00002B3B 1F pop ds
00002B3C 61 popaw
00002B3D B84F00 mov ax,0x4f
00002B40 C3 ret
(The full BIOS can be found at http://peter.korsgaard.com/vgabios.bin
if interested).
The old setup code didn't ask for VBE2 info, and the new code doesn't
actually do anything with the extra information, so the fix is to simply
not request it. Other BIOS'es might have the same problem.
Signed-off-by: Peter Korsgaard <jacmet@sunsite.dk>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Jan Beulich noticed that the reboot fixups went missing during
reboot.c unification.
(commit 4d022e35fd)
Geode and a few other rare boards with special reboot quirks are
affected.
Reported-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
convert_fxsr_to_user() in 2.6.24's i387_32.c did this, and
convert_to_fxsr() also does the inverse, so I assume it's an oversight
that it is no longer being done.
[ mingo@elte.hu:
we encode it this way because there's no space for the 'FPU Last
Instruction Opcode' (->fop) field in the legacy user_i387_ia32_struct
that PTRACE_GETFPREGS/PTRACE_SETFPREGS uses.
it's probably pure legacy - i'd be surprised if any user-space relied on
the FPU Last Opcode in any way. But indeed we used to do it previously
so the most conservative thing is to preserve that piece of information.
]
Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
The Linux kernel currently does not clear the direction flag before
calling a signal handler, whereas the x86/x86-64 ABI requires that.
Linux had this behavior/bug forever, but this becomes a real problem
with gcc version 4.3, which assumes that the direction flag is
correctly cleared at the entry of a function.
This patches changes the setup_frame() functions to clear the
direction before entering the signal handler.
Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: H. Peter Anvin <hpa@zytor.com>
We don't need to printk a message every time we transition.
Leave the code there, but ifdef'd out, as it's useful when
adding support for new processors.
Reported-by: Petr Titěra <P.Titera@century.cz>
Signed-off-by: Dave Jones <davej@redhat.com>
Add CONFIG_HAVE_KRETPROBES to the arch/<arch>/Kconfig file for relevant
architectures with kprobes support. This facilitates easy handling of
in-kernel modules (like samples/kprobes/kretprobe_example.c) that depend on
kretprobes being present in the kernel.
Thanks to Sam Ravnborg for helping make the patch more lean.
Per Mathieu's suggestion, added CONFIG_KRETPROBES and fixed up dependencies.
Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
P4 has been coming out as CPU_FAMILY=4 instead of 6: fix MPENTIUM4 typo.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-x86:
x86/xen: fix DomU boot problem
x86: not set node to cpu_to_node if the node is not online
x86, i387: fix ptrace leakage using init_fpu()
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/avi/kvm:
x86: disable KVM for Voyager and friends
KVM: VMX: Avoid rearranging switched guest msrs while they are loaded
KVM: MMU: Fix race when instantiating a shadow pte
KVM: Route irq 0 to vcpu 0 exclusively
KVM: Avoid infinite-frequency local apic timer
KVM: make MMU_DEBUG compile again
KVM: move alloc_apic_access_page() outside of non-preemptable region
KVM: SVM: fix Windows XP 64 bit installation crash
KVM: remove the usage of the mmap_sem for the protection of the memory slots.
KVM: emulate access to MSR_IA32_MCG_CTL
KVM: Make the supported cpuid list a host property rather than a vm property
KVM: Fix kvm_arch_vcpu_ioctl_set_sregs so that set_cr0 works properly
KVM: SVM: set NM intercept when enabling CR0.TS in the guest
KVM: SVM: Fix lazy FPU switching
Construct Xen guest e820 map with a hole between 640K-1M.
It's pure luck that Xen kernels have gotten away with it in the past.
The patch below seems like the right thing to do. It certainly boots in
a domU without the DMI problem (without any of the other related patches
such as Alexander's).
Signed-off-by: Ian Campbell <ijc@hellion.org.uk>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Tested-by: Mark McLoughlin <markmc@redhat.com>
Acked-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
resolve boot problem reported by Mel Gorman:
http://lkml.org/lkml/2008/2/13/404
init_cpu_to_node will use cpu->apic (from MADT or mptable) and
apic->node(from SRAT or AMD config space with k8_bus_64.c) to have
cpu->node mapping, and later identify_cpu will overwrite them
again...(with nearby_node...)
this patch checks if the node is online, otherwise it will not
update cpu_node map. so keep cpu_node map to online node before
identify_cpu..., to prevent possible error.
Signed-off-by: Yinghai Lu <yinghai.lu@sun.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
This bug got introduced by the recent i387 merge:
commit 4421011120
Author: Roland McGrath <roland@redhat.com>
Date: Wed Jan 30 13:31:50 2008 +0100
x86: x86 i387 user_regset
Current usage of unlazy_fpu() in ptrace specific routines is wrong.
unlazy_fpu() will not init fpu if the task never used math. So the
ptrace calls can expose the parent tasks FPU data in some cases.
Replace it with the init_fpu() which will init the math state, if the
task never used math before.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Most classic Pentiums don't have hardware virtualization extension,
and building kvm with Voyager, Visual Workstation, or NUMAQ
generates spurious failures.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
KVM tries to run as much as possible with the guest msrs loaded instead of
host msrs, since switching msrs is very expensive. It also tries to minimize
the number of msrs switched according to the guest mode; for example,
MSR_LSTAR is needed only by long mode guests. This optimization is done by
setup_msrs().
However, we must not change which msrs are switched while we are running with
guest msr state:
- switch to guest msr state
- call setup_msrs(), removing some msrs from the list
- switch to host msr state, leaving a few guest msrs loaded
An easy way to trigger this is to kexec an x86_64 linux guest. Early during
setup, the guest will switch EFER to not include SCE. KVM will stop saving
MSR_LSTAR, and on the next msr switch it will leave the guest LSTAR loaded.
The next host syscall will end up in a random location in the kernel.
Fix by reloading the host msrs before changing the msr list.
Signed-off-by: Avi Kivity <avi@qumranet.com>
For improved concurrency, the guest walk is performed concurrently with other
vcpus. This means that we need to revalidate the guest ptes once we have
write-protected the guest page tables, at which point they can no longer be
modified.
The current code attempts to avoid this check if the shadow page table is not
new, on the assumption that if it has existed before, the guest could not have
modified the pte without the shadow lock. However the assumption is incorrect,
as the racing vcpu could have modified the pte, then instantiated the shadow
page, before our vcpu regains control:
vcpu0 vcpu1
fault
walk pte
modify pte
fault in same pagetable
instantiate shadow page
lookup shadow page
conclude it is old
instantiate spte based on stale guest pte
We could do something clever with generation counters, but a test run by
Marcelo suggests this is unnecessary and we can just do the revalidation
unconditionally. The pte will be in the processor cache and the check can
be quite fast.
Signed-off-by: Avi Kivity <avi@qumranet.com>
If the local apic initial count is zero, don't start a an hrtimer with infinite
frequency, locking up the host.
Signed-off-by: Avi Kivity <avi@qumranet.com>
the cr3 variable is now inside the vcpu->arch structure.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
alloc_apic_access_page() can sleep, while vmx_vcpu_setup is called
inside a non preemptable region. Move it after put_cpu().
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
While installing Windows XP 64 bit wants to access the DEBUGCTL and the last
branch record (LBR) MSRs. Don't allowing this in KVM causes the installation to
crash. This patch allow the access to these MSRs and fixes the issue.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Markus Rechberger <markus.rechberger@amd.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
This patch replaces the mmap_sem lock for the memory slots with a new
kvm private lock, it is needed beacuse untill now there were cases where
kvm accesses user memory while holding the mmap semaphore.
Signed-off-by: Izik Eidus <izike@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Revert:
commit 8be8f54bae
Author: Thomas Gleixner <tglx@linutronix.de>
Date: Sat Feb 23 20:43:21 2008 +0100
x86: CPA: avoid split of alias mappings
because it clearly mishandles the case when __change_page_attr(), called
from __change_page_attr_set_clr(), changes cpa->processed to 1 and
cpa_process_alias(cpa) is executed right after that.
This crashes my x86-64 test box early in the boot process
(ref. http://bugzilla.kernel.org/show_bug.cgi?id=10140#c4).
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Injecting an GP when accessing this MSR lets Windows crash when running some
stress test tools in KVM. So this patch emulates access to this MSR.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Markus Rechberger <markus.rechberger@amd.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
One of the use cases for the supported cpuid list is to create a "greatest
common denominator" of cpu capabilities in a server farm. As such, it is
useful to be able to get the list without creating a virtual machine first.
Since the code does not depend on the vm in any way, all that is needed is
to move it to the device ioctl handler. The capability identifier is also
changed so that binaries made against -rc1 will fail gracefully.
Signed-off-by: Avi Kivity <avi@qumranet.com>