set_cr0_no_modeswitch() was a hack to avoid corrupting segment registers.
As we now cache the protected mode values on entry to real mode, this
isn't an issue anymore, and it interferes with reboot (which usually _is_
a modeswitch).
Signed-off-by: Avi Kivity <avi@qumranet.com>
The reset state has cs.selector == 0xf000 and cs.base == 0xffff0000,
which aren't compatible with vm86 mode, which is used for real mode
virtualization.
When we create a vcpu, we set cs.base to 0xf0000, but if we get there by
way of a reset, the values are inconsistent and vmx refuses to enter
guest mode.
Workaround by detecting the state and munging it appropriately.
Signed-off-by: Avi Kivity <avi@qumranet.com>
The current string pio interface communicates using guest virtual addresses,
relying on userspace to translate addresses and to check permissions. This
interface cannot fully support guest smp, as the check needs to take into
account two pages at one in case an unaligned string transfer straddles a
page boundary.
Change the interface not to communicate guest addresses at all; instead use
a buffer page (mmaped by userspace) and do transfers there. The kernel
manages the virtual to physical translation and can perform the checks
atomically by taking the appropriate locks.
Signed-off-by: Avi Kivity <avi@qumranet.com>
This is redundant, as we also return -EINTR from the ioctl, but it
allows us to examine the exit_reason field on resume without seeing
old data.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Currently, userspace is told about the nature of the last exit from the
guest using two fields, exit_type and exit_reason, where exit_type has
just two enumerations (and no need for more). So fold exit_type into
exit_reason, reducing the complexity of determining what really happened.
Signed-off-by: Avi Kivity <avi@qumranet.com>
KVM used to handle cpuid by letting userspace decide what values to
return to the guest. We now handle cpuid completely in the kernel. We
still let userspace decide which values the guest will see by having
userspace set up the value table beforehand (this is necessary to allow
management software to set the cpu features to the least common denominator,
so that live migration can work).
The motivation for the change is that kvm kernel code can be impacted by
cpuid features, for example the x86 emulator.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Currently when passing the a PIO emulation request to userspace, we
rely on userspace updating %rax (on 'in' instructions) and %rsi/%rdi/%rcx
(on string instructions). This (a) requires two extra ioctls for getting
and setting the registers and (b) is unfriendly to non-x86 archs, when
they get kvm ports.
So fix by doing the register fixups in the kernel and passing to userspace
only an abstract description of the PIO to be done.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Instead of twiddling the rip registers directly, use the
skip_emulated_instruction() function to do that for us.
Signed-off-by: Dor Laor <dor.laor@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
failed VM entry on VMX might still change %fs or %gs, thus make sure
that KVM always reloads the segment selectors. This is crutial on both
x86 and x86_64: x86 has __KERNEL_PDA in %fs on which things like
'current' depends and x86_64 has 0 there and needs MSR_GS_BASE to work.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Intel virtualization extensions do not support virtualizing real mode. So
kvm uses virtualized vm86 mode to run real mode code. Unfortunately, this
virtualized vm86 mode does not support the so called "big real" mode, where
the segment selector and base do not agree with each other according to the
real mode rules (base == selector << 4).
To work around this, kvm checks whether a selector/base pair violates the
virtualized vm86 rules, and if so, forces it into conformance. On a
transition back to protected mode, if we see that the guest did not touch
a forced segment, we restore it back to the original protected mode value.
This pile of hacks breaks down if the gdt has changed in real mode, as it
can cause a segment selector to point to a system descriptor instead of a
normal data segment. In fact, this happens with the Windows bootloader
and the qemu acpi bios, where a protected mode memcpy routine issues an
innocent 'pop %es' and traps on an attempt to load a system descriptor.
"Fix" by checking if the to-be-restored selector points at a system segment,
and if so, coercing it into a normal data segment. The long term solution,
of course, is to abandon vm86 mode and use emulation for big real mode.
Signed-off-by: Avi Kivity <avi@qumranet.com>
The vmx code currently treats the guest's sysenter support msrs as 32-bit
values, which breaks 32-bit compat mode userspace on 64-bit guests. Fix by
using the native word width of the machine.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Allocate a distinct inode for every vcpu in a VM. This has the following
benefits:
- the filp cachelines are no longer bounced when f_count is incremented on
every ioctl()
- the API and internal code are distinctly clearer; for example, on the
KVM_GET_REGS ioctl, there is no need to copy the vcpu number from
userspace and then copy the registers back; the vcpu identity is derived
from the fd used to make the call
Right now the performance benefits are completely theoretical since (a) we
don't support more than one vcpu per VM and (b) virtualization hardware
inefficiencies completely everwhelm any cacheline bouncing effects. But
both of these will change, and we need to prepare the API today.
Signed-off-by: Avi Kivity <avi@qumranet.com>
This adds a special MSR based hypercall API to KVM. This is to be
used by paravirtual kernels and virtual drivers.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Avi Kivity <avi@qumranet.com>
The whole thing is rotten, but this allows vmx to boot with the guest reboot
fix.
Signed-off-by: Markus Rechberger <markus.rechberger@amd.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Convert the PDA code to use %fs rather than %gs as the segment for
per-processor data. This is because some processors show a small but
measurable performance gain for reloading a NULL segment selector (as %fs
generally is in user-space) versus a non-NULL one (as %gs generally is).
On modern processors the difference is very small, perhaps undetectable.
Some old AMD "K6 3D+" processors are noticably slower when %fs is used
rather than %gs; I have no idea why this might be, but I think they're
sufficiently rare that it doesn't matter much.
This patch also fixes the math emulator, which had not been adjusted to
match the changed struct pt_regs.
[frederik.deweerdt@gmail.com: fixit with gdb]
[mingo@elte.hu: Fix KVM too]
Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com>
Signed-off-by: Andi Kleen <ak@suse.de>
Cc: Ian Campbell <Ian.Campbell@XenSource.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Zachary Amsden <zach@vmware.com>
Cc: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Frederik Deweerdt <frederik.deweerdt@gmail.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
On hotplug, we execute the hardware extension enable sequence. On unplug, we
decache any vcpus that last ran on the exiting cpu, and execute the hardware
extension disable sequence.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Like the inline code it replaces, this function decaches the vmcs from the cpu
it last executed on. in addition:
- vcpu_clear() works if the last cpu is also the cpu we're running on
- it is faster on larger smps by virtue of using smp_call_function_single()
Includes fix from Ingo Molnar.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Or 32-bit userspace will get confused.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Just like svm.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Forms like "0(%rsp)" generate an instruction with an unnecessary one byte
displacement under certain circumstances. replace with the equivalent
"(%rsp)".
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Intel hosts, without long mode, and with nx support disabled in the bios
have an efer that is readable but not writable. This causes a lockup on
switch to guest mode (even though it should exit with reason 34 according
to the documentation).
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The kvm mmio read path looks like:
1. guest read faults
2. kvm emulates read, calls emulator_read_emulated()
3. fails as a read requires userspace help
4. exit to userspace
5. userspace emulates read, kvm sets vcpu->mmio_read_completed
6. re-enter guest, fault again
7. kvm emulates read, calls emulator_read_emulated()
8. succeeds as vcpu->mmio_read_emulated is set
9. instruction completes and guest is resumed
A problem surfaces if the userspace exit (step 5) also requests an interrupt
injection. In that case, the guest does not re-execute the original
instruction, but the interrupt handler. The next time an mmio read is
exectued (likely for a different address), step 3 will find
vcpu->mmio_read_completed set and return the value read for the original
instruction.
The problem manifested itself in a few annoying ways:
- little squares appear randomly on console when switching virtual terminals
- ne2000 fails under nfs read load
- rtl8139 complains about "pci errors" even though the device model is
incapable of issuing them.
Fix by skipping interrupt injection if an mmio read is pending.
A better fix is to avoid re-entry into the guest, and re-emulating immediately
instead. However that's a bit more complex.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Both "=r" and "=g" breaks my build on i386:
$ make
CC [M] drivers/kvm/vmx.o
{standard input}: Assembler messages:
{standard input}:3318: Error: bad register name `%sil'
make[1]: *** [drivers/kvm/vmx.o] Error 1
make: *** [_module_drivers/kvm] Error 2
The reason is that setbe requires an 8-bit register but "=r" does not
constrain the target register to be one that has an 8-bit version on
i386.
According to
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=10153
the correct constraint is "=q".
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
No need to test for rflags.if as both VT and SVM specs assure us that on exit
caused from interrupt window opening, 'if' is set.
Signed-off-by: Dor Laor <dor.laor@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
It overwrites the right cr3 set from mmu setup. Happens only with the test
harness.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This will allow us to see the root cause when a vmwrite error happens.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Since we write protect shadowed guest page tables, there is no need to trap
page invalidations (the guest will always change the mapping before issuing
the invlpg instruction).
Signed-off-by: Avi Kivity <avi@qumranet.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
When beginning to process a page fault, make sure we have enough shadow pages
available to service the fault. If not, free some pages.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Hardware virtualization implementations allow the guests to freely change some
of the bits in cr0 and cr4, but trap when changing the other bits. This is
useful to avoid excessive exits due to changing, for example, the ts flag.
It also means the kvm's copy of cr0 and cr4 may be stale with respect to these
bits. most of the time this doesn't matter as these bits are not very
interesting. Other times, however (for example when returning cr0 to
userspace), they are, so get the fresh contents of these bits from the guest
by means of a new arch operation.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
The current interrupt injection mechanism might delay an interrupt under
the following circumstances:
- if injection fails because the guest is not interruptible (rflags.IF clear,
or after a 'mov ss' or 'sti' instruction). Userspace can check rflags,
but the other cases or not testable under the current API.
- if injection fails because of a fault during delivery. This probably
never happens under normal guests.
- if injection fails due to a physical interrupt causing a vmexit so that
it can be handled by the host.
In all cases the guest proceeds without processing the interrupt, reducing
the interactive feel and interrupt throughput of the guest.
This patch fixes the situation by allowing userspace to request an exit
when the 'interrupt window' opens, so that it can re-inject the interrupt
at the right time. Guest interactivity is very visibly improved.
Signed-off-by: Dor Laor <dor.laor@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
KVM does kmalloc() in an atomic section while having preemption disabled via
vcpu_load(). Fix this by moving the ->*_msr setup from the vcpu_setup method
to the vcpu_create method.
(This is also a small speedup for setting up a vcpu, which can in theory be
more frequent than the vcpu_create method).
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
No need to append _MSR to msr names, a prefix should suffice.
Signed-off-by: Nguyen Anh Quynh <aquynh@gmail.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Instead of doing tricky stuff with the arch dependent virtualization
registers, take a peek at the guest's efer.
This simlifies some code, and fixes some confusion in the mmu branch.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This allows plan9 to get a little further booting.
Signed-off-by: Michael Riepe <michael@mr511.de>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This allows opensolaris to boot on kvm/intel.
Signed-off-by: Michael Riepe <michael@mr511.de>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
It seems macbooks set bit 2 but not bit 0, which is an "enabled but vmxon will
fault" setting.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Tested-by: Alex Larsson (sometimes testing helps)
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
They're not on speaking terms.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
This makes the SET_SREGS ioctl behave symmetrically to the GET_SREGS ioctl wrt
the segment access rights flag.
Signed-off-by: Uri Lublin <uril@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>