Commit Graph

611 Commits

Author SHA1 Message Date
Gleb Natapov
9222be18f7 KVM: SVM: Coalesce userspace/kernel irqchip interrupt injection logic
Start to use interrupt/exception queues like VMX does.
This also fix the bug that if exit was caused by a guest
internal exception access to IDT the exception was not
reinjected.

Use EVENTINJ to inject interrupts.  Use VINT only for detecting when IRQ
windows is open again.  EVENTINJ ensures
the interrupt is injected immediately and not delayed.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:46 +03:00
Gleb Natapov
5df5664647 KVM: Use kvm_arch_interrupt_allowed() instead of checking interrupt_window_open directly
kvm_arch_interrupt_allowed() also checks IF so drop the check.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:46 +03:00
Gleb Natapov
1f21e79aac KVM: VMX: Cleanup vmx_intr_assist()
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:45 +03:00
Gleb Natapov
863e8e658e KVM: VMX: Consolidate userspace and kernel interrupt injection for VMX
Use the same callback to inject irq/nmi events no matter what irqchip is
in use. Only from VMX for now.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:45 +03:00
Gleb Natapov
8061823a25 KVM: Make kvm_cpu_(has|get)_interrupt() work for userspace irqchip too
At the vector level, kernel and userspace irqchip are fairly similar.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:45 +03:00
Jan Kiszka
3438253926 KVM: MMU: Fix auditing code
Fix build breakage of hpa lookup in audit_mappings_page. Moreover, make
this function robust against shadow_notrap_nonpresent_pte entries.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:45 +03:00
Marcelo Tosatti
59839dfff5 KVM: x86: check for cr3 validity in ioctl_set_sregs
Matt T. Yourst notes that kvm_arch_vcpu_ioctl_set_sregs lacks validity
checking for the new cr3 value:

"Userspace callers of KVM_SET_SREGS can pass a bogus value of cr3 to
the kernel. This will trigger a NULL pointer access in gfn_to_rmap()
when userspace next tries to call KVM_RUN on the affected VCPU and kvm
attempts to activate the new non-existent page table root.

This happens since kvm only validates that cr3 points to a valid guest
physical memory page when code *inside* the guest sets cr3. However, kvm
currently trusts the userspace caller (e.g. QEMU) on the host machine to
always supply a valid page table root, rather than properly validating
it along with the rest of the reloaded guest state."

http://sourceforge.net/tracker/?func=detail&atid=893831&aid=2687641&group_id=180599

Check for a valid cr3 address in kvm_arch_vcpu_ioctl_set_sregs, triple
fault in case of failure.

Cc: stable@kernel.org
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:43 +03:00
Avi Kivity
463656c000 KVM: Replace kvmclock open-coded get_cpu_var() with the real thing
Suggested by Ingo Molnar.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:42 +03:00
Gleb Natapov
8317c298ea KVM: SVM: Skip instruction on a task switch only when appropriate
If a task switch was initiated because off a task gate in IDT and IDT
was accessed because of an external even the instruction should not
be skipped.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:42 +03:00
Gleb Natapov
ba8afb6b0a KVM: x86 emulator: Add new mode of instruction emulation: skip
In the new mode instruction is decoded, but not executed. The EIP
is moved to point after the instruction.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:42 +03:00
Gleb Natapov
e637b8238a KVM: x86 emulator: Decode soft interrupt instructions
Do not emulate them yet.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:41 +03:00
Gleb Natapov
84ce66a686 KVM: x86 emulator: Completely decode in/out at decoding stage
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:41 +03:00
Gleb Natapov
341de7e372 KVM: x86 emulator: Add unsigned byte immediate decode
Extend "Source operand type" opcode description field to 4 bites
to accommodate new option.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:41 +03:00
Gleb Natapov
d53c4777b3 KVM: x86 emulator: Complete decoding of call near in decode stage
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:41 +03:00
Gleb Natapov
b2833e3cde KVM: x86 emulator: Complete short/near jcc decoding in decode stage
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:40 +03:00
Gleb Natapov
782b877c80 KVM: x86 emulator: Complete ljmp decoding at decode stage
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:40 +03:00
Gleb Natapov
0654169e73 KVM: x86 emulator: Add lcall decoding
No emulation yet.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:40 +03:00
Gleb Natapov
a5f868bd45 KVM: x86 emulator: Add decoding of 16bit second immediate argument
Such as segment number in lcall/ljmp

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:40 +03:00
Marcelo Tosatti
c2d0ee46e6 KVM: MMU: remove global page optimization logic
Complexity to fix it not worthwhile the gains, as discussed
in http://article.gmane.org/gmane.comp.emulators.kvm.devel/28649.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:39 +03:00
Marcelo Tosatti
ede2ccc517 KVM: PIT: fix count read and mode 0 handling
Commit 46ee278652f4cbd51013471b64c7897ba9bcd1b1 causes Solaris 10
to hang on boot.

Assuming that PIT counter reads should return 0 for an expired timer
is wrong: when it is active, the counter never stops (see comment on
__kpit_elapsed).

Also arm a one shot timer for mode 0.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:39 +03:00
Gleb Natapov
2d03319654 KVM: x86 emulator: fix call near emulation
The length of pushed on to the stack return address depends on operand
size not address size.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:38 +03:00
Sheng Yang
4c26b4cd6f KVM: MMU: Discard reserved bits checking on PDE bit 7-8
1. It's related to a Linux kernel bug which fixed by Ingo on
07a66d7c53. The original code exists for quite a
long time, and it would convert a PDE for large page into a normal PDE. But it
fail to fit normal PDE well.  With the code before Ingo's fix, the kernel would
fall reserved bit checking with bit 8 - the remaining global bit of PTE. So the
kernel would receive a double-fault.

2. After discussion, we decide to discard PDE bit 7-8 reserved checking for now.
For this marked as reserved in SDM, but didn't checked by the processor in
fact...

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:38 +03:00
Gleb Natapov
64a7ec0668 KVM: Fix unneeded instruction skipping during task switching.
There is no need to skip instruction if the reason for a task switch
is a task gate in IDT and access to it is caused by an external even.
The problem  is currently solved only for VMX since there is no reliable
way to skip an instruction in SVM. We should emulate it instead.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:38 +03:00
Gleb Natapov
b237ac37a1 KVM: Fix task switch back link handling.
Back link is written to a wrong TSS now.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:37 +03:00
Gleb Natapov
8843419048 KVM: VMX: Do not zero idt_vectoring_info in vmx_complete_interrupts().
We will need it later in task_switch().
Code in handle_exception() is dead. is_external_interrupt(vect_info)
will always be false since idt_vectoring_info is zeroed in
vmx_complete_interrupts().

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:37 +03:00
Gleb Natapov
37b96e9880 KVM: VMX: Rewrite vmx_complete_interrupt()'s twisted maze of if() statements
...with a more straightforward switch().

Also fix a bug when NMI could be dropped on exit. Although this should
never happen in practice, since NMIs can only be injected, never triggered
internally by the guest like exceptions.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:37 +03:00
Gleb Natapov
7b4a25cb29 KVM: VMX: Fix handling of a fault during NMI unblocked due to IRET
Bit 12 is undefined in any of the following cases:
 If the VM exit sets the valid bit in the IDT-vectoring information field.
 If the VM exit is due to a double fault.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:36 +03:00
Dong, Eddie
20c466b561 KVM: Use rsvd_bits_mask in load_pdptrs()
Also remove bit 5-6 from rsvd_bits_mask per latest SDM.

Signed-off-by: Eddie Dong <Eddie.Dong@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:36 +03:00
Sheng Yang
93ba03c2e2 KVM: VMX: Fix feature testing
The testing of feature is too early now, before vmcs_config complete initialization.

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:36 +03:00
Sheng Yang
045471563d KVM: VMX: Clean up Flex Priority related
And clean paranthes on returns.

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:36 +03:00
Wei Yongjun
7a6ce84c74 KVM: remove pointless conditional before kfree() in lapic initialization
Remove pointless conditional before kfree().

Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:35 +03:00
Avi Kivity
9645bb56b3 KVM: MMU: Use different shadows when EFER.NXE changes
A pte that is shadowed when the guest EFER.NXE=1 is not valid when
EFER.NXE=0; if bit 63 is set, the pte should cause a fault, and since the
shadow EFER always has NX enabled, this won't happen.

Fix by using a different shadow page table for different EFER.NXE bits.  This
allows vcpus to run correctly with different values of EFER.NXE, and for
transitions on this bit to be handled correctly without requiring a full
flush.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:35 +03:00
Dong, Eddie
82725b20e2 KVM: MMU: Emulate #PF error code of reserved bits violation
Detect, indicate, and propagate page faults where reserved bits are set.
Take care to handle the different paging modes, each of which has different
sets of reserved bits.

[avi: fix pte reserved bits for efer.nxe=0]

Signed-off-by: Eddie Dong <eddie.dong@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:35 +03:00
Eddie Dong
a8b876b1a4 KVM: MMU: Fix comment in page_fault()
The original one is for the code before refactoring.

Signed-off-by: Yaozu (Eddie) Dong <eddie.dong@intel.com>
Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:34 +03:00
Sheng Yang
f9c617f611 KVM: VMX: Correct wrong vmcs field sizes
EXIT_QUALIFICATION and GUEST_LINEAR_ADDRESS are natural width, not 64-bit.

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:34 +03:00
Avi Kivity
7d433b9f94 KVM: VMX: Make flexpriority module parameter reflect hardware capability
If the hardware does not support flexpriority, zero the module parameter.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:34 +03:00
Gleb Natapov
78646121e9 KVM: Fix interrupt unhalting a vcpu when it shouldn't
kvm_vcpu_block() unhalts vpu on an interrupt/timer without checking
if interrupt window is actually opened.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:33 +03:00
Gleb Natapov
09cec75488 KVM: Timer event should not unconditionally unhalt vcpu.
Currently timer events are processed before entering guest mode. Move it
to main vcpu event loop since timer events should be processed even while
vcpu is halted.  Timer may cause interrupt/nmi to be injected and only then
vcpu will be unhalted.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:33 +03:00
Avi Kivity
089d034e0c KVM: VMX: Fold vm_need_ept() into callers
Trivial.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:32 +03:00
Avi Kivity
575ff2dcb2 KVM: VMX: Zero ept module parameter if ept is not present
Allows reading back hardware capability.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:32 +03:00
Avi Kivity
919818abc2 KVM: VMX: Zero the vpid module parameter if vpid is not supported
This allows reading back how the hardware is configured.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:32 +03:00
Avi Kivity
4462d21a61 KVM: VMX: Annotate module parameters as __read_mostly
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:32 +03:00
Avi Kivity
736caefe15 KVM: VMX: Simplify module parameter names
Instead of 'enable_vpid=1', use a simple 'vpid=1'.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:31 +03:00
Avi Kivity
6062d012ed KVM: VMX: Rename kvm_handle_exit() to vmx_handle_exit()
It is a static vmx-specific function.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:31 +03:00
Avi Kivity
c1f8bc04c6 KVM: VMX: Make module parameters readable
Useful to see how the module was loaded.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:31 +03:00
Gleb Natapov
fe4c7b1914 KVM: reuse (pop|push)_irq from svm.c in vmx.c
The prioritized bit vector manipulation functions are useful in both vmx and
svm.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:31 +03:00
Gleb Natapov
61c50edfcd KVM: SVM: Remove duplicate code in svm_do_inject_vector()
svm_do_inject_vector() reimplements pop_irq().

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:30 +03:00
Amit Shah
7fe29e0faa KVM: x86: Ignore reads to EVNTSEL MSRs
We ignore writes to the performance counters and performance event
selector registers already. Kaspersky antivirus reads the eventsel
MSR causing it to crash with the current behaviour.

Return 0 as data when the eventsel registers are read to stop the
crash.

Signed-off-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:30 +03:00
Gleb Natapov
f00be0cae4 KVM: MMU: do not free active mmu pages in free_mmu_pages()
free_mmu_pages() should only undo what alloc_mmu_pages() does.
Free mmu pages from the generic VM destruction function, kvm_destroy_vm().

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:30 +03:00
Sheng Yang
e56d532f20 KVM: Device assignment framework rework
After discussion with Marcelo, we decided to rework device assignment framework
together. The old problems are kernel logic is unnecessary complex. So Marcelo
suggest to split it into a more elegant way:

1. Split host IRQ assign and guest IRQ assign. And userspace determine the
combination. Also discard msi2intx parameter, userspace can specific
KVM_DEV_IRQ_HOST_MSI | KVM_DEV_IRQ_GUEST_INTX in assigned_irq->flags to
enable MSI to INTx convertion.

2. Split assign IRQ and deassign IRQ. Import two new ioctls:
KVM_ASSIGN_DEV_IRQ and KVM_DEASSIGN_DEV_IRQ.

This patch also fixed the reversed _IOR vs _IOW in definition(by deprecated the
old interface).

[avi: replace homemade bitcount() by hweight_long()]

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:29 +03:00
Hannes Eder
386eb6e8b3 KVM: make 'lapic_timer_ops' and 'kpit_ops' static
Fix this sparse warnings:
  arch/x86/kvm/lapic.c:916:22: warning: symbol 'lapic_timer_ops' was not declared. Should it be static?
  arch/x86/kvm/i8254.c:268:22: warning: symbol 'kpit_ops' was not declared. Should it be static?

Signed-off-by: Hannes Eder <hannes@hanneseder.net>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:29 +03:00
Gleb Natapov
58c2dde17d KVM: APIC: get rid of deliver_bitmask
Deliver interrupt during destination matching loop.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Acked-by: Xiantao Zhang <xiantao.zhang@intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-06-10 11:48:27 +03:00
Gleb Natapov
e1035715ef KVM: change the way how lowest priority vcpu is calculated
The new way does not require additional loop over vcpus to calculate
the one with lowest priority as one is chosen during delivery bitmap
construction.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-06-10 11:48:27 +03:00
Gleb Natapov
343f94fe4d KVM: consolidate ioapic/ipi interrupt delivery logic
Use kvm_apic_match_dest() in kvm_get_intr_delivery_bitmask() instead
of duplicating the same code. Use kvm_get_intr_delivery_bitmask() in
apic_send_ipi() to figure out ipi destination instead of reimplementing
the logic.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-06-10 11:48:27 +03:00
Gleb Natapov
6da7e3f643 KVM: APIC: kvm_apic_set_irq deliver all kinds of interrupts
Get rid of ioapic_inj_irq() and ioapic_inj_nmi() functions.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-06-10 11:48:26 +03:00
Joerg Roedel
f5a1e9f895 KVM: MMU: remove call to kvm_mmu_pte_write from walk_addr
There is no reason to update the shadow pte here because the guest pte
is only changed to dirty state.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-06-10 11:48:26 +03:00
Marcelo Tosatti
d3c7b77d1a KVM: unify part of generic timer handling
Hide the internals of vcpu awakening / injection from the in-kernel
emulated timers. This makes future changes in this logic easier and
decreases the distance to more generic timer handling.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:25 +03:00
Marcelo Tosatti
fd66842370 KVM: PIT: remove usage of count_load_time for channel 0
We can infer elapsed time from hrtimer_expires_remaining.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:25 +03:00
Marcelo Tosatti
5a05d54554 KVM: PIT: remove unused scheduled variable
Unused.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:25 +03:00
Matt T. Yourst
2dea4c84bc KVM: x86: silence preempt warning on kvm_write_guest_time
This issue just appeared in kvm-84 when running on 2.6.28.7 (x86-64)
with PREEMPT enabled.

We're getting syslog warnings like this many (but not all) times qemu
tells KVM to run the VCPU:

BUG: using smp_processor_id() in preemptible [00000000] code:
qemu-system-x86/28938
caller is kvm_arch_vcpu_ioctl_run+0x5d1/0xc70 [kvm]
Pid: 28938, comm: qemu-system-x86 2.6.28.7-mtyrel-64bit
Call Trace:
debug_smp_processor_id+0xf7/0x100
kvm_arch_vcpu_ioctl_run+0x5d1/0xc70 [kvm]
? __wake_up+0x4e/0x70
? wake_futex+0x27/0x40
kvm_vcpu_ioctl+0x2e9/0x5a0 [kvm]
enqueue_hrtimer+0x8a/0x110
_spin_unlock_irqrestore+0x27/0x50
vfs_ioctl+0x31/0xa0
do_vfs_ioctl+0x74/0x480
sys_futex+0xb4/0x140
sys_ioctl+0x99/0xa0
system_call_fastpath+0x16/0x1b

As it turns out, the call trace is messed up due to gcc's inlining, but
I isolated the problem anyway: kvm_write_guest_time() is being used in a
non-thread-safe manner on preemptable kernels.

Basically kvm_write_guest_time()'s body needs to be surrounded by
preempt_disable() and preempt_enable(), since the kernel won't let us
query any per-CPU data (indirectly using smp_processor_id()) without
preemption disabled. The attached patch fixes this issue by disabling
preemption inside kvm_write_guest_time().

[marcelo: surround only __get_cpu_var calls since the warning
is harmless]

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:24 +03:00
Sheng Yang
bfd349d073 KVM: bit ops for deliver_bitmap
It's also convenient when we extend KVM supported vcpu number in the future.

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:22 +03:00
Sheng Yang
110c2faeba KVM: Update intr delivery func to accept unsigned long* bitmap
Would be used with bit ops, and would be easily extended if KVM_MAX_VCPUS is
increased.

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:22 +03:00
Avi Kivity
5897297bc2 KVM: VMX: Don't intercept MSR_KERNEL_GS_BASE
Windows 2008 accesses this MSR often on context switch intensive workloads;
since we run in guest context with the guest MSR value loaded (so swapgs can
work correctly), we can simply disable interception of rdmsr/wrmsr for this
MSR.

A complication occurs since in legacy mode, we run with the host MSR value
loaded. In this case we enable interception.  This means we need two MSR
bitmaps, one for legacy mode and one for long mode.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:21 +03:00
Avi Kivity
3e7c73e9b1 KVM: VMX: Don't use highmem pages for the msr and pio bitmaps
Highmem pages are a pain, and saving three lowmem pages on i386 isn't worth
the extra code.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-06-10 11:48:21 +03:00
Avi Kivity
a2edf57f51 KVM: Fix PDPTR reloading on CR4 writes
The processor is documented to reload the PDPTRs while in PAE mode if any
of the CR4 bits PSE, PGE, or PAE change.  Linux relies on this
behaviour when zapping the low mappings of PAE kernels during boot.

The code already handled changes to CR4.PAE; augment it to also notice changes
to PSE and PGE.

This triggered while booting an F11 PAE kernel; the futex initialization code
runs before any CR3 reloads and writes to a NULL pointer; the futex subsystem
ended up uninitialized, killing PI futexes and pulseaudio which uses them.

Cc: stable@kernel.org
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-05-25 20:00:53 +03:00
Avi Kivity
a8cd0244e9 KVM: Make paravirt tlb flush also reload the PAE PDPTRs
The paravirt tlb flush may be used not only to flush TLBs, but also
to reload the four page-directory-pointer-table entries, as it is used
as a replacement for reloading CR3.  Change the code to do the entire
CR3 reloading dance instead of simply flushing the TLB.

Cc: stable@kernel.org
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-05-25 20:00:50 +03:00
Avi Kivity
99f85a28a7 KVM: SVM: Remove port 80 passthrough
KVM optimizes guest port 80 accesses by passthing them through to the host.
Some AMD machines die on port 80 writes, allowing the guest to hard-lock the
host.

Remove the port passthrough to avoid the problem.

Cc: stable@kernel.org
Reported-by: Piotr Jaroszyński <p.jaroszynski@gmail.com>
Tested-by: Piotr Jaroszyński <p.jaroszynski@gmail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-05-11 14:40:51 +03:00
Avi Kivity
e286e86e6d KVM: Make EFER reads safe when EFER does not exist
Some processors don't have EFER; don't oops if userspace wants us to
read EFER when we check NX.

Cc: stable@kernel.org
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-05-11 11:19:00 +03:00
Avi Kivity
334b8ad7b1 KVM: Fix NX support reporting
NX support is bit 20, not bit 1.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-05-11 11:18:48 +03:00
Andre Przywara
19bca6ab75 KVM: SVM: Fix cross vendor migration issue with unusable bit
AMDs VMCB does not have an explicit unusable segment descriptor field,
so we emulate it by using "not present". This has to be setup before
the fixups, because this field is used there.

Signed-off-by: Andre Przywara <andre.przywara@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-05-11 11:18:04 +03:00
Jan Kiszka
888d256e9c KVM: Unregister cpufreq notifier on unload
Properly unregister cpufreq notifier on onload if it was registered
during init.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-04-22 13:54:33 +03:00
Joerg Roedel
7f1ea20896 KVM: x86: release time_page on vcpu destruction
Not releasing the time_page causes a leak of that page or the compound
page it is situated in.

Cc: stable@kernel.org
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-04-22 13:52:10 +03:00
Marcelo Tosatti
bf47a760f6 KVM: MMU: disable global page optimization
Complexity to fix it not worthwhile the gains, as discussed
in http://article.gmane.org/gmane.comp.emulators.kvm.devel/28649.

Cc: stable@kernel.org
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-04-22 13:52:09 +03:00
Ingo Molnar
8302294f43 Merge branch 'tracing/core-v2' into tracing-for-linus
Conflicts:
	include/linux/slub_def.h
	lib/Kconfig.debug
	mm/slob.c
	mm/slub.c
2009-04-02 00:49:02 +02:00
Avi Kivity
16175a796d KVM: VMX: Don't allow uninhibited access to EFER on i386
vmx_set_msr() does not allow i386 guests to touch EFER, but they can still
do so through the default: label in the switch.  If they set EFER_LME, they
can oops the host.

Fix by having EFER access through the normal channel (which will check for
EFER_LME) even on i386.

Reported-and-tested-by: Benjamin Gilbert <bgilbert@cs.cmu.edu>
Cc: stable@kernel.org
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:03:15 +02:00
Andrea Arcangeli
4539b35881 KVM: Fix missing smp tlb flush in invlpg
When kvm emulates an invlpg instruction, it can drop a shadow pte, but
leaves the guest tlbs intact.  This can cause memory corruption when
swapping out.

Without this the other cpu can still write to a freed host physical page.
tlb smp flush must happen if rmap_remove is called always before mmu_lock
is released because the VM will take the mmu_lock before it can finally add
the page to the freelist after swapout. mmu notifier makes it safe to flush
the tlb after freeing the page (otherwise it would never be safe) so we can do
a single flush for multiple sptes invalidated.

Cc: stable@kernel.org
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:03:14 +02:00
Hannes Eder
cded19f396 KVM: fix sparse warnings: Should it be static?
Impact: Make symbols static.

Fix this sparse warnings:
  arch/x86/kvm/mmu.c:992:5: warning: symbol 'mmu_pages_add' was not declared. Should it be static?
  arch/x86/kvm/mmu.c:1124:5: warning: symbol 'mmu_pages_next' was not declared. Should it be static?
  arch/x86/kvm/mmu.c:1144:6: warning: symbol 'mmu_pages_clear_parents' was not declared. Should it be static?
  arch/x86/kvm/x86.c:2037:5: warning: symbol 'kvm_read_guest_virt' was not declared. Should it be static?
  arch/x86/kvm/x86.c:2067:5: warning: symbol 'kvm_write_guest_virt' was not declared. Should it be static?
  virt/kvm/irq_comm.c:220:5: warning: symbol 'setup_routing_entry' was not declared. Should it be static?

Signed-off-by: Hannes Eder <hannes@hanneseder.net>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:03:14 +02:00
Hannes Eder
d7364a29b3 KVM: fix sparse warnings: context imbalance
Impact: Attribute function with __acquires(...) resp. __releases(...).

Fix this sparse warnings:
  arch/x86/kvm/i8259.c:34:13: warning: context imbalance in 'pic_lock' - wrong count at exit
  arch/x86/kvm/i8259.c:39:13: warning: context imbalance in 'pic_unlock' - unexpected unlock

Signed-off-by: Hannes Eder <hannes@hanneseder.net>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:03:13 +02:00
Amit Shah
41d6af1192 KVM: is_long_mode() should check for EFER.LMA
is_long_mode currently checks the LongModeEnable bit in
EFER instead of the LongModeActive bit. This is wrong, but
we survived this till now since it wasn't triggered. This
breaks guests that go from long mode to compatibility mode.

This is noticed on a solaris guest and fixes bug #1842160

Signed-off-by: Amit Shah <amit.shah@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2009-03-24 11:03:13 +02:00
Amit Shah
401d10dee0 KVM: VMX: Update necessary state when guest enters long mode
setup_msrs() should be called when entering long mode to save the
shadow state for the 64-bit guest state.

Using vmx_set_efer() in enter_lmode() removes some duplicated code
and also ensures we call setup_msrs(). We can safely pass the value
of shadow_efer to vmx_set_efer() as no other bits in the efer change
while enabling long mode (guest first sets EFER.LME, then sets CR0.PG
which causes a vmexit where we activate long mode).

With this fix, is_long_mode() can check for EFER.LMA set instead of
EFER.LME and 5e23049e86dd298b72e206b420513dbc3a240cd9 can be reverted.

Signed-off-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:03:13 +02:00
Joerg Roedel
c5bc224240 KVM: MMU: Fix another largepage memory leak
In the paging_fetch function rmap_remove is called after setting a large
pte to non-present. This causes rmap_remove to not drop the reference to
the large page. The result is a memory leak of that page.

Cc: stable@kernel.org
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Acked-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:03:11 +02:00
Andre Przywara
1fbdc7a585 KVM: SVM: set accessed bit for VMCB segment selectors
In the segment descriptor _cache_ the accessed bit is always set
(although it can be cleared in the descriptor itself). Since Intel
checks for this condition on a VMENTRY, set this bit in the AMD path
to enable cross vendor migration.

Cc: stable@kernel.org
Signed-off-by: Andre Przywara <andre.przywara@amd.com>
Acked-By: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:03:11 +02:00
Gleb Natapov
4925663a07 KVM: Report IRQ injection status to userspace.
IRQ injection status is either -1 (if there was no CPU found
that should except the interrupt because IRQ was masked or
ioapic was misconfigured or ...) or >= 0 in that case the
number indicates to how many CPUs interrupt was injected.
If the value is 0 it means that the interrupt was coalesced
and probably should be reinjected.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:03:11 +02:00
Joerg Roedel
452425dbaa KVM: MMU: remove assertion in kvm_mmu_alloc_page
The assertion no longer makes sense since we don't clear page tables on
allocation; instead we clear them during prefetch.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:03:10 +02:00
Joerg Roedel
6bed6b9e84 KVM: MMU: remove redundant check in mmu_set_spte
The following code flow is unnecessary:

	if (largepage)
		was_rmapped = is_large_pte(*shadow_pte);
	 else
	 	was_rmapped = 1;

The is_large_pte() function will always evaluate to one here because the
(largepage && !is_large_pte) case is already handled in the first
if-clause. So we can remove this check and set was_rmapped to one always
here.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Acked-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:03:10 +02:00
Gerd Hoffmann
c807660407 KVM: Fix kvmclock on !constant_tsc boxes
kvmclock currently falls apart on machines without constant tsc.
This patch fixes it.  Changes:

  * keep tsc frequency in a per-cpu variable.
  * handle kvmclock update using a new request flag, thus checking
    whenever we need an update each time we enter guest context.
  * use a cpufreq notifier to track frequency changes and force
    kvmclock updates.
  * send ipis to kick cpu out of guest context if needed to make
    sure the guest doesn't see stale values.

Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:03:09 +02:00
Sheng Yang
49cd7d2238 KVM: VMX: Use kvm_mmu_page_fault() handle EPT violation mmio
Removed duplicated code.

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:03:09 +02:00
Jan Kiszka
34c33d163f KVM: Drop unused evaluations from string pio handlers
Looks like neither the direction nor the rep prefix are used anymore.
Drop related evaluations from SVM's and VMX's I/O exit handlers.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:03:08 +02:00
Alexander Graf
1b2fd70c4e KVM: Add FFXSR support
AMD K10 CPUs implement the FFXSR feature that gets enabled using
EFER. Let's check if the virtual CPU description includes that
CPUID feature bit and allow enabling it then.

This is required for Windows Server 2008 in Hyper-V mode.

v2 adds CPUID capability exposure

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:03:08 +02:00
Marcelo Tosatti
44882eed2e KVM: make irq ack notifications aware of routing table
IRQ ack notifications assume an identity mapping between pin->gsi,
which might not be the case with, for example, HPET.

Translate before acking.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Acked-by: Gleb Natapov <gleb@redhat.com>
2009-03-24 11:03:08 +02:00
Avi Kivity
399ec807dd KVM: Userspace controlled irq routing
Currently KVM has a static routing from GSI numbers to interrupts (namely,
0-15 are mapped 1:1 to both PIC and IOAPIC, and 16:23 are mapped 1:1 to
the IOAPIC).  This is insufficient for several reasons:

- HPET requires non 1:1 mapping for the timer interrupt
- MSIs need a new method to assign interrupt numbers and dispatch them
- ACPI APIC mode needs to be able to reassign the PCI LINK interrupts to the
  ioapics

This patch implements an interrupt routing table (as a linked list, but this
can be easily changed) and a userspace interface to replace the table.  The
routing table is initialized according to the current hardwired mapping.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:03:06 +02:00
Amit Shah
1935547504 KVM: x86: Fix typos and whitespace errors
Some typos, comments, whitespace errors corrected in the cpuid code

Signed-off-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:03:05 +02:00
Avi Kivity
5a41accd3f KVM: MMU: Only enable cr4_pge role in shadow mode
Two dimensional paging is only confused by it.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:03:04 +02:00
Avi Kivity
f6e2c02b6d KVM: MMU: Rename "metaphysical" attribute to "direct"
This actually describes what is going on, rather than alerting the reader
that something strange is going on.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:03:04 +02:00
Marcelo Tosatti
9903a927a4 KVM: MMU: drop zeroing on mmu_memory_cache_alloc
Zeroing on mmu_memory_cache_alloc is unnecessary since:

- Smaller areas are pre-allocated with kmem_cache_zalloc.
- Page pointed by ->spt is overwritten with prefetch_page
  and entries in page pointed by ->gfns are initialized
  before reading.

[avi: zeroing pages is unnecessary]

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:03:04 +02:00
Joe Perches
ff81ff10b4 KVM: SVM: Fix typo in has_svm()
Signed-off-by: Joe Perches <joe@perches.com>
Acked-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:03:04 +02:00
Avi Kivity
4780c65904 KVM: Reset PIT irq injection logic when the PIT IRQ is unmasked
While the PIT is masked the guest cannot ack the irq, so the reinject logic
will never allow the interrupt to be injected.

Fix by resetting the reinjection counters on unmask.

Unbreaks Xen.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:03:03 +02:00
Avi Kivity
5d9b8e30f5 KVM: Add CONFIG_HAVE_KVM_IRQCHIP
Two KVM archs support irqchips and two don't.  Add a Kconfig item to
make selecting between the two models easier.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:03:02 +02:00
Avi Kivity
4677a3b693 KVM: MMU: Optimize page unshadowing
Using kvm_mmu_lookup_page() will result in multiple scans of the hash chains;
use hlist_for_each_entry_safe() to achieve a single scan instead.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:03:02 +02:00
Alexander Graf
c8a73f186b KVM: SVM: Add microcode patch level dummy
VMware ESX checks if the microcode level is correct when using a barcelona
CPU, in order to see if it actually can use SVM. Let's tell it we're on the
safe side...

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-03-24 11:03:02 +02:00