otherwise Xen is _completely_ unusable with 5 or more VCPUs.
(when !CONFIG_HAVE_SPARSE_IRQ).
based on Alex Nixon's patch.
also add +1 offset after redir_entries
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Acked-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Acked-by: Alex Nixon <alex.nixon@citrix.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Missed two lines when copying.
Fix panic on one of Ingo's machines that need to adjust ioapic id when
acpi off/ 32bit.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Suresh Siddha noticed that we should have a spinlock around it.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
fix warning:
arch/x86/kernel/io_apic.c: In function ‘print_local_APIC’:
arch/x86/kernel/io_apic.c:1786: warning: format ‘%08x’ expects type ‘unsigned int’, but argument 2 has type ‘u64’
arch/x86/kernel/io_apic.c:1787: warning: format ‘%08x’ expects type ‘unsigned int’, but argument 2 has type ‘u64’
By creating uniform behavior on 32-bit and 64-bit and printing out the ICR
value in two 32-bit words.
Code has changed:
text data bss dec hex filename
22901 19650 17040 59591 e8c7 io_apic.o.before
22899 19650 17040 59589 e8c5 io_apic.o.after
Due to the 32-bit cast narrowing the printed out value on 64-bit.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Following commit 9c3f2468d8339866d9ef6a25aae31a8909c6be0d, do_IRQ()
looks up the IRQ number in the per-cpu variable vector_irq.
This commit makes Xen initialise an identity vector_irq map for both X86_32 and X86_64.
Signed-off-by: Alex Nixon <alex.nixon@citrix.com>
Acked-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
for !CONFIG_HAVE_SPARSE_IRQ
fix:
In file included from arch/x86/kernel/early-quirks.c:18:
include/asm/io_apic.h: In function 'probe_nr_irqs':
include/asm/io_apic.h:209: error: 'NR_IRQS' undeclared (first use in this function)
include/asm/io_apic.h:209: error: (Each undeclared identifier is reported only once
include/asm/io_apic.h:209: error: for each function it appears in.)
v2: fix by Ingo
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Ingo said sparse_irq is some intrusive. need to make it selectable
to make it simple, remove irq_desc as parameter in some functions.
(ack, eoi, set_affinity).
may need to make member if irq_chip to take irq_desc, or struct irq later.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
use code in 64 to replace
move_native_irq(irq, desc);
in 32 bit
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
all the same except INTR_REMAPPING related and ioapic io resource.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
do we have 64bit system with sis chipset?
[ mingo@elte.hu: nope, the problem chipset was 32-bit only.
The code symmetry is good nevertheless. ]
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
also make no_timer_check to be global on 64 bit, because vmi_32 is using that.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
move first_system_vector to apic_64.c.
also add #ifdef CONFIG_INTR_REMAP to prepare 32 bit to use
same file.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
try to make functions have the same order between 32-bit and 64-bit.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
prepare for unification:
try to make functions be of the same order to io_apic_64.c.
v2: add calling setup_msi_irq back to arch_setup_msi_irq
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
let user decide the meaning of the bits.
This unifies the 32-bit and 64-bit io-apic code a bit.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
so could figure out bugs where we get an interrupt, but vector_irq is
not initialized yet.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
so we can merge io_apic_32.c and io_apic_64.c
v2: Use cpu_online_map as target cpus for bigsmp, just like 64-bit is doing.
Also remove some unused TARGET_CPUS macro.
v3: need to check if desc is null in smp_irq_move_cleanup
also migration needs to reset vector too, so copy __target_IO_APIC_irq
from 64bit.
(the duplication will go away once the two files are unified.)
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
but actually irq still needs to be less than NR_IRQS, because
interrupt[NR_IRQS] in entry.S.
need to enable per_cpu vector...
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
This has been deprecated for years, the user space irqbalanced utility
works better with numa, has configurable policies, etc...
Signed-off-by: Yinghai Lu <yhlu.kernel@gmai.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
also print out irq no in /proc/interrups and /proc/stat in hex, so could
tell bus/dev/func.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
change names:
irq_desc() ==> irq_desc_alloc
__irq_desc() ==> irq_desc
Also split a few of the uses in lowlevel x86 code.
v2: need to check if desc is null in smp_irq_move_cleanup
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
So we could remove some duplicated calling to irq_desc
v2: make sure irq_desc in init/main.c is not used without generic_hardirqs
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
remove irq limit checks - nr_irqs is dynamic and we expand anytime.
v2: fix checking about result irq_cfg_without_new, so could use msi again
v3: use irq_desc_without_new to check irq is valid
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
There are a handful of loops that go from 0 to nr_irqs and use
get_irq_desc() on them. These would allocate all the irq_desc
entries, regardless of the need for them.
Use the smarter for_each_irq_desc() iterator that will only iterate
over the present ones.
v2: make sure arch without GENERIC_HARDIRQS work too
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
based on Eric's patch ...
together mold it with dyn_array for irq_desc, will allcate kstat_irqs for
nr_irq_desc alltogether if needed. -- at that point nr_cpus is known already.
v2: make sure system without generic_hardirqs works they don't have irq_desc
v3: fix merging
v4: [mingo@elte.hu] fix typo
[ mingo@elte.hu ] irq: build fix
fix:
arch/x86/xen/spinlock.c: In function 'xen_spin_lock_slow':
arch/x86/xen/spinlock.c:90: error: 'struct kernel_stat' has no member named 'irqs'
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
preallocate 32 irq_2_pin entries, and use get_one_free_irq_2_pin() to get
one more and link to irq_cfg if needed.
so don't waste one where no irq is enabled.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
preallocate size is 32, and if it is not enough, irq_cfg will more
via alloc_bootmem() or kzalloc(). (depending on how early we are in
system setup)
v2: fix typo about size of init_one_irq_cfg ... should use sizeof(struct irq_cfg)
v3: according to Eric, change get_irq_cfg() to irq_cfg()
v4: squash add irq_cfg_alloc in
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
add CONFIG_HAVE_SPARSE_IRQ to for use condensed array.
Get rid of irq_desc[] array assumptions.
Preallocate 32 irq_desc, and irq_desc() will try to get more.
( No change in functionality is expected anywhere, except the odd build
failure where we missed a code site or where a crossing commit itroduces
new irq_desc[] usage. )
v2: according to Eric, change get_irq_desc() to irq_desc()
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Until now, NR_IRQS was derived from black magic defines that had to
be "large enough" to both accomodate NR_CPUS and MAX_NR_IO_APICs.
This resulted in a way too large irq_desc[] array on most x86 systems.
Especially with larger CPU masks, the size of irq_desc can spiral out
of control quickly.
So be smarter about it and use precise allocation instead: determine the
default maximum possible IRQ number from the ACPI MADT. Use a minimum limit
of at least 32 IRQs for broken BIOSes.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
fix non-APIC UP build:
arch/x86/kernel/built-in.o: In function `setup_arch':
: undefined reference to `pin_map_size'
arch/x86/kernel/built-in.o: In function `setup_arch':
: undefined reference to `first_free_entry'
Signed-off-by: Ingo Molnar <mingo@elte.hu>
also add first_free_entry and pin_map_size, which were NR_IRQS derived
constants.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
so could spare some memory with small alignment in bootmem
also tighten the alignment checking, and make print out less debug info.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Moving irq ack notification logic as common, and make
it shared with ia64 side.
Signed-off-by: Xiantao Zhang <xiantao.zhang@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
To share with other archs, this patch moves device assignment
logic to common parts.
Signed-off-by: Xiantao Zhang <xiantao.zhang@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
KVM-x86 dumps a lot of debug messages that have no meaning for normal
operation:
- INIT de-assertion is ignored
- SIPIs are sent and received
- APIC writes are unaligned or < 4 byte long
(Windows Server 2003 triggers this on SMP)
Degrade them to true debug messages, keeping the host kernel log clean
for real problems.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Assigned device could DMA to mmio pages, so also need to map mmio pages
into VT-d page table.
Signed-off-by: Weidong Han <weidong.han@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
The PIC code makes little effort to avoid kvm_vcpu_kick(), resulting in
unnecessary guest exits in some conditions.
For example, if the timer interrupt is routed through the IOAPIC, IRR
for IRQ 0 will get set but not cleared, since the APIC is handling the
acks.
This means that everytime an interrupt < 16 is triggered, the priority
logic will find IRQ0 pending and send an IPI to vcpu0 (in case IRQ0 is
not masked, which is Linux's case).
Introduce a new variable isr_ack to represent the IRQ's for which the
guest has been signalled / cleared the ISR. Use it to avoid more than
one IPI per trigger-ack cycle, in addition to the avoidance when ISR is
set in get_priority().
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Cache the unsynced children information in a per-page bitmap.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Allow guest pagetables to go out of sync. Instead of emulating write
accesses to guest pagetables, or unshadowing them, we un-write-protect
the page table and allow the guest to modify it at will. We rely on
invlpg executions to synchronize individual ptes, and will synchronize
the entire pagetable on tlb flushes.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Need to convert shadow_notrap_nonpresent -> shadow_trap_nonpresent when
unsyncing pages.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
kvm_mmu_zap_page will soon zap the unsynced children of a page. Restart
list walk in such case.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Introduce a function to walk all parents of a given page, invoking a handler.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
With pages out of sync invlpg needs to be trapped. For now simply nuke
the entry.
Untested on AMD.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Examine guest pagetable and bring the shadow back in sync. Caller is responsible
for local TLB flush before re-entering guest mode.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
There is not much point in write protecting large mappings. This
can only happen when a page is shadowed during the window between
is_largepage_backed and mmu_lock acquision. Zap the entry instead, so
the next pagefault will find a shadowed page via is_largepage_backed and
fallback to 4k translations.
Simplifies out of sync shadow.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Split the spte entry creation code into a new set_spte function.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
It is necessary to flush all TLB's when a large spte entry is
overwritten with a normal page directory pointer.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
arch/x86/kernel/pvclock.c:102:6: warning: symbol 'tsc_khz' shadows an earlier one
include/asm/tsc.h:18:21: originally declared here
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Avi Kivity <avi@redhat.com>
The vcpu should process pending SIPI message before entering guest mode again.
kvm_arch_vcpu_runnable() returns true if the vcpu is in SIPI state, so
we can't call it here.
Signed-off-by: Gleb Natapov <gleb@qumranet.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Noticed by sparse:
arch/x86/kvm/x86.c:3591:5: warning: symbol 'kvm_load_realmode_segment' was not declared. Should it be static?
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Convert gfn_to_pfn to use get_user_pages_fast, which can do lockless
pagetable lookups on x86. Kernel compilation on 4-way guest is 3.7%
faster on VMX.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
When an IRQ allocation fails, we free up the device structures and
disable the device so that we can unregister the device in the
userspace and not expose it to the guest at all.
Signed-off-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Based on a patch by: Kay, Allen M <allen.m.kay@intel.com>
This patch enables PCI device assignment based on VT-d support.
When a device is assigned to the guest, the guest memory is pinned and
the mapping is updated in the VT-d IOMMU.
[Amit: Expose KVM_CAP_IOMMU so we can check if an IOMMU is present
and also control enable/disable from userspace]
Signed-off-by: Kay, Allen M <allen.m.kay@intel.com>
Signed-off-by: Weidong Han <weidong.han@intel.com>
Signed-off-by: Ben-Ami Yassour <benami@il.ibm.com>
Signed-off-by: Amit Shah <amit.shah@qumranet.com>
Acked-by: Mark Gross <mgross@linux.intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
For instruction 'and al,imm' we use DstAcc instead of doing
the emulation directly into the instruction's opcode.
Signed-off-by: Guillaume Thouvenin <guillaume.thouvenin@ext.bull.net>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Add decode entries for these opcodes; execution is already implemented.
Signed-off-by: Guillaume Thouvenin <guillaume.thouvenin@ext.bull.net>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Add DstAcc operand type. That means that there are 4 bits now for
DstMask.
"In the good old days cpus would have only one register that was able to
fully participate in arithmetic operations, typically called A for
Accumulator. The x86 retains this tradition by having special, shorter
encodings for the A register (like the cmp opcode), and even some
instructions that only operate on A (like mul).
SrcAcc and DstAcc would accommodate these instructions by decoding A
into the corresponding 'struct operand'."
-- Avi Kivity
Signed-off-by: Guillaume Thouvenin <guillaume.thouvenin@ext.bull.net>
Signed-off-by: Avi Kivity <avi@qumranet.com>
jmp r/m64 doesn't require the rex.w prefix to indicate the operand size
is 64 bits. Set the Stack attribute (even though it doesn't involve the
stack, really) to indicate this.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Commit 1c0f4f5011829dac96347b5f84ba37c2252e1e08 left a useless access
of VM_ENTRY_INTR_INFO_FIELD in vmx_intr_assist behind. Clean this up.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Since "KVM: x86: do not execute halted vcpus", HLT by vcpu0 before system
reset by the IO thread will hang the guest.
Mark vcpu as runnable in such case.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Offline or uninitialized vcpu's can be executed if requested to perform
userspace work.
Follow Avi's suggestion to handle halted vcpu's in the main loop,
simplifying kvm_emulate_halt(). Introduce a new vcpu->requests bit to
indicate events that promote state from halted to running.
Also standardize vcpu wake sites.
Signed-off-by: Marcelo Tosatti <mtosatti <at> redhat.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
The patch adds in/out instructions to the x86 emulator.
The instruction was encountered while running the BIOS while using
the invalid guest state emulation patch.
Signed-off-by: Mohammed Gamal <m.gamal005@gmail.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
EPT is 4 level by default in 32pae(48 bits), but the addr parameter
of kvm_shadow_walk->entry() only accept unsigned long as virtual
address, which is 32bit in 32pae. This result in SHADOW_PT_INDEX()
overflow when try to fetch level 4 index.
Fix it by extend kvm_shadow_walk->entry() to accept 64bit addr in
parameter.
Signed-off-by: Sheng Yang <sheng.yang@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
This adds the std and cld instructions to the emulator.
Encountered while running the BIOS with invalid guest
state emulation enabled.
Signed-off-by: Mohammed Gamal <m.gamal005@gmail.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Currently KVM implements MC0-MC4_MISC read support. When booting Linux this
results in KVM warnings in the kernel log when the guest tries to read
MC5_MISC. Fix this warnings with this patch.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
The accessed bit was accidentally turned on in a random flag word, rather
than, the spte itself, which was lucky, since it used the non-EPT compatible
PT_ACCESSED_MASK.
Fix by turning the bit on in the spte and changing it to use the portable
accessed mask.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Otherwise, the cpu may allow writes to the tracked pages, and we lose
some display bits or fail to migrate correctly.
Signed-off-by: Avi Kivity <avi@qumranet.com>
The emulator only supported one instance of mov r, imm instruction
(opcode 0xb8), this adds the rest of these instructions.
Signed-off-by: Mohammed Gamal <m.gamal005@gmail.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
We currently walk the shadow page tables in two places: direct map (for
real mode and two dimensional paging) and paging mode shadow. Since we
anticipate requiring a third walk (for invlpg), it makes sense to have
a generic facility for shadow walk.
This patch adds such a shadow walker, walks the page tables and calls a
method for every spte encountered. The method can examine the spte,
modify it, or even instantiate it. The walk can be aborted by returning
nonzero from the method.
Signed-off-by: Avi Kivity <avi@qumranet.com>
In all cases the shadow root level is available in mmu.shadow_root_level,
so there is no need to pass it as a parameter.
Signed-off-by: Avi Kivity <avi@qumranet.com>
The two paths are equivalent except for one argument, which is already
available. Merge the two codepaths.
Signed-off-by: Avi Kivity <avi@qumranet.com>
sparse says:
arch/x86/kvm/x86.c:107:32: warning: symbol 'kvm_find_assigned_dev' was not declared. Should it be static?
arch/x86/kvm/i8254.c:225:6: warning: symbol 'kvm_pit_ack_irq' was not declared. Should it be static?
Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Avi Kivity <avi@qumranet.com>
This patch modifies mode switching and vmentry function in order to
drive invalid guest state emulation.
Signed-off-by: Mohammed Gamal <m.gamal005@gmail.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
This adds the invalid guest state handler function which invokes the x86
emulator until getting the guest to a VMX-friendly state.
[avi: leave atomic context if scheduling]
[guillaume: return to atomic context correctly]
Signed-off-by: Laurent Vivier <laurent.vivier@bull.net>
Signed-off-by: Guillaume Thouvenin <guillaume.thouvenin@ext.bull.net>
Signed-off-by: Mohammed Gamal <m.gamal005@gmail.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
The patch adds the module parameter required to enable emulating invalid
guest state, as well as the emulation_required flag used to drive
emulation whenever needed.
Signed-off-by: Mohammed Gamal <m.gamal005@gmail.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
This patch adds functions to check whether guest state is VMX compliant.
Signed-off-by: Mohammed Gamal <m.gamal005@gmail.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Even though we don't share irqs at the moment, we should ensure
regular user processes don't try to allocate system resources.
We check for capability to access IO devices (CAP_SYS_RAWIO) before
we request_irq on behalf of the guest.
Noticed by Avi.
Signed-off-by: Amit Shah <amit.shah@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Spurious acks can be generated, for example if the PIC is being reset.
Handle those acks gracefully rather than flooding the log with warnings.
Signed-off-by: Avi Kivity <avi@qumranet.com>
The irq ack during pic reset has three problems:
- Ignores slave/master PIC, using gsi 0-8 for both.
- Generates an ACK even if the APIC is in control.
- Depends upon IMR being clear, which is broken if the irq was masked
at the time it was generated.
The last one causes the BIOS to hang after the first reboot of
Windows installation, since PIT interrupts stop.
[avi: fix check whether pic interrupts are seen by cpu]
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
The vcpu thread can be preempted after the guest_debug_pre() callback,
resulting in invalid debug registers on the new vcpu.
Move it inside the non-preemptable section.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
We're in a hot path. We can't use kmalloc() because
it might impact performance. So, we just stick the buffer that
we need into the kvm_vcpu_arch structure. This is used very
often, so it is not really a waste.
We also have to move the buffer structure's definition to the
arch-specific x86 kvm header.
Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
[sheng: fix KVM_GET_LAPIC using wrong size]
Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: Sheng Yang <sheng.yang@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
On my machine with gcc 3.4, kvm uses ~2k of stack in a few
select functions. This is mostly because gcc fails to
notice that the different case: statements could have their
stack usage combined. It overflows very nicely if interrupts
happen during one of these large uses.
This patch uses two methods for reducing stack usage.
1. dynamically allocate large objects instead of putting
on the stack.
2. Use a union{} member for all of the case variables. This
tricks gcc into combining them all into a single stack
allocation. (There's also a comment on this)
Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Based on a patch from: Amit Shah <amit.shah@qumranet.com>
This patch adds support for handling PCI devices that are assigned to
the guest.
The device to be assigned to the guest is registered in the host kernel
and interrupt delivery is handled. If a device is already assigned, or
the device driver for it is still loaded on the host, the device
assignment is failed by conveying a -EBUSY reply to the userspace.
Devices that share their interrupt line are not supported at the moment.
By itself, this patch will not make devices work within the guest.
The VT-d extension is required to enable the device to perform DMA.
Another alternative is PVDMA.
Signed-off-by: Amit Shah <amit.shah@qumranet.com>
Signed-off-by: Ben-Ami Yassour <benami@il.ibm.com>
Signed-off-by: Weidong Han <weidong.han@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
We're currently facing timing problems in guests that do
calibration under heavy load, and then the load vanishes.
This means we'll have a much lower lpj than we actually should,
and delays end up taking less time than they should, which is a
nasty bug.
Solution is to pass on the lpj value from host to guest, and have it
preset.
Signed-off-by: Glauber Costa <gcosta@redhat.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
KVM intends to use paravirt code to calibrate khz. Xen
current code will do just fine. So as a first step, factor out
code to pvclock.c.
Signed-off-by: Glauber Costa <gcosta@redhat.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
The PIT injection logic is problematic under the following cases:
1) If there is a higher priority vector to be delivered by the time
kvm_pit_timer_intr_post is invoked ps->inject_pending won't be set.
This opens the possibility for missing many PIT event injections (say if
guest executes hlt at this point).
2) ps->inject_pending is racy with more than two vcpus. Since there's no locking
around read/dec of pt->pending, two vcpu's can inject two interrupts for a single
pt->pending count.
Fix 1 by using an irq ack notifier: only reinject when the previous irq
has been acked. Fix 2 with appropriate locking around manipulation of
pending count and irq_ack by the injection / ack paths.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Based on a patch from: Ben-Ami Yassour <benami@il.ibm.com>
which was based on a patch from: Amit Shah <amit.shah@qumranet.com>
Notify IRQ acking on PIC/APIC emulation. The previous patch missed two things:
- Edge triggered interrupts on IOAPIC
- PIC reset with IRR/ISR set should be equivalent to ack (LAPIC probably
needs something similar).
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
CC: Amit Shah <amit.shah@qumranet.com>
CC: Ben-Ami Yassour <benami@il.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
This can be used by kvm subsystems that are interested in when
interrupts are acked, for example time drift compensation.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Netware writes to DEBUGCTL and reads from the DEBUGCTL and LAST*IP MSRs
without further checks and is really confused to receive a #GP during that.
To make it happy we should just make them stubs, which is exactly what SVM
already does.
Writes to DEBUGCTL that are vendor-specific are resembled to behave as if the
virtual CPU does not know them.
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Usually HOST_RSP retains its value across guest entries. Take advantage
of this and avoid a vmwrite() when this is so.
Signed-off-by: Avi Kivity <avi@qumranet.com>
As we execute real mode guests in VM86 mode, exception have to be
reinjected appropriately when the guest triggered them. For this purpose
the patch adopts the real-mode injection pattern used in vmx_inject_irq
to vmx_queue_exception, additionally taking care that the IP is set
correctly for #BP exceptions. Furthermore it extends
handle_rmode_exception to reinject all those exceptions that can be
raised in real mode.
This fixes the execution of himem.exe from FreeDOS and also makes its
debug.com work properly.
Note that guest debugging in real mode is broken now. This has to be
fixed by the scheduled debugging infrastructure rework (will be done
once base patches for QEMU have been accepted).
Signed-off-by: Jan Kiszka <jan.kiszka@web.de>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Since checking for vcpu->arch.rmode.active is already done whenever we
call handle_rmode_exception(), checking it inside the function is redundant.
Signed-off-by: Mohammed Gamal <m.gamal005@gmail.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
Instead of looking at failed injections in the vm entry path, move
processing to the exit path in vmx_complete_interrupts(). This simplifes
the logic and removes any state that is hidden in vmx registers.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Similar to the exception queue, this hold interrupts that have been
accepted by the virtual processor core but not yet injected.
Not yet used.
Signed-off-by: Avi Kivity <avi@qumranet.com>
The vmx code assumes that IDT-Vectoring can only be set when an exception
is injected due to the exception in question. That's not true, however:
if the exception is injected correctly, and later another exception occurs
but its delivery is blocked due to a fault, then we will incorrectly assume
the first exception was not delivered.
Fix by unconditionally dequeuing the pending exception, and requeuing it
(or the second exception) if we see it in the IDT-Vectoring field.
Signed-off-by: Avi Kivity <avi@qumranet.com>
If we're emulating an instruction, either it will succeed, in which case
any previously queued exception will be spurious, or we will requeue the
same exception.
Signed-off-by: Avi Kivity <avi@qumranet.com>
Instead of processing nmi injection failure in the vm entry path, move
it to the vm exit path (vm_complete_interrupts()). This separates nmi
injection from nmi post-processing, and moves the nmi state from the VT
state into vcpu state (new variable nmi_injected specifying an injection
in progress).
Signed-off-by: Avi Kivity <avi@qumranet.com>
Currently most interrupt exit processing is handled on the entry path,
which is confusing. Move the NMI IRET fault processing to a new function,
vmx_complete_interrupts(), which is called on the vmexit path.
Signed-off-by: Avi Kivity <avi@qumranet.com>
The twisty maze of conditionals can be reduced.
[joerg: fix tlb flushing]
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
This function injects an interrupt into the guest given the kvm struct,
the (guest) irq number and the interrupt level.
Signed-off-by: Amit Shah <amit.shah@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
As suggested by Avi, introduce accessors to read/write guest registers.
This simplifies the ->cache_regs/->decache_regs interface, and improves
register caching which is important for VMX, where the cost of
vmcs_read/vmcs_write is significant.
[avi: fix warnings]
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>