Commit Graph

14227 Commits

Author SHA1 Message Date
Avi Kivity
9e31905f29 Merge remote-tracking branch 'tip/perf/core' into kvm-updates/3.3
* tip/perf/core: (66 commits)
  perf, x86: Expose perf capability to other modules
  perf, x86: Implement arch event mask as quirk
  x86, perf: Disable non available architectural events
  jump_label: Provide jump_label_key initializers
  jump_label, x86: Fix section mismatch
  perf, core: Rate limit perf_sched_events jump_label patching
  perf: Fix enable_on_exec for sibling events
  perf: Remove superfluous arguments
  perf, x86: Prefer fixed-purpose counters when scheduling
  perf, x86: Fix event scheduler for constraints with overlapping counters
  perf, x86: Implement event scheduler helper functions
  perf: Avoid a useless pmu_disable() in the perf-tick
  x86/tools: Add decoded instruction dump mode
  x86: Update instruction decoder to support new AVX formats
  x86/tools: Fix insn_sanity message outputs
  x86/tools: Fix instruction decoder message output
  x86: Fix instruction decoder to handle grouped AVX instructions
  x86/tools: Fix Makefile to build all test tools
  perf test: Soft errors shouldn't stop the "Validate PERF_RECORD_" test
  perf test: Validate PERF_RECORD_ events and perf_sample fields
  ...

Signed-off-by: Avi Kivity <avi@redhat.com>

* commit 'b3d9468a8bd218a695e3a0ff112cd4efd27b670a': (66 commits)
  perf, x86: Expose perf capability to other modules
  perf, x86: Implement arch event mask as quirk
  x86, perf: Disable non available architectural events
  jump_label: Provide jump_label_key initializers
  jump_label, x86: Fix section mismatch
  perf, core: Rate limit perf_sched_events jump_label patching
  perf: Fix enable_on_exec for sibling events
  perf: Remove superfluous arguments
  perf, x86: Prefer fixed-purpose counters when scheduling
  perf, x86: Fix event scheduler for constraints with overlapping counters
  perf, x86: Implement event scheduler helper functions
  perf: Avoid a useless pmu_disable() in the perf-tick
  x86/tools: Add decoded instruction dump mode
  x86: Update instruction decoder to support new AVX formats
  x86/tools: Fix insn_sanity message outputs
  x86/tools: Fix instruction decoder message output
  x86: Fix instruction decoder to handle grouped AVX instructions
  x86/tools: Fix Makefile to build all test tools
  perf test: Soft errors shouldn't stop the "Validate PERF_RECORD_" test
  perf test: Validate PERF_RECORD_ events and perf_sample fields
  ...
2011-12-27 11:22:24 +02:00
Sasha Levin
ff5c2c0316 KVM: Use memdup_user instead of kmalloc/copy_from_user
Switch to using memdup_user when possible. This makes code more
smaller and compact, and prevents errors.

Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:22:21 +02:00
Sasha Levin
cdfca7b346 KVM: Use kmemdup() instead of kmalloc/memcpy
Switch to kmemdup() in two places to shorten the code and avoid possible bugs.

Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:22:20 +02:00
Jan Kiszka
234b639206 KVM: x86 emulator: Remove set-but-unused cr4 from check_cr_write
This was probably copy&pasted from the cr0 case, but it's unneeded here.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:22:16 +02:00
Jan Kiszka
3d56cbdf35 KVM: MMU: Drop unused return value of kvm_mmu_remove_some_alloc_mmu_pages
freed_pages is never evaluated, so remove it as well as the return code
kvm_mmu_remove_some_alloc_mmu_pages so far delivered to its only user.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:22:15 +02:00
Alex,Shi
086c985501 KVM: use this_cpu_xxx replace percpu_xxx funcs
percpu_xxx funcs are duplicated with this_cpu_xxx funcs, so replace them
for further code clean up.

And in preempt safe scenario, __this_cpu_xxx funcs has a bit better
performance since __this_cpu_xxx has no redundant preempt_disable()

Signed-off-by: Alex Shi <alex.shi@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:22:13 +02:00
Xiao Guangrong
e37fa7853c KVM: MMU: audit: inline audit function
inline audit function and little cleanup

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:22:12 +02:00
Xiao Guangrong
d750ea2886 KVM: MMU: remove oos_shadow parameter
The unsync code should be stable now, maybe it is the time to remove this
parameter to cleanup the code a little bit

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:22:10 +02:00
Xiao Guangrong
e459e3228d KVM: MMU: move the relevant mmu code to mmu.c
Move the mmu code in kvm_arch_vcpu_init() to kvm_mmu_create()

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:22:09 +02:00
Xiao Guangrong
9edb17d55f KVM: x86: remove the dead code of KVM_EXIT_HYPERCALL
KVM_EXIT_HYPERCALL is not used anymore, so remove the code

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:22:07 +02:00
Xiao Guangrong
0375f7fad9 KVM: MMU: audit: replace mmu audit tracepoint with jump-label
The tracepoint is only used to audit mmu code, it should not be exposed to
user, let us replace it with jump-label.

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:22:05 +02:00
Sasha Levin
831bf664e9 KVM: Refactor and simplify kvm_dev_ioctl_get_supported_cpuid
This patch cleans and simplifies kvm_dev_ioctl_get_supported_cpuid by using a table
instead of duplicating code as Avi suggested.

This patch also fixes a bug where kvm_dev_ioctl_get_supported_cpuid would return
-E2BIG when amount of entries passed was just right.

Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:22:02 +02:00
Liu, Jinsong
fb215366b3 KVM: expose latest Intel cpu new features (BMI1/BMI2/FMA/AVX2) to guest
Intel latest cpu add 6 new features, refer http://software.intel.com/file/36945
The new feature cpuid listed as below:

1. FMA		CPUID.EAX=01H:ECX.FMA[bit 12]
2. MOVBE	CPUID.EAX=01H:ECX.MOVBE[bit 22]
3. BMI1		CPUID.EAX=07H,ECX=0H:EBX.BMI1[bit 3]
4. AVX2		CPUID.EAX=07H,ECX=0H:EBX.AVX2[bit 5]
5. BMI2		CPUID.EAX=07H,ECX=0H:EBX.BMI2[bit 8]
6. LZCNT	CPUID.EAX=80000001H:ECX.LZCNT[bit 5]

This patch expose these features to guest.
Among them, FMA/MOVBE/LZCNT has already been defined, MOVBE/LZCNT has
already been exposed.

This patch defines BMI1/AVX2/BMI2, and exposes FMA/BMI1/AVX2/BMI2 to guest.

Signed-off-by: Liu, Jinsong <jinsong.liu@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:22:01 +02:00
Avi Kivity
00b27a3efb KVM: Move cpuid code to new file
The cpuid code has grown; put it into a separate file.

Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:21:49 +02:00
Takuya Yoshikawa
2b5e97e1fa KVM: x86 emulator: Use opcode::execute for INS/OUTS from/to port in DX
INSB       : 6C
INSW/INSD  : 6D
OUTSB      : 6E
OUTSW/OUTSD: 6F

The I/O port address is read from the DX register when we decode the
operand because we see the SrcDX/DstDX flag is set.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:17:46 +02:00
Xiao Guangrong
28a37544fb KVM: introduce id_to_memslot function
Introduce id_to_memslot to get memslot by slot id

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:17:39 +02:00
Xiao Guangrong
be6ba0f096 KVM: introduce kvm_for_each_memslot macro
Introduce kvm_for_each_memslot to walk all valid memslot

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:17:37 +02:00
Xiao Guangrong
be593d6286 KVM: introduce update_memslots function
Introduce update_memslots to update slot which will be update to
kvm->memslots

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:17:35 +02:00
Xiao Guangrong
93a5cef07d KVM: introduce KVM_MEM_SLOTS_NUM macro
Introduce KVM_MEM_SLOTS_NUM macro to instead of
KVM_MEMORY_SLOTS + KVM_PRIVATE_MEM_SLOTS

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:17:34 +02:00
Takuya Yoshikawa
ff227392cd KVM: x86 emulator: Use opcode::execute for BSF/BSR
BSF: 0F BC
BSR: 0F BD

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-12-27 11:17:32 +02:00
Takuya Yoshikawa
e940b5c20f KVM: x86 emulator: Use opcode::execute for CMPXCHG
CMPXCHG: 0F B0, 0F B1

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-12-27 11:17:31 +02:00
Takuya Yoshikawa
e1e210b0a7 KVM: x86 emulator: Use opcode::execute for WRMSR/RDMSR
WRMSR: 0F 30
RDMSR: 0F 32

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-12-27 11:17:29 +02:00
Takuya Yoshikawa
bc00f8d2c2 KVM: x86 emulator: Use opcode::execute for MOV to cr/dr
MOV: 0F 22 (move to control registers)
MOV: 0F 23 (move to debug registers)

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-12-27 11:17:28 +02:00
Takuya Yoshikawa
d4ddafcdf2 KVM: x86 emulator: Use opcode::execute for CALL
CALL: E8

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-12-27 11:17:26 +02:00
Takuya Yoshikawa
ce7faab24f KVM: x86 emulator: Use opcode::execute for BT family
BT : 0F A3
BTS: 0F AB
BTR: 0F B3
BTC: 0F BB

Group 8: 0F BA

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-12-27 11:17:25 +02:00
Takuya Yoshikawa
d7841a4b1b KVM: x86 emulator: Use opcode::execute for IN/OUT
IN : E4, E5, EC, ED
OUT: E6, E7, EE, EF

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-12-27 11:17:23 +02:00
Gleb Natapov
46199f33c2 KVM: VMX: remove unneeded vmx_load_host_state() calls.
vmx_load_host_state() does not handle msrs switching (except
MSR_KERNEL_GS_BASE) since commit 26bb0981b3. Remove call to it
where it is no longer make sense.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:17:22 +02:00
Takuya Yoshikawa
95d4c16ce7 KVM: Optimize dirty logging by rmap_write_protect()
Currently, write protecting a slot needs to walk all the shadow pages
and checks ones which have a pte mapping a page in it.

The walk is overly heavy when dirty pages in that slot are not so many
and checking the shadow pages would result in unwanted cache pollution.

To mitigate this problem, we use rmap_write_protect() and check only
the sptes which can be reached from gfns marked in the dirty bitmap
when the number of dirty pages are less than that of shadow pages.

This criterion is reasonable in its meaning and worked well in our test:
write protection became some times faster than before when the ratio of
dirty pages are low and was not worse even when the ratio was near the
criterion.

Note that the locking for this write protection becomes fine grained.
The reason why this is safe is descripted in the comments.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:17:20 +02:00
Takuya Yoshikawa
7850ac5420 KVM: Count the number of dirty pages for dirty logging
Needed for the next patch which uses this number to decide how to write
protect a slot.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:17:19 +02:00
Takuya Yoshikawa
9b9b149236 KVM: MMU: Split gfn_to_rmap() into two functions
rmap_write_protect() calls gfn_to_rmap() for each level with gfn fixed.
This results in calling gfn_to_memslot() repeatedly with that gfn.

This patch introduces __gfn_to_rmap() which takes the slot as an
argument to avoid this.

This is also needed for the following dirty logging optimization.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:17:17 +02:00
Takuya Yoshikawa
d6eebf8b80 KVM: MMU: Clean up BUG_ON() conditions in rmap_write_protect()
Remove redundant checks and use is_large_pte() macro.

Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:17:13 +02:00
Chris Wright
fb92045843 KVM: MMU: remove KVM host pv mmu support
The host side pv mmu support has been marked for feature removal in
January 2011.  It's not in use, is slower than shadow or hardware
assisted paging, and a maintenance burden.  It's November 2011, time to
remove it.

Signed-off-by: Chris Wright <chrisw@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:17:10 +02:00
Chris Wright
5202397df8 KVM guest: remove KVM guest pv mmu support
This has not been used for some years now.  It's time to remove it.

Signed-off-by: Chris Wright <chrisw@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:17:08 +02:00
Jan Kiszka
3f2e5260f5 KVM: x86: Simplify kvm timer handler
The vcpu reference of a kvm_timer can't become NULL while the timer is
valid, so drop this redundant test. This also makes it pointless to
carry a separate __kvm_timer_fn, fold it into kvm_timer_fn.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-12-27 11:17:05 +02:00
Xiao Guangrong
a30f47cb15 KVM: MMU: improve write flooding detected
Detecting write-flooding does not work well, when we handle page written, if
the last speculative spte is not accessed, we treat the page is
write-flooding, however, we can speculative spte on many path, such as pte
prefetch, page synced, that means the last speculative spte may be not point
to the written page and the written page can be accessed via other sptes, so
depends on the Accessed bit of the last speculative spte is not enough

Instead of detected page accessed, we can detect whether the spte is accessed
after it is written, if the spte is not accessed but it is written frequently,
we treat is not a page table or it not used for a long time

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:17:02 +02:00
Xiao Guangrong
5d9ca30e96 KVM: MMU: fix detecting misaligned accessed
Sometimes, we only modify the last one byte of a pte to update status bit,
for example, clear_bit is used to clear r/w bit in linux kernel and 'andb'
instruction is used in this function, in this case, kvm_mmu_pte_write will
treat it as misaligned access, and the shadow page table is zapped

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:17:01 +02:00
Xiao Guangrong
889e5cbced KVM: MMU: split kvm_mmu_pte_write function
kvm_mmu_pte_write is too long, we split it for better readable

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:16:59 +02:00
Xiao Guangrong
f8734352c6 KVM: MMU: remove unnecessary kvm_mmu_free_some_pages
In kvm_mmu_pte_write, we do not need to alloc shadow page, so calling
kvm_mmu_free_some_pages is really unnecessary

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:16:58 +02:00
Xiao Guangrong
f57f2ef58f KVM: MMU: fast prefetch spte on invlpg path
Fast prefetch spte for the unsync shadow page on invlpg path

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:16:56 +02:00
Xiao Guangrong
505aef8f30 KVM: MMU: cleanup FNAME(invlpg)
Directly Use mmu_page_zap_pte to zap spte in FNAME(invlpg), also remove the
same code between FNAME(invlpg) and FNAME(sync_page)

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:16:54 +02:00
Xiao Guangrong
d01f8d5e02 KVM: MMU: do not mark accessed bit on pte write path
In current code, the accessed bit is always set when page fault occurred,
do not need to set it on pte write path

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:16:53 +02:00
Xiao Guangrong
6f6fbe98c3 KVM: x86: cleanup port-in/port-out emulated
Remove the same code between emulator_pio_in_emulated and
emulator_pio_out_emulated

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:16:51 +02:00
Xiao Guangrong
1cb3f3ae5a KVM: x86: retry non-page-table writing instructions
If the emulation is caused by #PF and it is non-page_table writing instruction,
it means the VM-EXIT is caused by shadow page protected, we can zap the shadow
page and retry this instruction directly

The idea is from Avi

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:16:50 +02:00
Xiao Guangrong
d5ae7ce835 KVM: x86: tag the instructions which are used to write page table
The idea is from Avi:
| tag instructions that are typically used to modify the page tables, and
| drop shadow if any other instruction is used.
| The list would include, I'd guess, and, or, bts, btc, mov, xchg, cmpxchg,
| and cmpxchg8b.

This patch is used to tag the instructions and in the later path, shadow page
is dropped if it is written by other instructions

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:16:48 +02:00
Xiao Guangrong
f759e2b4c7 KVM: MMU: avoid pte_list_desc running out in kvm_mmu_pte_write
kvm_mmu_pte_write is unsafe since we need to alloc pte_list_desc in the
function when spte is prefetched, unfortunately, we can not know how many
spte need to be prefetched on this path, that means we can use out of the
free  pte_list_desc object in the cache, and BUG_ON() is triggered, also some
path does not fill the cache, such as INS instruction emulated that does not
trigger page fault

Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:16:47 +02:00
Nadav Har'El
51cfe38ea5 KVM: nVMX: Fix warning-causing idt-vectoring-info behavior
When L0 wishes to inject an interrupt while L2 is running, it emulates an exit
to L1 with EXIT_REASON_EXTERNAL_INTERRUPT. This was explained in the original
nVMX patch 23, titled "Correct handling of interrupt injection".

Unfortunately, it is possible (though rare) that at this point there is valid
idt_vectoring_info in vmcs02. For example, L1 injected some interrupt to L2,
and when L2 tried to run this interrupt's handler, it got a page fault - so
it returns the original interrupt vector in idt_vectoring_info. The problem
is that if this is the case, we cannot exit to L1 with EXTERNAL_INTERRUPT
like we wished to, because the VMX spec guarantees that idt_vectoring_info
and exit_reason_external_interrupt can never happen together. This is not
just specified in the spec - a KVM L1 actually prints a kernel warning
"unexpected, valid vectoring info" if we violate this guarantee, and some
users noticed these warnings in L1's logs.

In order to better emulate a processor, which would never return the external
interrupt and the idt-vectoring-info together, we need to separate the two
injection steps: First, complete L1's injection into L2 (i.e., enter L2,
injecting to it the idt-vectoring-info); Second, after entry into L2 succeeds
and it exits back to L0, exit to L1 with the EXIT_REASON_EXTERNAL_INTERRUPT.
Most of this is already in the code - the only change we need is to remain
in L2 (and not exit to L1) in this case.

Note that the previous patch ensures (by using KVM_REQ_IMMEDIATE_EXIT) that
although we do enter L2 first, it will exit immediately after processing its
injection, allowing us to promptly inject to L1.

Note how we test vmcs12->idt_vectoring_info_field; This isn't really the
vmcs12 value (we haven't exited to L1 yet, so vmcs12 hasn't been updated),
but rather the place we save, at the end of vmx_vcpu_run, the vmcs02 value
of this field. This was explained in patch 25 ("Correct handling of idt
vectoring info") of the original nVMX patch series.

Thanks to Dave Allan and to Federico Simoncelli for reporting this bug,
to Abel Gordon for helping me figure out the solution, and to Avi Kivity
for helping to improve it.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:16:45 +02:00
Nadav Har'El
d6185f20a0 KVM: nVMX: Add KVM_REQ_IMMEDIATE_EXIT
This patch adds a new vcpu->requests bit, KVM_REQ_IMMEDIATE_EXIT.
This bit requests that when next entering the guest, we should run it only
for as little as possible, and exit again.

We use this new option in nested VMX: When L1 launches L2, but L0 wishes L1
to continue running so it can inject an event to it, we unfortunately cannot
just pretend to have run L2 for a little while - We must really launch L2,
otherwise certain one-off vmcs12 parameters (namely, L1 injection into L2)
will be lost. So the existing code runs L2 in this case.
But L2 could potentially run for a long time until it exits, and the
injection into L1 will be delayed. The new KVM_REQ_IMMEDIATE_EXIT allows us
to request that L2 will be entered, as necessary, but will exit as soon as
possible after entry.

Our implementation of this request uses smp_send_reschedule() to send a
self-IPI, with interrupts disabled. The interrupts remain disabled until the
guest is entered, and then, after the entry is complete (often including
processing an injection and jumping to the relevant handler), the physical
interrupt is noticed and causes an exit.

On recent Intel processors, we could have achieved the same goal by using
MTF instead of a self-IPI. Another technique worth considering in the future
is to use VM_EXIT_ACK_INTR_ON_EXIT and a highest-priority vector IPI - to
slightly improve performance by avoiding the useless interrupt handler
which ends up being called when smp_send_reschedule() is used.

Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-27 11:16:43 +02:00
Jan Kiszka
4d25a066b6 KVM: Don't automatically expose the TSC deadline timer in cpuid
Unlike all of the other cpuid bits, the TSC deadline timer bit is set
unconditionally, regardless of what userspace wants.

This is broken in several ways:
 - if userspace doesn't use KVM_CREATE_IRQCHIP, and doesn't emulate the TSC
   deadline timer feature, a guest that uses the feature will break
 - live migration to older host kernels that don't support the TSC deadline
   timer will cause the feature to be pulled from under the guest's feet;
   breaking it
 - guests that are broken wrt the feature will fail.

Fix by not enabling the feature automatically; instead report it to userspace.
Because the feature depends on KVM_CREATE_IRQCHIP, which we cannot guarantee
will be called, we expose it via a KVM_CAP_TSC_DEADLINE_TIMER and not
KVM_GET_SUPPORTED_CPUID.

Fixes the Illumos guest kernel, which uses the TSC deadline timer feature.

[avi: add the KVM_CAP + documentation]

Reported-by: Alexey Zaytsev <alexey.zaytsev@gmail.com>
Tested-by: Alexey Zaytsev <alexey.zaytsev@gmail.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-12-26 13:27:44 +02:00
Jan Kiszka
0924ab2cfa KVM: x86: Prevent starting PIT timers in the absence of irqchip support
User space may create the PIT and forgets about setting up the irqchips.
In that case, firing PIT IRQs will crash the host:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000128
IP: [<ffffffffa10f6280>] kvm_set_irq+0x30/0x170 [kvm]
...
Call Trace:
 [<ffffffffa11228c1>] pit_do_work+0x51/0xd0 [kvm]
 [<ffffffff81071431>] process_one_work+0x111/0x4d0
 [<ffffffff81071bb2>] worker_thread+0x152/0x340
 [<ffffffff81075c8e>] kthread+0x7e/0x90
 [<ffffffff815a4474>] kernel_thread_helper+0x4/0x10

Prevent this by checking the irqchip mode before starting a timer. We
can't deny creating the PIT if the irqchips aren't set up yet as
current user land expects this order to work.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2011-12-25 17:13:18 +02:00
Linus Torvalds
ecefc36b41 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
  net: Add a flow_cache_flush_deferred function
  ipv4: reintroduce route cache garbage collector
  net: have ipconfig not wait if no dev is available
  sctp: Do not account for sizeof(struct sk_buff) in estimated rwnd
  asix: new device id
  davinci-cpdma: fix locking issue in cpdma_chan_stop
  sctp: fix incorrect overflow check on autoclose
  r8169: fix Config2 MSIEnable bit setting.
  llc: llc_cmsg_rcv was getting called after sk_eat_skb.
  net: bpf_jit: fix an off-one bug in x86_64 cond jump target
  iwlwifi: update SCD BC table for all SCD queues
  Revert "Bluetooth: Revert: Fix L2CAP connection establishment"
  Bluetooth: Clear RFCOMM session timer when disconnecting last channel
  Bluetooth: Prevent uninitialized data access in L2CAP configuration
  iwlwifi: allow to switch to HT40 if not associated
  iwlwifi: tx_sync only on PAN context
  mwifiex: avoid double list_del in command cancel path
  ath9k: fix max phy rate at rate control init
  nfc: signedness bug in __nci_request()
  iwlwifi: do not set the sequence control bit is not needed
2011-12-21 18:29:26 -08:00