Currently, all non-dot symbols are being treated as function descriptors
in ABIv1. This is incorrect and is resulting in perf probe not working:
# perf probe do_fork
Added new event:
Failed to write event: Invalid argument
Error: Failed to add events.
# dmesg | tail -1
[192268.073063] Could not insert probe at _text+768432: -22
perf probe bases all kernel probes on _text and writes,
for example, "p:probe/do_fork _text+768432" to
/sys/kernel/debug/tracing/kprobe_events. In-kernel, _text is being
considered to be a function descriptor and is resulting in the above
error.
Fix this by changing how we lookup symbol addresses on ppc64. We first
check for the dot variant of a symbol and look at the non-dot variant
only if that fails. In this manner, we avoid having to look at the
function descriptor.
While at it, also separate out how this works on ABIv2 where
we don't have dot symbols, but need to use the local entry point.
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Once upon a time, at least 9 years ago (< 2.6.12), _TIF_SYSCALL_T_OR_A
meant "TRACE or AUDIT". But these days it means TRACE or AUDIT or
SECCOMP or TRACEPOINT or NOHZ.
All of those are implemented via syscall_dotrace() so rename the flag to
that to try and clarify things.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
We removed the last usage of CPU_FTR_IABR in commit 1ad7d70562
"powerpc/xmon: Enable HW instruction breakpoint on POWER8".
Mark it as free.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
pci_dn->phb is set to phb in update_dn_pci_info(), if succeed.
This patch removes the duplication of pci_dn->phb initialization.
Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When IOMMU bypass is enabled, a PCI device can read and write memory
that was not mapped by the driver without causing an EEH. That might
cause memory corruption, for example.
When we disable bypass, DMA reads and writes to addresses not mapped by
the IOMMU will cause an EEH, allowing us to debug such issues.
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The current handling of EPOW_SHUTDOWN_ON_UPS event does not shutdown the
system after logging the message. All the events of EPOW_SYSTEM_SHUTDOWN
action code (EPOW_SHUTDOWN_ON_UPS is a part of it) must initiate system
shutdown as per the SPAPR spec. If the LPAR does not shutdown after
receiving this rtas based event, it will expose itself to a forced
abrupt shutdown initiated by the platform firmware. This patch fixes the
situation.
Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The M64 range information is missed in dmesg, which would be helpful in debug.
This patch prints the M64 range information in the same format as M32.
Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Turning snoops on is the last step in CAPP recovery. Sapphire is expected to
have reinitialized the PHB and done the previous recovery steps.
Add mode argument to opal call to do this. Driver can turn snoops off although
it does not currently.
Signed-off-by: Ryan Grimm <grimm@linux.vnet.ibm.com>
Acked-by: Ian Munsie <imunsie@au1.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Add calls to the ps3_mm_set_repository_highmem() routine when the ps3
r1 highmem region is either created or destroyed.
Signed-off-by: Geoff Levand <geoff@infradead.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Add the new routine ps3_mm_set_repository_highmem() that saves highmem info to
the LV1 hypervisor registry so that the info will be available to second stage
OS's loaded by petitboot/kexec. FreeBSD and some Linux derivatives use
this feature.
Also, move the existing ps3_mm_get_repository_highmem() routine up in
the source file.
This implementation of ps3_mm_set_repository_highmem() assumes the repository
will have a single highmem region entry (at index 0).
Signed-off-by: Geoff Levand <geoff@infradead.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
To avoid the need for preprocessor conditionals in C source files add a set of
empty inline repository highmem write routines to platform.h that are used when
CONFIG_PS3_REPOSITORY_WRITE is not defined.
Signed-off-by: Geoff Levand <geoff@infradead.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Distros are enabling NUMA balancing (eg Ubuntu), so it would be good to
get some more test coverage with it.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Enable config options required by lxc and docker.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
KSM will only be used on areas marked for merging via madvise, and it
is showing nice improvements on KVM workloads, so enable it by
default.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
We are starting to see ppc64 boxes with SATA AHCI adapters in it,
so enable it in our defconfigs.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This was enabled on the pseries defconfigs recently, but missed
the ppc64 one.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
It looks like it's ~4 years since we updated some of these, so do a bulk
update.
Verified that the before and after generated configs are exactly the
same.
Which begs the question why update them? The answer is that it can be
confusing when the stored defconfig drifts too far from the generated
result.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
We have two arrays in kvm_host_state that contain register values for
the PMU. Currently we only create an asm-offsets symbol for the base of
the arrays, and do the array offset in the assembly code.
Creating an asm-offsets symbol for each field individually makes the
code much nicer to read, particularly for the MMCRx/SIxR/SDAR fields, and
might have helped us notice the recent double restore bug we had in this
code.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Acked-by: Alexander Graf <agraf@suse.de>
In the Makefile, string.o (which is generated from string.S) is
included into the list of objects being built unconditionally
(obj-y) in line 12.
Additionally, if CONFIG_PPC64 is set, it is included again in
line 17.
This patch removes the latter unnecessary inclusion.
Signed-off-by: Andreas Ruprecht <rupran@einserver.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Commit 2a2c74b2ef ("IBM Akebono: Add the Akebono platform") added a
select of IBM_EMAC_RGMII_WOL. But that Kconfig symbol isn't (yet) part
of the tree. So this select has been a nop since that commit was
included in v3.16-rc1.
Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
Acked-by: Alistair Popple <alistair@popple.id.au>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
In case of error, the function ioremap() returns NULL
not ERR_PTR(). The IS_ERR() test in the return value
check should be replaced with NULL test.
Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
isxdigit() macro definition is the same.
isalnum() from linux/ctype.h will accept additional latin non-ASCII
characters. This is harmless since this macro is used in scanhex() which
parses user input.
isspace() from linux/ctype.h will accept vertical tab and form feed but
not NULL. The use of this macro is modified to accept NULL as
well. Additional characters are harmless since this macro is also only
used in scanhex().
Signed-off-by: Vincent Bernat <vincent@bernat.im>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Moving config DTL up so it is below config PPC_SPLPAR means that
menuconfig will show config DTL nicely indented right below config
PPC_SPLPAR when PPC_SPLPAR is enabled.
To contrast that, right now if I enable PPC_SPLPAR in menuconfig, all I
can immediately tell is that "something showed up further down the list
where I wasn't looking", and I end up having to toggle the option a few
times to figure out what showed up, or look at the KConfig to find out
that config DTL depends on config PPC_SPLPAR.
Signed-off-by: Cody P Schafer <cody@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This removes the last few uses of CONFIG_PM_RUNTIME introduced
recently and makes that config option finally go away.
CONFIG_PM will be available directly from the menu now and
also it will be selected automatically if CONFIG_SUSPEND or
CONFIG_HIBERNATION is set.
/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABCAAGBQJUlYaPAAoJEILEb/54YlRx/SoP/2wYioGzBhOCYfHw6fZF8zrP
rotQ86sakhvSHre8K9QyFjvsA9wJ0CaTJF46YKZuHFhqU+IJZ7aXvNdEM1hK214J
Mf3L2AcbcdnXioAN+HpeZhQklp2qHe84YkVXBqsFD6kb/qUNV2LSjy6nKEUdY3jW
6KL2f3RgF/LDjTdedujJgcCYwMBwfX4B7U42BG4NQQ8z3wCV+imJgzNDrR5nNlqK
xu8ab8hO1Gi3msOJxS0y4MN6VTUpYOvQKhSyM9ErcB2ibclAdmcivKuFAz6gy5U7
PyDfYo/P3mXjMRBFb9fLqGtRcfstsnxPPSeKwp236tIQFX19Bj76UVUMJoUlXJP5
/f55/P7mCascg74ZZC4GiD/BSCRdqwInCsFMzqAfSq2NciKzeS6W7Mhd9VTLKDpl
5kqE39imUjZyps7/QqkfWskzB7Puhmqk3ZgTq2yAd4uQTpV7xlJYcnvr4oHCmAia
SsLdYOqMQzWr3qyz2f5cOqPAvOo3/Xk/HHfTOCHW/4L+Ov+C921/f3d5GnxX9Ha+
ucRaMp9j5FPYVwFaFkczAMNF2Eanq+Fupa3e6XUNNbYdchFqT9obnHZbVKyvswjR
vdGAYAjP/cLzIH9ETDCCXCRvBRw5pzeelDgvDPjPdmPjndHXG8WViyTIEyLL4+1i
BENtc/SUw3pZ7iNlGO78
=QnSO
-----END PGP SIGNATURE-----
Merge tag 'pm-config-3.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull CONFIG_PM_RUNTIME elimination from Rafael Wysocki:
"This removes the last few uses of CONFIG_PM_RUNTIME introduced
recently and makes that config option finally go away.
CONFIG_PM will be available directly from the menu now and also it
will be selected automatically if CONFIG_SUSPEND or CONFIG_HIBERNATION
is set"
* tag 'pm-config-3.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PM: Eliminate CONFIG_PM_RUNTIME
tty: 8250_omap: Replace CONFIG_PM_RUNTIME with CONFIG_PM
sound: sst-haswell-pcm: Replace CONFIG_PM_RUNTIME with CONFIG_PM
spi: Replace CONFIG_PM_RUNTIME with CONFIG_PM
Having switched over all of the users of CONFIG_PM_RUNTIME to use
CONFIG_PM directly, turn the latter into a user-selectable option
and drop the former entirely from the tree.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
Acked-by: Kevin Hilman <khilman@linaro.org>
The highlight is the series that reworks the idle management on powernv, which
allows us to use deeper idle states on those machines.
There's the fix from Anton for the "BUG at kernel/smpboot.c:134!" problem.
An i2c driver for powernv. This is acked by Wolfram Sang, and he asked that we
take it through the powerpc tree.
A fix for audit from rgb at Red Hat, acked by Paul Moore who is one of the audit
maintainers.
A patch from Ben to export the symbol map of our OPAL firmware as a sysfs file,
so that tools can use it.
Also some CXL fixes, a couple of powerpc perf fixes, a fix for smt-enabled, and
the patch to add __force to get_user() so we can use bitwise types.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJUk+oCAAoJEFHr6jzI4aWADBAP/i/CJ+cu6o4mzNDdfs8bnxqn
RGZCSV+SrkTZPcoLbLiM9iaqq34ORVIn7hwkhkTz2/koluMVfTsqtVulMoFf+hVd
GTVt81MjMFzA3hM3bXEV58KRT79+64K54dLCe0F7OaD6f4AikKR4LLz/PY0EBMiZ
2h13uQlfglaMeYTsaD9eeUpIIKs7+PwsNqUknmN9We07WWfxWqnRpiTR4TYTMXx4
3lQPvCnnHokwDqjuKgwiqDVSaCfCl8laS1i+BPk0G0aRV1AnPDvR3MhgVb2IpNxX
Joxy2D1HSawwDhqHOsId8dkGZXOM4vzo+Y658qnC1XfThqE0MhA+kCfa5/b6xlOR
K7nDO5A41B6nXB3mMOQh/szTXSIa8KJRTR3ibbJJrMdF6F0TN0JLLQNUcmM4j/5D
vvgZEzvFNZhWX98ktlQLde2E4ClWJg6mWESCGSgJeVjIXaxe/6GneIa8vLKm5QMu
OoykNsASMDGqddYMGoYeX/mSsvjPjK0PDO2q19sPbkP8xpyDLx6J8xo+5hO4l8xc
0Cdb38ECfeno+w5oKAnjidHnz0KYBsuYFLeS+rV0b8sUSWAzfdEjSn2AVIQ8gLOv
IOCAqwZ5tL9EcUs+AKru5EHtBEV+2XB54xPRxfdFS/k+vYRE7MpS3ipxveIynN2l
eRxf9hsSO7ASNDd0b3ID
=GXdK
-----END PGP SIGNATURE-----
Merge tag 'powerpc-3.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux
Pull second batch of powerpc updates from Michael Ellerman:
"The highlight is the series that reworks the idle management on
powernv, which allows us to use deeper idle states on those machines.
There's the fix from Anton for the "BUG at kernel/smpboot.c:134!"
problem.
An i2c driver for powernv. This is acked by Wolfram Sang, and he
asked that we take it through the powerpc tree.
A fix for audit from rgb at Red Hat, acked by Paul Moore who is one of
the audit maintainers.
A patch from Ben to export the symbol map of our OPAL firmware as a
sysfs file, so that tools can use it.
Also some CXL fixes, a couple of powerpc perf fixes, a fix for
smt-enabled, and the patch to add __force to get_user() so we can use
bitwise types"
* tag 'powerpc-3.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mpe/linux:
powerpc/powernv: Ignore smt-enabled on Power8 and later
powerpc/uaccess: Allow get_user() with bitwise types
powerpc/powernv: Expose OPAL firmware symbol map
powernv/powerpc: Add winkle support for offline cpus
powernv/cpuidle: Redesign idle states management
powerpc/powernv: Enable Offline CPUs to enter deep idle states
powerpc/powernv: Switch off MMU before entering nap/sleep/rvwinkle mode
i2c: Driver to expose PowerNV platform i2c busses
powerpc: add little endian flag to syscall_get_arch()
power/perf/hv-24x7: Use kmem_cache_free() instead of kfree
powerpc/perf/hv-24x7: Use per-cpu page buffer
cxl: Unmap MMIO regions when detaching a context
cxl: Add timeout to process element commands
cxl: Change contexts_lock to a mutex to fix sleep while atomic bug
powerpc: Secondary CPUs must set cpu_callin_map after setting active and online
- spring cleaning: removed support for IA64, and for hardware-assisted
virtualization on the PPC970
- ARM, PPC, s390 all had only small fixes
For x86:
- small performance improvements (though only on weird guests)
- usual round of hardware-compliancy fixes from Nadav
- APICv fixes
- XSAVES support for hosts and guests. XSAVES hosts were broken because
the (non-KVM) XSAVES patches inadvertently changed the KVM userspace
ABI whenever XSAVES was enabled; hence, this part is going to stable.
Guest support is just a matter of exposing the feature and CPUID leaves
support.
Right now KVM is broken for PPC BookE in your tree (doesn't compile).
I'll reply to the pull request with a patch, please apply it either
before the pull request or in the merge commit, in order to preserve
bisectability somewhat.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQEcBAABAgAGBQJUkpg+AAoJEL/70l94x66DUmoH/jzXYkptSW9NGgm79KqxGJlD
lzLnLBkitVvx++Mz5YBhdJEhKKLUlCtifFT1zPJQ/pthQhIRSaaAwZyNGgUs5w5x
yMGKHiPQFyZRbmQtZhCInW0BftJoYHHciO3nUfHCZnp34My9MP2D55W7/z+fYFfQ
DuqBSE9ThyZJtZ4zh8NRA9fCOeuqwVYRyoBs820Wbsh4cpIBoIK63Dg7k+CLE+ZV
MZa/mRL6bAfsn9W5bnOUAgHJ3SPznnWbO3/g0aV+roL/5pffblprJx9lKNR08xUM
6hDFLop2gDehDJesDkY/o8Ckp1hEouvfsVpSShry4vcgtn0hgh2O5/6Orbmj6vE=
=Zwq1
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM update from Paolo Bonzini:
"3.19 changes for KVM:
- spring cleaning: removed support for IA64, and for hardware-
assisted virtualization on the PPC970
- ARM, PPC, s390 all had only small fixes
For x86:
- small performance improvements (though only on weird guests)
- usual round of hardware-compliancy fixes from Nadav
- APICv fixes
- XSAVES support for hosts and guests. XSAVES hosts were broken
because the (non-KVM) XSAVES patches inadvertently changed the KVM
userspace ABI whenever XSAVES was enabled; hence, this part is
going to stable. Guest support is just a matter of exposing the
feature and CPUID leaves support"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (179 commits)
KVM: move APIC types to arch/x86/
KVM: PPC: Book3S: Enable in-kernel XICS emulation by default
KVM: PPC: Book3S HV: Improve H_CONFER implementation
KVM: PPC: Book3S HV: Fix endianness of instruction obtained from HEIR register
KVM: PPC: Book3S HV: Remove code for PPC970 processors
KVM: PPC: Book3S HV: Tracepoints for KVM HV guest interactions
KVM: PPC: Book3S HV: Simplify locking around stolen time calculations
arch: powerpc: kvm: book3s_paired_singles.c: Remove unused function
arch: powerpc: kvm: book3s_pr.c: Remove unused function
arch: powerpc: kvm: book3s.c: Remove some unused functions
arch: powerpc: kvm: book3s_32_mmu.c: Remove unused function
KVM: PPC: Book3S HV: Check wait conditions before sleeping in kvmppc_vcore_blocked
KVM: PPC: Book3S HV: ptes are big endian
KVM: PPC: Book3S HV: Fix inaccuracies in ICP emulation for H_IPI
KVM: PPC: Book3S HV: Fix KSM memory corruption
KVM: PPC: Book3S HV: Fix an issue where guest is paused on receiving HMI
KVM: PPC: Book3S HV: Fix computation of tlbie operand
KVM: PPC: Book3S HV: Add missing HPTE unlock
KVM: PPC: BookE: Improve irq inject tracepoint
arm/arm64: KVM: Require in-kernel vgic for the arch timers
...
Commit 69111bac42 ("powerpc: Replace __get_cpu_var uses") introduced
compile breakage to the e500 target by introducing invalid automatically
created C syntax.
Fix up the breakage and make the code compile again.
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Starting with POWER8, the subcore logic relies on all threads of a core
being booted so that they can participate in split mode switches. So on
those machines we ignore the smt_enabled_at_boot setting (smt-enabled on
the kernel command line).
Signed-off-by: Greg Kurz <gkurz@linux.vnet.ibm.com>
[mpe: Update comment and change log to be more precise]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
At the moment, if p and x are both of the same bitwise type
(eg. __le32), get_user(x, p) produces a sparse warning.
This is because *p is loaded into a long then cast back to typeof(*p).
When typeof(*p) is a bitwise type (which is uncommon), such a cast needs
__force, otherwise sparse produces a warning.
For non-bitwise types __force should have no effect, and should not hide
any legitimate errors.
Note that we are casting to typeof(*p) not typeof(x). Even with the
cast, if x and *p are of different types we should get the warning, so I
think we are not loosing the ability to detect any actual errors.
virtio would like to use bitwise types with get_user() so fix these
spurious warnings by adding __force.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
[mpe: Fill in changelog with more details]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The in-kernel XICS emulation is faster than doing it all in QEMU
and it has got a lot of testing, so enable it by default.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Currently the H_CONFER hcall is implemented in kernel virtual mode,
meaning that whenever a guest thread does an H_CONFER, all the threads
in that virtual core have to exit the guest. This is bad for
performance because it interrupts the other threads even if they
are doing useful work.
The H_CONFER hcall is called by a guest VCPU when it is spinning on a
spinlock and it detects that the spinlock is held by a guest VCPU that
is currently not running on a physical CPU. The idea is to give this
VCPU's time slice to the holder VCPU so that it can make progress
towards releasing the lock.
To avoid having the other threads exit the guest unnecessarily,
we add a real-mode implementation of H_CONFER that checks whether
the other threads are doing anything. If all the other threads
are idle (i.e. in H_CEDE) or trying to confer (i.e. in H_CONFER),
it returns H_TOO_HARD which causes a guest exit and allows the
H_CONFER to be handled in virtual mode.
Otherwise it spins for a short time (up to 10 microseconds) to give
other threads the chance to observe that this thread is trying to
confer. The spin loop also terminates when any thread exits the guest
or when all other threads are idle or trying to confer. If the
timeout is reached, the H_CONFER returns H_SUCCESS. In this case the
guest VCPU will recheck the spinlock word and most likely call
H_CONFER again.
This also improves the implementation of the H_CONFER virtual mode
handler. If the VCPU is part of a virtual core (vcore) which is
runnable, there will be a 'runner' VCPU which has taken responsibility
for running the vcore. In this case we yield to the runner VCPU
rather than the target VCPU.
We also introduce a check on the target VCPU's yield count: if it
differs from the yield count passed to H_CONFER, the target VCPU
has run since H_CONFER was called and may have already released
the lock. This check is required by PAPR.
Signed-off-by: Sam Bobroff <sam.bobroff@au1.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
There are two ways in which a guest instruction can be obtained from
the guest in the guest exit code in book3s_hv_rmhandlers.S. If the
exit was caused by a Hypervisor Emulation interrupt (i.e. an illegal
instruction), the offending instruction is in the HEIR register
(Hypervisor Emulation Instruction Register). If the exit was caused
by a load or store to an emulated MMIO device, we load the instruction
from the guest by turning data relocation on and loading the instruction
with an lwz instruction.
Unfortunately, in the case where the guest has opposite endianness to
the host, these two methods give results of different endianness, but
both get put into vcpu->arch.last_inst. The HEIR value has been loaded
using guest endianness, whereas the lwz will load the instruction using
host endianness. The rest of the code that uses vcpu->arch.last_inst
assumes it was loaded using host endianness.
To fix this, we define a new vcpu field to store the HEIR value. Then,
in kvmppc_handle_exit_hv(), we transfer the value from this new field to
vcpu->arch.last_inst, doing a byte-swap if the guest and host endianness
differ.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
This removes the code that was added to enable HV KVM to work
on PPC970 processors. The PPC970 is an old CPU that doesn't
support virtualizing guest memory. Removing PPC970 support also
lets us remove the code for allocating and managing contiguous
real-mode areas, the code for the !kvm->arch.using_mmu_notifiers
case, the code for pinning pages of guest memory when first
accessed and keeping track of which pages have been pinned, and
the code for handling H_ENTER hypercalls in virtual mode.
Book3S HV KVM is now supported only on POWER7 and POWER8 processors.
The KVM_CAP_PPC_RMA capability now always returns 0.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
This patch adds trace points in the guest entry and exit code and also
for exceptions handled by the host in kernel mode - hypercalls and page
faults. The new events are added to /sys/kernel/debug/tracing/events
under a new subsystem called kvm_hv.
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Suresh Warrier <warrier@linux.vnet.ibm.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Currently the calculations of stolen time for PPC Book3S HV guests
uses fields in both the vcpu struct and the kvmppc_vcore struct. The
fields in the kvmppc_vcore struct are protected by the
vcpu->arch.tbacct_lock of the vcpu that has taken responsibility for
running the virtual core. This works correctly but confuses lockdep,
because it sees that the code takes the tbacct_lock for a vcpu in
kvmppc_remove_runnable() and then takes another vcpu's tbacct_lock in
vcore_stolen_time(), and it thinks there is a possibility of deadlock,
causing it to print reports like this:
=============================================
[ INFO: possible recursive locking detected ]
3.18.0-rc7-kvm-00016-g8db4bc6 #89 Not tainted
---------------------------------------------
qemu-system-ppc/6188 is trying to acquire lock:
(&(&vcpu->arch.tbacct_lock)->rlock){......}, at: [<d00000000ecb1fe8>] .vcore_stolen_time+0x48/0xd0 [kvm_hv]
but task is already holding lock:
(&(&vcpu->arch.tbacct_lock)->rlock){......}, at: [<d00000000ecb25a0>] .kvmppc_remove_runnable.part.3+0x30/0xd0 [kvm_hv]
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(&(&vcpu->arch.tbacct_lock)->rlock);
lock(&(&vcpu->arch.tbacct_lock)->rlock);
*** DEADLOCK ***
May be due to missing lock nesting notation
3 locks held by qemu-system-ppc/6188:
#0: (&vcpu->mutex){+.+.+.}, at: [<d00000000eb93f98>] .vcpu_load+0x28/0xe0 [kvm]
#1: (&(&vcore->lock)->rlock){+.+...}, at: [<d00000000ecb41b0>] .kvmppc_vcpu_run_hv+0x530/0x1530 [kvm_hv]
#2: (&(&vcpu->arch.tbacct_lock)->rlock){......}, at: [<d00000000ecb25a0>] .kvmppc_remove_runnable.part.3+0x30/0xd0 [kvm_hv]
stack backtrace:
CPU: 40 PID: 6188 Comm: qemu-system-ppc Not tainted 3.18.0-rc7-kvm-00016-g8db4bc6 #89
Call Trace:
[c000000b2754f3f0] [c000000000b31b6c] .dump_stack+0x88/0xb4 (unreliable)
[c000000b2754f470] [c0000000000faeb8] .__lock_acquire+0x1878/0x2190
[c000000b2754f600] [c0000000000fbf0c] .lock_acquire+0xcc/0x1a0
[c000000b2754f6d0] [c000000000b2954c] ._raw_spin_lock_irq+0x4c/0x70
[c000000b2754f760] [d00000000ecb1fe8] .vcore_stolen_time+0x48/0xd0 [kvm_hv]
[c000000b2754f7f0] [d00000000ecb25b4] .kvmppc_remove_runnable.part.3+0x44/0xd0 [kvm_hv]
[c000000b2754f880] [d00000000ecb43ec] .kvmppc_vcpu_run_hv+0x76c/0x1530 [kvm_hv]
[c000000b2754f9f0] [d00000000eb9f46c] .kvmppc_vcpu_run+0x2c/0x40 [kvm]
[c000000b2754fa60] [d00000000eb9c9a4] .kvm_arch_vcpu_ioctl_run+0x54/0x160 [kvm]
[c000000b2754faf0] [d00000000eb94538] .kvm_vcpu_ioctl+0x498/0x760 [kvm]
[c000000b2754fcb0] [c000000000267eb4] .do_vfs_ioctl+0x444/0x770
[c000000b2754fd90] [c0000000002682a4] .SyS_ioctl+0xc4/0xe0
[c000000b2754fe30] [c0000000000092e4] syscall_exit+0x0/0x98
In order to make the locking easier to analyse, we change the code to
use a spinlock in the kvmppc_vcore struct to protect the stolen_tb and
preempt_tb fields. This lock needs to be an irq-safe lock since it is
used in the kvmppc_core_vcpu_load_hv() and kvmppc_core_vcpu_put_hv()
functions, which are called with the scheduler rq lock held, which is
an irq-safe lock.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Remove the function inst_set_field() that is not used anywhere.
This was partially found by using a static code analysis program called cppcheck.
Signed-off-by: Rickard Strandqvist <rickard_strandqvist@spectrumdigital.se>
Signed-off-by: Alexander Graf <agraf@suse.de>
Remove the function get_fpr_index() that is not used anywhere.
This was partially found by using a static code analysis program called cppcheck.
Signed-off-by: Rickard Strandqvist <rickard_strandqvist@spectrumdigital.se>
Signed-off-by: Alexander Graf <agraf@suse.de>
Removes some functions that are not used anywhere:
kvmppc_core_load_guest_debugstate() kvmppc_core_load_host_debugstate()
This was partially found by using a static code analysis program called cppcheck.
Signed-off-by: Rickard Strandqvist <rickard_strandqvist@spectrumdigital.se>
Signed-off-by: Alexander Graf <agraf@suse.de>
Remove the function sr_nx() that is not used anywhere.
This was partially found by using a static code analysis program called cppcheck.
Signed-off-by: Rickard Strandqvist <rickard_strandqvist@spectrumdigital.se>
Signed-off-by: Alexander Graf <agraf@suse.de>
The kvmppc_vcore_blocked() code does not check for the wait condition
after putting the process on the wait queue. This means that it is
possible for an external interrupt to become pending, but the vcpu to
remain asleep until the next decrementer interrupt. The fix is to
make one last check for pending exceptions and ceded state before
calling schedule().
Signed-off-by: Suresh Warrier <warrier@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
When being restored from qemu, the kvm_get_htab_header are in native
endian, but the ptes are big endian.
This patch fixes restore on a KVM LE host. Qemu also needs a fix for
this :
http://lists.nongnu.org/archive/html/qemu-ppc/2014-11/msg00008.html
Signed-off-by: Cédric Le Goater <clg@fr.ibm.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
This fixes some inaccuracies in the state machine for the virtualized
ICP when implementing the H_IPI hcall (Set_MFFR and related states):
1. The old code wipes out any pending interrupts when the new MFRR is
more favored than the CPPR but less favored than a pending
interrupt (by always modifying xisr and the pending_pri). This can
cause us to lose a pending external interrupt.
The correct code here is to only modify the pending_pri and xisr in
the ICP if the MFRR is equal to or more favored than the current
pending pri (since in this case, it is guaranteed that that there
cannot be a pending external interrupt). The code changes are
required in both kvmppc_rm_h_ipi and kvmppc_h_ipi.
2. Again, in both kvmppc_rm_h_ipi and kvmppc_h_ipi, there is a check
for whether MFRR is being made less favored AND further if new MFFR
is also less favored than the current CPPR, we check for any
resends pending in the ICP. These checks look like they are
designed to cover the case where if the MFRR is being made less
favored, we opportunistically trigger a resend of any interrupts
that had been previously rejected. Although, this is not a state
described by PAPR, this is an action we actually need to do
especially if the CPPR is already at 0xFF. Because in this case,
the resend bit will stay on until another ICP state change which
may be a long time coming and the interrupt stays pending until
then. The current code which checks for MFRR < CPPR is broken when
CPPR is 0xFF since it will not get triggered in that case.
Ideally, we would want to do a resend only if
prio(pending_interrupt) < mfrr && prio(pending_interrupt) < cppr
where pending interrupt is the one that was rejected. But we don't
have the priority of the pending interrupt state saved, so we
simply trigger a resend whenever the MFRR is made less favored.
3. In kvmppc_rm_h_ipi, where we save state to pass resends to the
virtual mode, we also need to save the ICP whose need_resend we
reset since this does not need to be my ICP (vcpu->arch.icp) as is
incorrectly assumed by the current code. A new field rm_resend_icp
is added to the kvmppc_icp structure for this purpose.
Signed-off-by: Suresh Warrier <warrier@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
Testing with KSM active in the host showed occasional corruption of
guest memory. Typically a page that should have contained zeroes
would contain values that look like the contents of a user process
stack (values such as 0x0000_3fff_xxxx_xxx).
Code inspection in kvmppc_h_protect revealed that there was a race
condition with the possibility of granting write access to a page
which is read-only in the host page tables. The code attempts to keep
the host mapping read-only if the host userspace PTE is read-only, but
if that PTE had been temporarily made invalid for any reason, the
read-only check would not trigger and the host HPTE could end up
read-write. Examination of the guest HPT in the failure situation
revealed that there were indeed shared pages which should have been
read-only that were mapped read-write.
To close this race, we don't let a page go from being read-only to
being read-write, as far as the real HPTE mapping the page is
concerned (the guest view can go to read-write, but the actual mapping
stays read-only). When the guest tries to write to the page, we take
an HDSI and let kvmppc_book3s_hv_page_fault take care of providing a
writable HPTE for the page.
This eliminates the occasional corruption of shared pages
that was previously seen with KSM active.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>
The B (segment size) field in the RB operand for the tlbie
instruction is two bits, which we get from the top two bits of
the first doubleword of the HPT entry to be invalidated. These
bits go in bits 8 and 9 of the RB operand (bits 54 and 55 in IBM
bit numbering).
The compute_tlbie_rb() function gets these bits as v >> (62 - 8),
which is not correct as it will bring in the top 10 bits, not
just the top two. These extra bits could corrupt the AP, AVAL
and L fields in the RB value. To fix this we shift right 62 bits
and then shift left 8 bits, so we only get the two bits of the
B field.
The first doubleword of the HPT entry is under the control of the
guest kernel. In fact, Linux guests will always put zeroes in bits
54 -- 61 (IBM bits 2 -- 9), but we should not rely on guests doing
this.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
In kvm_test_clear_dirty(), if we find an invalid HPTE we move on to the
next HPTE without unlocking the invalid one. In fact we should never
find an invalid and unlocked HPTE in the rmap chain, but for robustness
we should unlock it. This adds the missing unlock.
Reported-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Alexander Graf <agraf@suse.de>