FORM2 NUMA affinity is prepared to deal with empty (memory/cpu less)
NUMA nodes. This is used by the DAX KMEM driver to locate a PAPR SCM
device that has a different latency than the original NUMA node from the
regular memory. FORM2 is also able to deal with asymmetric NUMA
distances gracefully, something that our FORM1 implementation doesn't
do.
Move these FORM1 verifications to a new function and wait until after
CAS, when we're sure that we're sticking with FORM1, to enforce them.
Reviewed-by: Greg Kurz <groug@kaod.org>
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Message-Id: <20210920174947.556324-6-danielhb413@gmail.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Introducing a new NUMA affinity, FORM2, requires a new mechanism to
switch between affinity modes after CAS. Also, we want FORM2 data
structures and functions to be completely separated from the existing
FORM1 code, allowing us to avoid adding new code that inherits the
existing complexity of FORM1.
The idea of switching values used by the write_dt() functions in
spapr_numa.c was already introduced in the previous patch, and
the same approach will be used when dealing with the FORM1 and FORM2
arrays.
We can accomplish that by that by renaming the existing numa_assoc_array
to FORM1_assoc_array, which now is used exclusively to handle FORM1 affinity
data. A new helper get_associativity() is then introduced to be used by the
write_dt() functions to retrieve the current ibm,associativity array of
a given node, after considering affinity selection that might have been
done during CAS. All code that was using numa_assoc_array now needs to
retrieve the array by calling this function.
This will allow for an easier plug of FORM2 data later on.
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Message-Id: <20210920174947.556324-5-danielhb413@gmail.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
This addresses the comments from v22.
The functional changes are (the VOF ones need retesting with Pegasos2):
(VOF) setprop will start failing if the machine class callback
did not handle it;
(VOF) unit addresses are lowered in path_offset();
(SPAPR) /chosen/bootargs is initialized from kernel_cmdline if
the client did not change it.
Fixes: 5c991e5d4378 ("spapr: Implement Open Firmware client interface")
Cc: BALATON Zoltan <balaton@eik.bme.hu>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Message-Id: <20210708065625.548396-1-aik@ozlabs.ru>
Tested-by: BALATON Zoltan <balaton@eik.bme.hu>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
There are several new L1D cache flush bits added to the hcall which reflect
hardware security features for speculative cache access issues.
These behaviours are now being specified as negative in order to simplify
patched kernel compatibility with older firmware (a new problem found in
existing systems would automatically be vulnerable).
[dwg: Technically this changes behaviour for existing machine types.
After discussion with Nick, we've determined this is safe, because
the worst that will happen if a guest gets the wrong information due
to a migration is that it will perform some unnecessary workarounds,
but will remain correct and secure (well, as secure as it was going
to be anyway). In addition the change only affects cap-cfpc=safe
which is not enabled by default, and in fact is not possible to set
on any current hardware (though it's expected it will be possible on
POWER10)]
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Message-Id: <20210615044107.1481608-1-npiggin@gmail.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
The PAPR platform describes an OS environment that's presented by
a combination of a hypervisor and firmware. The features it specifies
require collaboration between the firmware and the hypervisor.
Since the beginning, the runtime component of the firmware (RTAS) has
been implemented as a 20 byte shim which simply forwards it to
a hypercall implemented in qemu. The boot time firmware component is
SLOF - but a build that's specific to qemu, and has always needed to be
updated in sync with it. Even though we've managed to limit the amount
of runtime communication we need between qemu and SLOF, there's some,
and it has become increasingly awkward to handle as we've implemented
new features.
This implements a boot time OF client interface (CI) which is
enabled by a new "x-vof" pseries machine option (stands for "Virtual Open
Firmware). When enabled, QEMU implements the custom H_OF_CLIENT hcall
which implements Open Firmware Client Interface (OF CI). This allows
using a smaller stateless firmware which does not have to manage
the device tree.
The new "vof.bin" firmware image is included with source code under
pc-bios/. It also includes RTAS blob.
This implements a handful of CI methods just to get -kernel/-initrd
working. In particular, this implements the device tree fetching and
simple memory allocator - "claim" (an OF CI memory allocator) and updates
"/memory@0/available" to report the client about available memory.
This implements changing some device tree properties which we know how
to deal with, the rest is ignored. To allow changes, this skips
fdt_pack() when x-vof=on as not packing the blob leaves some room for
appending.
In absence of SLOF, this assigns phandles to device tree nodes to make
device tree traversing work.
When x-vof=on, this adds "/chosen" every time QEMU (re)builds a tree.
This adds basic instances support which are managed by a hash map
ihandle -> [phandle].
Before the guest started, the used memory is:
0..e60 - the initial firmware
8000..10000 - stack
400000.. - kernel
3ea0000.. - initramdisk
This OF CI does not implement "interpret".
Unlike SLOF, this does not format uninitialized nvram. Instead, this
includes a disk image with pre-formatted nvram.
With this basic support, this can only boot into kernel directly.
However this is just enough for the petitboot kernel and initradmdisk to
boot from any possible source. Note this requires reasonably recent guest
kernel with:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=df5be5be8735
The immediate benefit is much faster booting time which especially
crucial with fully emulated early CPU bring up environments. Also this
may come handy when/if GRUB-in-the-userspace sees light of the day.
This separates VOF and sPAPR in a hope that VOF bits may be reused by
other POWERPC boards which do not support pSeries.
This assumes potential support for booting from QEMU backends
such as blockdev or netdev without devices/drivers used.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Message-Id: <20210625055155.2252896-1-aik@ozlabs.ru>
Reviewed-by: BALATON Zoltan <balaton@eik.bme.hu>
[dwg: Adjusted some includes which broke compile in some more obscure
compilation setups]
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Moved has_spr to cpu.h as ppc_has_spr and turned it into an inline function.
Change spr verification in pnv.c and spapr.c to a version that can
compile in a !TCG environment.
Signed-off-by: Lucas Mateus Castro (alqotel) <lucas.araujo@eldorado.org.br>
Message-Id: <20210507164146.67086-1-lucas.araujo@eldorado.org.br>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
The hypercalls h_enter, h_remove, h_bulk_remove, h_protect, and h_read,
have been moved to spapr_softmmu.c with the functions they depend on. The
functions is_ram_address and push_sregs_to_kvm_pr are not static anymore
as functions on both spapr_hcall.c and spapr_softmmu.c depend on them.
The hypercalls h_resize_hpt_prepare and h_resize_hpt_commit have been
divided, the KVM part stayed in spapr_hcall.c while the softmmu part
was moved to spapr_softmmu.c
Signed-off-by: Lucas Mateus Castro (alqotel) <lucas.araujo@eldorado.org.br>
Message-Id: <20210506163941.106984-2-lucas.araujo@eldorado.org.br>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
A following patch will make use of it.
Signed-off-by: Fabiano Rosas <farosas@linux.ibm.com>
Message-Id: <20210505001130.3999968-2-farosas@linux.ibm.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Here's the first ppc pull request for qemu-6.1. It has a wide variety
of stuff accumulated during the 6.0 freeze. Highlights are:
* Multi-phase reset cleanups for PAPR
* Preliminary cleanups towards allowing !CONFIG_TCG for the ppc target
* Cleanup of AIL logic and extension to POWER10
* Further improvements to handling of hot unplug failures on PAPR
* Allow much larger numbers of CPU on pseries
* Support for the H_SCM_HEALTH hypercall
* Add support for the Pegasos II board
* Substantial cleanup to hflag handling
* Assorted minor fixes and cleanups
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEdfRlhq5hpmzETofcbDjKyiDZs5IFAmCQ4ScACgkQbDjKyiDZ
s5KmNhAAsICdDqeu/jm1uhRCr0DDT/Wa6KE1xlglQ53ybWb5Hm2ae0Uwzti5ZWkt
T9yryObX++wiugbU5Dlx9eXTiJIPgTbDoBV1wfOa3a1BAxSEES1t70jwuwAXXBpX
mgU++SurQB70IB7vVvyXDi2Z592qGvMiKXqT0sdkfoexPHzAL0+KkQPyJZLeFchM
Ap/zRHAodXf9SuWAl+LwLXeb350jivXYXBWNcFRrBbOGpbVT0AJMYrk/TEa2ZIpi
SvbzAWuW+9mX0EOmk7JK5JfkT41cGNdcBcwd0bt4xyvUpmkXLaTMFDLVHj3HWSUn
PFA4RB3uKXyTfISVtWdxJBbFOzMpchI6lEiRJHCS+KuY7UsACqV1T/y54ATOUauC
ycLc9APgRaStdNPxfDl+xeFfoVb/f0mQsNwcmY1tv7z+3qE/trY9bMyrbgaebBFn
/TAkmPvXfwtAREnx8xF/57poarWUkvupGTQkANNosdFokpExmrLj8T0sKv90hh5Y
vkGf5zP4pYGN1Rs8qhOdHu+IjhVJvUl/L3LZYWcoMI6E61D8rGRc0Dkacx7gcja+
sluFi5Yh2fQn55y6LTi3049cB1wMd6wly0214F11RKoBswguiGuaqJmL4sNDO/s4
IcMCy5mg6C0jNZA5kHcdWmqsVzD2+XwP5J29n/LedlmgXoHYF+M=
=N0qr
-----END PGP SIGNATURE-----
Merge remote-tracking branch 'remotes/dg-gitlab/tags/ppc-for-6.1-20210504' into staging
ppc patch queue 2021-05-04
Here's the first ppc pull request for qemu-6.1. It has a wide variety
of stuff accumulated during the 6.0 freeze. Highlights are:
* Multi-phase reset cleanups for PAPR
* Preliminary cleanups towards allowing !CONFIG_TCG for the ppc target
* Cleanup of AIL logic and extension to POWER10
* Further improvements to handling of hot unplug failures on PAPR
* Allow much larger numbers of CPU on pseries
* Support for the H_SCM_HEALTH hypercall
* Add support for the Pegasos II board
* Substantial cleanup to hflag handling
* Assorted minor fixes and cleanups
# gpg: Signature made Tue 04 May 2021 06:52:39 BST
# gpg: using RSA key 75F46586AE61A66CC44E87DC6C38CACA20D9B392
# gpg: Good signature from "David Gibson <david@gibson.dropbear.id.au>" [full]
# gpg: aka "David Gibson (Red Hat) <dgibson@redhat.com>" [full]
# gpg: aka "David Gibson (ozlabs.org) <dgibson@ozlabs.org>" [full]
# gpg: aka "David Gibson (kernel.org) <dwg@kernel.org>" [unknown]
# Primary key fingerprint: 75F4 6586 AE61 A66C C44E 87DC 6C38 CACA 20D9 B392
* remotes/dg-gitlab/tags/ppc-for-6.1-20210504: (46 commits)
hw/ppc/pnv_psi: Use device_cold_reset() instead of device_legacy_reset()
hw/ppc/spapr_vio: Reset TCE table object with device_cold_reset()
hw/intc/spapr_xive: Use device_cold_reset() instead of device_legacy_reset()
target/ppc: removed VSCR from SPR registration
target/ppc: Reduce the size of ppc_spr_t
target/ppc: Clean up _spr_register et al
target/ppc: Add POWER10 exception model
target/ppc: rework AIL logic in interrupt delivery
target/ppc: move opcode table logic to translate.c
target/ppc: code motion from translate_init.c.inc to gdbstub.c
spapr_drc.c: handle hotunplug errors in drc_unisolate_logical()
spapr.h: increase FDT_MAX_SIZE
spapr.c: do not use MachineClass::max_cpus to limit CPUs
ppc: Rename current DAWR macros and variables
target/ppc: POWER10 supports scv
target/ppc: Fix POWER9 radix guest HV interrupt AIL behaviour
docs/system: ppc: Add documentation for ppce500 machine
roms/u-boot: Bump ppce500 u-boot to v2021.04 to fix broken pci support
roms/Makefile: Update ppce500 u-boot build directory name
ppc/spapr: Add support for implement support for H_SCM_HEALTH
...
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
POWER10 adds a new bit that modifies interrupt behaviour, LPCR[HAIL],
and it removes support for the LPCR[AIL]=0b10 mode.
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Tested-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Message-Id: <20210501072436.145444-3-npiggin@gmail.com>
[dwg: Corrected tab indenting]
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
The AIL logic is becoming unmanageable spread all over powerpc_excp(),
and it is slated to get even worse with POWER10 support.
Move it all to a new helper function.
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Tested-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Message-Id: <20210501072436.145444-2-npiggin@gmail.com>
[dwg: Corrected tab indenting]
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Stop including cpu.h in files that don't need it.
Signed-off-by: Thomas Huth <thuth@redhat.com>
Message-Id: <20210416171314.2074665-4-thuth@redhat.com>
Signed-off-by: Laurent Vivier <laurent@vivier.eu>
The function is called only inside spapr_hcall.c.
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Message-Id: <20210114180628.1675603-3-danielhb413@gmail.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
No need to expose the way DRCs are traversed outside of spapr_drc.c.
Signed-off-by: Greg Kurz <groug@kaod.org>
Message-Id: <20201218103400.689660-4-groug@kaod.org>
Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Tested-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Documentation of object_property_iter_init() clearly stipulates that
"it is forbidden to modify the property list while iterating". But this
is exactly what we do when resetting transient DR connectors during CAS.
The call to spapr_drc_reset() can finalize the hot-unplug sequence of a
PHB or a PCI bridge, both of which will then in turn destroy their PCI
DRCs. This could potentially invalidate the iterator. It is pure luck
that this haven't caused any issues so far.
Change spapr_drc_reset() to return true if it caused a device to be
removed. Restart from scratch in this case. This can potentially
increase the overall DRC reset time, especially with a high maxmem
which generates a lot of LMB DRCs. But this kind of setup is rare,
and so is the use case of rebooting a guest while doing hot-unplug.
Signed-off-by: Greg Kurz <groug@kaod.org>
Message-Id: <20201218103400.689660-3-groug@kaod.org>
Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Tested-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Non-transient DRCs are either in the empty or the ready state,
which means spapr_drc_reset() doesn't change their state. It
is thus not needed to do any checking. Call spapr_drc_reset()
unconditionally and squash spapr_drc_transient() into its
only user, spapr_drc_needed().
Signed-off-by: Greg Kurz <groug@kaod.org>
Message-Id: <20201218103400.689660-2-groug@kaod.org>
Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Tested-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
This allows to drop a user of qdev_get_machine().
Signed-off-by: Greg Kurz <groug@kaod.org>
Message-Id: <20201209170052.1431440-4-groug@kaod.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
HPT resizing is asynchronous: the guest first kicks off the creation of a
new HPT, then it waits for that new HPT to be actually created and finally
it asks the current HPT to be replaced by the new one.
In the case of a userland allocated HPT, this currently relies on calling
qemu_memalign() which aborts on OOM and never returns NULL. Since we seem
to have path to report the failure to the guest with an H_NO_MEM return
value, use qemu_try_memalign() instead of qemu_memalign().
Signed-off-by: Greg Kurz <groug@kaod.org>
Message-Id: <160398563636.32380.1747166034877173994.stgit@bahia.lan>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Use the return value of ppc_set_compat_all() to check failures,
which is preferred over hijacking local_err.
Signed-off-by: Greg Kurz <groug@kaod.org>
Message-Id: <20200914123505.612812-7-groug@kaod.org>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
The cas_check_pvr() function has two purposes:
- finding the "best" logical PVR, ie. the most recent one supported by
the guest for this CPU type
- checking if the guest supports the real PVR of this CPU type, which
is just an optional extra information to workaround the lack of
support for "compat" mode in PR KVM
This logic doesn't need error reporting, really. If we don't find a
suitable logical PVR, we return the special value 0 which is definitely
not a valid PVR. Let the caller decide on whether it should error out
or not.
This doesn't change the behavior.
Signed-off-by: Greg Kurz <groug@kaod.org>
Message-Id: <20200914123505.612812-6-groug@kaod.org>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
The implementation of this hypercall will be modified to use
spapr->numa_assoc_arrays input. Moving it to spapr_numa.c makes
make more sense.
Reviewed-by: Greg Kurz <groug@kaod.org>
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Message-Id: <20200904172422.617460-2-danielhb413@gmail.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Pass &error_abort instead of NULL where the returned value is
dereferenced or asserted to be non-null. Drop a now redundant
assertion.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Message-Id: <20200707160613.848843-24-armbru@redhat.com>
The CAS reboot flag is false by default and all the locations that
could set it to true have been dropped. This means that all code
blocks depending on the flag being set is dead code and the other
code blocks should be executed always.
Just do that and drop the now uneeded CAS reboot flag. Fix a
comment on the way to make checkpatch happy.
Signed-off-by: Greg Kurz <groug@kaod.org>
Message-Id: <158514994893.478799.11772512888322840990.stgit@bahia.lan>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
At the moment "ibm,client-architecture-support" ("CAS") is implemented
in SLOF and QEMU assists via the custom H_CAS hypercall which copies
an updated flatten device tree (FDT) blob to the SLOF memory which
it then uses to update its internal tree.
When we enable the OpenFirmware client interface in QEMU, we won't need
to copy the FDT to the guest as the client is expected to fetch
the device tree using the client interface.
This moves FDT rebuild out to a separate helper which is going to be
called from the "ibm,client-architecture-support" handler and leaves
writing FDT to the guest in the H_CAS handler.
This should not cause any behavioral change.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Message-Id: <20200310050733.29805-3-aik@ozlabs.ru>
Signed-off-by: Greg Kurz <groug@kaod.org>
Message-Id: <158514994229.478799.2178881312094922324.stgit@bahia.lan>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
The guest can select the MMU mode by setting bits 0-1 of byte 24
in OV5 to to 0b00 for hash or 0b01 for radix. As required by the
architecture, we terminate the boot process if any other value
is found there.
The usual way to negotiate features in OV5 is basically ANDing
the bitfield provided by the guest and the bitfield of features
supported by QEMU, previously populated at machine init.
For some not documented reason, MMU is treated differently : bit 1
of byte 24 (the radix/hash bit) is cleared from the guest OV5 and
explicitely set in the final negotiated OV5 if radix was requested.
Since the only expected input from the guest is the radix/hash bit
being set or not, it seems more appropriate to handle this like we
do for XIVE.
Set the radix bit in spapr->ov5 at machine init if it has a chance
to work (ie. power9, either TCG or a radix capable KVM) and rely
exclusively on spapr_ovec_intersect() to set the radix bit in
spapr->ov5_cas.
Signed-off-by: Greg Kurz <groug@kaod.org>
Message-Id: <158514993621.478799.4204740354545734293.stgit@bahia.lan>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
We currently check if some capability in OV5 was removed by the guest
since the previous CAS, and we trigger a CAS reboot in that case. This
was required because it could call for a device-tree property or node
removal, that we didn't support until recently (see commit 6787d27b04
"spapr: add option vector handling in CAS-generated resets" for details).
Now that we render a full FDT at CAS and that SLOF is able to handle
node removal, we don't need to do a CAS reset in this case anymore.
Also, this check can only return true if the guest has already called
CAS since the last full system reset (otherwise spapr->ov5_cas is
empty). Linux doesn't do that so this can be considered as dead code
for the vast majority of existing setups.
Drop the check. Since the only use of the ov5_cas_old variable is
precisely the check itself, drop the variable as well.
Signed-off-by: Greg Kurz <groug@kaod.org>
Message-Id: <158514993021.478799.10928618293640651819.stgit@bahia.lan>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
This is the only error path that needs to free the previously allocated
ov1.
Reported-by: Coverity (CID 1421924)
Fixes: cbd0d7f363 "spapr: Fail CAS if option vector table cannot be parsed"
Signed-off-by: Greg Kurz <groug@kaod.org>
Message-Id: <158481206205.336182.16106097429336044843.stgit@bahia.lan>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
The Real Mode Area (RMA) is the part of memory which a guest can access
when in real (MMU off) mode. Of course, for a guest under KVM, the MMU
isn't really turned off, it's just in a special translation mode - Virtual
Real Mode Area (VRMA) - which looks like real mode in guest mode.
The mechanics of how this works when using the hash MMU (HPT) put a
constraint on the size of the RMA, which depends on the size of the
HPT. So, the latter part of spapr_setup_hpt_and_vrma() clamps the RMA
we advertise to the guest based on this VRMA limit.
There are several things wrong with this:
1) spapr_setup_hpt_and_vrma() doesn't actually clamp, it takes the minimum
of Node 0 memory size and the VRMA limit. That will *often* work the
same as clamping, but there can be other constraints on RMA size which
supersede Node 0 memory size. We have real bugs caused by this
(currently worked around in the guest kernel)
2) Some callers of spapr_setup_hpt_and_vrma() are in a situation where
we're past the point that we can actually advertise an RMA limit to the
guest
3) But most fundamentally, the VRMA limit depends on host configuration
(page size) which shouldn't be visible to the guest, but this partially
exposes it. This can cause problems with migration in certain edge
cases, although we will mostly get away with it.
In practice, this clamping is almost never applied anyway. With 64kiB
pages and the normal rules for sizing of the HPT, the theoretical VRMA
limit will be 4x(guest memory size) and so never hit. It will hit with
4kiB pages, where it will be (guest memory size)/4. However all mainstream
distro kernels for POWER have used a 64kiB page size for at least 10 years.
So, simply replace this logic with a check that the RMA we've calculated
based only on guest visible configuration will fit within the host implied
VRMA limit. This can break if running HPT guests on a host kernel with
4kiB page size. As noted that's very rare. There also exist several
possible workarounds:
* Change the host kernel to use 64kiB pages
* Use radix MMU (RPT) guests instead of HPT
* Use 64kiB hugepages on the host to back guest memory
* Increase the guest memory size so that the RMA hits one of the fixed
limits before the RMA limit. This is relatively easy on POWER8 which
has a 16GiB limit, harder on POWER9 which has a 1TiB limit.
* Use a guest NUMA configuration which artificially constrains the RMA
within the VRMA limit (the RMA must always fit within Node 0).
Previously, on KVM, we also temporarily reduced the rma_size to 256M so
that the we'd load the kernel and initrd safely, regardless of the VRMA
limit. This was a) confusing, b) could significantly limit the size of
images we could load and c) introduced a behavioural difference between
KVM and TCG. So we remove that as well.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Greg Kurz <groug@kaod.org>
If a hot plug or unplug request is pending at CAS, we currently trigger
a CAS reboot, which severely increases the guest boot time. This is
because SLOF doesn't handle hot plug events and we had no way to fix
the FDT that gets presented to the guest.
We can do better thanks to recent changes in QEMU and SLOF:
- we now return a full FDT to SLOF during CAS
- SLOF was fixed to correctly detect any device that was either added or
removed since boot time and to update its internal DT accordingly.
The right solution is to process all pending hot plug/unplug requests
during CAS: convert hot plugged devices to cold plugged devices and
remove the hot unplugged ones, which is exactly what spapr_drc_reset()
does. Also clear all hot plug events that are currently queued since
they're no longer relevant.
Note that SLOF cannot currently populate hot plugged PCI bridges or PHBs
at CAS. Until this limitation is lifted, SLOF will reset the machine when
this scenario occurs : this will allow the FDT to be fully processed when
SLOF is started again (ie. the same effect as the CAS reboot that would
occur anyway without this patch).
Signed-off-by: Greg Kurz <groug@kaod.org>
Message-Id: <158257222352.4102917.8984214333937947307.stgit@bahia.lan>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
We currently don't support hotplug of devices between boot and CAS. If
this happens a CAS reboot is triggered. We detect this during CAS using
the spapr_drc_needed() function which is essentially a VMStateDescription
.needed callback. Even if the condition for CAS reboot happens to be the
same as for DRC migration, it looks wrong to piggyback a migration helper
for this.
Introduce a helper with slightly more explicit name and use it in both CAS
and DRC migration code. Since a subsequent patch will enhance this helper
to cover the case of hot unplug, let's go for spapr_drc_transient(). While
here convert spapr_hotplugged_dev_before_cas() to the "transient" wording as
well.
This doesn't change any behaviour.
Signed-off-by: Greg Kurz <groug@kaod.org>
Message-Id: <158169248180.3465937.9531405453362718771.stgit@bahia.lan>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Use an explicit boolean type.
This commit was produced with the included Coccinelle script
scripts/coccinelle/exec_rw_const.
Signed-off-by: Philippe Mathieu-Daudé <philmd@redhat.com>
According to the description of "ibm,client-architecture-support" that
can found in LoPAPR "B.6.2.3 Root Node Methods":
If multiple partition processors or threads are active at the time of
the ibm,client-architecture-support method call, or an error is detected
in the format of the ibm,architecture.vec structure, the err? boolean
shall be TRUE; else FALSE.
We certainly don't want to temper with the platform or with the PCR of
the other vCPUs if they happen to be active. Ensure we have only one
active vCPU and fail CAS otherwise. This is just for conformance and
robustness, it doesn't fix any known bugs.
Signed-off-by: Greg Kurz <groug@kaod.org>
Message-Id: <157969867170.571404.12117797348882189656.stgit@bahia.lan>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Most of the option vector helpers have assertions to check their
arguments aren't null. The guest can provide an arbitrary address
for the CAS structure that would result in such null arguments.
Fail CAS with H_PARAMETER and print a warning instead of aborting
QEMU.
Signed-off-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-Id: <157925255250.397143.10855183619366882459.stgit@bahia.lan>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
spapr_ovec_diff(ov, old, new) has somewhat complex semantics. ov is set
to those bits which are in new but not old, and it returns as a boolean
whether or not there are any bits in old but not new.
It turns out that both callers only care about the second, not the first.
This is basically equivalent to a bitmap subset operation, which is easier
to understand and implement. So replace spapr_ovec_diff() with
spapr_ovec_subset().
Cc: Mike Roth <mdroth@linux.vnet.ibm.com>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Cedric Le Goater <clg@fr.ibm.com>
spapr_h_cas_compose_response() handles the last piece of the PAPR feature
negotiation process invoked via the ibm,client-architecture-support OF
call. Its only caller is h_client_architecture_support() which handles
most of the rest of that process.
I believe it was placed in a separate file originally to handle some
fiddly dependencies between functions, but mostly it's just confusing
to have the CAS process split into two pieces like this. Now that
compose response is simplified (by just generating the whole device
tree anew), it's cleaner to just fold it into
h_client_architecture_support().
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Cedric Le Goater <clg@fr.ibm.com>
Reviewed-by: Greg Kurz <groug@kaod.org>
PAPR allows the interrupt controller used on a POWER9 machine (XICS or
XIVE) to be selected by the guest operating system, by using the
ibm,client-architecture-support (CAS) feature negotiation call.
Currently, if the guest selects an interrupt controller different from the
one selected at initial boot, this causes the system to be reset with the
new model and the boot starts again. This means we run through the SLOF
boot process twice, as well as any other bootloader (e.g. grub) in use
before the OS calls CAS. This can be confusing and/or inconvenient for
users.
Thanks to two fairly recent changes, we no longer need this reboot. 1) we
now completely regenerate the device tree when CAS is called (meaning we
don't need special case updates for all the device tree changes caused by
the interrupt controller mode change), 2) we now have explicit code paths
to activate and deactivate the different interrupt controllers, rather than
just implicitly calling those at machine reset time.
We can therefore eliminate the reboot for changing irq mode, simply by
putting a call to spapr_irq_update_active_intc() before we call
spapr_h_cas_compose_response() (which gives the updated device tree to
the guest firmware and OS).
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Cedric Le Goater <clg@fr.ibm.com>
Reviewed-by: Greg Kurz <groug@kaod.org>
SpaprIrq::ov5 stores the value for a particular byte in PAPR option vector
5 which indicates whether XICS, XIVE or both interrupt controllers are
available. As usual for PAPR, the encoding is kind of overly complicated
and confusing (though to be fair there are some backwards compat things it
has to handle).
But to make our internal code clearer, have SpaprIrq encode more directly
which backends are available as two booleans, and derive the OV5 value from
that at the point we need it.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Reviewed-by: Greg Kurz <groug@kaod.org>
Certain old guest versions don't understand the radix MMU introduced with
POWER ISA 3.0, but incorrectly select it if presented with the option at
CAS time. We workaround this in qemu by explicitly excluding the radix
(and other ISA 3.0 linked) options if the guest doesn't explicitly note
support for ISA 3.0.
This is handled by the 'cas_legacy_guest_workaround' flag, which is pretty
vague. Rename it to 'cas_pre_isa3_guest' to be clearer about what it's for.
In addition, we unnecessarily call spapr_populate_pa_features() with
different options when initially constructing the device tree and when
adjusting it at CAS time. At the initial construct time cas_pre_isa3_guest
is already false, so we can still use the flag, rather than explicitly
overriding it to be false at the callsite.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Reviewed-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
The sPAPR platform includes feature negotiation between the guest and
platform. That sometimes requires reconfiguring the virtual hardware, and
in some cases that is a complex enough process that we trigger a system
reset to handle it. That interacts badly with -no-reboot - we trigger the
reboot, -no-reboot means we exit and so the guest never gets to try again.
Eventually we want to get rid of CAS reboots entirely, since they're odd
and irritating for the user. But in the meantime we can fix the -no-reboot
problem by using SHUTDOWN_CAUSE_SUBSYSTEM_RESET which ignores -no-reboot
and seems to be designed for this sort of faux-reset for internal purposes
only.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
This implements the H_TPM_COMM hypercall, which is used by an
Ultravisor to pass TPM commands directly to the host's TPM device, or
a TPM Resource Manager associated with the device.
This also introduces a new virtual device, spapr-tpm-proxy, which
is used to configure the host TPM path to be used to service
requests sent by H_TPM_COMM hcalls, for example:
-device spapr-tpm-proxy,id=tpmp0,host-path=/dev/tpmrm0
By default, no spapr-tpm-proxy will be created, and hcalls will return
H_FUNCTION.
The full specification for this hypercall can be found in
docs/specs/ppc-spapr-uv-hcalls.txt
Since SVM-related hcalls like H_TPM_COMM use a reserved range of
0xEF00-0xEF80, we introduce a separate hcall table here to handle
them.
Signed-off-by: Michael Roth <mdroth@linux.vnet.ibm.com
Message-Id: <20190717205842.17827-3-mdroth@linux.vnet.ibm.com>
[dwg: Corrected #include for upstream change]
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
This has been useful to modify and test the Linux pseries suspend
code but it requires modification to the guest to call it (due to
being gated by other unimplemented features). It is not otherwise
used by Linux yet, but work is slowly progressing there.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Message-Id: <20190718034214.14948-5-npiggin@gmail.com>
Reviewed-by: Greg Kurz <groug@kaod.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
This does not do directed yielding and is not quite as strict as PAPR
specifies in terms of precise dispatch behaviour. This generally will
mean suboptimal performance, rather than guest misbehaviour. Linux
does not rely on exact dispatch behaviour.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Message-Id: <20190718034214.14948-4-npiggin@gmail.com>
Reviewed-by: Greg Kurz <groug@kaod.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
H_PROD is added, and H_CEDE is modified to test the prod bit
according to PAPR.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Message-Id: <20190718034214.14948-3-npiggin@gmail.com>
Reviewed-by: Greg Kurz <groug@kaod.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Implement cpu_exec_enter/exit on ppc which calls into new methods of
the same name in PPCVirtualHypervisorClass. These are used by spapr
to implement the splpar VPA dispatch counter initially.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Message-Id: <20190718034214.14948-2-npiggin@gmail.com>
[dwg: Removed unnecessary CONFIG_USER_ONLY checks as suggested by gkurz]
Reviewed-by: Greg Kurz <groug@kaod.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Free all SpaprOptionVector local pointers after use.
Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
Message-Id: <156335160761.82682.11912058325777251614.stgit@lep8c.aus.stglabs.ibm.com>
Reviewed-by: Greg Kurz <groug@kaod.org>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
sysemu/sysemu.h is a rather unfocused dumping ground for stuff related
to the system-emulator. Evidence:
* It's included widely: in my "build everything" tree, changing
sysemu/sysemu.h still triggers a recompile of some 1100 out of 6600
objects (not counting tests and objects that don't depend on
qemu/osdep.h, down from 5400 due to the previous two commits).
* It pulls in more than a dozen additional headers.
Split stuff related to run state management into its own header
sysemu/runstate.h.
Touching sysemu/sysemu.h now recompiles some 850 objects. qemu/uuid.h
also drops from 1100 to 850, and qapi/qapi-types-run-state.h from 4400
to 4200. Touching new sysemu/runstate.h recompiles some 500 objects.
Since I'm touching MAINTAINERS to add sysemu/runstate.h anyway, also
add qemu/main-loop.h.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Message-Id: <20190812052359.30071-30-armbru@redhat.com>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
[Unbreak OS-X build]
In my "build everything" tree, changing qemu/main-loop.h triggers a
recompile of some 5600 out of 6600 objects (not counting tests and
objects that don't depend on qemu/osdep.h). It includes block/aio.h,
which in turn includes qemu/event_notifier.h, qemu/notify.h,
qemu/processor.h, qemu/qsp.h, qemu/queue.h, qemu/thread-posix.h,
qemu/thread.h, qemu/timer.h, and a few more.
Include qemu/main-loop.h only where it's needed. Touching it now
recompiles only some 1700 objects. For block/aio.h and
qemu/event_notifier.h, these numbers drop from 5600 to 2800. For the
others, they shrink only slightly.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Message-Id: <20190812052359.30071-21-armbru@redhat.com>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Tested-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Let's suggest to the user how the machine should be configured to allow
the guest to boot successfully.
Suggested-by: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com>
Signed-off-by: Greg Kurz <groug@kaod.org>
Message-Id: <155799221739.527449.14907564571096243745.stgit@bahia.lan>
Reviewed-by: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com>
Tested-by: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com>
[dwg: Adjusted for style error]
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>