-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABAgAGBQJcsHOvAAoJEC7Z13T+cC21IgoP/37yKOLOuagT4an9L573qWPp
xCS48CJ4rNkpWXP3SHuTe+UHSp20sk+5b6rgM/VkLT2d301WS4gxF/VVu85ZFxGX
tkqDwnljb87MqwTbH5Yj/U4mmq8tZkNg4CqmWuvJhWv4aFe6T/YtAzUkl7y7YCcT
QfKqErl363JJKkL7cz+QWopFn5Gl/hy+mvEhbEtexWLIV1UNZ1i2hPWURfkh8FWH
C1047CAfnC/rEJy0GcwkH4iCem8n4LWkMuf3Zehq+Yx+f2e8FfxMkoOtLJVCoKWj
qoMidAplGqUxLoamZsbU1wEzwH6YH28X+uNUULsgDIIBuyW35ymbGsGTmKmNm6ED
zVM1K7badvLeO3PXBxUkviZk7UFjxjXz3xCQMheY47wPoslfX+EN0xUvQ2gW2MvO
Dhb9oPWsr/PFrMMJ35D2OOFH5kJC8Sj30YiP5lsnRoUBi4ecHCIUSlw6esKuYI+H
JfDAYzxe6QqoGg5cxSNUXP+vAgU/FQq+nGNGHzHnsIR4Udt/JsAtNwxQ9DCbYR0C
LA5qxZZTqkDtLPAHynqOzd8m7AoNaBSAgP2qp7Yp8ItXMyemlW2OIYR7yRxuL5bH
zWO/deGYHp3+j9/Z0quzSUL5G85m0o1xgRJcJe9T2fYjWgsy271WFqaZg1JEvpI6
zHcXEw71B7WQuAFaFO1+
=tJvv
-----END PGP SIGNATURE-----
Merge tag 's390-ccw-bios-2019-04-12' into s390-next-staging
Support for booting from a vfio-ccw passthrough dasd device
# gpg: Signature made Fri 12 Apr 2019 01:17:03 PM CEST
# gpg: using RSA key 2ED9D774FE702DB5
# gpg: Good signature from "Thomas Huth <th.huth@gmx.de>" [full]
# gpg: aka "Thomas Huth <thuth@redhat.com>" [undefined]
# gpg: aka "Thomas Huth <huth@tuxfamily.org>" [undefined]
# gpg: aka "Thomas Huth <th.huth@posteo.de>" [unknown]
* tag 's390-ccw-bios-2019-04-12':
pc-bios/s390: Update firmware images
s390-bios: Use control unit type to find bootable devices
s390-bios: Support booting from real dasd device
s390-bios: Add channel command codes/structs needed for dasd-ipl
s390-bios: Use control unit type to determine boot method
s390-bios: Refactor virtio to run channel programs via cio
s390-bios: Factor finding boot device out of virtio code path
s390-bios: Extend find_dev() for non-virtio devices
s390-bios: cio error handling
s390-bios: Support for running format-0/1 channel programs
s390-bios: ptr2u32 and u32toptr
s390-bios: Map low core memory
s390-bios: Decouple channel i/o logic from virtio
s390-bios: Clean up cio.h
s390-bios: decouple common boot logic from virtio
s390-bios: decouple cio setup from virtio
s390 vfio-ccw: Add bootindex property and IPLB data
Signed-off-by: Cornelia Huck <cohuck@redhat.com>
Rename qemu_getrampagesize() to qemu_minrampagesize(). While at it,
properly rename find_max_supported_pagesize() to
find_min_backend_pagesize().
s390x is actually interested into the maximum ram pagesize, so
introduce and use qemu_maxrampagesize().
Add a TODO, indicating that looking at any mapped memory backends is not
100% correct in some cases.
Signed-off-by: David Hildenbrand <david@redhat.com>
Message-Id: <20190417113143.5551-3-david@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Igor Mammedov <imammedo@redhat.com>
Signed-off-by: Cornelia Huck <cohuck@redhat.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Message-Id: <20190417190641.26814-8-armbru@redhat.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Add bootindex property and iplb data for vfio-ccw devices. This allows us to
forward boot information into the bios for vfio-ccw devices.
Refactor s390_get_ccw_device() to return device type. This prevents us from
having to use messy casting logic in several places.
Signed-off-by: Jason J. Herne <jjherne@linux.ibm.com>
Acked-by: Halil Pasic <pasic@linux.vnet.ibm.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Reviewed-by: Thomas Huth <thuth@redhat.com>
Message-Id: <1554388475-18329-2-git-send-email-jjherne@linux.ibm.com>
[thuth: fixed "typedef struct VFIOCCWDevice" build failure with clang]
Signed-off-by: Thomas Huth <thuth@redhat.com>
The GCC 9 compiler complains about many places in s390 code
that take the address of members of the 'struct SCHIB' which
is marked packed:
hw/vfio/ccw.c: In function ‘vfio_ccw_io_notifier_handler’:
hw/vfio/ccw.c:133:15: warning: taking address of packed member of ‘struct SCHIB’ may result in an unaligned pointer value \
[-Waddress-of-packed-member]
133 | SCSW *s = &sch->curr_status.scsw;
| ^~~~~~~~~~~~~~~~~~~~~~
hw/vfio/ccw.c:134:15: warning: taking address of packed member of ‘struct SCHIB’ may result in an unaligned pointer value \
[-Waddress-of-packed-member]
134 | PMCW *p = &sch->curr_status.pmcw;
| ^~~~~~~~~~~~~~~~~~~~~~
...snip many more...
Almost all of these are just done for convenience to avoid
typing out long variable/field names when referencing struct
members. We can get most of this convenience by taking the
address of the 'struct SCHIB' instead, avoiding triggering
the compiler warnings.
In a couple of places we copy via a local variable which is
a technique already applied elsewhere in s390 code for this
problem.
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
Message-Id: <20190329111104.17223-12-berrange@redhat.com>
Reviewed-by: Eric Farman <farman@linux.ibm.com>
Reviewed-by: Halil Pasic <pasic@linux.ibm.com>
Reviewed-by: Farhan Ali <alifm@linux.ibm.com>
Signed-off-by: Cornelia Huck <cohuck@redhat.com>
Some trace points are attributed to the wrong source file. Happens
when we neglect to update trace-events for code motion, or add events
in the wrong place, or misspell the file name.
Clean up with help of cleanup-trace-events.pl. Same funnies as in the
previous commit, of course. Manually shorten its change to
linux-user/trace-events to */signal.c.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Message-id: 20190314180929.27722-6-armbru@redhat.com
Message-Id: <20190314180929.27722-6-armbru@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Tracked down with cleanup-trace-events.pl. Funnies requiring manual
post-processing:
* block.c and blockdev.c trace points are in block/trace-events.
* hw/block/nvme.c uses the preprocessor to hide its trace point use
from cleanup-trace-events.pl.
* include/hw/xen/xen_common.h trace points are in hw/xen/trace-events.
* net/colo-compare and net/filter-rewriter.c use pseudo trace points
colo_compare_udp_miscompare and colo_filter_rewriter_debug to guard
debug code.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Message-id: 20190314180929.27722-5-armbru@redhat.com
Message-Id: <20190314180929.27722-5-armbru@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
We spell out sub/dir/ in sub/dir/trace-events' comments pointing to
source files. That's because when trace-events got split up, the
comments were moved verbatim.
Delete the sub/dir/ part from these comments. Gets rid of several
misspellings.
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Message-id: 20190314180929.27722-3-armbru@redhat.com
Message-Id: <20190314180929.27722-3-armbru@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This makes vfio_get_region_info_cap() to be used in quirks.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Message-Id: <20190307050518.64968-3-aik@ozlabs.ru>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
The "systempagesize" name suggests that it is the host system page size
while it is the smallest page size of memory backing the guest RAM so
let's rename it to stop confusion. This should cause no behavioral change.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Message-Id: <20190227085149.38596-4-aik@ozlabs.ru>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
The current code assumes that we can address more bits on a PCI bus
for DMA than we really can but there is no way knowing the actual limit.
This makes a better guess for the number of levels and if the kernel
fails to allocate that, this increases the level numbers till succeeded
or reached the 64bit limit.
This adds levels to the trace point.
This may cause the kernel to warn about failed allocation:
[65122.837458] Failed to allocate a TCE memory, level shift=28
which might happen if MAX_ORDER is not large enough as it can vary:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/powerpc/Kconfig?h=v5.0-rc2#n727
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Message-Id: <20190227085149.38596-3-aik@ozlabs.ru>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Kick the display link up event with a 0.1 sec delay,
so the guest has a chance to notice the link down first.
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
Reviewed-by: Liam Merwick <liam.merwick@oracle.com>
[update for redefined macro]
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This allows configure the display resolution which the vgpu should use.
The information will be passed to the guest using EDID, so the mdev
driver must support the vfio edid region for this to work.
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
Reviewed-by: Liam Merwick <liam.merwick@oracle.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This patch adds EDID support to the vfio display (aka vgpu) code.
When supported by the mdev driver qemu will generate a EDID blob
and pass it on using the new vfio edid region. The EDID blob will
be updated on UI changes (i.e. window resize), so the guest can
adapt.
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
Reviewed-by: Liam Merwick <liam.merwick@oracle.com>
[remove control flow via macro, use unsigned format specifier]
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
CONFIG_VFIO_PCI was not "default y" - and once you do that, it is also
important to disable it if PCI is not there.
Reported-by: Alex Williamson <alex.williamson@redhat.com>
Tested-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The make_device_config.sh script is replaced by minikconf, which
is modified to support the same command line as its predecessor.
The roots of the parsing are default-configs/*.mak, Kconfig.host and
hw/Kconfig. One difference with make_device_config.sh is that all symbols
have to be defined in a Kconfig file, including those coming from the
configure script. This is the reason for the Kconfig.host file introduced
in the previous patch. Whenever a file in default-configs/*.mak used
$(...) to refer to a config-host.mak symbol, this is replaced by a
Kconfig dependency; this part must be done already in this patch
for bisectability.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Yang Zhong <yang.zhong@intel.com>
Acked-by: Thomas Huth <thuth@redhat.com>
Message-Id: <20190123065618.3520-28-yang.zhong@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The Kconfig files were generated mostly with this script:
for i in `grep -ho CONFIG_[A-Z0-9_]* default-configs/* | sort -u`; do
set fnord `git grep -lw $i -- 'hw/*/Makefile.objs' `
shift
if test $# = 1; then
cat >> $(dirname $1)/Kconfig << EOF
config ${i#CONFIG_}
bool
EOF
git add $(dirname $1)/Kconfig
else
echo $i $*
fi
done
sed -i '$d' hw/*/Kconfig
for i in hw/*; do
if test -d $i && ! test -f $i/Kconfig; then
touch $i/Kconfig
git add $i/Kconfig
fi
done
Whenever a symbol is referenced from multiple subdirectories, the
script prints the list of directories that reference the symbol.
These symbols have to be added manually to the Kconfig files.
Kconfig.host and hw/Kconfig were created manually.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Yang Zhong <yang.zhong@intel.com>
Message-Id: <20190123065618.3520-27-yang.zhong@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Lots of work on tests: BiosTablesTest UEFI app,
vhost-user testing for non-Linux hosts.
Misc cleanups and fixes all over the place
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQEcBAABAgAGBQJccBqMAAoJECgfDbjSjVRpvSEIAKYPRNdCBX/SSS/L/tmJS5Zt
8IyU/HW1YJ249vO+aT6z4Q3QPgqNC3KjXC3brx/WRoPZnRroen4rv2Kqnk6SayPa
a52d2ubXKWxb3swdG1CAVzFRhq/ABpgAPx0dr1JW+RXgo2lxpJ4GNYxKMosQTaPE
hRNeXl1XlcIK525kJhFH3Hlij9mTRuY6T7ydpPQd8dUq2dBRaL9RrzZRrkZxCy6l
gQPUqNzPhG0XXyOiJmwYyVX0zGzbYrMLrMQAor2SBIYmU+zv2eZGPJUYxoMTUMzt
YR0WCpvkvPITlAryaBoozAIDYVz8PxBRT1KRwpDal+2rzlm6o+veKDiF8R46gn0=
=GzUz
-----END PGP SIGNATURE-----
Merge remote-tracking branch 'remotes/mst/tags/for_upstream' into staging
pci, pc, virtio: fixes, cleanups, tests
Lots of work on tests: BiosTablesTest UEFI app,
vhost-user testing for non-Linux hosts.
Misc cleanups and fixes all over the place
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
# gpg: Signature made Fri 22 Feb 2019 15:51:40 GMT
# gpg: using RSA key 281F0DB8D28D5469
# gpg: Good signature from "Michael S. Tsirkin <mst@kernel.org>" [full]
# gpg: aka "Michael S. Tsirkin <mst@redhat.com>" [full]
# Primary key fingerprint: 0270 606B 6F3C DF3D 0B17 0970 C350 3912 AFBE 8E67
# Subkey fingerprint: 5D09 FD08 71C8 F85B 94CA 8A0D 281F 0DB8 D28D 5469
* remotes/mst/tags/for_upstream: (26 commits)
pci: Sanity test minimum downstream LNKSTA
hw/smbios: fix offset of type 3 sku field
pci: Move NVIDIA vendor id to the rest of ids
virtio-balloon: Safely handle BALLOON_PAGE_SIZE < host page size
virtio-balloon: Use ram_block_discard_range() instead of raw madvise()
virtio-balloon: Rework ballon_page() interface
virtio-balloon: Corrections to address verification
virtio-balloon: Remove unnecessary MADV_WILLNEED on deflate
i386/kvm: ignore masked irqs when update msi routes
contrib/vhost-user-blk: fix the compilation issue
Revert "contrib/vhost-user-blk: fix the compilation issue"
pc-dimm: use same mechanism for [get|set]_addr
tests/data: introduce "uefi-boot-images" with the "bios-tables-test" ISOs
tests/uefi-test-tools: add build scripts
tests: introduce "uefi-test-tools" with the BiosTablesTest UEFI app
roms: build the EfiRom utility from the roms/edk2 submodule
roms: add the edk2 project as a git submodule
vhost-user-test: create a temporary directory per TestServer
vhost-user-test: small changes to init_hugepagefs
vhost-user-test: create a main loop per TestServer
...
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Introduces hot plug/unplug support for the vfio-ap device.
To hot plug a vfio-ap device using the QEMU device_add command:
(qemu) device_add vfio-ap,sysfsdev=$path-to-mdev
Where $path-to-mdev is the absolute path to the mediated matrix device
to which AP resources to be used by the guest have been assigned.
A vfio-ap device can be hot plugged only if:
1. A vfio-ap device has not been attached to the virtual machine's ap-bus
via the QEMU command line or a prior hot plug action.
2. The guest was started with the CPU model feature for AP enabled
(e.g., -cpu host,ap=on)
To hot unplug a vfio-ap device using the QEMU device_del command:
(qemu) device_del vfio-ap,sysfsdev=$path-to-mdev
Where $path-to-mdev is the absolute path to the mediated matrix device
specified when the vfio-ap device was attached to the virtual machine's
ap-bus.
A vfio-ap device can be hot unplugged only if:
1. A vfio-ap device has been attached to the virtual machine's ap-bus
via the QEMU command line or a prior hot plug action.
2. The guest was started with the CPU model feature for AP enabled
(e.g., -cpu host,ap=on)
Please note that a hot plug handler is not necessary for the vfio-ap device
because the AP matrix configuration for the guest is performed by the
kernel device driver when the vfio-ap device is realized. The vfio-ap device
represents a VFIO mediated device created in the host sysfs for use by a guest.
The mdev device is configured with an AP matrix (i.e., adapters and domains) via
its sysfs attribute interfaces prior to starting the guest or plugging a vfio-ap
device in. When the device is realized, a file descriptor is opened on the mdev
device which results in a callback to the vfio_ap kernel device driver. The
device driver then configures the AP matrix in the guest's SIE state description
from the AP matrix assigned via the mdev device's sysfs interfaces. The AP
devices will be created for the guest when the AP bus running on the guest
subsequently performs its periodic scan for AP devices.
The qdev_simple_device_unplug_cb() callback function is used for the same
reaons; namely, the vfio_ap kernel device driver will perform the AP resource
de-configuration for the guest when the vfio-ap device is unplugged. When the
vfio-ap device is unrealized, the mdev device file descriptor is closed which
results in a callback to the vfio_ap kernel device driver. The device driver
then clears the AP matrix configuration in the guest's SIE state description
and resets all of the affected queues. The AP devices created for the guest
will be removed when the AP bus running on the guest subsequently performs
its periodic scan and finds there are no longer any AP resources assigned to the
guest.
Signed-off-by: Tony Krowiak <akrowiak@linux.ibm.com>
Reviewed-by: Pierre Morel <pmorel@linux.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Halil Pasic <pasic@linux.ibm.com>
Tested-by: Pierre Morel <pmorel@linux.ibm.com>
Message-Id: <1550519397-25359-2-git-send-email-akrowiak@linux.ibm.com>
[CH: adapt to changed qbus_set_hotplug_handler() signature]
Signed-off-by: Cornelia Huck <cohuck@redhat.com>
sPAPR code will use it too so move it from VFIO to the common code.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Alistair Francis <alistair.francis@wdc.com>
Message-Id: <20190214051440.59167-1-aik@ozlabs.ru>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
We introduce the vfio_init_container_type() helper.
It computes the highest usable iommu type and then
set the container and the iommu type.
Its usage in vfio_connect_container() makes the code
ready for addition of new iommu types.
Signed-off-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Greg Kurz <groug@kaod.org>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
A kernel bug was introduced in v4.15 via commit 71a7d3d78e3c which
adds a test for address space wrap-around in the vfio DMA unmap path.
Unfortunately due to overflow, the kernel detects an unmap of the last
page in the 64-bit address space as a wrap-around. In QEMU, a Q35
guest with VT-d emulation and guest IOMMU enabled will attempt to make
such an unmap request during VM system reset, triggering an error:
qemu-kvm: VFIO_UNMAP_DMA: -22
qemu-kvm: vfio_dma_unmap(0x561f059948f0, 0xfef00000, 0xffffffff01100000) = -22 (Invalid argument)
Here the IOVA start address (0xfef00000) and the size parameter
(0xffffffff01100000) add to exactly 2^64, triggering the bug. A
kernel fix is queued for the Linux v5.0 release to address this.
This patch implements a workaround to retry the unmap, excluding the
final page of the range when we detect an unmap failing which matches
the requirements for this issue. This is expected to be a safe and
complete workaround as the VT-d address space does not extend to the
full 64-bit space and therefore the last page should never be mapped.
This workaround can be removed once all kernels with this bug are
sufficiently deprecated.
Link: https://bugzilla.redhat.com/show_bug.cgi?id=1662291
Reported-by: Pei Zhang <pezhang@redhat.com>
Debugged-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Make hw/vfio configurable and add new CONFIG_VFIO_* to the
default-configs/s390x*-softmmu.mak. This allow a finer-grain
selection of the various VFIO backends.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Thomas Huth <thuth@redhat.com>
Message-Id: <20190202072456.6468-28-yang.zhong@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Instead of wrapping the entire Makefile.objs with an ifeq/endif, just
include the directory only for Linux.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Thomas Huth <thuth@redhat.com>
Message-Id: <20190202072456.6468-4-yang.zhong@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The '%m' format instructs glibc's printf()/syslog() implementation to
insert the contents of strerror(errno). Since this is a glibc extension
it should generally be avoided in QEMU due to need for portability to a
variety of platforms.
Even though vfio is Linux-only code that could otherwise use "%m", it
must still be avoided in trace-events files because several of the
backends do not use the format string and so this error information is
invisible to them.
The errno string value should be given as an explicit trace argument
instead, making it accessible to all backends. This also allows it to
work correctly with future patches that use the format string with
systemtap's simple printf code.
Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
Message-id: 20190123120016.4538-4-berrange@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This will be needed when we change the QTAILQ head and elem structs
to unions. However, it is also consistent with the usage elsewhere
in QEMU for other list head structs (see for example FsMountList).
Note that most QTAILQs only need their name in order to do backwards
walks. Those do not break with the struct->union change, and anyway
the change will also remove the need to name heads when doing backwards
walks, so those are not touched here.
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
It is not used outside hw/vfio/common.c, so it does not need to
be extern.
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
VTD fixes
IR and split irqchip are now the default for Q35
ACPI refactoring
hotplug refactoring
new names for virtio devices
multiple pcie link width/speeds
PCI fixes
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQEcBAABAgAGBQJcG967AAoJECgfDbjSjVRpUlMH/1wy8WW7Br/4JxlWUPsfZTqZ
0Lg2n59wuFzRVS+gLotp6bUaJGR9xn9fKjI1wfD59oVrDTKyauuw82v0OityEs33
ZquFecuJvP6k5N40DkqA9YJjKSw7nUmLrsyrD0t2H43npikP2aD9f6yootrr3oVV
nPwBvyT9ykIBQc7IzsHDiLw3EPdIf+9IR7+l+PLptzV0zK43Jfwi/nzHIQq00UMz
eLM/ejQLIx4BZJnYGS0Cy6v3K7cS3k45LHDY0cGc0id2jHFN2vdQyHCF9I1ps72q
pSlhMaLEwvQSYyre6iFTG5uuvyIPWv3LOkaBEwMSA5B/HXuEb2RKHThYzS9dc68=
=OwW7
-----END PGP SIGNATURE-----
Merge remote-tracking branch 'remotes/mst/tags/for_upstream' into staging
pci, pc, virtio: fixes, features
VTD fixes
IR and split irqchip are now the default for Q35
ACPI refactoring
hotplug refactoring
new names for virtio devices
multiple pcie link width/speeds
PCI fixes
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
# gpg: Signature made Thu 20 Dec 2018 18:26:03 GMT
# gpg: using RSA key 281F0DB8D28D5469
# gpg: Good signature from "Michael S. Tsirkin <mst@kernel.org>"
# gpg: aka "Michael S. Tsirkin <mst@redhat.com>"
# Primary key fingerprint: 0270 606B 6F3C DF3D 0B17 0970 C350 3912 AFBE 8E67
# Subkey fingerprint: 5D09 FD08 71C8 F85B 94CA 8A0D 281F 0DB8 D28D 5469
* remotes/mst/tags/for_upstream: (44 commits)
x86-iommu: turn on IR by default if proper
x86-iommu: switch intr_supported to OnOffAuto type
q35: set split kernel irqchip as default
pci: Adjust PCI config limit based on bus topology
spapr_pci: perform unplug via the hotplug handler
pci/shpc: perform unplug via the hotplug handler
pci: Reuse pci-bridge hotplug handler handlers for pcie-pci-bridge
pci/pcie: perform unplug via the hotplug handler
pci/pcihp: perform unplug via the hotplug handler
pci/pcihp: overwrite hotplug handler recursively from the start
pci/pcihp: perform check for bus capability in pre_plug handler
s390x/pci: rename hotplug handler callbacks
pci/shpc: rename hotplug handler callbacks
pci/pcie: rename hotplug handler callbacks
hw/i386: Remove deprecated machines pc-0.10 and pc-0.11
hw: acpi: Remove AcpiRsdpDescriptor and fix tests
hw: acpi: Export and share the ARM RSDP build
hw: arm: Support both legacy and current RSDP build
hw: arm: Convert the RSDP build to the buid_append_foo() API
hw: arm: Carry RSDP specific data through AcpiRsdpData
...
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Clean up includes so that osdep.h is included first and headers
which it implies are not included manually.
This commit was created with scripts/clean-includes, with the changes
to the following files manually reverted:
contrib/libvhost-user/libvhost-user-glib.h
contrib/libvhost-user/libvhost-user.c
contrib/libvhost-user/libvhost-user.h
linux-user/mips64/cpu_loop.c
linux-user/mips64/signal.c
linux-user/sparc64/cpu_loop.c
linux-user/sparc64/signal.c
linux-user/x86_64/cpu_loop.c
linux-user/x86_64/signal.c
target/s390x/gen-features.c
tests/migration/s390x/a-b-bios.c
tests/test-rcu-simpleq.c
tests/test-rcu-tailq.c
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Message-Id: <20181204172535.2799-1-armbru@redhat.com>
Acked-by: Eduardo Habkost <ehabkost@redhat.com>
Acked-by: Halil Pasic <pasic@linux.ibm.com>
Acked-by: Yuval Shaia <yuval.shaia@oracle.com>
Acked-by: Viktor Prutyanov <viktor.prutyanov@phystech.edu>
Now that the downstream port will virtually negotiate itself to the
link status of the downstream device, we can remove this emulation.
It's not clear that it was every terribly useful anyway.
Tested-by: Geoffrey McRae <geoff@hostfission.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
In preparation for reporting higher virtual link speeds and widths,
create enums and macros to help us manage them.
Cc: Marcel Apfelbaum <marcel.apfelbaum@gmail.com>
Tested-by: Geoffrey McRae <geoff@hostfission.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
vfio-ap devices do not pin any pages in the host. Therefore, they
are compatible with memory ballooning.
Flag them as compatible, so both vfio-ap and a balloon can be
used simultaneously.
Cc: qemu-stable@nongnu.org
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Tested-by: Tony Krowiak <akrowiak@linux.ibm.com>
Reviewed-by: Halil Pasic <pasic@linux.ibm.com>
Signed-off-by: Cornelia Huck <cohuck@redhat.com>
If ioctl(..., VFIO_DEVICE_RESET) fails, we want to report errno
instead of ret (which is always -1 on error).
Fixes Coverity issue CID 1396176.
Reported-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Thomas Huth <thuth@redhat.com>
Reviewed-by: Tony Krowiak <akrowiak@linux.ibm.com>
Signed-off-by: Cornelia Huck <cohuck@redhat.com>
The previous commit changed vfio's warning messages from
vfio warning: DEV-NAME: Could not frobnicate
to
warning: vfio DEV-NAME: Could not frobnicate
To match this change, change error messages from
vfio error: DEV-NAME: On fire
to
vfio DEV-NAME: On fire
Note the loss of "error". If we think marking error messages that way
is a good idea, we should mark *all* error messages, i.e. make
error_report() print it.
Cc: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Message-Id: <20181017082702.5581-7-armbru@redhat.com>
The vfio code reports warnings like
error_report(WARN_PREFIX "Could not frobnicate", DEV-NAME);
where WARN_PREFIX is defined so the message comes out as
vfio warning: DEV-NAME: Could not frobnicate
This usage predates the introduction of warn_report() & friends in
commit 97f40301f1. It's time to convert to that interface. Since
these functions already prefix the message with "warning: ", replace
WARN_PREFIX by VFIO_MSG_PREFIX, so the messages come out like
warning: vfio DEV-NAME: Could not frobnicate
The next commit will replace ERR_PREFIX.
Cc: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20181017082702.5581-6-armbru@redhat.com>
From include/qapi/error.h:
* Pass an existing error to the caller with the message modified:
* error_propagate(errp, err);
* error_prepend(errp, "Could not frobnicate '%s': ", name);
Fei Li pointed out that doing error_propagate() first doesn't work
well when @errp is &error_fatal or &error_abort: the error_prepend()
is never reached.
Since I doubt fixing the documentation will stop people from getting
it wrong, introduce error_propagate_prepend(), in the hope that it
lures people away from using its constituents in the wrong order.
Update the instructions in error.h accordingly.
Convert existing error_prepend() next to error_propagate to
error_propagate_prepend(). If any of these get reached with
&error_fatal or &error_abort, the error messages improve. I didn't
check whether that's the case anywhere.
Cc: Fei Li <fli@suse.com>
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20181017082702.5581-2-armbru@redhat.com>
Define a TYPE_VFIO_PCI and drop DO_UPCAST.
Signed-off-by: Li Qiang <liq3ea@gmail.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Up to now the vfio-platform device has been abstract and could not be
instantiated. The integration of a new vfio platform device required
creating a dummy derived device which only set the compatible string.
Following the few vfio-platform device integrations we have seen the
actual requested adaptation happens on device tree node creation
(sysbus-fdt).
Hence remove the abstract setting, and read the list of compatible
values from sysfs if not set by a derived device.
Update the amd-xgbe and calxeda-xgmac drivers to fill in the number of
compatible values, as there can now be more than one.
Note that sysbus-fdt does not support the instantiation of the
vfio-platform device yet.
Signed-off-by: Eric Auger <eric.auger@redhat.com>
[geert: Rebase, set user_creatable=true, use compatible values in sysfs
instead of user-supplied manufacturer/model options, reword]
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Tested-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
So we have a boot display when using a vgpu as primary display.
ramfb depends on a fw_cfg file. fw_cfg files can not be added and
removed at runtime, therefore a ramfb-enabled vfio device can't be
hotplugged.
Add a nohotplug variant of the vfio-pci device (as child class). Add
the ramfb property to the nohotplug variant only. So to enable the vgpu
display with boot support use this:
-device vfio-pci-nohotplug,display=on,ramfb=on,sysfsdev=...
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Introduces a VFIO based AP device. The device is defined via
the QEMU command line by specifying:
-device vfio-ap,sysfsdev=<path-to-mediated-matrix-device>
There may be only one vfio-ap device configured for a guest.
The mediated matrix device is created by the VFIO AP device
driver by writing a UUID to a sysfs attribute file (see
docs/vfio-ap.txt). The mediated matrix device will be named
after the UUID. Symbolic links to the $uuid are created in
many places, so the path to the mediated matrix device $uuid
can be specified in any of the following ways:
/sys/devices/vfio_ap/matrix/$uuid
/sys/devices/vfio_ap/matrix/mdev_supported_types/vfio_ap-passthrough/devices/$uuid
/sys/bus/mdev/devices/$uuid
/sys/bus/mdev/drivers/vfio_mdev/$uuid
When the vfio-ap device is realized, it acquires and opens the
VFIO iommu group to which the mediated matrix device is
bound. This causes a VFIO group notification event to be
signaled. The vfio_ap device driver's group notification
handler will get called at which time the device driver
will configure the the AP devices to which the guest will
be granted access.
Signed-off-by: Tony Krowiak <akrowiak@linux.ibm.com>
Tested-by: Pierre Morel <pmorel@linux.ibm.com>
Acked-by: Halil Pasic <pasic@linux.ibm.com>
Tested-by: Pierre Morel <pmorel@linux.ibm.com>
Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Message-Id: <20181010170309.12045-6-akrowiak@linux.ibm.com>
Reviewed-by: Thomas Huth <thuth@redhat.com>
[CH: added missing g_free and device category]
Signed-off-by: Cornelia Huck <cohuck@redhat.com>
Add two functions to print an error/warning report once depending
on a passed-in condition variable and flip it if printed. This is
useful if you want to print a message not once-globally, but e.g.
once-per-device.
Inspired by warn_once() in hw/vfio/ccw.c, which has been replaced
with warn_report_once_cond().
Signed-off-by: Cornelia Huck <cohuck@redhat.com>
Message-Id: <20180830145902.27376-2-cohuck@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
[Function comments reworded]
Signed-off-by: Markus Armbruster <armbru@redhat.com>
A new error path fails to close the device file descriptor when
triggered by a ballooning incompatibility within the group. Fix it.
Fixes: 238e917285 ("vfio/ccw/pci: Allow devices to opt-in for ballooning")
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Fix error reported by Coverity where realpath can return NULL,
resulting in a segfault in strcmp(). This should never happen given
that we're working through regularly structured sysfs paths, but
trivial enough to easily avoid.
Fixes: 238e917285 ("vfio/ccw/pci: Allow devices to opt-in for ballooning")
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
At the moment the PPC64/pseries guest only supports 4K/64K/16M IOMMU
pages and POWER8 CPU supports the exact same set of page size so
so far things worked fine.
However POWER9 supports different set of sizes - 4K/64K/2M/1G and
the last two - 2M and 1G - are not even allowed in the paravirt interface
(RTAS DDW) so we always end up using 64K IOMMU pages, although we could
back guest's 16MB IOMMU pages with 2MB pages on the host.
This stores the supported host IOMMU page sizes in VFIOContainer and uses
this later when creating a new DMA window. This uses the system page size
(64k normally, 2M/16M/1G if hugepages used) as the upper limit of
the IOMMU pagesize.
This changes the type of @pagesize to uint64_t as this is what
memory_region_iommu_get_min_page_size() returns and clz64() takes.
There should be no behavioral changes on platforms other than pseries.
The guest will keep using the IOMMU page size selected by the PHB pagesize
property as this only changes the underlying hardware TCE table
granularity.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
If a vfio assigned device makes use of a physical IOMMU, then memory
ballooning is necessarily inhibited due to the page pinning, lack of
page level granularity at the IOMMU, and sufficient notifiers to both
remove the page on balloon inflation and add it back on deflation.
However, not all devices are backed by a physical IOMMU. In the case
of mediated devices, if a vendor driver is well synchronized with the
guest driver, such that only pages actively used by the guest driver
are pinned by the host mdev vendor driver, then there should be no
overlap between pages available for the balloon driver and pages
actively in use by the device. Under these conditions, ballooning
should be safe.
vfio-ccw devices are always mediated devices and always operate under
the constraints above. Therefore we can consider all vfio-ccw devices
as balloon compatible.
The situation is far from straightforward with vfio-pci. These
devices can be physical devices with physical IOMMU backing or
mediated devices where it is unknown whether a physical IOMMU is in
use or whether the vendor driver is well synchronized to the working
set of the guest driver. The safest approach is therefore to assume
all vfio-pci devices are incompatible with ballooning, but allow user
opt-in should they have further insight into mediated devices.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
We use a VFIOContainer to associate an AddressSpace to one or more
VFIOGroups. The VFIOContainer represents the DMA context for that
AdressSpace for those VFIOGroups and is synchronized to changes in
that AddressSpace via a MemoryListener. For IOMMU backed devices,
maintaining the DMA context for a VFIOGroup generally involves
pinning a host virtual address in order to create a stable host
physical address and then mapping a translation from the associated
guest physical address to that host physical address into the IOMMU.
While the above maintains the VFIOContainer synchronized to the QEMU
memory API of the VM, memory ballooning occurs outside of that API.
Inflating the memory balloon (ie. cooperatively capturing pages from
the guest for use by the host) simply uses MADV_DONTNEED to "zap"
pages from QEMU's host virtual address space. The page pinning and
IOMMU mapping above remains in place, negating the host's ability to
reuse the page, but the host virtual to host physical mapping of the
page is invalidated outside of QEMU's memory API.
When the balloon is later deflated, attempting to cooperatively
return pages to the guest, the page is simply freed by the guest
balloon driver, allowing it to be used in the guest and incurring a
page fault when that occurs. The page fault maps a new host physical
page backing the existing host virtual address, meanwhile the
VFIOContainer still maintains the translation to the original host
physical address. At this point the guest vCPU and any assigned
devices will map different host physical addresses to the same guest
physical address. Badness.
The IOMMU typically does not have page level granularity with which
it can track this mapping without also incurring inefficiencies in
using page size mappings throughout. MMU notifiers in the host
kernel also provide indicators for invalidating the mapping on
balloon inflation, not for updating the mapping when the balloon is
deflated. For these reasons we assume a default behavior that the
mapping of each VFIOGroup into the VFIOContainer is incompatible
with memory ballooning and increment the balloon inhibitor to match
the attached VFIOGroups.
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
PCI devices needing a ROM allocate an optional MemoryRegion with
pci_add_option_rom(). pci_del_option_rom() does the cleanup when the
device is destroyed. The only action taken by this routine is to call
vmstate_unregister_ram() which clears the id string of the optional
ROM RAMBlock and now, also flags the RAMBlock as non-migratable. This
was recently added by commit b895de5027 ("migration: discard
non-migratable RAMBlocks"), .
VFIO devices do their own loading of the PCI option ROM in
vfio_pci_size_rom(). The memory region is switched to an I/O region
and the PCI attribute 'has_rom' is set but the RAMBlock of the ROM
region is not allocated. When the associated PCI device is deleted,
pci_del_option_rom() calls vmstate_unregister_ram() which tries to
flag a NULL RAMBlock, leading to a SEGV.
It seems that 'has_rom' was set to have memory_region_destroy()
called, but since commit 469b046ead ("memory: remove
memory_region_destroy") this is not necessary anymore as the
MemoryRegion is freed automagically.
Remove the PCIDevice 'has_rom' attribute setting in vfio.
Fixes: b895de5027 ("migration: discard non-migratable RAMBlocks")
Signed-off-by: Cédric Le Goater <clg@kaod.org>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
It eases code review, unit is explicit.
Patch generated using:
$ git grep -E '(1024|2048|4096|8192|(<<|>>).?(10|20|30))' hw/ include/hw/
and modified manually.
Signed-off-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Message-Id: <20180625124238.25339-38-f4bug@amsat.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
There is at least one guest (OS) such that although it does not rely on
the guarantees provided by ORB 1 word 9 bit (aka unlimited prefetch, aka
P bit) not being set, it fails to tell this to the machine.
Usually this ain't a big deal, as the original purpose of the P bit is to
allow for performance optimizations. vfio-ccw however can not provide the
guarantees required if the bit is not set.
It is not possible to implement support for the P bit not set without
transitioning to lower level protocols for vfio-ccw. So let's give the
user the opportunity to force setting the P bit, if the user knows this
is safe. For self modifying channel programs forcing the P bit is not
safe. If the P bit is forced for a self modifying channel program things
are expected to break in strange ways.
Let's also avoid warning multiple about P bit not set in the ORB in case
P bit is not told to be forced, and designate the affected vfio-ccw
device.
Signed-off-by: Halil Pasic <pasic@linux.ibm.com>
Suggested-by: Dong Jia Shi <bjsdjshi@linux.ibm.com>
Acked-by: Jason J. Herne <jjherne@linux.ibm.com>
Tested-by: Jason J. Herne <jjherne@linux.ibm.com>
Message-Id: <20180524175828.3143-2-pasic@linux.ibm.com>
Signed-off-by: Cornelia Huck <cohuck@redhat.com>
Add support for multiple IOMMU indexes to the IOMMU notifier APIs.
When initializing a notifier with iommu_notifier_init(), the caller
must pass the IOMMU index that it is interested in. When a change
happens, the IOMMU implementation must pass
memory_region_notify_iommu() the IOMMU index that has changed and
that notifiers must be called for.
IOMMUs which support only a single index don't need to change.
Callers which only really support working with IOMMUs with a single
index can use the result of passing MEMTXATTRS_UNSPECIFIED to
memory_region_iommu_attrs_to_index().
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Message-id: 20180604152941.20374-3-peter.maydell@linaro.org
Commit a9994687cb ("vfio/display: core & wireup") added display
support to vfio-pci with the default being "auto", which breaks
existing VMs when the vGPU requires GL support but had no previous
requirement for a GL compatible configuration. "Off" is the safer
default as we impose no new requirements to VM configurations.
Fixes: a9994687cb ("vfio/display: core & wireup")
Cc: qemu-stable@nongnu.org
Cc: Gerd Hoffmann <kraxel@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
With vfio ioeventfd support, we can program vfio-pci to perform a
specified BAR write when an eventfd is triggered. This allows the
KVM ioeventfd to be wired directly to vfio-pci, entirely avoiding
userspace handling for these events. On the same micro-benchmark
where the ioeventfd got us to almost 90% of performance versus
disabling the GeForce quirks, this gets us to within 95%.
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The NVIDIA BAR0 quirks virtualize the PCI config space mirrors found
in device MMIO space. Normally PCI config space is considered a slow
path and further optimization is unnecessary, however NVIDIA uses a
register here to enable the MSI interrupt to re-trigger. Exiting to
QEMU for this MSI-ACK handling can therefore rate limit our interrupt
handling. Fortunately the MSI-ACK write is easily detected since the
quirk MemoryRegion otherwise has very few accesses, so simply looking
for consecutive writes with the same data is sufficient, in this case
10 consecutive writes with the same data and size is arbitrarily
chosen. We configure the KVM ioeventfd with data match, so there's
no risk of triggering for the wrong data or size, but we do risk that
pathological driver behavior might consume all of QEMU's file
descriptors, so we cap ourselves to 10 ioeventfds for this purpose.
In support of the above, generic ioeventfd infrastructure is added
for vfio quirks. This automatically initializes an ioeventfd list
per quirk, disables and frees ioeventfds on exit, and allows
ioeventfds marked as dynamic to be dropped on device reset. The
rationale for this latter feature is that useful ioeventfds may
depend on specific driver behavior and since we necessarily place a
cap on our use of ioeventfds, a machine reset is a reasonable point
at which to assume a new driver and re-profile.
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Quirks can be self modifying, provide a hook to allow them to cleanup
on device reset if desired.
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This will later be used to include list initialization.
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
No declaration of "hw/vfio/vfio-common.h" directly requires to include
the "exec/address-spaces.h" header. To simplify dependencies and
ease the upcoming cleanup of "exec/address-spaces.h", directly include
it in the source file where the declaration are used.
Signed-off-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Message-Id: <20180528232719.4721-2-f4bug@amsat.org>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
As part of plumbing MemTxAttrs down to the IOMMU translate method,
add MemTxAttrs as an argument to address_space_translate()
and address_space_translate_cached(). Callers either have an
attrs value to hand, or don't care and can use MEMTXATTRS_UNSPECIFIED.
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-id: 20180521140402.23318-4-peter.maydell@linaro.org
A recent patch fixed leaks of the dynamically allocated vcdev->vdev.name
field in vfio_ccw_realize(), but we now have three freeing sites for it.
This is unfortunate and seems to indicate something is wrong with its
life cycle.
The root issue is that vcdev->vdev.name is set before vfio_get_device()
is called, which theoretically prevents to call vfio_put_device() to
do the freeing. Well actually, we could call it anyway because
vfio_put_base_device() is a nop if the device isn't attached, but this
would be confusing.
This patch hence moves all the logic of attaching the device, including
the "already attached" check, to a separate vfio_ccw_get_device() function,
counterpart of vfio_put_device(). While here, vfio_put_device() is renamed
to vfio_ccw_put_device() for consistency.
Signed-off-by: Greg Kurz <groug@kaod.org>
Message-Id: <152326891065.266543.9487977590811413472.stgit@bahia.lan>
Signed-off-by: Cornelia Huck <cohuck@redhat.com>
During guest OS reboot, guest framebuffer is invalid. It will cause
bugs, if the invalid guest framebuffer is still used by host.
This patch is to introduce vfio_display_reset which is invoked
during vfio display reset. This vfio_display_reset function is used
to release the invalid display resource, disable scanout mode and
replace the invalid surface with QemuConsole's DisplaySurafce.
This patch can fix the GPU hang issue caused by gd_egl_draw during
guest OS reboot.
Changes v3->v4:
- Move dma-buf based display check into the vfio_display_reset().
(Gerd)
Changes v2->v3:
- Limit vfio_display_reset to dma-buf based vfio display. (Gerd)
Changes v1->v2:
- Use dpy_gfx_update_full() update screen after reset. (Gerd)
- Remove dpy_gfx_switch_surface(). (Gerd)
Signed-off-by: Tina Zhang <tina.zhang@intel.com>
Message-id: 1524820266-27079-3-git-send-email-tina.zhang@intel.com
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
If the subchannel is already attached or if vfio_get_device() fails, the
code jumps to the 'out_device_err' label and doesn't free the string it
has just allocated.
The code should be reworked so that vcdev->vdev.name only gets set when
the device has been attached, and freed when it is about to be detached.
This could be achieved with the addition of a vfio_ccw_get_device()
function that would be the counterpart of vfio_put_device(). But this is
a more elaborate cleanup that should be done in a follow-up. For now,
let's just add calls to g_free() on the buggy error paths.
Signed-off-by: Greg Kurz <groug@kaod.org>
Message-Id: <152311222681.203086.8874800175539040298.stgit@bahia>
Signed-off-by: Cornelia Huck <cohuck@redhat.com>
Commit 567b5b309a ("vfio/pci: Relax DMA map errors for MMIO regions")
added an error message if a passed memory section address or size
is not aligned to the page size and thus cannot be DMA mapped.
This patch fixes the trace by printing the region name and the
memory region section offset within the address space (instead of
offset_within_region).
We also turn the error_report into a trace event. Indeed, In some
cases, the traces can be confusing to non expert end-users and
let think the use case does not work (whereas it works as before).
This is the case where a BAR is successively mapped at different
GPAs and its sections are not compatible with dma map. The listener
is called several times and traces are issued for each intermediate
mapping. The end-user cannot easily match those GPAs against the
final GPA output by lscpi. So let's keep those information to
informed users. In mid term, the plan is to advise the user about
BAR relocation relevance.
Fixes: 567b5b309a ("vfio/pci: Relax DMA map errors for MMIO regions")
Signed-off-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This adds a possibility for the platform to tell VFIO not to emulate MSIX
so MMIO memory regions do not get split into chunks in flatview and
the entire page can be registered as a KVM memory slot and make direct
MMIO access possible for the guest.
This enables the entire MSIX BAR mapping to the guest for the pseries
platform in order to achieve the maximum MMIO preformance for certain
devices.
Tested on:
LSI Logic / Symbios Logic SAS3008 PCI-Express Fusion-MPT SAS-3 (rev 02)
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
At the moment we unconditionally avoid mapping MSIX data of a BAR and
emulate MSIX table in QEMU. However it is 1) not always necessary as
a platform may provide a paravirt interface for MSIX configuration;
2) can affect the speed of MMIO access by emulating them in QEMU when
frequently accessed registers share same system page with MSIX data,
this is particularly a problem for systems with the page size bigger
than 4KB.
A new capability - VFIO_REGION_INFO_CAP_MSIX_MAPPABLE - has been added
to the kernel [1] which tells the userspace that mapping of the MSIX data
is possible now. This makes use of it so from now on QEMU tries mapping
the entire BAR as a whole and emulate MSIX on top of that.
[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a32295c612c57990d17fb0f41e7134394b2f35f6
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
At the moment if vfio_memory_listener is registered in the system memory
address space, it maps/unmaps every RAM memory region for DMA.
It expects system page size aligned memory sections so vfio_dma_map
would not fail and so far this has been the case. A mapping failure
would be fatal. A side effect of such behavior is that some MMIO pages
would not be mapped silently.
However we are going to change MSIX BAR handling so we will end having
non-aligned sections in vfio_memory_listener (more details is in
the next patch) and vfio_dma_map will exit QEMU.
In order to avoid fatal failures on what previously was not a failure and
was just silently ignored, this checks the section alignment to
the smallest supported IOMMU page size and prints an error if not aligned;
it also prints an error if vfio_dma_map failed despite the page size check.
Both errors are not fatal; only MMIO RAM regions are checked
(aka "RAM device" regions).
If the amount of errors printed is overwhelming, the MSIX relocation
could be used to avoid excessive error output.
This is unlikely to cause any behavioral change.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw: Fix Int128 bit ops]
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Infrastructure for display support. Must be enabled
using 'display' property.
Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
Reviewed By: Kirti Wankhede <kwankhede@nvidia.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The license text currently specifies "any version" of the GPL. It
is unlikely that GPL v1 was ever intended; change this to the
standard "or any later version" text.
Cc: Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com>
Cc: Xiao Feng Ren <renxiaof@linux.vnet.ibm.com>
Cc: Pierre Morel <pmorel@linux.vnet.ibm.com>
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com>
Acked-by: Pierre Morel <pmorel@linux.vnet.ibm.com>
Signed-off-by: Cornelia Huck <cohuck@redhat.com>
basename(3) and dirname(3) modify their argument and may return
pointers to statically allocated memory which may be overwritten by
subsequent calls.
g_path_get_basename and g_path_get_dirname have no such issues, and
therefore more preferable.
Signed-off-by: Julia Suvorova <jusual@mail.ru>
Message-Id: <1519888086-4207-1-git-send-email-jusual@mail.ru>
Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
- new stats in virtio balloon
- virtio eventfd rework for boot speedup
- vhost memory rework for boot speedup
- fixes and cleanups all over the place
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQEcBAABAgAGBQJagxKDAAoJECgfDbjSjVRp5qAH/3gmgBaIzL3KRHd5i0RZifJv
PvyAVYgZd7h0+/1r9GM7guHKyEPZ08JtbHSm/HuDV4BD/Vf3/8joy8roExIfde2A
6k8fd6ANVQmE3t5zUxNXi9qiG4pO4xDIu4cMAbixzgN9x5ttlcfTw7fTT0e0VJxJ
8SN02/uCPPR/DY4/cpjah+slSyv6rBKT1v1ONy7djyRTYHi6h3Meoh05YfEALkwA
goxTKBZHi0L1IZ3HP/ZpXJDohQ5n2P09DX0fQgb8PgmW6WIWB/Qpi5pD53LZpMCV
n9waTF0U0ahneFd2FHo22QMMrwWvQyrjv+w5uXVr+qmHb/OyH2tUt7PgGF9+QKA=
=78s5
-----END PGP SIGNATURE-----
Merge remote-tracking branch 'remotes/mst/tags/for_upstream' into staging
virtio,vhost,pci,pc: features, fixes and cleanups
- new stats in virtio balloon
- virtio eventfd rework for boot speedup
- vhost memory rework for boot speedup
- fixes and cleanups all over the place
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
# gpg: Signature made Tue 13 Feb 2018 16:29:55 GMT
# gpg: using RSA key 281F0DB8D28D5469
# gpg: Good signature from "Michael S. Tsirkin <mst@kernel.org>"
# gpg: aka "Michael S. Tsirkin <mst@redhat.com>"
# Primary key fingerprint: 0270 606B 6F3C DF3D 0B17 0970 C350 3912 AFBE 8E67
# Subkey fingerprint: 5D09 FD08 71C8 F85B 94CA 8A0D 281F 0DB8 D28D 5469
* remotes/mst/tags/for_upstream: (22 commits)
virtio-balloon: include statistics of disk/file caches
acpi-test: update FADT
lpc: drop pcie host dependency
tests: acpi: fix FADT not being compared to reference table
hw/pci-bridge: fix pcie root port's IO hints capability
libvhost-user: Support across-memory-boundary access
libvhost-user: Fix resource leak
virtio-balloon: unref the memory region before continuing
pci: removed the is_express field since a uniform interface was inserted
virtio-blk: enable multiple vectors when using multiple I/O queues
pci/bus: let it has higher migration priority
pci-bridge/i82801b11: clear bridge registers on platform reset
vhost: Move log_dirty check
vhost: Merge and delete unused callbacks
vhost: Clean out old vhost_set_memory and friends
vhost: Regenerate region list from changed sections list
vhost: Merge sections added to temporary list
vhost: Simplify ring verification checks
vhost: Build temporary section list and deref after commit
virtio: improve virtio devices initialization time
...
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
qemu-common.h includes qemu/option.h, but most places that include the
former don't actually need the latter. Drop the include, and add it
to the places that actually need it.
While there, drop superfluous includes of both headers, and
separate #include from file comment with a blank line.
This cleanup makes the number of objects depending on qemu/option.h
drop from 4545 (out of 4743) to 284 in my "build everything" tree.
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Message-Id: <20180201111846.21846-20-armbru@redhat.com>
[Semantic conflict with commit bdd6a90a9e in block/nvme.c resolved]
according to Eduardo Habkost's commit fd3b02c889 all PCIEs now implement
INTERFACE_PCIE_DEVICE so we don't need is_express field anymore.
Devices that implements only INTERFACE_PCIE_DEVICE (is_express == 1)
or
devices that implements only INTERFACE_CONVENTIONAL_PCI_DEVICE (is_express == 0)
where not affected by the change.
The only devices that were affected are those that are hybrid and also
had (is_express == 1) - therefor only:
- hw/vfio/pci.c
- hw/usb/hcd-xhci.c
- hw/xen/xen_pt.c
For those 3 I made sure that QEMU_PCI_CAP_EXPRESS is on in instance_init()
Reviewed-by: Marcel Apfelbaum <marcel@redhat.com>
Reviewed-by: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Yoni Bettan <ybettan@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
After next patch, listener unregister will need the container to be
alive. Let's move this unregister phase to be before unset container,
since that operation will free the backend container in kernel,
otherwise we'll get these after next patch:
qemu-system-x86_64: VFIO_UNMAP_DMA: -22
qemu-system-x86_64: vfio_dma_unmap(0x559bf53a4590, 0x0, 0xa0000) = -22 (Invalid argument)
Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20180122060244.29368-4-peterx@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
These quirks are necessary for GeForce, but not for Quadro/GRID/Tesla
assignment. Leaving them enabled is fully functional and provides the
most compatibility, but due to the unique NVIDIA MSI ACK behavior[1],
it also introduces latency in re-triggering the MSI interrupt. This
overhead is typically negligible, but has been shown to adversely
affect some (very) high interrupt rate applications. This adds the
vfio-pci device option "x-no-geforce-quirks=" which can be set to
"on" to disable this additional overhead.
A follow-on optimization for GeForce might be to make use of an
ioeventfd to allow KVM to trigger an irqfd in the kernel vfio-pci
driver, avoiding the bounce through userspace to handle this device
write.
[1] Background: the NVIDIA driver has been observed to issue a write
to the MMIO mirror of PCI config space in BAR0 in order to allow the
MSI interrupt for the device to retrigger. Older reports indicated a
write of 0xff to the (read-only) MSI capability ID register, while
more recently a write of 0x0 is observed at config space offset 0x704,
non-architected, extended config space of the device (BAR0 offset
0x88704). Virtualization of this range is only required for GeForce.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
There is already @hostwin in vfio_listener_region_add() so there is no
point in having the other one.
Fixes: 2e4109de8e ("vfio/spapr: Create DMA window dynamically (SPAPR IOMMU v2)")
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Add the initialization of the mutex protecting the interrupt list.
Signed-off-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Recently proposed vfio-pci kernel changes (v4.16) remove the
restriction preventing userspace from mmap'ing PCI BARs in areas
overlapping the MSI-X vector table. This change is primarily intended
to benefit host platforms which make use of system page sizes larger
than the PCI spec recommendation for alignment of MSI-X data
structures (ie. not x86_64). In the case of POWER systems, the SPAPR
spec requires the VM to program MSI-X using hypercalls, rendering the
MSI-X vector table unused in the VM view of the device. However,
ARM64 platforms also support 64KB pages and rely on QEMU emulation of
MSI-X. Regardless of the kernel driver allowing mmaps overlapping
the MSI-X vector table, emulation of the MSI-X vector table also
prevents direct mapping of device MMIO spaces overlapping this page.
Thanks to the fact that PCI devices have a standard self discovery
mechanism, we can try to resolve this by relocating the MSI-X data
structures, either by creating a new PCI BAR or extending an existing
BAR and updating the MSI-X capability for the new location. There's
even a very slim chance that this could benefit devices which do not
adhere to the PCI spec alignment guidelines on x86_64 systems.
This new x-msix-relocation option accepts the following choices:
off: Disable MSI-X relocation, use native device config (default)
auto: Use a known good combination for the platform/device (none yet)
bar0..bar5: Specify the target BAR for MSI-X data structures
If compatible, the target BAR will either be created or extended and
the new portion will be used for MSI-X emulation.
The first obvious user question with this option is how to determine
whether a given platform and device might benefit from this option.
In most cases, the answer is that it won't, especially on x86_64.
Devices often dedicate an entire BAR to MSI-X and therefore no
performance sensitive registers overlap the MSI-X area. Take for
example:
# lspci -vvvs 0a:00.0
0a:00.0 Ethernet controller: Intel Corporation I350 Gigabit Network Connection
...
Region 0: Memory at db680000 (32-bit, non-prefetchable) [size=512K]
Region 3: Memory at db7f8000 (32-bit, non-prefetchable) [size=16K]
...
Capabilities: [70] MSI-X: Enable+ Count=10 Masked-
Vector table: BAR=3 offset=00000000
PBA: BAR=3 offset=00002000
This device uses the 16K bar3 for MSI-X with the vector table at
offset zero and the pending bits arrary at offset 8K, fully honoring
the PCI spec alignment guidance. The data sheet specifically refers
to this as an MSI-X BAR. This device would not see a benefit from
MSI-X relocation regardless of the platform, regardless of the page
size.
However, here's another example:
# lspci -vvvs 02:00.0
02:00.0 Serial Attached SCSI controller: xxxxxxxx
...
Region 0: I/O ports at c000 [size=256]
Region 1: Memory at ef640000 (64-bit, non-prefetchable) [size=64K]
Region 3: Memory at ef600000 (64-bit, non-prefetchable) [size=256K]
...
Capabilities: [c0] MSI-X: Enable+ Count=16 Masked-
Vector table: BAR=1 offset=0000e000
PBA: BAR=1 offset=0000f000
Here the MSI-X data structures are placed on separate 4K pages at the
end of a 64KB BAR. If our host page size is 4K, we're likely fine,
but at 64KB page size, MSI-X emulation at that location prevents the
entire BAR from being directly mapped into the VM address space.
Overlapping performance sensitive registers then starts to be a very
likely scenario on such a platform. At this point, the user could
enable tracing on vfio_region_read and vfio_region_write to determine
more conclusively if device accesses are being trapped through QEMU.
Upon finding a device and platform in need of MSI-X relocation, the
next problem is how to choose target PCI BAR to host the MSI-X data
structures. A few key rules to keep in mind for this selection
include:
* There are only 6 BAR slots, bar0..bar5
* 64-bit BARs occupy two BAR slots, 'lspci -vvv' lists the first slot
* PCI BARs are always a power of 2 in size, extending == doubling
* The maximum size of a 32-bit BAR is 2GB
* MSI-X data structures must reside in an MMIO BAR
Using these rules, we can evaluate each BAR of the second example
device above as follows:
bar0: I/O port BAR, incompatible with MSI-X tables
bar1: BAR could be extended, incurring another 64KB of MMIO
bar2: Unavailable, bar1 is 64-bit, this register is used by bar1
bar3: BAR could be extended, incurring another 256KB of MMIO
bar4: Unavailable, bar3 is 64bit, this register is used by bar3
bar5: Available, empty BAR, minimum additional MMIO
A secondary optimization we might wish to make in relocating MSI-X
is to minimize the additional MMIO required for the device, therefore
we might test the available choices in order of preference as bar5,
bar1, and finally bar3. The original proposal for this feature
included an 'auto' option which would choose bar5 in this case, but
various drivers have been found that make assumptions about the
properties of the "first" BAR or the size of BARs such that there
appears to be no foolproof automatic selection available, requiring
known good combinations to be sourced from users. This patch is
pre-enabled for an 'auto' selection making use of a validated lookup
table, but no entries are yet identified.
Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Tested-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The kernel provides similar emulation of PCI BAR register access to
QEMU, so up until now we've used that for things like BAR sizing and
storing the BAR address. However, if we intend to resize BARs or add
BARs that don't exist on the physical device, we need to switch to the
pure QEMU emulation of the BAR.
Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Tested-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Add one more layer to our stack of MemoryRegions, this base region
allows us to register BARs independently of the vfio region or to
extend the size of BARs which do map to a region. This will be
useful when we want hypervisor defined BARs or sections of BARs,
for purposes such as relocating MSI-X emulation. We therefore call
msix_init() based on this new base MemoryRegion, while the quirks,
which only modify regions still operate on those sub-MemoryRegions.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The fields were removed in the referenced commit, but the comment
still mentions them.
Fixes: 2fb9636ebf ("vfio-pci: Remove unused fields from VFIOMSIXInfo")
Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Tested-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
In order to enable TCE operations support in KVM, we have to inform
the KVM about VFIO groups being attached to specific LIOBNs. The KVM
already knows about VFIO groups, the only bit missing is which
in-kernel TCE table (the one with user visible TCEs) should update
the attached broups. There is an KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE
attribute of the VFIO KVM device which receives a groupfd/tablefd couple.
This uses a new memory_region_iommu_get_attr() helper to get the IOMMU fd
and calls KVM to establish the link.
As get_attr() is not implemented yet, this should cause no behavioural
change.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
applied using ./scripts/clean-includes
Signed-off-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Acked-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
When support for multiple mappings per a region were added, this was
left behind, let's finish and remove unused bits.
Fixes: db0da029a1 ("vfio: Generalize region support")
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The vfio_iommu_spapr_tce driver advertises kernel's support for
v1 and v2 IOMMU support, however it is not always possible to use
the requested IOMMU type. For example, a pseries host platform does not
support dynamic DMA windows so v2 cannot initialize and QEMU fails to
start.
This adds a fallback to the v1 IOMMU if v2 cannot be used.
Fixes: 318f67ce13 ("vfio: spapr: Add DMA memory preregistering (SPAPR IOMMU v2)")
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The init of giommu_list and hostwin_list is missed during container
initialization.
Signed-off-by: Liu, Yi L <yi.l.liu@linux.intel.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Commit 8c37faa475 ("vfio-pci, ppc64/spapr: Reorder group-to-container
attaching") moved registration of groups with the vfio-kvm device from
vfio_get_group() to vfio_connect_container(), but it missed the case
where a group is attached to an existing container and takes an early
exit. Perhaps this is a less common case on ppc64/spapr, but on x86
(without viommu) all groups are connected to the same container and
thus only the first group gets registered with the vfio-kvm device.
This becomes a problem if we then hot-unplug the devices associated
with that first group and we end up with KVM being misinformed about
any vfio connections that might remain. Fix by including the call to
vfio_kvm_device_add_group() in this early exit path.
Fixes: 8c37faa475 ("vfio-pci, ppc64/spapr: Reorder group-to-container attaching")
Cc: qemu-stable@nongnu.org # qemu-2.10+
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: Peter Xu <peterx@redhat.com>
Tested-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Tested-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The bus pointer in PCIDevice is basically redundant with QOM information.
It's always initialized to the qdev_get_parent_bus(), the only difference
is the type.
Therefore this patch eliminates the field, instead creating a pci_get_bus()
helper to do the type mangling to derive it conveniently from the QOM
Device object underneath.
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Eduardo Habkost <ehabkost@redhat.com>
Reviewed-by: Marcel Apfelbaum <marcel@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Simplify the error handling of the SSCH and RSCH handler avoiding
arbitrary and cryptic error codes being used to tell how the instruction
is supposed to end. Let the code detecting the condition tell how it's
to be handled in a less ambiguous way. It's best to handle SSCH and RSCH
in one go as the emulation of the two shares a lot of code.
For passthrough this change isn't pure refactoring, but changes the way
kernel reported EFAULT is handled. After clarifying the kernel interface
we decided that EFAULT shall be mapped to unit exception. Same goes for
unexpected error codes and absence of required ORB flags.
Signed-off-by: Halil Pasic <pasic@linux.vnet.ibm.com>
Message-Id: <20171017140453.51099-4-pasic@linux.vnet.ibm.com>
Tested-by: Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com>
[CH: cosmetic changes]
Reviewed-by: Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com>
Signed-off-by: Cornelia Huck <cohuck@redhat.com>
Add INTERFACE_CONVENTIONAL_PCI_DEVICE to all direct subtypes of
TYPE_PCI_DEVICE, except:
1) The ones that already have INTERFACE_PCIE_DEVICE set:
* base-xhci
* e1000e
* nvme
* pvscsi
* vfio-pci
* virtio-pci
* vmxnet3
2) base-pci-bridge
Not all PCI bridges are Conventional PCI devices, so
INTERFACE_CONVENTIONAL_PCI_DEVICE is added only to the subtypes
that are actually Conventional PCI:
* dec-21154-p2p-bridge
* i82801b11-bridge
* pbm-bridge
* pci-bridge
The direct subtypes of base-pci-bridge not touched by this patch
are:
* xilinx-pcie-root: Already marked as PCIe-only.
* pcie-pci-bridge: Already marked as PCIe-only.
* pcie-port: all non-abstract subtypes of pcie-port are already
marked as PCIe-only devices.
3) megasas-base
Not all megasas devices are Conventional PCI devices, so the
interface names are added to the subclasses registered by
megasas_register_types(), according to information in the
megasas_devices[] array.
"megasas-gen2" already implements INTERFACE_PCIE_DEVICE, so add
INTERFACE_CONVENTIONAL_PCI_DEVICE only to "megasas".
Acked-by: Alberto Garcia <berto@igalia.com>
Acked-by: John Snow <jsnow@redhat.com>
Acked-by: Anthony PERARD <anthony.perard@citrix.com>
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Marcel Apfelbaum <marcel@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
The following devices support both PCI Express and Conventional
PCI, by including special code to handle the QEMU_PCI_CAP_EXPRESS
flag and/or conditional pcie_endpoint_cap_init() calls:
* vfio-pci (is_express=1, but legacy PCI handled by
vfio_populate_device())
* vmxnet3 (is_express=0, but PCIe handled by vmxnet3_realize())
* pvscsi (is_express=0, but PCIe handled by pvscsi_realize())
* virtio-pci (is_express=0, but PCIe handled by
virtio_pci_dc_realize(), and additional legacy PCI code at
virtio_pci_realize())
* base-xhci (is_express=1, but pcie_endpoint_cap_init() call
is conditional on pci_bus_is_express(dev->bus)
* Note that xhci does not clear QEMU_PCI_CAP_EXPRESS like the
other hybrid devices
Cc: Dmitry Fleytman <dmitry@daynix.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Gerd Hoffmann <kraxel@redhat.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Marcel Apfelbaum <marcel@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
NVIDIA has defined a specification for creating GPUDirect "cliques",
where devices with the same clique ID support direct peer-to-peer DMA.
When running on bare-metal, tools like NVIDIA's p2pBandwidthLatencyTest
(part of cuda-samples) determine which GPUs can support peer-to-peer
based on chipset and topology. When running in a VM, these tools have
no visibility to the physical hardware support or topology. This
option allows the user to specify hints via a vendor defined
capability. For instance:
<qemu:commandline>
<qemu:arg value='-set'/>
<qemu:arg value='device.hostdev0.x-nv-gpudirect-clique=0'/>
<qemu:arg value='-set'/>
<qemu:arg value='device.hostdev1.x-nv-gpudirect-clique=1'/>
<qemu:arg value='-set'/>
<qemu:arg value='device.hostdev2.x-nv-gpudirect-clique=1'/>
</qemu:commandline>
This enables two cliques. The first is a singleton clique with ID 0,
for the first hostdev defined in the XML (note that since cliques
define peer-to-peer sets, singleton clique offer no benefit). The
subsequent two hostdevs are both added to clique ID 1, indicating
peer-to-peer is possible between these devices.
QEMU only provides validation that the clique ID is valid and applied
to an NVIDIA graphics device, any validation that the resulting
cliques are functional and valid is the user's responsibility. The
NVIDIA specification allows a 4-bit clique ID, thus valid values are
0-15.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
If the hypervisor needs to add purely virtual capabilties, give us a
hook through quirks to do that. Note that we determine the maximum
size for a capability based on the physical device, if we insert a
virtual capability, that can change. Therefore if maximum size is
smaller after added virt capabilities, use that.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
If vfio_add_std_cap() errors then going to out prepends irrelevant
errors for capabilities we haven't attempted to add as we unwind our
recursive stack. Just return error.
Fixes: 7ef165b9a8 ("vfio/pci: Pass an error object to vfio_add_capabilities")
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The existing tries to round up the number of pages but @pages is always
calculated as the rounded up value minus one which makes ctz64() always
return 0 and have create.levels always set 1.
This removes wrong "-1" and allows having more than 1 levels. This becomes
handy for >128GB guests with standard 64K pages as this requires blocks
with zone order 9 and the popular limit of CONFIG_FORCE_MAX_ZONEORDER=9
means that only blocks up to order 8 are allowed.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
The only exception are groups of numers separated by symbols
'.', ' ', ':', '/', like 'ab.09.7d'.
This patch is made by the following:
> find . -name trace-events | xargs python script.py
where script.py is the following python script:
=========================
#!/usr/bin/env python
import sys
import re
import fileinput
rhex = '%[-+ *.0-9]*(?:[hljztL]|ll|hh)?(?:x|X|"\s*PRI[xX][^"]*"?)'
rgroup = re.compile('((?:' + rhex + '[.:/ ])+' + rhex + ')')
rbad = re.compile('(?<!0x)' + rhex)
files = sys.argv[1:]
for fname in files:
for line in fileinput.input(fname, inplace=True):
arr = re.split(rgroup, line)
for i in range(0, len(arr), 2):
arr[i] = re.sub(rbad, '0x\g<0>', arr[i])
sys.stdout.write(''.join(arr))
=========================
Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Acked-by: Cornelia Huck <cohuck@redhat.com>
Message-id: 20170731160135.12101-5-vsementsov@virtuozzo.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
With the move of some docs/ to docs/devel/ on ac06724a71,
no references were updated.
Signed-off-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
hw/vfio/pci.c:308:29: warning: Use of memory after it is freed
qemu_set_fd_handler(*pfd, NULL, NULL, vdev);
^~~~
Reported-by: Clang Static Analyzer
Signed-off-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
free the data _after_ using it.
hw/vfio/platform.c:126:29: warning: Use of memory after it is freed
qemu_set_fd_handler(*pfd, NULL, NULL, NULL);
^~~~
Reported-by: Clang Static Analyzer
Signed-off-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Commit 7da624e2 ("vfio: Test realized when using VFIOGroup.device_list
iterator") introduced a pointer to the Object DeviceState in the VFIO
common base-device and skipped non-realized devices as we iterate
VFIOGroup.device_list. While it missed to initialize the pointer for
the vfio-ccw case. Let's fix it.
Fixes: 7da624e2 ("vfio: Test realized when using VFIOGroup.device_list
iterator")
Cc: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Halil Pasic <pasic@linux.vnet.ibm.com>
Signed-off-by: Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com>
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Message-Id: <20170718014926.44781-3-bjsdjshi@linux.vnet.ibm.com>
Signed-off-by: Cornelia Huck <cohuck@redhat.com>
When allocating memory for the vfio_irq_info parameter of the
VFIO_DEVICE_GET_IRQ_INFO ioctl, we used the wrong size. Let's
fix it by using the right size.
Reviewed-by: Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com>
Signed-off-by: Jing Zhang <bjzhjing@linux.vnet.ibm.com>
Signed-off-by: Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com>
Message-Id: <20170718014926.44781-2-bjsdjshi@linux.vnet.ibm.com>
Signed-off-by: Cornelia Huck <cohuck@redhat.com>
At the moment VFIO PCI device initialization works as follows:
vfio_realize
vfio_get_group
vfio_connect_container
register memory listeners (1)
update QEMU groups lists
vfio_kvm_device_add_group
Then (example for pseries) the machine reset hook triggers region_add()
for all regions where listeners from (1) are listening:
ppc_spapr_reset
spapr_phb_reset
spapr_tce_table_enable
memory_region_add_subregion
vfio_listener_region_add
vfio_spapr_create_window
This scheme works fine until we need to handle VFIO PCI device hotplug
and we want to enable PPC64/sPAPR in-kernel TCE acceleration on,
i.e. after PCI hotplug we need a place to call
ioctl(vfio_kvm_device_fd, KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE).
Since the ioctl needs a LIOBN fd (from sPAPRTCETable) and a IOMMU group fd
(from VFIOGroup), vfio_listener_region_add() seems to be the only place
for this ioctl().
However this only works during boot time because the machine reset
happens strictly after all devices are finalized. When hotplug happens,
vfio_listener_region_add() is called when a memory listener is registered
but when this happens:
1. new group is not added to the container->group_list yet;
2. VFIO KVM device is unaware of the new IOMMU group.
This moves bits around to have all necessary VFIO infrastructure
in place for both initial startup and hotplug cases.
[aw: ie, register vfio groups with kvm prior to memory listener
registration such that kvm-vfio pseudo device ioctls are available
during the region_add callback]
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This defines new QOM object - IOMMUMemoryRegion - with MemoryRegion
as a parent.
This moves IOMMU-related fields from MR to IOMMU MR. However to avoid
dymanic QOM casting in fast path (address_space_translate, etc),
this adds an @is_iommu boolean flag to MR and provides new helper to
do simple cast to IOMMU MR - memory_region_get_iommu. The flag
is set in the instance init callback. This defines
memory_region_is_iommu as memory_region_get_iommu()!=NULL.
This switches MemoryRegion to IOMMUMemoryRegion in most places except
the ones where MemoryRegion may be an alias.
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Message-Id: <20170711035620.4232-2-aik@ozlabs.ru>
Acked-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Intel 82599 VFs report a PCIe capability version of 0, which is
invalid. The earliest version of the PCIe spec used version 1. This
causes Windows to fail startup on the device and it will be disabled
with error code 10. Our choices are either to drop the PCIe cap on
such devices, which has the side effect of likely preventing the guest
from discovering any extended capabilities, or performing a fixup to
update the capability to the earliest valid version. This implements
the latter.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
VFIOGroup.device_list is effectively our reference tracking mechanism
such that we can teardown a group when all of the device references
are removed. However, we also use this list from our machine reset
handler for processing resets that affect multiple devices. Generally
device removals are fully processed (exitfn + finalize) when this
reset handler is invoked, however if the removal is triggered via
another reset handler (piix4_reset->acpi_pcihp_reset) then the device
exitfn may run, but not finalize. In this case we hit asserts when
we start trying to access PCI helpers since much of the PCI state of
the device is released. To resolve this, add a pointer to the Object
DeviceState in our common base-device and skip non-realized devices
as we iterate.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
After the patch 'Make errp the last parameter of pci_add_capability()',
pci_add_capability() and pci_add_capability2() now do exactly the same.
So drop the wrapper pci_add_capability() of pci_add_capability2(), then
replace the pci_add_capability2() with pci_add_capability() everywhere.
Cc: pbonzini@redhat.com
Cc: rth@twiddle.net
Cc: ehabkost@redhat.com
Cc: mst@redhat.com
Cc: dmitry@daynix.com
Cc: jasowang@redhat.com
Cc: marcel@redhat.com
Cc: alex.williamson@redhat.com
Cc: armbru@redhat.com
Suggested-by: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Mao Zhongyi <maozy.fnst@cn.fujitsu.com>
Reviewed-by: Marcel Apfelbaum <marcel@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
A bunch of fixes all over the place. Most notably this fixes
the new MTU feature when using vhost.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQEcBAABAgAGBQJZK2bwAAoJECgfDbjSjVRpNBgIALmNG7VaixhNUlnfX1n1JBnh
+HBP2zNfvi0q5roBuPFmlziKa3IBHb2Fcte4nb6QxmPg+uoaj39AOzfrrvz210kR
h2j5Qk2bCdMeWBpxI+xDDScwi/Im23Y6KN1eZyMekFr2CaSGiqOHZPPdbsyEcHPB
VylM0uHqSTZL5JAAzEuYlH+LLfPu91HoxMsIAdNuQX+qKyM2DZ4eICBQ0zA73USt
OduZltcRMk7UpvQMqY+2iaEXapXQQEUGrP2Mo8ZyqeIl2ItC33GspqBQIKjuZdrr
tpr/T1VWsLdZnURZXyELrFqrErDXvKaP9HROwvyLyYPXZF+pJ3LA7TopS5UmfNQ=
=Z4xG
-----END PGP SIGNATURE-----
Merge remote-tracking branch 'mst/tags/for_upstream' into staging
pci, virtio, vhost: fixes
A bunch of fixes all over the place. Most notably this fixes
the new MTU feature when using vhost.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
# gpg: Signature made Mon 29 May 2017 01:10:24 AM BST
# gpg: using RSA key 0x281F0DB8D28D5469
# gpg: Good signature from "Michael S. Tsirkin <mst@kernel.org>"
# gpg: aka "Michael S. Tsirkin <mst@redhat.com>"
# Primary key fingerprint: 0270 606B 6F3C DF3D 0B17 0970 C350 3912 AFBE 8E67
# Subkey fingerprint: 5D09 FD08 71C8 F85B 94CA 8A0D 281F 0DB8 D28D 5469
* mst/tags/for_upstream:
acpi-test: update expected files
pc: ACPI BIOS: use highest NUMA node for hotplug mem hole SRAT entry
vhost-user: pass message as a pointer to process_message_reply()
virtio_net: Bypass backends for MTU feature negotiation
intel_iommu: turn off pt before 2.9
intel_iommu: support passthrough (PT)
intel_iommu: allow dev-iotlb context entry conditionally
intel_iommu: use IOMMU_ACCESS_FLAG()
intel_iommu: provide vtd_ce_get_type()
intel_iommu: renaming context entry helpers
x86-iommu: use DeviceClass properties
memory: remove the last param in memory_region_iommu_replay()
memory: tune last param of iommu_ops.translate()
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
We were always passing in that one as "false" to assume that's an read
operation, and we also assume that IOMMU translation would always have
that read permission. A better permission would be IOMMU_NONE since the
replay is after all not a real read operation, but just a page table
rebuilding process.
CC: David Gibson <david@gibson.dropbear.id.au>
CC: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Jason Wang <jasowang@redhat.com>
Concurrent-sense data is currently not delivered. This patch stores
the concurrent-sense data to the subchannel if a unit check is pending
and the concurrent-sense bit is enabled. Then a TSCH can retreive the
right IRB data back to the guest.
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com>
Message-Id: <20170517004813.58227-13-bjsdjshi@linux.vnet.ibm.com>
Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Introduce a new callback on subchannel to handle ccw-request.
Realize the callback in vfio-ccw device. Besides, resort to
the event notifier handler to handling the ccw-request results.
1. Pread the I/O results via MMIO region.
2. Update the scsw info to guest.
3. Inject an I/O interrupt to notify guest the I/O result.
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Xiao Feng Ren <renxiaof@linux.vnet.ibm.com>
Signed-off-by: Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com>
Message-Id: <20170517004813.58227-11-bjsdjshi@linux.vnet.ibm.com>
Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
vfio-ccw resorts to the eventfd mechanism to communicate with userspace.
We fetch the irqs info via the ioctl VFIO_DEVICE_GET_IRQ_INFO,
register a event notifier to get the eventfd fd which is sent
to kernel via the ioctl VFIO_DEVICE_SET_IRQS, then we can implement
read operation once kernel sends the signal.
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com>
Message-Id: <20170517004813.58227-10-bjsdjshi@linux.vnet.ibm.com>
Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
vfio-ccw provides an MMIO region for I/O operations. We fetch its
information via ioctls here, then we can use it performing I/O
instructions and retrieving I/O results later on.
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com>
Message-Id: <20170517004813.58227-9-bjsdjshi@linux.vnet.ibm.com>
Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
We use the IOMMU_TYPE1 of VFIO to realize the subchannels
passthrough, implement a vfio based subchannels passthrough
driver called "vfio-ccw".
Support qemu parameters in the style of:
"-device vfio-ccw,sysfsdev=$mdev_file_path,devno=xx.x.xxxx'
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Xiao Feng Ren <renxiaof@linux.vnet.ibm.com>
Signed-off-by: Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com>
Message-Id: <20170517004813.58227-8-bjsdjshi@linux.vnet.ibm.com>
Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com>
commit 33cd52b5d7 unset
cannot_instantiate_with_device_add_yet in TYPE_SYSBUS, making all
sysbus devices appear on "-device help" and lack the "no-user"
flag in "info qdm".
To fix this, we can set user_creatable=false by default on
TYPE_SYS_BUS_DEVICE, but this requires setting
user_creatable=true explicitly on the sysbus devices that
actually work with -device.
Fortunately today we have just a few has_dynamic_sysbus=1
machines: virt, pc-q35-*, ppce500, and spapr.
virt, ppce500, and spapr have extra checks to ensure just a few
device types can be instantiated:
* virt supports only TYPE_VFIO_CALXEDA_XGMAC, TYPE_VFIO_AMD_XGBE.
* ppce500 supports only TYPE_ETSEC_COMMON.
* spapr supports only TYPE_SPAPR_PCI_HOST_BRIDGE.
This patch sets user_creatable=true explicitly on those 4 device
classes.
Now, the more complex cases:
pc-q35-*: q35 has no sysbus device whitelist yet (which is a
separate bug). We are in the process of fixing it and building a
sysbus whitelist on q35, but in the meantime we can fix the
"-device help" and "info qdm" bugs mentioned above. Also, despite
not being strictly necessary for fixing the q35 bug, reducing the
list of user_creatable=true devices will help us be more
confident when building the q35 whitelist.
xen: We also have a hack at xen_set_dynamic_sysbus(), that sets
has_dynamic_sysbus=true at runtime when using the Xen
accelerator. This hack is only used to allow xen-backend devices
to be dynamically plugged/unplugged.
This means today we can use -device with the following 22 device
types, that are the ones compiled into the qemu-system-x86_64 and
qemu-system-i386 binaries:
* allwinner-ahci
* amd-iommu
* cfi.pflash01
* esp
* fw_cfg_io
* fw_cfg_mem
* generic-sdhci
* hpet
* intel-iommu
* ioapic
* isabus-bridge
* kvmclock
* kvm-ioapic
* kvmvapic
* SUNW,fdtwo
* sysbus-ahci
* sysbus-fdc
* sysbus-ohci
* unimplemented-device
* virtio-mmio
* xen-backend
* xen-sysdev
This patch adds user_creatable=true explicitly to those devices,
temporarily, just to keep 100% compatibility with existing
behavior of q35. Subsequent patches will remove
user_creatable=true from the devices that are really not meant to
user-creatable on any machine, and remove the FIXME comment from
the ones that are really supposed to be user-creatable. This is
being done in separate patches because we still don't have an
obvious list of devices that will be whitelisted by q35, and I
would like to get each device reviewed individually.
Cc: Alexander Graf <agraf@suse.de>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Alistair Francis <alistair.francis@xilinx.com>
Cc: Beniamino Galvani <b.galvani@gmail.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Cornelia Huck <cornelia.huck@de.ibm.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: "Edgar E. Iglesias" <edgar.iglesias@gmail.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Frank Blaschka <frank.blaschka@de.ibm.com>
Cc: Gabriel L. Somlo <somlo@cmu.edu>
Cc: Gerd Hoffmann <kraxel@redhat.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: John Snow <jsnow@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kevin Wolf <kwolf@redhat.com>
Cc: Laszlo Ersek <lersek@redhat.com>
Cc: Marcel Apfelbaum <marcel@redhat.com>
Cc: Markus Armbruster <armbru@redhat.com>
Cc: Max Reitz <mreitz@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Maydell <peter.maydell@linaro.org>
Cc: Pierre Morel <pmorel@linux.vnet.ibm.com>
Cc: Prasad J Pandit <pjp@fedoraproject.org>
Cc: qemu-arm@nongnu.org
Cc: qemu-block@nongnu.org
Cc: qemu-ppc@nongnu.org
Cc: Richard Henderson <rth@twiddle.net>
Cc: Rob Herring <robh@kernel.org>
Cc: Shannon Zhao <zhaoshenglong@huawei.com>
Cc: sstabellini@kernel.org
Cc: Thomas Huth <thuth@redhat.com>
Cc: Yi Min Zhao <zyimin@linux.vnet.ibm.com>
Acked-by: John Snow <jsnow@redhat.com>
Acked-by: Juergen Gross <jgross@suse.com>
Acked-by: Marcel Apfelbaum <marcel@redhat.com>
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Message-Id: <20170503203604.31462-3-ehabkost@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
[ehabkost: Small changes at sysbus_device_class_init() comments]
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
When the "No host device provided" error occurs, the hint message
that starts with "Use -vfio-pci," makes no sense, since "-vfio-pci"
is not a valid command line parameter.
Correct this by replacing "-vfio-pci" with "-device vfio-pci".
Signed-off-by: Dong Jia Shi <bjsdjshi@linux.vnet.ibm.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This patch enables 8-byte writes and reads to VFIO. Such implemention
is already done but it's missing the 'case' to handle such accesses in
both vfio_region_write and vfio_region_read and the MemoryRegionOps:
impl.max_access_size and impl.min_access_size.
After this patch, 8-byte writes such as:
qemu_mutex_lock locked mutex 0x10905ad8
vfio_region_write (0001:03:00.0:region1+0xc0, 0x4140c, 4)
vfio_region_write (0001:03:00.0:region1+0xc4, 0xa0000, 4)
qemu_mutex_unlock unlocked mutex 0x10905ad8
goes like this:
qemu_mutex_lock locked mutex 0x10905ad8
vfio_region_write (0001:03:00.0:region1+0xc0, 0xbfd0008, 8)
qemu_mutex_unlock unlocked mutex 0x10905ad8
Signed-off-by: Jose Ricardo Ziviani <joserz@linux.vnet.ibm.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Sets valid.max_access_size and valid.min_access_size to ensure safe
8-byte accesses to vfio. Today, 8-byte accesses are broken into pairs
of 4-byte calls that goes unprotected:
qemu_mutex_lock locked mutex 0x10905ad8
vfio_region_write (0001:03:00.0:region1+0xc0, 0x2020c, 4)
qemu_mutex_unlock unlocked mutex 0x10905ad8
qemu_mutex_lock locked mutex 0x10905ad8
vfio_region_write (0001:03:00.0:region1+0xc4, 0xa0000, 4)
qemu_mutex_unlock unlocked mutex 0x10905ad8
which occasionally leads to:
qemu_mutex_lock locked mutex 0x10905ad8
vfio_region_write (0001:03:00.0:region1+0xc0, 0x2030c, 4)
qemu_mutex_unlock unlocked mutex 0x10905ad8
qemu_mutex_lock locked mutex 0x10905ad8
vfio_region_write (0001:03:00.0:region1+0xc0, 0x1000c, 4)
qemu_mutex_unlock unlocked mutex 0x10905ad8
qemu_mutex_lock locked mutex 0x10905ad8
vfio_region_write (0001:03:00.0:region1+0xc4, 0xb0000, 4)
qemu_mutex_unlock unlocked mutex 0x10905ad8
qemu_mutex_lock locked mutex 0x10905ad8
vfio_region_write (0001:03:00.0:region1+0xc4, 0xa0000, 4)
qemu_mutex_unlock unlocked mutex 0x10905ad8
causing strange errors in guest OS. With this patch, such accesses
are protected by the same lock guard:
qemu_mutex_lock locked mutex 0x10905ad8
vfio_region_write (0001:03:00.0:region1+0xc0, 0x2000c, 4)
vfio_region_write (0001:03:00.0:region1+0xc4, 0xb0000, 4)
qemu_mutex_unlock unlocked mutex 0x10905ad8
This happens because the 8-byte write should be broken into 4-byte
writes by memory.c:access_with_adjusted_size() in order to be under
the same lock. Today, it's done in exec.c:address_space_write_continue()
which was able to handle only 4 bytes due to a zero'ed
valid.max_access_size (see exec.c:memory_access_size()).
Signed-off-by: Jose Ricardo Ziviani <joserz@linux.vnet.ibm.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
In this patch, IOMMUNotifier.{start|end} are introduced to store section
information for a specific notifier. When notification occurs, we not
only check the notification type (MAP|UNMAP), but also check whether the
notified iova range overlaps with the range of specific IOMMU notifier,
and skip those notifiers if not in the listened range.
When removing an region, we need to make sure we removed the correct
VFIOGuestIOMMU by checking the IOMMUNotifier.start address as well.
This patch is solving the problem that vfio-pci devices receive
duplicated UNMAP notification on x86 platform when vIOMMU is there. The
issue is that x86 IOMMU has a (0, 2^64-1) IOMMU region, which is
splitted by the (0xfee00000, 0xfeefffff) IRQ region. AFAIK
this (splitted IOMMU region) is only happening on x86.
This patch also helps vhost to leverage the new interface as well, so
that vhost won't get duplicated cache flushes. In that sense, it's an
slight performance improvement.
Suggested-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <1491562755-23867-2-git-send-email-peterx@redhat.com>
[ehabkost: included extra vhost_iommu_region_del() change from Peter Xu]
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
The NVIDIA BAR5 quirk is targeting an ioport BAR. Some older devices
have a BAR5 which is not ioport and can induce a segfault here. Test
the BAR type to skip these devices.
Link: https://bugs.launchpad.net/qemu/+bug/1678466
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
This reverts commit c2b2e158cc.
The original patch intend to prevent linux i915 driver from using
stolen meory. But this patch breaks windows IGD driver loading on
Gen9+, as IGD HW will use stolen memory on Gen9+, once windows IGD
driver see zero size stolen memory, it will unload.
Meanwhile stolen memory will be disabled in 915 when i915 run as
a guest.
Signed-off-by: Xiong Zhang <xiong.y.zhang@intel.com>
[aw: Gen9+ is SkyLake and newer]
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Regardless of running in UPT or legacy mode, the guest igd
drivers may attempt to use stolen memory, however only legacy
mode has BIOS support for reserving stolen memmory in the
guest VM. We zero out the stolen memory size in all cases,
then guest igd driver won't use stolen memory.
In legacy mode, user could use x-igd-gms option to specify the
amount of stolen memory which will be pre-allocated and reserved
by bios for igd use.
Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=99028https://bugs.freedesktop.org/show_bug.cgi?id=99025
Signed-off-by: Xiong Zhang <xiong.y.zhang@intel.com>
Tested-by: Terrence Xu <terrence.xu@intel.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Since commit 4bb571d857 ("pci/pcie: don't assume cap id 0 is
reserved") removes the internal use of extended capability ID 0, the
comment here becomes invalid. However, peeling back the onion, the
code is still correct and we still can't seed the capability chain
with ID 0, unless we want to muck with using the version number to
force the header to be non-zero, which is much uglier to deal with.
The comment also now covers some of the subtleties of using cap ID 0,
such as transparently indicating absence of capabilities if none are
added. This doesn't detract from the correctness of the referenced
commit as vfio in the kernel also uses capability ID zero to mask
capabilties. In fact, we should skip zero capabilities precisely
because the kernel might also expose such a capability at the head
position and re-introduce the problem.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Tested-by: Peter Xu <peterx@redhat.com>
Reported-by: Jintack Lim <jintack@cs.columbia.edu>
Tested-by: Jintack Lim <jintack@cs.columbia.edu>
Currently we ignore this error, report it with error_reportf_err()
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Linux vfio driver supports to do VFIO_IOMMU_UNMAP_DMA for a very big
region. This can be leveraged by QEMU IOMMU implementation to cleanup
existing page mappings for an entire iova address space (by notifying
with an IOTLB with extremely huge addr_mask). However current
vfio_iommu_map_notify() does not allow that. It make sure that all the
translated address in IOTLB is falling into RAM range.
The check makes sense, but it should only be a sensible checker for
mapping operations, and mean little for unmap operations.
This patch moves this check into map logic only, so that we'll get
faster unmap handling (no need to translate again), and also we can then
better support unmapping a very big region when it covers non-ram ranges
or even not-existing ranges.
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
A cleanup for vfio_iommu_map_notify(). Now we will fetch vaddr even if
the operation is unmap, but it won't hurt much.
One thing to mention is that we need the RCU read lock to protect the
whole translation and map/unmap procedure.
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
We traces its range, but we don't know whether it's a MAP/UNMAP. Let's
dump it as well.
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Both devices seem to be specific to the ARM platform. It's confusing
for the users if they show up on other target architectures, too
(e.g. when the user runs QEMU with "-device ?" to get a list of
supported devices). Thus let's introduce proper configuration switches
so that the devices are only compiled and included when they are
really required.
Signed-off-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
The device has "bridge" in its name, so it should obviously be in
the category DEVICE_CATEGORY_BRIDGE.
Signed-off-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Previous IGD, up through Broadwell, only seem to write GTT values into
the first 1MB of space allocated for the BDSM, but clearly the GTT
can be multiple MB in size. Our test in vfio_igd_quirk_data_write()
correctly filters out indexes beyond 1MB, but given the 1MB mask we're
using, we re-apply writes only to the first 1MB of the guest allocated
BDSM.
We can't assume either the host or guest BDSM is naturally aligned, so
we can't simply apply a different mask. Instead, save the host BDSM
and do the arithmetic to subtract the host value to get the BDSM
offset and add it to the guest allocated BDSM.
Reported-by: Alexander Indenbaum <alexander.indenbaum@gmail.com>
Tested-by: Alexander Indenbaum <alexander.indenbaum@gmail.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
msix_init() reports errors with error_report(), which is wrong when
it's used in realize(). The same issue was fixed for msi_init() in
commit 1108b2f. In order to make the API change as small as possible,
leave the return value check to later patch.
For some devices(like e1000e, vmxnet3, nvme) who won't fail because of
msix_init's failure, suppress the error report by passing NULL error
object.
Bonus: add comment for msix_init.
CC: Jiri Pirko <jiri@resnulli.us>
CC: Gerd Hoffmann <kraxel@redhat.com>
CC: Dmitry Fleytman <dmitry@daynix.com>
CC: Jason Wang <jasowang@redhat.com>
CC: Michael S. Tsirkin <mst@redhat.com>
CC: Hannes Reinecke <hare@suse.de>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Alex Williamson <alex.williamson@redhat.com>
CC: Markus Armbruster <armbru@redhat.com>
CC: Marcel Apfelbaum <marcel@redhat.com>
Signed-off-by: Cao jin <caoj.fnst@cn.fujitsu.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
There are a number of unused trace events that
scripts/cleanup-trace-events.pl finds. The "hw/vfio/pci-quirks.c"
filename was typoed and "qapi/qapi-visit-core.c" was missing the qapi/
directory prefix.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-id: 20170126171613.1399-3-stefanha@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Now the kernel commit 05f0c03fbac1 ("vfio-pci: Allow to mmap
sub-page MMIO BARs if the mmio page is exclusive") allows VFIO
to mmap sub-page BARs. This is the corresponding QEMU patch.
With those patches applied, we could passthrough sub-page BARs
to guest, which can help to improve IO performance for some devices.
In this patch, we expand MemoryRegions of these sub-page
MMIO BARs to PAGE_SIZE in vfio_pci_write_config(), so that
the BARs could be passed to KVM ioctl KVM_SET_USER_MEMORY_REGION
with a valid size. The expanding size will be recovered when
the base address of sub-page BAR is changed and not page aligned
any more in guest. And we also set the priority of these BARs'
memory regions to zero in case of overlap with BARs which share
the same page with sub-page BARs in guest.
Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
When a PCI device is reset, pci_do_device_reset resets all BAR addresses
in the relevant PCIDevice's config buffer.
The VFIO configuration space stays untouched, so the guest OS may choose
to skip restoring the BAR addresses as they would seem intact. The PCI
device may be left non-operational.
One example of such a scenario is when the guest exits S3.
Fix this by resetting the BAR addresses in the VFIO configuration space
as well.
Signed-off-by: Ido Yariv <ido@wizery.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
As reported in the link below, user has a PCI device with a 4KB BAR
which contains the MSI-X table. This seems to hit a corner case in
the kernel where the region reports being mmap capable, but the sparse
mmap information reports a zero sized range. It's not entirely clear
that the kernel is incorrect in doing this, but regardless, we need
to handle it. To do this, fill our mmap array only with non-zero
sized sparse mmap entries and add an error return from the function
so we can tell the difference between nr_mmaps being zero based on
sparse mmap info vs lack of sparse mmap info.
NB, this doesn't actually change the behavior of the device, it only
removes the scary "Failed to mmap ... Performance may be slow" error
message. We cannot currently create an mmap over the MSI-X table.
Link: http://lists.nongnu.org/archive/html/qemu-discuss/2016-10/msg00009.html
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Setting skip_dump on a MemoryRegion allows us to modify one specific
code path, but the restriction we're trying to address encompasses
more than that. If we have a RAM MemoryRegion backed by a physical
device, it not only restricts our ability to dump that region, but
also affects how we should manipulate it. Here we recognize that
MemoryRegions do not change to sometimes allow dumps and other times
not, so we replace setting the skip_dump flag with a new initializer
so that we know exactly the type of region to which we're applying
this behavior.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
When vfio device is reset(encounter FLR, or bus reset), if need to do
bus reset(vfio_pci_hot_reset_one is called), vfio_pci_pre_reset &
vfio_pci_post_reset will be called twice.
Signed-off-by: Cao jin <caoj.fnst@cn.fujitsu.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Introductory comment for rtl8168 VFIO MSI-X quirk states:
At BAR2 offset 0x70 there is a dword data register,
offset 0x74 is a dword address register.
vfio: vfio_bar_read(0000:05:00.0:BAR2+0x70, 4) = 0xfee00398 // read data
Thus, correct offset for data read is 0x70,
but function vfio_rtl8168_quirk_data_read() wrongfully uses offset 0x74.
Signed-off-by: Thorsten Kohfeldt <thorsten.kohfeldt@gmx.de>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>