The 745 and 755 can share the HID registration, so move it all into
register_755_sprs, which applies for both CPUs.
Also rename that function to register_745_sprs, since the 745 is the
earliest of the two. This will help with separating 755-specific
registers in a subsequent patch.
Signed-off-by: Fabiano Rosas <farosas@linux.ibm.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Message-Id: <20220216162426.1885923-14-farosas@linux.ibm.com>
Signed-off-by: Cédric Le Goater <clg@kaod.org>
Move some of the 440 registers that are being repeated in the 440*
CPUs to register_440_sprs.
Signed-off-by: Fabiano Rosas <farosas@linux.ibm.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Message-Id: <20220216162426.1885923-11-farosas@linux.ibm.com>
Signed-off-by: Cédric Le Goater <clg@kaod.org>
We're considering these two to be from different CPU families, so
duplicate some code to keep them separate.
Signed-off-by: Fabiano Rosas <farosas@linux.ibm.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Message-Id: <20220216162426.1885923-10-farosas@linux.ibm.com>
Signed-off-by: Cédric Le Goater <clg@kaod.org>
We're considering these two to be in different CPU families (6xx and
7xx), so keep their SPR registration separate.
The code was copied into register_G2_sprs and the common function was
renamed to apply only to the 755.
Signed-off-by: Fabiano Rosas <farosas@linux.ibm.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Message-Id: <20220216162426.1885923-9-farosas@linux.ibm.com>
Signed-off-by: Cédric Le Goater <clg@kaod.org>
Make sure that every register_*_sprs function only has calls to
spr_register* to register individual SPRs. Do not allow nesting. This
makes the code easier to follow and a look at init_proc_* should
suffice to know what SPRs a CPU has.
Signed-off-by: Fabiano Rosas <farosas@linux.ibm.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Message-Id: <20220216162426.1885923-6-farosas@linux.ibm.com>
Signed-off-by: Cédric Le Goater <clg@kaod.org>
Now that the 601 was removed, all of our CPUs have a timebase, so that
can be moved into the common function.
Signed-off-by: Fabiano Rosas <farosas@linux.ibm.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Message-Id: <20220216162426.1885923-5-farosas@linux.ibm.com>
Signed-off-by: Cédric Le Goater <clg@kaod.org>
The top level init_proc calls register_generic_sprs but also registers
some other SPRs outside of that function. Let's group everything into
a single place.
Signed-off-by: Fabiano Rosas <farosas@linux.ibm.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Message-Id: <20220216162426.1885923-4-farosas@linux.ibm.com>
Signed-off-by: Cédric Le Goater <clg@kaod.org>
The G2LE CPU initialization code is the same as the G2. Use the latter
for both.
Signed-off-by: Fabiano Rosas <farosas@linux.ibm.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Message-Id: <20220216162426.1885923-3-farosas@linux.ibm.com>
Signed-off-by: Cédric Le Goater <clg@kaod.org>
The /* XXX : not implemented */ comments all over cpu_init are
confusing and ambiguous.
Do they mean not implemented by QEMU, not implemented in a specific
access mode? Not implemented by the CPU? Do they apply to just the
register right after or to a whole block? Do they mean we have an
action to take in the future to implement these? Are they only
informative?
Signed-off-by: Fabiano Rosas <farosas@linux.ibm.com>
Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
Message-Id: <20220216162426.1885923-2-farosas@linux.ibm.com>
Signed-off-by: Cédric Le Goater <clg@kaod.org>
This implements the Nested KVM HV hcall API for spapr under TCG.
The L2 is switched in when the H_ENTER_NESTED hcall is made, and the
L1 is switched back in returned from the hcall when a HV exception
is sent to the vhyp. Register state is copied in and out according to
the nested KVM HV hcall API specification.
The hdecr timer is started when the L2 is switched in, and it provides
the HDEC / 0x980 return to L1.
The MMU re-uses the bare metal radix 2-level page table walker by
using the get_pate method to point the MMU to the nested partition
table entry. MMU faults due to partition scope errors raise HV
exceptions and accordingly are routed back to the L1.
The MMU does not tag translations for the L1 (direct) vs L2 (nested)
guests, so the TLB is flushed on any L1<->L2 transition (hcall entry
and exit).
Reviewed-by: Fabiano Rosas <farosas@linux.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
[ clg: checkpatch fixes ]
Message-Id: <20220216102545.1808018-10-npiggin@gmail.com>
Signed-off-by: Cédric Le Goater <clg@kaod.org>
Introduce virtual hypervisor methods that can support a "Nested KVM HV"
implementation using the bare metal 2-level radix MMU, and using HV
exceptions to return from H_ENTER_NESTED (rather than cause interrupts).
HV exceptions can now be raised in the TCG spapr machine when running a
nested KVM HV guest. The main ones are the lev==1 syscall, the hdecr,
hdsi and hisi, hv fu, and hv emu, and h_virt external interrupts.
HV exceptions are intercepted in the exception handler code and instead
of causing interrupts in the guest and switching the machine to HV mode,
they go to the vhyp where it may exit the H_ENTER_NESTED hcall with the
interrupt vector numer as return value as required by the hcall API.
Address translation is provided by the 2-level page table walker that is
implemented for the bare metal radix MMU. The partition scope page table
is pointed to the L1's partition scope by the get_pate vhc method.
Reviewed-by: Fabiano Rosas <farosas@linux.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Message-Id: <20220216102545.1808018-9-npiggin@gmail.com>
Signed-off-by: Cédric Le Goater <clg@kaod.org>
This moves the logic to reset the QEMU exception state into its own
function.
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[ clg: checkpatch fixes ]
Message-Id: <20220216102545.1808018-8-npiggin@gmail.com>
Signed-off-by: Cédric Le Goater <clg@kaod.org>
The virtual hypervisor currently always intercepts and handles
hypercalls but with a future change this will not always be the case.
Add a helper for the test so the logic is abstracted from the mechanism.
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Message-Id: <20220216102545.1808018-7-npiggin@gmail.com>
Signed-off-by: Cédric Le Goater <clg@kaod.org>
In prepartion for implementing a full partition table option for
vhyp, update the get_pate method to take an lpid and return a
success/fail indicator.
The spapr implementation currently just asserts lpid is always 0
and always return success.
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[ clg: checkpatch fixes ]
Message-Id: <20220216102545.1808018-6-npiggin@gmail.com>
Signed-off-by: Cédric Le Goater <clg@kaod.org>
The radix on vhyp MMU uses a single-level radix table walk, with the
partition scope mapping provided by the flat QEMU machine memory.
A subsequent change will use the two-level radix walk on vhyp in some
situations, so provide a helper which can abstract that logic.
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Message-Id: <20220216102545.1808018-5-npiggin@gmail.com>
Signed-off-by: Cédric Le Goater <clg@kaod.org>
Machines which don't emulate the HDEC facility are able to use the
timer for something else. Provide functions to start and stop the
hdecr timer.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[ clg: checkpatch fixes ]
Message-Id: <20220216102545.1808018-4-npiggin@gmail.com>
Signed-off-by: Cédric Le Goater <clg@kaod.org>
The spapr virtual hypervisor does not require the hdecr timer.
Remove it.
Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Reviewed-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Message-Id: <20220216102545.1808018-3-npiggin@gmail.com>
Signed-off-by: Cédric Le Goater <clg@kaod.org>
Invalid or missing partition table entry exceptions should cause HV
interrupts. HDSISR is set to bad MMU config, which is consistent with
the ISA and experimentally matches what POWER9 generates.
Reviewed-by: Fabiano Rosas <farosas@linux.ibm.com>
Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[ clg: checkpatch fixes ]
Message-Id: <20220216102545.1808018-2-npiggin@gmail.com>
Signed-off-by: Cédric Le Goater <clg@kaod.org>
If the device backend is not persistent memory for the nvdimm, there is
need for explicit IO flushes on the backend to ensure persistence.
On SPAPR, the issue is addressed by adding a new hcall to request for
an explicit flush from the guest when the backend is not pmem. So, the
approach here is to convey when the hcall flush is required in a device
tree property. The guest once it knows the device backend is not pmem,
makes the hcall whenever flush is required.
To set the device tree property, a new PAPR specific device type inheriting
the nvdimm device is implemented. When the backend doesn't have pmem=on
the device tree property "ibm,hcall-flush-required" is set, and the guest
makes hcall H_SCM_FLUSH requesting for an explicit flush. The new device
has boolean property pmem-override which when "on" advertises the device
tree property even when pmem=on for the backend. The flush function
invokes the fdatasync or pmem_persist() based on the type of backend.
The vmstate structures are made part of the spapr-nvdimm device object.
The patch attempts to keep the migration compatibility between source and
destination while rejecting the incompatibles ones with failures.
Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Message-Id: <164396256092.109112.17933240273840803354.stgit@ltczzess4.aus.stglabs.ibm.com>
Signed-off-by: Cédric Le Goater <clg@kaod.org>
The patch adds support for the SCM flush hcall for the nvdimm devices.
To be available for exploitation by guest through the next patch. The
hcall is applicable only for new SPAPR specific device class which is
also introduced in this patch.
The hcall expects the semantics such that the flush to return with
H_LONG_BUSY_ORDER_10_MSEC when the operation is expected to take longer
time along with a continue_token. The hcall to be called again by providing
the continue_token to get the status. So, all fresh requests are put into
a 'pending' list and flush worker is submitted to the thread pool. The
thread pool completion callbacks move the requests to 'completed' list,
which are cleaned up after collecting the return status for the guest
in subsequent hcall from the guest.
The semantics makes it necessary to preserve the continue_tokens and
their return status across migrations. So, the completed flush states
are forwarded to the destination and the pending ones are restarted
at the destination in post_load. The necessary nvdimm flush specific
vmstate structures are also introduced in this patch which are to be
saved in the new SPAPR specific nvdimm device to be introduced in the
following patch.
Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Message-Id: <164396254862.109112.16675611182159105748.stgit@ltczzess4.aus.stglabs.ibm.com>
Signed-off-by: Cédric Le Goater <clg@kaod.org>
A new subclass inheriting NVDIMMDevice is going to be introduced in
subsequent patches. The new subclass uses the realize and unrealize
callbacks. Add them on NVDIMMClass to appropriately call them as part
of plug-unplug.
Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com>
Acked-by: Daniel Henrique Barboza <danielhb413@gmail.com>
Message-Id: <164396253158.109112.1926755104259023743.stgit@ltczzess4.aus.stglabs.ibm.com>
Signed-off-by: Cédric Le Goater <clg@kaod.org>
Honor the expected behavior of syncfs() to synchronously flush all data
and metadata to disk on linux systems.
If virtiofsd is started with '-o announce_submounts', the client is
expected to send a FUSE_SYNCFS request for each individual submount.
In this case, we just create a new file descriptor on the submount
inode with lo_inode_open(), call syncfs() on it and close it. The
intermediary file is needed because O_PATH descriptors aren't
backed by an actual file and syncfs() would fail with EBADF.
If virtiofsd is started without '-o announce_submounts' or if the
client doesn't have the FUSE_CAP_SUBMOUNTS capability, the client
only sends a single FUSE_SYNCFS request for the root inode. The
server would thus need to track submounts internally and call
syncfs() on each of them. This will be implemented later.
Note that syncfs() might suffer from a time penalty if the submounts
are being hammered by some unrelated workload on the host. The only
solution to prevent that is to avoid shared mounts.
Signed-off-by: Greg Kurz <groug@kaod.org>
Message-Id: <20220215181529.164070-2-groug@kaod.org>
Reviewed-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Provide an option "-o security_label/no_security_label" to enable/disable
security label functionality. By default these are turned off.
If enabled, server will indicate to client that it is capable of handling
one security label during file creation. Typically this is expected to
be a SELinux label. File server will set this label on the file. It will
try to set it atomically wherever possible. But its not possible in
all the cases.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Message-Id: <20220208204813.682906-11-vgoyal@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
If guest and host policies can't work with each other, then guest security
context (selinux label) needs to be set into an xattr. Say remap guest
security.selinux xattr to trusted.virtiofs.security.selinux.
That means setting "fscreate" is not going to help as that's ony useful
for security.selinux xattr on host.
So we need another method which is atomic. Use O_TMPFILE to create new
file, set xattr and then linkat() to proper place.
But this works only for regular files. So dir, symlinks will continue
to be non-atomic.
Also if host filesystem does not support O_TMPFILE, we fallback to
non-atomic behavior.
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Message-Id: <20220208204813.682906-10-vgoyal@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
This patch adds support for creating new file with security context
as sent by client. It basically takes three paths.
- If no security context enabled, then it continues to create files without
security context.
- If security context is enabled and but security.selinux has not been
remapped, then it uses /proc/thread-self/attr/fscreate knob to set
security context and then create the file. This will make sure that
newly created file gets the security context as set in "fscreate" and
this is atomic w.r.t file creation.
This is useful and host and guest SELinux policies don't conflict and
can work with each other. In that case, guest security.selinux xattr
is not remapped and it is passthrough as "security.selinux" xattr
on host.
- If security context is enabled but security.selinux xattr has been
remapped to something else, then it first creates the file and then
uses setxattr() to set the remapped xattr with the security context.
This is a non-atomic operation w.r.t file creation.
This mode will be most versatile and allow host and guest to have their
own separate SELinux xattrs and have their own separate SELinux policies.
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Message-Id: <20220208204813.682906-9-vgoyal@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Soon we will be able to create and also set security context on the file
atomically using /proc/self/task/tid/attr/fscreate knob. If this knob
is available on the system, first set the knob with the desired context
and then create the file. It will be created with the context set in
fscreate. This works basically for SELinux and its per thread.
This patch just introduces the helper functions. Subsequent patches will
make use of these helpers.
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Message-Id: <20220208204813.682906-8-vgoyal@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
dgilbert: Manually merged gettid syscall number fixup from Vivek
Move core file creation bits in a separate function. Soon this is going
to get more complex as file creation need to set security context also.
And there will be multiple modes of file creation in next patch.
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Message-Id: <20220208204813.682906-7-vgoyal@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Add capability to enable and parse security context as sent by client
and put into fuse_req. Filesystems now can get security context from
request and set it on files during creation.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Message-Id: <20220208204813.682906-6-vgoyal@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
->capable keeps track of what capabilities kernel supports and ->wants keep
track of what capabilities filesytem wants.
Right now these fields are 32bit in size. But now fuse has run out of
bits and capabilities can now have bit number which are higher than 31.
That means 32 bit fields are not suffcient anymore. Increase size to 64
bit so that we can add newer capabilities and still be able to use existing
code to check and set the capabilities.
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Message-Id: <20220208204813.682906-5-vgoyal@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Add some code to parse extended "struct fuse_init_in". And use a local
variable "flag" to represent 64 bit flags. This will make it easier
to add more features without having to worry about two 32bit flags (->flags
and ->flags2) in "fuse_struct_in".
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Message-Id: <20220208204813.682906-4-vgoyal@redhat.com>
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
dgilbert: Fixed up long line
Update headers to 5.17-rc1. I need latest fuse changes.
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Message-Id: <20220208204813.682906-3-vgoyal@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Kernel version 5.17 has increased the size of "struct fuse_init_in" struct.
Previously this struct was 16 bytes and now it has been extended to
64 bytes in size.
Once qemu headers are updated to latest, it will expect to receive 64 byte
size struct (for protocol version major 7 and minor > 6). But if guest is
booting older kernel (older than 5.17), then it still sends older
fuse_init_in of size 16 bytes. And do_init() fails. It is expecting
64 byte struct. And this results in mount of virtiofs failing.
Fix this by parsing 16 bytes only for now. Separate patches will be
posted which will parse rest of the bytes and enable new functionality.
Right now we don't support any of the new functionality, so we don't
lose anything by not parsing bytes beyond 16.
Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Message-Id: <20220208204813.682906-2-vgoyal@redhat.com>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
It is recommended to use g_autofree or g_autoptr as it reduces
the odds of introducing memory leaks in future changes.
Signed-off-by: Greg Kurz <groug@kaod.org>
Message-Id: <20220201151508.190035-3-groug@kaod.org>
Reviewed-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
Signed-off-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
local_test_path is allocated in virtio_9p_create_local_test_dir() to hold the path
of the temporary directory. It should be freed in virtio_9p_remove_local_test_dir()
when the temporary directory is removed. Clarify the lifecycle of local_test_path
while here.
Based-on: <f6602123c6f7d0d593466231b04fba087817abbd.1642879848.git.qemu_oss@crudebyte.com>
Signed-off-by: Greg Kurz <groug@kaod.org>
Message-Id: <20220201151508.190035-2-groug@kaod.org>
Reviewed-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
Signed-off-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
The 9p test cases use mkdtemp() to create a temporary directory for
running the 'local' 9p tests with real files/dirs. Unlike mktemp()
which only generates a unique file name, mkdtemp() also creates the
directory, therefore the subsequent mkdir() was wrong and caused
errors on some systems.
Signed-off-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
Fixes: 136b7af2 (tests/9pfs: fix test dir for parallel tests)
Reported-by: Daniel P. Berrangé <berrange@redhat.com>
Resolves: https://gitlab.com/qemu-project/qemu/-/issues/832
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Reviewed-by: Greg Kurz <Greg Kurz <groug@kaod.org>
Message-Id: <f6602123c6f7d0d593466231b04fba087817abbd.1642879848.git.qemu_oss@crudebyte.com>
Signed-off-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
Reviewed-by: Greg Kurz <groug@kaod.org>
Message-Id: <E1mn1fA-0005qZ-TM@lizzy.crudebyte.com>
The current firmware descriptor schema for flash requires that both the
executable to NVRAM template paths be provided. This is fine for the
most common usage of EDK2 builds in virtualization where the separate
_CODE and _VARS files are provided.
With confidential computing technology like AMD SEV, persistent storage
of variables may be completely disabled because the firmware requires a
known clean state on every cold boot. There is no way to express this
in the firmware descriptor today.
Even with regular EDK2 builds it is possible to create a firmware that
has both executable code and variable persistence in a single file. This
hasn't been commonly used, since it would mean every guest bootup would
need to clone the full firmware file, leading to redundant duplicate
storage of the code portion. In some scenarios this may not matter and
might even be beneficial. For example if a public cloud allows users to
bring their own firmware, such that the user can pre-enroll their own
secure boot keys, you're going to have this copied on disk for each
tenant already. At this point the it can be simpler to just deal with
a single file rather than split builds. The firmware descriptor ought
to be able to express this combined firmware model too.
This all points towards expanding the schema for flash with a 'mode'
concept:
- "split" - the current implicit behaviour with separate files
for code and variables.
- "combined" - the alternate behaviour where a single file contains
both code and variables.
- "stateless" - the confidential computing use case where storage
of variables is completely disable, leaving only the code.
Reviewed-by: Kashyap Chamarthy <kchamart@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <f4bug@amsat.org>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
Eduardo has indicated that he no longer has time to be involved in
a QEMU maintainership role. As one of the more frequent contributors
of patches and design ideas to seccomp, I'll take over in an "Odd
Fixes" role.
Acked-by: Eduardo Otubo <otubo@redhat.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
setns/unshare are used to change namespaces which is not something QEMU
needs to be able todo.
execveat is a new variant of execve so should be blocked just like
execve already is.
Acked-by: Eduardo Otubo <otubo@redhat.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
Modern glibc will use clone3 instead of clone, when it detects that it
is available. We need to compare flags in order to decide whether to
allow clone (thread create vs process fork), but in clone3 the flags
are hidden inside a struct. Seccomp can't currently match on data inside
a struct, so our only option is to block clone3 entirely. If we use
ENOSYS to block it, then glibc transparently falls back to clone.
This may need to be revisited if Linux adds a new architecture in
future and only provides clone3, without clone.
Acked-by: Eduardo Otubo <otubo@redhat.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
When '-sandbox on,spawn=deny' is given, we are supposed to block the
ability to spawn processes. We naively blocked the 'fork' syscall,
forgetting that any modern libc will use the 'clone' syscall instead.
We can't simply block the 'clone' syscall though, as that will break
thread creation. We thus list the set of flags used to create threads
and block anything that doesn't match this exactly.
Acked-by: Eduardo Otubo <otubo@redhat.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
The handling of some syscalls / libc function is quite subtle. For
example, 'fork' at a libc level doesn't always correspond to 'fork'
at a syscall level, since the 'clone' syscall is preferred usually.
The unit test will help to detect these kind of problems. A point of
difficulty in writing a test though is that the QEMU build process may
already be confined by seccomp. For example, if running inside a
container. Since we can't predict what filtering might have been applied
already, we are quite conservative and skip all tests if we see any kind
of seccomp filter active.
Acked-by: Eduardo Otubo <otubo@redhat.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
We're currently tailoring whether to use kill process or return EPERM
based on the syscall set. This is not flexible enough for future
requirements where we also need to be able to return a variety of
actions on a per-syscall granularity.
Acked-by: Eduardo Otubo <otubo@redhat.com>
Signed-off-by: Daniel P. Berrangé <berrange@redhat.com>
With the current implementation, blocking flock can lead to
deadlock. Thus, it's better to return EOPNOTSUPP if a user attempts
to perform a blocking flock request.
Signed-off-by: Sebastian Hasler <sebastian.hasler@stuvus.uni-stuttgart.de>
Message-Id: <20220113153249.710216-1-sebastian.hasler@stuvus.uni-stuttgart.de>
Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
Reviewed-by: Vivek Goyal <vgoyal@redhat.com>
Reviewed-by: Greg Kurz <groug@kaod.org>