Do not mount /dev/shm with MS_NOEXEC flag on WSL1. A bug on WSL1
(https://github.com/microsoft/WSL/issues/8777) prevents files from
being mapped using mmap if the underlying filesystem is mounted
with MS_NOEXEC.
Darling now be used without overlayfs by enabling
the environment "DARLING_NOOVERLAYFS". Darling also
disables overlayfs when it detects itself running in a WSL1
environment.
Without overlayfs, Darling will have to recursively copy all files
and folders from LIBEXEC_PATH to DPREFIX.
- Implemented an alternative to pidfd_open for kernels older than 5.3.
mldr should send a "lifetime pipe" to darlingserver during process start.
When the process dies, darlingserver should receive a POLLHUP event.
- Set increased_limit.rlim_cur to default_limit.rlim_max on systems without
/proc/sys/fs/nr_open. On WSL1, this greatly increases the number of open file
descriptors available.
- For systems without NSpid in /proc/self/status, implemented a way to manage
thread IDs in darlingserver during checkin. darlingserver should receive a hint
address on the thread's stack, and then compare it with a stack pointer retrieved using
PTRACE_GETREGS
- Avoided sending socket messages when msg_hdr.msg_name->sun_path is an empty string.
A null msg_name is used instead, otherwise, on some systems, this would fail with EINVAL.
Debug logging produces *lots* of output *very* quickly, so that's
disabled by default now. The log level can be controlled with the new
`DSERVER_LOG_LEVEL` env var. Just set it to the minimum level
you want to see in the output. It defaults to "error" so that only
error messages are logged.
One significant change made here is that lck_mtx structures now directly
contain the internals of dtape_mutex structures. This was changed
because the old way of storing in a malloc'ed object led to memory leaks.
The problem is that there's a lot of XNU code that uses simple locks and
does not destroy them (because it doesn't need to in the XNU
implementation). Since the only structure that really cares about the
lock size is the waitq structure, we just patch that up. Besides, we
had modified the waitq structure in the LKM before and nothing blew up,
so this should be fine.
This is used to avoid the server reading incorrect/corrupted reply
contents for pushed replies. This was happening because clients were
sending the push-reply call with the pointer to the message contents,
but they were immediately returning after sending it. This led to a race
condition in which the server would sometimes read the data after the
client had already overwritten/discarded said data.
The thread might have died after sending the message, so
it might not exist by the time the server gets the message.
In that case, just ignore/drop the message.
We were previously always updating the timer deadline. This meant that,
when a later deadline than the current one came along, we would update
the deadline to the later one. In effect, we were scheduling a timer for
the latest deadline available rather than the earliest.
The fix involves keeping track of the current deadline and not updating
it if the new deadline is later than the current one. There is an option
to override this behavior, however, because sometimes the timer_call code
changes the deadline on us to a later time and we *do* want to update it
when it tells us to do so explicitly. For example, the deadline returned
by timer_queue_expire is definitive: that's definitely the next deadline
we want. The deadline passed to timer_queue_assign, on the other hand,
is merely is a suggestion.
We were writing out the path to the target process (i.e. the one we're
looking up), but we should instead write it out to the process who made
the call.
This resolves a race condition where we receive a call and then
immediately receive an interrupt while that call is still pending.
The new behavior is to go ahead and process the pending call, but we
trigger interrupt processing as soon as the call suspends.
See DarlingServer::Kqchan::MachPort::_read() for why this is necessary.
This fixes crashes in libkqueue due to out-of-order kqchannel messages,
mainly visible in aslmanager.
This fixes some crashes with syslogd because the mqueue was vanishing
and calling knote_vanish, indicating its klist was going to be emptied.
However, since we weren't storing this flag in the knote,
filt_machportdetach thought the knote was still attached and tried to
detach it, causing a NULL pointer access.
Together with the corresponding changes in mldr, darlingserver no longer
requires capabilities while running! The next step towards making
Darling completely unprivileged would be to remove SUID from the main
Darling binary, but that's a task for some other time.
I originally started doing this to see if some issues I was seeing with
LLDB were related to the capabilities in mldr, but it seems they're
unrelated.
What this means is that we no longer release and destroy Thread and
Process instances when the threads and processes they manage die.
Instead, we keep them alive to perform some cleanup (like finishing
active calls).
This should fix the duct-tape panic where threads and tasks are still
referenced at death.
Best of all, there don't seem to be any leaks with this approach: for
each `process dying` or `thread dying` message in the log, there's a
`process being destroyed` or `thread being destroyed` message later
on. This means we're not leaking any processes or threads.
This call needs to access lots of private thread members, so it's better
to provide a single private helper that handles the call in the Thread
class rather than have it all in a Call.
This allows kernel runner threads to be created as necessary to process
the work that comes in through `kernelAsync` and `kernelSync`.
There's currently a hardcoded max of 10 permanent kernel runners.
However, if the workload is too much, temporary runners can be spawned;
each temporary worker processes a single work item and then exits. There
is no limit on the number of temporary workers that can be spawned.
This commit allows Darling processes to convert private memory in other
Darling processes into shared memory that they can access. This is
necessary, e.g. for LLDB.
They were using the current task, but that's not always the case.
LLDB, for example, calls mach_vm_region_recurse with the map of the task
it's debugging.
std::stoul is base 10 by default, so we were trying to process hex
values as decimal values(producing incorrect values, as expected).
Also, memoryRegionInfo now returns a structure with the info rather than
having everything passed in as a reference, just like memoryInfo was
recently changed to do as well. This should make easier to add more info
fields later.