Same situation as the last stack leak memory fix, this is fairly tricky
since it is dealing with stack pivoting. Fixes the memory leak around
pthread stack allocations, making memory usage lower for applications
that constantly spin-up and destroy threads (Like Steam).
We need to let glibc allocate a minimum sized stack (128KB and we can't
control it) to work around a race condition with DTV/TLS regions. This
means we need to do a stack pivot once the thread starts executing.
We also need to be careful because the `PThread` object is deleted
inside of the execution thread, which was resulting in a use-after-free
bug.
There are definitely some more memory leaks that I'm still fighting, and I have
noticed in my abusive thread creation program that we might want to
change some jemalloc options to more aggressively cut down on residency.
This is just one out of many.
I remember seeing some application last year where they closed a FEX
owned FD but now I don't remember what it was. This can really mess us
up so add some debug tracking so we can try and find it again.
Might be something specifically around flatpack, appimage, or chrome's
sandbox. I have some ideas about how to work around these problems if
they crop up but need to find the problem applications again.
Since we do an immediate overwrite of the file we are copying, we can
instead do a rename. Failure on rename is fine, will either mean the
telemetry file didn't exist initially, or some other permission error so
the telemetry will get lost regardless.
This may be useful for tracking TSO faulting when it manages to fetch
stale data. While most TSO crashes are due to nullptr dereferences, this
can still check for the corruption case.
In 64-bit mode, the LOOP instruction's RCX register usage is 64-bit or
32-bit.
In 32-bit mode, the LOOP instruction's RCX register usage is 32-bit or
16-bit.
FEX wasn't handling the 16-bit case at all which was causing the LOOP
instruction to effectively always operate at 32-bit size. Now this is
correctly supported, and it also stops treating the operation as 64-bit.
This was a funny joke that this was here, but it is fundamentally
incompatible with what we're doing. All those users are running proot
anyway because of how broken running under termux directly is.
Just remove this from here.
Take e.g a forward rep movsb copy from addr 0 to 1, the expected
behaviour since this is a bytewise copy is:
before: aaabbbb...
after: aaaaaaa...
but by copying in 32-byte chunks we end up with:
after: aaaabbbb...
due to the self overwrites not occuring within a single 32 bit copy.
When TSO is disabled, vector LDP/STP can be used for a two
instruction 32 byte memory copy which is significantly faster than the
current byte-by-byte copy. Performing two such copies directly after
oneanother also marginally increases copy speed for all sizes >=64.
I was looking at some other JIT overheads and this cropped up as some
overhead. Instead of materializing a constant using mov+movk+movk+movk,
load it from the named vector constant array.
In a micro-benchmark this improved performance by 34%.
In bytemark this improved on subbench by 0.82%
Missed this instruction when implementing rdtscp. Returns the same ID
result in a register just like rdtscp, but without the cycle counter
results. Doesn't touch any flags just like rdtscp.
This unit test recreates the error condition that #3478 causes.
With a string operation that is a backwards copy then the optimization
will read past the end of the page and result in a crash.
Seemingly only happens with backwards string operations, but test
forward and backward in this test.
x86 has a few prefetch instructions.
- prefetch - One of two classic 3DNow! instructions
- Prefetch in to L1 data cache
- prefetchw - One of two classic 3DNow! instructions
- Implies prefetch in to L1 data cache
- Prefetch cacheline with intent to write and exclusive ownership
- prefetchnta
- Prefetch non-temporal data in respect to /all/ cache levels
- Assumes inclusive caches?
- prefetch{t0,t1,t2}
- Prefetch data with respect to each cache level
- T0 = L1 and higher
- T1 = L2 and higher
- T2 = L3 and higher
**Some silly duplicates**
- prefetchwt1
- Duplicate of prefetchw but explicitly L1 data cache
- prefetch_exclusive
- Duplicate of prefetch
God Of War 2018 uses prefetchw as a hint for exclusive ownership of the
cacheline in some very aggressive spin-loops. Let's implement the
operations to help it along.
This function can be unit-tested more easily, and the stack special is more
cleanly handled as a post-collection step.
There is a minor functional change: The stack special case didn't trigger
previously if the range end was within the stack mapping. This is now fixed.