I'm taking GSOC in a slightly different direction. I will finish the
PowerPC blending/blitting optimizations, but first I'm going to focus
on the general Graphics::Surface and Graphics::ManagedSurface code for
now.
PowerPC's <altivec.h> header redefines bool to be __vector(4) __bool which
is weird, so I changed the prototypes of the functions to use int instead
of bool. Hopefully this fixes things.
Not all of Arm NEON intrinsics aren't included in the iOS simulator's
arm_neon.h file, so we just don't compile arm neon for the simulator
anymore. Also, arm_neon.h on Windows seems to be just an empty header
or atleast a header with only a few intrinsics of the many that should
be there.
Made it so that iOS doesn't use Arm NEON since it only supports a very
limited set of instructions (like it apparently doesn't have intrinics
for something as simple as bit shifting?). I also changed every float
literal in surface_simd_sse from a double literal to float because
windows x64 was complaining about it.
I was using just the GCC and CLANG macros to see what platform SCUMMVM
was being compiled on, but neglected the MSVC ones. This would lead it
to not compile on that compiler. I fixed that by adding those. I also
added the fallback simd implementation .cpp file into module.mk for the
ags engine.
Finished writing the code in surface_simd_sse.cpp. I also added a backup
option in case no processor simd extensions are found. In that case it
just defualts to the normal drawInnerGeneric. I also made
drawInnerGeneric a bit faster by moving certain things into compile
time. Tests were changed to also include SSE2.
Added a template specialization for 2bpp to 2bpp blits in
BITMAP::drawInner, makes 2bpp to 2bpp now around 2 times as fast as
normal 4bpp to 4bpp blitting.
Optimized most if not all code paths in BITMAP::draw. All blending modes
have been optimized with ARM NEON intrensics, and multiple different
source and destination formats are optimized. (for bytes per pixel the
following have been optimized, 1 and 1, 2 and 2, 4 and 4, 2 and 4).
After this, I am going to clean up this code and apply more optmizations
where I can, then make the SSE versions of the functions, and try to
optimize the slow path as much as I can. Then I will see what I can do
with BITMAP::stretched draw.
Just commiting my first attempts at optimizing BITMAP::draw and the
blendPixel function. Here's an overview of the changes (some are
temporary):
- Put the loop of BITMAP::draw into its own function drawInner.
I templated it so that I could put different paths into the loop that
could be optimized out at compile time if a certain blending function
didn't need it etc.
- I added apple NEON (SIMD) intrensics to the drawInner function,
haven't ported it to SSE yet, but there is a small library that actually
maps neon intrensics to sse ones.
- Removed a few ifs from the normal x loop and put it in the y loop.
Moved the loop of BTIMAP::draw into a templated function so that certain
checks could be done at runtime. When src and dst formats are both 4
bytes per pixel it skips colorToARGB and does a quicker method. This is
still a very early work in progress.
The old benchmark benchmarked the render_to_screen function, which
wasn't specific enough to what is needing to be benchmarked (
BITMAP::draw, and blending functions), so it was changed to just use the
drawing functions of allegro's bitmap directly. Also from just looking
at other games, it seems like AGS games use truecolor graphics more
often than not, so the benchmark test graphic was changed to truecolor.
The original engine allowed players to go to the main menu
via the Esc button, but only when they were in the Scene state.
This is now implemented here as well.