mirror of
https://github.com/mozilla/gecko-dev.git
synced 2024-11-01 22:55:23 +00:00
1c97c56d36
Andrea Canciani (4): support single-stop gradients test: verify that gradients do not crash pixman Draw radial gradients with PDF semantics Add comments about errors Benjamin Otte (2): region: Add a new test region-translate region: Fix pixman_region_translate() clipping bug Brad Smith (1): Add support for AltiVec detection for OpenBSD/PowerPC. Dmitri Vorobiev (5): Move aligned_malloc() to utils Add gettime() routine to test utils Add noinline macro Use <sys/mman.h> macros only when they are available Some clean-ups in fence_malloc() and fence_free() Jeff Muizelaar (1): create getter for component alpha Jonathan Morton (1): Add a lowlevel blitter benchmark Liu Xinyun (2): add enable-cache-prefetch option Remove cache prefetch code. M Joonas Pihlaja (6): Try harder to find suitable flags for pthreads. Don't trust OpenBSD's gcc to produce working code for __thread. Check that the OpenMP pragmas don't cause link errors. Check for specific flags by actually trying to compile and link. Avoid trailing slashes on automake install dirs. Fix thinko in configure.ac's macro to test linking. Maarten Bosmans (2): Use windows.h directly for mingw32 build Add *.exe to .gitignore Marek Vasut (1): Add support for 32bpp X14R6G6B6 format. Mika Yrjola (1): Fix "syntax error: empty declaration" warnings. Siarhei Siamashka (25): test: main loop from blitters-test added as a new function to utils.c test: blitters-test-bisect.rb converted to perl test: blitters-test updated to use new fuzzer_test_main() function test: scaling-test updated to use new fuzzer_test_main() function test: added OpenMP support for better utilization of multiple CPU cores test: 'scaling-crash-test' added test: 'scaling-test' updated to provide better coverage Code simplification (no need advancing 'vx' at the end of scanline) ARM: 'neon_combine_out_reverse_u' combiner ARM: added 'neon_composite_over_8888_8_0565' fast path ARM: common init/cleanup macro for saving/restoring NEON registers ARM: helper macros for conversion between 8888/x888/0565 formats ARM: added 'neon_composite_over_0565_8_0565' fast path test: detection of possible floating point registers corruption Nearest scaling fast path macros moved to 'pixman-fast-path.h' Nearest scaling fast path macro split into two parts Introduce a fake PIXMAN_REPEAT_COVER constant PAD repeat support for fast scaling with nearest filter NONE repeat support for fast scaling with nearest filter SSE2 optimization for scaled over_8888_8888 operation with nearest filter ARM: NEON: added forgotten cache preload for over_n_8888/over_n_0565 ARM: added 'neon_composite_add_0565_8_0565' fast path ARM: added 'neon_composite_out_reverse_8_0565' fast path Use more unrolling for scaled src_0565_0565 with nearest filter ARM: restore fallback to ARMv6 implementation from NEON in the delegate chain Søren Sandmann Pedersen (94): Don't use __thread on MinGW. Add macros for thread local storage on MinGW 32 test/gtk-utils: Set the size of the window to the size of the image Merge branch 'for-master' Eliminate mask_bits from all the scanline fetchers. When storing a g1 pixel, store the lowest bit, rather than comparing with 0. Make separate gray scanline storers. Fix conical gradients to match QConicalGradient from Qt Store the conical angle in floating point radians, not fixed point degrees Minor tweaks to README Make the combiner macros less likely to cause name collisions. Fix memory leak in the pthreads thread local storage code Hide the global implementation variable behind a force_inline function. Cache the implementation along with the fast paths. Split the fast path caching into its own force_inline function test: Make sure the palettes for indexed format roundtrip properly When converting indexed formats to 64 bits, don't correct for channel widths Make the repeat mode explicit in the FAST_NEAREST macro. In the FAST_NEAREST macro call the function 8888_8888 and not x888_x888 fast-path: Some formatting fixes Check for read accessors before taking the bilinear fast path [fast] Add fast_composite_src_x888_8888() [sse2] Add sse2_composite_src_x888_8888() [sse2] Add sse2_composite_in_n_8() [sse2] Add sse2_composite_add_n_8() bits: Fix potential divide-by-zero in projective code Add x14r6g6b6 format to blitters-test If we bail out of do_composite, make sure to undo any workarounds. CODING_STYLE: Delete the stuff about trailing spaces Fix Altivec/OpenBSD patch Extend scaling-crash-test in various ways Replace compute_src_extent_flags() with analyze_extents() Eliminate recursion from alpha map code Eliminate get_pixel_32() and get_pixel_64() from bits_image. Split bits_image_fetch_transformed() into two functions. Eliminate the store_scanline_{32,64} function pointers. Remove "_raw_" from all the accessors. Add some new FAST_PATH flags Store the various bits image fetchers in a table with formats and flags. Add alpha-loop test program pixman_image_set_alpha_map(): Disallow alpha map cycles Introduce new FAST_PATH_SAMPLES_OPAQUE flag Only try to compute the FAST_SAMPLES_COVER_CLIP for bits images Pre-release version bump to 0.19.2 Post-release version bump to 0.19.3 Merge pixman_image_composite32() and do_composite(). Be more paranoid about checking for GTK+ Store a2b2g2r2 pixel through the WRITE macro When pixman_compute_composite_region32() returns FALSE, don't fini the region. Silence some warnings about uninitialized variables Add FAST_PATH_NO_ALPHA_MAP to the standard destination flags. Do opacity computation with shifts instead of comparing with 0 Add fence_malloc() and fence_free(). Update and extend the alphamap test Rename FAST_PATH_NO_WIDE_FORMAT to FAST_PATH_NARROW_FORMAT Remove FAST_PATH_NARROW_FORMAT flag if there is a wide alpha map Clip composite region against the destination alpha map extents. Move some of the FAST_PATH_COVERS_CLIP computation to pixman-image.c analyze_extents: Fast path for non-transformed BITS images test: Add affine-test Use a macro to generate some {a,x}8r8g8b8, a8, and r5g6b5 bilinear fetchers. Enable bits_image_fetch_bilinear_affine_pad_a8r8g8b8 Enable bits_image_fetch_bilinear_affine_none_a8r8g8b8 Enable bits_image_fetch_bilinear_affine_reflect_a8r8g8b8 Enable bits_image_fetch_bilinear_affine_normal_a8r8g8b8 Enable bits_image_fetch_bilinear_affine_pad_x8r8g8b8 Enable bits_image_fetch_bilinear_affine_none_x8r8g8b8 Enable bits_image_fetch_bilinear_affine_reflect_x8r8g8b8 Enable bits_image_fetch_bilinear_affine_normal_x8r8g8b8 Enable bits_image_fetch_bilinear_affine_pad_a8 Enable bits_image_fetch_bilinear_affine_none_a8 Enable bits_image_fetch_bilinear_affine_reflect_a8 Enable bits_image_fetch_bilinear_affine_normal_a8 Enable bits_image_fetch_bilinear_affine_pad_r5g6b5 Enable bits_image_fetch_bilinear_affine_none_r5g6b5 Enable bits_image_fetch_bilinear_affine_reflect_r5g6b5 Enable bits_image_fetch_bilinear_affine_normal_r5g6b5 compute_composite_region32: Zero extents before returning FALSE. Pre-release version bump to 0.19.4 Post-release version bump to 0.19.5 If MAP_ANONYMOUS is not defined, define it to MAP_ANON. Revert "add enable-cache-prefetch option" Rename all the fast paths with _8000 in their names to _8 Fix search-and-replace issue in lowlevel-blt-bench.c Fix bug in FAST_PATH_STD_FAST_PATH Delete simple repeat code Remove broken optimizations in combine_disjoint_over_u() test: Fix bug in color_correct() in composite.c test: Fix eval_diff() so that it provides useful error values. test: Change composite so that it tests randomly generated images test: Parallize composite.c with OpenMP test: Add some more colors to the color table in composite.c Add no-op combiners for DST and the CA versions of the HSL operators. Plug leak in the alphamap test. Tor Lillqvist (1): Support __thread on MINGW 4.5
6019 lines
156 KiB
C
6019 lines
156 KiB
C
/*
|
|
* Copyright © 2008 Rodrigo Kumpera
|
|
* Copyright © 2008 André Tupinambá
|
|
*
|
|
* Permission to use, copy, modify, distribute, and sell this software and its
|
|
* documentation for any purpose is hereby granted without fee, provided that
|
|
* the above copyright notice appear in all copies and that both that
|
|
* copyright notice and this permission notice appear in supporting
|
|
* documentation, and that the name of Red Hat not be used in advertising or
|
|
* publicity pertaining to distribution of the software without specific,
|
|
* written prior permission. Red Hat makes no representations about the
|
|
* suitability of this software for any purpose. It is provided "as is"
|
|
* without express or implied warranty.
|
|
*
|
|
* THE COPYRIGHT HOLDERS DISCLAIM ALL WARRANTIES WITH REGARD TO THIS
|
|
* SOFTWARE, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND
|
|
* FITNESS, IN NO EVENT SHALL THE COPYRIGHT HOLDERS BE LIABLE FOR ANY
|
|
* SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
|
|
* WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN
|
|
* AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING
|
|
* OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS
|
|
* SOFTWARE.
|
|
*
|
|
* Author: Rodrigo Kumpera (kumpera@gmail.com)
|
|
* André Tupinambá (andrelrt@gmail.com)
|
|
*
|
|
* Based on work by Owen Taylor and Søren Sandmann
|
|
*/
|
|
#ifdef HAVE_CONFIG_H
|
|
#include <config.h>
|
|
#endif
|
|
|
|
#include <mmintrin.h>
|
|
#include <xmmintrin.h> /* for _mm_shuffle_pi16 and _MM_SHUFFLE */
|
|
#include <emmintrin.h> /* for SSE2 intrinsics */
|
|
#include "pixman-private.h"
|
|
#include "pixman-combine32.h"
|
|
#include "pixman-fast-path.h"
|
|
|
|
#if defined(_MSC_VER) && defined(_M_AMD64)
|
|
/* Windows 64 doesn't allow MMX to be used, so
|
|
* the pixman-x64-mmx-emulation.h file contains
|
|
* implementations of those MMX intrinsics that
|
|
* are used in the SSE2 implementation.
|
|
*/
|
|
# include "pixman-x64-mmx-emulation.h"
|
|
#endif
|
|
|
|
#ifdef USE_SSE2
|
|
|
|
/* --------------------------------------------------------------------
|
|
* Locals
|
|
*/
|
|
|
|
static __m64 mask_x0080;
|
|
static __m64 mask_x00ff;
|
|
static __m64 mask_x0101;
|
|
static __m64 mask_x_alpha;
|
|
|
|
static __m64 mask_x565_rgb;
|
|
static __m64 mask_x565_unpack;
|
|
|
|
static __m128i mask_0080;
|
|
static __m128i mask_00ff;
|
|
static __m128i mask_0101;
|
|
static __m128i mask_ffff;
|
|
static __m128i mask_ff000000;
|
|
static __m128i mask_alpha;
|
|
|
|
static __m128i mask_565_r;
|
|
static __m128i mask_565_g1, mask_565_g2;
|
|
static __m128i mask_565_b;
|
|
static __m128i mask_red;
|
|
static __m128i mask_green;
|
|
static __m128i mask_blue;
|
|
|
|
static __m128i mask_565_fix_rb;
|
|
static __m128i mask_565_fix_g;
|
|
|
|
/* ----------------------------------------------------------------------
|
|
* SSE2 Inlines
|
|
*/
|
|
static force_inline __m128i
|
|
unpack_32_1x128 (uint32_t data)
|
|
{
|
|
return _mm_unpacklo_epi8 (_mm_cvtsi32_si128 (data), _mm_setzero_si128 ());
|
|
}
|
|
|
|
static force_inline void
|
|
unpack_128_2x128 (__m128i data, __m128i* data_lo, __m128i* data_hi)
|
|
{
|
|
*data_lo = _mm_unpacklo_epi8 (data, _mm_setzero_si128 ());
|
|
*data_hi = _mm_unpackhi_epi8 (data, _mm_setzero_si128 ());
|
|
}
|
|
|
|
static force_inline __m128i
|
|
unpack_565_to_8888 (__m128i lo)
|
|
{
|
|
__m128i r, g, b, rb, t;
|
|
|
|
r = _mm_and_si128 (_mm_slli_epi32 (lo, 8), mask_red);
|
|
g = _mm_and_si128 (_mm_slli_epi32 (lo, 5), mask_green);
|
|
b = _mm_and_si128 (_mm_slli_epi32 (lo, 3), mask_blue);
|
|
|
|
rb = _mm_or_si128 (r, b);
|
|
t = _mm_and_si128 (rb, mask_565_fix_rb);
|
|
t = _mm_srli_epi32 (t, 5);
|
|
rb = _mm_or_si128 (rb, t);
|
|
|
|
t = _mm_and_si128 (g, mask_565_fix_g);
|
|
t = _mm_srli_epi32 (t, 6);
|
|
g = _mm_or_si128 (g, t);
|
|
|
|
return _mm_or_si128 (rb, g);
|
|
}
|
|
|
|
static force_inline void
|
|
unpack_565_128_4x128 (__m128i data,
|
|
__m128i* data0,
|
|
__m128i* data1,
|
|
__m128i* data2,
|
|
__m128i* data3)
|
|
{
|
|
__m128i lo, hi;
|
|
|
|
lo = _mm_unpacklo_epi16 (data, _mm_setzero_si128 ());
|
|
hi = _mm_unpackhi_epi16 (data, _mm_setzero_si128 ());
|
|
|
|
lo = unpack_565_to_8888 (lo);
|
|
hi = unpack_565_to_8888 (hi);
|
|
|
|
unpack_128_2x128 (lo, data0, data1);
|
|
unpack_128_2x128 (hi, data2, data3);
|
|
}
|
|
|
|
static force_inline uint16_t
|
|
pack_565_32_16 (uint32_t pixel)
|
|
{
|
|
return (uint16_t) (((pixel >> 8) & 0xf800) |
|
|
((pixel >> 5) & 0x07e0) |
|
|
((pixel >> 3) & 0x001f));
|
|
}
|
|
|
|
static force_inline __m128i
|
|
pack_2x128_128 (__m128i lo, __m128i hi)
|
|
{
|
|
return _mm_packus_epi16 (lo, hi);
|
|
}
|
|
|
|
static force_inline __m128i
|
|
pack_565_2x128_128 (__m128i lo, __m128i hi)
|
|
{
|
|
__m128i data;
|
|
__m128i r, g1, g2, b;
|
|
|
|
data = pack_2x128_128 (lo, hi);
|
|
|
|
r = _mm_and_si128 (data, mask_565_r);
|
|
g1 = _mm_and_si128 (_mm_slli_epi32 (data, 3), mask_565_g1);
|
|
g2 = _mm_and_si128 (_mm_srli_epi32 (data, 5), mask_565_g2);
|
|
b = _mm_and_si128 (_mm_srli_epi32 (data, 3), mask_565_b);
|
|
|
|
return _mm_or_si128 (_mm_or_si128 (_mm_or_si128 (r, g1), g2), b);
|
|
}
|
|
|
|
static force_inline __m128i
|
|
pack_565_4x128_128 (__m128i* xmm0, __m128i* xmm1, __m128i* xmm2, __m128i* xmm3)
|
|
{
|
|
return _mm_packus_epi16 (pack_565_2x128_128 (*xmm0, *xmm1),
|
|
pack_565_2x128_128 (*xmm2, *xmm3));
|
|
}
|
|
|
|
static force_inline int
|
|
is_opaque (__m128i x)
|
|
{
|
|
__m128i ffs = _mm_cmpeq_epi8 (x, x);
|
|
|
|
return (_mm_movemask_epi8 (_mm_cmpeq_epi8 (x, ffs)) & 0x8888) == 0x8888;
|
|
}
|
|
|
|
static force_inline int
|
|
is_zero (__m128i x)
|
|
{
|
|
return _mm_movemask_epi8 (
|
|
_mm_cmpeq_epi8 (x, _mm_setzero_si128 ())) == 0xffff;
|
|
}
|
|
|
|
static force_inline int
|
|
is_transparent (__m128i x)
|
|
{
|
|
return (_mm_movemask_epi8 (
|
|
_mm_cmpeq_epi8 (x, _mm_setzero_si128 ())) & 0x8888) == 0x8888;
|
|
}
|
|
|
|
static force_inline __m128i
|
|
expand_pixel_32_1x128 (uint32_t data)
|
|
{
|
|
return _mm_shuffle_epi32 (unpack_32_1x128 (data), _MM_SHUFFLE (1, 0, 1, 0));
|
|
}
|
|
|
|
static force_inline __m128i
|
|
expand_alpha_1x128 (__m128i data)
|
|
{
|
|
return _mm_shufflehi_epi16 (_mm_shufflelo_epi16 (data,
|
|
_MM_SHUFFLE (3, 3, 3, 3)),
|
|
_MM_SHUFFLE (3, 3, 3, 3));
|
|
}
|
|
|
|
static force_inline void
|
|
expand_alpha_2x128 (__m128i data_lo,
|
|
__m128i data_hi,
|
|
__m128i* alpha_lo,
|
|
__m128i* alpha_hi)
|
|
{
|
|
__m128i lo, hi;
|
|
|
|
lo = _mm_shufflelo_epi16 (data_lo, _MM_SHUFFLE (3, 3, 3, 3));
|
|
hi = _mm_shufflelo_epi16 (data_hi, _MM_SHUFFLE (3, 3, 3, 3));
|
|
|
|
*alpha_lo = _mm_shufflehi_epi16 (lo, _MM_SHUFFLE (3, 3, 3, 3));
|
|
*alpha_hi = _mm_shufflehi_epi16 (hi, _MM_SHUFFLE (3, 3, 3, 3));
|
|
}
|
|
|
|
static force_inline void
|
|
expand_alpha_rev_2x128 (__m128i data_lo,
|
|
__m128i data_hi,
|
|
__m128i* alpha_lo,
|
|
__m128i* alpha_hi)
|
|
{
|
|
__m128i lo, hi;
|
|
|
|
lo = _mm_shufflelo_epi16 (data_lo, _MM_SHUFFLE (0, 0, 0, 0));
|
|
hi = _mm_shufflelo_epi16 (data_hi, _MM_SHUFFLE (0, 0, 0, 0));
|
|
*alpha_lo = _mm_shufflehi_epi16 (lo, _MM_SHUFFLE (0, 0, 0, 0));
|
|
*alpha_hi = _mm_shufflehi_epi16 (hi, _MM_SHUFFLE (0, 0, 0, 0));
|
|
}
|
|
|
|
static force_inline void
|
|
pix_multiply_2x128 (__m128i* data_lo,
|
|
__m128i* data_hi,
|
|
__m128i* alpha_lo,
|
|
__m128i* alpha_hi,
|
|
__m128i* ret_lo,
|
|
__m128i* ret_hi)
|
|
{
|
|
__m128i lo, hi;
|
|
|
|
lo = _mm_mullo_epi16 (*data_lo, *alpha_lo);
|
|
hi = _mm_mullo_epi16 (*data_hi, *alpha_hi);
|
|
lo = _mm_adds_epu16 (lo, mask_0080);
|
|
hi = _mm_adds_epu16 (hi, mask_0080);
|
|
*ret_lo = _mm_mulhi_epu16 (lo, mask_0101);
|
|
*ret_hi = _mm_mulhi_epu16 (hi, mask_0101);
|
|
}
|
|
|
|
static force_inline void
|
|
pix_add_multiply_2x128 (__m128i* src_lo,
|
|
__m128i* src_hi,
|
|
__m128i* alpha_dst_lo,
|
|
__m128i* alpha_dst_hi,
|
|
__m128i* dst_lo,
|
|
__m128i* dst_hi,
|
|
__m128i* alpha_src_lo,
|
|
__m128i* alpha_src_hi,
|
|
__m128i* ret_lo,
|
|
__m128i* ret_hi)
|
|
{
|
|
__m128i t1_lo, t1_hi;
|
|
__m128i t2_lo, t2_hi;
|
|
|
|
pix_multiply_2x128 (src_lo, src_hi, alpha_dst_lo, alpha_dst_hi, &t1_lo, &t1_hi);
|
|
pix_multiply_2x128 (dst_lo, dst_hi, alpha_src_lo, alpha_src_hi, &t2_lo, &t2_hi);
|
|
|
|
*ret_lo = _mm_adds_epu8 (t1_lo, t2_lo);
|
|
*ret_hi = _mm_adds_epu8 (t1_hi, t2_hi);
|
|
}
|
|
|
|
static force_inline void
|
|
negate_2x128 (__m128i data_lo,
|
|
__m128i data_hi,
|
|
__m128i* neg_lo,
|
|
__m128i* neg_hi)
|
|
{
|
|
*neg_lo = _mm_xor_si128 (data_lo, mask_00ff);
|
|
*neg_hi = _mm_xor_si128 (data_hi, mask_00ff);
|
|
}
|
|
|
|
static force_inline void
|
|
invert_colors_2x128 (__m128i data_lo,
|
|
__m128i data_hi,
|
|
__m128i* inv_lo,
|
|
__m128i* inv_hi)
|
|
{
|
|
__m128i lo, hi;
|
|
|
|
lo = _mm_shufflelo_epi16 (data_lo, _MM_SHUFFLE (3, 0, 1, 2));
|
|
hi = _mm_shufflelo_epi16 (data_hi, _MM_SHUFFLE (3, 0, 1, 2));
|
|
*inv_lo = _mm_shufflehi_epi16 (lo, _MM_SHUFFLE (3, 0, 1, 2));
|
|
*inv_hi = _mm_shufflehi_epi16 (hi, _MM_SHUFFLE (3, 0, 1, 2));
|
|
}
|
|
|
|
static force_inline void
|
|
over_2x128 (__m128i* src_lo,
|
|
__m128i* src_hi,
|
|
__m128i* alpha_lo,
|
|
__m128i* alpha_hi,
|
|
__m128i* dst_lo,
|
|
__m128i* dst_hi)
|
|
{
|
|
__m128i t1, t2;
|
|
|
|
negate_2x128 (*alpha_lo, *alpha_hi, &t1, &t2);
|
|
|
|
pix_multiply_2x128 (dst_lo, dst_hi, &t1, &t2, dst_lo, dst_hi);
|
|
|
|
*dst_lo = _mm_adds_epu8 (*src_lo, *dst_lo);
|
|
*dst_hi = _mm_adds_epu8 (*src_hi, *dst_hi);
|
|
}
|
|
|
|
static force_inline void
|
|
over_rev_non_pre_2x128 (__m128i src_lo,
|
|
__m128i src_hi,
|
|
__m128i* dst_lo,
|
|
__m128i* dst_hi)
|
|
{
|
|
__m128i lo, hi;
|
|
__m128i alpha_lo, alpha_hi;
|
|
|
|
expand_alpha_2x128 (src_lo, src_hi, &alpha_lo, &alpha_hi);
|
|
|
|
lo = _mm_or_si128 (alpha_lo, mask_alpha);
|
|
hi = _mm_or_si128 (alpha_hi, mask_alpha);
|
|
|
|
invert_colors_2x128 (src_lo, src_hi, &src_lo, &src_hi);
|
|
|
|
pix_multiply_2x128 (&src_lo, &src_hi, &lo, &hi, &lo, &hi);
|
|
|
|
over_2x128 (&lo, &hi, &alpha_lo, &alpha_hi, dst_lo, dst_hi);
|
|
}
|
|
|
|
static force_inline void
|
|
in_over_2x128 (__m128i* src_lo,
|
|
__m128i* src_hi,
|
|
__m128i* alpha_lo,
|
|
__m128i* alpha_hi,
|
|
__m128i* mask_lo,
|
|
__m128i* mask_hi,
|
|
__m128i* dst_lo,
|
|
__m128i* dst_hi)
|
|
{
|
|
__m128i s_lo, s_hi;
|
|
__m128i a_lo, a_hi;
|
|
|
|
pix_multiply_2x128 (src_lo, src_hi, mask_lo, mask_hi, &s_lo, &s_hi);
|
|
pix_multiply_2x128 (alpha_lo, alpha_hi, mask_lo, mask_hi, &a_lo, &a_hi);
|
|
|
|
over_2x128 (&s_lo, &s_hi, &a_lo, &a_hi, dst_lo, dst_hi);
|
|
}
|
|
|
|
/* load 4 pixels from a 16-byte boundary aligned address */
|
|
static force_inline __m128i
|
|
load_128_aligned (__m128i* src)
|
|
{
|
|
return _mm_load_si128 (src);
|
|
}
|
|
|
|
/* load 4 pixels from a unaligned address */
|
|
static force_inline __m128i
|
|
load_128_unaligned (const __m128i* src)
|
|
{
|
|
return _mm_loadu_si128 (src);
|
|
}
|
|
|
|
/* save 4 pixels using Write Combining memory on a 16-byte
|
|
* boundary aligned address
|
|
*/
|
|
static force_inline void
|
|
save_128_write_combining (__m128i* dst,
|
|
__m128i data)
|
|
{
|
|
_mm_stream_si128 (dst, data);
|
|
}
|
|
|
|
/* save 4 pixels on a 16-byte boundary aligned address */
|
|
static force_inline void
|
|
save_128_aligned (__m128i* dst,
|
|
__m128i data)
|
|
{
|
|
_mm_store_si128 (dst, data);
|
|
}
|
|
|
|
/* save 4 pixels on a unaligned address */
|
|
static force_inline void
|
|
save_128_unaligned (__m128i* dst,
|
|
__m128i data)
|
|
{
|
|
_mm_storeu_si128 (dst, data);
|
|
}
|
|
|
|
/* ------------------------------------------------------------------
|
|
* MMX inlines
|
|
*/
|
|
|
|
static force_inline __m64
|
|
load_32_1x64 (uint32_t data)
|
|
{
|
|
return _mm_cvtsi32_si64 (data);
|
|
}
|
|
|
|
static force_inline __m64
|
|
unpack_32_1x64 (uint32_t data)
|
|
{
|
|
return _mm_unpacklo_pi8 (load_32_1x64 (data), _mm_setzero_si64 ());
|
|
}
|
|
|
|
static force_inline __m64
|
|
expand_alpha_1x64 (__m64 data)
|
|
{
|
|
return _mm_shuffle_pi16 (data, _MM_SHUFFLE (3, 3, 3, 3));
|
|
}
|
|
|
|
static force_inline __m64
|
|
expand_alpha_rev_1x64 (__m64 data)
|
|
{
|
|
return _mm_shuffle_pi16 (data, _MM_SHUFFLE (0, 0, 0, 0));
|
|
}
|
|
|
|
static force_inline __m64
|
|
expand_pixel_8_1x64 (uint8_t data)
|
|
{
|
|
return _mm_shuffle_pi16 (
|
|
unpack_32_1x64 ((uint32_t)data), _MM_SHUFFLE (0, 0, 0, 0));
|
|
}
|
|
|
|
static force_inline __m64
|
|
pix_multiply_1x64 (__m64 data,
|
|
__m64 alpha)
|
|
{
|
|
return _mm_mulhi_pu16 (_mm_adds_pu16 (_mm_mullo_pi16 (data, alpha),
|
|
mask_x0080),
|
|
mask_x0101);
|
|
}
|
|
|
|
static force_inline __m64
|
|
pix_add_multiply_1x64 (__m64* src,
|
|
__m64* alpha_dst,
|
|
__m64* dst,
|
|
__m64* alpha_src)
|
|
{
|
|
__m64 t1 = pix_multiply_1x64 (*src, *alpha_dst);
|
|
__m64 t2 = pix_multiply_1x64 (*dst, *alpha_src);
|
|
|
|
return _mm_adds_pu8 (t1, t2);
|
|
}
|
|
|
|
static force_inline __m64
|
|
negate_1x64 (__m64 data)
|
|
{
|
|
return _mm_xor_si64 (data, mask_x00ff);
|
|
}
|
|
|
|
static force_inline __m64
|
|
invert_colors_1x64 (__m64 data)
|
|
{
|
|
return _mm_shuffle_pi16 (data, _MM_SHUFFLE (3, 0, 1, 2));
|
|
}
|
|
|
|
static force_inline __m64
|
|
over_1x64 (__m64 src, __m64 alpha, __m64 dst)
|
|
{
|
|
return _mm_adds_pu8 (src, pix_multiply_1x64 (dst, negate_1x64 (alpha)));
|
|
}
|
|
|
|
static force_inline __m64
|
|
in_over_1x64 (__m64* src, __m64* alpha, __m64* mask, __m64* dst)
|
|
{
|
|
return over_1x64 (pix_multiply_1x64 (*src, *mask),
|
|
pix_multiply_1x64 (*alpha, *mask),
|
|
*dst);
|
|
}
|
|
|
|
static force_inline __m64
|
|
over_rev_non_pre_1x64 (__m64 src, __m64 dst)
|
|
{
|
|
__m64 alpha = expand_alpha_1x64 (src);
|
|
|
|
return over_1x64 (pix_multiply_1x64 (invert_colors_1x64 (src),
|
|
_mm_or_si64 (alpha, mask_x_alpha)),
|
|
alpha,
|
|
dst);
|
|
}
|
|
|
|
static force_inline uint32_t
|
|
pack_1x64_32 (__m64 data)
|
|
{
|
|
return _mm_cvtsi64_si32 (_mm_packs_pu16 (data, _mm_setzero_si64 ()));
|
|
}
|
|
|
|
/* Expand 16 bits positioned at @pos (0-3) of a mmx register into
|
|
*
|
|
* 00RR00GG00BB
|
|
*
|
|
* --- Expanding 565 in the low word ---
|
|
*
|
|
* m = (m << (32 - 3)) | (m << (16 - 5)) | m;
|
|
* m = m & (01f0003f001f);
|
|
* m = m * (008404100840);
|
|
* m = m >> 8;
|
|
*
|
|
* Note the trick here - the top word is shifted by another nibble to
|
|
* avoid it bumping into the middle word
|
|
*/
|
|
static force_inline __m64
|
|
expand565_16_1x64 (uint16_t pixel)
|
|
{
|
|
__m64 p;
|
|
__m64 t1, t2;
|
|
|
|
p = _mm_cvtsi32_si64 ((uint32_t) pixel);
|
|
|
|
t1 = _mm_slli_si64 (p, 36 - 11);
|
|
t2 = _mm_slli_si64 (p, 16 - 5);
|
|
|
|
p = _mm_or_si64 (t1, p);
|
|
p = _mm_or_si64 (t2, p);
|
|
p = _mm_and_si64 (p, mask_x565_rgb);
|
|
p = _mm_mullo_pi16 (p, mask_x565_unpack);
|
|
|
|
return _mm_srli_pi16 (p, 8);
|
|
}
|
|
|
|
/* ----------------------------------------------------------------------------
|
|
* Compose Core transformations
|
|
*/
|
|
static force_inline uint32_t
|
|
core_combine_over_u_pixel_sse2 (uint32_t src, uint32_t dst)
|
|
{
|
|
uint8_t a;
|
|
__m64 ms;
|
|
|
|
a = src >> 24;
|
|
|
|
if (a == 0xff)
|
|
{
|
|
return src;
|
|
}
|
|
else if (src)
|
|
{
|
|
ms = unpack_32_1x64 (src);
|
|
return pack_1x64_32 (
|
|
over_1x64 (ms, expand_alpha_1x64 (ms), unpack_32_1x64 (dst)));
|
|
}
|
|
|
|
return dst;
|
|
}
|
|
|
|
static force_inline uint32_t
|
|
combine1 (const uint32_t *ps, const uint32_t *pm)
|
|
{
|
|
uint32_t s = *ps;
|
|
|
|
if (pm)
|
|
{
|
|
__m64 ms, mm;
|
|
|
|
mm = unpack_32_1x64 (*pm);
|
|
mm = expand_alpha_1x64 (mm);
|
|
|
|
ms = unpack_32_1x64 (s);
|
|
ms = pix_multiply_1x64 (ms, mm);
|
|
|
|
s = pack_1x64_32 (ms);
|
|
}
|
|
|
|
return s;
|
|
}
|
|
|
|
static force_inline __m128i
|
|
combine4 (const __m128i *ps, const __m128i *pm)
|
|
{
|
|
__m128i xmm_src_lo, xmm_src_hi;
|
|
__m128i xmm_msk_lo, xmm_msk_hi;
|
|
__m128i s;
|
|
|
|
if (pm)
|
|
{
|
|
xmm_msk_lo = load_128_unaligned (pm);
|
|
|
|
if (is_transparent (xmm_msk_lo))
|
|
return _mm_setzero_si128 ();
|
|
}
|
|
|
|
s = load_128_unaligned (ps);
|
|
|
|
if (pm)
|
|
{
|
|
unpack_128_2x128 (s, &xmm_src_lo, &xmm_src_hi);
|
|
unpack_128_2x128 (xmm_msk_lo, &xmm_msk_lo, &xmm_msk_hi);
|
|
|
|
expand_alpha_2x128 (xmm_msk_lo, xmm_msk_hi, &xmm_msk_lo, &xmm_msk_hi);
|
|
|
|
pix_multiply_2x128 (&xmm_src_lo, &xmm_src_hi,
|
|
&xmm_msk_lo, &xmm_msk_hi,
|
|
&xmm_src_lo, &xmm_src_hi);
|
|
|
|
s = pack_2x128_128 (xmm_src_lo, xmm_src_hi);
|
|
}
|
|
|
|
return s;
|
|
}
|
|
|
|
static force_inline void
|
|
core_combine_over_u_sse2 (uint32_t* pd,
|
|
const uint32_t* ps,
|
|
const uint32_t* pm,
|
|
int w)
|
|
{
|
|
uint32_t s, d;
|
|
|
|
__m128i xmm_dst_lo, xmm_dst_hi;
|
|
__m128i xmm_src_lo, xmm_src_hi;
|
|
__m128i xmm_alpha_lo, xmm_alpha_hi;
|
|
|
|
/* Align dst on a 16-byte boundary */
|
|
while (w && ((unsigned long)pd & 15))
|
|
{
|
|
d = *pd;
|
|
s = combine1 (ps, pm);
|
|
|
|
*pd++ = core_combine_over_u_pixel_sse2 (s, d);
|
|
ps++;
|
|
if (pm)
|
|
pm++;
|
|
w--;
|
|
}
|
|
|
|
while (w >= 4)
|
|
{
|
|
/* I'm loading unaligned because I'm not sure about
|
|
* the address alignment.
|
|
*/
|
|
xmm_src_hi = combine4 ((__m128i*)ps, (__m128i*)pm);
|
|
|
|
if (is_opaque (xmm_src_hi))
|
|
{
|
|
save_128_aligned ((__m128i*)pd, xmm_src_hi);
|
|
}
|
|
else if (!is_zero (xmm_src_hi))
|
|
{
|
|
xmm_dst_hi = load_128_aligned ((__m128i*) pd);
|
|
|
|
unpack_128_2x128 (xmm_src_hi, &xmm_src_lo, &xmm_src_hi);
|
|
unpack_128_2x128 (xmm_dst_hi, &xmm_dst_lo, &xmm_dst_hi);
|
|
|
|
expand_alpha_2x128 (
|
|
xmm_src_lo, xmm_src_hi, &xmm_alpha_lo, &xmm_alpha_hi);
|
|
|
|
over_2x128 (&xmm_src_lo, &xmm_src_hi,
|
|
&xmm_alpha_lo, &xmm_alpha_hi,
|
|
&xmm_dst_lo, &xmm_dst_hi);
|
|
|
|
/* rebuid the 4 pixel data and save*/
|
|
save_128_aligned ((__m128i*)pd,
|
|
pack_2x128_128 (xmm_dst_lo, xmm_dst_hi));
|
|
}
|
|
|
|
w -= 4;
|
|
ps += 4;
|
|
pd += 4;
|
|
if (pm)
|
|
pm += 4;
|
|
}
|
|
|
|
while (w)
|
|
{
|
|
d = *pd;
|
|
s = combine1 (ps, pm);
|
|
|
|
*pd++ = core_combine_over_u_pixel_sse2 (s, d);
|
|
ps++;
|
|
if (pm)
|
|
pm++;
|
|
|
|
w--;
|
|
}
|
|
}
|
|
|
|
static force_inline void
|
|
core_combine_over_reverse_u_sse2 (uint32_t* pd,
|
|
const uint32_t* ps,
|
|
const uint32_t* pm,
|
|
int w)
|
|
{
|
|
uint32_t s, d;
|
|
|
|
__m128i xmm_dst_lo, xmm_dst_hi;
|
|
__m128i xmm_src_lo, xmm_src_hi;
|
|
__m128i xmm_alpha_lo, xmm_alpha_hi;
|
|
|
|
/* Align dst on a 16-byte boundary */
|
|
while (w &&
|
|
((unsigned long)pd & 15))
|
|
{
|
|
d = *pd;
|
|
s = combine1 (ps, pm);
|
|
|
|
*pd++ = core_combine_over_u_pixel_sse2 (d, s);
|
|
w--;
|
|
ps++;
|
|
if (pm)
|
|
pm++;
|
|
}
|
|
|
|
while (w >= 4)
|
|
{
|
|
/* I'm loading unaligned because I'm not sure
|
|
* about the address alignment.
|
|
*/
|
|
xmm_src_hi = combine4 ((__m128i*)ps, (__m128i*)pm);
|
|
xmm_dst_hi = load_128_aligned ((__m128i*) pd);
|
|
|
|
unpack_128_2x128 (xmm_src_hi, &xmm_src_lo, &xmm_src_hi);
|
|
unpack_128_2x128 (xmm_dst_hi, &xmm_dst_lo, &xmm_dst_hi);
|
|
|
|
expand_alpha_2x128 (xmm_dst_lo, xmm_dst_hi,
|
|
&xmm_alpha_lo, &xmm_alpha_hi);
|
|
|
|
over_2x128 (&xmm_dst_lo, &xmm_dst_hi,
|
|
&xmm_alpha_lo, &xmm_alpha_hi,
|
|
&xmm_src_lo, &xmm_src_hi);
|
|
|
|
/* rebuid the 4 pixel data and save*/
|
|
save_128_aligned ((__m128i*)pd,
|
|
pack_2x128_128 (xmm_src_lo, xmm_src_hi));
|
|
|
|
w -= 4;
|
|
ps += 4;
|
|
pd += 4;
|
|
|
|
if (pm)
|
|
pm += 4;
|
|
}
|
|
|
|
while (w)
|
|
{
|
|
d = *pd;
|
|
s = combine1 (ps, pm);
|
|
|
|
*pd++ = core_combine_over_u_pixel_sse2 (d, s);
|
|
ps++;
|
|
w--;
|
|
if (pm)
|
|
pm++;
|
|
}
|
|
}
|
|
|
|
static force_inline uint32_t
|
|
core_combine_in_u_pixelsse2 (uint32_t src, uint32_t dst)
|
|
{
|
|
uint32_t maska = src >> 24;
|
|
|
|
if (maska == 0)
|
|
{
|
|
return 0;
|
|
}
|
|
else if (maska != 0xff)
|
|
{
|
|
return pack_1x64_32 (
|
|
pix_multiply_1x64 (unpack_32_1x64 (dst),
|
|
expand_alpha_1x64 (unpack_32_1x64 (src))));
|
|
}
|
|
|
|
return dst;
|
|
}
|
|
|
|
static force_inline void
|
|
core_combine_in_u_sse2 (uint32_t* pd,
|
|
const uint32_t* ps,
|
|
const uint32_t* pm,
|
|
int w)
|
|
{
|
|
uint32_t s, d;
|
|
|
|
__m128i xmm_src_lo, xmm_src_hi;
|
|
__m128i xmm_dst_lo, xmm_dst_hi;
|
|
|
|
while (w && ((unsigned long) pd & 15))
|
|
{
|
|
s = combine1 (ps, pm);
|
|
d = *pd;
|
|
|
|
*pd++ = core_combine_in_u_pixelsse2 (d, s);
|
|
w--;
|
|
ps++;
|
|
if (pm)
|
|
pm++;
|
|
}
|
|
|
|
while (w >= 4)
|
|
{
|
|
xmm_dst_hi = load_128_aligned ((__m128i*) pd);
|
|
xmm_src_hi = combine4 ((__m128i*) ps, (__m128i*) pm);
|
|
|
|
unpack_128_2x128 (xmm_dst_hi, &xmm_dst_lo, &xmm_dst_hi);
|
|
expand_alpha_2x128 (xmm_dst_lo, xmm_dst_hi, &xmm_dst_lo, &xmm_dst_hi);
|
|
|
|
unpack_128_2x128 (xmm_src_hi, &xmm_src_lo, &xmm_src_hi);
|
|
pix_multiply_2x128 (&xmm_src_lo, &xmm_src_hi,
|
|
&xmm_dst_lo, &xmm_dst_hi,
|
|
&xmm_dst_lo, &xmm_dst_hi);
|
|
|
|
save_128_aligned ((__m128i*)pd,
|
|
pack_2x128_128 (xmm_dst_lo, xmm_dst_hi));
|
|
|
|
ps += 4;
|
|
pd += 4;
|
|
w -= 4;
|
|
if (pm)
|
|
pm += 4;
|
|
}
|
|
|
|
while (w)
|
|
{
|
|
s = combine1 (ps, pm);
|
|
d = *pd;
|
|
|
|
*pd++ = core_combine_in_u_pixelsse2 (d, s);
|
|
w--;
|
|
ps++;
|
|
if (pm)
|
|
pm++;
|
|
}
|
|
}
|
|
|
|
static force_inline void
|
|
core_combine_reverse_in_u_sse2 (uint32_t* pd,
|
|
const uint32_t* ps,
|
|
const uint32_t *pm,
|
|
int w)
|
|
{
|
|
uint32_t s, d;
|
|
|
|
__m128i xmm_src_lo, xmm_src_hi;
|
|
__m128i xmm_dst_lo, xmm_dst_hi;
|
|
|
|
while (w && ((unsigned long) pd & 15))
|
|
{
|
|
s = combine1 (ps, pm);
|
|
d = *pd;
|
|
|
|
*pd++ = core_combine_in_u_pixelsse2 (s, d);
|
|
ps++;
|
|
w--;
|
|
if (pm)
|
|
pm++;
|
|
}
|
|
|
|
while (w >= 4)
|
|
{
|
|
xmm_dst_hi = load_128_aligned ((__m128i*) pd);
|
|
xmm_src_hi = combine4 ((__m128i*) ps, (__m128i*)pm);
|
|
|
|
unpack_128_2x128 (xmm_src_hi, &xmm_src_lo, &xmm_src_hi);
|
|
expand_alpha_2x128 (xmm_src_lo, xmm_src_hi, &xmm_src_lo, &xmm_src_hi);
|
|
|
|
unpack_128_2x128 (xmm_dst_hi, &xmm_dst_lo, &xmm_dst_hi);
|
|
pix_multiply_2x128 (&xmm_dst_lo, &xmm_dst_hi,
|
|
&xmm_src_lo, &xmm_src_hi,
|
|
&xmm_dst_lo, &xmm_dst_hi);
|
|
|
|
save_128_aligned (
|
|
(__m128i*)pd, pack_2x128_128 (xmm_dst_lo, xmm_dst_hi));
|
|
|
|
ps += 4;
|
|
pd += 4;
|
|
w -= 4;
|
|
if (pm)
|
|
pm += 4;
|
|
}
|
|
|
|
while (w)
|
|
{
|
|
s = combine1 (ps, pm);
|
|
d = *pd;
|
|
|
|
*pd++ = core_combine_in_u_pixelsse2 (s, d);
|
|
w--;
|
|
ps++;
|
|
if (pm)
|
|
pm++;
|
|
}
|
|
}
|
|
|
|
static force_inline void
|
|
core_combine_reverse_out_u_sse2 (uint32_t* pd,
|
|
const uint32_t* ps,
|
|
const uint32_t* pm,
|
|
int w)
|
|
{
|
|
while (w && ((unsigned long) pd & 15))
|
|
{
|
|
uint32_t s = combine1 (ps, pm);
|
|
uint32_t d = *pd;
|
|
|
|
*pd++ = pack_1x64_32 (
|
|
pix_multiply_1x64 (
|
|
unpack_32_1x64 (d), negate_1x64 (
|
|
expand_alpha_1x64 (unpack_32_1x64 (s)))));
|
|
|
|
if (pm)
|
|
pm++;
|
|
ps++;
|
|
w--;
|
|
}
|
|
|
|
while (w >= 4)
|
|
{
|
|
__m128i xmm_src_lo, xmm_src_hi;
|
|
__m128i xmm_dst_lo, xmm_dst_hi;
|
|
|
|
xmm_src_hi = combine4 ((__m128i*)ps, (__m128i*)pm);
|
|
xmm_dst_hi = load_128_aligned ((__m128i*) pd);
|
|
|
|
unpack_128_2x128 (xmm_src_hi, &xmm_src_lo, &xmm_src_hi);
|
|
unpack_128_2x128 (xmm_dst_hi, &xmm_dst_lo, &xmm_dst_hi);
|
|
|
|
expand_alpha_2x128 (xmm_src_lo, xmm_src_hi, &xmm_src_lo, &xmm_src_hi);
|
|
negate_2x128 (xmm_src_lo, xmm_src_hi, &xmm_src_lo, &xmm_src_hi);
|
|
|
|
pix_multiply_2x128 (&xmm_dst_lo, &xmm_dst_hi,
|
|
&xmm_src_lo, &xmm_src_hi,
|
|
&xmm_dst_lo, &xmm_dst_hi);
|
|
|
|
save_128_aligned (
|
|
(__m128i*)pd, pack_2x128_128 (xmm_dst_lo, xmm_dst_hi));
|
|
|
|
ps += 4;
|
|
pd += 4;
|
|
if (pm)
|
|
pm += 4;
|
|
|
|
w -= 4;
|
|
}
|
|
|
|
while (w)
|
|
{
|
|
uint32_t s = combine1 (ps, pm);
|
|
uint32_t d = *pd;
|
|
|
|
*pd++ = pack_1x64_32 (
|
|
pix_multiply_1x64 (
|
|
unpack_32_1x64 (d), negate_1x64 (
|
|
expand_alpha_1x64 (unpack_32_1x64 (s)))));
|
|
ps++;
|
|
if (pm)
|
|
pm++;
|
|
w--;
|
|
}
|
|
}
|
|
|
|
static force_inline void
|
|
core_combine_out_u_sse2 (uint32_t* pd,
|
|
const uint32_t* ps,
|
|
const uint32_t* pm,
|
|
int w)
|
|
{
|
|
while (w && ((unsigned long) pd & 15))
|
|
{
|
|
uint32_t s = combine1 (ps, pm);
|
|
uint32_t d = *pd;
|
|
|
|
*pd++ = pack_1x64_32 (
|
|
pix_multiply_1x64 (
|
|
unpack_32_1x64 (s), negate_1x64 (
|
|
expand_alpha_1x64 (unpack_32_1x64 (d)))));
|
|
w--;
|
|
ps++;
|
|
if (pm)
|
|
pm++;
|
|
}
|
|
|
|
while (w >= 4)
|
|
{
|
|
__m128i xmm_src_lo, xmm_src_hi;
|
|
__m128i xmm_dst_lo, xmm_dst_hi;
|
|
|
|
xmm_src_hi = combine4 ((__m128i*) ps, (__m128i*)pm);
|
|
xmm_dst_hi = load_128_aligned ((__m128i*) pd);
|
|
|
|
unpack_128_2x128 (xmm_src_hi, &xmm_src_lo, &xmm_src_hi);
|
|
unpack_128_2x128 (xmm_dst_hi, &xmm_dst_lo, &xmm_dst_hi);
|
|
|
|
expand_alpha_2x128 (xmm_dst_lo, xmm_dst_hi, &xmm_dst_lo, &xmm_dst_hi);
|
|
negate_2x128 (xmm_dst_lo, xmm_dst_hi, &xmm_dst_lo, &xmm_dst_hi);
|
|
|
|
pix_multiply_2x128 (&xmm_src_lo, &xmm_src_hi,
|
|
&xmm_dst_lo, &xmm_dst_hi,
|
|
&xmm_dst_lo, &xmm_dst_hi);
|
|
|
|
save_128_aligned (
|
|
(__m128i*)pd, pack_2x128_128 (xmm_dst_lo, xmm_dst_hi));
|
|
|
|
ps += 4;
|
|
pd += 4;
|
|
w -= 4;
|
|
if (pm)
|
|
pm += 4;
|
|
}
|
|
|
|
while (w)
|
|
{
|
|
uint32_t s = combine1 (ps, pm);
|
|
uint32_t d = *pd;
|
|
|
|
*pd++ = pack_1x64_32 (
|
|
pix_multiply_1x64 (
|
|
unpack_32_1x64 (s), negate_1x64 (
|
|
expand_alpha_1x64 (unpack_32_1x64 (d)))));
|
|
w--;
|
|
ps++;
|
|
if (pm)
|
|
pm++;
|
|
}
|
|
}
|
|
|
|
static force_inline uint32_t
|
|
core_combine_atop_u_pixel_sse2 (uint32_t src,
|
|
uint32_t dst)
|
|
{
|
|
__m64 s = unpack_32_1x64 (src);
|
|
__m64 d = unpack_32_1x64 (dst);
|
|
|
|
__m64 sa = negate_1x64 (expand_alpha_1x64 (s));
|
|
__m64 da = expand_alpha_1x64 (d);
|
|
|
|
return pack_1x64_32 (pix_add_multiply_1x64 (&s, &da, &d, &sa));
|
|
}
|
|
|
|
static force_inline void
|
|
core_combine_atop_u_sse2 (uint32_t* pd,
|
|
const uint32_t* ps,
|
|
const uint32_t* pm,
|
|
int w)
|
|
{
|
|
uint32_t s, d;
|
|
|
|
__m128i xmm_src_lo, xmm_src_hi;
|
|
__m128i xmm_dst_lo, xmm_dst_hi;
|
|
__m128i xmm_alpha_src_lo, xmm_alpha_src_hi;
|
|
__m128i xmm_alpha_dst_lo, xmm_alpha_dst_hi;
|
|
|
|
while (w && ((unsigned long) pd & 15))
|
|
{
|
|
s = combine1 (ps, pm);
|
|
d = *pd;
|
|
|
|
*pd++ = core_combine_atop_u_pixel_sse2 (s, d);
|
|
w--;
|
|
ps++;
|
|
if (pm)
|
|
pm++;
|
|
}
|
|
|
|
while (w >= 4)
|
|
{
|
|
xmm_src_hi = combine4 ((__m128i*)ps, (__m128i*)pm);
|
|
xmm_dst_hi = load_128_aligned ((__m128i*) pd);
|
|
|
|
unpack_128_2x128 (xmm_src_hi, &xmm_src_lo, &xmm_src_hi);
|
|
unpack_128_2x128 (xmm_dst_hi, &xmm_dst_lo, &xmm_dst_hi);
|
|
|
|
expand_alpha_2x128 (xmm_src_lo, xmm_src_hi,
|
|
&xmm_alpha_src_lo, &xmm_alpha_src_hi);
|
|
expand_alpha_2x128 (xmm_dst_lo, xmm_dst_hi,
|
|
&xmm_alpha_dst_lo, &xmm_alpha_dst_hi);
|
|
|
|
negate_2x128 (xmm_alpha_src_lo, xmm_alpha_src_hi,
|
|
&xmm_alpha_src_lo, &xmm_alpha_src_hi);
|
|
|
|
pix_add_multiply_2x128 (
|
|
&xmm_src_lo, &xmm_src_hi, &xmm_alpha_dst_lo, &xmm_alpha_dst_hi,
|
|
&xmm_dst_lo, &xmm_dst_hi, &xmm_alpha_src_lo, &xmm_alpha_src_hi,
|
|
&xmm_dst_lo, &xmm_dst_hi);
|
|
|
|
save_128_aligned (
|
|
(__m128i*)pd, pack_2x128_128 (xmm_dst_lo, xmm_dst_hi));
|
|
|
|
ps += 4;
|
|
pd += 4;
|
|
w -= 4;
|
|
if (pm)
|
|
pm += 4;
|
|
}
|
|
|
|
while (w)
|
|
{
|
|
s = combine1 (ps, pm);
|
|
d = *pd;
|
|
|
|
*pd++ = core_combine_atop_u_pixel_sse2 (s, d);
|
|
w--;
|
|
ps++;
|
|
if (pm)
|
|
pm++;
|
|
}
|
|
}
|
|
|
|
static force_inline uint32_t
|
|
core_combine_reverse_atop_u_pixel_sse2 (uint32_t src,
|
|
uint32_t dst)
|
|
{
|
|
__m64 s = unpack_32_1x64 (src);
|
|
__m64 d = unpack_32_1x64 (dst);
|
|
|
|
__m64 sa = expand_alpha_1x64 (s);
|
|
__m64 da = negate_1x64 (expand_alpha_1x64 (d));
|
|
|
|
return pack_1x64_32 (pix_add_multiply_1x64 (&s, &da, &d, &sa));
|
|
}
|
|
|
|
static force_inline void
|
|
core_combine_reverse_atop_u_sse2 (uint32_t* pd,
|
|
const uint32_t* ps,
|
|
const uint32_t* pm,
|
|
int w)
|
|
{
|
|
uint32_t s, d;
|
|
|
|
__m128i xmm_src_lo, xmm_src_hi;
|
|
__m128i xmm_dst_lo, xmm_dst_hi;
|
|
__m128i xmm_alpha_src_lo, xmm_alpha_src_hi;
|
|
__m128i xmm_alpha_dst_lo, xmm_alpha_dst_hi;
|
|
|
|
while (w && ((unsigned long) pd & 15))
|
|
{
|
|
s = combine1 (ps, pm);
|
|
d = *pd;
|
|
|
|
*pd++ = core_combine_reverse_atop_u_pixel_sse2 (s, d);
|
|
ps++;
|
|
w--;
|
|
if (pm)
|
|
pm++;
|
|
}
|
|
|
|
while (w >= 4)
|
|
{
|
|
xmm_src_hi = combine4 ((__m128i*)ps, (__m128i*)pm);
|
|
xmm_dst_hi = load_128_aligned ((__m128i*) pd);
|
|
|
|
unpack_128_2x128 (xmm_src_hi, &xmm_src_lo, &xmm_src_hi);
|
|
unpack_128_2x128 (xmm_dst_hi, &xmm_dst_lo, &xmm_dst_hi);
|
|
|
|
expand_alpha_2x128 (xmm_src_lo, xmm_src_hi,
|
|
&xmm_alpha_src_lo, &xmm_alpha_src_hi);
|
|
expand_alpha_2x128 (xmm_dst_lo, xmm_dst_hi,
|
|
&xmm_alpha_dst_lo, &xmm_alpha_dst_hi);
|
|
|
|
negate_2x128 (xmm_alpha_dst_lo, xmm_alpha_dst_hi,
|
|
&xmm_alpha_dst_lo, &xmm_alpha_dst_hi);
|
|
|
|
pix_add_multiply_2x128 (
|
|
&xmm_src_lo, &xmm_src_hi, &xmm_alpha_dst_lo, &xmm_alpha_dst_hi,
|
|
&xmm_dst_lo, &xmm_dst_hi, &xmm_alpha_src_lo, &xmm_alpha_src_hi,
|
|
&xmm_dst_lo, &xmm_dst_hi);
|
|
|
|
save_128_aligned (
|
|
(__m128i*)pd, pack_2x128_128 (xmm_dst_lo, xmm_dst_hi));
|
|
|
|
ps += 4;
|
|
pd += 4;
|
|
w -= 4;
|
|
if (pm)
|
|
pm += 4;
|
|
}
|
|
|
|
while (w)
|
|
{
|
|
s = combine1 (ps, pm);
|
|
d = *pd;
|
|
|
|
*pd++ = core_combine_reverse_atop_u_pixel_sse2 (s, d);
|
|
ps++;
|
|
w--;
|
|
if (pm)
|
|
pm++;
|
|
}
|
|
}
|
|
|
|
static force_inline uint32_t
|
|
core_combine_xor_u_pixel_sse2 (uint32_t src,
|
|
uint32_t dst)
|
|
{
|
|
__m64 s = unpack_32_1x64 (src);
|
|
__m64 d = unpack_32_1x64 (dst);
|
|
|
|
__m64 neg_d = negate_1x64 (expand_alpha_1x64 (d));
|
|
__m64 neg_s = negate_1x64 (expand_alpha_1x64 (s));
|
|
|
|
return pack_1x64_32 (pix_add_multiply_1x64 (&s, &neg_d, &d, &neg_s));
|
|
}
|
|
|
|
static force_inline void
|
|
core_combine_xor_u_sse2 (uint32_t* dst,
|
|
const uint32_t* src,
|
|
const uint32_t *mask,
|
|
int width)
|
|
{
|
|
int w = width;
|
|
uint32_t s, d;
|
|
uint32_t* pd = dst;
|
|
const uint32_t* ps = src;
|
|
const uint32_t* pm = mask;
|
|
|
|
__m128i xmm_src, xmm_src_lo, xmm_src_hi;
|
|
__m128i xmm_dst, xmm_dst_lo, xmm_dst_hi;
|
|
__m128i xmm_alpha_src_lo, xmm_alpha_src_hi;
|
|
__m128i xmm_alpha_dst_lo, xmm_alpha_dst_hi;
|
|
|
|
while (w && ((unsigned long) pd & 15))
|
|
{
|
|
s = combine1 (ps, pm);
|
|
d = *pd;
|
|
|
|
*pd++ = core_combine_xor_u_pixel_sse2 (s, d);
|
|
w--;
|
|
ps++;
|
|
if (pm)
|
|
pm++;
|
|
}
|
|
|
|
while (w >= 4)
|
|
{
|
|
xmm_src = combine4 ((__m128i*) ps, (__m128i*) pm);
|
|
xmm_dst = load_128_aligned ((__m128i*) pd);
|
|
|
|
unpack_128_2x128 (xmm_src, &xmm_src_lo, &xmm_src_hi);
|
|
unpack_128_2x128 (xmm_dst, &xmm_dst_lo, &xmm_dst_hi);
|
|
|
|
expand_alpha_2x128 (xmm_src_lo, xmm_src_hi,
|
|
&xmm_alpha_src_lo, &xmm_alpha_src_hi);
|
|
expand_alpha_2x128 (xmm_dst_lo, xmm_dst_hi,
|
|
&xmm_alpha_dst_lo, &xmm_alpha_dst_hi);
|
|
|
|
negate_2x128 (xmm_alpha_src_lo, xmm_alpha_src_hi,
|
|
&xmm_alpha_src_lo, &xmm_alpha_src_hi);
|
|
negate_2x128 (xmm_alpha_dst_lo, xmm_alpha_dst_hi,
|
|
&xmm_alpha_dst_lo, &xmm_alpha_dst_hi);
|
|
|
|
pix_add_multiply_2x128 (
|
|
&xmm_src_lo, &xmm_src_hi, &xmm_alpha_dst_lo, &xmm_alpha_dst_hi,
|
|
&xmm_dst_lo, &xmm_dst_hi, &xmm_alpha_src_lo, &xmm_alpha_src_hi,
|
|
&xmm_dst_lo, &xmm_dst_hi);
|
|
|
|
save_128_aligned (
|
|
(__m128i*)pd, pack_2x128_128 (xmm_dst_lo, xmm_dst_hi));
|
|
|
|
ps += 4;
|
|
pd += 4;
|
|
w -= 4;
|
|
if (pm)
|
|
pm += 4;
|
|
}
|
|
|
|
while (w)
|
|
{
|
|
s = combine1 (ps, pm);
|
|
d = *pd;
|
|
|
|
*pd++ = core_combine_xor_u_pixel_sse2 (s, d);
|
|
w--;
|
|
ps++;
|
|
if (pm)
|
|
pm++;
|
|
}
|
|
}
|
|
|
|
static force_inline void
|
|
core_combine_add_u_sse2 (uint32_t* dst,
|
|
const uint32_t* src,
|
|
const uint32_t* mask,
|
|
int width)
|
|
{
|
|
int w = width;
|
|
uint32_t s, d;
|
|
uint32_t* pd = dst;
|
|
const uint32_t* ps = src;
|
|
const uint32_t* pm = mask;
|
|
|
|
while (w && (unsigned long)pd & 15)
|
|
{
|
|
s = combine1 (ps, pm);
|
|
d = *pd;
|
|
|
|
ps++;
|
|
if (pm)
|
|
pm++;
|
|
*pd++ = _mm_cvtsi64_si32 (
|
|
_mm_adds_pu8 (_mm_cvtsi32_si64 (s), _mm_cvtsi32_si64 (d)));
|
|
w--;
|
|
}
|
|
|
|
while (w >= 4)
|
|
{
|
|
__m128i s;
|
|
|
|
s = combine4 ((__m128i*)ps, (__m128i*)pm);
|
|
|
|
save_128_aligned (
|
|
(__m128i*)pd, _mm_adds_epu8 (s, load_128_aligned ((__m128i*)pd)));
|
|
|
|
pd += 4;
|
|
ps += 4;
|
|
if (pm)
|
|
pm += 4;
|
|
w -= 4;
|
|
}
|
|
|
|
while (w--)
|
|
{
|
|
s = combine1 (ps, pm);
|
|
d = *pd;
|
|
|
|
ps++;
|
|
*pd++ = _mm_cvtsi64_si32 (
|
|
_mm_adds_pu8 (_mm_cvtsi32_si64 (s), _mm_cvtsi32_si64 (d)));
|
|
if (pm)
|
|
pm++;
|
|
}
|
|
}
|
|
|
|
static force_inline uint32_t
|
|
core_combine_saturate_u_pixel_sse2 (uint32_t src,
|
|
uint32_t dst)
|
|
{
|
|
__m64 ms = unpack_32_1x64 (src);
|
|
__m64 md = unpack_32_1x64 (dst);
|
|
uint32_t sa = src >> 24;
|
|
uint32_t da = ~dst >> 24;
|
|
|
|
if (sa > da)
|
|
{
|
|
ms = pix_multiply_1x64 (
|
|
ms, expand_alpha_1x64 (unpack_32_1x64 (DIV_UN8 (da, sa) << 24)));
|
|
}
|
|
|
|
return pack_1x64_32 (_mm_adds_pu16 (md, ms));
|
|
}
|
|
|
|
static force_inline void
|
|
core_combine_saturate_u_sse2 (uint32_t * pd,
|
|
const uint32_t *ps,
|
|
const uint32_t *pm,
|
|
int w)
|
|
{
|
|
uint32_t s, d;
|
|
|
|
uint32_t pack_cmp;
|
|
__m128i xmm_src, xmm_dst;
|
|
|
|
while (w && (unsigned long)pd & 15)
|
|
{
|
|
s = combine1 (ps, pm);
|
|
d = *pd;
|
|
|
|
*pd++ = core_combine_saturate_u_pixel_sse2 (s, d);
|
|
w--;
|
|
ps++;
|
|
if (pm)
|
|
pm++;
|
|
}
|
|
|
|
while (w >= 4)
|
|
{
|
|
xmm_dst = load_128_aligned ((__m128i*)pd);
|
|
xmm_src = combine4 ((__m128i*)ps, (__m128i*)pm);
|
|
|
|
pack_cmp = _mm_movemask_epi8 (
|
|
_mm_cmpgt_epi32 (
|
|
_mm_srli_epi32 (xmm_src, 24),
|
|
_mm_srli_epi32 (_mm_xor_si128 (xmm_dst, mask_ff000000), 24)));
|
|
|
|
/* if some alpha src is grater than respective ~alpha dst */
|
|
if (pack_cmp)
|
|
{
|
|
s = combine1 (ps++, pm);
|
|
d = *pd;
|
|
*pd++ = core_combine_saturate_u_pixel_sse2 (s, d);
|
|
if (pm)
|
|
pm++;
|
|
|
|
s = combine1 (ps++, pm);
|
|
d = *pd;
|
|
*pd++ = core_combine_saturate_u_pixel_sse2 (s, d);
|
|
if (pm)
|
|
pm++;
|
|
|
|
s = combine1 (ps++, pm);
|
|
d = *pd;
|
|
*pd++ = core_combine_saturate_u_pixel_sse2 (s, d);
|
|
if (pm)
|
|
pm++;
|
|
|
|
s = combine1 (ps++, pm);
|
|
d = *pd;
|
|
*pd++ = core_combine_saturate_u_pixel_sse2 (s, d);
|
|
if (pm)
|
|
pm++;
|
|
}
|
|
else
|
|
{
|
|
save_128_aligned ((__m128i*)pd, _mm_adds_epu8 (xmm_dst, xmm_src));
|
|
|
|
pd += 4;
|
|
ps += 4;
|
|
if (pm)
|
|
pm += 4;
|
|
}
|
|
|
|
w -= 4;
|
|
}
|
|
|
|
while (w--)
|
|
{
|
|
s = combine1 (ps, pm);
|
|
d = *pd;
|
|
|
|
*pd++ = core_combine_saturate_u_pixel_sse2 (s, d);
|
|
ps++;
|
|
if (pm)
|
|
pm++;
|
|
}
|
|
}
|
|
|
|
static force_inline void
|
|
core_combine_src_ca_sse2 (uint32_t* pd,
|
|
const uint32_t* ps,
|
|
const uint32_t *pm,
|
|
int w)
|
|
{
|
|
uint32_t s, m;
|
|
|
|
__m128i xmm_src_lo, xmm_src_hi;
|
|
__m128i xmm_mask_lo, xmm_mask_hi;
|
|
__m128i xmm_dst_lo, xmm_dst_hi;
|
|
|
|
while (w && (unsigned long)pd & 15)
|
|
{
|
|
s = *ps++;
|
|
m = *pm++;
|
|
*pd++ = pack_1x64_32 (
|
|
pix_multiply_1x64 (unpack_32_1x64 (s), unpack_32_1x64 (m)));
|
|
w--;
|
|
}
|
|
|
|
while (w >= 4)
|
|
{
|
|
xmm_src_hi = load_128_unaligned ((__m128i*)ps);
|
|
xmm_mask_hi = load_128_unaligned ((__m128i*)pm);
|
|
|
|
unpack_128_2x128 (xmm_src_hi, &xmm_src_lo, &xmm_src_hi);
|
|
unpack_128_2x128 (xmm_mask_hi, &xmm_mask_lo, &xmm_mask_hi);
|
|
|
|
pix_multiply_2x128 (&xmm_src_lo, &xmm_src_hi,
|
|
&xmm_mask_lo, &xmm_mask_hi,
|
|
&xmm_dst_lo, &xmm_dst_hi);
|
|
|
|
save_128_aligned (
|
|
(__m128i*)pd, pack_2x128_128 (xmm_dst_lo, xmm_dst_hi));
|
|
|
|
ps += 4;
|
|
pd += 4;
|
|
pm += 4;
|
|
w -= 4;
|
|
}
|
|
|
|
while (w)
|
|
{
|
|
s = *ps++;
|
|
m = *pm++;
|
|
*pd++ = pack_1x64_32 (
|
|
pix_multiply_1x64 (unpack_32_1x64 (s), unpack_32_1x64 (m)));
|
|
w--;
|
|
}
|
|
}
|
|
|
|
static force_inline uint32_t
|
|
core_combine_over_ca_pixel_sse2 (uint32_t src,
|
|
uint32_t mask,
|
|
uint32_t dst)
|
|
{
|
|
__m64 s = unpack_32_1x64 (src);
|
|
__m64 expAlpha = expand_alpha_1x64 (s);
|
|
__m64 unpk_mask = unpack_32_1x64 (mask);
|
|
__m64 unpk_dst = unpack_32_1x64 (dst);
|
|
|
|
return pack_1x64_32 (in_over_1x64 (&s, &expAlpha, &unpk_mask, &unpk_dst));
|
|
}
|
|
|
|
static force_inline void
|
|
core_combine_over_ca_sse2 (uint32_t* pd,
|
|
const uint32_t* ps,
|
|
const uint32_t *pm,
|
|
int w)
|
|
{
|
|
uint32_t s, m, d;
|
|
|
|
__m128i xmm_alpha_lo, xmm_alpha_hi;
|
|
__m128i xmm_src_lo, xmm_src_hi;
|
|
__m128i xmm_dst_lo, xmm_dst_hi;
|
|
__m128i xmm_mask_lo, xmm_mask_hi;
|
|
|
|
while (w && (unsigned long)pd & 15)
|
|
{
|
|
s = *ps++;
|
|
m = *pm++;
|
|
d = *pd;
|
|
|
|
*pd++ = core_combine_over_ca_pixel_sse2 (s, m, d);
|
|
w--;
|
|
}
|
|
|
|
while (w >= 4)
|
|
{
|
|
xmm_dst_hi = load_128_aligned ((__m128i*)pd);
|
|
xmm_src_hi = load_128_unaligned ((__m128i*)ps);
|
|
xmm_mask_hi = load_128_unaligned ((__m128i*)pm);
|
|
|
|
unpack_128_2x128 (xmm_dst_hi, &xmm_dst_lo, &xmm_dst_hi);
|
|
unpack_128_2x128 (xmm_src_hi, &xmm_src_lo, &xmm_src_hi);
|
|
unpack_128_2x128 (xmm_mask_hi, &xmm_mask_lo, &xmm_mask_hi);
|
|
|
|
expand_alpha_2x128 (xmm_src_lo, xmm_src_hi,
|
|
&xmm_alpha_lo, &xmm_alpha_hi);
|
|
|
|
in_over_2x128 (&xmm_src_lo, &xmm_src_hi,
|
|
&xmm_alpha_lo, &xmm_alpha_hi,
|
|
&xmm_mask_lo, &xmm_mask_hi,
|
|
&xmm_dst_lo, &xmm_dst_hi);
|
|
|
|
save_128_aligned (
|
|
(__m128i*)pd, pack_2x128_128 (xmm_dst_lo, xmm_dst_hi));
|
|
|
|
ps += 4;
|
|
pd += 4;
|
|
pm += 4;
|
|
w -= 4;
|
|
}
|
|
|
|
while (w)
|
|
{
|
|
s = *ps++;
|
|
m = *pm++;
|
|
d = *pd;
|
|
|
|
*pd++ = core_combine_over_ca_pixel_sse2 (s, m, d);
|
|
w--;
|
|
}
|
|
}
|
|
|
|
static force_inline uint32_t
|
|
core_combine_over_reverse_ca_pixel_sse2 (uint32_t src,
|
|
uint32_t mask,
|
|
uint32_t dst)
|
|
{
|
|
__m64 d = unpack_32_1x64 (dst);
|
|
|
|
return pack_1x64_32 (
|
|
over_1x64 (d, expand_alpha_1x64 (d),
|
|
pix_multiply_1x64 (unpack_32_1x64 (src),
|
|
unpack_32_1x64 (mask))));
|
|
}
|
|
|
|
static force_inline void
|
|
core_combine_over_reverse_ca_sse2 (uint32_t* pd,
|
|
const uint32_t* ps,
|
|
const uint32_t *pm,
|
|
int w)
|
|
{
|
|
uint32_t s, m, d;
|
|
|
|
__m128i xmm_alpha_lo, xmm_alpha_hi;
|
|
__m128i xmm_src_lo, xmm_src_hi;
|
|
__m128i xmm_dst_lo, xmm_dst_hi;
|
|
__m128i xmm_mask_lo, xmm_mask_hi;
|
|
|
|
while (w && (unsigned long)pd & 15)
|
|
{
|
|
s = *ps++;
|
|
m = *pm++;
|
|
d = *pd;
|
|
|
|
*pd++ = core_combine_over_reverse_ca_pixel_sse2 (s, m, d);
|
|
w--;
|
|
}
|
|
|
|
while (w >= 4)
|
|
{
|
|
xmm_dst_hi = load_128_aligned ((__m128i*)pd);
|
|
xmm_src_hi = load_128_unaligned ((__m128i*)ps);
|
|
xmm_mask_hi = load_128_unaligned ((__m128i*)pm);
|
|
|
|
unpack_128_2x128 (xmm_dst_hi, &xmm_dst_lo, &xmm_dst_hi);
|
|
unpack_128_2x128 (xmm_src_hi, &xmm_src_lo, &xmm_src_hi);
|
|
unpack_128_2x128 (xmm_mask_hi, &xmm_mask_lo, &xmm_mask_hi);
|
|
|
|
expand_alpha_2x128 (xmm_dst_lo, xmm_dst_hi,
|
|
&xmm_alpha_lo, &xmm_alpha_hi);
|
|
pix_multiply_2x128 (&xmm_src_lo, &xmm_src_hi,
|
|
&xmm_mask_lo, &xmm_mask_hi,
|
|
&xmm_mask_lo, &xmm_mask_hi);
|
|
|
|
over_2x128 (&xmm_dst_lo, &xmm_dst_hi,
|
|
&xmm_alpha_lo, &xmm_alpha_hi,
|
|
&xmm_mask_lo, &xmm_mask_hi);
|
|
|
|
save_128_aligned (
|
|
(__m128i*)pd, pack_2x128_128 (xmm_mask_lo, xmm_mask_hi));
|
|
|
|
ps += 4;
|
|
pd += 4;
|
|
pm += 4;
|
|
w -= 4;
|
|
}
|
|
|
|
while (w)
|
|
{
|
|
s = *ps++;
|
|
m = *pm++;
|
|
d = *pd;
|
|
|
|
*pd++ = core_combine_over_reverse_ca_pixel_sse2 (s, m, d);
|
|
w--;
|
|
}
|
|
}
|
|
|
|
static force_inline void
|
|
core_combine_in_ca_sse2 (uint32_t * pd,
|
|
const uint32_t *ps,
|
|
const uint32_t *pm,
|
|
int w)
|
|
{
|
|
uint32_t s, m, d;
|
|
|
|
__m128i xmm_alpha_lo, xmm_alpha_hi;
|
|
__m128i xmm_src_lo, xmm_src_hi;
|
|
__m128i xmm_dst_lo, xmm_dst_hi;
|
|
__m128i xmm_mask_lo, xmm_mask_hi;
|
|
|
|
while (w && (unsigned long)pd & 15)
|
|
{
|
|
s = *ps++;
|
|
m = *pm++;
|
|
d = *pd;
|
|
|
|
*pd++ = pack_1x64_32 (
|
|
pix_multiply_1x64 (
|
|
pix_multiply_1x64 (unpack_32_1x64 (s), unpack_32_1x64 (m)),
|
|
expand_alpha_1x64 (unpack_32_1x64 (d))));
|
|
|
|
w--;
|
|
}
|
|
|
|
while (w >= 4)
|
|
{
|
|
xmm_dst_hi = load_128_aligned ((__m128i*)pd);
|
|
xmm_src_hi = load_128_unaligned ((__m128i*)ps);
|
|
xmm_mask_hi = load_128_unaligned ((__m128i*)pm);
|
|
|
|
unpack_128_2x128 (xmm_dst_hi, &xmm_dst_lo, &xmm_dst_hi);
|
|
unpack_128_2x128 (xmm_src_hi, &xmm_src_lo, &xmm_src_hi);
|
|
unpack_128_2x128 (xmm_mask_hi, &xmm_mask_lo, &xmm_mask_hi);
|
|
|
|
expand_alpha_2x128 (xmm_dst_lo, xmm_dst_hi,
|
|
&xmm_alpha_lo, &xmm_alpha_hi);
|
|
|
|
pix_multiply_2x128 (&xmm_src_lo, &xmm_src_hi,
|
|
&xmm_mask_lo, &xmm_mask_hi,
|
|
&xmm_dst_lo, &xmm_dst_hi);
|
|
|
|
pix_multiply_2x128 (&xmm_dst_lo, &xmm_dst_hi,
|
|
&xmm_alpha_lo, &xmm_alpha_hi,
|
|
&xmm_dst_lo, &xmm_dst_hi);
|
|
|
|
save_128_aligned (
|
|
(__m128i*)pd, pack_2x128_128 (xmm_dst_lo, xmm_dst_hi));
|
|
|
|
ps += 4;
|
|
pd += 4;
|
|
pm += 4;
|
|
w -= 4;
|
|
}
|
|
|
|
while (w)
|
|
{
|
|
s = *ps++;
|
|
m = *pm++;
|
|
d = *pd;
|
|
|
|
*pd++ = pack_1x64_32 (
|
|
pix_multiply_1x64 (
|
|
pix_multiply_1x64 (
|
|
unpack_32_1x64 (s), unpack_32_1x64 (m)),
|
|
expand_alpha_1x64 (unpack_32_1x64 (d))));
|
|
|
|
w--;
|
|
}
|
|
}
|
|
|
|
static force_inline void
|
|
core_combine_in_reverse_ca_sse2 (uint32_t * pd,
|
|
const uint32_t *ps,
|
|
const uint32_t *pm,
|
|
int w)
|
|
{
|
|
uint32_t s, m, d;
|
|
|
|
__m128i xmm_alpha_lo, xmm_alpha_hi;
|
|
__m128i xmm_src_lo, xmm_src_hi;
|
|
__m128i xmm_dst_lo, xmm_dst_hi;
|
|
__m128i xmm_mask_lo, xmm_mask_hi;
|
|
|
|
while (w && (unsigned long)pd & 15)
|
|
{
|
|
s = *ps++;
|
|
m = *pm++;
|
|
d = *pd;
|
|
|
|
*pd++ = pack_1x64_32 (
|
|
pix_multiply_1x64 (
|
|
unpack_32_1x64 (d),
|
|
pix_multiply_1x64 (unpack_32_1x64 (m),
|
|
expand_alpha_1x64 (unpack_32_1x64 (s)))));
|
|
w--;
|
|
}
|
|
|
|
while (w >= 4)
|
|
{
|
|
xmm_dst_hi = load_128_aligned ((__m128i*)pd);
|
|
xmm_src_hi = load_128_unaligned ((__m128i*)ps);
|
|
xmm_mask_hi = load_128_unaligned ((__m128i*)pm);
|
|
|
|
unpack_128_2x128 (xmm_dst_hi, &xmm_dst_lo, &xmm_dst_hi);
|
|
unpack_128_2x128 (xmm_src_hi, &xmm_src_lo, &xmm_src_hi);
|
|
unpack_128_2x128 (xmm_mask_hi, &xmm_mask_lo, &xmm_mask_hi);
|
|
|
|
expand_alpha_2x128 (xmm_src_lo, xmm_src_hi,
|
|
&xmm_alpha_lo, &xmm_alpha_hi);
|
|
pix_multiply_2x128 (&xmm_mask_lo, &xmm_mask_hi,
|
|
&xmm_alpha_lo, &xmm_alpha_hi,
|
|
&xmm_alpha_lo, &xmm_alpha_hi);
|
|
|
|
pix_multiply_2x128 (&xmm_dst_lo, &xmm_dst_hi,
|
|
&xmm_alpha_lo, &xmm_alpha_hi,
|
|
&xmm_dst_lo, &xmm_dst_hi);
|
|
|
|
save_128_aligned (
|
|
(__m128i*)pd, pack_2x128_128 (xmm_dst_lo, xmm_dst_hi));
|
|
|
|
ps += 4;
|
|
pd += 4;
|
|
pm += 4;
|
|
w -= 4;
|
|
}
|
|
|
|
while (w)
|
|
{
|
|
s = *ps++;
|
|
m = *pm++;
|
|
d = *pd;
|
|
|
|
*pd++ = pack_1x64_32 (
|
|
pix_multiply_1x64 (
|
|
unpack_32_1x64 (d),
|
|
pix_multiply_1x64 (unpack_32_1x64 (m),
|
|
expand_alpha_1x64 (unpack_32_1x64 (s)))));
|
|
w--;
|
|
}
|
|
}
|
|
|
|
static force_inline void
|
|
core_combine_out_ca_sse2 (uint32_t * pd,
|
|
const uint32_t *ps,
|
|
const uint32_t *pm,
|
|
int w)
|
|
{
|
|
uint32_t s, m, d;
|
|
|
|
__m128i xmm_alpha_lo, xmm_alpha_hi;
|
|
__m128i xmm_src_lo, xmm_src_hi;
|
|
__m128i xmm_dst_lo, xmm_dst_hi;
|
|
__m128i xmm_mask_lo, xmm_mask_hi;
|
|
|
|
while (w && (unsigned long)pd & 15)
|
|
{
|
|
s = *ps++;
|
|
m = *pm++;
|
|
d = *pd;
|
|
|
|
*pd++ = pack_1x64_32 (
|
|
pix_multiply_1x64 (
|
|
pix_multiply_1x64 (
|
|
unpack_32_1x64 (s), unpack_32_1x64 (m)),
|
|
negate_1x64 (expand_alpha_1x64 (unpack_32_1x64 (d)))));
|
|
w--;
|
|
}
|
|
|
|
while (w >= 4)
|
|
{
|
|
xmm_dst_hi = load_128_aligned ((__m128i*)pd);
|
|
xmm_src_hi = load_128_unaligned ((__m128i*)ps);
|
|
xmm_mask_hi = load_128_unaligned ((__m128i*)pm);
|
|
|
|
unpack_128_2x128 (xmm_dst_hi, &xmm_dst_lo, &xmm_dst_hi);
|
|
unpack_128_2x128 (xmm_src_hi, &xmm_src_lo, &xmm_src_hi);
|
|
unpack_128_2x128 (xmm_mask_hi, &xmm_mask_lo, &xmm_mask_hi);
|
|
|
|
expand_alpha_2x128 (xmm_dst_lo, xmm_dst_hi,
|
|
&xmm_alpha_lo, &xmm_alpha_hi);
|
|
negate_2x128 (xmm_alpha_lo, xmm_alpha_hi,
|
|
&xmm_alpha_lo, &xmm_alpha_hi);
|
|
|
|
pix_multiply_2x128 (&xmm_src_lo, &xmm_src_hi,
|
|
&xmm_mask_lo, &xmm_mask_hi,
|
|
&xmm_dst_lo, &xmm_dst_hi);
|
|
pix_multiply_2x128 (&xmm_dst_lo, &xmm_dst_hi,
|
|
&xmm_alpha_lo, &xmm_alpha_hi,
|
|
&xmm_dst_lo, &xmm_dst_hi);
|
|
|
|
save_128_aligned (
|
|
(__m128i*)pd, pack_2x128_128 (xmm_dst_lo, xmm_dst_hi));
|
|
|
|
ps += 4;
|
|
pd += 4;
|
|
pm += 4;
|
|
w -= 4;
|
|
}
|
|
|
|
while (w)
|
|
{
|
|
s = *ps++;
|
|
m = *pm++;
|
|
d = *pd;
|
|
|
|
*pd++ = pack_1x64_32 (
|
|
pix_multiply_1x64 (
|
|
pix_multiply_1x64 (
|
|
unpack_32_1x64 (s), unpack_32_1x64 (m)),
|
|
negate_1x64 (expand_alpha_1x64 (unpack_32_1x64 (d)))));
|
|
|
|
w--;
|
|
}
|
|
}
|
|
|
|
static force_inline void
|
|
core_combine_out_reverse_ca_sse2 (uint32_t * pd,
|
|
const uint32_t *ps,
|
|
const uint32_t *pm,
|
|
int w)
|
|
{
|
|
uint32_t s, m, d;
|
|
|
|
__m128i xmm_alpha_lo, xmm_alpha_hi;
|
|
__m128i xmm_src_lo, xmm_src_hi;
|
|
__m128i xmm_dst_lo, xmm_dst_hi;
|
|
__m128i xmm_mask_lo, xmm_mask_hi;
|
|
|
|
while (w && (unsigned long)pd & 15)
|
|
{
|
|
s = *ps++;
|
|
m = *pm++;
|
|
d = *pd;
|
|
|
|
*pd++ = pack_1x64_32 (
|
|
pix_multiply_1x64 (
|
|
unpack_32_1x64 (d),
|
|
negate_1x64 (pix_multiply_1x64 (
|
|
unpack_32_1x64 (m),
|
|
expand_alpha_1x64 (unpack_32_1x64 (s))))));
|
|
w--;
|
|
}
|
|
|
|
while (w >= 4)
|
|
{
|
|
xmm_dst_hi = load_128_aligned ((__m128i*)pd);
|
|
xmm_src_hi = load_128_unaligned ((__m128i*)ps);
|
|
xmm_mask_hi = load_128_unaligned ((__m128i*)pm);
|
|
|
|
unpack_128_2x128 (xmm_dst_hi, &xmm_dst_lo, &xmm_dst_hi);
|
|
unpack_128_2x128 (xmm_src_hi, &xmm_src_lo, &xmm_src_hi);
|
|
unpack_128_2x128 (xmm_mask_hi, &xmm_mask_lo, &xmm_mask_hi);
|
|
|
|
expand_alpha_2x128 (xmm_src_lo, xmm_src_hi,
|
|
&xmm_alpha_lo, &xmm_alpha_hi);
|
|
|
|
pix_multiply_2x128 (&xmm_mask_lo, &xmm_mask_hi,
|
|
&xmm_alpha_lo, &xmm_alpha_hi,
|
|
&xmm_mask_lo, &xmm_mask_hi);
|
|
|
|
negate_2x128 (xmm_mask_lo, xmm_mask_hi,
|
|
&xmm_mask_lo, &xmm_mask_hi);
|
|
|
|
pix_multiply_2x128 (&xmm_dst_lo, &xmm_dst_hi,
|
|
&xmm_mask_lo, &xmm_mask_hi,
|
|
&xmm_dst_lo, &xmm_dst_hi);
|
|
|
|
save_128_aligned (
|
|
(__m128i*)pd, pack_2x128_128 (xmm_dst_lo, xmm_dst_hi));
|
|
|
|
ps += 4;
|
|
pd += 4;
|
|
pm += 4;
|
|
w -= 4;
|
|
}
|
|
|
|
while (w)
|
|
{
|
|
s = *ps++;
|
|
m = *pm++;
|
|
d = *pd;
|
|
|
|
*pd++ = pack_1x64_32 (
|
|
pix_multiply_1x64 (
|
|
unpack_32_1x64 (d),
|
|
negate_1x64 (pix_multiply_1x64 (
|
|
unpack_32_1x64 (m),
|
|
expand_alpha_1x64 (unpack_32_1x64 (s))))));
|
|
w--;
|
|
}
|
|
}
|
|
|
|
static force_inline uint32_t
|
|
core_combine_atop_ca_pixel_sse2 (uint32_t src,
|
|
uint32_t mask,
|
|
uint32_t dst)
|
|
{
|
|
__m64 m = unpack_32_1x64 (mask);
|
|
__m64 s = unpack_32_1x64 (src);
|
|
__m64 d = unpack_32_1x64 (dst);
|
|
__m64 sa = expand_alpha_1x64 (s);
|
|
__m64 da = expand_alpha_1x64 (d);
|
|
|
|
s = pix_multiply_1x64 (s, m);
|
|
m = negate_1x64 (pix_multiply_1x64 (m, sa));
|
|
|
|
return pack_1x64_32 (pix_add_multiply_1x64 (&d, &m, &s, &da));
|
|
}
|
|
|
|
static force_inline void
|
|
core_combine_atop_ca_sse2 (uint32_t * pd,
|
|
const uint32_t *ps,
|
|
const uint32_t *pm,
|
|
int w)
|
|
{
|
|
uint32_t s, m, d;
|
|
|
|
__m128i xmm_src_lo, xmm_src_hi;
|
|
__m128i xmm_dst_lo, xmm_dst_hi;
|
|
__m128i xmm_alpha_src_lo, xmm_alpha_src_hi;
|
|
__m128i xmm_alpha_dst_lo, xmm_alpha_dst_hi;
|
|
__m128i xmm_mask_lo, xmm_mask_hi;
|
|
|
|
while (w && (unsigned long)pd & 15)
|
|
{
|
|
s = *ps++;
|
|
m = *pm++;
|
|
d = *pd;
|
|
|
|
*pd++ = core_combine_atop_ca_pixel_sse2 (s, m, d);
|
|
w--;
|
|
}
|
|
|
|
while (w >= 4)
|
|
{
|
|
xmm_dst_hi = load_128_aligned ((__m128i*)pd);
|
|
xmm_src_hi = load_128_unaligned ((__m128i*)ps);
|
|
xmm_mask_hi = load_128_unaligned ((__m128i*)pm);
|
|
|
|
unpack_128_2x128 (xmm_dst_hi, &xmm_dst_lo, &xmm_dst_hi);
|
|
unpack_128_2x128 (xmm_src_hi, &xmm_src_lo, &xmm_src_hi);
|
|
unpack_128_2x128 (xmm_mask_hi, &xmm_mask_lo, &xmm_mask_hi);
|
|
|
|
expand_alpha_2x128 (xmm_src_lo, xmm_src_hi,
|
|
&xmm_alpha_src_lo, &xmm_alpha_src_hi);
|
|
expand_alpha_2x128 (xmm_dst_lo, xmm_dst_hi,
|
|
&xmm_alpha_dst_lo, &xmm_alpha_dst_hi);
|
|
|
|
pix_multiply_2x128 (&xmm_src_lo, &xmm_src_hi,
|
|
&xmm_mask_lo, &xmm_mask_hi,
|
|
&xmm_src_lo, &xmm_src_hi);
|
|
pix_multiply_2x128 (&xmm_mask_lo, &xmm_mask_hi,
|
|
&xmm_alpha_src_lo, &xmm_alpha_src_hi,
|
|
&xmm_mask_lo, &xmm_mask_hi);
|
|
|
|
negate_2x128 (xmm_mask_lo, xmm_mask_hi, &xmm_mask_lo, &xmm_mask_hi);
|
|
|
|
pix_add_multiply_2x128 (
|
|
&xmm_dst_lo, &xmm_dst_hi, &xmm_mask_lo, &xmm_mask_hi,
|
|
&xmm_src_lo, &xmm_src_hi, &xmm_alpha_dst_lo, &xmm_alpha_dst_hi,
|
|
&xmm_dst_lo, &xmm_dst_hi);
|
|
|
|
save_128_aligned (
|
|
(__m128i*)pd, pack_2x128_128 (xmm_dst_lo, xmm_dst_hi));
|
|
|
|
ps += 4;
|
|
pd += 4;
|
|
pm += 4;
|
|
w -= 4;
|
|
}
|
|
|
|
while (w)
|
|
{
|
|
s = *ps++;
|
|
m = *pm++;
|
|
d = *pd;
|
|
|
|
*pd++ = core_combine_atop_ca_pixel_sse2 (s, m, d);
|
|
w--;
|
|
}
|
|
}
|
|
|
|
static force_inline uint32_t
|
|
core_combine_reverse_atop_ca_pixel_sse2 (uint32_t src,
|
|
uint32_t mask,
|
|
uint32_t dst)
|
|
{
|
|
__m64 m = unpack_32_1x64 (mask);
|
|
__m64 s = unpack_32_1x64 (src);
|
|
__m64 d = unpack_32_1x64 (dst);
|
|
|
|
__m64 da = negate_1x64 (expand_alpha_1x64 (d));
|
|
__m64 sa = expand_alpha_1x64 (s);
|
|
|
|
s = pix_multiply_1x64 (s, m);
|
|
m = pix_multiply_1x64 (m, sa);
|
|
|
|
return pack_1x64_32 (pix_add_multiply_1x64 (&d, &m, &s, &da));
|
|
}
|
|
|
|
static force_inline void
|
|
core_combine_reverse_atop_ca_sse2 (uint32_t * pd,
|
|
const uint32_t *ps,
|
|
const uint32_t *pm,
|
|
int w)
|
|
{
|
|
uint32_t s, m, d;
|
|
|
|
__m128i xmm_src_lo, xmm_src_hi;
|
|
__m128i xmm_dst_lo, xmm_dst_hi;
|
|
__m128i xmm_alpha_src_lo, xmm_alpha_src_hi;
|
|
__m128i xmm_alpha_dst_lo, xmm_alpha_dst_hi;
|
|
__m128i xmm_mask_lo, xmm_mask_hi;
|
|
|
|
while (w && (unsigned long)pd & 15)
|
|
{
|
|
s = *ps++;
|
|
m = *pm++;
|
|
d = *pd;
|
|
|
|
*pd++ = core_combine_reverse_atop_ca_pixel_sse2 (s, m, d);
|
|
w--;
|
|
}
|
|
|
|
while (w >= 4)
|
|
{
|
|
xmm_dst_hi = load_128_aligned ((__m128i*)pd);
|
|
xmm_src_hi = load_128_unaligned ((__m128i*)ps);
|
|
xmm_mask_hi = load_128_unaligned ((__m128i*)pm);
|
|
|
|
unpack_128_2x128 (xmm_dst_hi, &xmm_dst_lo, &xmm_dst_hi);
|
|
unpack_128_2x128 (xmm_src_hi, &xmm_src_lo, &xmm_src_hi);
|
|
unpack_128_2x128 (xmm_mask_hi, &xmm_mask_lo, &xmm_mask_hi);
|
|
|
|
expand_alpha_2x128 (xmm_src_lo, xmm_src_hi,
|
|
&xmm_alpha_src_lo, &xmm_alpha_src_hi);
|
|
expand_alpha_2x128 (xmm_dst_lo, xmm_dst_hi,
|
|
&xmm_alpha_dst_lo, &xmm_alpha_dst_hi);
|
|
|
|
pix_multiply_2x128 (&xmm_src_lo, &xmm_src_hi,
|
|
&xmm_mask_lo, &xmm_mask_hi,
|
|
&xmm_src_lo, &xmm_src_hi);
|
|
pix_multiply_2x128 (&xmm_mask_lo, &xmm_mask_hi,
|
|
&xmm_alpha_src_lo, &xmm_alpha_src_hi,
|
|
&xmm_mask_lo, &xmm_mask_hi);
|
|
|
|
negate_2x128 (xmm_alpha_dst_lo, xmm_alpha_dst_hi,
|
|
&xmm_alpha_dst_lo, &xmm_alpha_dst_hi);
|
|
|
|
pix_add_multiply_2x128 (
|
|
&xmm_dst_lo, &xmm_dst_hi, &xmm_mask_lo, &xmm_mask_hi,
|
|
&xmm_src_lo, &xmm_src_hi, &xmm_alpha_dst_lo, &xmm_alpha_dst_hi,
|
|
&xmm_dst_lo, &xmm_dst_hi);
|
|
|
|
save_128_aligned (
|
|
(__m128i*)pd, pack_2x128_128 (xmm_dst_lo, xmm_dst_hi));
|
|
|
|
ps += 4;
|
|
pd += 4;
|
|
pm += 4;
|
|
w -= 4;
|
|
}
|
|
|
|
while (w)
|
|
{
|
|
s = *ps++;
|
|
m = *pm++;
|
|
d = *pd;
|
|
|
|
*pd++ = core_combine_reverse_atop_ca_pixel_sse2 (s, m, d);
|
|
w--;
|
|
}
|
|
}
|
|
|
|
static force_inline uint32_t
|
|
core_combine_xor_ca_pixel_sse2 (uint32_t src,
|
|
uint32_t mask,
|
|
uint32_t dst)
|
|
{
|
|
__m64 a = unpack_32_1x64 (mask);
|
|
__m64 s = unpack_32_1x64 (src);
|
|
__m64 d = unpack_32_1x64 (dst);
|
|
|
|
__m64 alpha_dst = negate_1x64 (pix_multiply_1x64 (
|
|
a, expand_alpha_1x64 (s)));
|
|
__m64 dest = pix_multiply_1x64 (s, a);
|
|
__m64 alpha_src = negate_1x64 (expand_alpha_1x64 (d));
|
|
|
|
return pack_1x64_32 (pix_add_multiply_1x64 (&d,
|
|
&alpha_dst,
|
|
&dest,
|
|
&alpha_src));
|
|
}
|
|
|
|
static force_inline void
|
|
core_combine_xor_ca_sse2 (uint32_t * pd,
|
|
const uint32_t *ps,
|
|
const uint32_t *pm,
|
|
int w)
|
|
{
|
|
uint32_t s, m, d;
|
|
|
|
__m128i xmm_src_lo, xmm_src_hi;
|
|
__m128i xmm_dst_lo, xmm_dst_hi;
|
|
__m128i xmm_alpha_src_lo, xmm_alpha_src_hi;
|
|
__m128i xmm_alpha_dst_lo, xmm_alpha_dst_hi;
|
|
__m128i xmm_mask_lo, xmm_mask_hi;
|
|
|
|
while (w && (unsigned long)pd & 15)
|
|
{
|
|
s = *ps++;
|
|
m = *pm++;
|
|
d = *pd;
|
|
|
|
*pd++ = core_combine_xor_ca_pixel_sse2 (s, m, d);
|
|
w--;
|
|
}
|
|
|
|
while (w >= 4)
|
|
{
|
|
xmm_dst_hi = load_128_aligned ((__m128i*)pd);
|
|
xmm_src_hi = load_128_unaligned ((__m128i*)ps);
|
|
xmm_mask_hi = load_128_unaligned ((__m128i*)pm);
|
|
|
|
unpack_128_2x128 (xmm_dst_hi, &xmm_dst_lo, &xmm_dst_hi);
|
|
unpack_128_2x128 (xmm_src_hi, &xmm_src_lo, &xmm_src_hi);
|
|
unpack_128_2x128 (xmm_mask_hi, &xmm_mask_lo, &xmm_mask_hi);
|
|
|
|
expand_alpha_2x128 (xmm_src_lo, xmm_src_hi,
|
|
&xmm_alpha_src_lo, &xmm_alpha_src_hi);
|
|
expand_alpha_2x128 (xmm_dst_lo, xmm_dst_hi,
|
|
&xmm_alpha_dst_lo, &xmm_alpha_dst_hi);
|
|
|
|
pix_multiply_2x128 (&xmm_src_lo, &xmm_src_hi,
|
|
&xmm_mask_lo, &xmm_mask_hi,
|
|
&xmm_src_lo, &xmm_src_hi);
|
|
pix_multiply_2x128 (&xmm_mask_lo, &xmm_mask_hi,
|
|
&xmm_alpha_src_lo, &xmm_alpha_src_hi,
|
|
&xmm_mask_lo, &xmm_mask_hi);
|
|
|
|
negate_2x128 (xmm_alpha_dst_lo, xmm_alpha_dst_hi,
|
|
&xmm_alpha_dst_lo, &xmm_alpha_dst_hi);
|
|
negate_2x128 (xmm_mask_lo, xmm_mask_hi,
|
|
&xmm_mask_lo, &xmm_mask_hi);
|
|
|
|
pix_add_multiply_2x128 (
|
|
&xmm_dst_lo, &xmm_dst_hi, &xmm_mask_lo, &xmm_mask_hi,
|
|
&xmm_src_lo, &xmm_src_hi, &xmm_alpha_dst_lo, &xmm_alpha_dst_hi,
|
|
&xmm_dst_lo, &xmm_dst_hi);
|
|
|
|
save_128_aligned (
|
|
(__m128i*)pd, pack_2x128_128 (xmm_dst_lo, xmm_dst_hi));
|
|
|
|
ps += 4;
|
|
pd += 4;
|
|
pm += 4;
|
|
w -= 4;
|
|
}
|
|
|
|
while (w)
|
|
{
|
|
s = *ps++;
|
|
m = *pm++;
|
|
d = *pd;
|
|
|
|
*pd++ = core_combine_xor_ca_pixel_sse2 (s, m, d);
|
|
w--;
|
|
}
|
|
}
|
|
|
|
static force_inline void
|
|
core_combine_add_ca_sse2 (uint32_t * pd,
|
|
const uint32_t *ps,
|
|
const uint32_t *pm,
|
|
int w)
|
|
{
|
|
uint32_t s, m, d;
|
|
|
|
__m128i xmm_src_lo, xmm_src_hi;
|
|
__m128i xmm_dst_lo, xmm_dst_hi;
|
|
__m128i xmm_mask_lo, xmm_mask_hi;
|
|
|
|
while (w && (unsigned long)pd & 15)
|
|
{
|
|
s = *ps++;
|
|
m = *pm++;
|
|
d = *pd;
|
|
|
|
*pd++ = pack_1x64_32 (
|
|
_mm_adds_pu8 (pix_multiply_1x64 (unpack_32_1x64 (s),
|
|
unpack_32_1x64 (m)),
|
|
unpack_32_1x64 (d)));
|
|
w--;
|
|
}
|
|
|
|
while (w >= 4)
|
|
{
|
|
xmm_src_hi = load_128_unaligned ((__m128i*)ps);
|
|
xmm_mask_hi = load_128_unaligned ((__m128i*)pm);
|
|
xmm_dst_hi = load_128_aligned ((__m128i*)pd);
|
|
|
|
unpack_128_2x128 (xmm_src_hi, &xmm_src_lo, &xmm_src_hi);
|
|
unpack_128_2x128 (xmm_mask_hi, &xmm_mask_lo, &xmm_mask_hi);
|
|
unpack_128_2x128 (xmm_dst_hi, &xmm_dst_lo, &xmm_dst_hi);
|
|
|
|
pix_multiply_2x128 (&xmm_src_lo, &xmm_src_hi,
|
|
&xmm_mask_lo, &xmm_mask_hi,
|
|
&xmm_src_lo, &xmm_src_hi);
|
|
|
|
save_128_aligned (
|
|
(__m128i*)pd, pack_2x128_128 (
|
|
_mm_adds_epu8 (xmm_src_lo, xmm_dst_lo),
|
|
_mm_adds_epu8 (xmm_src_hi, xmm_dst_hi)));
|
|
|
|
ps += 4;
|
|
pd += 4;
|
|
pm += 4;
|
|
w -= 4;
|
|
}
|
|
|
|
while (w)
|
|
{
|
|
s = *ps++;
|
|
m = *pm++;
|
|
d = *pd;
|
|
|
|
*pd++ = pack_1x64_32 (
|
|
_mm_adds_pu8 (pix_multiply_1x64 (unpack_32_1x64 (s),
|
|
unpack_32_1x64 (m)),
|
|
unpack_32_1x64 (d)));
|
|
w--;
|
|
}
|
|
}
|
|
|
|
/* ---------------------------------------------------
|
|
* fb_compose_setup_sSE2
|
|
*/
|
|
static force_inline __m64
|
|
create_mask_16_64 (uint16_t mask)
|
|
{
|
|
return _mm_set1_pi16 (mask);
|
|
}
|
|
|
|
static force_inline __m128i
|
|
create_mask_16_128 (uint16_t mask)
|
|
{
|
|
return _mm_set1_epi16 (mask);
|
|
}
|
|
|
|
static force_inline __m64
|
|
create_mask_2x32_64 (uint32_t mask0,
|
|
uint32_t mask1)
|
|
{
|
|
return _mm_set_pi32 (mask0, mask1);
|
|
}
|
|
|
|
/* Work around a code generation bug in Sun Studio 12. */
|
|
#if defined(__SUNPRO_C) && (__SUNPRO_C >= 0x590)
|
|
# define create_mask_2x32_128(mask0, mask1) \
|
|
(_mm_set_epi32 ((mask0), (mask1), (mask0), (mask1)))
|
|
#else
|
|
static force_inline __m128i
|
|
create_mask_2x32_128 (uint32_t mask0,
|
|
uint32_t mask1)
|
|
{
|
|
return _mm_set_epi32 (mask0, mask1, mask0, mask1);
|
|
}
|
|
#endif
|
|
|
|
/* SSE2 code patch for fbcompose.c */
|
|
|
|
static void
|
|
sse2_combine_over_u (pixman_implementation_t *imp,
|
|
pixman_op_t op,
|
|
uint32_t * dst,
|
|
const uint32_t * src,
|
|
const uint32_t * mask,
|
|
int width)
|
|
{
|
|
core_combine_over_u_sse2 (dst, src, mask, width);
|
|
_mm_empty ();
|
|
}
|
|
|
|
static void
|
|
sse2_combine_over_reverse_u (pixman_implementation_t *imp,
|
|
pixman_op_t op,
|
|
uint32_t * dst,
|
|
const uint32_t * src,
|
|
const uint32_t * mask,
|
|
int width)
|
|
{
|
|
core_combine_over_reverse_u_sse2 (dst, src, mask, width);
|
|
_mm_empty ();
|
|
}
|
|
|
|
static void
|
|
sse2_combine_in_u (pixman_implementation_t *imp,
|
|
pixman_op_t op,
|
|
uint32_t * dst,
|
|
const uint32_t * src,
|
|
const uint32_t * mask,
|
|
int width)
|
|
{
|
|
core_combine_in_u_sse2 (dst, src, mask, width);
|
|
_mm_empty ();
|
|
}
|
|
|
|
static void
|
|
sse2_combine_in_reverse_u (pixman_implementation_t *imp,
|
|
pixman_op_t op,
|
|
uint32_t * dst,
|
|
const uint32_t * src,
|
|
const uint32_t * mask,
|
|
int width)
|
|
{
|
|
core_combine_reverse_in_u_sse2 (dst, src, mask, width);
|
|
_mm_empty ();
|
|
}
|
|
|
|
static void
|
|
sse2_combine_out_u (pixman_implementation_t *imp,
|
|
pixman_op_t op,
|
|
uint32_t * dst,
|
|
const uint32_t * src,
|
|
const uint32_t * mask,
|
|
int width)
|
|
{
|
|
core_combine_out_u_sse2 (dst, src, mask, width);
|
|
_mm_empty ();
|
|
}
|
|
|
|
static void
|
|
sse2_combine_out_reverse_u (pixman_implementation_t *imp,
|
|
pixman_op_t op,
|
|
uint32_t * dst,
|
|
const uint32_t * src,
|
|
const uint32_t * mask,
|
|
int width)
|
|
{
|
|
core_combine_reverse_out_u_sse2 (dst, src, mask, width);
|
|
_mm_empty ();
|
|
}
|
|
|
|
static void
|
|
sse2_combine_atop_u (pixman_implementation_t *imp,
|
|
pixman_op_t op,
|
|
uint32_t * dst,
|
|
const uint32_t * src,
|
|
const uint32_t * mask,
|
|
int width)
|
|
{
|
|
core_combine_atop_u_sse2 (dst, src, mask, width);
|
|
_mm_empty ();
|
|
}
|
|
|
|
static void
|
|
sse2_combine_atop_reverse_u (pixman_implementation_t *imp,
|
|
pixman_op_t op,
|
|
uint32_t * dst,
|
|
const uint32_t * src,
|
|
const uint32_t * mask,
|
|
int width)
|
|
{
|
|
core_combine_reverse_atop_u_sse2 (dst, src, mask, width);
|
|
_mm_empty ();
|
|
}
|
|
|
|
static void
|
|
sse2_combine_xor_u (pixman_implementation_t *imp,
|
|
pixman_op_t op,
|
|
uint32_t * dst,
|
|
const uint32_t * src,
|
|
const uint32_t * mask,
|
|
int width)
|
|
{
|
|
core_combine_xor_u_sse2 (dst, src, mask, width);
|
|
_mm_empty ();
|
|
}
|
|
|
|
static void
|
|
sse2_combine_add_u (pixman_implementation_t *imp,
|
|
pixman_op_t op,
|
|
uint32_t * dst,
|
|
const uint32_t * src,
|
|
const uint32_t * mask,
|
|
int width)
|
|
{
|
|
core_combine_add_u_sse2 (dst, src, mask, width);
|
|
_mm_empty ();
|
|
}
|
|
|
|
static void
|
|
sse2_combine_saturate_u (pixman_implementation_t *imp,
|
|
pixman_op_t op,
|
|
uint32_t * dst,
|
|
const uint32_t * src,
|
|
const uint32_t * mask,
|
|
int width)
|
|
{
|
|
core_combine_saturate_u_sse2 (dst, src, mask, width);
|
|
_mm_empty ();
|
|
}
|
|
|
|
static void
|
|
sse2_combine_src_ca (pixman_implementation_t *imp,
|
|
pixman_op_t op,
|
|
uint32_t * dst,
|
|
const uint32_t * src,
|
|
const uint32_t * mask,
|
|
int width)
|
|
{
|
|
core_combine_src_ca_sse2 (dst, src, mask, width);
|
|
_mm_empty ();
|
|
}
|
|
|
|
static void
|
|
sse2_combine_over_ca (pixman_implementation_t *imp,
|
|
pixman_op_t op,
|
|
uint32_t * dst,
|
|
const uint32_t * src,
|
|
const uint32_t * mask,
|
|
int width)
|
|
{
|
|
core_combine_over_ca_sse2 (dst, src, mask, width);
|
|
_mm_empty ();
|
|
}
|
|
|
|
static void
|
|
sse2_combine_over_reverse_ca (pixman_implementation_t *imp,
|
|
pixman_op_t op,
|
|
uint32_t * dst,
|
|
const uint32_t * src,
|
|
const uint32_t * mask,
|
|
int width)
|
|
{
|
|
core_combine_over_reverse_ca_sse2 (dst, src, mask, width);
|
|
_mm_empty ();
|
|
}
|
|
|
|
static void
|
|
sse2_combine_in_ca (pixman_implementation_t *imp,
|
|
pixman_op_t op,
|
|
uint32_t * dst,
|
|
const uint32_t * src,
|
|
const uint32_t * mask,
|
|
int width)
|
|
{
|
|
core_combine_in_ca_sse2 (dst, src, mask, width);
|
|
_mm_empty ();
|
|
}
|
|
|
|
static void
|
|
sse2_combine_in_reverse_ca (pixman_implementation_t *imp,
|
|
pixman_op_t op,
|
|
uint32_t * dst,
|
|
const uint32_t * src,
|
|
const uint32_t * mask,
|
|
int width)
|
|
{
|
|
core_combine_in_reverse_ca_sse2 (dst, src, mask, width);
|
|
_mm_empty ();
|
|
}
|
|
|
|
static void
|
|
sse2_combine_out_ca (pixman_implementation_t *imp,
|
|
pixman_op_t op,
|
|
uint32_t * dst,
|
|
const uint32_t * src,
|
|
const uint32_t * mask,
|
|
int width)
|
|
{
|
|
core_combine_out_ca_sse2 (dst, src, mask, width);
|
|
_mm_empty ();
|
|
}
|
|
|
|
static void
|
|
sse2_combine_out_reverse_ca (pixman_implementation_t *imp,
|
|
pixman_op_t op,
|
|
uint32_t * dst,
|
|
const uint32_t * src,
|
|
const uint32_t * mask,
|
|
int width)
|
|
{
|
|
core_combine_out_reverse_ca_sse2 (dst, src, mask, width);
|
|
_mm_empty ();
|
|
}
|
|
|
|
static void
|
|
sse2_combine_atop_ca (pixman_implementation_t *imp,
|
|
pixman_op_t op,
|
|
uint32_t * dst,
|
|
const uint32_t * src,
|
|
const uint32_t * mask,
|
|
int width)
|
|
{
|
|
core_combine_atop_ca_sse2 (dst, src, mask, width);
|
|
_mm_empty ();
|
|
}
|
|
|
|
static void
|
|
sse2_combine_atop_reverse_ca (pixman_implementation_t *imp,
|
|
pixman_op_t op,
|
|
uint32_t * dst,
|
|
const uint32_t * src,
|
|
const uint32_t * mask,
|
|
int width)
|
|
{
|
|
core_combine_reverse_atop_ca_sse2 (dst, src, mask, width);
|
|
_mm_empty ();
|
|
}
|
|
|
|
static void
|
|
sse2_combine_xor_ca (pixman_implementation_t *imp,
|
|
pixman_op_t op,
|
|
uint32_t * dst,
|
|
const uint32_t * src,
|
|
const uint32_t * mask,
|
|
int width)
|
|
{
|
|
core_combine_xor_ca_sse2 (dst, src, mask, width);
|
|
_mm_empty ();
|
|
}
|
|
|
|
static void
|
|
sse2_combine_add_ca (pixman_implementation_t *imp,
|
|
pixman_op_t op,
|
|
uint32_t * dst,
|
|
const uint32_t * src,
|
|
const uint32_t * mask,
|
|
int width)
|
|
{
|
|
core_combine_add_ca_sse2 (dst, src, mask, width);
|
|
_mm_empty ();
|
|
}
|
|
|
|
/* -------------------------------------------------------------------
|
|
* composite_over_n_8888
|
|
*/
|
|
|
|
static void
|
|
sse2_composite_over_n_8888 (pixman_implementation_t *imp,
|
|
pixman_op_t op,
|
|
pixman_image_t * src_image,
|
|
pixman_image_t * mask_image,
|
|
pixman_image_t * dst_image,
|
|
int32_t src_x,
|
|
int32_t src_y,
|
|
int32_t mask_x,
|
|
int32_t mask_y,
|
|
int32_t dest_x,
|
|
int32_t dest_y,
|
|
int32_t width,
|
|
int32_t height)
|
|
{
|
|
uint32_t src;
|
|
uint32_t *dst_line, *dst, d;
|
|
int32_t w;
|
|
int dst_stride;
|
|
__m128i xmm_src, xmm_alpha;
|
|
__m128i xmm_dst, xmm_dst_lo, xmm_dst_hi;
|
|
|
|
src = _pixman_image_get_solid (src_image, dst_image->bits.format);
|
|
|
|
if (src == 0)
|
|
return;
|
|
|
|
PIXMAN_IMAGE_GET_LINE (
|
|
dst_image, dest_x, dest_y, uint32_t, dst_stride, dst_line, 1);
|
|
|
|
xmm_src = expand_pixel_32_1x128 (src);
|
|
xmm_alpha = expand_alpha_1x128 (xmm_src);
|
|
|
|
while (height--)
|
|
{
|
|
dst = dst_line;
|
|
|
|
dst_line += dst_stride;
|
|
w = width;
|
|
|
|
while (w && (unsigned long)dst & 15)
|
|
{
|
|
d = *dst;
|
|
*dst++ = pack_1x64_32 (over_1x64 (_mm_movepi64_pi64 (xmm_src),
|
|
_mm_movepi64_pi64 (xmm_alpha),
|
|
unpack_32_1x64 (d)));
|
|
w--;
|
|
}
|
|
|
|
while (w >= 4)
|
|
{
|
|
xmm_dst = load_128_aligned ((__m128i*)dst);
|
|
|
|
unpack_128_2x128 (xmm_dst, &xmm_dst_lo, &xmm_dst_hi);
|
|
|
|
over_2x128 (&xmm_src, &xmm_src,
|
|
&xmm_alpha, &xmm_alpha,
|
|
&xmm_dst_lo, &xmm_dst_hi);
|
|
|
|
/* rebuid the 4 pixel data and save*/
|
|
save_128_aligned (
|
|
(__m128i*)dst, pack_2x128_128 (xmm_dst_lo, xmm_dst_hi));
|
|
|
|
w -= 4;
|
|
dst += 4;
|
|
}
|
|
|
|
while (w)
|
|
{
|
|
d = *dst;
|
|
*dst++ = pack_1x64_32 (over_1x64 (_mm_movepi64_pi64 (xmm_src),
|
|
_mm_movepi64_pi64 (xmm_alpha),
|
|
unpack_32_1x64 (d)));
|
|
w--;
|
|
}
|
|
|
|
}
|
|
_mm_empty ();
|
|
}
|
|
|
|
/* ---------------------------------------------------------------------
|
|
* composite_over_n_0565
|
|
*/
|
|
static void
|
|
sse2_composite_over_n_0565 (pixman_implementation_t *imp,
|
|
pixman_op_t op,
|
|
pixman_image_t * src_image,
|
|
pixman_image_t * mask_image,
|
|
pixman_image_t * dst_image,
|
|
int32_t src_x,
|
|
int32_t src_y,
|
|
int32_t mask_x,
|
|
int32_t mask_y,
|
|
int32_t dest_x,
|
|
int32_t dest_y,
|
|
int32_t width,
|
|
int32_t height)
|
|
{
|
|
uint32_t src;
|
|
uint16_t *dst_line, *dst, d;
|
|
int32_t w;
|
|
int dst_stride;
|
|
__m128i xmm_src, xmm_alpha;
|
|
__m128i xmm_dst, xmm_dst0, xmm_dst1, xmm_dst2, xmm_dst3;
|
|
|
|
src = _pixman_image_get_solid (src_image, dst_image->bits.format);
|
|
|
|
if (src == 0)
|
|
return;
|
|
|
|
PIXMAN_IMAGE_GET_LINE (
|
|
dst_image, dest_x, dest_y, uint16_t, dst_stride, dst_line, 1);
|
|
|
|
xmm_src = expand_pixel_32_1x128 (src);
|
|
xmm_alpha = expand_alpha_1x128 (xmm_src);
|
|
|
|
while (height--)
|
|
{
|
|
dst = dst_line;
|
|
|
|
dst_line += dst_stride;
|
|
w = width;
|
|
|
|
while (w && (unsigned long)dst & 15)
|
|
{
|
|
d = *dst;
|
|
|
|
*dst++ = pack_565_32_16 (
|
|
pack_1x64_32 (over_1x64 (_mm_movepi64_pi64 (xmm_src),
|
|
_mm_movepi64_pi64 (xmm_alpha),
|
|
expand565_16_1x64 (d))));
|
|
w--;
|
|
}
|
|
|
|
while (w >= 8)
|
|
{
|
|
xmm_dst = load_128_aligned ((__m128i*)dst);
|
|
|
|
unpack_565_128_4x128 (xmm_dst,
|
|
&xmm_dst0, &xmm_dst1, &xmm_dst2, &xmm_dst3);
|
|
|
|
over_2x128 (&xmm_src, &xmm_src,
|
|
&xmm_alpha, &xmm_alpha,
|
|
&xmm_dst0, &xmm_dst1);
|
|
over_2x128 (&xmm_src, &xmm_src,
|
|
&xmm_alpha, &xmm_alpha,
|
|
&xmm_dst2, &xmm_dst3);
|
|
|
|
xmm_dst = pack_565_4x128_128 (
|
|
&xmm_dst0, &xmm_dst1, &xmm_dst2, &xmm_dst3);
|
|
|
|
save_128_aligned ((__m128i*)dst, xmm_dst);
|
|
|
|
dst += 8;
|
|
w -= 8;
|
|
}
|
|
|
|
while (w--)
|
|
{
|
|
d = *dst;
|
|
*dst++ = pack_565_32_16 (
|
|
pack_1x64_32 (over_1x64 (_mm_movepi64_pi64 (xmm_src),
|
|
_mm_movepi64_pi64 (xmm_alpha),
|
|
expand565_16_1x64 (d))));
|
|
}
|
|
}
|
|
|
|
_mm_empty ();
|
|
}
|
|
|
|
/* ------------------------------
|
|
* composite_add_n_8888_8888_ca
|
|
*/
|
|
static void
|
|
sse2_composite_add_n_8888_8888_ca (pixman_implementation_t *imp,
|
|
pixman_op_t op,
|
|
pixman_image_t * src_image,
|
|
pixman_image_t * mask_image,
|
|
pixman_image_t * dst_image,
|
|
int32_t src_x,
|
|
int32_t src_y,
|
|
int32_t mask_x,
|
|
int32_t mask_y,
|
|
int32_t dest_x,
|
|
int32_t dest_y,
|
|
int32_t width,
|
|
int32_t height)
|
|
{
|
|
uint32_t src, srca;
|
|
uint32_t *dst_line, d;
|
|
uint32_t *mask_line, m;
|
|
uint32_t pack_cmp;
|
|
int dst_stride, mask_stride;
|
|
|
|
__m128i xmm_src, xmm_alpha;
|
|
__m128i xmm_dst;
|
|
__m128i xmm_mask, xmm_mask_lo, xmm_mask_hi;
|
|
|
|
__m64 mmx_src, mmx_alpha, mmx_mask, mmx_dest;
|
|
|
|
src = _pixman_image_get_solid (src_image, dst_image->bits.format);
|
|
srca = src >> 24;
|
|
|
|
if (src == 0)
|
|
return;
|
|
|
|
PIXMAN_IMAGE_GET_LINE (
|
|
dst_image, dest_x, dest_y, uint32_t, dst_stride, dst_line, 1);
|
|
PIXMAN_IMAGE_GET_LINE (
|
|
mask_image, mask_x, mask_y, uint32_t, mask_stride, mask_line, 1);
|
|
|
|
xmm_src = _mm_unpacklo_epi8 (
|
|
create_mask_2x32_128 (src, src), _mm_setzero_si128 ());
|
|
xmm_alpha = expand_alpha_1x128 (xmm_src);
|
|
mmx_src = _mm_movepi64_pi64 (xmm_src);
|
|
mmx_alpha = _mm_movepi64_pi64 (xmm_alpha);
|
|
|
|
while (height--)
|
|
{
|
|
int w = width;
|
|
const uint32_t *pm = (uint32_t *)mask_line;
|
|
uint32_t *pd = (uint32_t *)dst_line;
|
|
|
|
dst_line += dst_stride;
|
|
mask_line += mask_stride;
|
|
|
|
while (w && (unsigned long)pd & 15)
|
|
{
|
|
m = *pm++;
|
|
|
|
if (m)
|
|
{
|
|
d = *pd;
|
|
|
|
mmx_mask = unpack_32_1x64 (m);
|
|
mmx_dest = unpack_32_1x64 (d);
|
|
|
|
*pd = pack_1x64_32 (
|
|
_mm_adds_pu8 (pix_multiply_1x64 (mmx_mask, mmx_src), mmx_dest));
|
|
}
|
|
|
|
pd++;
|
|
w--;
|
|
}
|
|
|
|
while (w >= 4)
|
|
{
|
|
xmm_mask = load_128_unaligned ((__m128i*)pm);
|
|
|
|
pack_cmp =
|
|
_mm_movemask_epi8 (
|
|
_mm_cmpeq_epi32 (xmm_mask, _mm_setzero_si128 ()));
|
|
|
|
/* if all bits in mask are zero, pack_cmp are equal to 0xffff */
|
|
if (pack_cmp != 0xffff)
|
|
{
|
|
xmm_dst = load_128_aligned ((__m128i*)pd);
|
|
|
|
unpack_128_2x128 (xmm_mask, &xmm_mask_lo, &xmm_mask_hi);
|
|
|
|
pix_multiply_2x128 (&xmm_src, &xmm_src,
|
|
&xmm_mask_lo, &xmm_mask_hi,
|
|
&xmm_mask_lo, &xmm_mask_hi);
|
|
xmm_mask_hi = pack_2x128_128 (xmm_mask_lo, xmm_mask_hi);
|
|
|
|
save_128_aligned (
|
|
(__m128i*)pd, _mm_adds_epu8 (xmm_mask_hi, xmm_dst));
|
|
}
|
|
|
|
pd += 4;
|
|
pm += 4;
|
|
w -= 4;
|
|
}
|
|
|
|
while (w)
|
|
{
|
|
m = *pm++;
|
|
|
|
if (m)
|
|
{
|
|
d = *pd;
|
|
|
|
mmx_mask = unpack_32_1x64 (m);
|
|
mmx_dest = unpack_32_1x64 (d);
|
|
|
|
*pd = pack_1x64_32 (
|
|
_mm_adds_pu8 (pix_multiply_1x64 (mmx_mask, mmx_src), mmx_dest));
|
|
}
|
|
|
|
pd++;
|
|
w--;
|
|
}
|
|
}
|
|
|
|
_mm_empty ();
|
|
}
|
|
|
|
/* ---------------------------------------------------------------------------
|
|
* composite_over_n_8888_8888_ca
|
|
*/
|
|
|
|
static void
|
|
sse2_composite_over_n_8888_8888_ca (pixman_implementation_t *imp,
|
|
pixman_op_t op,
|
|
pixman_image_t * src_image,
|
|
pixman_image_t * mask_image,
|
|
pixman_image_t * dst_image,
|
|
int32_t src_x,
|
|
int32_t src_y,
|
|
int32_t mask_x,
|
|
int32_t mask_y,
|
|
int32_t dest_x,
|
|
int32_t dest_y,
|
|
int32_t width,
|
|
int32_t height)
|
|
{
|
|
uint32_t src;
|
|
uint32_t *dst_line, d;
|
|
uint32_t *mask_line, m;
|
|
uint32_t pack_cmp;
|
|
int dst_stride, mask_stride;
|
|
|
|
__m128i xmm_src, xmm_alpha;
|
|
__m128i xmm_dst, xmm_dst_lo, xmm_dst_hi;
|
|
__m128i xmm_mask, xmm_mask_lo, xmm_mask_hi;
|
|
|
|
__m64 mmx_src, mmx_alpha, mmx_mask, mmx_dest;
|
|
|
|
src = _pixman_image_get_solid (src_image, dst_image->bits.format);
|
|
|
|
if (src == 0)
|
|
return;
|
|
|
|
PIXMAN_IMAGE_GET_LINE (
|
|
dst_image, dest_x, dest_y, uint32_t, dst_stride, dst_line, 1);
|
|
PIXMAN_IMAGE_GET_LINE (
|
|
mask_image, mask_x, mask_y, uint32_t, mask_stride, mask_line, 1);
|
|
|
|
xmm_src = _mm_unpacklo_epi8 (
|
|
create_mask_2x32_128 (src, src), _mm_setzero_si128 ());
|
|
xmm_alpha = expand_alpha_1x128 (xmm_src);
|
|
mmx_src = _mm_movepi64_pi64 (xmm_src);
|
|
mmx_alpha = _mm_movepi64_pi64 (xmm_alpha);
|
|
|
|
while (height--)
|
|
{
|
|
int w = width;
|
|
const uint32_t *pm = (uint32_t *)mask_line;
|
|
uint32_t *pd = (uint32_t *)dst_line;
|
|
|
|
dst_line += dst_stride;
|
|
mask_line += mask_stride;
|
|
|
|
while (w && (unsigned long)pd & 15)
|
|
{
|
|
m = *pm++;
|
|
|
|
if (m)
|
|
{
|
|
d = *pd;
|
|
mmx_mask = unpack_32_1x64 (m);
|
|
mmx_dest = unpack_32_1x64 (d);
|
|
|
|
*pd = pack_1x64_32 (in_over_1x64 (&mmx_src,
|
|
&mmx_alpha,
|
|
&mmx_mask,
|
|
&mmx_dest));
|
|
}
|
|
|
|
pd++;
|
|
w--;
|
|
}
|
|
|
|
while (w >= 4)
|
|
{
|
|
xmm_mask = load_128_unaligned ((__m128i*)pm);
|
|
|
|
pack_cmp =
|
|
_mm_movemask_epi8 (
|
|
_mm_cmpeq_epi32 (xmm_mask, _mm_setzero_si128 ()));
|
|
|
|
/* if all bits in mask are zero, pack_cmp are equal to 0xffff */
|
|
if (pack_cmp != 0xffff)
|
|
{
|
|
xmm_dst = load_128_aligned ((__m128i*)pd);
|
|
|
|
unpack_128_2x128 (xmm_mask, &xmm_mask_lo, &xmm_mask_hi);
|
|
unpack_128_2x128 (xmm_dst, &xmm_dst_lo, &xmm_dst_hi);
|
|
|
|
in_over_2x128 (&xmm_src, &xmm_src,
|
|
&xmm_alpha, &xmm_alpha,
|
|
&xmm_mask_lo, &xmm_mask_hi,
|
|
&xmm_dst_lo, &xmm_dst_hi);
|
|
|
|
save_128_aligned (
|
|
(__m128i*)pd, pack_2x128_128 (xmm_dst_lo, xmm_dst_hi));
|
|
}
|
|
|
|
pd += 4;
|
|
pm += 4;
|
|
w -= 4;
|
|
}
|
|
|
|
while (w)
|
|
{
|
|
m = *pm++;
|
|
|
|
if (m)
|
|
{
|
|
d = *pd;
|
|
mmx_mask = unpack_32_1x64 (m);
|
|
mmx_dest = unpack_32_1x64 (d);
|
|
|
|
*pd = pack_1x64_32 (
|
|
in_over_1x64 (&mmx_src, &mmx_alpha, &mmx_mask, &mmx_dest));
|
|
}
|
|
|
|
pd++;
|
|
w--;
|
|
}
|
|
}
|
|
|
|
_mm_empty ();
|
|
}
|
|
|
|
/*---------------------------------------------------------------------
|
|
* composite_over_8888_n_8888
|
|
*/
|
|
|
|
static void
|
|
sse2_composite_over_8888_n_8888 (pixman_implementation_t *imp,
|
|
pixman_op_t op,
|
|
pixman_image_t * src_image,
|
|
pixman_image_t * mask_image,
|
|
pixman_image_t * dst_image,
|
|
int32_t src_x,
|
|
int32_t src_y,
|
|
int32_t mask_x,
|
|
int32_t mask_y,
|
|
int32_t dest_x,
|
|
int32_t dest_y,
|
|
int32_t width,
|
|
int32_t height)
|
|
{
|
|
uint32_t *dst_line, *dst;
|
|
uint32_t *src_line, *src;
|
|
uint32_t mask;
|
|
int32_t w;
|
|
int dst_stride, src_stride;
|
|
|
|
__m128i xmm_mask;
|
|
__m128i xmm_src, xmm_src_lo, xmm_src_hi;
|
|
__m128i xmm_dst, xmm_dst_lo, xmm_dst_hi;
|
|
__m128i xmm_alpha_lo, xmm_alpha_hi;
|
|
|
|
PIXMAN_IMAGE_GET_LINE (
|
|
dst_image, dest_x, dest_y, uint32_t, dst_stride, dst_line, 1);
|
|
PIXMAN_IMAGE_GET_LINE (
|
|
src_image, src_x, src_y, uint32_t, src_stride, src_line, 1);
|
|
|
|
mask = _pixman_image_get_solid (mask_image, PIXMAN_a8r8g8b8);
|
|
|
|
xmm_mask = create_mask_16_128 (mask >> 24);
|
|
|
|
while (height--)
|
|
{
|
|
dst = dst_line;
|
|
dst_line += dst_stride;
|
|
src = src_line;
|
|
src_line += src_stride;
|
|
w = width;
|
|
|
|
while (w && (unsigned long)dst & 15)
|
|
{
|
|
uint32_t s = *src++;
|
|
uint32_t d = *dst;
|
|
|
|
__m64 ms = unpack_32_1x64 (s);
|
|
__m64 alpha = expand_alpha_1x64 (ms);
|
|
__m64 dest = _mm_movepi64_pi64 (xmm_mask);
|
|
__m64 alpha_dst = unpack_32_1x64 (d);
|
|
|
|
*dst++ = pack_1x64_32 (
|
|
in_over_1x64 (&ms, &alpha, &dest, &alpha_dst));
|
|
|
|
w--;
|
|
}
|
|
|
|
while (w >= 4)
|
|
{
|
|
xmm_src = load_128_unaligned ((__m128i*)src);
|
|
xmm_dst = load_128_aligned ((__m128i*)dst);
|
|
|
|
unpack_128_2x128 (xmm_src, &xmm_src_lo, &xmm_src_hi);
|
|
unpack_128_2x128 (xmm_dst, &xmm_dst_lo, &xmm_dst_hi);
|
|
expand_alpha_2x128 (xmm_src_lo, xmm_src_hi,
|
|
&xmm_alpha_lo, &xmm_alpha_hi);
|
|
|
|
in_over_2x128 (&xmm_src_lo, &xmm_src_hi,
|
|
&xmm_alpha_lo, &xmm_alpha_hi,
|
|
&xmm_mask, &xmm_mask,
|
|
&xmm_dst_lo, &xmm_dst_hi);
|
|
|
|
save_128_aligned (
|
|
(__m128i*)dst, pack_2x128_128 (xmm_dst_lo, xmm_dst_hi));
|
|
|
|
dst += 4;
|
|
src += 4;
|
|
w -= 4;
|
|
}
|
|
|
|
while (w)
|
|
{
|
|
uint32_t s = *src++;
|
|
uint32_t d = *dst;
|
|
|
|
__m64 ms = unpack_32_1x64 (s);
|
|
__m64 alpha = expand_alpha_1x64 (ms);
|
|
__m64 mask = _mm_movepi64_pi64 (xmm_mask);
|
|
__m64 dest = unpack_32_1x64 (d);
|
|
|
|
*dst++ = pack_1x64_32 (
|
|
in_over_1x64 (&ms, &alpha, &mask, &dest));
|
|
|
|
w--;
|
|
}
|
|
}
|
|
|
|
_mm_empty ();
|
|
}
|
|
|
|
/*---------------------------------------------------------------------
|
|
* composite_over_8888_n_8888
|
|
*/
|
|
|
|
static void
|
|
sse2_composite_src_x888_8888 (pixman_implementation_t *imp,
|
|
pixman_op_t op,
|
|
pixman_image_t * src_image,
|
|
pixman_image_t * mask_image,
|
|
pixman_image_t * dst_image,
|
|
int32_t src_x,
|
|
int32_t src_y,
|
|
int32_t mask_x,
|
|
int32_t mask_y,
|
|
int32_t dest_x,
|
|
int32_t dest_y,
|
|
int32_t width,
|
|
int32_t height)
|
|
{
|
|
uint32_t *dst_line, *dst;
|
|
uint32_t *src_line, *src;
|
|
int32_t w;
|
|
int dst_stride, src_stride;
|
|
|
|
|
|
PIXMAN_IMAGE_GET_LINE (
|
|
dst_image, dest_x, dest_y, uint32_t, dst_stride, dst_line, 1);
|
|
PIXMAN_IMAGE_GET_LINE (
|
|
src_image, src_x, src_y, uint32_t, src_stride, src_line, 1);
|
|
|
|
while (height--)
|
|
{
|
|
dst = dst_line;
|
|
dst_line += dst_stride;
|
|
src = src_line;
|
|
src_line += src_stride;
|
|
w = width;
|
|
|
|
while (w && (unsigned long)dst & 15)
|
|
{
|
|
*dst++ = *src++ | 0xff000000;
|
|
w--;
|
|
}
|
|
|
|
while (w >= 16)
|
|
{
|
|
__m128i xmm_src1, xmm_src2, xmm_src3, xmm_src4;
|
|
|
|
xmm_src1 = load_128_unaligned ((__m128i*)src + 0);
|
|
xmm_src2 = load_128_unaligned ((__m128i*)src + 1);
|
|
xmm_src3 = load_128_unaligned ((__m128i*)src + 2);
|
|
xmm_src4 = load_128_unaligned ((__m128i*)src + 3);
|
|
|
|
save_128_aligned ((__m128i*)dst + 0, _mm_or_si128 (xmm_src1, mask_ff000000));
|
|
save_128_aligned ((__m128i*)dst + 1, _mm_or_si128 (xmm_src2, mask_ff000000));
|
|
save_128_aligned ((__m128i*)dst + 2, _mm_or_si128 (xmm_src3, mask_ff000000));
|
|
save_128_aligned ((__m128i*)dst + 3, _mm_or_si128 (xmm_src4, mask_ff000000));
|
|
|
|
dst += 16;
|
|
src += 16;
|
|
w -= 16;
|
|
}
|
|
|
|
while (w)
|
|
{
|
|
*dst++ = *src++ | 0xff000000;
|
|
w--;
|
|
}
|
|
}
|
|
|
|
_mm_empty ();
|
|
}
|
|
|
|
/* ---------------------------------------------------------------------
|
|
* composite_over_x888_n_8888
|
|
*/
|
|
static void
|
|
sse2_composite_over_x888_n_8888 (pixman_implementation_t *imp,
|
|
pixman_op_t op,
|
|
pixman_image_t * src_image,
|
|
pixman_image_t * mask_image,
|
|
pixman_image_t * dst_image,
|
|
int32_t src_x,
|
|
int32_t src_y,
|
|
int32_t mask_x,
|
|
int32_t mask_y,
|
|
int32_t dest_x,
|
|
int32_t dest_y,
|
|
int32_t width,
|
|
int32_t height)
|
|
{
|
|
uint32_t *dst_line, *dst;
|
|
uint32_t *src_line, *src;
|
|
uint32_t mask;
|
|
int dst_stride, src_stride;
|
|
int32_t w;
|
|
|
|
__m128i xmm_mask, xmm_alpha;
|
|
__m128i xmm_src, xmm_src_lo, xmm_src_hi;
|
|
__m128i xmm_dst, xmm_dst_lo, xmm_dst_hi;
|
|
|
|
PIXMAN_IMAGE_GET_LINE (
|
|
dst_image, dest_x, dest_y, uint32_t, dst_stride, dst_line, 1);
|
|
PIXMAN_IMAGE_GET_LINE (
|
|
src_image, src_x, src_y, uint32_t, src_stride, src_line, 1);
|
|
|
|
mask = _pixman_image_get_solid (mask_image, PIXMAN_a8r8g8b8);
|
|
|
|
xmm_mask = create_mask_16_128 (mask >> 24);
|
|
xmm_alpha = mask_00ff;
|
|
|
|
while (height--)
|
|
{
|
|
dst = dst_line;
|
|
dst_line += dst_stride;
|
|
src = src_line;
|
|
src_line += src_stride;
|
|
w = width;
|
|
|
|
while (w && (unsigned long)dst & 15)
|
|
{
|
|
uint32_t s = (*src++) | 0xff000000;
|
|
uint32_t d = *dst;
|
|
|
|
__m64 src = unpack_32_1x64 (s);
|
|
__m64 alpha = _mm_movepi64_pi64 (xmm_alpha);
|
|
__m64 mask = _mm_movepi64_pi64 (xmm_mask);
|
|
__m64 dest = unpack_32_1x64 (d);
|
|
|
|
*dst++ = pack_1x64_32 (
|
|
in_over_1x64 (&src, &alpha, &mask, &dest));
|
|
|
|
w--;
|
|
}
|
|
|
|
while (w >= 4)
|
|
{
|
|
xmm_src = _mm_or_si128 (
|
|
load_128_unaligned ((__m128i*)src), mask_ff000000);
|
|
xmm_dst = load_128_aligned ((__m128i*)dst);
|
|
|
|
unpack_128_2x128 (xmm_src, &xmm_src_lo, &xmm_src_hi);
|
|
unpack_128_2x128 (xmm_dst, &xmm_dst_lo, &xmm_dst_hi);
|
|
|
|
in_over_2x128 (&xmm_src_lo, &xmm_src_hi,
|
|
&xmm_alpha, &xmm_alpha,
|
|
&xmm_mask, &xmm_mask,
|
|
&xmm_dst_lo, &xmm_dst_hi);
|
|
|
|
save_128_aligned (
|
|
(__m128i*)dst, pack_2x128_128 (xmm_dst_lo, xmm_dst_hi));
|
|
|
|
dst += 4;
|
|
src += 4;
|
|
w -= 4;
|
|
|
|
}
|
|
|
|
while (w)
|
|
{
|
|
uint32_t s = (*src++) | 0xff000000;
|
|
uint32_t d = *dst;
|
|
|
|
__m64 src = unpack_32_1x64 (s);
|
|
__m64 alpha = _mm_movepi64_pi64 (xmm_alpha);
|
|
__m64 mask = _mm_movepi64_pi64 (xmm_mask);
|
|
__m64 dest = unpack_32_1x64 (d);
|
|
|
|
*dst++ = pack_1x64_32 (
|
|
in_over_1x64 (&src, &alpha, &mask, &dest));
|
|
|
|
w--;
|
|
}
|
|
}
|
|
|
|
_mm_empty ();
|
|
}
|
|
|
|
/* --------------------------------------------------------------------
|
|
* composite_over_8888_8888
|
|
*/
|
|
static void
|
|
sse2_composite_over_8888_8888 (pixman_implementation_t *imp,
|
|
pixman_op_t op,
|
|
pixman_image_t * src_image,
|
|
pixman_image_t * mask_image,
|
|
pixman_image_t * dst_image,
|
|
int32_t src_x,
|
|
int32_t src_y,
|
|
int32_t mask_x,
|
|
int32_t mask_y,
|
|
int32_t dest_x,
|
|
int32_t dest_y,
|
|
int32_t width,
|
|
int32_t height)
|
|
{
|
|
int dst_stride, src_stride;
|
|
uint32_t *dst_line, *dst;
|
|
uint32_t *src_line, *src;
|
|
|
|
PIXMAN_IMAGE_GET_LINE (
|
|
dst_image, dest_x, dest_y, uint32_t, dst_stride, dst_line, 1);
|
|
PIXMAN_IMAGE_GET_LINE (
|
|
src_image, src_x, src_y, uint32_t, src_stride, src_line, 1);
|
|
|
|
dst = dst_line;
|
|
src = src_line;
|
|
|
|
while (height--)
|
|
{
|
|
core_combine_over_u_sse2 (dst, src, NULL, width);
|
|
|
|
dst += dst_stride;
|
|
src += src_stride;
|
|
}
|
|
_mm_empty ();
|
|
}
|
|
|
|
/* ------------------------------------------------------------------
|
|
* composite_over_8888_0565
|
|
*/
|
|
static force_inline uint16_t
|
|
composite_over_8888_0565pixel (uint32_t src, uint16_t dst)
|
|
{
|
|
__m64 ms;
|
|
|
|
ms = unpack_32_1x64 (src);
|
|
return pack_565_32_16 (
|
|
pack_1x64_32 (
|
|
over_1x64 (
|
|
ms, expand_alpha_1x64 (ms), expand565_16_1x64 (dst))));
|
|
}
|
|
|
|
static void
|
|
sse2_composite_over_8888_0565 (pixman_implementation_t *imp,
|
|
pixman_op_t op,
|
|
pixman_image_t * src_image,
|
|
pixman_image_t * mask_image,
|
|
pixman_image_t * dst_image,
|
|
int32_t src_x,
|
|
int32_t src_y,
|
|
int32_t mask_x,
|
|
int32_t mask_y,
|
|
int32_t dest_x,
|
|
int32_t dest_y,
|
|
int32_t width,
|
|
int32_t height)
|
|
{
|
|
uint16_t *dst_line, *dst, d;
|
|
uint32_t *src_line, *src, s;
|
|
int dst_stride, src_stride;
|
|
int32_t w;
|
|
|
|
__m128i xmm_alpha_lo, xmm_alpha_hi;
|
|
__m128i xmm_src, xmm_src_lo, xmm_src_hi;
|
|
__m128i xmm_dst, xmm_dst0, xmm_dst1, xmm_dst2, xmm_dst3;
|
|
|
|
PIXMAN_IMAGE_GET_LINE (
|
|
dst_image, dest_x, dest_y, uint16_t, dst_stride, dst_line, 1);
|
|
PIXMAN_IMAGE_GET_LINE (
|
|
src_image, src_x, src_y, uint32_t, src_stride, src_line, 1);
|
|
|
|
#if 0
|
|
/* FIXME
|
|
*
|
|
* I copy the code from MMX one and keep the fixme.
|
|
* If it's a problem there, probably is a problem here.
|
|
*/
|
|
assert (src_image->drawable == mask_image->drawable);
|
|
#endif
|
|
|
|
while (height--)
|
|
{
|
|
dst = dst_line;
|
|
src = src_line;
|
|
|
|
dst_line += dst_stride;
|
|
src_line += src_stride;
|
|
w = width;
|
|
|
|
/* Align dst on a 16-byte boundary */
|
|
while (w &&
|
|
((unsigned long)dst & 15))
|
|
{
|
|
s = *src++;
|
|
d = *dst;
|
|
|
|
*dst++ = composite_over_8888_0565pixel (s, d);
|
|
w--;
|
|
}
|
|
|
|
/* It's a 8 pixel loop */
|
|
while (w >= 8)
|
|
{
|
|
/* I'm loading unaligned because I'm not sure
|
|
* about the address alignment.
|
|
*/
|
|
xmm_src = load_128_unaligned ((__m128i*) src);
|
|
xmm_dst = load_128_aligned ((__m128i*) dst);
|
|
|
|
/* Unpacking */
|
|
unpack_128_2x128 (xmm_src, &xmm_src_lo, &xmm_src_hi);
|
|
unpack_565_128_4x128 (xmm_dst,
|
|
&xmm_dst0, &xmm_dst1, &xmm_dst2, &xmm_dst3);
|
|
expand_alpha_2x128 (xmm_src_lo, xmm_src_hi,
|
|
&xmm_alpha_lo, &xmm_alpha_hi);
|
|
|
|
/* I'm loading next 4 pixels from memory
|
|
* before to optimze the memory read.
|
|
*/
|
|
xmm_src = load_128_unaligned ((__m128i*) (src + 4));
|
|
|
|
over_2x128 (&xmm_src_lo, &xmm_src_hi,
|
|
&xmm_alpha_lo, &xmm_alpha_hi,
|
|
&xmm_dst0, &xmm_dst1);
|
|
|
|
/* Unpacking */
|
|
unpack_128_2x128 (xmm_src, &xmm_src_lo, &xmm_src_hi);
|
|
expand_alpha_2x128 (xmm_src_lo, xmm_src_hi,
|
|
&xmm_alpha_lo, &xmm_alpha_hi);
|
|
|
|
over_2x128 (&xmm_src_lo, &xmm_src_hi,
|
|
&xmm_alpha_lo, &xmm_alpha_hi,
|
|
&xmm_dst2, &xmm_dst3);
|
|
|
|
save_128_aligned (
|
|
(__m128i*)dst, pack_565_4x128_128 (
|
|
&xmm_dst0, &xmm_dst1, &xmm_dst2, &xmm_dst3));
|
|
|
|
w -= 8;
|
|
dst += 8;
|
|
src += 8;
|
|
}
|
|
|
|
while (w--)
|
|
{
|
|
s = *src++;
|
|
d = *dst;
|
|
|
|
*dst++ = composite_over_8888_0565pixel (s, d);
|
|
}
|
|
}
|
|
|
|
_mm_empty ();
|
|
}
|
|
|
|
/* -----------------------------------------------------------------
|
|
* composite_over_n_8_8888
|
|
*/
|
|
|
|
static void
|
|
sse2_composite_over_n_8_8888 (pixman_implementation_t *imp,
|
|
pixman_op_t op,
|
|
pixman_image_t * src_image,
|
|
pixman_image_t * mask_image,
|
|
pixman_image_t * dst_image,
|
|
int32_t src_x,
|
|
int32_t src_y,
|
|
int32_t mask_x,
|
|
int32_t mask_y,
|
|
int32_t dest_x,
|
|
int32_t dest_y,
|
|
int32_t width,
|
|
int32_t height)
|
|
{
|
|
uint32_t src, srca;
|
|
uint32_t *dst_line, *dst;
|
|
uint8_t *mask_line, *mask;
|
|
int dst_stride, mask_stride;
|
|
int32_t w;
|
|
uint32_t m, d;
|
|
|
|
__m128i xmm_src, xmm_alpha, xmm_def;
|
|
__m128i xmm_dst, xmm_dst_lo, xmm_dst_hi;
|
|
__m128i xmm_mask, xmm_mask_lo, xmm_mask_hi;
|
|
|
|
__m64 mmx_src, mmx_alpha, mmx_mask, mmx_dest;
|
|
|
|
src = _pixman_image_get_solid (src_image, dst_image->bits.format);
|
|
|
|
srca = src >> 24;
|
|
if (src == 0)
|
|
return;
|
|
|
|
PIXMAN_IMAGE_GET_LINE (
|
|
dst_image, dest_x, dest_y, uint32_t, dst_stride, dst_line, 1);
|
|
PIXMAN_IMAGE_GET_LINE (
|
|
mask_image, mask_x, mask_y, uint8_t, mask_stride, mask_line, 1);
|
|
|
|
xmm_def = create_mask_2x32_128 (src, src);
|
|
xmm_src = expand_pixel_32_1x128 (src);
|
|
xmm_alpha = expand_alpha_1x128 (xmm_src);
|
|
mmx_src = _mm_movepi64_pi64 (xmm_src);
|
|
mmx_alpha = _mm_movepi64_pi64 (xmm_alpha);
|
|
|
|
while (height--)
|
|
{
|
|
dst = dst_line;
|
|
dst_line += dst_stride;
|
|
mask = mask_line;
|
|
mask_line += mask_stride;
|
|
w = width;
|
|
|
|
while (w && (unsigned long)dst & 15)
|
|
{
|
|
uint8_t m = *mask++;
|
|
|
|
if (m)
|
|
{
|
|
d = *dst;
|
|
mmx_mask = expand_pixel_8_1x64 (m);
|
|
mmx_dest = unpack_32_1x64 (d);
|
|
|
|
*dst = pack_1x64_32 (in_over_1x64 (&mmx_src,
|
|
&mmx_alpha,
|
|
&mmx_mask,
|
|
&mmx_dest));
|
|
}
|
|
|
|
w--;
|
|
dst++;
|
|
}
|
|
|
|
while (w >= 4)
|
|
{
|
|
m = *((uint32_t*)mask);
|
|
|
|
if (srca == 0xff && m == 0xffffffff)
|
|
{
|
|
save_128_aligned ((__m128i*)dst, xmm_def);
|
|
}
|
|
else if (m)
|
|
{
|
|
xmm_dst = load_128_aligned ((__m128i*) dst);
|
|
xmm_mask = unpack_32_1x128 (m);
|
|
xmm_mask = _mm_unpacklo_epi8 (xmm_mask, _mm_setzero_si128 ());
|
|
|
|
/* Unpacking */
|
|
unpack_128_2x128 (xmm_dst, &xmm_dst_lo, &xmm_dst_hi);
|
|
unpack_128_2x128 (xmm_mask, &xmm_mask_lo, &xmm_mask_hi);
|
|
|
|
expand_alpha_rev_2x128 (xmm_mask_lo, xmm_mask_hi,
|
|
&xmm_mask_lo, &xmm_mask_hi);
|
|
|
|
in_over_2x128 (&xmm_src, &xmm_src,
|
|
&xmm_alpha, &xmm_alpha,
|
|
&xmm_mask_lo, &xmm_mask_hi,
|
|
&xmm_dst_lo, &xmm_dst_hi);
|
|
|
|
save_128_aligned (
|
|
(__m128i*)dst, pack_2x128_128 (xmm_dst_lo, xmm_dst_hi));
|
|
}
|
|
|
|
w -= 4;
|
|
dst += 4;
|
|
mask += 4;
|
|
}
|
|
|
|
while (w)
|
|
{
|
|
uint8_t m = *mask++;
|
|
|
|
if (m)
|
|
{
|
|
d = *dst;
|
|
mmx_mask = expand_pixel_8_1x64 (m);
|
|
mmx_dest = unpack_32_1x64 (d);
|
|
|
|
*dst = pack_1x64_32 (in_over_1x64 (&mmx_src,
|
|
&mmx_alpha,
|
|
&mmx_mask,
|
|
&mmx_dest));
|
|
}
|
|
|
|
w--;
|
|
dst++;
|
|
}
|
|
}
|
|
|
|
_mm_empty ();
|
|
}
|
|
|
|
/* ----------------------------------------------------------------
|
|
* composite_over_n_8_8888
|
|
*/
|
|
|
|
pixman_bool_t
|
|
pixman_fill_sse2 (uint32_t *bits,
|
|
int stride,
|
|
int bpp,
|
|
int x,
|
|
int y,
|
|
int width,
|
|
int height,
|
|
uint32_t data)
|
|
{
|
|
uint32_t byte_width;
|
|
uint8_t *byte_line;
|
|
|
|
__m128i xmm_def;
|
|
|
|
if (bpp == 8)
|
|
{
|
|
uint8_t b;
|
|
uint16_t w;
|
|
|
|
stride = stride * (int) sizeof (uint32_t) / 1;
|
|
byte_line = (uint8_t *)(((uint8_t *)bits) + stride * y + x);
|
|
byte_width = width;
|
|
stride *= 1;
|
|
|
|
b = data & 0xff;
|
|
w = (b << 8) | b;
|
|
data = (w << 16) | w;
|
|
}
|
|
else if (bpp == 16)
|
|
{
|
|
stride = stride * (int) sizeof (uint32_t) / 2;
|
|
byte_line = (uint8_t *)(((uint16_t *)bits) + stride * y + x);
|
|
byte_width = 2 * width;
|
|
stride *= 2;
|
|
|
|
data = (data & 0xffff) * 0x00010001;
|
|
}
|
|
else if (bpp == 32)
|
|
{
|
|
stride = stride * (int) sizeof (uint32_t) / 4;
|
|
byte_line = (uint8_t *)(((uint32_t *)bits) + stride * y + x);
|
|
byte_width = 4 * width;
|
|
stride *= 4;
|
|
}
|
|
else
|
|
{
|
|
return FALSE;
|
|
}
|
|
|
|
xmm_def = create_mask_2x32_128 (data, data);
|
|
|
|
while (height--)
|
|
{
|
|
int w;
|
|
uint8_t *d = byte_line;
|
|
byte_line += stride;
|
|
w = byte_width;
|
|
|
|
while (w >= 1 && ((unsigned long)d & 1))
|
|
{
|
|
*(uint8_t *)d = data;
|
|
w -= 1;
|
|
d += 1;
|
|
}
|
|
|
|
while (w >= 2 && ((unsigned long)d & 3))
|
|
{
|
|
*(uint16_t *)d = data;
|
|
w -= 2;
|
|
d += 2;
|
|
}
|
|
|
|
while (w >= 4 && ((unsigned long)d & 15))
|
|
{
|
|
*(uint32_t *)d = data;
|
|
|
|
w -= 4;
|
|
d += 4;
|
|
}
|
|
|
|
while (w >= 128)
|
|
{
|
|
save_128_aligned ((__m128i*)(d), xmm_def);
|
|
save_128_aligned ((__m128i*)(d + 16), xmm_def);
|
|
save_128_aligned ((__m128i*)(d + 32), xmm_def);
|
|
save_128_aligned ((__m128i*)(d + 48), xmm_def);
|
|
save_128_aligned ((__m128i*)(d + 64), xmm_def);
|
|
save_128_aligned ((__m128i*)(d + 80), xmm_def);
|
|
save_128_aligned ((__m128i*)(d + 96), xmm_def);
|
|
save_128_aligned ((__m128i*)(d + 112), xmm_def);
|
|
|
|
d += 128;
|
|
w -= 128;
|
|
}
|
|
|
|
if (w >= 64)
|
|
{
|
|
save_128_aligned ((__m128i*)(d), xmm_def);
|
|
save_128_aligned ((__m128i*)(d + 16), xmm_def);
|
|
save_128_aligned ((__m128i*)(d + 32), xmm_def);
|
|
save_128_aligned ((__m128i*)(d + 48), xmm_def);
|
|
|
|
d += 64;
|
|
w -= 64;
|
|
}
|
|
|
|
if (w >= 32)
|
|
{
|
|
save_128_aligned ((__m128i*)(d), xmm_def);
|
|
save_128_aligned ((__m128i*)(d + 16), xmm_def);
|
|
|
|
d += 32;
|
|
w -= 32;
|
|
}
|
|
|
|
if (w >= 16)
|
|
{
|
|
save_128_aligned ((__m128i*)(d), xmm_def);
|
|
|
|
d += 16;
|
|
w -= 16;
|
|
}
|
|
|
|
while (w >= 4)
|
|
{
|
|
*(uint32_t *)d = data;
|
|
|
|
w -= 4;
|
|
d += 4;
|
|
}
|
|
|
|
if (w >= 2)
|
|
{
|
|
*(uint16_t *)d = data;
|
|
w -= 2;
|
|
d += 2;
|
|
}
|
|
|
|
if (w >= 1)
|
|
{
|
|
*(uint8_t *)d = data;
|
|
w -= 1;
|
|
d += 1;
|
|
}
|
|
}
|
|
|
|
_mm_empty ();
|
|
return TRUE;
|
|
}
|
|
|
|
static void
|
|
sse2_composite_src_n_8_8888 (pixman_implementation_t *imp,
|
|
pixman_op_t op,
|
|
pixman_image_t * src_image,
|
|
pixman_image_t * mask_image,
|
|
pixman_image_t * dst_image,
|
|
int32_t src_x,
|
|
int32_t src_y,
|
|
int32_t mask_x,
|
|
int32_t mask_y,
|
|
int32_t dest_x,
|
|
int32_t dest_y,
|
|
int32_t width,
|
|
int32_t height)
|
|
{
|
|
uint32_t src, srca;
|
|
uint32_t *dst_line, *dst;
|
|
uint8_t *mask_line, *mask;
|
|
int dst_stride, mask_stride;
|
|
int32_t w;
|
|
uint32_t m;
|
|
|
|
__m128i xmm_src, xmm_def;
|
|
__m128i xmm_mask, xmm_mask_lo, xmm_mask_hi;
|
|
|
|
src = _pixman_image_get_solid (src_image, dst_image->bits.format);
|
|
|
|
srca = src >> 24;
|
|
if (src == 0)
|
|
{
|
|
pixman_fill_sse2 (dst_image->bits.bits, dst_image->bits.rowstride,
|
|
PIXMAN_FORMAT_BPP (dst_image->bits.format),
|
|
dest_x, dest_y, width, height, 0);
|
|
return;
|
|
}
|
|
|
|
PIXMAN_IMAGE_GET_LINE (
|
|
dst_image, dest_x, dest_y, uint32_t, dst_stride, dst_line, 1);
|
|
PIXMAN_IMAGE_GET_LINE (
|
|
mask_image, mask_x, mask_y, uint8_t, mask_stride, mask_line, 1);
|
|
|
|
xmm_def = create_mask_2x32_128 (src, src);
|
|
xmm_src = expand_pixel_32_1x128 (src);
|
|
|
|
while (height--)
|
|
{
|
|
dst = dst_line;
|
|
dst_line += dst_stride;
|
|
mask = mask_line;
|
|
mask_line += mask_stride;
|
|
w = width;
|
|
|
|
while (w && (unsigned long)dst & 15)
|
|
{
|
|
uint8_t m = *mask++;
|
|
|
|
if (m)
|
|
{
|
|
*dst = pack_1x64_32 (
|
|
pix_multiply_1x64 (
|
|
_mm_movepi64_pi64 (xmm_src), expand_pixel_8_1x64 (m)));
|
|
}
|
|
else
|
|
{
|
|
*dst = 0;
|
|
}
|
|
|
|
w--;
|
|
dst++;
|
|
}
|
|
|
|
while (w >= 4)
|
|
{
|
|
m = *((uint32_t*)mask);
|
|
|
|
if (srca == 0xff && m == 0xffffffff)
|
|
{
|
|
save_128_aligned ((__m128i*)dst, xmm_def);
|
|
}
|
|
else if (m)
|
|
{
|
|
xmm_mask = unpack_32_1x128 (m);
|
|
xmm_mask = _mm_unpacklo_epi8 (xmm_mask, _mm_setzero_si128 ());
|
|
|
|
/* Unpacking */
|
|
unpack_128_2x128 (xmm_mask, &xmm_mask_lo, &xmm_mask_hi);
|
|
|
|
expand_alpha_rev_2x128 (xmm_mask_lo, xmm_mask_hi,
|
|
&xmm_mask_lo, &xmm_mask_hi);
|
|
|
|
pix_multiply_2x128 (&xmm_src, &xmm_src,
|
|
&xmm_mask_lo, &xmm_mask_hi,
|
|
&xmm_mask_lo, &xmm_mask_hi);
|
|
|
|
save_128_aligned (
|
|
(__m128i*)dst, pack_2x128_128 (xmm_mask_lo, xmm_mask_hi));
|
|
}
|
|
else
|
|
{
|
|
save_128_aligned ((__m128i*)dst, _mm_setzero_si128 ());
|
|
}
|
|
|
|
w -= 4;
|
|
dst += 4;
|
|
mask += 4;
|
|
}
|
|
|
|
while (w)
|
|
{
|
|
uint8_t m = *mask++;
|
|
|
|
if (m)
|
|
{
|
|
*dst = pack_1x64_32 (
|
|
pix_multiply_1x64 (
|
|
_mm_movepi64_pi64 (xmm_src), expand_pixel_8_1x64 (m)));
|
|
}
|
|
else
|
|
{
|
|
*dst = 0;
|
|
}
|
|
|
|
w--;
|
|
dst++;
|
|
}
|
|
}
|
|
|
|
_mm_empty ();
|
|
}
|
|
|
|
/*-----------------------------------------------------------------------
|
|
* composite_over_n_8_0565
|
|
*/
|
|
|
|
static void
|
|
sse2_composite_over_n_8_0565 (pixman_implementation_t *imp,
|
|
pixman_op_t op,
|
|
pixman_image_t * src_image,
|
|
pixman_image_t * mask_image,
|
|
pixman_image_t * dst_image,
|
|
int32_t src_x,
|
|
int32_t src_y,
|
|
int32_t mask_x,
|
|
int32_t mask_y,
|
|
int32_t dest_x,
|
|
int32_t dest_y,
|
|
int32_t width,
|
|
int32_t height)
|
|
{
|
|
uint32_t src, srca;
|
|
uint16_t *dst_line, *dst, d;
|
|
uint8_t *mask_line, *mask;
|
|
int dst_stride, mask_stride;
|
|
int32_t w;
|
|
uint32_t m;
|
|
__m64 mmx_src, mmx_alpha, mmx_mask, mmx_dest;
|
|
|
|
__m128i xmm_src, xmm_alpha;
|
|
__m128i xmm_mask, xmm_mask_lo, xmm_mask_hi;
|
|
__m128i xmm_dst, xmm_dst0, xmm_dst1, xmm_dst2, xmm_dst3;
|
|
|
|
src = _pixman_image_get_solid (src_image, dst_image->bits.format);
|
|
|
|
srca = src >> 24;
|
|
if (src == 0)
|
|
return;
|
|
|
|
PIXMAN_IMAGE_GET_LINE (
|
|
dst_image, dest_x, dest_y, uint16_t, dst_stride, dst_line, 1);
|
|
PIXMAN_IMAGE_GET_LINE (
|
|
mask_image, mask_x, mask_y, uint8_t, mask_stride, mask_line, 1);
|
|
|
|
xmm_src = expand_pixel_32_1x128 (src);
|
|
xmm_alpha = expand_alpha_1x128 (xmm_src);
|
|
mmx_src = _mm_movepi64_pi64 (xmm_src);
|
|
mmx_alpha = _mm_movepi64_pi64 (xmm_alpha);
|
|
|
|
while (height--)
|
|
{
|
|
dst = dst_line;
|
|
dst_line += dst_stride;
|
|
mask = mask_line;
|
|
mask_line += mask_stride;
|
|
w = width;
|
|
|
|
while (w && (unsigned long)dst & 15)
|
|
{
|
|
m = *mask++;
|
|
|
|
if (m)
|
|
{
|
|
d = *dst;
|
|
mmx_mask = expand_alpha_rev_1x64 (unpack_32_1x64 (m));
|
|
mmx_dest = expand565_16_1x64 (d);
|
|
|
|
*dst = pack_565_32_16 (
|
|
pack_1x64_32 (
|
|
in_over_1x64 (
|
|
&mmx_src, &mmx_alpha, &mmx_mask, &mmx_dest)));
|
|
}
|
|
|
|
w--;
|
|
dst++;
|
|
}
|
|
|
|
while (w >= 8)
|
|
{
|
|
xmm_dst = load_128_aligned ((__m128i*) dst);
|
|
unpack_565_128_4x128 (xmm_dst,
|
|
&xmm_dst0, &xmm_dst1, &xmm_dst2, &xmm_dst3);
|
|
|
|
m = *((uint32_t*)mask);
|
|
mask += 4;
|
|
|
|
if (m)
|
|
{
|
|
xmm_mask = unpack_32_1x128 (m);
|
|
xmm_mask = _mm_unpacklo_epi8 (xmm_mask, _mm_setzero_si128 ());
|
|
|
|
/* Unpacking */
|
|
unpack_128_2x128 (xmm_mask, &xmm_mask_lo, &xmm_mask_hi);
|
|
|
|
expand_alpha_rev_2x128 (xmm_mask_lo, xmm_mask_hi,
|
|
&xmm_mask_lo, &xmm_mask_hi);
|
|
|
|
in_over_2x128 (&xmm_src, &xmm_src,
|
|
&xmm_alpha, &xmm_alpha,
|
|
&xmm_mask_lo, &xmm_mask_hi,
|
|
&xmm_dst0, &xmm_dst1);
|
|
}
|
|
|
|
m = *((uint32_t*)mask);
|
|
mask += 4;
|
|
|
|
if (m)
|
|
{
|
|
xmm_mask = unpack_32_1x128 (m);
|
|
xmm_mask = _mm_unpacklo_epi8 (xmm_mask, _mm_setzero_si128 ());
|
|
|
|
/* Unpacking */
|
|
unpack_128_2x128 (xmm_mask, &xmm_mask_lo, &xmm_mask_hi);
|
|
|
|
expand_alpha_rev_2x128 (xmm_mask_lo, xmm_mask_hi,
|
|
&xmm_mask_lo, &xmm_mask_hi);
|
|
in_over_2x128 (&xmm_src, &xmm_src,
|
|
&xmm_alpha, &xmm_alpha,
|
|
&xmm_mask_lo, &xmm_mask_hi,
|
|
&xmm_dst2, &xmm_dst3);
|
|
}
|
|
|
|
save_128_aligned (
|
|
(__m128i*)dst, pack_565_4x128_128 (
|
|
&xmm_dst0, &xmm_dst1, &xmm_dst2, &xmm_dst3));
|
|
|
|
w -= 8;
|
|
dst += 8;
|
|
}
|
|
|
|
while (w)
|
|
{
|
|
m = *mask++;
|
|
|
|
if (m)
|
|
{
|
|
d = *dst;
|
|
mmx_mask = expand_alpha_rev_1x64 (unpack_32_1x64 (m));
|
|
mmx_dest = expand565_16_1x64 (d);
|
|
|
|
*dst = pack_565_32_16 (
|
|
pack_1x64_32 (
|
|
in_over_1x64 (
|
|
&mmx_src, &mmx_alpha, &mmx_mask, &mmx_dest)));
|
|
}
|
|
|
|
w--;
|
|
dst++;
|
|
}
|
|
}
|
|
|
|
_mm_empty ();
|
|
}
|
|
|
|
/* -----------------------------------------------------------------------
|
|
* composite_over_pixbuf_0565
|
|
*/
|
|
|
|
static void
|
|
sse2_composite_over_pixbuf_0565 (pixman_implementation_t *imp,
|
|
pixman_op_t op,
|
|
pixman_image_t * src_image,
|
|
pixman_image_t * mask_image,
|
|
pixman_image_t * dst_image,
|
|
int32_t src_x,
|
|
int32_t src_y,
|
|
int32_t mask_x,
|
|
int32_t mask_y,
|
|
int32_t dest_x,
|
|
int32_t dest_y,
|
|
int32_t width,
|
|
int32_t height)
|
|
{
|
|
uint16_t *dst_line, *dst, d;
|
|
uint32_t *src_line, *src, s;
|
|
int dst_stride, src_stride;
|
|
int32_t w;
|
|
uint32_t opaque, zero;
|
|
|
|
__m64 ms;
|
|
__m128i xmm_src, xmm_src_lo, xmm_src_hi;
|
|
__m128i xmm_dst, xmm_dst0, xmm_dst1, xmm_dst2, xmm_dst3;
|
|
|
|
PIXMAN_IMAGE_GET_LINE (
|
|
dst_image, dest_x, dest_y, uint16_t, dst_stride, dst_line, 1);
|
|
PIXMAN_IMAGE_GET_LINE (
|
|
src_image, src_x, src_y, uint32_t, src_stride, src_line, 1);
|
|
|
|
#if 0
|
|
/* FIXME
|
|
*
|
|
* I copy the code from MMX one and keep the fixme.
|
|
* If it's a problem there, probably is a problem here.
|
|
*/
|
|
assert (src_image->drawable == mask_image->drawable);
|
|
#endif
|
|
|
|
while (height--)
|
|
{
|
|
dst = dst_line;
|
|
dst_line += dst_stride;
|
|
src = src_line;
|
|
src_line += src_stride;
|
|
w = width;
|
|
|
|
while (w && (unsigned long)dst & 15)
|
|
{
|
|
s = *src++;
|
|
d = *dst;
|
|
|
|
ms = unpack_32_1x64 (s);
|
|
|
|
*dst++ = pack_565_32_16 (
|
|
pack_1x64_32 (
|
|
over_rev_non_pre_1x64 (ms, expand565_16_1x64 (d))));
|
|
w--;
|
|
}
|
|
|
|
while (w >= 8)
|
|
{
|
|
/* First round */
|
|
xmm_src = load_128_unaligned ((__m128i*)src);
|
|
xmm_dst = load_128_aligned ((__m128i*)dst);
|
|
|
|
opaque = is_opaque (xmm_src);
|
|
zero = is_zero (xmm_src);
|
|
|
|
unpack_565_128_4x128 (xmm_dst,
|
|
&xmm_dst0, &xmm_dst1, &xmm_dst2, &xmm_dst3);
|
|
unpack_128_2x128 (xmm_src, &xmm_src_lo, &xmm_src_hi);
|
|
|
|
/* preload next round*/
|
|
xmm_src = load_128_unaligned ((__m128i*)(src + 4));
|
|
|
|
if (opaque)
|
|
{
|
|
invert_colors_2x128 (xmm_src_lo, xmm_src_hi,
|
|
&xmm_dst0, &xmm_dst1);
|
|
}
|
|
else if (!zero)
|
|
{
|
|
over_rev_non_pre_2x128 (xmm_src_lo, xmm_src_hi,
|
|
&xmm_dst0, &xmm_dst1);
|
|
}
|
|
|
|
/* Second round */
|
|
opaque = is_opaque (xmm_src);
|
|
zero = is_zero (xmm_src);
|
|
|
|
unpack_128_2x128 (xmm_src, &xmm_src_lo, &xmm_src_hi);
|
|
|
|
if (opaque)
|
|
{
|
|
invert_colors_2x128 (xmm_src_lo, xmm_src_hi,
|
|
&xmm_dst2, &xmm_dst3);
|
|
}
|
|
else if (!zero)
|
|
{
|
|
over_rev_non_pre_2x128 (xmm_src_lo, xmm_src_hi,
|
|
&xmm_dst2, &xmm_dst3);
|
|
}
|
|
|
|
save_128_aligned (
|
|
(__m128i*)dst, pack_565_4x128_128 (
|
|
&xmm_dst0, &xmm_dst1, &xmm_dst2, &xmm_dst3));
|
|
|
|
w -= 8;
|
|
src += 8;
|
|
dst += 8;
|
|
}
|
|
|
|
while (w)
|
|
{
|
|
s = *src++;
|
|
d = *dst;
|
|
|
|
ms = unpack_32_1x64 (s);
|
|
|
|
*dst++ = pack_565_32_16 (
|
|
pack_1x64_32 (
|
|
over_rev_non_pre_1x64 (ms, expand565_16_1x64 (d))));
|
|
w--;
|
|
}
|
|
}
|
|
|
|
_mm_empty ();
|
|
}
|
|
|
|
/* -------------------------------------------------------------------------
|
|
* composite_over_pixbuf_8888
|
|
*/
|
|
|
|
static void
|
|
sse2_composite_over_pixbuf_8888 (pixman_implementation_t *imp,
|
|
pixman_op_t op,
|
|
pixman_image_t * src_image,
|
|
pixman_image_t * mask_image,
|
|
pixman_image_t * dst_image,
|
|
int32_t src_x,
|
|
int32_t src_y,
|
|
int32_t mask_x,
|
|
int32_t mask_y,
|
|
int32_t dest_x,
|
|
int32_t dest_y,
|
|
int32_t width,
|
|
int32_t height)
|
|
{
|
|
uint32_t *dst_line, *dst, d;
|
|
uint32_t *src_line, *src, s;
|
|
int dst_stride, src_stride;
|
|
int32_t w;
|
|
uint32_t opaque, zero;
|
|
|
|
__m128i xmm_src_lo, xmm_src_hi;
|
|
__m128i xmm_dst_lo, xmm_dst_hi;
|
|
|
|
PIXMAN_IMAGE_GET_LINE (
|
|
dst_image, dest_x, dest_y, uint32_t, dst_stride, dst_line, 1);
|
|
PIXMAN_IMAGE_GET_LINE (
|
|
src_image, src_x, src_y, uint32_t, src_stride, src_line, 1);
|
|
|
|
#if 0
|
|
/* FIXME
|
|
*
|
|
* I copy the code from MMX one and keep the fixme.
|
|
* If it's a problem there, probably is a problem here.
|
|
*/
|
|
assert (src_image->drawable == mask_image->drawable);
|
|
#endif
|
|
|
|
while (height--)
|
|
{
|
|
dst = dst_line;
|
|
dst_line += dst_stride;
|
|
src = src_line;
|
|
src_line += src_stride;
|
|
w = width;
|
|
|
|
while (w && (unsigned long)dst & 15)
|
|
{
|
|
s = *src++;
|
|
d = *dst;
|
|
|
|
*dst++ = pack_1x64_32 (
|
|
over_rev_non_pre_1x64 (
|
|
unpack_32_1x64 (s), unpack_32_1x64 (d)));
|
|
|
|
w--;
|
|
}
|
|
|
|
while (w >= 4)
|
|
{
|
|
xmm_src_hi = load_128_unaligned ((__m128i*)src);
|
|
|
|
opaque = is_opaque (xmm_src_hi);
|
|
zero = is_zero (xmm_src_hi);
|
|
|
|
unpack_128_2x128 (xmm_src_hi, &xmm_src_lo, &xmm_src_hi);
|
|
|
|
if (opaque)
|
|
{
|
|
invert_colors_2x128 (xmm_src_lo, xmm_src_hi,
|
|
&xmm_dst_lo, &xmm_dst_hi);
|
|
|
|
save_128_aligned (
|
|
(__m128i*)dst, pack_2x128_128 (xmm_dst_lo, xmm_dst_hi));
|
|
}
|
|
else if (!zero)
|
|
{
|
|
xmm_dst_hi = load_128_aligned ((__m128i*)dst);
|
|
|
|
unpack_128_2x128 (xmm_dst_hi, &xmm_dst_lo, &xmm_dst_hi);
|
|
|
|
over_rev_non_pre_2x128 (xmm_src_lo, xmm_src_hi,
|
|
&xmm_dst_lo, &xmm_dst_hi);
|
|
|
|
save_128_aligned (
|
|
(__m128i*)dst, pack_2x128_128 (xmm_dst_lo, xmm_dst_hi));
|
|
}
|
|
|
|
w -= 4;
|
|
dst += 4;
|
|
src += 4;
|
|
}
|
|
|
|
while (w)
|
|
{
|
|
s = *src++;
|
|
d = *dst;
|
|
|
|
*dst++ = pack_1x64_32 (
|
|
over_rev_non_pre_1x64 (
|
|
unpack_32_1x64 (s), unpack_32_1x64 (d)));
|
|
|
|
w--;
|
|
}
|
|
}
|
|
|
|
_mm_empty ();
|
|
}
|
|
|
|
/* -------------------------------------------------------------------------------------------------
|
|
* composite_over_n_8888_0565_ca
|
|
*/
|
|
|
|
static void
|
|
sse2_composite_over_n_8888_0565_ca (pixman_implementation_t *imp,
|
|
pixman_op_t op,
|
|
pixman_image_t * src_image,
|
|
pixman_image_t * mask_image,
|
|
pixman_image_t * dst_image,
|
|
int32_t src_x,
|
|
int32_t src_y,
|
|
int32_t mask_x,
|
|
int32_t mask_y,
|
|
int32_t dest_x,
|
|
int32_t dest_y,
|
|
int32_t width,
|
|
int32_t height)
|
|
{
|
|
uint32_t src;
|
|
uint16_t *dst_line, *dst, d;
|
|
uint32_t *mask_line, *mask, m;
|
|
int dst_stride, mask_stride;
|
|
int w;
|
|
uint32_t pack_cmp;
|
|
|
|
__m128i xmm_src, xmm_alpha;
|
|
__m128i xmm_mask, xmm_mask_lo, xmm_mask_hi;
|
|
__m128i xmm_dst, xmm_dst0, xmm_dst1, xmm_dst2, xmm_dst3;
|
|
|
|
__m64 mmx_src, mmx_alpha, mmx_mask, mmx_dest;
|
|
|
|
src = _pixman_image_get_solid (src_image, dst_image->bits.format);
|
|
|
|
if (src == 0)
|
|
return;
|
|
|
|
PIXMAN_IMAGE_GET_LINE (
|
|
dst_image, dest_x, dest_y, uint16_t, dst_stride, dst_line, 1);
|
|
PIXMAN_IMAGE_GET_LINE (
|
|
mask_image, mask_x, mask_y, uint32_t, mask_stride, mask_line, 1);
|
|
|
|
xmm_src = expand_pixel_32_1x128 (src);
|
|
xmm_alpha = expand_alpha_1x128 (xmm_src);
|
|
mmx_src = _mm_movepi64_pi64 (xmm_src);
|
|
mmx_alpha = _mm_movepi64_pi64 (xmm_alpha);
|
|
|
|
while (height--)
|
|
{
|
|
w = width;
|
|
mask = mask_line;
|
|
dst = dst_line;
|
|
mask_line += mask_stride;
|
|
dst_line += dst_stride;
|
|
|
|
while (w && ((unsigned long)dst & 15))
|
|
{
|
|
m = *(uint32_t *) mask;
|
|
|
|
if (m)
|
|
{
|
|
d = *dst;
|
|
mmx_mask = unpack_32_1x64 (m);
|
|
mmx_dest = expand565_16_1x64 (d);
|
|
|
|
*dst = pack_565_32_16 (
|
|
pack_1x64_32 (
|
|
in_over_1x64 (
|
|
&mmx_src, &mmx_alpha, &mmx_mask, &mmx_dest)));
|
|
}
|
|
|
|
w--;
|
|
dst++;
|
|
mask++;
|
|
}
|
|
|
|
while (w >= 8)
|
|
{
|
|
/* First round */
|
|
xmm_mask = load_128_unaligned ((__m128i*)mask);
|
|
xmm_dst = load_128_aligned ((__m128i*)dst);
|
|
|
|
pack_cmp = _mm_movemask_epi8 (
|
|
_mm_cmpeq_epi32 (xmm_mask, _mm_setzero_si128 ()));
|
|
|
|
unpack_565_128_4x128 (xmm_dst,
|
|
&xmm_dst0, &xmm_dst1, &xmm_dst2, &xmm_dst3);
|
|
unpack_128_2x128 (xmm_mask, &xmm_mask_lo, &xmm_mask_hi);
|
|
|
|
/* preload next round */
|
|
xmm_mask = load_128_unaligned ((__m128i*)(mask + 4));
|
|
|
|
/* preload next round */
|
|
if (pack_cmp != 0xffff)
|
|
{
|
|
in_over_2x128 (&xmm_src, &xmm_src,
|
|
&xmm_alpha, &xmm_alpha,
|
|
&xmm_mask_lo, &xmm_mask_hi,
|
|
&xmm_dst0, &xmm_dst1);
|
|
}
|
|
|
|
/* Second round */
|
|
pack_cmp = _mm_movemask_epi8 (
|
|
_mm_cmpeq_epi32 (xmm_mask, _mm_setzero_si128 ()));
|
|
|
|
unpack_128_2x128 (xmm_mask, &xmm_mask_lo, &xmm_mask_hi);
|
|
|
|
if (pack_cmp != 0xffff)
|
|
{
|
|
in_over_2x128 (&xmm_src, &xmm_src,
|
|
&xmm_alpha, &xmm_alpha,
|
|
&xmm_mask_lo, &xmm_mask_hi,
|
|
&xmm_dst2, &xmm_dst3);
|
|
}
|
|
|
|
save_128_aligned (
|
|
(__m128i*)dst, pack_565_4x128_128 (
|
|
&xmm_dst0, &xmm_dst1, &xmm_dst2, &xmm_dst3));
|
|
|
|
w -= 8;
|
|
dst += 8;
|
|
mask += 8;
|
|
}
|
|
|
|
while (w)
|
|
{
|
|
m = *(uint32_t *) mask;
|
|
|
|
if (m)
|
|
{
|
|
d = *dst;
|
|
mmx_mask = unpack_32_1x64 (m);
|
|
mmx_dest = expand565_16_1x64 (d);
|
|
|
|
*dst = pack_565_32_16 (
|
|
pack_1x64_32 (
|
|
in_over_1x64 (
|
|
&mmx_src, &mmx_alpha, &mmx_mask, &mmx_dest)));
|
|
}
|
|
|
|
w--;
|
|
dst++;
|
|
mask++;
|
|
}
|
|
}
|
|
|
|
_mm_empty ();
|
|
}
|
|
|
|
/* -----------------------------------------------------------------------
|
|
* composite_in_n_8_8
|
|
*/
|
|
|
|
static void
|
|
sse2_composite_in_n_8_8 (pixman_implementation_t *imp,
|
|
pixman_op_t op,
|
|
pixman_image_t * src_image,
|
|
pixman_image_t * mask_image,
|
|
pixman_image_t * dst_image,
|
|
int32_t src_x,
|
|
int32_t src_y,
|
|
int32_t mask_x,
|
|
int32_t mask_y,
|
|
int32_t dest_x,
|
|
int32_t dest_y,
|
|
int32_t width,
|
|
int32_t height)
|
|
{
|
|
uint8_t *dst_line, *dst;
|
|
uint8_t *mask_line, *mask;
|
|
int dst_stride, mask_stride;
|
|
uint32_t d, m;
|
|
uint32_t src;
|
|
uint8_t sa;
|
|
int32_t w;
|
|
|
|
__m128i xmm_alpha;
|
|
__m128i xmm_mask, xmm_mask_lo, xmm_mask_hi;
|
|
__m128i xmm_dst, xmm_dst_lo, xmm_dst_hi;
|
|
|
|
PIXMAN_IMAGE_GET_LINE (
|
|
dst_image, dest_x, dest_y, uint8_t, dst_stride, dst_line, 1);
|
|
PIXMAN_IMAGE_GET_LINE (
|
|
mask_image, mask_x, mask_y, uint8_t, mask_stride, mask_line, 1);
|
|
|
|
src = _pixman_image_get_solid (src_image, dst_image->bits.format);
|
|
|
|
sa = src >> 24;
|
|
|
|
xmm_alpha = expand_alpha_1x128 (expand_pixel_32_1x128 (src));
|
|
|
|
while (height--)
|
|
{
|
|
dst = dst_line;
|
|
dst_line += dst_stride;
|
|
mask = mask_line;
|
|
mask_line += mask_stride;
|
|
w = width;
|
|
|
|
while (w && ((unsigned long)dst & 15))
|
|
{
|
|
m = (uint32_t) *mask++;
|
|
d = (uint32_t) *dst;
|
|
|
|
*dst++ = (uint8_t) pack_1x64_32 (
|
|
pix_multiply_1x64 (
|
|
pix_multiply_1x64 (_mm_movepi64_pi64 (xmm_alpha),
|
|
unpack_32_1x64 (m)),
|
|
unpack_32_1x64 (d)));
|
|
w--;
|
|
}
|
|
|
|
while (w >= 16)
|
|
{
|
|
xmm_mask = load_128_unaligned ((__m128i*)mask);
|
|
xmm_dst = load_128_aligned ((__m128i*)dst);
|
|
|
|
unpack_128_2x128 (xmm_mask, &xmm_mask_lo, &xmm_mask_hi);
|
|
unpack_128_2x128 (xmm_dst, &xmm_dst_lo, &xmm_dst_hi);
|
|
|
|
pix_multiply_2x128 (&xmm_alpha, &xmm_alpha,
|
|
&xmm_mask_lo, &xmm_mask_hi,
|
|
&xmm_mask_lo, &xmm_mask_hi);
|
|
|
|
pix_multiply_2x128 (&xmm_mask_lo, &xmm_mask_hi,
|
|
&xmm_dst_lo, &xmm_dst_hi,
|
|
&xmm_dst_lo, &xmm_dst_hi);
|
|
|
|
save_128_aligned (
|
|
(__m128i*)dst, pack_2x128_128 (xmm_dst_lo, xmm_dst_hi));
|
|
|
|
mask += 16;
|
|
dst += 16;
|
|
w -= 16;
|
|
}
|
|
|
|
while (w)
|
|
{
|
|
m = (uint32_t) *mask++;
|
|
d = (uint32_t) *dst;
|
|
|
|
*dst++ = (uint8_t) pack_1x64_32 (
|
|
pix_multiply_1x64 (
|
|
pix_multiply_1x64 (
|
|
_mm_movepi64_pi64 (xmm_alpha), unpack_32_1x64 (m)),
|
|
unpack_32_1x64 (d)));
|
|
w--;
|
|
}
|
|
}
|
|
|
|
_mm_empty ();
|
|
}
|
|
|
|
/* -----------------------------------------------------------------------
|
|
* composite_in_n_8
|
|
*/
|
|
|
|
static void
|
|
sse2_composite_in_n_8 (pixman_implementation_t *imp,
|
|
pixman_op_t op,
|
|
pixman_image_t * src_image,
|
|
pixman_image_t * mask_image,
|
|
pixman_image_t * dst_image,
|
|
int32_t src_x,
|
|
int32_t src_y,
|
|
int32_t mask_x,
|
|
int32_t mask_y,
|
|
int32_t dest_x,
|
|
int32_t dest_y,
|
|
int32_t width,
|
|
int32_t height)
|
|
{
|
|
uint8_t *dst_line, *dst;
|
|
int dst_stride;
|
|
uint32_t d;
|
|
uint32_t src;
|
|
int32_t w;
|
|
|
|
__m128i xmm_alpha;
|
|
__m128i xmm_dst, xmm_dst_lo, xmm_dst_hi;
|
|
|
|
PIXMAN_IMAGE_GET_LINE (
|
|
dst_image, dest_x, dest_y, uint8_t, dst_stride, dst_line, 1);
|
|
|
|
src = _pixman_image_get_solid (src_image, dst_image->bits.format);
|
|
|
|
xmm_alpha = expand_alpha_1x128 (expand_pixel_32_1x128 (src));
|
|
|
|
src = src >> 24;
|
|
|
|
if (src == 0xff)
|
|
return;
|
|
|
|
if (src == 0x00)
|
|
{
|
|
pixman_fill (dst_image->bits.bits, dst_image->bits.rowstride,
|
|
8, dest_x, dest_y, width, height, src);
|
|
|
|
return;
|
|
}
|
|
|
|
while (height--)
|
|
{
|
|
dst = dst_line;
|
|
dst_line += dst_stride;
|
|
w = width;
|
|
|
|
while (w && ((unsigned long)dst & 15))
|
|
{
|
|
d = (uint32_t) *dst;
|
|
|
|
*dst++ = (uint8_t) pack_1x64_32 (
|
|
pix_multiply_1x64 (
|
|
_mm_movepi64_pi64 (xmm_alpha),
|
|
unpack_32_1x64 (d)));
|
|
w--;
|
|
}
|
|
|
|
while (w >= 16)
|
|
{
|
|
xmm_dst = load_128_aligned ((__m128i*)dst);
|
|
|
|
unpack_128_2x128 (xmm_dst, &xmm_dst_lo, &xmm_dst_hi);
|
|
|
|
pix_multiply_2x128 (&xmm_alpha, &xmm_alpha,
|
|
&xmm_dst_lo, &xmm_dst_hi,
|
|
&xmm_dst_lo, &xmm_dst_hi);
|
|
|
|
save_128_aligned (
|
|
(__m128i*)dst, pack_2x128_128 (xmm_dst_lo, xmm_dst_hi));
|
|
|
|
dst += 16;
|
|
w -= 16;
|
|
}
|
|
|
|
while (w)
|
|
{
|
|
d = (uint32_t) *dst;
|
|
|
|
*dst++ = (uint8_t) pack_1x64_32 (
|
|
pix_multiply_1x64 (
|
|
_mm_movepi64_pi64 (xmm_alpha),
|
|
unpack_32_1x64 (d)));
|
|
w--;
|
|
}
|
|
}
|
|
|
|
_mm_empty ();
|
|
}
|
|
|
|
/* ---------------------------------------------------------------------------
|
|
* composite_in_8_8
|
|
*/
|
|
|
|
static void
|
|
sse2_composite_in_8_8 (pixman_implementation_t *imp,
|
|
pixman_op_t op,
|
|
pixman_image_t * src_image,
|
|
pixman_image_t * mask_image,
|
|
pixman_image_t * dst_image,
|
|
int32_t src_x,
|
|
int32_t src_y,
|
|
int32_t mask_x,
|
|
int32_t mask_y,
|
|
int32_t dest_x,
|
|
int32_t dest_y,
|
|
int32_t width,
|
|
int32_t height)
|
|
{
|
|
uint8_t *dst_line, *dst;
|
|
uint8_t *src_line, *src;
|
|
int src_stride, dst_stride;
|
|
int32_t w;
|
|
uint32_t s, d;
|
|
|
|
__m128i xmm_src, xmm_src_lo, xmm_src_hi;
|
|
__m128i xmm_dst, xmm_dst_lo, xmm_dst_hi;
|
|
|
|
PIXMAN_IMAGE_GET_LINE (
|
|
dst_image, dest_x, dest_y, uint8_t, dst_stride, dst_line, 1);
|
|
PIXMAN_IMAGE_GET_LINE (
|
|
src_image, src_x, src_y, uint8_t, src_stride, src_line, 1);
|
|
|
|
while (height--)
|
|
{
|
|
dst = dst_line;
|
|
dst_line += dst_stride;
|
|
src = src_line;
|
|
src_line += src_stride;
|
|
w = width;
|
|
|
|
while (w && ((unsigned long)dst & 15))
|
|
{
|
|
s = (uint32_t) *src++;
|
|
d = (uint32_t) *dst;
|
|
|
|
*dst++ = (uint8_t) pack_1x64_32 (
|
|
pix_multiply_1x64 (
|
|
unpack_32_1x64 (s), unpack_32_1x64 (d)));
|
|
w--;
|
|
}
|
|
|
|
while (w >= 16)
|
|
{
|
|
xmm_src = load_128_unaligned ((__m128i*)src);
|
|
xmm_dst = load_128_aligned ((__m128i*)dst);
|
|
|
|
unpack_128_2x128 (xmm_src, &xmm_src_lo, &xmm_src_hi);
|
|
unpack_128_2x128 (xmm_dst, &xmm_dst_lo, &xmm_dst_hi);
|
|
|
|
pix_multiply_2x128 (&xmm_src_lo, &xmm_src_hi,
|
|
&xmm_dst_lo, &xmm_dst_hi,
|
|
&xmm_dst_lo, &xmm_dst_hi);
|
|
|
|
save_128_aligned (
|
|
(__m128i*)dst, pack_2x128_128 (xmm_dst_lo, xmm_dst_hi));
|
|
|
|
src += 16;
|
|
dst += 16;
|
|
w -= 16;
|
|
}
|
|
|
|
while (w)
|
|
{
|
|
s = (uint32_t) *src++;
|
|
d = (uint32_t) *dst;
|
|
|
|
*dst++ = (uint8_t) pack_1x64_32 (
|
|
pix_multiply_1x64 (unpack_32_1x64 (s), unpack_32_1x64 (d)));
|
|
w--;
|
|
}
|
|
}
|
|
|
|
_mm_empty ();
|
|
}
|
|
|
|
/* -------------------------------------------------------------------------
|
|
* composite_add_n_8_8
|
|
*/
|
|
|
|
static void
|
|
sse2_composite_add_n_8_8 (pixman_implementation_t *imp,
|
|
pixman_op_t op,
|
|
pixman_image_t * src_image,
|
|
pixman_image_t * mask_image,
|
|
pixman_image_t * dst_image,
|
|
int32_t src_x,
|
|
int32_t src_y,
|
|
int32_t mask_x,
|
|
int32_t mask_y,
|
|
int32_t dest_x,
|
|
int32_t dest_y,
|
|
int32_t width,
|
|
int32_t height)
|
|
{
|
|
uint8_t *dst_line, *dst;
|
|
uint8_t *mask_line, *mask;
|
|
int dst_stride, mask_stride;
|
|
int32_t w;
|
|
uint32_t src;
|
|
uint8_t sa;
|
|
uint32_t m, d;
|
|
|
|
__m128i xmm_alpha;
|
|
__m128i xmm_mask, xmm_mask_lo, xmm_mask_hi;
|
|
__m128i xmm_dst, xmm_dst_lo, xmm_dst_hi;
|
|
|
|
PIXMAN_IMAGE_GET_LINE (
|
|
dst_image, dest_x, dest_y, uint8_t, dst_stride, dst_line, 1);
|
|
PIXMAN_IMAGE_GET_LINE (
|
|
mask_image, mask_x, mask_y, uint8_t, mask_stride, mask_line, 1);
|
|
|
|
src = _pixman_image_get_solid (src_image, dst_image->bits.format);
|
|
|
|
sa = src >> 24;
|
|
|
|
xmm_alpha = expand_alpha_1x128 (expand_pixel_32_1x128 (src));
|
|
|
|
while (height--)
|
|
{
|
|
dst = dst_line;
|
|
dst_line += dst_stride;
|
|
mask = mask_line;
|
|
mask_line += mask_stride;
|
|
w = width;
|
|
|
|
while (w && ((unsigned long)dst & 15))
|
|
{
|
|
m = (uint32_t) *mask++;
|
|
d = (uint32_t) *dst;
|
|
|
|
*dst++ = (uint8_t) pack_1x64_32 (
|
|
_mm_adds_pu16 (
|
|
pix_multiply_1x64 (
|
|
_mm_movepi64_pi64 (xmm_alpha), unpack_32_1x64 (m)),
|
|
unpack_32_1x64 (d)));
|
|
w--;
|
|
}
|
|
|
|
while (w >= 16)
|
|
{
|
|
xmm_mask = load_128_unaligned ((__m128i*)mask);
|
|
xmm_dst = load_128_aligned ((__m128i*)dst);
|
|
|
|
unpack_128_2x128 (xmm_mask, &xmm_mask_lo, &xmm_mask_hi);
|
|
unpack_128_2x128 (xmm_dst, &xmm_dst_lo, &xmm_dst_hi);
|
|
|
|
pix_multiply_2x128 (&xmm_alpha, &xmm_alpha,
|
|
&xmm_mask_lo, &xmm_mask_hi,
|
|
&xmm_mask_lo, &xmm_mask_hi);
|
|
|
|
xmm_dst_lo = _mm_adds_epu16 (xmm_mask_lo, xmm_dst_lo);
|
|
xmm_dst_hi = _mm_adds_epu16 (xmm_mask_hi, xmm_dst_hi);
|
|
|
|
save_128_aligned (
|
|
(__m128i*)dst, pack_2x128_128 (xmm_dst_lo, xmm_dst_hi));
|
|
|
|
mask += 16;
|
|
dst += 16;
|
|
w -= 16;
|
|
}
|
|
|
|
while (w)
|
|
{
|
|
m = (uint32_t) *mask++;
|
|
d = (uint32_t) *dst;
|
|
|
|
*dst++ = (uint8_t) pack_1x64_32 (
|
|
_mm_adds_pu16 (
|
|
pix_multiply_1x64 (
|
|
_mm_movepi64_pi64 (xmm_alpha), unpack_32_1x64 (m)),
|
|
unpack_32_1x64 (d)));
|
|
|
|
w--;
|
|
}
|
|
}
|
|
|
|
_mm_empty ();
|
|
}
|
|
|
|
/* -------------------------------------------------------------------------
|
|
* composite_add_n_8_8
|
|
*/
|
|
|
|
static void
|
|
sse2_composite_add_n_8 (pixman_implementation_t *imp,
|
|
pixman_op_t op,
|
|
pixman_image_t * src_image,
|
|
pixman_image_t * mask_image,
|
|
pixman_image_t * dst_image,
|
|
int32_t src_x,
|
|
int32_t src_y,
|
|
int32_t mask_x,
|
|
int32_t mask_y,
|
|
int32_t dest_x,
|
|
int32_t dest_y,
|
|
int32_t width,
|
|
int32_t height)
|
|
{
|
|
uint8_t *dst_line, *dst;
|
|
int dst_stride;
|
|
int32_t w;
|
|
uint32_t src;
|
|
|
|
__m128i xmm_src;
|
|
|
|
PIXMAN_IMAGE_GET_LINE (
|
|
dst_image, dest_x, dest_y, uint8_t, dst_stride, dst_line, 1);
|
|
|
|
src = _pixman_image_get_solid (src_image, dst_image->bits.format);
|
|
|
|
src >>= 24;
|
|
|
|
if (src == 0x00)
|
|
return;
|
|
|
|
if (src == 0xff)
|
|
{
|
|
pixman_fill (dst_image->bits.bits, dst_image->bits.rowstride,
|
|
8, dest_x, dest_y, width, height, 0xff);
|
|
|
|
return;
|
|
}
|
|
|
|
src = (src << 24) | (src << 16) | (src << 8) | src;
|
|
xmm_src = _mm_set_epi32 (src, src, src, src);
|
|
|
|
while (height--)
|
|
{
|
|
dst = dst_line;
|
|
dst_line += dst_stride;
|
|
w = width;
|
|
|
|
while (w && ((unsigned long)dst & 15))
|
|
{
|
|
*dst = (uint8_t)_mm_cvtsi64_si32 (
|
|
_mm_adds_pu8 (
|
|
_mm_movepi64_pi64 (xmm_src),
|
|
_mm_cvtsi32_si64 (*dst)));
|
|
|
|
w--;
|
|
dst++;
|
|
}
|
|
|
|
while (w >= 16)
|
|
{
|
|
save_128_aligned (
|
|
(__m128i*)dst, _mm_adds_epu8 (xmm_src, load_128_aligned ((__m128i*)dst)));
|
|
|
|
dst += 16;
|
|
w -= 16;
|
|
}
|
|
|
|
while (w)
|
|
{
|
|
*dst = (uint8_t)_mm_cvtsi64_si32 (
|
|
_mm_adds_pu8 (
|
|
_mm_movepi64_pi64 (xmm_src),
|
|
_mm_cvtsi32_si64 (*dst)));
|
|
|
|
w--;
|
|
dst++;
|
|
}
|
|
}
|
|
|
|
_mm_empty ();
|
|
}
|
|
|
|
/* ----------------------------------------------------------------------
|
|
* composite_add_8_8
|
|
*/
|
|
|
|
static void
|
|
sse2_composite_add_8_8 (pixman_implementation_t *imp,
|
|
pixman_op_t op,
|
|
pixman_image_t * src_image,
|
|
pixman_image_t * mask_image,
|
|
pixman_image_t * dst_image,
|
|
int32_t src_x,
|
|
int32_t src_y,
|
|
int32_t mask_x,
|
|
int32_t mask_y,
|
|
int32_t dest_x,
|
|
int32_t dest_y,
|
|
int32_t width,
|
|
int32_t height)
|
|
{
|
|
uint8_t *dst_line, *dst;
|
|
uint8_t *src_line, *src;
|
|
int dst_stride, src_stride;
|
|
int32_t w;
|
|
uint16_t t;
|
|
|
|
PIXMAN_IMAGE_GET_LINE (
|
|
src_image, src_x, src_y, uint8_t, src_stride, src_line, 1);
|
|
PIXMAN_IMAGE_GET_LINE (
|
|
dst_image, dest_x, dest_y, uint8_t, dst_stride, dst_line, 1);
|
|
|
|
while (height--)
|
|
{
|
|
dst = dst_line;
|
|
src = src_line;
|
|
|
|
dst_line += dst_stride;
|
|
src_line += src_stride;
|
|
w = width;
|
|
|
|
/* Small head */
|
|
while (w && (unsigned long)dst & 3)
|
|
{
|
|
t = (*dst) + (*src++);
|
|
*dst++ = t | (0 - (t >> 8));
|
|
w--;
|
|
}
|
|
|
|
core_combine_add_u_sse2 ((uint32_t*)dst, (uint32_t*)src, NULL, w >> 2);
|
|
|
|
/* Small tail */
|
|
dst += w & 0xfffc;
|
|
src += w & 0xfffc;
|
|
|
|
w &= 3;
|
|
|
|
while (w)
|
|
{
|
|
t = (*dst) + (*src++);
|
|
*dst++ = t | (0 - (t >> 8));
|
|
w--;
|
|
}
|
|
}
|
|
|
|
_mm_empty ();
|
|
}
|
|
|
|
/* ---------------------------------------------------------------------
|
|
* composite_add_8888_8888
|
|
*/
|
|
static void
|
|
sse2_composite_add_8888_8888 (pixman_implementation_t *imp,
|
|
pixman_op_t op,
|
|
pixman_image_t * src_image,
|
|
pixman_image_t * mask_image,
|
|
pixman_image_t * dst_image,
|
|
int32_t src_x,
|
|
int32_t src_y,
|
|
int32_t mask_x,
|
|
int32_t mask_y,
|
|
int32_t dest_x,
|
|
int32_t dest_y,
|
|
int32_t width,
|
|
int32_t height)
|
|
{
|
|
uint32_t *dst_line, *dst;
|
|
uint32_t *src_line, *src;
|
|
int dst_stride, src_stride;
|
|
|
|
PIXMAN_IMAGE_GET_LINE (
|
|
src_image, src_x, src_y, uint32_t, src_stride, src_line, 1);
|
|
PIXMAN_IMAGE_GET_LINE (
|
|
dst_image, dest_x, dest_y, uint32_t, dst_stride, dst_line, 1);
|
|
|
|
while (height--)
|
|
{
|
|
dst = dst_line;
|
|
dst_line += dst_stride;
|
|
src = src_line;
|
|
src_line += src_stride;
|
|
|
|
core_combine_add_u_sse2 (dst, src, NULL, width);
|
|
}
|
|
|
|
_mm_empty ();
|
|
}
|
|
|
|
/* -------------------------------------------------------------------------------------------------
|
|
* sse2_composite_copy_area
|
|
*/
|
|
|
|
static pixman_bool_t
|
|
pixman_blt_sse2 (uint32_t *src_bits,
|
|
uint32_t *dst_bits,
|
|
int src_stride,
|
|
int dst_stride,
|
|
int src_bpp,
|
|
int dst_bpp,
|
|
int src_x,
|
|
int src_y,
|
|
int dst_x,
|
|
int dst_y,
|
|
int width,
|
|
int height)
|
|
{
|
|
uint8_t * src_bytes;
|
|
uint8_t * dst_bytes;
|
|
int byte_width;
|
|
|
|
if (src_bpp != dst_bpp)
|
|
return FALSE;
|
|
|
|
if (src_bpp == 16)
|
|
{
|
|
src_stride = src_stride * (int) sizeof (uint32_t) / 2;
|
|
dst_stride = dst_stride * (int) sizeof (uint32_t) / 2;
|
|
src_bytes =(uint8_t *)(((uint16_t *)src_bits) + src_stride * (src_y) + (src_x));
|
|
dst_bytes = (uint8_t *)(((uint16_t *)dst_bits) + dst_stride * (dst_y) + (dst_x));
|
|
byte_width = 2 * width;
|
|
src_stride *= 2;
|
|
dst_stride *= 2;
|
|
}
|
|
else if (src_bpp == 32)
|
|
{
|
|
src_stride = src_stride * (int) sizeof (uint32_t) / 4;
|
|
dst_stride = dst_stride * (int) sizeof (uint32_t) / 4;
|
|
src_bytes = (uint8_t *)(((uint32_t *)src_bits) + src_stride * (src_y) + (src_x));
|
|
dst_bytes = (uint8_t *)(((uint32_t *)dst_bits) + dst_stride * (dst_y) + (dst_x));
|
|
byte_width = 4 * width;
|
|
src_stride *= 4;
|
|
dst_stride *= 4;
|
|
}
|
|
else
|
|
{
|
|
return FALSE;
|
|
}
|
|
|
|
while (height--)
|
|
{
|
|
int w;
|
|
uint8_t *s = src_bytes;
|
|
uint8_t *d = dst_bytes;
|
|
src_bytes += src_stride;
|
|
dst_bytes += dst_stride;
|
|
w = byte_width;
|
|
|
|
while (w >= 2 && ((unsigned long)d & 3))
|
|
{
|
|
*(uint16_t *)d = *(uint16_t *)s;
|
|
w -= 2;
|
|
s += 2;
|
|
d += 2;
|
|
}
|
|
|
|
while (w >= 4 && ((unsigned long)d & 15))
|
|
{
|
|
*(uint32_t *)d = *(uint32_t *)s;
|
|
|
|
w -= 4;
|
|
s += 4;
|
|
d += 4;
|
|
}
|
|
|
|
while (w >= 64)
|
|
{
|
|
__m128i xmm0, xmm1, xmm2, xmm3;
|
|
|
|
xmm0 = load_128_unaligned ((__m128i*)(s));
|
|
xmm1 = load_128_unaligned ((__m128i*)(s + 16));
|
|
xmm2 = load_128_unaligned ((__m128i*)(s + 32));
|
|
xmm3 = load_128_unaligned ((__m128i*)(s + 48));
|
|
|
|
save_128_aligned ((__m128i*)(d), xmm0);
|
|
save_128_aligned ((__m128i*)(d + 16), xmm1);
|
|
save_128_aligned ((__m128i*)(d + 32), xmm2);
|
|
save_128_aligned ((__m128i*)(d + 48), xmm3);
|
|
|
|
s += 64;
|
|
d += 64;
|
|
w -= 64;
|
|
}
|
|
|
|
while (w >= 16)
|
|
{
|
|
save_128_aligned ((__m128i*)d, load_128_unaligned ((__m128i*)s) );
|
|
|
|
w -= 16;
|
|
d += 16;
|
|
s += 16;
|
|
}
|
|
|
|
while (w >= 4)
|
|
{
|
|
*(uint32_t *)d = *(uint32_t *)s;
|
|
|
|
w -= 4;
|
|
s += 4;
|
|
d += 4;
|
|
}
|
|
|
|
if (w >= 2)
|
|
{
|
|
*(uint16_t *)d = *(uint16_t *)s;
|
|
w -= 2;
|
|
s += 2;
|
|
d += 2;
|
|
}
|
|
}
|
|
|
|
_mm_empty ();
|
|
|
|
return TRUE;
|
|
}
|
|
|
|
static void
|
|
sse2_composite_copy_area (pixman_implementation_t *imp,
|
|
pixman_op_t op,
|
|
pixman_image_t * src_image,
|
|
pixman_image_t * mask_image,
|
|
pixman_image_t * dst_image,
|
|
int32_t src_x,
|
|
int32_t src_y,
|
|
int32_t mask_x,
|
|
int32_t mask_y,
|
|
int32_t dest_x,
|
|
int32_t dest_y,
|
|
int32_t width,
|
|
int32_t height)
|
|
{
|
|
pixman_blt_sse2 (src_image->bits.bits,
|
|
dst_image->bits.bits,
|
|
src_image->bits.rowstride,
|
|
dst_image->bits.rowstride,
|
|
PIXMAN_FORMAT_BPP (src_image->bits.format),
|
|
PIXMAN_FORMAT_BPP (dst_image->bits.format),
|
|
src_x, src_y, dest_x, dest_y, width, height);
|
|
}
|
|
|
|
static void
|
|
sse2_composite_over_x888_8_8888 (pixman_implementation_t *imp,
|
|
pixman_op_t op,
|
|
pixman_image_t * src_image,
|
|
pixman_image_t * mask_image,
|
|
pixman_image_t * dst_image,
|
|
int32_t src_x,
|
|
int32_t src_y,
|
|
int32_t mask_x,
|
|
int32_t mask_y,
|
|
int32_t dest_x,
|
|
int32_t dest_y,
|
|
int32_t width,
|
|
int32_t height)
|
|
{
|
|
uint32_t *src, *src_line, s;
|
|
uint32_t *dst, *dst_line, d;
|
|
uint8_t *mask, *mask_line;
|
|
uint32_t m;
|
|
int src_stride, mask_stride, dst_stride;
|
|
int32_t w;
|
|
__m64 ms;
|
|
|
|
__m128i xmm_src, xmm_src_lo, xmm_src_hi;
|
|
__m128i xmm_dst, xmm_dst_lo, xmm_dst_hi;
|
|
__m128i xmm_mask, xmm_mask_lo, xmm_mask_hi;
|
|
|
|
PIXMAN_IMAGE_GET_LINE (
|
|
dst_image, dest_x, dest_y, uint32_t, dst_stride, dst_line, 1);
|
|
PIXMAN_IMAGE_GET_LINE (
|
|
mask_image, mask_x, mask_y, uint8_t, mask_stride, mask_line, 1);
|
|
PIXMAN_IMAGE_GET_LINE (
|
|
src_image, src_x, src_y, uint32_t, src_stride, src_line, 1);
|
|
|
|
while (height--)
|
|
{
|
|
src = src_line;
|
|
src_line += src_stride;
|
|
dst = dst_line;
|
|
dst_line += dst_stride;
|
|
mask = mask_line;
|
|
mask_line += mask_stride;
|
|
|
|
w = width;
|
|
|
|
while (w && (unsigned long)dst & 15)
|
|
{
|
|
s = 0xff000000 | *src++;
|
|
m = (uint32_t) *mask++;
|
|
d = *dst;
|
|
ms = unpack_32_1x64 (s);
|
|
|
|
if (m != 0xff)
|
|
{
|
|
__m64 ma = expand_alpha_rev_1x64 (unpack_32_1x64 (m));
|
|
__m64 md = unpack_32_1x64 (d);
|
|
|
|
ms = in_over_1x64 (&ms, &mask_x00ff, &ma, &md);
|
|
}
|
|
|
|
*dst++ = pack_1x64_32 (ms);
|
|
w--;
|
|
}
|
|
|
|
while (w >= 4)
|
|
{
|
|
m = *(uint32_t*) mask;
|
|
xmm_src = _mm_or_si128 (load_128_unaligned ((__m128i*)src), mask_ff000000);
|
|
|
|
if (m == 0xffffffff)
|
|
{
|
|
save_128_aligned ((__m128i*)dst, xmm_src);
|
|
}
|
|
else
|
|
{
|
|
xmm_dst = load_128_aligned ((__m128i*)dst);
|
|
|
|
xmm_mask = _mm_unpacklo_epi16 (unpack_32_1x128 (m), _mm_setzero_si128());
|
|
|
|
unpack_128_2x128 (xmm_src, &xmm_src_lo, &xmm_src_hi);
|
|
unpack_128_2x128 (xmm_mask, &xmm_mask_lo, &xmm_mask_hi);
|
|
unpack_128_2x128 (xmm_dst, &xmm_dst_lo, &xmm_dst_hi);
|
|
|
|
expand_alpha_rev_2x128 (xmm_mask_lo, xmm_mask_hi, &xmm_mask_lo, &xmm_mask_hi);
|
|
|
|
in_over_2x128 (&xmm_src_lo, &xmm_src_hi, &mask_00ff, &mask_00ff, &xmm_mask_lo, &xmm_mask_hi, &xmm_dst_lo, &xmm_dst_hi);
|
|
|
|
save_128_aligned ((__m128i*)dst, pack_2x128_128 (xmm_dst_lo, xmm_dst_hi));
|
|
}
|
|
|
|
src += 4;
|
|
dst += 4;
|
|
mask += 4;
|
|
w -= 4;
|
|
}
|
|
|
|
while (w)
|
|
{
|
|
m = (uint32_t) *mask++;
|
|
|
|
if (m)
|
|
{
|
|
s = 0xff000000 | *src;
|
|
|
|
if (m == 0xff)
|
|
{
|
|
*dst = s;
|
|
}
|
|
else
|
|
{
|
|
__m64 ma, md, ms;
|
|
|
|
d = *dst;
|
|
|
|
ma = expand_alpha_rev_1x64 (unpack_32_1x64 (m));
|
|
md = unpack_32_1x64 (d);
|
|
ms = unpack_32_1x64 (s);
|
|
|
|
*dst = pack_1x64_32 (in_over_1x64 (&ms, &mask_x00ff, &ma, &md));
|
|
}
|
|
|
|
}
|
|
|
|
src++;
|
|
dst++;
|
|
w--;
|
|
}
|
|
}
|
|
|
|
_mm_empty ();
|
|
}
|
|
|
|
static void
|
|
sse2_composite_over_8888_8_8888 (pixman_implementation_t *imp,
|
|
pixman_op_t op,
|
|
pixman_image_t * src_image,
|
|
pixman_image_t * mask_image,
|
|
pixman_image_t * dst_image,
|
|
int32_t src_x,
|
|
int32_t src_y,
|
|
int32_t mask_x,
|
|
int32_t mask_y,
|
|
int32_t dest_x,
|
|
int32_t dest_y,
|
|
int32_t width,
|
|
int32_t height)
|
|
{
|
|
uint32_t *src, *src_line, s;
|
|
uint32_t *dst, *dst_line, d;
|
|
uint8_t *mask, *mask_line;
|
|
uint32_t m;
|
|
int src_stride, mask_stride, dst_stride;
|
|
int32_t w;
|
|
|
|
__m128i xmm_src, xmm_src_lo, xmm_src_hi, xmm_srca_lo, xmm_srca_hi;
|
|
__m128i xmm_dst, xmm_dst_lo, xmm_dst_hi;
|
|
__m128i xmm_mask, xmm_mask_lo, xmm_mask_hi;
|
|
|
|
PIXMAN_IMAGE_GET_LINE (
|
|
dst_image, dest_x, dest_y, uint32_t, dst_stride, dst_line, 1);
|
|
PIXMAN_IMAGE_GET_LINE (
|
|
mask_image, mask_x, mask_y, uint8_t, mask_stride, mask_line, 1);
|
|
PIXMAN_IMAGE_GET_LINE (
|
|
src_image, src_x, src_y, uint32_t, src_stride, src_line, 1);
|
|
|
|
while (height--)
|
|
{
|
|
src = src_line;
|
|
src_line += src_stride;
|
|
dst = dst_line;
|
|
dst_line += dst_stride;
|
|
mask = mask_line;
|
|
mask_line += mask_stride;
|
|
|
|
w = width;
|
|
|
|
while (w && (unsigned long)dst & 15)
|
|
{
|
|
uint32_t sa;
|
|
|
|
s = *src++;
|
|
m = (uint32_t) *mask++;
|
|
d = *dst;
|
|
|
|
sa = s >> 24;
|
|
|
|
if (m)
|
|
{
|
|
if (sa == 0xff && m == 0xff)
|
|
{
|
|
*dst = s;
|
|
}
|
|
else
|
|
{
|
|
__m64 ms, md, ma, msa;
|
|
|
|
ma = expand_alpha_rev_1x64 (load_32_1x64 (m));
|
|
ms = unpack_32_1x64 (s);
|
|
md = unpack_32_1x64 (d);
|
|
|
|
msa = expand_alpha_rev_1x64 (load_32_1x64 (sa));
|
|
|
|
*dst = pack_1x64_32 (in_over_1x64 (&ms, &msa, &ma, &md));
|
|
}
|
|
}
|
|
|
|
dst++;
|
|
w--;
|
|
}
|
|
|
|
while (w >= 4)
|
|
{
|
|
m = *(uint32_t *) mask;
|
|
|
|
if (m)
|
|
{
|
|
xmm_src = load_128_unaligned ((__m128i*)src);
|
|
|
|
if (m == 0xffffffff && is_opaque (xmm_src))
|
|
{
|
|
save_128_aligned ((__m128i *)dst, xmm_src);
|
|
}
|
|
else
|
|
{
|
|
xmm_dst = load_128_aligned ((__m128i *)dst);
|
|
|
|
xmm_mask = _mm_unpacklo_epi16 (unpack_32_1x128 (m), _mm_setzero_si128());
|
|
|
|
unpack_128_2x128 (xmm_src, &xmm_src_lo, &xmm_src_hi);
|
|
unpack_128_2x128 (xmm_mask, &xmm_mask_lo, &xmm_mask_hi);
|
|
unpack_128_2x128 (xmm_dst, &xmm_dst_lo, &xmm_dst_hi);
|
|
|
|
expand_alpha_2x128 (xmm_src_lo, xmm_src_hi, &xmm_srca_lo, &xmm_srca_hi);
|
|
expand_alpha_rev_2x128 (xmm_mask_lo, xmm_mask_hi, &xmm_mask_lo, &xmm_mask_hi);
|
|
|
|
in_over_2x128 (&xmm_src_lo, &xmm_src_hi, &xmm_srca_lo, &xmm_srca_hi,
|
|
&xmm_mask_lo, &xmm_mask_hi, &xmm_dst_lo, &xmm_dst_hi);
|
|
|
|
save_128_aligned ((__m128i*)dst, pack_2x128_128 (xmm_dst_lo, xmm_dst_hi));
|
|
}
|
|
}
|
|
|
|
src += 4;
|
|
dst += 4;
|
|
mask += 4;
|
|
w -= 4;
|
|
}
|
|
|
|
while (w)
|
|
{
|
|
uint32_t sa;
|
|
|
|
s = *src++;
|
|
m = (uint32_t) *mask++;
|
|
d = *dst;
|
|
|
|
sa = s >> 24;
|
|
|
|
if (m)
|
|
{
|
|
if (sa == 0xff && m == 0xff)
|
|
{
|
|
*dst = s;
|
|
}
|
|
else
|
|
{
|
|
__m64 ms, md, ma, msa;
|
|
|
|
ma = expand_alpha_rev_1x64 (load_32_1x64 (m));
|
|
ms = unpack_32_1x64 (s);
|
|
md = unpack_32_1x64 (d);
|
|
|
|
msa = expand_alpha_rev_1x64 (load_32_1x64 (sa));
|
|
|
|
*dst = pack_1x64_32 (in_over_1x64 (&ms, &msa, &ma, &md));
|
|
}
|
|
}
|
|
|
|
dst++;
|
|
w--;
|
|
}
|
|
}
|
|
|
|
_mm_empty ();
|
|
}
|
|
|
|
static void
|
|
sse2_composite_over_reverse_n_8888 (pixman_implementation_t *imp,
|
|
pixman_op_t op,
|
|
pixman_image_t * src_image,
|
|
pixman_image_t * mask_image,
|
|
pixman_image_t * dst_image,
|
|
int32_t src_x,
|
|
int32_t src_y,
|
|
int32_t mask_x,
|
|
int32_t mask_y,
|
|
int32_t dest_x,
|
|
int32_t dest_y,
|
|
int32_t width,
|
|
int32_t height)
|
|
{
|
|
uint32_t src;
|
|
uint32_t *dst_line, *dst;
|
|
__m128i xmm_src;
|
|
__m128i xmm_dst, xmm_dst_lo, xmm_dst_hi;
|
|
__m128i xmm_dsta_hi, xmm_dsta_lo;
|
|
int dst_stride;
|
|
int32_t w;
|
|
|
|
src = _pixman_image_get_solid (src_image, dst_image->bits.format);
|
|
|
|
if (src == 0)
|
|
return;
|
|
|
|
PIXMAN_IMAGE_GET_LINE (
|
|
dst_image, dest_x, dest_y, uint32_t, dst_stride, dst_line, 1);
|
|
|
|
xmm_src = expand_pixel_32_1x128 (src);
|
|
|
|
while (height--)
|
|
{
|
|
dst = dst_line;
|
|
|
|
dst_line += dst_stride;
|
|
w = width;
|
|
|
|
while (w && (unsigned long)dst & 15)
|
|
{
|
|
__m64 vd;
|
|
|
|
vd = unpack_32_1x64 (*dst);
|
|
|
|
*dst = pack_1x64_32 (over_1x64 (vd, expand_alpha_1x64 (vd),
|
|
_mm_movepi64_pi64 (xmm_src)));
|
|
w--;
|
|
dst++;
|
|
}
|
|
|
|
while (w >= 4)
|
|
{
|
|
__m128i tmp_lo, tmp_hi;
|
|
|
|
xmm_dst = load_128_aligned ((__m128i*)dst);
|
|
|
|
unpack_128_2x128 (xmm_dst, &xmm_dst_lo, &xmm_dst_hi);
|
|
expand_alpha_2x128 (xmm_dst_lo, xmm_dst_hi, &xmm_dsta_lo, &xmm_dsta_hi);
|
|
|
|
tmp_lo = xmm_src;
|
|
tmp_hi = xmm_src;
|
|
|
|
over_2x128 (&xmm_dst_lo, &xmm_dst_hi,
|
|
&xmm_dsta_lo, &xmm_dsta_hi,
|
|
&tmp_lo, &tmp_hi);
|
|
|
|
save_128_aligned (
|
|
(__m128i*)dst, pack_2x128_128 (tmp_lo, tmp_hi));
|
|
|
|
w -= 4;
|
|
dst += 4;
|
|
}
|
|
|
|
while (w)
|
|
{
|
|
__m64 vd;
|
|
|
|
vd = unpack_32_1x64 (*dst);
|
|
|
|
*dst = pack_1x64_32 (over_1x64 (vd, expand_alpha_1x64 (vd),
|
|
_mm_movepi64_pi64 (xmm_src)));
|
|
w--;
|
|
dst++;
|
|
}
|
|
|
|
}
|
|
|
|
_mm_empty ();
|
|
}
|
|
|
|
static void
|
|
sse2_composite_over_8888_8888_8888 (pixman_implementation_t *imp,
|
|
pixman_op_t op,
|
|
pixman_image_t * src_image,
|
|
pixman_image_t * mask_image,
|
|
pixman_image_t * dst_image,
|
|
int32_t src_x,
|
|
int32_t src_y,
|
|
int32_t mask_x,
|
|
int32_t mask_y,
|
|
int32_t dest_x,
|
|
int32_t dest_y,
|
|
int32_t width,
|
|
int32_t height)
|
|
{
|
|
uint32_t *src, *src_line, s;
|
|
uint32_t *dst, *dst_line, d;
|
|
uint32_t *mask, *mask_line;
|
|
uint32_t m;
|
|
int src_stride, mask_stride, dst_stride;
|
|
int32_t w;
|
|
|
|
__m128i xmm_src, xmm_src_lo, xmm_src_hi, xmm_srca_lo, xmm_srca_hi;
|
|
__m128i xmm_dst, xmm_dst_lo, xmm_dst_hi;
|
|
__m128i xmm_mask, xmm_mask_lo, xmm_mask_hi;
|
|
|
|
PIXMAN_IMAGE_GET_LINE (
|
|
dst_image, dest_x, dest_y, uint32_t, dst_stride, dst_line, 1);
|
|
PIXMAN_IMAGE_GET_LINE (
|
|
mask_image, mask_x, mask_y, uint32_t, mask_stride, mask_line, 1);
|
|
PIXMAN_IMAGE_GET_LINE (
|
|
src_image, src_x, src_y, uint32_t, src_stride, src_line, 1);
|
|
|
|
while (height--)
|
|
{
|
|
src = src_line;
|
|
src_line += src_stride;
|
|
dst = dst_line;
|
|
dst_line += dst_stride;
|
|
mask = mask_line;
|
|
mask_line += mask_stride;
|
|
|
|
w = width;
|
|
|
|
while (w && (unsigned long)dst & 15)
|
|
{
|
|
uint32_t sa;
|
|
|
|
s = *src++;
|
|
m = (*mask++) >> 24;
|
|
d = *dst;
|
|
|
|
sa = s >> 24;
|
|
|
|
if (m)
|
|
{
|
|
if (sa == 0xff && m == 0xff)
|
|
{
|
|
*dst = s;
|
|
}
|
|
else
|
|
{
|
|
__m64 ms, md, ma, msa;
|
|
|
|
ma = expand_alpha_rev_1x64 (load_32_1x64 (m));
|
|
ms = unpack_32_1x64 (s);
|
|
md = unpack_32_1x64 (d);
|
|
|
|
msa = expand_alpha_rev_1x64 (load_32_1x64 (sa));
|
|
|
|
*dst = pack_1x64_32 (in_over_1x64 (&ms, &msa, &ma, &md));
|
|
}
|
|
}
|
|
|
|
dst++;
|
|
w--;
|
|
}
|
|
|
|
while (w >= 4)
|
|
{
|
|
xmm_mask = load_128_unaligned ((__m128i*)mask);
|
|
|
|
if (!is_transparent (xmm_mask))
|
|
{
|
|
xmm_src = load_128_unaligned ((__m128i*)src);
|
|
|
|
if (is_opaque (xmm_mask) && is_opaque (xmm_src))
|
|
{
|
|
save_128_aligned ((__m128i *)dst, xmm_src);
|
|
}
|
|
else
|
|
{
|
|
xmm_dst = load_128_aligned ((__m128i *)dst);
|
|
|
|
unpack_128_2x128 (xmm_src, &xmm_src_lo, &xmm_src_hi);
|
|
unpack_128_2x128 (xmm_mask, &xmm_mask_lo, &xmm_mask_hi);
|
|
unpack_128_2x128 (xmm_dst, &xmm_dst_lo, &xmm_dst_hi);
|
|
|
|
expand_alpha_2x128 (xmm_src_lo, xmm_src_hi, &xmm_srca_lo, &xmm_srca_hi);
|
|
expand_alpha_2x128 (xmm_mask_lo, xmm_mask_hi, &xmm_mask_lo, &xmm_mask_hi);
|
|
|
|
in_over_2x128 (&xmm_src_lo, &xmm_src_hi, &xmm_srca_lo, &xmm_srca_hi,
|
|
&xmm_mask_lo, &xmm_mask_hi, &xmm_dst_lo, &xmm_dst_hi);
|
|
|
|
save_128_aligned ((__m128i*)dst, pack_2x128_128 (xmm_dst_lo, xmm_dst_hi));
|
|
}
|
|
}
|
|
|
|
src += 4;
|
|
dst += 4;
|
|
mask += 4;
|
|
w -= 4;
|
|
}
|
|
|
|
while (w)
|
|
{
|
|
uint32_t sa;
|
|
|
|
s = *src++;
|
|
m = (*mask++) >> 24;
|
|
d = *dst;
|
|
|
|
sa = s >> 24;
|
|
|
|
if (m)
|
|
{
|
|
if (sa == 0xff && m == 0xff)
|
|
{
|
|
*dst = s;
|
|
}
|
|
else
|
|
{
|
|
__m64 ms, md, ma, msa;
|
|
|
|
ma = expand_alpha_rev_1x64 (load_32_1x64 (m));
|
|
ms = unpack_32_1x64 (s);
|
|
md = unpack_32_1x64 (d);
|
|
|
|
msa = expand_alpha_rev_1x64 (load_32_1x64 (sa));
|
|
|
|
*dst = pack_1x64_32 (in_over_1x64 (&ms, &msa, &ma, &md));
|
|
}
|
|
}
|
|
|
|
dst++;
|
|
w--;
|
|
}
|
|
}
|
|
|
|
_mm_empty ();
|
|
}
|
|
|
|
/* A variant of 'core_combine_over_u_sse2' with minor tweaks */
|
|
static force_inline void
|
|
scaled_nearest_scanline_sse2_8888_8888_OVER (uint32_t* pd,
|
|
const uint32_t* ps,
|
|
int32_t w,
|
|
pixman_fixed_t vx,
|
|
pixman_fixed_t unit_x,
|
|
pixman_fixed_t max_vx)
|
|
{
|
|
uint32_t s, d;
|
|
const uint32_t* pm = NULL;
|
|
|
|
__m128i xmm_dst_lo, xmm_dst_hi;
|
|
__m128i xmm_src_lo, xmm_src_hi;
|
|
__m128i xmm_alpha_lo, xmm_alpha_hi;
|
|
|
|
/* Align dst on a 16-byte boundary */
|
|
while (w && ((unsigned long)pd & 15))
|
|
{
|
|
d = *pd;
|
|
s = combine1 (ps + (vx >> 16), pm);
|
|
vx += unit_x;
|
|
|
|
*pd++ = core_combine_over_u_pixel_sse2 (s, d);
|
|
if (pm)
|
|
pm++;
|
|
w--;
|
|
}
|
|
|
|
while (w >= 4)
|
|
{
|
|
__m128i tmp;
|
|
uint32_t tmp1, tmp2, tmp3, tmp4;
|
|
|
|
tmp1 = ps[vx >> 16];
|
|
vx += unit_x;
|
|
tmp2 = ps[vx >> 16];
|
|
vx += unit_x;
|
|
tmp3 = ps[vx >> 16];
|
|
vx += unit_x;
|
|
tmp4 = ps[vx >> 16];
|
|
vx += unit_x;
|
|
|
|
tmp = _mm_set_epi32 (tmp4, tmp3, tmp2, tmp1);
|
|
|
|
xmm_src_hi = combine4 ((__m128i*)&tmp, (__m128i*)pm);
|
|
|
|
if (is_opaque (xmm_src_hi))
|
|
{
|
|
save_128_aligned ((__m128i*)pd, xmm_src_hi);
|
|
}
|
|
else if (!is_zero (xmm_src_hi))
|
|
{
|
|
xmm_dst_hi = load_128_aligned ((__m128i*) pd);
|
|
|
|
unpack_128_2x128 (xmm_src_hi, &xmm_src_lo, &xmm_src_hi);
|
|
unpack_128_2x128 (xmm_dst_hi, &xmm_dst_lo, &xmm_dst_hi);
|
|
|
|
expand_alpha_2x128 (
|
|
xmm_src_lo, xmm_src_hi, &xmm_alpha_lo, &xmm_alpha_hi);
|
|
|
|
over_2x128 (&xmm_src_lo, &xmm_src_hi,
|
|
&xmm_alpha_lo, &xmm_alpha_hi,
|
|
&xmm_dst_lo, &xmm_dst_hi);
|
|
|
|
/* rebuid the 4 pixel data and save*/
|
|
save_128_aligned ((__m128i*)pd,
|
|
pack_2x128_128 (xmm_dst_lo, xmm_dst_hi));
|
|
}
|
|
|
|
w -= 4;
|
|
pd += 4;
|
|
if (pm)
|
|
pm += 4;
|
|
}
|
|
|
|
while (w)
|
|
{
|
|
d = *pd;
|
|
s = combine1 (ps + (vx >> 16), pm);
|
|
vx += unit_x;
|
|
|
|
*pd++ = core_combine_over_u_pixel_sse2 (s, d);
|
|
if (pm)
|
|
pm++;
|
|
|
|
w--;
|
|
}
|
|
_mm_empty ();
|
|
}
|
|
|
|
FAST_NEAREST_MAINLOOP (sse2_8888_8888_cover_OVER,
|
|
scaled_nearest_scanline_sse2_8888_8888_OVER,
|
|
uint32_t, uint32_t, COVER);
|
|
FAST_NEAREST_MAINLOOP (sse2_8888_8888_none_OVER,
|
|
scaled_nearest_scanline_sse2_8888_8888_OVER,
|
|
uint32_t, uint32_t, NONE);
|
|
FAST_NEAREST_MAINLOOP (sse2_8888_8888_pad_OVER,
|
|
scaled_nearest_scanline_sse2_8888_8888_OVER,
|
|
uint32_t, uint32_t, PAD);
|
|
|
|
static const pixman_fast_path_t sse2_fast_paths[] =
|
|
{
|
|
/* PIXMAN_OP_OVER */
|
|
PIXMAN_STD_FAST_PATH (OVER, solid, a8, r5g6b5, sse2_composite_over_n_8_0565),
|
|
PIXMAN_STD_FAST_PATH (OVER, solid, a8, b5g6r5, sse2_composite_over_n_8_0565),
|
|
PIXMAN_STD_FAST_PATH (OVER, solid, null, a8r8g8b8, sse2_composite_over_n_8888),
|
|
PIXMAN_STD_FAST_PATH (OVER, solid, null, x8r8g8b8, sse2_composite_over_n_8888),
|
|
PIXMAN_STD_FAST_PATH (OVER, solid, null, r5g6b5, sse2_composite_over_n_0565),
|
|
PIXMAN_STD_FAST_PATH (OVER, a8r8g8b8, null, a8r8g8b8, sse2_composite_over_8888_8888),
|
|
PIXMAN_STD_FAST_PATH (OVER, a8r8g8b8, null, x8r8g8b8, sse2_composite_over_8888_8888),
|
|
PIXMAN_STD_FAST_PATH (OVER, a8b8g8r8, null, a8b8g8r8, sse2_composite_over_8888_8888),
|
|
PIXMAN_STD_FAST_PATH (OVER, a8b8g8r8, null, x8b8g8r8, sse2_composite_over_8888_8888),
|
|
PIXMAN_STD_FAST_PATH (OVER, a8r8g8b8, null, r5g6b5, sse2_composite_over_8888_0565),
|
|
PIXMAN_STD_FAST_PATH (OVER, a8b8g8r8, null, b5g6r5, sse2_composite_over_8888_0565),
|
|
PIXMAN_STD_FAST_PATH (OVER, solid, a8, a8r8g8b8, sse2_composite_over_n_8_8888),
|
|
PIXMAN_STD_FAST_PATH (OVER, solid, a8, x8r8g8b8, sse2_composite_over_n_8_8888),
|
|
PIXMAN_STD_FAST_PATH (OVER, solid, a8, a8b8g8r8, sse2_composite_over_n_8_8888),
|
|
PIXMAN_STD_FAST_PATH (OVER, solid, a8, x8b8g8r8, sse2_composite_over_n_8_8888),
|
|
PIXMAN_STD_FAST_PATH (OVER, a8r8g8b8, a8r8g8b8, a8r8g8b8, sse2_composite_over_8888_8888_8888),
|
|
PIXMAN_STD_FAST_PATH (OVER, a8r8g8b8, a8, x8r8g8b8, sse2_composite_over_8888_8_8888),
|
|
PIXMAN_STD_FAST_PATH (OVER, a8r8g8b8, a8, a8r8g8b8, sse2_composite_over_8888_8_8888),
|
|
PIXMAN_STD_FAST_PATH (OVER, a8b8g8r8, a8, x8b8g8r8, sse2_composite_over_8888_8_8888),
|
|
PIXMAN_STD_FAST_PATH (OVER, a8b8g8r8, a8, a8b8g8r8, sse2_composite_over_8888_8_8888),
|
|
PIXMAN_STD_FAST_PATH (OVER, x8r8g8b8, a8, x8r8g8b8, sse2_composite_over_x888_8_8888),
|
|
PIXMAN_STD_FAST_PATH (OVER, x8r8g8b8, a8, a8r8g8b8, sse2_composite_over_x888_8_8888),
|
|
PIXMAN_STD_FAST_PATH (OVER, x8b8g8r8, a8, x8b8g8r8, sse2_composite_over_x888_8_8888),
|
|
PIXMAN_STD_FAST_PATH (OVER, x8b8g8r8, a8, a8b8g8r8, sse2_composite_over_x888_8_8888),
|
|
PIXMAN_STD_FAST_PATH (OVER, x8r8g8b8, solid, a8r8g8b8, sse2_composite_over_x888_n_8888),
|
|
PIXMAN_STD_FAST_PATH (OVER, x8r8g8b8, solid, x8r8g8b8, sse2_composite_over_x888_n_8888),
|
|
PIXMAN_STD_FAST_PATH (OVER, x8b8g8r8, solid, a8b8g8r8, sse2_composite_over_x888_n_8888),
|
|
PIXMAN_STD_FAST_PATH (OVER, x8b8g8r8, solid, x8b8g8r8, sse2_composite_over_x888_n_8888),
|
|
PIXMAN_STD_FAST_PATH (OVER, a8r8g8b8, solid, a8r8g8b8, sse2_composite_over_8888_n_8888),
|
|
PIXMAN_STD_FAST_PATH (OVER, a8r8g8b8, solid, x8r8g8b8, sse2_composite_over_8888_n_8888),
|
|
PIXMAN_STD_FAST_PATH (OVER, a8b8g8r8, solid, a8b8g8r8, sse2_composite_over_8888_n_8888),
|
|
PIXMAN_STD_FAST_PATH (OVER, a8b8g8r8, solid, x8b8g8r8, sse2_composite_over_8888_n_8888),
|
|
PIXMAN_STD_FAST_PATH_CA (OVER, solid, a8r8g8b8, a8r8g8b8, sse2_composite_over_n_8888_8888_ca),
|
|
PIXMAN_STD_FAST_PATH_CA (OVER, solid, a8r8g8b8, x8r8g8b8, sse2_composite_over_n_8888_8888_ca),
|
|
PIXMAN_STD_FAST_PATH_CA (OVER, solid, a8b8g8r8, a8b8g8r8, sse2_composite_over_n_8888_8888_ca),
|
|
PIXMAN_STD_FAST_PATH_CA (OVER, solid, a8b8g8r8, x8b8g8r8, sse2_composite_over_n_8888_8888_ca),
|
|
PIXMAN_STD_FAST_PATH_CA (OVER, solid, a8r8g8b8, r5g6b5, sse2_composite_over_n_8888_0565_ca),
|
|
PIXMAN_STD_FAST_PATH_CA (OVER, solid, a8b8g8r8, b5g6r5, sse2_composite_over_n_8888_0565_ca),
|
|
PIXMAN_STD_FAST_PATH (OVER, pixbuf, pixbuf, a8r8g8b8, sse2_composite_over_pixbuf_8888),
|
|
PIXMAN_STD_FAST_PATH (OVER, pixbuf, pixbuf, x8r8g8b8, sse2_composite_over_pixbuf_8888),
|
|
PIXMAN_STD_FAST_PATH (OVER, rpixbuf, rpixbuf, a8b8g8r8, sse2_composite_over_pixbuf_8888),
|
|
PIXMAN_STD_FAST_PATH (OVER, rpixbuf, rpixbuf, x8b8g8r8, sse2_composite_over_pixbuf_8888),
|
|
PIXMAN_STD_FAST_PATH (OVER, pixbuf, pixbuf, r5g6b5, sse2_composite_over_pixbuf_0565),
|
|
PIXMAN_STD_FAST_PATH (OVER, rpixbuf, rpixbuf, b5g6r5, sse2_composite_over_pixbuf_0565),
|
|
PIXMAN_STD_FAST_PATH (OVER, x8r8g8b8, null, x8r8g8b8, sse2_composite_copy_area),
|
|
PIXMAN_STD_FAST_PATH (OVER, x8b8g8r8, null, x8b8g8r8, sse2_composite_copy_area),
|
|
|
|
/* PIXMAN_OP_OVER_REVERSE */
|
|
PIXMAN_STD_FAST_PATH (OVER_REVERSE, solid, null, a8r8g8b8, sse2_composite_over_reverse_n_8888),
|
|
PIXMAN_STD_FAST_PATH (OVER_REVERSE, solid, null, a8b8g8r8, sse2_composite_over_reverse_n_8888),
|
|
|
|
/* PIXMAN_OP_ADD */
|
|
PIXMAN_STD_FAST_PATH_CA (ADD, solid, a8r8g8b8, a8r8g8b8, sse2_composite_add_n_8888_8888_ca),
|
|
PIXMAN_STD_FAST_PATH (ADD, a8, null, a8, sse2_composite_add_8_8),
|
|
PIXMAN_STD_FAST_PATH (ADD, a8r8g8b8, null, a8r8g8b8, sse2_composite_add_8888_8888),
|
|
PIXMAN_STD_FAST_PATH (ADD, a8b8g8r8, null, a8b8g8r8, sse2_composite_add_8888_8888),
|
|
PIXMAN_STD_FAST_PATH (ADD, solid, a8, a8, sse2_composite_add_n_8_8),
|
|
PIXMAN_STD_FAST_PATH (ADD, solid, null, a8, sse2_composite_add_n_8),
|
|
|
|
/* PIXMAN_OP_SRC */
|
|
PIXMAN_STD_FAST_PATH (SRC, solid, a8, a8r8g8b8, sse2_composite_src_n_8_8888),
|
|
PIXMAN_STD_FAST_PATH (SRC, solid, a8, x8r8g8b8, sse2_composite_src_n_8_8888),
|
|
PIXMAN_STD_FAST_PATH (SRC, solid, a8, a8b8g8r8, sse2_composite_src_n_8_8888),
|
|
PIXMAN_STD_FAST_PATH (SRC, solid, a8, x8b8g8r8, sse2_composite_src_n_8_8888),
|
|
PIXMAN_STD_FAST_PATH (SRC, x8r8g8b8, null, a8r8g8b8, sse2_composite_src_x888_8888),
|
|
PIXMAN_STD_FAST_PATH (SRC, x8b8g8r8, null, a8b8g8r8, sse2_composite_src_x888_8888),
|
|
PIXMAN_STD_FAST_PATH (SRC, a8r8g8b8, null, a8r8g8b8, sse2_composite_copy_area),
|
|
PIXMAN_STD_FAST_PATH (SRC, a8b8g8r8, null, a8b8g8r8, sse2_composite_copy_area),
|
|
PIXMAN_STD_FAST_PATH (SRC, a8r8g8b8, null, x8r8g8b8, sse2_composite_copy_area),
|
|
PIXMAN_STD_FAST_PATH (SRC, a8b8g8r8, null, x8b8g8r8, sse2_composite_copy_area),
|
|
PIXMAN_STD_FAST_PATH (SRC, x8r8g8b8, null, x8r8g8b8, sse2_composite_copy_area),
|
|
PIXMAN_STD_FAST_PATH (SRC, x8b8g8r8, null, x8b8g8r8, sse2_composite_copy_area),
|
|
PIXMAN_STD_FAST_PATH (SRC, r5g6b5, null, r5g6b5, sse2_composite_copy_area),
|
|
PIXMAN_STD_FAST_PATH (SRC, b5g6r5, null, b5g6r5, sse2_composite_copy_area),
|
|
|
|
/* PIXMAN_OP_IN */
|
|
PIXMAN_STD_FAST_PATH (IN, a8, null, a8, sse2_composite_in_8_8),
|
|
PIXMAN_STD_FAST_PATH (IN, solid, a8, a8, sse2_composite_in_n_8_8),
|
|
PIXMAN_STD_FAST_PATH (IN, solid, null, a8, sse2_composite_in_n_8),
|
|
|
|
SIMPLE_NEAREST_FAST_PATH_COVER (OVER, a8r8g8b8, x8r8g8b8, sse2_8888_8888),
|
|
SIMPLE_NEAREST_FAST_PATH_COVER (OVER, a8b8g8r8, x8b8g8r8, sse2_8888_8888),
|
|
SIMPLE_NEAREST_FAST_PATH_COVER (OVER, a8r8g8b8, a8r8g8b8, sse2_8888_8888),
|
|
SIMPLE_NEAREST_FAST_PATH_COVER (OVER, a8b8g8r8, a8b8g8r8, sse2_8888_8888),
|
|
SIMPLE_NEAREST_FAST_PATH_NONE (OVER, a8r8g8b8, x8r8g8b8, sse2_8888_8888),
|
|
SIMPLE_NEAREST_FAST_PATH_NONE (OVER, a8b8g8r8, x8b8g8r8, sse2_8888_8888),
|
|
SIMPLE_NEAREST_FAST_PATH_NONE (OVER, a8r8g8b8, a8r8g8b8, sse2_8888_8888),
|
|
SIMPLE_NEAREST_FAST_PATH_NONE (OVER, a8b8g8r8, a8b8g8r8, sse2_8888_8888),
|
|
SIMPLE_NEAREST_FAST_PATH_PAD (OVER, a8r8g8b8, x8r8g8b8, sse2_8888_8888),
|
|
SIMPLE_NEAREST_FAST_PATH_PAD (OVER, a8b8g8r8, x8b8g8r8, sse2_8888_8888),
|
|
SIMPLE_NEAREST_FAST_PATH_PAD (OVER, a8r8g8b8, a8r8g8b8, sse2_8888_8888),
|
|
SIMPLE_NEAREST_FAST_PATH_PAD (OVER, a8b8g8r8, a8b8g8r8, sse2_8888_8888),
|
|
|
|
{ PIXMAN_OP_NONE },
|
|
};
|
|
|
|
static pixman_bool_t
|
|
sse2_blt (pixman_implementation_t *imp,
|
|
uint32_t * src_bits,
|
|
uint32_t * dst_bits,
|
|
int src_stride,
|
|
int dst_stride,
|
|
int src_bpp,
|
|
int dst_bpp,
|
|
int src_x,
|
|
int src_y,
|
|
int dst_x,
|
|
int dst_y,
|
|
int width,
|
|
int height)
|
|
{
|
|
if (!pixman_blt_sse2 (
|
|
src_bits, dst_bits, src_stride, dst_stride, src_bpp, dst_bpp,
|
|
src_x, src_y, dst_x, dst_y, width, height))
|
|
|
|
{
|
|
return _pixman_implementation_blt (
|
|
imp->delegate,
|
|
src_bits, dst_bits, src_stride, dst_stride, src_bpp, dst_bpp,
|
|
src_x, src_y, dst_x, dst_y, width, height);
|
|
}
|
|
|
|
return TRUE;
|
|
}
|
|
|
|
#if defined(__GNUC__) && !defined(__x86_64__) && !defined(__amd64__)
|
|
__attribute__((__force_align_arg_pointer__))
|
|
#endif
|
|
static pixman_bool_t
|
|
sse2_fill (pixman_implementation_t *imp,
|
|
uint32_t * bits,
|
|
int stride,
|
|
int bpp,
|
|
int x,
|
|
int y,
|
|
int width,
|
|
int height,
|
|
uint32_t xor)
|
|
{
|
|
if (!pixman_fill_sse2 (bits, stride, bpp, x, y, width, height, xor))
|
|
{
|
|
return _pixman_implementation_fill (
|
|
imp->delegate, bits, stride, bpp, x, y, width, height, xor);
|
|
}
|
|
|
|
return TRUE;
|
|
}
|
|
|
|
#if defined(__GNUC__) && !defined(__x86_64__) && !defined(__amd64__)
|
|
__attribute__((__force_align_arg_pointer__))
|
|
#endif
|
|
pixman_implementation_t *
|
|
_pixman_implementation_create_sse2 (void)
|
|
{
|
|
#ifdef USE_MMX
|
|
pixman_implementation_t *fallback = _pixman_implementation_create_mmx ();
|
|
#else
|
|
pixman_implementation_t *fallback = _pixman_implementation_create_fast_path ();
|
|
#endif
|
|
pixman_implementation_t *imp = _pixman_implementation_create (fallback, sse2_fast_paths);
|
|
|
|
/* SSE2 constants */
|
|
mask_565_r = create_mask_2x32_128 (0x00f80000, 0x00f80000);
|
|
mask_565_g1 = create_mask_2x32_128 (0x00070000, 0x00070000);
|
|
mask_565_g2 = create_mask_2x32_128 (0x000000e0, 0x000000e0);
|
|
mask_565_b = create_mask_2x32_128 (0x0000001f, 0x0000001f);
|
|
mask_red = create_mask_2x32_128 (0x00f80000, 0x00f80000);
|
|
mask_green = create_mask_2x32_128 (0x0000fc00, 0x0000fc00);
|
|
mask_blue = create_mask_2x32_128 (0x000000f8, 0x000000f8);
|
|
mask_565_fix_rb = create_mask_2x32_128 (0x00e000e0, 0x00e000e0);
|
|
mask_565_fix_g = create_mask_2x32_128 (0x0000c000, 0x0000c000);
|
|
mask_0080 = create_mask_16_128 (0x0080);
|
|
mask_00ff = create_mask_16_128 (0x00ff);
|
|
mask_0101 = create_mask_16_128 (0x0101);
|
|
mask_ffff = create_mask_16_128 (0xffff);
|
|
mask_ff000000 = create_mask_2x32_128 (0xff000000, 0xff000000);
|
|
mask_alpha = create_mask_2x32_128 (0x00ff0000, 0x00000000);
|
|
|
|
/* MMX constants */
|
|
mask_x565_rgb = create_mask_2x32_64 (0x000001f0, 0x003f001f);
|
|
mask_x565_unpack = create_mask_2x32_64 (0x00000084, 0x04100840);
|
|
|
|
mask_x0080 = create_mask_16_64 (0x0080);
|
|
mask_x00ff = create_mask_16_64 (0x00ff);
|
|
mask_x0101 = create_mask_16_64 (0x0101);
|
|
mask_x_alpha = create_mask_2x32_64 (0x00ff0000, 0x00000000);
|
|
|
|
_mm_empty ();
|
|
|
|
/* Set up function pointers */
|
|
|
|
/* SSE code patch for fbcompose.c */
|
|
imp->combine_32[PIXMAN_OP_OVER] = sse2_combine_over_u;
|
|
imp->combine_32[PIXMAN_OP_OVER_REVERSE] = sse2_combine_over_reverse_u;
|
|
imp->combine_32[PIXMAN_OP_IN] = sse2_combine_in_u;
|
|
imp->combine_32[PIXMAN_OP_IN_REVERSE] = sse2_combine_in_reverse_u;
|
|
imp->combine_32[PIXMAN_OP_OUT] = sse2_combine_out_u;
|
|
imp->combine_32[PIXMAN_OP_OUT_REVERSE] = sse2_combine_out_reverse_u;
|
|
imp->combine_32[PIXMAN_OP_ATOP] = sse2_combine_atop_u;
|
|
imp->combine_32[PIXMAN_OP_ATOP_REVERSE] = sse2_combine_atop_reverse_u;
|
|
imp->combine_32[PIXMAN_OP_XOR] = sse2_combine_xor_u;
|
|
imp->combine_32[PIXMAN_OP_ADD] = sse2_combine_add_u;
|
|
|
|
imp->combine_32[PIXMAN_OP_SATURATE] = sse2_combine_saturate_u;
|
|
|
|
imp->combine_32_ca[PIXMAN_OP_SRC] = sse2_combine_src_ca;
|
|
imp->combine_32_ca[PIXMAN_OP_OVER] = sse2_combine_over_ca;
|
|
imp->combine_32_ca[PIXMAN_OP_OVER_REVERSE] = sse2_combine_over_reverse_ca;
|
|
imp->combine_32_ca[PIXMAN_OP_IN] = sse2_combine_in_ca;
|
|
imp->combine_32_ca[PIXMAN_OP_IN_REVERSE] = sse2_combine_in_reverse_ca;
|
|
imp->combine_32_ca[PIXMAN_OP_OUT] = sse2_combine_out_ca;
|
|
imp->combine_32_ca[PIXMAN_OP_OUT_REVERSE] = sse2_combine_out_reverse_ca;
|
|
imp->combine_32_ca[PIXMAN_OP_ATOP] = sse2_combine_atop_ca;
|
|
imp->combine_32_ca[PIXMAN_OP_ATOP_REVERSE] = sse2_combine_atop_reverse_ca;
|
|
imp->combine_32_ca[PIXMAN_OP_XOR] = sse2_combine_xor_ca;
|
|
imp->combine_32_ca[PIXMAN_OP_ADD] = sse2_combine_add_ca;
|
|
|
|
imp->blt = sse2_blt;
|
|
imp->fill = sse2_fill;
|
|
|
|
return imp;
|
|
}
|
|
|
|
#endif /* USE_SSE2 */
|