projects | ||
sse2neon | ||
vu | ||
.gitignore | ||
.travis.yml | ||
COPYING | ||
lto.c | ||
make_w32.cmd | ||
make_w64.cmd | ||
make.sh | ||
module.c | ||
module.h | ||
my_types.h | ||
osal_dynamiclib_unix.c | ||
osal_dynamiclib_win32.c | ||
osal_dynamiclib.h | ||
README.md | ||
rsp_api_export.ver | ||
rsp.h | ||
su.c | ||
su.h |
Vector Technology as Implemented for Use with a RISC and SIMD Technology Signal Processor
A vector processor uses long registers addressable by segment-precision, where each segment is n bits wide. The power of a vector processor is that many complex matrix operations, whose algorithms take many scalar CPU instructions and clock cycles to emulate on a regular, personal computer processor, can often times formulate and transfer the correct result in less than a single clock cycle. The impossibility to replicate this precise behavior has paved the way for vendor businesses to protect their systems against hardware emulation since the introduction of display devices rendering three-dimensional graphics. The Nintendo 64 was the first video game system to employ this convenience to their advantage.
Project Reality's Signal Processor
In the engineering make-up of the Nintendo 64 (original codename: Project Reality) is a modified MIPS family revision 4000 co-processor called the "Reality Coprocessor" (RCP). More importantly, the signal processor in this component is responsible for all vector memory operations and transactions, which are almost all impossible to emulate with full accuracy on a scalar, personal computer processor. The vector technology implemented into this design is that accepted from Silicon Graphics, Inc.
RSP Vector Operation Matrices
Here, the entire MIPS R4000 instruction set was modified for very fast, exception-free processing flow, and operation definitions for each instruction do not fall within the scope of this section. Presented instead are layouts of the new instructions added to the scalar unit (those under LWC2
and SWC2
, even though they do interface with the vector unit) and the vector unit (essentially, any instruction under COP2
whose mnemonic starts with a 'V'). Information of how pre-existing MIPS R4000 instructions were modified or which ones were removed is the adventure of the MIPS programmer to research.
C2
vd, vs, vt[element] /* exceptions: scalar divide reads */
COP2 | element | vs1 | vs2 | vt | func |
---|---|---|---|---|---|
010010 |
1eeee |
ttttt |
sssss |
ddddd |
?????? |
The major types of VU computational instructions are multiply, add, select, logical, and divide.
Multiply instructions are the most frequent and classifiable as follows:
- If
a == 0
, then round the product loaded to the accumulator (VMUL*
andVMUD*
). - If
a == 1
, then the product is added to an accumulator element (VMAC*
andVMAD*
). - If
(format & 0b100) == 0
, then the operation is single-precision (VMUL*
andVMAC*
). - If
(format & 0b100) != 0
, then the operation is double-precision (VMUD*
andVMAD*
).
op-code | Type |
---|---|
00axxx |
multiply |
01xxxx |
add |
100xxx |
select |
101xxx |
logical |
110xxx |
divide |
00 (VMULF)
Vector Multiply Signed Fractions01 (VMULU)
Vector Multiply Unsigned Fractions02 reserved
VRNDP
was intended for MPEG DCT rounding but omitted.03 reserved
VMULQ
was intended for MPEG inverse quantization but omitted.04 (VMUDL)
Vector Multiply Low Partial Products05 (VMUDM)
Vector Multiply Mid Partial Products06 (VMUDN)
Vector Multiply Mid Partial Products07 (VMUDH)
Vector Multiply High Partial Products10 (VMACF)
Vector Multiply-Accumulate Signed Fractions11 (VMACU)
Vector Multiply-Accumulate Unsigned Fractions12 reserved
VRNDN
was intended for MPEG DCT rounding but omitted.13 (VMACQ)
Vector Accumulator Oddification14 (VMADL)
Vector Multiply-Accumulate Low Partial Products15 (VMADM)
Vector Multiply-Accumulate Mid Partial Products16 (VMADN)
Vector Multiply-Accumulate Mid Partial Products17 (VMADH)
Vector Multiply-Accumulate High Partial Products20 (VADD)
Vector Add Short Elements21 (VSUB)
Vector Subtract Short Elements22 reserved
23 (VABS)
Vector Absolute Value of Short Elements24 (VADDC)
Vector Add Short Elements with Carry25 (VSUBC)
Vector Subtract Short Elements with Carry26 reserved
27 reserved
30 reserved
31 reserved
32 reserved
33 reserved
34 reserved
35 (VSAR)
Vector Accumulator Read36 reserved
37 reserved
40 (VLT)
Vector Select Less Than41 (VEQ)
Vector Select Equal42 (VNE)
Vector Select Not Equal43 (VGE)
Vector Select Greater Than or Equal44 (VCL)
Vector Select Clip Test Low45 (VCH)
Vector Select Clip Test High46 (VCR)
Vector Select Clip Test Low (single-precision)47 (VMRG)
Vector Select Merge50 (VAND)
Vector AND Short Elements51 (VNAND)
Vector NAND Short Elements52 (VOR)
Vector OR Short Elements53 (VNOR)
Vector NOR Short Elements54 (VXOR)
Vector XOR Short Elements55 (VNXOR)
Vector NXOR Short Elements56 reserved
57 reserved
60 (VRCP)
Vector Element Scalar Reciprocal (single-precision)61 (VRCPL)
Vector Element Scalar Reciprocal Low62 (VRCPH)
Vector Element Scalar Reciprocal High63 (VMOV)
Vector Element Scalar Move64 (VRSQ)
Vector Element Scalar SQRT Reciprocal (single-precision)65 (VRSQL)
Vector Element Scalar SQRT Reciprocal Low66 (VRSQH)
Vector Element Scalar SQRT Reciprocal High67 (VNOP)
Vector Null Instruction70 reserved
71 reserved
72 reserved
73 reserved
74 reserved
75 reserved
76 reserved
77 reserved
RSP Vector Load Transfers
The VR-DMEM transaction instruction cycles are still processed by the scalar unit, not the vector unit. In the modern implementations accepted by most vector unit communications systems today, the transfer instructions are classifiable under five groups:
- BV, SV, LV, DV
- PV, UV, XV, ZV
- HV, FV, AV
- QV, RV
- TV, WV
Not all of those instructions were implemented as of the time of the Nintendo 64's RCP, however. Additionally, their ordering in the opcode matrix was a little skewed to what is seen below. At this time, it is better to use only three categories of instructions:
- normal: Anything under Group I or Group IV is normal type. Only the element must be aligned;
addr & 1
may resolve true. - packed: Anything under Group II or Group III. Useful for working with specially mapped data, such as pixels.
- transposed:
LTV
, LTWV,STV
, andSWV
can be found in heaps of 16 instructions, all dedicated to matrix transposition through eight diagonals of halfword elements.
LWC2
vt[element], offset(base)
LWC2 | base | vt | rd | element | offset |
---|---|---|---|---|---|
110010 |
sssss |
ttttt |
????? |
eeee |
Xxxxxxx |
00 (LBV)
Load Byte to Vector Unit01 (LSV)
Load Shortword to Vector Unit02 (LLV)
Load Longword to Vector Unit03 (LDV)
Load Doubleword to Vector Unit04 (LQV)
Load Quadword to Vector Unit05 (LRV)
Load Rest to Vector Unit06 (LPV)
Load Packed Signed to Vector Unit07 (LUV)
Load Packed Unsigned to Vector Unit10 (LHV)
Load Alternate Bytes to Vector Unit11 (LFV)
Load Alternate Fourths to Vector Unit12 reserved
LTWV13 (LTV)
Load Transposed to Vector Unit14 reserved
15 reserved
16 reserved
17 reserved
SWC2
vt[element], offset(base)
SWC2 | base | vt | rd | element | offset |
---|---|---|---|---|---|
111010 |
sssss |
ttttt |
????? |
eeee |
Xxxxxxx |
00 (SBV)
Store Byte from Vector Unit01 (SSV)
Store Shortword from Vector Unit02 (SLV)
Store Longword from Vector Unit03 (SDV)
Store Doubleword from Vector Unit04 (SQV)
Store Quadword from Vector Unit05 (SRV)
Store Rest from Vector Unit06 (SPV)
Store Packed Signed from Vector Unit07 (SUV)
Store Packed Unsigned from Vector Unit10 (SHV)
Store Alternate Bytes from Vector Unit11 (SFV)
Store Alternate Fourths from Vector Unit12 (SWV)
Store Transposed Wrapped from Vector Unit13 (STV)
Store Transposed from Vector Unit14 reserved
15 reserved
16 reserved
17 reserved
If, by any chance, the opcode specifier is greater than 17 [oct], it was probably meant to execute the extended counterparts to the above loads and stores, which were questionably obsolete and remain reserved.
Informational References for Vector Processor Architecture
Instruction Methods for Performing Data Formatting While Moving Data Between Memory and a Vector Register File United States patent no. 5,812,147 Timothy J. Van Hook Silicon Graphics, Inc.
Method and System for Efficient Matrix Multiplication in a SIMD Processor Architecture United States patent no. 7,873,812 Tibet Mimar
Efficient Handling of Vector High-Level Language Constructs in a SIMD Processor United States patent no. 7,793,084 Tibet Mimar
Flexible Vector Modes of Operation for SIMD Processor patent pending? Tibet Mimar
Programming a Vector Processor and Parallel Programming of an Asymmetric Dual Multiprocessor Comprised of a Vector Processor and a RISC Processor United States patent no. 6,016,395 Moataz Ali Mohamed Samsung Electronics Co., Ltd.
Execution Unit for Processing a Data Stream Independently and in Parallel United States patent no. 6,401,194 Le Trong Nguyen Samsung Electronics Co., Ltd.