Draft of 2 cycles saved

This commit is contained in:
Sauraen
2025-08-23 16:15:04 -07:00
parent c5c5edf960
commit dcd617a0a0
2 changed files with 143 additions and 28 deletions

View File

@@ -45,7 +45,7 @@ even to an odd number of lights adds a different time than vice versa.
| 1st tri to occluded | Can't | Can't | 43 |
| Only/2nd tri to draw | 172 | 156 | 158 |
| 1st tri to draw | 173 | 157 | 159 |
| Extra per tri from snake | Can't | 9 | 9 |
| Tri snake | Can't | * | * |
| Vtx before DMA start | 16 | 17 | 17 |
| Vtx pair, no lighting | 54 | 54 | 70 |
| Vtx pair, 0 dir lts | Can't | 65 | 81 |
@@ -88,3 +88,113 @@ even to an odd number of lights adds a different time than vice versa.
| Light dir xfrm, 7 dir lts | 375 | 170 | 170 |
| Light dir xfrm, 8 dir lts | Can't | 171 | 171 |
| Light dir xfrm, 9 dir lts | Can't | 196 | 196 |
## Triangle Snake Cycle Counts
### Very Long Snakes
For this section, we assume almost all tris are contained in very long snakes,
so the overhead of starting and ending snakes is negligible. This overhead is
discussed in the next section.
We are assuming that the same set of tris is being drawn with or without snakes.
Thus, cycles from `tri_main_from_snake` through the instruction after the return
exclusive are not counted here, as they are the same regardless of which method
is being used.
For a pair of tris drawn without snakes, i.e. with a single `SP2Triangles`
command, the cycles are:
- Command dispatch: 12
- First tri up to `tri_main_from_snake`: 5
- Second tri up to `tri_main_from_snake`: 4
- Total: 21
For a pair of tris which are part of a long snake, the cycles are:
- Each tri up to `tri_main_from_snake`: 11
- Total: 22
However, there's also the memory bandwidth savings. The `SP2Triangles` command
is 8 bytes and the two tris in a long snake are 2 bytes, so switching to snake
saves 6 bytes of bandwidth. Testing has shown that RSP DMAs on average transfer
about 2.2 bytes per cycle, though it depends on the length. So this is a savings
of about 2.7 cycles of RDRAM / RDP time. Since the DMAs loading this data are
input buffer loads, and the RSP stalls waiting for input buffer loads (it does
not do useful work during this time), this is also 2.7 cycles of RSP time. This
offsets the 1 extra cycle of processing the tri pair above.
Therefore, switching to snake (assuming very long snakes) saves about 2.7
cycles of RDRAM / RDP time and 1.7 cycles of RSP time per two tris, or about
0.9 RSP cycles and 1.4 RDRAM cycles per tri.
### Starting a Snake
Since a `SPTriSnake` command encodes 5 triangles, for comparison to
`SP2Triangles` we will consider the overhead for 10 triangles total / two snake
starts.
For `SPTriSnake`, this is 2 x (12 cycles command dispatch + 4 cycles snake
initialization + 5 tris x 11 cycles per tri as discussed above) = 142 RSP
cycles. And it is 16 bytes of loads = 7.3 cycles of RDRAM / RDP time and stall
RSP time. So the total cost is 149.3 RSP and 7.3 RDRAM cycles.
For `SP2Triangles`, this is 5 x (21 cycles as discussed above) = 105 RSP cycles.
And it is 40 bytes of loads = 18.2 cycles of RDRAM / RDP time and stall RSP
time. So the total cost is 123.2 RSP and 18.2 RDRAM cycles.
But drawing those 10 tris as part of very long snakes would have saved 13.5
RDRAM cycles and 8.5 RSP cycles. So the relative cost of drawing these tris as
two start-of-snakes instead of in very long snakes is 34.6 RSP cycles and 2.6
RDRAM cycles. Thus the cost of each start-of-snake relative to long snakes is
17.3 RSP cycles and 1.3 RDRAM cycles.
### Ending a Snake
Ending a snake costs 12 cycles of RSP time and has no direct impact on memory
traffic. However, calculating the overall performance is more complicated: the
snake can end after 1-8 bytes of the `SPContinueSnake` command, and the
remaining bytes are "wasted" in that they do not contribute to drawing tris
with memory bandwidth savings.
From a mesh optimization standpoint, this is not an issue. If you have a snake
which has filled 8 bytes of the previous `SPContinueSnake` command, and you have
another triangle to draw, there are only two cases. If that tri can't be
appended to the snake, you have to draw it with a `SP1Triangle` command either
way, so there is no performance difference. If it can be appended to the snake,
doing so will take 8 bytes of memory traffic--the same as the `SP1Triangle`
command. The snake end penalty will have to be paid whether before or after this
tri. And it's 11 RSP cycles to draw one more tri in an existing snake, whereas
the command dispatch plus second tri code for `SP1Triangle` is 16 cycles. So
it's better to continue a snake than to stop it early and use non-snake
commands, even if this leads to a mostly empty `SPContinueSnake` command. Of
course, if you can fill up even more tris in the command, the performance
benefit increases.
Assuming snake lengths are uniformly distributed, on average a snake will end
after 4.5 bytes (the same number of triangles) of a `SPContinueSnake` command.
In this case, the command will take 4.5 tris x 11 cycles per tri + 12 cycle end
snake penalty = 61.5 RSP cycles, and 8 bytes of memory traffic = 3.6 RDRAM
cycles. If these 4.5 tris were instead drawn with `SP2Triangles` commands, that
would be 2.25 commands = 47.3 RSP cycles and 18 bytes = 8.2 RDRAM cycles. Thus
on average, the snake end costs 14.2 RSP cycles and saves 4.6 RDRAM cycles
compared to `SP2Triangles` commands. But drawing those 4.5 tris as part of very
long snakes would have saved 3.9 RSP cycles and 6.1 RDRAM cycles. So the average
cost of ending a snake relative to very long snakes is 18.1 RSP and 1.5 RDRAM
cycles.
### Example
Suppose there are 4000 tris on screen. Suppose that 90% of them have been
encoded with snakes--the rest are disconnected single tris or tri pairs (quads).
That 10% are then encoded with `SP2Triangles` commands, which is the same
performance with or without snakes, so we ignore those tris, and there are
3600 "snakeable" tris in the scene.
Suppose that the average snake length is 16, to account for some objects with
more contiguous tris with the same material, and others with smaller disjoint
parts. Thus, for 3600 tris, there are 225 snakes.
Switching the 3600 tris from `SP2Triangles` commands to long snakes saves
4860 RDRAM cycles and 3060 RSP cycles. However, the 225 snake starts and ends
cost 630 RDRAM and 7965 RSP cycles relative to this. So the total performance
change of switching to snakes in this case is that the RDRAM / RDP goes faster
by 4230 cycles = 68 us, but the RSP goes slower by 4905 cycles = 78 us.

View File

@@ -755,7 +755,7 @@ $zero ---------------------- Hardwired zero ------------------------------------
$1 v1 texptr <------------- vtxLeft ------------------------------> temp, init 0
$2 v2 shdptr clipVNext -------> <----- lbPostAo laPtr temp
$3 v3 shdflg clipVLastOfsc vLoopRet ---------> laVtxLeft temp
$4 ~ unused! ~
$4 <--------- origV1Idx -------->
$5 ------------------------- vGeomMid ---------------------------------------------
$6 geom mode clipMaskIdx -----> <-- lbTexgenOrRet laSTKept
$7 v2flag tile <------------- fogFlag ----------> laPacked mtx valid cmd byte
@@ -797,6 +797,12 @@ perfCounterA equ $28 // Performance counter A (functions depend on config)
perfCounterB equ $29 // Performance counter B (functions depend on config)
perfCounterC equ $30 // Performance counter C (functions depend on config)
// Tri write:
origV1Idx equ $4 // Original / current vertex 1 index (not address)
// Vertex init:
viLtFlag equ $9 // Holds pointLightFlag or dirLightsXfrmValid
// Vertex write:
vtxLeft equ $1 // Number of vertices left to process * 0x10
vLoopRet equ $3 // Return address at end of vtx loop = top of loop or misc lighting
@@ -826,7 +832,7 @@ laSpecFres equ $16 // Nonzero if doing ltadv_normal_to_vertex for specular
laL2A equ $19 // Nonzero if light-to-alpha (cel shading) enabled
laTexgen equ $20 // Nonzero if texgen enabled
// Clipping
// Clipping:
clipVNext equ $2 // Next vertex (vertex at forward end of current edge)
clipVLastOfsc equ $3 // Last vertex / offscreen vertex
clipVOnsc equ $19 // Onscreen vertex
@@ -837,10 +843,7 @@ clipPolyRead equ $17 // Read pointer within current polygon being clipped
clipPolySelect equ $18 // Clip poly double buffer selection
clipPolyWrite equ $21 // Write pointer within current polygon being clipped
// Vertex init
viLtFlag equ $9 // Holds pointLightFlag or dirLightsXfrmValid
// Misc
// Misc:
nextRA equ $10 // Address to return to after overlay load
ovlInitClock equ $16 // Temp for profiling
dmaLen equ $19 // DMA length in bytes minus 1
@@ -1292,18 +1295,24 @@ G_TRISNAKE_handler:
sw cmd_w0, rdpHalf1Val // Store indices a, b, c
addi inputBufferPos, inputBufferPos, -6 // Point to byte 2, index b of 1st tri
li $ra, tri_snake_loop // For tri_main
lbu origV1Idx, rdpHalf1Val + 1 // Initial value, normally carried over
tri_snake_loop:
lh $3, (inputBufferEnd)(inputBufferPos) // Load indices b and c
addi inputBufferPos, inputBufferPos, 1 // Increment indices being read
tri_snake_loop_from_input_buffer:
lb $2, rdpHalf1Val + 1 // Old v1; == index b, except when bridging between old and new load
bltz $3, tri_snake_end // Upper bit of real index b set = done
andi $11, $3, 1 // Get direction flag from index c
beqz inputBufferPos, tri_snake_over_input_buffer // == 0 at end of input buffer
andi $3, $3, 0x7E // Mask out flags from index c
sb $3, rdpHalf1Val + 1 // Store index c as vertex 1
j tri_main
sb $2, (rdpHalf1Val + 2)($11) // Store old v1 as 2 if dir clear or 3 if set
tri_snake_loop_from_input_buffer:
andi $11, $3, 1 // Get direction flag from index c
bltz $3, tri_snake_end // Upper bit of real index b set = done
sb origV1Idx, (rdpHalf1Val + 2)($11) // Store old v1 as 2 if dir clear or 3 if set
andi origV1Idx, $3, 0x7E // New v1 = mask out flags from index c
sb origV1Idx, rdpHalf1Val + 1 // Store index c as vertex 1
j tri_main_from_snake // Repeat next instr so we can skip lbu origV1Idx
lpv $v27[4], (rdpHalf1Val)($zero) // To vector unit in elems 5-7
tri_snake_ret_from_input_buffer:
li $ra, tri_snake_loop // Clobbered by DMA. Not in the loop to save a cycle.
j tri_snake_loop_from_input_buffer // inputBufferPos pointing to first byte loaded
lbu $3, (inputBufferEnd)(inputBufferPos) // Load c; clear real index b sign bit -> don't exit
// H = highest on screen = lowest Y value; then M = mid, L = low
tHAtF equ $v5
@@ -1319,6 +1328,8 @@ tPosMmH equ $v6
tPosLmH equ $v8
tPosHmM equ $v11
align_with_warning 8, "One instruction of padding before tris"
G_TRI2_handler:
G_QUAD_handler:
jal tri_main // Send second tri; return here for first tri
@@ -1328,12 +1339,13 @@ G_TRI1_handler:
sw cmd_w0, rdpHalf1Val // Store first tri indices
tri_main:
lpv $v27[4], (rdpHalf1Val)($zero) // To vector unit in elems 5-7
lbu $1, rdpHalf1Val + 1
lbu origV1Idx, rdpHalf1Val + 1
tri_main_from_snake:
lbu $2, rdpHalf1Val + 2
vclr vZero
lbu $3, rdpHalf1Val + 3
vmudn $v29, vOne, vTRC_VB // Address of vertex buffer
lhu $1, (vertexTable)($1)
lhu $1, (vertexTable)(origV1Idx)
vmadl $v27, $v27, vTRC_VS // Plus vtx indices times length
lhu $2, (vertexTable)($2)
vmadl $v6, $v31, $v31[2] // 0; vtx 1 addr in $v6 elem 5
@@ -1420,9 +1432,7 @@ tPosCatF equ $v25
andi $11, vGeomMid, G_SHADING_SMOOTH >> 8
.endif
vmudh $v29, tPosMmH, tPosLmH[0]
.if !ENABLE_PROFILING
lbu $10, rdpHalf1Val + 1 // Original vertex 1 before shuffle and clipping
.endif
// nop
t1WI equ $v13 // elems 0, 4, 6
vmadh $v29, tPosLmH, tPosHmM[0]
mfc2 $3, tLPos[10] // tLPos = highest Y value = lowest on screen (x, y, addr)
@@ -1432,7 +1442,7 @@ tXPI equ $v17
lpv tHAtI[0], VTX_COLOR_VEC($1) // Load vert color of vertex 1
vreadacc tXPF, ACC_MIDDLE
.if !ENABLE_PROFILING
lhu $10, (vertexTable)($10)
lhu $10, (vertexTable)(origV1Idx)
.endif
vrcp $v20[0], tPosCatI[1]
lpv tMAtI[0], VTX_COLOR_VEC($2) // Load vert color of vertex 2
@@ -1776,13 +1786,8 @@ return_and_end_mat:
sb $zero, materialCullMode // This covers all tri early exits except clipping
tri_snake_over_input_buffer:
j displaylist_dma_tri_snake // inputBufferPos is now 0; load whole buffer
li nextRA, tri_snake_ret_from_input_buffer
tri_snake_ret_from_input_buffer:
li $ra, tri_snake_loop // Clobbered by DMA. Putting this in the loop saves an instruction but loop takes 1 more cycle per tri.
j tri_snake_loop_from_input_buffer // inputBufferPos pointing to first byte loaded
lbu $3, (inputBufferEnd)(inputBufferPos) // Load c; clear real index b sign bit -> don't exit
bgez $3, displaylist_dma_tri_snake // If $3 < 0, last tri flag set, proceed to end
li nextRA, tri_snake_ret_from_input_buffer // inputBufferPos is now 0; load whole buffer
tri_snake_end:
addi inputBufferPos, inputBufferPos, 7 // Round up to whole input command
addi $11, $zero, 0xFFF8 // Sign-extend; andi is zero-extend!