2024-06-17 03:56:10 +02:00
|
|
|
@page performance Performance Results
|
|
|
|
|
|
|
|
|
|
# Performance Results
|
2024-06-16 18:21:06 -07:00
|
|
|
|
2025-07-13 16:37:22 -07:00
|
|
|
F3DEX3_NOC matches or beats the RSP performance of F3DEX2 on **all** critical
|
|
|
|
|
paths in the microcode, including command dispatch, vertex processing, and
|
|
|
|
|
triangle processing. Then, the RDP and memory traffic performance improvements
|
|
|
|
|
of F3DEX3--56 vertex buffer, auto-batched rendering, etc.--should further
|
|
|
|
|
improve overall game performance from there.
|
|
|
|
|
|
2024-11-17 22:30:52 -08:00
|
|
|
## Cycle Counts
|
|
|
|
|
|
2024-08-30 22:10:46 -07:00
|
|
|
These are cycle counts for many key paths in the microcode. Lower numbers are
|
2024-08-18 10:32:35 -07:00
|
|
|
better. The timings are hand-counted taking into account all pipeline stalls and
|
2025-07-13 16:37:22 -07:00
|
|
|
all dual-issue conditions. Instruction alignment after branches is usually taken
|
2025-09-29 21:51:34 -07:00
|
|
|
into account (especially in F3DEX3), but in some cases it is assumed to be
|
|
|
|
|
optimal.
|
2024-06-16 18:21:06 -07:00
|
|
|
|
2025-07-13 16:37:22 -07:00
|
|
|
All numbers assume default profiling configuration. Tri numbers assume texture,
|
|
|
|
|
shade, and Z, and not flushing the buffer. Tri numbers are measured from the
|
|
|
|
|
first cycle of the command handler inclusive, to the first cycle of whatever is
|
2025-08-03 18:10:50 -07:00
|
|
|
after $ra exclusive; this is in order to capture an extra stall cycle in F3DEX2
|
|
|
|
|
when finishing a triangle and going to the next command.
|
2024-06-16 18:21:06 -07:00
|
|
|
|
2025-07-19 13:07:29 -07:00
|
|
|
Vertex numbers assume no extra F3DEX3 features (packed normals, ambient
|
|
|
|
|
occlusion, etc.). These features are listed below as the number of extra cycles
|
|
|
|
|
the feature costs per vertex pair. ltbasic is the codepath when point lighting,
|
|
|
|
|
specular, and Fresnel are disabled; ltadv is the codepath with any of these
|
|
|
|
|
enabled. The reason timings are listed separately for each number of lights is
|
|
|
|
|
because some implementations are pipelined for two lights, so going from an
|
|
|
|
|
even to an odd number of lights adds a different time than vice versa.
|
|
|
|
|
|
2025-07-13 16:37:22 -07:00
|
|
|
| | F3DEX2 | F3DEX3_NOC | F3DEX3 |
|
|
|
|
|
|----------------------------|--------|------------|--------|
|
2025-09-29 21:51:34 -07:00
|
|
|
| Command dispatch | 12 | 10 | 10 |
|
|
|
|
|
| Small RDP command | 14 | 4 | 4 |
|
|
|
|
|
| Only/2nd tri to offscreen | 27 | 20 | 20 |
|
|
|
|
|
| 1st tri to offscreen | 28 | 21 | 21 |
|
|
|
|
|
| Only/2nd tri to clip | 32 | 25 | 25 |
|
|
|
|
|
| 1st tri to clip | 33 | 26 | 26 |
|
|
|
|
|
| Only/2nd tri to backface | 38 | 31 | 31 |
|
|
|
|
|
| 1st tri to backface | 39 | 32 | 32 |
|
|
|
|
|
| Only/2nd tri to degenerate | 42 | 33 | 33 |
|
|
|
|
|
| 1st tri to degenerate | 43 | 34 | 34 |
|
|
|
|
|
| Only/2nd tri to occluded | Can't | Can't | 37 |
|
|
|
|
|
| 1st tri to occluded | Can't | Can't | 38 |
|
|
|
|
|
| Only/2nd tri to draw | 172 | 149 | 151 |
|
|
|
|
|
| 1st tri to draw | 173 | 150 | 152 |
|
|
|
|
|
| Tri snake | Can't | 10/11* | 10/11* |
|
2025-07-13 16:37:22 -07:00
|
|
|
| Vtx before DMA start | 16 | 17 | 17 |
|
|
|
|
|
| Vtx pair, no lighting | 54 | 54 | 70 |
|
|
|
|
|
| Vtx pair, 0 dir lts | Can't | 65 | 81 |
|
|
|
|
|
| Vtx pair, 1 dir lt | 73 | 70 | 86 |
|
|
|
|
|
| Vtx pair, 2 dir lts | 76 | 77 | 93 |
|
|
|
|
|
| Vtx pair, 3 dir lts | 88 | 84 | 100 |
|
|
|
|
|
| Vtx pair, 4 dir lts | 91 | 91 | 107 |
|
|
|
|
|
| Vtx pair, 5 dir lts | 103 | 98 | 114 |
|
|
|
|
|
| Vtx pair, 6 dir lts | 106 | 105 | 121 |
|
|
|
|
|
| Vtx pair, 7 dir lts | 118 | 112 | 128 |
|
|
|
|
|
| Vtx pair, 8 dir lts | Can't | 119 | 135 |
|
|
|
|
|
| Vtx pair, 9 dir lts | Can't | 126 | 142 |
|
2025-07-19 13:07:29 -07:00
|
|
|
| Vtx pair, 0 point lts | Can't | 117 | 133 |
|
|
|
|
|
| Vtx pair, 1 point lt | 276 | 194 | 210 |
|
|
|
|
|
| Vtx pair, 2 point lts | 420 | 271 | 287 |
|
|
|
|
|
| Vtx pair, 3 point lts | 564 | 348 | 364 |
|
|
|
|
|
| Vtx pair, 4 point lts | 708 | 425 | 441 |
|
|
|
|
|
| Vtx pair, 5 point lts | 852 | 502 | 518 |
|
|
|
|
|
| Vtx pair, 6 point lts | 996 | 579 | 595 |
|
|
|
|
|
| Vtx pair, 7 point lts | 1140 | 656 | 672 |
|
|
|
|
|
| Vtx pair, 8 point lts | Can't | 733 | 749 |
|
|
|
|
|
| Vtx pair, 9 point lts | Can't | 810 | 826 |
|
|
|
|
|
| Packed normals, ltbasic | Can't | 6 | 6 |
|
|
|
|
|
| Light-to-alpha, ltbasic | Can't | 10 | 10 |
|
|
|
|
|
| Ambient occlusion, ltbasic | Can't | 9 | 9 |
|
|
|
|
|
| Packed normals, ltadv | Can't | -3 | -3 |
|
|
|
|
|
| Light-to-alpha, ltadv | Can't | 6 | 6 |
|
|
|
|
|
| Ambient occlusion, ltadv | Can't | 0 | 0 |
|
|
|
|
|
| Specular or fresnel | Can't | 47 | 47 |
|
2025-08-24 16:46:07 -07:00
|
|
|
| + Fresnel | Can't | 23 | 23 |
|
2025-07-19 13:07:29 -07:00
|
|
|
| + Specular per dir lt | Can't | 13 | 13 |
|
|
|
|
|
| + Specular per point lt | Can't | 13 | 13 |
|
2025-07-13 16:37:22 -07:00
|
|
|
| Light dir xfrm, 0 dir lts | Can't | 92 | 92 |
|
|
|
|
|
| Light dir xfrm, 1 dir lt | 141 | 92 | 92 |
|
|
|
|
|
| Light dir xfrm, 2 dir lts | 180 | 93 | 93 |
|
|
|
|
|
| Light dir xfrm, 3 dir lts | 219 | 118 | 118 |
|
|
|
|
|
| Light dir xfrm, 4 dir lts | 258 | 119 | 119 |
|
|
|
|
|
| Light dir xfrm, 5 dir lts | 297 | 144 | 144 |
|
|
|
|
|
| Light dir xfrm, 6 dir lts | 336 | 145 | 145 |
|
|
|
|
|
| Light dir xfrm, 7 dir lts | 375 | 170 | 170 |
|
|
|
|
|
| Light dir xfrm, 8 dir lts | Can't | 171 | 171 |
|
|
|
|
|
| Light dir xfrm, 9 dir lts | Can't | 196 | 196 |
|
2025-08-23 16:15:04 -07:00
|
|
|
|
|
|
|
|
## Triangle Snake Cycle Counts
|
|
|
|
|
|
2025-09-29 21:51:34 -07:00
|
|
|
With the recent F3DEX3 updates bringing significant RSP time savings in command
|
|
|
|
|
dispatch and triangle draw, triangle snakes are unfortuantely no longer
|
|
|
|
|
competitive in RSP time.
|
|
|
|
|
|
|
|
|
|
Suppose we have two tris which are offscreen. If drawn with `SP2Triangles`, this
|
|
|
|
|
is 10 cycles for command dispatch, 21 cycles to cull the first tri, and 20
|
|
|
|
|
cycles to cull the second, for a total of 51 cycles. If drawn as part of a long
|
|
|
|
|
triangle snake, the triangle snake processing adds 10 or 11 cycles relative to
|
|
|
|
|
the `SP2Triangles` first or second triangle respectively. So this is 31 cycles
|
|
|
|
|
to cull each triangle, for a total of 61 cycles.
|
|
|
|
|
|
|
|
|
|
It gets worse for snakes when counting the overhead of starting and ending a
|
|
|
|
|
snake, which have also gotten worse with the recent changes bringing triangle
|
|
|
|
|
performance improvements. I used to have a long discussion here computing
|
|
|
|
|
estimated performance for switching to snakes, but the numbers have all changed
|
|
|
|
|
and they were imprecise to begin with. The upshot is for a typical scene,
|
|
|
|
|
switching everything from `SP2Triangles` to snakes might save about 70 us of
|
|
|
|
|
RDRAM/RDP time but cost about 400 us of RSP time.
|
|
|
|
|
|
|
|
|
|
However, note that in F3DEX2, `SP2Triangles` to two offscreen triangles is
|
|
|
|
|
12+28+27 = 67 cycles. F3DEX3 is so much faster than F3DEX2 that even the
|
|
|
|
|
performance penalty of snakes doesn't outweigh this.
|