Files
F3DEX3/docs/Documentation/Performance.md

117 lines
6.5 KiB
Markdown
Raw Permalink Normal View History

2024-06-17 03:56:10 +02:00
@page performance Performance Results
# Performance Results
2024-06-16 18:21:06 -07:00
2025-07-13 16:37:22 -07:00
F3DEX3_NOC matches or beats the RSP performance of F3DEX2 on **all** critical
paths in the microcode, including command dispatch, vertex processing, and
triangle processing. Then, the RDP and memory traffic performance improvements
of F3DEX3--56 vertex buffer, auto-batched rendering, etc.--should further
improve overall game performance from there.
2024-11-17 22:30:52 -08:00
## Cycle Counts
2024-08-30 22:10:46 -07:00
These are cycle counts for many key paths in the microcode. Lower numbers are
2024-08-18 10:32:35 -07:00
better. The timings are hand-counted taking into account all pipeline stalls and
2025-07-13 16:37:22 -07:00
all dual-issue conditions. Instruction alignment after branches is usually taken
2025-09-29 21:51:34 -07:00
into account (especially in F3DEX3), but in some cases it is assumed to be
optimal.
2024-06-16 18:21:06 -07:00
2025-07-13 16:37:22 -07:00
All numbers assume default profiling configuration. Tri numbers assume texture,
shade, and Z, and not flushing the buffer. Tri numbers are measured from the
first cycle of the command handler inclusive, to the first cycle of whatever is
2025-08-03 18:10:50 -07:00
after $ra exclusive; this is in order to capture an extra stall cycle in F3DEX2
when finishing a triangle and going to the next command.
2024-06-16 18:21:06 -07:00
2025-07-19 13:07:29 -07:00
Vertex numbers assume no extra F3DEX3 features (packed normals, ambient
occlusion, etc.). These features are listed below as the number of extra cycles
the feature costs per vertex pair. ltbasic is the codepath when point lighting,
specular, and Fresnel are disabled; ltadv is the codepath with any of these
enabled. The reason timings are listed separately for each number of lights is
because some implementations are pipelined for two lights, so going from an
even to an odd number of lights adds a different time than vice versa.
2025-07-13 16:37:22 -07:00
| | F3DEX2 | F3DEX3_NOC | F3DEX3 |
|----------------------------|--------|------------|--------|
2025-09-29 21:51:34 -07:00
| Command dispatch | 12 | 10 | 10 |
| Small RDP command | 14 | 4 | 4 |
| Only/2nd tri to offscreen | 27 | 20 | 20 |
| 1st tri to offscreen | 28 | 21 | 21 |
| Only/2nd tri to clip | 32 | 25 | 25 |
| 1st tri to clip | 33 | 26 | 26 |
| Only/2nd tri to backface | 38 | 31 | 31 |
| 1st tri to backface | 39 | 32 | 32 |
| Only/2nd tri to degenerate | 42 | 33 | 33 |
| 1st tri to degenerate | 43 | 34 | 34 |
| Only/2nd tri to occluded | Can't | Can't | 37 |
| 1st tri to occluded | Can't | Can't | 38 |
| Only/2nd tri to draw | 172 | 149 | 151 |
| 1st tri to draw | 173 | 150 | 152 |
| Tri snake | Can't | 10/11* | 10/11* |
2025-07-13 16:37:22 -07:00
| Vtx before DMA start | 16 | 17 | 17 |
| Vtx pair, no lighting | 54 | 54 | 70 |
| Vtx pair, 0 dir lts | Can't | 65 | 81 |
| Vtx pair, 1 dir lt | 73 | 70 | 86 |
| Vtx pair, 2 dir lts | 76 | 77 | 93 |
| Vtx pair, 3 dir lts | 88 | 84 | 100 |
| Vtx pair, 4 dir lts | 91 | 91 | 107 |
| Vtx pair, 5 dir lts | 103 | 98 | 114 |
| Vtx pair, 6 dir lts | 106 | 105 | 121 |
| Vtx pair, 7 dir lts | 118 | 112 | 128 |
| Vtx pair, 8 dir lts | Can't | 119 | 135 |
| Vtx pair, 9 dir lts | Can't | 126 | 142 |
2025-07-19 13:07:29 -07:00
| Vtx pair, 0 point lts | Can't | 117 | 133 |
| Vtx pair, 1 point lt | 276 | 194 | 210 |
| Vtx pair, 2 point lts | 420 | 271 | 287 |
| Vtx pair, 3 point lts | 564 | 348 | 364 |
| Vtx pair, 4 point lts | 708 | 425 | 441 |
| Vtx pair, 5 point lts | 852 | 502 | 518 |
| Vtx pair, 6 point lts | 996 | 579 | 595 |
| Vtx pair, 7 point lts | 1140 | 656 | 672 |
| Vtx pair, 8 point lts | Can't | 733 | 749 |
| Vtx pair, 9 point lts | Can't | 810 | 826 |
| Packed normals, ltbasic | Can't | 6 | 6 |
| Light-to-alpha, ltbasic | Can't | 10 | 10 |
| Ambient occlusion, ltbasic | Can't | 9 | 9 |
| Packed normals, ltadv | Can't | -3 | -3 |
| Light-to-alpha, ltadv | Can't | 6 | 6 |
| Ambient occlusion, ltadv | Can't | 0 | 0 |
| Specular or fresnel | Can't | 47 | 47 |
| + Fresnel | Can't | 23 | 23 |
2025-07-19 13:07:29 -07:00
| + Specular per dir lt | Can't | 13 | 13 |
| + Specular per point lt | Can't | 13 | 13 |
2025-07-13 16:37:22 -07:00
| Light dir xfrm, 0 dir lts | Can't | 92 | 92 |
| Light dir xfrm, 1 dir lt | 141 | 92 | 92 |
| Light dir xfrm, 2 dir lts | 180 | 93 | 93 |
| Light dir xfrm, 3 dir lts | 219 | 118 | 118 |
| Light dir xfrm, 4 dir lts | 258 | 119 | 119 |
| Light dir xfrm, 5 dir lts | 297 | 144 | 144 |
| Light dir xfrm, 6 dir lts | 336 | 145 | 145 |
| Light dir xfrm, 7 dir lts | 375 | 170 | 170 |
| Light dir xfrm, 8 dir lts | Can't | 171 | 171 |
| Light dir xfrm, 9 dir lts | Can't | 196 | 196 |
2025-08-23 16:15:04 -07:00
## Triangle Snake Cycle Counts
2025-09-29 21:51:34 -07:00
With the recent F3DEX3 updates bringing significant RSP time savings in command
dispatch and triangle draw, triangle snakes are unfortuantely no longer
competitive in RSP time.
Suppose we have two tris which are offscreen. If drawn with `SP2Triangles`, this
is 10 cycles for command dispatch, 21 cycles to cull the first tri, and 20
cycles to cull the second, for a total of 51 cycles. If drawn as part of a long
triangle snake, the triangle snake processing adds 10 or 11 cycles relative to
the `SP2Triangles` first or second triangle respectively. So this is 31 cycles
to cull each triangle, for a total of 61 cycles.
It gets worse for snakes when counting the overhead of starting and ending a
snake, which have also gotten worse with the recent changes bringing triangle
performance improvements. I used to have a long discussion here computing
estimated performance for switching to snakes, but the numbers have all changed
and they were imprecise to begin with. The upshot is for a typical scene,
switching everything from `SP2Triangles` to snakes might save about 70 us of
RDRAM/RDP time but cost about 400 us of RSP time.
However, note that in F3DEX2, `SP2Triangles` to two offscreen triangles is
12+28+27 = 67 cycles. F3DEX3 is so much faster than F3DEX2 that even the
performance penalty of snakes doesn't outweigh this.