Added performance results

2026-01-21 10:37:45 -08:00 · 2024-04-12 22:56:32 -07:00
parent 675ef164e6
commit eaaa2fbcf5
2 changed files with 65 additions and 30 deletions
--- a/README.md
+++ b/README.md
@@ -158,6 +158,8 @@ framerate:
 - This only applies to vertex processing, not triangle processing or other
  miscellaneous microcode tasks. So the total RSP cycles spent doing useful work
  during the frame is only modestly increased.
+- The increase in time is only RSP cycles; there is no additional memory
+  traffic, so the RDP time is not directly affected.
 - In scenes which are complex enough to fill the RSP->RDP FIFO in DRAM, the RSP
  usually spends a significant fraction of time waiting for the FIFO to not be
  full (as revealed by the F3DEX3 performance counters, see below). In these
@@ -184,13 +186,13 @@ faster version based on the same algorithms as F3DEX2. This removes:

 However, it retains all other F3DEX3 features:
 - 56 verts, 9 directional lights
- Occlusion plane (optional, see below)
+- Occlusion plane (optional with NOC configuration)
 - Z attribute offsets
 - All features not related to vertex/lighting: auto-batched rendering, packed 5
  triangles commands, hints system, etc.

-The performance of F3DEX3 vertex processing with LVP and NOC is almost the same
-as that of F3DEX2; see the Performance Results section below.
+The performance of F3DEX3 vertex processing with both LVP and NOC is almost the
+same as that of F3DEX2; see the Performance Results section below.

 ### Profiling

@@ -252,16 +254,36 @@ Some ways to use this for debugging are:

 ## Performance Results

-Vertex pipeline cycles per vertex pair in steady state. Hand-counted timings
-taking into account all pipeline stalls, but not instruction alignment.
+Vertex pipeline cycles per **vertex pair** in steady state. Hand-counted timings
+taking into account all pipeline stalls and all dual-issue conditions except for
+instruction alignment.

-| Microcode      | No Lighting | First Dir Lt | Second Dir Lt |
-|----------------|-------------|--------------|---------------|
-| F3DEX3         |
-| F3DEX3_NOC     |
-| F3DEX3_LVP     |
-| F3DEX3_LVP_NOC |
-| F3DEX2         | 54          | 19           | 3             |
+| Microcode      | No Lighting | First Dir Lt | Total for 1 Dir Lt | Extra Dir Lts |
+|----------------|-------------|--------------|--------------------|---------------|
+| F3DEX3         | 97          | 103          | 200                | 29            |
+| F3DEX3_NOC     | 79          | 103          | 182                | 29            |
+| F3DEX3_LVP     | 80          | 15           | 95                 | 7             |
+| F3DEX3_LVP_NOC | 62          | 15           | 77                 | 7             |
+| F3DEX2         | 54          | 19           | 73                 | 3 then 12     |
+
+Vertex processing time as reported by the performance counter in the `PA`
+configuration.
+- Scene 1: Kakariko, adult day, from DMT entrance
+- Scene 2: Custom empty scene with Suzanne monkey head with 1 dir light
+- Scene 3: Same but Suzanne has vertex colors instead of lighting (Link is still
+  on screen and has lighting)
+
+| Microcode      | Scene 1 | Scene 2 | Scene 3 |
+|----------------|---------|---------|---------|
+| F3DEX3         | 7.64ms  | 3.13ms  | 2.37ms  |
+| F3DEX3_NOC     | 7.07ms  | 2.89ms  | 2.14ms  |
+| F3DEX3_LVP     | 4.57ms  | 1.77ms  | 1.67ms  |
+| F3DEX3_LVP_NOC | 3.96ms  | 1.52ms  | 1.41ms  |
+| F3DEX2         | No*     | No*     | No*     |
+| Vertex count   | 3664    | 1608    | 1608    |
+
+*F3DEX2 does not contain performance counters, so the portion of the RSP time
+taken for vertex processing cannot be measured.


 ## Porting Your Romhack Codebase to F3DEX3
@@ -484,12 +506,7 @@ always use the new encoding.

 ### Vertex Processing RSP Time

-The vertex processing algorithm in F3DEX3 is redesigned compared to F3DEX2,
-which enables several of the new graphical features in F3DEX3 as well as the
-56 vertex buffer. With the new algorithm, the RSP takes significantly longer to
-process vertices in F3DEX3, especially vertices without lighting or with a very
-small number of directional lights. Note that this is RSP cycles only, not RDP
-cycles or DRAM traffic.
+See the Microcode Configuration and Performance Results sections above.

 ### Overlay 4

@@ -560,11 +577,15 @@ It is recommended to use `G_NORMALS_MODE_FAST` (the default) for most things,
 and use `G_NORMALS_MODE_AUTO` only for objects while they currently have a
 nonuniform scale (e.g. Mario only while he is squashed).

+Note that in the LVP configuration, lighting is computed in model space by
+transforming light directions into model space with M transpose, like in F3DEX2.
+Thus there is no mIT matrix and the SPNormalsMode setting is ignored.
+
 ### Optimizing for RSP code size

-A number of over-zealous optimizations in F3DEX2 which saved a few cycles but
-took several more instructions have been removed. This has a very small impact
-on overall RSP time and no impact on RDP time.
+A number of optimizations in F3DEX2 which saved a few cycles but took several
+more instructions have been removed. Outside of vertex processing, these have a
+very small impact on overall RSP time and no impact on RDP time.

 ### Far clipping removal

@@ -578,22 +599,25 @@ The removal of far clipping saved a bunch of DMEM space, and enabled other
 changes to the clipping implementation which saved even more DMEM space.

 NoN (No Nearclipping) is also mandatory in F3DEX3, though this was already the
-microcode option used in OoT.
+microcode option used in OoT. Note that tris are still clipped at the camera
+plane; nearclipping means they are clipped at the nearplane, which is a short
+distance in front of the camera plane.

 ### Removal of scaled vertex normals

 A few clever romhackers figured out that you could shrink the normals on verts
 in your mesh (so their length is less than "1") to make the lighting on those
-verts dimmer and create a version of ambient occlusion. F3DEX3 normalizes vertex
-normals after transforming them, which is required for most features of the
-lighting system including packed normals, so this no longer works. However,
-F3DEX3 has support for ambient occlusion via vertex alpha, which accomplishes
-the same goal with some extra benefits:
+verts dimmer and create a version of ambient occlusion. In the base vertex
+pipeline, F3DEX3 normalizes vertex normals after transforming them, which is
+required for most features of the lighting system including packed normals, so
+this no longer works. However, F3DEX3 has support for ambient occlusion via
+vertex alpha, which accomplishes the same goal with some extra benefits:
 - Much easier to create: just paint the vertex alpha in Blender / fast64. The
  scaled normals approach was not supported in fast64 and had to be done with
  scripts or by hand.
- The amount of ambient occlusion in F3DEX3 can be set at runtime based on scene
-  lighting, whereas the scaled normals approach is baked into the mesh.
+- The amount of ambient occlusion in F3DEX3 can be set at runtime based on
+  variable scene lighting, whereas the scaled normals approach is baked into the
+  mesh.
 - F3DEX3 can have the vertex alpha affect ambient, directional, and point lights
  by different amounts, which is not possible with scaled normals. In fact,
  scaled normals never affect the ambient light, contrary to the concept of
@@ -607,6 +631,9 @@ F3DEX3 will fix the normals' scale but then apply the AO.
 The only case where scaled normals work but F3DEX3 AO doesn't work is for meshes
 with vertex alpha actually used for transparency (therefore also no fog).

+Note that in LVP mode, scaled normals are supported and work the same way as in
+F3DEX2, while ambient occlusion is not supported.
+
 ### RDP temporary buffers shrinking

 In FIFO versions of F3DEX2, there are two DMEM buffers to hold RDP commands
--- a/f3dex3.s
+++ b/f3dex3.s
@@ -1416,7 +1416,7 @@ sWRL equ $v25 // vtx_store W Reciprocal Low  | IMPORTANT: Can be the same reg as
 sWRH equ $v26 // vtx_store W Reciprocal High | using different ones saves one cycle delay
    vmudl   $v29, vPairTPosF, $v30[3] // Persp norm
    move    secondVtxPos, outputVtxPos          // Second and output vertices write to same mem...
-    vmadm  s1WH, vPairTPosI, $v30[3] // Persp norm
+    vmadm   s1WH, vPairTPosI, $v30[3] // Persp norm
    bltz    $1, @@skipsecond                    // ...if < 0 verts remain, ...
     vmadn  s1WL, $v31, $v31[2] // 0
    addi    secondVtxPos, outputVtxPos, vtxSize // ...otherwise, second vtx is next vtx
@@ -3038,6 +3038,7 @@ lt_loop:
    lpv     vCCC[4], (ltBufOfs + 8 - lightSize)(curLight) // Light or lookat 0 dir in elems 4-6
    lbu     $11,     (ltBufOfs + 3 - lightSize)(curLight) // Light type / constant attenuation
    beq     curLight, altBaseReg, lt_post
+     // nop
     vmrg   vAAA, vAAA, vCCC                            // vAAA = light direction
    bnez    $11, lt_point
     luv    vDDD,    (ltBufOfs + 0 - lightSize)(curLight) // Light color
@@ -3045,6 +3046,7 @@ lt_loop:
    vmulf   vAAA, vAAA, vPairNrml // Light dir * normalized normals
    vmudh   $v29, vOne, $v31[7] // Load accum mid with 0x7FFF (1 in s.15)
    vmadm   vCCC, vPairRGBA, $v30[1] // + (alpha - 1) * aoDir factor; elems 3, 7
+    // vnop
    vmudh   $v29, vOne, vAAA[0h]
    vmadh   $v29, vOne, vAAA[1h]
    vmadh   vAAA, vOne, vAAA[2h]
@@ -3060,9 +3062,11 @@ lt_finish_light:
    vxor    vAAA, vAAA, $v31[7] // = 0x7FFF - result
 lt_skip_specular:
    vge     vAAA, vAAA, $v31[2] // 0; clamp dot product to >= 0
+    // vnop; vnop; vnop
    vmudm   $v29, vAAA, vBBB[2h] // Dot product int * scale frac
    vmadh   vAAA, vAAA, vBBB[3h] // Dot product int * scale int, clamp to 0x7FFF
    addi    curLight, curLight, -lightSize
+    // vnop; vnop
    vmudh   $v29, vOne, vPairLt // Load accum mid with current light level
    j       lt_loop
     vmacf  vPairLt, vDDD, vAAA[0h] // + light color * dot product
@@ -3078,6 +3082,7 @@ vLtAOut    equ $v26 // = vDDD: light / effects alpha output
    andi    $11, $5, G_LIGHTTOALPHA >> 8
    andi    $20, $5, G_PACKED_NORMALS >> 8
    andi    $10, $5, G_TEXTURE_GEN >> 8
+    // nop
    vmulf   vLtRGBOut, vPairRGBA, vPairLt  // RGB output is RGB * light
    beqz    $11, lt_skip_cel
     vcopy  vLtAOut, vPairRGBA             // Alpha output = vertex alpha (only 3, 7 matter)
@@ -3203,6 +3208,7 @@ lt_normalize:
    vreadacc vDDD, ACC_MIDDLE
    vreadacc vCCC, ACC_UPPER
    mtc2    $11, vPairLt[6] // Constant frac part in elem 3
+    // vnop; vnop
    vmudm   $v29, vOne, vDDD[2h] // Sum of squared components
    vmadh   $v29, vOne, vCCC[2h]
    srl     $11, $24, 5 // Top 3 bits
@@ -3214,6 +3220,7 @@ lt_normalize:
    ori     $20, $20, 0x20 // Append leading 1 to mantissa
    vmadh   vCCC, vCCC, vOne
    sllv    $20, $20, $11 // Left shift to create floating point
+    // vnop; vnop; vnop
    vrsqh   $v29[2], vCCC[0] // High input, garbage output
    sll     $20, $20, 8 // Min range 00002000, 00002100... 00003F00, max 00100000...001F8000
    vrsql   $v29[1], vDDD[0] // Low input, low output
@@ -3224,6 +3231,7 @@ lt_normalize:
    vrsql   $v29[5], vDDD[4] // Low input, low output
    vrsqh   $v29[4], $v31[2] // 0 input, high output
    mtc2    $20, vCCC[6] // Quadratic frac part in elem 3
+    // vnop; vnop; vnop
    vmudn   vBBB, vBBB, $v29[0h] // Vec frac * int scaling, discard result
    srl     $20, $20, 16
    vmadm   vBBB, vAAA, $v29[1h] // Vec int * frac scaling, discard result