Lots more documentation

2026-01-21 10:37:45 -08:00 · 2025-07-13 16:37:22 -07:00
parent 35f7faf653
commit 4c3af75485
9 changed files with 218 additions and 323 deletions
--- a/README.md
+++ b/README.md
@@ -29,8 +29,8 @@ all at the same time!
  and normals/lighting on the same mesh**, by encoding the normals in the unused
  2 bytes of each vertex using the 5-6-5 bit encoding by HailToDodongo from
  [Tiny3D](https://github.com/HailToDodongo/tiny3d). Model-space precision of
-  the normals is reduced, but this is rarely noticeable and there is barely any
-  performance penalty compared to regular normals without vertex colors.
+  the normals is reduced, but this is rarely noticeable, and the performance is
+  nearly identical to vanilla normals (without simultaneous vertex colors).
 - New geometry mode bit `G_AMBOCCLUSION` enables **ambient occlusion** for
  opaque materials. Paint the shadow map into the vertex alpha channel; separate
  factors (set with `SPAmbOcclusion`) control how much this affects the ambient
@@ -73,7 +73,7 @@ all at the same time!
  RSP time as fewer verts have to be reloaded and re-transformed, and also makes
  display lists shorter.
 - New **occlusion plane** system allows the placement of a 3D quadrilateral
-  where objects behind this plane in screen space are culled. This can
+  where triangles behind this plane in screen space are culled. This can
  dramatically improve RDP performance by reducing overdraw in scenes with walls
  in the middle, such as a city or an indoor scene.
 - If a material display list being drawn is the same as the last material, the
@@ -92,7 +92,8 @@ all at the same time!
  shade alpha values are all below or above a settable threshold. This
  **substantially reduces the performance penalty of cel shading**--only tris
  which "straddle" the cel threshold are drawn twice, the others are only drawn
-  once.
+  once. This can also be used to **cull tris which are fully in fog**, replacing
+  far clipping which is removed in F3DEX3.
 - A new "hints" system encodes the expected size of the target display list into
  call, branch, and return DL commands. This allows only the needed number of DL
  commands in the next DL to be fetched, rather than always fetching full
@@ -107,26 +108,30 @@ all at the same time!
  value. This can be used for clearing the Z buffer or filling the framebuffer
  or the letterbox with a solid color **faster than the RDP can in fill mode**.
  Practical performance may vary due to scheduling constraints.
- The key codepaths for triangle draw and vertex processing (assuming lighting
-  enabled and the occlusion plane disabled with the `NOC` configuration) are
-  **slightly faster than in F3DEX2**.
+- New `SPFlush` command can ensure that the RDP starts clearing the framebuffer
+  as soon as possible during the frame, instead of waiting a short time for
+  further RSP processing.
+- The key codepaths for command dispatch, triangle draw, and vertex processing
+  (assuming lighting enabled and the occlusion plane disabled with the `NOC`
+  configuration) are **slightly faster than in F3DEX2**.

 ### Miscellaneous

 - **Z-fighting of decals has been nearly eliminated**, with only a modest
-  increase in overdraw of very close occluding geometry. This is based on a
-  technique developed by SGI, neglected and removed by Nintendo, and re-added
-  by Rare; the F3DEX3 version improves upon it by choosing optimal parameters
-  and automatically enabling it for all decals with no code or DL changes. In
-  addition, the reduction in Z buffer precision from F3DEX(1) to F3DEX2 has been
-  reversed, and additional Z buffer precision beyond F3DEX(1) has been added.
+  increase in overdraw onto the decal of very close occluding geometry. This is
+  based on a technique developed by SGI, neglected and removed by Nintendo, and
+  re-added by Rare; the F3DEX3 version improves upon it by choosing optimal
+  parameters and automatically enabling it for all decals with no code or DL
+  changes.
+- The reduction in Z buffer precision from F3DEX(1) to F3DEX2 has been reversed,
+  and **additional Z buffer precision** beyond F3DEX(1) has been added.
 - **Point lighting** has been redesigned. The appearance when a light is close
  to an object has been improved. Fixed a bug in F3DEX2/ZEX point lighting where
  a Z component was accidentally doubled in the point lighting calculations. The
-  quadratic point light attenuation factor is now an E3M5 floating-point number.
-  The performance penalty for using large numbers of point lights has been
-  reduced.
- Maximum number of directional / point **lights raised from 7 to 9**. Minimum
+  quadratic point light attenuation factor is now an E3M5 floating-point number
+  for a wider representable range. The performance penalty for using large
+  numbers of point lights has been reduced.
+- Maximum number of directional / point lights **raised from 7 to 9**. Minimum
  number of directional / point lights lowered from 1 to 0 (F3DEX2 required at
  least one). Also supports loading all lights in one DMA transfer
  (`SPSetLights`), rather than one per light.
@@ -136,15 +141,15 @@ all at the same time!
  parameters are encoded in the command. With some limitations, this allows the
  tint colors of cel shading to **match scene lighting** with no code
  intervention. Also useful for other lighting-dependent effects.
- The microcode automatically switches between two lighting implementations
+- The microcode automatically switches between **two lighting implementations**
  depending on which visual features are selected in the particular material.
  The "basic lighting" codepath--which is roughly the same speed as F3DEX2--
  supports all F3DEX2 features (directional lights, texgen), plus packed
  normals, ambient occlusion, and light-to-alpha. The "advanced lighting"
  codepath, which is slower, adds support for point lights, specular, and
-  Fresnel. You only pay the performance penalty for the features you use, and
-  only for the objects which use them.
-  
+  Fresnel. You only pay the performance penalty for the objects which use these
+  advanced features.
+

 ### Profiling

--- a/cpu/occlusionplane.c
+++ b/cpu/occlusionplane.c
@@ -20,9 +20,9 @@ update the occlusion plane after updating the camera and write the pointer to
 this occlusion plane into the existing DL command near the beginning.

 3. Create a system in your game engine for dynamically choosing or creating an
-occlusion plane. For example, you might have a set of pre-determined occlusion
-planes in the scene, and at runtime pick the one which you think is most
-optimal. Some criteria to use for this include:
+occlusion plane. (See the implementation in HackerOoT.) For example, you might
+have a set of pre-determined occlusion planes in the scene, and at runtime pick
+the one which you think is most optimal. Some criteria to use for this include:
  - whether the camera is on the correct side of the occlusion plane
  - the distance from the camera to the full (infinite) plane
  - how far the point of the camera projected onto the full (infinite) plane is
--- a/docs/Documentation/Backwards
+++ b/docs/Documentation/Backwards
@@ -29,8 +29,8 @@ e.g. `SPMatrix` refers to `gSPMatrix` and `gsSPMatrix`. `*` means wildcard.
 | Command              | Bin | C   | Perf | Notes |
 |----------------------|-----|-----|------|-------|
 | `DPLoadTLUT*`        | =   | =   | Up   | Load is not sent to RDP if repeated in auto-batched rendering. See the GBI comment near `SPDontSkipTexLoadsAcross`. This is a performance optimization only and doesn't affect on-screen output unless the game is buggy / misusing the feature, so this behavior need not be emulated in HLE. |
-| `DPLoadBlock*`       | =   | =   | Up   |  Same as `DPLoadTLUT*` above. |
-| `DPLoadTile*`        | =   | =   | Up   |  Same as `DPLoadTLUT*` above. |
+| `DPLoadBlock*`       | =   | =   | Up   | Same as `DPLoadTLUT*` above. |
+| `DPLoadTile*`        | =   | =   | Up   | Same as `DPLoadTLUT*` above. |
 | `SPSetOtherMode`     | =   | =   |      |  |
 | All other `DP*`      | =   | =   |      | Microcode generally can't change RDP command behavior. |

@@ -48,13 +48,13 @@ e.g. `SPMatrix` refers to `gSPMatrix` and `gsSPMatrix`. `*` means wildcard.
 | `G_MV_POINT`         | Rem | Rem |      | Removed because the internal vertex format is no longer a multiple of 8 (DMA word). |
 | `SPTexture`          | =   | =   |      |  |
 | `SPTextureL`         | =   | =   |      | HW V1 workaround; long since deprecated. |
-| `SP1Triangle`        | =   | =   | Up   | Some of the new features in F3DEX3 (occlusion plane, alpha compare culling, decal fix) are during triangle processing.
+| `SP1Triangle`        | =   | =   | Up   | Some of the new features in F3DEX3 (occlusion plane, alpha compare culling, decal fix) are during triangle processing. |
 | `SP2Triangles`       | =   | =   | Up   | Same as `SP1Triangle` above. |
 | `SP1Quadrangle`      | =   | =   | Up   | Same as `SP1Triangle` above. |
 | `SPTriStrip`         | New | New | Up   | New command that draws 5 tris from 7 indexes, see GBI. |
 | `SPTriFan`           | New | New | Up   | New command that draws 5 tris from 7 indexes, see GBI. |
 | `SPMemset`           | New | New | Up   | New command that memsets a RDRAM region faster than the RDP can, for framebuffer or Z-buffer clear. |
-| `G_LINE3D`           | Rem | Rem |      | Removed; no-op in F3DEX2. |
+| `G_LINE3D`           | Rem | Rem |      | Removed; was a no-op in F3DEX2. |

 ### Control Logic

@@ -74,7 +74,7 @@ e.g. `SPMatrix` refers to `gSPMatrix` and `gsSPMatrix`. `*` means wildcard.
 | `G_MW_SEGMENT`       | =   | =   |      |  |
 | `G_MWO_SEGMENT_*`    | =   | =   |      | These were never needed. |
 | `SPFlush`            | New | New | Up   | This is a performance optimization only and can't be HLE emulated, so it should be treated as a no-op. |
-| `G*` (`Gfx` subtypes) | ?  | ?   |      | Deprecated. These did not fully reflect the bits usage in actual commands even in F3DEX2. These have mostly not been updated for F3DEX3. |
+| `G*` (`Gfx` subtypes) | ?  | ?   |      | Deprecated. These did not fully reflect the bits usage in actual commands even in F3DEX2. Almost none of these have been updated for F3DEX3. |

 ### 3D Space

@@ -82,7 +82,7 @@ e.g. `SPMatrix` refers to `gSPMatrix` and `gsSPMatrix`. `*` means wildcard.
 |----------------------|-----|-----|------|-------|
 | `Mtx`                | =   | =   |      |  |
 | `SPMatrix`           | Chg | =   | *    | Encoding changed due to multiple flags below changing. |
-| `G_MTX_PUSH`         | =   | =   | Down | `SPMatrix` processing with `G_MTX_PUSH` set is moved to Overlay 3 (slower) as games should not use the RSP matrix stack for accuracy and performance reasons (see GBI). |
+| `G_MTX_PUSH`         | =   | =   | Down | `SPMatrix` processing with `G_MTX_PUSH` set is moved to Overlay 3 (slower) as games generally should not use the RSP matrix stack for accuracy and performance reasons (see GBI). |
 | `G_MTX_NOPUSH`       | =   | =   |      |  |
 | `G_MTX_LOAD`         | Chg | =   |      | Encoding inverted (in SPMatrix, not in the definition of `G_MTX_LOAD`). |
 | `G_MTX_MUL`          | Chg | =   |      | Encoding inverted (in SPMatrix, not in the definition of `G_MTX_MUL`). |
@@ -92,20 +92,20 @@ e.g. `SPMatrix` refers to `gSPMatrix` and `gsSPMatrix`. `*` means wildcard.
 | `G_MV_TEMPMTX0`      | Chg | =   |      | Encoding changed. |
 | `G_MV_VPMTX`         | Chg | New |      | New name for `G_MV_PMTX`, encoding changed. |
 | `G_MV_TEMPMTX1`      | Chg | =   |      | Encoding changed. |
-| `SPPopMatrix*`       | Chg | =   | Down | Moved to Overlay 3 (slower) as games should not use the RSP matrix stack for accuracy and performance reasons (see GBI). Encoding is changed due to `G_MV_MMTX` changing. |
+| `SPPopMatrix*`       | Chg | =   | Down | Moved to Overlay 3 (slower) as games generally should not use the RSP matrix stack for accuracy and performance reasons (see GBI). Encoding is changed due to `G_MV_MMTX` changing. |
 | `SPForceMatrix`      | Chg | Chg |      | Converted into no-op. |
 | `G_MV_MATRIX`        | Rem | Rem |      | Removed. |
 | `G_MW_MATRIX`        | Rem | Rem |      | Removed. |
 | `G_MW_FORCEMTX`      | Rem | Rem |      | Removed. |
 | `SPViewport`         | *   | *   |      | Command itself is the same, but see `Vp` below. |
-| `Vp_t` / `Vp`        | Chg | Chg |      | The Y scale is now negated, and the Z values are different due to the change from `G_MAXZ` to `G_NEW_MAXZ`.
+| `Vp_t` / `Vp`        | Chg | Chg |      | The Y scale is now negated, and the Z values are different due to the change from `G_MAXZ` to `G_NEW_MAXZ`. |
 | `G_MAXZ`             | Rem | Rem |      | Replaced with `G_NEW_MAXZ`. The name change is to force you to update your code--especially viewport definitions with hardcoded constants which are NOT defined in terms of `G_MAXZ`. |
 | `G_NEW_MAXZ`         | New | New |      | The equivalent of `G_MAXZ` constant used in viewport calculations. |
 | `G_MV_VIEWPORT`      | =   | =   |      |  |
 | `SPPerspNormalize`   | Chg | =   |      | Encoding changed. |
 | `G_MW_PERSPNORM`     | Rem | Rem |      | Removed. The perspective normalization factor is set via `G_MW_FX` with the changed encoding of `SPPerspNormalize`. |
 | `G_MWO_PERSPNORM`    | New | New |      |  |
-| `SPClipRatio`        | Chg | Chg |      | Converted into no-op. It is not possible to change the clip ratio from 2 in F3DEX3. |
+| `SPClipRatio`        | Chg | Chg |      | Converted into no-op. It is not possible to change the clip ratio from 2 in F3DEX3. Changing the clip ratio was rarely used in production games. |
 | `G_MW_CLIP`          | Rem | Rem |      | Removed. See `SPClipRatio` above. |

 ### Lighting
@@ -113,9 +113,9 @@ e.g. `SPMatrix` refers to `gSPMatrix` and `gsSPMatrix`. `*` means wildcard.
 | Command              | Bin | C   | Perf | Notes |
 |----------------------|-----|-----|------|-------|
 | `Light_t`, `Light`   | Chg | *   |      | `type` field must be set to 0 (`LIGHT_TYPE_DIR`) to indicate directional light. `size` field for specular added. Otherwise the same, though note that now there is not an extra 8 bytes of padding between lights (the offset between them is 16, not 24). |
-| `LIGHT_TYPE_DIR`     | New | New |      | New macro, but the encoding is the same as F3DEX2_PL. |
-| `PointLight_t`       | Chg | *   |      | Same changes as `Light_t`. Also note that the `kq` field is now interpreted as an E3M5 floating-point number. |
-| `LIGHT_TYPE_POINT`   | New | New |      | New macro, but the encoding is the same as F3DEX2_PL. |
+| `LIGHT_TYPE_DIR`     | New | New |      | New macro, but the encoding is the same as in F3DEX2_PL. |
+| `PointLight_t`       | Chg | *   |      | Same changes as `Light_t`. Also the `kq` field is now interpreted as an E3M5 floating-point number. |
+| `LIGHT_TYPE_POINT`   | New | New |      | New macro, but the encoding is the same as in F3DEX2_PL. |
 | `Ambient_t`, `Ambient` | = | =   |      | Note that you must use `Ambient`, not `Light`, for the ambient light if you have 9 directional/point lights. |
 | `Lights1`, `Lights2`, ... | Chg | * |   | The ambient light is at the end, not the beginning. The data layout matches the RSP internal data layout to enable `SPSetLights`. |
 | `Lightsn`            | Chg | *   |      | Same as `Lights1` etc. Also, now 9 directional/point lights. |
@@ -127,7 +127,7 @@ e.g. `SPMatrix` refers to `gSPMatrix` and `gsSPMatrix`. `*` means wildcard.
 | `G_MWO_NUMLIGHT`     | =   | =   |      |  |
 | `NUML`               | Chg | =   |      | Encoding changed. |
 | `NUMLIGHTS_*`        | Chg | =   |      | Deprecated as these are just defined equal to their number, because F3DEX3 supports zero lights. |
-| `LIGHT_*`            | =   | =   |      | Deprecated and were never useful. |
+| `LIGHT_*`            | =   | =   |      | Deprecated and were not useful in F3DEX2 either. |
 | `SPLight`            | Chg | =   |      | Encoding changed. Note that you must use `SPAmbient`, not `SPLight`, for the ambient light if you have 9 directional/point lights. Also note that you should usually use `SPSetLights` unless you need to set individual lights without affecting the others. |
 | `SPAmbient`          | New | New |      | New command to upload the ambient light. If you have 0-8 directional/point lights, you can also use `SPLight` for this (slightly slower), but if you have 9 directional/point lights you must use `SPAmbient`. |
 | `SPLightColor*`      | Chg | =   |      | Encoding changed. |
@@ -138,7 +138,7 @@ e.g. `SPMatrix` refers to `gSPMatrix` and `gsSPMatrix`. `*` means wildcard.
 | `G_MWO_bLIGHT_*`     | Chg | =   |      | Encodings changed. No longer needed. |
 | `G_MVO_L*`           | Rem | Rem |      | Removed. |
 | `SPCameraWorld`      | New | New |      | New command to set the camera position for Fresnel. |
-| `PlainVtx`           | New | New |      | For `SPCameraWorld`.
+| `PlainVtx`           | New | New |      | For `SPCameraWorld`. |
 | `SPLookAt`           | New | New |      | Replaces `SPLookAtX` and `SPLookAtY`. |
 | `SPLookAtX`          | Chg | *   |      | Encoding changed; in an attempt at backwards compatibility, defined as `SPLookAt`, which works with basic usage. |
 | `SPLookAtY`          | Chg | *   |      | Converted to no-op. |
@@ -155,7 +155,7 @@ e.g. `SPMatrix` refers to `gSPMatrix` and `gsSPMatrix`. `*` means wildcard.
 |--------------------------|-----|-----|------|-------|
 | `SP*GeometryMode*`       | *   | *   |      | Commands themselves are the same, but many new geometry mode flags, see below. |
 | `G_ZBUFFER`              | =   | =   |      |  |
-| `G_TEXTURE_ENABLE`       | =   | =   |      | Very old (F3D / HW v1) display lists with this bit set will no longer crash on F3DEX3, unlike F3DEX2. |
+| `G_TEXTURE_ENABLE`       | =   | =   |      | Very old (F3D / HW v1) display lists with this bit set will crash on F3DEX2, but not on F3DEX3. |
 | `G_SHADE`                | =   | =   |      |  |
 | `G_ATTROFFSET_ST_ENABLE` | New | New |      | New geometry mode bit that enables ST attribute offsets, usually for smooth scrolling. |
 | `SPAttrOffsetST`         | New | New |      | New command which writes ST attribute offsets using `G_MWO_ATTR_OFFSET_*`. |
@@ -199,13 +199,11 @@ e.g. `SPMatrix` refers to `gSPMatrix` and `gsSPMatrix`. `*` means wildcard.
 | `SPDontSkipTexLoadsAcross` | New | New | Up | New command which locally cancels auto-batched rendering by writing an invalid address to `G_MWO_LAST_MAT_DL_ADDR`. |
 | `G_MWO_LAST_MAT_DL_ADDR`   | New | New |      |  |
 | `SPAlphaCompareCull` | New | New | Up   | New command which enables culling of tris based on shade alpha values, for cel shading. Normal use of this command in cel shading is a performance optimization only and doesn't affect on-screen output, so it can be treated as a no-op by an initial HLE implementation. But it is easy to write a display list where it does affect on-screen output, so a good HLE implementation should emulate it. |
-| `G_ALPHA_COMPARE_CULL_DISABLE` | New | New |      | Settings for `SPAlphaCompareCull`. |
-| `G_ALPHA_COMPARE_CULL_BELOW`   | New | New |      | Settings for `SPAlphaCompareCull`. |
-| `G_ALPHA_COMPARE_CULL_ABOVE`   | New | New |      | Settings for `SPAlphaCompareCull`. |
+| `G_ALPHA_COMPARE_CULL_*`   | New | New |      | Settings for `SPAlphaCompareCull`. |
 | `G_MWO_ALPHA_COMPARE_CULL` | New | New |      |  |
 | `MoveWd`             | =   | =   |      | Regular/valid encodings are the same. |
 | `MoveHalfwd`         | New | New |      | Like `MoveWd` but writes 2 bytes instead of 4. |
 | `G_MW_FX`            | New | New |      | New moveword table index for base address for many parameters. |
 | `G_SPECIAL_1`        | Rem | Rem |      | Removed; in F3DEX2, triggered MVP matrix recalculation. |
-| `G_SPECIAL_2`        | Rem | Rem |      | Removed; no-op in F3DEX2. |
-| `G_SPECIAL_3`        | Rem | Rem |      | Removed; no-op in F3DEX2. |
+| `G_SPECIAL_2`        | Rem | Rem |      | Removed; was a no-op in F3DEX2. |
+| `G_SPECIAL_3`        | Rem | Rem |      | Removed; was a no-op in F3DEX2. |
--- a/docs/Documentation/Configuration.md
+++ b/docs/Documentation/Configuration.md
@@ -2,7 +2,7 @@

 # Microcode Configuration

-There are several selectable configuration settings when building F3DEX3, which
+There are a few selectable configuration settings when building F3DEX3, which
 can be enabled in any combination. With a couple minor exceptions, none of these
 settings affect the GBI--in fact, you can swap between the microcode versions on
 a per-frame basis if you build multiple versions into your romhack.
@@ -30,65 +30,21 @@ which version to use on the profiling results from the previous frame: if the
 RSP is the bottleneck (e.g. the RDP `CLK - CMD` is high), use the NOC version,
 and otherwise use the base version.

-## Legacy Vertex Pipeline (LVP)
-
-The primary tradeoff for all the new lighting features in F3DEX3 is increased
-RSP time for vertex processing. The base version of F3DEX3 takes about
-**2-2.5x** more RSP time for vertex processing than F3DEX2 (see Performance
-Results section below), assuming no lighting or directional lights only. You
-should use the F3DEX3 performance counters (see below) to determine whether your
-game is usually RSP or RDP bound.
-
-If your game is usually RDP bound--like OoT--this generally will not affect the
-game's overall framerate, so you should stick with base F3DEX3:
- The increased time only applies to vertex processing, not triangle processing
-  or other miscellaneous microcode tasks. So the total RSP cycles spent doing
-  useful work during the frame is only modestly increased.
- The increase in time is only RSP cycles; there is no additional memory
-  traffic, so the RDP time is not directly affected.
- In scenes which are complex enough to fill the RSP->RDP FIFO in DRAM, the RSP
-  usually spends a significant fraction of time waiting for the FIFO to not be
-  full, as revealed by the performance counters. In these cases, slower vertex
-  processing simply means less time spent waiting, and little to no change in
-  total RSP time.
- When the FIFO does not fill up, usually the RSP takes significantly less time
-  during the frame compared to the RDP, so increased RSP time usually does not
-  affect the overall framerate.
-
-However, for RSP bound or extremely optimized (Kaze Emanuar) games, base F3DEX3
-can become a bottleneck, so the Legacy Vertex Pipeline (LVP) configuration has
-been introduced.
-
-This configuration replaces F3DEX3's native vertex and lighting code with a
-faster version based on the same algorithms as F3DEX2. This removes:
- Point lighting
- F3DEX3 lighting features: packed normals, ambient occlusion, light-to-alpha
-  (cel shading), Fresnel, and specular lighting
- ST attribute offsets
-
-However, it retains all other F3DEX3 features:
- 56 verts, 9 directional lights
- Occlusion plane (optional with NOC configuration)
- All features not related to vertex/lighting: auto-batched rendering, packed 5
-  triangles commands, hints system, etc.
-
-With both LVP and NOC enabled, F3DEX3 is faster on the RSP than F3DEX2 (see
-@ref performance).
-
 ## Profiling

-As mentioned above, F3DEX3 includes many performance counters. There are far too
-many counters for a single microcode to maintain, so multiple configurations of
-the microcode can be built, each containing a different set of performance
-counters. These can be swapped while the game is running so the full set of
-counters can be effectively accessed over multiple frames.
+F3DEX3 includes many performance counters. There are far too many counters for a
+single microcode to maintain, so multiple configurations of the microcode can be
+built, each containing a different set of performance counters. These can be
+swapped while the game is running so the full set of counters can be effectively
+accessed over multiple frames.

 There are a total of 21 performance counters, including:
 - Counts of vertices, triangles, rectangles, matrices, DL commands, etc.
 - Times the microcode was processing vertices, processing triangles, stalled
  because the RDP FIFO in DMEM was full, and stalled waiting for DMAs to finish
 - A counter enabling a rough measurement of how long the RDP was stalled
-  waiting for RDRAM for I/O to the framebuffer / Z buffer
+  waiting for RDRAM for I/O to the framebuffer / Z buffer (spoiler: often
+  half to two thirds of the total RDP time!)

 The default configuration of F3DEX3 provides a few of the most basic counters.
 The additional profiling configurations, called A, B, and C (for example
@@ -103,7 +59,9 @@ because their removal does not affect the RDP render time.
 Use `BrZ` if the microcode is replacing F3DEX2 or an earlier F3D version (i.e.
 SM64), or `BrW` if the microcode is replacing F3DZEX (i.e. OoT or MM). This
 controls whether `SPBranchLessZ*` uses the vertex's W coordinate or screen Z
-coordinate.
+coordinate. If you are creating a new project for any game without using vanilla
+scenes, and you're considering using this instruction for LoD, you should use
+`BrW`.

 ## Debug Normals (`dbgN`)

@@ -113,10 +71,12 @@ version intended to be shipped. It can still be enabled by changing

 To help debug lighting issues when integrating F3DEX3 into your romhack, this
 feature causes the vertex colors of any material with lighting enabled to be set
-to the transformed, normalized world space normals. The X, Y, and Z components
-map to R, G, and B, with each dimension's conceptual (-1.0 ... 1.0) range mapped
-to (0 ... 255). This is not compatible with LVP as world space normals do not
-exist in that pipeline. This also breaks vertex alpha and texgen / lookat.
+to the normals. When F3DEX3 is using the "basic" lighting codepath, these are
+the model space normals, and when it is using the "advanced" lighting codepath
+(point lights, specular, or Fresnel) these are transformed, normalized world
+space normals. The X, Y, and Z components map to R, G, and B, with each
+dimension's conceptual (-1.0 ... 1.0) range mapped to (0 ... 255). This also
+breaks vertex alpha and texgen / lookat.

 Some ways to use this for debugging are:
 - If the normals have obvious problems (e.g. flickering, or not changing
@@ -124,9 +84,9 @@ Some ways to use this for debugging are:
  model space normals or the M matrix. Conversely, if there is a problem with
  the standard lighting results (e.g. flickering) but the normals don't have
  this problem, the problem is likely in the lighting data.
- Check that the colors don't change based on the camera position, but DO change
-  as the object rotates, so that the same side of an object in world space is
-  always the same color.
+- If using the "advanced" lighting codepath, check that the colors don't change
+  based on the camera position, but DO change as the object rotates, so that the
+  same side of an object in world space is always the same color.
 - Make a simple object like an octahedron or sphere, view it in game, and check
  that the normals are correct. A normal pointing along +X would be
  (1.0, 0.0, 0.0), meaning (255, 128, 128) or pink. A normal pointing along -X
--- a/docs/Documentation/Design
+++ b/docs/Documentation/Design
@@ -2,85 +2,39 @@

 # What are the tradeoffs for all these new features?

-## Vertex Processing RSP Time
+In other words, when is F3DEX3 worse than F3DEX2?

-See the Microcode Configuration and Performance Results sections above.
+## Vertex processing RSP time for occlusion plane

-## Overlay 4
+In the occlusion plane F3DEX3 configuration, vertex processing is slower than
+in F3DEX2. If using this configuration and there is no occlusion plane or it is
+occluding almost nothing, the RSP will be slower with no other benefit.

-(Note that in the LVP configuration, Overlay 4 is absent; there is no M inverse
-transpose matrix discussed below, and the other commands mentioned below are
-directly in the microcode without an overlay, due to there being enough IMEM
-space.)
+However, when the occlusion plane is occluding even a few percent of the
+triangles in the scene, the situation changes. This saves RDP time, and most
+games are RDP bound, so this trades off RSP time for RDP time and makes the game
+faster overall. Plus, RSP time is also saved for the tris which are not drawn,
+which can approximately cancel out the extra RSP time for computing the
+occlusion plane for all vertices.

-F3DEX2 contains Overlay 2, which does lighting, and Overlay 3, which does
-clipping (run on any large triangle which extends a large distance offscreen).
-These overlays are more RSP assembly code which are loaded into the same space
-in IMEM. If the wrong overlay is loaded when the other is needed, the proper
-one is loaded and then code jumps to it. Display lists which do not use lighting
-can stay on Overlay 3 at all times. Display lists for things that are typically
-relatively small on screen, such as characters, can stay on Overlay 2 at all
-times, because even when a triangle overlaps the edge of the screen, it
-typically moves fully off the screen and is discarded before it reaches the
-clipping bounds (2x the screen size).
+## Functionality in Overlay 3

-In F3DEX2, the only case where the overlays are swapped frequently is for
-scenes with lighting, because they have large triangles which often extend far
-offscreen (Overlay 3) but also need lighting (Overlay 2). Worst case, the RSP
-will load Overlay 2 once for every `SPVertex` command and then load Overlay 3
-for every set of `SP*Triangle*` commands.
+The following commands are moved to Overlay 3 in F3DEX3 to save IMEM space. This
+means that code will have to be loaded from DRAM to run them if Overlays 2 or 4
+(for lighting) happen to be loaded already.
+- Push and multiply codepaths for `SPMatrix`
+- `SPPopMatrix*`
+- `SPDma*`
+- `SPMemset`

-(If you're curious, Overlays 0 and 1 are not related to 2 and 3, and have to do
-with starting and stopping RSP tasks. During normal display list execution,
-Overlay 1 is always loaded.)
+However:
+- Multiplying, pushing, and popping matrices is not recommended for performance
+  or accuracy, and these are not used for most 3D objects in SM64 or OoT.
+- `SPDma*` is rarely used except at startup for HLE detection.
+- `SPMemset` is a new F3DEX3 command which can improve performance. Plus, it is
+  typically run shortly after render start, when Overlay 3 is already in IMEM.

-F3DEX3 introduces Overlay 4, which can occupy the same IMEM as Overlay 2 and 3.
-This overlay contains handlers for:
- Computing the inverse transpose of the model matrix M (abbreviated as mIT),
-  discussed below
- The codepath for `SPMatrix` with `G_MTX_MUL` set (base version only; this is
-  moved out of the overlay to normal microcode in the NOC configuration due to
-  having extra IMEM space available)
- `SPBranchLessZ*`
- `SPDma_io`
-
-Whenever any of these features is needed, the RSP has to swap to Overlay 4. The
-next time lighting or clipping is needed, the RSP has to then swap back to
-Overlay 2 or 3. The round-trip of these two overlay loads takes about 5
-microseconds of DRAM time including overheads. Fortunately, all the above
-features other than the mIT matrix are rarely or never used.
-
-The mIT matrix is needed in F3DEX3 because normals are covectors--they stretch
-in the opposite direction of an object's scaling. So while you multiply a vertex
-by M to transform it from model space to world space, you have to multiply a
-normal by M inverse transpose to go to world space. F3DEX2 solves this problem
-by instead transforming light directions into model space with M transpose, and
-computing the lighting in model space. However, this requires extra DMEM to
-store the transformed lights, and adds an additional performance penalty for
-point lighting which is absent in F3DEX3. Plus, having world space normals in
-F3DEX3 enables Fresnel and specular lighting.
-
-If an object's transformation matrix stack only includes translations,
-rotations, and uniform scale (i.e. same scale in X, Y, and Z), then M inverse
-transpose is just a rescaled version of M, and the normals can be transformed
-with M directly. It is only when the matrix includes nonuniform scales or shear
-that M inverse transpose differs from M. The difference gets larger as the scale
-or shear gets more extreme.
-
-F3DEX3 provides three options for handling this (see `SPNormalsMode`):
- `G_NORMALS_MODE_FAST`: Use M to transform normals. No performance penalty.
-  Lighting will be somewhat distorted for objects with nonuniform scale or
-  shear.
- `G_NORMALS_MODE_AUTO`: The RSP will automatically compute M inverse transpose
-  whenever M changes. Costs about 3.5 microseconds of DRAM time per matrix, i.e.
-  per object or skeleton limb which has lighting enabled. Lighting is correct
-  for nonuniform scale or shear.
- `G_NORMALS_MODE_MANUAL`: You compute M inverse transpose on the CPU and
-  manually upload it to the RSP every time M changes.
-
-It is recommended to use `G_NORMALS_MODE_FAST` (the default) for most things,
-and use `G_NORMALS_MODE_AUTO` only for objects while they currently have a
-nonuniform scale (e.g. Mario only while he is squashed).
+So there is not a significant practical performance impact from these changes.

 ## Far clipping removal

@@ -90,6 +44,13 @@ though it can be seen in certain extreme cases. However, it is used on the SM64
 title screen for the zoom-in on Mario's face, so this will look slightly
 different.

+Far clipping can be used to cull tris which are fully "fogged out" if the
+background color (no skybox) is also the fog color, for performance benefits.
+This effect has a bad reputation in '90s era games for being used as a cheap
+trick to hide performance problems, though it's occasionally used in "spooky"
+levels in romhacks. In F3DEX3, `SPAlphaCompareCull` can be used instead of far
+clipping to cull these triangles which are fully in fog.
+
 The removal of far clipping saved a bunch of DMEM space, and enabled other
 changes to the clipping implementation which saved even more DMEM space.

@@ -102,11 +63,11 @@ distance in front of the camera plane.

 A few clever romhackers figured out that you could shrink the normals on verts
 in your mesh (so their length is less than "1") to make the lighting on those
-verts dimmer and create a version of ambient occlusion. In the base vertex
-pipeline, F3DEX3 normalizes vertex normals after transforming them, which is
-required for most features of the lighting system including packed normals, so
-this no longer works. However, F3DEX3 has support for ambient occlusion via
-vertex alpha, which accomplishes the same goal with some extra benefits:
+verts dimmer and create a version of ambient occlusion. In the "advanced"
+lighting codepath, F3DEX3 normalizes vertex normals after transforming them,
+which is required for point lights, specular, and Fresnel, so this no longer
+works. However, F3DEX3 has support for ambient occlusion via vertex alpha, which
+accomplishes the same goal with some extra benefits:
 - Much easier to create: just paint the vertex alpha in Blender / fast64. The
  scaled normals approach was not supported in fast64 and had to be done with
  scripts or by hand.
@@ -118,16 +79,12 @@ vertex alpha, which accomplishes the same goal with some extra benefits:
  scaled normals never affect the ambient light, contrary to the concept of
  ambient occlusion.

-Furthermore, for partial HLE compatibility, the same mesh can have the ambient
-occlusion information encoded in both scaled normals and vertex alpha at the
-same time. HLE will ignore the vertex alpha AO but use the scaled normals;
-F3DEX3 will fix the normals' scale but then apply the AO.
-
 The only case where scaled normals work but F3DEX3 AO doesn't work is for meshes
 with vertex alpha actually used for transparency (therefore also no fog).

-Note that in LVP mode, scaled normals are supported and work the same way as in
-F3DEX2, while ambient occlusion is not supported.
+Note that in the "basic" lighting codepath in F3DEX3, vertex normals are treated
+the same way as in F3DEX2, so scaled normals are supported there. Ambient
+occlusion is also supported there.

 ## RDP temporary buffers shrinking

@@ -161,20 +118,13 @@ In F3DEX2, the RSP time for drawing non-textured tris was significantly lower
 than for textured tris, by skipping a chunk of computation for the texture
 coefficients if they were disabled. In F3DEX3, no computation is skipped when
 textures are disabled. However, almost all materials use textures, and F3DEX3 is
-a little faster at drawing textured tris than F3DEX2. Plus, DRAM access time RSP
-> FIFO and FIFO -> RDP is still saved from not sending the coefficients, and
-RDP time savings from avoiding loading a texture are unaffected of course. 
+a little faster at drawing textured tris than F3DEX2. Plus, F3DEX3 still does
+not send the texture cofficients if they are disabled, saving DRAM access time
+for RSP -> FIFO and FIFO -> RDP. RDP time savings from avoiding loading a
+texture are unaffected of course. 

 ## Obscure semantic differences from F3DEX2 that should never matter in practice

- `SPLoadUcode*` corrupts the current M inverse transpose matrix state. If using
-  `G_NORMALS_MODE_FAST`, this doesn't matter. If using `G_NORMALS_MODE_AUTO`,
-  you must send the M matrix to the RSP again after returning to F3DEX3 from the
-  other microcode (which would normally be done anyway when starting to draw the
-  next object). If using `G_NORMALS_MODE_MANUAL`, you must send the updated
-  M inverse transpose matrix to the RSP after returning to F3DEX3 from the other
-  microcode (which would normally be done anyway when starting to draw the next
-  object).
 - Changing fog settings--i.e. enabling or disabling `G_FOG` in the geometry mode
  or executing `SPFogFactor` or `SPFogPosition`--between loading verts and
  drawing tris with those verts will lead to incorrect fog values for those
--- a/docs/Documentation/Performance.md
+++ b/docs/Documentation/Performance.md
@@ -1,96 +1,71 @@
@page performance Performance Results

-# Philosophy
-
-The base version of F3DEX3 was created for RDP bound games like OoT, where new
-visual effects are desired and increasing the RSP time a bit does not affect the
-overall performance. If your game is RSP bound, using the base version of F3DEX3
-will make it slower.
-
-Conversely, F3DEX3_LVP_NOC matches or beats the RSP performance of F3DEX2 on
-**all** critical paths in the microcode, including command dispatch, vertex
-processing, and triangle processing. Then, the RDP and memory traffic
-performance improvements of F3DEX3--56 vertex buffer, auto-batched rendering,
-etc.--should further improve performance from there. This means that switching
-from F3DEX2 to F3DEX3_LVP_NOC should always improve performance regardless of
-whether your game is RSP bound or RDP bound.
-
-
 # Performance Results

+F3DEX3_NOC matches or beats the RSP performance of F3DEX2 on **all** critical
+paths in the microcode, including command dispatch, vertex processing, and
+triangle processing. Then, the RDP and memory traffic performance improvements
+of F3DEX3--56 vertex buffer, auto-batched rendering, etc.--should further
+improve overall game performance from there.
+
 ## Cycle Counts

 These are cycle counts for many key paths in the microcode. Lower numbers are
 better. The timings are hand-counted taking into account all pipeline stalls and
-all dual-issue conditions. Instruction alignment after branches is sometimes
-taken into account, otherwise assumed to be optimal.
+all dual-issue conditions. Instruction alignment after branches is usually taken
+into account, but in some cases it is assumed to be optimal.

-Vertex / lighting numbers assume no special features (texgen, packed normals,
-etc.) Tri numbers assume texture, shade, and Z, and not flushing the buffer.
-All numbers assume default profiling configuration. Empty cells are "not
-measured yet".
+All numbers assume default profiling configuration. Tri numbers assume texture,
+shade, and Z, and not flushing the buffer. Tri numbers are measured from the
+first cycle of the command handler inclusive, to the first cycle of whatever is
+after $ra exclusive; this is in order to capture the extra latency and stalls in
+F3DEX2.

-|                            | F3DEX2 | F3DEX3_LVP_NOC | F3DEX3_LVP | F3DEX3_NOC | F3DEX3 |
-|----------------------------|--------|----------------|------------|------------|--------|
-| Command dispatch           | 12     | 12             | 12         | 12         | 12     |
-| Small RDP command          | 14     | 5              | 5          | 5          | 5      |
-| Vtx before DMA start       | 16     | 17             | 17         | 17         | 17     |
-| Vtx pair, no lighting      | 54     | 54             | 81         | 79         | 98     |
-| Vtx pair, 0 dir lts        | Can't  | 64             |            |            |        |
-| Vtx pair, 1 dir lt         | 73     | 70             | 96         | 182        | 201    |
-| Vtx pair, 2 dir lts        | 76     | 77             | 103        | 211        | 230    |
-| Vtx pair, 3 dir lts        | 88     | 84             | 110        | 240        | 259    |
-| Vtx pair, 4 dir lts        | 91     | 91             | 117        | 269        | 288    |
-| Vtx pair, 5 dir lts        | 103    | 98             | 124        | 298        | 317    |
-| Vtx pair, 6 dir lts        | 106    | 105            | 131        | 327        | 346    |
-| Vtx pair, 7 dir lts        | 118    | 112            | 138        | 356        | 375    |
-| Vtx pair, 8 dir lts        | Can't  | 119            | 145        | 385        | 404    |
-| Vtx pair, 9 dir lts        | Can't  | 126            | 152        | 414        | 433    |
-| Light dir xfrm, 0 dir lts  | Can't  | 95             | 95         | None       | None   |
-| Light dir xfrm, 1 dir lt   | 141    | 95             | 95         | None       | None   |
-| Light dir xfrm, 2 dir lts  | 180    | 96             | 96         | None       | None   |
-| Light dir xfrm, 3 dir lts  | 219    | 121            | 121        | None       | None   |
-| Light dir xfrm, 4 dir lts  | 258    | 122            | 122        | None       | None   |
-| Light dir xfrm, 5 dir lts  | 297    | 147            | 147        | None       | None   |
-| Light dir xfrm, 6 dir lts  | 336    | 148            | 148        | None       | None   |
-| Light dir xfrm, 7 dir lts  | 375    | 173            | 173        | None       | None   |
-| Light dir xfrm, 8 dir lts  | Can't  | 174            | 174        | None       | None   |
-| Light dir xfrm, 9 dir lts  | Can't  | 199            | 199        | None       | None   |
-| Only/2nd tri to offscreen  | 27     | 26             | 26         | 26         | 26     |
-| 1st tri to offscreen       | 28     | 27             | 27         | 27         | 27     |
-| Only/2nd tri to clip       | 32     | 31             | 31         | 31         | 31     |
-| 1st tri to clip            | 33     | 32             | 32         | 32         | 32     |
-| Only/2nd tri to backface   | 38     | 38             | 38         | 38         | 38     |
-| 1st tri to backface        | 39     | 39             | 39         | 39         | 39     |
-| Only/2nd tri to degenerate | 42     | 40             | 40         | 40         | 40     |
-| 1st tri to degenerate      | 43     | 41             | 41         | 41         | 41     |
-| Only/2nd tri to occluded   | Can't  | Can't          | 49         | Can't      | 49     |
-| 1st tri to occluded        | Can't  | Can't          | 50         | Can't      | 50     |
-| Only/2nd tri to draw       | 172    | 160            | 163        | 160        | 163    |
-| 1st tri to draw            | 173    | 160            | 163        | 160        | 163    |
-
-
-Tri numbers are measured from the first cycle of the command handler inclusive,
-to the first cycle of whatever is after $ra exclusive. This is in order
-to capture the extra latency and stalls in F3DEX2.
-
-## Measurements
-
-Vertex processing time as reported by the performance counter in the `PA`
-configuration.
- Scene 1: Kakariko, adult day, from DMT entrance
- Scene 2: Custom empty scene with Suzanne monkey head with 1 dir light
- Scene 3: Same but Suzanne has vertex colors instead of lighting (Link is still
-  on screen and has lighting)
-
-| Microcode      | Scene 1 | Scene 2 | Scene 3 |
-|----------------|---------|---------|---------|
-| F3DEX3         | 7.41ms  | 2.99ms  | 2.22ms  |
-| F3DEX3_NOC     | 6.85ms  | 2.75ms  | 1.98ms  |
-| F3DEX3_LVP     | 4.12ms  | 1.59ms  | 1.48ms  |
-| F3DEX3_LVP_NOC | 3.34ms  | 1.27ms  | 1.16ms  |
-| F3DEX2         | Can't*  | Can't*  | Can't*  |
-| Vertex count   | 3557    | 1548    | 1548    |
-
-*F3DEX2 does not contain performance counters, so the portion of the RSP time
-taken for vertex processing cannot be measured.
+|                            | F3DEX2 | F3DEX3_NOC | F3DEX3 |
+|----------------------------|--------|------------|--------|
+| Command dispatch           | 12     | 12         | 12     |
+| Small RDP command          | 14     | 5          | 5      |
+| Only/2nd tri to offscreen  | 27     | 26         | 26     |
+| 1st tri to offscreen       | 28     | 27         | 27     |
+| Only/2nd tri to clip       | 32     | 31         | 31     |
+| 1st tri to clip            | 33     | 32         | 32     |
+| Only/2nd tri to backface   | 38     | 38         | 38     |
+| 1st tri to backface        | 39     | 39         | 39     |
+| Only/2nd tri to degenerate | 42     | 40         | 40     |
+| 1st tri to degenerate      | 43     | 41         | 41     |
+| Only/2nd tri to occluded   | Can't  | Can't      | 49     |
+| 1st tri to occluded        | Can't  | Can't      | 50     |
+| Only/2nd tri to draw       | 172    | 159        | 162    |
+| 1st tri to draw            | 173    | 160        | 163    |
+| Vtx before DMA start       | 16     | 17         | 17     |
+| Vtx pair, no lighting      | 54     | 54         | 70     |
+| Vtx pair, 0 dir lts        | Can't  | 65         | 81     |
+| Vtx pair, 1 dir lt         | 73     | 70         | 86     |
+| Vtx pair, 2 dir lts        | 76     | 77         | 93     |
+| Vtx pair, 3 dir lts        | 88     | 84         | 100    |
+| Vtx pair, 4 dir lts        | 91     | 91         | 107    |
+| Vtx pair, 5 dir lts        | 103    | 98         | 114    |
+| Vtx pair, 6 dir lts        | 106    | 105        | 121    |
+| Vtx pair, 7 dir lts        | 118    | 112        | 128    |
+| Vtx pair, 8 dir lts        | Can't  | 119        | 135    |
+| Vtx pair, 9 dir lts        | Can't  | 126        | 142    |
+| Vtx pair, 0 point lts      | Can't  | TODO       | +16    |
+| Vtx pair, 1 point lt       | TODO   | TODO       | +16    |
+| Vtx pair, 2 point lts      | TODO   | TODO       | +16    |
+| Vtx pair, 3 point lts      | TODO   | TODO       | +16    |
+| Vtx pair, 4 point lts      | TODO   | TODO       | +16    |
+| Vtx pair, 5 point lts      | TODO   | TODO       | +16    |
+| Vtx pair, 6 point lts      | TODO   | TODO       | +16    |
+| Vtx pair, 7 point lts      | TODO   | TODO       | +16    |
+| Vtx pair, 8 point lts      | Can't  | TODO       | +16    |
+| Vtx pair, 9 point lts      | Can't  | TODO       | +16    |
+| Light dir xfrm, 0 dir lts  | Can't  | 92         | 92     |
+| Light dir xfrm, 1 dir lt   | 141    | 92         | 92     |
+| Light dir xfrm, 2 dir lts  | 180    | 93         | 93     |
+| Light dir xfrm, 3 dir lts  | 219    | 118        | 118    |
+| Light dir xfrm, 4 dir lts  | 258    | 119        | 119    |
+| Light dir xfrm, 5 dir lts  | 297    | 144        | 144    |
+| Light dir xfrm, 6 dir lts  | 336    | 145        | 145    |
+| Light dir xfrm, 7 dir lts  | 375    | 170        | 170    |
+| Light dir xfrm, 8 dir lts  | Can't  | 171        | 171    |
+| Light dir xfrm, 9 dir lts  | Can't  | 196        | 196    |
--- a/docs/Documentation/Porting
+++ b/docs/Documentation/Porting
@@ -6,14 +6,14 @@ For an OoT codebase, only a few minor changes are required to use F3DEX3.
 However, more changes are recommended to increase performance and enable new
 features.

-How to modify the microcode in your HackerOoT based romhack (steps may be
-similar for other games):
+How to modify the microcode in your HackerOoT based romhack (note that this is
+already done in HackerOoT, so this is provided as a guide for other games):
 - Replace `include/ultra64/gbi.h` in your romhack with `gbi.h` from this repo.
 - Make the "Required Changes" listed below.
 - Build this repo: install the latest version of `armips`, then `make
  F3DEX3_BrZ` or `make F3DEX3_BrW`.
 - Copy the microcode binaries (`build/F3DEX3_X/F3DEX3_X.code` and
-  `build/F3DEX3_X/F3DEX3_X.data`) to somewhere in your romhack repo, e.g. `data`.
+  `build/F3DEX3_X/F3DEX3_X.data`) to `data` in your romhack repo.
 - In `data/rsp.rodata.s`, change the line between `fifoTextStart` and
  `fifoTextEnd` to `.incbin "data/F3DEX3_X.code"` (or wherever you put the
  binary), and similarly change the line between `fifoDataStart` and
@@ -41,9 +41,12 @@ Both OoT and SM64:
  dynamically) (search for `Vp` case-sensitive, `SPViewport`, and `G_MAXZ`),
  change the maximum Z value from `G_MAXZ` to `G_NEW_MAXZ` and negate the
  Y scale. For more information, see the comment next to `G_MAXZ` in the GBI.
-  Note that your romhack codebase may have the constant hardcoded, usually as
-  `511` which is supposed to be `(G_MAXZ/2)`, instead of actually writing
-  `G_MAXZ`; you need to change these too, there are several of these in SM64.
+  Note that your romhack codebase may have the constant hardcoded (usually as
+  `511` which is supposed to be `(G_MAXZ/2)`), instead of actually writing an
+  expression containing `G_MAXZ`; you need to change these too, there are
+  several of these in SM64. Fortunately, it is easy to notice if you have failed
+  to update a Y scale, as anything drawn using that viewport will be upside
+  down.
 - Remove uses of internal GBI features which have been removed in F3DEX3 (see
  @ref compatibility for full list). In OoT, the only changes needed are:
    - In `src/code/ucode_disas.c`, remove the switch statement cases for
@@ -51,20 +54,22 @@ Both OoT and SM64:
      and `G_MW_PERSPNORM`.
    - In `src/libultra/gu/lookathil.c`, remove the lines which set the `col`,
      `colc`, and `pad` fields.
-    - In each place `G_MAXZ` is used, a compiler error will be generated;
-      negate the Y scale in each related viewport and change to `G_NEW_MAXZ`.
+    - As mentioned above, in each place `G_MAXZ` is used, a compiler error will
+      be generated; negate the Y scale in each related viewport and change the
+      Z scale and offset to use `G_NEW_MAXZ`.
 - Change your game engine lighting code to set the `type` (formerly `pad1`)
  field to 0 in the initialization of any directional light (`Light_t` and
-  derived structs like `Light` or `Lightsn`). F3DEX3 ignores the state of the
-  `G_LIGHTING_POSITIONAL` geometry mode bit in all display lists, meaning both
-  directional and point lights are supported for all display lists (including
-  vanilla). The light is identified as directional if `type` == 0 or point if
-  `kc` > 0 (`kc` and `type` are the same byte). This change is required because
-  otherwise garbage nonzero values may be put in the padding byte, leading
-  directional lights to be misinterpreted as point lights.
+  derived structs like `Light` or `Lightsn`). This change is required because
+  otherwise garbage nonzero values may be put in this byte, which was a padding
+  byte for a non-point-light microcode but is used to identify the light as
+  point or directional in a point light microcode.
    - The change needed in OoT is: in `src/code/z_lights.c`, in
      `Lights_BindPoint`, `Lights_BindDirectional`, and `Lights_NewAndDraw`, set
      `l.type` to 0 right before setting `l.col`.
+- If your game already had point lighting, use `ENABLE_POINT_LIGHTS` instead
+  of `G_LIGHTING_POSITIONAL` to indicate that point lights are currently active.
+  (Static uses of `G_LIGHTING_POSITIONAL` in display lists need not be removed
+  as this bit is ignored.)

 SM64 only:

@@ -72,15 +77,15 @@ SM64 only:
  fixed, the vanilla permanent light direction of `{0x28, 0x28, 0x28}` must be
  changed to `{0x49, 0x49, 0x49}`, or everything will be too dark. The former
  vector is not properly normalized, but F3D through F3DEX2 normalize light
-  directions in the microcode, so it doesn't matter with those microcodes. In
-  contrast, F3DEX3 normalizes vertex normals (after transforming them), but
-  assumes light directions have already been normalized.
+  directions in the microcode, so it doesn't matter with those microcodes. The
+  two lighting codepaths in F3DEX3 treat light directions and vertex normals
+  differently: the fast one works like F3DEX2, but the slow one normalizes
+  vertex normals after transforming them and does not modify light directions.
+  Thus in this case, the light directions must already be normalized.
 - Matrix stack fix (world space lighting / view matrix in VP instead of in M) is
  basically required. If you *really* want camera space lighting, use matrix
  stack fix, transform the fixed camera space light direction by V inverse each
-  frame, and send that to the RSP. This will be faster than the alternative (not
-  using matrix stack fix and enabling `G_NORMALS_MODE_AUTO` to correct the
-  matrix).
+  frame, and send that to the RSP.

 ## Recommended Changes (Non-Lighting)

@@ -88,18 +93,14 @@ SM64 only:
  use `SPLookAt` instead (this is only a few lines change). Also remove any
  code which writes `SPClipRatio` or `SPForceMatrix`--these are now no-ops, so
  you might as well not write them.
- Avoid using `G_MTX_MUL` in `SPMatrix`. That is, make sure your game engine
-  computes a matrix stack on the CPU and sends the final matrix for each object
-  / limb to the RSP, rather than multiplying matrices on the RSP. OoT already
-  usually does the former for precision / accuracy reasons and only uses
-  `G_MTX_MUL` in a couple places (e.g. view * perspective matrix); it is okay to
-  leave those. This change is recommended because the `G_MTX_MUL` mode of
-  `SPMatrix` has been moved to Overlay 4 in F3DEX3 (see below), making it
-  substantially slower than it was in F3DEX2. It still functions the same though
-  so you can use it if it's really needed.
+- Avoid using `G_MTX_MUL` and `G_MTX_PUSH` in `SPMatrix`, and `SPPopMatrix*`,
+  for performance and accuracy reasons. See the GBI for more information. If
+  these are only used in a couple non-critical places such as for GUIs, that's
+  okay.
 - Re-export as many display lists (scenes, objects, skeletons, etc.) as possible
  with fast64 set to F3DEX3 mode, to take advantage of the substantially larger
-  vertex buffer, triangle packing commands, "hints" system, etc.
+  vertex buffer (and eventually when supported by community tools, the triangle
+  packing commands and "hints" system).
 - `#define REQUIRE_SEMICOLONS_AFTER_GBI_COMMANDS` (at the top of, or before
  including, the GBI) for a more modern, OoT-style codebase where uses of GBI
  commands require semicolons after them. SM64 omits the semicolons sometimes,
@@ -137,10 +138,9 @@ SM64 only:
  emulate point lights in a scene with a directional light recomputed per actor.
  You can now just send those to the RSP as real point lights, regardless of
  whether the display lists are vanilla or new.
- If you are porting a game which already had point lighting (e.g. Majora's
-  Mask), note that the point light kc, kl, and kq factors have been changed, so
-  you will need to redesign how game engine light parameters (e.g. "light
-  radius") map to these parameters.
+- If your game already had point lighting, note that the point light kc, kl, and
+  kq factors have been changed, so you will need to redesign how game engine
+  light parameters (e.g. "light radius") map to these parameters.

 ## Changes Required for New Features

--- a/f3dex3.s
+++ b/f3dex3.s
@@ -1346,7 +1346,7 @@ tri_noinit: // ra is next cmd, second tri in TRI2, or middle of clipping
    vsub    $v11, $v4, $v6    // v11 = vertex 2 - vertex 1 (x, y, addr)
    vlt     $v13, $v2, $v4[1] // v13 = min(v1.y, v2.y), VCO = v1.y < v2.y
    bnez    $11, return_and_end_mat // Then the whole tri is offscreen, cull
-     // 22 cycles
+     // 22 cycles (for tri2 first tri; tri1/only subtract 1 from counts)
     vmrg   tHPos, $v6, $v4   // v14 = v1.y < v2.y ? v1 : v2 (lower vertex of v1, v2)
    vmudh   $v29, $v10, $v12[1] // x = (v1 - v2).x * (v1 - v3).y ... 
    lhu     $24, activeClipPlanes
@@ -3147,6 +3147,7 @@ ltbasic_setup_after_xfrm:
    j       vtx_after_lt_setup
     li     lbAfter, ltbasic_ao
    
+.align 8
 xfrm_light_store_lookat:
    vmadh   $v29, $v9,  lpWrld[1h]
    spv     lpFinal[0], (xfrmLookatDirs)($zero) // Store lookat. 1st time garbage, 2nd real
--- a/gbi.h
+++ b/gbi.h
@@ -2954,8 +2954,14 @@ _DW({                                         \

    
 /**
- * Alpha compare culling. Optimization for cel shading, could also be used for
- * other scenarios where tris are being drawn with alpha compare.
+ * Alpha compare culling. This was originally created as an optimization for cel
+ * shading, but it can also be used for other scenarios. In particular, it can
+ * be used with fog to cull tris which are entirely in the fog. This could also
+ * be accomplished with far clipping, but far clipping is removed in F3DEX3.
+ * ```
+ * // Cull tris where all three vertex shade alpha are >= 0xFF
+ * gSPAlphaCompareCull(..., G_ALPHA_COMPARE_CULL_ABOVE, 0xFF);
+ * ```
 * 
 * If mode == G_ALPHA_COMPARE_CULL_DISABLE, tris are drawn normally.
 *