diff --git a/README.md b/README.md
index a377ff3..3e60c70 100644
--- a/README.md
+++ b/README.md
@@ -4,8 +4,8 @@ Modern graphics microcode for N64 romhacks. Will make you want to finally ditch
 HLE. Heavily modified version of F3DEX2, with all vertex and lighting code
 rewritten from scratch.
 
-**F3DEX3 is in alpha. It is not guaranteed to be bug-free, and updates may bring
-breaking changes.**
+**F3DEX3 is in beta. The GBI should be relatively stable but may change if there
+is a good reason.**
 
 [View the documentation here](https://hackern64.github.io/F3DEX3/) (or just look
 through the docs folder).
@@ -16,7 +16,7 @@ through the docs folder).
 
 Compared to F3DEX2 or any other F3D family microcode, F3DEX3 is...
 - faster on the RDP
-- in `LVP_NOC` configuration ([see docs](https://hackern64.github.io/F3DEX3/configuration.html)), [also faster on the RSP](https://hackern64.github.io/F3DEX3/performance.html)
+- in `NOC` configuration ([see docs](https://hackern64.github.io/F3DEX3/configuration.html)), [also faster on the RSP](https://hackern64.github.io/F3DEX3/performance.html)
 - more accurate
 - full of new visual features
 - [measurable in performance](https://hackern64.github.io/F3DEX3/counters.html)
@@ -27,9 +27,10 @@ all at the same time!
 
 - New geometry mode bit `G_PACKED_NORMALS` enables **simultaneous vertex colors
   and normals/lighting on the same mesh**, by encoding the normals in the unused
-  2 bytes of each vertex using a variant of [octahedral encoding](https://knarkowicz.wordpress.com/2014/04/16/octahedron-normal-vector-encoding/).
-  The normals are effectively as precise as with the vanilla method of replacing
-  vertex RGB with normal XYZ.
+  2 bytes of each vertex using the 5-6-5 bit encoding by HailToDodongo from
+  [Tiny3D](https://github.com/HailToDodongo/tiny3d). Model-space precision of
+  the normals is reduced, but this is rarely noticeable and there is barely any
+  performance penalty compared to regular normals without vertex colors.
 - New geometry mode bit `G_AMBOCCLUSION` enables **ambient occlusion** for
   opaque materials. Paint the shadow map into the vertex alpha channel; separate
   factors (set with `SPAmbOcclusion`) control how much this affects the ambient
@@ -106,6 +107,9 @@ all at the same time!
   value. This can be used for clearing the Z buffer or filling the framebuffer
   or the letterbox with a solid color **faster than the RDP can in fill mode**.
   Practical performance may vary due to scheduling constraints.
+- The key codepaths for triangle draw and vertex processing (assuming lighting
+  enabled and the occlusion plane disabled with the `NOC` configuration) are
+  **slightly faster than in F3DEX2**.
 
 ### Miscellaneous
 
@@ -132,14 +136,23 @@ all at the same time!
   parameters are encoded in the command. With some limitations, this allows the
   tint colors of cel shading to **match scene lighting** with no code
   intervention. Also useful for other lighting-dependent effects.
+- The microcode automatically switches between two lighting implementations
+  depending on which visual features are selected in the particular material.
+  The "basic lighting" codepath--which is roughly the same speed as F3DEX2--
+  supports all F3DEX2 features (directional lights, texgen), plus packed
+  normals, ambient occlusion, and light-to-alpha. The "advanced lighting"
+  codepath, which is slower, adds support for point lights, specular, and
+  Fresnel. You only pay the performance penalty for the features you use, and
+  only for the objects which use them.
+  
 
 ### Profiling
 
 F3DEX3 introduces a suite of performance profiling capabilities. These take the
 form of performance counters, which report cycle counts for various operations
 or the number of items processed of a given type. There are a total of 21
-performance counters across multiple microcode versions. See the Profiling
-section below.
+performance counters across multiple microcode versions. See the Performance
+Counters page in the docs.
 
 
 ## Credits
@@ -159,6 +172,7 @@ Other contributors:
 - Kaze Emanuar: several feature suggestions, testing
 - thecozies: Fresnel feature suggestion
 - Rasky: memset feature suggestion
+- HailToDodongo: packed normals encoding
 - coco875: Doxygen / GitHub Pages setup
 - ThePerfectLuigi64: CI build setup
 - neoshaman: feature discussions
diff --git a/docs/Code/Counters.md b/docs/Code/Counters.md
index 49f236d..dd62a97 100644
--- a/docs/Code/Counters.md
+++ b/docs/Code/Counters.md
@@ -116,7 +116,7 @@ In variables.h with the ENABLE_SPEEDMETER section:
 extern volatile F3DEX3YieldDataFooter gRSPProfilingResults;
 ```
 
-In the true codepath of Sched_TaskComplete:
+In the `true` codepath of Sched_TaskComplete:
 ```
 #ifdef ENABLE_SPEEDMETER
     /* Fetch number of primitives drawn from yield data */
@@ -139,7 +139,7 @@ volatile F3DEX3YieldDataFooter gRSPProfilingResults;
 ```
 
 You can display them on screen however you wish. Here is an example, in
-SpeedMeter_DrawTimeEntries
+SpeedMeter_DrawTimeEntries:
 ```
 GfxPrint printer;
 Gfx* opaStart;
diff --git a/docs/Documentation/Removed.md b/docs/Documentation/Removed.md
index 039172a..60cee62 100644
--- a/docs/Documentation/Removed.md
+++ b/docs/Documentation/Removed.md
@@ -4,6 +4,60 @@
 
 These features were present in earlier F3DEX3 versions, but have been removed.
 
+## Legacy Vertex Pipeline (LVP) Configuration
+
+Early versions of F3DEX3 were developed exclusively in an OoT context, where
+scenes are almost always RDP bottlenecked. Thus, these versions focused on
+reducing RDP time and adding new visual features at the cost of RSP time.
+
+Later, Kaze Emanuar became interested in using F3DEX3 in Return to Yoshi's
+Island due to the RDP performance improvements. However, due to the intense
+optimization work he had done, his game was relatively balanced in RDP / RSP
+time. Thus, when he tried F3DEX3, the decrease in RDP time and increase in RSP
+time made the game slower overall, which was not acceptable.
+
+As a result, the LVP configuration of F3DEX3 was developed, to bring
+F3DEX2-style vertex processing in exchange for dropping some of the advanced
+lighting features (which Kaze was not going to use anyway due to HLE
+compatibility). This was implemented, and after much optimization across the
+entire microcode, `F3DEX3_LVP_NOC` became slightly faster than F3DEX2 on both
+RDP and RSP. This caused Kaze to immediately adopt this configuration of F3DEX3
+for Return to Yoshi's Island.
+
+Unfortunately, this meant that if developers wanted to use the advanced lighting
+features of F3DEX3 in any part of their project, they were stuck with the much
+slower non-LVP configuration of F3DEX3. The desire to have the microcode
+automatically swap versions for each material, plus the invention of ways to
+include some of the advanced lighting features in the LVP vertex processing
+without any performance penalty when not using them, led to the reunion of the
+versions. Now you get LVP-style performance when not using some of the advanced
+features, and only pay the performance penalty while rendering objects which
+use them.
+
+A similar approach was also considered for the NOC configuration--to have the
+microcode only compute the occlusion plane when it is enabled. This is
+unfortunately infeasible. Register allocation / naming, as well as some
+pipelined instructions leading into and out of lighting, are significantly
+different between the occlusion plane and NOC versions of vertex processing.
+This means the microcode would have to swap between four versions of lighting
+code instead of just two, creating much more complexity with the overlay system
+and IMEM size issues. Furthermore, the occlusion plane is typically not
+enabled/disabled per object, but used when rendering as much of the game
+contents as possible to maximize occluded objects. So it is reasonable to choose
+the occlusion plane or NOC configuration on a per-frame or even per-scene basis.
+
+## Octahedral Encoding for Packed Normals
+
+Previous F3DEX3 versions encoded packed normals into the unused 2 bytes of each
+vertex using a variant of [octahedral encoding](https://knarkowicz.wordpress.com/2014/04/16/octahedron-normal-vector-encoding/).
+Using this method, the normals were effectively as precise as with the vanilla
+method of replacing vertex RGB with normal XYZ. However, the decoding of this
+format was inefficient, partly due to the requirement to also support vanilla
+normals at vanilla performance. Once HailToDodongo showed that the community was
+willing to accept the moderate precision loss of the much simpler 5-6-5 bit
+encoding in [Tiny3D](https://github.com/HailToDodongo/tiny3d), this was adopted
+in F3DEX3.
+
 ## Clipping minimal scanlines algorithm
 
 Earlier F3DEX3 versions included a modified algorithm for triangulating the
diff --git a/f3dex3.s b/f3dex3.s
index e371f43..1e502f9 100644
--- a/f3dex3.s
+++ b/f3dex3.s
@@ -143,6 +143,8 @@ COUNTER_C_FIFO_FULL equ 1
 
 .endif
 
+CFG_DEBUG_NORMALS equ 0 // Can manually enable here
+
 // Only raise a warning in base modes; in profiling modes, addresses will be off
 .macro warn_if_base, warntext
     .if !ENABLE_PROFILING
@@ -1264,7 +1266,6 @@ G_MODIFYVTX_handler:
     j       do_moveword  // Moveword adds cmd_w0 to $10 for final addr
      lbu    cmd_w0, (inputBufferEnd - 0x07)(inputBufferPos)  // offset in vtx, bit 15 clear
 
-TODO check vtx 1 behavior
 G_TRIFAN_handler: // 17
     li      $1, 0x8000                   // $ra negative = flag for G_TRIFAN
 G_TRISTRIP_handler:
@@ -2597,8 +2598,9 @@ tris_end:
 
 tri_fan_store:
     lb      $11, (inputBufferEnd - 7)(inputBufferPos) // Load vtx 1
+    sh      cmd_w1_dram, 5(rdpCmdBufPtr) // Store vtx N+2 and N+3 as 1 and 2
     j       tri_main
-     sb     $11, 5(rdpCmdBufPtr)         // Store vtx 1
+     sb     $11, 7(rdpCmdBufPtr)         // Store vtx 1 as 3
 
 // Converts the segmented address in cmd_w1_dram to the corresponding physical address
 segmented_to_physical: // 7
@@ -3235,6 +3237,19 @@ ltbasic_start_standard:
     vnop
     luv     lVCI[0],    (tempVpRGBA)(rdpCmdBufEndP1) // Load vertex color input
 ltbasic_after_start:
+
+.if CFG_DEBUG_NORMALS
+.warning "Debug normals visualization is enabled"
+    vmudh   vpNrmlX, vOne, vpNrmlX[3h] // Move X to all elements
+    vne     $v29, $v31, $v31[1h] // Set VCC to 10111011
+    vmrg    vpNrmlX, vpNrmlX, vpNrmlY[3h] // X in 0, 4; Y to 1, 5
+    vne     $v29, $v31, $v31[2h] // Set VCC to 11011101
+    vmrg    vpNrmlX, vpNrmlX, vpNrmlZ[3h] // Z to 2, 6
+    vmudh   $v29, vOne, $v31[5] // 0x4000; middle gray
+    j       vtx_return_from_lighting
+     vmacf  vpRGBA, vpNrmlX, $v31[5] // 0x4000; + 0.5 * normal
+.else // CFG_DEBUG_NORMALS
+
     vmulf   $v29,  vpNrmlX, vLTC[4] // Normals X elems 3, 7 * first light dir X
 // lDIR <- (NOC: -, Occ: sOTM)
     lpv     lDIR[0], (ltBufOfs + 8 - 2*lightSize)(ambLight) // Xfrmed dir in elems 4-6; temp reg
@@ -3267,7 +3282,9 @@ ltbasic_post:
     jr      lbAfter
 // vpRGBA <- lDIR
      vmrg   vpRGBA, vpLtTot, lVCI  // RGB = light, A = vtx alpha
-    
+
+.endif // CFG_DEBUG_NORMALS
+
 // lbAfter       = ltbasic_ao if AO else
 // lbPostAo      = ltbasic_l2a if L2A else
 //                 ltbasic_packed if packed else
@@ -3469,6 +3486,17 @@ ltadv_vtx_loop: // Even instruction
     vmudn   vpWrlF, vpWrlF, $v31[1] // -1; negate world pos so add light/cam pos to it
     andi    laSpecFres, vGeomMid, (G_LIGHTING_SPECULAR | G_FRESNEL_COLOR | G_FRESNEL_ALPHA) >> 8
     vmadh   vpWrlI, vpWrlI, $v31[1] // -1
+
+.if CFG_DEBUG_NORMALS
+    vmudh   $v29, vOne, $v31[5] // 0x4000; middle gray
+    li      laTexgen, 0
+    vmacf   vpRGBA, vpWNrm, $v31[5] // 0x4000; + 0.5 * normal
+ltadv_finish_light:
+ltadv_loop:
+ltadv_normals_to_regs:
+ltadv_specular:
+.else
+
 ltadv_normals_to_regs:
     vmudh   vpNrmlY, vOne, vpWNrm[1h] // Move normals to separate registers
     bnez    laSpecFres, ltadv_spec_fres_setup
@@ -3504,7 +3532,7 @@ ltadv_specular: // aDOT in/out, uses vpLtTot[3] and $11 as temps
     jr      $ra
      vxor   aDOT, aDOT, $v31[7]    // = 0x7FFF - result
 
-align_with_warning 8, "One instruction of padding before ltadv_post"
+.align 8
 ltadv_post:
 // aClOut <- vpWrlF
 // aAlOut <- vpWrlI
@@ -3525,7 +3553,7 @@ ltadv_post:
     vcopy   aClOut, vpLtTot            // If no packed normals, base output is just light
 @@skip_novtxcolor:
     vmrg    vpRGBA, aClOut, aAlOut     // Merge base output and alpha output
-    beqz    $11, @@skip_fresnel
+    beqz    $11, ltadv_skip_fresnel
      ldv    vpMdl[8], (VTX_IN_OB + 0 * inputVtxSize)(laPtr) // Vtx 1 Model pos + PN
     lsv     aAOF[0], (vTRC_0100_addr - altBase)(altBaseReg) // Load constant 0x0100 to temp
     vabs    aOAFrs, aOAFrs, aOAFrs     // Fresnel dot in aOAFrs[0h]; absolute value for underwater
@@ -3538,7 +3566,10 @@ ltadv_post:
 @@skip:
     vmrg    vpRGBA, vpRGBA, aOAFrs[0h] // Replace color or alpha with fresnel
     vge     vpRGBA, vpRGBA, $v31[2]    // Clamp to >= 0 for fresnel; doesn't affect others
-@@skip_fresnel:
+
+.endif // CFG_DEBUG_NORMALS
+
+ltadv_skip_fresnel:
     beqz    laTexgen, ltadv_after_texgen
      suv    vpRGBA,   (VTX_IN_TC - 2 * inputVtxSize)(laPtr) // Vtx 2:1 RGBA
 // Texgen: aDOT still contains lookat 0 in elems 0-2, lookat 1 in elems 4-6
@@ -3659,13 +3690,6 @@ ltadv_normalize: // Normalize vector in aDPosI:vpWrlF i/f
      // aDIR <- aDotSc
 
 
-CFG_DEBUG_NORMALS equ 0 // Can manually enable here
-.if CFG_DEBUG_NORMALS
-.warning "Debug normals visualization is enabled"
-    vmudh   $v29, vOne, $v31[5] // 0x4000; middle gray
-    j       TODO
-     vmacf  vpRGBA, vpWNrm, $v31[5] // 0x4000; + 0.5 * normal
-.endif
 
 ovl4_end:
 .align 8
diff --git a/gbi.h b/gbi.h
index 50c6d6b..3c14377 100644
--- a/gbi.h
+++ b/gbi.h
@@ -2707,13 +2707,19 @@ _DW({                                                        \
 }
 /**
  * 5 Triangles in strip arrangement. Draws the following tris:
- * v1-v2-v3, v3-v2-v4, v3-v4-v5, v5-v4-v6, v5-v6-v7
+ * v1-v2-v3, v2-v4-v3, v3-v4-v5, v4-v6-v5, v5-v6-v7
  * If you want to draw fewer tris, set indices to -1 from the right.
- * e.g. to draw 4 tris, set v7 to -1; to draw 3 tris, set v6 to -1
- * Note that any set of 3 adjacent tris can be drawn with either SPTriStrip
+ * e.g. to draw 4 tris, set v7 to -1; to draw 3 tris, set v6 to -1.
+ * 
+ * @note Any set of 3 adjacent tris can be drawn with either SPTriStrip
  * or SPTriFan. For arbitrary sets of 4 adjacent tris, four out of five of them
  * can be drawn with one of SPTriStrip or SPTriFan. The 4-triangle formation
- * which can't be drawn with either command looks like the Triforce.
+ * which can't be drawn with either command looks like the Triforce--maybe
+ * F3DEX4 will support gsSPTriForce. :)
+ *
+ * @note The first index of each triangle drawn is different, so that in
+ * !G_SHADING_SMOOTH (flat shading) mode, the single color or single normal of
+ * each triangle can be set independently.
  */
 #define gSPTriStrip(pkt, v1, v2, v3, v4, v5, v6, v7) \
     _gSP5Triangles(pkt, G_TRISTRIP, v1, v2, v3, v4, v5, v6, v7)
@@ -2724,8 +2730,8 @@ _DW({                                                        \
     _gsSP5Triangles(G_TRISTRIP, v1, v2, v3, v4, v5, v6, v7)
 /**
  * 5 Triangles in fan arrangement. Draws the following tris:
- * v1-v2-v3, v1-v3-v4, v1-v4-v5, v1-v5-v6, v1-v6-v7
- * Otherwise works the same as SPTriStrip, see above.
+ * v2-v3-v1, v3-v4-v1, v4-v5-v1, v5-v6-v1, v6-v7-v1
+ * Otherwise works the same as @see SPTriStrip.
  */
 #define gSPTriFan(pkt, v1, v2, v3, v4, v5, v6, v7) \
     _gSP5Triangles(pkt, G_TRIFAN, v1, v2, v3, v4, v5, v6, v7)