diff --git a/README.md b/README.md index a377ff3..3e60c70 100644 --- a/README.md +++ b/README.md @@ -4,8 +4,8 @@ Modern graphics microcode for N64 romhacks. Will make you want to finally ditch HLE. Heavily modified version of F3DEX2, with all vertex and lighting code rewritten from scratch. -**F3DEX3 is in alpha. It is not guaranteed to be bug-free, and updates may bring -breaking changes.** +**F3DEX3 is in beta. The GBI should be relatively stable but may change if there +is a good reason.** [View the documentation here](https://hackern64.github.io/F3DEX3/) (or just look through the docs folder). @@ -16,7 +16,7 @@ through the docs folder). Compared to F3DEX2 or any other F3D family microcode, F3DEX3 is... - faster on the RDP -- in `LVP_NOC` configuration ([see docs](https://hackern64.github.io/F3DEX3/configuration.html)), [also faster on the RSP](https://hackern64.github.io/F3DEX3/performance.html) +- in `NOC` configuration ([see docs](https://hackern64.github.io/F3DEX3/configuration.html)), [also faster on the RSP](https://hackern64.github.io/F3DEX3/performance.html) - more accurate - full of new visual features - [measurable in performance](https://hackern64.github.io/F3DEX3/counters.html) @@ -27,9 +27,10 @@ all at the same time! - New geometry mode bit `G_PACKED_NORMALS` enables **simultaneous vertex colors and normals/lighting on the same mesh**, by encoding the normals in the unused - 2 bytes of each vertex using a variant of [octahedral encoding](https://knarkowicz.wordpress.com/2014/04/16/octahedron-normal-vector-encoding/). - The normals are effectively as precise as with the vanilla method of replacing - vertex RGB with normal XYZ. + 2 bytes of each vertex using the 5-6-5 bit encoding by HailToDodongo from + [Tiny3D](https://github.com/HailToDodongo/tiny3d). Model-space precision of + the normals is reduced, but this is rarely noticeable and there is barely any + performance penalty compared to regular normals without vertex colors. - New geometry mode bit `G_AMBOCCLUSION` enables **ambient occlusion** for opaque materials. Paint the shadow map into the vertex alpha channel; separate factors (set with `SPAmbOcclusion`) control how much this affects the ambient @@ -106,6 +107,9 @@ all at the same time! value. This can be used for clearing the Z buffer or filling the framebuffer or the letterbox with a solid color **faster than the RDP can in fill mode**. Practical performance may vary due to scheduling constraints. +- The key codepaths for triangle draw and vertex processing (assuming lighting + enabled and the occlusion plane disabled with the `NOC` configuration) are + **slightly faster than in F3DEX2**. ### Miscellaneous @@ -132,14 +136,23 @@ all at the same time! parameters are encoded in the command. With some limitations, this allows the tint colors of cel shading to **match scene lighting** with no code intervention. Also useful for other lighting-dependent effects. +- The microcode automatically switches between two lighting implementations + depending on which visual features are selected in the particular material. + The "basic lighting" codepath--which is roughly the same speed as F3DEX2-- + supports all F3DEX2 features (directional lights, texgen), plus packed + normals, ambient occlusion, and light-to-alpha. The "advanced lighting" + codepath, which is slower, adds support for point lights, specular, and + Fresnel. You only pay the performance penalty for the features you use, and + only for the objects which use them. + ### Profiling F3DEX3 introduces a suite of performance profiling capabilities. These take the form of performance counters, which report cycle counts for various operations or the number of items processed of a given type. There are a total of 21 -performance counters across multiple microcode versions. See the Profiling -section below. +performance counters across multiple microcode versions. See the Performance +Counters page in the docs. ## Credits @@ -159,6 +172,7 @@ Other contributors: - Kaze Emanuar: several feature suggestions, testing - thecozies: Fresnel feature suggestion - Rasky: memset feature suggestion +- HailToDodongo: packed normals encoding - coco875: Doxygen / GitHub Pages setup - ThePerfectLuigi64: CI build setup - neoshaman: feature discussions diff --git a/docs/Code/Counters.md b/docs/Code/Counters.md index 49f236d..dd62a97 100644 --- a/docs/Code/Counters.md +++ b/docs/Code/Counters.md @@ -116,7 +116,7 @@ In variables.h with the ENABLE_SPEEDMETER section: extern volatile F3DEX3YieldDataFooter gRSPProfilingResults; ``` -In the true codepath of Sched_TaskComplete: +In the `true` codepath of Sched_TaskComplete: ``` #ifdef ENABLE_SPEEDMETER /* Fetch number of primitives drawn from yield data */ @@ -139,7 +139,7 @@ volatile F3DEX3YieldDataFooter gRSPProfilingResults; ``` You can display them on screen however you wish. Here is an example, in -SpeedMeter_DrawTimeEntries +SpeedMeter_DrawTimeEntries: ``` GfxPrint printer; Gfx* opaStart; diff --git a/docs/Documentation/Removed.md b/docs/Documentation/Removed.md index 039172a..60cee62 100644 --- a/docs/Documentation/Removed.md +++ b/docs/Documentation/Removed.md @@ -4,6 +4,60 @@ These features were present in earlier F3DEX3 versions, but have been removed. +## Legacy Vertex Pipeline (LVP) Configuration + +Early versions of F3DEX3 were developed exclusively in an OoT context, where +scenes are almost always RDP bottlenecked. Thus, these versions focused on +reducing RDP time and adding new visual features at the cost of RSP time. + +Later, Kaze Emanuar became interested in using F3DEX3 in Return to Yoshi's +Island due to the RDP performance improvements. However, due to the intense +optimization work he had done, his game was relatively balanced in RDP / RSP +time. Thus, when he tried F3DEX3, the decrease in RDP time and increase in RSP +time made the game slower overall, which was not acceptable. + +As a result, the LVP configuration of F3DEX3 was developed, to bring +F3DEX2-style vertex processing in exchange for dropping some of the advanced +lighting features (which Kaze was not going to use anyway due to HLE +compatibility). This was implemented, and after much optimization across the +entire microcode, `F3DEX3_LVP_NOC` became slightly faster than F3DEX2 on both +RDP and RSP. This caused Kaze to immediately adopt this configuration of F3DEX3 +for Return to Yoshi's Island. + +Unfortunately, this meant that if developers wanted to use the advanced lighting +features of F3DEX3 in any part of their project, they were stuck with the much +slower non-LVP configuration of F3DEX3. The desire to have the microcode +automatically swap versions for each material, plus the invention of ways to +include some of the advanced lighting features in the LVP vertex processing +without any performance penalty when not using them, led to the reunion of the +versions. Now you get LVP-style performance when not using some of the advanced +features, and only pay the performance penalty while rendering objects which +use them. + +A similar approach was also considered for the NOC configuration--to have the +microcode only compute the occlusion plane when it is enabled. This is +unfortunately infeasible. Register allocation / naming, as well as some +pipelined instructions leading into and out of lighting, are significantly +different between the occlusion plane and NOC versions of vertex processing. +This means the microcode would have to swap between four versions of lighting +code instead of just two, creating much more complexity with the overlay system +and IMEM size issues. Furthermore, the occlusion plane is typically not +enabled/disabled per object, but used when rendering as much of the game +contents as possible to maximize occluded objects. So it is reasonable to choose +the occlusion plane or NOC configuration on a per-frame or even per-scene basis. + +## Octahedral Encoding for Packed Normals + +Previous F3DEX3 versions encoded packed normals into the unused 2 bytes of each +vertex using a variant of [octahedral encoding](https://knarkowicz.wordpress.com/2014/04/16/octahedron-normal-vector-encoding/). +Using this method, the normals were effectively as precise as with the vanilla +method of replacing vertex RGB with normal XYZ. However, the decoding of this +format was inefficient, partly due to the requirement to also support vanilla +normals at vanilla performance. Once HailToDodongo showed that the community was +willing to accept the moderate precision loss of the much simpler 5-6-5 bit +encoding in [Tiny3D](https://github.com/HailToDodongo/tiny3d), this was adopted +in F3DEX3. + ## Clipping minimal scanlines algorithm Earlier F3DEX3 versions included a modified algorithm for triangulating the diff --git a/f3dex3.s b/f3dex3.s index e371f43..1e502f9 100644 --- a/f3dex3.s +++ b/f3dex3.s @@ -143,6 +143,8 @@ COUNTER_C_FIFO_FULL equ 1 .endif +CFG_DEBUG_NORMALS equ 0 // Can manually enable here + // Only raise a warning in base modes; in profiling modes, addresses will be off .macro warn_if_base, warntext .if !ENABLE_PROFILING @@ -1264,7 +1266,6 @@ G_MODIFYVTX_handler: j do_moveword // Moveword adds cmd_w0 to $10 for final addr lbu cmd_w0, (inputBufferEnd - 0x07)(inputBufferPos) // offset in vtx, bit 15 clear -TODO check vtx 1 behavior G_TRIFAN_handler: // 17 li $1, 0x8000 // $ra negative = flag for G_TRIFAN G_TRISTRIP_handler: @@ -2597,8 +2598,9 @@ tris_end: tri_fan_store: lb $11, (inputBufferEnd - 7)(inputBufferPos) // Load vtx 1 + sh cmd_w1_dram, 5(rdpCmdBufPtr) // Store vtx N+2 and N+3 as 1 and 2 j tri_main - sb $11, 5(rdpCmdBufPtr) // Store vtx 1 + sb $11, 7(rdpCmdBufPtr) // Store vtx 1 as 3 // Converts the segmented address in cmd_w1_dram to the corresponding physical address segmented_to_physical: // 7 @@ -3235,6 +3237,19 @@ ltbasic_start_standard: vnop luv lVCI[0], (tempVpRGBA)(rdpCmdBufEndP1) // Load vertex color input ltbasic_after_start: + +.if CFG_DEBUG_NORMALS +.warning "Debug normals visualization is enabled" + vmudh vpNrmlX, vOne, vpNrmlX[3h] // Move X to all elements + vne $v29, $v31, $v31[1h] // Set VCC to 10111011 + vmrg vpNrmlX, vpNrmlX, vpNrmlY[3h] // X in 0, 4; Y to 1, 5 + vne $v29, $v31, $v31[2h] // Set VCC to 11011101 + vmrg vpNrmlX, vpNrmlX, vpNrmlZ[3h] // Z to 2, 6 + vmudh $v29, vOne, $v31[5] // 0x4000; middle gray + j vtx_return_from_lighting + vmacf vpRGBA, vpNrmlX, $v31[5] // 0x4000; + 0.5 * normal +.else // CFG_DEBUG_NORMALS + vmulf $v29, vpNrmlX, vLTC[4] // Normals X elems 3, 7 * first light dir X // lDIR <- (NOC: -, Occ: sOTM) lpv lDIR[0], (ltBufOfs + 8 - 2*lightSize)(ambLight) // Xfrmed dir in elems 4-6; temp reg @@ -3267,7 +3282,9 @@ ltbasic_post: jr lbAfter // vpRGBA <- lDIR vmrg vpRGBA, vpLtTot, lVCI // RGB = light, A = vtx alpha - + +.endif // CFG_DEBUG_NORMALS + // lbAfter = ltbasic_ao if AO else // lbPostAo = ltbasic_l2a if L2A else // ltbasic_packed if packed else @@ -3469,6 +3486,17 @@ ltadv_vtx_loop: // Even instruction vmudn vpWrlF, vpWrlF, $v31[1] // -1; negate world pos so add light/cam pos to it andi laSpecFres, vGeomMid, (G_LIGHTING_SPECULAR | G_FRESNEL_COLOR | G_FRESNEL_ALPHA) >> 8 vmadh vpWrlI, vpWrlI, $v31[1] // -1 + +.if CFG_DEBUG_NORMALS + vmudh $v29, vOne, $v31[5] // 0x4000; middle gray + li laTexgen, 0 + vmacf vpRGBA, vpWNrm, $v31[5] // 0x4000; + 0.5 * normal +ltadv_finish_light: +ltadv_loop: +ltadv_normals_to_regs: +ltadv_specular: +.else + ltadv_normals_to_regs: vmudh vpNrmlY, vOne, vpWNrm[1h] // Move normals to separate registers bnez laSpecFres, ltadv_spec_fres_setup @@ -3504,7 +3532,7 @@ ltadv_specular: // aDOT in/out, uses vpLtTot[3] and $11 as temps jr $ra vxor aDOT, aDOT, $v31[7] // = 0x7FFF - result -align_with_warning 8, "One instruction of padding before ltadv_post" +.align 8 ltadv_post: // aClOut <- vpWrlF // aAlOut <- vpWrlI @@ -3525,7 +3553,7 @@ ltadv_post: vcopy aClOut, vpLtTot // If no packed normals, base output is just light @@skip_novtxcolor: vmrg vpRGBA, aClOut, aAlOut // Merge base output and alpha output - beqz $11, @@skip_fresnel + beqz $11, ltadv_skip_fresnel ldv vpMdl[8], (VTX_IN_OB + 0 * inputVtxSize)(laPtr) // Vtx 1 Model pos + PN lsv aAOF[0], (vTRC_0100_addr - altBase)(altBaseReg) // Load constant 0x0100 to temp vabs aOAFrs, aOAFrs, aOAFrs // Fresnel dot in aOAFrs[0h]; absolute value for underwater @@ -3538,7 +3566,10 @@ ltadv_post: @@skip: vmrg vpRGBA, vpRGBA, aOAFrs[0h] // Replace color or alpha with fresnel vge vpRGBA, vpRGBA, $v31[2] // Clamp to >= 0 for fresnel; doesn't affect others -@@skip_fresnel: + +.endif // CFG_DEBUG_NORMALS + +ltadv_skip_fresnel: beqz laTexgen, ltadv_after_texgen suv vpRGBA, (VTX_IN_TC - 2 * inputVtxSize)(laPtr) // Vtx 2:1 RGBA // Texgen: aDOT still contains lookat 0 in elems 0-2, lookat 1 in elems 4-6 @@ -3659,13 +3690,6 @@ ltadv_normalize: // Normalize vector in aDPosI:vpWrlF i/f // aDIR <- aDotSc -CFG_DEBUG_NORMALS equ 0 // Can manually enable here -.if CFG_DEBUG_NORMALS -.warning "Debug normals visualization is enabled" - vmudh $v29, vOne, $v31[5] // 0x4000; middle gray - j TODO - vmacf vpRGBA, vpWNrm, $v31[5] // 0x4000; + 0.5 * normal -.endif ovl4_end: .align 8 diff --git a/gbi.h b/gbi.h index 50c6d6b..3c14377 100644 --- a/gbi.h +++ b/gbi.h @@ -2707,13 +2707,19 @@ _DW({ \ } /** * 5 Triangles in strip arrangement. Draws the following tris: - * v1-v2-v3, v3-v2-v4, v3-v4-v5, v5-v4-v6, v5-v6-v7 + * v1-v2-v3, v2-v4-v3, v3-v4-v5, v4-v6-v5, v5-v6-v7 * If you want to draw fewer tris, set indices to -1 from the right. - * e.g. to draw 4 tris, set v7 to -1; to draw 3 tris, set v6 to -1 - * Note that any set of 3 adjacent tris can be drawn with either SPTriStrip + * e.g. to draw 4 tris, set v7 to -1; to draw 3 tris, set v6 to -1. + * + * @note Any set of 3 adjacent tris can be drawn with either SPTriStrip * or SPTriFan. For arbitrary sets of 4 adjacent tris, four out of five of them * can be drawn with one of SPTriStrip or SPTriFan. The 4-triangle formation - * which can't be drawn with either command looks like the Triforce. + * which can't be drawn with either command looks like the Triforce--maybe + * F3DEX4 will support gsSPTriForce. :) + * + * @note The first index of each triangle drawn is different, so that in + * !G_SHADING_SMOOTH (flat shading) mode, the single color or single normal of + * each triangle can be set independently. */ #define gSPTriStrip(pkt, v1, v2, v3, v4, v5, v6, v7) \ _gSP5Triangles(pkt, G_TRISTRIP, v1, v2, v3, v4, v5, v6, v7) @@ -2724,8 +2730,8 @@ _DW({ \ _gsSP5Triangles(G_TRISTRIP, v1, v2, v3, v4, v5, v6, v7) /** * 5 Triangles in fan arrangement. Draws the following tris: - * v1-v2-v3, v1-v3-v4, v1-v4-v5, v1-v5-v6, v1-v6-v7 - * Otherwise works the same as SPTriStrip, see above. + * v2-v3-v1, v3-v4-v1, v4-v5-v1, v5-v6-v1, v6-v7-v1 + * Otherwise works the same as @see SPTriStrip. */ #define gSPTriFan(pkt, v1, v2, v3, v4, v5, v6, v7) \ _gSP5Triangles(pkt, G_TRIFAN, v1, v2, v3, v4, v5, v6, v7)