diff --git a/design-tradeoffs.html b/design-tradeoffs.html index 62945c7..c1821db 100644 --- a/design-tradeoffs.html +++ b/design-tradeoffs.html @@ -111,8 +111,8 @@ Vertex processing RSP time for occlusion plane
In the occlusion plane F3DEX3 configuration, vertex processing is slower than in F3DEX2. If using this configuration and there is no occlusion plane or it is occluding almost nothing, the RSP will be slower with no other benefit.
However, when the occlusion plane is occluding even a few percent of the triangles in the scene, the situation changes. This saves RDP time, and most games are RDP bound, so this trades off RSP time for RDP time and makes the game faster overall. Plus, RSP time is also saved for the tris which are not drawn, which can approximately cancel out the extra RSP time for computing the occlusion plane for all vertices.
The following commands are moved to Overlay 3 in F3DEX3 to save IMEM space. This means that code will have to be loaded from DRAM to run them if Overlays 2 or 4 (for lighting) happen to be loaded already.
The following commands are moved to Overlay 2 or 3 in F3DEX3 to save IMEM space. This means that code will have to be loaded from DRAM to run them if a different overlay happens to be loaded already.
SPMatrixSPPopMatrix*SPDma*However:
SPDma* is rarely used except at startup for HLE detection.SPMemset is a new F3DEX3 command which can improve performance. Plus, it is typically run shortly after render start, when Overlay 3 is already in IMEM.SPMemset is a new F3DEX3 command which can improve performance. Plus, it is typically run shortly after render start, when Overlay 3 (which contains it) is already in IMEM.So there is not a significant practical performance impact from these changes.
Segment 0 is now reserved: ensure segment 0 is never set to anything but 0x00000000. In F3DEX2 and prior this was only a good idea (and SM64 and OoT always follow this); in F3DEX3 segmented addresses are now resolved relative to other segments. That is, gsSPSegment(0x08, 0x07001000) sets segment 8 to the base address of segment 7 with an additional offset of 0x1000. So for correct behavior when supplying a direct-mapped or physical address such as 0x80101000, segment 0 must always be 0x00000000 so that this address resolves to e.g. 0x101000 as expected in this example.
In F3DEX2, the RSP time for drawing non-textured tris was significantly lower than for textured tris, by skipping a chunk of computation for the texture coefficients if they were disabled. In F3DEX3, no computation is skipped when textures are disabled. However, almost all materials use textures, and F3DEX3 is a little faster at drawing textured tris than F3DEX2. Plus, F3DEX3 still does not send the texture cofficients if they are disabled, saving DRAM access time for RSP -> FIFO and FIFO -> RDP. RDP time savings from avoiding loading a texture are unaffected of course.
+In F3DEX2, the RSP time for drawing non-textured tris was significantly lower than for textured tris, by skipping a chunk of computation for the texture coefficients if they were disabled. In F3DEX3, no computation is skipped when textures are disabled. However, practically almost all materials use textures, and F3DEX3 is faster at drawing textured tris than F3DEX2. Plus, F3DEX3 still does not send the texture cofficients if they are disabled, saving DRAM access time for RSP -> FIFO and FIFO -> RDP. RDP time savings from avoiding loading a texture are unaffected of course.
In F3DEX2, the microcode checks whether the CPU has requested that it yield (to run the audio microcode) before running every display list command. F3DEX3 now performs this check every time the input buffer is refilled, which is typically once every 21 commands. The amount by which this delays the start of the audio microcode is typically very small, and worst case during normal conditions would be a few hundred microseconds. However, if the RDP FIFO is full during this time, the microcode will have to wait for the RDP to make progress through its workload to free up space for the outputs of the RSP commands. This will slow down the RSP to the RDP's speed, and since triangles can be arbitrarily large on screen, this can theoretically cause huge stalls. If you ever encounter this in practice, please contact Sauraen.
+G_FOG in the geometry mode or executing SPFogFactor or SPFogPosition–between loading verts and drawing tris with those verts will lead to incorrect fog values for those tris. In F3DEX2, the fog settings at vertex load time would always be used, even if they were changed before drawing tris.G_RDPHALF_1, which is used to hold state during some display list macros which are actually two 8-byte commands. This change is not noticeable when using standard GBI commands, only if something highly custom has been set up. G_RDPHALF_1, which is used to hold state during some display list macros which are actually two 8-byte commands. This change is not noticeable when using standard GBI commands, only if something highly custom has been set up.SPTexture and SPFogFactor state is corrupted when loading and returning from another microcode (S2DEX). In F3DEX2, it would be reinitialized to default values; in F3DEX3, it is left as garbage values. SPMemset command fills a specified RDRAM region with a repeated 16-bit value. This can be used for clearing the Z buffer or filling the framebuffer or the letterbox with a solid color faster than the RDP can in fill mode. Practical performance may vary due to scheduling constraints.SPFlush command can ensure that the RDP starts clearing the framebuffer as soon as possible during the frame, instead of waiting a short time for further RSP processing.NOC configuration) are slightly faster than in F3DEX2.NOC configuration) are faster than in F3DEX2, sometimes much faster.F3DEX3_NOC matches or beats the RSP performance of F3DEX2 on all critical paths in the microcode, including command dispatch, vertex processing, and triangle processing. Then, the RDP and memory traffic performance improvements of F3DEX3–56 vertex buffer, auto-batched rendering, etc.–should further improve overall game performance from there.
-These are cycle counts for many key paths in the microcode. Lower numbers are better. The timings are hand-counted taking into account all pipeline stalls and all dual-issue conditions. Instruction alignment after branches is usually taken into account, but in some cases it is assumed to be optimal.
+These are cycle counts for many key paths in the microcode. Lower numbers are better. The timings are hand-counted taking into account all pipeline stalls and all dual-issue conditions. Instruction alignment after branches is usually taken into account (especially in F3DEX3), but in some cases it is assumed to be optimal.
All numbers assume default profiling configuration. Tri numbers assume texture, shade, and Z, and not flushing the buffer. Tri numbers are measured from the first cycle of the command handler inclusive, to the first cycle of whatever is after $ra exclusive; this is in order to capture an extra stall cycle in F3DEX2 when finishing a triangle and going to the next command.
Vertex numbers assume no extra F3DEX3 features (packed normals, ambient occlusion, etc.). These features are listed below as the number of extra cycles the feature costs per vertex pair. ltbasic is the codepath when point lighting, specular, and Fresnel are disabled; ltadv is the codepath with any of these enabled. The reason timings are listed separately for each number of lights is because some implementations are pipelined for two lights, so going from an even to an odd number of lights adds a different time than vice versa.
| F3DEX2 | F3DEX3_NOC | F3DEX3 | |
|---|---|---|---|
| Command dispatch | 12 | 12 | 12 | Command dispatch | 12 | 10 | 10 |
| Small RDP command | 14 | 5 | 5 | Small RDP command | 14 | 4 | 4 |
| Only/2nd tri to offscreen | 27 | 25 | 25 | Only/2nd tri to offscreen | 27 | 20 | 20 |
| 1st tri to offscreen | 28 | 26 | 26 | 1st tri to offscreen | 28 | 21 | 21 |
| Only/2nd tri to clip | 32 | 30 | 30 | Only/2nd tri to clip | 32 | 25 | 25 |
| 1st tri to clip | 33 | 31 | 31 | 1st tri to clip | 33 | 26 | 26 |
| Only/2nd tri to backface | 38 | 36 | 36 | Only/2nd tri to backface | 38 | 31 | 31 |
| 1st tri to backface | 39 | 37 | 37 | 1st tri to backface | 39 | 32 | 32 |
| Only/2nd tri to degenerate | 42 | 38 | 38 | Only/2nd tri to degenerate | 42 | 33 | 33 |
| 1st tri to degenerate | 43 | 39 | 39 | 1st tri to degenerate | 43 | 34 | 34 |
| Only/2nd tri to occluded | Can't | Can't | 42 | Only/2nd tri to occluded | Can't | Can't | 37 |
| 1st tri to occluded | Can't | Can't | 43 | 1st tri to occluded | Can't | Can't | 38 |
| Only/2nd tri to draw | 172 | 156 | 158 | Only/2nd tri to draw | 172 | 149 | 151 |
| 1st tri to draw | 173 | 157 | 159 | 1st tri to draw | 173 | 150 | 152 |
| Tri snake | Can't | * | * | Tri snake | Can't | 10/11* | 10/11* |
| Vtx before DMA start | 16 | 17 | 17 |
| Light dir xfrm, 9 dir lts | Can't | 196 | 196 |
For this section, we assume almost all tris are contained in very long snakes, so the overhead of starting and ending snakes is negligible. This overhead is discussed in the next section.
-We are assuming that the same set of tris is being drawn with or without snakes. Thus, cycles from tri_main_from_snake through the instruction after the return exclusive are not counted here, as they are the same regardless of which method is being used.
For a pair of tris drawn without snakes, i.e. with a single SP2Triangles command, the cycles are:
tri_main_from_snake: 5tri_main_from_snake: 4For a pair of tris which are part of a long snake, the cycles are:
tri_main_from_snake: 11However, there's also the memory bandwidth savings. The SP2Triangles command is 8 bytes and the two tris in a long snake are 2 bytes, so switching to snake saves 6 bytes of bandwidth. Testing has shown that RSP DMAs on average transfer about 2.2 bytes per cycle, though it depends on the length. So this is a savings of about 2.7 cycles of RDRAM / RDP time. Since the DMAs loading this data are input buffer loads, and the RSP stalls waiting for input buffer loads (it does not do useful work during this time), this is also 2.7 cycles of RSP time. This offsets the 1 extra cycle of processing the tri pair above.
Therefore, switching to snake (assuming very long snakes) saves about 2.7 cycles of RDRAM / RDP time and 1.7 cycles of RSP time per two tris, or about 0.9 RSP cycles and 1.4 RDRAM cycles per tri.
-Since a SPTriSnake command encodes 5 triangles, for comparison to SP2Triangles we will consider the overhead for 10 triangles total / two snake starts.
For SPTriSnake, this is 2 x (12 cycles command dispatch + 4 cycles snake initialization + 5 tris x 11 cycles per tri as discussed above) = 142 RSP cycles. And it is 16 bytes of loads = 7.3 cycles of RDRAM / RDP time and stall RSP time. So the total cost is 149.3 RSP and 7.3 RDRAM cycles.
For SP2Triangles, this is 5 x (21 cycles as discussed above) = 105 RSP cycles. And it is 40 bytes of loads = 18.2 cycles of RDRAM / RDP time and stall RSP time. So the total cost is 123.2 RSP and 18.2 RDRAM cycles.
But drawing those 10 tris as part of very long snakes would have saved 13.5 RDRAM cycles and 8.5 RSP cycles. So the relative cost of drawing these tris as two start-of-snakes instead of in very long snakes is 34.6 RSP cycles and 2.6 RDRAM cycles. Thus the cost of each start-of-snake relative to long snakes is 17.3 RSP cycles and 1.3 RDRAM cycles.
-Ending a snake costs 12 cycles of RSP time and has no direct impact on memory traffic. However, calculating the overall performance is more complicated: the snake can end after 1-8 bytes of the SPContinueSnake command, and the remaining bytes are "wasted" in that they do not contribute to drawing tris with memory bandwidth savings.
From a mesh optimization standpoint, this is not an issue. If you have a snake which has filled 8 bytes of the previous SPContinueSnake command, and you have another triangle to draw, there are only two cases. If that tri can't be appended to the snake, you have to draw it with a SP1Triangle command either way, so there is no performance difference. If it can be appended to the snake, doing so will take 8 bytes of memory traffic–the same as the SP1Triangle command. The snake end penalty will have to be paid whether before or after this tri. And it's 11 RSP cycles to draw one more tri in an existing snake, whereas the command dispatch plus second tri code for SP1Triangle is 16 cycles. So it's better to continue a snake than to stop it early and use non-snake commands, even if this leads to a mostly empty SPContinueSnake command. Of course, if you can fill up even more tris in the command, the performance benefit increases.
Assuming snake lengths are uniformly distributed, on average a snake will end after 4.5 bytes (the same number of triangles) of a SPContinueSnake command. In this case, the command will take 4.5 tris x 11 cycles per tri + 12 cycle end snake penalty = 61.5 RSP cycles, and 8 bytes of memory traffic = 3.6 RDRAM cycles. If these 4.5 tris were instead drawn with SP2Triangles commands, that would be 2.25 commands = 47.3 RSP cycles and 18 bytes = 8.2 RDRAM cycles. Thus on average, the snake end costs 14.2 RSP cycles and saves 4.6 RDRAM cycles compared to SP2Triangles commands. But drawing those 4.5 tris as part of very long snakes would have saved 3.9 RSP cycles and 6.1 RDRAM cycles. So the average cost of ending a snake relative to very long snakes is 18.1 RSP and 1.5 RDRAM cycles.
Suppose there are 4000 tris on screen. Suppose that 90% of them have been encoded with snakes–the rest are disconnected single tris or tri pairs (quads). That 10% are then encoded with SP2Triangles commands, which is the same performance with or without snakes, so we ignore those tris, and there are 3600 "snakeable" tris in the scene.
Suppose that the average snake length is 16, to account for some objects with more contiguous tris with the same material, and others with smaller disjoint parts. Thus, for 3600 tris, there are 225 snakes.
-Switching the 3600 tris from SP2Triangles commands to long snakes saves 4860 RDRAM cycles and 3060 RSP cycles. However, the 225 snake starts and ends cost 630 RDRAM and 7965 RSP cycles relative to this. So the total performance change of switching to snakes in this case is that the RDRAM / RDP goes faster by 4230 cycles = 68 us, but the RSP goes slower by 4905 cycles = 78 us.
With the recent F3DEX3 updates bringing significant RSP time savings in command dispatch and triangle draw, triangle snakes are unfortuantely no longer competitive in RSP time.
+Suppose we have two tris which are offscreen. If drawn with SP2Triangles, this is 10 cycles for command dispatch, 21 cycles to cull the first tri, and 20 cycles to cull the second, for a total of 51 cycles. If drawn as part of a long triangle snake, the triangle snake processing adds 10 or 11 cycles relative to the SP2Triangles first or second triangle respectively. So this is 31 cycles to cull each triangle, for a total of 61 cycles.
It gets worse for snakes when counting the overhead of starting and ending a snake, which have also gotten worse with the recent changes bringing triangle performance improvements. I used to have a long discussion here computing estimated performance for switching to snakes, but the numbers have all changed and they were imprecise to begin with. The upshot is for a typical scene, switching everything from SP2Triangles to snakes might save about 70 us of RDRAM/RDP time but cost about 400 us of RSP time.
However, note that in F3DEX2, SP2Triangles to two offscreen triangles is 12+28+27 = 67 cycles. F3DEX3 is so much faster than F3DEX2 that even the performance penalty of snakes doesn't outweigh this.
For an OoT codebase, only a few minor changes are required to use F3DEX3. However, more changes are recommended to increase performance and enable new features.
How to modify the microcode in your HackerOoT based romhack (note that this is already done in HackerOoT, so this is provided as a guide for other games):
Both OoT and SM64:
{0x28, 0x28, 0x28} must be changed to {0x49, 0x49, 0x49}, or everything will be too dark. The former vector is not properly normalized, but F3D through F3DEX2 normalize light directions in the microcode, so it doesn't matter with those microcodes. The two lighting codepaths in F3DEX3 treat light directions and vertex normals differently: the fast one works like F3DEX2, but the slow one normalizes vertex normals after transforming them and does not modify light directions. Thus in this case, the light directions must already be normalized.SPLookAtX and SPLookAtY to use SPLookAt instead (this is only a few lines change). Also remove any code which writes SPClipRatio or SPForceMatrix–these are now no-ops, so you might as well not write them.#define REQUIRE_SEMICOLONS_AFTER_GBI_COMMANDS (at the top of, or before including, the GBI) for a more modern, OoT-style codebase where uses of GBI commands require semicolons after them. SM64 omits the semicolons sometimes, e.g. gSPDisplayList(gfx++, foo) gSPEndDisplayList(gfx++);. If you are using -Wpedantic, using this define is required.#define NO_SYNCS_IN_TEXTURE_LOADS (at the top of, or before including, the GBI) and fix any crashes or graphical issues that arise. Display lists exported from fast64 already do not contain these syncs, but vanilla display lists or custom ones using the texture loading multi-command macros do. Disabling the syncs saves a few percent of RDP cycles for each material setup; what percentage this is of the total RDP time depends on how many triangles are typically drawn between each material change. For more information, see the GBI documentation near this define.SPSetLights, instead of one-at-a-time with repeated SPLight commands. Note that if you are using a pointer (dynamically allocated) rather than a direct variable (statically allocated), you need to dereference it; see the docstring for this macro in the GBI.Each of these changes is required if you want to use the respective new feature, but is not necessary if you are not using it.
These features were present in earlier F3DEX3 versions, but have been removed.
-Early versions of F3DEX3 were developed exclusively in an OoT context, where scenes are almost always RDP bottlenecked. Thus, these versions focused on reducing RDP time and adding new visual features at the cost of RSP time.
Later, Kaze Emanuar became interested in using F3DEX3 in Return to Yoshi's Island due to the RDP performance improvements. However, due to the intense optimization work he had done, his game was relatively balanced in RDP / RSP time. Thus, when he tried F3DEX3, the decrease in RDP time and increase in RSP time made the game slower overall, which was not acceptable.
As a result, the LVP configuration of F3DEX3 was developed, to bring F3DEX2-style vertex processing in exchange for dropping some of the advanced lighting features (which Kaze was not going to use anyway due to HLE compatibility). This was implemented, and after much optimization across the entire microcode, F3DEX3_LVP_NOC became slightly faster than F3DEX2 on both RDP and RSP. This caused Kaze to immediately adopt this configuration of F3DEX3 for Return to Yoshi's Island.
Unfortunately, this meant that if developers wanted to use the advanced lighting features of F3DEX3 in any part of their project, they were stuck with the much slower non-LVP configuration of F3DEX3. The desire to have the microcode automatically swap versions for each material, plus the invention of ways to include some of the advanced lighting features in the LVP vertex processing without any performance penalty when not using them, led to the reunion of the versions. Now you get LVP-style performance when not using some of the advanced features, and only pay the performance penalty while rendering objects which use them.
A similar approach was also considered for the NOC configuration–to have the microcode only compute the occlusion plane when it is enabled. This is unfortunately infeasible. Register allocation / naming, as well as some pipelined instructions leading into and out of lighting, are significantly different between the occlusion plane and NOC versions of vertex processing. This means the microcode would have to swap between four versions of lighting code instead of just two, creating much more complexity with the overlay system and IMEM size issues. Furthermore, the occlusion plane is typically not enabled/disabled per object, but used when rendering as much of the game contents as possible to maximize occluded objects. So it is reasonable to choose the occlusion plane or NOC configuration on a per-frame or even per-scene basis.
-Previous F3DEX3 versions encoded packed normals into the unused 2 bytes of each vertex using a variant of octahedral encoding. Using this method, the normals were effectively as precise as with the vanilla method of replacing vertex RGB with normal XYZ. However, the decoding of this format was inefficient, partly due to the requirement to also support vanilla normals at vanilla performance. Once HailToDodongo showed that the community was willing to accept the moderate precision loss of the much simpler 5-6-5 bit encoding in Tiny3D, this was adopted in F3DEX3.
-Earlier F3DEX3 versions included a modified algorithm for triangulating the polygon which was formed as the result of clipping. This algorithm broke up the polygon into triangles in such a way that the fewest scanlines were accessed multiple times, leading to maximum performance on the RDP. For example, if the polygon was a diamond shape, this algorithm would always cut it horizontally– leading to few or no scanlines being touched by both the top and bottom tris–as opposed to vertically, leading to all scanlines being touched by the left and right tris.
In testing, this was able to save a few hundred microseconds at best in scenes with many large clipped tris. However, this feature has been removed, because it was found to cause undesirable visual artifacts. Other changes to clipping were experimented with in the past, and ultimately not included. These are not due to a bug or design issue with the microcode, but a fundamental limitation of the RDP: vertex colors are interpolated in screen space without perspective correction. In other words, the shade colors of ANY triangle not flat to the camera are slightly wrong, regardless of which microcode is in use. The same world space portion of the triangle will have a slightly different color depending on how the camera is rotated around it. The issues with clipping are a result of this.
@@ -132,10 +132,10 @@ Color interpolation exampleNote that BOTH of these are wrong: the correct value for that pixel is 128, because all points on the horizontal midline of the original triangle are color 128. The N64 can't draw the correct triangle here–its colors would have to change nonlinearly along an edge.
The problem with the clipping minimal scanlines algorithm is that it would switch between cases C and D here based on which diagonal had a larger Y component. In other words, if the camera moved slightly, the choice of triangulation might change, causing the middle of the polygon to visibly change color. This was visible on large scene triangles with lighting: as you walked around, the colors would have slight but abrupt changes, which look wrong/bad.
The best we can do, which is what all previous F3D family microcodes did and F3DEX3 does now, is to triangulate in a consistent way, based on the winding of the input triangles. The results are still wrong, but they're wrong the same way every frame, so there are no abrupt changes visible.
-Earlier F3DEX3 versions included attribute offsets for vertex Z as well as ST. By setting this to -2 and drawing an opaque tri, the tri would appear like a decal, but with no Z-fighting. This has been removed and replaced with the decal fix, which is automatic and does not require any special setup in the display list.
-SPTriStrip and SPTriFanThese commands are still supported in the GBI, but as special cases of SPTriSnake with specific sets of directions. In addition to covering both of these commands, the SPTriSnake command can draw the mirror-imaged 4-triangle strip which SPTriStrip could not (without inefficiency), as well as arbitrarily long triangle strips, fans, and other snake shapes via SPContinueSnake .
The snake need not be constrained to either shape; it can turn left or right in any combination. This can be thought of as concatenating triangle strips and fans. (Original photo by Al d'Vilas, free-use licensed)
A snake can be arbitrarily long. It starts with a SPTriSnake command, which may be followed by one or more SPContinueSnake macros which encode continued indices. The latter are not commands (there's no command byte)–they are just more index data sequentially in the display list. In other words, the display list input buffer is the storage for the indices data. The microcode correctly handles the case when the snake runs off the end of the input buffer and the input buffer needs to be refilled. The refilled data starts from the start of the input buffer, as if it were regular commands; this matters for the hints system.
The goal of any accelerated triangles system in a microcode is to reduce the memory bandwidth used for loading triangle indices. The actual tris drawn are the same regardless of how their indices are encoded in the display list, so we do not consider the performance of actually drawing the tris, only loading their indices.
An SPTriSnake command by itself contains 7 vertices and draws 5 triangles (because the first triangle needs two extra vertices to start itself). An SPContinueSnake macro contains 8 vertices and draws 8 tris, in each case continuing the existing snake. The F3D family microcodes before F3DEX3 only provided SP1Triangle and SP2Triangle commands, so any snake of 3 or more tris is more efficient than F3DEX2 and older microcodes. The efficiency gain is up to 4x (2 tris -> 8 tris per 8-byte macro), though in typical meshes the gain is expected to be 2-3x.
The key advantage of a triangle snake over a traditional triangle strip is that it better exploits the vertex cache.
In any microcode, the vertex cache is of a fixed size, and any continuous subset of it can be reloaded. Loading a vertex costs 16 bytes of memory throughput plus some RSP time to perform transformation and lighting. So, the goal of vertex cache optimization is to reduce the number of vertices reloaded (loaded a second time when they had been loaded in the past but are no longer loaded). A secondary goal is to reduce the number of vertices kept in the cache between loads, as doing so increases the average load size, which decreases the relative overhead of the loads.
@@ -145,7 +145,7 @@ Part of a mesh showing subdivision into 4 strips or 1 snakeIf vertex loads are optimized for rendering with triangle strips, long "1D" sections of meshes will be loaded, which does not exploit the "2D" spatial locality of the vertex cache. This is especially inefficient if the export tool always reloads vertices instead of keeping them in the cache: in the case pictured, the entire top row of selected vertices will be immediately reloaded when rendering the next strip up.
-Microcodes compatible with libultra–including the F3D family, S2DEX, JPEG decoder, etc.–are required to listen for a flag from the CPU, and if it is set, to save their state and stop executing. This allows the higher-priority audio microcode to be swapped in and run, which must occur soon after every VI. The audio microcode may take a few ms to run, so if it is delayed by more than a few ms, there is a risk of audio corruption.
Any command which results in RDP commands being enqueued–triangle or rectangle draws, texture loads, CC setting changes, etc.–can cause the current temporary RDP command buffer in DMEM to be flushed to the FIFO in RDRAM. If the latter is full, the command will wait until space is available. In an extreme case, the RDP may have to clear the framebuffer and depth buffer before making progress and opening up space in the FIFO, which can take several ms. Thus, the processing of most display list commands could theoretically cause the RSP to wait–delaying the yield–by several ms.
@@ -158,11 +158,11 @@ What about yielding?Since each pair of tris drawn in the snake require a buffer flush, and the tris the RSP is enqueuing are the same data size as the tris the RDP is rendering, the RSP will have to wait after each pair of tris in the snake for capacity in the RDP buffer before it can continue with the snake. In other words, the snake speed is limited by the RDP drawing speed for tris much earlier in the frame. In this example, the RSP will not finish the snake and respond to the yield for 10 ms, delaying the audio microcode too long and causing audio corruption.
This is still an unlikely case though:
A future version of F3DEX3 could allow the snake command to yield in the middle. This has not been implemented yet because it is very difficult to validate. Yields are rare relative to display list commands (typically 1-2 of the former and many thousands of the latter per frame). And, until we have a robust F3DEX3 mesh optimizer and a game where most things are drawn with snakes (i.e. few vanilla assets left), snakes will also be rare in the display list. So it will be hard to know whether the yield-during-snake codepath is even being run, let alone whether it is correct in all cases.
-F3DEX3 now checks for yields whenever the input buffer is refilled, not before every command as in F3DEX2. When a triangle snake extends across the boundary of the input buffer, a yield can occur, and F3DEX3 correctly suspends and resumes the triangle snake in this case. So, while triangle snakes can be unlimited in length, because the input buffer is 21 commands = 168 bytes, there is guaranteed to be a yield check at least once every 168 triangles. (Any snake of 8 tris or more can potentially cross the input buffer end and therefore be interrupted.) This guarantee does not help practically though, as practically snakes will not be more than about 110 tris due to the vertex buffer size.
+Tiny3D, the homebrew microcode, uses triangle strips as its accelerated triangles command. F3DEX3 triangle snakes have several advantages compared to Tiny3D's triangle strips:
Light structure.
+Note: the weird order is for the DMEM alignment benefit of the microcode.
+