STEP2 [Introduction to N64 Programming] - Chapter 4 Tips and Techniques

4-2 Tuning Performance

4-2-1 Double Buffering the Display List

Usually only one buffer is used to hold the display list that is constructed by the CPU and executed by the RCP. However, if you use two display list buffers instead of one, the RCP can execute the drawing process for one frame at the same time as the CPU is constructing the next frame's display list. This technique is called "double buffering." Use this technique to speed up your game's graphics if the graphics are complex and you have plenty of memory.

-A display list is a graphics command list that holds the commands necessary to render one frame of graphics. The CPU constructs the display list and then passes it to the RCP to execute it.

The longer it takes the CPU to construct each display list, the more speed you can add by using the double-buffering technique. However, remember that although double-buffering makes it appear as though the display list construction time is zero, it really is not. The CPU still has to construct each display list and that means the CPU is not available for other processes that might require it. Therefore, you should devise an efficient algorithm to minimize the processing time needed to construct the display list. This can ultimately lead to faster overall game processing. A weak point of this method is that double buffering requires twice as much memory to make the two buffers, so if you are short on memory, this technique may not work for you.

Figure 4-2-1 Double Buffering the Display List

4-2-2 Triple Frame Buffering

Usually, you use two frame buffers (double buffering) as explained in Section 2-4 Use of Frame Buffer. As the RCP is drawing the next frame into one buffer, the video DAC is displaying the previously drawn frame. However, switching between the two frame buffers occurs only at the vertical synchronization point. Therefore, if the RCP hasn't finished drawing a new frame when the next vertical synchronization occurs, you won't be able to use the buffers together for the next frame. In cases like this where it takes longer to draw each frame, you can make the RCP more efficient by using triple frame buffering. One buffer for "displaying," one for "drawing completed and waiting for switch", and one for "drawing".

The speed-up effect of this method can be huge when the drawing time of a frame is frequently out of sync with the vertical synchronization timing, because the waiting time without triple buffering is long. On the other hand, if the drawing time of a frame is usually in sync with the vertical synchronization, this method has little value because each drawn frame has little waiting time. You need to weigh the advantages against the disadvantages. Triple buffering uses a lot of memory because each of the three frame buffers are quite large even in low resolution. Also, there is another disadvantage in that the TV display is always two frames behind instead of just one.

Figure 4-2-2 Triple Frame Buffering

4-2-3 Using LOD (Level Of Detail)

LOD means the level of detail. By providing different levels of detail, you can significantly improve performance. For example, objects that are viewed as fast moving or far away need much less detail than do stationary objects that are close. When you display a lot of objects on the screen, the RCP processing time increases. The RCP processing time is determined by the time it takes to do vertex coordinate transformations, lighting, and so on done by the RSP microcode and the polygon texturization process of the RDP. When you increase the number of objects to be displayed, the processing time for the vertex coordinate transformation or lighting is sometimes going to be a problem.

When displaying 3D objects that are close, you need to provide a lot of precise detail. On the other hand, when an object is small and far away, you can provide very little detail. Therefore you can prepare in advance, several versions of a model, each with a varying level of detail. Then switch the display model based on the distance of the model from the viewer. This very effectively reduces processing time. The disadvantages of this technique are that it takes a long time to prepare several LOD versions of each model in order to make the switching appear natural. Also, you need to use memory to store all those models. However, because models that have little detail use very little room, the impact on memory is not too bad.

Figure 4-2-3 Images of the three levels of LOD

When a model reaches a certain distance from the viewer, you can greatly improve performance without affecting quality by not showing the model in 3D. Just make it one piece of a picture pasted as pre-rendered image data. This makes it possible to produce good resolution at a fast pace.

This method is most effective when the processing capability of the RSP microcode is saturated, but it has no effect when the RDP performance is saturated.

4-2-4 Volume Culling

If the RCP simply displayed all objects on file, it would waste a lot of time processing coordinate transformations for vertices and models that lie outside the current view. To speed up processing, do not process data that is not displayed on the screen. Volume culling simply means removing those commands from the display list that apply to vertices or models that lie outside the current view. See the gSPCullDisplayList function for details. Of course, it is most effective not to send unused drawing instructions for things outside of the view in the CPU process. Volume culling is very effective when there are many models or vertices and the processing capability of the RSP microcode is saturated, but it has no effect when the RDP performance is saturated.

Figure 4-2-4 Do not process data that is not displayed on the screen.

4-2-5 Anti-Aliasing

Anti-aliasing is one of the strongest features of the N64. It smoothes out the jagged edges on lines. However, anti-aliasing takes time. It reduces the pixel fill-rate performance of the RDP because the anti-aliasing process needs to update the coverage value of each pixel, read the frame buffer, and write the update. Therefore, memory access amount of the frame buffer increases by a factor of two. When the RDP fill-rate performance is saturated, you should turn off anti-aliasing. You need to consider the trade off and determine which is most important; the image quality or the drawing speed.

4-2-6 Z-Buffering

When you use the Z-buffer, draw the closest object first and then move into the background to get the best speed. If you draw the farthest object first, you have to repeatedly write the entire Z-buffer. Drawing the closest object first is faster because for subsequent objects, you need only write that part of the Z-buffer that is not "covered up" by the foreground object. This is a very effective technique when the RDP fill-rate performance is saturated. See the Z-buffer of STEP 1 [1-7 Basic Terminology (Thread, Message, etc.)]

Figure 4-2-5 Draw from the closest picture in the Z-buffer

4-2-7 Optimizing GBI Commands

-The gSPVertex function loads vertex information. It uses vectorization so it can provide coordinate transformations for two vertexes at a time. Therefore, it is better to load an even number of vertices rather than an odd number so that no display list processing time is wasted.

-If you use the gSPModifyVertex function, you can directly load values into a previously loaded vertex cache. When a vertex cannot be shared because it has the same xyz coordinates but a different texture coordinate, you can use this function and load only the new texture coordinate, thus optimizing the display list. Additionally, this command directly embeds the value into a 64-bit GBI command, so it operates at a very high-speed. However, you need to provide the multiplication value for the scaling value set up by the gSPTexture function; it is needed for the texture coordinate.

-After you load a modeling matrix or projection matrix, the MP matrix is recalculated. For example, if you give matrices of scaling, rotation, and translation to the RSP separately and multiply them, you will cause an unnecessary recalculation of the MP matrix internally. Try to complete this with a single matrix multiplication if possible. For example, if you know you are going to reuse a matrix multiplication, you can improve performance by doing it once in advance and then provide the result as a constant in all the places it is needed.

-Lighting can have a significant effect on the performance of the vertex coordinate transformation process. Try to do as much as possible with lighting off. For example, when using objects that are fixed in the field and that don't operate the light, turn the lighting off. Also, if you create a texture that appears lighted, the lighting process can be omitted. However, if you have too many textures, the performance will be sacrificed. In addition, do volume culling with lighting off. Obviously, culled vertices don't need lighting. Depending on the microcode process, use of the gSPVertex function can double the performance.

Figure 4-2-6 Lighting

-In graphics microcode, the command that RSP processes most is not the gSPVertex function, but the gSP1Triangle function. This is because to require the RDP to draw triangles, it must set up the command with close to 180 bytes and send it. This causes a data sending delay to the RDP. In most microcode, the buffer memory allocates 1K byte from DMEM inside of the RSP, but this becomes full with a drawing command of six triangles. If these triangles are too small, it is not a problem because the RDP completes the drawing process at once and they are removed one after another. However, if the triangles are big, this command process is slow and the RSP has to wait for the output. If you use the .fifo microcode or .dram microcode for this, the process may speed up.

Figure 4-2-7 Drawing Triangles

-If you do not use the texture, one of the triangle commands to the RDP becomes unnecessary and the length of the RDP command becomes 64 bytes short. Therefore, you should avoid giving color to a monochromatic object by using texture. Instead, turn the texture off, and add color by applying the primitive color or the vertex color. The vertex color cannot directly add color so if you need lighting, use the primitive color or switch lights. However, because the RSP processing time takes longer to switch lights, it is more effective to use the primitive color. The RDP drawing time does not change, therefore, this method is effective only when you have many vertices and a problem with the RSP processing time, not the drawing time.

Figure 4-2-8 Setting color with the texture

4-2-8 Optimizing the Display List

Commands to the RSP microcode are provided by the display list. The processing rate of the microcode varies depending on the way the display list is constructed. The key here is to reuse the vertex cache effectively, acquiring the vertex data as needed. It is particularly important to optimize the display list when the game application creates model data dynamically.

4-2-9 Speeding up the Audio Process

Audio processing time must never be ignored. The audio process needs both CPU and RSP processing time.
To decrease the audio processing time, pay attention to the following:

-Lower the sampling frequency of the audio playback as low as sound quality permits.
The lower the sampling frequency, the fewer the number of samples that must be processed per unit time. The direct effect is a faster audio process. Refer to the osAiSetFrequency function.

-Decrease the number of physical voices (the number of simultaneous pronunciations) given to the synthesizer driver to as few as possible.
Because the time for the waveform synthesis process in the CPU and RSP is almost proportional to the number of physical voices, decreasing the number of physical voices is effective in speeding up the process. You can change the number of physical voices with each scene. Therefore, it may be most effective to use the most fantastic music that uses a lot of physical voices at the beginning or end of the game, thus decreasing the number of physical voices during the game when processing time is most crucial.

-Minimize the resampling (pitch shift) process or the ADPCM process as it takes a relatively long time.
Try to record the sound using a sampling rate as close to the playback rate as possible, so that you can avoid the resampling process. Or, when the N64 Game Pak ROM capacity can afford it, try to use raw voice source data that is not ADPCM compressed. This improves RSP processing time. However, in actuality, a resampling process is likely to be necessary because, for example, the numerical value becomes 32006Hz even if you set AI to 32000Hz.

-Minimize sound effects; these take a long time to process.
The more you use the effect primitive, the more it takes time to process. Try to keep the number of effect primitives to an absolute minimum.