24.5 CPU Tuning

24.5.1 Parallel Execution of the CPU and the RCP

Full speed rendering in the Nintendo 64 can only be accomplished by fully utilizing all of it's resources. One of the most powerful is the coarse-grain parallelism that can be achieved between the CPU and the RCP.

There are many ways you can exploit this parallelism, here are some ideas:

compute game and animation parameters for frame (n+1) while frame (n) is rendered with the RCP.
compute game and animation parameters while another RCP task is being processed.
instruct the RDP to render from a DRAM display list while the RSP is used to compute another task, such as audio.

24.5.2 Sorting

A detailed analysis of sorting algorithms is beyond the scope of this document. The reader is referred to texts by Knuth¹ or Sedgewick², among others. It is useful to review major properties of sorting algorithm analysis and see how they relate to real-time system performance.

¹Knuth, D. E., The Art of Computer Programming, Volume 3: Searching and Sorting, Addison-Wesley Publishing, 1973, ISBN: 0-201-03803-X.

²Sedgewick, R., Algorithms in C, Addison-Wesley Publishing, 1990, ISBN: 0-201-51425-7.

Properties of sorting algorithms which we want to compare include:

best case sorting time
worst case sorting time
average case sorting time
additional memory requirements
size of the code to implement
ability to exploit coherence

The time to sort is probably the most important; obviously we want to choose an algorithm that is fast. But it is not that easy. Some of the fastest sorting algorithms have the widest disparity between their average time and their worst-case time. This makes it difficult to predict performance necessary for a real-time system.

Often the difference between worst-average-best-case performance is the initial order of the data. By knowing what we are sorting (and why) we can choose a better sort. For example, if we are sorting Z-values in order to determine visibility drawing order, we can reason that this order varies only slightly from frame to frame (objects do not move "dramatically" and sort interchanges are local). By exploiting this frame to frame coherence, we can choose a sort with linear performance for the "already nearly sorted" case, speeding up our sort tremendously.

Additional memory requirements are also a major concern in an embedded system. They must be minimal, and most of all, predictable. Consider the sorting problem when designing your data structures.