Hi there:
For those who have read about the proves about why wii Hollywood has a GPU natively capable of GPGPU will have no trouble reading this, if not, you should read the topic "Nvidia 7 series not capable of HDR+AA, but ATI Hollywood"

Ok, first off, must of us know about the displacement mapping patents that nintendo field some years ago; but why even an architecture like those present in the ATI x1000 cannot achieve displacement mapping efficiently(like the ATI Radeon x1900), the answer can be found here


Graphics Drivers
»Graphics Driver
»Forceware 97.28
Vertex Displacement Mapping or simply Displacement Mapping is a technique allowing
to deform a polygonal mesh using a texture (displacement map) in order to add surface
detail. The principle is not new, it is even the basis of number of terrain generation
algorithms (see the Terrain Generator GLUT project). The new thing is the use of the GPU
to achieve in real time the mesh deformation.

Update: November 5, 2006:
The displacement mapping involves a graphics controller that allows to access at least to one texture unit inside the vertex shader. The access to a texture inside the vertex shader is called Vertex Texture Fetching. The shader model 3.0 imposes that at least 4 texture units are accessible inside the vertex shader. Currently, only the graphics controllers based on nVidia Geforce 6, Geforce 7 and higher support the Vertex Texture Fetching. ATI graphics controllers do not support the Vertex Texture Fetching even for the latest high-end models such as the X1950XTX (for more explanation see here: ATI X1900XTX and VTF).


As we know, the first ATI x1900 were launched in January 2006, but the ATI Hollywood was finished after June of 2006, when NEC announced it´s collaboration with Nintendo and Mosys for providing eDRAM to Wii.

The thing is, if Nintendo was seeking a way to achieve displacement mapping efficiently, that would make the ATI Hollywood comparable to ATI 2000 series models, since those were the ATI Radeons that supported vertex texture fetch.


The Radeon HD 2000 texture processor is a highly integrated device configured like follows:

8 texture address units to calculate the address to sample
20 texture samplers
4 texture filter units

The new texture processor supports filtering of FP32 textures as well as the vertex texture fetch feature the Radeon X1000 did not support. It also supports 8192x8192 textures and RGBE 9:9:9:5 texture format to comply with DirectX 10 requirements. Besides everything else, ATI/AMD claims an improved quality of anisotropic filtering.


In few words, Hollywood may be very similar to the ATI Radeons HD2000. Plus, there is a displacement mappinmg patent filed by Nintendo that talks about vertex texture fetching, and the command processor providing stream of vertex commands.

Patent Storm

Command processor 200 receives display commands from main processor 110 and parses them—obtaining any additional data necessary to process them from shared memory 112. The command processor 200 provides a stream of vertex commands to graphics pipeline 180 for 2D and/or 3D processing and rendering. Graphics pipeline 180 generates images based on these commands. The resulting image information may be transferred to main memory 112 for access by display controller/video interface unit 164—which displays the frame buffer output of pipeline 180 on display 56.

FIG. 5 is a logical flow diagram of graphics processor 154. Main processor 110 may store graphics command streams 210, display lists 212 and vertex arrays 214 in main memory 112, and pass pointers to command processor 200 via bus interface 150. The main processor 110 stores graphics commands in one or more graphics first-in-first-out (FIFO) buffers 210 it allocates in main memory 110. The command processor 200 fetches: command streams from main memory 112 via an on-chip FIFO memory buffer 216 that receives and buffers the graphics commands for synchronization/flow control and load balancing, display lists 212 from main memory 112 via an on-chip call FIFO memory buffer 218, and vertex attributes from the command stream and/or from vertex arrays 214 in main memory 112 via a vertex cache 220.

Command processor 200 performs command processing operations 200a that convert attribute types to floating point format, and pass the resulting complete vertex polygon data to graphics pipeline 180 for rendering/rasterization. A programmable memory arbitration circuitry 130 (see FIG. 4) arbitrates access to shared main memory 112 between graphics pipeline 180, command processor 200 and display controller/video interface unit 164.

FIG. 4 shows that graphics pipeline 180 may include: a transform unit 300, a setup/rasterizer 400, a texture unit 500, a texture environment unit 600, and a pixel engine 700.

Transform unit 300 performs a variety of 2D and 3D transform and other operations 300a (see FIG. 5). Transform unit 300 may include one or more matrix memories 300b for storing matrices used in transformation processing 300a. Transform unit 300 transforms incoming geometry per vertex from object space to screen space; and transforms incoming texture coordinates and computes projective texture coordinates (300c). Transform unit 300 may also perform polygon clipping/culling (300d). Lighting processing 300e also performed by transform unit 300b provides per vertex lighting computations for up to eight independent lights in one example embodiment. As discussed herein in greater detail, Transform unit 300 also performs texture coordinate generation (300c) for emboss-style bump mapping effects.


On purpose, one of the new features of the ATI HD 2000 series is that they include vertex cache and that only have 4 texture units:


Associated on the side of the shader core diagram are the texture units. ATI has chosen four texture units for R600. Each unit has eight texture addresses per cycle while four of those are used for bilinear and four are used for four unfiltered lookups. The vertex cache can be used for vertex accesses or other structured accesses. It can even be used for displacements, which will probably become more prevalent in DX10 games.

Associated within each unit are 20 texture samplers for a total of 80 samplers in R600. These samplers fetch and return the data. According to ATI, it does not matter if it is floating point or integer data. It will return four filtered floating point values per cycle and it will return four unfiltered floating-point, or any other type of data per cycle. The 2400 and 2600 core functionality will remain the same but they won't be able to return as much because they have fewer units.

Compared to the previous generations, the texture caches are a bit more complicated, as they are broken up into several caches. There is a 32K L1 unified for all of the SIMD arrays. In comparison, the R500 series only had an 8K cache (per SIMD it is four times larger). It is backed up by a second 256K L2 (2600 has 128K L2 and the 2400 has no L2). The secondary cache allows for very large data structures like fat pixels or very large textures. The aim is to reduce the bandwidth they use for texture.

In concert with the texture cache subsystem, there is also a vertex cache system. It is called a vertex cache because that is one of its primary uses, but it can be used for unfiltered texture lookups as well. It is quite common to use the cache with displacement mapping, structured lookup into arrays and render-to-vertex arrays where data is fetched back. Since it deals primarily with vertex data, it was called a vertex cache. For all intents and purposes, it is a structured linear cache working in parallel. It is not necessarily as important how much data actually goes through any of these caches as much as it is the availability of resources when work needs to be done. The availability of resources for which the cache can be arbitrated is more crucial. In the case of the HD 2400, it actually fetches its vertices through the texture cache. The hardware looks at all of these units as a general resource and will have the compiler take the shader code and convert it. The key to all of this architecture is how well the compiler can convert code, which will determine how things are going to work and what kind of throughput you will actually end up with.

In tasks such as render to texture, it is common to create a texture and then immediately use it. Issues can arise by doing that. The texture needs to finish being drawn before it is used. On older processors (ATI and current Nvidia), the chip would idle to finish rendering the texture before moving on to the next command. There is a performance hit involved. ATI has changed this on the 2000 series. As mentioned before, self checking has been moved down into the hardware so when the rendering of textures occurs there is a coherency check within the chip across the texture units and the raster back ends. The driver doesn't care anymore. It just sends the commands down to the chip and fills it up. The processor itself handles all of the synchronizations between all of the units.

Stream Out allows something that was introduced in the R500 called render-to-vertex buffer. This can now be done after geometry shading processing by streaming it out directly from the shader. It can write vertex data out of the shader and then circulated through for tessellation or any other extra processing. It can also be done via thread communication. Here one thread can write the data out and have the next thread reads it back in, do a render-to-vertex buffer, or overflow the GS data. This can only be done if the GPR stack is virtualized.


And as we read from the displacement mapping patent of nintendo, the hardware includes a vertex cache.

Another key point is that some ATI of the HD 2000 series are so optimized that have very few texture units comparable to the ATI x1600; they can have as low as 8 texture units in total.