Written exam in EDA075 Mobile Graphics, 2006-10-20, 14-19
=============================================================
1a) The major problem is heat, and because the generated
power must get out. Furthermore, the devices are small and
therefore there is no space for fans. The generated heat
can be accepted on a desktop, but not on a phone due to
inconvenience for the user.

1b) See slide 5 (title: "Memory system in a mobile") in L8.pdf

1c) The depth complexity, d:
The number of triangles that overlap with a pixel
(even though each triangle need not write to the pixel)
The overdraw is the number of times you actually write
to a pixel.
Statistical model of the overdraw, o, is:
o(d)=1 + 1/2 + 1/3 + ... + 1/d

2a) In each node, you can compute a bounding volume (e.g., a
minimal box) of the geometry of that node's entire subtree.
When rendering the scene, you can test whether a node's
bounding volume is outside the view frustum. If it is,
then its contents will not be visible, and hence you do
not need to process that node's subtree further.
This can speedup rendering a lot.

2b) If the path of transforms from the node up to the root is
static, i.e., constant, then you can collapse that path
into a single matrix that never need to be updated. Note also that
you can multiply this concatenated matrix to the vertices of
the object and store the transformed vertices. The only matrix
needed is then the camera transform and the projection matrix.
You save the matrix multiplication between the object's matrix and
the camera*projection matrix.
In general if we assume that a transform node can be either "static"
(i.e., its matrix never change) or "dynamic" (matrix can change on a
per-frame basis), then you can always collapse two or
more "neighboring" (connected with an edge in the graph) static
nodes.

2c) One could imagine that recursive structures could be rendered
in a simple way using loops. For example, assume that we a transform
node, which has a child node that contains a sphere. Then there
is a loop edge back up to the transform node. Each time you reach
the transform node, you scale the sphere to have the radius, and
translate it in the y-direction by twice the new radius. This gives
you a sphere, with a smaller sphere on top of it, and that sphere
has en even smaller sphere on top of it, and so on.
To avoid infinite recursion, a stopping criteria is need inside
the transform node. This could be for example, when the radius
is smaller than some threshold value, or when the projected size
of the object is small to make a significant impact on the rendered
image.

2d) OpenGL ES has been slimmed down quite a lot, and many redundant
calls or otherwise obsolete API calls have been removed. This makes
for a smaller rendering core, and simpler drivers.


3a: Zmin culling works with tiled rasterization algorithms, and
each tile has storage for the minimum of its z-values in the
depth buffer. Let's denote this value by: zmin
This must be stored in on-chip memory or accessed
through a cache. When rendering a triangle, the Z_max of
the triangle inside the tile must be estimated. If
Z_max < zmin
then we know that the rendered triangle is definitely in front
of all previously rendered triangles, and hence we do not need
to read from the z-buffer. This is the bandwidth that we can
save, i.e., only reads from the z-buffer.
Z_max can be estimated with:
1) max of the triangle vertices' depths,
   or
2) max of the triangle plane's depths at the tile corners,
  or
3) minimum of 1 and 2 (which is the best-performing solution).

3b: You start an occlusion query, and then you render a number
of triangles, and the hardware counts how many fragments that
pass through the entire pipeline (often all writes to the buffers
are turned off). You can for example draw the triangles of
the bounding box of a complex character, and if no fragments pass
then the box is invisible and thus also the character.
This means that you do not need to render the character at all,
and you do not even need to send the character over the bus to
the graphics hardware.

3c: Both algorithms can save only depth reads (assuming texturing
is done after the depth test), and therefore we look at the part
of the rasterization equation that deals with depth buffer reads
(and writes):
B_d = (d-o)*Z_r + o*(Z_r + Z_w)

(d-o)*Z_r are fragments that only read, and hence are occluded,
and these can be avoided by Zmax-culling. So when:
d-o > o  <==> d> 2*o
is fulfilled, we know that Zmax works better. When
d=4 we have (see 1c) o=2 (approximately), and this is when
the break even is reached.
So, for d>=4 we would want to use Zmax, and Zmin before that
(in the beginning of rendering a frame).
Somehow, we need to "count" the depth complexity of each tile,
or have some estimate of the average depth complexity of
the entire screen. A counter per tile, or one for the screen
need to be incremented each time we render a triangle.

4a) You must make sure that there are no cracks and that no
pixels are written to more than once. This is done using
McCool's code snippet. This can be found in Figure 2.4 in
the course notes. Basically,
bool inside(x,y)
    if(e(x,y)>0 return true;
    if(e(x,y)<0 return false;
    if(a>0) return true;
    if(a<0) return false;
    if(b>0) return true;
    return false;
end;
In addition, you need to take care of the case where a vertex
coincides with the sampling point of the pixel. This can
be done by using fixed-point coordinates for the vertices,
and offsetting them so that this never happen. Or use
the solution in Figure 2.8.

4b) You should not project all four points on the normal,
and you should not evaluate the edge functions for all corners of
the tile. There are smarter, less expensive solution.
So, you can look at the signs of the normal components
of the edge functions in order to determine only _one_ tile
corner (called p) to insert into the edge functions.
The normal is (a,b) or sometimes called (n_x, n_y).
If(n_x<0 && n_y<0) then use the lower left corner of the tile as p.
and so on (see Equation 2.10 in course notes)
if(e(p)<0) then we have no overlap.
Do this for all three edge functions.
You also need to check whether the bounding box of the triangle
overlaps with the tile.

4c) Fixed-point multiplication adds the fractional bits, so
a*b will be have 4+5=9 fractional bits.
c have 2 fractional bits, and so we need to shift it 7 steps to
the left before the addition can be carried out.
Solution: d = a*b + (d<<7);

5a) First of all, we need to have a fall back so that we can handle
uncompressed blocks, e.g., if a tile cannot be compressed with
the available algorithms, then the block has to be sent in
uncompressed form over the bus. In addition, you need to store
a few (e.g., 1 or 2) bits per tile in a cache or on chip.
These bits indicate with which algorithm the tile has been
compressed. So, for example, using two bits, you can have the
following:
00 = uncompressed fallback
01 = tile is cleared
10 = compressed using algorithm X
11 = compressed using algorithm Y

5b) Depth varies linearly over a triangle, and so can be
predicted easily when a tile only overlaps with one or
two triangles. The color buffer is much more complex because
texturing can make the colors appear random, and this is
of course harder to compress.

5c) 8x4 texels are compressed, and two colors are stored within
each such block (16 bits each). So, there are 32 bits remaining.
Each texel gets one bit per-pixel modulation. So, two
images are bilinearly upscaled by reading neighboring block's
colors, and this per-pixel bit can choose either of these.
There are some special cases, but these are not needed for
full score on this exercise. This costs 16+16+32=64 bits
which is 2 bits per pixel.

6a) Unified shaders: by using the same shader instructions for
both vertex shaders and pixel shaders, we can built one type
of shader unit that executes these instructions. The major idea
then is to use the units for either vertices or pixels depending
on where the bottleneck is. That is we get load balancing in
the pipeline and better exploit the computational units in the
hardware.

6b) Here is one idea: compress each color channel (R, G, and B)
separately. For a 4x4 tile, store the component
in three of the corners. This costs 3x8 bits=24 bits. The
rest of the values are predicted using the plane equation of
these three values. To handle more tiles, we add a 3 bit
correction term per pixel (4x4-3=13 pixels left), which costs
13*3=39 bits. In total, this costs 24+39=63 bits.
For R,G, and B, this sums up to : 63*3=189 bits, which is about
12 bits per pixel. That is the compression ratio is 50%.

6c) To make a 4-bit number into an 8-bit number, we need to make
sure that the largest 4-bit number maps to the largest 8-bit number.
Thus, we need to multiply with 255/15=17.
Assume the 4-bit number is A. Then we need to compute:
17*A= 16*A + A = (A<<4) + A = (A<<4) or A
where "or" is logical or.


If there are
many shader units in an architecture, then a