Written exam in EDA075 Mobile Graphics, 2006-10-20, 14-19 ============================================================= 1a) The major problem is heat, and because the generated power must get out. Furthermore, the devices are small and therefore there is no space for fans. The generated heat can be accepted on a desktop, but not on a phone due to inconvenience for the user. 1b) See slide 5 (title: "Memory system in a mobile") in L8.pdf 1c) The depth complexity, d: The number of triangles that overlap with a pixel (even though each triangle need not write to the pixel) The overdraw is the number of times you actually write to a pixel. Statistical model of the overdraw, o, is: o(d)=1 + 1/2 + 1/3 + ... + 1/d 2a) In each node, you can compute a bounding volume (e.g., a minimal box) of the geometry of that node's entire subtree. When rendering the scene, you can test whether a node's bounding volume is outside the view frustum. If it is, then its contents will not be visible, and hence you do not need to process that node's subtree further. This can speedup rendering a lot. 2b) If the path of transforms from the node up to the root is static, i.e., constant, then you can collapse that path into a single matrix that never need to be updated. Note also that you can multiply this concatenated matrix to the vertices of the object and store the transformed vertices. The only matrix needed is then the camera transform and the projection matrix. You save the matrix multiplication between the object's matrix and the camera*projection matrix. In general if we assume that a transform node can be either "static" (i.e., its matrix never change) or "dynamic" (matrix can change on a per-frame basis), then you can always collapse two or more "neighboring" (connected with an edge in the graph) static nodes. 2c) One could imagine that recursive structures could be rendered in a simple way using loops. For example, assume that we a transform node, which has a child node that contains a sphere. Then there is a loop edge back up to the transform node. Each time you reach the transform node, you scale the sphere to have the radius, and translate it in the y-direction by twice the new radius. This gives you a sphere, with a smaller sphere on top of it, and that sphere has en even smaller sphere on top of it, and so on. To avoid infinite recursion, a stopping criteria is need inside the transform node. This could be for example, when the radius is smaller than some threshold value, or when the projected size of the object is small to make a significant impact on the rendered image. 2d) OpenGL ES has been slimmed down quite a lot, and many redundant calls or otherwise obsolete API calls have been removed. This makes for a smaller rendering core, and simpler drivers. 3a: Zmin culling works with tiled rasterization algorithms, and each tile has storage for the minimum of its z-values in the depth buffer. Let's denote this value by: zmin This must be stored in on-chip memory or accessed through a cache. When rendering a triangle, the Z_max of the triangle inside the tile must be estimated. If Z_max < zmin then we know that the rendered triangle is definitely in front of all previously rendered triangles, and hence we do not need to read from the z-buffer. This is the bandwidth that we can save, i.e., only reads from the z-buffer. Z_max can be estimated with: 1) max of the triangle vertices' depths, or 2) max of the triangle plane's depths at the tile corners, or 3) minimum of 1 and 2 (which is the best-performing solution). 3b: You start an occlusion query, and then you render a number of triangles, and the hardware counts how many fragments that pass through the entire pipeline (often all writes to the buffers are turned off). You can for example draw the triangles of the bounding box of a complex character, and if no fragments pass then the box is invisible and thus also the character. This means that you do not need to render the character at all, and you do not even need to send the character over the bus to the graphics hardware. 3c: Both algorithms can save only depth reads (assuming texturing is done after the depth test), and therefore we look at the part of the rasterization equation that deals with depth buffer reads (and writes): B_d = (d-o)*Z_r + o*(Z_r + Z_w) (d-o)*Z_r are fragments that only read, and hence are occluded, and these can be avoided by Zmax-culling. So when: d-o > o <==> d> 2*o is fulfilled, we know that Zmax works better. When d=4 we have (see 1c) o=2 (approximately), and this is when the break even is reached. So, for d>=4 we would want to use Zmax, and Zmin before that (in the beginning of rendering a frame). Somehow, we need to "count" the depth complexity of each tile, or have some estimate of the average depth complexity of the entire screen. A counter per tile, or one for the screen need to be incremented each time we render a triangle. 4a) You must make sure that there are no cracks and that no pixels are written to more than once. This is done using McCool's code snippet. This can be found in Figure 2.4 in the course notes. Basically, bool inside(x,y) if(e(x,y)>0 return true; if(e(x,y)<0 return false; if(a>0) return true; if(a<0) return false; if(b>0) return true; return false; end; In addition, you need to take care of the case where a vertex coincides with the sampling point of the pixel. This can be done by using fixed-point coordinates for the vertices, and offsetting them so that this never happen. Or use the solution in Figure 2.8. 4b) You should not project all four points on the normal, and you should not evaluate the edge functions for all corners of the tile. There are smarter, less expensive solution. So, you can look at the signs of the normal components of the edge functions in order to determine only _one_ tile corner (called p) to insert into the edge functions. The normal is (a,b) or sometimes called (n_x, n_y). If(n_x<0 && n_y<0) then use the lower left corner of the tile as p. and so on (see Equation 2.10 in course notes) if(e(p)<0) then we have no overlap. Do this for all three edge functions. You also need to check whether the bounding box of the triangle overlaps with the tile. 4c) Fixed-point multiplication adds the fractional bits, so a*b will be have 4+5=9 fractional bits. c have 2 fractional bits, and so we need to shift it 7 steps to the left before the addition can be carried out. Solution: d = a*b + (d<<7); 5a) First of all, we need to have a fall back so that we can handle uncompressed blocks, e.g., if a tile cannot be compressed with the available algorithms, then the block has to be sent in uncompressed form over the bus. In addition, you need to store a few (e.g., 1 or 2) bits per tile in a cache or on chip. These bits indicate with which algorithm the tile has been compressed. So, for example, using two bits, you can have the following: 00 = uncompressed fallback 01 = tile is cleared 10 = compressed using algorithm X 11 = compressed using algorithm Y 5b) Depth varies linearly over a triangle, and so can be predicted easily when a tile only overlaps with one or two triangles. The color buffer is much more complex because texturing can make the colors appear random, and this is of course harder to compress. 5c) 8x4 texels are compressed, and two colors are stored within each such block (16 bits each). So, there are 32 bits remaining. Each texel gets one bit per-pixel modulation. So, two images are bilinearly upscaled by reading neighboring block's colors, and this per-pixel bit can choose either of these. There are some special cases, but these are not needed for full score on this exercise. This costs 16+16+32=64 bits which is 2 bits per pixel. 6a) Unified shaders: by using the same shader instructions for both vertex shaders and pixel shaders, we can built one type of shader unit that executes these instructions. The major idea then is to use the units for either vertices or pixels depending on where the bottleneck is. That is we get load balancing in the pipeline and better exploit the computational units in the hardware. 6b) Here is one idea: compress each color channel (R, G, and B) separately. For a 4x4 tile, store the component in three of the corners. This costs 3x8 bits=24 bits. The rest of the values are predicted using the plane equation of these three values. To handle more tiles, we add a 3 bit correction term per pixel (4x4-3=13 pixels left), which costs 13*3=39 bits. In total, this costs 24+39=63 bits. For R,G, and B, this sums up to : 63*3=189 bits, which is about 12 bits per pixel. That is the compression ratio is 50%. 6c) To make a 4-bit number into an 8-bit number, we need to make sure that the largest 4-bit number maps to the largest 8-bit number. Thus, we need to multiply with 255/15=17. Assume the 4-bit number is A. Then we need to compute: 17*A= 16*A + A = (A<<4) + A = (A<<4) or A where "or" is logical or. If there are many shader units in an architecture, then a