When Tegra 4 was introduced at CES there was precious little information about the setup of the integrated GPU.  We all knew that it would be a much more powerful GPU, but we were not entirely sure how it was set up.  Now NVIDIA has finally released a slew of whitepapers that deal with not only the GPU portion of Tegra 4, but also some of the low level features of the Cortex A15 processor.  For this little number I am just going over the graphics portion.

This robust looking fellow is the Tegra 4.  Note the four pixel "pipelines" that can output 4 pixels per clock.

The graphics units on the Tegra 4 and Tegra 4i are identical in overall architecture, just that the 4i has fewer units and they are arranged slightly differently.  Tegra 4 is comprised of 72 units, 48 of which are pixel shaders.  These pixel shaders are VLIW based VEC4 units.  The other 24 units are vertex shaders.  The Tegra 4i is comprised of 60 units, 48 of which are pixel shaders and 12 are vertex shaders.  We knew at CES that it was not a unified shader design, but we were still unsure of the overall makeup of the part.  There are some very good reasons why NVIDIA went this route, as we will soon explore.

If NVIDIA were to transition to unified shaders, it would increase the overall complexity and power consumption of the part.  Each shader unit would have to be able to handle both vertex and pixel workloads, which means more transistors are needed to handle it.  Simpler shaders focused on either pixel or vertex operations are more efficient at what they do, both in terms of transistors used and power consumption.  This is the same train of thought when using fixed function units vs. fully programmable.  Yes, the programmability will give more flexibility, but the fixed function unit is again smaller, faster, and more efficient at its workload.

On the other hand here we have the Tegra 4i, which gives up half the pixel pipelines and vertex shaders, but keeps all 48 pixel shaders.

If there was one surprise here, it would be that the part is not completely OpenGL ES 3.0 compliant.  It is lacking in one major function that is required for certification.  This particular part cannot render at FP32 levels.  It has been quite a few years since we have heard of anything not being able to do FP32 in the PC market, but it is quite common to not support it in the power and transistor conscious mobile market.  NVIDIA decided to go with a FP 20 partial precision setup.  They claim that for all intents and purposes, it will not be noticeable to the human eye.  Colors will still be rendered properly and artifacts will be few and far between.  Remember back in the day when NVIDIA supported FP16 and FP32 while they chastised ATI for choosing FP24 with the Radeon 9700 Pro?  Times have changed a bit.  Going with FP20 is again a power and transistor saving decision.  It still supports DX9.3 and OpenGL ES 2.0, but it is not fully OpenGL ES 3.0 compliant.  This is not to say that it does not support any 3.0 features.  It in fact does support quite a bit of the functionality required by 3.0, but it is still not fully compliant.

This will be an interesting decision to watch over the next few years.  The latest Mali 600 series, PowerVR 6 series, and Adreno 300 series solutions all support OpenGL ES 3.0.  Tegra 4 is the odd man out.  While most developers have no plans to go to 3.0 anytime in the near future, it will eventually be implemented in software.  When that point comes, then the Tegra 4 based devices will be left a bit behind.  By then NVIDIA will have a fully compliant solution, but that is little comfort for those buying phones and tablets in the near future that will be saddled with non-compliance once applications hit.

The list of OpenGL ES 3.0 features that are actually present in Tegra 4, but the lack of FP32 relegates it to 2.0 compliant status.

The core speed is increased to 672 MHz, well up from the 520 MHz in Tegra 3 (8 pixel and 4 vertex shaders).  The GPU can output four pixels per clock, double that of Tegra 3.  Once we consider the extra clock speed and pixel pipelines, the Tegra 4 increases pixel fillrate by 2.6x.  Pixel and vertex shading will get a huge boost in performance due to the dramatic increase of units and clockspeed.  Overall this is a very significant improvement over the previous generation of parts.

The Tegra 4 can output to a 4K display natively, and that is not the only new feature for this part.  Here is a quick list:

2x/4x Multisample Antialiasing (MSAA)

24-bit Z (versus 20-bit Z in the Tegra 3 processor) and 8-bit Stencil

4K x 4K texture size incl. Non-Power of Two textures (versus 2K x 2K in the Tegra 3 processor) – for higher quality textures, and easier to port full resolution textures from  console and PC games to Tegra 4 processor.  Good for high resolution displays.

16:1 Depth (Z) Compression and 4:1 Color Compression (versus none in Tegra 3 processor) – this is lossless compression and is useful for reducing bandwidth to/from the frame buffer, and especially effective in antialiasing processing when processing multiple samples per pixel

Depth Textures

Percentage Closer Filtering for Shadow Texture Mapping and Soft Shadows

Texture border color eliminate coarse MIP-level bleeding

sRGB for Texture Filtering, Render Surfaces and MSAA down-filter

1 – CSAA is no longer supported in Tegra 4 processors

This is a big generational jump, and now we only have to see how it performs against the other top end parts from Qualcomm, Samsung, and others utilizing IP from Imagination and ARM.