Review Index:

Intel's Larrabee Architecture

Author: Josh Walrath
Manufacturer: Intel

Something Borrowed, Something BLUE

    Internal communications for this chip had to be rethought as well.  The advantages of using a well known X86 core technology means that it is adept at context switching and pre-emptive multitasking, it is able to handle virtual memory and page swapping with ease, and perhaps most importantly has a robust mechanism for insuring cache coherency between all of the cores.  To get them to communicate with each other at speeds which will not leave cores idling Intel has implemented a 1024 bit ring bus architecture.  It is a bi-directional 512 x 2 bus which looks surprisingly similar to what AMD (ATI) developed for the Radeon 1x00/2x00/3x00 series of products.  When combined with large L1 and L2 caches, the cores theoretically have enough data and bandwidth to stay at a high level of usage in most workloads.

We can see the bandwidth requirements over three different applications and how immediate mode rendering competes with binning (tiling).  The bandwidth requirements in each case is significantly lower for binning.

    Intel is also taking a route pioneered by Videologic and GigaPixel.  Larrabee is not an immediate mode renderer, rather it uses a tiling format.  Intel has added a few more wrinkles to the old tiling idea, and they are calling it “binned rendering”.  From what I gather, it essentially takes processed primitive and vertex data and puts it in bins, and then processes these bins in parallel to render multiple tiles.  As compared to immediate mode rendering which will follow a single pixel throughout the entire pipeline to be stored in the frame buffer.  This type of binning appears to be very cost effective in terms of bandwidth, and something a lot of 3D architects have attempted to do well in the past but have always been a few steps short in competing with other immediate mode rendering architectures.

    Intel is also breaking up these operations into threads, and further breaking up threads into “strands” and then “fibers”.  So a thread is sent to a core, which is then broken up into 16 or so strands (each vector unit can handle 16 ops or “strands” in this case).  Each core can handle up to 4 threads at a time, and the 16 strands can be further broken up into 8 fibers.  The threads are hardware managed, but the strands and fibers have to be software managed.  This is going to be an interesting feat to efficiently handle all of the strands.  Unfortunately Intel did not give us a very granular view of how these threads/strands/fibers are actually processed, and what kind of operations each is able to do.

Instead of processing one pixel at a time, Intel's architecture bins the pixel data and then renders them out in tiles.  This saves on bandwidth and potentially provides greater efficiency when using Intel's architecture.

    The one overwhelming theme that Intel presented us with was scalable performance using more and more cores.  Theoretically Intel is showing us that it can simply throw more cores at the problem, and with this architecture it will continue to scale without hitting any point of diminishing returns.  So while NVIDIA and AMD have thrown more stream processors at the problem with their latest GTX 200 series and Radeon 4800 parts, their return on performance is not scaling as well as many would have imagined.  The Radeon 3800 series had 320 stream processors, but the Radeon 4800 series with 800 SPs does not give 2.5X the performance of the older architecture.

Odds and Ends

    Intel also showed a few extra examples of what their programmable architecture can do, and they primarily deal with rendering “corner cases” where traditional renders may sometimes fall down.  The examples they give have to deal with sorting, z-cull, and their “irregular Z-buffer”.

Once the bins are ready, we can see how the tiles are processed.  Tile data is then stored in L2 cache, which then can be accessed for further operations on that data or for rendering.

    The sorting portion supposedly resolves some of the issues with transparency and pixel sorting.  When a scene is rendered, it is either rendered from back to front or front to back.  In back to front the back pixels are rendered first, while the front pixels are rendered last.  This creates a very large amount of overdraw, and a big performance drop, but it is very accurate in rendering scenes with some type of transparency or fog included.  Front to back rendering is much more efficient as pixels which are occluded are never rendered (something else a tiling architecture is good at).  However, some of the effects that involve transparency or opacity can be incorrect in their final representation.  With Intel’s highly programmable architecture, a very efficient z-cull mechanism can be programmed in to accurately render these problematic scenes.  Again, Intel did not go into great detail about any of the features of the architecture, so we do not know how effective it will be, or if the functionality is even going to be exposed by developers in the future.

    The irregular Z-buffer is aimed at making shadowing much more realistic by eliminating some of the annoying quirks that standard stencil buffer shadows bring to a scene.  Most shadows are sampled at a much lower resolution than the scene, so the shadows (if not softened by pixel shading routines) will appear very jagged and often incorrect when dealing with more complex geometry.  Another problem, and one that users can notice when running games like Bioshock, is that the use of shadows on parts of the body which are touching other surfaces (for example feet touching the ground) are incorrect and give the user the impression that the character is actually floating off the ground or not actually touching anything.