Re-introduction to Nehalem
Last week was the final Intel Developer Form before the release of new Nehalem processor and with that expo came the final pieces of the hardware puzzle. The architecture has been explained and discussed since early 2007 in a general sense but last week we finally got the details on some key architectural improvements that help Nehalem stand out from both Intel's and AMD's current generation lineup.
Previous to this piece, I had written not one, not two but three rough draft analysis of the Nehalem core. (Here, here and here.) Some of what we cover today in this article will be a repeat of those details but even I could use a refresher course since the early March 2007 briefings.
Intel's Design Goals
Since introducing the "tick-tock" method of processor design several generations ago Intel has really impressed me with their ability to layout a roadmap years in advance and hit the dates and performance targets nearly dead on. The "tock" of this design mentality is a new microarchitecture (like Merom) while the "tick" is an upgraded process technology (like the move from 65nm to 45nm with Penryn). Nehalem will be the next "tock" on this scale followed by a 32nm reduced version called Westmere.
Intel has already laid out the next "tock" as well; called Sandy Bridge Intel is keeping mostly mum about the features and details of this chip until next year sometime.
Here you can see a die shot of the new Nehalem processor - in this iteration a four core design with two separate QPI links and large L3 cache in relation to the rest of the chip. The primary goal of Nehalem was to take the big performance advantages that the Core 2 CPUs have and modularize them. Now with the Nehalem design, which will be branded as the Intel Core i7, Intel can easily create a range of processors from 1 core to 8 cores depending the application and market demands. Eight core CPUs will be found in servers while you'll find dual core machines in the mobile market several months after the initial desktop introduction. QPI (Quick Path Interlink) channels can also vary in order improve CPU-to-CPU communication.
The current Intel flagship CPU, the Core 2 Duo/Quad design, is still quite the performer. It introduced a 4-wide execution engine and SSE4.1 instructions that added 128-bit wide instruction support. Smart Cache and Smart Memory Access were marketing names given to better caching systems and protocols that improved performance marginally over the previous design.
At a high level the Nehalem core adds some key features to the processor designs we currently have with Penryn. SSE instructions get the bump to a 4.2 revision, better branch prediction and pre-fetch algorithms and simultaneous multi-threading (SMT) makes a return after a brief hiatus with the NetBurst architecture.
When glanced at from a purely block diagram status, here is what Intel's Nehalem architecture has to offer. We will walk through most of these features and specifications on the following pages.
Nehalem Decode Engine
The first section of the Nehalem architecture includes the fetch and decode operations as well as the first layer of cache and is dubbed the "front-end" of the design. This part of the processor is responsible for creating the operands for the compute engine to crunch on while performing effective branch prediction. New in the Nehalem design are updated macrofusion techniques and a loop stream detector.
Macrofusion is a technique introduced with the Core 2 design that combines specific instructions for faster execution and better efficiency. This was only possible in 32-bit mode before but with Nehalem the benefit will apply to 64-bt systems as well. The loop stream detector, while not new, has been improved by including the instruction decode step in the detection. This allows the feature to prevent as many as 28 micro-ops from being run.
The branch prediction unit has also seen some improvements - the examples Intel offered up include an L2 addition for larger code sizes and renamed stacked buffers.