A Detailed Look at Intel's New Core Architecture
Wide Dyanamic Execution and Media Boost
There are five key architectural changes that revolve around the areas described above that Intel is very adamant about.
The summary should look very familiar if you have looked over my original IDF article from day one but we have a lot more detail in this article for you to digest.
Wide Dynamic Execution
By far the most important new change in Intel's Core Architecture is a move to a wider, four instruction width design.
In comparison to traditional three instruction designs, the Core design has the ability to work on 4 separate instructions at the same time through the pipeline. The diagram refers to a 4+ width design that takes into account the new macro-fusion feature.
Macro-fusion is a new technology that allows Intel's processors to combine common x86 instructions pairs to a single micro-op that can be executed in a single clock on a single ALU. For example, a compare (CMP) and a conditional branch (JCC) can be combined into a single micro-op of Intel's design called a CMP + Condition Branch JCC.
The ALU units were enhanced for this macro-fusion but are still single dispatch and single cycle execution based. The ALUs can then perform the combined micro-op that actually represents two x86 instructions in a single pass. The Core Architecture reads five instructions from the instruction queue and if it finds a fuseable pair, they are combined and converted to one micro-op. This means for that 5th instruction, you get it done one cycle earlier than you would have without the fusion process.
When the ALU physically performs the arithmetic, it is essentially making the processor 'magically' appear to be both wider and deeper in design without the cost and drawback associated with having a five instruction design. What we are essentially seeing here is a type of data compression that allows more data to be passed through the ALU and have work done in less time.
Advanced Digital Media Boost
With the new Core Architecture Intel's processors will be able to execute all of the SSE instructions up to 128-bits in a single cycle.
Now the processor could theoretically handle a 128-bit Multiply, Add, Store and Load and still have a single pipeline left open for any other operation. In this diagram Intel showed the macro-fused CMPJCC as the extra instruction being executed so you are essentially getting six x86 instructions done per clock at a maximum.
In this diagram Intel was demonstrating that with these full 128-bit access paths, the latency involved in waiting for math to complete before you could load more data from memory (as they used the same pipeline) can improve performance in applications like media encoding considerably