heterogeneous Uniform Memory Access
Several years back we first heard AMD’s plans on creating a uniform memory architecture which will allow the CPU to share address spaces with the GPU. The promise here is to create a very efficient architecture that will provide excellent performance in a mixed environment of serial and parallel programming loads. When GPU computing came on the scene it was full of great promise. The idea of a heavily parallel processing unit that will accelerate both integer and floating point workloads could be a potential gold mine in wide variety of applications. Alas, the promise of the technology did not meet expectations when we have viewed the results so far. There are many problems with combining serial and parallel workloads between CPUs and GPUs, and a lot of this has to do with very basic programming and the communication of data between two separate memory pools.
CPUs and GPUs do not share common memory pools. Instead of using pointers in programming to tell each individual unit where data is stored in memory, the current implementation of GPU computing requires the CPU to write the contents of that address to the standalone memory pool of the GPU. This is time consuming and wastes cycles. It also increases programming complexity to be able to adjust to such situations. Typically only very advanced programmers with a lot of expertise in this subject could program effective operations to take these limitations into consideration. The lack of unified memory between CPU and GPU has hindered the adoption of the technology for a lot of applications which could potentially use the massively parallel processing capabilities of a GPU.
The idea for GPU compute has been around for a long time (comparatively). I still remember getting very excited about the idea of using a high end video card along with a card like the old GeForce 6600 GT to be a coprocessor which would handle heavy math operations and PhysX. That particular plan never quite came to fruition, but the idea was planted years before the actual introduction of modern DX9/10/11 hardware. It seems as if this step with hUMA could actually provide a great amount of impetus to implement a wide range of applications which can actively utilize the GPU portion of an APU.
Jaguar Hits the Embedded Space
It has long been known that AMD has simply not had a lot of luck going head to head against Intel in the processor market. Some years back they worked on differentiating themselves, and in so doing have been able to stay afloat through hard times. The acquisitions that AMD has made in the past decade are starting to make a difference in the company, especially now that the PC market that they have relied upon for revenue and growth opportunities is suddenly contracting. This of course puts a cramp in AMD’s style, but with better than expected results in their previous quarter, things are not nearly as dim as some would expect.
Q1 was still pretty harsh for AMD, but they maintained their marketshare in both processors and graphics chips. One area that looks to get a boost is that of embedded processors. AMD has offered embedded processors for some time, but with the way the market is heading they look to really ramp up their offerings to fit in a variety of applications and SKUs. The last generation of G-series processors were based upon the Bobcat/Brazos platform. This two chip design (APU and media hub) came in a variety of wattages with good performance from both the CPU and GPU portion. While the setup looked pretty good on paper, it was not widely implemented because of the added complexity of a two chip design plus thermal concerns vs. performance.
AMD looks to address these problems with one of their first, true SOC designs. The latest G-series SOC’s are based upon the brand new Jaguar core from AMD. Jaguar is the successor to the successful Bobcat core which is a low power, dual core processor with integrated DX11/VLIW5 based graphics. Jaguar improves performance vs. Bobcat in CPU operations between 6% to 13% when clocked identically, but because it is manufactured on a smaller process node it is able to do so without using as much power. Jaguar can come in both dual core and quad core packages. The graphics portion is based on the latest GCN architecture.
Subject: Processors | April 17, 2013 - 09:48 PM | Tim Verry
Tagged: overclocking, intel ivr, intel hd graphics, Intel, haswell, cpu
During the Intel Developer Forum in Beijing, China the X86 chip giant revealed details about how overclocking will work on its upcoming Haswell processors. Enthusiasts will be pleased to know that the new chips do not appear to be any more restrictive than the existing Ivy Bridge processors as far as overclocking. Intel has even opened up the overclocking capabilities slightly by allowing additional BCLK tiers without putting aspects such as the PCI-E bus out of spec.
The new Haswell chips have an integrated voltage regulator, which allows programmable voltage to both the CPU, Memory, and GPU portions of the chip. As far as overclocking the CPU itself, Intel has opened up the Turbo Boost and is allowing enthusiasts to set an overclocked Turbo Boost clockspeed. Additionally, Intel is specifying available BCLK values of 100, 125, and 167MHz without putting other systems out of spec (they use different ratios to counterbalance the increased BCLK, which is important for keeping the PCI-E bus within ~100Mhz). The chips will also feature unlocked core ratios all the way up to 80 in 100MHz increments. That would allow enthusiasts with a cherry-picked chip and outrageous cooling to clock the chip up to 8GHz without overclocking the BCLK value (though no chip is likely to reach that clockspeed, especially for everyday usage!).
Remember that the CPU clockspeed is determined by the BCLK value times a pre-set multiplier. Unlocked processors will allow enthusiasts to adjust the multiplier up or down as they please, while non-K edition chips will likely only permit lower multipliers with higher-than-default multipliers locked out. Further, Intel will allow the adventurous to overclock the BLCK value above the pre-defined 100, 125, and 167MHz options, but the chip maker expects most chips will max out at anywhere between five-to-seven percent higher than normal. PC Perspective’s Morry Teitelman speculates that slightly higher BCLK overclocks may be possible if you have a good chip and adequate cooling, however.
Similar to current-generation Ivy Bridge (and Sandy Bridge before that) processors, Intel will pack Haswell processors with its own HD Graphics pGPU. The new HD Graphics will be unlocked and the graphics ratio will be able to scale up to a maximum of 60 in 50MHz steps for a potential maximum of 3GHz. The new processor graphics cards will also benefit from Intel’s IVR (programmable voltage) circuitry. The HD Graphics and CPU are fed voltage from the integrated voltage regulator (IVR), and is controlled by adjusting the Vccin value. The default is 1.8V, but it supports a recommended range of 1.8V to 2.3V with a maximum of 3V.
Finally, Intel is opening up the memory controller to further overclocking. Intel will allow enthusiasts to overclock the memory in either 200MHz or 266MHz increments, which allows for a maximum of either 2,000MHz or 2,666MHz respectively. The default voltage will depend on the particular RAM DIMMs you use, but can be controlled via the Vddq IVR setting.
It remains to be seen how Intel will lock down the various processor SKUs, especially the non-K edition chips, but at least now we have an idea of how a fully-unlocked Haswell processor will overclock. On a positive note, it is similar to what we have become used to with Ivy Bridge, so similar overclocking strategies for getting the most out of processors should still apply with a bit of tweaking. I’m interested to see how the integration of the voltage regulation hardware will affect overclocking though. Hopefully it will live up to the promises of increased efficiency!
Are you gearing up for a Haswell overhaul of your system, and do you plan to overclock?
In addition to Intel's announcement of new Xeon processors, the company is launching three new Atom-series processors for servers later this year. The new processor lineups include the Intel Atom S12x9 family for storage applications, Rangeley processors for networking gear, and Avoton SoCs for low-power micro-servers.
The Intel Atom S12x9 family takes the existing S1200 processors and makes a few tweaks to optimize the SoCs for storage servers and other storage appliances. For reference, the Intel Atom S1200 series of processors feature sub-9W TDPs, 1MB of cache, and two physical CPU cores clocked at up to 2GHz. However, Intel did not list the individual S12x9 SKUs or specifications, so it is unknown if they will also be clocked at up to 2GHz. The new Atom S12x9 processors will feature 40 PCI-E 2.0 lanes (26 Root Port and 16 Non-Transparent Bridge) to provide ample bandwidth between I/O and processor. The SoCs also feature hardware RAID acceleration, Native Dual-Casting, and Asynchronous DRAM Self-Refresh. Native Dual-Casting allows data to be read from one source and written to two memory locations simultaneously while Asynchronous DRAM Self-Refresh protects data during a power failure.
The new chips are available now to customers and will be available in OEM systems later this year. Vendors that plan to release systems with the S12x9 processors include Accusys, MacroSAN, Qnap, and Qsan.
Intel is also introducing a new series of processors --- codenamed Rangeley -- is intended to power future networking gear. The 22nm Atom SoC is slated to be available sometime in the second half of this year (2H'13). Intel is positioning the Rangeley processors at entry-level to mid-range routers, switches, and security appliances.
While S12x9 and Rangeley are targeted at specific tasks, the company is also releasing a general purpose Atom processor codenamed Avoton. The Avoton SoCs are aimed at low power micro-servers, and is Intel's answer to ARM chips in the server room. Avoton is Intel's second generation 64-bit Atom processor series. It uses the company's Silvermont architecture on a 22nm process. The major update with Avoton is the inclusion of an Ethernet controller built into the processor itself. According to Intel, building networking into the processor instead of as a chip on a separate add-on board results in "significant improvements in performance per watt." These chips are currently being sampled to partners, and should be available in Avoton-powered servers later this year (2H'13).
This year is certainly shaping up to be an interesting year for Atom processors. I'm excited to see how the battle unfolds between the ARM and Atom-based solutions in the data center.
Subject: Processors | April 3, 2013 - 08:35 AM | Tim Verry
Tagged: mobile, Lenovo, electrical engineering, chip design, arm
According to a recent article in the EE Times, Beijing-based PC OEM Lenovo many be entering the mobile chip design business. An anonymous source allegedly familiar with the matter has indicated that Lenovo will be expanding its Integrated Circuits design team to 100 engineers by the second-half of this year. Further, Lenovo will reportedly task the newly-expanded team with designing an ARM processor of its own to join the ranks of Apple, Intel, NVIDIA, Qualcomm, Huawei, Samsung, and others.
It is unclear whether Lenovo simply intends to license an existing ARM core and graphics module or if the design team expansion is merely the begining of a growing division that will design a custom chip for its smartphones and Chromebooks to truly differentiate itself and take advantage of vertical integration.
Junko Yoshida of the EE Times article notes that Lenovo was turned away by Samsung when it attempted to use the company's latest Exynos Octa processor. I think that might contribute to the desire to have its own chip design team, but it may also be that the company believes it can compete in a serious way and set its lineup of smartphones apart from the crowd (as Apple has managed to do) as it pursues further Chinese market share and slowly moves its phones into the United States market.
Details are scarce, but it is at least an intriguing protential future for the company. It will be interesting to see if Lenovo is able to make it work in this extremely-competitive and expensive area.
Do you think Lenovo has what it takes to design its own mobile chip? Is it a good idea?
Subject: Editorial, General Tech, Processors, Shows and Expos | March 20, 2013 - 06:26 PM | Scott Michaud
Tagged: windows rt, nvidia, GTC 2013
NVIDIA develops processors, but without an x86 license they are only able to power ARM-based operating systems. When it comes to Windows, that means Windows Phone or Windows RT. The latter segment of the market has disappointing sales according to multiple OEMs, which Microsoft blames them for, but the jolly green GPU company is not crying doomsday.
NVIDIA just skimming the Surface RT, they hope.
As reported by The Verge, NVIDIA CEO Jen-Hsun Huang was optimistic that Microsoft would eventually let Windows RT blossom. He noted how Microsoft very often "gets it right" at some point when they push an initiative. And it is true, Microsoft has a history of turning around perceived disasters across a variety of devices.
They also have a history of, as they call it, "knifing the baby."
I think there is a very real fear for some that Microsoft could consider Intel's latest offerings as good enough to stop pursuing ARM. Of course, the more the pursue ARM, the more their business model will rely upon the-interface-formerly-known-as-Metro and likely all of its certification politics. As such, I think it is safe to say that I am watching the industry teeter on a fence with a bear on one side and a pack of rabid dogs on the other. On the one hand, Microsoft jumping back to Intel would allow them to perpetuate the desktop and all of the openness it provides. On the other hand, even if they stick with Intel they likely will just kill the desktop anyway, for the sake of user confusion and the security benefits of cert. We might just have less processor manufacturers when they do that.
So it could be that NVIDIA is confident that Microsoft will push Windows RT, or it could be that NVIDIA is pushing Microsoft to continue to develop Windows RT. Frankly, I do not know which would be better... or more accurately, worse.
Subject: Processors | March 12, 2013 - 02:52 PM | Jeremy Hellstrom
Tagged: VLIW4, trinity, Richland, piledriver, notebook, mobile, hd 8000, APU, amd, A10-5750
The differences between Richland and Trinity are not earth shattering but there are certainly some refinements implemented by AMD in the A10-5750. One very noticeable one is support for DDR3-1866 as well as better power management for both the CPU and GPU; with new temperature balancing algorithms and measurement the ability to balance the load properly has increased from Trinity. Many AMD users will be more interested in the GPU portion of the die than the CPU, as that is where AMD actually has as lead on Intel and this particular chip contains the HD8650G, with clocks of 720MHz boost and 533MHz base and increase from the previous generation of 35 and 37MHz respectively. You can read more about the other three models that will be released over at The Tech Report.
"AMD has formally introduced the first members of its Richland APU family. We have the goods on the chips and Richland's new power management tech, which combines temperature-based inputs with bottleneck-aware clock boosting."
Here are some more Processor articles from around the web:
- AMD Richland APU Preview: Trinity Gets a Facelift @ Hardware Canucks
- 2013 AMD Mobile APU (Richland) @ Bjorn3D
- Westmere-EP to Sandy Bridge-EP: The Scientist Potential Upgrade @ AnandTech
- AMD Phenom II X4 955, Phenom II X4 960T, Phenom II X6 1075T and Intel Pentium G2120, Core i3-3220, Core i5-3330 @ ixbt.com
- AMD FX-8350 @ iXBT Labs
- The new Opteron 6300: Finally Tested! @ AnandTech
- Intel Core i5-3570K vs. i7-3770K Ivy Bridge @ techPowerUp
AMD Exposes Richland
When we first heard about “Richland” last year, there was a little bit of excitement from people. Not many were sure what to expect other than a faster “Trinity” based CPU with a couple extra goodies. Today we finally get to see what Richland is. While interesting, it is not necessarily exciting. While an improvement, it will not take AMD over the top in the mobile market. What it actually brings to the table is better competition and a software suite that could help to convince buyers to choose AMD instead of a competing Intel part.
From a design standpoint, it is nearly identical to the previous Trinity. That being said, a modern processor is not exactly simple. A lot of software optimizations can be applied to these products to increase performance and efficiency. It seems that AMD has done exactly that. We had heard rumors that the graphics portion was in fact changed, but it looks like it has stayed the same. Process improvements have been made, but that is about the extent of actual hardware changes to the design.
The new Richland APUs are branded the A-5000 series of products. The top end is the A10-5750M with HD-8650 integrated graphics. This is still the VLIW-4 based graphics unit seen in the previous Trinity products, but enough changes have been made with software that I can enable Dual Graphics with the new Solar System based GPUs (GCN). The speeds of these products have received a nice boost. As compared to the previous top end A10-4600, the 5750 takes the base speed from 2.3 GHz to 2.5 GHz. Boost goes from 3.2 GHz up to 3.5 GHz. The graphics portion takes the base clock from 496 MHz up to 533 MHz, while turbo mode improves over the 4600 from 685 MHz to 720 MHz. These are not staggering figures, but it all still fits within the 35 watt TDP of the previous product.
One other important improvement is the ability to utilize DDR-3 1866 memory. Throughout the past year we have seen memory densities increase fairly dramatically without impacting power consumption. This goes for speed as well. While we would expect to see lower power DIMMs be used in the thin and light categories, expect to see faster DDR-3 1866 in the larger notebooks that will soon be heading our way.
Subject: Processors | February 20, 2013 - 09:35 PM | Josh Walrath
Tagged: Tegra 4i, tegra 4, tegra 3, Tegra 2, tegra, phoenix, nvidia, icera, i500
The NVIDIA Tegra 4 and Shield project were announced at this year’s CES, but there were other products in the pipeline that were just not quite ready to see the light of day at that time. While Tegra 4 is an impressive looking part for mobile applications, it is not entirely appropriate for the majority of smart phones out there. Sure, the nebulous “Superphone” category will utilize Tegra 4, but that is not a large part of the smartphone market. The two basic issues with Tegra 4 is that it pulls a bit more power at the rated clockspeeds than some manufacturers like, and it does not contain a built-in modem for communication needs.
The die shot of the Tegra 4i. A lot going on in this little guy.
NVIDIA bought up UK modem designer Icera to help create true all-in-one SOCs. Icera has a unique method with building their modems that they say is not only more flexible than what others are offering, but also much more powerful. These modems skip a lot of fixed function units that most modems are made of and rely on high speed general purpose compute units and an interesting software stack to create smaller modems with greater flexibility when it comes to wireless standards. At CES NVIDIA showed off the first product of this acquisition, the i500. This is a standalone chip and is set to be offered with the Tegra 4 SOC.
Yesterday NVIDIA introduced the Tegra 4i, formerly codenamed “Grey”. This is a combined Tegra SOC with the Icera i500 modem. This is not exactly what we were expecting, but the results are actually quite exciting. Before I get too out of hand about the possibilities of the chip, I must make one thing perfectly clear. The chip itself will not be available until Q4 2013. It will be released in limited products with greater availability in Q1 2014. While NVIDIA is announcing this chip, end users will not get to use it until much later this year. I believe this issue is not so much that NVIDIA cannot produce the chips, but rather the design cycles of new and complex cell phones do not allow for rapid product development.
Tegra 4i really should not be confused for the slightly earlier Tegra 4. The 4i actually uses the 4th revision of the Cortex A9 processor rather than the Cortex A15 in the Tegra 4. The A9 has been a mainstay of modern cell phone processors for some years now and offers a great deal of performance when considering die size and power consumption. The 4th revision improves IPC of the A9 in a variety of ways (memory management, prefetch, buffers, etc.), so it will perform better than previous Cortex A9 solutions. Performance will not approach that provided by the much larger and complex A15 cores, but it is a nice little boost from what we have previously seen.
The Tegra 4 features a 72 core GPU (though NVIDIA has still declined to detail the specifics of their new mobile graphics technology- these ain’t Kepler though), while the 4i features a nearly identical unit featuring 60 cores. There is no word so far as to what speed these will be running at or how performance really compares to the latest graphics products from ARM, Imagination, or Qualcomm.
The chip is made on TSMC’s 28 nm HPM process and features core speeds up to 2.3 GHz. We again have no information on if that will be all four cores at that speed or turbo functionality with one core. The design adopts the previous 4+1 core setup with four high speed cores and one power saving core. Considering how small each core is (Cortex A9 or A15) it is not a waste of silicon as compared to the potential power savings. The HPM process is the high power version rather than the LPM (low power) used for Tegra 4. My guess here is that the A9 cores are not going to pull all that much power anyway due to their simpler design as compared to A15. Hitting 2.3 GHz is also a factor in the process decision. Also consider that +1 core that is fabricated slightly differently than the other four to allow for slower transistor switching speed with much lower leakage.
The die size looks to be in the 60 to 65 mm squared range. This is not a whole lot larger than the original Tegra 2 which was around 50 mm squared. Consider that the Tegra 4i has three more cores, a larger and more able GPU portion, and the integrated Icera i500 modem. The modem is a full Cat 3 LTE capable unit (100 mbps), so bandwidth should not be an issue for this phone. The chip has all of the features of the larger Tegra 4, such as the Computational Photography Architecture, Image Signal Processor, video engine, and the “optimized memory interface”. All of those neat things that NVIDIA showed off at CES will be included. The only other major feature that is not present is the ability to output 3200x2000 resolutions. This particular chip is limited to 1920x1200. Not a horrific tradeoff considering this will be a smartphone SOC with a max of 1080P resolution for the near future.
We expect to see Tegra 4 out in late Q2 in some devices, but not a lot. While Tegra 4 is certainly impressive, I would argue that Tegra 4i is the more marketable product with a larger chance of success. If it were available today, I would expect its market impact to be similar to what we saw with the original 28nm Krait SOCs from Qualcomm last year. There is simply a lot of good technology in this core. It is small, it has a built-in modem, and performance per mm squared looks to be pretty tremendous. Power consumption will be appropriate for handhelds, and perhaps might turn out to be better than most current solutions built on 28 nm and 32 nm processes.
NVIDIA also developed the Phoenix Reference Phone which features the Tegra 4i. This is a rather robust looking unit with a 5” screen and 1080P resolution. It has front and rear facing cameras, USB and HDMI ports, and is only 8 mm thin. Just as with the original Tegra 3 it features the DirectTouch functionality which uses the +1 core to handle all touch inputs. This makes it more accurate and sensitive as compared to other solutions on the market.
Overall I am impressed with this product. It is a very nice balance of performance, features, and power consumption. As mentioned before, it will not be out until Q4 2013. This will obviously give the competition some time to hone their own products and perhaps release something that will not only compete well with Tegra 4i in its price range, but exceed it in most ways. I am not entirely certain of this, but it is a potential danger. The potential is low though, as the design cycles for complex and feature packed cell phones are longer than 6 to 7 months. While NVIDIA has had some success in the SOC market, they have not had a true homerun yet. Tegra 2 and Tegra 3 had their fair share of design wins, but did not ship in numbers that came anywhere approaching Qualcomm or Samsung. Perhaps Tegra 4i will be that breakthrough part for NVIDIA? Hard to say, but when we consider how aggressive this company is, how deep their developer relations, and how feature packed these products seem to be, then I think that NVIDIA will continue to gain traction and marketshare in the SOC market.
Subject: Processors | January 25, 2013 - 06:11 PM | Jeremy Hellstrom
Tagged: haswell, Intel, overclocking, speculation, BCLK
hardCOREware is engaging in a bit of informed speculation on how overclocking the upcoming Haswell chips will be accomplished. Now that Intel has relaxed the draconian lock down of frequencies and multipliers that they enforced for a few generations of chips, overclockers are once again getting excited about their new chips. They talk about the departure of the Front Side Bus and the four frequencies which overclockers have been using in modern generations and then share their research on why the inclusion of a GPU on the CPU might just make overclockers very happy.
"This is an overclocking preview of Intel’s upcoming Haswell platform. We have noticed that they have made an architectural change that may be a great benefit to overclockers. Check out our thoughts on the potential return of BCLK overclocking!"
Here are some more Processor articles from around the web:
- Intel Core i7-3960x vs. i7-3970x@Bjorn3D
- Intel Core i3-3220 v. Intel Core i3-3225 Review @ MissingRemote
- Desktop CPU Comparison Guide @ TechARP
- Testing Memory Speeds on AMD's A10-5800K Trinity APU @ Legit Reviews
- AMD A10 5700K APU @ Guru of 3D
Get notified when we go live!