ARM Releases Cortex-A72 for Licensing
On February 3rd, ARM announced a slew of new designs, including the Cortex A72. Few details were shared with us, but what we learned was that it could potentially redefine power and performance in the ARM ecosystem. Ryan was invited to London to participate in a deep dive of what ARM has done to improve its position against market behemoth Intel in the very competitive mobile space. Intel has a leg up on process technology with their 14nm Tri-Gate process, but they are continuing to work hard in making their x86 based processors more power efficient, while still maintaining good performance. There are certain drawbacks to using an ISA that is focused on high performance computing rather than being designed from scratch to provide good performance with excellent energy efficiency.
ARM has been on a pretty good roll with their Cortex A9, A7, A15, A17, A53, and A57 parts over the past several years. These designs have been utilized in a multitude of products and scenarios, with configurations that have scaled up to 16 cores. While each iteration has improved upon the previous, ARM is facing the specter of Intel’s latest generation, highly efficient x86 SOCs based on the 2nd gen 14nm Tri-Gate process. Several things have fallen into place for ARM to help them stay competitive, but we also cannot ignore the experience and design hours that have led to this product.
(Editor's Note: During my time with ARM last week it became very apparent that it is not standing still, not satisfied with its current status. With competition from Intel, Qualcomm and others ramping up over the next 12 months in both mobile and server markets, ARM will more than ever be depedent on the evolution of core design and GPU design to maintain advantages in performance and efficiency. As Josh will go into more detail here, the Cortex-A72 appears to be an incredibly impressive design and all indications and conversations I have had with others, outside of ARM, believe that it will be an incredibly successful product.)
Cortex A72: Highest Performance ARM Cortex
ARM has been ubiquitous for mobile applications since it first started selling licenses for their products in the 90s. They were found everywhere it seemed, but most people wouldn’t recognize the name ARM because these chips were fabricated and sold by licensees under their own names. Guys like Ti, Qualcomm, Apple, DEC and others all licensed and adopted ARM technology in one form or the other.
ARM’s importance grew dramatically with the introduction of increased complexity cellphones and smartphones. They also gained attention through multimedia devices such as the Microsoft Zune. What was once a fairly niche company with low performance, low power offerings became the 800 pound gorilla in the mobile market. Billions of chips are sold yearly based on ARM technology. To stay in that position ARM has worked aggressively on continually providing excellent power characteristics for their parts, but now they are really focusing on overall performance and capabilities to address, not only the smartphone market, but also the higher performance computing and server spaces that they want a significant presence in.
ARM Releases Top Cortex Design to Partners
ARM has an interesting history of releasing products. The company was once in the shadowy background of the CPU world, but with the explosion of mobile devices and its relevance in that market, ARM has had to adjust how it approaches the public with their technologies. For years ARM has announced products and technology, only to see it ship one to two years down the line. It seems that with the increased competition in the marketplace from Apple, Intel, NVIDIA, and Qualcomm ARM is now pushing to license out its new IP in a way that will enable their partners to achieve a faster time to market.
The big news this time is the introduction of the Cortex A72. This is a brand new design that will be based on the ARMv8-A instruction set. This is a 64 bit capable processor that is also backwards compatible with 32 bit applications programmed for ARMv7 based processors. ARM does not go into great detail about the product other than it is significantly faster than the previous Cortex-A15 and Cortex-A57.
The previous Cortex-A15 processors were announced several years back and made their first introduction in late 2013/early 2014. These were still 32 bit processors and while they had good performance for the time, they did not stack up well against the latest A8 SOCs from Apple. The A53 and A57 designs were also announced around two years ago. These are the first 64 bit designs from ARM and were meant to compete with the latest custom designs from Apple and Qualcomm’s upcoming 64 bit part. We are only now just seeing these parts make it into production, and even Qualcomm has licensed the A53 and A57 designs to insure a faster time to market for this latest batch of next-generation mobile devices.
We can look back over the past five years and see that ARM is moving forward in announcing their parts and then having their partners ship them within a much shorter timespan than we were used to seeing. ARM is hoping to accelerate the introduction of its new parts within the next year.
Subject: Processors, Mobile | October 29, 2014 - 04:30 AM | Scott Michaud
Tagged: arm, mali-T800, mali
While some mobile SoC manufacturers have created their own graphics architectures, others license from ARM (and some even have a mixture of each within their product stack). There does not seem to be a specific push with this generation, rather just increases in the areas that make the most sense. Some comments tout increased energy efficiency, others higher performance, and even API support got a boost to OpenGL ES 3.1, which brings compute shaders to mobile graphics applications (without invoking OpenCL, etc.).
Three models are in the Mali-T800 series: the T820, the T830, and the T860. As you climb in the list, the products go from entry level to high-performance mobile. GPUs are often designed in modularized segments, which ARM calls cores. You see this frequently in desktop, discrete graphics cards where an entire product stack contains a handful of actual designs, but products are made by disabling whole modules. The T820 and T830 can scale between one to four "core" modules, each core containing four actual "shader cores", while the T860 can scale between one to sixteen "core" modules, each core with 16 "shader cores". Again "core modules" are groups that contain actual shader processors (and L2 cache, etc.). Cores in cores.
This is probably why NVIDIA calls them "Streaming Multiprocessors" that contain "CUDA Cores".
ARM does not (yet) provide an actual GFLOP rating for these processors, and it is up to manufacturers to some extent. It is normally a matter of multiplying the clock frequency by the number of ops per cycle and by the number of shader units available. I tried, but I assume my assumption of instructions per clock was off because the number I was getting did not match with known values from previous generations, so I assumed that I made a mistake. Also, again, ARM considers their performance figures to be conservative. Manufacturers should have no problem exceeding these, effortlessly.
As for a release timeline? Because these architectures are designed for manufacturers to implement, you should start seeing them within devices hitting retail in late 2015, early 2016.
Subject: General Tech | October 31, 2013 - 03:48 PM | Ken Addison
Tagged: podcast, video, R9 290X, amd, radeon, 290x crossfire, 280x, r9 280x, gtx 770, gtx 780, arm, mali, Altera
PC Perspective Podcast #275 - 10/31/2013
Join us this week as we discuss the AMD Radeon R9 290X, ARMTechCon 2013, NVIDIA Pricedrops and more!
The URL for the podcast is: http://pcper.com/podcast - Share with your friends!
- iTunes - Subscribe to the podcast directly through the Store
- RSS - Subscribe through your regular RSS reader
- MP3 - Direct download link to the MP3 file
Hosts: Ryan Shrout, Jeremy Hellstrom, Josh Walrath, and Allyn Malventano
ARM is Serious About Graphics
Ask most computer users from 10 years ago who ARM is, and very few would give the correct answer. Some well informed people might mention “Intel” and “StrongARM” or “XScale”, but ARM remained a shadowy presence until we saw the rise of the Smartphone. Since then, ARM has built up their brand, much to the chagrin of companies like Intel and AMD. Partners such as Samsung, Apple, Qualcomm, MediaTek, Rockchip, and NVIDIA have all worked with ARM to produce chips based on the ARMv7 architecture, with Apple being the first to release the first ARMv8 (64 bit) SOCs. The multitude of ARM architectures are likely the most shipped chips in the world, going from very basic processors to the very latest Apple A7 SOC.
The ARMv7 and ARMv8 architectures are very power efficient, yet provide enough performance to handle the vast majority of tasks utilized on smartphones and tablets (as well as a handful of laptops). With the growth of visual computing, ARM also dedicated itself towards designing competent graphics portions of their chips. The Mali architecture is aimed at being an affordable option for those without access to their own graphics design groups (NVIDIA, Qualcomm), but competitive with others that are willing to license their IP out (Imagination Technologies).
ARM was in fact one of the first to license out the very latest graphics technology to partners in the form of the Mali-T600 series of products. These modules were among the first to support OpenGL ES 3.0 (compatible with 2.0 and 1.1) and DirectX 11. The T600 architecture is very comparable to Imagination Technologies’ Series 6 and the Qualcomm Adreno 300 series of products. Currently NVIDIA does not have a unified mobile architecture in production that supports OpenGL ES 3.0/DX11, but they are adapting the Kepler architecture to mobile and will be licensing it to interested parties. Qualcomm does not license out Adreno after buying that group from AMD (Adreno is an anagram of Radeon).
Subject: General Tech, Graphics Cards, Processors, Mobile | July 23, 2013 - 04:01 AM | Scott Michaud
Tagged: Samsung, mali, exynos
Exynos, the line of System on a Chip (SoC) products from Samsung, were notably absent of ARM Mali GPUs. This, apparently, struck concern over how viable Mali will continue to be and whether ARM will continue to lose designs to competitors such as Imagination Technologies.
Then Samsung announced, Monday evening for us North Americans, the upcoming Exynos 5 Octa Processor will embed six ARM Mali-T628 GPU cores. The T628 GPU cores are capable of OpenCL 1.1 and OpenGL ES 3.0 standards which should allow applications to offload heavy batches of tasks, such as computational photography processing, with high efficiency and performance.
The Exynos 5 Octa contains four ARM Cortex-A15 cores at 1.8GHz, supported by four additional Cortex-A7 cores clocked at 1.3GHz. These processors are currently being sampled and should be produced in August.
Cortex-A12 fills a gap
Starting off Computex with an interesting announcement, ARM is talking about a new Cortex-A12 core that will attempt to address a performance gap in the SoC ecosystem between the A9 and A15. In the battle to compete with Krait and Intel's Silvermont architecture due in late 2013, ARM definitely needed to address the separation in performance and efficiency of the A9 and A15.
Source: ARM. Top to bottom: Cortex-A15, A12, A9 die size estimate
Targeted at mid-range devices that tend to be more cost (and thus die-size) limited, the Cortex-A12 will ship in late 2014 for product sampling and you should begin seeing hardware for sale in early 2015.
Architecturally, the changes for the upcoming A12 core revolve around a move to fully out of order dual-issue design including the integrated floating point units. The execution units are faster and the memory design has been improved but ARM wasn't ready to talk about specifics with me yet; expect that later in the year.
ARM claims this results in a 40% performance gain for the Cortex-A12 over the Cortex-A9, tested in SPECint. Because product won't even start sampling until late in 2014 we have no way to verify this data yet or to evaluate efficiency claims. That time lag between announcement and release will also give competitors like Intel, AMD and even Qualcomm time to answer back with potential earlier availability.
ARM is a company that no longer needs much of an introduction. This was not always the case. ARM has certainly made a name for themselves among PC, tablet, and handheld consumers. Their primary source of income is licensing CPU designs as well as their ISA. While names like the Cortex A9 and Cortex A15 are fairly well known, not as many people know about the graphics IP that ARM also licenses. Mali is the product name of the graphics IP, and it encompasses an entire range of features and performance that can be licensed by other 3rd parties.
I was able to get a block of time with Nizar Romdhane, Head of the Mali Ecosystem at ARM. I was able to ask a few questions about Mali, ARM’s plans to address the increasingly important mobile graphics market, and how they will compete with competition from Imagination Technologies, Intel, AMD, NVIDIA, and Qualcomm.
We would like to thank Nizar for his time, as well as Phil Hughes in facilitating this interview. Stay tuned as we are expecting to continue this series of interviews with other ARM employees in the near future.
Subject: Graphics Cards | February 25, 2013 - 08:01 PM | Josh Walrath
Tagged: nvidia, tegra, tegra 4, Tegra 4i, pixel, vertex, PowerVR, mali, adreno, geforce
When Tegra 4 was introduced at CES there was precious little information about the setup of the integrated GPU. We all knew that it would be a much more powerful GPU, but we were not entirely sure how it was set up. Now NVIDIA has finally released a slew of whitepapers that deal with not only the GPU portion of Tegra 4, but also some of the low level features of the Cortex A15 processor. For this little number I am just going over the graphics portion.
This robust looking fellow is the Tegra 4. Note the four pixel "pipelines" that can output 4 pixels per clock.
The graphics units on the Tegra 4 and Tegra 4i are identical in overall architecture, just that the 4i has fewer units and they are arranged slightly differently. Tegra 4 is comprised of 72 units, 48 of which are pixel shaders. These pixel shaders are VLIW based VEC4 units. The other 24 units are vertex shaders. The Tegra 4i is comprised of 60 units, 48 of which are pixel shaders and 12 are vertex shaders. We knew at CES that it was not a unified shader design, but we were still unsure of the overall makeup of the part. There are some very good reasons why NVIDIA went this route, as we will soon explore.
If NVIDIA were to transition to unified shaders, it would increase the overall complexity and power consumption of the part. Each shader unit would have to be able to handle both vertex and pixel workloads, which means more transistors are needed to handle it. Simpler shaders focused on either pixel or vertex operations are more efficient at what they do, both in terms of transistors used and power consumption. This is the same train of thought when using fixed function units vs. fully programmable. Yes, the programmability will give more flexibility, but the fixed function unit is again smaller, faster, and more efficient at its workload.
On the other hand here we have the Tegra 4i, which gives up half the pixel pipelines and vertex shaders, but keeps all 48 pixel shaders.
If there was one surprise here, it would be that the part is not completely OpenGL ES 3.0 compliant. It is lacking in one major function that is required for certification. This particular part cannot render at FP32 levels. It has been quite a few years since we have heard of anything not being able to do FP32 in the PC market, but it is quite common to not support it in the power and transistor conscious mobile market. NVIDIA decided to go with a FP 20 partial precision setup. They claim that for all intents and purposes, it will not be noticeable to the human eye. Colors will still be rendered properly and artifacts will be few and far between. Remember back in the day when NVIDIA supported FP16 and FP32 while they chastised ATI for choosing FP24 with the Radeon 9700 Pro? Times have changed a bit. Going with FP20 is again a power and transistor saving decision. It still supports DX9.3 and OpenGL ES 2.0, but it is not fully OpenGL ES 3.0 compliant. This is not to say that it does not support any 3.0 features. It in fact does support quite a bit of the functionality required by 3.0, but it is still not fully compliant.
This will be an interesting decision to watch over the next few years. The latest Mali 600 series, PowerVR 6 series, and Adreno 300 series solutions all support OpenGL ES 3.0. Tegra 4 is the odd man out. While most developers have no plans to go to 3.0 anytime in the near future, it will eventually be implemented in software. When that point comes, then the Tegra 4 based devices will be left a bit behind. By then NVIDIA will have a fully compliant solution, but that is little comfort for those buying phones and tablets in the near future that will be saddled with non-compliance once applications hit.
The list of OpenGL ES 3.0 features that are actually present in Tegra 4, but the lack of FP32 relegates it to 2.0 compliant status.
The core speed is increased to 672 MHz, well up from the 520 MHz in Tegra 3 (8 pixel and 4 vertex shaders). The GPU can output four pixels per clock, double that of Tegra 3. Once we consider the extra clock speed and pixel pipelines, the Tegra 4 increases pixel fillrate by 2.6x. Pixel and vertex shading will get a huge boost in performance due to the dramatic increase of units and clockspeed. Overall this is a very significant improvement over the previous generation of parts.
The Tegra 4 can output to a 4K display natively, and that is not the only new feature for this part. Here is a quick list:
2x/4x Multisample Antialiasing (MSAA)
24-bit Z (versus 20-bit Z in the Tegra 3 processor) and 8-bit Stencil
4K x 4K texture size incl. Non-Power of Two textures (versus 2K x 2K in the Tegra 3 processor) – for higher quality textures, and easier to port full resolution textures from console and PC games to Tegra 4 processor. Good for high resolution displays.
16:1 Depth (Z) Compression and 4:1 Color Compression (versus none in Tegra 3 processor) – this is lossless compression and is useful for reducing bandwidth to/from the frame buffer, and especially effective in antialiasing processing when processing multiple samples per pixel
Percentage Closer Filtering for Shadow Texture Mapping and Soft Shadows
Texture border color eliminate coarse MIP-level bleeding
sRGB for Texture Filtering, Render Surfaces and MSAA down-filter
1 - CSAA is no longer supported in Tegra 4 processors
This is a big generational jump, and now we only have to see how it performs against the other top end parts from Qualcomm, Samsung, and others utilizing IP from Imagination and ARM.