Project Trillium: Machine Learning on ARM
Addressing New Markets
Machine Learning is one of the hot topics in technology, and certainly one that is growing at a very fast rate. Applications such as facial recognition and self-driving cars are powering much of the development going on in this area. So far we have seen CPUs and GPUs being used in ML applications, but in most cases these are not the most efficient ways of doing these highly parallel but relatively computationally simple workloads. New chips have been introduced that are far more focused on machine learning, and now it seems that ARM is throwing their hat into the ring.
ARM is introducing three products under the Project Trillium brand. It features a ML processor, a OD (Object Detection) processor, and a ARM developed Neural Network software stack. This project came as a surprise for most of us, but in hindsight it is a logical avenue for them to address as it will be incredibly important moving forward. Currently many applications that require machine learning are not processed at the edge, namely in the consumer’s hand or device right next to them. Workloads may be requested from the edge, but most of the heavy duty processing occurs in datacenters located all around the world. This requires communication, and sometimes pretty hefty levels of bandwidth. If neither of those things are present, applications requiring ML break down.
The solution here of course is to do ML and OD at the source. I find it interesting that computing history has gone back and forth from centralized to personal and back again multiple times over the past several decades. It takes no genius to see that applications are going to be the key motivator as to where computing will take place. While we will see applications move into the cloud, it is a good bet that we will eventually see new applications which require being on the person rather than in the cloud.
To achieve this ARM is introducing a yet unnamed ML CPU that promises performance of 4.6 TOPs per 1.5 watts (or thereabouts). This product is designed at ARM and is aimed at the premium mobile market initially. These products could easily find themselves in other form factors, but mobile seems to make the most sense for now. ARM did not go into details as to what the architecture of the ML is comprised of, but we can make a few guesses. In the last year NVIDIA released news of their Tensor Cores that are integrated in several of their high end products. These 4x4 matrix multiply-accumulate units are small but powerful units that run a smaller subset of operations as compared to CPU cores or modern GPU cores. My guess is that the ARM unit is going to be very similar with fair sized caches and a high speed interconnect to an ARM CPU. I do not think it will have a memory controller of its own as that makes design that much more difficult in an already constrained form factor.
The second processor is the Object Detection unit. This is a second generation design that is based on the work from the Apical group that was acquired by ARM. This turns out to be a pretty key unit for ML as it is able to identify objects of interest quickly and pass that data off to the ML unit. It culls the amount of data much the way that hidden triangle removal is done by GPUs. It takes only the pertinent data to pass on, thereby improving efficiency and lowering the amount of work that needs to be done by the ML processor.
The final part is of course the software stack that allows these products to communicate with other ML applications such as TensorFlow, Caffe, and others. This should speed up adoption as well as lower the workload of programmers trying to get their applications up and running on ARM and their partners’ products utilizing these designs. Not having to reinvent the wheel is key here as programming for every new architecture coming out in the next several years would take up too much time. We have already seen this with TensorFlow leaning heavily on CUDA. Adding in the software development that ARM has done to enable easier integration with their products will only help strengthen the market.
These products can be further enhanced in their functionality by working with the Cortex CPU and Mali graphics. More complex workloads can be shunted to these units as needed, allowing the ML unit to focus on the work it does best and most efficiently. More complex programming will be needed to get the best mix of performance and efficiency out of every potential high end SOC and ML unit.
The first OD unit will be available shortly while the ML unit will show up by the middle of this year. I would expect high end phones to integrate these parts by early next year. The key to their success will be the integration of suitable apps that will take advantage of this functionality. Setting aside PCB space and power for units that will never be utilized due to immature or non-functional applications that no one needs is not the goal for any successful manufacturer.
Pushing these applications locally is a logical decision considering the bandwidth and computational needs of ML products that we are still only basically using. As more people utilize this functionality, the greater the load on centralized processing and storage. More bandwidth is required, greater connectivity, and a higher uptime that guarantees that the application will work when needed most.
ARM is attempting to put this functionality in our hands with a low power solution that in theory can provide a large degree of localized machine learning. Now we just need to see companies willing to adopt it and make applications that can improve the customer’s experience.