Edge AI is still new and many people are not sure which hardware platforms to choose for their projects. Today, we will compare a few of leading and emerging platforms.
Nvidia has dominated AI chip with its GPUs since the boom of deep learning starting in 2012. Although they were power hungry, noisy and expensive (blame Bitcoin gold rush), there wasn’t other alternative and we had to tolerate with them. About 3 years ago, Google announced they have designed Tensor Processing Unit (TPU) to accelerate deep learning inference speed in datacenters. That triggered rush for established tech companies and startups to come out with specialised AI chip for both datacenters and edge.
What we will talk today is platform for edge AI. So, what exactly is edge AI? The term of edge AI is borrowed from edge computing which means that the computation is happening close to the data source. In AI world, now it generally means anything that is not happening in datacenter or your bulky computers. This includes IoT, mobile phones, drones, self-driving cars etc which as you can see, actually varies greatly in term of physical size and there are many vendors.
We will therefore focus our focus in platforms that are small enough to fit into pockets comfortably and that individual and small companies could purchase and use. Nvidia did a good job of its competitions in the following benchmark comparisons, and we have — Intel Neural Computer Stick, Google Edge TPU and its very own Jetson Nano.
When evaluating AI models and hardware platform for real time deployment, the first thing I will look at is — how fast are they. In computer vision tasks, the benchmark is normally measured in frame per second (FPS). The higher number indicates better performance, for real time video streaming, you would need at least about 10 fps for video to appear to be smooth. Nvidia performed some benchmarks where you can find the result in https://developer.nvidia.com/embedded/jetson-nano-dl-inference-benchmarks. There are a number of applications used in the benchmarks, two of the most common ones are classification and object detection. Computationally, classification is the simplest task as it only need to make one prediction of what that image is e.g. an apple or an orange. On the other hand, detection task is more demanding as it will need to detect location of multiple objects and their classes e.g. multiple cars and pedestrians. This is exactly the application that requires hardware acceleration.
Having understood the implication of these two applications, we can now look at the benchmark result (I’ll explain the DNR later). Jetson Nano’s numbers look good for real time inference, let’s use them as baseline. Intel Neural Computer Stick 2 (we’ll just call it NCS2 here) can perform 30 FPS in classification using MobileNet-v2 which is not bad. However, it really struggles doing object detection at 11 FPS. By the way, NCS2 is a USB stick and it needs to use it together with an external host computer which is Raspberry Pi3 in this case. The benchmark numbers may be higher if more powerful computer is used. If we look at the numbers for Raspberry Pi 3 alone without UCS2, it is capable of doing inference of classification at 2.5 FPS which is not bad for hobbyist or toy project. Alright, going back to UCS2, I think frame rate of about 10 FPS is probably not fast enough for real time object tracking especially for high speed movement and it is likely that many objects will be missed and you would need very good tracking algorithm to compensate for that. Of course, we don’t trust benchmark results wholly. Normally, the company compared their hand-optimized software against competitors’ out-of-the-box models.
Now let’s turn our attention to Google Edge TPU. It is quite unusual for companies to include superior competitors’ result into their report. Edge TPU could perform 130 FPS in classification and that is twice that of Nano’s! For object detection, Edge TPU is also faster but only just slightly at 48 FPS vs 39 FPS. I got hold of an Edge TPU board a few days ago, I ran the demo that comes with me and this is what I got — 75 FPS! I haven’t gone into the code to look at the neural network image size which has big impact in inference speed but the demo surely look very smooth and the FPS was impressive!
SIZE, POWER and COST
Physical size is important factor, it has to be small enough to fit into the edge devices. Development boards contains some peripherals that may not end up in production modules e.g. Ethernet, USB sockets but the dev boards give us good ideas of the size and also indication of power consumption. The figure below shows the actual development boards (I only have NCS1 and yet to receive my Coral USB). If we start from the middle, the Coral Edge TPU dev board is exactly of credit card size and you can use that as reference to gauge the size.
Dev Board Price and Production Module Size
- NCS2 — $99, 72.5mm x 27mm
- Edge TPU USB — $74.99, 65mm x 30mm
- Edge TPU Dev Board — $149.99(dev), 40 x 48mm
- Jetson Nano — $129 or $99(dev), 45mm x 70mm
Both Jetson Nano and Edge TPU dev uses 5V power supplies, the former has power specification of 10W. I couldn’t find the number for Edge TPU but from the current specification of 5A at 2–3A, I suppose it is in the same power bracket. However, the heatsink in Edge TPU board is much smaller and it doesn’t run all time during the objection detection demo. Coupled that with Edge TPU efficient hardware architecture, I guess the power consumption should be significantly lower than that of Jetson Nano. I think recognising the formidable challenge, Nvidia priced its dev kit low at $99. Google hasn’t announced the price for their production module but I estimate it will be competitive against Jetson Nano.
On the other hand, both USB3.0 sticks have similar size but NCS2 is pricier despite of lower performance. Does this means Intel is doom? Not necessary, the software could turn the tide of battle.
As you’re already aware, USB sticks will need to connect to a host system and if your system runs Windows, then NCS2 is your only choice. End of story, you can stop reading now.
Although Edge TPU appears to be most competitive in term of performance and size but it is also the most limiting in software. It support only Ubuntu as host system but the biggest challenge lies in the machine learning framework. They support only one ML framework which is Tensorflow (no prize in guessing it right, you know Tensorflow is owned by Google’s right?). Actually, no, technically it is called Tensorflow Lite which is a variant that support limited number of neural network layers. There’s worse to come, it doesn’t even support the full Tensorflow Lite but only the models that are quantized to 8-bits integer (INT8)! This is in contrast to NCS2 that support also FP16 (16-bit floating point) in addition to INT8.
What are the significance in that? Traditionally, deep learning models are trained in FP32 and in general they can be later converted to FP16 easily without much loss in accuracy. However, that is not the case for INT8 where post-training conversion will usually gives you disastrous accuracy. You’ll have to incorporate the quantization into the training. That means you can’t use your pre-trained FP32 AI models but will have to add some layers to your model and train them from scratch. The training will also take longer than your usual time due to the additional layers. If you want to learn more about quantization, you can read my blog here. Google does provide some pre-trained models where you can finetune and save you lots of time but unfortunately there are only a few computer vision models that you can choose from. This is the reason why there were so many DNR in Nvidia’s benchmark of Edge TPU. This is where Intel and Nvidia do better. Intel has good number of pre-trained models that you can choose from (https://software.intel.com/en-us/openvino-toolkit/documentation/pretrained-models). Interestingly, it includes resnet50-binary-0001 that make use of Binary Convolution layers or in layman term, 1-bit layer. Intel’s OpenVINO allow conversion of models from Tensorflow, Caffe, MxNet, Kaldi and ONNX.
NVIDIA IS THE SOFTWARE KING
As pioneer in AI hardware, Nvidia’s software is the most versatile as its TensorRT support most ML framework including MATLAB. EdgeTPU and NCS2 are designed to support some subset of computational layers (primarily for computer vision tasks) but Jetson Nano is essentially a GPU and it can do most computation that its big brother desktop GPUs could do only slower. Having said that, if your application involves some non computer vision models e.g. recurrent network or you develop your own models with many custom layers, then it is safer to use Jetson series to avoid nasty surprise when porting trained models to embedded deployment. Nvidia also provide DeepStream SDK that allow multiple video streaming and Isaac robotics Engine for path planning and autonomous navigation.
Now we have had overview of these platforms with their pros and cons, which platforms should we use for what applications? All of them are capable of running computer vision AI but this is what I think the applications are most suitable for each of them. I’ll also mention some of their unique hardware features.
Pros: Support Windows, fast deployment, good selection of models
Cons: Relatively slower inference speed and higher price
Best applications are kiosk, ATM, point of sale system that runs Windows. It allows very easy and fast AI upgrade to existing system. It is also good for hobbyists and low-volume projects.
Pros: Top performance, comes with Wifi and encryption engine
Cons: Limited training resources, AI models and software libraries e.g. OpenCV is not supported.
Although the price is the highest among all but this includes the complete systems like Wifi and encryption engine, making it ideal for consumer electronics and IoT devices like smart cameras at home. Due to the fact that it is the newest kid in the town, there are not many resources (training materials, AI models, tutorials) available and it make more sense for consumer electronics business where they could afford more R&D cost.
NVIDIA JETSON NANO
Pros: Good software ecosystem and resources, additional software libraries
Cons: Slightly bulky
This will be ideal for autonomous vehicles like drones, toys, vacuum cleaners. It a general purpose AI platform, therefore in areas where other platforms do not excel, then it is safe bet to choose Nano.
I was planning to do some benchmarks to measure the actual performance of these platforms. However, after publication of this blog, I found a report with detailed measurement of speed, power consumption and temperature, and that saves me the trouble of re-inventing the wheels. The order basically matches my estimation, where Edge TPU is the fastest, followed by Nano and NCS2; and Nano is more power hungry compared to Edge TPU.
The EDGE AI has arrived, so what’s my prediction of the future hardware. One obvious trend is the use of lower bitwidth which will continue to happen. Currently anything below 8-bit don’t give good accuracy but this is an active research area and hardware companies should be prepared to welcome breakthrough in algorithm research.
Computer vision being the first area that was revolutionised by deep learning, we see that all the aforementioned platforms geared heavily towards feed forward convolutional neural networks that are used for computer vision. With the rise of voice based intelligent system like Alexa, I see there is a gap for edge AI chip for speech.
Soon-Yau is tech industry veteran with 15 years experience. He has worked for Qualcomm and Nvidia, and is currently freelance AI consultant specialised in embedded AI.