Machine Learning on Embedded (Part 5)

Note: This post is the fourth in the series. Here you can find part 1, part 2, part 3 and part 4.


In the previous post here, I’ve used x-cube-ai with the STM32F746 to run a tflite model and benchmark the inference performance. In that post I’ve found that the x-cube-ai is ~12.5x faster compared to TensorFlow Lite for microcontrollers (tflite-micro) when running on the same MCU. Generally, the first 4 posts were focused on running the model inference on the edge, which means running the inference on the embedded MCU. This actually is the most important thing nowadays, as being able running inferences on the edge on small MCUs means less consumption and more important that are not rely on the cloud. What is cloud? That means that there is an inference accelerator in cloud, or in layman terms the inference is running on a remote server somewhere on the internet.

One thing to note, is that the only reason I’m using the MNIST model is for benchmarking and consistency with the previous post. There’s no any real reason to use this model in a scenario like this. The important thing here is not the model, but the model’s complexity. So any model with the some kind of complexity that matches your use-case scenario can be used. But as I’ve said since I’ve used this model in the previous posts, I’ll use it also here.

So, what are the benefits of running the inference on the cloud?

Well, that depends. There are many parameters that define a decision like this. I’ll try to list a few of them.

  • It might be faster to run an inference on the cloud (that depends also on other parameters though).
  • The MCU that you already have (or you must use) is not capable to run the inference itself using e.g. tflite-micro or another API.
  • There is a robust network
  • The time that the cloud inference to be run (including the network transactions) is faster than running on the edge (=on the device)
  • If the target device runs on battery it may be more energy efficient to use a cloud accelerator
  • It’s possible to re-train your model and update the cloud without having to update the clients (as long the input and output tensors are not changed).

What are the disadvantages on running the inference on the cloud?

  • You need a target with a network connection
  • Networks are not always reliable
  • The server hardware is not reliable. If the server fails, all the clients will fail
  • The cloud server is not energy efficient
  • Maintenance of the cloud

If you ask me, the most important advantage of edge devices is that they don’t rely on any external dependencies. And the most important advantage of the cloud is that it can be updated at any time, even on the fly.

On this post I’ll focus on running the inference on the cloud and use an MCU as a client to that service. Since I like embedded things the cloud tflite server will be a Jetson nano running in the two supported power modes and the client will be an esp8266 NodeMCU running at 160MHz.

All the project file are in this repo:

Now let’s dive into it.


Let’s have a look in the components I’ve used.

ESP8266 NodeMCU

This is the esp8266 module with 4MB flash and the esp8266 core which can run up to 160MHz. It has two SPI interfaces, one used for the onboard EEPROM and one it’s free to use. Also it has a 12-bit ADC channel which is limited to max 1V input signals. This is a great limitation and we’ll see later why. You can find this on ebay sold for ~1.5 EUR, which is dirt cheap. For this project I’ll use the Arduino library to create a TCP socket that connects to a server, sends an input array and then retrieves the inference result output.

Jetson nano development board

The jetson nano dev board is based on a Quad-core ARM Cortex-A57 running @ 1.4GHz, with 4GB LPDDR4 and an NVIDIA Maxwell GPU with 128 CUDA cores. I’m using this board because the tensorflow-gpu (which contains the tflite) supports its GPU and therefore it provides acceleration when running a model inference. This board doesn’t have WiFi or BT, but it has a mini-pcie connector (key-E) so you’re able to connect a WiFi-BT module. In this project I will just use the ethernet cable connected to a wireless router.

The Jetson nano supports two power modes. The default mode 0 is called MAXN and the mode 1 is called 5W. You can verify on which mode your CPU is running with this command:

nvpmodel -q

And you can set the mode (e.g. mode 1 – 5W) like this:

# sets the mode to 5W
sudo nvpmodel -m 1

# sets the mode to MAXN
sudo nvpmodel -m 0

I’ll benchmark both modes in this post.

My workstation

I’ve also used my development workstation in order to do benchmark comparisons with the Jetson nano. The main specs are:

  • Ryzen 2700x @ 3700MHz (8 cores / 16 threads)
  • 32GB @ 3200MHz
  • GeForce GT 710 (No CUDA ūüôĀ )
  • Ubuntu 18.04
  • Kernel 4.18.20-041820-generic

Network setup

This is the network setup I’ve used for the development and testing/benchmarking the project. The esp8266 is connected via WiFi on the router and the workstation (2700x) and the jetson nano are connected via Ethernet (in the drawing replace TCP = ETH!).

This is a photo of the development setup.

Repo details

In the repo you’ll find several folders. Here I’ll list what each folder contains. I suggest you also read the file in the repo as it contains information that might not be available here and also the README file will be always updated.

  • ./esp8266-tf-client: This folder contains the firmware for the esp8266
  • ./jupyter_notebook: This folder contains the .ipynb jupyter notebook which you can use on the server and includes the TfliteServer class (which will be explained later) and the tflite model file (mnist.tflite).
  • ./schema: The flatbuffers schema file I’ve used for the communication protocol
  • ./tcp-stress-tool: A C/C++ tool that I’vewritten to stress and benchmark the tflite server.

esp8266 firmware

This folder contains the source code for the esp8266 firmware. To build the esp8266 firmware open the `esp8266-tf-client/esp8266-tf-client.ino` with Arduino IDE (version > 1.8). Then in the file you need to change a couple of variables according to your network setup. In the source code you’ll find those values:

#define SSID "SSID"
#define SERVER_IP ""
#define SERVER_PORT 32001

You need to edit them according to your wifi network and router setup. So, use your wifi router’s SSID and password. The `SERVER_IP` is the IP of the computer that will run the python server and the `SERVER_PORT` is the server’s port and they both need to be the same also in the python script. All the data in the communication between the client and the server are serialized with flatbuffers. This comes with quite a significant performance hit, but it’s quite necessary in this case. The client sends 3180 bytes on every transaction to the server, which are the serialized 784 floats for each 28×28 byte digit. Then the response from the server to the client is 96 bytes. These byte lengths are hardcoded, so if you do any changes you need also to change he definitions in the code. They are hard-coded in order to accelerate the network recv() routines so they don’t wait for timeouts.

By default this project assumes that the esp8266 runs at 160MHz. In case you change this to 80MHz then you need also to change the `MS_CONST` in the code like this:

#define MS_CONST 80000.0f

Otherwise the ms values will be wrong. I guess there’s an easier and automated way to do this, but yeah…

The firmware supports 3 serial commands that you can send via the terminal. All the commands need to be terminated with a newline. The supported commands are:

Command Description
TEST Sends a single digit inference request to the server and it will print the parsed response
START=<SECS> Triggers a TCP inference request from the server every <SECS>. Therefore, if you want to poll the server every 5 secs then you need to send this command over the serial to the esp8266 (don’t forget the newline in the end). For example, this will trigger an inference request every 5 seconds: `START=5`.
STOP Stops the timer that sends the periodical TCP inference requests

To build and upload the firmware to the esp8266 read the of the repo.

Using the Jupyter notebook

I’ve used the exact same tflite model that I’ve used in part 3 and part 4. The model is located in ./jupyter_notebook/mnist.tflite. You need to clone the repo on the Jetson nano (or your workstation is you prefer). From now on instead of making a distinction between the Jetson nano and the workstation I’ll just refer to them as the cloud as it doesn’t really make any difference. Therefore, just clone the repo to your cloud server. This here is the jupyter notepad on bitbucket.

Benchmarking the inference on the cloud

The important sections in the notepad are 3 and 4. Section 3 is the `Benchmark the inference on the Jetson-nano`. Here I assume that this runs on the nano, but it’s the same on any server. So, in this section I’m benchmarking the model inference with a random input. I’ve run this benchmark on both my workstation and the Jetson nano and these are the results I got. For reference I’ll also add the numbers with the edge inference on the STM32F7 from the previous post using x-cube-ai.

Cloud server ms (after 1000 runs)
My workstation 0.206410
Jetson nano (MAXN) 0.987536
Jetson nano (5W) 2.419758
STM32F746 @ 216MHz 76.754
STM32F746 @ 288MHz 57.959

The next table shown the difference in performance between all the different benchmarks.

STM@216 STM@288 Nano 5W Nano MAXN 2700x
STM@216 1 1.324 31.719 77.722 371.852
STM@288 0.755 1 23.952 58.69 280.795
Nano 5W 0.031 0.041 1 2.45 11.723
Nano MAXN 0.012 0.017 0.408 1 4.784
2700x 0.002 0.003 0.085 0.209 1

An example how to read the above table is that the STM32F7@288 is 1.324x faster than STM32F7@216. Also Ryzen 2700x is 371.8x times faster. Also the Jetson nano in MAXN mode is 2.45x times faster that the 5W mode, e.t.c.

What you should probably keep from the above table is that Jetson nano is ~32x to 78x times faster than the STM32F7 at the stock clocks. Also the 2700x is only ~4.7x times faster than nano in MAXN mode, which is very good performance for the nano if you think about its consumption, cost and size.

Therefore, the performance/cost and performance/consumption ratio is far better on the Jetson nano compared to 2700x. So it makes perfect sense to use this as a cloud tflite server. One use-case of this scenario is having a cloud accelerator running locally on a place that covers a wide area with WiFi and then having dozens of esp8266 clients that request inferences from the server.

Benchmarking the tflite cloud inference

To run the server you need to run the cell in section `4. Run the TCP server`. First you need to insert the correct IP of the cloud server. For example my Jetson nano has the IP Then you run the cell. The other way is you can edit the `jupyter_notebook/TfliteServer/` file and in this code change the IP (or the TCP if you like also)

if __name__=="__main__":
    srv = TfliteServer('../mnist.tflite')
    srv.listen('', 32001)

Then on your terminal run:


This will run the server and you’ll get the following output.

dimtass@jetson-nano:~/rnd/tensorflow-nano/jupyter_notebook/TfliteServer$ python3
TfliteServer initialized
TCP server started at port: 32001

Now send the TEST command on the esp8266 via the serial terminal. When you do this, then the following things will happen:

  1. esp8266 serializes the 28×28 random array to a flatbuffer
  2. esp8266 connects the TCP port of the server
  3. esp8266 sends the flabuffer to the server
  4. Server de-serializes the flatbuffer
  5. Server converts the tensor from (784,) to (1, 28, 28, 1)
  6. Server runs the inference with the input
  7. Server serializes the output it in a flatbuffer (including the time in ms of the inference operation)
  8. Server sends the output back to the esp8266
  9. esp8266 de-serializes the output
  10. esp8266 outputs the result

This is what you get from the esp8266 serial output:

Request a single inference...
======== Results ========
Inference time in ms: 12.608528
out[0]: 0.080897
out[1]: 0.128900
out[2]: 0.112090
out[3]: 0.129278
out[4]: 0.079890
out[5]: 0.106956
out[6]: 0.074446
out[7]: 0.106730
out[8]: 0.103112
out[9]: 0.077702
Transaction time: 42.387493 ms

In this output the “inference time in ms” is the time in ms that the cloud server spend to run the inference. Then you get the array of the 10 predictions for the output and finally the “Transaction time” is the total time of the whole procedure. The total time is the time that steps 1-9 spent. At the same time the output of the server is the following:

==== Results ====
Hander time in msec: 30.779839
Prediction results: [0.08089687 0.12889975 0.11208985 0.12927799 0.07988966 0.10695633
 0.07444601 0.10673008 0.10311186 0.07770159]
Predicted value: 3

The “handler time in msec” is the time that the TCP reception handler used (see: jupyter_notebook/TfliteServer/ and the FbTcpHandler class.

From the above benchmark with the esp8266 we need to keep the following two things:

  1. From the 42.38 ms the 12.60 ms was the inference run time, so all the rest things like serialization and network transactions costed 29.78 ms (on the local WiFi network). Therefore, the extra time was 2.3x times more that the inference running time itself.
  2. The total time that the above operation lasted was 42.38 ms and the STM32F7 needed 76.75 ms @ 216MHz (or 57.96 @ 288MHz). That means the the cloud inference is 1.8x and 1.36x times faster.

Finally, as you probably already noticed, the protocol is very simple, so there are no checksums, server-client validation and other fail-safe mechanisms. Of course, that’s on purpose, as you can imagine. Otherwise, the complexity would be higher. But you need to consider those things if you’re going to design a system like this.

Benchmarking the tflite server

The tflite TCP server is just a python TCP socket listener. That means that by design it has much lower performance compared to any TCP server written in C/C++ or Java. Despite the fact that I was aware of this limitation, I’ve chosen to go with this solution in order to integrate the server easily in the jupyter notebook and it was also much faster to implement. Sadly, I’ve seen a great performance hit with this implementation and I would like to investigate a bit further (in the future) and verify if that’s because of the python implementation or something else. The results were pretty bad especially for the Jetson nano.

In order to test the server, I’ve written a small C/C++ stress tool that I’ve used to spawn a user-defined number of TCP client threads and request inferences from the server. Because it’s still early in my testing, I assume that the gpu can only run one inference per time, therefore there’s a thread lock before any thread is able to call the inference function. This lock is in the jupyter_notebook/TfliteServer/ file in those lines:

output_data, time_ms = runInference(resp.Input().DigitAsNumpy())

One thing I would like to mention here is that I’m not lazy to investigate in depth every aspect of each framework, it’s just that I don’t have the time, therefore I do logical assumptions. This is why I assume that I need to put a lock there, in order to prevent several simultaneous calls in the tensorflow API. Maybe this is handled in the API, I don’t know. Anyway, have in mind that’s the reason this lock there, so all the inferences requests will block and wait until the current running inference is finished.

So, the easiest way to run some benchmarks is to use run the TfliteServer on the server. First you need to edit the IP address in the __main__ function. You need to use the IP of the server, or if you run this locally (even when I do this locally I use the real IP address). Then run the server:

cd jupyter_notebook/TfliteServer/

Then you can run the client and pass the server IP, port and number of threads in the command line. For example, I’ve run both the client and server on my workstation, which has the IP, so the command I’ve used was:

cd tcp-stress-tool/
./tcp-stress-tool 32001 500

This will spawn 500 clients (each on its own thread) and request an inference from the python server. Because the output is quite big, I’ll only post the last line (but I’ve copied some logs in the results/ folder in the repo).

This tool will spawn a number of TCP clients and will request
the tflite server to run an inference on random data.
Warning: there is no proper input parsing, so you need to be
cautious and read the usage below.

tcp-stress-tool [server ip] [server port] [number of clients]

server ip:
server port: 32001
number of clients: 500

Spawning 500 TCP clients...
[thread=2] Connected
[thread=1] Connected
[thread=3] Connected


Total elapsed time: 31228.558064 ms
Average server inference time: 0.461818 ms

The output means that 500 TCP transactions and inferences were completed in 31.2 secs with average inference time 0.46 ms. That means the total time for the inferences were 23 secs and the rest 8.2 secs were spend in the comms and serializing the data. These 8.2 secs seem a bit too much, though, right? I’m sure that this time should be less. On the Jetson nano it’s even worse, because I wasn’t able to run a test with 500 clients and many connections were rejected. Any number more that 20 threads and python script can’t handle this. I don’t know why. In the results/ folder you’ll find the following test results:

  • tcp-stress-jetson-nano-10c-5W.txt
  • tcp-stress-jetson-nano-50c-5W.txt
  • tcp-stress-jetson-nano-50c-MAXN.txt
  • tcp-stress-output-workstation-500c.txt

As you can guess from the filename, Xc is the number of threads and for Jetson nano there are results for both modes (MAXN and 5W). This is a table with all the results:

Test Threads Total time ms Avg. inference ms
Nano 5W 10 1057.1 3.645
Nano 5W 20 3094.05 4.888
Nano MAXN 10 236.13 2.41
Nano MAXN 20 3073.33 3.048
2700x 500 31228.55 0.461

From those results, I’m quite sure that there’s something wrong with the python3 TCP server. Maybe at some point I’ll try something different. In any case that concludes my tests, although there’s a question mark as regarding the performance of the Jetson nano when it’s acting as tflite server. For now, it seems that it can’t handle a lot of connections (with this implementation), but I’m quite certain this will be much different if the server is a proper C/C++ implementation.


With this post I’ve finished the main tests around ML I had originally on my mind. I’ve explored how ML can be used with various embedded MCUs and I’ve tested both edge and cloud implementations. At the edge part of ML, I’ve tested a naive implementation and also two higher level APIs (the TensorFlow Lite for Microcontrollers API and also the x-cube-ai from ST). For the cloud part of ML, I’ve tested one of the most common and dirt cheap WiFi enabled MCUs the esp8266.

I’ll mention here once again that, although I’ve used the MNIST example, that doesn’t really matter. It’s the NN model complexity that matters. By that I mean that although it doesn’t make any sense to send a 28×28 tensor from the esp8266 to the cloud for running a prediction on a digit, the model is still just fine for running benchmarks and make conclusions. Also this (784,) input tensor, stresses also the network, which is good for performance tests.

One thing that you might wondering at this point is, “which implementation is better”? There’s no a single answer for this. This is a per case decision and it depends on several parameters around the specific requirements of the project, like cost, energy efficiency, location, environmental conditioons and several other things. By doing those tests though, I now have a more clear image of the capabilities and the limitations of the current technology and this is a very good thing to have when you have to start with a real project development. I hope that the readers who gone all the posts of this series are able to make some conclusions about those tools and the limitations; and based on this knowledge can start evaluating more focused solutions that fit their project’s specs.

One thing that’s also important, is that the whole ML domain is developing really fast and things are changing very fast, even in next days or even hours. New APIs, new tools, new hardware are showing up. For example, numerous hardware vendors are now releasing products with some kind of NN acceleration (or AI as they prefer to mention it). I’ve read a couple of days ago that even Alibaba released a 16-core RISC-V Processor (XuanTie 910) with AI acceleration. AmLogic released the A311D. Rockchip released the RK3399Pro.¬† Also, Gyrfalcon released the Lightspeeur 2801S Neural Accelerator, to compete Intel’s NCS2 and Google’s TPU. And many more chinese manufactures will release several other RISC-V CPUs with accelerators for NN the next few weeks and months. Therefore, as you can see the ML on the embedded domain is very hot.

I think I will return from time to time to the embedded ML domain in the future to sync with the current progress and maybe write a few more posts on the subject. But the next stupid-project will be something different. There’s still some clean up and editing I want to do in the first 2 posts in the series, though.

I hope you enjoyed this series as much as I did.

Have fun!

Machine Learning on Embedded (Part 4)


Note: This post is the fourth in the series. Here you can find part 1, part 2 and part 3.

For this post I’ve used the same MNIST model that I’ve trained for TensorFlow Lite for Microcontrollers (tflite-micro) and I’ve implemented the firmware on the 32F746GDISCOVERY by using the ST’s X-CUBE-AI framework. But before dive into this let’s do a recap and repeat some key points from the previous articles.

In part 1, I’ve implemented a naive implementation of a single neuron with 3-inputs and 1-output. Naive means that the inference was just C code, without any accelerations from the hardware. I’ve run those tests on a various different MCUs and it was fun seeing even an arduino nano running this thing. Also I’ve overclocked a few MCUs to see how the frequency increment scales with the inference performance.

In part 2, I’ve implemented another naive implementation of a NN with 3-input, 32-hidden, 1-output. The result was that as expected, which means that as the NN complexity increases the performance drops. Therefore, not all MCUs can provide the performance to run more complex in real-time. The real-time part now is something objective, because real-time can be from a few ns up to several hours depending on the project. That means that if the inference of a deep-er network needs 12 hours to run in your arduino and your data stream is 1 input per 12 hours and 2 minutes, then you’re fine. Anyway, I won’t debate on that I think you know what I mean. But if your input sample is every few ms then you need something faster. Also, in the back of my head was to verify if this simple NN complexity is useful at all and if it can offer something more than lookup tables or algorithms.

In part 3, I was planning to use x-cube-ai from ST, to port a Keras NN and then benchmark the inference, but after the hint I got in the comments from Raukk, I’ve decided to go with the tflite-micro. Tflite-micro at that point seemed very appealing, because it’s a great idea to have a common API between the desktop, the embedded Linux and the MCU worlds. Think about it. It’s really great to be able to share (almost) the same code between those platforms.

Therefore, in this post I’ve implemented the exact same model to do a comparison of the x-cube-ai and tflite-micro. As I’ve mentioned also to the previous posts (and I’m doing this also now), the Machine Learning (ML) on the low embedded (=MCUs) is still a work in progress and there’s a lot of development on the various tools. If you think about it the whole ML is still is changing rapidly for the last years and its introduction to microcontrollers is even more recent. It’s a very hot topic and domain right now. For example, while I was doing the tflite-micro post the repo, it was updated several times; but I had to stop updating and lock to a git version in order to finish the post.

Also, after I’ve finished the post for the x-cube-ai, the same day the new version 4.0.0 released, which pushed back the post release. The new version supports to import tflite models and because I’ve used a Keras model in my first implementation, I had to throw away quite some work that I’ve done… But I couldn’t do otherwise, as now I had the chance to use the exact same tflite model and not the Keras model (the tflite was a port from Keras). Of course, I didn’t expect any differences, but still it’s better to compare the exact same models.

You’ll find all the source code for this project here:

So, let’s dive into it.


ST presents the X-CUBE-AI as an “STM32Cube Expansion Package part of the STM32Cube.AI ecosystem and extending STM32CubeMX capabilities with automatic conversion of pre-trained Neural Network and integration of generated optimized library into the user’s project“. Yeah, I know, fancy words. In plain English that means that it’s just a static library for the STM32 MCUs that uses the cmsis-dsp accelerations and a set of tools that convert various model formats to the format that the library can process. That’s it. And it works really well.

There’s also a very informative video here, that shows the procedure you need to follow in order to create a new x-cube-ai project and that’s the one I’ve also used to create the project in this repo. I believe it’s very straight forward and there’s no reason to explain anything more than that. The only different thing I do always is that I’m just integrating the resulted code from STM32CubeMX to my cmake template.

So the x-cube-ai adds some tools in the CubeMX GUI and you can use them to analyze the model, compress the weight values, and validate the model on both desktop and the target. With x-cube-ai, you can finally create source code for 3 types of projects, which are the SystemPerformance, Validation and ApplicationTemplate. For the first two projects you just compile them, flash and run, so you don’t have to write any code yourself (unless you want to change default behaviour). As you can see on the YouTube link I’ve posted, you can choose the type of project in the “Pinout & Configuration” tab and then click in the “Additional Software”. From that list expand the “X-CUBE-AI/Application” (be careful to select the proper (=latest?) version if you have many) and then in the Selection column, select the type of the project you want to build.

Analyzing the model

I want to mention here that in ST they’ve done a great job on logging and display information for the model. You get many information in CubeMX while preparing your model and you know beforehand the RAM/ROM size with the compression, the complexity, the ROM usage, MACC and also you can derive the complexity by layer. This is an example output I got when I’ve analyzed the MNIST model.

Analyzing model 
Neural Network Tools for STM32 v1.0.0 (AI tools v4.0.0) 
-- Importing model 
-- Importing model - done (elapsed time 0.401s) 
-- Rendering model 
-- Rendering model - done (elapsed time 0.156s) 
Creating report file /home/dimtass/STM32Cube/Repository/Packs/STMicroelectronics/X-CUBE-AI/4.0.0/Utilities/linux/stm32ai_output/mnistkeras_analyze_report.txt 
Exec/report summary (analyze 0.558s err=0) 
model file      : /rnd/bitbucket/machine-learning-for-embedded/code-stm32f746-xcube/mnist.tflite 
type            : tflite (tflite) 
c_name          : mnistkeras 
compression     : 4 
quantize        : None 
L2r error       : NOT EVALUATED 
workspace dir   : /tmp/mxAI_workspace26422621629890969500934879814382 
output dir      : /home/dimtass/STM32Cube/Repository/Packs/STMicroelectronics/X-CUBE-AI/4.0.0/Utilities/linux/stm32ai_output 
model_name      : mnist 
model_hash      : 3be31e1950791ab00299d58cada9dfae 
input           : input_0 (item#=784, size=3.06 KiB, fmt=FLOAT32) 
input (total)   : 3.06 KiB 
output          : nl_7 (item#=10, size=40 B, fmt=FLOAT32) 
output (total)  : 40 B 
params #        : 93,322 (364.54 KiB) 
macc            : 2,852,598 
rom (ro)        : 263,720 (257.54 KiB) -29.35% 
ram (rw)        : 33,664 + 3,176 (32.88 KiB + 3.10 KiB) 
id  layer (type)        output shape      param #     connected to             macc           rom                 
0   input_0 (Input)     (28, 28, 1)                                                                               
    conv2d_0 (Conv2D)   (26, 26, 32)      320         input_0                  237,984        1,280               
    nl_0 (Nonlinearity) (26, 26, 32)                  conv2d_0                                                    
1   pool_1 (Pool)       (13, 13, 32)                  nl_0                                                        
2   conv2d_2 (Conv2D)   (11, 11, 64)      18,496      pool_1                   2,244,480      73,984              
    nl_2 (Nonlinearity) (11, 11, 64)                  conv2d_2                                                    
3   pool_3 (Pool)       (5, 5, 64)                    nl_2                                                        
4   conv2d_4 (Conv2D)   (3, 3, 64)        36,928      pool_3                   332,416        147,712             
    nl_4 (Nonlinearity) (3, 3, 64)                    conv2d_4                                                    
5   reshape_5 (Reshape) (576,)                        nl_4                                                        
    dense_5 (Dense)     (64,)             36,928      reshape_5                36,864         38,144 (c)          
    nl_5 (Nonlinearity) (64,)                         dense_5                  64                                 
6   dense_6 (Dense)     (10,)             650         nl_5                     640            2,600               
7   nl_7 (Nonlinearity) (10,)                         dense_6                  150                                
mnist p=93322(364.54 KBytes) macc=2852598 rom=257.54 KBytes ram=32.88 KBytes -29.35% 
Complexity by layer - macc=2,852,598 rom=263,720 
id      layer (type)        macc                                    rom                                     
0       conv2d_0 (Conv2D)   ||||                              8.3%  |                                 0.5%  
2       conv2d_2 (Conv2D)   |||||||||||||||||||||||||||||||  78.7%  ||||||||||||||||                 28.1%  
4       conv2d_4 (Conv2D)   |||||                            11.7%  |||||||||||||||||||||||||||||||  56.0%  
5       dense_5 (Dense)     |                                 1.3%  ||||||||                         14.5%  
5       nl_5 (Nonlinearity) |                                 0.0%  |                                 0.0%  
6       dense_6 (Dense)     |                                 0.0%  |                                 1.0%  
7       nl_7 (Nonlinearity) |                                 0.0%  |                                 0.0%  
Using TensorFlow backend. 
Analyze complete on AI model

This is the output that you get by just running the analyze tool on the imported tflite model in CubeMX. Lots of information there, but let’s focus in some really important info. As you can see, you know exactly how much ROM and RAM you need! You couldn’t do that with the tflite-micro. In tflite-micro you need to either calculate this by your own, or you would need to add heap size and try to load the model, if the heap wasn’t enough and the allocator was complaining, then add more heap and repeat. This is not very convenient right? But with x-cube-ai you know exactly how much heap you need at least for the model (and also add more for your app). Great stuff.

Model RAM/ROM usage

So in this case the ROM needed for the model is 263760 bytes. In part 3, that was 375740 bytes (see section 3 in the jupyter notepad). That difference is not because I’ve used quantization, but because of the 4x compression selection I’ve made for the weights in the tool (see in the YouTube video which does the same time=3:21). Therefore, the decrease in the model size in ROM is from that compression. According to the tools that’s -29.35% compared to the original size. In the current project the model binary blob is in the `source/src/mnistkeras_data.c` file and it’s an C array like the one in the tflite-micro project. The similar file in the tf-lite model was the `source/src/inc/model_data.h`. Those sizes are without quantization, because I didn’t manage to convert the model to UINT8 as the TFLiteConverter converts the model only to INT8, which is not supported in tflite. I’m still puzzled with that and I can’t figure out why this happening and I couldn’t find any documentation or example how to do that.

Now, let’s go to the RAM usage. With x-cube-ai the RAM needed is only 36840 bytes! In the tflite-micro I needed 151312 bytes (see the table in the “Model RAM Usage” section here). That’s 4x times less RAM. It’s amazing. The reason for that is that in tflite-micro the micro_allocator expands the layers of the model in the RAM, but in the x-cube-ai that doesn’t happen. From the above report (and from what I’ve seen) it seems that the layers remain in the ROM and it seems that the API only allocates RAM for the needed operations.

As you can imagine those two things (RAM and ROM usage) makes x-cube-ai a much better option even to start with. That makes even possible to run this model in MCUs with less RAM/ROM than the STM32F746, which is considered a buffed MCU. Huge difference in terms of resources.

Framework output project types

As I’ve mentioned previously, with x-cube-ai you can create 3 types of projects (SystemPerformance, Validation, ApplicationTemplate). Let’s see a few more details about those.

Note: for the SystemPerformance and Validation project types, I’ve included the bin files in the extras/folder. You can only flash those on the STM32F746 which comes with the 32F746GDISCOVERY board.


As the name clearly implies, you can use this project type in order to benchmark the performance using random inputs. If you think about it, that’s all that I would need for this post. I just need to import the model, build this application and there you go, I have all I need. That’s correct, but… I wanted to do the same that I’ve done in the previous project with tflite-micro and be able to use a comm protocol to upload inputs from hand-drawn digits from the jupyter notebook to the STM32F7, run the inference and get the output back and validate the result. Therefore, although this project type is enough for benchmarking, I still had work to do. But in case you just need to benchmark the MCU running the model inference, just build this. You don’t even have to write a single line of code. This is the serial output when this code runs (this is a loop, but I only post one iteration).

Running PerfTest on "mnistkeras" with random inputs (16 iterations)...

Results for "mnistkeras", 16 inferences @216MHz/216MHz (complexity: 2852598 MACC)
 duration     : 73.785 ms (average)
 CPU cycles   : 15937636 -1352/+808 (average,-/+)
 CPU Workload : 7%
 cycles/MACC  : 5.58 (average for all layers)
 used stack   : 576 bytes
 used heap    : 0:0 0:0 (req:allocated,req:released) cfg=0

From the above output we can see that @216MHz (default frequency) the inference duration was 73.78 ms (average) and then some other info. Ok, so now let’s push the frequency up a bit @288MHz and see what happens.

Running PerfTest on "mnistkeras" with random inputs (16 iterations)...

Results for "mnistkeras", 16 inferences @288MHz/288MHz (complexity: 2852598 MACC)
 duration     : 55.339 ms (average)
 CPU cycles   : 15937845 -934/+1145 (average,-/+)
 CPU Workload : 5%
 cycles/MACC  : 5.58 (average for all layers)
 used stack   : 576 bytes
 used heap    : 0:0 0:0 (req:allocated,req:released) cfg=0

55.39 ms! It’s amazing. More about that later.


The validation project type is the one that you can use if you want to validate your model with different inputs. There is a mode that you can validate on the target with either random or user-defined data. There is a pdf document here, named “Getting started with X-CUBE-AI Expansion Package for Artificial Intelligence (AI)” and you can find the format of the user input in section 14.2, which is just a csv file with comma separated values.

The default mode, which is the random inputs produces the following output (warning: a lot of text is following).

Starting AI validation on target with random data... 
Neural Network Tools for STM32 v1.0.0 (AI tools v4.0.0) 
-- Importing model 
-- Importing model - done (elapsed time 0.403s) 
-- Building X86 C-model 
-- Building X86 C-model - done (elapsed time 0.519s) 
-- Setting inputs (and outputs) data 
Using random input, shape=(10, 784) 
-- Setting inputs (and outputs) data - done (elapsed time 0.691s) 
-- Running STM32 C-model 
ON-DEVICE STM32 execution ("mnistkeras", /dev/ttyUSB0, 115200).. 
<Stm32com id=0x7f8fd8339ef0 - CONNECTED(/dev/ttyUSB0/115200) devid=0x449/STM32F74xxx msg=2.0> 
 0x449/STM32F74xxx @216MHz/216MHz (FPU is present) lat=7 Core:I$/D$ ART: 
 found network(s): ['mnistkeras'] 
 description    : 'mnistkeras' (28, 28, 1)-[7]->(1, 1, 10) macc=2852598 rom=257.54KiB ram=32.88KiB 
 tools versions : rt=(4, 0, 0) tool=(4, 0, 0)/(1, 3, 0) api=(1, 1, 0) "Fri Jul 26 14:30:06 2019" 
Running with inputs=(10, 28, 28, 1).. 
....... 1/10 
....... 2/10 
....... 3/10 
....... 4/10 
....... 5/10 
....... 6/10 
....... 7/10 
....... 8/10 
....... 9/10 
....... 10/10 
 RUN Stats    : batches=10 dur=4.912s tfx=4.684s 6.621KiB/s (wb=30.625KiB,rb=400B) 
Results for 10 inference(s) @216/216MHz (macc:2852598) 
 duration    : 78.513 ms (average) 
 CPU cycles  : 16958877 (average) 
 cycles/MACC : 5.95 (average for all layers) 
Inspector report (layer by layer) 
 n_nodes        : 7 
 num_inferences : 10 
Clayer  id  desc                          oshape          fmt       ms         
0       0   10011/(Merged Conv2d / Pool)  (13, 13, 32)    FLOAT32   11.289     
1       2   10011/(Merged Conv2d / Pool)  (5, 5, 64)      FLOAT32   57.406     
2       4   10004/(2D Convolutional)      (3, 3, 64)      FLOAT32   8.768      
3       5   10005/(Dense)                 (1, 1, 64)      FLOAT32   1.009      
4       5   10009/(Nonlinearity)          (1, 1, 64)      FLOAT32   0.006      
5       6   10005/(Dense)                 (1, 1, 10)      FLOAT32   0.022      
6       7   10009/(Nonlinearity)          (1, 1, 10)      FLOAT32   0.015      
                                                                    78.513 (total) 
-- Running STM32 C-model - done (elapsed time 5.282s) 
-- Running original model 
-- Running original model - done (elapsed time 0.100s) 
Exec/report summary (validate 0.000s err=0) 
model file      : /rnd/bitbucket/machine-learning-for-embedded/code-stm32f746-xcube/mnist.tflite 
type            : tflite (tflite) 
c_name          : mnistkeras 
compression     : 4 
quantize        : None 
L2r error       : 2.87924684e-03 (expected to be < 0.01) 
workspace dir   : /tmp/mxAI_workspace3396387792167015918690437549914931 
output dir      : /home/dimtass/.stm32cubemx/stm32ai_output 
model_name      : mnist 
model_hash      : 3be31e1950791ab00299d58cada9dfae 
input           : input_0 (item#=784, size=3.06 KiB, fmt=FLOAT32) 
input (total)   : 3.06 KiB 
output          : nl_7 (item#=10, size=40 B, fmt=FLOAT32) 
output (total)  : 40 B 
params #        : 93,322 (364.54 KiB) 
macc            : 2,852,598 
rom (ro)        : 263,720 (257.54 KiB) -29.35% 
ram (rw)        : 33,664 + 3,176 (32.88 KiB + 3.10 KiB) 
id  layer (type)        output shape      param #     connected to             macc           rom                 
0   input_0 (Input)     (28, 28, 1)                                                                               
    conv2d_0 (Conv2D)   (26, 26, 32)      320         input_0                  237,984        1,280               
    nl_0 (Nonlinearity) (26, 26, 32)                  conv2d_0                                                    
1   pool_1 (Pool)       (13, 13, 32)                  nl_0                                                        
2   conv2d_2 (Conv2D)   (11, 11, 64)      18,496      pool_1                   2,244,480      73,984              
    nl_2 (Nonlinearity) (11, 11, 64)                  conv2d_2                                                    
3   pool_3 (Pool)       (5, 5, 64)                    nl_2                                                        
4   conv2d_4 (Conv2D)   (3, 3, 64)        36,928      pool_3                   332,416        147,712             
    nl_4 (Nonlinearity) (3, 3, 64)                    conv2d_4                                                    
5   reshape_5 (Reshape) (576,)                        nl_4                                                        
    dense_5 (Dense)     (64,)             36,928      reshape_5                36,864         38,144 (c)          
    nl_5 (Nonlinearity) (64,)                         dense_5                  64                                 
6   dense_6 (Dense)     (10,)             650         nl_5                     640            2,600               
7   nl_7 (Nonlinearity) (10,)                         dense_6                  150                                
mnist p=93322(364.54 KBytes) macc=2852598 rom=257.54 KBytes ram=32.88 KBytes -29.35% 
Cross accuracy report (reference vs C-model) 
NOTE: the output of the reference model is used as ground truth value 
acc=100.00%, rmse=0.0007, mae=0.0003 
10 classes (10 samples) 
C0         0    .    .    .    .    .    .    .    .    .   
C1         .    0    .    .    .    .    .    .    .    .   
C2         .    .    2    .    .    .    .    .    .    .   
C3         .    .    .    0    .    .    .    .    .    .   
C4         .    .    .    .    0    .    .    .    .    .   
C5         .    .    .    .    .    1    .    .    .    .   
C6         .    .    .    .    .    .    0    .    .    .   
C7         .    .    .    .    .    .    .    2    .    .   
C8         .    .    .    .    .    .    .    .    5    .   
C9         .    .    .    .    .    .    .    .    .    0   
Creating /home/dimtass/.stm32cubemx/stm32ai_output/mnistkeras_val_m_inputs.csv 
Creating /home/dimtass/.stm32cubemx/stm32ai_output/mnistkeras_val_c_inputs.csv 
Creating /home/dimtass/.stm32cubemx/stm32ai_output/mnistkeras_val_m_outputs.csv 
Creating /home/dimtass/.stm32cubemx/stm32ai_output/mnistkeras_val_c_outputs.csv 
Creating /home/dimtass/.stm32cubemx/stm32ai_output/mnistkeras_val_io.npz 
Evaluation report (summary) 
Mode                acc       rmse      mae       
X-cross             100.0%    0.000672  0.000304  
L2r error : 2.87924684e-03 (expected to be < 0.01) 
Creating report file /home/dimtass/.stm32cubemx/stm32ai_output/mnistkeras_validate_report.txt 
Complexity/l2r error by layer - macc=2,852,598 rom=263,720 
id  layer (type)        macc                          rom                           l2r error                     
0   conv2d_0 (Conv2D)   |||                     8.3%  |                       0.5%                                
2   conv2d_2 (Conv2D)   |||||||||||||||||||||  78.7%  |||||||||||            28.1%                                
4   conv2d_4 (Conv2D)   |||                    11.7%  |||||||||||||||||||||  56.0%                                
5   dense_5 (Dense)     |                       1.3%  ||||||                 14.5%                                
5   nl_5 (Nonlinearity) |                       0.0%  |                       0.0%                                
6   dense_6 (Dense)     |                       0.0%  |                       1.0%                                
7   nl_7 (Nonlinearity) |                       0.0%  |                       0.0%  2.87924684e-03 *              
fatal: not a git repository (or any of the parent directories): .git 
Using TensorFlow backend. 
Validation ended

I’ve also included a file extras/digit.csv which is the digit “2” (same used in the jupyter notebook) that you can use this to verify the model on the target using the `extras/code-stm32f746-xcube-evaluation.bin` firmware and CubeMX. You just need to load the digit to the CubeMX input and validate the model on the target. This is part of the output, when validating with that file:

Cross accuracy report (reference vs C-model) 
NOTE: the output of the reference model is used as ground truth value 
acc=100.00%, rmse=0.0000, mae=0.0000 
10 classes (1 samples) 
C0         0    .    .    .    .    .    .    .    .    .   
C1         .    0    .    .    .    .    .    .    .    .   
C2         .    .    1    .    .    .    .    .    .    .   
C3         .    .    .    0    .    .    .    .    .    .   
C4         .    .    .    .    0    .    .    .    .    .   
C5         .    .    .    .    .    0    .    .    .    .   
C6         .    .    .    .    .    .    0    .    .    .   
C7         .    .    .    .    .    .    .    0    .    .   
C8         .    .    .    .    .    .    .    .    0    .   
C9         .    .    .    .    .    .    .    .    .    0

The above output means that the network found the digit “2” with 100% accuracy.


This is the project you want to build when you develop your own application. In this case CubeMX creates only the necessary code that wraps the x-cube-ai library. These are the app_x-cube-ai.hand app_x-cube-ai.cfiles that are located in the source/srcfolder (and in the inc/ forder in the src). These are just wrappers files around the library and the model. You actually only need to call this function and then you’re ready to run your inference.


The x-cube-ai static lib

Let’s see a few things about the x-cube-ai library. First and most important, it’s a closed source library, so it’s a proprietary software. You won’t get the code for this, which for people like me is a big negative. I guess that way ST tries to keep the library around their own hardware, which it makes sense; but nevertheless I don’t like it. That means that the only thing you have access are the header files in the `source/libs/AI/Inc` folder and the static library blob. The only insight you can have in the library is using the elfread tool and extract some information from the blob. I’ve added the output in the `extras/elfread_libNetworkRuntime400_CM7_GCC.txt`.

From the output I can tell that this was build on a windows machine from the user `fauvarqd`, lol. Very valuable information. OK seriously now, you can also see the exported calls (which you could see anyways from the header files) and also the name of the object files that are used to build the library. An other trick if you want to get more info is to try to build the project by removing the dsp library. Then the linker will complain that the lib doesn’t find some functions, which means that you can derive some of them. But does it really matter though. No source code, no fun ūüôĀ

I don’t like the fact that I don’t have access in there, but it is what it is, so let’s move on.

Building the project

You can find the C++ cmake project here:

In the source/libs folder you’ll find all the necessary libraries which are CMSIS, the STM32F7xx_HAL_Driver, flatbuffers and the x-cube-ai lib. All these are building as static libraries and then the main.cpp app is linked against those static libs. You will find the cmake files for those libs in source/cmake. The file in the repo is quite thorough about the build options and the different builds. To build the code run this command:


If you want to enable overclocking the you can build like this:


Just be aware to select the value you like for the clock in sources/src/main.cppfile in this line:

RCC_OscInitStruct.PLL.PLLN = 288; // Overclock

The default overclocking value is 288MHz, but you can experiment with a higher one (in my case that was the maximum without hard-faults).

Also if you overclock you want to change also the clock dividers on the APB1 and APB2 buses, otherwise the clocks will be too high and you’ll get hard-faults.

RCC_ClkInitStruct.APB1CLKDivider = RCC_HCLK_DIV4;
RCC_ClkInitStruct.APB2CLKDivider = RCC_HCLK_DIV2;

The build command will build the project in the build-stm32folder. It’s interesting to see the resulted sizes for all the libs and the binary file. The next array lists the sizes by using the current latest gcc-arm-none-eabi-8-2019-q3-update toolchain from here. By the time you read the article this might already have changed.

File Size
stm32f7-mnist-x_cube_ai.bin 339.5 kB
libNetworkRuntime400_CM7_GCC.a 414.4kB

This is interesting. Let’s see now the differences between the resulted binary and the main AI libs (tflite-micro and x-cube-ai).

(sizes in kB)
x-cube-ai tflite-micro
binary 339.5 542.7
library 414.4 2867

As you can see from above, both the binary and the library for x-cube-ai are much smaller. Regarding the binary, that’s because the model is smaller as the weights are compressed. Regarding the libs you can’t really say if the size matters are the implementation and the supported layers for tflite-micro are different, but it seems that the x-cube-ai library is much more optimized for this MCU and also it must be more stripped down.

Supported commands in STM32F7 firmware

The code structure of this project in the repo is pretty much the same with the code in the 3rd post. In this case though I’ve only used a single command. I’ll copy-paste the text needed from the previous post.

After you build and flash the firmware on the STM32F7 (read the for more detailed info), you can use a serial port to either send commands via a terminal like cutecom or interact with the jupyter notebook. The firmware supports two UART ports on the STM32F7. In the first case the commands are just ASCII strings, but in the second case it’s a binary flatbuffer schema. You can find the schema in `source/schema/schema.fbs` if you want to experiment and change stuff. In the firmware code the handing of the received UART commands is done in `source/src/main.cpp` in function `dbg_uart_parser()`.

The command protocol is plain simple (115200,8,n,1) and its format is:

where ID is a number and each number is a different command. So:
CMD=1, runs the inference of the hard-coded hand-drawn digit (see below)

This is how I’ve connected the two UART ports in my case. Also have a look the repo’s file for the exact pins on the connector.

Note: More copy-paste from the previous post is coming, as many things are the same, but I have to add them here for consistency.

Use the Jupyter notebook with STM32F7

In the jupyter notebook here, there’s a description on how to evaluate the model on the STM32F7. There are actually two ways to do that, the first one is to use the digit which is already integrated in the code and the other way is to upload your hand-draw digit to the STM32 for evaluation. In any case this will validate the model and also benchmark the NN. Therefore, all you need to do is to build and upload the firmware, make the proper connections, run the jupyter notebook and follow the steps in “5. Load model and interpreter”.

I’ve written two custom Python classes which are used in the notebook. Those classes are located in jupyter_notebook/ folder and each has its own folder.


The MnistDigitDraw class is using tkinter to create a small window on which you can draw your custom digit using your mouse.


In the left window you can draw your digit by using your mouse. When you’ve done then you can either press the Clearbutton if you’re not satisfied. If you are then you can press the Inferencebutton which will actually convert the digit to the format that is used for the inference (I know think that this button name in not the best I could use, but anyway). This will also display the converted digit on the right side of the panel. This is an example.

Finally, you need to press the Exportbutton to write the digit into a file, which can be used later in the notepad. Have in mind that jupyter notepad can only execute only one cell at a time. That means that as long as the this window is not terminated then the current cell is running, so you need to first to close the window pressing the [x] button to proceed.

After you export the digit you can validate it in the next cells either in the notepad or the STM32F7.


The FbComm class handles the communication between the jupyter notepad and the STM32F7 (or another tool which I’ll explain). The FbComm supports two different communication means. The first is the Serial comms using a serial port and the other is a TCP socket. There is a reason I’ve done this. Normally, the communication of the notepad is using the serial port and send to/receive data from the STM32F7. To develop using this communication is slow as it takes a lot of time to build and flash the firmware on the device every time. Therefore, I’ve written a small C++ tool in `jupyter_notebook/FbComm/test_cpp_app/fb_comm_test.cpp`. Actually it’s mainlt C code for sockets but wrapped in a C++ file as flatbuffers need C++. Anyway, if you plan on changing stuff in the flatbuffer schema it’s better to use this tool first to validate the protocol and the conversions and when it’s done then just copy-paste the code on the STM32F7 and expect that it should work.

When you switch to the STM32F7 then you can just use the same class but with the proper arguments for using the serial port.


The files in this folder are generated from the flatc compiler, so you shouldn’t change anything in there. If you make any changes in `source/schema/schema.fbs`, then you need to re-run the flatc compiler to re-create the new files. Have a look in the “Flatbuffers” section in the file how to do this.

Benchmarking the x-cube-ai

The benchmark procedure was a bit easier with the x-cube-ai compared to the tflite-micro. I’ve just compiled the project w/ and w/o overclocking and run the inference several times from the jupyter notebook. As I’ve mentioned earlier you don’t really have to do that, just use the SystemPerformance project from the CubeMX and just change the frequency, but this is not so cool like uploading your hand-drawn digit, right? Anyway, that’s the table with the results:

216 MHz 288 MHz
76.754 ms 57.959 ms

Now let’s do a comparison between the tflite-micro and the x-cube-ai inference run times.

x-cube-ai (ms) tflite-micro (ms) difference
216 MHz 76.754 126.31 1.64x (48.8%)
288 MHz 57.959 94.957 1.64x (48.4%)

Mistakenly I’ve initially calculated this difference to be 170%, because I’ve build the tflite firmware with the DEBUG flag on and I thought that it really huge. After fixing this, I’ve measured a difference of ~48% which is still significant difference, but it might be acceptable depending the application (or not).

You might noticed that the inference time is a bit higher now compared to the SystemPerformance project binary. I only assume that this is because in the benchmark the outputs are not populated and they are dropped. I’m not sure about this, but it’s my guess as this seems to be a consistent behaviour. Anyway, the difference is 2-3 ms, so I’ll skip ruin my day thinking more about this as the results of my project are actually a bit faster than the default validation project.

Evaluating on the STM32F7

This is an example image of the digit I’ve drawn. The format is the standard grayscale 28×28 px image. That’s an uint8 grayscale image [0, 255], but it’s normalized to a [0, 1] float, as the network input and output is float32.

After running the inference on the target we get back this result on the jupyter notebook.

Comm initialized
Num of elements: 784
Sending image data
Receive results...
Command: 2
Execution time: 76.754265 msec
Out[9]: 0.000000
Out[8]: 0.000000
Out[7]: 0.000000
Out[6]: 0.000000
Out[5]: 0.000000
Out[4]: 0.000000
Out[3]: 0.000000
Out[2]: 1.000000
Out[1]: 0.000000
Out[0]: 0.000000

The output predicts that the input is number 2 and it’s 100% certain about it. Cool.

Things I liked and didn’t liked about x-cube-ai

From the things that you’ve read above you can pretty much conclude by yourself about the pros of the x-cube-ai, which actually make almost all the cons to seem less important, but I’ll list them anyways. This is not yet a comparison with tflite-micro.


  1. It’s lightning fast. The performance of this library is amazing.
  2. It’s very light and doesn’t use a lot of resources and the result binary is small.
  3. The tool in the CubeMX is able to compress the weights.
  4. The x-cube-ai tool is integrated nicely in the CubeMX interface, although it could be better.
  5. Great analysis reports that helps you make decisions for which MCU you need to use and optimizations before even start coding (regarding ROM and RAM usage).
  6. Supports importing models from Keras, tflite, Lasagne, Caffe and ConvNetJS. So, you are not limited in one tool and also Keras support is really nice.
  7. You can build and test the performance and validate your NN without having to write a single line of code. Just import your model and build the SystemPerformance or Validation application and you’re done.
  8. When you write your own application based on the template then you actually only have to use two functions, one to init the network and a function to run your inference. That’s it.


  1. It’s a proprietary library! No source code available. That’s a big, big problem for many reasons. I never had a good experience with closed source libraries, because when you hit a bug, then you’re f*cked. You can’t debug and solve this by yourself and you need to file a report for the bug and then wait. And you might wait forever!
  2. ST support quite sucks if you’re an individual developer or a really small company. There is a forum, which is based on other developers help, but most of the times you might not get an answer. Sometimes, you see answers from ST stuff, but expect that this won’t happen most of the times. If you’re a big player and you have support from component vendors like Arrow e.t.c. then you can expect all the help you need.
  3. Lack of documentation. There’s only a pdf document here (UM2526). This has a lot of information, but there are still a lot of information missing. Tbh, after I searched in the x-cube-ai folders which are installed in the CubeMX, I’ve found more info and tools, but there’s no mention about those anywhere! I really didn’t like that. OK, now I know, so if you’re also desperate then in your Linux box, have a look at this path: ~/STM32Cube/Repository/Packs/STMicroelectronics/X-CUBE-AI/4.0.0/Documentation. That’s for the 4.0.0 version, so in our case it might be different.

TFLite-micro vs x-cube-ai

Disclaimer: I have nothing to do with ST and I’ve never even got a free sample from them. I had to do this, for what is following…

As you can see the x-cube-ai’s has more pros than cons are more cons compare to the tflite-micro. Tbh, I’ve also enjoyed more working with the x-cube-ai rather the tflite-micro as it was much easier. The only thing from the x-cube-ai that leaves a bitter taste is that it’s a proprietary software. I can’t stress out how much I don’t like this and all the problems that brings along. For example, let’s assume that tomorrow ST decides to pull off the plug from this project, boom, everything is gone. That doesn’t sound very nice when you’re planning for a long commitment to an API or tool. I quite insist on this, because the last 15-16 years I’ve seen this many times in my professional career and you don’t want this to happen to your released product.¬† Of course, if the API serves you well for your current running project and you don’t plan on changing something critical then it’s fine, go for it. But, I really like the fact that tflite-micro is open.

I’m a bit puzzled about tflite. At the this point, the only reason I can think of using tflite-micro over x-cube-ai, is if you want to port your code from a tflite project which already runs on your application CPU (and Linux) to an MCU to test and prototype and decide if it worth switching to an MCU as a cheaper solution. Of course, the impact of tflite in the performance is something that needs consideration and currently there’s no rule of thumb of how much slower is compared to other APIs and on specific hardware. For example in the STM32F7 case (and for the specific model) is 1.64x times slower, but this figure might be different for another MCU. Anyway, you must be aware of these limitations, know what to really expect from tflite-micro and how much room you have for performance enhancement.

There is another nice thing about tflite-micro thought. Because it’s open source you can branch the git repo and then spend time to optimise the code for your specific hardware. Definitely the performance will be much, much better; but I can’t really say how much as it depends on so many things. Have in mind that also tflite-micro is written in C++ and some of its hocus pocus may have negative impact in the performance. But at least it remains a good alternative option for prototyping, experimentation and develop to its core. And that’s the best thing with open source code.

Finally, x-cube-ai limits your options to the STM32 series. Don’t get me wrong this MCU series is great and I use stm32 for many of my projects, but it’s always nice to have an alternative.


The x-cube-ai is fast. It’s also easy to use and develop on it, it has those ready-to-build apps and the template to build your project, everything is in an all-in-one solution (CubeMX). But on the other hand is a black box and don’t expect much support if you’re not a big player.

ST was very active the last year. I also liked the STM32-MP1 SBC they released with Yocto support from day one and mainline kernel support. They are very active and serious. Although I still consider the whole HAL Driver library a bloated library (which it is, as I’ve proven that in previous stupid-projects). I didn’t had any issues; but I’ve also didn’t write much code for these last two projects (I had serious issues when I’ve tried a few years ago).

Generally, the code is focused around the NN libs performance and not the MCU peripheral library performance, but still you need to consider those things when you evaluating platforms to start a new project.

From a brief in the source code though, it seems that you can use the x-cube-ai library without the HAL library, but you would need to port some bits to the LL library to use it with that one. Anyway, that’s me; I guess most people are happy with HAL, so…

In my next post, I will use a jetson-nano to run the same inference using tflite (not micro) and an ESP8266 that will use a REST-API. Also TensorRT, seems nice. I may also try this for the next post, will see.

Update: Next post is available here.

Have fun!