Control Siglent SDG1025 with python (bonus: add web access using any SBC)


When you’re an electronic engineer you have some standard equipment on your bench. A few multimeters, a soldering station, a bench-top PSU and an oscilloscope. Some of us have also a frequency counter and a waveform generator. If your equipment is quite new then probably it has some kind of communication interface to connect and control the device remotely. That’s very useful in some cases, because you can do a lot of fancy things. The problem usually is that if you’re a Linux user, then you don’t this that easy, as the EE world is dominated by Windows. Nevertheless, nowadays there are solutions sometimes, which are also open source and that makes it easy to take full control.

This post is about how to control the Siglent SDG1000 waveform generator series using Python. I own an SDG1025, so I’ve only tested on this, but it should be the same for all the series.

Siglent SDG1025

A few words about SDG1025, though if you’re reading this you probably already have it already on your bench. SDG1025 is an arbitrary function generator that is able to output a maximum frequency of 25MHz for sine, square and Gaussian noise; but less for other signals (e.g. 5MHz for pulse and arbitrary waves). It has a dual channel output and the second channel can also be used as a frequency counter.

This is the device user manual with more info and that’s how it looks like.

What makes SDG1025 special is its price and the features that comes with this. This is not a hobbyist neither a pro equipment, regarding the price range. It lies somewhere in the middle, but by doing a very small and easy modification it gets in the semi-pro range. This modification is to change the XTAL on the motherboard with a better TCXO, as the motherboard has already the footprint and also it’s an easy to do modification. You can find 25MHz TCXOs with 0.1 or 0.3 ppm in ebay for ~15-10 EUR and this will make the input/ouput much more accurate. It is questionable though if those old format TCXOs are capable of such low ppm accuracy, but in any case with a bit more effort you can also use a new SMD TCXO as long as it has TTL level output. This is a video that shows this modification, if you’re interested.

Another big plus of SDG1025 is the USB communication port. This port is used for communication between your workstation and the instrument and it supports the Virtual Instrument Software Architecture (VISA) API.

The only negative thing I can think of about the SDG1025, is that the fan is loud. I think that would be probably my next hack on the SDG1025, but I don’t use it that often to really consider doing it, yet.


VISA is a standard communication protocol made by various companies, but you probably know it from the National Instrument’s NI-VISA, which is used in LabView. This protocol is available over Ethernet, GPIB, serial, USB e.t.c. and it’s quite a simple protocol where the host queries the device and gets the devie sends a response to the host. The response can be some meaningful data or just an acknowledgement of the received command.

The problem with VISA is that almost all the tools and drivers are made for Windows. You see LabView was a big thing in the Windows XP era (and I guess it still probably is in some industries) as it was easier for people who didn’t know programming to create monitoring and control tools and create automations using LabView. Of course, a lot of things have changed since then and to be honest I’m not aware of the current state of LabView. At some point I believe it was used also from engineers who knew about programming, but at the time LabView had become a de facto tool in the industry and also people preferred the nice GUI elements that it was offering. I remember at that time (early 2000’s) I was using Borland’s VCL components with Borland C++ and Delphi to create my own GUIs for instrument control, instead of LabView. I remember I was finding easier to do use VCL and C++ than use the LabView workflow. Oh, the VCL! It was an awesome library ahead of its time and I wonder why it didn’t conquer the market…

Anyway, because Windows are favored when it comes to those tools, the Linux users are left behind. Times have changed though and although Linux should be considered as the main development platform, some companies are left a bit behind or they provide solutions and support only to specific partners.

USB Port

Now let’s get back to the SDG1025. In the rear side of the generator there’s a USB port, as you can see from this image.

That USB port is used for communicating with your workstation and it supports two modes of operation, USBRAW and USBTMC.

USBTMC stands for USB Test and Measurement Class. You might know this already from those cheap 8-ch, 24MHz network analyzers that sold on ebay and some of which are based on the CY7C68013A EZ-USB. Those analyzers claim compatibility with Sigrok software suite, which supports USBTMC.

Supporting USBTMC, doesn’t mean that your device is supported (e.g. SDG1025). It only means that your device supports that class specification; therefore you could integrate your device to Sigrok if you like, but you have to add the support. So, it’s just a protocol specification and not the communication protocol specific details, which are different and custom for each device and equipment.

So, let’s go back to the SDG1025. You can select the USB mode by pressing the Utility button and browse to this menu [Utility >> Interface >> USB Setup]. There you’ll find those two options:

In this menu you need to select the USBTMC option.

Now connect the USB cable to your Linux machine and run the lsusb command which should print a line like this.

$ lsusb
Bus 008 Device 002: ID f4ed:ee3a Shenzhen Siglent Co., Ltd. SDG1010 Waveform Generator (TMC mode)

That means that your workstation recognises the SDG1025 and it’s also set to the correct mode (USBTMC). Now all is left is to be able to communicate with the device.

Host controller

Usually when you want to control an equipment you need to use your workstation. But that’s not always the case, as you could also use a small Linux SBC. Therefore, in this post I’ll assume that the host controller is the nanopi-neo SBC, which is based on the Allwinner H3 that has a Quad-core Cortex-A7 that runs up to 1.2GHz. This small board costs $10 and it’s powerful enough to perform in some tasks better that other more expensive SBCs.

Therefore, since we’re talking about Linux you have the flexibility to use whatever hardware you like as the host controller. In this case, the nanopi-neo can convert actually the SDG1025 to an Ethernet controlled device, therefore you can use an SBC with WiFi to do the same and control the SDG1025 via your WiFi network. Anyway you get the idea, it’s very flexible. Also the power consumption of these small SBCs is very low, too.

Therefore, from now on the commands that I’m going to use are the same regardless the host controller. Finally, for the nanopi-neo I’ve built the Armbian 5.93 git version, but you can also download a pre-built image from here.

Communicate with the SDG1025

Let’s assume that you’re have booted your Linux host and you have a console terminal. In general, when it comes to python, the best practise is to always use a virtual environment. That makes things easier, because if you break something you don’t break your main environment but the virtual. To setup your host, create a new virtual environment and install the needed dependencies, use these commands:

# Update system
sudo apt update
sudo apt -y upgrade

# Install needed packages
sudo apt install python3-venv
sudo apt install python3-pip
sudo apt install libusb-1.0-0-dev

# Creae a new environment
mkdir ~/pyenv
cd ~/pyenv
python3 -m venv sdg1025
source ~/pyenv/sdg1025/bin/activate

# Install python dependencies
pip3 install setuptools
pip3 install pyusb

After the running the “source ~/pyenv/sdg1025/bin/activate” command, you should see in your prompt the environment name. For example, on my nanopi-neo I see this prompt:

(sdg1025) dimtass@nanopineo:~$

That means that now you’re working on the virtual environment and all the python modules you’re installing are installed in that environment. To deactivate the virtual environment you run this command


Finally, you always need to activate this environment when you need to communicate with the sdg1025 and use those tools.

source ~/pyenv/sdg1025/bin/activate

If everything works fine for you and you didn’t brake anything in the process, then if you don’t like using this virtual env, you can just install the same tools as before without creating the virtual env and skip those steps. From now on though, I’ll assume that the virtual env is used.

Now you need to create a udev rule, so your user gets permissions to operate on the USB device, as by default only root has access to that. If you remember in a previous step, you’ve run the lsusb command and you got an output similar to this:

$ lsusb
Bus 008 Device 002: ID f4ed:ee3a Shenzhen Siglent Co., Ltd. SDG1010 Waveform Generator (TMC mode)

From this output you need the USB VID and PID numbers which are the f4ed and ee3a in this case. Your SDG1025 should have the same VID/PID. Now you need to create a new udev rule with those IDs.

sudo nano /etc/udev/rules.d/51-siglent-sdg1025.rules

And then copy this line and save the file.

SUBSYSTEM=="usb", ATTRS{idVendor}=="f4ed", ATTRS{idProduct}=="ee3a", MODE="0666"

Now you need to update and re-trigger the rules with this command:

sudo udevadm control --reload-rules && sudo udevadm trigger

If you haven’t already connected the USB cable from the SDG1025 to your host, you can do it now.


The last piece you need is the python-usbtmc module, which you can download from here:

This is the module which implements the USBTMC protocol and you can use to script with python. To install it run those commands:

git clone
cd python-usbtmc
sudo python install
cd ../
rm -rf python-usbtmc

If everything went smoothly then now you should be able to test the communication between the host and the SDG1025. To do this launch python3 and you should see something like this:

(sdg1025) $ python3
Python 3.6.8 (default, Jan 14 2019, 11:02:34) 
[GCC 8.0.1 20180414 (experimental) [trunk revision 259383]] on linux
Type "help", "copyright", "credits" or "license" for more information.

The next commands are used to import the python-usbtmc module and list the connected USBTMC devices that are connected on the system

>>> import usbtmc
>>> usbtmc.list_resources()

As you can see, the list_resources() function listed the SDG1025. Now you need to copy this exact string and use it to instantiate a new instrument object.

>>> sdg = usbtmc.Instrument('USB::62000::60000::SDG10GA1234567::INSTR')

In your case this string will be different, of course, as it also contains the device serial number. Now you can poll the device and get some basic information.

>>> print(sdg.ask("*IDN?"))
*IDN SDG,SDG1025,SDG10GA1234567,,04-00-00-30-28

The *IDN? is a standard protocol command that sends back some basic information. In this case it’s the device name, serial number and firmware version.

You’re done! Now you can control the SDG1025 via USB.

Supported commands

The SDG1025 supports a lot of commands that pretty much can control all the functionality of the device and there’s a full list of the supported commands in this document. Yes, it’s a quite large document and there are many commands in there, but you don’t really need them all, right? You should have a look at all in order to decide which seem more interesting for you to create some custom scripts. The very basic commands, I think, are the BSWV (BASIC_WAVE) and OUTP (OUTPUT).

The BSWV sets or gets basic wave parameters and you can use it to control the output waveform. To read the current output setup of channel 1 run this:

>>> sdg.ask("C1:BSWV?")

As you can guess C1means channel 1, so for channel 2 you should run this

>>> sdg.ask("C2:BSWV?")

Therefore, if you want to set the CH1 frequency to a 3V3, 10MHz, square output, you need to run this:

>>> sdg.write('C1:BSWV WVTP,SQUARE,FRQ,10000000,AMP,3.3V')

As you can see though, the BSWV command only configures the output, but it doesn’t also enable it by default. You need the OUTP command to do that. Therefore, to enable the CH1 output with that configuration you need to run this

>>> sdg.write('C1:OUTP ON')

You can disable the output like this:

>>> instr.write('C1:OUTP OFF')

Finally, you can query the output state like this

>>> sdg.ask('C1:OUTP?')

From the response, you see that you can also control the output load (LOAD) and the polarity (PLRT). NOR means that the polarity is normal, but you can set it to INVT if you want to inverse it.

At this point it should be clear what you can do with this module and the supported commands. Now that you know that you can script with python and control the device, you can understand how many things you can do. It’s awesome!

Just read the Programming Guide document and do your magic.

Bonus: Create a web control using the nanopi-neo

This wouldn’t be a proper stupid project if this was missing, so I couldn’t resist the temptation. As bonus material I’ll show you how to use the nanopi-neo (or any other SBC) to do some basic control on the output via a web interface using python, flask and websockets. I won’t really get into the code details, but the code is free to use and edit and you can get it this repo:

In order to run the code you need to follow all the commands in the previous sections to install a virtual env and python-usbtmc. Additionally, you need to install a couple of things more in order to support flask and wtforms on which my web interface is based on. In my case I’ve used the nanopi-neo and also my workstation and it worked fine on both. Now type those commands:

pip3 install flask
pip3 install wtforms
pip3 install flask-socketio
pip3 install eventlet

The last command which installs the eventlet package may fail, but it doesn’t really matter. This package is for installing the eventlet async mode, which makes websockets faster. Without it the default mode is set to threading, which is still fast but not that fast (I’ve seen around 2ms ping times using websockets with eventlet and ~30ms with threading).

Now that you’ve installed the packages, you can run the code, but first you need to connect the USB cable on the nanopi-neo (or whatever) and the SDG1025. This is everything connected together (in the photo the web-interface is already running).

Now, run these commands (includes also the repo clone).

git clone
cd web-interface-for-sdg1025

And that’s it. You should see in the output messages like the following:

[SDG1025]: Connecting to the first available device: USB::62701::60986::SDG10GA4150816::INSTR
[SDG1025]: Connected to: SDG10GA4150816
Starting Flask web server...
Open this link to your browser:

That means that everything worked fine. Now, if the web app is running on the nanopi-neo (or wherever), open your browser and connect to its IP address and port 5000. And you will see something like this:

Note: in the picture the frequency should be (HZ) and not (KHz), but I’ve fixed that in the repo.

When the interface is opened, then it reads the current configuration for both channels and then updates all the needed elements on the web interface. You can use the controls to play around with the interface. I’ll also post a video that I’ve made using the nanopi-neo running the web server and my tablet to control the SDG1025. Excuse some delays when using controls on my tablet, but it’s because the touch screen is problematic. I won’t buy a new tablet until is completely fallen apart. For now it serves its purpose just fine.


Most of the new equipment are coming with some kind of communication interface, which is really nice and you can use it to perform a lot of automated tasks. The problem is that most of those devices support Windows only. This is an old habit for most of the companies as they either can’t see the value in supporting Linux platforms or the engineers are still left in the 00’s era where Windows XP was the ultimate platform for engineers.

Anyway, for whatever the reason, there is hope if your equipment supports a standard protocol or class; because then you can probably find (or develop your own) module and then do whatever you like. For the SDG1025 I’ve used the python-usbtmc module which is open and available here. I’ve also used Flask to develop a web app. The front-end is made with HTML, javascript (jquery + jquery-ui) and the back-end is written in Python. The reason I’ve used Flask, is because it was easier to integrate the python-usbtmc module in the web app and Flask is really nice and mature enough and it has a very nice back-end that integrates also the web server, so you don’t need to install one. Anyway, I also like Flask, so I use it whenever python is involved.

This web app is a very basic one and I’ve made it just for fun. I’ve only supported setting the outputs, but you can use it as a base template to do whatever you like and add more functionality. You can make tons of automation using Python scripting and the SDG1025, it’s up to your imagination.

Have in mind, that I’ve noticed some times that the communication was broken and I had to restart the device. I didn’t find the cause it as it wasn’t often, but I suspect the device…

That was my last stupid project for the next few months as I plan to have some rest and make some proper holidays.

Have fun!

Benchmarking TensorFlow Lite for microcontrollers on Linux SBCs


([06.08.2019] Edit: I’ve added an extra section with more benchmarks for the nanopi-neo4)

([07.08.2019] Edit: I’ve added 3 more results that Shaw Tan posted in the comments)

In this post, I’ll show you the results of benchmarking the TensorFlow Lite for microcontrollers (tflite-micro) API not on various MCUs this time, but on various Linux SBCs (Single-Board Computers). For this purpose I’ve used a code which I’ve originally written to test and compare the tflite-micro API which is written in C++ and the tflite python API. This is the repo:

Then, I thought why not test the tflite-micro API on various SBCs that I have around.

For those who don’t know what an SBC is, then it’s just a small computer with an application CPU like the Raspberry Pi (RPi). Probably everyone knows RPi but there are more than a 100 SBCs in the market in all forms and sizes (e.g. RPi  has currently released 13 different models).


Although I have plenty of those boards around, I didn’t used all of them. That would be very time consuming. I’ve selected quite a few though. Before I list the devices and their specs, here is a photo of the testing bench. A couple of SBCs have missed the photo-shooting and joined the party later.

Top row, from left to right:

Bottom row, from left to right:

  • Nanopi Neo 4: Rockchip RK3399, Dual Cortex-A72 @ 1.8GHz + Quad Cortex-A53 @ 1.5GHz
  • Nanopi K1 Plus: Allwinner H5, Quad Cortex-A53 @ 1.3GHz, 3GB DDR3
  • Nanopi Neo2: Allwinner H5, Quad Cortex A53 @ 900MHz, 512MB DDR3
  • Orangepi Prime: Allwinner H5, Quad Cortex-A53 @ 1.3 GHz, 2GB DDR3

Missing from the photo:

  • Nanopi neo: Allwinner H3, Quad Cortex-A7 @ 1.2GHz, 512MB DDR3
  • Nanopi duo: Allwinner H2+, QuadCortex-A7 @ 1GHz, 512MB DDR3

I’ve tried to use the same distribution packages for each SBC, so in the majority of the boards the rootfs and packages are from Ubuntu 18.04, but the kernels are different versions. It’s impossible to have the exact same kernel version as not every SoC has a mainline support. Also, even if the SoC has a mainline support, like H5 and RK3399, the SBC itself probably doesn’t. Therefore, each board needs a device-tree and several patches for a specific kernel version to be able to boot properly. There are build systems like Armbian, which make things easier, but it supports only a fragment of the available SBCs. The SoC and SBC support in the mainline kernel is an important issue for a long time now, but let’s continue.

In this table I’ve listed the image I’ve used for each board. These are the boards that I’ve benchmarked myself

SBC Image Kernel
AML-S905X-CC Ubuntu 18.04 Headless 4.19.55
Raspberry Pi 3 B+ Ubuntu 18.04 Server 4.4.0-1102
Jetson nano Ubuntu 18.04 4.9
Nanopi Duo Armbian 5.93 4.19.63
Nanopi Neo Armbian 5.93 4.19.63
Nanopi Neo2 Armbian 5.93 4.19.63
Nanopi Neo 4 Armbian 5.93 4.4.179
Nanopi Neo 4 (2) Yocto build 4.4.176
Nanopi K1 Plus Armbian 5.93 4.19.63
Orangepi Prime Armbian 5.93 4.19.63
Beaglebone Black Debian 9.5

This table is from other SBCs that Shaw Tan benchmarked and posted the results in the comments

SBC Image Kernel
Google Coral Dev Board Debian Stretch aarch64 4.9.51-imx
Rock Pi 4B Debian Stretch aarch64 4.4.154
Raspberry Pi 4 Rasbian 4.19.59-v8

Got another result from RemoteC in the comments

SBC Image Kernel
Odroid MC1 Linux odroid 4.14.141-169
Odroid N2 Linux odroid 4.9.216-71

But why use tflite-micro?

That’s a good question. Why use the tflite-micro API on a Linux SBC? The tflite-micro is meant to be used with small microcontrollers and there is the tflite C++ API for Linux, which is more powerful and complete. Let’s see what are the pros and cons of the C++ tflite API.


  • Supports more Ops
  • Supports also Python (that doesn’t have to do with C++ but is a plus for the API)
  • Can scale to multi-core CPUs
  • GPU, TPU and NCS2 support
  • Can load flatbuffer models from the file system
  • Small binary (~300KB)


  • It’s hard to build
  • A lot of dependencies
  • The whole API is integrated in the git repo and it’s very difficult to get only the needed files to create your custom build via Make or CMake
  • Not available pre-build binaries for architectures other than x86_64

I’ve tried to build the API in a couple of SBCs and that failed for several different reasons. For now only RPi seems to be supported, but again it’s quite difficult to re-use the library, it seems like you need to develop your application inside the tensorflow repo and it’s difficult to extract the headers. Maybe there is somewhere a documentation on how to do this, but I couldn’t find it. I’ve also seen other people having the same issue (see here), but none of the suggestions worked for me (except in my x86_64 workstation). At some point I was tired to keep trying, so I quit. Also the build was taking hours just to see it fail.

Then I though, why not use the tflite-micro API which worked fine on the STM32F7 before and since it’s a simple library that builds on MCUs and the only dependency is flatbuffers, then it should build everywhere else (including Linux for those SBCs). So I’ve tried this and it worked just fine. Then I though to run some benchmarks on various SBCs I have around to see how it performs and if it’s something usable.

tflite model

Before I present the results, let’s have a quick look at the tflite model. The model is the same I’ve used in the tests I’ve done in the “ML for embedded” series here, here and here. The complexity (MACC) of this model is measured in multiply-and-accumulate (MAC) operations, which is the fundamental operation in ML.

For the specific mnist model I’ve used, the MACC is 2,852,598. That means that to get the output from any random input the CPU executes 2.8 million MAC operations. That’s I guess a medium size model for a NN, but definitely not a small one.

Because now we’re using the tflite-micro API, the model is not a file which is loaded from the file system, but it’s embedded in the executable binary, as the flatbuffer model is just a byte array in a header file.

Build the benchmark

If you want to test on any other SBC, you just need g++ and cmake. If your distro supports apt, then run this:

sudo apt install cmake g++

Then you just need to clone the repo, build the benchmark and run it. Just be aware that the output filename is changing depending on the CPU architecture. So for an aarch64 CPU run those commands:

git clone
cd tflite-micro-python-comparison/

And you’ll get the results.

(For a cortex-a7, the executable name will be mnist-tflite-micro-armv7l)


Note: I haven’t done any custom optimization for any of the SBCs or the compiler flags for each architecture. The compiler is let to automatically detect the best options depending the system. Also, I’m running the stock images, without even change the performance governor, which is set “ondemand” for all the SBCs. Finally, I haven’t applied any overclocking for any of the SBCs, therefore the CPU frequencies are whatever those images have set for the ondemand governor.

Ok, so now let’s see the results I got on those SBCs. I’ve created a table with the average time of 1000 inferences for each test. In this list I’ve also added the results from my workstation and also in the end the results of the exact same code with the exact same tflite model running on the STM32F746, which is an MCU and baremetal firmware.

SBC Average for 1000 runs  (ms)
Ryzen 2700X (this is not SBC) 2.19
AML-S905X-CC 15.54
Raspberry Pi 3 B+ 13.47
Jetson nano 9.34
Nanopi Duo 36.76
Nanopi Neo 16
Nanopi Neo2 22.83
Nanopi-Neo4 5.82
Nanopi-Ne4 (2) 5.21
Nanopi K1 Plus 14.32
Orangepi Prime 18.40
Beaglebone Black 97.03
STM32F746 @ 216MHz 76.75
STM32F746 @ 288 MHz 57.95

The next table has results that Shaw Tan and RemoteC posted in the comments section:

SBC Average for 1000 runs  (ms)
Google Coral Dev Board 12.40
Rock Pi 4B 6.33
Raspberry Pi 4 8.35
Odroid MC1 – core 0 7.82
Odroid MC1 – core 1 15.32
Odroid N2 – core 1 9.90
Odroid N2 – core 5 6.35

There are many interesting results on this table.

First, the RK3399 cpu (nanopi-neo4) outperforms the Jetson nano and it’s only 2.6 times slower that my Ryzen 2700X cpu. Amazing, right? This boards costs only $50. Excellent performance on this specific test.

Then the Jetson nano needs 9.34 ms. Be aware that this is a single CPU thread time! If you remember, here the tflite python API scored 0.98 ms in MAXN mode and 2.42 in 5W mode. The tflite python API supported the GPU though and used CUDA acceleration. So, for this board, when using the GPU and the tflite you get 9.5x (MAXN mode) and 3.8x (5W mode) better performance. Nevertheless, 9.34 ms doesn’t sound that bad compared to the 5W mode.

Beaglebone black is the worst performer. It’s even slower than the STM32F7, which is quite expected as BBB is a single core running @ 1GHz and the code is running on top of Linux. So the OS has a huge impact in performance, compared to baremetal. This raises an interesting question…

What would happened if the tflite-micro was running baremetal on those application CPUs? Without OS? Interesting question, maybe I get to it at another post some time in the future.

Then, the rest SBCs lie somewhere in the middle.

Finally, it worth noting that the nanopi neo is an SBC that costs only $9.99! That’s an amazing price/performance ratio for running the tflite-micro API. This board amazed me with those results (given its price). Also it’s supposed to be an LTS board, which means that it will be in stock for some time, though it’s not clear for how long.

Additional benchmarks on the nanopi-neo4 (RK3399)

Since the nanopi-neo4 performed better than the other SBCs, I’ve build an image using Yocto. Since I’m the owner and maintainer of this layer, I need to say that this is actually a mix of the armbian u-boot and kernel versions and patches and also I’ve used some parts from meta-sunxi, but I’ve done more integrations to like supporting the GPU and some other things.

Initially, I’ve tried the armbian build because it’s easy for everyone to reproduce, but then I though to test also my Yocto layer. Therefore I’ve used this repo here:

There are details how to use the layer and build in the README file of this repo, but this is the command I’ve used to build the image I’ve tested (after setting up the repo as described in the readme).

MACHINE=nanopi-neo4 DISTRO=rk-none source ./ build

Then in the build/conf/local.conf file I’ve added this line in the end of the file

IMAGE_INSTALL += " cmake "

And finally I’ve build with this command:

bitbake rk-image-testing

Then I’ve flashed the image on an SD card:

# change sdX with your SD card
sudo umount /dev/sdX*
sudo dd if=build/tmp/deploy/images/nanopi-neo4/rk-image-testing-nanopi-neo4.wic.bz2 of=/dev/sdX status=progress
sudo eject /dev/sdX

This distro gave me a bit better results. I’ve also tried to benchmark each core separately by running the script like this:

taskset --cpu-list 0 ./build-aarch64/src/mnist-tflite-micro-aarch64

Two test all cores you need to replace in the above script the 0 with the number of cores from 0 to 5. Cores [0-3] are the Cortex-A53 and [4-5] are the Cortex-A72, which are faster. Without using taskset the OS will always run the script on the A72. These are the results:

Core number Results (ms)
0 12.39
1 12.40
2 12.39
3 12.93
4 5.21
5 5.21

Therefore, compare to the armbian distro there’s a slight better performance of 0.61 ms, but this might be on the different kernel version, I don’t know, but it the difference seems to be constant on every inference run I’ve tested.


From the above results I’ve come to some conclusions, which I’ll list in a pros/cons list of using the tflite-micro API on a SBC running a Linux OS.


  • Very simple to build and use
  • Minimal dependencies
  • Easy to extract the proper source and header files from the Tensorflow repo (compared to tflite)
  • Performance is not that bad (that’s a personal opinion though)
  • Portability. You can compile the exact same code on any CPU or MCU and also on Linux or baremetal (maybe also in Windows with MinGW, I haven’t tested).


  • No multi-threading. The tflite-micro only supports 1 thread.
  • The model is embedded in the executable as a byte array, therefore if you want to be able to load tflite models from the file system, then you need to implement your own parser, which just loads the tflite model file from the filesystem to a dynamically allocated byte array.
  • It’s slower compared to tflite C++ API

From the above points (I may missed few, so I’ll update if somethings comes to my mind), it’s clear that the performance using the tflite-micro API on a multi-core CPU and in Linux, will be worse compared to the tflite API. I only have numbers to do a comparison between tflite and tflite-micro for my Ryzen 2700x and the Jetson nano, but not for the other boards. See the table:

CPU tflite-micro/tflite speed ratio
Ryzen 2700X 10.63x
Jetson nano (5W) 9.46x
Jetson nano (MAXN) 3.86x

The above table shows that the tflite API is 10.63x faster than tflite-micro on the 2700x and 9.46x and 3.86x faster on the Jetson nano (depending the mode). This a great difference, of course.

What I would keep from this benchmark is that the tflite-micro is easy to build and use on any architecture, but there’s a performance impact. This impact is much larger if the SoC is multi-core, has a GPU or any other accelerator which can’t be used from the tflite API.

It’s up to you to decide, if the performance of the tflite-micro is enough for your model (depending the MACC). If it is, then you can just use tflite-micro instead. Of course, don’t expect to run inferences on real-time video but for real-time audio it should be probably enough.

Anyway, with tflite-micro it’s very easy to test and evaluate for your model on your SBC.

Have fun!

Controlling a 3D object in Unity3D with teensy and MPU-6050


Have a look at this image.

What does it look like? Is it the new Star Wars? Nope. It’s a 3D real time rendering from a “game” that I’ve build just to test my new stupid project with the MPU6050 motion tracking sensor. Cool isn’t it? Before I get there let me do a short introduction.

The reason I’ve enjoyed this project so much, is that I’ve started this project without knowing anything about 3D game engines, skyboxes and quaternions and after 1.5 day I got this thing up and running! I don’t say this to praise myself. On the contrary, I mention this to praise the current state of the free and open source software. Therefore, I’ll use some space here to praise the people that contribute to FOSS and make for others easy and fast to experiment and prototype.

I don’t know for how long you’ve been engineers, hobbyists or anything related and for how long. But from my experience, trying to make something similar 15 or 10 years ago (and also on a Linux machine), would be really hard and time spending procedure. Of course, I don’t mean getting the same 3D results, I mean a result relative to that era. Maybe I would spend several months, because I would have to do almost everything by myself. Now though, I’ve only wrote a few hundred lines of glue code and spend some time on YouTube and surfing github and some other sources around the web. Nowadays, there are so many things that are free/open and available that is almost impossible not to find what you’re looking for. I’m not only talking about source code, but also tools, documentation, finished projects and resources (even open hardware). There are so many people that provide their knowledge and the outcome of their work/research nowadays, that you can find almost everything. You can learn almost everything from the internet, too. OK, probably it’s quite difficult to become a nuclear physicist using only online sources, but you can definitely learn anything related to programming and other engineering domains. It doesn’t matter why people decide to make their knowledge available publicly, all it matters is that it’s out there, available for everyone to use it.

And thanks to those people, I’ve managed to install Unity3D on my Ubuntu, learn how to use it to make what I needed, found a free to use 3D model for the Millennium Falcon, used Blender on Linux to edit the 3D model, found a tool to create a Skybox that resembles the universe, found an HID example for the teensy 3.2, the source code for the MPU6050 and how to calibrate it and finally dozens of posts with information about all those things. Things like how to use lens flares, mesh colliders to control flare visibility on cameras with flare layers, event system for GUI interactions and several other stuff that I wasn’t even aware of before, everything explained from other people in forums in a way that it’s far easier to read from any available documentation. Tons of information.

Then I just put all the pieces together and wrote the glue code and this is the result.

(Excuse, the bad video quality, but I’ve used my terrible smartphone camera)

It is a great time to be an engineer or a hobbyist and having all those tools and information available to your hands. All you need is just some time for playing and making stupid projects 😉

All the source code and files for this project are available here:

Note: This post is not targeting 3D graphics developers in no way. It’s meant mostly for embedded engineers or hobbyists.

So, let’s dive into the project…


To make the project I’ve used various software tools and only two hardware components. Let’s see the components.

Teensy 3.2

You’re not limited on Teensy 3.2, but you can use any Teensy that supports the RawHID lib. Teensy 3.2 it’s based on the NXP MK20DX256VLH7 which has a Cortex-M4 core running at 72 MHz and can be overclocked easily using the Arduino IDE up to 96MHz. It has various of peripherals and the pinout exports everything you need to build various interesting projects. For this project I’ve only used the USB and I2C. Teensy is not the cheapest MCU board out there as it costs around $20, but it comes with a great support and it’s great for prototyping.


According to TDK (which manufactures the MPU-6050) this is a Six-Axis (Gyro + Accelerometer) MEMS MotionTracking Devices which has an onboard Digital Motion Processor (DMP). According the TDK web page I should use a ™ on every word less than 4 characters in the previous sentence. Anyway, to put it simple it’s a package that contains a 3-axis Gyro, a 3-axis Accelerometer and a special processing unit that performs some complex calculations on the sensor’s data. You can find small boards with the sensor on ebay that cost ~1.5 EUR, which is dirt cheap. The sensor is 3V3 and 5V capable, which makes it easy to be used with a very wide range of development boards and kits.

Connecting the hardware

The hardware connections are easy as both the Teensy and the mpu-6050 are voltage level compatible. The following table shows the connections you need to make.

Teensy 3.2 (pin) MPU-6050
18 SDA
19 SCL
23 INT

That’s it, you’re done. Now all you have to do, is to connect Teensy 3.2 to your workstation, calibrate the sensor and then build the firmware and flash it.

Calibrating the MPU-6050

Not all the MPUs are made the same. Since it’s a quite complex device, both the gyro and the accelerometer (accel) have tolerances. Those tolerances affect the readings you get, for example if you place the sensor on a perfectly flat space then you’ll get a reading from the sensor that it’s not on a flat space, which means that you’re reading the tolerances (offsets). Therefore, first you need to place the sensor on a flat space and then use the readings to calibrate the sensor and minimize those offsets. Because every chip has different tolerances, you need to do this for every sensor, so you don’t do this once for a single sensor and then re-use the same values also for others (even if they are in the same production batch). This sensor supports to upload to it those offset values using some I2C registers in order to perform calculations with those offsets, which in the end offloads the external CPU.

Normally, if you need maximum accuracy during calibration, then you need an inclinometer in order to place your sensor completely flat and a vibration free base. You can find inclinometers on ebay or amazon, from 7 EUR up to thousands of EUR. Of course, you get what you pay. Have in mind that an inclinometer is just a tilt sensor, but it’s calibrated in the production. A cheap inclinometer may suck in many different ways, e.g. maybe is not even calibrated or the calibration reference is not calibrated or the tilt sensor itself is crap. Anyway, for this project you don’t really need to use this.

For now just place the sensor in a surface that seems flat. Also because you probably have already soldered the pin-headers, try to flatten the sensor compare to the surface. We really don’t need accuracy here, just an approximation, so make your eye an inclinometer.

Another important thing to know is that when you power-up the sensor then the orientation is zeroed at the current orientation. That means if the sensor is pointing south then this direction will always be the starting point. This is important for when you connect the sensor to the 3D object, then you need to put the sensor flat and pointing to that object on your screen and then power-on (connect the USB cable to Teensy).

Note: Before you start the calibration you need to leave the module powered on for a few minutes in order for the temperature to stabilize. This is very important, don’t skip this step.

OK, so now you should have your sensor connected to Teensy and on a flat(-ish) surface. Now you need to need to flash the calibration firmware. I’ve included two calibration source codes in the repo. The easiest one to use is in `mpu6050-calibration/mpu6050-raw-calibration/mpu6050-raw-calibration.ino`. I’ve got this source code from here.

Note: In order to be able to build the firmware on the Arduino IDE, you need to add this library here. The Arduino library in this repo is for both the MPU-6050 and the I2Cdev which is needed from all the source codes. Just copy from this folder the I2Cdev and MPU6050 in to your Arduino library folder.

When you build and upload the `mpu6050-raw-calibration.ino` on the Teensy, then you also need to use the Arduino IDE to open the Serial Monitor. When you do that, then you’ll get this prompt repeatedly:

Send any character to start calibrating...

Press Enterin the output textbox of the Serial Monitor and the calibration will start. In my case there were a few iterations and then I got the calibration values in the output:

            ax	ay	az	gx	gy	gz
average values:		-7	-5	16380	0	1	0
calibration offsets:	1471	-3445	1355	-44	26	26

MPU-6050 is calibrated.

Use these calibration offsets in your code:

Now copy-paste the above code block in to your
'teensy-motion-hid/teensy-motion-hid.ino' file
in function setCalibrationValues().

As the message says, now just open the `teensy-motion-hid/teensy-motion-hid.ino` file and copy the mpu.set*function calls in the setCalibrationValues()function.

Advanced calibration

If you want to see a few more details regarding calibration and an example on how to use a PID controller for calibrating the sensor and then use a jupyter notebook to analyze the results, then continue here. Otherwise, you can skip this section as it’s not really needed.

In order to calculate the calibration offsets you can use a PID controller. For those who doesn’t know what PID controller is, then you’ll have to see this first (or if you know how negative feedback on op-amps works, then think that it’s quite the same). Generally, is a control feedback loop that is used a lot in control systems (e.g. HVAC for room temperature, elevators e.t.c).

Anyway, in order to calibrate the sensor using a PID controller, then you need to build and upload the `mpu6050-calibration/mpu6050-pid-calibration/mpu6050-pid-calibration.ino` source code. I won’t get in to the details of the code, but the important thing is that this code uses 6 different PID controllers, one for each offset you want to calculate (3 for the accel. axes and 3 for the gyro axes). This source code is a modification I’ve made of this repo here.

Again, you need to let the sensor a few minutes powered on before perform the calibration and set it on a flat surface. When the firmware starts, then it will spit out numbers on the serial monitor. Here’s an example:


And this goes on forever. Each line is a comma separated list of values and the meaning of those values from left to right is:

  • Average Acc X value
  • Average Acc X offset
  • Average Acc Y value
  • Average Acc Y offset
  • Average Acc Z value
  • Average Acc Z offset
  • Average Gyro X value
  • Average Gyro X offset
  • Average Gyro Y value
  • Average Gyro Y offset
  • Average Gyro Z value
  • Average Gyro Z offset

Now, all you need to do is to let this running for a couple of seconds (30-60) and then copy all the output from the serial monitor to a file named calibration_data.txt. The file actually already exists in the `/rnd/bitbucket/teensy-hid-with-unity3d/mpu6050-calibration` folder and it contains the values from my sensor, so you can use those to experiment or erase the file and put yours in its place. Also, be aware that when you copy the output from the serial monitor to the txt file, you need to delete any empty line in the end for the notebook scripts to work, otherwise, you’ll get an error in the jupyter notepad.

Note: while you running the calibration firmware you need to be sure that the there are no any vibrations on the surface. For example, if you put this on your desk then be sure that there’s no vibrations from you or any other equipment you may have running on the desk.

As I’m explaining quite thorough in the notebook how to use it, I’ll keep it simple here. Also, from this point I assume that you’ve read the jupyter notepad in the repo here.

You can use the notebook to visualize the PID controller output and also calculate the values to use for your sensor’s offsets. It’s interesting to see some plots here. As I mention in the notebook,  you can use the data_start_offset and data_end_offset, to plot different subsets of data for each axis.

This is the plot when data_start_offset=0 and data_end_offset=20.

Click on each image to zoom-in.

So, as you can see in the first 20 samples, the PID controller kicks in and tries to correct the error and as the error in the beginning is significant, you get this slope. See in the first 15 samples the error for the Acc X axis is corrected from more than -3500 to near zero. For the gyro things are a bit different, as it’s more sensitive and fluctuates a bit more. Let’s see the same plots with data_start_offset=20 and data_end_offset=120.

On the above images, I’ve started from sample 20, in order to remove the steep slope during the first PID controller input/output correction. Now you can see that the data that are coming from the accel. and gyro axes are fluctuating quite much and the PID tries on every iteration to correct this error. Of course, you’ll never get a zero error as the sensor is quite sensitive and there’s also thermal and electronic noise and also vibrations that you can’t sense but the sensor does. So, what you do in such cases is that you use the average value for each axis. Be careful, though. You don’t want to include the first samples in the average value calculations for each axis, because that would give you values that are way off. As you can see in the notepad here, I’m using the skip_first_n_data to skip the first 100 samples and then calculate the average from the rest values.

Finally, you can use the calculated values from the “Source code values” section and copy those in the firmware. You can use whatever method you like to calibrate the sensor, just be aware that if you try both methods you won’t get the same values! Even if you run the same test twice you won’t the exact same values, but they should be similar.

HID Manager

In the hid_manager/ folder you’ll find the source code from a tool I’ve written and I named hid_manager. Let me explain what this does and why is needed.

The HID manager is the software that receives the HID raw messages from Teensy and then copies the data in to a buffer that is shared with Unity. Note that this code works only for Linux. If you want to use the project on Windows then you need to port this code and actually is the only part of the project that is OS specific.

So, why use this HID manager? The problem with Unity3D and most similar software is that although they have some kind of input API, this is often quite complicated. I mean, I’ve tried to have a look at the API and try to use it, but quickly I’ve realized that it would take too much effort and time to create my own custom input controller for Unity and the use it in there. So, I’ve decided it to make it quick and dirty. In this case, though, I would say that quick and dirty, is not a bad thing (except that it’s OS specific). Therefore, what is the easiest and fast way to import real-time data to any other software that runs on the same host? Using some kind of inter-process communication, of course. In this case, the easiest way to do that was to use the Linux /tmp folder, which is mount in the system’s RAM memory and then create a file buffer in the /tmp and share this buffer between the Unity3D game application and the hid manager.

To do that I’ve written a script in hid_manager/, which makes sure to create this buffer and occupy 64 bytes in the RAM. The USB HID packets I’m using are 64 bytes long, so we don’t need more RAM than that. Of course, I’m not using all the bytes, but it’s good to have more than the exact the same number. For example, the first test was to send only the Euler angles from the sensor, but then I’ve realized that I was getting affected from the Gimbal lock effect, so I also had to add the Quaternion values, that I was getting anyways from the sensor (I’ll come back to those later). So, having more available size is always nice and actually in this case the USB packet buffer is always 64 bytes, so you get them for free. The problem arises when you need more than 64-byts, then you need to use some data serialization and packaging.

Also, note in both the teensy-motion-hid/teensy-motion-hid.ino and hid_manager/hid_manager.c, the indianness is the same, which makes things a bit easier and faster.

In order to build the code, just run make inside the folder and then you need first flash the Teensy with the teensy-motion-hid.ino firmware and then run the manager using the script.


If you try to run the script before the Teensy is programmed, then you’ll get an error as the HID device won’t be found attached and enumerated on your system.

Note: if the HID manager is not running then on the serial monitor output you’ll get this message

Unable to transmit packet

The HID manager supports two modes. The default mode is that it runs and it just copies the incoming data from the HID device to the buffer. The second one is the debug mode. In the debug mode, it prints also the 64 bytes that it gets from the Teensy. To run the HID manager in debug mode run this command.

DEBUG=1 ./

By running the above command you’ll get an output in your console similar to this:

$ DEBUG=1 ./ 
    This script starts the HID manager. You need to connect
    the teensy via USB to your workstation and flash the proper
    firmware that works with this tool.
    More info here:
Controlling a 3D object in Unity3D with teensy and MPU-6050
Usage for no debugging mode: $ ./ Usage for debugging mode (it prints the HID incoming data): $ DEBUG=1 ./ Debug ouput is set to: 1 Starting HID manager... Denug mode enabled... Open shared memfile found rawhid device recv 64 bytes: AB CD A9 D5 BD 37 4C 2B E5 BB 0B D3 BD BE 00 FC 7F 3F 00 00 54 3B 00 00 80 38 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 F9 29 00 00 01 00 00 00 E1 94 FF 1F recv 64 bytes: AB CD 9D 38 E5 3B 28 E3 AB 37 B6 EA AB BE 00 FC 7F 3F 00 00 40 3B 00 00 00 00 00 00 80 B8 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 F9 29 00 00 01 00 00 00 E1 94 FF 1F ...

The 0xAB 0xCD bytes are just a preamble.

Note: I haven’t added any checks on the data, like checking the preamble or having checksums e.t.c. as there wasn’t a reason for this. In other case I would consider at least a naive checksum like xor’ing the bytes, which is fast.

In the next video you can see on the left top side the quaternion data output from the Teensy and on the left bottom the raw hid data in the hid manager while it’s running in debug mode.

Of course, printing the data on both ends adds significant latency in model motion.

Teensy 3.2 firmware

The firmware for Teensy is located in teensy-motion-hid/teensy-motion-hid.ino. There’s no much to say here, just open the file with the Arduino IDE and then build and upload the firmware.

The important part in the code are those lines here:

        mpu.dmpGetQuaternion(&q, fifoBuffer);

        un_hid_payload pl;
        pl.packet.preamble[0] = 0xAB;
        pl.packet.preamble[1] = 0xCD;

        mpu.dmpGetEuler(euler, &q);
        pl.packet.x.number = euler[0] * 180 / M_PI;
        pl.packet.y.number = euler[1] * 180 / M_PI;
        pl.packet.z.number = euler[2] * 180 / M_PI;
        pl.packet.qw.number = q.w;
        pl.packet.qx.number = q.x;
        pl.packet.qy.number = q.y;
        pl.packet.qz.number = q.z;

If the ADD_EULER_TO_HID is enabled, then the Euler angles will also be added in the hid packet, but this might be add a bit more latency.

Now that the data are copied from Teensy to a shared memory buffer in /tmp, you can use Unity3D to read those data and use them in your game. Before proceed with the Unity3D section, though, let’s open a short parenthesis on the kind of data you get from the sensor.

Sensor data

As I’ve mentioned, the sensor does all the hard work and maths to calculate the Euler and the quaternion values from the 6 axes values in real-time (which is a great acceleration). But what are those values, why we need them and why I prefer to use only the quaternion? Usually I prefer to give just a quick explanation and leave the pros explain it better than me, so I’ll the same now.

The Euler angles is just the angle of the rotation for each axis in the 3D space. In air navigation those angles are known as roll, pitch and yaw and by knowing those angles you know your object’s rotation. You can see also this video which explains this better than I do. There’s one problem with Euler angles and this is that if two of the 3 axes are driven in a parallel configuration then you loose one degree of freedom. This is a video explains this in more detail.

As I’ve already mentioned, the sensor calculates the quaternion values. Quaternion is much more difficult to explain as it’s a 4-D number and anything more then 3-D is difficult to visualize and explain. I will try to avoid to explain this myself and just post this link here, which explains quaternions and try to represent them to the 3D space. The important thing you need to know, is that the quaternion doesn’t suffer from the gimbal lock, also it’s supported in Unity3D and it’s supposed to make calculations faster compared to vector calculations for the CPU/GPU.

Unity3D project

For this project I wanted to interact with a 3D object on the screen using the mpu-6050. Then I remembered that I’ve seen a video on Unity3D which seemed nice, so when I’ve seen that there was a Linux build (but not officially supported), then I thought to give it a try. When I’ve started the project I knew nothing about this software, but for doing simple things it seems nice. I had quite a few difficulties, but with some google-fu I’ve fixed all the issues.

Installing Unity3D on Ubuntu is not pain free, but it’s not that hard either and when you succeed, it works great. I’ve download the installer from here (see always the last post which has the latest version) and to install Unity Hub I’ve followed this guide here. Unity3D is not open source, but it’s free for personal use and you need to create an account in order to get your free license. I guess I could use an open 3D game machine, but since it was free and I wanted for personal use, I went for that. In order, to install the same versions that I’ve used run the following commands:

# install dependencies
sudo apt install libgtk2.0-0 libsoup2.4-1 libarchive13 libpng16-16 libgconf-2-4 lib32stdc++6 libcanberra-gtk-module

# install unity
chmod +x UnitySetup-2019.1.0f2

# install unity hub
chmod +x UnityHubSetup.AppImage

When you open the project, you’ll find in the bottom tab some files. The one that’s interesting for you is the HID_controller.cs. This file in the repo is located in here: Unity3D-project/gyro-acc-test/Assets/HID_Controller.cs. In this file the important bits are the MemoryMappedfile object which is instantiated in the start() function and opens the shared file in the /tmp and reads the mpu6050 data and the update() function.

void Start()
    hid_buf = new byte[64];

    // transform.Rotate(30.53f, -5.86f, -6.98f);
    Debug.developerConsoleVisible = true;
    Debug.Log("Unity3D demo started...");

    mmf = MemoryMappedFile.CreateFromFile("/tmp/hid-shared-buffer", FileMode.OpenOrCreate, "/tmp/hid-shared-buffer");
    using (var stream = mmf.CreateViewStream()) {
        var data = stream.Read(hid_buf, 0, 64);
        if (data > 0) {
            Debug.Log("Data in: " + data);
            float hid_x = System.BitConverter.ToSingle(hid_buf, 2);
            Debug.Log("x: " + hid_x);


// Update is called once per frame
void Update()
    using (var stream = mmf.CreateViewStream()) {
        var data = stream.Read(hid_buf, 0, 64);
        if (data > 0) {
            float qw = System.BitConverter.ToSingle(hid_buf, (int) DataIndex.qW);
            float qx = System.BitConverter.ToSingle(hid_buf, (int) DataIndex.qX);
            float qy = System.BitConverter.ToSingle(hid_buf, (int) DataIndex.qY);
            float qz = System.BitConverter.ToSingle(hid_buf, (int) DataIndex.qZ);
            transform.rotation = new Quaternion(-qy, -qz, qx, qw);

As you can see in the start() function, the mmf MemoryMappedFile is created and attached to the /tmp/hid-shared-buffer. Then there’s a dummy read from the file to make sure that the stream works and prints a debug message. This code runs only once when the HID_Controller class is created.

In update() function the code creates a stream connected to the memory mapped file, then reads the data, parses the proper float values from the buffer and finally creates a Quaternion object with the 4D values and updates the object rotation.

You’ll also notice that the values in the quaternion are not in the (x,y,z,w) order, but (-y,-z,x,w). This is weird, right? But this happens for a couple of reasons that I’ll try to explain. In page 40 of this PDF datasheet you’ll find this image.

These are the X,Y,Z axes on the chip. Notice also the signs, they are positive on the direction is shown and negative on the opposite direction. The dot on the top corner indicated where pin 1 is located on the plastic package. The next image is the stick I’ve made with the sensor board attached on it on which I’ve drawn the dot position and the axes.

Therefore, you see that the X and Y axes are swapped (compared to the pdf image), so the quaternion from (x,y,z,w) becomes (y,x,z,w). But wait… in the code is (-y,-z,x,w)! Well, that troubled me also for a moment, then I’ve read this in the documentation, “Most of the standard scripts in Unity assume that the y-axis represents up in your 3D world.“, which means that you need also to swap Y and Z, but because in the place of Y now is X, then you replace X with Z, so the quaternion from (y,x,z,w) becomes (y,z,x,w). But wait! What about the “-” signs? Well if you see again the first image it shows the sign for each axis. Because of the way you hold the stick, compared to the moving object on the screen reverses that rotation for the y and z axes, then the quaternion becomes (-y,-z,x,w). Well, I don’t know anything about 3D graphics, unity and quaternions, but at least the above explanation makes sense and it works, so… this must be it.

I’ve found the Millenium Falcon 3D model here and it’s free to use (except that any star-wars related stuff are trademarked and you can’t really use them for any professional use). This is what I meant in the intro, all the software I’ve used until now was free or open. So A. Meerow, who build this model did this 3D mesh in his free time, gave it for free and I was able to save dozens of hours to make the same thing. Thanks mate.

I had a few issues with the 3D model though when I imported the model in Unity. One issue was that there was a significant offset on one of the axis, another issue was that because of the previous thing I’ve mentioned I had to export the model with the Y – Z axes swapped and finally another issue was that the texture when importing the .obj file wasn’t applied properly, so I had to import the model in the .fbx format. To fix those things I’ve downloaded and used Blender. I’ve also used blender for the first time, but it was quite straight forward to use and fix those issues.

Blender is a free and open source 3D creation suite and I have to say that it looks beautiful and very professional tool. There are so many options, buttons and menus that makes clear that this is a professional grade tool. And it’s free! Amazing.

Anyway, after I’ve fixed those things and exported the model to the .fbx format I wanted to change the default Skybox in Unity3D and I wanted something that seems like deep space. So I’ve found another awesome free and open tool, which is named Spacescape and creates a spaceboxes with stars and nebulas, using algorithms and it also has a ton of options to play with. The funny thing was that I’ve tried to build it on my Ubuntu 18.04 but I had a lot of issues as it’s based on a quite old Qt version and also needs some dependencies that also failed. Anyway, I’ve downloaded the Windows executable and it worked fine with Wine (version 3.0). This is a screenshot of the tool running on my ubuntu.

These are the default options and I’ve actually used them as the result was great.

Finally, I’ve just added some lights, a lens flare and a camera rotation in the Unity3D scene and it was done.

Play the game demo

In order to “play” the game (yeah I know it’s not a game, it’s just a moving 3d object on a scene), you need to load the project from the Unity3D-project/gyro-acc-test/ folder. Then you just build it by pressing Ctrl+B and this will create an executable file named “teensy-wars.x86_64” and at the same time it will also launch the game. After you build the project (File >> Build Settings), you can just lauch the teensy-wars.x86_64 executable.

Make sure that before you do this, you’re running the hid_manager in the background and that you’ve flashed Teensy with the teensy-motion-hid/teensy-motion-hid.ino firmware and the mpu-6050 sensor is connected properly.


I’m amazed with this project for many reasons. It took me 1.5 day to complete it. Now that I’m writing those lines, I’m thinking that I’ve spend more time in documenting this project and write the article rather implement it and the reason for this the power of open source, the available free software (free to use or open), the tons of available information in documents, manuals and forums and finally and most important the fact that people are willing to share their work and know-how. Of course, open source it’s not new to me, I do this for years also by myself, but I was amazed that I was able to build something like this without even use any of those tools before in such short time. Prototyping has become so easy nowadays. It’s really amazing.

Anyway, I’ve really enjoyed this project and I’ve enjoyed more the fact that I’ve realized the power that everyone has in their hands nowadays. It’s really easy to build amazing stuff, very fast and get good results. Of course, I need to mention, that this can’t replace the domain expertise in any way. But still it’s nice that engineers from other domains can jump into another unknown domain and make something quick and dirty and get some idea how things are working.

Have fun!

Machine Learning on Embedded (Part 5)

Note: This post is the fourth in the series. Here you can find part 1, part 2, part 3 and part 4.


In the previous post here, I’ve used x-cube-ai with the STM32F746 to run a tflite model and benchmark the inference performance. In that post I’ve found that the x-cube-ai is ~12.5x faster compared to TensorFlow Lite for microcontrollers (tflite-micro) when running on the same MCU. Generally, the first 4 posts were focused on running the model inference on the edge, which means running the inference on the embedded MCU. This actually is the most important thing nowadays, as being able running inferences on the edge on small MCUs means less consumption and more important that are not rely on the cloud. What is cloud? That means that there is an inference accelerator in cloud, or in layman terms the inference is running on a remote server somewhere on the internet.

One thing to note, is that the only reason I’m using the MNIST model is for benchmarking and consistency with the previous post. There’s no any real reason to use this model in a scenario like this. The important thing here is not the model, but the model’s complexity. So any model with the some kind of complexity that matches your use-case scenario can be used. But as I’ve said since I’ve used this model in the previous posts, I’ll use it also here.

So, what are the benefits of running the inference on the cloud?

Well, that depends. There are many parameters that define a decision like this. I’ll try to list a few of them.

  • It might be faster to run an inference on the cloud (that depends also on other parameters though).
  • The MCU that you already have (or you must use) is not capable to run the inference itself using e.g. tflite-micro or another API.
  • There is a robust network
  • The time that the cloud inference to be run (including the network transactions) is faster than running on the edge (=on the device)
  • If the target device runs on battery it may be more energy efficient to use a cloud accelerator
  • It’s possible to re-train your model and update the cloud without having to update the clients (as long the input and output tensors are not changed).

What are the disadvantages on running the inference on the cloud?

  • You need a target with a network connection
  • Networks are not always reliable
  • The server hardware is not reliable. If the server fails, all the clients will fail
  • The cloud server is not energy efficient
  • Maintenance of the cloud

If you ask me, the most important advantage of edge devices is that they don’t rely on any external dependencies. And the most important advantage of the cloud is that it can be updated at any time, even on the fly.

On this post I’ll focus on running the inference on the cloud and use an MCU as a client to that service. Since I like embedded things the cloud tflite server will be a Jetson nano running in the two supported power modes and the client will be an esp8266 NodeMCU running at 160MHz.

All the project file are in this repo:

Now let’s dive into it.


Let’s have a look in the components I’ve used.

ESP8266 NodeMCU

This is the esp8266 module with 4MB flash and the esp8266 core which can run up to 160MHz. It has two SPI interfaces, one used for the onboard EEPROM and one it’s free to use. Also it has a 12-bit ADC channel which is limited to max 1V input signals. This is a great limitation and we’ll see later why. You can find this on ebay sold for ~1.5 EUR, which is dirt cheap. For this project I’ll use the Arduino library to create a TCP socket that connects to a server, sends an input array and then retrieves the inference result output.

Jetson nano development board

The jetson nano dev board is based on a Quad-core ARM Cortex-A57 running @ 1.4GHz, with 4GB LPDDR4 and an NVIDIA Maxwell GPU with 128 CUDA cores. I’m using this board because the tensorflow-gpu (which contains the tflite) supports its GPU and therefore it provides acceleration when running a model inference. This board doesn’t have WiFi or BT, but it has a mini-pcie connector (key-E) so you’re able to connect a WiFi-BT module. In this project I will just use the ethernet cable connected to a wireless router.

The Jetson nano supports two power modes. The default mode 0 is called MAXN and the mode 1 is called 5W. You can verify on which mode your CPU is running with this command:

nvpmodel -q

And you can set the mode (e.g. mode 1 – 5W) like this:

# sets the mode to 5W
sudo nvpmodel -m 1

# sets the mode to MAXN
sudo nvpmodel -m 0

I’ll benchmark both modes in this post.

My workstation

I’ve also used my development workstation in order to do benchmark comparisons with the Jetson nano. The main specs are:

  • Ryzen 2700x @ 3700MHz (8 cores / 16 threads)
  • 32GB @ 3200MHz
  • GeForce GT 710 (No CUDA 🙁 )
  • Ubuntu 18.04
  • Kernel 4.18.20-041820-generic

Network setup

This is the network setup I’ve used for the development and testing/benchmarking the project. The esp8266 is connected via WiFi on the router and the workstation (2700x) and the jetson nano are connected via Ethernet (in the drawing replace TCP = ETH!).

This is a photo of the development setup.

Repo details

In the repo you’ll find several folders. Here I’ll list what each folder contains. I suggest you also read the file in the repo as it contains information that might not be available here and also the README file will be always updated.

  • ./esp8266-tf-client: This folder contains the firmware for the esp8266
  • ./jupyter_notebook: This folder contains the .ipynb jupyter notebook which you can use on the server and includes the TfliteServer class (which will be explained later) and the tflite model file (mnist.tflite).
  • ./schema: The flatbuffers schema file I’ve used for the communication protocol
  • ./tcp-stress-tool: A C/C++ tool that I’vewritten to stress and benchmark the tflite server.

esp8266 firmware

This folder contains the source code for the esp8266 firmware. To build the esp8266 firmware open the `esp8266-tf-client/esp8266-tf-client.ino` with Arduino IDE (version > 1.8). Then in the file you need to change a couple of variables according to your network setup. In the source code you’ll find those values:

#define SSID "SSID"
#define SERVER_IP ""
#define SERVER_PORT 32001

You need to edit them according to your wifi network and router setup. So, use your wifi router’s SSID and password. The `SERVER_IP` is the IP of the computer that will run the python server and the `SERVER_PORT` is the server’s port and they both need to be the same also in the python script. All the data in the communication between the client and the server are serialized with flatbuffers. This comes with quite a significant performance hit, but it’s quite necessary in this case. The client sends 3180 bytes on every transaction to the server, which are the serialized 784 floats for each 28×28 byte digit. Then the response from the server to the client is 96 bytes. These byte lengths are hardcoded, so if you do any changes you need also to change he definitions in the code. They are hard-coded in order to accelerate the network recv() routines so they don’t wait for timeouts.

By default this project assumes that the esp8266 runs at 160MHz. In case you change this to 80MHz then you need also to change the `MS_CONST` in the code like this:

#define MS_CONST 80000.0f

Otherwise the ms values will be wrong. I guess there’s an easier and automated way to do this, but yeah…

The firmware supports 3 serial commands that you can send via the terminal. All the commands need to be terminated with a newline. The supported commands are:

Command Description
TEST Sends a single digit inference request to the server and it will print the parsed response
START=<SECS> Triggers a TCP inference request from the server every <SECS>. Therefore, if you want to poll the server every 5 secs then you need to send this command over the serial to the esp8266 (don’t forget the newline in the end). For example, this will trigger an inference request every 5 seconds: `START=5`.
STOP Stops the timer that sends the periodical TCP inference requests

To build and upload the firmware to the esp8266 read the of the repo.

Using the Jupyter notebook

I’ve used the exact same tflite model that I’ve used in part 3 and part 4. The model is located in ./jupyter_notebook/mnist.tflite. You need to clone the repo on the Jetson nano (or your workstation is you prefer). From now on instead of making a distinction between the Jetson nano and the workstation I’ll just refer to them as the cloud as it doesn’t really make any difference. Therefore, just clone the repo to your cloud server. This here is the jupyter notepad on bitbucket.

Benchmarking the inference on the cloud

The important sections in the notepad are 3 and 4. Section 3 is the `Benchmark the inference on the Jetson-nano`. Here I assume that this runs on the nano, but it’s the same on any server. So, in this section I’m benchmarking the model inference with a random input. I’ve run this benchmark on both my workstation and the Jetson nano and these are the results I got. For reference I’ll also add the numbers with the edge inference on the STM32F7 from the previous post using x-cube-ai.

Cloud server ms (after 1000 runs)
My workstation 0.206410
Jetson nano (MAXN) 0.987536
Jetson nano (5W) 2.419758
STM32F746 @ 216MHz 76.754
STM32F746 @ 288MHz 57.959

The next table shown the difference in performance between all the different benchmarks.

STM@216 STM@288 Nano 5W Nano MAXN 2700x
STM@216 1 1.324 31.719 77.722 371.852
STM@288 0.755 1 23.952 58.69 280.795
Nano 5W 0.031 0.041 1 2.45 11.723
Nano MAXN 0.012 0.017 0.408 1 4.784
2700x 0.002 0.003 0.085 0.209 1

An example how to read the above table is that the STM32F7@288 is 1.324x faster than STM32F7@216. Also Ryzen 2700x is 371.8x times faster. Also the Jetson nano in MAXN mode is 2.45x times faster that the 5W mode, e.t.c.

What you should probably keep from the above table is that Jetson nano is ~32x to 78x times faster than the STM32F7 at the stock clocks. Also the 2700x is only ~4.7x times faster than nano in MAXN mode, which is very good performance for the nano if you think about its consumption, cost and size.

Therefore, the performance/cost and performance/consumption ratio is far better on the Jetson nano compared to 2700x. So it makes perfect sense to use this as a cloud tflite server. One use-case of this scenario is having a cloud accelerator running locally on a place that covers a wide area with WiFi and then having dozens of esp8266 clients that request inferences from the server.

Benchmarking the tflite cloud inference

To run the server you need to run the cell in section `4. Run the TCP server`. First you need to insert the correct IP of the cloud server. For example my Jetson nano has the IP Then you run the cell. The other way is you can edit the `jupyter_notebook/TfliteServer/` file and in this code change the IP (or the TCP if you like also)

if __name__=="__main__":
    srv = TfliteServer('../mnist.tflite')
    srv.listen('', 32001)

Then on your terminal run:


This will run the server and you’ll get the following output.

dimtass@jetson-nano:~/rnd/tensorflow-nano/jupyter_notebook/TfliteServer$ python3
TfliteServer initialized
TCP server started at port: 32001

Now send the TEST command on the esp8266 via the serial terminal. When you do this, then the following things will happen:

  1. esp8266 serializes the 28×28 random array to a flatbuffer
  2. esp8266 connects the TCP port of the server
  3. esp8266 sends the flabuffer to the server
  4. Server de-serializes the flatbuffer
  5. Server converts the tensor from (784,) to (1, 28, 28, 1)
  6. Server runs the inference with the input
  7. Server serializes the output it in a flatbuffer (including the time in ms of the inference operation)
  8. Server sends the output back to the esp8266
  9. esp8266 de-serializes the output
  10. esp8266 outputs the result

This is what you get from the esp8266 serial output:

Request a single inference...
======== Results ========
Inference time in ms: 12.608528
out[0]: 0.080897
out[1]: 0.128900
out[2]: 0.112090
out[3]: 0.129278
out[4]: 0.079890
out[5]: 0.106956
out[6]: 0.074446
out[7]: 0.106730
out[8]: 0.103112
out[9]: 0.077702
Transaction time: 42.387493 ms

In this output the “inference time in ms” is the time in ms that the cloud server spend to run the inference. Then you get the array of the 10 predictions for the output and finally the “Transaction time” is the total time of the whole procedure. The total time is the time that steps 1-9 spent. At the same time the output of the server is the following:

==== Results ====
Hander time in msec: 30.779839
Prediction results: [0.08089687 0.12889975 0.11208985 0.12927799 0.07988966 0.10695633
 0.07444601 0.10673008 0.10311186 0.07770159]
Predicted value: 3

The “handler time in msec” is the time that the TCP reception handler used (see: jupyter_notebook/TfliteServer/ and the FbTcpHandler class.

From the above benchmark with the esp8266 we need to keep the following two things:

  1. From the 42.38 ms the 12.60 ms was the inference run time, so all the rest things like serialization and network transactions costed 29.78 ms (on the local WiFi network). Therefore, the extra time was 2.3x times more that the inference running time itself.
  2. The total time that the above operation lasted was 42.38 ms and the STM32F7 needed 76.75 ms @ 216MHz (or 57.96 @ 288MHz). That means the the cloud inference is 1.8x and 1.36x times faster.

Finally, as you probably already noticed, the protocol is very simple, so there are no checksums, server-client validation and other fail-safe mechanisms. Of course, that’s on purpose, as you can imagine. Otherwise, the complexity would be higher. But you need to consider those things if you’re going to design a system like this.

Benchmarking the tflite server

The tflite TCP server is just a python TCP socket listener. That means that by design it has much lower performance compared to any TCP server written in C/C++ or Java. Despite the fact that I was aware of this limitation, I’ve chosen to go with this solution in order to integrate the server easily in the jupyter notebook and it was also much faster to implement. Sadly, I’ve seen a great performance hit with this implementation and I would like to investigate a bit further (in the future) and verify if that’s because of the python implementation or something else. The results were pretty bad especially for the Jetson nano.

In order to test the server, I’ve written a small C/C++ stress tool that I’ve used to spawn a user-defined number of TCP client threads and request inferences from the server. Because it’s still early in my testing, I assume that the gpu can only run one inference per time, therefore there’s a thread lock before any thread is able to call the inference function. This lock is in the jupyter_notebook/TfliteServer/ file in those lines:

output_data, time_ms = runInference(resp.Input().DigitAsNumpy())

One thing I would like to mention here is that I’m not lazy to investigate in depth every aspect of each framework, it’s just that I don’t have the time, therefore I do logical assumptions. This is why I assume that I need to put a lock there, in order to prevent several simultaneous calls in the tensorflow API. Maybe this is handled in the API, I don’t know. Anyway, have in mind that’s the reason this lock there, so all the inferences requests will block and wait until the current running inference is finished.

So, the easiest way to run some benchmarks is to use run the TfliteServer on the server. First you need to edit the IP address in the __main__ function. You need to use the IP of the server, or if you run this locally (even when I do this locally I use the real IP address). Then run the server:

cd jupyter_notebook/TfliteServer/

Then you can run the client and pass the server IP, port and number of threads in the command line. For example, I’ve run both the client and server on my workstation, which has the IP, so the command I’ve used was:

cd tcp-stress-tool/
./tcp-stress-tool 32001 500

This will spawn 500 clients (each on its own thread) and request an inference from the python server. Because the output is quite big, I’ll only post the last line (but I’ve copied some logs in the results/ folder in the repo).

This tool will spawn a number of TCP clients and will request
the tflite server to run an inference on random data.
Warning: there is no proper input parsing, so you need to be
cautious and read the usage below.

tcp-stress-tool [server ip] [server port] [number of clients]

server ip:
server port: 32001
number of clients: 500

Spawning 500 TCP clients...
[thread=2] Connected
[thread=1] Connected
[thread=3] Connected


Total elapsed time: 31228.558064 ms
Average server inference time: 0.461818 ms

The output means that 500 TCP transactions and inferences were completed in 31.2 secs with average inference time 0.46 ms. That means the total time for the inferences were 23 secs and the rest 8.2 secs were spend in the comms and serializing the data. These 8.2 secs seem a bit too much, though, right? I’m sure that this time should be less. On the Jetson nano it’s even worse, because I wasn’t able to run a test with 500 clients and many connections were rejected. Any number more that 20 threads and python script can’t handle this. I don’t know why. In the results/ folder you’ll find the following test results:

  • tcp-stress-jetson-nano-10c-5W.txt
  • tcp-stress-jetson-nano-50c-5W.txt
  • tcp-stress-jetson-nano-50c-MAXN.txt
  • tcp-stress-output-workstation-500c.txt

As you can guess from the filename, Xc is the number of threads and for Jetson nano there are results for both modes (MAXN and 5W). This is a table with all the results:

Test Threads Total time ms Avg. inference ms
Nano 5W 10 1057.1 3.645
Nano 5W 20 3094.05 4.888
Nano MAXN 10 236.13 2.41
Nano MAXN 20 3073.33 3.048
2700x 500 31228.55 0.461

From those results, I’m quite sure that there’s something wrong with the python3 TCP server. Maybe at some point I’ll try something different. In any case that concludes my tests, although there’s a question mark as regarding the performance of the Jetson nano when it’s acting as tflite server. For now, it seems that it can’t handle a lot of connections (with this implementation), but I’m quite certain this will be much different if the server is a proper C/C++ implementation.


With this post I’ve finished the main tests around ML I had originally on my mind. I’ve explored how ML can be used with various embedded MCUs and I’ve tested both edge and cloud implementations. At the edge part of ML, I’ve tested a naive implementation and also two higher level APIs (the TensorFlow Lite for Microcontrollers API and also the x-cube-ai from ST). For the cloud part of ML, I’ve tested one of the most common and dirt cheap WiFi enabled MCUs the esp8266.

I’ll mention here once again that, although I’ve used the MNIST example, that doesn’t really matter. It’s the NN model complexity that matters. By that I mean that although it doesn’t make any sense to send a 28×28 tensor from the esp8266 to the cloud for running a prediction on a digit, the model is still just fine for running benchmarks and make conclusions. Also this (784,) input tensor, stresses also the network, which is good for performance tests.

One thing that you might wondering at this point is, “which implementation is better”? There’s no a single answer for this. This is a per case decision and it depends on several parameters around the specific requirements of the project, like cost, energy efficiency, location, environmental conditioons and several other things. By doing those tests though, I now have a more clear image of the capabilities and the limitations of the current technology and this is a very good thing to have when you have to start with a real project development. I hope that the readers who gone all the posts of this series are able to make some conclusions about those tools and the limitations; and based on this knowledge can start evaluating more focused solutions that fit their project’s specs.

One thing that’s also important, is that the whole ML domain is developing really fast and things are changing very fast, even in next days or even hours. New APIs, new tools, new hardware are showing up. For example, numerous hardware vendors are now releasing products with some kind of NN acceleration (or AI as they prefer to mention it). I’ve read a couple of days ago that even Alibaba released a 16-core RISC-V Processor (XuanTie 910) with AI acceleration. AmLogic released the A311D. Rockchip released the RK3399Pro.  Also, Gyrfalcon released the Lightspeeur 2801S Neural Accelerator, to compete Intel’s NCS2 and Google’s TPU. And many more chinese manufactures will release several other RISC-V CPUs with accelerators for NN the next few weeks and months. Therefore, as you can see the ML on the embedded domain is very hot.

I think I will return from time to time to the embedded ML domain in the future to sync with the current progress and maybe write a few more posts on the subject. But the next stupid-project will be something different. There’s still some clean up and editing I want to do in the first 2 posts in the series, though.

I hope you enjoyed this series as much as I did.

Have fun!

Machine Learning on Embedded (Part 4)


Note: This post is the fourth in the series. Here you can find part 1, part 2 and part 3.

For this post I’ve used the same MNIST model that I’ve trained for TensorFlow Lite for Microcontrollers (tflite-micro) and I’ve implemented the firmware on the 32F746GDISCOVERY by using the ST’s X-CUBE-AI framework. But before dive into this let’s do a recap and repeat some key points from the previous articles.

In part 1, I’ve implemented a naive implementation of a single neuron with 3-inputs and 1-output. Naive means that the inference was just C code, without any accelerations from the hardware. I’ve run those tests on a various different MCUs and it was fun seeing even an arduino nano running this thing. Also I’ve overclocked a few MCUs to see how the frequency increment scales with the inference performance.

In part 2, I’ve implemented another naive implementation of a NN with 3-input, 32-hidden, 1-output. The result was that as expected, which means that as the NN complexity increases the performance drops. Therefore, not all MCUs can provide the performance to run more complex in real-time. The real-time part now is something objective, because real-time can be from a few ns up to several hours depending on the project. That means that if the inference of a deep-er network needs 12 hours to run in your arduino and your data stream is 1 input per 12 hours and 2 minutes, then you’re fine. Anyway, I won’t debate on that I think you know what I mean. But if your input sample is every few ms then you need something faster. Also, in the back of my head was to verify if this simple NN complexity is useful at all and if it can offer something more than lookup tables or algorithms.

In part 3, I was planning to use x-cube-ai from ST, to port a Keras NN and then benchmark the inference, but after the hint I got in the comments from Raukk, I’ve decided to go with the tflite-micro. Tflite-micro at that point seemed very appealing, because it’s a great idea to have a common API between the desktop, the embedded Linux and the MCU worlds. Think about it. It’s really great to be able to share (almost) the same code between those platforms.

Therefore, in this post I’ve implemented the exact same model to do a comparison of the x-cube-ai and tflite-micro. As I’ve mentioned also to the previous posts (and I’m doing this also now), the Machine Learning (ML) on the low embedded (=MCUs) is still a work in progress and there’s a lot of development on the various tools. If you think about it the whole ML is still is changing rapidly for the last years and its introduction to microcontrollers is even more recent. It’s a very hot topic and domain right now. For example, while I was doing the tflite-micro post the repo, it was updated several times; but I had to stop updating and lock to a git version in order to finish the post.

Also, after I’ve finished the post for the x-cube-ai, the same day the new version 4.0.0 released, which pushed back the post release. The new version supports to import tflite models and because I’ve used a Keras model in my first implementation, I had to throw away quite some work that I’ve done… But I couldn’t do otherwise, as now I had the chance to use the exact same tflite model and not the Keras model (the tflite was a port from Keras). Of course, I didn’t expect any differences, but still it’s better to compare the exact same models.

You’ll find all the source code for this project here:

So, let’s dive into it.


ST presents the X-CUBE-AI as an “STM32Cube Expansion Package part of the STM32Cube.AI ecosystem and extending STM32CubeMX capabilities with automatic conversion of pre-trained Neural Network and integration of generated optimized library into the user’s project“. Yeah, I know, fancy words. In plain English that means that it’s just a static library for the STM32 MCUs that uses the cmsis-dsp accelerations and a set of tools that convert various model formats to the format that the library can process. That’s it. And it works really well.

There’s also a very informative video here, that shows the procedure you need to follow in order to create a new x-cube-ai project and that’s the one I’ve also used to create the project in this repo. I believe it’s very straight forward and there’s no reason to explain anything more than that. The only different thing I do always is that I’m just integrating the resulted code from STM32CubeMX to my cmake template.

So the x-cube-ai adds some tools in the CubeMX GUI and you can use them to analyze the model, compress the weight values, and validate the model on both desktop and the target. With x-cube-ai, you can finally create source code for 3 types of projects, which are the SystemPerformance, Validation and ApplicationTemplate. For the first two projects you just compile them, flash and run, so you don’t have to write any code yourself (unless you want to change default behaviour). As you can see on the YouTube link I’ve posted, you can choose the type of project in the “Pinout & Configuration” tab and then click in the “Additional Software”. From that list expand the “X-CUBE-AI/Application” (be careful to select the proper (=latest?) version if you have many) and then in the Selection column, select the type of the project you want to build.

Analyzing the model

I want to mention here that in ST they’ve done a great job on logging and display information for the model. You get many information in CubeMX while preparing your model and you know beforehand the RAM/ROM size with the compression, the complexity, the ROM usage, MACC and also you can derive the complexity by layer. This is an example output I got when I’ve analyzed the MNIST model.

Analyzing model 
Neural Network Tools for STM32 v1.0.0 (AI tools v4.0.0) 
-- Importing model 
-- Importing model - done (elapsed time 0.401s) 
-- Rendering model 
-- Rendering model - done (elapsed time 0.156s) 
Creating report file /home/dimtass/STM32Cube/Repository/Packs/STMicroelectronics/X-CUBE-AI/4.0.0/Utilities/linux/stm32ai_output/mnistkeras_analyze_report.txt 
Exec/report summary (analyze 0.558s err=0) 
model file      : /rnd/bitbucket/machine-learning-for-embedded/code-stm32f746-xcube/mnist.tflite 
type            : tflite (tflite) 
c_name          : mnistkeras 
compression     : 4 
quantize        : None 
L2r error       : NOT EVALUATED 
workspace dir   : /tmp/mxAI_workspace26422621629890969500934879814382 
output dir      : /home/dimtass/STM32Cube/Repository/Packs/STMicroelectronics/X-CUBE-AI/4.0.0/Utilities/linux/stm32ai_output 
model_name      : mnist 
model_hash      : 3be31e1950791ab00299d58cada9dfae 
input           : input_0 (item#=784, size=3.06 KiB, fmt=FLOAT32) 
input (total)   : 3.06 KiB 
output          : nl_7 (item#=10, size=40 B, fmt=FLOAT32) 
output (total)  : 40 B 
params #        : 93,322 (364.54 KiB) 
macc            : 2,852,598 
rom (ro)        : 263,720 (257.54 KiB) -29.35% 
ram (rw)        : 33,664 + 3,176 (32.88 KiB + 3.10 KiB) 
id  layer (type)        output shape      param #     connected to             macc           rom                 
0   input_0 (Input)     (28, 28, 1)                                                                               
    conv2d_0 (Conv2D)   (26, 26, 32)      320         input_0                  237,984        1,280               
    nl_0 (Nonlinearity) (26, 26, 32)                  conv2d_0                                                    
1   pool_1 (Pool)       (13, 13, 32)                  nl_0                                                        
2   conv2d_2 (Conv2D)   (11, 11, 64)      18,496      pool_1                   2,244,480      73,984              
    nl_2 (Nonlinearity) (11, 11, 64)                  conv2d_2                                                    
3   pool_3 (Pool)       (5, 5, 64)                    nl_2                                                        
4   conv2d_4 (Conv2D)   (3, 3, 64)        36,928      pool_3                   332,416        147,712             
    nl_4 (Nonlinearity) (3, 3, 64)                    conv2d_4                                                    
5   reshape_5 (Reshape) (576,)                        nl_4                                                        
    dense_5 (Dense)     (64,)             36,928      reshape_5                36,864         38,144 (c)          
    nl_5 (Nonlinearity) (64,)                         dense_5                  64                                 
6   dense_6 (Dense)     (10,)             650         nl_5                     640            2,600               
7   nl_7 (Nonlinearity) (10,)                         dense_6                  150                                
mnist p=93322(364.54 KBytes) macc=2852598 rom=257.54 KBytes ram=32.88 KBytes -29.35% 
Complexity by layer - macc=2,852,598 rom=263,720 
id      layer (type)        macc                                    rom                                     
0       conv2d_0 (Conv2D)   ||||                              8.3%  |                                 0.5%  
2       conv2d_2 (Conv2D)   |||||||||||||||||||||||||||||||  78.7%  ||||||||||||||||                 28.1%  
4       conv2d_4 (Conv2D)   |||||                            11.7%  |||||||||||||||||||||||||||||||  56.0%  
5       dense_5 (Dense)     |                                 1.3%  ||||||||                         14.5%  
5       nl_5 (Nonlinearity) |                                 0.0%  |                                 0.0%  
6       dense_6 (Dense)     |                                 0.0%  |                                 1.0%  
7       nl_7 (Nonlinearity) |                                 0.0%  |                                 0.0%  
Using TensorFlow backend. 
Analyze complete on AI model

This is the output that you get by just running the analyze tool on the imported tflite model in CubeMX. Lots of information there, but let’s focus in some really important info. As you can see, you know exactly how much ROM and RAM you need! You couldn’t do that with the tflite-micro. In tflite-micro you need to either calculate this by your own, or you would need to add heap size and try to load the model, if the heap wasn’t enough and the allocator was complaining, then add more heap and repeat. This is not very convenient right? But with x-cube-ai you know exactly how much heap you need at least for the model (and also add more for your app). Great stuff.

Model RAM/ROM usage

So in this case the ROM needed for the model is 263760 bytes. In part 3, that was 375740 bytes (see section 3 in the jupyter notepad). That difference is not because I’ve used quantization, but because of the 4x compression selection I’ve made for the weights in the tool (see in the YouTube video which does the same time=3:21). Therefore, the decrease in the model size in ROM is from that compression. According to the tools that’s -29.35% compared to the original size. In the current project the model binary blob is in the `source/src/mnistkeras_data.c` file and it’s an C array like the one in the tflite-micro project. The similar file in the tf-lite model was the `source/src/inc/model_data.h`. Those sizes are without quantization, because I didn’t manage to convert the model to UINT8 as the TFLiteConverter converts the model only to INT8, which is not supported in tflite. I’m still puzzled with that and I can’t figure out why this happening and I couldn’t find any documentation or example how to do that.

Now, let’s go to the RAM usage. With x-cube-ai the RAM needed is only 36840 bytes! In the tflite-micro I needed 151312 bytes (see the table in the “Model RAM Usage” section here). That’s 4x times less RAM. It’s amazing. The reason for that is that in tflite-micro the micro_allocator expands the layers of the model in the RAM, but in the x-cube-ai that doesn’t happen. From the above report (and from what I’ve seen) it seems that the layers remain in the ROM and it seems that the API only allocates RAM for the needed operations.

As you can imagine those two things (RAM and ROM usage) makes x-cube-ai a much better option even to start with. That makes even possible to run this model in MCUs with less RAM/ROM than the STM32F746, which is considered a buffed MCU. Huge difference in terms of resources.

Framework output project types

As I’ve mentioned previously, with x-cube-ai you can create 3 types of projects (SystemPerformance, Validation, ApplicationTemplate). Let’s see a few more details about those.

Note: for the SystemPerformance and Validation project types, I’ve included the bin files in the extras/folder. You can only flash those on the STM32F746 which comes with the 32F746GDISCOVERY board.


As the name clearly implies, you can use this project type in order to benchmark the performance using random inputs. If you think about it, that’s all that I would need for this post. I just need to import the model, build this application and there you go, I have all I need. That’s correct, but… I wanted to do the same that I’ve done in the previous project with tflite-micro and be able to use a comm protocol to upload inputs from hand-drawn digits from the jupyter notebook to the STM32F7, run the inference and get the output back and validate the result. Therefore, although this project type is enough for benchmarking, I still had work to do. But in case you just need to benchmark the MCU running the model inference, just build this. You don’t even have to write a single line of code. This is the serial output when this code runs (this is a loop, but I only post one iteration).

Running PerfTest on "mnistkeras" with random inputs (16 iterations)...

Results for "mnistkeras", 16 inferences @216MHz/216MHz (complexity: 2852598 MACC)
 duration     : 73.785 ms (average)
 CPU cycles   : 15937636 -1352/+808 (average,-/+)
 CPU Workload : 7%
 cycles/MACC  : 5.58 (average for all layers)
 used stack   : 576 bytes
 used heap    : 0:0 0:0 (req:allocated,req:released) cfg=0

From the above output we can see that @216MHz (default frequency) the inference duration was 73.78 ms (average) and then some other info. Ok, so now let’s push the frequency up a bit @288MHz and see what happens.

Running PerfTest on "mnistkeras" with random inputs (16 iterations)...

Results for "mnistkeras", 16 inferences @288MHz/288MHz (complexity: 2852598 MACC)
 duration     : 55.339 ms (average)
 CPU cycles   : 15937845 -934/+1145 (average,-/+)
 CPU Workload : 5%
 cycles/MACC  : 5.58 (average for all layers)
 used stack   : 576 bytes
 used heap    : 0:0 0:0 (req:allocated,req:released) cfg=0

55.39 ms! It’s amazing. More about that later.


The validation project type is the one that you can use if you want to validate your model with different inputs. There is a mode that you can validate on the target with either random or user-defined data. There is a pdf document here, named “Getting started with X-CUBE-AI Expansion Package for Artificial Intelligence (AI)” and you can find the format of the user input in section 14.2, which is just a csv file with comma separated values.

The default mode, which is the random inputs produces the following output (warning: a lot of text is following).

Starting AI validation on target with random data... 
Neural Network Tools for STM32 v1.0.0 (AI tools v4.0.0) 
-- Importing model 
-- Importing model - done (elapsed time 0.403s) 
-- Building X86 C-model 
-- Building X86 C-model - done (elapsed time 0.519s) 
-- Setting inputs (and outputs) data 
Using random input, shape=(10, 784) 
-- Setting inputs (and outputs) data - done (elapsed time 0.691s) 
-- Running STM32 C-model 
ON-DEVICE STM32 execution ("mnistkeras", /dev/ttyUSB0, 115200).. 
<Stm32com id=0x7f8fd8339ef0 - CONNECTED(/dev/ttyUSB0/115200) devid=0x449/STM32F74xxx msg=2.0> 
 0x449/STM32F74xxx @216MHz/216MHz (FPU is present) lat=7 Core:I$/D$ ART: 
 found network(s): ['mnistkeras'] 
 description    : 'mnistkeras' (28, 28, 1)-[7]->(1, 1, 10) macc=2852598 rom=257.54KiB ram=32.88KiB 
 tools versions : rt=(4, 0, 0) tool=(4, 0, 0)/(1, 3, 0) api=(1, 1, 0) "Fri Jul 26 14:30:06 2019" 
Running with inputs=(10, 28, 28, 1).. 
....... 1/10 
....... 2/10 
....... 3/10 
....... 4/10 
....... 5/10 
....... 6/10 
....... 7/10 
....... 8/10 
....... 9/10 
....... 10/10 
 RUN Stats    : batches=10 dur=4.912s tfx=4.684s 6.621KiB/s (wb=30.625KiB,rb=400B) 
Results for 10 inference(s) @216/216MHz (macc:2852598) 
 duration    : 78.513 ms (average) 
 CPU cycles  : 16958877 (average) 
 cycles/MACC : 5.95 (average for all layers) 
Inspector report (layer by layer) 
 n_nodes        : 7 
 num_inferences : 10 
Clayer  id  desc                          oshape          fmt       ms         
0       0   10011/(Merged Conv2d / Pool)  (13, 13, 32)    FLOAT32   11.289     
1       2   10011/(Merged Conv2d / Pool)  (5, 5, 64)      FLOAT32   57.406     
2       4   10004/(2D Convolutional)      (3, 3, 64)      FLOAT32   8.768      
3       5   10005/(Dense)                 (1, 1, 64)      FLOAT32   1.009      
4       5   10009/(Nonlinearity)          (1, 1, 64)      FLOAT32   0.006      
5       6   10005/(Dense)                 (1, 1, 10)      FLOAT32   0.022      
6       7   10009/(Nonlinearity)          (1, 1, 10)      FLOAT32   0.015      
                                                                    78.513 (total) 
-- Running STM32 C-model - done (elapsed time 5.282s) 
-- Running original model 
-- Running original model - done (elapsed time 0.100s) 
Exec/report summary (validate 0.000s err=0) 
model file      : /rnd/bitbucket/machine-learning-for-embedded/code-stm32f746-xcube/mnist.tflite 
type            : tflite (tflite) 
c_name          : mnistkeras 
compression     : 4 
quantize        : None 
L2r error       : 2.87924684e-03 (expected to be < 0.01) 
workspace dir   : /tmp/mxAI_workspace3396387792167015918690437549914931 
output dir      : /home/dimtass/.stm32cubemx/stm32ai_output 
model_name      : mnist 
model_hash      : 3be31e1950791ab00299d58cada9dfae 
input           : input_0 (item#=784, size=3.06 KiB, fmt=FLOAT32) 
input (total)   : 3.06 KiB 
output          : nl_7 (item#=10, size=40 B, fmt=FLOAT32) 
output (total)  : 40 B 
params #        : 93,322 (364.54 KiB) 
macc            : 2,852,598 
rom (ro)        : 263,720 (257.54 KiB) -29.35% 
ram (rw)        : 33,664 + 3,176 (32.88 KiB + 3.10 KiB) 
id  layer (type)        output shape      param #     connected to             macc           rom                 
0   input_0 (Input)     (28, 28, 1)                                                                               
    conv2d_0 (Conv2D)   (26, 26, 32)      320         input_0                  237,984        1,280               
    nl_0 (Nonlinearity) (26, 26, 32)                  conv2d_0                                                    
1   pool_1 (Pool)       (13, 13, 32)                  nl_0                                                        
2   conv2d_2 (Conv2D)   (11, 11, 64)      18,496      pool_1                   2,244,480      73,984              
    nl_2 (Nonlinearity) (11, 11, 64)                  conv2d_2                                                    
3   pool_3 (Pool)       (5, 5, 64)                    nl_2                                                        
4   conv2d_4 (Conv2D)   (3, 3, 64)        36,928      pool_3                   332,416        147,712             
    nl_4 (Nonlinearity) (3, 3, 64)                    conv2d_4                                                    
5   reshape_5 (Reshape) (576,)                        nl_4                                                        
    dense_5 (Dense)     (64,)             36,928      reshape_5                36,864         38,144 (c)          
    nl_5 (Nonlinearity) (64,)                         dense_5                  64                                 
6   dense_6 (Dense)     (10,)             650         nl_5                     640            2,600               
7   nl_7 (Nonlinearity) (10,)                         dense_6                  150                                
mnist p=93322(364.54 KBytes) macc=2852598 rom=257.54 KBytes ram=32.88 KBytes -29.35% 
Cross accuracy report (reference vs C-model) 
NOTE: the output of the reference model is used as ground truth value 
acc=100.00%, rmse=0.0007, mae=0.0003 
10 classes (10 samples) 
C0         0    .    .    .    .    .    .    .    .    .   
C1         .    0    .    .    .    .    .    .    .    .   
C2         .    .    2    .    .    .    .    .    .    .   
C3         .    .    .    0    .    .    .    .    .    .   
C4         .    .    .    .    0    .    .    .    .    .   
C5         .    .    .    .    .    1    .    .    .    .   
C6         .    .    .    .    .    .    0    .    .    .   
C7         .    .    .    .    .    .    .    2    .    .   
C8         .    .    .    .    .    .    .    .    5    .   
C9         .    .    .    .    .    .    .    .    .    0   
Creating /home/dimtass/.stm32cubemx/stm32ai_output/mnistkeras_val_m_inputs.csv 
Creating /home/dimtass/.stm32cubemx/stm32ai_output/mnistkeras_val_c_inputs.csv 
Creating /home/dimtass/.stm32cubemx/stm32ai_output/mnistkeras_val_m_outputs.csv 
Creating /home/dimtass/.stm32cubemx/stm32ai_output/mnistkeras_val_c_outputs.csv 
Creating /home/dimtass/.stm32cubemx/stm32ai_output/mnistkeras_val_io.npz 
Evaluation report (summary) 
Mode                acc       rmse      mae       
X-cross             100.0%    0.000672  0.000304  
L2r error : 2.87924684e-03 (expected to be < 0.01) 
Creating report file /home/dimtass/.stm32cubemx/stm32ai_output/mnistkeras_validate_report.txt 
Complexity/l2r error by layer - macc=2,852,598 rom=263,720 
id  layer (type)        macc                          rom                           l2r error                     
0   conv2d_0 (Conv2D)   |||                     8.3%  |                       0.5%                                
2   conv2d_2 (Conv2D)   |||||||||||||||||||||  78.7%  |||||||||||            28.1%                                
4   conv2d_4 (Conv2D)   |||                    11.7%  |||||||||||||||||||||  56.0%                                
5   dense_5 (Dense)     |                       1.3%  ||||||                 14.5%                                
5   nl_5 (Nonlinearity) |                       0.0%  |                       0.0%                                
6   dense_6 (Dense)     |                       0.0%  |                       1.0%                                
7   nl_7 (Nonlinearity) |                       0.0%  |                       0.0%  2.87924684e-03 *              
fatal: not a git repository (or any of the parent directories): .git 
Using TensorFlow backend. 
Validation ended

I’ve also included a file extras/digit.csv which is the digit “2” (same used in the jupyter notebook) that you can use this to verify the model on the target using the `extras/code-stm32f746-xcube-evaluation.bin` firmware and CubeMX. You just need to load the digit to the CubeMX input and validate the model on the target. This is part of the output, when validating with that file:

Cross accuracy report (reference vs C-model) 
NOTE: the output of the reference model is used as ground truth value 
acc=100.00%, rmse=0.0000, mae=0.0000 
10 classes (1 samples) 
C0         0    .    .    .    .    .    .    .    .    .   
C1         .    0    .    .    .    .    .    .    .    .   
C2         .    .    1    .    .    .    .    .    .    .   
C3         .    .    .    0    .    .    .    .    .    .   
C4         .    .    .    .    0    .    .    .    .    .   
C5         .    .    .    .    .    0    .    .    .    .   
C6         .    .    .    .    .    .    0    .    .    .   
C7         .    .    .    .    .    .    .    0    .    .   
C8         .    .    .    .    .    .    .    .    0    .   
C9         .    .    .    .    .    .    .    .    .    0

The above output means that the network found the digit “2” with 100% accuracy.


This is the project you want to build when you develop your own application. In this case CubeMX creates only the necessary code that wraps the x-cube-ai library. These are the app_x-cube-ai.hand app_x-cube-ai.cfiles that are located in the source/srcfolder (and in the inc/ forder in the src). These are just wrappers files around the library and the model. You actually only need to call this function and then you’re ready to run your inference.


The x-cube-ai static lib

Let’s see a few things about the x-cube-ai library. First and most important, it’s a closed source library, so it’s a proprietary software. You won’t get the code for this, which for people like me is a big negative. I guess that way ST tries to keep the library around their own hardware, which it makes sense; but nevertheless I don’t like it. That means that the only thing you have access are the header files in the `source/libs/AI/Inc` folder and the static library blob. The only insight you can have in the library is using the elfread tool and extract some information from the blob. I’ve added the output in the `extras/elfread_libNetworkRuntime400_CM7_GCC.txt`.

From the output I can tell that this was build on a windows machine from the user `fauvarqd`, lol. Very valuable information. OK seriously now, you can also see the exported calls (which you could see anyways from the header files) and also the name of the object files that are used to build the library. An other trick if you want to get more info is to try to build the project by removing the dsp library. Then the linker will complain that the lib doesn’t find some functions, which means that you can derive some of them. But does it really matter though. No source code, no fun 🙁

I don’t like the fact that I don’t have access in there, but it is what it is, so let’s move on.

Building the project

You can find the C++ cmake project here:

In the source/libs folder you’ll find all the necessary libraries which are CMSIS, the STM32F7xx_HAL_Driver, flatbuffers and the x-cube-ai lib. All these are building as static libraries and then the main.cpp app is linked against those static libs. You will find the cmake files for those libs in source/cmake. The file in the repo is quite thorough about the build options and the different builds. To build the code run this command:


If you want to enable overclocking the you can build like this:


Just be aware to select the value you like for the clock in sources/src/main.cppfile in this line:

RCC_OscInitStruct.PLL.PLLN = 288; // Overclock

The default overclocking value is 288MHz, but you can experiment with a higher one (in my case that was the maximum without hard-faults).

Also if you overclock you want to change also the clock dividers on the APB1 and APB2 buses, otherwise the clocks will be too high and you’ll get hard-faults.

RCC_ClkInitStruct.APB1CLKDivider = RCC_HCLK_DIV4;
RCC_ClkInitStruct.APB2CLKDivider = RCC_HCLK_DIV2;

The build command will build the project in the build-stm32folder. It’s interesting to see the resulted sizes for all the libs and the binary file. The next array lists the sizes by using the current latest gcc-arm-none-eabi-8-2019-q3-update toolchain from here. By the time you read the article this might already have changed.

File Size
stm32f7-mnist-x_cube_ai.bin 339.5 kB
libNetworkRuntime400_CM7_GCC.a 414.4kB

This is interesting. Let’s see now the differences between the resulted binary and the main AI libs (tflite-micro and x-cube-ai).

(sizes in kB)
x-cube-ai tflite-micro
binary 339.5 542.7
library 414.4 2867

As you can see from above, both the binary and the library for x-cube-ai are much smaller. Regarding the binary, that’s because the model is smaller as the weights are compressed. Regarding the libs you can’t really say if the size matters are the implementation and the supported layers for tflite-micro are different, but it seems that the x-cube-ai library is much more optimized for this MCU and also it must be more stripped down.

Supported commands in STM32F7 firmware

The code structure of this project in the repo is pretty much the same with the code in the 3rd post. In this case though I’ve only used a single command. I’ll copy-paste the text needed from the previous post.

After you build and flash the firmware on the STM32F7 (read the for more detailed info), you can use a serial port to either send commands via a terminal like cutecom or interact with the jupyter notebook. The firmware supports two UART ports on the STM32F7. In the first case the commands are just ASCII strings, but in the second case it’s a binary flatbuffer schema. You can find the schema in `source/schema/schema.fbs` if you want to experiment and change stuff. In the firmware code the handing of the received UART commands is done in `source/src/main.cpp` in function `dbg_uart_parser()`.

The command protocol is plain simple (115200,8,n,1) and its format is:

where ID is a number and each number is a different command. So:
CMD=1, runs the inference of the hard-coded hand-drawn digit (see below)

This is how I’ve connected the two UART ports in my case. Also have a look the repo’s file for the exact pins on the connector.

Note: More copy-paste from the previous post is coming, as many things are the same, but I have to add them here for consistency.

Use the Jupyter notebook with STM32F7

In the jupyter notebook here, there’s a description on how to evaluate the model on the STM32F7. There are actually two ways to do that, the first one is to use the digit which is already integrated in the code and the other way is to upload your hand-draw digit to the STM32 for evaluation. In any case this will validate the model and also benchmark the NN. Therefore, all you need to do is to build and upload the firmware, make the proper connections, run the jupyter notebook and follow the steps in “5. Load model and interpreter”.

I’ve written two custom Python classes which are used in the notebook. Those classes are located in jupyter_notebook/ folder and each has its own folder.


The MnistDigitDraw class is using tkinter to create a small window on which you can draw your custom digit using your mouse.


In the left window you can draw your digit by using your mouse. When you’ve done then you can either press the Clearbutton if you’re not satisfied. If you are then you can press the Inferencebutton which will actually convert the digit to the format that is used for the inference (I know think that this button name in not the best I could use, but anyway). This will also display the converted digit on the right side of the panel. This is an example.

Finally, you need to press the Exportbutton to write the digit into a file, which can be used later in the notepad. Have in mind that jupyter notepad can only execute only one cell at a time. That means that as long as the this window is not terminated then the current cell is running, so you need to first to close the window pressing the [x] button to proceed.

After you export the digit you can validate it in the next cells either in the notepad or the STM32F7.


The FbComm class handles the communication between the jupyter notepad and the STM32F7 (or another tool which I’ll explain). The FbComm supports two different communication means. The first is the Serial comms using a serial port and the other is a TCP socket. There is a reason I’ve done this. Normally, the communication of the notepad is using the serial port and send to/receive data from the STM32F7. To develop using this communication is slow as it takes a lot of time to build and flash the firmware on the device every time. Therefore, I’ve written a small C++ tool in `jupyter_notebook/FbComm/test_cpp_app/fb_comm_test.cpp`. Actually it’s mainlt C code for sockets but wrapped in a C++ file as flatbuffers need C++. Anyway, if you plan on changing stuff in the flatbuffer schema it’s better to use this tool first to validate the protocol and the conversions and when it’s done then just copy-paste the code on the STM32F7 and expect that it should work.

When you switch to the STM32F7 then you can just use the same class but with the proper arguments for using the serial port.


The files in this folder are generated from the flatc compiler, so you shouldn’t change anything in there. If you make any changes in `source/schema/schema.fbs`, then you need to re-run the flatc compiler to re-create the new files. Have a look in the “Flatbuffers” section in the file how to do this.

Benchmarking the x-cube-ai

The benchmark procedure was a bit easier with the x-cube-ai compared to the tflite-micro. I’ve just compiled the project w/ and w/o overclocking and run the inference several times from the jupyter notebook. As I’ve mentioned earlier you don’t really have to do that, just use the SystemPerformance project from the CubeMX and just change the frequency, but this is not so cool like uploading your hand-drawn digit, right? Anyway, that’s the table with the results:

216 MHz 288 MHz
76.754 ms 57.959 ms

Now let’s do a comparison between the tflite-micro and the x-cube-ai inference run times.

x-cube-ai (ms) tflite-micro (ms) difference
216 MHz 76.754 126.31 1.64x (48.8%)
288 MHz 57.959 94.957 1.64x (48.4%)

Mistakenly I’ve initially calculated this difference to be 170%, because I’ve build the tflite firmware with the DEBUG flag on and I thought that it really huge. After fixing this, I’ve measured a difference of ~48% which is still significant difference, but it might be acceptable depending the application (or not).

You might noticed that the inference time is a bit higher now compared to the SystemPerformance project binary. I only assume that this is because in the benchmark the outputs are not populated and they are dropped. I’m not sure about this, but it’s my guess as this seems to be a consistent behaviour. Anyway, the difference is 2-3 ms, so I’ll skip ruin my day thinking more about this as the results of my project are actually a bit faster than the default validation project.

Evaluating on the STM32F7

This is an example image of the digit I’ve drawn. The format is the standard grayscale 28×28 px image. That’s an uint8 grayscale image [0, 255], but it’s normalized to a [0, 1] float, as the network input and output is float32.

After running the inference on the target we get back this result on the jupyter notebook.

Comm initialized
Num of elements: 784
Sending image data
Receive results...
Command: 2
Execution time: 76.754265 msec
Out[9]: 0.000000
Out[8]: 0.000000
Out[7]: 0.000000
Out[6]: 0.000000
Out[5]: 0.000000
Out[4]: 0.000000
Out[3]: 0.000000
Out[2]: 1.000000
Out[1]: 0.000000
Out[0]: 0.000000

The output predicts that the input is number 2 and it’s 100% certain about it. Cool.

Things I liked and didn’t liked about x-cube-ai

From the things that you’ve read above you can pretty much conclude by yourself about the pros of the x-cube-ai, which actually make almost all the cons to seem less important, but I’ll list them anyways. This is not yet a comparison with tflite-micro.


  1. It’s lightning fast. The performance of this library is amazing.
  2. It’s very light and doesn’t use a lot of resources and the result binary is small.
  3. The tool in the CubeMX is able to compress the weights.
  4. The x-cube-ai tool is integrated nicely in the CubeMX interface, although it could be better.
  5. Great analysis reports that helps you make decisions for which MCU you need to use and optimizations before even start coding (regarding ROM and RAM usage).
  6. Supports importing models from Keras, tflite, Lasagne, Caffe and ConvNetJS. So, you are not limited in one tool and also Keras support is really nice.
  7. You can build and test the performance and validate your NN without having to write a single line of code. Just import your model and build the SystemPerformance or Validation application and you’re done.
  8. When you write your own application based on the template then you actually only have to use two functions, one to init the network and a function to run your inference. That’s it.


  1. It’s a proprietary library! No source code available. That’s a big, big problem for many reasons. I never had a good experience with closed source libraries, because when you hit a bug, then you’re f*cked. You can’t debug and solve this by yourself and you need to file a report for the bug and then wait. And you might wait forever!
  2. ST support quite sucks if you’re an individual developer or a really small company. There is a forum, which is based on other developers help, but most of the times you might not get an answer. Sometimes, you see answers from ST stuff, but expect that this won’t happen most of the times. If you’re a big player and you have support from component vendors like Arrow e.t.c. then you can expect all the help you need.
  3. Lack of documentation. There’s only a pdf document here (UM2526). This has a lot of information, but there are still a lot of information missing. Tbh, after I searched in the x-cube-ai folders which are installed in the CubeMX, I’ve found more info and tools, but there’s no mention about those anywhere! I really didn’t like that. OK, now I know, so if you’re also desperate then in your Linux box, have a look at this path: ~/STM32Cube/Repository/Packs/STMicroelectronics/X-CUBE-AI/4.0.0/Documentation. That’s for the 4.0.0 version, so in our case it might be different.

TFLite-micro vs x-cube-ai

Disclaimer: I have nothing to do with ST and I’ve never even got a free sample from them. I had to do this, for what is following…

As you can see the x-cube-ai’s has more pros than cons are more cons compare to the tflite-micro. Tbh, I’ve also enjoyed more working with the x-cube-ai rather the tflite-micro as it was much easier. The only thing from the x-cube-ai that leaves a bitter taste is that it’s a proprietary software. I can’t stress out how much I don’t like this and all the problems that brings along. For example, let’s assume that tomorrow ST decides to pull off the plug from this project, boom, everything is gone. That doesn’t sound very nice when you’re planning for a long commitment to an API or tool. I quite insist on this, because the last 15-16 years I’ve seen this many times in my professional career and you don’t want this to happen to your released product.  Of course, if the API serves you well for your current running project and you don’t plan on changing something critical then it’s fine, go for it. But, I really like the fact that tflite-micro is open.

I’m a bit puzzled about tflite. At the this point, the only reason I can think of using tflite-micro over x-cube-ai, is if you want to port your code from a tflite project which already runs on your application CPU (and Linux) to an MCU to test and prototype and decide if it worth switching to an MCU as a cheaper solution. Of course, the impact of tflite in the performance is something that needs consideration and currently there’s no rule of thumb of how much slower is compared to other APIs and on specific hardware. For example in the STM32F7 case (and for the specific model) is 1.64x times slower, but this figure might be different for another MCU. Anyway, you must be aware of these limitations, know what to really expect from tflite-micro and how much room you have for performance enhancement.

There is another nice thing about tflite-micro thought. Because it’s open source you can branch the git repo and then spend time to optimise the code for your specific hardware. Definitely the performance will be much, much better; but I can’t really say how much as it depends on so many things. Have in mind that also tflite-micro is written in C++ and some of its hocus pocus may have negative impact in the performance. But at least it remains a good alternative option for prototyping, experimentation and develop to its core. And that’s the best thing with open source code.

Finally, x-cube-ai limits your options to the STM32 series. Don’t get me wrong this MCU series is great and I use stm32 for many of my projects, but it’s always nice to have an alternative.


The x-cube-ai is fast. It’s also easy to use and develop on it, it has those ready-to-build apps and the template to build your project, everything is in an all-in-one solution (CubeMX). But on the other hand is a black box and don’t expect much support if you’re not a big player.

ST was very active the last year. I also liked the STM32-MP1 SBC they released with Yocto support from day one and mainline kernel support. They are very active and serious. Although I still consider the whole HAL Driver library a bloated library (which it is, as I’ve proven that in previous stupid-projects). I didn’t had any issues; but I’ve also didn’t write much code for these last two projects (I had serious issues when I’ve tried a few years ago).

Generally, the code is focused around the NN libs performance and not the MCU peripheral library performance, but still you need to consider those things when you evaluating platforms to start a new project.

From a brief in the source code though, it seems that you can use the x-cube-ai library without the HAL library, but you would need to port some bits to the LL library to use it with that one. Anyway, that’s me; I guess most people are happy with HAL, so…

In my next post, I will use a jetson-nano to run the same inference using tflite (not micro) and an ESP8266 that will use a REST-API. Also TensorRT, seems nice. I may also try this for the next post, will see.

Update: Next post is available here.

Have fun!

Machine Learning on Embedded (Part 3)


Note: This post is the third in the series. Here you can find part 1, part 2, part 4 and part 5.

Edit (24.07.2019): I’ve done a stupid mistake and I haven’t used the hard float acceleration of the FPU on the STM32F7. This explains the terrible performance I had. But still with the FPU enabled although the NN is now 3x times faster, it’s still much much slower compared to the x-cube-ai implementation from ST (~13x slower).

Edit (19.04.2020): I’ve found out that the code was build with DEBUG on, thus the large time in running the inference. I’ve updated the tables and conclusions based on the new times.

In the previous post (part 1 and part 2), I’ve done a benchmark of very two simple NN on various different MCUs. That was a naive implementation of a 2-input, 1-output and a 2-input,32-hidden,1-output layers. Of course, as you can imagine this is hardly useful to do anything useful in the ML domain, but it works fine for benchmarking. My plan next was to run a more complicated NN to the STM32F746 MCU. To do that I was about to use the X-CUBE-AI and I’ve started to work on it, but during the process I got fed up with the implementation and the lack of information around the API and some bits and tools that are although there’s a reference on them, they’re nowhere available. I’ve also asked in their forum, but ST forums are not the place to get answers as the company doesn’t provide any help. Only other users do, but the concept is quite new and there are no many users to provide answers. Btw, this is my unanswered post in the ST community forums.

As I’ve mentioned also to the previous post, this domain is quite hot right now. There is a lot of development and many things are changing rapidly, which makes things that I’ve written 1 week ago, to be now obsolete. For example, a few hours ago the new 4.0.0 version of X-CUBE-AI was released, which supports a few things that are now very interesting to test and to benchmarks and comparisons. Anyway, I’ll get to that later.

You’ll find all the source code an files used for this project here:

So, let’s step into it…

TensorFlow Lite for microcontrollers

In the first post, I had a very interesting chat in the comments sections with Raukk, who provided several suggestions (you can see the comments at the end of this post). At some point he suggest to have a look at the TensorFlow-Lite API for microcontrollers and then after reading the online documentation, I thought that this is what I need. I thought that this would make things easier. Normally, I would provide my findings in the end of the post, but I’ll do a spoiler. Well, it doesn’t! In the current state (unless if I’m doing something so terrible wrong!), the tflite micro API sucks in so many ways, but the most important is the very low performance performance. During the rest of the post I’ll elaborate on my findings.

TensorFlow (TF or tf) has a subset API, which is the TensorFlow Lite (tflite). As the name implies, this is a lighter version of the main API. The idea for this is that small application CPUs (like ARM) to be able to run tf models with less dependencies, low latency and smaller size, which is great for medium/large embedded devices. Note that I’m referring to application CPUs and not MCUs. That seems to be working quite well for small Linux SBCs and also Andoid devices. Especially for the Android there’s support for the NNAPI, but by the time this post is written there are also quite a few issues with various platforms. It’s still like a beta thing.

Anyway, at the same time, tensorflow tries to support even smaller CPUs from the MCU domain. TensorFlow Lite for Microcontrollers (tflite-micro) is an experimental subset of the the tflite that meant to be a baremetal portable API. Portable, of course, means that although the API is baremetal and can be compiled virtually for any MCU, still there are parts like the HW acceleration that needs to be ported depending the architecture. For example, Cortex-M4 and M7 have DSP accelerators (there’s also a new NN library in CMSIS) that can be used to improve the performance. At the same time, tflite also provides other optimizations like quantization that can improve the performance on MCUs that don’t have HW accelerators. As we’ll see next though, because this API is very premature not all of those things are really working out of the box and there are still several bugs.

Nevertheless, this stupid project was quite fan, too; because I’ve put together a jupyter notepad that trains a MNIST model with TF, which then you can convert to tflite and upload it to the STM32F7. You can also use the notepad to hand-draw your own digit and test it with both the notepad and the STM32F7. So the notepad can communicate with the STM32F7 and run inferences, cool stuff.

Therefore, this post will be limited only around TF-Lite for microcontrollers and the next post will be about the new X-CUBE-AI API.

Have in mind that this subset is still new and work in progress. I think it was quite a mistake to dive in it so early, but I had to, as the x-cube-ai initially didn’t met my expectations and I needed to proceed with my work on port a keras model in a MCU. For that reason, I’ve decided to get deeper in tflite-micro as I think it will be also the future API that it will dominate the market (at least in the near future).

Also I’ve spend a few days to make tflite-micro to work in the way I wanted to and it was quite challenging and in the end I was completely disappointed by the procedure and the time that it needs to set it up for use. The reason is that the API is a bit chaotic and under heavy development, but I’ll list in more detail the issues I had later in the post.

Training the MNIST TF model

First you need to train your model. Since I’m not an expert on the domain I’ve chosen to port another Keras model from here to tflite. The procedure was easy and very straight forward, as Keras structure seems to be almost similar to TF. The resulted model is here, so you can compare the two models:

MNIST for TensorFlow-Lite notepad

It’s better if you git clone the repo and open the notepad locally, so you can run the steps if you like. Of course, you definitely need to clone it locally in order to test the code with the STM32F7. In the notepad I’ve tried to put things in a way that makes sense, so first in section 1 I’m creating the model using TF and then I’m doing the training and evaluation. For this project the model in general and also accuracy doesn’t really matter as I want to focus on the model transfer to the STM32F7 and then benchmark the tflite-micro API. So I’ll deal with the low level technical stuff and not how to create the perfect model.

Convert Keras model for use in embedded

After I’ve trained the model, in section 2, I’m converting the model to the tflite format. The tflite format is just a serialized and flatten binary format of the model using the flatbuffers serialization library. This library is actually the coolest thing in tflite that actually works quite well. I’ve also added a script in `jupyter_notebook/` which does the same thing and you can run it from the repo source path.

Converting the model to tflite was the first major issue I had and I’ve only managed to solve it partially. The problem is that by default all the model weights are float32 numbers, which means two things. First the model size is big as every float32 is 4 bytes and it takes a lot of flash and RAM. Second the execution will be slower compared to UINT8 numbers that tflite is supposed to support. Have a look at this snippet from the notebook:

tflite_mnist_model = 'mnist.tflite'
converter = tf.lite.TFLiteConverter.from_keras_model_file('mnist_keras.h5')
# converter.optimizations = [tf.lite.Optimize.OPTIMIZE_FOR_SIZE]
# converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_model = converter.convert()
flatbuffer_size = open(tflite_mnist_model, "wb").write(tflite_model)

There are two lines which are commented out. The important line is the `tf.lite.Optimize.OPTIMIZE_FOR_SIZE`. This according to the RTFM, will post-quantize the model and reduce the size of weights from 32bits to 8. That’s a 4x times reduction in size. That works fine when converting the model, but it doesn’t work when the model is running on the MCU. If you try to build the code with the OPTIMIZE_FOR_SIZE, then when the model is loaded in the MCU, you’ll get this error:

Only float32, int16, int32, int64, uint8, bool, complex64 supported currently.

This error comes from the `source/libs/tensorflow/lite/experimental/micro/` which is allocating RAM for each layer and it seems like the converter tools converts some weights in int8 values instead of uint8 and int8 is not supported. At least not yet. Therefore, although quantization seems like a great feature, it just don’t work with the tflite-micro API! The same also stands for the `OPTIMIZED_UINT8` option. I’ve also seen another person complaining about this, so we’re either the only ones that tried it or we do the same mistake somewhere in the process.

Anyway, I’ll do a comparison at least of the resulted converted sizes, as I hope in the future this will be fixed. But for now keep in mind that you can only use float32 for all the weights. As I’ve mentioned earlier, this may change even in a few hours or it may take more, who knows.

Even if you use quantization then although there’s a significant compression in the model size, still the size is quite large for most of the small embedded devices. that don’t have enough flash. Post-quantization, though, has large impact in the model size as you’ll see in the next table. Post-quantization means that the quantization happens after training the model, but you can also use quantization during the training (according to the RFTM). Also, there are different types of quantization but let’s have a look in the following table.

Size (bytes) Ratio Saving %
Original file 780504
No quantization 375740 2 51.8
OPTIMIZE_FOR_SIZE 99344 7.85 87.27
OPTIMIZED_UINT8 97424 8.01 87.51

From the above table it seems that the OPTIMIZE_FOR_SIZE or OPTIMIZED_UINT8, make a huge difference compared to no quantization, but doesn’t make any real difference in the produced size between them. Have in mind that if you want to use the OPTIMIZED_UINT8 flag, then you also need make your model quantization aware by adding this before you compile and fit your model. According to the RTFM this is how is done.

import tensorflow as tf
# Quantization aware training
sess = tf.keras.backend.get_session()

Finally, if you want to convert those models by your self using the script then these are the commands.

# Convert keras the keras model to tflite
python3 mnist.h5

# Convert keras the keras model to tflite and optimize with OPTIMIZE_FOR_SIZE
python3 mnist.h5 size

# Convert keras the keras model to tflite and optimize with QUANTIZED_UINT8
python3 mnist.h5 uint8

For this time being, forget about quantization, so you should convert your model to tflite without any optimization. Now that you have your model converted from TF to tflite there’s another step. Now you need to convert this to a C array. To do that you can run the following command:

xxd -i jupyter_notebook/mnist.tflite > source/src/inc/model_data.h

This will create a C header file and it will place it (and override any file) in the source code. Here is what you’ll find inside the file:

unsigned char jupyter_notebook_mnist_tflite[] = {
  0x1c, 0x00, 0x00, 0x00, 0x54, 0x46, 0x4c, 0x33, 0x00, 0x00, 0x00, 0x00,
unsigned int jupyter_notebook_mnist_tflite_len = 375740;

This is the mnist.tflite converted to bytes. Don’t forget that the mnist.tflite is a flatbuffer container. That means that the C++ structured model is serialized into this flatbuffer model in order to be transferred to another platform or architecture. Therefore this C array, will be deserialized at running time. Also note that this array normally it would get in the heap area, but you don’t want to do that. The non-optimized model size is 367KBs which is more than the available RAM on the STM32F747NG. Therefore, you need to force the compiler to store this array in flash, which means that you need to change your table to const like this:

const unsigned char jupyter_notebook_mnist_tflite[] = {

That’s it! Now you have your model and weights flattened and be ready to use with the tflite API in your microcontroller. As it’s already mentioned in the online documentation here, only a subset of operations are currently supported but the API is flexible to extend or build with more options if you like.

Porting TF-Lite micro to STM32F7 and CMAKE

TF-Lite for microcontrollers doesn’t support cmake. Instead there are some makefiles in the github repo that build specific examples to test. Although, that this may seems ok, it’s not as it’s very hard to re-use those makefiles to make your own projects as they are done in a way that you need to develop your application inside the repo. My preference in general is to keep things simple and separated, this is why I generally prefer cmake. The problem with cmake is that you can achieve the same thing in many different ways and sometimes you may end up with builds that work, but they are also very complicated. Of course, as the project complexity grows cmake also becomes a bit more ugly, but anyway I believe that it’s far easier to maintain and scale and most importantly I always have my STM32 template projects in cmake. Therefore, I had to make tflite-micro to be built with cmake. That task took a while, as the makefile project does some magic in the background like downloading other repos that are not in the source code (e.g. flatbuffers and gemmlowp).

In the end I’ve managed to do so, but the problem is that it’s not easy to update it. The reason is that the header file includes have relative paths to the main repo’s top folder, which is not the tflite folder but the TF API’s folder. For that reason, I had to sed all the source files that includes header files.

Things I liked and didn’t liked about TF-Lite micro

I prefer to write this section here, before the benchmark, because it’s a negative feedback and I prefer to keep the end of the post focused on the results. The whole procedure, was a bit pain and I’m not satisfied with the outcome… I believe (or better, I hope) that in the future the tflite micro API will get better and more easy to use. I mean I expect more from Google. I had several problems when it came to the point to use it, which I will try to address next. Keep in mind that the time this article is written the version of tensorflow is 1.14.0-718503b075d, so in case the post is not updated many things may have changed when you read this.


  1. The thingy which is used for the automated tests is very interesting! I didn’t know that such thing existed and it seems very promising for future use. You can use it to emulate HW and run your built binaries on it and integrate automated tests to your current CI infrastructure.
  2. The idea of having a common API that runs on every platform and architecture is really interesting and this is the main reason that I hope that this API gets better. that means that you can (almost) just copy-paste your code from your Jetson nano or RPi and compile it on the STM32F7. That’s really awesome.
  3. It has Google’s support. Some of you might think why that’s a pro? I think it is because it means that it will get more development effort, but of course that doesn’t mean that the result will be optimal. Only time will show.


  1. Documentation is currently horrible. It’s very difficult to do even simple things, because of that reason. The only way sometimes to really understand what’s going on is to read the source code. You may think that this expected with most APIs, but this API is huge and that takes much more time! A better documentation will definitely help.
  2. It seems that you can achieve the same thing with many different ways as there are quite a few duplicate implementations. So, when you’re looking for examples you may see completely different API calls that do the same thing. That makes it very difficult to plan your workflow, especially when you’re getting started with TF. I’ve read some people that say that this is a nice feature of the API. No it’s not. An API should be clean and simple to use.
  3. It’s very slow… Much, much, much slower compared to x-cube-ai. Have in mind that I’ve only managed to benchmark float and not quantized uint8 numbers. But my current rough estimation that tf-lite micro is approx. 38x times slower to run the same inference compared to X-CUBE-AI. That’s a really big number there…
  4. There are some examples for different microcontrollers, but the build system is a bit bloated and I find it a bit over-engineered and difficult to be able to build your own code.
  5. The build system is based on the make build automation tool, which I guess is ok, but it was really difficult for me to port to cmake, because the build scripts download other stuff in the background and there are many different pieces of code all over the place. Not that cmake makes things much more better, but anyway…
  6. Because there are so many different pieces all over the place, the code doesn’t make much sense. While trying to port the build in cmake I’ve realized that it’s a spaghetti of files. The problem is that micro-tflite is subset of tflite which is subset of tensorflow. But all those things are not distinct. At some point it would be nice if the micro tflite was a separate github repo.
  7. There’s a list of supported platforms here. The problem with that list is that although the example for the stm32f103 (bluepill) is in the github repo and you just call make to build it. But for the stm32f746 you need to download some tarball project files that contain the source files including some unknown tflite version. The problem is that those files are already outdated! Also, why use keil project files in those tarballs? It’s a bit mess…
  8. Regarding the project files for the stm32f746, that I’ve mention in the previous point, why use Keil? I didn’t expect from Google to enforce the usage of this IDE, because it’s only for Windows and also it doesn’t make any sense to use Keil when so many better and FOSS alternatives exist. Again, my personal opinion is that cmake would make more sense, especially for embedded.
  9. The tflite-micro code is in C++11. Many will think, “OK, so what?”. The problem actually is that most of the low embedded engineers are not familiar with that language. Of course, you can overcome this by just learn it and to be fair the API is relative easy and not much C++11 hocus-pocus is used. My main concern regarding C++ though is that it’s not easy for every microcontroller to setup a C++ project to build. For example for the STM32, the CubeMX tool that is used to setup a project, doesn’t support to create C++ projects from templates. Therefore, you need to spend time to do it by yourself. For that reason, for example, I have my own cmake C++ template, but as I’ve said porting the tflite-micro to cmake was an adventure.
  10. Porting from the tflite build system to cmake isn’t sustainable in the long term. The reason is that there’s a lot of work need to be done. For example, all the header includes have hardcoded paths, which for cmake is not convenient and in my case I had to remove all those hardcoded paths.
  11. Another annoying issue is that the size optimizations when converting a h5 model to tflite, seems to be incompatible with the tflite-micro. Others also complain for this issue. In the end only the non-optimize model is able to be used, but I guess it’s just a matter of time for that to be fixed.

I know that the cons list is much longer, but the main advantage is the unified API across all those different platforms and architectures. Currently the performance really sucks, but if this gets better then imho TF will become the de-facto tool.

Building the project

You can find the C++ cmake project here:

In the source/libs folder you’ll find all the necessary libraries which are CMSIS, the STM32F7xx_HAL_Driver, flatbuffers, gemmlowp and tensorflow. All these are building as static libraries and then the main.cpp app is linked against those static libs. You will find the cmake files for those libs in source/cmake. The file in the repo is quite thorough about the build options and the different build, but here I’ll focus only on the accelerated build which is uses the CMSIS-NN API for the STM32F7. To build with this option then run this command:


This will build the project in the build-stm32folder. It’s interesting to see the resulted sizes for all the libs and the binary file. The next array lists the sizes by using the current latest gcc-arm-none-eabi-8-2019-q3-update toolchain from here. By the time you read the article this might already have changed.

File Size
stm32f7-mnist-tflite.bin 542.7 kB
libSTM32F7_DSP_Lib.a 5.1 MB
libSTM32F7_NN_Lib.a 598.8 kB
libSTM32F7xx_HAL_Driver.a 814.9 kB
libTensorflow_lite_micro.a 2.8 MB

Normally you would wonder why do you care about the size of the static libs if only the binary size matters and that’s a good point. But it does it matter because the RTFM of the the tflite-micro mentions that this lib is ~300KB. After testing this the only way to achieve this size is to build a dynamic lib and then strip it and then it gets around 300KB. But this was not mentioned in the RTFM, so let’s say this what they wanted to write in the first place. Btw, you can strip any of the above libs by running this:

arm-none-eabi-strip -s libname.a

BUT, you can’t strip static linked libraries because there will not be any symbols left to build against :p . Anyway, so have in mind that the claimed size is only for dynamic linked libs, which of course it doesn’t really matter for MCUs.

Finally, as you can see the binary size is ~half Megabyte in size. This is huge for a MCU. Most of this size comes from the `source/src/inc/model_data.h` file which is the flatbuffer model of the NN which is already ~340 KB. The binary size with the model after the conversion with the quantization optimizations would be 266 kB, but as I’ve said this won’t work with the tflite-micro API.

Model RAM usage

This table shows the RAM usage per layer when the flatten flatbuffer model is expanded to memory.

Layer Size in bytes
conv2d_7_input 3136
dense_4/Softmax 40
dense_4/BiasAdd 40
dense_3/Relu 256
conv2d_9/Relu 2304
max_pooling2d_6/MaxPool 6400
conv2d_8/Relu 30976
max_pooling2d_5/MaxPool 21632
conv2d_7/Relu 86528
= 151312

Therefore, you see that for this model more that 151KB of RAM are needed. The STM32F746 I’m using has 320KB or RAM which are enough for this model, but still 151KB are quite a lot of RAM for embedded, so you need to keep in mind such limitations!

Supported commands in STM32F7 firmware

After you build and flash the firmware on the STM32F7 (read the for more detailed info), you can use a serial port to either send commands via a terminal like cutecom or interact with the jupyter notebook. The firmware supports two UART ports on the STM32F7. In the first case the commands are just ASCII strings, but in the second case it’s a binary flatbuffer schema. You can find the schema in `source/schema/schema.fbs` if you want to experiment and change stuff. In the firmware code the handing of the received UART commands is done in `source/src/main.cpp` in function `dbg_uart_parser()`.

The command protocol is plain simple (115200,8,n,1) and its format is:

where ID is a number and each number is a different command. So:
CMD=1, prints the model view
CMD=2, runs the inference of the hard-coded hand-drawn digit (see below)

This is how I’ve connected the two UART ports in my case. Also have a look the repo’s file for the exact pins on the connector.

Use the Jupyter notebook with STM32F7

In the jupyter notebook here, there’s a description on how to evaluate the model on the STM32F7. There are actually two ways to do that, the first one is to use the digit which is already integrated in the code and the other way is to upload your hand-draw digit to the STM32 for evaluation. In any case this will validate the model and also benchmark the NN. Therefore, all you need to do is to build and upload the firmware, make the proper connections, run the jupyter notebook and follow the steps in “5. Load model and interpreter”.

I’ve written two custom Python classes which are used in the notebook. Those classes are located in jupyter_notebook/ folder and each has its own folder.


The MnistDigitDraw class is using tkinter to create a small window on which you can draw your custom digit using your mouse.


In the left window you can draw your digit by using your mouse. When you’ve done then you can either press the Clearbutton if you’re not satisfied. If you are then you can press the Inferencebutton which will actually convert the digit to the format that is used for the inference (I know think that this button name in not the best I could use, but anyway). This will also display the converted digit on the right side of the panel. This is an example.

Finally, you need to press the Exportbutton to write the digit into a file, which can be used later in the notepad. Have in mind that jupyter notepad can only execute only one cell at a time. That means that as long as the this window is not terminated then the current cell is running, so you need to first to close the window pressing the [x] button to proceed.

In my case, as I’m ambidextrous and I’m using two mouses at the same time on my desk, so I’ve manged to run several tests with drawing digits with both my hands as each of my hands produces a different output. I know it’s weird, but usually in office I prefer to use my left mouse hand and at home both, so I can rest my hands a bit.

After you export the digit you can validate it in the next cells either in the notepad or the STM32F7.


The FbComm class handles the communication between the jupyter notepad and the STM32F7 (or another tool which I’ll explain). The FbComm supports two different communication means. The first is the Serial comms using a serial port and the other is a TCP socket. There is a reason I’ve done this. Normally, the communication of the notepad is using the serial port and send to/receive data from the STM32F7. To develop using this communication is slow as it takes a lot of time to build and flash the firmware on the device every time. Therefore, I’ve written a small C++ tool in `jupyter_notebook/FbComm/test_cpp_app/fb_comm_test.cpp`. Actually it’s mainlt C code for sockets but wrapped in a C++ file as flatbuffers need C++. Anyway, if you plan on changing stuff in the flatbuffer schema it’s better to use this tool first to validate the protocol and the conversions and when it’s done then just copy-paste the code on the STM32F7 and expect that it should work.

When you switch to the STM32F7 then you can just use the same class but with the proper arguments for using the serial port.


The files in this folder are generated from the flatc compiler, so you shouldn’t change anything in there. If you make any changes in `source/schema/schema.fbs`, then you need to re-run the flatc compiler to re-create the new files. Have a look in the “Flatbuffers” section in the file how to do this.

Benchmarking the TF-Lite micro firmware

Finally, we got here. But I need to clarify some things first.

I’ve implemented several different tests for the firmware in order to benchmark the various implementations of the tflite micro API. What I mean is that the depthwise_convlayer is implemented in 3 different ways in the API. The default implementation is in the `source/libs/tensorflow/lite/experimental/micro/kernels/` file. Then there is another implementation in `source/libs/tensorflow/lite/experimental/micro/kernels/portable_optimized/` and finally the `/rnd/bitbucket/stm32f746-tflite-micro-mnist/source/libs/tensorflow/lite/experimental/micro/kernels/cmsis-nn/`. I’ve added detailed instructions how to build each case in the repo’s README file.

In `source/src/inc/digit.h` I’ve added a custom hand-drawn digit (number 5) that you use to test the firmware and the model without having to send any data to the board. To do that you can by sending the command CMD=2. This will run the inference and at the same time it benchmarks the process for every layer and the total time it takes. Let’s see the results when running the benchmark in various scenarios.

The first column is the layer name and the others are the time in msec of each layer on 6 different cases, which are:

  • [1]: 216MHz, default
  • [2]: 216MHz, portable_optimized/
  • [3]: 216MHz, cmsis-nn/
  • [4]: 288MHz, default
  • [5]: 288MHz, portable_optimized/
  • [6]: 288MHz, cmsis-nn/

Edit (24.07.2019): The following table is with the FPU of the STM32F7 disabled, which was my mistake. Therefore, I just leave it here for reference. The next table is the one that has the FPU enabled.

Layer [1] [2] [3] [4] [5] [6]
DEPTHWISE_CONV_2D 236 236 235 177 177 176
MAX_POOL_2D 23 23 23 18 17 17
CONV_2D 2347 2346 2346 1760 1760 1760
MAX_POOL_2D 7 7 7 5 5 5
CONV_2D 348 348 348 261 261 260
SOFTMAX 0 0 0 0 0 0
TOTAL TIME= 2966 2965 2964 2225 2224 2222

Edit (24.07.2019): This is the table with the FPU enabled.

Layer [1] [2] [3] [4] [5] [6]
DEPTHWISE_CONV_2D 18.69 18.7 18.77 14.02 14.02 14.08
MAX_POOL_2D 1.99 1.99 1.99 1.49 1.49 1.49
CONV_2D 91.03 91.08 90.94 68.48 68.49 68.54
MAX_POOL_2D 0.56 0.56 0.56 0.42 0.42 0.42
CONV_2D 12.52 12.51 12.49 9.41 9.39 9.39
FULLY_CONNECTED 1.48 1.48 1.48 1.11 1.12 1.11
FULLY_CONNECTED 0.03 0.03 0.03 0.02 0.02 0.02
SOFTMAX 0.01 0.01 0.01 0.007 0.007 0.007
TOTAL TIME= 126.31 126.36 126.27 94.957 94.957 95.057

From the above table, you can notice that:

  • When FPU is enabled then tflite is ~23.48x times faster (oh, really?)
  • There’s no really any difference with and without the DSP/NN libs acceleration
  • The CPU frequency has a great impact in the execution time (which is expected)
  • It’s quite fast, but not that much
  • The CPU spends most of the time in the CONV_2D layer.

I’m quite disappointed with the fact that the CMSIS DSP/NN library didn’t make any real difference here. I’ve spent quite some time to integrated in the cmake build and I was hoping for better results.

In case you want to overclock your CPU, have in mind that it may be unstable and the CPU can crash. I’ve managed to run the benchmark @ 288MHz, but when I was using the flatbuffers communication between the jupyter notebook and the STM32F7 then the CPU was crashing at a random point. I’ve used st-link with GDB to verify that this was the case and not a software bug. So, just be aware if you experiment with overclocked CPU.

If you want to use GDB with the code then mind that although the -g flag is set in the cmake, the elf file is stripped. Therefore, in the `/rnd/bitbucket/stm32f746-tflite-micro-mnist/source/CMakeLists.txt` file you need to find this line

-s \

and remove the -sfrom that and re-build. Then GDB will be able to find the symbols.

Evaluating on the STM32F7

This is an example image of the digit I’ve drawn. The format is the standard grayscale 28×28 px image. That’s an uint8 grayscale image [0, 255], but it’s normalized to a [0, 1] float, as the network input and output is float32.

After running the inference on the target we get back this result.

Comm initialized
Num of elements: 784
Sending image data
Receive results...
Command: 2
Execution time: 126.329910 msec
Out[9]: 0.000000
Out[8]: 0.000000
Out[7]: 1.000000
Out[6]: 0.000000
Out[5]: 0.000000
Out[4]: 0.000000
Out[3]: 0.000000
Out[2]: 0.000000
Out[1]: 0.000000
Out[0]: 0.000000

From the above output, you can see that the result is an array of 10 float32. Each index of the array represents the prediction of the NN for each digit. Out[0] is the digit 0 and Out[9] is number 9. So from the above output you see that the NN classifies the image as number 7. It’s interesting that Out[1], Out[2], Out[3] are not zero. I think it’s quitvious why the NN made those predictions, because there are parts of 7 that are quite similar to 1, 2, 3. Anyway, in this case I’m getting the same prediction from the notepad notebook as also from the STM32F7. And that was the case for all my tests.

Conclusions (and a spoiler for part 4)

Before I close this post, I will make a spoiler for the next post that follows. I’ve already used the exact same model with the X-CUBE-AI and this is part of the result from an inference (with random input data, which doesn’t matter).

Running PerfTest on "mnistkeras" with random inputs (16 iterations)...

Results for "mnistkeras", 16 inferences @216MHz/216MHz (complexity: 2852598 MACC)
 duration     : 73.735 ms (average)
 CPU cycles   : 15926760 -458/+945 (average,-/+)
 CPU Workload : 7%
 cycles/MACC  : 5.58 (average for all layers)
 used stack   : 576 bytes
 used heap    : 0:0 0:0 (req:allocated,req:released) cfg=0

Do you notice something in there? The duration for the same model is 73.7 ms instead of 126.31 ms at the same frequency with tflite-micro. That’s ~1.64x times faster!

I guess this difference is because the x-cube-ai uses internally INT8 for all the weights.

In the next post, I’ll do benchmarks with the X-CUBE-AI for the same model on the STM32F7 and then do a comparison.

Update: Part 4 is now available here.

Have fun!

Machine Learning on Embedded (Part 2)


Note: This post is the second in the series. Here you can find part 1, part 3, part 4 and part 5.

In the first part (here) we’ve designed, trained and evaluated a very simple NN with 3-inputs and 1-output. It will make more sense if you have a look at the first post before continuing with this.

So, in this post we will design a bit more complex (but again simple) NN and we’ll do the same procedure like the first part. Design, train and evaluate. For consistency and make it easier to compare, we’ll use the same inputs and training set.


The MCUs that we’re going to use are the same one with the previous post.

Another simple NN

Everything that is related to this project for all the article parts are in this bitbucket repo:

In the previous post we had a very simple NN with 3-inputs and 1-output. In this post we’ll have a NN with 3-inputs, a hidden layer with 32 nodes and 1-output. You can see that in the following picture:

You see that not all 32 nodes are displayed in the picture, but only h(0), h(1), h(2) and h(31). Also I haven’t added all the weights because there wasn’t enough space, but its easy to guess that they are similar with the ones from a(0).

To write the mathematical equation for this NN is a bit more complex (only because it takes a lot of lines), but the logic behind it it’s the same. It’s just the dot product of the inputs and the weights between the inputs and the hidden layer and then the dot product of the hidden layer and the weights between the layer and the output. Anyway, math doesn’t really matter for now.

As the inputs are the same, the same table with all possible 8 input sets stands as before.

Training the model

To train this model is a bit more complicated than before. You can open the `Simple python NN (1 hidden).ipynb` notepad from the cloned repo in your Jupyter browser or you can just view it here. The python code seems almost the same but in this case I’ve made some changes to support the hidden layer and the additional weights between each layer.

In step 2. in the notebook you can see that now the weights are a [3][32] array. That means 32 weights for each of the 3 inputs. That’s 96 weights only for the first two layers, plus another 32 weights for the next, which is total 128 weights! So you can imagine that this will need a lot more processing time to calculate and also that this number can grow really fast the more hidden layers or nodes/layer you add.

After we train the model we see some interesting results. I’m copying them here:

# Simple 2

[0 0 0] = [0.28424671]
[0 0 1] = [0.00297735]
[0 1 0] = [0.21864649]
[0 1 1] = [0.00229043]
[1 0 0] = [0.99992042]
[1 0 1] = [0.99799112]
[1 1 0] = [0.99988018]
[1 1 1] = [0.99720236]

Let’s see again the results from the previous post.

# Simple 1

[0 0 0] = [0.5]
[0 0 1] = [0.009664]
[0 1 0] = [0.44822538]
[0 1 1] = [0.00786466]
[1 0 0] = [0.99993704]
[1 0 1] = [0.99358931]
[1 1 0] = [0.9999225]
[1 1 1] = [0.99211997]

Do you see what just happened? In the previous NN with no hidden layer the prediction for [0 0 0] was 50% and for [0 1 0] was 44%. With the new NN that has the hidden layer the prediction is much more clear now and the NN predicts that those values must probably be 0. Therefore, by using the same inputs and same output the new more complex NN makes more accurate predictions.

It’s not always necessary that the more complex a NN is will make better predictions. Actually, it might be the opposite. If you want to dig deeper you can have a look about NN over-fitting. Most probably even in this second case with the 32-node hidden layer, the model is over-fitting and maybe 8 nodes are more than enough, but I prefer to test this 32-node hidden layer in order to stress the MCUs with more load and get some insight how these little boards will cope up with that load.

Evaluate on the MCUs

Now that we designed, trained and evaluated our model on the Jupyter notepad we’re going to test the NN on different MCUs.

What is important here is not if the evaluation really works on the MCUs. I mean that’s just code and of course it will work the same way and you’ll get similar results. You results may just differ a bit between different MCUs, because as we’re using doubles and the accuracy may vary.

C code

Regarding the NN prediction implementation in the C code, just have a look at the test_neural_network2() and benchmark_neural_network2() functions in the code. The rest is the same as I’ve described in the first post.

Supported serial commands

Again, please refer to the first post.

For this post the START=2command was used in order to execute the benchmark with the second simple NN. In the previous post the benchmark results were obtained with the START=1command. Keep in mind that if you want to switch from one mode to another you need first to send the STOPcommand.


You can find all the oscilloscope screenshots for the prediction benchmarks in the screenshots folder. The captures are the ones that have the simple2 in their filename. In the following table I’ve gathered all the results for the prediction execution time for each board. As the second NN takes more time you can ignore the toggle time as it’s insignificant. Here are the results:

MCU Prediction time (μsec)
stm32f103 @ 72MHz 700
stm32f103 @ 128MHz 385
Arduino Uno @ 8MHz 5600
Ard. Leonardo @ 16MHz Oops!
Arduino DUE @ 84MHz 686
ESP8266-12E @ 160MHz 392
Teensy 3.2 @ 120MHz 504
Teensy 3.5 @ 168MHz 363
stm32f746 @ 216MHz 127
stm32f746 @ 295MHz 92.8

As you can see from the above table I’ve lost the results for the Arduino Leonardo. But who cares. I mean it’s boringly slow anyway. I may try to re-run the test and update.

Now let’s think about a real-time application. As you can see the prediction time now has increased significantly. It’s interesting to see how much that time has increased. Let’s see the ratio between the NN in the first post and this.

MCU Prediction time ratio
stm32f103 @ 72MHz 41.42
stm32f103 @ 128MHz 41.04
Arduino Uno @ 8MHz 48.95
Ard. Leonardo @ 16MHz
Arduino DUE @ 84MHz 36.29
ESP8266-12E @ 160MHz 25.06
Teensy 3.2 @ 120MHz 42.857
Teensy 3.5 @ 168MHz 41.06
stm32f746 @ 216MHz 26.13
stm32f746 @ 295MHz 25.92

Let’s explain what this ratio is. This number show how much slower the second NN execution is compared to the first NN for the specific CPU. So for the stm32f103 the second NN needs 41 times the time that the first NN needed to predict the output. Therefore, the bigger the number the worst effect the second NN had on the MCU. On those terms, the stm3f103 seems to scale much more worse than the stm32f746 and the esp8266. The stm32f746 and esp8266 really shine and scale much better that any other MCU. The reason I guess, is the hardware FPU that those two have, which can explain the ratio difference as the NN is actually just calculating dot products on doubles.

Therefore, here we have a good hint. If you want to run a NN on a MCU, first find one with a hard FPU, like Cortex-M4/7 or esp8266. From those two, the stm32f746 of course is a much better option (but that depends also the use case, so if you need wifi connection then esp8266 is also a good option). So, coming back to real-time applications we need to think that the second NN is also a simple one as we only have 3 inputs. If we had more inputs then we would need more time to get a prediction. Also the closer we get to the millisecond area that already excludes most of the MCUs from any real-time application that needs to make fast decisions. Of course, once again it always depends on the project! If for example you had a NN that the inputs were the axis of a 3D-accelerometer and you had a trained model that needed to predict a value according to the inputs, then maybe 700 μsec or even 500 μsec are ok. But they may not! So it really depends on the problem you need to solve.


After finishing those tests I had mixed feelings. That’s because I’ve managed to design, train and evaluate two simple NN models and be able to test them successfully on all the MCUs. That was awesome. Of course, the performance is different and depends on the MCU. So, although I see some potentials here, at the same time it seems that the performance drops quite much as the model complexity increases. But as I’ve said it depends in the real use case you may have. You might be able to use an MCU to run the predict function, you might not. It all depends on the real-time requirements and the model complexity.

Let’s keep the fact that the tools are out there. There are many different MCUs, with different processing power and accelerators that might fit your use case. New Cortex-M cpus are now coming with NN accelerators. I believe it’s a good time now to start diving into the ML and the ways that it can be used with the various MCUs in the low embedded domain. Also there are many other HW platforms available in the market, like FPGAs with integrated application CPUs that can be used for ML. The market is growing a lot and now it’s a good time to get involved.

Update: next part is here.

Until then have fun!

Machine Learning on Embedded (Part 1)


Note: This post is the first in the series. Here you can find part 2part 3, part 4 and part 5.

Since 2015 I was following the whole machine learning hype closely and after 4 years I can finally say that is mature enough for me to get involved and try to play and experiment with it in the low/mid embedded domain. Although it’s exciting to get immediately involved to new techs, I believe that engineers should just keep an eye on the ones that seem to be valuable in the future or have potential to grow into something that can be used on their domain. But at the same time engineers must be moderate and wait for the “hype” to fade and then get the real valuable information. This is what happened with machine learning in my case. Now I finally feel that it’s the right time to dig in this domain more seriously and that the tools and frameworks are mature and simple to use.

And this bring us to the fact that, it’s different to implement and develop the tools and it’s a different thing to use them to solve a problem. Until now, many engineers worked hard on the development of these tools and now it’s much easier for us to just use them to solve problems. Keras, for example, it’s exactly that. It’s one really mature and beautiful framework to use and now it’s very stable. On the other hand, when you wait for this to happen, then you have a more steep learn curve in front of you, so from time to time it’s good to be updated with what’s going on in the domains that you’re interested.

Anyway, this time I’ve decided to make a completely stupid project to find the limits and the use cases of ML in the embedded world. Well, don’t get me wrong that’s not a research, it’s just evaluating the current status of the tools and how they perform with the current low embedded technologies. Nowadays, when most engineers hear embedded they think of some kind ARM application CPU that runs Linux on a SBC. Well, sure, that’s embedded too and there are many of those SBCs these days and they are really cheap, but embedded is also those dirt cheap 8, 16, 32-bit RISC MCUs  and also the Cortex-M series.

Let’s make some assumptions for the rest of this post. Let’s assume that low embedded is everything that is equal or less than a Cortex-M MCUs and high embedded all the rest application CPUs that can actually run Linux. I know that there also some smaller MMU-less MCUs that can run Linux, but let’s forget about that now. Also from now on I’ll refer to machine and deep learning as ML, just for convenience. Although the terminology on the field is getting standardized I’ll try to keep it simple, even if I’m not using the proper convention in some cases. Otherwise this post will become like those that I was reading my self in the beginning that were really hard to follow. So, although there are some differences, let’s keep it simple. AI, deep learning, machine learning… I’ll just use ML! Also I’ll refer to a single neural as a neural or a node. Finally, I’ll refer to neural network as NN.

This article will be split in 4 or 5 different posts. The first one (this one) will have some very generic information about ML and NN; but not in depth, as this is not the purpose of this post series. Also in this post we’ll implement a very simple NN with a single node with 3 inputs and 1 output and then run some benchmarks in various MCUs and analyze the results.

In the second part  I’ll use the same MCUs but with a bit more complex NN that has the same inputs, but a hidden layer with 32 nodes and 1 output. This NN will be more accurate in its predictions (as we’ll see), compared to the simple NN; but at the same time it will need more processing time to run the forward prediction. Please don’t expect to learn the terminology and details on ML here, but it will be much easier to follow if you already know some basic things around NN.

Excited already? No? Well, don’t forget it’s a stupid project. There’s always a small excitement in doing something useless. So, let’s move on.


Spoiler. For me, one of the most interesting thing in this stupid project was the amount of the different boards that I’ve used to run those benchmarks. I think what I liked most was the fact that I was able to test all these different boards  with the same code. For sure, the stm32f103 (blue-pill) was more optimized as I’ve used my own low level cmake template, but nevertheless I enjoyed having most of my boards running the same neural network code. Well, I didn’t used any of my PSoC 4 & 5, STM8, LPC1110, LPC1768 and a few other boards I have around, but I didn’t have more time to spend on this. Maybe at some later point I’ll add the benchmark for those, too.

STM32F103C8T6 (aka blue-pill)

This is my favorite board and my reference, so it couldn’t miss the party. I’ve run the benchmarks @72MHz and then I’ve overclocked the MCU @128MHz.


This is actually the STM32 F7 discovery board here, which is a cute development board with a lot of peripherals and cool stuff on it, but in this case I’ve only use a serial port and a gpio pin. As I like to overclock the stm32s, I’ve managed to overclock this mcu @295MHz. After that it didn’t work for me.

Arduino Uno

I guess I don’t need to write more about this. Everybody knows it. It runs on a ATmega328p and it’s the slowest MCU in this comparison.

Arduino Leonardo

This is another Arduino variant with the ATmega32 cpu, which is a bit faster than the ATmega328p.

Arduino DUE

This is an arduino board that runs on an Atmel SAM3X8E MCU, which is actually an ARM Cortex-M3 running at 84MHz. Quite fast MCU for its release date back then.

Teensy 3.2

The teensy is a very interesting board. It’s a bit expensive, sure. But it’s almost fully compatible with the Arduino IDE libraries and that makes it ideal for fast prototyping and testing. It’s based on a Cortex-M4 CPU and for the test I’ve used it overclocked @120MHz.

Teensy 3.5

This teensy board is using a Cortex-M4 CPU, too; but it runs on faster clocks. I’ve also used it overclocked @168MHz. The overclocking options for both teensy boards, are coming easy and for free from within the Teensy plugin in the Arduino IDE. I had some issues with one library but nothing difficult to solve. More details in the file on each MCU code folder.


Yep, we all now this board. An L106 32-bit RISC CPU running up to 160MHz.

A simple NN

OK, so let’s now jump to the interesting stuff. Everything that is related to this project and for all the posts are in this bitbucket repo:

Although it’s not the best thing to have all these different things in one repo, it makes more sense as it makes it easier to maintain and update. During this post series I’ll use different parts from this repo, so everything you see there are not only for the this first post.

Before we begin, in case that you want to learn some basics for NN then you can watch these videos in YouTube (1, 2, 3, 4) and also this playlist.

First let’s start with a simple NN. For this post, we’re going to use a single neural with 3 inputs and 1 output. You can see that in the following picture.

In the above image we see the topology of a simple NN. That has 3x inputs and 1x output. I won’t get into the details of the math. In this case the output is simple to calculate and it’s:

y = a0*w0  +  a1*w1 + a2*w2

This is the dot product of a(n) and w(n), where n=1,2,3. Just be aware that a(n) is not a function, it just means a0, a1, a2. The same for w(n). So, a(n) are the inputs and w(n) are the so called weights. You can think that weights are just numbers that their size control the effect that each a(n) has in the output result. The higher the w(n) is the more a(n) affects y.

The output is not y, thought. The output is the sigmoid of y, so:

output = sigmoid(y)

What sigmoid does is that it limits the output between 0 and 1. So the more negative y is then it’s near 0 and the more positive it’s near 1. In the ML world this function is called activation function.

For this project we assume that a(n) is a single binary digit (0 or 1). Therefore, since we have 3 inputs then all the possible combinations are in the following table:

a0 a1 a2
0 0 0
0 0 1
0 1 0
0 1 1
1 0 0
1 0 1
1 1 0
1 1 1

For simplicity, you can think of those inputs as 3 buttons connected to 3 gpio pins on the MCU and their state is either pressedor not pressed. Then depending their state, the output is also a binary value (0 or 1).

Training the model

Training the model means getting a set of inputs that we already know that they produce a specific output and then train the NN according to these. Then we hope/expect that the NN is able to predict the output for unknown inputs that hasn’t been trained on. The training is not done on the target, but it’s done separately on a workstation (or cloud) that has more processing power; and finally only execute the prediction function on the MCU. Although this model is very simple, someone may argue that 2 inputs – 1 output is simpler :p . Although it’s simple enough, we’ll do the training on a workstation as it’s important to use some tools that make the workflow easier.

To do that, is better to use a jupyter notebook to do all the design, training, testing and evaluation. Also Jupyter notebooks are the standard documents that you’ll find in most new github projects. The most simple way to install Jupyter and the other tools we need is using miniconda. I’ll be really quick on this. I’m using Ubuntu, so the following commands are for that OS.

  • Download miniconda from here
  • Install miniconda
  • Create a new environment and install the tools you’ll need
    # Create a new environment
    conda create -n nn-env python
    # Activate the environment
    conda activate nn-env
    # Now install those packages to that environment
    conda install -c conda-forge numpy
    conda install -c conda-forge jupyter
    conda install -c conda-forge scikit-learn
    conda install -c conda-forge tensorflow-gpu
    conda install -c conda-forge keras

    Not all of the above packages needed for this example, but we’ll use them later.

  • Next git clone the repo for this project and run Jupyter.
    git clone
    cd machine-learning-for-embedded
    jupyter notebook &

    If everything goes right, then now you should be able to see the web-interface from Jupyter in your browser. If not, then I guess you need to do some google-fu. In this web interface you would see a folder with the name jupyter_notebooks. Just double click on that and there you’ll find all the notebooks for this project. The one we need for this post is the `Simple python NN.ipynb`. Just click on that.

    What you see in there is some mix of markdown text and python code. For the simple cases of the first two parts we’re going to implement the NN with just python code, without using any advanced library like tensorflow or keras. The reason for this is that we can write code that we can later convert to simple C and run tests on the different MCUs.

Again, I won’t go into the details of Jupyter notebooks and python. I guess there are plenty of tutorials in internet that are much better from any explanation I can provide.

Let’s see the notepad now.

Note: In case you just want to view the notebook and evaluate your results, you don’t have to install Jupyter, but instead you can just view the notebook in the bitbucket repo here.

First we import some functions from numpy to simplify the code. Then we create a NeuralNetwork class that is able to train and evaluate our simple NN. Then we create a training set for our binary inputs. As we’ve seen before, 3 binary inputs have 8 possible combinations and we choose to use a train set of 4 inputs. That means that we’ll train our NN with only 4 out of 8 combinations and then expect the NN to be able to predict the rest by itself. So we train with the 50% of the possible values.

Then we create an array with the 4 inputs and the 4 outputs. After that we initialize the NeuralNetwork class and view the random weights. A note here is that the weights always have random values in the beginning. The meaning of training is to calculate those weights (if you prefer the mathematical approach is to find where the slope of the function, I’ve mentioned before, is minimum or ideally zero). Another note is that when you run this notebook in your browser you may get different values after each training (you shouldn’t but you may). Those values should be similar to the ones in the notebook, but they also might differ a bit. In this case, if you just want to evaluate your results with the C code that runs on the MCUs then have in mind that you may need to change the weights in the MCU code according to your weights. By default, the weights in the C code are the ones that you get in the repo’s notebooks without execute any cells.

Finally, we train the model to calculate the weights and then we evaluate the model with all the possible input combinations. For convenience I’m copying my results here:

[0 0 0] = [0.5]
[0 0 1] = [0.009664]
[0 1 0] = [0.44822538]
[0 1 1] = [0.00786466]
[1 0 0] = [0.99993704]
[1 0 1] = [0.99358931]
[1 1 0] = [0.9999225]
[1 1 1] = [0.99211997]

From the above output we see that for the values that we used during training the predictions are very accurate. This output is from the stm32f203, as you’ll find out all the Arduino compiled code don’t have that floating point precision when you convert the doubles to strings. As I’ve mentioned before in the output we get values from 0 to 1. That’s the prediction of the NN and the closer is to 0 or 1 then the higher is the possibility that the output has that value (because in this example it happens that we have binary output so it’s 0 or 1 anyways). So in case of the training inputs [[0, 0, 1], [1, 1, 1], [1, 0, 1], [0, 1, 1]] we see that the accuracy is much better compared to the unknown inputs like [0 0 0] and [0 1 0]. Especially the first input it’s not actually possible to say if it’s 0 or 1 as it stands right in the middle. Ouch!

Evaluate on the MCUs

Now that we designed, trained and evaluated our model on the Jupyter notepad we’re going to test the NN on different MCUs.

What is important here is not actually if the prediction really works on the MCUs. I mean that’s just code, of course it will work the same way and you’ll get similar results. You results might differ a bit because as we use doubles that may differ from one architecture to other. What is important though, is the performance!

That’s all about we care eventually, right? And that was the main drive for me to create this project. To find out how do those MCUs perform in simple or more complex NNs? Is it possible to run a NN in real-time? Does it even have a meaning to do that on an MCU? What you should expect? Is it worth it? Can those tiny MCUs give a good performance? What are the limits? Is it maybe better to convert a NN problem to algorithmic in order to run it on a MCU? Are nested ifs, lookup tables, Karnaugh maps still a better alternative? And a lot of other questions.

Just be sure that I’m not going to answer all those things here though, as there are a lot of different parameters per project and use case. But by doing this yourself, you should definitely get an idea of the performance, the potentials and the limits that exist with the current technologies.

The evaluation on the MCUs is split in 3 different cases. We have the stm32f103 that has it’s own code folder in the `code-stm32f013` folder. Also the stm32f746 has it’s own code folder (code-stm32f746), as esp8266 and arduino due. For the other arduinos and teensy boards you can use the code-arduinofolder.

Just a parenthesis here. Probably people that read my blog more often, they know that I’m more a baremetal embedded guy. I enjoy doing stuff with more stripped down libraries even CMSIS for the Cortex-M. Although I’m mentioning from time to time that I don’t like using Arduino IDE or  HAL libraries, I’ve also mentioned that I find these very useful for cases like this. Prototyping! Therefore, using those tools for prototyping is an excellent choice and a no-brain decision. We need to choose and use our tools wisely and where they fit best every time. So evaluating a case or project on different HW it always make sense to use those tools for prototyping and then write the proper low embedded code where is needed. OK, enough with this parenthesis.

You’ll find details on how to build and run each code on each MCU in the README files in the project folders.  Hence, I’ll only mention the serial protocol that I’m using and also how it works in the background.

C code

The c code is really simple for this example. The dot product and the sigmoid function are implemented in the neural_network.h/c files and from the main.c file we just call the prediction() functions (which is just the sigmoid(dot()) function). The same .h and .c files are used for all the different codes. Also the weights for this example is the double weights[] array in main.c and the inputs are the double inputs[8][3] array again in the main.c function. For now just ignore the double weights_1[32][3] and double weights_2[] arrays, which are used for part 2.

Finally, also two important functions for this example are the benchmark_neural_network() and test_neural_network(). Both are triggered with commands from the serial port. The test function will just print the prediction for all the input combinations in order to compare them with the jupyter notebook and the benchmark function will run a single prediction and at the same time toggle a pin in order to measure the time the function has taken with an oscilloscope.

Supported serial commands

In order to simplify testing I’ve created a couple of commands. In case of stm32 you can connect to the serial port at 115kbps and for the rest MCUs that use the .ino project you need to connect at 9600 bps (anyway it’s either 9600 or 115200).

The supported commands are the following (all commands expect a newline in the end):


where <mode>: 1 or 2

This command will evaluate all the 8 possible inputs by running the prediction using the calculated weights and will print the output. Then you can compare the output with the output from the jupyter notebook.

Mode 1, is using the simple1 NN and its weights. The simple1 NN is the one we use on this post with 3 inputs and 1 output.

Mode 2, is using the simple2 NN and its weights. The simple2 NN is the one that we use on part 2 with 2 inputs, a hidden layer with 32 nodes and 1 output.

Note: If you run the TEST commands on any arduino build firmware you’ll get a bit disappointed as you for some reason the Serial.print function can only print double values with a 2 decimals. That’s a bit crap. I’ve read that there are some ways to fix this, but that it doesn’t really matter. It only matters that the predictions are correct enough. With stm32 that’s not an issue you will get pretty much the same accuracy with the python output.


where <mode>: 1 or 2 (has the same meaning as before)

This command starts a timer that every 3 seconds will run the prediction function and also toggles a gpio in order to help us to make precision measurements. By default, the prediction is using the first input set [0 0 0]. That doesn’t really matter as it doesn’t affect the computation timing, but you can change it in the code if you like. You can verify that mode 1 is much faster than mode 2, but we’ll have a look at it at the next post.


The STOP commands just stops the timer that was triggered with the START=<mode> command.


First I need to mention that the best way to measure the time that a code needs to run is by using an oscilloscope and a gpio pin. You can set the pin high just before you run your function, then run the function and then set the pin to low. Then by using the oscilloscope you can calculate the exact time the operation lasted.

There’s a catch though! When toggling a pin, that also takes some time and that time is different for different hardware and even gpio libraries for the same hardware. For that reason in the code you’ll find out that every time I’m toggling the pin twice before run the NN prediction function. That way you can measure the time that those two toggles spend and then subtract the average from the time that the prediction operation lasted. So, you measure the time of the two toggles and if that time is Tt then you measure the time between the HIGH and LOW of the prediction function and the total time spend for the predictions will be:

Tp = Thl – (Tt/2)
Tp : Prediction time
Thl: Time of High-Low transition that includes the prediction function
Tt: Time of the two toggles

Anyway, let’s not complicate things more. The above just helps only when the prediction function time is fast or different MCUs have similar time and you want to remove the overhead of any GPIO handling that may differ between different MCUs.

Note: I’ve included all the oscilloscope screenshots in the screenshots folder in the repo. Therefore, you can have a look on the oscilloscope output for each different MCU as I’m not going to post them all here (there are just too many).

Before posting the table of the results, these are the screenshots for the stm32f103 and the Arduino Uno. The name coding in the screenshots folder is <mcu>-<NN topology>-<frequency>-<capture>.png. That means that for the teensy 3.2 the ss for that simple example (simple1) and the pin toggle will be `teensy_3.2-simple1-120MHz-predict.png`. In the next post (part 2) the NN topology will be called simple2.

These are the captures for the toggle and prediction for stm32f103 and arduino uno.

stm32f103 @ 128MHz pin toggle time = 290 nsec

stm32f103 @ 128MHz prediction time = 9.38 μsec

Arduino Uno @ 8MHz pin toggle time = 15.5 μsec

Arduino Uno @ 8MHz prediction time = 114.4 μsec

Although you already get a rough idea, the next table summarizes everything.

MCU Pin toggle time (μsec) Prediction time (μsec)
stm32f103 @ 72MHz 0.512 16.9
stm32f103 @ 128MHz 0.290 9.38
Arduino Uno @ 8MHz 15.5 114.4
Ard. Leonardo @ 16MHz 21 116
Arduino DUE @ 84MHz 8.8 18.9
ESP8266-12E @ 160MHz 1.58 15.64
Teensy 3.2 @ 120MHz 0.830 11.76
Teensy 3.5 @ 168MHz 0.572 8.84
stm32f746 @ 216MHz 0.157 4.86
stm32f746 @ 295MHz 0.115 3.58

As you can see from the above table the higher the frequency the better the performance (o, really?). I haven’t substracted the pin toggle time from the prediction time! Also note that although the Teensy 3.5 has a better performance from the stm32f103@128MHz the pin toggle time is almost the double… That’s because those arduino libraries are implemented on top of bloated functions, even for just enable/disable a pin. Of course, the overclocked stm32f746 @ 295MHz is by far the fastest in all terms.

Also I’ve noticed something irrelevant with the NN. If you see the ratio of the (Prediction time)/(Pin toggle time), then you get some interesting numbers. Let’s see the following table:

MCU (prediction time)/(pin toggle time)
stm32f103 @ 72MHz 33
stm32f103 @ 128MHz 32.34
Arduino Uno @ 8MHz 7.38
Ard. Leonardo @ 16MHz 5.52
Arduino DUE @ 84MHz 2.14
ESP8266-12E @ 160MHz 9.89
Teensy 3.2 @ 120MHz 14.16
Teensy 3.5 @ 168MHz 15.45
stm32f746 @ 295MHz 31.13

The above table shows what you can expect from your hardware and how those bloatware arduino libs hurt the overall performance. To be fair though, the NN code is not affected from the libraries, as it’s plain C code. But normally your MCU will also do other tasks and not only run the NN; therefore, everything else that the cpu does affects the NN performance, especially if the code uses bloated libraries. In this case we just toggle a pin and running a timer in the background, nothing else. Well, not true, the stm32f103 actually runs also a few other stuff in the background, but nevertheless it has the best prediction/toggle ratio. The Arduino DUE has the most weird behavior, which doesn’t make sense, but it was consistent. I didn’t even bother to debug that, though. Anyway, the above table is the reason that sometimes I mention that prototyping is completely different from development. Prototyping is proof of concept, and after that going into the low level will bring you the performance. If you don’t care about performance, then sure pick the tool that suits your needs.


From this example we’ve seen that we can actually design, train, evaluate and test a NN with Jupyter and python and then run the forward prediction function on a small MCU. Isn’t that great? Yeah, I know… Using so much resources on those small MCUs to run a 3-input, 1-output NN deserves the title of the stupid project! Anyway, from the last tables we have some interesting results that you can also interpret as you think.

The most interesting is that we can actually use this NN for real-time applications! OK, don’t laugh. I know that this example it’s useless, but you can! Even the 114.4 usec of the Arduino is ok’ish for fast real-time applications. Of course, it depends on the case and the specs. I mean if you expect you inputs to change faster than that, of course you can’t use it. But think buttons for now! 😛

It’s really fast and even Arduino uno can handle this NN, 100 μsec is really fast. Oh, wait. That bring us on another question. If they are buttons then why not created a nested-if function and handle that much much faster.

Even better, why not create lookup table? Maybe even create a Karnaugh map of the inputs/outputs and reduce that to a couple of logic operations. That would work really really fast!

Well, as I said, this is a very simplified example. I mean, this is just for testing and is not meant to do anything really usable. But on the other hand think that what if instead of 3 inputs we had 128? Or 512? Then it would be really difficult to make a Karnaugh map and simplify it. Or we would need to write a ton of if-else cases. But what would happened if we needed to change something in the input or output sets? Then it would be also quite some work in the code. Maybe the lookup table is still a valid and good solution, though. It will cost RAM or FLASH space, but also the weights of the NN will get a lot of space. So you would need to compare how much space each solution would use and then if the NN needs less space then decide if less space is more important than speed execution.

It’s important to realise that ML doesn’t make better every problem we have, neither it’s a magic tool that solves all our engineering problems. It’s a tool that seems to have the potential to solve some issues that it was very complicated to solve before. And it may apply also to problems that we already have solutions for them, but ML may provide some flexibility we didn’t have before.

In the next post here, will do the same for a bit more complex NN with 3-inputs, a hidden layer with 32 nodes and 1-output.

Until then have fun!

Losing the wagon


This post is not about a stupid-project, but it’s a bit more philosophical and it’s about losing the wagon. Well, life has many wagons, but let’s narrow it to the technological and engineering wagon. This is an engineering blog after all.

The last couple of days I was exploring what’s the current state of the home automation domain and specifically for the KNX. I’ve started developing for the KNX bus back in 2007. The trigger was a friend of mine, who’s an electrical engineer and started talking about this fancy new KNX bus around in 2006-2007 (if I remember correctly) and which derived from the Instabus. He got my attention as he already have made some KNX installations and soon I got involved into it. I was fascinated with it and I wanted to start build stuff around it.

The KNX standard is supposed to be an open standard at the time, but it wasn’t really. Back then there were only few information around it and you needed to buy the specifications (which were expensive). So, I had to do a lot of stuff by my self. The only thing that was available it was the BCUSDK. This project started in 2005, but it was all that I needed. From this code I’ve managed to extract and understand the protocol and most things around it. The details weren’t obvious, of course, because the code wasn’t documented but having some KNX devices to experiment with and the code it was enough to do everything I wanted. Also a friend of mine (also an engineer) got fascinated with it and soon we got our KNX certification and in no time we’ve developed a whole platform around it. This included APIs, GUI interfaces and gateways to many standard protocols used at the time (IP, RS232, RS485) and gateways to GSM modules, GPS, Alarm systems and several other stuff.

Well, it was brilliant at the time and there wasn’t anything like that back in 2007. We could beat any competition. And then… for some reasons we just stopped. I don’t even remember the excuse at the time to be honest. But I know the real reason now. That was around in 2008-2009.

Now, I’ve checked again and the KNX automation domain has completely transformed to a huge market and code-base. Several different APIs for several programming languages exist. Python, C/C++, even a KNX module for Qt. I’ve wrote a KNX module in Qt in 2008 by myself and now I’ve seen that last year there was a new module in Qt for that. After 10 years!

So, I was 10 years ahead than the market. I’ve seen the wagon of this train more that 10 years ago and all its potentials. I’ve developed a whole system around it and I let it die, thus losing the wagon and the train. Now you can find almost everything and many stuff are open source, which is great. There is even a Yocto layer with all those tools included. It’s the meta-calaos.

Trust me, it may seem a bit disappointing to realize that you’ve lost the wagon and see how the market has ended today; and knowing that you were there 10 years ago and just did nothing. But, is it really? So, when this happens the most reasonable thing you need to do is ask yourself, why? Actually, not only a single why but several whys and then when you find the reasons and make some decisions for yourself, even knowing yourself better.

And this is what this post is all about.

Some thoughts

I guess I’m not the only one that had this situation. Some of you know exactly what I’m talking about and already being there at least once. Well, also for me this was not the first time. I had that more that once, but the above case hit me harder as I was pioneer and 10 years earlier than the rest of the market. Well, in my case I know why I’ve failed. The reason is that I’m a “lazy” engineer. I’ll come back to this phrase later.

I’ve seen many engineers in my life. Mostly “not-really-good” engineers, for my standards. Although, in my professional career I’ve been told that I’m a good engineer and I know that I’m capable to do stuff, at the same time, I don’t consider myself a -good- engineer. What is -good- engineer after all? No one should consider himself a -good- engineer. If you do that, then it’s over. Of course, when it comes to the professional aspect then you need to present yourself as a good engineer and it’s easier to do that if others already believe it. But in the end I just consider myself just an engineer. And this is a good and a bad thing at the same time.

Being an engineer is only a part of what you are in your professional career. You’re not only an engineer. You are also a salesman, a manager, a director and a CEO. You’re everything at the same time. At least you become those things after a few years in your domain, it’s the natural evolution which is called experience. But it’s the proportion of these analogies that you have and that makes and drives your professional career. Some people are better managers than engineers. Other might have more “CEO-like” qualities than the rest. So, you can’t have all the qualities in a high level at the same time. You may have one or two, but it’s extremely rare that you have everything. But, is that really a problem?

For example, I’m a “lazy” engineer. Lazy, doesn’t mean that I’m really lazy to do something. Actually it’s the opposite. I can drive myself to finish a project and complete it in the best and most optimal way. But then I need to do something else. I can’t stay on that for a long time. I can’t devote myself to a single project or domain and stay there forever. If I try to do that, then it makes me lazy in the end. I’m getting bored and I start hate what I do. And thus, I’m a “lazy” engineer. Well, at least until now I haven’t find a project or domain that I would like to stay forever.

But being a “lazy” engineer had its flaws. For example, in this case I was 10 years in advance compared to the market and then I got bored. So, I got lazy. Therefore, I had to just drop everything and go to the next challenge. Otherwise, I would doom myself in a situation that I would hate what I’m doing. Maybe some of you can understand this, maybe others don’t. It’s not necessary that every engineer is the same. We have different qualities and proportions of them and that’s fine!

I’ve met engineers that they are not so skilled, but they devoted themselves to an idea and a project and they succeeded to make it their main source of income. Many of those projects and ideas for me were so easy to develop and implement and even boring to even start doing them. They were just too simple for me, from the engineering aspect. But, they were profitable! And some engineers struggled to do something which for me seemed so easy and they made a profit out of it. Others didn’t, though. I believe those who did, were also a bit lucky, but all of them they were devoted and better salesman than engineers. Being able to sale something is more important that be able to build it in the best possible engineering way. The product may have it’s flaws, it may need several iterations until it gets released, it may even released and be a crap from the engineering aspect. But does this matter in the end? If it you make a profit and a business case out of it then it’s successful in mainstream market terms.

You don’t have to be an expert in something to do stuff. I’ve programmed in more than 10  programming languages as a professional. I may be an expert only on 2-3 of them, but it doesn’t really matter. You don’t need to be an expert in any programming language to make something that works and be profitable. Writing code is the most easy thing to do. Does it matter if it’s the best code? If you do it the pythonic, or yoctonic or C++17 way? All the code you ever written in the end is just crap. It’s a mess, unless it’s just a few lines that do a very specific thing. You might though you’ve written the best code 5 years ago and if you see that code today you’ll hate yourself for writing that crap. But, it doesn’t matter. Really. You become an expert in something, more as a “professional skill” that it will make it easier for you to find a better job; but if you want to realize your own ideas and make that a product, then it doesn’t matter if you’re an expert. It never mattered.

Therefore, who’s the successful engineer in the end? The one that managed to devote himself in a product and release it in the market and make a profit, or the one that one that didn’t? The one that is expert in 1-2 things or the one who’s capable in 10+? The one that delivers fast or the one that delivers something robust and well-engineered? The one that sees 10 steps further or the one that can focus on the current step? Don’t try to answer this question too fast.

I think that the success is to be satisfied in what you do and be happy with what you achieved in the end of your work. This sentence is a bit vague though, because what makes you happy now it doesn’t mean that it will make you happy in the future. But do you really know the future? No. So, what is left is what makes you a happy engineer now. And if you’re happy then it probably means that you’re also successful in what you do.

Therefore, making your own best-selling project and profit from your awesome idea is not necessary what will make you happy and a successful engineer. So, first you need to focus and find what makes you happy as an engineer and even more if engineering actually is making you happy at all. Because you might be a very good engineer and not be happy being an engineer. You need to know your assets and your values and what to expect from yourself.

Sure, it would be a great thing to become a successful engineer that will have a profitable idea and make a product out of it. But it’s not really necessary. Is it? It might happen, it might not. Maybe you even say it loud to yourself sometimes, but in the back of your head you don’t really want it or believe it. Because in the end everything comes with a price and you already know it. So, if your idea becomes successful, then you need to devote to it. You need to stop being an engineer and be a salesman, a CEO and whatever comes with it. But certainly not an engineer anymore. You will spend more time in managing things and do stuff unrelated to the engineering domain and you will fade as an engineer. That depends if it’s good or bad. If you like manage things and prefer it more than being an engineer then it’s great! But you also need to be capable with managing things, not just like it. Therefore, you need to know what you want to be and you need to know if you have the proper skills for that and if you don’t try to develop them.

If you know what makes you happy, then do it, but first consider all the consequences and be certain that you can judge yourself and skills right.

For me, being an engineer is not really a job. It’s just a hobby and it’s fun to explore new things and have different challenges. In my job I may don’t have the freedom to do exactly what I want every time, but I’m also doing a lot of stuff in my free time, without a profit. And I’m happy. It’s more like a lifestyle. What you do in your life, should be fun. And I feel lucky that it’s still fun for me. So, I don’t really have regrets about missing opportunities, because all the missing (or not) opportunities brought me to this point today. In the end, the only thing that matters is to know what makes you happy. You don’t have to be the best in something or find the best idea and make a huge profit. All you have to be is happy with what you do.

If you’re lucky enough to be happy with what you do, then you are a successful engineer and no matter what wagons you’ve lost or losing down the path, you’re always on your happiness wagon and do the things that you like. And that’s the best wagon you can ever be in your professional career.

Have fun!

STM32 cmake template (with more cool stuff)


While I’m still waiting for some parts for my next stupid project, I was a bit bored and decided to clean up my STM32 cmake template that I’m usually using for my bluepill projects. I mean, I was pretty happy with it since now and it was working fine, but there’s no better wasted time than doing the same thing again to get the same result and have the illusion that this time is better. So, this deserves a post
to the stupid-projects.

Anyway, while I was about to waste my time on that, I’ve though it would be a nice thing to make it support a few different libraries. Cmake, is something that you either love or hate. I do both. The good thing is that you can achieve the same
result by following a number of different ways. This is a nice thing, but also can be a trouble. The reason is that, if there was only a single valid solution or a way to do create a cmake build then it would be difficult to make errors. You would make a lot of mistakes until make it work, but when it worked, that would be the only right way. On the other hand, if there are many different ways to achieve the same result, then there are good and bad ways. And for some unknown universal law, the chance to choose the worst way is much higher that selecting every other way, good or bad.

Therefore, cmake gets both love and hate from my side. In the end, it’s all about experience. If you do something very often, then after some time you learn to choose the better ways. But if you create a cmake project 1-2 times per year, then then next time you deal with your own CMakeList.txt files and you have to re-learn everything you’ve done, then you realise how many things you’ve done wrong or you could do them better. Also the cmake documentation reminds me a law textbook. There are tons of information in there, but written in a way that stupid people like me can’t understand the documentation and need to read a cmake cookbook or see examples in the internet. Then everything gets clear.


I’m using a lot the standard peripheral library from ST. In general, I hate the monstrous HAL API and the use of C++ when it’s not really needed, but I like CubeMX, because it’s nice to calculate clocks and play around with the pinout. Also, when I’m using the USB on the stm32f103c8t6 (blue-pill), I’m always using the ST’s USB FS Device Driver that is compatible with the standard peripheral library. That combination is beautiful. I’ve found a couple bugs, which I’ve fixed and everything is great. I can say that I couldn’t need anything else than that.

I know that there are plenty people that like the HAL API and using C++ with the STM32 and that’s fine. If you like using that, then keep doing it. For my perspective, is that the HAL API is something that doesn’t provide anything more that the stdperiph library and also there are so many layers of software between the actual API and the CMSIS level, that it doesn’t make sense. For me it’s too much complicated and when it breaks it’s not just open the mcu datasheet and find the error, but you also need to debug all that software layer in between. Sorry, no time for this. Regarding the C++, I’ve wrote a post before here. Generally, there’s no right or wrong. But personally I prefer to write C++ when I’m developing a Qt app or when I really need some things that the language can make my code cleaner, more portable and faster. If it can’t do that, then I see no reason to use it. Also, the libraries are C libraries with C++ wrappers in the headers. That means something. Therefore, I need to be convinced that C++ will actually be better than C for the specific project, otherwise I’ll go with C.

There is also another project that supports the stm32 and plenty of other mcus and it deserves more love. This is the libopencm3 project. That is actually a library that replaces the standard peripheral library from ST. This is a very nice library. It’s low level library and based on CMSIS. It gets updated much more often that the stdperiph and the project is active. For example, I see now that the last update was a few hours ago (that doesn’t mean that it was for stm32f1) and at the same time the last version of the stdperiph was in 2012, so 7 years ago. Also another neat thing with libopencm3 is that everyone can join the project and send commits to add functionality or fix bugs. I’m thinking to commit the stm31f1 overclocking patch I have to clock the stm at 128MHz, but I guess this won’t be accepted as it’s out of specs, but anyway I’ll try, he he. So, yeah libopencm3 deserves more love and I think that sometimes you may also create smaller code.

So I’ve decided to add support to the cmake template also for the libopencm3.

Finally, let’s go to FreeRTOS. I guess, everyone knows what that is and I guess there are a lot of people that love it. Well, I respect rtos. I mean most of my work is done on embedded Linux, so I know about rtos. But still until today, I never, never had to really use an rtos on a small embedded mcu. Until today there was nothing that I couldn’t do using state machines. I’ve also written a very nice and neat lib for state machines and I think I should open-source it at some point. Anyway, I never had to use an rtos on an stm32 or other small mcu, but I guess there are needs that other people have. From my perspective it seems that simplifies things and produces less code and complexity, but on the other hand you loose more important things like full control of the runtime and also there’s a hit in performance. But anyway, it’s fun to have it as an option for prototyping and write small apps while you don’t want to mess with timers and interrupts.

Hence, in this cmake template you get all the above in the same project and you are able to select which libraries to enable by selecting the proper options in the cmake. But let’s have a look. This is the repo here:

After you clone the repo, there is a very interesting file that you should read. It’s supposed to written in a way that is easier to understand, compared to the cmake documentation. Also, another important file is the build.shscript that it handles all the details and runs cmake with the proper options.

So let’s see what those options are. The only thing you need to build the examples is to run the build.shscript with the proper parameters. Inside the build script you’ll find all the supported parameters, but not all of them are needed to be set everytime.

  • TOOLCHAIN_DIR: This points should point to your toolchain path
  • CMAKE_TOOLCHAIN: This points to your cmake toolchain file. This file actually sets up the toolchain to be used. When working with the blue-pill, you wouldn’t need to change that.
  • CLEANBUILD: This parameter is either true or false. When it’s true then the build script will delete the build folder and that means that you’ll get a clean build. Very useful, especially if you’re making changes to your cmake files, in order to remove the cmake cache. By default is false.
  • ECLIPSE_IDE: This is either true or false. If that’s true then the cmake will also create Eclipse project files so you can import the project in Eclipse and use it as an IDE to develop. That’s a handy option because of intellisense. By default is fault because I usually prefer the VS Code.
  • USE_STDPERIPH_DRIVER: This option can be ON or OFF and enables or disables the ST’s standard peripheral driver and the CMSIS lib. By default is set to OFF so you need to explicitly set it to ON during build.
  • USE_STM32_USB_FS_LIB: This option can be ON or OFF and enables or disables the ST’s USB FS Device Driver. By default is set to OFF so you need to explicitly set it to ON during build.
  • USE_LIBOPENCM3: This option can be ON or OFF and enables or disables the libopencm3 library. By default is set to OFF so you need to explicitly set it to ON during build. You can’t have this set to ON at the same time with the USE_STDPERIPH_DRIVER
  • USE_FREERTOS: This option can be ON or OFF and enables or disables the FreeRTOS library. By default is set to OFF so you need to explicitly set it to ON during build.
  • SRC: With this option you can specify the source folder. You may have different source folders with different projects in the source/ folder. For example in this template there are two folders the source/src_stdperiph and the source/src_freertos so you can select which one you want to build, as they have completely different projects and need different libraries.

The two example projects, as you can guess from the names, are for testing the stdperiph and the freertos/libopencm3 libs. To build those two projects you can run these commands:

# stdperiph

# FreeRTOS & LibopenCM3

# Create Eclipse projects files

So, yeah, pretty much that’s it. Easy and handy.


This was a re-write of my cmake template and as cherry on top I’ve decided to add the support for the FreeRTOS and LibopenCM3. I’ll probably use more often the libopencm3 in the future, ot at least evaluate it enough to see how it performs and regarding the FreeRTOS, I think it’s a nice addition for prototyping and use tasks instead of writing code.

Finally, one note here. Be careful when you use the -flto flag in the GCC optimisations, because this completely brakes the FreeRTOS. For example you can build the freertos example and flash it on the stm and you get a 500ms toggling LED, but it you add the -flto flag in the COMPILER_OPTIMISATION parameter in the main CMakeLists.txt file then you’ll find out that the vTaskDelay breaks and the pin toggling very fast.

Have fun!