Controlling a 3D object in Unity3D with teensy and MPU-6050


Have a look at this image.

What does it look like? Is it the new Star Wars? Nope. It’s a 3D real time rendering from a “game” that I’ve build just to test my new stupid project with the MPU6050 motion tracking sensor. Cool isn’t it? Before I get there let me do a short introduction.

The reason I’ve enjoyed this project so much, is that I’ve started this project without knowing anything about 3D game engines, skyboxes and quaternions and after 1.5 day I got this thing up and running! I don’t say this to praise myself. On the contrary, I mention this to praise the current state of the free and open source software. Therefore, I’ll use some space here to praise the people that contribute to FOSS and make for others easy and fast to experiment and prototype.

I don’t know for how long you’ve been engineers, hobbyists or anything related and for how long. But from my experience, trying to make something similar 15 or 10 years ago (and also on a Linux machine), would be really hard and time spending procedure. Of course, I don’t mean getting the same 3D results, I mean a result relative to that era. Maybe I would spend several months, because I would have to do almost everything by myself. Now though, I’ve only wrote a few hundred lines of glue code and spend some time on YouTube and surfing github and some other sources around the web. Nowadays, there are so many things that are free/open and available that is almost impossible not to find what you’re looking for. I’m not only talking about source code, but also tools, documentation, finished projects and resources (even open hardware). There are so many people that provide their knowledge and the outcome of their work/research nowadays, that you can find almost everything. You can learn almost everything from the internet, too. OK, probably it’s quite difficult to become a nuclear physicist using only online sources, but you can definitely learn anything related to programming and other engineering domains. It doesn’t matter why people decide to make their knowledge available publicly, all it matters is that it’s out there, available for everyone to use it.

And thanks to those people, I’ve managed to install Unity3D on my Ubuntu, learn how to use it to make what I needed, found a free to use 3D model for the Millennium Falcon, used Blender on Linux to edit the 3D model, found a tool to create a Skybox that resembles the universe, found an HID example for the teensy 3.2, the source code for the MPU6050 and how to calibrate it and finally dozens of posts with information about all those things. Things like how to use lens flares, mesh colliders to control flare visibility on cameras with flare layers, event system for GUI interactions and several other stuff that I wasn’t even aware of before, everything explained from other people in forums in a way that it’s far easier to read from any available documentation. Tons of information.

Then I just put all the pieces together and wrote the glue code and this is the result.

(Excuse, the bad video quality, but I’ve used my terrible smartphone camera)

It is a great time to be an engineer or a hobbyist and having all those tools and information available to your hands. All you need is just some time for playing and making stupid projects ūüėČ

All the source code and files for this project are available here:

Note: This post is not targeting 3D graphics developers in no way. It’s meant mostly for embedded engineers or hobbyists.

So, let’s dive into the project…


To make the project I’ve used various software tools and only two hardware components. Let’s see the components.

Teensy 3.2

You’re not limited on Teensy 3.2, but you can use any Teensy that supports the RawHID lib. Teensy 3.2 it’s based on the NXP MK20DX256VLH7 which has a Cortex-M4 core running at 72 MHz and can be overclocked easily using the Arduino IDE up to 96MHz. It has various of peripherals and the pinout exports everything you need to build various interesting projects. For this project I’ve only used the USB and I2C. Teensy is not the cheapest MCU board out there as it costs around $20, but it comes with a great support and it’s great for prototyping.


According to TDK (which manufactures the MPU-6050) this is a Six-Axis (Gyro + Accelerometer) MEMS MotionTracking Devices which has an onboard Digital Motion Processor (DMP). According the TDK web page I should use a ‚ĄĘ on every word less than 4 characters in the previous sentence. Anyway, to put it simple it’s a package that contains a 3-axis Gyro, a 3-axis Accelerometer and a special processing unit that performs some complex calculations on the sensor’s data. You can find small boards with the sensor on ebay that cost ~1.5 EUR, which is dirt cheap. The sensor is 3V3 and 5V capable, which makes it easy to be used with a very wide range of development boards and kits.

Connecting the hardware

The hardware connections are easy as both the Teensy and the mpu-6050 are voltage level compatible. The following table shows the connections you need to make.

Teensy 3.2 (pin) MPU-6050
18 SDA
19 SCL
23 INT

That’s it, you’re done. Now all you have to do, is to connect Teensy 3.2 to your workstation, calibrate the sensor and then build the firmware and flash it.

Calibrating the MPU-6050

Not all the MPUs are made the same. Since it’s a quite complex device, both the gyro and the accelerometer (accel) have tolerances. Those tolerances affect the readings you get, for example if you place the sensor on a perfectly flat space then you’ll get a reading from the sensor that it’s not on a flat space, which means that you’re reading the tolerances (offsets). Therefore, first you need to place the sensor on a flat space and then use the readings to calibrate the sensor and minimize those offsets. Because every chip has different tolerances, you need to do this for every sensor, so you don’t do this once for a single sensor and then re-use the same values also for others (even if they are in the same production batch). This sensor supports to upload to it those offset values using some I2C registers in order to perform calculations with those offsets, which in the end offloads the external CPU.

Normally, if you need maximum accuracy during calibration, then you need an inclinometer in order to place your sensor completely flat and a vibration free base. You can find inclinometers on ebay or amazon, from 7 EUR up to thousands of EUR. Of course, you get what you pay. Have in mind that an inclinometer is just a tilt sensor, but it’s calibrated in the production. A cheap inclinometer may suck in many different ways, e.g. maybe is not even calibrated or the calibration reference is not calibrated or the tilt sensor itself is crap. Anyway, for this project you don’t really need to use this.

For now just place the sensor in a surface that seems flat. Also because you probably have already soldered the pin-headers, try to flatten the sensor compare to the surface. We really don’t need accuracy here, just an approximation, so make your eye an inclinometer.

Another important thing to know is that when you power-up the sensor then the orientation is zeroed at the current orientation. That means if the sensor is pointing south then this direction will always be the starting point. This is important for when you connect the sensor to the 3D object, then you need to put the sensor flat and pointing to that object on your screen and then power-on (connect the USB cable to Teensy).

Note: Before you start the calibration you need to leave the module powered on for a few minutes in order for the temperature to stabilize. This is very important, don’t skip this step.

OK, so now you should have your sensor connected to Teensy and on a flat(-ish) surface. Now you need to need to flash the calibration firmware. I’ve included two calibration source codes in the repo. The easiest one to use is in `mpu6050-calibration/mpu6050-raw-calibration/mpu6050-raw-calibration.ino`. I’ve got this source code from here.

Note: In order to be able to build the firmware on the Arduino IDE, you need to add this library here. The Arduino library in this repo is for both the MPU-6050 and the I2Cdev which is needed from all the source codes. Just copy from this folder the I2Cdev and MPU6050 in to your Arduino library folder.

When you build and upload the `mpu6050-raw-calibration.ino` on the Teensy, then you also need to use the Arduino IDE to open the Serial Monitor. When you do that, then you’ll get this prompt repeatedly:

Send any character to start calibrating...

Press Enterin the output textbox of the Serial Monitor and the calibration will start. In my case there were a few iterations and then I got the calibration values in the output:

            ax	ay	az	gx	gy	gz
average values:		-7	-5	16380	0	1	0
calibration offsets:	1471	-3445	1355	-44	26	26

MPU-6050 is calibrated.

Use these calibration offsets in your code:

Now copy-paste the above code block in to your
'teensy-motion-hid/teensy-motion-hid.ino' file
in function setCalibrationValues().

As the message says, now just open the `teensy-motion-hid/teensy-motion-hid.ino` file and copy the mpu.set*function calls in the setCalibrationValues()function.

Advanced calibration

If you want to see a few more details regarding calibration and an example on how to use a PID controller for calibrating the sensor and then use a jupyter notebook to analyze the results, then continue here. Otherwise, you can skip this section as it’s not really needed.

In order to calculate the calibration offsets you can use a PID controller. For those who doesn’t know what PID controller is, then you’ll have to see this first (or if you know how negative feedback on op-amps works, then think that it’s quite the same). Generally, is a control feedback loop that is used a lot in control systems (e.g. HVAC for room temperature, elevators e.t.c).

Anyway, in order to calibrate the sensor using a PID controller, then you need to build and upload the `mpu6050-calibration/mpu6050-pid-calibration/mpu6050-pid-calibration.ino` source code. I won’t get in to the details of the code, but the important thing is that this code uses 6 different PID controllers, one for each offset you want to calculate (3 for the accel. axes and 3 for the gyro axes). This source code is a modification I’ve made of this repo here.

Again, you need to let the sensor a few minutes powered on before perform the calibration and set it on a flat surface. When the firmware starts, then it will spit out numbers on the serial monitor. Here’s an example:


And this goes on forever. Each line is a comma separated list of values and the meaning of those values from left to right is:

  • Average Acc X value
  • Average Acc X offset
  • Average Acc Y value
  • Average Acc Y offset
  • Average Acc Z value
  • Average Acc Z offset
  • Average Gyro X value
  • Average Gyro X offset
  • Average Gyro Y value
  • Average Gyro Y offset
  • Average Gyro Z value
  • Average Gyro Z offset

Now, all you need to do is to let this running for a couple of seconds (30-60) and then copy all the output from the serial monitor to a file named calibration_data.txt. The file actually already exists in the `/rnd/bitbucket/teensy-hid-with-unity3d/mpu6050-calibration` folder and it contains the values from my sensor, so you can use those to experiment or erase the file and put yours in its place. Also, be aware that when you copy the output from the serial monitor to the txt file, you need to delete any empty line in the end for the notebook scripts to work, otherwise, you’ll get an error in the jupyter notepad.

Note: while you running the calibration firmware you need to be sure that the there are no any vibrations on the surface. For example, if you put this on your desk then be sure that there’s no vibrations from you or any other equipment you may have running on the desk.

As I’m explaining quite thorough in the notebook how to use it, I’ll keep it simple here. Also, from this point I assume that you’ve read the jupyter notepad in the repo here.

You can use the notebook to visualize the PID controller output and also calculate the values to use for your sensor’s offsets. It’s interesting to see some plots here. As I mention in the notebook,¬† you can use the data_start_offset and data_end_offset, to plot different subsets of data for each axis.

This is the plot when data_start_offset=0 and data_end_offset=20.

Click on each image to zoom-in.

So, as you can see in the first 20 samples, the PID controller kicks in and tries to correct the error and as the error in the beginning is significant, you get this slope. See in the first 15 samples the error for the Acc X axis is corrected from more than -3500 to near zero. For the gyro things are a bit different, as it’s more sensitive and fluctuates a bit more. Let’s see the same plots with data_start_offset=20 and data_end_offset=120.

On the above images, I’ve started from sample 20, in order to remove the steep slope during the first PID controller input/output correction. Now you can see that the data that are coming from the accel. and gyro axes are fluctuating quite much and the PID tries on every iteration to correct this error. Of course, you’ll never get a zero error as the sensor is quite sensitive and there’s also thermal and electronic noise and also vibrations that you can’t sense but the sensor does. So, what you do in such cases is that you use the average value for each axis. Be careful, though. You don’t want to include the first samples in the average value calculations for each axis, because that would give you values that are way off. As you can see in the notepad here, I’m using the skip_first_n_data to skip the first 100 samples and then calculate the average from the rest values.

Finally, you can use the calculated values from the “Source code values” section and copy those in the firmware. You can use whatever method you like to calibrate the sensor, just be aware that if you try both methods you won’t get the same values! Even if you run the same test twice you won’t the exact same values, but they should be similar.

HID Manager

In the hid_manager/ folder you’ll find the source code from a tool I’ve written and I named hid_manager. Let me explain what this does and why is needed.

The HID manager is the software that receives the HID raw messages from Teensy and then copies the data in to a buffer that is shared with Unity. Note that this code works only for Linux. If you want to use the project on Windows then you need to port this code and actually is the only part of the project that is OS specific.

So, why use this HID manager? The problem with Unity3D and most similar software is that although they have some kind of input API, this is often quite complicated. I mean, I’ve tried to have a look at the API and try to use it, but quickly I’ve realized that it would take too much effort and time to create my own custom input controller for Unity and the use it in there. So, I’ve decided it to make it quick and dirty. In this case, though, I would say that quick and dirty, is not a bad thing (except that it’s OS specific). Therefore, what is the easiest and fast way to import real-time data to any other software that runs on the same host? Using some kind of inter-process communication, of course. In this case, the easiest way to do that was to use the Linux /tmp folder, which is mount in the system’s RAM memory and then create a file buffer in the /tmp and share this buffer between the Unity3D game application and the hid manager.

To do that I’ve written a script in hid_manager/, which makes sure to create this buffer and occupy 64 bytes in the RAM. The USB HID packets I’m using are 64 bytes long, so we don’t need more RAM than that. Of course, I’m not using all the bytes, but it’s good to have more than the exact the same number. For example, the first test was to send only the Euler angles from the sensor, but then I’ve realized that I was getting affected from the Gimbal lock effect, so I also had to add the Quaternion values, that I was getting anyways from the sensor (I’ll come back to those later). So, having more available size is always nice and actually in this case the USB packet buffer is always 64 bytes, so you get them for free. The problem arises when you need more than 64-byts, then you need to use some data serialization and packaging.

Also, note in both the teensy-motion-hid/teensy-motion-hid.ino and hid_manager/hid_manager.c, the indianness is the same, which makes things a bit easier and faster.

In order to build the code, just run make inside the folder and then you need first flash the Teensy with the teensy-motion-hid.ino firmware and then run the manager using the script.


If you try to run the script before the Teensy is programmed, then you’ll get an error as the HID device won’t be found attached and enumerated on your system.

Note: if the HID manager is not running then on the serial monitor output you’ll get this message

Unable to transmit packet

The HID manager supports two modes. The default mode is that it runs and it just copies the incoming data from the HID device to the buffer. The second one is the debug mode. In the debug mode, it prints also the 64 bytes that it gets from the Teensy. To run the HID manager in debug mode run this command.

DEBUG=1 ./

By running the above command you’ll get an output in your console similar to this:

$ DEBUG=1 ./ 
    This script starts the HID manager. You need to connect
    the teensy via USB to your workstation and flash the proper
    firmware that works with this tool.
    More info here:
Controlling a 3D object in Unity3D with teensy and MPU-6050
Usage for no debugging mode: $ ./ Usage for debugging mode (it prints the HID incoming data): $ DEBUG=1 ./ Debug ouput is set to: 1 Starting HID manager... Denug mode enabled... Open shared memfile found rawhid device recv 64 bytes: AB CD A9 D5 BD 37 4C 2B E5 BB 0B D3 BD BE 00 FC 7F 3F 00 00 54 3B 00 00 80 38 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 F9 29 00 00 01 00 00 00 E1 94 FF 1F recv 64 bytes: AB CD 9D 38 E5 3B 28 E3 AB 37 B6 EA AB BE 00 FC 7F 3F 00 00 40 3B 00 00 00 00 00 00 80 B8 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 F9 29 00 00 01 00 00 00 E1 94 FF 1F ...

The 0xAB 0xCD bytes are just a preamble.

Note: I haven’t added any checks on the data, like checking the preamble or having checksums e.t.c. as there wasn’t a reason for this. In other case I would consider at least a naive checksum like xor’ing the bytes, which is fast.

In the next video you can see on the left top side the quaternion data output from the Teensy and on the left bottom the raw hid data in the hid manager while it’s running in debug mode.

Of course, printing the data on both ends adds significant latency in model motion.

Teensy 3.2 firmware

The firmware for Teensy is located in teensy-motion-hid/teensy-motion-hid.ino. There’s no much to say here, just open the file with the Arduino IDE and then build and upload the firmware.

The important part in the code are those lines here:

        mpu.dmpGetQuaternion(&q, fifoBuffer);

        un_hid_payload pl;
        pl.packet.preamble[0] = 0xAB;
        pl.packet.preamble[1] = 0xCD;

        mpu.dmpGetEuler(euler, &q);
        pl.packet.x.number = euler[0] * 180 / M_PI;
        pl.packet.y.number = euler[1] * 180 / M_PI;
        pl.packet.z.number = euler[2] * 180 / M_PI;
        pl.packet.qw.number = q.w;
        pl.packet.qx.number = q.x;
        pl.packet.qy.number = q.y;
        pl.packet.qz.number = q.z;

If the ADD_EULER_TO_HID is enabled, then the Euler angles will also be added in the hid packet, but this might be add a bit more latency.

Now that the data are copied from Teensy to a shared memory buffer in /tmp, you can use Unity3D to read those data and use them in your game. Before proceed with the Unity3D section, though, let’s open a short parenthesis on the kind of data you get from the sensor.

Sensor data

As I’ve mentioned, the sensor does all the hard work and maths to calculate the Euler and the quaternion values from the 6 axes values in real-time (which is a great acceleration). But what are those values, why we need them and why I prefer to use only the quaternion? Usually I prefer to give just a quick explanation and leave the pros explain it better than me, so I’ll the same now.

The Euler angles is just the angle of the rotation for each axis in the 3D space. In air navigation those angles are known as roll, pitch and yaw and by knowing those angles you know your object’s rotation. You can see also this video which explains this better than I do. There’s one problem with Euler angles and this is that if two of the 3 axes are driven in a parallel configuration then you loose one degree of freedom. This is a video explains this in more detail.

As I’ve already mentioned, the sensor calculates the quaternion values. Quaternion is much more difficult to explain as it’s a 4-D number and anything more then 3-D is difficult to visualize and explain. I will try to avoid to explain this myself and just post this link here, which explains quaternions and try to represent them to the 3D space. The important thing you need to know, is that the quaternion doesn’t suffer from the gimbal lock, also it’s supported in Unity3D and it’s supposed to make calculations faster compared to vector calculations for the CPU/GPU.

Unity3D project

For this project I wanted to interact with a 3D object on the screen using the mpu-6050. Then I remembered that I’ve seen a video on Unity3D which seemed nice, so when I’ve seen that there was a Linux build (but not officially supported), then I thought to give it a try. When I’ve started the project I knew nothing about this software, but for doing simple things it seems nice. I had quite a few difficulties, but with some google-fu I’ve fixed all the issues.

Installing Unity3D on Ubuntu is not pain free, but it’s not that hard either and when you succeed, it works great. I’ve download the installer from here (see always the last post which has the latest version) and to install Unity Hub I’ve followed this guide here. Unity3D is not open source, but it’s free for personal use and you need to create an account in order to get your free license. I guess I could use an open 3D game machine, but since it was free and I wanted for personal use, I went for that. In order, to install the same versions that I’ve used run the following commands:

# install dependencies
sudo apt install libgtk2.0-0 libsoup2.4-1 libarchive13 libpng16-16 libgconf-2-4 lib32stdc++6 libcanberra-gtk-module

# install unity
chmod +x UnitySetup-2019.1.0f2

# install unity hub
chmod +x UnityHubSetup.AppImage

When you open the project, you’ll find in the bottom tab some files. The one that’s interesting for you is the HID_controller.cs. This file in the repo is located in here: Unity3D-project/gyro-acc-test/Assets/HID_Controller.cs. In this file the important bits are the MemoryMappedfile object which is instantiated in the start() function and opens the shared file in the /tmp and reads the mpu6050 data and the update() function.

void Start()
    hid_buf = new byte[64];

    // transform.Rotate(30.53f, -5.86f, -6.98f);
    Debug.developerConsoleVisible = true;
    Debug.Log("Unity3D demo started...");

    mmf = MemoryMappedFile.CreateFromFile("/tmp/hid-shared-buffer", FileMode.OpenOrCreate, "/tmp/hid-shared-buffer");
    using (var stream = mmf.CreateViewStream()) {
        var data = stream.Read(hid_buf, 0, 64);
        if (data > 0) {
            Debug.Log("Data in: " + data);
            float hid_x = System.BitConverter.ToSingle(hid_buf, 2);
            Debug.Log("x: " + hid_x);


// Update is called once per frame
void Update()
    using (var stream = mmf.CreateViewStream()) {
        var data = stream.Read(hid_buf, 0, 64);
        if (data > 0) {
            float qw = System.BitConverter.ToSingle(hid_buf, (int) DataIndex.qW);
            float qx = System.BitConverter.ToSingle(hid_buf, (int) DataIndex.qX);
            float qy = System.BitConverter.ToSingle(hid_buf, (int) DataIndex.qY);
            float qz = System.BitConverter.ToSingle(hid_buf, (int) DataIndex.qZ);
            transform.rotation = new Quaternion(-qy, -qz, qx, qw);

As you can see in the start() function, the mmf MemoryMappedFile is created and attached to the /tmp/hid-shared-buffer. Then there’s a dummy read from the file to make sure that the stream works and prints a debug message. This code runs only once when the HID_Controller class is created.

In update() function the code creates a stream connected to the memory mapped file, then reads the data, parses the proper float values from the buffer and finally creates a Quaternion object with the 4D values and updates the object rotation.

You’ll also notice that the values in the quaternion are not in the (x,y,z,w) order, but (-y,-z,x,w). This is weird, right? But this happens for a couple of reasons that I’ll try to explain. In page 40 of this PDF datasheet you’ll find this image.

These are the X,Y,Z axes on the chip. Notice also the signs, they are positive on the direction is shown and negative on the opposite direction. The dot on the top corner indicated where pin 1 is located on the plastic package. The next image is the stick I’ve made with the sensor board attached on it on which I’ve drawn the dot position and the axes.

Therefore, you see that the X and Y axes are swapped (compared to the pdf image), so the quaternion from (x,y,z,w) becomes (y,x,z,w). But wait… in the code is (-y,-z,x,w)! Well, that troubled me also for a moment, then I’ve read this in the documentation, “Most of the standard scripts¬†in Unity assume that the y-axis represents¬†up in your 3D world.“, which means that you need also to swap Y and Z, but because in the place of Y now is X, then you replace X with Z, so the quaternion from (y,x,z,w) becomes (y,z,x,w). But wait! What about the “-” signs? Well if you see again the first image it shows the sign for each axis. Because of the way you hold the stick, compared to the moving object on the screen reverses that rotation for the y and z axes, then the quaternion becomes (-y,-z,x,w). Well, I don’t know anything about 3D graphics, unity and quaternions, but at least the above explanation makes sense and it works, so… this must be it.

I’ve found the Millenium Falcon 3D model here and it’s free to use (except that any star-wars related stuff are trademarked and you can’t really use them for any professional use). This is what I meant in the intro, all the software I’ve used until now was free or open. So A. Meerow, who build this model did this 3D mesh in his free time, gave it for free and I was able to save dozens of hours to make the same thing. Thanks mate.

I had a few issues with the 3D model though when I imported the model in Unity. One issue was that there was a significant offset on one of the axis, another issue was that because of the previous thing I’ve mentioned I had to export the model with the Y – Z axes swapped and finally another issue was that the texture when importing the .obj file wasn’t applied properly, so I had to import the model in the .fbx format. To fix those things I’ve downloaded and used Blender. I’ve also used blender for the first time, but it was quite straight forward to use and fix those issues.

Blender is a free and open source 3D creation suite and I have to say that it looks beautiful and very professional tool. There are so many options, buttons and menus that makes clear that this is a professional grade tool. And it’s free! Amazing.

Anyway, after I’ve fixed those things and exported the model to the .fbx format I wanted to change the default Skybox in Unity3D and I wanted something that seems like deep space. So I’ve found another awesome free and open tool, which is named Spacescape and creates a spaceboxes with stars and nebulas, using algorithms and it also has a ton of options to play with. The funny thing was that I’ve tried to build it on my Ubuntu 18.04 but I had a lot of issues as it’s based on a quite old Qt version and also needs some dependencies that also failed. Anyway, I’ve downloaded the Windows executable and it worked fine with Wine (version 3.0). This is a screenshot of the tool running on my ubuntu.

These are the default options and I’ve actually used them as the result was great.

Finally, I’ve just added some lights, a lens flare and a camera rotation in the Unity3D scene and it was done.

Play the game demo

In order to “play” the game (yeah I know it’s not a game, it’s just a moving 3d object on a scene), you need to load the project from the Unity3D-project/gyro-acc-test/ folder. Then you just build it by pressing Ctrl+B and this will create an executable file named “teensy-wars.x86_64” and at the same time it will also launch the game. After you build the project (File >> Build Settings), you can just lauch the teensy-wars.x86_64 executable.

Make sure that before you do this, you’re running the hid_manager in the background and that you’ve flashed Teensy with the teensy-motion-hid/teensy-motion-hid.ino firmware and the mpu-6050 sensor is connected properly.


I’m amazed with this project for many reasons. It took me 1.5 day to complete it. Now that I’m writing those lines, I’m thinking that I’ve spend more time in documenting this project and write the article rather implement it and the reason for this the power of open source, the available free software (free to use or open), the tons of available information in documents, manuals and forums and finally and most important the fact that people are willing to share their work and know-how. Of course, open source it’s not new to me, I do this for years also by myself, but I was amazed that I was able to build something like this without even use any of those tools before in such short time. Prototyping has become so easy nowadays. It’s really amazing.

Anyway, I’ve really enjoyed this project and I’ve enjoyed more the fact that I’ve realized the power that everyone has in their hands nowadays. It’s really easy to build amazing stuff, very fast and get good results. Of course, I need to mention, that this can’t replace the domain expertise in any way. But still it’s nice that engineers from other domains can jump into another unknown domain and make something quick and dirty and get some idea how things are working.

Have fun!

Machine Learning on Embedded (Part 5)

Note: This post is the fourth in the series. Here you can find part 1, part 2, part 3 and part 4.


In the previous post here, I’ve used x-cube-ai with the STM32F746 to run a tflite model and benchmark the inference performance. In that post I’ve found that the x-cube-ai is ~12.5x faster compared to TensorFlow Lite for microcontrollers (tflite-micro) when running on the same MCU. Generally, the first 4 posts were focused on running the model inference on the edge, which means running the inference on the embedded MCU. This actually is the most important thing nowadays, as being able running inferences on the edge on small MCUs means less consumption and more important that are not rely on the cloud. What is cloud? That means that there is an inference accelerator in cloud, or in layman terms the inference is running on a remote server somewhere on the internet.

One thing to note, is that the only reason I’m using the MNIST model is for benchmarking and consistency with the previous post. There’s no any real reason to use this model in a scenario like this. The important thing here is not the model, but the model’s complexity. So any model with the some kind of complexity that matches your use-case scenario can be used. But as I’ve said since I’ve used this model in the previous posts, I’ll use it also here.

So, what are the benefits of running the inference on the cloud?

Well, that depends. There are many parameters that define a decision like this. I’ll try to list a few of them.

  • It might be faster to run an inference on the cloud (that depends also on other parameters though).
  • The MCU that you already have (or you must use) is not capable to run the inference itself using e.g. tflite-micro or another API.
  • There is a robust network
  • The time that the cloud inference to be run (including the network transactions) is faster than running on the edge (=on the device)
  • If the target device runs on battery it may be more energy efficient to use a cloud accelerator
  • It’s possible to re-train your model and update the cloud without having to update the clients (as long the input and output tensors are not changed).

What are the disadvantages on running the inference on the cloud?

  • You need a target with a network connection
  • Networks are not always reliable
  • The server hardware is not reliable. If the server fails, all the clients will fail
  • The cloud server is not energy efficient
  • Maintenance of the cloud

If you ask me, the most important advantage of edge devices is that they don’t rely on any external dependencies. And the most important advantage of the cloud is that it can be updated at any time, even on the fly.

On this post I’ll focus on running the inference on the cloud and use an MCU as a client to that service. Since I like embedded things the cloud tflite server will be a Jetson nano running in the two supported power modes and the client will be an esp8266 NodeMCU running at 160MHz.

All the project file are in this repo:

Now let’s dive into it.


Let’s have a look in the components I’ve used.

ESP8266 NodeMCU

This is the esp8266 module with 4MB flash and the esp8266 core which can run up to 160MHz. It has two SPI interfaces, one used for the onboard EEPROM and one it’s free to use. Also it has a 12-bit ADC channel which is limited to max 1V input signals. This is a great limitation and we’ll see later why. You can find this on ebay sold for ~1.5 EUR, which is dirt cheap. For this project I’ll use the Arduino library to create a TCP socket that connects to a server, sends an input array and then retrieves the inference result output.

Jetson nano development board

The jetson nano dev board is based on a Quad-core ARM Cortex-A57 running @ 1.4GHz, with 4GB LPDDR4 and an NVIDIA Maxwell GPU with 128 CUDA cores. I’m using this board because the tensorflow-gpu (which contains the tflite) supports its GPU and therefore it provides acceleration when running a model inference. This board doesn’t have WiFi or BT, but it has a mini-pcie connector (key-E) so you’re able to connect a WiFi-BT module. In this project I will just use the ethernet cable connected to a wireless router.

The Jetson nano supports two power modes. The default mode 0 is called MAXN and the mode 1 is called 5W. You can verify on which mode your CPU is running with this command:

nvpmodel -q

And you can set the mode (e.g. mode 1 – 5W) like this:

# sets the mode to 5W
sudo nvpmodel -m 1

# sets the mode to MAXN
sudo nvpmodel -m 0

I’ll benchmark both modes in this post.

My workstation

I’ve also used my development workstation in order to do benchmark comparisons with the Jetson nano. The main specs are:

  • Ryzen 2700x @ 3700MHz (8 cores / 16 threads)
  • 32GB @ 3200MHz
  • GeForce GT 710 (No CUDA ūüôĀ )
  • Ubuntu 18.04
  • Kernel 4.18.20-041820-generic

Network setup

This is the network setup I’ve used for the development and testing/benchmarking the project. The esp8266 is connected via WiFi on the router and the workstation (2700x) and the jetson nano are connected via Ethernet (in the drawing replace TCP = ETH!).

This is a photo of the development setup.

Repo details

In the repo you’ll find several folders. Here I’ll list what each folder contains. I suggest you also read the file in the repo as it contains information that might not be available here and also the README file will be always updated.

  • ./esp8266-tf-client: This folder contains the firmware for the esp8266
  • ./jupyter_notebook: This folder contains the .ipynb jupyter notebook which you can use on the server and includes the TfliteServer class (which will be explained later) and the tflite model file (mnist.tflite).
  • ./schema: The flatbuffers schema file I’ve used for the communication protocol
  • ./tcp-stress-tool: A C/C++ tool that I’vewritten to stress and benchmark the tflite server.

esp8266 firmware

This folder contains the source code for the esp8266 firmware. To build the esp8266 firmware open the `esp8266-tf-client/esp8266-tf-client.ino` with Arduino IDE (version > 1.8). Then in the file you need to change a couple of variables according to your network setup. In the source code you’ll find those values:

#define SSID "SSID"
#define SERVER_IP ""
#define SERVER_PORT 32001

You need to edit them according to your wifi network and router setup. So, use your wifi router’s SSID and password. The `SERVER_IP` is the IP of the computer that will run the python server and the `SERVER_PORT` is the server’s port and they both need to be the same also in the python script. All the data in the communication between the client and the server are serialized with flatbuffers. This comes with quite a significant performance hit, but it’s quite necessary in this case. The client sends 3180 bytes on every transaction to the server, which are the serialized 784 floats for each 28×28 byte digit. Then the response from the server to the client is 96 bytes. These byte lengths are hardcoded, so if you do any changes you need also to change he definitions in the code. They are hard-coded in order to accelerate the network recv() routines so they don’t wait for timeouts.

By default this project assumes that the esp8266 runs at 160MHz. In case you change this to 80MHz then you need also to change the `MS_CONST` in the code like this:

#define MS_CONST 80000.0f

Otherwise the ms values will be wrong. I guess there’s an easier and automated way to do this, but yeah…

The firmware supports 3 serial commands that you can send via the terminal. All the commands need to be terminated with a newline. The supported commands are:

Command Description
TEST Sends a single digit inference request to the server and it will print the parsed response
START=<SECS> Triggers a TCP inference request from the server every <SECS>. Therefore, if you want to poll the server every 5 secs then you need to send this command over the serial to the esp8266 (don’t forget the newline in the end). For example, this will trigger an inference request every 5 seconds: `START=5`.
STOP Stops the timer that sends the periodical TCP inference requests

To build and upload the firmware to the esp8266 read the of the repo.

Using the Jupyter notebook

I’ve used the exact same tflite model that I’ve used in part 3 and part 4. The model is located in ./jupyter_notebook/mnist.tflite. You need to clone the repo on the Jetson nano (or your workstation is you prefer). From now on instead of making a distinction between the Jetson nano and the workstation I’ll just refer to them as the cloud as it doesn’t really make any difference. Therefore, just clone the repo to your cloud server. This here is the jupyter notepad on bitbucket.

Benchmarking the inference on the cloud

The important sections in the notepad are 3 and 4. Section 3 is the `Benchmark the inference on the Jetson-nano`. Here I assume that this runs on the nano, but it’s the same on any server. So, in this section I’m benchmarking the model inference with a random input. I’ve run this benchmark on both my workstation and the Jetson nano and these are the results I got. For reference I’ll also add the numbers with the edge inference on the STM32F7 from the previous post using x-cube-ai.

Cloud server ms (after 1000 runs)
My workstation 0.206410
Jetson nano (MAXN) 0.987536
Jetson nano (5W) 2.419758
STM32F746 @ 216MHz 76.754
STM32F746 @ 288MHz 57.959

The next table shown the difference in performance between all the different benchmarks.

STM@216 STM@288 Nano 5W Nano MAXN 2700x
STM@216 1 1.324 31.719 77.722 371.852
STM@288 0.755 1 23.952 58.69 280.795
Nano 5W 0.031 0.041 1 2.45 11.723
Nano MAXN 0.012 0.017 0.408 1 4.784
2700x 0.002 0.003 0.085 0.209 1

An example how to read the above table is that the STM32F7@288 is 1.324x faster than STM32F7@216. Also Ryzen 2700x is 371.8x times faster. Also the Jetson nano in MAXN mode is 2.45x times faster that the 5W mode, e.t.c.

What you should probably keep from the above table is that Jetson nano is ~32x to 78x times faster than the STM32F7 at the stock clocks. Also the 2700x is only ~4.7x times faster than nano in MAXN mode, which is very good performance for the nano if you think about its consumption, cost and size.

Therefore, the performance/cost and performance/consumption ratio is far better on the Jetson nano compared to 2700x. So it makes perfect sense to use this as a cloud tflite server. One use-case of this scenario is having a cloud accelerator running locally on a place that covers a wide area with WiFi and then having dozens of esp8266 clients that request inferences from the server.

Benchmarking the tflite cloud inference

To run the server you need to run the cell in section `4. Run the TCP server`. First you need to insert the correct IP of the cloud server. For example my Jetson nano has the IP Then you run the cell. The other way is you can edit the `jupyter_notebook/TfliteServer/` file and in this code change the IP (or the TCP if you like also)

if __name__=="__main__":
    srv = TfliteServer('../mnist.tflite')
    srv.listen('', 32001)

Then on your terminal run:


This will run the server and you’ll get the following output.

dimtass@jetson-nano:~/rnd/tensorflow-nano/jupyter_notebook/TfliteServer$ python3
TfliteServer initialized
TCP server started at port: 32001

Now send the TEST command on the esp8266 via the serial terminal. When you do this, then the following things will happen:

  1. esp8266 serializes the 28×28 random array to a flatbuffer
  2. esp8266 connects the TCP port of the server
  3. esp8266 sends the flabuffer to the server
  4. Server de-serializes the flatbuffer
  5. Server converts the tensor from (784,) to (1, 28, 28, 1)
  6. Server runs the inference with the input
  7. Server serializes the output it in a flatbuffer (including the time in ms of the inference operation)
  8. Server sends the output back to the esp8266
  9. esp8266 de-serializes the output
  10. esp8266 outputs the result

This is what you get from the esp8266 serial output:

Request a single inference...
======== Results ========
Inference time in ms: 12.608528
out[0]: 0.080897
out[1]: 0.128900
out[2]: 0.112090
out[3]: 0.129278
out[4]: 0.079890
out[5]: 0.106956
out[6]: 0.074446
out[7]: 0.106730
out[8]: 0.103112
out[9]: 0.077702
Transaction time: 42.387493 ms

In this output the “inference time in ms” is the time in ms that the cloud server spend to run the inference. Then you get the array of the 10 predictions for the output and finally the “Transaction time” is the total time of the whole procedure. The total time is the time that steps 1-9 spent. At the same time the output of the server is the following:

==== Results ====
Hander time in msec: 30.779839
Prediction results: [0.08089687 0.12889975 0.11208985 0.12927799 0.07988966 0.10695633
 0.07444601 0.10673008 0.10311186 0.07770159]
Predicted value: 3

The “handler time in msec” is the time that the TCP reception handler used (see: jupyter_notebook/TfliteServer/ and the FbTcpHandler class.

From the above benchmark with the esp8266 we need to keep the following two things:

  1. From the 42.38 ms the 12.60 ms was the inference run time, so all the rest things like serialization and network transactions costed 29.78 ms (on the local WiFi network). Therefore, the extra time was 2.3x times more that the inference running time itself.
  2. The total time that the above operation lasted was 42.38 ms and the STM32F7 needed 76.75 ms @ 216MHz (or 57.96 @ 288MHz). That means the the cloud inference is 1.8x and 1.36x times faster.

Finally, as you probably already noticed, the protocol is very simple, so there are no checksums, server-client validation and other fail-safe mechanisms. Of course, that’s on purpose, as you can imagine. Otherwise, the complexity would be higher. But you need to consider those things if you’re going to design a system like this.

Benchmarking the tflite server

The tflite TCP server is just a python TCP socket listener. That means that by design it has much lower performance compared to any TCP server written in C/C++ or Java. Despite the fact that I was aware of this limitation, I’ve chosen to go with this solution in order to integrate the server easily in the jupyter notebook and it was also much faster to implement. Sadly, I’ve seen a great performance hit with this implementation and I would like to investigate a bit further (in the future) and verify if that’s because of the python implementation or something else. The results were pretty bad especially for the Jetson nano.

In order to test the server, I’ve written a small C/C++ stress tool that I’ve used to spawn a user-defined number of TCP client threads and request inferences from the server. Because it’s still early in my testing, I assume that the gpu can only run one inference per time, therefore there’s a thread lock before any thread is able to call the inference function. This lock is in the jupyter_notebook/TfliteServer/ file in those lines:

output_data, time_ms = runInference(resp.Input().DigitAsNumpy())

One thing I would like to mention here is that I’m not lazy to investigate in depth every aspect of each framework, it’s just that I don’t have the time, therefore I do logical assumptions. This is why I assume that I need to put a lock there, in order to prevent several simultaneous calls in the tensorflow API. Maybe this is handled in the API, I don’t know. Anyway, have in mind that’s the reason this lock there, so all the inferences requests will block and wait until the current running inference is finished.

So, the easiest way to run some benchmarks is to use run the TfliteServer on the server. First you need to edit the IP address in the __main__ function. You need to use the IP of the server, or if you run this locally (even when I do this locally I use the real IP address). Then run the server:

cd jupyter_notebook/TfliteServer/

Then you can run the client and pass the server IP, port and number of threads in the command line. For example, I’ve run both the client and server on my workstation, which has the IP, so the command I’ve used was:

cd tcp-stress-tool/
./tcp-stress-tool 32001 500

This will spawn 500 clients (each on its own thread) and request an inference from the python server. Because the output is quite big, I’ll only post the last line (but I’ve copied some logs in the results/ folder in the repo).

This tool will spawn a number of TCP clients and will request
the tflite server to run an inference on random data.
Warning: there is no proper input parsing, so you need to be
cautious and read the usage below.

tcp-stress-tool [server ip] [server port] [number of clients]

server ip:
server port: 32001
number of clients: 500

Spawning 500 TCP clients...
[thread=2] Connected
[thread=1] Connected
[thread=3] Connected


Total elapsed time: 31228.558064 ms
Average server inference time: 0.461818 ms

The output means that 500 TCP transactions and inferences were completed in 31.2 secs with average inference time 0.46 ms. That means the total time for the inferences were 23 secs and the rest 8.2 secs were spend in the comms and serializing the data. These 8.2 secs seem a bit too much, though, right? I’m sure that this time should be less. On the Jetson nano it’s even worse, because I wasn’t able to run a test with 500 clients and many connections were rejected. Any number more that 20 threads and python script can’t handle this. I don’t know why. In the results/ folder you’ll find the following test results:

  • tcp-stress-jetson-nano-10c-5W.txt
  • tcp-stress-jetson-nano-50c-5W.txt
  • tcp-stress-jetson-nano-50c-MAXN.txt
  • tcp-stress-output-workstation-500c.txt

As you can guess from the filename, Xc is the number of threads and for Jetson nano there are results for both modes (MAXN and 5W). This is a table with all the results:

Test Threads Total time ms Avg. inference ms
Nano 5W 10 1057.1 3.645
Nano 5W 20 3094.05 4.888
Nano MAXN 10 236.13 2.41
Nano MAXN 20 3073.33 3.048
2700x 500 31228.55 0.461

From those results, I’m quite sure that there’s something wrong with the python3 TCP server. Maybe at some point I’ll try something different. In any case that concludes my tests, although there’s a question mark as regarding the performance of the Jetson nano when it’s acting as tflite server. For now, it seems that it can’t handle a lot of connections (with this implementation), but I’m quite certain this will be much different if the server is a proper C/C++ implementation.


With this post I’ve finished the main tests around ML I had originally on my mind. I’ve explored how ML can be used with various embedded MCUs and I’ve tested both edge and cloud implementations. At the edge part of ML, I’ve tested a naive implementation and also two higher level APIs (the TensorFlow Lite for Microcontrollers API and also the x-cube-ai from ST). For the cloud part of ML, I’ve tested one of the most common and dirt cheap WiFi enabled MCUs the esp8266.

I’ll mention here once again that, although I’ve used the MNIST example, that doesn’t really matter. It’s the NN model complexity that matters. By that I mean that although it doesn’t make any sense to send a 28×28 tensor from the esp8266 to the cloud for running a prediction on a digit, the model is still just fine for running benchmarks and make conclusions. Also this (784,) input tensor, stresses also the network, which is good for performance tests.

One thing that you might wondering at this point is, “which implementation is better”? There’s no a single answer for this. This is a per case decision and it depends on several parameters around the specific requirements of the project, like cost, energy efficiency, location, environmental conditioons and several other things. By doing those tests though, I now have a more clear image of the capabilities and the limitations of the current technology and this is a very good thing to have when you have to start with a real project development. I hope that the readers who gone all the posts of this series are able to make some conclusions about those tools and the limitations; and based on this knowledge can start evaluating more focused solutions that fit their project’s specs.

One thing that’s also important, is that the whole ML domain is developing really fast and things are changing very fast, even in next days or even hours. New APIs, new tools, new hardware are showing up. For example, numerous hardware vendors are now releasing products with some kind of NN acceleration (or AI as they prefer to mention it). I’ve read a couple of days ago that even Alibaba released a 16-core RISC-V Processor (XuanTie 910) with AI acceleration. AmLogic released the A311D. Rockchip released the RK3399Pro.¬† Also, Gyrfalcon released the Lightspeeur 2801S Neural Accelerator, to compete Intel’s NCS2 and Google’s TPU. And many more chinese manufactures will release several other RISC-V CPUs with accelerators for NN the next few weeks and months. Therefore, as you can see the ML on the embedded domain is very hot.

I think I will return from time to time to the embedded ML domain in the future to sync with the current progress and maybe write a few more posts on the subject. But the next stupid-project will be something different. There’s still some clean up and editing I want to do in the first 2 posts in the series, though.

I hope you enjoyed this series as much as I did.

Have fun!

Machine Learning on Embedded (Part 4)


Note: This post is the fourth in the series. Here you can find part 1, part 2 and part 3.

For this post I’ve used the same MNIST model that I’ve trained for TensorFlow Lite for Microcontrollers (tflite-micro) and I’ve implemented the firmware on the 32F746GDISCOVERY by using the ST’s X-CUBE-AI framework. But before dive into this let’s do a recap and repeat some key points from the previous articles.

In part 1, I’ve implemented a naive implementation of a single neuron with 3-inputs and 1-output. Naive means that the inference was just C code, without any accelerations from the hardware. I’ve run those tests on a various different MCUs and it was fun seeing even an arduino nano running this thing. Also I’ve overclocked a few MCUs to see how the frequency increment scales with the inference performance.

In part 2, I’ve implemented another naive implementation of a NN with 3-input, 32-hidden, 1-output. The result was that as expected, which means that as the NN complexity increases the performance drops. Therefore, not all MCUs can provide the performance to run more complex in real-time. The real-time part now is something objective, because real-time can be from a few ns up to several hours depending on the project. That means that if the inference of a deep-er network needs 12 hours to run in your arduino and your data stream is 1 input per 12 hours and 2 minutes, then you’re fine. Anyway, I won’t debate on that I think you know what I mean. But if your input sample is every few ms then you need something faster. Also, in the back of my head was to verify if this simple NN complexity is useful at all and if it can offer something more than lookup tables or algorithms.

In part 3, I was planning to use x-cube-ai from ST, to port a Keras NN and then benchmark the inference, but after the hint I got in the comments from Raukk, I’ve decided to go with the tflite-micro. Tflite-micro at that point seemed very appealing, because it’s a great idea to have a common API between the desktop, the embedded Linux and the MCU worlds. Think about it. It’s really great to be able to share (almost) the same code between those platforms. During the process though, I’ve found a lot of issues and I had to spend more time than I thought. You can read the full list of those issues in the post, but the main issue was that the performance was not good. Actually, the inference was extremely slow. At the end of writing the article I’ve figured out that I had the FPU disabled (duh!) and when I’ve enabled I got a 3x times better performance, but it was still slow.

Therefore, in this post I’ve implemented the exact same model to do a comparison of the x-cube-ai and tflite-micro. As I’ve mentioned also to the previous posts (and I’m doing this also now), the Machine Learning (ML) on the low embedded (=MCUs) is still a work in progress and there’s a lot of development on the various tools. If you think about it the whole ML is still is changing rapidly for the last years and its introduction to microcontrollers is even more recent. It’s a very hot topic and domain right now. For example, while I was doing the tflite-micro post the repo, it was updated several times; but I had to stop updating and lock to a git version in order to finish the post.

Also, after I’ve finished the post for the x-cube-ai, the same day the new version 4.0.0 released, which pushed back the post release. The new version supports to import tflite models and because I’ve used a Keras model in my first implementation, I had to throw away quite some work that I’ve done… But I couldn’t do otherwise, as now I had the chance to use the exact same tflite model and not the Keras model (the tflite was a port from Keras). Of course, I didn’t expect any differences, but still it’s better to compare the exact same models.

You’ll find all the source code for this project here:

So, let’s dive into it.


ST presents the X-CUBE-AI as an “STM32Cube Expansion Package part of the STM32Cube.AI ecosystem and extending STM32CubeMX capabilities with automatic conversion of pre-trained Neural Network and integration of generated optimized library into the user’s project“. Yeah, I know, fancy words. In plain English that means that it’s just a static library for the STM32 MCUs that uses the cmsis-dsp accelerations and a set of tools that convert various model formats to the format that the library can process. That’s it. And it works really well.

There’s also a very informative video here, that shows the procedure you need to follow in order to create a new x-cube-ai project and that’s the one I’ve also used to create the project in this repo. I believe it’s very straight forward and there’s no reason to explain anything more than that. The only different thing I do always is that I’m just integrating the resulted code from STM32CubeMX to my cmake template.

So the x-cube-ai adds some tools in the CubeMX GUI and you can use them to analyze the model, compress the weight values, and validate the model on both desktop and the target. With x-cube-ai, you can finally create source code for 3 types of projects, which are the SystemPerformance, Validation and ApplicationTemplate. For the first two projects you just compile them, flash and run, so you don’t have to write any code yourself (unless you want to change default behaviour). As you can see on the YouTube link I’ve posted, you can choose the type of project in the “Pinout & Configuration” tab and then click in the “Additional Software”. From that list expand the “X-CUBE-AI/Application” (be careful to select the proper (=latest?) version if you have many) and then in the Selection column, select the type of the project you want to build.

Analyzing the model

I want to mention here that in ST they’ve done a great job on logging and display information for the model. You get many information in CubeMX while preparing your model and you know beforehand the RAM/ROM size with the compression, the complexity, the ROM usage, MACC and also you can derive the complexity by layer. This is an example output I got when I’ve analyzed the MNIST model.

Analyzing model 
Neural Network Tools for STM32 v1.0.0 (AI tools v4.0.0) 
-- Importing model 
-- Importing model - done (elapsed time 0.401s) 
-- Rendering model 
-- Rendering model - done (elapsed time 0.156s) 
Creating report file /home/dimtass/STM32Cube/Repository/Packs/STMicroelectronics/X-CUBE-AI/4.0.0/Utilities/linux/stm32ai_output/mnistkeras_analyze_report.txt 
Exec/report summary (analyze 0.558s err=0) 
model file      : /rnd/bitbucket/machine-learning-for-embedded/code-stm32f746-xcube/mnist.tflite 
type            : tflite (tflite) 
c_name          : mnistkeras 
compression     : 4 
quantize        : None 
L2r error       : NOT EVALUATED 
workspace dir   : /tmp/mxAI_workspace26422621629890969500934879814382 
output dir      : /home/dimtass/STM32Cube/Repository/Packs/STMicroelectronics/X-CUBE-AI/4.0.0/Utilities/linux/stm32ai_output 
model_name      : mnist 
model_hash      : 3be31e1950791ab00299d58cada9dfae 
input           : input_0 (item#=784, size=3.06 KiB, fmt=FLOAT32) 
input (total)   : 3.06 KiB 
output          : nl_7 (item#=10, size=40 B, fmt=FLOAT32) 
output (total)  : 40 B 
params #        : 93,322 (364.54 KiB) 
macc            : 2,852,598 
rom (ro)        : 263,720 (257.54 KiB) -29.35% 
ram (rw)        : 33,664 + 3,176 (32.88 KiB + 3.10 KiB) 
id  layer (type)        output shape      param #     connected to             macc           rom                 
0   input_0 (Input)     (28, 28, 1)                                                                               
    conv2d_0 (Conv2D)   (26, 26, 32)      320         input_0                  237,984        1,280               
    nl_0 (Nonlinearity) (26, 26, 32)                  conv2d_0                                                    
1   pool_1 (Pool)       (13, 13, 32)                  nl_0                                                        
2   conv2d_2 (Conv2D)   (11, 11, 64)      18,496      pool_1                   2,244,480      73,984              
    nl_2 (Nonlinearity) (11, 11, 64)                  conv2d_2                                                    
3   pool_3 (Pool)       (5, 5, 64)                    nl_2                                                        
4   conv2d_4 (Conv2D)   (3, 3, 64)        36,928      pool_3                   332,416        147,712             
    nl_4 (Nonlinearity) (3, 3, 64)                    conv2d_4                                                    
5   reshape_5 (Reshape) (576,)                        nl_4                                                        
    dense_5 (Dense)     (64,)             36,928      reshape_5                36,864         38,144 (c)          
    nl_5 (Nonlinearity) (64,)                         dense_5                  64                                 
6   dense_6 (Dense)     (10,)             650         nl_5                     640            2,600               
7   nl_7 (Nonlinearity) (10,)                         dense_6                  150                                
mnist p=93322(364.54 KBytes) macc=2852598 rom=257.54 KBytes ram=32.88 KBytes -29.35% 
Complexity by layer - macc=2,852,598 rom=263,720 
id      layer (type)        macc                                    rom                                     
0       conv2d_0 (Conv2D)   ||||                              8.3%  |                                 0.5%  
2       conv2d_2 (Conv2D)   |||||||||||||||||||||||||||||||  78.7%  ||||||||||||||||                 28.1%  
4       conv2d_4 (Conv2D)   |||||                            11.7%  |||||||||||||||||||||||||||||||  56.0%  
5       dense_5 (Dense)     |                                 1.3%  ||||||||                         14.5%  
5       nl_5 (Nonlinearity) |                                 0.0%  |                                 0.0%  
6       dense_6 (Dense)     |                                 0.0%  |                                 1.0%  
7       nl_7 (Nonlinearity) |                                 0.0%  |                                 0.0%  
Using TensorFlow backend. 
Analyze complete on AI model

This is the output that you get by just running the analyze tool on the imported tflite model in CubeMX. Lots of information there, but let’s focus in some really important info. As you can see, you know exactly how much ROM and RAM you need! You couldn’t do that with the tflite-micro. In tflite-micro you need to either calculate this by your own, or you would need to add heap size and try to load the model, if the heap wasn’t enough and the allocator was complaining, then add more heap and repeat. This is not very convenient right? But with x-cube-ai you know exactly how much heap you need at least for the model (and also add more for your app). Great stuff.

Model RAM/ROM usage

So in this case the ROM needed for the model is 263760 bytes. In part 3, that was 375740 bytes (see section 3 in the jupyter notepad). That difference is not because I’ve used quantization, but because of the 4x compression selection I’ve made for the weights in the tool (see in the YouTube video which does the same time=3:21). Therefore, the decrease in the model size in ROM is from that compression. According to the tools that’s -29.35% compared to the original size. In the current project the model binary blob is in the `source/src/mnistkeras_data.c` file and it’s an C array like the one in the tflite-micro project. The similar file in the tf-lite model was the `source/src/inc/model_data.h`. Those sizes are without quantization, because I didn’t manage to convert the model to UINT8 as the TFLiteConverter converts the model only to INT8, which is not supported in tflite. I’m still puzzled with that and I can’t figure out why this happening and I couldn’t find any documentation or example how to do that.

Now, let’s go to the RAM usage. With x-cube-ai the RAM needed is only 36840 bytes! In the tflite-micro I needed 151312 bytes (see the table in the “Model RAM Usage” section here). That’s 4x times less RAM. It’s amazing. The reason for that is that in tflite-micro the micro_allocator expands the layers of the model in the RAM, but in the x-cube-ai that doesn’t happen. From the above report (and from what I’ve seen) it seems that the layers remain in the ROM and it seems that the API only allocates RAM for the needed operations.

As you can imagine those two things (RAM and ROM usage) makes x-cube-ai a much better option even to start with. That makes even possible to run this model in MCUs with less RAM/ROM than the STM32F746, which is considered a buffed MCU. Huge difference in terms of resources.

Framework output project types

As I’ve mentioned previously, with x-cube-ai you can create 3 types of projects (SystemPerformance, Validation, ApplicationTemplate). Let’s see a few more details about those.

Note: for the SystemPerformance and Validation project types, I’ve included the bin files in the extras/folder. You can only flash those on the STM32F746 which comes with the 32F746GDISCOVERY board.


As the name clearly implies, you can use this project type in order to benchmark the performance using random inputs. If you think about it, that’s all that I would need for this post. I just need to import the model, build this application and there you go, I have all I need. That’s correct, but… I wanted to do the same that I’ve done in the previous project with tflite-micro and be able to use a comm protocol to upload inputs from hand-drawn digits from the jupyter notebook to the STM32F7, run the inference and get the output back and validate the result. Therefore, although this project type is enough for benchmarking, I still had work to do. But in case you just need to benchmark the MCU running the model inference, just build this. You don’t even have to write a single line of code. This is the serial output when this code runs (this is a loop, but I only post one iteration).

Running PerfTest on "mnistkeras" with random inputs (16 iterations)...

Results for "mnistkeras", 16 inferences @216MHz/216MHz (complexity: 2852598 MACC)
 duration     : 73.785 ms (average)
 CPU cycles   : 15937636 -1352/+808 (average,-/+)
 CPU Workload : 7%
 cycles/MACC  : 5.58 (average for all layers)
 used stack   : 576 bytes
 used heap    : 0:0 0:0 (req:allocated,req:released) cfg=0

From the above output we can see that @216MHz (default frequency) the inference duration was 73.78 ms (average) and then some other info. Ok, so now let’s push the frequency up a bit @288MHz and see what happens.

Running PerfTest on "mnistkeras" with random inputs (16 iterations)...

Results for "mnistkeras", 16 inferences @288MHz/288MHz (complexity: 2852598 MACC)
 duration     : 55.339 ms (average)
 CPU cycles   : 15937845 -934/+1145 (average,-/+)
 CPU Workload : 5%
 cycles/MACC  : 5.58 (average for all layers)
 used stack   : 576 bytes
 used heap    : 0:0 0:0 (req:allocated,req:released) cfg=0

55.39 ms! It’s amazing. More about that later.


The validation project type is the one that you can use if you want to validate your model with different inputs. There is a mode that you can validate on the target with either random or user-defined data. There is a pdf document here, named “Getting started with X-CUBE-AI Expansion Package for Artificial Intelligence (AI)” and you can find the format of the user input in section 14.2, which is just a csv file with comma separated values.

The default mode, which is the random inputs produces the following output (warning: a lot of text is following).

Starting AI validation on target with random data... 
Neural Network Tools for STM32 v1.0.0 (AI tools v4.0.0) 
-- Importing model 
-- Importing model - done (elapsed time 0.403s) 
-- Building X86 C-model 
-- Building X86 C-model - done (elapsed time 0.519s) 
-- Setting inputs (and outputs) data 
Using random input, shape=(10, 784) 
-- Setting inputs (and outputs) data - done (elapsed time 0.691s) 
-- Running STM32 C-model 
ON-DEVICE STM32 execution ("mnistkeras", /dev/ttyUSB0, 115200).. 
<Stm32com id=0x7f8fd8339ef0 - CONNECTED(/dev/ttyUSB0/115200) devid=0x449/STM32F74xxx msg=2.0> 
 0x449/STM32F74xxx @216MHz/216MHz (FPU is present) lat=7 Core:I$/D$ ART: 
 found network(s): ['mnistkeras'] 
 description    : 'mnistkeras' (28, 28, 1)-[7]->(1, 1, 10) macc=2852598 rom=257.54KiB ram=32.88KiB 
 tools versions : rt=(4, 0, 0) tool=(4, 0, 0)/(1, 3, 0) api=(1, 1, 0) "Fri Jul 26 14:30:06 2019" 
Running with inputs=(10, 28, 28, 1).. 
....... 1/10 
....... 2/10 
....... 3/10 
....... 4/10 
....... 5/10 
....... 6/10 
....... 7/10 
....... 8/10 
....... 9/10 
....... 10/10 
 RUN Stats    : batches=10 dur=4.912s tfx=4.684s 6.621KiB/s (wb=30.625KiB,rb=400B) 
Results for 10 inference(s) @216/216MHz (macc:2852598) 
 duration    : 78.513 ms (average) 
 CPU cycles  : 16958877 (average) 
 cycles/MACC : 5.95 (average for all layers) 
Inspector report (layer by layer) 
 n_nodes        : 7 
 num_inferences : 10 
Clayer  id  desc                          oshape          fmt       ms         
0       0   10011/(Merged Conv2d / Pool)  (13, 13, 32)    FLOAT32   11.289     
1       2   10011/(Merged Conv2d / Pool)  (5, 5, 64)      FLOAT32   57.406     
2       4   10004/(2D Convolutional)      (3, 3, 64)      FLOAT32   8.768      
3       5   10005/(Dense)                 (1, 1, 64)      FLOAT32   1.009      
4       5   10009/(Nonlinearity)          (1, 1, 64)      FLOAT32   0.006      
5       6   10005/(Dense)                 (1, 1, 10)      FLOAT32   0.022      
6       7   10009/(Nonlinearity)          (1, 1, 10)      FLOAT32   0.015      
                                                                    78.513 (total) 
-- Running STM32 C-model - done (elapsed time 5.282s) 
-- Running original model 
-- Running original model - done (elapsed time 0.100s) 
Exec/report summary (validate 0.000s err=0) 
model file      : /rnd/bitbucket/machine-learning-for-embedded/code-stm32f746-xcube/mnist.tflite 
type            : tflite (tflite) 
c_name          : mnistkeras 
compression     : 4 
quantize        : None 
L2r error       : 2.87924684e-03 (expected to be < 0.01) 
workspace dir   : /tmp/mxAI_workspace3396387792167015918690437549914931 
output dir      : /home/dimtass/.stm32cubemx/stm32ai_output 
model_name      : mnist 
model_hash      : 3be31e1950791ab00299d58cada9dfae 
input           : input_0 (item#=784, size=3.06 KiB, fmt=FLOAT32) 
input (total)   : 3.06 KiB 
output          : nl_7 (item#=10, size=40 B, fmt=FLOAT32) 
output (total)  : 40 B 
params #        : 93,322 (364.54 KiB) 
macc            : 2,852,598 
rom (ro)        : 263,720 (257.54 KiB) -29.35% 
ram (rw)        : 33,664 + 3,176 (32.88 KiB + 3.10 KiB) 
id  layer (type)        output shape      param #     connected to             macc           rom                 
0   input_0 (Input)     (28, 28, 1)                                                                               
    conv2d_0 (Conv2D)   (26, 26, 32)      320         input_0                  237,984        1,280               
    nl_0 (Nonlinearity) (26, 26, 32)                  conv2d_0                                                    
1   pool_1 (Pool)       (13, 13, 32)                  nl_0                                                        
2   conv2d_2 (Conv2D)   (11, 11, 64)      18,496      pool_1                   2,244,480      73,984              
    nl_2 (Nonlinearity) (11, 11, 64)                  conv2d_2                                                    
3   pool_3 (Pool)       (5, 5, 64)                    nl_2                                                        
4   conv2d_4 (Conv2D)   (3, 3, 64)        36,928      pool_3                   332,416        147,712             
    nl_4 (Nonlinearity) (3, 3, 64)                    conv2d_4                                                    
5   reshape_5 (Reshape) (576,)                        nl_4                                                        
    dense_5 (Dense)     (64,)             36,928      reshape_5                36,864         38,144 (c)          
    nl_5 (Nonlinearity) (64,)                         dense_5                  64                                 
6   dense_6 (Dense)     (10,)             650         nl_5                     640            2,600               
7   nl_7 (Nonlinearity) (10,)                         dense_6                  150                                
mnist p=93322(364.54 KBytes) macc=2852598 rom=257.54 KBytes ram=32.88 KBytes -29.35% 
Cross accuracy report (reference vs C-model) 
NOTE: the output of the reference model is used as ground truth value 
acc=100.00%, rmse=0.0007, mae=0.0003 
10 classes (10 samples) 
C0         0    .    .    .    .    .    .    .    .    .   
C1         .    0    .    .    .    .    .    .    .    .   
C2         .    .    2    .    .    .    .    .    .    .   
C3         .    .    .    0    .    .    .    .    .    .   
C4         .    .    .    .    0    .    .    .    .    .   
C5         .    .    .    .    .    1    .    .    .    .   
C6         .    .    .    .    .    .    0    .    .    .   
C7         .    .    .    .    .    .    .    2    .    .   
C8         .    .    .    .    .    .    .    .    5    .   
C9         .    .    .    .    .    .    .    .    .    0   
Creating /home/dimtass/.stm32cubemx/stm32ai_output/mnistkeras_val_m_inputs.csv 
Creating /home/dimtass/.stm32cubemx/stm32ai_output/mnistkeras_val_c_inputs.csv 
Creating /home/dimtass/.stm32cubemx/stm32ai_output/mnistkeras_val_m_outputs.csv 
Creating /home/dimtass/.stm32cubemx/stm32ai_output/mnistkeras_val_c_outputs.csv 
Creating /home/dimtass/.stm32cubemx/stm32ai_output/mnistkeras_val_io.npz 
Evaluation report (summary) 
Mode                acc       rmse      mae       
X-cross             100.0%    0.000672  0.000304  
L2r error : 2.87924684e-03 (expected to be < 0.01) 
Creating report file /home/dimtass/.stm32cubemx/stm32ai_output/mnistkeras_validate_report.txt 
Complexity/l2r error by layer - macc=2,852,598 rom=263,720 
id  layer (type)        macc                          rom                           l2r error                     
0   conv2d_0 (Conv2D)   |||                     8.3%  |                       0.5%                                
2   conv2d_2 (Conv2D)   |||||||||||||||||||||  78.7%  |||||||||||            28.1%                                
4   conv2d_4 (Conv2D)   |||                    11.7%  |||||||||||||||||||||  56.0%                                
5   dense_5 (Dense)     |                       1.3%  ||||||                 14.5%                                
5   nl_5 (Nonlinearity) |                       0.0%  |                       0.0%                                
6   dense_6 (Dense)     |                       0.0%  |                       1.0%                                
7   nl_7 (Nonlinearity) |                       0.0%  |                       0.0%  2.87924684e-03 *              
fatal: not a git repository (or any of the parent directories): .git 
Using TensorFlow backend. 
Validation ended

I’ve also included a file extras/digit.csv which is the digit “2” (same used in the jupyter notebook) that you can use this to verify the model on the target using the `extras/code-stm32f746-xcube-evaluation.bin` firmware and CubeMX. You just need to load the digit to the CubeMX input and validate the model on the target. This is part of the output, when validating with that file:

Cross accuracy report (reference vs C-model) 
NOTE: the output of the reference model is used as ground truth value 
acc=100.00%, rmse=0.0000, mae=0.0000 
10 classes (1 samples) 
C0         0    .    .    .    .    .    .    .    .    .   
C1         .    0    .    .    .    .    .    .    .    .   
C2         .    .    1    .    .    .    .    .    .    .   
C3         .    .    .    0    .    .    .    .    .    .   
C4         .    .    .    .    0    .    .    .    .    .   
C5         .    .    .    .    .    0    .    .    .    .   
C6         .    .    .    .    .    .    0    .    .    .   
C7         .    .    .    .    .    .    .    0    .    .   
C8         .    .    .    .    .    .    .    .    0    .   
C9         .    .    .    .    .    .    .    .    .    0

The above output means that the network found the digit “2” with 100% accuracy.


This is the project you want to build when you develop your own application. In this case CubeMX creates only the necessary code that wraps the x-cube-ai library. These are the app_x-cube-ai.hand app_x-cube-ai.cfiles that are located in the source/srcfolder (and in the inc/ forder in the src). These are just wrappers files around the library and the model. You actually only need to call this function and then you’re ready to run your inference.


The x-cube-ai static lib

Let’s see a few things about the x-cube-ai library. First and most important, it’s a closed source library, so it’s a proprietary software. You won’t get the code for this, which for people like me is a big negative. I guess that way ST tries to keep the library around their own hardware, which it makes sense; but nevertheless I don’t like it. That means that the only thing you have access are the header files in the `source/libs/AI/Inc` folder and the static library blob. The only insight you can have in the library is using the elfread tool and extract some information from the blob. I’ve added the output in the `extras/elfread_libNetworkRuntime400_CM7_GCC.txt`.

From the output I can tell that this was build on a windows machine from the user `fauvarqd`, lol. Very valuable information. OK seriously now, you can also see the exported calls (which you could see anyways from the header files) and also the name of the object files that are used to build the library. An other trick if you want to get more info is to try to build the project by removing the dsp library. Then the linker will complain that the lib doesn’t find some functions, which means that you can derive some of them. But does it really matter though. No source code, no fun ūüôĀ

I don’t like the fact that I don’t have access in there, but it is what it is, so let’s move on.

Building the project

You can find the C++ cmake project here:

In the source/libs folder you’ll find all the necessary libraries which are CMSIS, the STM32F7xx_HAL_Driver, flatbuffers and the x-cube-ai lib. All these are building as static libraries and then the main.cpp app is linked against those static libs. You will find the cmake files for those libs in source/cmake. The file in the repo is quite thorough about the build options and the different builds. To build the code run this command:


If you want to enable overclocking the you can build like this:


Just be aware to select the value you like for the clock in sources/src/main.cppfile in this line:

RCC_OscInitStruct.PLL.PLLN = 288; // Overclock

The default overclocking value is 288MHz, but you can experiment with a higher one (in my case that was the maximum without hard-faults).

Also if you overclock you want to change also the clock dividers on the APB1 and APB2 buses, otherwise the clocks will be too high and you’ll get hard-faults.

RCC_ClkInitStruct.APB1CLKDivider = RCC_HCLK_DIV4;
RCC_ClkInitStruct.APB2CLKDivider = RCC_HCLK_DIV2;

The build command will build the project in the build-stm32folder. It’s interesting to see the resulted sizes for all the libs and the binary file. The next array lists the sizes by using the current latest gcc-arm-none-eabi-8-2019-q3-update toolchain from here. By the time you read the article this might already have changed.

File Size
stm32f7-mnist-x_cube_ai.bin 339.5 kB
libNetworkRuntime400_CM7_GCC.a 414.4kB

This is interesting. Let’s see now the differences between the resulted binary and the main AI libs (tflite-micro and x-cube-ai).

(sizes in kB)
x-cube-ai tflite-micro
binary 339.5 542.7
library 414.4 2867

As you can see from above, both the binary and the library for x-cube-ai are much smaller. Regarding the binary, that’s because the model is smaller as the weights are compressed. Regarding the libs you can’t really say if the size matters are the implementation and the supported layers for tflite-micro are different, but it seems that the x-cube-ai library is much more optimized for this MCU and also it must be more stripped down.

Supported commands in STM32F7 firmware

The code structure of this project in the repo is pretty much the same with the code in the 3rd post. In this case though I’ve only used a single command. I’ll copy-paste the text needed from the previous post.

After you build and flash the firmware on the STM32F7 (read the for more detailed info), you can use a serial port to either send commands via a terminal like cutecom or interact with the jupyter notebook. The firmware supports two UART ports on the STM32F7. In the first case the commands are just ASCII strings, but in the second case it’s a binary flatbuffer schema. You can find the schema in `source/schema/schema.fbs` if you want to experiment and change stuff. In the firmware code the handing of the received UART commands is done in `source/src/main.cpp` in function `dbg_uart_parser()`.

The command protocol is plain simple (115200,8,n,1) and its format is:

where ID is a number and each number is a different command. So:
CMD=1, runs the inference of the hard-coded hand-drawn digit (see below)

This is how I’ve connected the two UART ports in my case. Also have a look the repo’s file for the exact pins on the connector.

Note: More copy-paste from the previous post is coming, as many things are the same, but I have to add them here for consistency.

Use the Jupyter notebook with STM32F7

In the jupyter notebook here, there’s a description on how to evaluate the model on the STM32F7. There are actually two ways to do that, the first one is to use the digit which is already integrated in the code and the other way is to upload your hand-draw digit to the STM32 for evaluation. In any case this will validate the model and also benchmark the NN. Therefore, all you need to do is to build and upload the firmware, make the proper connections, run the jupyter notebook and follow the steps in “5. Load model and interpreter”.

I’ve written two custom Python classes which are used in the notebook. Those classes are located in jupyter_notebook/ folder and each has its own folder.


The MnistDigitDraw class is using tkinter to create a small window on which you can draw your custom digit using your mouse.


In the left window you can draw your digit by using your mouse. When you’ve done then you can either press the Clearbutton if you’re not satisfied. If you are then you can press the Inferencebutton which will actually convert the digit to the format that is used for the inference (I know think that this button name in not the best I could use, but anyway). This will also display the converted digit on the right side of the panel. This is an example.

Finally, you need to press the Exportbutton to write the digit into a file, which can be used later in the notepad. Have in mind that jupyter notepad can only execute only one cell at a time. That means that as long as the this window is not terminated then the current cell is running, so you need to first to close the window pressing the [x] button to proceed.

After you export the digit you can validate it in the next cells either in the notepad or the STM32F7.


The FbComm class handles the communication between the jupyter notepad and the STM32F7 (or another tool which I’ll explain). The FbComm supports two different communication means. The first is the Serial comms using a serial port and the other is a TCP socket. There is a reason I’ve done this. Normally, the communication of the notepad is using the serial port and send to/receive data from the STM32F7. To develop using this communication is slow as it takes a lot of time to build and flash the firmware on the device every time. Therefore, I’ve written a small C++ tool in `jupyter_notebook/FbComm/test_cpp_app/fb_comm_test.cpp`. Actually it’s mainlt C code for sockets but wrapped in a C++ file as flatbuffers need C++. Anyway, if you plan on changing stuff in the flatbuffer schema it’s better to use this tool first to validate the protocol and the conversions and when it’s done then just copy-paste the code on the STM32F7 and expect that it should work.

When you switch to the STM32F7 then you can just use the same class but with the proper arguments for using the serial port.


The files in this folder are generated from the flatc compiler, so you shouldn’t change anything in there. If you make any changes in `source/schema/schema.fbs`, then you need to re-run the flatc compiler to re-create the new files. Have a look in the “Flatbuffers” section in the file how to do this.

Benchmarking the x-cube-ai

The benchmark procedure was a bit easier with the x-cube-ai compared to the tflite-micro. I’ve just compiled the project w/ and w/o overclocking and run the inference several times from the jupyter notebook. As I’ve mentioned earlier you don’t really have to do that, just use the SystemPerformance project from the CubeMX and just change the frequency, but this is not so cool like uploading your hand-drawn digit, right? Anyway, that’s the table with the results:

216 MHz 288 MHz
76.754 ms 57.959 ms

Now let’s do a comparison between the tflite-micro and the x-cube-ai inference run times.

x-cube-ai (ms) tflite-micro (ms) difference
216 MHz 76.754 923 12.5x (170%)
288 MHz 57.959 692 12.5x (170%)

Wow. That’s a huge difference, right. The x-cube-ai is 12.5x times faster (or 170% better performance). Great work they’ve done.

You might noticed that the inference time is a bit higher now compared to the SystemPerformance project binary. I only assume that this is because in the benchmark the outputs are not populated and they are dropped. I’m not sure about this, but it’s my guess as this seems to be a consistent behaviour. Anyway, the difference is 2-3 ms, so I’ll skip ruin my day thinking more about this as the results of my project are actually a bit faster than the default validation project.

Evaluating on the STM32F7

This is an example image of the digit I’ve drawn. The format is the standard grayscale 28×28 px image. That’s an uint8 grayscale image [0, 255], but it’s normalized to a [0, 1] float, as the network input and output is float32.

After running the inference on the target we get back this result on the jupyter notebook.

Comm initialized
Num of elements: 784
Sending image data
Receive results...
Command: 2
Execution time: 76.754265 msec
Out[9]: 0.000000
Out[8]: 0.000000
Out[7]: 0.000000
Out[6]: 0.000000
Out[5]: 0.000000
Out[4]: 0.000000
Out[3]: 0.000000
Out[2]: 1.000000
Out[1]: 0.000000
Out[0]: 0.000000

The output predicts that the input is number 2 and it’s 100% certain about it. Cool.

Things I liked and didn’t liked about x-cube-ai

From the things that you’ve read above you can pretty much conclude by yourself about the pros of the x-cube-ai, which actually make almost all the cons to seem less important, but I’ll list them anyways. This is not yet a comparison with tflite-micro.


  1. It’s lightning fast. The performance of this library is amazing.
  2. It’s very light and doesn’t use a lot of resources and the result binary is small.
  3. The tool in the CubeMX is able to compress the weights.
  4. The x-cube-ai tool is integrated nicely in the CubeMX interface, although it could be better.
  5. Great analysis reports that helps you make decisions for which MCU you need to use and optimizations before even start coding (regarding ROM and RAM usage).
  6. Supports importing models from Keras, tflite, Lasagne, Caffe and ConvNetJS. So, you are not limited in one tool and also Keras support is really nice.
  7. You can build and test the performance and validate your NN without having to write a single line of code. Just import your model and build the SystemPerformance or Validation application and you’re done.
  8. When you write your own application based on the template then you actually only have to use two functions, one to init the network and a function to run your inference. That’s it.


  1. It’s a proprietary library! No source code available. That’s a big, big problem for many reasons. I never had a good experience with closed source libraries, because when you hit a bug, then you’re f*cked. You can’t debug and solve this by yourself and you need to file a report for the bug and then wait. And you might wait forever!
  2. ST support quite sucks if you’re an individual developer or a really small company. There is a forum, which is based on other developers help, but most of the times you might not get an answer. Sometimes, you see answers from ST stuff, but expect that this won’t happen most of the times. If you’re a big player and you have support from component vendors like Arrow e.t.c. then you can expect all the help you need.
  3. Lack of documentation. There’s only a pdf document here (UM2526). This has a lot of information, but there are still a lot of information missing. Tbh, after I searched in the x-cube-ai folders which are installed in the CubeMX, I’ve found more info and tools, but there’s no mention about those anywhere! I really didn’t like that. OK, now I know, so if you’re also desperate then in your Linux box, have a look at this path: ~/STM32Cube/Repository/Packs/STMicroelectronics/X-CUBE-AI/4.0.0/Documentation. That’s for the 4.0.0 version, so in our case it might be different.

TFLite-micro vs x-cube-ai

Disclaimer: I have nothing to do with ST and I’ve never even got a free sample from them. I had to do this, for what is following…

As you can see the x-cube-ai’s has more pros than cons are more cons compare to the tflite-micro. Tbh, I’ve also enjoyed more working with the x-cube-ai rather the tflite-micro as it was much easier. The only thing from the x-cube-ai that leaves a bitter taste is that it’s a proprietary software. I can’t stress out how much I don’t like this and all the problems that brings along. For example, let’s assume that tomorrow ST decides to pull off the plug from this project, boom, everything is gone. That doesn’t sound very nice when you’re planning for a long commitment to an API or tool. I quite insist on this, because the last 15-16 years I’ve seen this many times in my professional career and you don’t want this to happen to your released product.¬† Of course, if the API serves you well for your current running project and you don’t plan on changing something critical then it’s fine, go for it. But, I really like the fact that tflite-micro is open.

I’m a bit puzzled about tflite. At the this point, the only reason I can think of using tflite-micro over x-cube-ai, is if you want to port your code from a tflite project which already runs on your application CPU (and Linux) to an MCU to test and prototype and decide if it worth switching to an MCU as a cheaper solution. Of course, the impact of tflite in the performance is something that needs consideration and currently there’s no rule of thumb of how much slower is compared to other APIs and on specific hardware. For example in the STM32F7 case (and for the specific model) is 12.5x times slower, but this figure might be different for another MCU. Anyway, you must be aware of these limitations, know what to really expect from tflite-micro and how much room you have for performance enhancement.

There is another nice thing about tflite-micro thought. Because it’s open source you can branch the git repo and then spend time to optimise the code for your specific hardware. Definitely the performance will be much, much better; but I can’t really say how much as it depends on so many things. Have in mind that also tflite-micro is written in C++ and some of its hocus pocus and template magic may have negative impact in the performance. But at least it remains a good alternative option for prototyping, experimentation and develop to its core. And that’s the best thing with open source code.

Finally, x-cube-ai limits your options to the STM32 series. Don’t get me wrong this MCU series is great and I use stm32 for many of my projects, but it’s always nice to have an alternative.


The x-cube-ai is fast. It’s also easy to use and develop on it, it has those ready-to-build apps and the template to build your project, everything is in an all-in-one solution (CubeMX). But on the other hand is a black box and don’t expect much support if you’re not a big player.

ST was very active the last year. I also liked the STM32-MP1 SBC they released with Yocto support from day one and mainline kernel support. They are very active and serious. Although I still consider the whole HAL Driver library a bloated library (which it is, as I’ve proven that in previous stupid-projects). I didn’t had any issues; but I’ve also didn’t write much code for these last two projects (I had serious issues when I’ve tried a few years ago).

Generally, the code is focused around the NN libs performance and not the MCU peripheral library performance, but still you need to consider those things when you evaluating platforms to start a new project.

From a brief in the source code though, it seems that you can use the x-cube-ai library without the HAL library, but you would need to port some bits to the LL library to use it with that one. Anyway, that’s me; I guess most people are happy with HAL, so…

In my next post, I will use a jetson-nano to run the same inference using tflite (not micro) and an ESP8266 that will use a REST-API. Also TensorRT, seems nice. I may also try this for the next post, will see.

Update: Next post is available here.

Have fun!

Machine Learning on Embedded (Part 3)


Note: This post is the third in the series. Here you can find part 1, part 2, part 4 and part 5.

Edit (24.07.2019): I’ve done a stupid mistake and I haven’t used the hard float acceleration of the FPU on the STM32F7. This explains the terrible performance I had. But still with the FPU enabled although the NN is now 3x times faster, it’s still much much slower compared to the x-cube-ai implementation from ST (~13x slower).

In the previous post (part 1 and part 2), I’ve done a benchmark of very two simple NN on various different MCUs. That was a naive implementation of a 2-input, 1-output and a 2-input,32-hidden,1-output layers. Of course, as you can imagine this is hardly useful to do anything useful in the ML domain, but it works fine for benchmarking. My plan next was to run a more complicated NN to the STM32F746 MCU. To do that I was about to use the X-CUBE-AI and I’ve started to work on it, but during the process I got fed up with the implementation and the lack of information around the API and some bits and tools that are although there’s a reference on them, they’re nowhere available. I’ve also asked in their forum, but ST forums are not the place to get answers as the company doesn’t provide any help. Only other users do, but the concept is quite new and there are no many users to provide answers. Btw, this is my unanswered post in the ST community forums.

As I’ve mentioned also to the previous post, this domain is quite hot right now. There is a lot of development and many things are changing rapidly, which makes things that I’ve written 1 week ago, to be now obsolete. For example, a few hours ago the new 4.0.0 version of X-CUBE-AI was released, which supports a few things that are now very interesting to test and to benchmarks and comparisons. Anyway, I’ll get to that later.

You’ll find all the source code an files used for this project here:

So, let’s step into it…

TensorFlow Lite for microcontrollers

In the first post, I had a very interesting chat in the comments sections with Raukk, who provided several suggestions (you can see the comments at the end of this post). At some point he suggest to have a look at the TensorFlow-Lite API for microcontrollers and then after reading the online documentation, I thought that this is what I need. I thought that this would make things easier. Normally, I would provide my findings in the end of the post, but I’ll do a spoiler. Well, it doesn’t! In the current state (unless if I’m doing something so terrible wrong!), the tflite micro API sucks in so many ways, but the most important is the very low performance performance. During the rest of the post I’ll elaborate on my findings.

TensorFlow (TF or tf) has a subset API, which is the TensorFlow Lite (tflite). As the name implies, this is a lighter version of the main API. The idea for this is that small application CPUs (like ARM) to be able to run tf models with less dependencies, low latency and smaller size, which is great for medium/large embedded devices. Note that I’m referring to application CPUs and not MCUs. That seems to be working quite well for small Linux SBCs and also Andoid devices. Especially for the Android there’s support for the NNAPI, but by the time this post is written there are also quite a few issues with various platforms. It’s still like a beta thing.

Anyway, at the same time, tensorflow tries to support even smaller CPUs from the MCU domain. TensorFlow Lite for Microcontrollers (tflite-micro) is an experimental subset of the the tflite that meant to be a baremetal portable API. Portable, of course, means that although the API is baremetal and can be compiled virtually for any MCU, still there are parts like the HW acceleration that needs to be ported depending the architecture. For example, Cortex-M4 and M7 have DSP accelerators (there’s also a new NN library in CMSIS) that can be used to improve the performance. At the same time, tflite also provides other optimizations like quantization that can improve the performance on MCUs that don’t have HW accelerators. As we’ll see next though, because this API is very premature not all of those things are really working out of the box and there are still several bugs.

Nevertheless, this stupid project was quite fan, too; because I’ve put together a jupyter notepad that trains a MNIST model with TF, which then you can convert to tflite and upload it to the STM32F7. You can also use the notepad to hand-draw your own digit and test it with both the notepad and the STM32F7. So the notepad can communicate with the STM32F7 and run inferences, cool stuff.

Therefore, this post will be limited only around TF-Lite for microcontrollers and the next post will be about the new X-CUBE-AI API.

Have in mind that this subset is still new and work in progress. I think it was quite a mistake to dive in it so early, but I had to, as the x-cube-ai initially didn’t met my expectations and I needed to proceed with my work on port a keras model in a MCU. For that reason, I’ve decided to get deeper in tflite-micro as I think it will be also the future API that it will dominate the market (at least in the near future).

Also I’ve spend a few days to make tflite-micro to work in the way I wanted to and it was quite challenging and in the end I was completely disappointed by the procedure and the time that it needs to set it up for use. The reason is that the API is a bit chaotic and under heavy development, but I’ll list in more detail the issues I had later in the post.

Training the MNIST TF model

First you need to train your model. Since I’m not an expert on the domain I’ve chosen to port another Keras model from here to tflite. The procedure was easy and very straight forward, as Keras structure seems to be almost similar to TF. The resulted model is here, so you can compare the two models:

MNIST for TensorFlow-Lite notepad

It’s better if you git clone the repo and open the notepad locally, so you can run the steps if you like. Of course, you definitely need to clone it locally in order to test the code with the STM32F7. In the notepad I’ve tried to put things in a way that makes sense, so first in section 1 I’m creating the model using TF and then I’m doing the training and evaluation. For this project the model in general and also accuracy doesn’t really matter as I want to focus on the model transfer to the STM32F7 and then benchmark the tflite-micro API. So I’ll deal with the low level technical stuff and not how to create the perfect model.

Convert Keras model for use in embedded

After I’ve trained the model, in section 2, I’m converting the model to the tflite format. The tflite format is just a serialized and flatten binary format of the model using the flatbuffers serialization library. This library is actually the coolest thing in tflite that actually works quite well. I’ve also added a script in `jupyter_notebook/` which does the same thing and you can run it from the repo source path.

Converting the model to tflite was the first major issue I had and I’ve only managed to solve it partially. The problem is that by default all the model weights are float32 numbers, which means two things. First the model size is big as every float32 is 4 bytes and it takes a lot of flash and RAM. Second the execution will be slower compared to UINT8 numbers that tflite is supposed to support. Have a look at this snippet from the notebook:

tflite_mnist_model = 'mnist.tflite'
converter = tf.lite.TFLiteConverter.from_keras_model_file('mnist_keras.h5')
# converter.optimizations = [tf.lite.Optimize.OPTIMIZE_FOR_SIZE]
# converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_model = converter.convert()
flatbuffer_size = open(tflite_mnist_model, "wb").write(tflite_model)

There are two lines which are commented out. The important line is the `tf.lite.Optimize.OPTIMIZE_FOR_SIZE`. This according to the RTFM, will post-quantize the model and reduce the size of weights from 32bits to 8. That’s a 4x times reduction in size. That works fine when converting the model, but it doesn’t work when the model is running on the MCU. If you try to build the code with the OPTIMIZE_FOR_SIZE, then when the model is loaded in the MCU, you’ll get this error:

Only float32, int16, int32, int64, uint8, bool, complex64 supported currently.

This error comes from the `source/libs/tensorflow/lite/experimental/micro/` which is allocating RAM for each layer and it seems like the converter tools converts some weights in int8 values instead of uint8 and int8 is not supported. At least not yet. Therefore, although quantization seems like a great feature,¬†it just don’t work with the tflite-micro API! The same also stands for the `OPTIMIZED_UINT8` option. I’ve also seen another person complaining about this, so we’re either the only ones that tried it or we do the same mistake somewhere in the process.

Anyway, I’ll do a comparison at least of the resulted converted sizes, as I hope in the future this will be fixed. But for now keep in mind that you can only use float32 for all the weights. As I’ve mentioned earlier, this may change even in a few hours or it may take more, who knows.

Even if you use quantization then although there’s a significant compression in the model size, still the size is quite large for most of the small embedded devices. that don’t have enough flash. Post-quantization, though, has large impact in the model size as you’ll see in the next table. Post-quantization means that the quantization happens after training the model, but you can also use quantization during the training (according to the RFTM). Also, there are different types of quantization but let’s have a look in the following table.

Size (bytes) Ratio Saving %
Original file 780504
No quantization 375740 2 51.8
OPTIMIZE_FOR_SIZE 99344 7.85 87.27
OPTIMIZED_UINT8 97424 8.01 87.51

From the above table it seems that the OPTIMIZE_FOR_SIZE or OPTIMIZED_UINT8, make a huge difference compared to no quantization, but doesn’t make any real difference in the produced size between them. Have in mind that if you want to use the OPTIMIZED_UINT8 flag, then you also need make your model quantization aware by adding this before you compile and fit your model. According to the RTFM this is how is done.

import tensorflow as tf
# Quantization aware training
sess = tf.keras.backend.get_session()

Finally, if you want to convert those models by your self using the script then these are the commands.

# Convert keras the keras model to tflite
python3 mnist.h5

# Convert keras the keras model to tflite and optimize with OPTIMIZE_FOR_SIZE
python3 mnist.h5 size

# Convert keras the keras model to tflite and optimize with QUANTIZED_UINT8
python3 mnist.h5 uint8

For this time being, forget about quantization, so you should convert your model to tflite without any optimization. Now that you have your model converted from TF to tflite there’s another step. Now you need to convert this to a C array. To do that you can run the following command:

xxd -i jupyter_notebook/mnist.tflite > source/src/inc/model_data.h

This will create a C header file and it will place it (and override any file) in the source code. Here is what you’ll find inside the file:

unsigned char jupyter_notebook_mnist_tflite[] = {
  0x1c, 0x00, 0x00, 0x00, 0x54, 0x46, 0x4c, 0x33, 0x00, 0x00, 0x00, 0x00,
unsigned int jupyter_notebook_mnist_tflite_len = 375740;

This is the mnist.tflite converted to bytes. Don’t forget that the mnist.tflite is a flatbuffer container. That means that the C++ structured model is serialized into this flatbuffer model in order to be transferred to another platform or architecture. Therefore this C array, will be deserialized at running time. Also note that this array normally it would get in the heap area, but you don’t want to do that. The non-optimized model size is 367KBs which is more than the available RAM on the STM32F747NG. Therefore, you need to force the compiler to store this array in flash, which means that you need to change your table to const like this:

const unsigned char jupyter_notebook_mnist_tflite[] = {

That’s it! Now you have your model and weights flattened and be ready to use with the tflite API in your microcontroller. As it’s already mentioned in the online documentation here, only a subset of operations are currently supported but the API is flexible to extend or build with more options if you like.

Porting TF-Lite micro to STM32F7 and CMAKE

TF-Lite for microcontrollers doesn’t support cmake. Instead there are some makefiles in the github repo that build specific examples to test. Although, that this may seems ok, it’s not as it’s very hard to re-use those makefiles to make your own projects as they are done in a way that you need to develop your application inside the repo. My preference in general is to keep things simple and separated, this is why I generally prefer cmake. The problem with cmake is that you can achieve the same thing in many different ways and sometimes you may end up with builds that work, but they are also very complicated. Of course, as the project complexity grows cmake also becomes a bit more ugly, but anyway I believe that it’s far easier to maintain and scale and most importantly I always have my STM32 template projects in cmake. Therefore, I had to make tflite-micro to be built with cmake. That task took a while, as the makefile project does some magic in the background like downloading other repos that are not in the source code (e.g. flatbuffers and gemmlowp).

In the end I’ve managed to do so, but the problem is that it’s not easy to update it. The reason is that the header file includes have relative paths to the main repo’s top folder, which is not the tflite folder but the TF API’s folder. For that reason, I had to sed all the source files that includes header files.

Things I liked and didn’t liked about TF-Lite micro

I prefer to write this section here, before the benchmark, because it’s a negative feedback and I prefer to keep the end of the post focused on the results. The whole procedure, was a bit pain and I’m not satisfied with the outcome… I believe (or better, I hope) that in the future the tflite micro API will get better and more easy to use. I mean I expect more from Google. I had several problems when it came to the point to use it, which I will try to address next. Keep in mind that the time this article is written the version of tensorflow is 1.14.0-718503b075d, so in case the post is not updated many things may have changed when you read this.


  1. The thingy which is used for the automated tests is very interesting! I didn’t know that such thing existed and it seems very promising for future use. You can use it to emulate HW and run your built binaries on it and integrate automated tests to your current CI infrastructure.
  2. The idea of having a common API that runs on every platform and architecture is really interesting and this is the main reason that I hope that this API gets better. that means that you can (almost) just copy-paste your code from your Jetson nano or RPi and compile it on the STM32F7. That’s really awesome.
  3. It has Google’s support. Some of you might think why that’s a pro? I think it is because it means that it will get more development effort, but of course that doesn’t mean that the result will be optimal. Only time will show.


  1. Documentation is currently horrible. It’s very difficult to do even simple things, because of that reason. The only way sometimes to really understand what’s going on is to read the source code. You may think that this expected with most APIs, but this API is huge and that takes much more time! A better documentation will definitely help.
  2. It seems that you can achieve the same thing with many different ways as there are quite a few duplicate implementations. So, when you’re looking for examples you may see completely different API calls that do the same thing. That makes it very difficult to plan your workflow, especially when you’re getting started with TF. I’ve read some people that say that this is a nice feature of the API. No it’s not. An API should be clean and simple to use.
  3. It’s very slow… Much, much, much slower compared to x-cube-ai. Have in mind that I’ve only managed to benchmark float and not quantized uint8 numbers. But my current rough estimation that tf-lite micro is approx. 38x times slower to run the same inference compared to X-CUBE-AI. That’s a really big number there…
  4. There are some examples for different microcontrollers, but the build system is a bit bloated and I find it a bit over-engineered and difficult to be able to build your own code.
  5. The build system is based on the make build automation tool, which I guess is ok, but it was really difficult for me to port to cmake, because the build scripts download other stuff in the background and there are many different pieces of code all over the place. Not that cmake makes things much more better, but anyway…
  6. Because there are so many different pieces all over the place, the code doesn’t make much sense. While trying to port the build in cmake I’ve realized that it’s a spaghetti of files. The problem is that micro-tflite is subset of tflite which is subset of tensorflow. But all those things are not distinct. At some point it would be nice if the micro tflite was a separate github repo.
  7. There’s a list of supported platforms here. The problem with that list is that although the example for the stm32f103 (bluepill) is in the github repo and you just call make to build it. But for the stm32f746 you need to download some tarball project files that contain the source files including some unknown tflite version. The problem is that those files are already outdated! Also, why use keil project files in those tarballs? It’s a bit mess…
  8. Regarding the project files for the stm32f746, that I’ve mention in the previous point, why use Keil? I didn’t expect from Google to enforce the usage of this IDE, because it’s only for Windows and also it doesn’t make any sense to use Keil when so many better and FOSS alternatives exist. Again, my personal opinion is that cmake would make more sense, especially for embedded.
  9. The tflite-micro code is in C++11. Many will think, “OK, so what?”. The problem actually is that most of the low embedded engineers are not familiar with that language. Of course, you can overcome this by just learn it and to be fair the API is relative easy and not much C++11 hocus-pocus is used. My main concern regarding C++ though is that it’s not easy for every microcontroller to setup a C++ project to build. For example for the STM32, the CubeMX tool that is used to setup a project, doesn’t support to create C++ projects from templates. Therefore, you need to spend time to do it by yourself. For that reason, for example, I have my own cmake C++ template, but as I’ve said porting the tflite-micro to cmake was an adventure.
  10. Porting from the tflite build system to cmake isn’t sustainable in the long term. The reason is that there’s a lot of work need to be done. For example, all the header includes have hardcoded paths, which for cmake is not convenient and in my case I had to remove all those hardcoded paths.
  11. Another annoying issue is that the size optimizations when converting a h5 model to tflite, seems to be incompatible with the tflite-micro. Others also complain for this issue. In the end only the non-optimize model is able to be used, but I guess it’s just a matter of time for that to be fixed.

I know that the cons list is much longer, but the main advantage is the unified API across all those different platforms and architectures. Currently the performance really sucks, but if this gets better then imho TF will become the de-facto tool.

Building the project

You can find the C++ cmake project here:

In the source/libs folder you’ll find all the necessary libraries which are CMSIS, the STM32F7xx_HAL_Driver, flatbuffers, gemmlowp and tensorflow. All these are building as static libraries and then the main.cpp app is linked against those static libs. You will find the cmake files for those libs in source/cmake. The file in the repo is quite thorough about the build options and the different build, but here I’ll focus only on the accelerated build which is uses the CMSIS-NN API for the STM32F7. To build with this option then run this command:


This will build the project in the build-stm32folder. It’s interesting to see the resulted sizes for all the libs and the binary file. The next array lists the sizes by using the current latest gcc-arm-none-eabi-8-2019-q3-update toolchain from here. By the time you read the article this might already have changed.

File Size
stm32f7-mnist-tflite.bin 542.7 kB
libSTM32F7_DSP_Lib.a 5.1 MB
libSTM32F7_NN_Lib.a 598.8 kB
libSTM32F7xx_HAL_Driver.a 814.9 kB
libTensorflow_lite_micro.a 2.8 MB

Normally you would wonder why do you care about the size of the static libs if only the binary size matters and that’s a good point. But it does it matter because the RTFM of the the tflite-micro mentions that this lib is ~300KB. After testing this the only way to achieve this size is to build a dynamic lib and then strip it and then it gets around 300KB. But this was not mentioned in the RTFM, so let’s say this what they wanted to write in the first place. Btw, you can strip any of the above libs by running this:

arm-none-eabi-strip -s libname.a

BUT, you can’t strip static linked libraries because there will not be any symbols left to build against :p . Anyway, so have in mind that the claimed size is only for dynamic linked libs, which of course it doesn’t really matter for MCUs.

Finally, as you can see the binary size is ~half Megabyte in size. This is huge for a MCU. Most of this size comes from the `source/src/inc/model_data.h` file which is the flatbuffer model of the NN which is already ~340 KB. The binary size with the model after the conversion with the quantization optimizations would be 266 kB, but as I’ve said this won’t work with the tflite-micro API.

Model RAM usage

This table shows the RAM usage per layer when the flatten flatbuffer model is expanded to memory.

Layer Size in bytes
conv2d_7_input 3136
dense_4/Softmax 40
dense_4/BiasAdd 40
dense_3/Relu 256
conv2d_9/Relu 2304
max_pooling2d_6/MaxPool 6400
conv2d_8/Relu 30976
max_pooling2d_5/MaxPool 21632
conv2d_7/Relu 86528
= 151312

Therefore, you see that for this model more that 151KB of RAM are needed. The STM32F746 I’m using has 320KB or RAM which are enough for this model, but still 151KB are quite a lot of RAM for embedded, so you need to keep in mind such limitations!

Supported commands in STM32F7 firmware

After you build and flash the firmware on the STM32F7 (read the for more detailed info), you can use a serial port to either send commands via a terminal like cutecom or interact with the jupyter notebook. The firmware supports two UART ports on the STM32F7. In the first case the commands are just ASCII strings, but in the second case it’s a binary flatbuffer schema. You can find the schema in `source/schema/schema.fbs` if you want to experiment and change stuff. In the firmware code the handing of the received UART commands is done in `source/src/main.cpp` in function `dbg_uart_parser()`.

The command protocol is plain simple (115200,8,n,1) and its format is:

where ID is a number and each number is a different command. So:
CMD=1, prints the model view
CMD=2, runs the inference of the hard-coded hand-drawn digit (see below)

This is how I’ve connected the two UART ports in my case. Also have a look the repo’s file for the exact pins on the connector.

Use the Jupyter notebook with STM32F7

In the jupyter notebook here, there’s a description on how to evaluate the model on the STM32F7. There are actually two ways to do that, the first one is to use the digit which is already integrated in the code and the other way is to upload your hand-draw digit to the STM32 for evaluation. In any case this will validate the model and also benchmark the NN. Therefore, all you need to do is to build and upload the firmware, make the proper connections, run the jupyter notebook and follow the steps in “5. Load model and interpreter”.

I’ve written two custom Python classes which are used in the notebook. Those classes are located in jupyter_notebook/ folder and each has its own folder.


The MnistDigitDraw class is using tkinter to create a small window on which you can draw your custom digit using your mouse.


In the left window you can draw your digit by using your mouse. When you’ve done then you can either press the Clearbutton if you’re not satisfied. If you are then you can press the Inferencebutton which will actually convert the digit to the format that is used for the inference (I know think that this button name in not the best I could use, but anyway). This will also display the converted digit on the right side of the panel. This is an example.

Finally, you need to press the Exportbutton to write the digit into a file, which can be used later in the notepad. Have in mind that jupyter notepad can only execute only one cell at a time. That means that as long as the this window is not terminated then the current cell is running, so you need to first to close the window pressing the [x] button to proceed.

In my case, as I’m ambidextrous and I’m using two mouses at the same time on my desk, so I’ve manged to run several tests with drawing digits with both my hands as each of my hands produces a different output. I know it’s weird, but usually in office I prefer to use my left mouse hand and at home both, so I can rest my hands a bit.

After you export the digit you can validate it in the next cells either in the notepad or the STM32F7.


The FbComm class handles the communication between the jupyter notepad and the STM32F7 (or another tool which I’ll explain). The FbComm supports two different communication means. The first is the Serial comms using a serial port and the other is a TCP socket. There is a reason I’ve done this. Normally, the communication of the notepad is using the serial port and send to/receive data from the STM32F7. To develop using this communication is slow as it takes a lot of time to build and flash the firmware on the device every time. Therefore, I’ve written a small C++ tool in `jupyter_notebook/FbComm/test_cpp_app/fb_comm_test.cpp`. Actually it’s mainlt C code for sockets but wrapped in a C++ file as flatbuffers need C++. Anyway, if you plan on changing stuff in the flatbuffer schema it’s better to use this tool first to validate the protocol and the conversions and when it’s done then just copy-paste the code on the STM32F7 and expect that it should work.

When you switch to the STM32F7 then you can just use the same class but with the proper arguments for using the serial port.


The files in this folder are generated from the flatc compiler, so you shouldn’t change anything in there. If you make any changes in `source/schema/schema.fbs`, then you need to re-run the flatc compiler to re-create the new files. Have a look in the “Flatbuffers” section in the file how to do this.

Benchmarking the TF-Lite micro firmware

Finally, we got here. But I need to clarify some things first.

I’ve implemented several different tests for the firmware in order to benchmark the various implementations of the tflite micro API. What I mean is that the depthwise_convlayer is implemented in 3 different ways in the API. The default implementation is in the `source/libs/tensorflow/lite/experimental/micro/kernels/` file. Then there is another implementation in `source/libs/tensorflow/lite/experimental/micro/kernels/portable_optimized/` and finally the `/rnd/bitbucket/stm32f746-tflite-micro-mnist/source/libs/tensorflow/lite/experimental/micro/kernels/cmsis-nn/`. I’ve added detailed instructions how to build each case in the repo’s README file.

In `source/src/inc/digit.h` I’ve added a custom hand-drawn digit (number 5) that you use to test the firmware and the model without having to send any data to the board. To do that you can by sending the command CMD=2. This will run the inference and at the same time it benchmarks the process for every layer and the total time it takes. Let’s see the results when running the benchmark in various scenarios.

The first column is the layer name and the others are the time in msec of each layer on 6 different cases, which are:

  • [1]: 216MHz, default
  • [2]: 216MHz, portable_optimized/
  • [3]: 216MHz, cmsis-nn/
  • [4]: 288MHz, default
  • [5]: 288MHz, portable_optimized/
  • [6]: 288MHz, cmsis-nn/

Edit (24.07.2019): The following table is with the FPU of the STM32F7 disabled, which was my mistake. Therefore, I just leave it here for reference. The next table is the one that has the FPU enabled.

Layer [1] [2] [3] [4] [5] [6]
DEPTHWISE_CONV_2D 236 236 235 177 177 176
MAX_POOL_2D 23 23 23 18 17 17
CONV_2D 2347 2346 2346 1760 1760 1760
MAX_POOL_2D 7 7 7 5 5 5
CONV_2D 348 348 348 261 261 260
SOFTMAX 0 0 0 0 0 0
TOTAL TIME= 2966 2965 2964 2225 2224 2222

Edit (24.07.2019): This is the table with the FPU enabled.

Layer [1] [2] [3] [4] [5] [6]
DEPTHWISE_CONV_2D 70 70 69 52 52 52
MAX_POOL_2D 7 7 7 5 5 5
CONV_2D 733 733 733 550 550 550
MAX_POOL_2D 7 7 2 2 2 2
CONV_2D 108 108 108 81 81 81
SOFTMAX 0 0 0 0 0 0
TOTAL TIME= 923 923 922 692 692 692

From the above table, you can notice that:

  • When FPU is enabled then tflite is ~3.2x times faster (oh, really?)
  • There’s no really any difference with and without the DSP/NN libs acceleration
  • The CPU frequency has a great impact in the execution time (which is expected)
  • It’s slooooooooooooow
  • The CPU spends most of the time in the CONV_2D layer. But all the layers are quite slow.

I’m quite disappointed with the fact that the CMSIS DSP/NN library didn’t make any real difference here. I’ve spent quite some time to integrated in the cmake build and I was hoping for better results.

In case you want to overclock your CPU, have in mind that it may be unstable and the CPU can crash. I’ve managed to run the benchmark @ 288MHz, but when I was using the flatbuffers communication between the jupyter notebook and the STM32F7 then the CPU was crashing at a random point. I’ve used st-link with GDB to verify that this was the case and not a software bug. So, just be aware if you experiment with overclocked CPU.

If you want to use GDB with the code then mind that although the -g flag is set in the cmake, the elf file is stripped. Therefore, in the `/rnd/bitbucket/stm32f746-tflite-micro-mnist/source/CMakeLists.txt` file you need to find this line

set (CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fno-exceptions -fno-rtti -s")

and remove the -sfrom that and re-build. Then GDB will be able to find the symbols.

Evaluating on the STM32F7

This is an example image of the digit I’ve drawn. The format is the standard grayscale 28×28 px image. That’s an uint8 grayscale image [0, 255], but it’s normalized to a [0, 1] float, as the network input and output is float32.

After running the inference on the target we get back this result.

Comm initialized
Num of elements: 784
Sending image data
Receive results...
Command: 2
Execution time: 922.918518 msec
Out[9]: 0.000000
Out[8]: 0.000000
Out[7]: 1.000000
Out[6]: 0.000000
Out[5]: 0.000000
Out[4]: 0.000000
Out[3]: 0.000000
Out[2]: 0.000000
Out[1]: 0.000000
Out[0]: 0.000000

From the above output, you can see that the result is an array of 10 float32. Each index of the array represents the prediction of the NN for each digit. Out[0] is the digit 0 and Out[9] is number 9. So from the above output you see that the NN classifies the image as number 7. It’s interesting that Out[1], Out[2], Out[3] are not zero. I think it’s quite obvious why the NN made those predictions, because there are parts of 7 that are quite similar to 1, 2, 3. Anyway, in this case I’m getting the same prediction from the notepad notebook as also from the STM32F7. And that was the case for all my tests.

Conclusions (and a spoiler for part 4)

Before I close this post, I will make a spoiler for the next post that follows. I’ve already used the exact same model with the X-CUBE-AI and this is part of the result from an inference (with random input data, which doesn’t matter).

Running PerfTest on "mnistkeras" with random inputs (16 iterations)...

Results for "mnistkeras", 16 inferences @216MHz/216MHz (complexity: 2852598 MACC)
 duration     : 73.735 ms (average)
 CPU cycles   : 15926760 -458/+945 (average,-/+)
 CPU Workload : 7%
 cycles/MACC  : 5.58 (average for all layers)
 used stack   : 576 bytes
 used heap    : 0:0 0:0 (req:allocated,req:released) cfg=0

Do you notice a couple of things there? First duration is 73.7 ms instead of 922 ms at the same frequency with tflite-micro. That’s ~12.5x times faster!

Also the CPU workload is very low… I’m not sure what’s going on, but I’ve seen that the x-cube-ai needs also the CRC module to be enabled, but maybwe that’s irrelevant and it’s used only for the comms. Anyway, I hope I’ll find out during my next tests.

I really don’t know if I did something so terribly wrong with tflite-micro (which I’m sure I didn’t, but you never know) or the API is so slow. That’s an enormous difference that makes tflite pretty much unusable. I hope that I’ve done something wrong, though. I plan to raise an issue in the github repo with my findings to see if I get any answer.

In the next post, I’ll do benchmarks with the X-CUBE-AI for the same model on the STM32F7 and then do a comparison.

Update: Part 4 is now available here.

Have fun!

Machine Learning on Embedded (Part 2)


Note: This post is the second in the series. Here you can find part 1, part 3, part 4 and part 5.

In the first part (here) we’ve designed, trained and evaluated a very simple NN with 3-inputs and 1-output. It will make more sense if you have a look at the first post before continuing with this.

So, in this post we will design a bit more complex (but again simple) NN and we’ll do the same procedure like the first part. Design, train and evaluate. For consistency and make it easier to compare, we’ll use the same inputs and training set.


The MCUs that we’re going to use are the same one with the previous post.

Another simple NN

Everything that is related to this project for all the article parts are in this bitbucket repo:

In the previous post we had a very simple NN with 3-inputs and 1-output. In this post we’ll have a NN with 3-inputs, a hidden layer with 32 nodes and 1-output. You can see that in the following picture:

You see that not all 32 nodes are displayed in the picture, but only h(0), h(1), h(2) and h(31). Also I haven’t added all the weights because there wasn’t enough space, but its easy to guess that they are similar with the ones from a(0).

To write the mathematical equation for this NN is a bit more complex (only because it takes a lot of lines), but the logic behind it it’s the same. It’s just the dot product of the inputs and the weights between the inputs and the hidden layer and then the dot product of the hidden layer and the weights between the layer and the output. Anyway, math doesn’t really matter for now.

As the inputs are the same, the same table with all possible 8 input sets stands as before.

Training the model

To train this model is a bit more complicated than before. You can open the `Simple python NN (1 hidden).ipynb` notepad from the cloned repo in your Jupyter browser or you can just view it here. The python code seems almost the same but in this case I’ve made some changes to support the hidden layer and the additional weights between each layer.

In step 2. in the notebook you can see that now the weights are a [3][32] array. That means 32 weights for each of the 3 inputs. That’s 96 weights only for the first two layers, plus another 32 weights for the next, which is total 128 weights! So you can imagine that this will need a lot more processing time to calculate and also that this number can grow really fast the more hidden layers or nodes/layer you add.

After we train the model we see some interesting results. I’m copying them here:

# Simple 2

[0 0 0] = [0.28424671]
[0 0 1] = [0.00297735]
[0 1 0] = [0.21864649]
[0 1 1] = [0.00229043]
[1 0 0] = [0.99992042]
[1 0 1] = [0.99799112]
[1 1 0] = [0.99988018]
[1 1 1] = [0.99720236]

Let’s see again the results from the previous post.

# Simple 1

[0 0 0] = [0.5]
[0 0 1] = [0.009664]
[0 1 0] = [0.44822538]
[0 1 1] = [0.00786466]
[1 0 0] = [0.99993704]
[1 0 1] = [0.99358931]
[1 1 0] = [0.9999225]
[1 1 1] = [0.99211997]

Do you see what just happened? In the previous NN with no hidden layer the prediction for [0 0 0] was 50% and for [0 1 0] was 44%. With the new NN that has the hidden layer the prediction is much more clear now and the NN predicts that those values must probably be 0. Therefore, by using the same inputs and same output the new more complex NN makes more accurate predictions.

It’s not always necessary that the more complex a NN is will make better predictions. Actually, it might be the opposite. If you want to dig deeper you can have a look about NN over-fitting. Most probably even in this second case with the 32-node hidden layer, the model is over-fitting and maybe 8 nodes are more than enough, but I prefer to test this 32-node hidden layer in order to stress the MCUs with more load and get some insight how these little boards will cope up with that load.

Evaluate on the MCUs

Now that we designed, trained and evaluated our model on the Jupyter notepad we’re going to test the NN on different MCUs.

What is important here is not if the evaluation really works on the MCUs. I mean that’s just code and of course it will work the same way and you’ll get similar results. You results may just differ a bit between different MCUs, because as we’re using doubles and the accuracy may vary.

C code

Regarding the NN prediction implementation in the C code, just have a look at the test_neural_network2() and benchmark_neural_network2() functions in the code. The rest is the same as I’ve described in the first post.

Supported serial commands

Again, please refer to the first post.

For this post the START=2command was used in order to execute the benchmark with the second simple NN. In the previous post the benchmark results were obtained with the START=1command. Keep in mind that if you want to switch from one mode to another you need first to send the STOPcommand.


You can find all the oscilloscope screenshots for the prediction benchmarks in the screenshots folder. The captures are the ones that have the simple2 in their filename. In the following table I’ve gathered all the results for the prediction execution time for each board. As the second NN takes more time you can ignore the toggle time as it’s insignificant. Here are the results:

MCU Prediction time (őľsec)
stm32f103 @ 72MHz 700
stm32f103 @ 128MHz 385
Arduino Uno @ 8MHz 5600
Ard. Leonardo @ 16MHz Oops!
Arduino DUE @ 84MHz 686
ESP8266-12E @ 160MHz 392
Teensy 3.2 @ 120MHz 504
Teensy 3.5 @ 168MHz 363
stm32f746 @ 216MHz 127
stm32f746 @ 295MHz 92.8

As you can see from the above table I’ve lost the results for the Arduino Leonardo. But who cares. I mean it’s boringly slow anyway. I may try to re-run the test and update.

Now let’s think about a real-time application. As you can see the prediction time now has increased significantly. It’s interesting to see how much that time has increased. Let’s see the ratio between the NN in the first post and this.

MCU Prediction time ratio
stm32f103 @ 72MHz 41.42
stm32f103 @ 128MHz 41.04
Arduino Uno @ 8MHz 48.95
Ard. Leonardo @ 16MHz
Arduino DUE @ 84MHz 36.29
ESP8266-12E @ 160MHz 25.06
Teensy 3.2 @ 120MHz 42.857
Teensy 3.5 @ 168MHz 41.06
stm32f746 @ 216MHz 26.13
stm32f746 @ 295MHz 25.92

Let’s explain what this ratio is. This number show how much slower the second NN execution is compared to the first NN for the specific CPU. So for the stm32f103 the second NN needs 41 times the time that the first NN needed to predict the output. Therefore, the bigger the number the worst effect the second NN had on the MCU. On those terms, the stm3f103 seems to scale much more worse than the stm32f746 and the esp8266. The stm32f746 and esp8266 really shine and scale much better that any other MCU. The reason I guess, is the hardware FPU that those two have, which can explain the ratio difference as the NN is actually just calculating dot products on doubles.

Therefore, here we have a good hint. If you want to run a NN on a MCU, first find one with a hard FPU, like Cortex-M4/7 or esp8266. From those two, the stm32f746 of course is a much better option (but that depends also the use case, so if you need wifi connection then esp8266 is also a good option). So, coming back to real-time applications we need to think that the second NN is also a simple one as we only have 3 inputs. If we had more inputs then we would need more time to get a prediction. Also the closer we get to the millisecond area that already excludes most of the MCUs from any real-time application that needs to make fast decisions. Of course, once again it always depends on the project! If for example you had a NN that the inputs were the axis of a 3D-accelerometer and you had a trained model that needed to predict a value according to the inputs, then maybe 700 őľsec or even 500 őľsec are ok. But they may not! So it really depends on the problem you need to solve.


After finishing those tests I had mixed feelings. That’s because I’ve managed to design, train and evaluate two simple NN models and be able to test them successfully on all the MCUs. That was awesome. Of course, the performance is different and depends on the MCU. So, although I see some potentials here, at the same time it seems that the performance drops quite much as the model complexity increases. But as I’ve said it depends in the real use case you may have. You might be able to use an MCU to run the predict function, you might not. It all depends on the real-time requirements and the model complexity.

Let’s keep the fact that the tools are out there. There are many different MCUs, with different processing power and accelerators that might fit your use case. New Cortex-M cpus are now coming with NN accelerators. I believe it’s a good time now to start diving into the ML and the ways that it can be used with the various MCUs in the low embedded domain. Also there are many other HW platforms available in the market, like FPGAs with integrated application CPUs that can be used for ML. The market is growing a lot and now it’s a good time to get involved.

Update: next part is here.

Until then have fun!

Machine Learning on Embedded (Part 1)


Note: This post is the first in the series. Here you can find part 2, part 3, part 4 and part 5.

Since 2015 I was following the whole machine learning hype closely and after 4 years I can finally say that is mature enough for me to get involved and try to play and experiment with it in the low/mid embedded domain. Although it’s exciting to get immediately involved to new techs, I believe that engineers should just keep an eye on the ones that seem to be valuable in the future or have potential to grow into something that can be used on their domain. But at the same time engineers must be moderate and wait for the “hype” to fade and then get the real valuable information. This is what happened with machine learning in my case. Now I finally feel that it’s the right time to dig in this domain more seriously and that the tools and frameworks are mature and simple to use.

And this bring us to the fact that, it’s different to implement and develop the tools and it’s a different thing to use them to solve a problem. Until now, many engineers worked hard on the development of these tools and now it’s much easier for us to just use them to solve problems. Keras, for example, it’s exactly that. It’s one really mature and beautiful framework to use and now it’s very stable. On the other hand, when you wait for this to happen, then you have a more steep learn curve in front of you, so from time to time it’s good to be updated with what’s going on in the domains that you’re interested.

Anyway, this time I’ve decided to make a completely stupid project to find the limits and the use cases of ML in the embedded world. Well, don’t get me wrong that’s not a research, it’s just evaluating the current status of the tools and how they perform with the current low embedded technologies. Nowadays, when most engineers hear embedded they think of some kind ARM application CPU that runs Linux on a SBC. Well, sure, that’s embedded too and there are many of those SBCs these days and they are really cheap, but embedded is also those dirt cheap 8, 16, 32-bit RISC MCUs¬† and also the Cortex-M series.

Let’s make some assumptions for the rest of this post. Let’s assume that low embedded is everything that is equal or less than a Cortex-M MCUs and high embedded all the rest application CPUs that can actually run Linux. I know that there also some smaller MMU-less MCUs that can run Linux, but let’s forget about that now. Also from now on I’ll refer to machine and deep learning as ML, just for convenience. Although the terminology on the field is getting standardized I’ll try to keep it simple, even if I’m not using the proper convention in some cases. Otherwise this post will become like those that I was reading my self in the beginning that were really hard to follow. So, although there are some differences, let’s keep it simple. AI, deep learning, machine learning… I’ll just use ML! Also I’ll refer to a single neural as a neural or a node. Finally, I’ll refer to neural network as NN.

This article will be split in 4 or 5 different posts. The first one (this one) will have some very generic information about ML and NN; but not in depth, as this is not the purpose of this post series. Also in this post we’ll implement a very simple NN with a single node with 3 inputs and 1 output and then run some benchmarks in various MCUs and analyze the results.

In the second part¬† I’ll use the same MCUs but with a bit more complex NN that has the same inputs, but a hidden layer with 32 nodes and 1 output. This NN will be more accurate in its predictions (as we’ll see), compared to the simple NN; but at the same time it will need more processing time to run the forward prediction. Please don’t expect to learn the terminology and details on ML here, but it will be much easier to follow if you already know some basic things around NN.

Excited already? No? Well, don’t forget it’s a stupid project. There’s always a small excitement in doing something useless. So, let’s move on.


Spoiler. For me, one of the most interesting thing in this stupid project was the amount of the different boards that I’ve used to run those benchmarks. I think what I liked most was the fact that I was able to test all these different boards¬† with the same code. For sure, the stm32f103 (blue-pill) was more optimized as I’ve used my own low level cmake template, but nevertheless I enjoyed having most of my boards running the same neural network code. Well, I didn’t used any of my PSoC 4 & 5, STM8, LPC1110, LPC1768 and a few other boards I have around, but I didn’t have more time to spend on this. Maybe at some later point I’ll add the benchmark for those, too.

STM32F103C8T6 (aka blue-pill)

This is my favorite board and my reference, so it couldn’t miss the party. I’ve run the benchmarks @72MHz and then I’ve overclocked the MCU @128MHz.


This is actually the STM32 F7 discovery board here, which is a cute development board with a lot of peripherals and cool stuff on it, but in this case I’ve only use a serial port and a gpio pin. As I like to overclock the stm32s, I’ve managed to overclock this mcu @295MHz. After that it didn’t work for me.

Arduino Uno

I guess I don’t need to write more about this. Everybody knows it. It runs on a ATmega328p and it’s the slowest MCU in this comparison.

Arduino Leonardo

This is another Arduino variant with the ATmega32 cpu, which is a bit faster than the ATmega328p.

Arduino DUE

This is an arduino board that runs on an Atmel SAM3X8E MCU, which is actually an ARM Cortex-M3 running at 84MHz. Quite fast MCU for its release date back then.

Teensy 3.2

The teensy is a very interesting board. It’s a bit expensive, sure. But it’s almost fully compatible with the Arduino IDE libraries and that makes it ideal for fast prototyping and testing. It’s based on a Cortex-M4 CPU and for the test I’ve used it overclocked @120MHz.

Teensy 3.5

This teensy board is using a Cortex-M4 CPU, too; but it runs on faster clocks. I’ve also used it overclocked @168MHz. The overclocking options for both teensy boards, are coming easy and for free from within the Teensy plugin in the Arduino IDE. I had some issues with one library but nothing difficult to solve. More details in the file on each MCU code folder.


Yep, we all now this board. An L106 32-bit RISC CPU running up to 160MHz.

A simple NN

OK, so let’s now jump to the interesting stuff. Everything that is related to this project and for all the posts are in this bitbucket repo:

Although it’s not the best thing to have all these different things in one repo, it makes more sense as it makes it easier to maintain and update. During this post series I’ll use different parts from this repo, so everything you see there are not only for the this first post.

Before we begin, in case that you want to learn some basics for NN then you can watch these videos in YouTube (1, 2, 3, 4) and also this playlist.

First let’s start with a simple NN. For this post, we’re going to use a single neural with 3 inputs and 1 output. You can see that in the following picture.

In the above image we see the topology of a simple NN. That has 3x inputs and 1x output. I won’t get into the details of the math. In this case the output is simple to calculate and it’s:

y = a0*w0  +  a1*w1 + a2*w2

This is the dot product of a(n) and w(n), where n=1,2,3. Just be aware that a(n) is not a function, it just means a0, a1, a2. The same for w(n). So, a(n) are the inputs and w(n) are the so called weights. You can think that weights are just numbers that their size control the effect that each a(n) has in the output result. The higher the w(n) is the more a(n) affects y.

The output is not y, thought. The output is the sigmoid of y, so:

output = sigmoid(y)

What sigmoid does is that it limits the output between 0 and 1. So the more negative y is then it’s near 0 and the more positive it’s near 1. In the ML world this function is called activation function.

For this project we assume that a(n) is a single binary digit (0 or 1). Therefore, since we have 3 inputs then all the possible combinations are in the following table:

a0 a1 a2
0 0 0
0 0 1
0 1 0
0 1 1
1 0 0
1 0 1
1 1 0
1 1 1

For simplicity, you can think of those inputs as 3 buttons connected to 3 gpio pins on the MCU and their state is either pressedor not pressed. Then depending their state, the output is also a binary value (0 or 1).

Training the model

Training the model means getting a set of inputs that we already know that they produce a specific output and then train the NN according to these. Then we hope/expect that the NN is able to predict the output for unknown inputs that hasn’t been trained on. The training is not done on the target, but it’s done separately on a workstation (or cloud) that has more processing power; and finally only execute the prediction function on the MCU. Although this model is very simple, someone may argue that 2 inputs – 1 output is simpler :p . Although it’s simple enough, we’ll do the training on a workstation as it’s important to use some tools that make the workflow easier.

To do that, is better to use a jupyter notebook to do all the design, training, testing and evaluation. Also Jupyter notebooks are the standard documents that you’ll find in most new github projects. The most simple way to install Jupyter and the other tools we need is using miniconda. I’ll be really quick on this. I’m using Ubuntu, so the following commands are for that OS.

  • Download miniconda from here
  • Install miniconda
  • Create a new environment and install the tools you’ll need
    # Create a new environment
    conda create -n nn-env python
    # Activate the environment
    conda activate nn-env
    # Now install those packages to that environment
    conda install -c conda-forge numpy
    conda install -c conda-forge jupyter
    conda install -c conda-forge scikit-learn
    conda install -c conda-forge tensorflow-gpu
    conda install -c conda-forge keras

    Not all of the above packages needed for this example, but we’ll use them later.

  • Next git clone the repo for this project and run Jupyter.
    git clone
    cd machine-learning-for-embedded
    jupyter notebook &

    If everything goes right, then now you should be able to see the web-interface from Jupyter in your browser. If not, then I guess you need to do some google-fu. In this web interface you would see a folder with the name jupyter_notebooks. Just double click on that and there you’ll find all the notebooks for this project. The one we need for this post is the `Simple python NN.ipynb`. Just click on that.

    What you see in there is some mix of markdown text and python code. For the simple cases of the first two parts we’re going to implement the NN with just python code, without using any advanced library like tensorflow or keras. The reason for this is that we can write code that we can later convert to simple C and run tests on the different MCUs.

Again, I won’t go into the details of Jupyter notebooks and python. I guess there are plenty of tutorials in internet that are much better from any explanation I can provide.

Let’s see the notepad now.

Note: In case you just want to view the notebook and evaluate your results, you don’t have to install Jupyter, but instead you can just view the notebook in the bitbucket repo here.

First we import some functions from numpy to simplify the code. Then we create a NeuralNetwork class that is able to train and evaluate our simple NN. Then we create a training set for our binary inputs. As we’ve seen before, 3 binary inputs have 8 possible combinations and we choose to use a train set of 4 inputs. That means that we’ll train our NN with only 4 out of 8 combinations and then expect the NN to be able to predict the rest by itself. So we train with the 50% of the possible values.

Then we create an array with the 4 inputs and the 4 outputs. After that we initialize the NeuralNetwork class and view the random weights. A note here is that the weights always have random values in the beginning. The meaning of training is to calculate those weights (if you prefer the mathematical approach is to find where the slope of the function, I’ve mentioned before, is minimum or ideally zero). Another note is that when you run this notebook in your browser you may get different values after each training (you shouldn’t but you may). Those values should be similar to the ones in the notebook, but they also might differ a bit. In this case, if you just want to evaluate your results with the C code that runs on the MCUs then have in mind that you may need to change the weights in the MCU code according to your weights. By default, the weights in the C code are the ones that you get in the repo’s notebooks without execute any cells.

Finally, we train the model to calculate the weights and then we evaluate the model with all the possible input combinations. For convenience I’m copying my results here:

[0 0 0] = [0.5]
[0 0 1] = [0.009664]
[0 1 0] = [0.44822538]
[0 1 1] = [0.00786466]
[1 0 0] = [0.99993704]
[1 0 1] = [0.99358931]
[1 1 0] = [0.9999225]
[1 1 1] = [0.99211997]

From the above output we see that for the values that we used during training the predictions are very accurate. This output is from the stm32f203, as you’ll find out all the Arduino compiled code don’t have that floating point precision when you convert the doubles to strings. As I’ve mentioned before in the output we get values from 0 to 1. That’s the prediction of the NN and the closer is to 0 or 1 then the higher is the possibility that the output has that value (because in this example it happens that we have binary output so it’s 0 or 1 anyways). So in case of the training inputs [[0, 0, 1], [1, 1, 1], [1, 0, 1], [0, 1, 1]] we see that the accuracy is much better compared to the unknown inputs like [0 0 0] and [0 1 0]. Especially the first input it’s not actually possible to say if it’s 0 or 1 as it stands right in the middle. Ouch!

Evaluate on the MCUs

Now that we designed, trained and evaluated our model on the Jupyter notepad we’re going to test the NN on different MCUs.

What is important here is not actually if the prediction really works on the MCUs. I mean that’s just code, of course it will work the same way and you’ll get similar results. You results might differ a bit because as we use doubles that may differ from one architecture to other. What is important though, is the performance!

That’s all about we care eventually, right? And that was the main drive for me to create this project. To find out how do those MCUs perform in simple or more complex NNs? Is it possible to run a NN in real-time? Does it even have a meaning to do that on an MCU? What you should expect? Is it worth it? Can those tiny MCUs give a good performance? What are the limits? Is it maybe better to convert a NN problem to algorithmic in order to run it on a MCU? Are nested ifs, lookup tables, Karnaugh maps still a better alternative? And a lot of other questions.

Just be sure that I’m not going to answer all those things here though, as there are a lot of different parameters per project and use case. But by doing this yourself, you should definitely get an idea of the performance, the potentials and the limits that exist with the current technologies.

The evaluation on the MCUs is split in 3 different cases. We have the stm32f103 that has it’s own code folder in the `code-stm32f013` folder. Also the stm32f746 has it’s own code folder (code-stm32f746), as esp8266 and arduino due. For the other arduinos and teensy boards you can use the code-arduinofolder.

Just a parenthesis here. Probably people that read my blog more often, they know that I’m more a baremetal embedded guy. I enjoy doing stuff with more stripped down libraries even CMSIS for the Cortex-M. Although I’m mentioning from time to time that I don’t like using Arduino IDE or¬† HAL libraries, I’ve also mentioned that I find these very useful for cases like this. Prototyping! Therefore, using those tools for prototyping is an excellent choice and a no-brain decision. We need to choose and use our tools wisely and where they fit best every time. So evaluating a case or project on different HW it always make sense to use those tools for prototyping and then write the proper low embedded code where is needed. OK, enough with this parenthesis.

You’ll find details on how to build and run each code on each MCU in the README files in the project folders.¬† Hence, I’ll only mention the serial protocol that I’m using and also how it works in the background.

C code

The c code is really simple for this example. The dot product and the sigmoid function are implemented in the neural_network.h/c files and from the main.c file we just call the prediction() functions (which is just the sigmoid(dot()) function). The same .h and .c files are used for all the different codes. Also the weights for this example is the double weights[] array in main.c and the inputs are the double inputs[8][3] array again in the main.c function. For now just ignore the double weights_1[32][3] and double weights_2[] arrays, which are used for part 2.

Finally, also two important functions for this example are the benchmark_neural_network() and test_neural_network(). Both are triggered with commands from the serial port. The test function will just print the prediction for all the input combinations in order to compare them with the jupyter notebook and the benchmark function will run a single prediction and at the same time toggle a pin in order to measure the time the function has taken with an oscilloscope.

Supported serial commands

In order to simplify testing I’ve created a couple of commands. In case of stm32 you can connect to the serial port at 115kbps and for the rest MCUs that use the .ino project you need to connect at 9600 bps (anyway it’s either 9600 or 115200).

The supported commands are the following (all commands expect a newline in the end):


where <mode>: 1 or 2

This command will evaluate all the 8 possible inputs by running the prediction using the calculated weights and will print the output. Then you can compare the output with the output from the jupyter notebook.

Mode 1, is using the simple1 NN and its weights. The simple1 NN is the one we use on this post with 3 inputs and 1 output.

Mode 2, is using the simple2 NN and its weights. The simple2 NN is the one that we use on part 2 with 2 inputs, a hidden layer with 32 nodes and 1 output.

Note: If you run the TEST commands on any arduino build firmware you’ll get a bit disappointed as you for some reason the Serial.print function can only print double values with a 2 decimals. That’s a bit crap. I’ve read that there are some ways to fix this, but that it doesn’t really matter. It only matters that the predictions are correct enough. With stm32 that’s not an issue you will get pretty much the same accuracy with the python output.


where <mode>: 1 or 2 (has the same meaning as before)

This command starts a timer that every 3 seconds will run the prediction function and also toggles a gpio in order to help us to make precision measurements. By default, the prediction is using the first input set [0 0 0]. That doesn’t really matter as it doesn’t affect the computation timing, but you can change it in the code if you like. You can verify that mode 1 is much faster than mode 2, but we’ll have a look at it at the next post.


The STOP commands just stops the timer that was triggered with the START=<mode> command.


First I need to mention that the best way to measure the time that a code needs to run is by using an oscilloscope and a gpio pin. You can set the pin high just before you run your function, then run the function and then set the pin to low. Then by using the oscilloscope you can calculate the exact time the operation lasted.

There’s a catch though! When toggling a pin, that also takes some time and that time is different for different hardware and even gpio libraries for the same hardware. For that reason in the code you’ll find out that every time I’m toggling the pin twice before run the NN prediction function. That way you can measure the time that those two toggles spend and then subtract the average from the time that the prediction operation lasted. So, you measure the time of the two toggles and if that time is Tt then you measure the time between the HIGH and LOW of the prediction function and the total time spend for the predictions will be:

Tp = Thl – (Tt/2)
Tp : Prediction time
Thl: Time of High-Low transition that includes the prediction function
Tt: Time of the two toggles

Anyway, let’s not complicate things more. The above just helps only when the prediction function time is fast or different MCUs have similar time and you want to remove the overhead of any GPIO handling that may differ between different MCUs.

Note: I’ve included all the oscilloscope screenshots in the screenshots folder in the repo. Therefore, you can have a look on the oscilloscope output for each different MCU as I’m not going to post them all here (there are just too many).

Before posting the table of the results, these are the screenshots for the stm32f103 and the Arduino Uno. The name coding in the screenshots folder is <mcu>-<NN topology>-<frequency>-<capture>.png. That means that for the teensy 3.2 the ss for that simple example (simple1) and the pin toggle will be `teensy_3.2-simple1-120MHz-predict.png`. In the next post (part 2) the NN topology will be called simple2.

These are the captures for the toggle and prediction for stm32f103 and arduino uno.

stm32f103 @ 128MHz pin toggle time = 290 nsec

stm32f103 @ 128MHz prediction time = 9.38 őľsec

Arduino Uno @ 8MHz pin toggle time = 15.5 őľsec

Arduino Uno @ 8MHz prediction time = 114.4 őľsec

Although you already get a rough idea, the next table summarizes everything.

MCU Pin toggle time (őľsec) Prediction time (őľsec)
stm32f103 @ 72MHz 0.512 16.9
stm32f103 @ 128MHz 0.290 9.38
Arduino Uno @ 8MHz 15.5 114.4
Ard. Leonardo @ 16MHz 21 116
Arduino DUE @ 84MHz 8.8 18.9
ESP8266-12E @ 160MHz 1.58 15.64
Teensy 3.2 @ 120MHz 0.830 11.76
Teensy 3.5 @ 168MHz 0.572 8.84
stm32f746 @ 216MHz 0.157 4.86
stm32f746 @ 295MHz 0.115 3.58

As you can see from the above table the higher the frequency the better the performance (o, really?). I haven’t substracted the pin toggle time from the prediction time! Also note that although the Teensy 3.5 has a better performance from the stm32f103@128MHz the pin toggle time is almost the double… That’s because those arduino libraries are implemented on top of bloated functions, even for just enable/disable a pin. Of course, the overclocked stm32f746 @ 295MHz is by far the fastest in all terms.

Also I’ve noticed something irrelevant with the NN. If you see the ratio of the (Prediction time)/(Pin toggle time), then you get some interesting numbers. Let’s see the following table:

MCU (prediction time)/(pin toggle time)
stm32f103 @ 72MHz 33
stm32f103 @ 128MHz 32.34
Arduino Uno @ 8MHz 7.38
Ard. Leonardo @ 16MHz 5.52
Arduino DUE @ 84MHz 2.14
ESP8266-12E @ 160MHz 9.89
Teensy 3.2 @ 120MHz 14.16
Teensy 3.5 @ 168MHz 15.45
stm32f746 @ 295MHz 31.13

The above table shows what you can expect from your hardware and how those bloatware arduino libs hurt the overall performance. To be fair though, the NN code is not affected from the libraries, as it’s plain C code. But normally your MCU will also do other tasks and not only run the NN; therefore, everything else that the cpu does affects the NN performance, especially if the code uses bloated libraries. In this case we just toggle a pin and running a timer in the background, nothing else. Well, not true, the stm32f103 actually runs also a few other stuff in the background, but nevertheless it has the best prediction/toggle ratio. The Arduino DUE has the most weird behavior, which doesn’t make sense, but it was consistent. I didn’t even bother to debug that, though. Anyway, the above table is the reason that sometimes I mention that prototyping is completely different from development. Prototyping is proof of concept, and after that going into the low level will bring you the performance. If you don’t care about performance, then sure pick the tool that suits your needs.


From this example we’ve seen that we can actually design, train, evaluate and test a NN with Jupyter and python and then run the forward prediction function on a small MCU. Isn’t that great? Yeah, I know… Using so much resources on those small MCUs to run a 3-input, 1-output NN deserves the title of the stupid project! Anyway, from the last tables we have some interesting results that you can also interpret as you think.

The most interesting is that we can actually use this NN for real-time applications! OK, don’t laugh. I know that this example it’s useless, but you can! Even the 114.4 usec of the Arduino is ok’ish for fast real-time applications. Of course, it depends on the case and the specs. I mean if you expect you inputs to change faster than that, of course you can’t use it. But think buttons for now! ūüėõ

It’s really fast and even Arduino uno can handle this NN, 100 őľsec is really fast. Oh, wait. That bring us on another question. If they are buttons then why not created a nested-if function and handle that much much faster.

Even better, why not create lookup table? Maybe even create a Karnaugh map of the inputs/outputs and reduce that to a couple of logic operations. That would work really really fast!

Well, as I said, this is a very simplified example. I mean, this is just for testing and is not meant to do anything really usable. But on the other hand think that what if instead of 3 inputs we had 128? Or 512? Then it would be really difficult to make a Karnaugh map and simplify it. Or we would need to write a ton of if-else cases. But what would happened if we needed to change something in the input or output sets? Then it would be also quite some work in the code. Maybe the lookup table is still a valid and good solution, though. It will cost RAM or FLASH space, but also the weights of the NN will get a lot of space. So you would need to compare how much space each solution would use and then if the NN needs less space then decide if less space is more important than speed execution.

It’s important to realise that ML doesn’t make better every problem we have, neither it’s a magic tool that solves all our engineering problems. It’s a tool that seems to have the potential to solve some issues that it was very complicated to solve before. And it may apply also to problems that we already have solutions for them, but ML may provide some flexibility we didn’t have before.

In the next post here, will do the same for a bit more complex NN with 3-inputs, a hidden layer with 32 nodes and 1-output.

Until then have fun!

Losing the wagon


This post is not about a stupid-project, but it’s a bit more philosophical and it’s about losing the wagon. Well, life has many wagons, but let’s narrow it to the technological and engineering wagon. This is an engineering blog after all.

The last couple of days I was exploring what’s the current state of the home automation domain and specifically for the KNX. I’ve started developing for the KNX bus back in 2007. The trigger was a friend of mine, who’s an electrical engineer and started talking about this fancy new KNX bus around in 2006-2007 (if I remember correctly) and which derived from the Instabus. He got my attention as he already have made some KNX installations and soon I got involved into it. I was fascinated with it and I wanted to start build stuff around it.

The KNX standard is supposed to be an open standard at the time, but it wasn’t really. Back then there were only few information around it and you needed to buy the specifications (which were expensive). So, I had to do a lot of stuff by my self. The only thing that was available it was the BCUSDK. This project started in 2005, but it was all that I needed. From this code I’ve managed to extract and understand the protocol and most things around it. The details weren’t obvious, of course, because the code wasn’t documented but having some KNX devices to experiment with and the code it was enough to do everything I wanted. Also a friend of mine (also an engineer) got fascinated with it and soon we got our KNX certification and in no time we’ve developed a whole platform around it. This included APIs, GUI interfaces and gateways to many standard protocols used at the time (IP, RS232, RS485) and gateways to GSM modules, GPS, Alarm systems and several other stuff.

Well, it was brilliant at the time and there wasn’t anything like that back in 2007. We could beat any competition. And then… for some reasons we just stopped. I don’t even remember the excuse at the time to be honest. But I know the real reason now. That was around in 2008-2009.

Now, I’ve checked again and the KNX automation domain has completely transformed to a huge market and code-base. Several different APIs for several programming languages exist. Python, C/C++, even a KNX module for Qt. I’ve wrote a KNX module in Qt in 2008 by myself and now I’ve seen that last year there was a new module in Qt for that. After 10 years!

So, I was 10 years ahead than the market. I’ve seen the wagon of this train more that 10 years ago and all its potentials. I’ve developed a whole system around it and I let it die, thus losing the wagon and the train. Now you can find almost everything and many stuff are open source, which is great. There is even a Yocto layer with all those tools included. It’s the meta-calaos.

Trust me, it may seem a bit disappointing to realize that you’ve lost the wagon and see how the market has ended today; and knowing that you were there 10 years ago and just did nothing. But, is it really? So, when this happens the most reasonable thing you need to do is ask yourself, why? Actually, not only a single why but several whys and then when you find the reasons and make some decisions for yourself, even knowing yourself better.

And this is what this post is all about.

Some thoughts

I guess I’m not the only one that had this situation. Some of you know exactly what I’m talking about and already being there at least once. Well, also for me this was not the first time. I had that more that once, but the above case hit me harder as I was pioneer and 10 years earlier than the rest of the market. Well, in my case I know why I’ve failed. The reason is that I’m a “lazy” engineer. I’ll come back to this phrase later.

I’ve seen many engineers in my life. Mostly “not-really-good” engineers, for my standards. Although, in my professional career I’ve been told that I’m a good engineer and I know that I’m capable to do stuff, at the same time, I don’t consider myself a -good- engineer. What is -good- engineer after all? No one should consider himself a -good- engineer. If you do that, then it’s over. Of course, when it comes to the professional aspect then you need to present yourself as a good engineer and it’s easier to do that if others already believe it. But in the end I just consider myself just an engineer. And this is a good and a bad thing at the same time.

Being an engineer is only a part of what you are in your professional career. You’re not only an engineer. You are also a salesman, a manager, a director and a CEO. You’re everything at the same time. At least you become those things after a few years in your domain, it’s the natural evolution which is called experience. But it’s the proportion of these analogies that you have and that makes and drives your professional career. Some people are better managers than engineers. Other might have more “CEO-like” qualities than the rest. So, you can’t have all the qualities in a high level at the same time. You may have one or two, but it’s extremely rare that you have everything. But, is that really a problem?

For example, I’m a “lazy” engineer. Lazy, doesn’t mean that I’m really lazy to do something. Actually it’s the opposite. I can drive myself to finish a project and complete it in the best and most optimal way. But then I need to do something else. I can’t stay on that for a long time. I can’t devote myself to a single project or domain and stay there forever. If I try to do that, then it makes me lazy in the end. I’m getting bored and I start hate what I do. And thus, I’m a “lazy” engineer. Well, at least until now I haven’t find a project or domain that I would like to stay forever.

But being a “lazy” engineer had its flaws. For example, in this case I was 10 years in advance compared to the market and then I got bored. So, I got lazy. Therefore, I had to just drop everything and go to the next challenge. Otherwise, I would doom myself in a situation that I would hate what I’m doing. Maybe some of you can understand this, maybe others don’t. It’s not necessary that every engineer is the same. We have different qualities and proportions of them and that’s fine!

I’ve met engineers that they are not so skilled, but they devoted themselves to an idea and a project and they succeeded to make it their main source of income. Many of those projects and ideas for me were so easy to develop and implement and even boring to even start doing them. They were just too simple for me, from the engineering aspect. But, they were profitable! And some engineers struggled to do something which for me seemed so easy and they made a profit out of it. Others didn’t, though. I believe those who did, were also a bit lucky, but all of them they were devoted and better salesman than engineers. Being able to sale something is more important that be able to build it in the best possible engineering way. The product may have it’s flaws, it may need several iterations until it gets released, it may even released and be a crap from the engineering aspect. But does this matter in the end? If it you make a profit and a business case out of it then it’s successful in mainstream market terms.

You don’t have to be an expert in something to do stuff. I’ve programmed in more than 10¬† programming languages as a professional. I may be an expert only on 2-3 of them, but it doesn’t really matter. You don’t need to be an expert in any programming language to make something that works and be profitable. Writing code is the most easy thing to do. Does it matter if it’s the best code? If you do it the pythonic, or yoctonic or C++17 way? All the code you ever written in the end is just crap. It’s a mess, unless it’s just a few lines that do a very specific thing. You might though you’ve written the best code 5 years ago and if you see that code today you’ll hate yourself for writing that crap. But, it doesn’t matter. Really. You become an expert in something, more as a “professional skill” that it will make it easier for you to find a better job; but if you want to realize your own ideas and make that a product, then it doesn’t matter if you’re an expert. It never mattered.

Therefore, who’s the successful engineer in the end? The one that managed to devote himself in a product and release it in the market and make a profit, or the one that one that didn’t? The one that is expert in 1-2 things or the one who’s capable in 10+? The one that delivers fast or the one that delivers something robust and well-engineered? The one that sees 10 steps further or the one that can focus on the current step? Don’t try to answer this question too fast.

I think that the success is to be satisfied in what you do and be happy with what you achieved in the end of your work. This sentence is a bit vague though, because what makes you happy now it doesn’t mean that it will make you happy in the future. But do you really know the future? No. So, what is left is what makes you a happy engineer now. And if you’re happy then it probably means that you’re also successful in what you do.

Therefore, making your own best-selling project and profit from your awesome idea is not necessary what will make you happy and a successful engineer. So, first you need to focus and find what makes you happy as an engineer and even more if engineering actually is making you happy at all. Because you might be a very good engineer and not be happy being an engineer. You need to know your assets and your values and what to expect from yourself.

Sure, it would be a great thing to become a successful engineer that will have a profitable idea and make a product out of it. But it’s not really necessary. Is it? It might happen, it might not. Maybe you even say it loud to yourself sometimes, but in the back of your head you don’t really want it or believe it. Because in the end everything comes with a price and you already know it. So, if your idea becomes successful, then you need to devote to it. You need to stop being an engineer and be a salesman, a CEO and whatever comes with it. But certainly not an engineer anymore. You will spend more time in managing things and do stuff unrelated to the engineering domain and you will fade as an engineer. That depends if it’s good or bad. If you like manage things and prefer it more than being an engineer then it’s great! But you also need to be capable with managing things, not just like it. Therefore, you need to know what you want to be and you need to know if you have the proper skills for that and if you don’t try to develop them.

If you know what makes you happy, then do it, but first consider all the consequences and be certain that you can judge yourself and skills right.

For me, being an engineer is not really a job. It’s just a hobby and it’s fun to explore new things and have different challenges. In my job I may don’t have the freedom to do exactly what I want every time, but I’m also doing a lot of stuff in my free time, without a profit. And I’m happy. It’s more like a lifestyle. What you do in your life, should be fun. And I feel lucky that it’s still fun for me. So, I don’t really have regrets about missing opportunities, because all the missing (or not) opportunities brought me to this point today. In the end, the only thing that matters is to know what makes you happy. You don’t have to be the best in something or find the best idea and make a huge profit. All you have to be is happy with what you do.

If you’re lucky enough to be happy with what you do, then you are a successful engineer and no matter what wagons you’ve lost or losing down the path, you’re always on your happiness wagon and do the things that you like. And that’s the best wagon you can ever be in your professional career.

Have fun!

STM32 cmake template (with more cool stuff)


While I’m still waiting for some parts for my next stupid project, I was a bit bored and decided to clean up my STM32 cmake template that I’m usually using for my bluepill projects. I mean, I was pretty happy with it since now and it was working fine, but there’s no better wasted time than doing the same thing again to get the same result and have the illusion that this time is better. So, this deserves a post
to the stupid-projects.

Anyway, while I was about to waste my time on that, I’ve though it would be a nice thing to make it support a few different libraries. Cmake, is something that you either love or hate. I do both. The good thing is that you can achieve the same
result by following a number of different ways. This is a nice thing, but also can be a trouble. The reason is that, if there was only a single valid solution or a way to do create a cmake build then it would be difficult to make errors. You would make a lot of mistakes until make it work, but when it worked, that would be the only right way. On the other hand, if there are many different ways to achieve the same result, then there are good and bad ways. And for some unknown universal law, the chance to choose the worst way is much higher that selecting every other way, good or bad.

Therefore, cmake gets both love and hate from my side. In the end, it’s all about experience. If you do something very often, then after some time you learn to choose the better ways. But if you create a cmake project 1-2 times per year, then then next time you deal with your own CMakeList.txt files and you have to re-learn everything you’ve done, then you realise how many things you’ve done wrong or you could do them better. Also the cmake documentation reminds me a law textbook. There are tons of information in there, but written in a way that stupid people like me can’t understand the documentation and need to read a cmake cookbook or see examples in the internet. Then everything gets clear.


I’m using a lot the standard peripheral library from ST. In general, I hate the monstrous HAL API and the use of C++ when it’s not really needed, but I like CubeMX, because it’s nice to calculate clocks and play around with the pinout. Also, when I’m using the USB on the stm32f103c8t6 (blue-pill), I’m always using the ST’s USB FS Device Driver that is compatible with the standard peripheral library. That combination is beautiful. I’ve found a couple bugs, which I’ve fixed and everything is great. I can say that I couldn’t need anything else than that.

I know that there are plenty people that like the HAL API and using C++ with the STM32 and that’s fine. If you like using that, then keep doing it. For my perspective, is that the HAL API is something that doesn’t provide anything more that the stdperiph library and also there are so many layers of software between the actual API and the CMSIS level, that it doesn’t make sense. For me it’s too much complicated and when it breaks it’s not just open the mcu datasheet and find the error, but you also need to debug all that software layer in between. Sorry, no time for this. Regarding the C++, I’ve wrote a post before here. Generally, there’s no right or wrong. But personally I prefer to write C++ when I’m developing a Qt app or when I really need some things that the language can make my code cleaner, more portable and faster. If it can’t do that, then I see no reason to use it. Also, the libraries are C libraries with C++ wrappers in the headers. That means something. Therefore, I need to be convinced that C++ will actually be better than C for the specific project, otherwise I’ll go with C.

There is also another project that supports the stm32 and plenty of other mcus and it deserves more love. This is the libopencm3 project. That is actually a library that replaces the standard peripheral library from ST. This is a very nice library. It’s low level library and based on CMSIS. It gets updated much more often that the stdperiph and the project is active. For example, I see now that the last update was a few hours ago (that doesn’t mean that it was for stm32f1) and at the same time the last version of the stdperiph was in 2012, so 7 years ago. Also another neat thing with libopencm3 is that everyone can join the project and send commits to add functionality or fix bugs. I’m thinking to commit the stm31f1 overclocking patch I have to clock the stm at 128MHz, but I guess this won’t be accepted as it’s out of specs, but anyway I’ll try, he he. So, yeah libopencm3 deserves more love and I think that sometimes you may also create smaller code.

So I’ve decided to add support to the cmake template also for the libopencm3.

Finally, let’s go to FreeRTOS. I guess, everyone knows what that is and I guess there are a lot of people that love it. Well, I respect rtos. I mean most of my work is done on embedded Linux, so I know about rtos. But still until today, I never, never had to really use an rtos on a small embedded mcu. Until today there was nothing that I couldn’t do using state machines. I’ve also written a very nice and neat lib for state machines and I think I should open-source it at some point. Anyway, I never had to use an rtos on an stm32 or other small mcu, but I guess there are needs that other people have. From my perspective it seems that simplifies things and produces less code and complexity, but on the other hand you loose more important things like full control of the runtime and also there’s a hit in performance. But anyway, it’s fun to have it as an option for prototyping and write small apps while you don’t want to mess with timers and interrupts.

Hence, in this cmake template you get all the above in the same project and you are able to select which libraries to enable by selecting the proper options in the cmake. But let’s have a look. This is the repo here:

After you clone the repo, there is a very interesting file that you should read. It’s supposed to written in a way that is easier to understand, compared to the cmake documentation. Also, another important file is the build.shscript that it handles all the details and runs cmake with the proper options.

So let’s see what those options are. The only thing you need to build the examples is to run the build.shscript with the proper parameters. Inside the build script you’ll find all the supported parameters, but not all of them are needed to be set everytime.

  • TOOLCHAIN_DIR: This points should point to your toolchain path
  • CMAKE_TOOLCHAIN: This points to your cmake toolchain file. This file actually sets up the toolchain to be used. When working with the blue-pill, you wouldn’t need to change that.
  • CLEANBUILD: This parameter is either true or false. When it’s true then the build script will delete the build folder and that means that you’ll get a clean build. Very useful, especially if you’re making changes to your cmake files, in order to remove the cmake cache. By default is false.
  • ECLIPSE_IDE: This is either true or false. If that’s true then the cmake will also create Eclipse project files so you can import the project in Eclipse and use it as an IDE to develop. That’s a handy option because of intellisense. By default is fault because I usually prefer the VS Code.
  • USE_STDPERIPH_DRIVER: This option can be ON or OFF and enables or disables the ST’s standard peripheral driver and the CMSIS lib. By default is set to OFF so you need to explicitly set it to ON during build.
  • USE_STM32_USB_FS_LIB:¬†This option can be ON or OFF and enables or disables the ST’s USB FS Device Driver.¬†By default is set to OFF so you need to explicitly set it to ON during build.
  • USE_LIBOPENCM3:¬†This option can be ON or OFF and enables or disables the libopencm3 library. By default is set to OFF so you need to explicitly set it to ON during build. You can’t have this set to ON at the same time with the USE_STDPERIPH_DRIVER
  • USE_FREERTOS: This option can be ON or OFF and enables or disables the FreeRTOS library. By default is set to OFF so you need to explicitly set it to ON during build.
  • SRC: With this option you can specify the source folder. You may have different source folders with different projects in the source/ folder. For example in this template there are two folders the source/src_stdperiph and the source/src_freertos so you can select which one you want to build, as they have completely different projects and need different libraries.

The two example projects, as you can guess from the names, are for testing the stdperiph and the freertos/libopencm3 libs. To build those two projects you can run these commands:

# stdperiph

# FreeRTOS & LibopenCM3

# Create Eclipse projects files

So, yeah, pretty much that’s it. Easy and handy.


This was a re-write of my cmake template and as cherry on top I’ve decided to add the support for the FreeRTOS and LibopenCM3. I’ll probably use more often the libopencm3 in the future, ot at least evaluate it enough to see how it performs and regarding the FreeRTOS, I think it’s a nice addition for prototyping and use tasks instead of writing code.

Finally, one note here. Be careful when you use the -flto flag in the GCC optimisations, because this completely brakes the FreeRTOS. For example you can build the freertos example and flash it on the stm and you get a 500ms toggling LED, but it you add the -flto flag in the COMPILER_OPTIMISATION parameter in the main CMakeLists.txt file then you’ll find out that the vTaskDelay breaks and the pin toggling very fast.

Have fun!

EMC probe using RTL-SDR


This stupid project is a side project that I’ve done while I was preparing for another stupid project. And as it was a fun process, I’ve decided to create a post for it. Although it’s quite amazing, it’s actually quite useless for most of the home projects and completely useless for professional use. Anyway, in the next project I plan to use a 10MHz OCXO as a time reference. The idea was to use a rubidium reference, but it’s quite expensive and it didn’t fit in the cost expectancy I have for a stupid project. Of course, both OCXO and the Rb are enormous overkill, but this is the point of making stupid projects in the first place.

Anyway, a couple of weeks ago I’ve seen Dave’s (EEVblog) video in youtube here. In this video Dave explains how to build a cheap EMC probe using a rigid coax cable, an LNA and a spectrum analyser. So, he build this $10 probe, but the spectrum analyser he used costs lots of bucks already. He mentions of course in the end that he would at some point make a video by using a SDR instead of a spectrum analyser. Well, in my case I needed to measure how my OCXO behaves. Does it leak the 10MHz or harmonics in different places? And if yes, how can I reduce this RF leakage. So, I’ve decided to build an EMC probe myself. And this is what this post is about.


These are mainly two components that I’ve used for this project.


The SDR dongle I’ve used is the RTL-SDR. There are a few reasons for this decision, but the most important one is that it goes down to 500KHz without any modification or any additional external hardware. Although the tuner used on the RTL-SDR (which is Rafael Micro R820T/2) is rated at 24 ‚Äď 1766 MHz, the V3 (3rd revision) supports a mode called direct sampling mode (dsm), which is software enabled and can make the reception under the 24MHz possible. This makes also possible to use it as a spectrum analyser to detect the leakage of the 10MHz OCXO. So, definitely buy this specific dongle from here if you’re interested for this range. Just have in mind that it’s not the intended use of dsm to have high accuracy, but it’s perfectly fine for this case.

This is the dongle:

Another reason to buy that specific dongle is to support the RTL-SDR blog the great work that Carl Laufer does (and the great support). In my case, the first dongle I got had a small issue and I’ve just contact Carl and he immediately shipped another unit. Also his book, is excellent for beginners and if you buy the kindle version you get free updates for ever. Great stuff.

Semi-rigid Coax cable

This is the cable that you need to make the probe. Get the 30cm one, so you can make 2 probes. I’ve made a small and big one and the large is twice the size of the small one. This how the cable looks like:

You may also need an antenna cable with an SMA male-female ends. This can be used as an extension to the probe (one of the RTL-SDR options includes that cable). I’ll show you later the two options you have how to use the probe with the rtl-sdr.


Regarding how to make the probe, I think Dave’s video that I’ve linked above, explains everything in great detail; therefore, I will just add my 2 cents from my own experience. Have in mind that those rigid cables are really tough to bend and cut their shielding. I’ve used a thick board marker and a pencil to bend the cables in the two different sizes. Also, in order to cut the shielding and create the gap on the top of the probe (that is explained in the video), you need to make sure that you have a new and sharp blade to your cutter, otherwise it will be really hard and messy. Finally, when you solder the end of the cable to the main cable body to create the loop, first put up you soldering iron temperature (400+ degrees), then put some solder in the tip of the exposed core and use a high temperature resistant glove (or oven glove) to bend, press and hold the end of the cable to touch the rigid shield and solder the tip of the cable to the shielding. Then it will keep its shape and you can have a free hand to finish soldering the rest.

These are my probes after soldering them.

Then I’ve used a black rubber coating paint for insulation on the rigid shield. So this is how the probes are looking now.

These are diameters of each probe.

So, the large one has almost the double radius and can also probe twice the area. It would be also nice if the probe end was a true circle (and not elliptic), but for my usage case it works just fine.

Now that you have the probes and the RTL-SDR, there are two ways to use them. One way is to use a USB extension to plug the dongle and then screw the probe straight on it. The other way is to plug your RTL-SDR dongle on a USB port then use an antenna extension cable to connect the probe to the dongle. The next pictures show the two setups.

Of course, you can also use the second setup with a USB extension. Anyway, I prefer the first more, because that lowers the interference and signal attenuation as there is no a long cable between the antenna and the dongle. Also the case of the dongle acts like a handle and makes it easier to hold it. It might get a bit warm depending the range you use it, but not that much, though.

Next I’ve tested how much current the RTL-SDR dongle draws when is running in the direct sampling mode (dsm) and it seems that’s not that much actually, so in my case was ~110mA.

The above one is the replacement dongle. The first dongle was drawing 200mA in the beginning and then this current started to increase more and more, the dongle was burning hot and then at some point it died. But the replacement is working just fine. This is the only picture of the old dongle that I’ve managed to get while it was still working.

Next thing is to select the software that you’re going to use. Since, I’m limited to Linux, I’ve used the SDR# git repo with the mono project run-time. There are a couple of things that you need to do in Ubuntu 18.04LTS in order to make the rtl-sdr work properly.

From now on RTL-SDR or dongle is referred to the hardware and rtl-sdr to the software.

First you need to clone this repo:

In there you’ll find a file called `rtl-sdr.rules`. Just copy that to your udev rules and reload.

sudo cp rtl-sdr.rules /etc/udev/rules.d/
sudo udevadm control --reload-rules
sudo udevadm trigger

The above just makes sure that your user account has proper permissions to the USB device in /dev and can use libusb to send commands to the dongle. Then you need to blacklist some kernel modules. In Ubuntu, the rtl-sdr has a module driver as this device is meant to be used a generic DVB-T receiver, but in order to use the device with SDR# you need to disable those drivers and use the rtl-sdr user space drivers. To identify if your distro has those default drivers run this command:

find /lib/modules/(uname -r) -type f -name '*.ko' | grep rtl28

In my case I get this output.


So, I had to blacklist the modules. To do that in ubuntu create this file `/etc/modprobe.d/blacklist-rtlsdr.conf` and add these lines in there.


You might have to reset your system to check that this works properly. Now you can build the rtl-sdr repo and test the module. To do that

git clone
cd rtl-sdr/
mkdir build
cd build/
cmake ../

Then the executable is located in the .src/ folder. Have in mind that this step is not really needed to run SDR#, but it’s good to be aware of it in case you need to test your dongle before you proceed with the SDR#. Now, to install SDR# in Ubuntu do this:

sudo apt install mono-complete libportaudio2 librtlsdr0 librtlsdr-dev
git clone
cd sdrsharp
xbuild /p:TargetFrameworkVersion="v4.6" /p:Configuration=Release
cd Release/
ln -s /usr/lib/x86_64-linux-gnu/
ln -s /usr/lib/x86_64-linux-gnu/ librtlsdr.dll
mono SDRSharp.exe

If you have your dongle connected then you can start play with it. For more details on how to use SDR# with the dongle, you better have a look to Carl’s book or have a look to the rtl-sdr blog as there are some posts that have info about it.

In my case I needed to use the direct sampling mode, so I’ve selected the RTL-SDR / USB device and then clicked configure and from the new dialog that opens select the Direct sampling (Q branch)option in the Sampling Mode. This will work only in the V3 of the RTL-SDR dongle! If you following these steps, now close the dialog box and check the Correct IQcheckbox, drag the audio AF Gain to zero (you don’t need audio feedback) and then you’re ready. I like to lower down the contrast slider in the right so the waterfall is a bit more deep blue as the noise is gone. Also I uncheck the Snap to grid.

In case that you have an older version of the RTL-SDR, or another dongle then you need to do a small modification to get it working in the direct sampling mode, but you need to google it, depending your dongle.

Testing the probes

For the tests, I’m going to use a cheap second-hand OCXO that I’ve bought from ebay and it’s made from Isotemp and the model number is OCXO143-141. This is the one:

This seems to be a customised OCXO as there’s no reference for it in the official cmpany site. There is a 143-XXX series but not 143-141 model. Anyway, from the PDF file here, it’s the A-package. I’m not going to go into the details, but it’s an amazing OCXO for the price and you can find it very cheap in ebay. I got mine for 16 EUR. Its frequency output is 10MHz and as I’ve said earlier that’s lower than the minimum supported frequency from most of the SDR dongles, but the RTL-SDR can go down to this frequency when is set to direct sampling mode.

As you can see from the above image, you just need to connect the GND and the +5V and that’s it. Normally, the OCXO needs approx. 5 mins to stabilise, because the heating element needs some time to warm up the case and the crystal inside it to the operating temperature which is 70 ¬ļC. Hence, be aware not to touch it after some time, because it might be very warm. When the OCXO is connected to the power supply then it draws ~500mA current @ 5V for a few seconds, but after that gradually drops and it’s stable at approx 176 mA.

The temperature of the case was around 46 ¬ļC, but with a lot of variation, so I guess that’s because of the reflective case, which makes difficult for my IR temperature gun to measure the temperature precisely.

Ok, so now that everything is set, lets test the probes.

I’ve used the USB extension to connect the dongle to my Ubuntu workstation and I’ve connected the probe on the dongle’s RF input. Then I’ve build the sdrsharp from the git source (I’ve explained the process earlier) and run the GUI. Selected the the RTL-SDR / USB, configured the sample rate to 2.4MSPS and the sampling mode to Direct sampling (Q branch). Then checked the Correct IQand unchecked the Snap to grid and pressed Play. Simple.

This is the capture I’ve made when I’ve put the probe close to the 10MHz output of the OCXO.

So, it works!

After that I’ve used to probe on the PSU cables that was powering the OCXO and I’ve seen that the 10MHz was leaking back to the PSU. Also the leak was not allover the cable but in some spots only. In my use case I’ll use a USB cable to power up the OCXO that will also power the rest circuit; therefore I’ll probably need some ferrite core cable clips, which I’ve already have ordered as I was expected that, but it will take some time to get them from China. Anyways, that’s also a topic for the next project, but I need the probe to at least minimise this leak.

Limitations and comparisons

Now I’ll try to guess how this DIY EMC probe using the RTL-SDR dongle compares to a spectrum analyser with proper EMC probes and then try to test the diy probes with my Rigol DS1054Z, which has an FFT function. The list might not be complete as I don’t know all the specific details and I’m pretty much guessing here.

Professional probe DIY probe
+ It’s calibrated + Very cheap to make
+ Just buy it and plug it in + If you don’t care for accuracy it’s fine
+ Well defined characteristics + Fun to make one
– High price – You need to build it by your self
– No fun making it – Not calibrated
    You can also buy a cheap set of EMC probes on ebay, of course. That costs around 40 EUR¬† which is that’s not too much if you want to get a bit more serious or want to avoid the hassle to create your own probes. It’s not in my plans to buy something like that, because I don’t need better accuracy or calibrated probes, I just need to be able to see if there’s a frequency there and for that purpose the DIY probe is more than perfect.

OK, so now let’s see how the RTL-SDR dongle compares to a spectrum analyser. Again, these are more guesses as I don’t have a spectrum analyser and I’m not expert in the field.

Spectrum Analyzer DIY probe
+ Wider real-time bandwidth + OK’ish real-time bandwidth
+ Very low noise + OK’ish in terms of noise
+ Tons of options and functions + You can write your own functions using the rtl-sdr API and hack it
+ Wider range (several GHz but for more $$$) + Very, very cheap
– Much more expensive + Portable (can be used even with your phone)
– Less portable – Limited real-time bandwidth
– Can’t hack it – Can’t be used when accuracy is needed

Depending the money you spend you can get very expensive spectrum analysers that include amazing functionality. There are also spectrum analysers for hobbyists and amateurs (like the Rigol DSO700 series), but they cost from 750 (for 500MHz) up to 1100 EUR (for the 1 GHz). That’s a lot of money for a hobbyist! You can spend just 25 EUR for the RTL-SDR dongle and buy something else with the rest…

If I had to choose which is the most significant difference, I would probably say it’s the real-time bandwidth. For example, by using the RTL-SDR dongle you’re limited in approx. 2MHz bandwidth real-time processing of data chunks. With a spectrum analyser the real-time bandwidth can be up to 10-25MHz (or more). I know that the RTL-SDR dongle documentation is clear regarding the real-time bandwidth and it mentions that the max bandwidth is 1.6MHz (due to Nyquist) and the 3.2MSPS or 3.2MHz is by using the I/Q samples. On the other hand, I don’t know if the spectrum analyser specs are referring to Nyquist bandwidth or not. They may do,¬† I don’t know, but nevertheless is higher bandwidth in any case. Do you really need it though? In that price?

The key point here is that for an amateur (or even professionals sometimes), it doesn’t really matter if what you see in the screen is 100% accurate. Most of the times you just want to see if there’s something there and if there’s a frequency in that bandwidth that pops up.¬† So with the RTL-SDR dongle you may not get an accurate value of the magnitude of the frequencies that you see and also more noise, but at least you can see that there is a frequency popping up in there.

Comparing with the FFT function of the DS1054Z

In order to compare the RTL-SDR and sdrsharp and the FFT function of the DS1054Z I can’t use the OCXO module as it’s too weak. So in this case, I’ve used the Siglent SDG1025 function generator, which is famous for not being a great performer, but still is just fine for any home project that I’ve done or I may do in the future. I’ve set the output to 10MHz and 1Vpp and then used my ds1054z to capture the output and use the FFT math function and at the same time I’ve used the DIY EMC probe to check what it captures from the SDG1025 cable.

Note: with the probe not connected I’ve an interference at ~9.6MHz, which I don’t know where it comes from, but I have a guess. It’s a lower frequency compared to the internal 28.8MHz TXCO used to clock both the RTL2832U and R820T2. My guess is that it’s a mirror image from the internal 28.8MHz TXCO, because 28.8/9.6 = 3, which is an integer. Btw, also the 48MHz freq of the USB may show an image at 9.6MHz, because 28*2 = 48/9.6 = 5. This is there even without the probe connected, so this makes me think it’s because of the TXCO. Anyway, it’s always a good thing to check first without the probe connected, to see if there are any frequencies already in the spectrum you’re interested. This is what I’ve got without the probe connected:

You see the 9.6MHz in the left side. Here you see that there is also one in 9.490MHz, without the probe connected. I can’t figure out where that comes from, because it’s near the 9.6 it’s a bit weird, but the TXCO is 1ppm, which means that even if it was coming from there 1ppm @ 28.8MHz means 28.8Hz and that means that the worst case is to had am image at 9.599MHz. Maybe there’s some internal PLL in the RTL2832U to get the 48MHz (I couldn’t find the datasheets to figure this out) and it has a large offset? Nah, anyway… It’s just there.

Then the next picture is with probing the 10MHz output of the SDG1025 with the RTL-SDR dongle.

Now you can see that there are quite a few frequencies around the 10MHz. The SDG1025 is set to output a sine! It seems that the sine output of the function generator contains a lot of harmonics that also emit and the probe can capture them. In case of the OCXO I didn’t saw the other spikes although the ouyput is a square wave. Probably because the output of the OCXO was much weaker.

The next picture is the FFT output of the ds1054z.


Here you can see that also the FFT of the Rigol shows that there are other frequencies around the 10MHz sine and also the freq counter shows that the output is not very stable. Probably that’s true as I didn’t wait for a long time for the function generator to settle. Btw, as I have the 10MHz OCXO, maybe I’ll use it as an external reference to the SDG1025. This would make it much, much better in terms of output stability (marked as a future project).

Another thing that deserves mention here is the bandwidth of the FFT on the Rigol. In this case I’ve set the FFT mode from trace to *memory in the FFT menu. That makes the screen update much slower (who cares), but you get a wider bandwidth. Also notice that the visible bandwidth here is 5MHz/Div, so you see actually see 50MHz bandwidth on the screen.

Also, it worth mention that the RTL-SDR has much better resolution compared to the Rigol. That’s expected because the bandwidth is smaller and that’s why you see the several spikes around 10MHz in case of the dongle and only a single lobe on the FFT of the Rigol.

Now have a look at this weird thing:

Here you see that the bandwidth is 100MHz/Div, which I dunno, it doesn’t make sense, since this is a 50MHz oscilloscope (with the 100MHz hack, of course) and the displayed spectrum now is 600MHz. So, yeah…. I don’t know what’s going on in here, maybe I’m missing something, but I just guess that it’s because the FFT is just a mathematical function, so although there’s no real input data on the oscilloscope the display just shows the maths results.


First, I really loved the RTL-SDR dongle. Although I was aware of it for some years now, I never had any excuse (or time) to play with it. I’m lucky though, because now it’s much more mature in both hardware and software terms. The RTL-SDR V3 is awesome, because of the software enabled direct sampling mode that allows it to go below the 24MHz and also the rtl-sdr API is more mature/stable and also supports the direct sampling mode setting, too.

Now that I’ve spent some time with the rtl-sdr and I’ve seen the API, I’m really intrigued to use one of the many SBCs that I have everywhere around my workbench to build a standalone spectrum analyser. I’ve seen that there’s already a project for the beaglebone black which is called ViewRF. This is actually a Qt4 application that uses the rtl-sdr API and then it draws on a canvas without the need of a window system as it uses the QWS (Qt Window System). There are a few problems with that, though. First it’s a quite old Qt version (which is fine), but I prefer to use the meta-qt5 layer to build my custom Yocto distros. For example, I already have the meta-allwinner-hx and meta-nanopi-neo4 layers that I could use to build a custom image for one of the supported boards. Those layers are supporting the Qt5 API, but the QWS was removed in that version and that means that I would need to write a new app to do something like the ViewRF. Anyway, I need to investigate a bit more on this, because there probably others that already implemented those stuff long time ago.

I’m also thinking to write an app for one the WiFi enabled SBCs that will run an http server and serve a GUI using websockets to send the data. I don’t know yet the amount of data that are coming from the dongle, so that might not work. Anyway, so many nice stuff to do and so many ideas, but no much time :/

For now, I just want to finish reading Carl’s book as there’s a ton of interesting information in there and is written in a way that really gets you into it. I may not have the time to do a lot of stuff with the RTL-SDR dongle in the future, but nevertheless I want to learn more about it and have the knowledge in the back of my head for future projects and it’s good to know what it can do with it and how it can be used. Still, the EMC probe with a standalone spectrum analyser is a nice project and I will do it at some point in the future. I’ve also found this interesting Yocto layer here. So many cool stuff, so less time.

Have fun!

NanoPi-Neo4 Yocto meta layer


Embedded Linux & GPU 3D acceleration… Say no more. When you get those words together then hell brakes loose. Well, it depends, though. If you’re using Yocto or Buildroot and choose a BSP that has all the pieces already connected, then you’re good to go. Just build, flash the image and you’re done. It’s even easier if you use a distro like Armbian, Raspbian or others, just download the image, flash it and you’re done. But what happens when you need to build up from the scratch? In this case you need to connect the pieces together yourself and then you realize that it’s a bit more complicated. The Linux graphics stack is a huge pool of buzzwords, abbreviations, tons of code and APIs. It’s really huge. Let’s play with numbers a little. I got the git repo of the linux-stable and I’ve used cloc to extract some numbers.

git clone git://
cd inux-stable
git checkout v5.0.4
cloc .

Cloc, will do its magic and will return some metrics like this:

$ cloc .
   68143 text files.
   67266 unique files.                                          
   19991 files ignored. v 1.74  T=245.01 s (206.9 files/s, 99334.7 lines/s)
Language                             files          blank        comment           code
C                                    26664        2638425        2520654       13228800
C/C++ Header                         19213         499711         920704        3754569
Assembly                              1328          47549         106234         275703
JSON                                   213              0              0         137286
make                                  2442           8986           9604          39369
Bourne Shell                           454           8586           7078          35343
Perl                                    54           5413           4000          27388
Python                                 116           3691           4060          19920
HTML                                     5            656              0           5440
yacc                                     9            697            376           4616
DOS Batch                               69            115              0           3540
PO File                                  5            791            918           3061
lex                                      8            330            321           2004
C++                                      8            290             82           1853
YAML                                    22            316            242           1849
Bourne Again Shell                      51            352            318           1722
awk                                     11            170            155           1386
TeX                                      1            108              3            915
Glade                                    1             58              0            603
NAnt script                              2            143              0            540
Markdown                                 2            133              0            423
Cucumber                                 1             28             49            166
Windows Module Definition                2             14              0            102
m4                                       1             15              1             95
CSS                                      1             27             28             72
XSLT                                     5             13             26             61
vim script                               1              3             12             27
Ruby                                     1              4              0             25
D                                        2              0              0             11
INI                                      1              1              0              6
sed                                      1              2              5              5
SUM:                                 50694        3216627        3574870       17546900

From the above you see that the Linux kernel is composed by 17.5 million text lines from which the 13.2 million are plain C code. Now let’s try the same at the same in the drivers/gpu folder:

$ cloc drivers/gpu
    4219 text files.
    4204 unique files.                                          
     328 files ignored. v 1.74  T=16.51 s (236.9 files/s, 170553.5 lines/s)
Language                     files          blank        comment           code
C/C++ Header                  1616          50271         146151        1320487
C                             2195         190710         166995         936436
Assembly                         2            454            354           1566
make                            98            382            937           1399
SUM:                          3911         241817         314437        2259888

So, the gpu drivers are approx 2.2 million code lines. That’s 12.5% of the whole Linux kernel and that’s a lot.¬† The whole arch/folder is 1.6 million code lines (including asm lines), which is much less and contains all the supported kernel architectures.

So now that you realize the size of the graphics stack in the Linux kernel, you can safely guess that the deeper you get into the graphics stack, the more complex and difficult things are getting. Well, ok… I mean pretty much all the Linux kernel subsystems are another world, but if you’re an electronic engineer (like I do) then most of subsystems do make sense, but geez… graphics is another beast.

Anyway, this time the stupid project was to make a Yocto layer for the incredible NanoPi-Neo4 board. Why incredible? Well, because it’s an RK3399 board in a very small factor that only costs 50$. Can’t get better that this.

So, this idea was spinning in my head for some time now, but I couldn’t justify that is stupid enough to be promoted to a stupid project. But then I’ve decided to lower my standards and just accept the challenge.



Meet NanoPi-Neo4

You can read the specs of the board here, but I’ll also list the highlights which are:

  • Dual Cortex-A72 + Quad Core Cortex-A53
  • Quad Core Mali T864 (GL|ES 3.1/3.0/2.0)
  • 1GB DDR3
  • GbE
  • WiFi + BT 4.0
  • CSI
  • HDMI
  • 1x USB3.0
  • 1x USB2.0
  • 1x USB2.0 (pin header)

Yep, only 1GB RAM, but come on for testing code on this CPU and do evaluation without pay a lot is more than enough.

LCD Display

Well, it’s all about 3D graphics, so you need an LCD display. I have a small cheap 7″ LCD (1024×600) that I’ve bought from ebay for something like 30 EUR. Although it seems a bit expensive for its specs, on the other hand it has a controller that has HDMI, VGA and RCA output, is powered from a USB port and it has this nice stand.


There isn’t much to say about the project, except that it took some time to connect the pieces. Also, although I’ve finished the yocto meta layer and everything worked fine I’ve realized that I had left a lot of blur points in the back of my head. Sometimes, most of us (engineers) we just connect pieces together and usually we “monkey see, monkey do”, because of reasons (especially time constraint reasons). This might be fine when something works, even on the professional level, but when that happens to me it means sleepless nights, knowing that I’ve done something without having the full knowledge why that eventually worked. So I’ve struggled a bit to find information how things are really connected and working in the lower level. In my quest for the truth, I’ve met Myy in the Armbian forum and I want to thank him here, because he was really helpful to clarify a lot of things. I consider Miouyouyou being a great person not only because his contribution to open source but also for his willing to share his knowledge. Thanks mate!

Pretty much, what I’ve done was to use parts of the Armbian distro that supports the NanoPi-Neo4 (like u-boot and kernel patches), then use some parts from the meta-rockchip meta layer and then make the glue that connects everything together and also add some extra features. Why do this? Well, because the Armbian distro has a much more updated kernel and patches and a boot script setup that I really like and prefer to use a similar one in my layers. On the other hand, the meta-rockchip layer had the recipes for the various components for the RK chips, which are not updated though, so I had to also update those, too.

Although, I’ve created images for console, X11, Wayland, XWayland and Qt, I’ve only tried and tested the Wayland image, so be aware of that, in case you’re here and reading how to use the layer for X11 or Qt. If Wayland works then I guess also the QtWayland would work just fine, too. But sorry, there’s no much free time to test everything.

A few things about the graphics stack

The main components for the graphics support are the Mali blobs and the Mali DRM driver. DRM is the Direct Rendering Manager and it’s a Linux subsystem. It’s actually a Linux kernel API that run in the kernel space and exposes a set of functions that user-space applications can send commands and data to the GPU. This meant to be a generic API that can be used from applications without care what’s the underlying hardware, so the application doesn’t care if you have an nvidia, amd, intel or mali GPU.

So now that  the kernel provide us with the DRM, how do we access it? Here is where the hardware specific libdrm comes into the game. The next image shows where the libdrm stands in the chain (image taken from here).

In the above image you see that the DRM is the kernel subsystem and it provides an API (via ioctl() commands) that any user app can use. Of course, when you write a graphics application, you don’t want to call ioctl functions. Also, there’s another problem. Although there are some common ioctl functions which are generic and hardware Independent, at the same time each vendor supports ioctl functions that are specific to its hardware. And this is where the trouble starts, because now every vendor needs to supply these specific functions, so you need an additional hardware specific driver. Therefore, the libdrm is composed by two different things, the libdrm core, which is hardware independent and the libdrm driver which is specific to the underlying hardware.

In case of RK3399, Rockchip provides this Mali specific libdrm driver that contains those hardware specific functions and you need to build and use this driver. Have in mind that if the vendor doesn’t do regular updates to the libdrm then you might end up in a situation that your window manager (e.g. Wayland) doesn’t support an older API version anymore and that’s really a bad thing, because then you’re stuck to an old version and maybe also in an old kernel, depending the dependencies of the various software layers.

So the conclusion is that libdrm is just a wrapper that simplifies the ioctl calls to the DRM kernel subsystem and provides a C API that is easier to use and also contains all the hardware specific functions.

Now that you have your way to send commands and data to the GPU you can do 3D graphics! Well… sure but it’s not that simple. With the DRM driver you can get a buffer and start drawing stuff, but that doesn’t really mean anything. Why? Because, acceleration. Let’s try to put it simple. I don’t know if my example succeeds to do this, but I’ll try anyways.

Have you ever used any paint software on your computer? If you haven’t stop here, go to extend your skillset by learning MS-Paint and then come back again. Let’s think the most simple drawing¬† program that you only have a brush of a fixed 1-pixel size, you can select the color and if you press the click button it only paints one pixel, so no drag and draw. Now, you need to create a black fill rectangle, how do you do it if you only have this brush tool? Well, you start paint pixels to create the four sides for the rectangle and then start clicking inside the rectangle on every pixel until you make it all black. How much time did that take? A lot. We’re lucky though, because in the next version there’s a rectangle tool and also you can chose to draw a rectangle that is already color filled. How long did that take? 1-click and just a few milliseconds needed to drag the mouse to select the size. This is the acceleration. It doesn’t matter if it’s 2D or 3D, there are tools that make things faster. Therefore, for 3D graphics there are APIs that accelerate the 3D graphics creation in so many ways that are beyond of my understanding and involve tons of math and algorithms.

So, now it should be clear that although you have the libdrm and a way to have access to a buffer and draw stuff on your display that’s not enough. You need a way to accelerate these graphics creation and this where several tools are coming in to the game. There are plenty of tools (= APIs, libraries). For example there’s the Mesa 3D library. Mesa is actually a set of different 3D acceleration libraries and includes libraries like OpenGL, Direct3D, Vulkan, OpenCL and others. Each of these libraries maybe have other subsets, e.g. the OpenGL|ES in the OpenGL. Pretty much is chaos in there. Tons of code and APIs, it’s a nightmare for the embedded engineers. What all those libraries do, is to accelerate the 3D graphics creation. But how do they do that? This is done in software algorithms, of course, but that’s not enough. If that was all that’s happening then all the GPUs would have the same rendering performance. And not only that, but they would have the same rendering performance with any software renderer. Actually, OpenGL has its own software renderer (which is the default one and is used as a reference) and does all the rendering in pure software aand CPU cycles. And here is the point where the competition between GPU vendors starts. Every vendor implements this acceleration in their silicon, so the GPU can implement let’s say the whole color rectangle creation in hardware with a simple ioctl call. So there is a specific hardware unit in the GPU that you can send a command and do that very fast. This is a very simplified example, of course. But that also means that each vendor implements the acceleration API in a different and hardware specific way. Therefore, every vendor provides it’s own implementation of these acceleration libraries and also the vendors like to provide only the pre-compiled blobs that do this in order to protect their intellectual property.

The Mali GPU that the RK3399 uses is no different from the other vendors, so in order to really use 3D acceleration and support these 3D acceleration libraries you need to get these pre-compiled¬† blob files from Rockchip and use them in place of the software renderers (like in case of OpenGL). Here is libmali. Libmali is those blob libraries from Mali (and Rockchip) that contain the hardware specific code. What it actually does is that exposes the various API functions of the acceleration libraries to the user and internally it converts those API calls to the GPU specific ioctl calls. So there’s some code in there, which is pre-compiled in order to hide the implementation and you just get the binary. In this case, the libmali supports the OpenGL, OpenCL and GBM all in one, at the same time and in order to use it you need to build mesa, add the build files in your rootfs and then replace some Mesa libraries with the libmali blob. In this case, the same blob exports the API functions for multiple different *.so libraries from Mesa (e.g. libEGL, libEGLES, libOpenCL, libgdm and libwayland). Anyway, you don’t really need to get into the details of those things, you just need to remember that the vendor’s blob library file contains the APIs from¬† different libraries in the same binary, whichcan replace all those libs by just create a symbolic link of those libraries to the same blob.

I’ll come back to this with an example later, after I describe how to build the images.

Build the image

The meta layer for the NanoPi-Neo4 is located here:

There’s a quite thorough README file in the repo, so please read that first, because I’ll skip this step in here, in order to update only one place regarding the procedure. Pretty much you need to git clone the needed meta layers and then run the script to setup the environment and build the image you want. In this case, I’ll build and test the rk-image-testing for the rk-wayland, so my setup environment command is:

MACHINE=nanopi-neo4 DISTRO=rk-wayland source ./ buil

And to build the image I’ve run:

bitbake rk-image-testing

Hopefully, you won’t get any errors. After that step I’ve flashed the image on an SD card and booted Linux.

Now, let’s see something interesting… Remember that I’ve told you that we’ve replaced some mesa libs with the mali blobs? Let’s see that. In your /usr/libfolder you can find the libraries that mesa builds. These are for example:

  • /usr/lib/
  • /usr/lib/
  • /usr/lib/
  • /usr/lib/
  • /usr/lib/
  • /usr/lib/

All these are different libraries that target a different API. But see what happens next.

root:~# ll /usr/lib/ | grep
lrwxrwxrwx  1 root root       10 Mar 23  2019 ->*
lrwxrwxrwx  1 root root       10 Mar 23  2019 ->*
lrwxrwxrwx  1 root root       10 Mar 23  2019 ->*
lrwxrwxrwx  1 root root       10 Mar 23  2019 ->*
lrwxrwxrwx  1 root root       10 Mar 23  2019 ->*
lrwxrwxrwx  1 root root       10 Mar 23  2019 ->*
-rwxr-xr-x  1 root root 26306696 Mar 23  2019*
lrwxrwxrwx  1 root root       10 Mar 23  2019 ->*
lrwxrwxrwx  1 root root       10 Mar 23  2019 ->*
lrwxrwxrwx  1 root root       10 Mar 23  2019 ->*
lrwxrwxrwx  1 root root       10 Mar 23  2019 ->*
lrwxrwxrwx  1 root root       10 Mar 23  2019 ->*
lrwxrwxrwx  1 root root       10 Mar 23  2019 ->*

You see here? All the above libs are a symbolic link to the blob that the `meta-nanopi-neo4/recipes-graphics/libgles/rockchip-mali_rk.bbappend` yocto recipe added into the image. Furthermore, you can list the symbol table (= API calls) that this blob exports and get all the different API functions from all the different acceleration libraries. To do that, run this command:

readelf -s /usr/lib/

This will return the symbol table of the, which is a huge list of API calls. There you will find gl*, egl_*, gbm_* and cl* calls, all mixed up together in the same libary. So the blob is a buffed up library that knows how to handle all these library calls, but the implementation is hidden in the binary blob and it’s specific for that GPU hardware and family.

So to wrap up things, the is the super-lib that provides the 3D acceleration APIs (and hides the implementation) and that lib is able to send commands and data to the Mali specific libdrm driver, which provides the connection between the kernel DRM subsystem and the user-space. The result of the libdrm driver a DRI device which is located in /dev/dri

root:~# ll /dev/dri/card0
crw-rw---- 1 root video 226, 0 Mar 22 20:31 /dev/dri/card0

So, any user-space application that needs to use the GPU, has to open the /dev/dri/card0device and start sending commands and data. Therefore, if an application uses 3D acceleration, needs to make calls in the renderer API ( to create the graphics on a buffer and then send the result to the /dev/dri/card0using the proper DRM api.

This is how the most important things are connected Linux graphics stack in case of RK3399. Of course, there are a lot of other different implementations in the stack in the various layers and other vendors may have a different approach and APIs. For example, there is the FBdev, Xorg e.t.c. If you like buzzwords, abbreviations and complexity then the Linux graphics stack will be your best friend. If you just want to scratch the surface and be able to do simple stuff like setting up an image that supports 3D acceleration, I think the above things are enough. This provides the minimum knowledge (in my opinion) that you need to setup the graphics for a BSP. You can of course, just copy paste stuff from other implementations and “monkey see, monkey do”, this will probably also work, but this won’t help you much if something brakes and you need to debug.

One thing that I left out is that from all the above things there’s also another important piece missing and this is the hardware specific DRM driver. Although the DRM is part of the kernel is not really generic, only a part of this driver is generic. Hence Mali has it’s own driver. This driver in the kernel (for Mali and Rockchip) is the `midgard_kbase` and is located in drivers/gpu/arm/midgard/. In this folder in mali_kbase_core_linux.cfile you’ll find the `kbase_dt_ids` structure which is this one here:

static const struct of_device_id kbase_dt_ids[] = {
    { .compatible = "arm,malit7xx" },
    { .compatible = "arm,mali-midgard" },
    { /* sentinel */ }
MODULE_DEVICE_TABLE(of, kbase_dt_ids);

Also in arch/arm64/boot/dts/rockchip/rk3399.dti, which is included from the NanoPi-Neo4 device tree you’ll find this entry here:

gpu: gpu@ff9a0000 {
    compatible = "arm,malit860",

This means that during the kernel boot when the device-tree is parsed the kernel will find this device-tree entry and will load the proper driver. You can also retrieve the version like this:

root:~# cat /sys/module/midgard_kbase/version
r18p0-01rel0 (UK version 10.6)


To test the 3D graphics acceleration I’ve used the glmark2-es2-drm tool. First I’ve run it like that to enjoy the benchmark on the screen.

root:~# glmark2-es2-drm

But this will limit the framerate to the vsync and you’ll only get 60fps. In order to test the raw performance of the GPU you need to run the benchmark and render to an off-screen surface with this command:

glmark2-es2-drm --off-screen

By running the previous command this is the output that I’m getting.

root:~# glmark2-es2-drm --off-screen
    glmark2 2017.07
    OpenGL Information
    GL_VENDOR:     ARM
    GL_RENDERER:   Mali-T860
    GL_VERSION:    OpenGL ES 3.2 v1.r14p0-01rel0-git(966ed26).1adba2a645140567eac3a1adfc8dc25d
[build] use-vbo=false: FPS: 118 FrameTime: 8.475 ms
[build] use-vbo=true: FPS: 134 FrameTime: 7.463 ms
[texture] texture-filter=nearest: FPS: 145 FrameTime: 6.897 ms
[texture] texture-filter=linear: FPS: 144 FrameTime: 6.944 ms
[texture] texture-filter=mipmap: FPS: 143 FrameTime: 6.993 ms
[shading] shading=gouraud: FPS: 122 FrameTime: 8.197 ms
[shading] shading=blinn-phong-inf: FPS: 114 FrameTime: 8.772 ms
[shading] shading=phong: FPS: 101 FrameTime: 9.901 ms
[shading] shading=cel: FPS: 98 FrameTime: 10.204 ms
[bump] bump-render=high-poly: FPS: 90 FrameTime: 11.111 ms
[bump] bump-render=normals: FPS: 125 FrameTime: 8.000 ms
[bump] bump-render=height: FPS: 125 FrameTime: 8.000 ms
libpng warning: iCCP: known incorrect sRGB profile
[effect2d] kernel=0,1,0;1,-4,1;0,1,0;: FPS: 57 FrameTime: 17.544 ms
libpng warning: iCCP: known incorrect sRGB profile
[effect2d] kernel=1,1,1,1,1;1,1,1,1,1;1,1,1,1,1;: FPS: 22 FrameTime: 45.455 ms
[pulsar] light=false:quads=5:texture=false: FPS: 138 FrameTime: 7.246 ms
libpng warning: iCCP: known incorrect sRGB profile
[desktop] blur-radius=5:effect=blur:passes=1:separable=true:windows=4: FPS: 25 FrameTime: 40.000 ms
libpng warning: iCCP: known incorrect sRGB profile
[desktop] effect=shadow:windows=4: FPS: 107 FrameTime: 9.346 ms
[buffer] columns=200:interleave=false:update-dispersion=0.9:update-fraction=0.5:update-method=map: FPS: 35 FrameTime: 28.571 ms
[buffer] columns=200:interleave=false:update-dispersion=0.9:update-fraction=0.5:update-method=subdata: FPS: 35 FrameTime: 28.571 ms
[buffer] columns=200:interleave=true:update-dispersion=0.9:update-fraction=0.5:update-method=map: FPS: 38 FrameTime: 26.316 ms
[ideas] speed=duration: FPS: 68 FrameTime: 14.706 ms
[jellyfish] <default>: FPS: 75 FrameTime: 13.333 ms
[terrain] <default>: FPS: 5 FrameTime: 200.000 ms
[shadow] <default>: FPS: 50 FrameTime: 20.000 ms
[refract] <default>: FPS: 28 FrameTime: 35.714 ms
[conditionals] fragment-steps=0:vertex-steps=0: FPS: 132 FrameTime: 7.576 ms
[conditionals] fragment-steps=5:vertex-steps=0: FPS: 73 FrameTime: 13.699 ms
[conditionals] fragment-steps=0:vertex-steps=5: FPS: 131 FrameTime: 7.634 ms
[function] fragment-complexity=low:fragment-steps=5: FPS: 97 FrameTime: 10.309 ms
[function] fragment-complexity=medium:fragment-steps=5: FPS: 65 FrameTime: 15.385 ms
[loop] fragment-loop=false:fragment-steps=5:vertex-steps=5: FPS: 97 FrameTime: 10.309 ms
[loop] fragment-steps=5:fragment-uniform=false:vertex-steps=5: FPS: 96 FrameTime: 10.417 ms
[loop] fragment-steps=5:fragment-uniform=true:vertex-steps=5: FPS: 71 FrameTime: 14.085 ms
                                  glmark2 Score: 88

The important thing from the above output is that the GL_VENDOR, GL_RENDERER and GL_VERSION are the expected. So the Mali-T860 GPU does the rendering and the version if the OpenGL|ES 3.2 (and the driver version is r14p0). This is great, we have all the greatest and latest stuff (the date that this post is written) and we’re ready to use the 3D hardware acceleration.

This is also a small video with the glmark2 rendering to the LCD screen.


Well, that was an interesting project. It started with just creating a Yocto meta layer for the NanoPi-Neo4 with 3D support. That actually worked quite fast, but then I’ve realized that although I had some background on the Linux graphics stack, I wasn’t sure why it worked and then I’ve realized that it was a bit more complex and start getting deeper. I also got a lot of valuable input from Myy, who was kind to share a lot of insights on the subject and without his explanations it would take me much more time to unravel this.

I hope this Yocto is layer is useful, but also have in mind that is not fully tested and it might not get very frequent updates as I’m also maintaining other stuff in my (much less now) free time.

I promise the next project will be really stupid. I already know what it will be.

Have fun!