Component database (with LED indicators)


Finally, a really stupid project! You know, I was worrying that this blog is getting too serious with all those posts about machine learning and DevOps. But now it’s time for a genuine stupid project. This time I’m going to describe why and how I’ve built my personal component database and how I’ve extended this and made some modifications to make it more friendly.

For the last 20 years my collection of components, various devices and peripherals has grown too much. Dozens of various MCUs, SBCs, prototype PCBs and hundreds of components like ICs, passive devices (resistors, capacitors, e.t.c.), and even cables. That’s too much and very hard to organize and sort. Most EE and embedded engineers already feel familiar with the problem. So, the usual solution to this is to buy component organizers with various size of storage compartments and some times even large drawers to fit the larger items. Then the next step is to use some invisible tape as labels on each drawer and write what’s inside.

While this is fine when you have a few components, when the list is growing too much then even that is not useful. Especially, after some time that you forgot when some stuff are really located then you need to start reading the labels on every drawer. Sometimes it may happen that you pass the drawer and then start from beginning. If you’re like me this can get you angry after few seconds of reading labels. You see, the thing is that when you have to start looking for a component is because you need it right now as you’re in middle of something and if you stop doing the important task and start spend time searching for the component, which is supposed to be organized, then you start getting pissed. At least I do.

Therefore, the solution is to have a database (DB) that you can search for the component and then you get its location. Having a such a database is very convenient, because you can also add additional information, like an image of the component, the datasheet and finally create a web interface that is able run queries in the db and display all those information in a nice format.

For that reason a few years back in 2013 I’ve created such a database and a web interface for my components and until now I was running this on my bananapi BPI-M1. Until then though, the inventory has kept growing and recently I’ve realized that the database is not enough and I need to do something else in order to be able to find the component I need a bit faster. The solution was already somewhere in those components I already had, therefore I’ve used an ESP8266 module and an addressable RGB LED strip.

Before continue with the actual project details, this is a video of the actual result that I got.

If what you’ve just see didn’t make any sense, then let’s continue with the post.


This is a list of the components that I’ve used for this project


This is the old classic BPI-M1

You don’t have to use this exact SBC, you can use whatever you have. The reason I’ll use this in this example is just because it’s my temporary file server and already runs a web server. BPI-M1 is the first banana-pi board and it’s known for its SATA interface, because at that time is was one of the very few boards that had a real SATA interface (and not a USB-to-SATA or similar). Not only that, but also has a GbE and you can achieve more than 60 MB/sec on a GbE network, which is quite amazing as most of the boards are much slower. I’m using this SBC as my home web server and home automation server, as also a temporary network storage for data that are not important and also I’m running a bunch of services like the DDNS client and other stuff. Therefore, in my case I already had that SBC running in the house for many years. Since 2014 it was running an old bananian distro and only recently I’ve updated to one of the latest Armbian distros.

As I’ve mentioned you can use whatever SBC you like/have, but I’ll explain the procedure for BPI-M1 and Armbian, which it should be the same for any SBC running Armbian and almost the same for other distros.

Of course you can use any web server in the internet if you already have. I’ll explain later why there’s no problem to host the web interface in a web server on the internet and at the same time running the ESP8266 in your local network.

Components organizer

This box has many names as components organizer, storage box with drawers, storage organizer, e.t.c. Whatever the name is, this is how it looks like:

There are many small plastic drawers that you organize your components. I mean you know what it is, right? I won’t explain more. As you’ve seen on the video when the LED is turning on then the drawer is also lit. This kind of white or transparent plastic (I don’t know the proper English name) is good for this purpose as it’s illuminating with the color of the RGB light. So prefer those organizers instead of a non-transparent.

WS2812B RGB LED strip

This is the ws2812b RGB LED strip I’ve used.

You probably are more familiar with this strip format.

But it’s important to use the first strip for the reasons I’ll explain in a bit.

The good ol’ ws2812b is one of the first addressable RGB LEDs and until today it’s just fine if you don’t need to do fancy animation and generally any fast blinking on large RGB strips. Also, it’s not really addressable but it seems like that from the software perspective. The reason that ws2812b it’s not really addressable is because you can control each LED individually, but if you want to change a LED color you need to send all the LED values on the strip (actually the colors up to the wanted index). That’s because the data are shifted from one LED to the next one in the chain/strip, so every 24-bits (3x colors, 8-bit/color) the data are shifted to the next LED.

As you can imagine this strip is placed in the back of the organizer in order to illuminate the drawers. The reason you need the first format is that each drawer has a distance from the next one. The first strip has 10 LEDs/meter which means 1 LED every 10cm which is just fine for almost any organiser (except the bottom big drawer). The second strip has 60 LEDs/meter which means that in that case behind every drawer will be more than 1 LED, which is great waste. So if you used the second strip you would either have to cut every LED and resolder it using an extension cable or skip LEDs and use a more expensive strip. Anyway, it doesn’t make sense to use the second strip, just use the first one.


The ESP8266 is an MCU with WiFi and there are many different modules with this MCU as you can see here. Also there are many different PCB boards that use some of those variations, for example the ESP-12F can be found on many different PCBs with the most known being the NodeMCU. In this project I’ll use the NodeMCU, because it’s easier for most of the people to use as you can use a USB cable to power the module and also flash it easily and do development using the UART for debugging. In reality though, since I have a dozen of ESP-01 with 512KB and 1MB, I’m using that in my setup. I won’t get into the details for the ESP-01 though, because you need to make different connections in order to flash it. Since you only need 2 pins, one for output data in the strip and one for reset the firmware configuration settings to default, then the ESP-01 would fit fine, but with some ticker for make the pins usable.

Anyway let’s assume that we’ll use the NodeMCU for now, which looks like that

Project details

Let’s see a few details for the project now. Let’s start with the project repo which is located here:

In there you’ll find a few different things. First is the www/ folder that contains the web interface, then it’s the esp8266-firmware that contains the firmware for the NodeMCU (or any ESP-12E) and finally there is a dockerfile that builds an image with a webserver that you can use for your tests. I’ll explain every bit in more detail later, but for now let’s focus on how everything is connected. This is the functional diagram.

As you can see from the above diagram there’s a web server (in this case Lighttpds) with PHP and SQLite3 support that listens on a specific port and IP address. This server can run on either a docker container (for testing or normal use) or an SBC. In my case I did all the development with docker and after everything was ready I’ve uploaded the web interface on the BPI.

Then from the diagram it’s shown that any web browser in the network can have access to the web interface and execute SQL queries to display data. The web browser is also able to send aREST requests via POST to a remote device that runs an aREST server, which in this case this is the ESP8266.

Finally, you see that the ESP8266 is also connected in the network and accepts aREST requests. Also it drives a ws2812b RGB LED strip.

So what happens in normal usage is that you open the web interface with your web browser. By default the web server will return a list with all the components in the database (you can change that in the code if you like). Then you can write in the search text field the keyword to search in the database. The keyword will be looked up in the part name, the description or the tag fields of the database’s records (more for the db records later) and the executed SQL query will return all the results and will update the web page with the entries. Then depending on the record you can view the image of the item, download the datasheet, or if you’ve set an index in the Location field in the db table then you’ll also see a button with label “Show”.

The image and the datasheet are stored in the web server in the datasheet/ and images/ folder. Actually the image that you see in the list is the thumbnail of the actual image and the thumbnails are stored in the thumbs/ folder. The reason of having thumbnails is to load the web interface faster and then if you click on a thumbnail then the actual image is loaded, which can be in much better resolution with finer detail. By default the thumbs are max 320×320 pixels, but you can also change that as you prefer.

When pressing the “Show” button, then a javascript function is executed on the web browser and an aREST request is posted to the IP address of the ESP8266. Then the ESP8266 will receive the POST and will execute the corresponding function, which one of them is to turn on a LED on the RGB strip using the passing index from the request. Then the LED will turn on and will remain lit for a programmable time, which by default is 5 secs. It’s possible to have more than one LED lit at the same time and each LED has it’s own timer. You can also lit all the LEDs at the same time with a programmable color in order to create a more romantic atmosphere in your working environment…

Finally, you can send aREST commands to change the ESP8266 configuration and program the available parameters, which I’ll list later.

Web interface

The web interface code is in the www/lab/ folder of the repo. In case you want to use it in your current running web server, then just copy the lab/ folder to your parent web directory. Although the web interface has quite a few files, it’s very basic and simple. You can ignore the css/ and the js/ folders as they just have helper scripts.

Since the web interface runs on your web browser then it’s necessary that your device (web browser) is on the same network and subnet with the ESP8266. Therefore, you can also have the web interface running for example on your VPS (virtual private server) on the internet and the aREST posts will work fine as long you’re in the same network with the ESP8266.

The main file of the web interface is the `www/lab/index.php`. This is the file that displays the item list, sends SQL queries to the web server and also posts aREST commands to ESP8266. In the $(document).ready function you can see that there’s a PHP code that first read the ESP8266IP.txt file. This file is in the lab/ folder of the web server and it contains the IP of the ESP8266. By default is set to, therefore you need to change that with the IP of your ESP8266 device. In the same function the web browser will try to detect if the ESP8266 is actually on the network and if it’s not then you’ll get a warning.

The function turn_on_led(index) function sends a POST to the ESP8266 aREST server with the index of the LED that needs to turn on.

The html form the the id searchform is the form that handles the “Search part” area in the web interface. When you write a keyword in there and press the “Search” button, then this code is executed:

// by default list all the parts in the database
$sql = "SELECT * from tbl_parts";
// check if a submit is done
  // check for valid search string
  $part=$_POST['part'];	// get part
  $sql = 'SELECT * from tbl_parts WHERE Part LIKE "%'.$part.'%" OR Description LIKE "%'.$part.'%" OR Tags LIKE "%'.$part.'%"';
echo '<br>Results for query: '.$sql.'<br><br>';

This will reload the content of the page and will create a list of items with the results. The code for that is just right below the above snippet in the index.php file. The last interesting code in this file is that one here:

if (!empty($row['Location']) && $ip) {
  echo '<button onclick="turn_on_led(' . $row['Location'] . ')">Show</button>';

This code will add the “Show” button if the proper field in the DB is set and the ESP8266 is on the network.

There are also other two pages that are important and these are the web/lab/upload.html and web/lab/upload.php files. These two pages are used to add new records in to the database. The upload.html page is opened in the web browser when you press the “Add new part button” and this is what you see

There you fill all the text boxes you want and also select an image file and a datasheet if you like to upload to the server. When you fill all the data then you press the “Add part to database” button and if everything runs smooth then you’ll see this

This means that a new record is inserted in the database without error and if you press the “Return” button then you get back to the main page. There’s a problem though. I wouldn’t use this way to create all the records in the database because it’s very slow and it will take a lot of time. Also with this way you can’t edit or update a record, therefore the preferred way to initially set up your database with many components is to use program that can open and edit the database.

In my case I’m using Ubuntu, therefore I’ll explain how to do that with Linux, but in case you’re a Windows user then you can use this the DB Browser for SQLite. To install that in Ubuntu just run this command:

sudo apt install sqlitebrowser

Then from your applications run the DB Browser for SQLite and when the program starts open the `www/lab/parts.db` and click on the “Browse Data” tab. In there you’ll see this

Now this is a much easier way to add records, but also there’s a drawback because the files won’t be uploaded and also the thumbnails won’t be created automatically. But there’s a solution about that, too.

So, the best way to fill your database is to run the docker web server as this will help you to test and also add the records faster. I’ll explain how to run the docker image in the next section, so for now I assume that you already do this, so I won’t interrupt the explanation and the process.

The DB has the following fields:

  • Part
  • Description
  • Datasheet
  • Image
  • Manufacturer
  • Qty
  • Tags
  • Location

This is an example of a record as shown in the web interface

In this the bluepill is the the Part field, the long description is the Description field and the image and datasheet icons are shown because I’ve added the image name in the Image and Datasheet fields. The Manufacturer, Qty fields are not really important and are not shown anywhere. The Tags field is used as an extra tag for DB SQL search queries and finally if you set a number in the Location field then the “Show” button is shown and that number is the index of the LED on the addressable RGB strip.

What is important is that you copy the image file in the web/lab/images folder and the datasheet pdf file in the web/lab/datasheets folder, but in the DB fields you only write the filename including the extension, for example blue-pill.png and blue-pill.pdf. Finally, after you fill the DB with all your parts and items then to create the thumbs you can run the `www/lab/scripts/` script from the parent directory of the repo like this

cd www/lab

This script will create the thumbnails in the www/lab/thumbs folder. You can change the default thumb size like this:

MAX_WIDTH=240 MAX_HEIGHT=240 ./tools/

After you’ve finished with the DB changes, then don’t forget to click the “Write Changes” button in the program and then you can upload all the files to your web server after you’ve done testing with the docker web server image. Remember that if you click on the image in the item list then the original image is loaded and opens in the browser.

Testing with the docker web server

As I’ve mentioned earlier, testing with the actual web server that runs on your SBC or web server is quite cumbersome as you first need to set up the server and then it’s still not very convenient to edit the files remotely or edit them locally and then upload them to test. You can do it if you like, but personally I would get tired with this soon. Therefore, I’ve added a Dockerfile in the repo which is located in the `docker-lighttpd-php7.4-sqlite3/` folder. To build the docker image run the following command

docker build --build-arg WWW_DATA_UID=$(id -u ${USER}) -t lighttpd-server docker-lighttpd-php7.4-sqlite3/

This command will build a docker image with the Lighhtpd web server, PHP 7.4 and SQLite which can be used for your tests. After the image is built you also need the docker-compose tool, which is not installed with docker and you need to install it separately with these commands

sudo curl -L "$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
sudo chmod +x /usr/local/bin/docker-compose

If you don’t use Linux then see here how to install docker-compose.

So, after you’ve built the image and you installed docker-compose then from the root directory of the repo you can run this command:

docker-compose up

After running this command you should see this output:

Creating lab-web-db_webserver_1 ... done
Attaching to lab-web-db_webserver_1
webserver_1  | [14-Jan-2020 14:43:13] NOTICE: fpm is running, pid 7
webserver_1  | [14-Jan-2020 14:43:13] NOTICE: ready to handle connections

This command will actually run a container with the web server and it will also mount the web interface folder in the /var/www/html/lab folder inside the container and will also override the /etc/lighttpd/lighttpd.conf and `/usr/local/etc/php/php.ini` files in the container. You can also open the `docker-compose.yml` file to see the exact configuration of the running container. Now to test the web server you need to get the IP of the server and to do that the easiest way is to run this command on a new console window/tab

docker-compose exec webserver /bin/sh

This command will get you in the container’s console, so you can run ifconfig and get the IP of the container, for example:

743e05812196:/var/www/html# ifconfig
eth0      Link encap:Ethernet  HWaddr 02:42:AC:12:00:02  
          inet addr:  Bcast:  Mask:
          RX packets:15 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:2430 (2.3 KiB)  TX bytes:0 (0.0 B)

In this case the IP of the container is which means that you can open the web interface to your web browser using this URL:

That’s it! Now you can edit your files with any text editor and add stuff in your DB and just refresh the web page in your browser to load the changes. That way the testing is done much easier.


Finally there’s the firmware for the ESP8266! I’ll try to explain a few things about the firmware here, but generally is very easy and straight forward. The firmware is in the `esp8266-firmware/` folder in the repo. I’ve used Visual Studio Code (VSC) and the (PIO) plugin to write the firmware as it’s very easy to do stupid projects fast and I’ve also used the Arduino framework with the ESP8266. The nice thing with the PIO plugin in VS Code is that you don’t need to find the libraries I’ve used yourself, as the plugin will handle those.

First open the `esp8266-firmware/` folder in VSC and then install the PIO plugin from the extensions in the left menu of the IDE. When you do that, then click on the alien head in the left menu, which opens the PIO menu in which you see the project tasks. Press the “Build” and see what happens. If that builds OK, then you’re good to go, but if it complains about missing libraries then click on the “View -> Terminal” menu of VSC and then in the terminal run those two commands:

pio lib update
pio lib install

Those commands should install the missing libs which are the FastLED and aRest.

A note about aREST here. Although the library claims that it’s a true restful API it’s not! There’s no any control about GET, POST, UPDATE and all the other supported rest API commands. Whatever command you use, the aREST handles them the same in the same callback… That’s actually a pity because I had to implement some commands twice in order to implement the POST/GET functionality. Anyway, a small rant, but also I don’t intend to fix this myself…

The main course file is the src/main.cpp. There a few things going on in there, but generally you need to define a few things before you build the firmware, which are:

  • def_ssid: The default SSID of your router
  • def_ssid_password: The SSID password
  • def_led_on_color: The default RGB color when a LED is activated
  • def_led_off_color: The default RGB color for deactivated LED
  • RESET_TO_DEF_PIN: The pin is used to reset the configuration to the default values
  • STRIPE_DATA_PIN: The pin that drives the RGB LED strip
  • NUM_LEDS: The number of LEDS in the strip. This is the number of the drawers of your component organizer.

Normally, you would only change initially the def_ssid and `def_ssid_password` for testing and then at some point also the def_led_on_color.

Now you need to connect everything together, which is quite simple. Use a +5V power supply that can provide the needed power and current for the LED strip. In my case I see approx 2.5 Amps with 100 LEDs when are set to white which is the max. That means that I need at least a 3A PSU (or 15W). Now, connect the +5V and GND of the WS2812B strip to the PSU and then also do the following connections

ESP8266 (NodeMCU) WS2812B
D2 Din

Also connect the D1 pin of the NodeMCU to GND via a 4K7 resistor. This pin is used for reseting the configuration to the default values and when it’s connected to the GND then it’s the normal operation and when is connected to +3V3 then if you reset the ESP8266 then the default configuration will be loaded. Always remember to connect the resistor again to the GND after reseting to defaults.

Finally connect a USB cable to ESP8266 and your computer or a USB power supply in normal use.

Getting back to the code, the `load_default_configuration()` will load the default configuration and this function is called either if a false configuration is found (e.g. empty or corrupted conf) or when the D1 pin is HIGH.

When the program starts and the main function is loaded, then the serial port is configured and then the D1 input pin. Then the configuration is loaded from the EEPROM (actually the flash in case of ESP8266) and next the aREST API is configured with the exposed functions and the internal variables. Then the module is connected to the SSID via WiFi and the main 100ms timer is set to continuous run. This timer is used for all the timeout actions of the LEDs. Finally, the WS2812B strip is initialized and all LEDs are set to the led_off_color value, which is Black (aka turned off) by default.

In the main loop, the code handles all the aREST requests, checks if it needs to save the configuration in the EEPROM (=flash) and also check the timeout of the LEDs and in any is activated it check the timeout and when it’s reached it turns off the LED.

The rest of the functions with the ICACHE_RAM_ATTR prefix are the callbacks for the aREST API. This prefix in the function is actually a macro that instructs the linker to place those functions in RAM instead of flash, so the code is accessed faster. So let’s see now the supported commands which are listed in the following table

Command Description
The index of the LED to turn ON
The integer value of the CRGB color for ON
The integer value of the CRGB color for OFF
The integer value of the LED timeout in seconds
The integer value of the CRGB color for the ambient mode (for real-time selection)
The integer value of the CRGB color that is saved as the default ambient color
Enables/disables the ambient mode
The WiFi password

You can find the CRGB value for each supported color in the FastLED library file which is located in .pio/libdeps/nodemcuv2/FastLED_ID126/pixeltypes.h. Then you can use a calculator to convert HEX values to integers.

In case you want to play around with the aREST commands, you can use a REST client like insomnia or you can use your browser and use the address bar to send GET commands. In any case the URL for each command has the following format:


Therefore if the ESP8266 IP address is and you want to use the led_index command to turn on the 2nd LED of the strip, then paste this line to your web browser’s address bar and hit enter.

Or if you want to set the led_on_color to SkyBlue (which according to pixeltypes.h is 0x87CEEB in HEX or 8900331 in integer), then

Finally, you can get all the current aREST variable value with this URL:

This is will return something like this:

  "variables": {
    "led_on_color": 1,
    "led_off_color": 0,
    "led_on_timeout": 5000,
    "led_ambient": 1,
    "enable_ambient": 0,
    "wifi_ssid": "    ",
    "wifi_password": "         "
  "id": "1",
  "name": "lab-db",
  "hardware": "esp8266",
  "connected": true

Actually, since I’m using Insomnia for my tests, this is an example output from the tool.

Finally, I would like to mention that the configuration, although you see that I’m using the EEPROM.h header and functions, is actually stored in the flash. The configuration is a struct and specifically the struct tp_config which has several members, but the important are the preamble, the version and the crc. The preamble is just a fixed 16-bit word (see FLASH_PREAMBLE) which is 0xBEEF by default and it’s used by the code the verify that when it copies the EEPROM in to the struct then it’s a valid struct. The configuration version is used to compare the EEPROM configuration version with the firmware version and if those are different (i.e. because of a firmware update) then the check_eeprom_version() function is responsible to handle this. Therefore, if you add more variables in the configuration then the size of the configuration will change, which means that you need bump up the version in the firmware code and then also handle this in that function in case you want to preserve the current configuration (e.g. WiFi credentials), otherwise the easy way to just restore the defaults, which is already happens in the code. Finally, the crc is the checksum of the configuration and if that is not correct when reading the data from the EEPROM then again it’s an error condition and by default is handled by reset the configuration to defaults.

To build and upload the code now, just select the PIO in the activity bar in the left. There you would see the project tasks which include “Build” and “Upload and Monitor”. If you can’t see those then you probably opened the top repo folder which means that you need to open only the esp8266-firmare/ folder in VSC and then you should be able to see those tasks. By clicking the build task the terminal should open in the bottom of the UI and you should get something like this

Building in release mode
Retrieving maximum program size .pio/build/nodemcuv2/firmware.elf
Checking size .pio/build/nodemcuv2/firmware.elf
Advanced Memory Usage is available via "PlatformIO Home > Project Inspect"
DATA:    [====      ]  38.2% (used 31316 bytes from 81920 bytes)
PROGRAM: [===       ]  28.1% (used 293584 bytes from 1044464 bytes)
====================================================== [SUCCESS] Took 2.26 seconds ======================================================

Terminal will be reused by tasks, press any key to close it.

That means that the firmware is ready to be uploaded to the ESP8266, so click the “Upload and Monitor” task from the left menu and the flash procedure will start. There’s a chance that the USB Comm port is not found, in this case if you’re using Linux it means that you need to add the proper udev rules (see how you do this here).


Before you proceed with installing the LEDs in the components organizer, you first need to check that everything works as expected. Therefore, first you need to create your database and the easiest way as I’ve mentioned is using the docker image and the DB Browser for SQLite. After you’ve done with the database then you can do the tests with the ESP8266 in the docker container, so first you need to make the proper connections and connect the WB2812B led stripe with the ESP8266.

For this post I’ll use the NodeMCU (actually a clone) with the ESP-12E module. Just connect the D1 pin with a 4K7 resistor to GND, also the GND of the module with the GND of the WS2812B, the D2 pin of the ESP8266 with the Din pin of the strip and finally connect the ESP8266 with a USB cable to either your PC or to USB charger. Of course, before that you need to verify that your firmware is flashed and the ESP8266 is able to connect to the WiFi router and also responds to the http://ip-address/ aREST command with it’s current configuration.

Finally, connect the WS2812B to a +5V power supply that can provide enough current. Just a note here, although the ESP8266 is a 3V3 device and the WS2812B is a 5V, you shouldn’t expect any problems with the voltage difference and you shouldn’t need a level translator. From my experience I never had any issues with any 3V3 MCU and the WS2812B.

After all connections are made and everything is powered up, then you can use your desktop browser (not your smartphone) that runs the docker container and test that the LED is lit on when you press the “Show” button of a component that has an index in the Location field in the DB. If that works then you’re good to go and install the LED strip in the back of the components organizer.

LED strip comparison and details

As I’ve mentioned I didn’t use the standard 30/60/144 LEDs per meter strips that you usually find in ebay. This is a 10 LEDs/meter strip, which means each LED is 10cm away from the other. That is perfect for this project for many reason but also has a drawback, which I’ll discuss later.

For comparison, this a photo of the 10 LEDs/m strip and another 60 LEDs/m I have. The comparison is between them and an ESP-01 module, which is one of the smallest ESP8266 modules available.

The strip on the left is 5m length, which means it counts 50 LEDs and the one on the right side is 4m length and it counts 240 LEDs. The good thing about the strip on the left is that much more thinner and flexible and the distance between the LEDs is ideal for the components organizer drawers.

The problem with the thinner strip though it’s that it has more resistance. Why is that a problem? Well, if it was only one strip (aka 5 meters) then it would be fine, but since I’m using 2x strips (10 meters) then as you can imagine the resistance grows for the LEDs that are far away from the power supply. Also, because the thinner strip is not a flexible PCB, but is has thin cables between each LED, then the resistance is even higher.

In the next photo you can see the effect that this resistance has on the LEDs when two strips are connected together and the power supply is only connected to the one end of the strip.

As you can see, the strip that is near the PSU is bright white as it should be, but the half of the next strip is starting to have discoloration and instead of white light you get yellow. That’s because the length of the strip adds more resistance in the end of the line and the voltage on the LEDs is dropped enough to not give full power to the LEDs, so you get a dim light (thus yellow).

The solution for that, as you may already have guessed, is to supply voltage to both ends of the strip. Therefore, the +5V PSU output is connected in both ends (you don’t have to connect both GND though, one is enough). The result is that now the stripe will have equal voltage in both sides, so all the LEDs will be lit the same. The next photo shows the 10 meter strip with the PSU connected to both sides.

You see now that the light is uniform through the whole LED strip length.

Finally, I’ve measured the maximum consumption of the 10m LEDs strip when all color pixels (RGB) are boosted to the max value 0xFF. The result is:

So, it’s about 1.3A @ 5V or 6.5W. Therefore a USB 5V/2A charger is more than enough for the strip including the ESP8266. I don’t plan to lit all the LEDs to full white at the same time anyways as I like dim lights. I just remind you here that you can use the ambient light functionality in the ESP firmware and light your organizer in any color to have a cool ambient effect in your lab.

Installing the LEDs

Next step is to install the LED strip in the back of the organizer. It seems to me that this was the most time consuming task compared to write the firmware… Anyway, since I have 3 organizers that I want to illuminate and each one has 33 drawers, it means I need 99 LEDs, so I got a LED strip with 100 LEDs with 1 LED / 10cm. The extra length is very convenient in this case.

Also, because I have 3 organizers and only 1 strip then it means that I had to cut the strip and solder connectors, so when I have to move them then I don’t have to remove the strip or remove them altogether. Also in case I need to add an extra organizer then I can extend the strip with another connector. Also using the WS2812B is nice because one of the LEDs is malfunctioning I can replace it easily.

One last thing is to decide the LED indexes before you fill the DB. I should probably mention that earlier, but I guess you’ll read that far before doing any of the previous steps. Let’s see a simplified example of how I’ve laid out the LED strip in the back of the organizers

The above diagram shows a simplified diagram of the 3 organizers and the index increment direction for the LED indexes. It’s obvious that it doesn’t make much sense to label each drawer like that if you were using a sticker on each drawer and write an index with a marker. But it makes perfect sense in case you have a LED strip like the WS2812B, because that way you minimize the length of the strip. Of course you can use a different arrangement like start from top-to-bottom instead of left-to-right. In my case I’ve used a quite unusual indexing as you’ll see in the next picture that I’ve done with the one of the organizers.

Let me say here, that for my standards it’s the best I could do. I’m getting really bored and lazy when it comes to manual work like this, so I’m just doing everything as fast as I can and never look back again (literally in this case).

As you can see there’s a single LED behind each drawer and I’ve used an invisible tape to stick the flexible strip on the plastic rails in the back. I hope they won’t fell soon, but generally I’ve used tons of invisible tape for various reasons and it never failed me. Also, look the the indexing on the top it doesn’t make sense, especially with the indexes I’ve used in the previous example. The reason for that is that only this way the input was because it was more convenient in order to connect the next strip. This is a calc sheet I’ve made. It probably doesn’t make sense but the teal color represents the left organizer, the violet the middle and the orange the right organizer. Then (ABC) is the first column of the organizer, (DEF) the second, (GHI) the third and finally (JKL) is the fourth column of drawers. Yeah I know, my brain works a bit weird, but I find this very easy to understand and remember what I’ve done.

Anyway, you don’t have to use this, you can skip it and make an indexing map that is better for you.

Now, this is a photo of the back of all 3 organizers when I’ve done with this ridiculous amount of boring manual work…

At least it worked without any issues on the first boot. As you may see I had to cut the strip at every end and solder connectors and also had to solder the two catted pieces from the long strips to make the strip part for the last organizer. Oh, so boring!

In the above ans also the next picture I’ve tested the ESP8266 and the strip with the indexes and the ambient functionality. This is after I’ve finished with the whole installation.

Neat. Of course, you wouldn’t expect to have uniform Illumination as each drawer has different components, some they’re full, some almost empty. But still the result in really nice and actually it looks better in my lab than the picture.

Finally, I’ve also made a custom USB cable to provide power to both the ESP8266 and the WS2812B strip. I’ve used a USB power pack that is capable to provide 2A @ 5V, which is more than enough for everything. This is the cable.

Next thing I’ll do when I get my 3D printer is to print a nice case to fit the ESP module.

Installing the web-interface to BPI-M1

In may case, after everything seems to be working I’ve moved the whole www/lab folder from the repo to the BPI-M1. As I’ve already mentioned I’m using the Armbian distro on my BPI-M1, which doesn’t have a web server, PHP and sqlite by default. Therefore, I’ll list the commands I’ve used to install those in Armbian. So, to install lighhtpd with php and sqlite support, I’ve run those commands:

apt-get update
apt-get upgrade
apt-get install lighttpd php7.2-fpm php7.2-sqlite3 php-yaml

lighttpd-enable-mod fastcgi
lighttpd-enable-mod fastcgi-php

Then edit the `15-fastcgi-php.conf` file

vi /etc/lighttpd/conf-available/15-fastcgi-php.conf

and add this:

# -*- depends: fastcgi -*-
# /usr/share/doc/lighttpd/fastcgi.txt.gz

## Start an FastCGI server for php (needs the php5-cgi package)
fastcgi.server += ( ".php" => 
        "socket" => "/var/run/php/php7.2-fpm.sock",
        "broken-scriptfilename" => "enable"

Then uncomment this line in /etc/php/7.2/fpm/php.ini


And finally restart lighttpd

sudo service lighttpd force-reload

Now copy lab/ folder from the www/ in the repo to the /var/www/html folder, so the index.php file now should be in /var/www/html/lab/index.php.

Finally, you need the right permissions in the web folder, so run this command in your SBC terminal (via ssh or serial console)

sudo chown www-data:www-data /var/www/html/lab

Now use any web browser (smartphone, laptop, e.t.c.) and connect to the web interface and test again that everything is working fine. Also test that the upload is working with large pdf files. In case it doesn’t then have a look in the `www/php.ini` file and add those params to your SBC, too.

If it’s working, then you’re done.


Here is a video with playing around with the web interface and also a video that I’m using the web browser on my workstation to enable the “ambient” feature of the ESP8266 firmware, which just lids all the WB2812B strip LEDs to a color and makes a nice color-power-consuming atmosphere in my home lab.

In the first few minutes I’m just fooling around with the database and I’m searching for “MCP” and “STM” keywords in there. I’m also using the “Show” button to demonstrate that the LED of the drawer that contains the part is lit. Then I test the ambient light function with my smartphone and finally with the workstation browser.

Generally, I’m always using my workstation and never the smartphone. The difference in the two cases is that when the web interface runs on the workstation then the color HTML object changes the color in real-time while you’re moving your mouse and you don’t have to press “set” every time (which you need to do with the mobile browser). You can see that at 3:13 of the video. In the code you can see two javascript events, ambient_color() and save_ambient_color(). The first event is called with the oninput callback, which means on any color update in real-time from the HTML color object and the second event, which is the onchange() is triggered only when you close the color picker on the desktop web-browser or click “set” on the mobile browser. In the case of the onchange() event the color is also saved in flash, so next time you press the “Enable ambient” button then this color is used.

Some weird issues

During my experiments I had some weird behavior from a specific batch of ESP8266 modules. It seems that not all modules are made the same. These are the two modules I’ve used

On the left side is the LoLin NodeMCU module and on the right side is a generic cheap module I’ve bought from ebay some time ago with this description here: “NodeMCU ESP8266 ESP-12E V1.0 Wifi CP2102 IoT Lua 267 NEW”.

The problem I had is that when I flashed the same firmware on both devices the generic ESP8266 module was jamming my smartphone WiFi connection, but the LoLin didn’t had any affect. The next image shows the affect both modules have on a Speedtest run.

On the left side is when the firmware is running on the LoLin module and the right when it’s running on the other. You see that the difference is quite large, but the worst thing is that in the second case it was making the internet browsing with my smartphone very hard as I couldn’t download web pages fast and it was lagging a lot.

First thing I thought was the transmitting power. I believe that the second module transmit power is quite high and thus creates those issues. I thought to use my RTL-SDR (that I’ve used in this post here), but the problem with this dongle is that the Rafael Micro R820T/2 tuner is able to tune up to 1766 MHz, which makes sniffing the 2.4GHz band impossible. I’ve searched and found an MMDS downconverter in aliexpress but that doesn’t ship to Germany and the other options where quite expensive, so I’ve dropped the idea to get into it and find what’s going on. At some point I’m planning to get a HackRF One anyways, which is able to sniff up to 6GHz, so I may look into it again then.

Anyway, just have that in mind because it may cause troubles in your WiFi network. For the ESP8266 arduino framework there’s also the `setOutputPower()` function that is supposed to control the TX power from 0 up to 20.5dBm (see here). I may play around with it also at some point, but for now I’m using the LoLin module.

Generally, I’ll have a look at it in the future and write a post when I figure out why this is happening. I have 3 of those ESP8266 modules and it’s a shame to be unusable. I’ll try experimenting first with the `setOutputPower()`.

Edit 1:

I’ve just realized that the default behavior for WiFi.begin() is to configure both STA + AP, which means that ESP8266 is configured as both access point (AP) and client/station (STA). This is definitely an issue and affects the overall WiFi performance of the surrounding devices, but still doesn’t explain the difference between the different modules. Anyway, now I’m forcing only STA by explicitly setting the mode in the code


Edit 2:

After forcing the STA mode, the speedtest got a bit better on my smartphone but still the uploading was horrible. Next was to limit the TX power with the `WiFi.setOutputPower()` function. This solved mostly the issues I had, but not completely. So, around 6-8 dBm I’ve seen a couple of disconnections on the ESP8266 while I was running the speedtest on the the smartphone. Anyway, around 12.5 dBm seems the sweet spot for my home arrangement, but still not quite happy with the upload speed on the speedtests.

For now I’ve ordered a couple more LoLin modules, because I only have one that seems to be working fine. All the other modules have this issue. When I get the HackRF or if I find a MMDS downconverter I’ll get deeper into this…


This post ended up longer than I expected. Although all the different parts and codes are simple, it’s just that there are many different things that compose the outcome. Also I got a bit deeper with the development and testing tools (e.g. docker). Sometimes I tend to think that I’m using Docker too much, but on the other hand I’m happy that I keep my development workstation clean from tons of packages and tools that I don’t use often,  but only when making those stupid projects.

Regarding the project, I have to say that for me this DB was a savior for me, because many times I need components and I don’t know if I have them and even more I have no idea where to start looking for them. Generally, I keep things organized otherwise it would be impossible to keep track of every component, but no matter how organised I am, even if I find that I have what I need in the database, then if I don’t remember -which is the case most of the times- where the component might be, then I need to start looking everywhere.

For that reason making this interactive RGB LED strip to point to the correct drawer is a huge advantage for me. Sometimes now I just play with this even if I don’t really need a component, he he. It’s really nice to have.

On the other hand, if you’re thinking to implement this, then you’re free to use the code and do whatever you like with it. The only boring thing is to feel the database if you have a lot of components, but you don’t have to do it in one day. I’m filling this database for the last 6 years. What I do is, first I buy the components from ebay (usually) and then I find a nice image and the datasheet and I add the component to the database.

Here’s a nice trick I’m using in some components. Since the thumbnail doesn’t have to be same image with the image in the image/ folder, what I do is that I’m using a different image in thumb/ and image/ folder. In this case, when the list is shown the thumbnail image is shown and then when I click on the image then the image from the image/ folder is displayed. This is useful for example in many MCUs, because I use an actual photo of the MCU for the thumbnail and the pinout image for the bigger image.

Another useful thing that I do, is that many times I can’t find a datasheet for the component but I’ve found a web page that has info for that component. Then I print the web page in a PDF file and I’m uploading that. Finally, some times I need to save more than one PDF for a component, therefore in this case I merge two or more PDF files in one using one of the available online PDF mergers (e.g. this or that). That way I can overcome the fact that only one PDF is available for each component in the database.

Finally, what you’ll learn by doing this project is work with docker images and web servers, setting up your SBC as a web server, a bit of REST (although aREST is not really a RESTful environment), and also a bit of PHP, javascript, SQLite, ESP8266 firmware and playing with LEDs.

That’s it, I hope you find this stupid project useful, for some reason.

Have fun!

Posts archive

This is a list of all the blog posts:

DevOps for Embedded (part 3)


Note: This is the third post of the DevOps for Embedded series. You can find the first post here and the second here.

In this series of post, I’m explaining a different ways of creating a reusable environment for developing and also building a template firmware for an STM32F103. The project itself, meaning the STM32 firmware, doesn’t really matters and is just a reference project to work with the presented examples. Instead of this firmware you can have anything else, less or more complicated. In this post though, I’m going to use another firmware which is not the template I’ve used in the two previous posts as I want to demonstrate how a test farm can be used. More on this later.

In the first post I’ve explained how to use Packer and Ansible to create a Docker CDE (common development environment) image and use it on your host OS to build and flash your code for a simple STM32F103 project. This docker image was then pushed to the docker hub repository and then I’ve shown how you can create a gitlab-ci pipeline that triggers on repo code changes, pulls the docker CDE image and then builds the code, runs tests and finally exports the artifact to the gitlab repo.

In the second post I’ve explained how to use Packer and Ansible to build an AWS EC2 AMI image that includes the tools that you need to compile the STM32 code. Also, I’ve shown how you can install the gitlab-runner in the AMI and then use an instance of this image with gitlab-ci to build the code when you push changes in the repo. Also, you’ve seen how you can use docker and your workstation to create multiple gitlab-runners and also how to use Vagrant to make the process of running instances easier.

Therefore, we pretty much covered the CI part of the CI/CD pipeline and I’ve shown a few different ways how this can be done, so you can chose what fits your specific case.

In this post, I’ll try to scratch the surface of testing your embedded project in the CI/CD pipeline.Scratching the surface is literal, because fully automating the testing of an embedded project can be even impossible. But I’ll talk about this later.

Testing in the embedded domain

When it comes to embedded, there’s one important thing that it would be great to include in to your automation and that’s testing. There are different types of testing and a few of them apply only to the embedded domain and not in the software domain or other technology domains. That is because in an embedded project you have a custom hardware that works with a custom software and that may be connected to- or be a part of- a bigger and more complex system. So what do you test in this case? The hardware? The software? The overall system? The functionality? It can get really complicated.

Embedded projects can be also very complex themselves. Usually an embedded product consists of an MCU or application CPU and some hardware around that. The complexity is not only about the peripheral hardware around the MCU, but also the overall functionality of the hardware. Therefore, for example the hardware might be quite simple, but it might be difficult to test it because of it’s functionality and/or the testing conditions. For example, it would be rather difficult to test your toggling LED firmware on the arduino in zero-gravity or 100 g acceleration :p

Anyway, getting back to testing, it should be clear that it can become very hard to do. Nevertheless, the important thing is to know at least what you can test and what you can’t; therefore it’s mandatory to write down your specs and create a list of the tests that you would like to have and then start to group those in tests that can be done in software and those that can be done with some external hardware. Finally, you need to sort those groups on how easy is to implement each test.

You see there are many different things that actually need testing, but let’s break this down

Unit testing

Usually testing in the software domain is about running tests on specific code functions, modules or units (unit testing) and most of the times those tests are isolated, meaning that only the specific module is tested and not the whole software. Unit testing nowadays is used more and more in the lower embedded software (firmware), which is good. Most of the times though embedded engineers, especially older ones like me, don’t like unit tests. I think the reason behind that is that because it takes a lot of time to write unit tests and also because many embedded engineers are fed up with how rapid the domain is moving and can’t follow for various reasons.

There are a few things that you need to be aware of unit tests. Unit tests are not a panacea, which means is not a remedy for all the problems you may have. Unit tests only test the very specific thing you’ve programmed them to test, nothing more nothing less. Unit tests will test a function, but they won’t prevent you from using that function in a way that it will brake your code, therefore you can only test very obvious things that you already thought that they might brake your code.  Of course, this is very useful but it’s far from being a solution that if you use it then you can feel safe.

Most of the times, unit tests are useful to verify that a change you’ve made in your code or in a function that is deep in the execution list of another function, doesn’t brake anything that was working before. Again, you need to have in mind that this won’t make you safe. Bad coding is always bad coding no matter how many tests you do. For example if you start pushing and popping things to the stack in your code, instead of leaving the compiler to be the only owner of the stack, then you’re heading to troubles and no unit test can save you from this.

Finally, don’t get too excited with unit testing and start to make tests for any function you have. You need to use it wisely and not spend too much time implementing tests that are not really needed. You need to reach a point that you are confident for the code you’re writing and also re-use the code you’ve written for other projects. Therefore, focus on writing better code and write tests only when is needed or for critical modules. This will also simplify your CI pipeline and make its maintenance easier.

Testing with mocks

Some tests can be done using either software or external hardware. Back in 2005 I’ve wrote the firmware for a GPS tracker and I wanted to test both the software module that was receiving and processing the NMEA GPS input via a UART and also the application firmware. To do that one way was to use a unit test to test the module, which is OK to test the processing but not much useful for real case scenario tests. The other way was to test the device out in the real world, which it would be the best option, but as you can imagine it’s a tedious and hard process to debug the device and fix code while you’re driving. Finally, the other way was to mock the GPS receiver and replay data in the UART port. This is called mocking the hardware and in this case I’ve just wrote a simple VB6 app that was replaying captured GPS NMEA data from another GPS receiver to the device under test (DUT).

There are several ways for mocking hardware. Mocking can be software only and also integrated in the test unit itself, or it can be external like the GPS tracker example. In most of the cases though, mocking is done in the unit tests without an external hardware and of course this is the “simplest” case and it’s preferred most of the times. I’ve used quotes in simplest, because sometimes mocking the hardware only with software integrated in the unit test can be also very hard.

Personally I consider mocking to be different from unit tests, although some times you may find them together (e.g. in a unit testing framework that supports mocking) and maybe they are presented like they are the same thing. It’s not.

In the common development environment (CDE) image that I’ve used also in the previous posts there’s another Ansible role I’ve used, which I didn’t mentioned on purpose and that’s the cpputest framework, which you’ll find in this role `provisioning/roles/cpputest/tasks/main.yml`. In case you haven’t read the previous posts, you’ll need to clone this repo here to run the image build examples:

There are many testing frameworks for different programming languages, but since this example project template is limited to C/C++ I’ve used cpputest as a reference. Of course, in your case you can create another Ansible role and install the testing framework of your preference. Usually, those frameworks are easy to install and therefore also easy to add them into your image. One thing worth mention here is that if the testing framework you want to use is included in the package repository of your image (e.g. apt for debian) and you’re going to install this during the image provision, then keep in mind that this version might change if you re-build your image after some time and the package is updated in the repo, too. Of course, the same issue stands for all the repo packages.

In general, if you want to play safe you need to have this in mind. It’s not wrong to use package repositories to install tools you need during provisioning. it’s just that you need to be aware of this and that at some point one of the packages may be updated and introduce a different behavior than the expected.

System and functional testing

This is where things are getting interesting in embedded. How do you test your system? How do you test the functionality? These are the real problems in embedded and this is where you should spend most of your time when thinking about testing.

First you need to make a plan. You need to write down the specs for testing your project. This is the most important part and the embedded engineer is valuable to bring out the things that need testing (which are not obvious). The idea of compiling this list of tests needed is simple and there are two questions that you have to answer, which is “what” and “how”. “What” do you need to test? “How” are you going to do that in a simple way?

After you make this list, then there are already several tools and frameworks that can help you on doing this, so let’s have a look.

Other testing frameworks

Oh boy… this is the part that I will try to be fair and distant my self from my personal experiences and beliefs. Just a brief story, I’ve evaluated many different testing frameworks and I’m not really satisfied about what’s available out there, as everything seems too complicated or bloated or it doesn’t make sense to me. Anyway, I’ll try not to rant about that a lot, but every time in the end I had to either implement my own framework or part of it to make the job done.

The problem with all these frameworks in general is that they try to be very generic and do many things at the same time. Thus sometimes this becomes a problem, especially in cases you need to solve specific problems that could be solved easily with a bash script and eventually you spend more time trying to fit the framework in to your case, rather have a simple and clean solution. The result many times will be a really ugly hack of the framework usage, which is hard to maintain especially if you haven’t done it yourself and nobody has documented it.

When I refer to testing frameworks in this section I don’t mean unit testing frameworks, but higher level automation frameworks like Robot Framework, LAVA, Fuego Test System, tbot and several others. Please, keep in mind that I don’t mean to be disrespectful to any of those frameworks or their users and mostly their developers. These are great tools and they’ve being developed from people that know well what testing is all about (well they know better than be, tbh). I’m also sure that these frameworks fit excellent in many cases and are valuable tools to use. Personally, I prefer simplicity and I prefer simple tools that do a single thing and don’t have many dependencies, that’s it; nothing more, nothing less.

As you can see, there are a lot of testing frameworks (many more that the ones I’ve listed). Many companies or developers also make their own testing tools. It’s very important for the developer/architect of the testing to feel comfortable and connected with the testing framework. This is why testing frameworks are important and that’s why there are so many.

In this post I’ll use robot framework, only because I’ve seen many QAs are using it for their tests. Personally I find this framework really hard to maintain and I wouldn’t use it my self, because many times it doesn’t make sense and especially for embedded. Generally I think robot-framework target the web development and when it gets to integrate it in other domains it can get nasty, because of the syntax and that the delimiters are multiple spaces and also by default all variables are strings. Also sometimes it’s hard to use when you load library classes that have integer parameters and you need to pass those from the command line. Anyway, the current newer version 3.1.x handles that a bit better, but until now it was cumbersome.

Testing stages order and/or parallelization

The testing frameworks that I’ve mentioned above are the ones that are commonly used in the CI/CD pipelines. In some CI tools, like gitlab-ci, you’re able to run multiple tasks in a stage, so in case of the testing stage you may run multiple testing tasks. These tasks may be anything, e.g. unit or functional tests or whatever you want, but it is important to decide the architecture you’re going to use and if that makes sense for your pipeline. Also be aware that you can’t run multiple tasks in the same stage if those tasks are using both the same hardware, unless you handle this explicitly. For example, you can’t have two tasks sending I2C commands to the target and expect that you get valid results. Therefore, in the test stage you may run multiple tasks with unit tests and only one that accesses the target hardware (that depends on the case, of course, but you get the point).

Also let’s assume that you have unit tests and functional tests that you need to run. There’s a question that you need to ask yourself. Does it make sense to run functional tests before unit tests? Does it make sense to run them in parallel? Probably you’ll think that it’s quite inefficient to run complex and time consuming functional tests if your unit tests fails. Probably your unit tests need less time to run, so why not run them first and then run the functional tests? Also, if you run them in parallel then if your unit test fails fast, then your functional test may take a lot of time until it completes, which means that your gitlab-runner (or Jenkins node e.t.c.) will be occupied running a test on an already failed build.

On the other hand, you may prefer to run them in parallel even if one of them fails, so at least you know early that the other tests passes and you don’t have to do also changes there. But this also means that in the next iteration if you your error is not fixed, then you’ll loose the same amount of time again. Also by running the tasks parallel even if one occasionally fails and you have a many successful builds then you save a lot of time in long term. But at the same time, parallel stages means more runners, but also at the same time each runner can be specific for each task, so you have another runner to build the code and another to run functional tests. The last means that your builders may be x86_64 machines that use x86_64 toolchains and the functional test runners are ARM SBCs that don’t need to have toolchains to build code.

You see how quickly this escalates and it gets even deeper than that. Therefore, there are many details that you need to consider. To make it worse, of course, you don’t need to decide this or the other, but implement a hybrid solution or change your solution in the middle of your project because it makes more sense to have another architecture when the project starts, when is half-way and when it’s done and only updates and maintenance is done.

DevOps is (and should be) an alive, flexible and dynamic process in embedded. You need to take advantage of this flexibility and use it in a way that makes more sense in each step of your project. You don’t have to stick to an architecture that worked fine in the beginning of the project, but later it became inefficient. DevOps need to be adaptive and this is why being as much as IaC as possible makes those transitions safer and with lower risk. If your infrastructure is in code and automated then you can bring it back if an architectural core change fails.

Well, for now if you need to keep a very draft information out of this, then just try to keep your CI pipeline stages stack up in a logical way that makes sense.

Project example setup

For this post I’ll create a small farm with test servers that will operate on the connected devices under test (DUT). It’s important to make clear at this point that there many different architectures that you can implement for this scenario, but I will try to cover only two of them. This doesn’t mean that this is the only way to implement your farm, but I think having two examples to show the different ways that you can do this demonstrates the flexibility and may trigger you to find other architectures that are better for your project.

As in the previous posts I’ll use gitlab-ci to run CI/CD pipelines, therefore I’ll refer to the agents as runners. If it was Jenkins I would refer to them as nodes. Anyway, agents and runners will be used interchangeably. The same goes for stages and jobs when talking about pipelines.

A simple test server

Before getting into the CI/CD part let’s see what a test server is. The test server is a hardware platform that is able to conduct tests on the DUT (device under test) and also prepare the DUT for those tests. That preparation needs to be done in a way that the DUT is not affected from any previous tests that ran on it. That can be achieved in different ways depending the DUT, but in this case the “reset” state is when the firmware is flashed, the device is reset and any peripheral connections have always the same and predefined state.

For this post the DUT is the STM32F103C8T6 (blue-pill) but what should we use as a test server? The short answer is any Linux capable SBC, but in this case I’ll use the Nanopi K1 Plus and the Nanopi Neo2-LTS. Although those are different SBCs, they share the same important component, which is the Allwinner H5 SoC that integrates an ARM Cortex-A53 cpu. The main difference is the CPU clock and the RAM size, which is 2GB for the TK1 and 512MB for the Neo2. The reason I’ve chosen different SBCs is to also benchmark their differences. The cost of the Neo2 is $20 at the time the post is written and the K1 is $35. The price might be almost twice more for the K1, but $35 on the other hand is really cheap for building a farm. You could also use a rapberry pi instead, but later on it will be more obvious why I’ve chosen those SBCs.

By choosing a Linux capable SBC like these comes with great flexibility. First you don’t need to build your own hardware, which is fine for most of the cases you’ll have to deal with. If your projects demands, for some reason, a specific hardware then you may build only the part is needed in a PCB which is connected to the SBC. The flexibility then expands also to the OS that is running on the SBC, which is Linux. Having Linux running on your test server board is great for many reasons that I won’t go into detail, but the most important is that you can choose from a variety of already existing software and frameworks for Linux.

So, what packages does this test server needs to have? Well, that’s up to you and your project specs, but in this case I’ll use Python3, the robot testing framework Docker and a few other tools. It’s important to just grasp how this is set up and not which tools are used, because tools are only important to a specific project. but you can use whatever you like. During the next post I’ll explain how those tools are used in each case.

This is a simple diagram of how the test server is connected on the DUT.

The test server (SBC) has various I/Os and ports, like USB, I2C, SPI, GPIOs, OTG e.t.c. The DUT (STM32) is connected on the ST-Link programmer via the SWD interface and the ST-Link is conencted via USB to the SBC. Also the STM32F1 is connected to the SBC via an I2C interface and a GPIO (GPIO A0)

Since this setup is meant to be simple and just a proof of concept, it doesn’t need to implement any complicated task, since you can scale this example in whatever extent you need/want to. Therefore, for this example project the STM32 device will implement a naive I2C GPIO expander device. The device will have 8 GPIOs and a set of registers that can be accessed via I2C to program each GPIO functionality. Therefore, the test server will run a robot test that will check that the device is connected and then configure a pin as output, set it high, read and verify the pin, then set it low and again read and verify the pin. Each of this tasks is a different test and if all tests pass then the testing will be successful.

So, this is our simple example. Although it’s simple, it’s important to focus on the architecture and the procedure and not in the complexity of the system. Your project’s complexity may be much more, but in the end if you break things down everything ends up to this simple case, therefore it’s many simple cases connected together. Of course, there’s a chance that your DUT can’t be 100% tested automatically for various reasons, but even if you can automate the 40-50% it’s a huge gain in time and cost.

Now let’s see the two different test farm architectures we’re going to test in this post.

Multi-agent pipeline

In multi-agent pipelines you just have multiple runners in the same pipeline. The agents may execute different actions in parallel in the same stage/job or each agent may execute a different job in the pipeline sequence, one after another. Before going to the pros and cons of this architecture, let’s see first how this is realized in our simple example.

Although the above may seem a bit complicate, in reality it’s quite simple. This shows that when a new code commit is done then the gitlab-ci will start executing the pipeline. In this pipeline there’s a build stage that is compiling the code and runs units tests on an AWS EC2 instance and if everything goes well, then the artifact (firmware binary) is uploaded to gitlab. In this case, it doesn’t really matter if it’s an AWS or any other baremetal server or whatever infrastructure you have for building code. What matters is that the code build process is done on a different agent than the one that tests the DUT.

It’s quite obvious that when it comes to the next stage, which is the automated firmware testing, that includes flashing the firmware and run tests on the real hardware, then this can’t be done on an AWS EC2 instance. It also can’t be done on your baremetal build servers which may be anywhere in your building. Therefore, in this stage you need to have another agent that will handle the testing and that means that you can have a multi-stage/multi-agent pipeline. In this case the roles are very distinct and also the test farm can be somewhere remotely.

Now let’s see some pros and cons of this architecture.


  • Isolation. Since the build servers are isolated they can be re-used for other tasks while the pipeline waits for the next stage which is the testing and which may get quite some time.
  • Flexibility. If you use orchestration (e.g. kubernetes) then you can scale your build servers on demand. The same doesn’t stand for the test server though.
  • Failure handling. If you use a monitoring tool or an orchestrator (e.g. Kubernetes) then if your build server fails then a new one is spawned and you’re back and running automatically.
  • Maintenance. This is actually can be both in the pro and con, but in this context it means that you maintain your build servers and test servers differently and that for some is consider better. My cup is half full on this, I think it depends on the use case.
  • Speed. The build servers can be much faster that the test servers, which means that the build stage will end much faster and the runners will be available for other jobs.


  • Maintenance. You’ll need to maintain two different infrastructures, which is the build servers and the test servers.
  • Costs. It cost more to run both infrastructure at the same time, but the costs may be negligible in real use scenarios, especially for companies.
  • Different technologies. This is quite similar with the maintenance, but also means that you need to be able to integrate all those technologies to your pipeline, which might be a pain for small teams that there’s no know-how and time to learn to do this properly. If this is not done properly in an early stage then you may deal with a lot of problems later in the development and you may even need to re-structure your solution, which means more time and maybe more money.

These are not the only pros and cons, but these are generic and there are others depending your specific project. Nevertheless, if the multi-agent pipeline architecture is clear then you can pretty much extract the pain point you may have in your specific case.

Single-agent pipeline

In the single-agent pipeline there’s only one runner that does everything. The runner will compile the firmware and then will also execute the tests. So the SBC will run the gitlab-runner agent, will have an ARM toolchain for the SoC architecture and will run all the needed tools, like Docker, robot tests or any other testing framework is used. This sounds awesome, right? And it actually is. To visualize this setup see the following image.

In the above image you can see that the test-server (in this case the nanopi-k1-plus) is running the the gitlab-runner client and then it peaks the build job from the gitlab-ci on a new commit. Then it uploads the firmware artifact and then it runs the tests on the target. For this to happen it means that the test server needs to have all the tools are needed and the STM32 toolchain (for the aarch64 architecture). Let’s see the pros and cons here:


  • Maintenance. Only one infrastructure to maintain.
  • Cost. The running costs may be less when using this architecture.


  • Failures. If your agent fails for any reason, hardware or software, then you may need physical presence to fix this and the fix might be as simple as reset the board or change a cable or in worst case your SBC is dead and you need to replace it. Of course, this may happen in the previous architecture, but in this case you didn’t only lost a test server but also a builder.
  • Speed. A Cortex-A53 is slower when building software than an x86 CPU (even if that’s a vCPU on a cloud cluster)
  • Support. Currently there are a few aarch64 builds for gitlab, but gitlab is not officially supported for this architecture. This can be true also for other tools and packages, therefore you need to consider this before choosing your platform.

Again these are the generic pros/cons and in no way the only ones. There may be more in your specific case, which you need to consider by extracting your project’s pain points.

So, the next question is how do we setup our test servers?

Test server OS image

There are two ways to setup your testing farm as a IaC. I’ll only explain briefly the first one, but I’ll use the second way to this post example. The first way is to use an already existing OS image for your SBC like a debian or ubuntu distro (e.g. Armbian) and then use a provisioner like Ansible to setup the image. Therefore, you only need to flash the image to the SD card and then boot the board and then provision the OS. Let’s see the pros and cons of this solution:


  • The image is coming with an APT repository that you can use to install whatever you need using Ansible. This makes the image dynamic.
  • You can update your provisioning scripts at a later point and run them on all agents.
  • The provisioning scripts are versioned with git.
  • The provision scripts can be re-used on any targeted hardware as long as the OS distro is the same.


  • You may need to safely backup the base OS Image as it might be difficult to find it again in the future and the provisioning may fail because of that.
  • The image is bloated with unnecessary packages.
  • In most of those images you need first to do the initial installation steps manually as they have a loader that expects from the user to manually create a new user for safety reasons (e.g. Armbian)
  • Not being a static image may become problem in some cases as someone may install packages that make the image unstable or create a different environment from the rest agents.
  • You need different images if you have different SBCs in your farm.

The strong argument to use this solution is the existence of an APT repository which is comes from a well established distro (e.g. Debian). Therefore, in this scenario you would have to download or build your base image with a cross-toolchain, which is not that hard nowadays as there are many tools for that. Then you would need to store this base image for future use. Then flash an SD card for each agent and boot the agent while is connected to the network. Then run the ansible playbook targeting your test-farm (which means you need a host file that describes your farm). After that you’re ready to go.

The second option, which I’ll actually use in the post is to build a custom OS distro using Yocto. That means that your IaC is a meta layer that builds upon a Yocto BSP layer and using the default Poky distro as a base. One of the pros for me in this case (which is not necessarily also your case), is that I’m the maintainer of this allwinner BSP layer for this SoCs… Anyway, let’s see the generic pros and cons of this option:


  • The test server specific meta layer can support all the Yocto BSPs. It doesn’t matter if you have an RPi or any of the supported allwinner BSPs, the image that you’ll get in the end will be the same.
  • It’s a static image. That means all the versions will be always the same and it’s more difficult to break your environment.
  • It’s an IaC solution, since it’s a Yocto meta layer which it can be hosted in a git repo.
  • It’s full flexible and you can customize your image during build without the need of provisioning.
  • Full control on your package repository, if you need one.
  • You can still use Ansible to provision your image even if you don’t use a package repo (of course you can also build your own package repository using Yocto, which is very simple to do).
  • Yocto meta layers are re-usable


  • Although you can build your own package repository and build ipk, deb and rpm packages for your image, it’s not something that it comes that easy like using Debian APT and you need to support this infrastructure also in the future. This of course gives you more control, which is a good thing.
  • If you don’t use a package manager, then you can’t provision your test farm to add new functionality and you need to add this functionality to your Yocto image and then build and re-flash all the test servers. Normally, this won’t happen often, but in case you want to play safe then you need to build your own package repository.
  • Yocto is much more harder to use and build your image compared to just use Ansible on a standard mainstream image.

As I’ve already mentioned in this test farm example I’ll use Yocto. You may think that for this example that’s too much, but I believe that it’s a much better solution in the long term, it’s a more proper IaC as everything is code and no binary images or external dependencies are needed. It’s also a good starting point to integrate Yocto to your workflow as it has become the industry standard and that’s for good reason. Therefore, integrating this technology to your infrastructure is not beneficial for the project workflow and also for your general infrastructure as it provides flexibility, integrates wide used and well established technologies and expands you knowledge base.

Of course, you need to always consider if the above are fit to your case if you’re working in an organisation. I mean, there’s no point to integrate a new stack in your workflow if you can do your job in a more simple way and you’re sure that you’ll never need this new stack to the current project’s workflow (or any near future project). Always, choose the more simple solution that makes you sure that fits your skills and project’s requirements; because this will bring less problems in the future and prevent you from using a new acquired skill or tool in a wrong way. But if you see or feel that the solutions that you can provide now can’t scale well then it’s probably better to spend some time to integrate a new technology rather try to fix your problems with ugly hacks that may break easily in the future (and probably they will).

Setting up the SBCs (nanopi-neo2 and nanopi-k1-plus)

The first component of the test farm is the test server. In this case I’ll use two different allwinner SBCs for that in order to show how flexible you can be by using the IaC concept in DevOps. To do that I’ll build a custom Linux image using Yocto and the result will be two identical images but for different SBCs, which you can image how powerful that is as you have full control over the image and also you can use almost any SBC.

To build the image you’ll need the meta-allwinner-hx layer which is the SBC layer that supports all those different SBCs. Since I’m the maintainer of the layer I’m updating this quite often and I try to ensure that it works properly, although I can’t test it on all the supported SBCs. Anyway, then you need an extra layer that will sit on top of the BSP and create an image with all the tools are needed and will provide a pre-configured image to support the external connected hardware (e.g. stlink) and also the required services and environment for the tests. The source of the custom test server Yocto meta layer that I’m going to use is here:

In the file of this repo there are instructions on how to use this layer, but I’ll also explain how to build the image here. First you need to prepare the build environment for that you need to create a folder for the image build and a folder for the sources.

mkdir -p yocto-testserver/sources
cd yocto-testserver/sources
git clone --depth 1
cd ../

The last command will git clone the test server meta layer repo in the sources/meta-test-server folder. Now you have two options, first is to just run this script from the root folder.


This script will prepare everything and will also build the Docker image that is needed to build the Yocto image. That means that the Yocto image for the BSP will not use your host environment, but it’s going to be built inside a Docker container. This is pretty much the same that we did in the previous posts to build the STM32 firmware, but this time we build a custom Linux distribution for the test servers.

The other option to build the image (without the script) is to type each command in the script yourself to your terminal, so you can edit any commands if you like (e.g. the docker image and container name).

Assuming that you’ve built the Docker image and now you’re in the running container, you can run these commands to setup the build environment and the then build the image:

DISTRO=allwinner-distro-console MACHINE=nanopi-k1-plus source ./ build
bitbake test-server-image

After this is finished then exit the running container and you can now flash the image to the SD card and check if the test server boots. To flash the image, then you need first to find the path of your SD card (e.g. /dev/sdX) and then run this command:

sudo MACHINE=nanopi-k1-plus ./ /dev/sdX

If this commands fails because you’re missing the bmap-tool then you can install it in your host using apt

sudo apt install bmap-tools

After you flash the SD card and boot the SBC (in this case the nanopi-k1-plus) with that image then you can just log in as root without any password. To do this you need to connect a USB to uart module and map the proper Tx, Rx and GND pins between the USB module and the SBC. Then you can open any serial terminal (for Linux serial terminal I prefer using putty) and connect at 115200 baudrate. Then you should be able to view the the boot messsages and login as root without password.

Now for sanity check you can run the `uname -a` command. This is what I get in my case (you’ll get a different output while the post is getting older).

root:~# uname -a
Linux nanopi-k1-plus 5.3.13-allwinner #1 SMP Mon Dec 16 20:21:03 UTC 2019 aarch64 GNU/Linux

That means that the Yocto image runs on the 5.3.13 SMP kernel and the architecture is aarch64. You can also test that docker, gitlab-runner and the toolchain are there. For example you can just git clone the STM32 template code I’ve used in the previous posts and build the code. If that builds then you’re good to go. In my case, I’ve used these commands here in the nanopi-k1-plus:

git clone --recursive
cd stm32f103-cmake-template
time TOOLCHAIN_DIR=/opt/toolchains/gcc-arm-none-eabi-9-2019-q4-major CLEANBUILD=true USE_STDPERIPH_DRIVER=ON SRC=src_stdperiph ./

And this is the result I got:

Building the project in Linux environment
- removing build directory: /home/root/stm32f103-cmake-template/build-stm32
--- Pre-cmake ---
architecture      : stm32
distclean         : true
parallel          : 4
[ 95%] Linking C executable stm32-cmake-template.elf
   text	   data	    bss	    dec	    hex	filename
  14924	    856	   1144	  16924	   421c	stm32-cmake-template.elf
[ 95%] Built target stm32-cmake-template.elf
Scanning dependencies of target stm32-cmake-template.hex
Scanning dependencies of target stm32-cmake-template.bin
[100%] Generating stm32-cmake-template.hex
[100%] Generating stm32-cmake-template.bin
[100%] Built target stm32-cmake-template.hex
[100%] Built target stm32-cmake-template.bin

real	0m11.816s
user	0m30.953s
sys	0m3.973s

That means that the nanopi-k1-plus built the STM32 template firmware using 4 threads and the aarch64 GCC toolchain and this tool 11.8 secs. Yep, you see right. The nanopi with the 4-core ARM-CortexA53 is by far the slowest when it comes to build the firmware even compared to AWS EC2. Do you really care about this though? Maybe not, maybe yes. Anyway I’ll post the benchmark results later in this post.

Next test is to check if the SBC is able to flash the firmware on the STM32. First you can probe the st-link. In my case this is what I get:

root:~/stm32f103-cmake-template# st-info --probe
Found 1 stlink programmers
 serial: 513f6e06493f55564009213f
openocd: "\x51\x3f\x6e\x06\x49\x3f\x55\x56\x40\x09\x21\x3f"
  flash: 65536 (pagesize: 1024)
   sram: 20480
 chipid: 0x0410
  descr: F1 Medium-density device

Therefore, that means that the st-link is properly built, installed and the udev rules work fine. Next step is to try to flash the firmware with this command:

st-flash --reset write build-stm32/src_stdperiph/stm32-cmake-template.bin 0x8000000

Then you should see something like this in the terminal

st-flash 1.5.1
2019-12-20T23:56:22 INFO common.c: Loading device parameters....
2019-12-20T23:56:22 INFO common.c: Device connected is: F1 Medium-density device, id 0x20036410
2019-12-20T23:56:22 INFO common.c: SRAM size: 0x5000 bytes (20 KiB), Flash: 0x10000 bytes (64 KiB) in pages of 1024 bytes
2019-12-20T23:56:22 INFO common.c: Attempting to write 15780 (0x3da4) bytes to stm32 address: 134217728 (0x8000000)
Flash page at addr: 0x08003c00 erased
2019-12-20T23:56:24 INFO common.c: Finished erasing 16 pages of 1024 (0x400) bytes
2019-12-20T23:56:24 INFO common.c: Starting Flash write for VL/F0/F3/F1_XL core id
2019-12-20T23:56:24 INFO flash_loader.c: Successfully loaded flash loader in sram
 16/16 pages written
2019-12-20T23:56:25 INFO common.c: Starting verification of write complete
2019-12-20T23:56:25 INFO common.c: Flash written and verified! jolly good!

That means that the SBC managed to flash the firmware on the STM32. You can visually verify this by checking that the LED on gpio C13 is toggling every 500ms.

Great. By doing those steps we verified that we’ve built a Docker image that can build the Yocto image for the nanopi-k1-plus and after booting the image all the tools that are needed  to build and flash the STM32 firmware are there and work fine.

The above sequence can be integrated in a CI pipeline that builds the image whenever there’s a change in the meta-test-server layer. Therefore, when you add more tools or change things in the image and push the changes then the pipeline build the new Linux image that you can flash to the SD cards of the SBCs. The MACHINE parameter in the build command can be a parameter in the pipeline so you can build different images for any SBC that you have. This makes things simple and automates your test farm infrastructure (in terms of the OS).

Note: You can also connect via ssh as root in the image without password if you know the IP. Since DHCP is on by default for the image then if you don’t want to use a USB to serial module then you can assign a static IP in your router to the SBC using its MAC address.

You may wonder here, why I’ve used a Dockerfile to build the image that builds the Yocto image, instead of using Packer and Ansible like I’ve did in the previous posts. Well, the only reason is that the Dockerfile comes already with the meta-allwinner-hx repo and it’s well tested and ready to be use. You can definitely use Packer to build your custom docker image that builds your Yocto image but is that what you really want in this case? Think that the distributed Dockerfile that comes with the meta-allwinner-hx Yocto layer is what your vendor or distributor may supply you with for another SBC. In this case it’s always better to use what you get from the vendor because you will probably get support when something goes wrong.

Therefore, don’t always create everything from the scratch, but use whatever is available that makes your life easier. DevOps is not about create everything from the scratch but use whatever is simpler for your case.

Also in this case, there’s also something else that you need to consider. Since I don’t have powerful build servers to build yocto images, I have to do it on my desktop workstation. Building a Yocto image may require more that 16GB or RAM and 50GB of disk space and also 8 cores are quite the minimum.

Also, using AWS EC2 to build a Yocto image doesn’t make sense either for this case, since this would require a very powerful EC2 instance, a lot of time and a lot of space.

So, in your specific case you’ll need a quite powerful build server if you want to integrate the Yocto image build in your CI/CD (which you should). For this post, though, I’ll use my workstation to build the image and I’ll skip the details of creating a CI/CD pipeline for build the Yocto image as it’s pretty much the same as any other CI/CD case except that you’ll need a more powerful agent.

In case you need to have the Yocto docker image in a registry, then use Packer to build those images and upload them to a docker registry or make a CI/CD pipeline that just builds the Yocto builder image from a Dockerfile (in this case the ine in the meta-allwinner-hx repo) and then uploads that image on your remote/local registry. There are many possibilities, you choose what is the best and easier for your.

Again, the specific details for this test case are not important. What’s important is the procedure and the architecture.

Setting up the DUT

Now that you’ve setup your test server it’s time to setup your DUT (the stm32f103 blue pill in this case). What’s important is how you connect the SBC and the STM32 as you need to power the STM32 and also connect the proper pins. The following table shows the connections you need to make (the pin number for nanopi-neo2 are here and for nanopi-k1-plus here).

SBC (nanopi-neo2/k1-plus)
pin # : [function]
STM32F103 (blue pill)
pin #
1 : [SYS_3.3V] 3V3
3 : [I2C0_SDA] PB7
5: [I2C0_SCL] PB6
6 : [GND] GND
7: [GPIOG11] PA0

The ST-LINK USB module is connected on any available USB port of your SBC (nanopi-neo2 has only one). Then you need to connect the pins of the module to the STM32. That’s straight forward just connect the RST, SWDIO, SWCLK and GND of the module to the same pins on the STM32 (the SWDIO, SWCLK and GND are in the connector on the back and only RST is near PB11). Do not connect the 3V3 power of the ST-LINK module to the STM32 as it’s already getting power from the SBC’s SYS_3.3V.

Since we’re using the I2C bus then we need to use two external pull-up resistors on the PB7 and PB6 pins. Just connect those two resistors to the pins and one of the available 3V3 pins of the STM32.

Finally, for debugging purposes you may want to connect the serial port of the SBC and the STM32 to two USB-to-UART modules. I’ve done this for debugging and generally is useful at the point you’re setting up your farm and when everything is working as expected then remove it and only re-connect when troubleshooting an agent in your farm.

As I’ve mentioned before, I won’t use the same STM32 template firmware I’ve used in the previous posts, but this time I’ll use a firmware that actually does something. The firmware code is located here:

Although the README file contains a lot of info about the firmware, I’ll make a brief description about it. So, this firmware will make the stm32f103 to function as an I2C GPIO expander, like the ones that are used from many Linux SBCs. The mcu will be connected to the SBC via I2C and a GPIO, then the SBC will be able to configure the stm32 pins using the I2C protocol and then also control the GPIO. The protocol is quite simple and it’s explained in detail in the README.

If you want to experiment with building the firmware locally on your host or the SBC you can do it, so you get a bit familiar with the firmware, but it’s not necessary from now on as everything will be done by the CI pipeline.

Automated tests

One last thing before we get to testing the farm, is to explain what tests are in the firmware repo and how they are used. Let’s go back again in the firmware repo here:

In this repo you’ll see a folder that is named tests. In there you’ll find all the test scripts that the CI pipeline will use. The pipeline configuration yaml file is on the root folder of the repo and it’s named .gitlab-ci.yml as the previous posts. In this file you’ll see that there are 3 stages which are the build, flash and test. The build stage will build the code, the flash stage will flash the binary on the STM32 and finally the test stage will run a test using the robot-framework. Although, I’m using robot-framework here, that could be any other framework or even scripts.

The robot test is the `tests/robot-tests/stm32-fw-test.robot` file. If you open this file you’ll see that it uses two Libraries. Those libraries are just python classes that are located in `tests/` and specifically it’s calling `tests/` and `tests/`. Those files are just python classes, so open those files and have a look in there, too.

The `tests/` implements the I2C specific protocol that the STM32 firmware supports to configure the GPIOs. This class is using the python smbus package which is installed from the meta-test-server Yocto layer.  You also see that this object supports to probe the I2C to find if there’s an STM32 device running the firmware. This is done by reading the 0x00 and 0x01 registers that contain the magic word 0xBEEF (sorry no vegan hex). Therefore this function will read those two registers and if 0xBEEF is found then it means that the board is there and running the correct firmware. You could also add a firmware version (which you should), but I haven’t in this case. You’ll also find functions to read/write registers, set the configuration of the available GPIOs and set the pins value.

Next, the `tests/` class exposes a python interface to configure and control the SBC’s GPIOs in this case. Have in mind that when you use the SBC’s GPIOs via the sysfs in Linux, then you need some additional udev rules in your OS for the python to have the proper permissions. This file is currently provided from the meta-allwinner-hx BSP layer in this file `meta-allwinner-hx/recipes-extended/udev/udev-python-gpio/60-python-gpio-permissions.rules`. Therefore, in case you use your own image you need to add this rule to your udev otherwise the GPIOs won’t be available with this python class.

The rest of the python files are just helpers, so the above two files are the important ones. Back to the robot test script you can see the test cases, which I list here for convenience.

*** Test Cases ***
Probe STM32
     ${probe} =     stm32.probe
     Log  STM32GpioExpander exists ${probe}

Config output pin
     ${config} =    stm32.set_config    ${0}  ${0}  ${0}  ${1}  ${0}  ${0}
     Log  Configured gpio[0] as output as GPOA PIN0

Set stm32 pin high
     ${pin} =  stm32.set_pin  ${0}  ${1}
     Log  Set gpio[0] to HIGH

Read host pin
     host.set_config     ${203}  ${0}  ${0}
     ${pin} =  host.read_pin  ${203}
     Should Be Equal     ${pin}  1
     Log  Read gpio[0] value: ${pin}

Reset stm32 pin
     ${pin} =  stm32.set_pin  ${0}  ${0}
     Log  Set gpio[0] to LOW

Read host pin 2
     host.set_config     ${203}  ${0}  ${0}
     ${pin} =  host.read_pin  ${203}
     Should Be Equal     ${pin}  0
     Log  Read gpio[0] value: ${pin}

The first test is called “Probe STM32” and it just probes the STM32 device and reads the magic word (or the preamble if you prefer). This will actually call the STM32GpioExpander.probe() function which returns True if the device exists otherwise False. The next test “Config output pin” will configure the STM32’s output PA0 as output. The next test “Set stm32 pin high” will set the PA0 to HIGH. The next test “Read host pin” will configure the SBC’s GPIOG11 (pin #203) to input and then read if the input which is connected to the STM32 output PA0 is indeed HIGH. If the input reads “1” then the test passes. The next test “Reset stm32 pin” will set the PA0 to LOW (“0”). Finally, the last test “Read host pin 2” will read again the GPIOG11 pin and verify that it’s low.

If all tests pass then it means that the firmware is flashed, the I2C bus is working, the protocol is working for both sides and the GPIOs are working properly, too.

Have in mind that the voltage levels for the SBC and the STM32 must be the same (3V3 in this case), so don’t connect a 5V device to the SBC and if you do, then you need to use a voltage level shifter IC. Also you may want to connect a 5-10KΩ resistor in series with the two pins, just in case. I didn’t but you should :p

From the .gitlab-ci.yml you can see that the robot command is the following:

robot --pythonpath . --outputdir ./results -v I2C_SLAVE:0x08 -v BOARD:nanopi_neo2 -t "Probe STM32" -t "Config output pin" -t "Set stm32 pin high" -t "Read host pin" -t "Reset stm32 pin" -t "Read host pin 2" ./robot-tests/

Yes, this is a long command, which passes the I2C slave address as a parameter, the board parameter and a list with the tests to run. The I2C address can be omitted as it’s hard-coded  in the robot test for backwards compatibility as in pre-3.1.x versions all parameters are passed as strings and this won’t work with the STM32GpioExpander class. The robot command will only return 0 (which means no errors) if all tests are passed.

A couple of notes here. This robot test will run all tests in sequence and it will continue in the next test even if the previous fails. You may don’t like this behavior, so you can change that by calling the robot test multiple times for each test and if the previous fails then skip the rest tests. Anyway, at this point you know what’s the best scenario for you, but just have that in mind when you write your tests.

Testing the farm!

When everything is ready to go then you just need to push a change to your firmware repo or manually trigger the pipeline. At that point you have your farm up and running. This is a photo of my small testing farm.

On the left side is the nanopi-k1-plus and on the right side is the nanopi-neo2. Both are running the same Yocto image, have an internet connection, have an ST-LINK v2 connected on the USB port and the I2C and GPIO signals are connected between the SBC and the STM32. Also the two UART ports of the SBCs are connected to my workstation for debugging. Although you see only two test servers in the photo you could have as many as you like in your farm.

Next step is to make sure that the gitlab-runner is running in both SBCs and they’re registered as runners in the repo. Normally in this case the gitlab-runner is a service in the Yocto image that runs a script to make sure that if the SBC is not registered then it registers automatically to the project. For this to make it happen, you need to add two lines in your local.conf file when you build the Yocto image.

GITLAB_REPO = "stm32f103-i2c-io-expander"
GITLAB_TOKEN = "4sFaXz89yJStt1hdKM9z"

Of course, you should change those values with yours. Just as a reminder, although I’ve explained in the previous post how to get those values, the GITLAB_REPO is just your repo name and the GITLAB_TOKEN is the token that you get in the “Settings -> CI/CD -> Runners” tab in your gitlab repo.

In case that the gitlab-runner doesn’t run in your SBC for any reason then just run this command in your SBC’s terminal


In my case everything seems to be working fine, therefore in my gitlab project I can see this in the “”Settings -> CI/CD -> Runners”.

Note: You need to disable the shared runners in the “Settings -> CI/CD -> Runners” page for this to work, otherwise any random runner will pick the job and fail.

The first one is the nanopi-k1-plus and the second one is the nanopi-neo2. You can verify those hashes in your /etc/gitlab-runner/config.toml file in each SBC in the token parameter.

Next I’m triggering the pipeline manually and see what happens. In the gitlab web interface (after some failures until get the pipeline up and running) I see that there are 3 stages and the build stage started. The first job is picked by the nanopi-neo2 and you can see the log here. This is part of this log

1 Running with gitlab-runner 12.6.0~beta.2049.gc941d463 (c941d463)
2   on nanopi-neo2 _JXcYZdJ
3 Using Shell executor... 00:00
5 Running on nanopi-neo2... 00:00
7 Fetching changes with git depth set to 50... 00:03
8 Reinitialized existing Git repository in /home/root/builds/_JXcYZdJ/0/dimtass/stm32f103-i2c-io-expander/.git/
9 From
10  * [new ref]         refs/pipelines/106633593 -> refs/pipelines/106633593

141 [100%] Built target stm32f103-i2c-io-expander.bin
142 [100%] Built target stm32f103-i2c-io-expander.hex
143 real	0m14.413s
144 user	0m35.717s
145 sys	0m4.805s
148 Creating cache build-cache... 00:01
149 Runtime platform                                    arch=arm64 os=linux pid=4438 revision=c941d463 version=12.6.0~beta.2049.gc941d463
150 build-stm32/src: found 55 matching files           
151 No URL provided, cache will be not uploaded to shared cache server. Cache will be stored only locally. 
152 Created cache
154 Uploading artifacts... 00:03
155 Runtime platform                                    arch=arm64 os=linux pid=4463 revision=c941d463 version=12.6.0~beta.2049.gc941d463
156 build-stm32/src/stm32f103-i2c-io-expander.bin: found 1 matching files 
157 Uploading artifacts to coordinator... ok            id=392515265 responseStatus=201 Created token=dXApM1zs
159 Job succeeded

This means that the build succeeded, it took 14.413s and the artifact is uploaded. It’s interesting also that the runtime platform is arch=arm64 because the gitlab-runner is running on the nanopi-neo2, which is an ARM 64-bit CPU.

Next stage is the flash stage and the log is here. This is a part of this log:

1 Running with gitlab-runner 12.6.0~beta.2049.gc941d463 (c941d463)
2   on nanopi-neo2 _JXcYZdJ
3 Using Shell executor... 00:00
5 Running on nanopi-neo2... 00:00
7 Fetching changes with git depth set to 50... 00:03
8 Reinitialized existing Git repository in /home/root/builds/_JXcYZdJ/0/dimtass/stm32f103-i2c-io-expander/.git/
9 Checking out bac70245 as master...

30 Downloading artifacts for build (392515265)... 00:01
31 Runtime platform                                    arch=arm64 os=linux pid=4730 revision=c941d463 version=12.6.0~beta.2049.gc941d463
32 Downloading artifacts from coordinator... ok        id=392515265 responseStatus=200 OK token=dXApM1zs
34 $ st-flash --reset write build-stm32/src/stm32f103-i2c-io-expander.bin 0x8000000 00:02
35 st-flash 1.5.1
36 2020-01-02T14:41:27 INFO common.c: Loading device parameters....
37 2020-01-02T14:41:27 INFO common.c: Device connected is: F1 Medium-density device, id 0x20036410
38 2020-01-02T14:41:27 INFO common.c: SRAM size: 0x5000 bytes (20 KiB), Flash: 0x10000 bytes (64 KiB) in pages of 1024 bytes
39 2020-01-02T14:41:27 INFO common.c: Attempting to write 12316 (0x301c) bytes to stm32 address: 134217728 (0x8000000)
40 Flash page at addr: 0x08003000 erased
41 2020-01-02T14:41:28 INFO common.c: Finished erasing 13 pages of 1024 (0x400) bytes
42 2020-01-02T14:41:28 INFO common.c: Starting Flash write for VL/F0/F3/F1_XL core id
43 2020-01-02T14:41:28 INFO flash_loader.c: Successfully loaded flash loader in sram
44  13/13 pages written
45 2020-01-02T14:41:29 INFO common.c: Starting verification of write complete
46 2020-01-02T14:41:29 INFO common.c: Flash written and verified! jolly good!
51 Job succeeded

Nice, right? The log shows that this stage cleaned up the previous job and downloaded the firmware artifact and the used the st-flash tool and the ST-LINKv2 to flash the binary to the STM32 and that was successful.

Finally, the last stage is the run_robot stage and the log is here. This is part of this log:

1 Running with gitlab-runner 12.6.0~beta.2049.gc941d463 (c941d463)
2   on nanopi-neo2 _JXcYZdJ
3 Using Shell executor... 00:00
5 Running on nanopi-neo2...

32 Downloading artifacts for build (392515265)... 00:01
33 Runtime platform                                    arch=arm64 os=linux pid=5313 revision=c941d463 version=12.6.0~beta.2049.gc941d463
34 Downloading artifacts from coordinator... ok        id=392515265 responseStatus=200 OK token=dXApM1zs
36 $ cd tests/ 00:02
37 $ robot --pythonpath . --outputdir ./results -v I2C_SLAVE:0x08 -v BOARD:nanopi_neo2 -t "Probe STM32" -t "Config output pin" -t "Set stm32 pin high" -t "Read host pin" -t "Reset stm32 pin" -t "Read host pin 2" ./robot-tests/
38 ==============================================================================
39 Robot-Tests                                                                   
40 ==============================================================================
41 Robot-Tests.Stm32-Fw-Test :: This test verifies that there is an STM32 board  
42 ==============================================================================
43 Probe STM32                                                           | PASS |
44 ------------------------------------------------------------------------------
45 Config output pin                                                     | PASS |
46 ------------------------------------------------------------------------------
47 Set stm32 pin high                                                    | PASS |
48 ------------------------------------------------------------------------------
49 Read host pin                                                         | PASS |
50 ------------------------------------------------------------------------------
51 Reset stm32 pin                                                       | PASS |
52 ------------------------------------------------------------------------------
53 Read host pin 2                                                       | PASS |
54 ------------------------------------------------------------------------------
55 Robot-Tests.Stm32-Fw-Test :: This test verifies that there is an S... | PASS |
56 6 critical tests, 6 passed, 0 failed
57 6 tests total, 6 passed, 0 failed
58 ==============================================================================
59 Robot-Tests                                                           | PASS |
60 6 critical tests, 6 passed, 0 failed
61 6 tests total, 6 passed, 0 failed
62 ==============================================================================
63 Output:  /home/root/builds/_JXcYZdJ/0/dimtass/stm32f103-i2c-io-expander/tests/results/output.xml
64 Log:     /home/root/builds/_JXcYZdJ/0/dimtass/stm32f103-i2c-io-expander/tests/results/log.html
65 Report:  /home/root/builds/_JXcYZdJ/0/dimtass/stm32f103-i2c-io-expander/tests/results/report.html
70 Job succeeded

Great. This log shows that robot test ran in the SBC and all the tests were successful. Therefore, because all the tests are passed I get this:

Green is the favorite color of DevOps.

Let’s make a resume of what was achieved. I’ve mentioned earlier that I’ll test two different architectures. This is the second one that the test server is also the agent that builds the firmware and then also performs the firmware flashing and testing. This is a very powerful architecture, because a single agent runs all the different stages, therefore it’s easy to add more agents without worrying about adding more builders in the cloud. On the other hand the build stage is slower and also it would be much easier to add more builders in the cloud rather adding SBCs. Of course cloud builders are only capable of building the firmware faster, therefore the bottleneck will always be the flashing and testing stages.

What you should keep from this architecture example is that everything is included in a single runner (or agent or node). There are no external dependencies and everything is running in one place. You will decide if that’s good or bad for your test farm.

Testing the farm with cloud builders

Next I’ll implement the first architecture example that I’ve mentioned earlier, which is that one:

In the above case I’ll create an AWS EC2 instance to build the firmware in the same way that I did it in the previous post here. The other two stages (flashing and testing) will be executed on the SBCs. So how do you do that?

Gitlab CI supports tags for the runners, so you can specify which stage is running on which runner. You can have multiple runners with different capabilities and each runner type can have a custom tag that defines it’s capabilities. Therefore, I’ll use a different tag for the AWS EC2 instance and the SBCs and each stage in the .gitlab-ci.yml will execute on the specified tag. This is a very powerful feature as you can have agents with different capabilities that serve multiple projects.

Since I don’t to change the default .gitlab-ci.yml pipeline in my repo I’ll post the yaml file you need to use here:

        name: dimtass/stm32-cde-image:0.1
        entrypoint: [""]
        GIT_SUBMODULE_STRATEGY: recursive
        - build
        - flash
        - test
            - stm32-builder
        stage: build
        script: time TOOLCHAIN_DIR=/opt/toolchains/gcc-arm-none-eabi-9-2019-q4-major CLEANBUILD=true USE_STDPERIPH_DRIVER=ON USE_DBGUART=ON SRC=src ./
            key: build-cache
            - build-stm32/src
            - build-stm32/src/stm32f103-i2c-io-expander.bin
            expire_in: 1 week
            - test-farm
        stage: flash
        script: st-flash --reset write build-stm32/src/stm32f103-i2c-io-expander.bin 0x8000000
            key: build-cache
            - test-farm
        stage: test
            - cd tests/
            - robot --pythonpath . --outputdir ./results -v I2C_SLAVE:0x08 -v BOARD:nanopi_neo2 -t "Probe STM32" -t "Config output pin" -t "Set stm32 pin high" -t "Read host pin" -t "Reset stm32 pin" -t "Read host pin 2" ./robot-tests/

I’ll use this temporarily in the repo for testing, but you won’t find this file, as I’ll revert it back.

In this file you can see that each stage now has a tags entry that lists which runners can execute each stage. The Yocto images doesn’t have tags for the runners, therefore for this test I’ll remove the previous runners using the web interface and then run this command on each test server:

gitlab-runner verify --delete
gitlab-runner register

Then you need to enter each parameter like that:

Please enter the gitlab-ci coordinator URL (e.g.
Please enter the gitlab-ci token for this runner:
Please enter the gitlab-ci description for this runner:
[nanopi-neo2]: nanopi-neo2
Please enter the gitlab-ci tags for this runner (comma separated):
Registering runner... succeeded                     runner=4sFaXz89
Please enter the executor: parallels, shell, ssh, virtualbox, docker+machine, docker-ssh+machine, docker, docker-ssh, kubernetes, custom:
Runner registered successfully. Feel free to start it, but if it's running already the config should be automatically reloaded!

From the above the important bits are the description (in this case nanopi-neo2) and the tags (in this case test-farm).

Next, I’ve done the same on the nanopi-k1-plus. Please ignore these manual steps here, these will be scripted or included in the Yocto image. This is only for testing and demonstrate the architecture.

Finally, to build the AWS EC2 instance I’ve ran those commands:

git clone
cd stm32-cde-template
packer build stm32-cde-aws-gitlab-runner.json
ln -sf Vagrantfile_aws Vagrantfile

Now edit the Vagrantfile and fill in the aws_keypair_name and aws_ami_name, according to the ones you have in your web EC2 management console (I’ve also explained those steps in the previous post here).

vagrant up
vagrant shh
ps -A | grep gitlab

The last command will display a running instance of the gitlab-runner, but in this case here this is registered to the stm32-cmake-template project of the previous post! That’s because this image was made for that project. Normally I would need to change the Ansible scripts and point to the new repo, but for the demonstration I will just stop the instances, remove the the runner and add a new one that points to this post repo.

So, first I killed the gitlab process in the aws image like this:

ubuntu@ip-172-31-43-158:~$ ps -A | grep gitlab
 1188 ?        00:00:00 gitlab-runner
ubuntu@ip-172-31-43-158:~$ sudo kill -9 1188
ubuntu@ip-172-31-43-158:~$ gitlab-runner verify --delete
ubuntu@ip-172-31-43-158:~$ gitlab-runner register

And then I’ve filled the following:

Please enter the gitlab-ci coordinator URL (e.g.
Please enter the gitlab-ci token for this runner:
Please enter the gitlab-ci description for this runner:
[ip-172-31-43-158]: aws-stm32-agent
Please enter the gitlab-ci tags for this runner (comma separated):
Registering runner... succeeded                     runner=4sFaXz89
Please enter the executor: custom, docker, docker-ssh, shell, virtualbox, docker+machine, kubernetes, parallels, ssh, docker-ssh+machine:
Runner registered successfully. Feel free to start it, but if it's running already the config should be automatically reloaded! 

The important bits here are again the description which is `aws-stm32-agent` and the tag which is `stm32-builder`. Therefore, you see that the tag of this runner is the same with the tag in the build stage in the .gitlab-ci.yml script and the tags of the SBCs are the same with the tags of the flash and test stages.

That’s it! You can now verify the above in the “Settings -> CI/CD -> Runners” page in the repo. In my case I see this:

In the image above you see that now each runner has a different description and the nanopi-neo2 and nanopi-k1-plus have the same tag which is test-farm. You can of course do this with a script similar to the `meta-test-server/recipes-testtools/gitlab-runner/gitlab-runner/` in the meta-test-server image.

Here is the successful CI pipeline that the build is done on the AWS EC2 instance and the flash and test stages were executed on the nanopi-k1-plus. You can see from this log here, that the build was done on AWS, I’m pasting a part of the log:

1 Running with gitlab-runner 12.6.0 (ac8e767a)
2   on aws-stm32-agent Qyc1_zKi
3 Using Shell executor... 00:00
5 Running on ip-172-31-43-158... 00:00
7 Fetching changes with git depth set to 50... 00:01
8 Reinitialized existing Git repository in /home/ubuntu/builds/Qyc1_zKi/0/dimtass/stm32f103-i2c-io-expander/.git/

140 real	0m5.309s
141 user	0m4.248s
142 sys	0m0.456s
145 Creating cache build-cache... 00:00
146 Runtime platform                                    arch=amd64 os=linux pid=3735 revision=ac8e767a version=12.6.0
147 build-stm32/src: found 55 matching files           
148 No URL provided, cache will be not uploaded to shared cache server. Cache will be stored only locally. 
149 Created cache 151
Uploading artifacts... 00:03
152 Runtime platform                                    arch=amd64 os=linux pid=3751 revision=ac8e767a version=12.6.0
153 build-stm32/src/stm32f103-i2c-io-expander.bin: found 1 matching files 
154 Uploading artifacts to coordinator... ok            id=393617901 responseStatus=201 Created token=yHW5azBz
156 Job succeeded

In line 152, you see that the arch is amd64, which is correct for the AWS EC2 image. Next here and here are the logs for the flash and test stages. In those logs you can see the runner name now is the nanopi-k1-plus and the arch is arm64.

Also in this pipeline here, you see from this log here that the same AWS EC2 instance as before did the firmware build. Finally from this log here and here, you can verify that the nanopi-neo2 runner flashed its attached STM32 and run the robot test on it.

Now I’ll try to explain the whole process that happened in a simple way. First the pipeline was triggered on the web interface. Then the gitlab server parsed the .gitlab-ci.yml and found that there are 3 stages. Also it found that each stage has a tag. Therefore, it stopped in the first stage and was waiting for any runner with that tag to poll the server. Let’s assume that at some point the gitlab-runner on any of the SBCs polls the server first (this happens by default every 3 secs for all the runners). When the SBC connected to the server then it posted the repo name, the token, a few other details and also the runner tag. The server checked the tag name of the runner and the stage and it sees that it’s different, so it doesn’t execute the stage on that runner and closes the connection.

After 1-2 secs the gitlab-runner in the AWS EC2 instance with the same tag as the build stage connects to the gitlab server. Then the server verifies both the runner and the tag and then starts to execute commands to the runner via a tunnel. First it git clones the repo to a temporary folder in the AWS EC2 instance storage and then executes the script line. Then the firmware is built and the server downloads the artifact that was requested in the yaml file (see: artifacts in the build stage in .gitlab-ci.yml).

Now that the build stage is successful and the artifact is on the server, the server initiates the next stage and now waits for any runner with the test-farm tag. Of course the EC2 instance can’t execute this stage as it has another tag. Then at some point one of the SBCs is connected and the server takes control of the SBC and executes the flash stage using a tunnel. In this case, because there is an actual ST-LINKv2 USB module connected and also the STM32, then the server uploads the artifact bin firmware on the SBC and then flashes the firmware to the STM32.

Finally, the server in the test stage uses the same runner since it has the proper tag and executes the robot test via the tunnel between the server and the runner. Then all tests passes and the success result is returned.

That’s it. This is pretty much the whole process that happened in the background. Of course, the same process happened in the first architecture with the only difference that because there wasn’t a tag in the stages and the runners, then all the stages were running on the same runner.


Let’s see a few numbers now. The interesting benchmark is the build time on the AWS EC2 and the two nanopi SBCs. In the following table you can see how much time the build spend in each case.

Runner Time is secs
AWS EC2 (t2.micro – 1 core) 5.345s
Nanopi-neo2 (4 cores) 14.413s
Nanopi-k1-plus (4 cores) 12.509s
Ryzen 2700X (8-cores/16-threads) 1.914s

As you can see from the above table the t2.micro x86_64 instance is by far faster builder than the nanopi SBCs and of course my workstation (Ryzen 2700X) is even more faster. Have in mind that this is a very simple firmware and it builds quite fast. Therefore, you need to consider when you decide your CI pipeline architecture. Is that performance important for you?


That was a long post! If I have to resume all my conclusions in this post series in only one sentence, then this is it:

DevOps for Embedded is tough!

I think that I’ve made a lot of comments during those posts that verify this and the truth is that this was just a simple example. Of course, most of the things I’ve demonstrated are the same regardless the complexity of the project. When it gets really complicated is not in the CI/CD architecture as you have many available options and tools you can use. There’s a tool for everything that you might need! Of course you need to be careful with this… Only use what is simple to understand and maintain in the future and avoid complicated solutions and tools. Also avoid using a ton of different tools to do something that you can do with a few lines of code in any programming language.

Always go for the most simple solution and the one that seems that it can scale to your future needs. Do not assume that you won’t need to scale up in the future even for the same project. Also do not assume that you have to find a single architecture for the whole project life, as this might change during the different project phases. This is why you need to follow one of the main best practise of DevOps, which is IaC. All your infrastructure needs to be code, so you can make changes easy and fast and be able to revert back to a known working state if something goes wrong. That way you can change your architecture easily even during the project.

I don’t know what your opinion is after consuming these information in those posts, but personally I see room for cloud services in the embedded domain. You can use builders in the cloud and also you can run unit tests and mocks to the cloud, too. I believe it’s easier to maintain this kind of infrastructure and also scale it up as you like using an orchestrator like Kubernetes.

On the other hand the cloud agents can’t run tests on the real hardware and be part of a testing farm. In this case your test farm needs real hardware like the examples I’ve demonstrated on this post. Running a testing farm is a huge and demanding task, though and the bigger the farm is the more you need to automate everything. Automation needs to be everywhere. You show that there are many things involved. Even the OS image needs to be part of the IaC. Use Yocto or buildroot or any distro (and provision it with Ansible). Make a CI pipeline also for that image. Just try to automate everything.

Of course, there are things that can’t be automated. For example, if an STM32 board is malfunctioning or a power supply dies or even a single cable has a lot of resistance and creates weird issues, then you can’t use a script to fix this. You can’t write script to create more agents in your farm. All those things are manual work and it needs time, but you can create your farm in a way that you minimize the problems and the time to add new agents or debug problems. Use jigs for example. Use a 3D printer to create custom jigs that the SBC and the STM32 board are fit in there with proper cable management and also check your cables before the installation.

Finally, every project is different and developers have different needs. Find what are the pain points are for your project early and automate the solution.

If you’re an embedded engineer and you’re reading this I hope that you can see the need of DevOps to your project. And if you’re a DevOps engineer new to the embedded domain and reading this, then I hope you grasp the needs of this domain and how complicated can be.

The next post will definitely be a stupid project. The last three were a bit serious… Until then,

Have fun!






DevOps for embedded (part 2)


Note: This is the second post of the DevOps for Embedded series. You can find the first post here.

In the previous post, I’ve explained how to use Packer and Ansible to create a Docker CDE (common development environment) image and use it on your host OS to build and flash your code for a simple STM32F103 project. This docker image was then pushed to the docker hub repository and then I’ve shown how you can create a gitlab-ci pipeline that triggers on repo code changes, pulls the docker CDE image and then builds the code, runs tests and finally export the artifact to the gitlab repo.

This is the most common case of a CI/CD pipeline in an embedded project. What was not demonstrated was the testing on the real hardware, which I’ll leave for the next post.

In this post, I’ll show how you can use a cloud service and specifically AWS to perform the same task as the previous post and create a cloud VM that builds the code as either a CDE or as a part of s CI/CD pipeline. I’ll also show a couple of different architectures for doing this and discuss what are the pros and cons for each one. So, let’s dive into it.

Install aws CLI

In order to use the aws services you need first to install the aws CLI tool. The time this post is written, there is the version 1 of the aws and there’s also a new version 2, which is preview for evaluating and testing. Therefore, I’ll use version 1 for this post. There’s a guide here on how to install the aws CLI, but it’s easy and you just need python3 and use pip:

pip3 install awscli --upgrade --user

After that you can test the installation by running aws --version:

$ aws --version
aws-cli/1.16.281 Python/3.6.9 Linux/5.3.0-22-generic botocore/1.13.17

So, the version I’ll be using in this post is 1.16.281.

It’s important to know that if you get errors while following this guide then you need to be sure that you use the same versions for each software that is used. The reason is that Packer and Ansible for example are getting updated quite often and sometimes the newer versions are not backwards compatible.

Create key pairs for AWS

This will be needed later in order to run an instance from the created AMI. When an instance is created you need somehow to connect to it. This instance will get a public IP and the easiest way to connect to is by SSH. For security reasons you need to either create a pair of private/public key on the AWS EC2 MC (MC = Management Console) or create one locally on your host and then upload the public key to AWS. Both solutions are fine, but they have some differences and here you’ll find the documentation for this.

If you use the MC to create the keys then your browser will download a pem file, which you can use to connect to any instance that uses this pair. You can store this pem file in a safe place, distribute it or do whatever you like. If you create your keys that way, then you trust CM backend for the key creation, which I assume is fine, but in case you have a better random generator or you want full control, then you may don’t want the CM to create the keys.

The other way is to create your keys on your side if you prefer. I assume that you would need that for two reasons, one is to re-use an existing ssh pair that you already have or you may prefer use your random generator. In any case, AWS supports to import external keys, so just create the keys and upload them in your CM in the “Network & Security -> Key Pairs” tab.

In this example, I’m using my host’s key pair which is in my user’s ~/.ssh so if you haven’t created ssh keys to your host then just run this command:

ssh-keygen -t rsa -C ""

Of course, you need to change the dummy email with yours. When you run this command it will create two files (id_rsa and ) in your ~/.ssh folder. Have a look in there, just to be sure.

Now that you have your keys, open to your AWS EC2 CM and upload you public ~/.ssh/ key in the “Network & Security -> Key Pairs” tab.

Creating a security group

Another thing that you’ll need later is a security group for your AMI. This is just some firewall rules grouped in a tag that you can use when you create new instances. Then the new instance will get those rules and apply them to the instance. The most usual case is that you need to allow only ssh inbound connections and allow all the outbound connections from the image.

To create a new group of rules, go to your AWS EC2 management console and click on the “NETWORK & SECURITY -> Security Groups” tab. Then clink on the “Create Security Group” button and then on the new dialog, type “ssh-22-only” for the Security group name, write your own description and then in the inbound tab press “Add Rule” button and then select “SSH” in the type column, select “Custom” in the Source column and type “”. Apply the changes and you’ll see your new ssh-22-only rule in the list.

Creating an AWS EC2 AMI

In the repo that I’ve used in the previous post to build the docker image with Packer, there’s also a json file to build the AWS EC2 AMI. AMI stands for Amazon Machine Image and it’s just a containerized image with a special format that Amazon uses for their purposes and infrastructure.

Before proceeding with the post you need to make sure that you have at least a free tier AWS account and you’ve created you credentials.

To build the EC2 image you need to make sure that you have a folder named .aws in your user’s home and inside there are the configand credentialfiles. If you don’t have this, then that means that you need to configure aws using the CLI. To do this, run the following command in your host’s terminal

aws configure

When this runs, you’ll need to fill in your AWS Access Key ID and the AWS Secret Access Key that you got when you’ve created your credentials. Also you need to select your region, which is the location that you want to store your AMI. This selection is depending on your geo-location and there’s a list to choose the proper name depending where you’re located. The list is in this link. In my case, because I’m staying in Berlin it makes sense to choose eu-central-1 which is located in Frankfurt.

After you run the configuration, the ~/.aws folder and the needed files should be there. You can also verify it by cat the files in the shell.

cat ~/.aws/{config,credentials}

The config file contains information about the region and the credentials file contains your AWS Access Key ID and the AWS Secret Access Key. Normally, you don’t want those keys flapping around in text mode and you should use some kind of vault service, but let’s skip this step for now.

If you haven’t cloned the repo I’ve used in the previous post already, then this is the repo you need to clone:

Have a look in the `stm32-cde-aws.json` file. I’ll also paste it here:

    "_comment": "Build an AWS EC2 image for STM32 CDE",
    "_author": "Dimitris Tassopoulos <>",
    "variables": {
      "aws_access_key": "",
      "aws_secret_key": "",
      "cde_version": "0.1",
      "build_user": "stm32"
    "builders": [{
      "type": "amazon-ebs",
      "access_key": "{{user `aws_access_key`}}",
      "secret_key": "{{user `aws_secret_key`}}",
      "region": "eu-central-1",
      "source_ami": "ami-050a22b7e0cf85dd0",
      "instance_type": "t2.micro",
      "ssh_username": "ubuntu",
      "ami_name": "stm32-cde-image-{{user `cde_version`}} {{timestamp | clean_ami_name}}",
      "ssh_interface": "public_ip"
        "type": "shell",
        "inline": [
          "sleep 20"
        "type": "ansible",
        "user": "ubuntu",
        "playbook_file": "provisioning/setup-cde.yml",
        "extra_arguments": [
          "-e env_build_user={{user `build_user`}}"

There you see the variables section and in that section there are two important variables which are the aws_access_key and the aws_secret_key and both are empty in the json file. In our case that we have configured the aws CLI that’s no problem, because Packer will use the aws CLI with our user’s credentials. Nevertheless, this is something that you need to take care if the Packer build runs on a VM in a CI/CD pipeline, because in this case you’ll need to provide those credentials somehow and usually you do this either by hard-encoding the keys which is not recommended or having the keys to your environment which is better than have them hard-coded but not ideal from the security aspect, or by using a safe vault that stores those keys and the builder can retrieve the keys from there and use them in a “safe” way. In this example, since the variables are empty, Packer expects to find those in the host environment.

Next, in the builders section you see that I’ve selected the “amazon-ebs” type (which is documented here). There you see the keys that are pulled from the host environment (which in this case is the ~./aws). The region, as you can see in this case is hard-coded in the json, so you need to change this depending your location.

The source_amifield is also important as it points to the base AWS image that it’s going to be used as a basis to create the custom CDE image. When you run the build function of Packer, then in the background Packer will create a new instance from the source_ami and the instance will be set with the configuration inside the “builders” block in the stm32-cde-aws.json file. All the supported configuration parameters that you can use with packer are listed here. In this case, the instance will be a t2.micro which has 1 vCPU and the default snapshot will have 8GB of storage. After the instance is created and then Packer will run the provisioners scripts and then it will create a new AMI from this instance and name it by the ami_name that you’ve chosen in the json file.

You can verify this behavior later by running the packer build function and having your EC2 management console open in your browser. There you will see that while Packer is running it will create a temporary instance from the source_ami do it’s stuff in there, then it will create the new AMI and it will terminate the temporary instance. It’s important to know what each instance state means. Terminated means that the instance is gone and it can’t be used anymore and also its storage is gone, too. If you need to keep your storage after an instance is terminated then you need to create a block volume and mount it in your image, it’s not important for us now, but you can have a look in the documentation of AWS and packer how to do that.

Let’s go back on how to find the image you want to use. This is a bit cumbersome and imho it could be better, because as it is now it’s a bit hard to figure it out. To do that there are two ways, one is to use the aws-cli tool and list all the available images, but this is almost useless as it returns a huge json file with all the images. For example:

aws ec2 describe-images > ~/Downloads/ami-images.json
cat ~/Downloads/ami-images.json

Of course, you can use some filters if you like and limit the results, but still that’s not an easy way to do it. For example, to get all the official amazon ebs images, then you can run this command:

aws ec2 describe-images --owners amazon > ~/Downloads/images.json

For more about filtering the results run this command to read the help:

aws ec2 describe-images help

There’s a bit better way to do this from your aws web console, though but it’s a bit hidden and not easy to find. To do that first go to your aws console here. Then you’ll see a button named “Launch instance” and if you press that then a new page will open and where you can search for available AMIs. The good thing with the web search is that you can also easily see if the image you need is “Free tier eligible”, which means that you can use it with your free subscription with no extra cost. In our case the image is the ami-050a22b7e0cf85dd0, which is an ubuntu 16.04 x86_64 base image.

Next field in the json file is the “instance_type” for now it’s important to set it to t2.micro. You can find out more about the instance types here. The t2.micro instance for the ami-050a22b7e0cf85dd0 AMI is “free tier eligible”. You can find which instances are fee tier eligible when you select the image in the web interface and press “Next” to proceed to the next step in which you select the instance type and you can see which of them are in the free tier. Have in mind that free tier instance type are depending the source_ami.

The ssh_username in the json, is the name that you need to use when you ssh in the image and the ami_name is the name of the new image that is going to be created. In this case you can see that this name is dynamic as it depends on the cde_version and the timestamp. The timestamp and clean_ami_name are created by Packer.

The “provisioners” section, like in the previous post, uses Ansible to configure the image; but as you can see there are two additional provisioning steps. The first is a shell script that sleeps for 20 secs and this is an important step because you need a delay in order to wait for the image to boot and then be able to ssh to it. If Ansible tries to connect before the instance is up and running and be able to accept SSH connections, then the image build will fail. Also, some times even if you don’t have enough delay then even it’s get connected via SSH the apt packager may fail, so always use a delay.

The next provisioning step uses a shell sudo command (inside the instance) to update the APT repos and install python. This step also wasn’t needed in the case of the docker image creation with Packer. As I’ve mentioned in the previous post, although the base images from different providers are quite the same, they do have differences and in this case the AWS Ubuntu image doesn’t have Python installed and Python is needed on the target for Ansible, even if it’s connected remotely via SSH.

To build the AWS image, just run this command in your host terminal.

packer build stm32-cde-aws.json

After running the above command you should see an output like this:

$ packer build stm32-cde-aws.json 
amazon-ebs output will be in this color.

==> amazon-ebs: Prevalidating AMI Name: stm32-cde-image-0.1 1575473402
    amazon-ebs: Found Image ID: ami-050a22b7e0cf85dd0
==> amazon-ebs: Creating temporary keypair: packer_5de7d0fb-33e1-1465-13e0-fa53cf8f7eb5
==> amazon-ebs: Creating temporary security group for this instance: packer_5de7d0fd-b64c-ea24-b0e4-116e34b0bbf3
==> amazon-ebs: Authorizing access to port 22 from [] in the temporary security groups...
==> amazon-ebs: Launching a source AWS instance...
==> amazon-ebs: Adding tags to source instance
    amazon-ebs: Adding tag: "Name": "Packer Builder"
    amazon-ebs: Instance ID: i-01e0fd352782b1be3
==> amazon-ebs: Waiting for instance (i-01e0fd352782b1be3) to become ready...
==> amazon-ebs: Using ssh communicator to connect:
==> amazon-ebs: Waiting for SSH to become available...
==> amazon-ebs: Connected to SSH!
==> amazon-ebs: Provisioning with shell script: /tmp/packer-shell622070440
==> amazon-ebs: Provisioning with shell script: /tmp/packer-shell792469849
==> amazon-ebs: Provisioning with Ansible...
==> amazon-ebs: Executing Ansible: ansible-playbook --extra-vars packer_build_name=amazon-ebs packer_builder_type=amazon-ebs -o IdentitiesOnly=yes -i /tmp/packer-provisioner-ansible661166803 /.../stm32-cde-template/provisioning/setup-cde.yml -e ansible_ssh_private_key_file=/tmp/ansible-key093516356
==> amazon-ebs: Stopping the source instance...
    amazon-ebs: Stopping instance
==> amazon-ebs: Waiting for the instance to stop...
==> amazon-ebs: Creating AMI stm32-cde-image-0.1 1575473402 from instance i-01e0fd352782b1be3
    amazon-ebs: AMI: ami-07362049ac21dd92c
==> amazon-ebs: Waiting for AMI to become ready...
==> amazon-ebs: Terminating the source AWS instance...
==> amazon-ebs: Cleaning up any extra volumes...
==> amazon-ebs: No volumes to clean up, skipping
==> amazon-ebs: Deleting temporary security group...
==> amazon-ebs: Deleting temporary keypair...
Build 'amazon-ebs' finished.

==> Builds finished. The artifacts of successful builds are:
--> amazon-ebs: AMIs were created:
eu-central-1: ami-07362049ac21dd92c

That means that Packer created a new AMI in the eu-central-1 regional server. To verify the AMI creation you can also connect to your EC2 Management Console and view the AMI tab. In my case I got this:

Some clarifications now. This is just an AMI, it’s not a running instance and you can see from the image that the status is set to available. This means that the image is ready to be used and you can create/run an instance from this. So let’s try to do this and test if it works.

Create an AWS EC2 instance

Now that you have your AMI you can create instances. To create instances there several ways, like using your EC2 management console web interface or the CLI interface or even use Vagrant that simplifies things a lot. I prefer using Vagrant, but let me explain why I thunk Vagrant is a good option by mentioning why the other options are not that good. First, it should be obvious why using the web interface to start/stop instances is not your best option. If not then, just think what you need to do this every time you need to build your code. You need to open your browser, connect to your management console, then perform several clicks  to create the instance, then open your terminal and connect to the instance and after finishing your build, stop the instance again from the web interface. That takes time and it’s almost a labor work…

Next option would be the CLI interface. That’s not that bad actually, but the shell command would be a long train of options and then you would need a shell script and also you would had to deal with the CLI too much, which is not ideal in any case.

Finally, you can use Vagrant. With vagrant the only thing you need is a file named Vagrantfile which is a ruby script file and that contains the configuration of the instance you want to create from a given AMI. In there, you can also define several other options and most important, since this file it’s just a text file, it can be pushed in a git repo and benefit from the versioning and re-usability git provides. Therefore let’s see how to use Vagrant for this.

Installing Vagrant

To install vagrant you just need to download the binary from here. Then you can copy this file somewhere to your path, e.g. ~/.local/bin. To check if everything is ok then run this:

vagrant version

In my case it returns:

Installed Version: 2.2.6
Latest Version: 2.2.6

Now you need to install a plugin called vagrant-aws.

vagrant plugin install vagrant-aws

Because Vagrant works with boxes, you need to install a special dummy box that wraps the AWS EC2 functionality and it’s actually a proxy or gateway box to the AWS service

vagrant box add aws-dummy

Now inside the `stm32-cde-template` git repo that you’ve cloned, create a symlink of Vagrantfile_aws to Vagrantfile like this.

ls -sf Vagrantfile_aws Vagrantfile

This will create a symlink that you can override at any point as I’ll show later on this post.

Using Vagrant to launch an instance

Before you proceed with creating an instance with Vagrant you need first to fill the configuration parameters in the vagrant-aws-settings.yml. The only thing that this files does, is to store the variables values so the Vagrantfile remains dynamic and portable. Having a Vagrantfile which is portable is very convenient, because you can just copy/paste the file between your git repos and only edit the vagrant-aws-settings.yml file and put the proper values for your AMI.

Therefore, open `vagrant-aws-settings.yml` with your editor and fill the proper values in there.

  • aws_region is the region that you want your instance to be created.
  • aws_keypair_name is the key pair that you’ve created earlier in your AWS EC2 MC. Use the name that you used in the MC, not the filename of the key in your host!
  • aws_ami_name this is the AMI name of the instance. You’ll get that in your “IMAGES -> AMIs” tab in your MC. Just copy the string in the “AMI ID” column and paste it in the settings yaml file. Note that the AMI name will be also printed in the end of the packer build command.
  • aws_instance_type that’s the instance type, as we’ve said t2.micro is eligible for free tier accounts.
  • aws_security_groups the security group you’ve created earlier for allowing ssh inbound connections (e.g. the ssh-22-only we’ve created earlier).
  • ssh_username the AMI username. Each base AMI has a default username and all the AMI that Packer builds inherits that username. See the full list of default usernames here. In our case it’s ubuntu.
  • ssh_private_key_path the path of the public ssh key that the instance will use.

When you fill the proper values, then in the root folder of the git repo run this command:

vagrant up

Then you’ll see Vagrant starting to create the new instance with the given configuration. The output should like similar to this:

$ vagrant up
Bringing machine 'default' up with 'aws' provider...
==> default: Warning! The AWS provider doesn't support any of the Vagrant
==> default: high-level network configurations (``). They
==> default: will be silently ignored.
==> default: Launching an instance with the following settings...
==> default:  -- Type: t2.micro
==> default:  -- AMI: ami-04928c5fa611e89b4
==> default:  -- Region: eu-central-1
==> default:  -- Keypair: id_rsa_laptop_lux
==> default:  -- Security Groups: ["ssh-22-only"]
==> default:  -- Block Device Mapping: []
==> default:  -- Terminate On Shutdown: false
==> default:  -- Monitoring: false
==> default:  -- EBS optimized: false
==> default:  -- Source Destination check: 
==> default:  -- Assigning a public IP address in a VPC: false
==> default:  -- VPC tenancy specification: default
==> default: Waiting for instance to become "ready"...
==> default: Waiting for SSH to become available...
==> default: Machine is booted and ready for use!
==> default: Rsyncing folder: /home/.../stm32-cde-template/ => /vagrant

You’ll notice that you’re again in the prompt in your terminal, so what happened? Well, Vagrant just created a new instance from our custom Packer AMI and left it in running state. You can check in your MC and verify that the new instance is in running state. Therefore, now it’s up and running and all you need to do is to connect there and start executing commands.

To connect, there are two options. One is to use any SSH client and connect as the ubuntu user in the machine. To get the ip of the running instance just click on the instance in the “INSTANCES -> Instances” tab in your CM and then on the bottom area you’ll see this “Public DNS:”. The string after this, is the public DNS string. Copy that and then connect via ssh.

The other way, which is the easiest if to use Vagrant and just type:

vagrant ssh

This will automatically connect you to the instance, so you don’t need to find the DNS name or even connect to your CM.

Building your code

Now that you’re inside your instance’s terminal, after running vagrant ssh you can build the STM32 template code like this:

git clone --depth 1 --recursive
cd stm32f103-cmake-template
time TOOLCHAIN_DIR=/opt/toolchains/gcc-arm-none-eabi-9-2019-q4-major CLEANBUILD=true USE_STDPERIPH_DRIVER=ON SRC=src_stdperiph ./

I’ve used the time command to build the firmware in order to benchmark the build time. The result on my AWS instance is:

Building the project in Linux environment
[ 95%] Linking C executable stm32-cmake-template.elf
   text	   data	    bss	    dec	    hex	filename
  14924	    856	   1144	  16924	   421c	stm32-cmake-template.elf
[ 95%] Built target stm32-cmake-template.elf
Scanning dependencies of target stm32-cmake-template.bin
Scanning dependencies of target stm32-cmake-template.hex
[ 97%] Generating stm32-cmake-template.bin
[100%] Generating stm32-cmake-template.hex
[100%] Built target stm32-cmake-template.hex
[100%] Built target stm32-cmake-template.bin

real	0m5.318s
user	0m4.304s
sys	0m0.372s

So the code builds fine! Checking the firmware size with the previous post, you’ll see it’s the same. Therefore, what happened is that we’ve managed to create a CDE image for our project on the AWS cloud, started an instance, ssh to it and then build the code using the same toolchain and almost identical environment as the previous post. This is great.

Now, every developer that has this Vagrantfile can create an instance from the same AMI and build the code. You can have as many instances you like or you can have one and multiple developers connect on the same instance. In the last case, though, multiple developers can connect to the instance using an SSH client and not the vagrant ssh.

Now there’s a question, though. What happens with the instance after you finish? Well, you can just type exit in the instance console and the SSH connection will close and you’ll be back to your host’s terminal. At that point the instance is still running in the AWS cloud. You can leave it running or destroy the instance by running this command on your host:

vagrant destroy

This command will terminate the instance and you can verify this to your AWS EC2 CM.

Note: When you destroy the instance then everything will disappear, so you’ll lose any work you’ve done, also the firmware binaries will destroyed and you won’t be able to revert the change, unless you’ve created a volume and mounted it in to your instance or upload the artifacts somewhere else.

Now let’s see a few more details about proper cleaning up your instances and images.

Cleaning up properly

When building an AMI with Packer you don’t pay for this as it uses a t2.micro instance which is a free tier and when it finishes it cleans up properly also any storage that was used.

The AMI that has being built also occupies some space. If you think about it, it’s a full blown OS, with rootfs and also some storage and this needs to be stored somewhere. When the AMI is built, it occupies a storage space which is called snapshot and every new instance that runs from this AMI is actually starting from this snapshot (think like forking in git) and a new snapshot is created which stores all the changes compared to the original snapshot. Therefore, more space is needed.

This storage though is not free, well the first 30GB of elastic block storage (=EBS which is the one that is used to store the EC2 volumes and snapshots) are free, but then you pay depending on how much more you use. Therefore, if you have an AMI with 8GB storage then it’s free. Then if you create a couple of instances more, then if you exceed the current 30GB storage limit then you’ll need to pay for the extra storage, per month. Don’t be afraid, though as the cost will be like 1 cent per snapshot, so it’s not that much at all.

EBS is not the same storage as the Amazon simple storage service (S3). S3 in the free-tier accounts is limited to 5GB and this is the bucket storage you can use to save your data or meta-data. For the rest of he post, we’re only using EBS not S3.

You can see how much EBS storage you’re using in the “ELASTIC BLOCK STORE” tab in your EC2 AWS CM. In there, there are two other tabs, Volumes and Snapshots. The Volume lists the storage that your snapshot uses. You AMI also uses some drive space though and this points to a snapshot. If you click on your AMI in the CM you’ll see that the block device points to a snapshot. Now click on the snapshots tab and verify that the snapshot of the instance is there.

What happens when you run vagrant up is that a new Volume is created for the instance that derives from the AMI’s snapshot. Any changes you do there they have affect only on the volume not the AMI’s snapshot! When you run vagrant destroy then the volume that is mounted in that instance is deleted and the AMI’s snapshot is still there for the next instance.

Therefore, each instance you run it will have a volume and until the instance is destroyed/terminated you’ll pay for the storage (if it exceeds the 30GB). The amount is negligible, but still you need to have this in mind.

Regarding the snapshot billing you can see here.

Just have in mind that if you don’t want to get charged about anything then just terminate your instances, delete your AMIs, the volumes and the snapshots and you’re fine. Otherwise, you need to always have in mind not to exceed the 30GB limit.

Finally, the 30GB of EBS is the current limit the date I’m writing this post. This may change at any time in the future, so always check that.

Code build benchmarks

I can’t help it, I love benchmarks. So, before move on let’s see how the current solutions we’ve seen so far performing when build the STM32 firmware. I’ll compare my laptop (which is a i7-8750, 12x cores @ 2.2 and 16GB), my workstation (Ryzen 2700X & 32GB RAM),the gitlab-ci and the AWS EC2 AMI we just build. There are the results using the time command:

2700X Laptop GitLab-CI AWS (t2.micro)
Build (secs) 1.046 3.547s 5.971s 5.318s

I remind you that this is just a template project, so no much code to build, but still the results are quite impressive. My laptop that uses 12 threads needs 3.5 secs and the cloud instances need around 5-6 secs. Not bad, right? Nevertheless, I wouldn’t compare that much gitlab-ci and aws instances as this is just a single run and the build time one each might change also during the day and the servers load (I guess). The important thing is that the difference is not huge, at least for this scenario. For a more complicated project (e.g. a Yocto build), you should expect that the difference will be significant.

Using AWS in your CI/CD pipeline

Like in the previous post, I’ll show you how to use this AMI that you created with Packer into your CI/CD. This is where the things are getting a bit more complicated and I’ll try to explain why. Let’s see how the CI/CD pipeline actually works. You make some changes in the code and then push the changes. Then gitlab-ci is triggered automatically and peaks a gitlab-runner from the list with the compatible runners and sends all the settings to that runner to build the code.

By default those runners in gitlab are always running and poll the main gitlab-ci service for waiting builds, so gitlab-ci doesn’t initiate the communication, the runner does. Although gitlab provides a number of free to use runners, you have a limit on the actual free time that you can use them. Of course this limitation applies only when using and not a local gitlab installation. gitlab allows you to use your own gitlab-runner even when you use the service, though. It doesn’t matter if the runner is your laptop, a baremetal server or cloud instance. All you need is to run the gitlab-runenr client that you can download from here (see the instructions in the link) and then do some configuration in your gitlab source code repo where your .gitlab-ci.yml is.

The following image shows the simple architecture of this scenario.

In the above example gitlab is hosting the repo and the repo has a gitlab-ci pipeline. The developer pulls the code, makes changes and pushes back in the repo. Then the AWS EC2 CDE instance will poll the gitlab-ci service and when a tagged build is available it will pick the job, run the CI/CD pipeline and then return the result (including any artifacts).

There are two problems here, though. The first is how the AWS CDE builder knows which build to execute and doesn’t execute other builds from other repos. The second is that as you’ve probably noticed, the builder needs to always running in order to poll the gitlab-ci service and peak the builds! That means that the cost starting adding up even if the builder is idle and doesn’t execute builds.

For now, let’s see how to create an instance that is running the gitlab-runner.

What is an AWS EC2 gitlab-runner?

We can use Packer and Ansible to create an AMI that runs a gitlab-runner. There are a few things that we need to consider though and implement an architecture that is scalable and can be used in more that one projects and also it will be easy to create as many runner instances you need without further configuration.

The gitlab-runner needs a couple of settings in order to work properly. First it needs to know the domain that will poll for new jobs. As I’ve mentioned earlier, in gitlab, the runner is polling the server in pre-configred intervals and asks for pending jobs.

Then the gitlab-runner needs to know which jobs is able/allowed to run. This makes sense, because if you think about it, a gitlab-runner is an OS image and it’s coming with specific libraries and tools. Therefore, if the job needs libraries that are not included in the runner’s environment, then the build will fail. Later in this post I’ll explain what are the available architectures that you can use.

Anyway, for now the runner needs to be aware of the project that it can run and this is handled with a unique token per project that you can create in the gitlab project settings. When you create this token, then gitlab-ci will forward your project’s builds only to runners that are registered with this token.

Before proceed further let’s see the architecture approach.

gitlab-runner images architecture approach

The gitlab-runner can be a few different things. One is having a generic runner (or image instance when it comes to the cloud) with no dependencies and use the build pipeline to configure the running instance and install all the needed libraries and tools just before the build stage. An other approach is having multiple images that are only meant to build specific projects and only include specific dependencies. And finally you can have a generic image that supports docker and can use other docker containers which have the needed dependencies installed to build the source code.

Which of the the above scenarios sounds better for your case? Let’s have a look at the pros/cons of each solution.

Single generic image (no installed dependencies)
+ Easy to maintain your image as it’s only one (or just very few)
+ Less storage space needed (either for AWS snapshots or docker repos)
On every build the code repo needs to prepare the environment
If the build environment is very complex, then creating it will be time consuming and each build will take a lot of time
Increased network traffic on every build
Multiple images with integrated dependencies
+ Builds are fast, because the environment is already configured
+ Source code pipeline is agnostic to the environment (.gitlab-ci.yml)
A lot of storage needed for all images, which may increase maintenance costs
Generic image that supports docker
+ Easy to maintain the image
+/- Multiple docker images need space and maintenance but the maintenance is easy
The image instance will always have to download a docker container on every different stage of build (remember previous post?)
The build takes more time

From my perspective, tbh it’s not very clear which architecture is overall better. All of them have their strengths and weaknesses. I guess this is something that you need to decide when you have your project specs and you know exactly what you need and what you expect from your infrastructure and what other people (like devs) expect for their workflow. You may even decide to mix architectures in order to have fast developer builds and more automated backbone infrastructure.

For our example I’ll go with the multiple image approach that includes the dependencies. So, I’ll build a specific image for this purpose and then run instances from this image, that will act as gitlab-runners which they have the proper environment to build the code themselves.

Building the gitlab-runner AWS AMI

As I’ve mentioned earlier, I’ll go with the strategy to create an AWS image that contains the STM32 CDE and also runs a gitlab-runner. This is a more compact solution as the AMI won’t have to run docker and download a docker image to build the firmware and also the firmware build will be a lot faster.

Note that the instance will have to always run in order to be able to pick jobs from the gitlab-ci server. Therefore, using docker or not (which makes the build slower) doesn’t really affect the running costs, but it only affects the time that the runner will be available again for the next job, which is also important. So, it’s not about running costs, so much but for better performance and faster availability.

Again you’ll need this repo here:

In the repo you’ll find several different packer json image configurations. The one that we’re interested in is the the `stm32-cde-aws-gitlab-runner.json` which will build an AWS AMI that includes the CDE and also has a gitlab-runner installed. There is a change that you need to do in the json, because the configuration doesn’t know the token of your gitlab project. Therefore, you need to go to your gitlab projects “Settings -> CI/CD -> Runners” and then copy the registration token from that page and paste it in the gitlab_runner_token in the json file.

Then you need to build the image with Packer:

packer build stm32-cde-aws-gitlab-runner.json

This command will start building the AWS AMI and in the meantime you can see the progress to your AWS EC2 Management Console (MC). In the end you will see something like this in your host console

==> amazon-ebs: Stopping the source instance...
    amazon-ebs: Stopping instance
==> amazon-ebs: Waiting for the instance to stop...
==> amazon-ebs: Creating AMI stm32-cde-image-gitlab-runner-0.1 1575670554 from instance i-038028cdda29c419e
    amazon-ebs: AMI: ami-0771050a755ad82ea
==> amazon-ebs: Waiting for AMI to become ready...
==> amazon-ebs: Terminating the source AWS instance...
==> amazon-ebs: Cleaning up any extra volumes...
==> amazon-ebs: No volumes to clean up, skipping
==> amazon-ebs: Deleting temporary security group...
==> amazon-ebs: Deleting temporary keypair...
Build 'amazon-ebs' finished.

==> Builds finished. The artifacts of successful builds are:
--> amazon-ebs: AMIs were created:
eu-central-1: ami-0771050a755ad82ea

Note in the 4th line, the AMI name in this case is `stm32-cde-image-gitlab-runner-0.1` and not `stm32-cde-image-0.1` like the previous post.

Now we’ll use Vagrant again to run an instance of the built AMI in the AWS and verify that the gitlab-runner in the instance works properly and connects and gets jobs from the gitlab-ci.

Before run Vagrant make sure that you edit the `vagrant-aws-settings.yml` file and place the proper aws_keypair_name and aws_ami_name. You’ll find the new AMI name in the AWS EC2 MC in the “IMAGES –> AMIs” tab in the “AMI ID” column. After using the proper values then run these commands:

ln -sf Vagrantfile_aws Vagrantfile
vagrant up
vagrant shh
ps -A | grep gitlab

Normally you’ve seen that the ps command shown a running instance of gitlab-ruuner. You can also verify in your gitlab project in”Settings -> CI/CD -> Runners”, that the runner is connected. In my case this is what ps returns:

ubuntu@ip-xx-xx-34-51:~$ ps -A | grep gitlab
 1165 ?        00:00:00 gitlab-runner

That means the gitlab-runner runs, but let’s also see the configuration to be sure:

ubuntu@ip-xx-xx-34-x51:~$ sudo cat /etc/gitlab-runner/config.toml
concurrent = 1
check_interval = 0

  session_timeout = 1800

  name = "runner-20191206T224300"
  limit = 1
  url = ""
  token = "w_xchoJGszqGzCPd55y9"
  executor = "shell"

The token you see now in the config.toml is the token of the source code project repo, but the token of the runner! Therefore, don’t think that’s an error. So in this case the token is `w_xchoJGszqGzCPd55y9` and if you go to your gitlab’s “Settings -> CI/CD -> Runner” web page you’ll see something similar to this:

You see that the connected runner is the w_xchoJG, so the gitlab-runner that runs on the AWS AMI we’ve just built seems to be working fine. But’s lets build the code to be sure that the AWS gitlab-runner works. To do that just go to your repo in the “CI/CD -> Pipelines” and trigger a manual build and then click on the build icon to get into the gitlab’s console output. In my case this is the output

1 Running with gitlab-runner 12.5.0 (577f813d)
2   on runner-20191206T232941 jL5AyaAe
3 Using Shell executor... 00:00
5 Running on ip-172-31-32-40... 00:00
7 Fetching changes with git depth set to 50...
8 Initialized empty Git repository in /home/gitlab-runner/builds/jL5AyaAe/0/dimtass/stm32f103-cmake-template/.git/
9 Created fresh repository.
10 From
126 [ 95%] Linking C executable stm32-cmake-template.elf
127    text	   data	    bss	    dec	    hex	filename
128   14924	    856	   1144	  16924	   421c	stm32-cmake-template.elf
129 [ 95%] Built target stm32-cmake-template.elf
130 Scanning dependencies of target stm32-cmake-template.hex
131 Scanning dependencies of target stm32-cmake-template.bin
132 [ 97%] Generating stm32-cmake-template.bin
133 [100%] Generating stm32-cmake-template.hex
134 [100%] Built target stm32-cmake-template.bin
135 [100%] Built target stm32-cmake-template.hex
136 real	0m6.030s
137 user	0m4.372s
138 sys	0m0.444s
139 Creating cache build-cache... 00:00
140 Runtime platform                                    arch=amd64 os=linux pid=2486 revision=577f813d version=12.5.0
141 build-stm32/src_stdperiph: found 54 matching files 
142 No URL provided, cache will be not uploaded to shared cache server. Cache will be stored only locally. 
143 Created cache
144 Uploading artifacts... 00:02
145 Runtime platform                                    arch=amd64 os=linux pid=2518 revision=577f813d version=12.5.0
146 build-stm32/src_stdperiph/stm32-cmake-template.bin: found 1 matching files 
147 Uploading artifacts to coordinator... ok            id=372257848 responseStatus=201 Created token=r3k6HRr8
148 Job succeeded

Success! Do you see the ip-172-31-32-40? This is the AWS instance. The AWS instance managed to build the code and also uploaded the built artifact back in the gitlab-ci. Therefore we managed to use packer to build an AWS EC2 AMI that we can use for building the code.

Although it seems that everything is fine, there is something that you need to consider as this solution has a significant drawback that needs to be resolved. The problem is that the gitlab-runner is registering to the gitlab-ci server when the AMI image is build. Therefore, any instance that is running from this image will have the same runner token and that’s a problem. That means that you can either run only a single instance from this image, therefore build several almost identical images with different tokens. Or have a way to re-register the runner every time that a new instance is running.

To fix this, then you need to use a startup script that re-registers a new gitlab-runner every time the instance runs. There is a documentation about how to use the user_data do this in here. In this case because I’m using Vagrant, all I have to do is to edit the `vagrant-aws-settings.yml` file and add the `` name in the aws_startup_script: line, like this:


What will happen now is that Vagrant will pass the content of this script file to AWS and when an instance is running then the instance will register as a new gitlab-runner.

Also another thing you need to do is to comment out the two last lines in the `provisioning/roles/gitlab-runner/tasks/main.yml` file, in order to disable registering a runner by default while the image is created.

# - name: Register GitLab Runner
#   import_tasks: gitlab-runner-register.yml

Then re-build the Packer image and run Vagrant again.

packer build stm32-cde-aws-gitlab-runner.json
vagrant up

In this case, the Packer image won’t register a runner while building and then the vagrant up script will pass the aws_startup_script which is defined in the `vagrant-aws-settings.yml` file and the gitlab-runner will registered when the instance is running.

One last thing that remains is to automate the gitlab-runner un-registration. In this case with AWS you need to add your custom scipt in the /etc/ec2-termination as it’s described in here. For my example that’s not convenient to do that now, but you should be aware of this and implement it probably in your Ansible provision while building the image.

Therefore, with the current AMI, you’ll have to remove the outdated gitlab-runners from the gitlab web interface in your project’s “Settings -> CI/CD -> Runners” page.

Using docker as the gitlab-runner (Vagrant)

In the same repo you may have noticed two more files, `stm32-cde-docker-gitlab-runner.json` and `Vagrantfile_docker`. Probably, you’ve guessed right… Those two files can be used to build a docker image with the CDE including the gitlab-runner and the vagrant file to run a container of this image. To do this run the following commands on your host:

ln -sf Vagrantfile_docker Vagrantfile
vagrant up --provider=docker
vagrant ssh
ps -A | grep gitlab

With those commands you’ll build the image using vagrant and then run the instance, connect to it and verify that gitlab-runner is running. Also check again your gitlab project to verify this and then re-run the pipeline to verify that the docker instance picks up the build. In this case, I’ve used my workstation which is a Ryzen 2700X with 32GB RAM and an NVME drive. Again, this time my workstation registered as a runner in the project and it worked fine. This is the result in the giltab-ci output.

1 Running with gitlab-runner 12.5.0 (577f813d)
2   on runner-20191207T183544 scBgCx85
3 Using Shell executor... 00:00
5 Running on 00:00
7 Fetching changes with git depth set to 50...
8 Initialized empty Git repository in /home/gitlab-runner/builds/scBgCx85/0/dimtass/stm32f103-cmake-template/.git/
9 Created fresh repository.
10 From
126 [ 95%] Linking C executable stm32-cmake-template.elf
127    text	   data	    bss	    dec	    hex	filename
128   14924	    856	   1144	  16924	   421c	stm32-cmake-template.elf
129 [ 95%] Built target stm32-cmake-template.elf
130 Scanning dependencies of target stm32-cmake-template.bin
131 Scanning dependencies of target stm32-cmake-template.hex
132 [100%] Generating stm32-cmake-template.hex
133 [100%] Generating stm32-cmake-template.bin
134 [100%] Built target stm32-cmake-template.hex
135 [100%] Built target stm32-cmake-template.bin
136 real	0m1.744s
137 user	0m4.697s
138 sys	0m1.046s
139 Creating cache build-cache... 00:00
140 Runtime platform                                    arch=amd64 os=linux pid=14639 revision=577f813d version=12.5.0
141 build-stm32/src_stdperiph: found 54 matching files 
142 No URL provided, cache will be not uploaded to shared cache server. Cache will be stored only locally. 
143 Created cache
144 Uploading artifacts... 00:03
145 Runtime platform                                    arch=amd64 os=linux pid=14679 revision=577f813d version=12.5.0
146 build-stm32/src_stdperiph/stm32-cmake-template.bin: found 1 matching files 
147 Uploading artifacts to coordinator... ok            id=372524980 responseStatus=201 Created token=KAc2ebBj
148 Job succeeded

It’s obvious that the runner now is more powerful because the build only lasted 1.7 secs. Of course, that’s a negligible difference compared to the gitlab-ci built in runners and also the AWS EC2 instances.

The important thing to keep in mind in this case is that when you run vagrant up then the container instance is not terminated after the Ansible provisioning is running! That means that the container runs in the background just after ansible ends, therefore that’s the reason that the gitlab-runner is running when you connect in the image using vagrant ssh. This is a significant difference with using other methods to run the docker container as we see in the next example.

Using docker as the gitlab-runner (Packer)

Finally, you can also use the packer json file to build the image and push it to your docker hub repository (if you like) and then use your console to run a docker container from this image. In this case, though;  you need to manually run the gitlab-runner when you create a container from the docker image, because no background services are running on the container when it starts, unless you run them manually when you create the container or have an entry-point script that runs those services (e.g. the gitlab runner).

Let’s see the simple case first that you need to start a container with bash as an entry.

docker run -it dimtass/stm32-cde-image-gitlab-runner:0.1 -c "/bin/bash"

If you’ve built your own image then you need to replace it to the above command. After you run this command you’ll end up inside the container and then if you try to run ps -A you’ll find that there’s pretty much nothing running in the background, including the gitlab-runner. Therefore, the current running container is like running CDE image we’ve build in the previous post. That means that this image can be used as CDE as also a gitlab-runner container! That’s important to have in mind and you can take advantage of this default behavior of docker containers.

Now in the container terminal run this:

gitlab-runner run

This command will run gitlab-runner in the container, which in turn will use the config in `/etc/gitlab-runner/config.toml` that Ansible installed. Then you can run the gitlab pipeline again to build using this runner. It’s good during your tests to run only a single runner for your project so it’s always the one that picks the job. I won’t paste the result again, but trust me it works the same way as before.

The other way you can use the packer image is that instead of running the container using /bin/bash as entry-point, yes you’ve guessed right, use the gitlab-runner run as entry point. That way you can use the image for doing automations. To test this stop any previous running containers and run this command:

docker run -it dimtass/stm32-cde-image-gitlab-runner:0.1 -c "/usr/bin/gitlab-runner run"

This command will run the gitlab-runner in the container and the terminal will block until you exit or quit. This here is also an important thing to be careful! gitlab-runner supports also to run a background service, by running this like that.

docker run -it dimtass/stm32-cde-image-gitlab-runner:0.1 -c "/usr/bin/gitlab-runner start"

This won’t work! Docker by design will exit when the command is exits and it will return the exit status. Therefore, if you don’t want to block in your terminal when running the gitlab-runner run command, then you need to run the container as a detached container. To do this, run this command:

docker run -d -it dimtass/stm32-cde-image-gitlab-runner:0.1 -c "/usr/bin/gitlab-runner run"

This command will return by printing the hash of the running container, but the container will keep running in the background as a detached container. Therefore, your runner will keep running. You can always use the docker ps -a command to verify which containers are running and which are exited.

If you think that you can just spawn 10 of these containers now, then don’t get excited because this can’t be done. You see, the runner was registered during the Ansible provisioning state, so at that point the runner got a token, that you can verify by comparing the `/etc/gitlab-runner/config.toml` configuration and your project’s gitlab web interface in the “Settings -> CI/CD -> Runners”. Therefore, if you start running multiple containers then all of them will have the same token.

The above thing means that you have to deal with this somehow, otherwise you would need to build a new image every time you need a runner. Well, no worries this can be done with a simple script like the AWS case previously, which it may it may seem a bit ugly, but it’s fine to use. Again there are many ways to do that. I’ll explain two of them.

One way is to add an entry script in the docker image and then you point to that as an entry point when you run a new container. BUT that means that you need to store the token in the image, which is not great and also that means that this image will be only valid to use with that specific gitlab repo. That doesn’t sound great, right?

The other way is to have a script on your host and then run a new container by mounting that script and run it in the entry-point. That way you keep the token on the host and also you can have a git hosted and versioned script that you can store on a different git repo (which brings several good things and helps automation). So, let’s see how to do this.

First have a look in the git repo (stm32-cde-template) and open the `` file. In this file you need to add your token from your repo (stm32f103-cmake-template in our case) or you can add the token in the command line in your automation, but for this example just enter the token in the file.

What the script does is that first un-registers any previous local (in the container not globally) runner and then registers a new runner and runs it. So simple. The only simple thing is the weird command you need to execute in order to run a new container which is:

docker run -d -v $(pwd)/ dimtass/stm32-cde-image-gitlab-runner:0.1 -c "/bin/bash /"

Note, that you need to run this command on the top level directory of the stm32-cde-template repo. What that command does is that it creates a detached container, mounts the host’s local script and then it executes that script in the container entry. Ugly command, but what it does it’s quite simple.

What to do now? Just run that command multiple times! 5-6 times will do. Now run docker ps -a to your host and also see your repos Runners in the settings. This is my repo after running this command.

Cool right? I have 6 gitlab runners, running on my workstation that can share 16 cores. OK, enough with that, now stop all the containers and remove them using docker stop and docker rm.


With this post we’ve done with the simple ways of creating a common development environment (CDE) image that we can use for building a firmware for the STM32. This image can be used also to create gitlab-runners that will build the firmware in a pipeline. You’ve noticed that there are many ways to achieve the same result and each way is better in some things and worse that the other ways. It’s up to you to decide the best and more simple approach to solve your problem and base your CI/CD architecture.

You need to experiment at least once with each case and document the pros and cons of each method yourself. Pretty much the same thing that I’ve done in the last two posts, but you need to do it yourself, because you may have other needs and get deeper in aspects that I didn’t go myself. Actually, there are so many different projects and each project has it’s own specifications and needs, so it’s hard to tell which is the best solution for any case.

In the next post, I’ll get deeper in creating pipelines with testing farms for building the firmware, flashing and testing. You’ll see that there are cheap ways you can use to create small farms and that these can be used in many different ways. You need to keep in mind that those posts are just introduction to the DevOps for embedded and only scratch the surface of this huge topic. Eventually, you may find that these examples in these posts are enough for the most cases you might need as not all projects are complicated, but still it just scratching the surface when it comes to large scale projects, where things are much more complicated. But in any case, even in large scale projects you start simple and then get deeper step-by-step. Until the next post…

Have fun!








DevOps for embedded (part 1)


Note: This is the first post of the DevOps for Embedded series.

Wow, it’s being some time since the last post. A lot happened since then, but now it’s time to come back with a new series of posts around DevOps and embedded. Of course, embedded is a vast domain (as also DevOps), so I’ll start with a stupid project (of course) that just builds the code of a small MCU and then I’ll probably get into a small embedded Linux project with Yocto, because there are significant differences in the approach and strategies you need to use on each case. Although, the tools are pretty much the same. One thing that you’ll realize when you get familiar with the tools and the DevOps domain, is that you can achieve the same result by using different methodologies and architectures, but what it matters in the end of the day is to find a solution which is simple for everyone to understand, it’s re-usable, it has scalability and it makes maintenance easy.

Oh, btw this article is not meant for DevOps engineers as they’re already familiar with these basic tools and the architecture of CI/CD pipelines . Maybe what’s interesting for DevOps, especially those that are coming from the Web/Cloud/DB domain, are the problems that embedded engineers have and the tools that they’re using.

Note: the host OS used in the article is Ubuntu 18.04.3 LTS, therefore the commands and tools installation procedure is for this OS, but it should be quite similar for any OS.

What is DevOps?

Actually, there is a lot of discussion around DevOps and what exactly it is. I’ll try to describe it in a simple way. The word is an abbreviation and it means Development/Operations and according to wikipedia: “DevOps is a set of practices that combines software development (Dev) and information-technology operations (Ops) which aims to shorten the systems development life cycle and provide continuous delivery with high software quality.”

To put it other words is all about automation. Automate everything. Automate all the stages of development (+deployment/delivery) and every detail of the procedure. Ideally, a successful DevOps architecture would achieve this: if the building is burned down to the ground, then if you just buy new equipment, then you can have your infrastructure back within a couple hours. If your infrastructure is in the cloud, then you can have it back within minutes. The automation, ideally starts from people’s access key-cards for the building and ends up to the product that clients gets into their hands. Some would argue that also the coffee machine should be also included in the automation process, in this case if it has a CPU and a debug port, then it’s doable.

If you have to keep something from this post, is that DevOps is about providing simple and scalable solutions to the problems that arise while you trying to fully automate your infrastructure.

Why DevOps?

It should be obvious by now that having everything automated not only saves you a lot of time (e.g. having your infrastructure back in case it “dies on a bizarre gardening accident” [SpinalTap pun]), but if it’s done right then it also makes it easier to track problematic changes in your infrastructure and revert them or fix them fast, so it minimizes failures and hard-to-find issues. Also, because normally you will automate your infrastructure gradually, you will have to architecture this automation in steps, which makes you understand your infrastructure better and deal more efficient with issues and problems. This understanding of the core of your infrastructure is useful and it’s a nice opportunity to also question your current infrastructure architecture and processes and therefore make changes that will help those processes in future developments.

It’s not necessary to automate the 100% of your infrastructure, but it is at least important to automate procedures that are repeated and they are prone to frequent errors or issues that add delays in the development cycle. Usually those are the things that are altered by many people at the same time, for example code. In this case you need to have a basic framework and workflow that protects you from spending a lot of time in debugging and try to fix issues rather adding new functionality.

What DevOps isn’t.

DevOps isn’t about the tools! There are so many tools, frameworks, platforms, languages, abbreviations and acronyms in DevOps, that you’ll definitely get lost. There are tons of them! And to make it even worse there are new things coming out almost every week. Of course, trying to integrate all the new trends to your automation infrastructure doesn’t make sense and this shouldn’t be your objective. Each of these tools, has its usage and you need to know exactly what you need. If you have time for research, then do your evaluation, but not force your self to use something because it’s hype.

In DevOps and automation in general you can achieve the same result with numerous different ways and tools. There’s not a single or “right” way for doing something. It’s up to you to understand your problem well, know at least a few available tools and provide a simple solution. What’s important here, is always to provide simple solutions. Simple enough for everyone in the team to understand and it’s easy to test, troubleshoot and debug.

So forget about that you have to use Kubernetes or Jenkins or whatever, because others do. You need only to find the simplest solution for your problem.

DevOps and embedded

Usually DevOps are found in cloud services. That means that most of the web servers and databases on the internet are based on the DevOps practices, which makes a lot of sense, since you wouldn’t like your web page or corporate online platform to be offline for hours or days if something goes wrong. Everything needs to be automated and actually there’s a special branch of DevOps called Site Reliability Engineering (SRE) that hardens even more the infrastructure up to the point, in some cases, to survive even catastrophic global events. This blog for example is running on a VPS (virtual private server) on the cloud, but tbh honest I don’t think that my host service may survive a global catastrophic event, but also if that happens, who cares about a blog.

So, what about embedded? How would you use DevOps to your embedded workflow and infrastructure. Well, that’s easy, actually. Just think how would you automate everything you do, so that if you loose everything tomorrow (except you code, of course), then you can be back on track in a few minutes or hours. For example, all your developers get new laptops and are able to start working in a few minutes and all the tools, SDKs, toolchains have the same versions and the development environment is the same for everyone. Also, your build servers are back online and you start your continuous integration, continuous delivery (CI/CD) in minutes.

How does this sound? Cool, right?

Well, it’s doable, of course, but there are many steps in the between and the most important is to be able to understand what are the problems you need to solve to get there, what are your expectations in time, cost and resources and how to architect such a solution in a simple and robust way. What makes embedded interesting also, is that most of the times you have a hardware product that you need to flash the firmware on the various MCUs, FPGAs and DSPs and then start running automated tests and verify that everything works properly. If your project is large, you may even have farms with the specific hardware running and use them in your CI/CD pipeline.

Although, it’s too early in the article, I’ll put a link from a presentation that BMW did a few years ago. They managed to automate a large scale embedded project for an automotive head unit. This is the YouTube presentation and this is the PDF document. I repeat that this is a large scale project with hundreds of developers and also complicated in many aspect. Nevertheless, they successfully managed to automate the full process (at least that’s the claim).

Have in mind that the more you reading about this domain, many times, instead of DevOps you’ll find that alternative phrases like “CI/CD” or “CI pipeline” are used instead. Actually, CD/CD is considered part of the DevOps domain and it’s not an alternative abbreviation for DevOps, but if you know this then you won’t get confused while reading. You’ll also find many other abbreviations like IaC or IaaS, which you probably used more on cloud computing, but still the target is the same. Automate everything.

How to start?

So, now it should be quite clear what DevOps is and how it can help your project and if you’re interested, then you’re probably wondering, how to start?

Wait… Before getting into details, let’s define a very basic and simple embedded project in which you need to develop a firmware for a small MCU (e.g. STM32). In this case, you need to write the code, compile it, flash it on the target, test it and if everything works as expected then integrate the changes to the release branch and ship it. Although, that’s a very basic project, there are many different ways to automate the whole pipeline and some ways may be better than others depending the case. The important thing is to understand that each case is different and the challenge is to find the simplest and optimal solution for your specific problem and do not over-engineer your solution.

As you realize, whatever follows in this series of articles is not necessarily the best solution to every problem. It’s just one of the ways to do solve this particular problem. Because the tools and the ways to do something are so many, I won’t get into details for all of them (I don’t even know all of the available tools out there as they are too many, so there might be a simpler solution than the one I’m using).

Although, there’s not a single answer that covers all the possible scenarios there are a couple of things that need to be done in most of the cases. Therefore, I’ll list those things here and start to analyze a few of them and make examples during the process. So, this is the list:

Note that not all of these steps can be fully automated. For example, the testing may involve some manual tests and also most of the times the automated user acceptance it’s difficult to achieve in embedded projects that may have various electronics around.

Finally, this is not a comprehensive list, but it’s a basic skeleton for most of the projects. Let’s now see some of the above items.

Common development environment (CDE)

It’s very important to have a common development environment shared from all the developers and that’s also easy to reproduce, test, deploy and deliver itself. This is actually one of the most important things you need apply to your workflow in early stages, because it will save you from a lot of troubles down the path.

So, what’s CDE? The assumption was that we’re going to develop a firmware on an STM32 MCU. In this case, CDE is all those things that are needed to compile your code. That includes your OS version, your system libraries, the toolchain, the compiler and linker flags and the way you build the firmware. What’s not important in this case is the IDE that you’re using, which makes sense if you think about it and it’s also preferable and convenient as developers can use their favorite editor to write code (including their favorite plugins). It is important though, that each of those IDEs won’t pollute the project environment with unnecessary custom configuration and files, so precaution actions needed there, too (git makes that easy, though).

Why using a CDE is important? I’ll write a real project case, recently I was working in a company that instead of my recommendations, developers refused to integrate a CDE in their development workflow. I already had created a CDE for building a quite complex Yocto distro and nobody ever used it and I couldn’t also enforce its use because in flat hierarchies the majority can make also bad choices. Eventually, developers were using their own workstation environment to build the Linux distro, which in the end led to a chaotic situation that the image wasn’t building in all workstations in the same way and on some workstation it was even failing to build and at the same time it was building on others. As you can imagine this is not sustainable and it makes development really hard and chaotic. At some point I just gave up on even try to build the distro as it needed too much time to solve issues rather building it. Having a CDE ensures that your project will build the same way wherever it runs, which is robust and there is consistency. This only reason actually it should be enough for everyone to consider adopting a CDE in the development workflow.

So how can you do this? Well, there are a few ways, but it is important to be able to have an automated way to reproduce a versioned CDE. This is usually done by using a Virtual Machine (or virtual OS image) that is provisioned to include all the needed tools. There are quite a few virtualization solutions like VMWare, VirtualBox, Docker, cloud VPS (Virtual Private Server) like AWS, Azure, e.t.c. Some of them are free, but others cost per usage or you need to buy a license. Which one is best for you it’s something that you need to decide. Usually, if you don’t have any experience you will probably need to buy support even for a free product, just to play safe.

Therefore, in our project example the CDE will be a VM (or virtual OS) that includes all those tools and can be shared between developers that they may even use different OS as their base system on their workstation. See? This is also very useful because the CDE will be OS independent.

Since there are many virtualization technologies, you need to first decide which one suits your needs. VMWare and VirtualBox are virtual machine solutions, which means that they run a software machine on top of your host OS. This is a sandboxed machine that uses your hardware’s shared resources, therefore you can assign a number of CPUs, RAM and storage from your host system to the VM. As you can imagine, the performance of this setup is not that great and you may need more resources compared to other solutions.

An other solution is using a containerized image, like Docker, that uses all your hardware resources without a virtual machine layer in between. It’s still isolated (not 100% though) and since it’s containerized it can be transferred and re-used, too (like the VM solution). This solution has better performance compared to VMs, but it’s underground technology implementation is different depending the host OS, therefore; in case of a Linux host, Docker performs great, but in case of a Windows host it’s actually implemented as a VM.

Then there are solutions like working on remote VPS (virtual private servers), which they are either baremetal or cloud servers (in your premises or on a remote location). These servers are running an OS and host the CDE that you use to build your code, therefore you edit your code locally on your host OS with your favorite IDE and then build the code remotely on the VPS. The benefit of this solution is that the VPS can be faster than your laptop/workstation and can be shared also among other developers. The disadvantage is that you need a network connection and this may become a problem if the produced build artifacts that you need to test are large in size and the network connection is not fast (e.g. if the image is a 30+ GB Linux distro that’s a lot of data to download even on fast networks).

What’s important though, is that whatever solution you choose, to apply the IaC (infrastructure as code) concept, which means that the infrastructure will be managed and provisioned through scripts, code or config files.

Now, let’s examine our specific case. Since we’re interested on building a baremetal firmware on the STM32 it’s obvious that we don’t need a lot of resources and a powerful builder. Even if the project has a lot of code, the build may take 1-2 mins, which is a fair time for a large code base. If the code is not that large then it may be just a few seconds. Of course, if the build is running on a host that can run a parallel build and compile many source files at the same time, then the build will also be faster. So in this case, which solution is preferred?

Well, all the solutions that I’ve mentioned earlier are suitable for this case; therefore, you can use VirtualBox and Docker (which are free) or use a VPS like AWS, which is not free but it doesn’t cost that much (especially for a company or organization) and also saves you time, money, energy from running your own infrastructure 24/7, as you pay only for your image running time.

Because all these solutions seems to be suitable, there’s another question. How can you evaluate all of these and decide? Or even better, how can you achieve having an alternative solution, that won’t cost you a lot of time and money, when you decide to change from one solution to another? Well, as you guess there are tools for this, like Packer or Vagrant, that can simplify the process of creating different type of images with the same configuration.  Although those two tools seem similar, they’re actually a bit different and it’s always a good idea to read more about those in the previous links. You can use both to create CDEs, but Packer is better for creating images (or boxes that can be used also from Vagrant) and this can be done by using a single configuration file that builds images for different providers at the same time. Providers in this context are Docker, Vagrant, AWS e.t.c. Also, both tools support provisioning tools that allow you to customize your image.

There are many provisioning tools like Puppet, Chef, Ansible and even shell scripts. Pretty much Packer and Vagrant can use the same provisioners (link1 and link2). I won’t get into the details for each one of them, but although seemingly the most simple provisioning seems to be using shell scripts, it’s not really. In this article series I’ll use Ansible.

Why Ansible? That’s because it’s more flexible, makes easier to handle more complex cases and it comes with a ton of ready to use examples, modules and roles (roles are just pre-defined and tested scripts for popular packages). That makes it better compared to use shell scripts. Compared to Puppet and Chef, Ansible is “better” in our case, because it’s able to connect via SSH to the target and configure it remotely, instead of having to run the provisioner on the target itself.

Another problem with using shell scripts is that they may get too complicated in case you want to cover all cases. For example, let’s assume that you want to create a directory on the target image, then you need to use mkdir command and create it. But if this directory already exists, then the script will fail, unless you check first if the directory already exists before you execute the command. Ansible does exactly that, first checks the change that it needs to make and if the status is already the wanted one, then it doesn’t apply the change.

Since we decided to create a Docker and AWS EC2 AMI image, which have almost the similar configuration, then Packer seems a suitable solution for this. You could also use Vagrant, but let’s go with Packer, for now. In both cases, you need to have a Docker and AWS account in order to proceed with the rest of the examples, as you’ll need to push those images to a remote server. If you don’t care about cloud and pushing an image to the cloud then you can still use Packer to create only the Docker image, which can be pushed to your local repository on your computer, but in this case maybe go with Vagrant as it provider a more user friendly way to build, run and use the image, without having to deal with the docker cli.

I won’t get into details here on how to create your AWS account and credentials, as this would need a lot of time and writing. You can find this information though in the AWS EC2 documentation, which explains things much better than me. Therefore, from now on I assume that you already have at least a free AWS account and your credentials (access and secret key).

In these article, we’re going to use these repos here:

In the first repo you’ll find the files that Packer will use to build the image and the second repo is just a small STM32 template code sample that I use every time I start a new project for the STM32F103.

Before continuing building the image, let’s have a look at the environment that we need inside this image. By having a look at the source code it’s obvious that we need CMake and an ARM GCC toolchain. Therefore, the image needs to include these and we do that by using Ansible as the provisioner during the image build. Ansible uses playbooks and roles for provisioning targets. Again, I won’t get into the details of how to use Ansible in this article and you can find more information in the online documentation here.

Installing Packer

To install Packer, download the pre-compiled binary for your architecture here. For Ubuntu 18.04 that would be the Linux 64-bit. Then copy this to your /usr/bin (using sudo). You can also just copy it to your ~/.local/bin folder and if you’re the only user that will use Packer.

To test if everything is ok, run:

packer -v

In my case it returns version 1.4.5.

Installing Ansible

To install Ansible in Ubuntu 18.04 you need to add an external repo and install it from there following the following commands:

sudo apt-add-repository ppa:ansible/ansible
sudo apt-get update
sudo apt-get install ansible -y

This is the official ppa repo, so it’s ok to use that. If you trust nobody, then you can see other installation methods here (you can also just use the git repo).

To test that everything is ok, run:

ansible --version

In my case the version is 2.9.1. Have in mind that since those tools are under constant development they’re updated regularly and you need always to use the documentation depending your version.

Registering to docker and creating a repository

In this example we’ll use the docker hub and create a repository for this image and store it there. The reason for this is that the image will be accessible from anywhere and any docker host will have access to it. Of course, in case you don’t want to publish your image you can just create either a local repository on your workstation or have a local server in the network that stores the docker images.

In this case, I’ll use docker hub. So you need to create an account there and then create a repository with the name stm32-cde-image. In this case your image will be in the <username>/stm32-cde-image repository. So, in my case than my username is dimtass it will be dimtass/stm32-cde-image.

Now that you have your repo, you need to go to your user account settings and create a token in order to be able to push images to your repositories. You can do this in Account Settings -> Security -> Access Tokensand follow the instructions there and create a token, run a shell command and use your username and the token to create your credentials locally on your system. After that, you can just push images to your account, so Packer can also do the same.

Finally, in the stm32-cde-docker.jsonand stm32-cde-images.jsonfiles, you need to change the repository name in the docker post-processor and use yours, otherwise the image build will fail.

Create the CDE images

Now that you have all the tools you need, open your terminal to the packer repo you cloned earlier and have a look at the stm32-cde-images.json, stm32-cde-aws.json and stm32-cde-docker.json files. These files are the configuration files that packer will use to create the images. The stm32-cde-images.jsonfile contains the configuration for both docker and aws ec2 images and then the other two files are for each image (docker, aws).

In part-1 I’ll only explain how to build and use the docker image, so forget about anything related to AWS for now. In the stm32-cde-images.json and stm32-cde-docker.json files there is a builder type which is for Docker. In this case, all it matters is to choose the imagefrom any repository. In this case, I’ve chosen the ubuntu:16.04 as a base image which is from the official docker repository and it’s quite similar to the amazon’s EC2 image (there are significant differences though). The base image is actually an official docker image, that the docker client in your host OS will fetch from the docker hub and use as a base to create another image on top of that. This official image is a strip down ubuntu image, on which you can install whatever packages, tools and libraries you need on top, which means the more you add the size will get larger. Therefore, the base ubuntu:16.04 image may be just 123MB, but this can easily grow up to a GB and even more.

As I’ve noted earlier, although you can use Packer to build images from various different builder types, this doesn’t mean that all the base images from all the sources will be the same and there might be some insignificant differences between them. Nevertheless, using a provisioner like Ansible will handle this without issues, as it will only apply changes that are not existed in the image and that’s the reason that Ansible is better than a generic shell script, as a script may fail in an specific image and not on another. Of course, there’s a chance that you’re using shell commands in your Ansible configuration, therefore in this case you just need to test that everything works as expected.

To create the docker image you need to clone this repo:

And use packer to build the stm32-cde-docker.json like this:

git clone
cd stm32-cde-template
packer build stm32-cde-docker.json

Note: before run the build command, you need to change the docker_repo variable and put your docker repo, that you’ve created and also you need to have created a token for docker to be able to push the image.

If everything goes fine, the you’ll see something like this:

==> docker: Creating a temporary directory for sharing data...
==> docker: Pulling Docker image: ubuntu:16.04
==> docker: Using docker communicator to connect:
==> docker: Provisioning with shell script: /tmp/packer-shell909138989
==> docker: Provisioning with Ansible...
==> docker: Committing the container
==> docker: Killing the container: 84a9c7842a090521f8dc9bd70c39bd99b6ce46b0d409da3cdf68b05404484b0f
==> docker: Running post-processor: docker-tag
    docker (docker-tag): Tagging image: sha256:0126157afb61d7b4dff8124fcc06a776425ac6c08eeb1866322a63fa1c3d3921
    docker (docker-tag): Repository: dimtass/stm32-cde-image:0.1
==> docker: Running post-processor: docker-push
Build 'docker' finished.

That means that you image was build without errors and its also pushed to the docker hub repository.

Using the Docker CDE image

OK, now we an image, what do we do with this?

Here we need to make clear that this image is just the CDE. It’s not yet used in any way and it’s just available to be used in any way you like. This image can be used in different ways, of course. For example, developers can use the image to build their code locally on their workstation or remotely, but the same image can be used to build the code during a continuous integration pipeline. The same image also can be used for testing, though in this case, as you’ll see you need to use the provisioner to add more tools and packages in there (which is actually done already). Therefore, this image is the base image that we can use in our whole pipeline workflow.

Note that these images were created by only using configuration files, which are versioned and stored in a git repo. This is nice, because that way you can create branches to test your changes, add more tools and also keep a history log of all the changes so you can track and debug any issues that were introduced. I hope it’s clear how convenient that is.

Depending the kind of image you build there’s a bit different way to use it. For example, if you have a VirtualBox VM then you need to launch VBox and then start the image, or if you’ve used Vagrant (regardless if the Vagrant box is a VirtualBox or a Docker image), then you need to start the image again using “vagrant up” and “vagrant ssh” to connect to the image. If you’re using docker, then you need to run a new container from the image, attach to it, do your job and then stop/destroy the container (or leave it stopped and re-use it later). For AWS AMIs it’s a bit different procedure. Therefore, each tool and way you’ve used to create the image, defines how to use the image. What happens in the background is pretty much the same in most cases (e.g. in case of Docker, it doesn’t matter if you use Vagrant or the docker command line (cli), the result in the background is the same).

In this post I’ll only explain how to use the Docker image and I’ll explain how to use the Vagrant and AWS EC2 AMI in a next post.

So, let’s start with the docker image. Again, I won’t go into the details of the cli command, but I’ll just list them and explain the result. For more details you need to consult the docker cli reference manual.

In the previous steps we created the docker image and pushed it to our hub repository. Although, I’ll use my repo image name, you need to change that to yours, but also my repo for testing is fine. The test repo we’re going to use to build the code is that one here:

Therefore, the objective is to create a docker container from our new image and then clone the code repo and build it inside the docker container.

cd ~
docker run -it dimtass/stm32-cde-image:0.1 -c "/bin/bash"

This will create a new container, run it and attach to it’s bash terminal, so you should find your terminal’s prompt inside the running container. Now, to test if that the container has all the tools that are needed and also build the firmware, then all you have to do is to git clone the stm32 firmware repo in the container and build it, like this:

cd /root
git clone
cd stm32f103-cmake-template
TOOLCHAIN_DIR=/opt/toolchains/gcc-arm-none-eabi-9-2019-q4-major CLEANBUILD=true USE_STDPERIPH_DRIVER=ON SRC=src_stdperiph ./

The above commands should be run inside the docker container. So, first you change dir in /root and then clone and build the template repo. In the build command (the last one), you need to pass the toolchain path, which in this case we’ve used Packer and Ansible to install the toolchain in `/opt/toolchains/gcc-arm-none-eabi-9-2019-q4-major`. Then you need to pass a few other parameters that are explained in the stm32f103-cmake-template repo, as you can build also a FreeRTOS template project or use libopencm3 instead of ST standard peripheral library.

The result of the build command in my case was:

[ 95%] Linking C executable stm32-cmake-template.elf
   text	   data	    bss	    dec	    hex	filename
  14924	    856	   1144	  16924	   421c	stm32-cmake-template.elf
[ 95%] Built target stm32-cmake-template.elf
Scanning dependencies of target stm32-cmake-template.bin
Scanning dependencies of target stm32-cmake-template.hex
[100%] Generating stm32-cmake-template.hex
[100%] Generating stm32-cmake-template.bin
[100%] Built target stm32-cmake-template.bin
[100%] Built target stm32-cmake-template.hex

That means that our CMake project for the STM32 was built fine and now we have a hex file that we can use. But, wait…. how can you test this firmware? This is how you develop? How do you make code changes and build them? Well, there are many options here. To flash the image from inside the container you can expose the USB st-link programmer to docker by using the --device option in docker when creating the container using docker run -it command. You can a look how to use this option here. We’ll get to that later.

The most important thing now, though, is how a developer can use an IDE to make changes in the code and then build the firmware. The option for this is to share your drive with the container using volumes. That way you can have a folder that is mounted to the container and then you can clone the repo inside this folder. The result is the the repo will be cloned to your local folder, therefore you can use your IDE in your OS to open the code and make edits. Then build the code inside the container.

To do that, let’s first clean the previous container. To do that, run this command to  your OS terminal (not the docker container) which will list the current stopped or running containers.

docker ps -a

An example of this output may be:

CONTAINER ID        IMAGE                           COMMAND                  CREATED             STATUS                     PORTS               NAMES
166f2ef0ff7d        dimtass/stm32-cde-image:0.1     "/bin/sh -c /bin/bash"   2 minutes ago       Up 2 minutes                                   admiring_jepsen

The above output means that there is a container running using the dimtass/stm32-cde-image. That’s the one that I’ve started earlier, so now I need to stop it, if it’s running and then remove it, like this:

docker stop 166f2ef0ff7d
docker rm 166f2ef0ff7d

Note that `166f2ef0ff7d` is the container’s ID which is unique for this container and you can use docker cli tool to do actions on this specific container. The above commands will just stop and then remove the container.

Now, you need to run a new container and this time mount a local folder from your OS filesystem to the docker container. To do that run these commands:

cd ~/Downloads
mkdir -p testing-docker-image/shared
cd testing-docker-image
docker run -v $(pwd)/shared:/home/stm32/shared -it dimtass/stm32-cde-image:0.1 -c "/bin/bash"

The last command will end up inside the container’s bash. Then you need to change to the stm32 user and repeat the git clone.

cd /home/stm32
su stm32
git clone --recursive
cd stm32f103-cmake-template
TOOLCHAIN_DIR=/opt/toolchains/gcc-arm-none-eabi-9-2019-q4-major CLEANBUILD=true USE_STDPERIPH_DRIVER=ON SRC=src_stdperiph ./

There’s a trap here, though! For this to work, your OS user needs to have the same uid and gid as the stm32 user in the container (or the opposite). Therefore, if your user has uid/gid = 1000/1000, then the same needs to be for the stm32 user in the container. If that’s not the case for you, then you need to create a user in the container that has the same uid/gid. If these are not the same, then the permissions will be different and you won’t be able to create/edit files.

For example if your OS uid/gid is 1001 and the stm32 user in the container is 1000, then while the container is running run this command in the terminal inside the docker container:

usermod -u 1001 stm32

The above command will change the uid/gid of the stm32 user inside the container from 1000 to 1001.

Let’s now assume that the permissions are OK and the uid/gid are the same for both users in your OS and container. That means that you can use your favorite IDE and open the git project, do your changes and at the same time running the docker container in the background and build the source when you like. Have in mind that the build will be a manual procedure and you won’t be able to press a button to your IDE to build the source in the container. Of course, you can achieve this functionality, if you wish, as there are a few ways, but let’s not focus on this right now.

Finally, you’ve managed to create an image that is your common development environment and that contains all the needed tools in order to build you code and test it. At any time that you need to add more tools in your image, then you can use the Ansible provisioning files and add more packages or roles in there. Then all the developers will automatically get the last updates when they create a new container to build the firmware. For this reason, it’s good to use the latest tag in the docker image and the repository, so the devs get automatically the latest image.

Flashing the firmware

To flash the firmware that was build with the docker image, there are two ways. One is use your OS and have installed the st-link tool, but this means that you need to find the tool for your OS and install it. Also if some developers have different OSes and workstation with different architectures, then probably they will have also different versions of the tool and even not the same tool. Therefore, having the st-flash tool in the docker image helps to distribute also the same flasher to all developers. In this case, there’s an Ansible role that installs st-flash in the image, so it’s there and you can use it for flashing.

As mentioned earlier, though; that’s not enough because the docker image needs to have access to the USB st-link device. That’s not a problem using docker though. So in my case, as I’m running on Ubuntu it’s easy to find the USB device path. Just, run the lsusb command on a terminal and get the description, so in my case it’s:

Bus 001 Device 006: ID 0483:3748 STMicroelectronics ST-LINK/V2

That means that the ST-LINK V2 device was enumerated in bus 001 and it’s device ID 006. For Linux and docker that means that if you mount the `/dev/bus/usb/001/` path in the docker image then you’ll be able to use the st-link device. One thing to note here is that this trick will always expect the programmer to be connected on the same port. Usually this is not a problem for a developer machine, but in case that you want to generalize this you can mount /dev/bus/usb and docker will be able to have access to any usb device that is registered. But that’s a bad practice, for development always use the same USB port.

Therefore, now remove any previous containers that are stopped/running (use docker ps -a to find them) and then create/run a new container that also mounts the /dev path for the usb, like this:

docker run -v $(pwd)/shared:/home/stm32/shared -v /dev/bus/usb/001:/dev/bus/usb/001 --privileged -it dimtass/stm32-cde-image:0.1 -c "/bin/bash"

Disconnect the device if it’s already connected and then re-connect. Then to verify that the device is enumerated in the docker container and also the udev rules are properly loaded, run this command:

st-info --probe

In my case this returns:

root@6ebcda502dd5:/# st-info --probe
Found 1 stlink programmers
 serial: 513f6e06493f55564009213f
openocd: "\x51\x3f\x6e\x06\x49\x3f\x55\x56\x40\x09\x21\x3f"
  flash: 65536 (pagesize: 1024)
   sram: 20480
 chipid: 0x0410
  descr: F1 Medium-density device

Great, so the device is enumerated fine on the docker container. Now, let’s clone again the repo, build and flash the firmware. Before proceed to this step, make sure that the stm32f103 (I’m using blue-pill) is connected to the st-link and the st-link is enumerated in the docker container. Then run these commands:

cd /home/stm32
su stm32
git clone --recursive
cd stm32f103-cmake-template
TOOLCHAIN_DIR=/opt/toolchains/gcc-arm-none-eabi-9-2019-q4-major CLEANBUILD=true USE_STDPERIPH_DRIVER=ON SRC=src_stdperiph ./

Now that the firmware is built, flash it with this command:

st-flash --reset write build-stm32/src_stdperiph/stm32-cmake-template.bin 0x8000000

You should see something like this:

stm32@6ebcda502dd5:~/stm32f103-cmake-template$ st-flash --reset write build-stm32/src_stdperiph/stm32-cmake-template.bin 0x8000000
st-flash 1.5.1
2019-12-02T21:18:04 INFO common.c: Loading device parameters....
2019-12-02T21:18:04 INFO common.c: Device connected is: F1 Medium-density device, id 0x20036410
2019-12-02T21:18:04 INFO common.c: SRAM size: 0x5000 bytes (20 KiB), Flash: 0x10000 bytes (64 KiB) in pages of 1024 bytes
2019-12-02T21:18:04 INFO common.c: Attempting to write 15780 (0x3da4) bytes to stm32 address: 134217728 (0x8000000)
Flash page at addr: 0x08003c00 erased
2019-12-02T21:18:04 INFO common.c: Finished erasing 16 pages of 1024 (0x400) bytes
2019-12-02T21:18:04 INFO common.c: Starting Flash write for VL/F0/F3/F1_XL core id
2019-12-02T21:18:04 INFO flash_loader.c: Successfully loaded flash loader in sram
 16/16 pages written
2019-12-02T21:18:05 INFO common.c: Starting verification of write complete
2019-12-02T21:18:05 INFO common.c: Flash written and verified! jolly good!

What just happened is that you run a docker container on your machine, which may not have any of the tools that are needed to build and flash the firmware and managed to build the firmware and flash it. The same way, any other developer in the team in the building or from home, can build the same firmware the same way, get the same binary and flash it with the same programmer version. This means that you’ve just ruled out a great part of the development chain that can cause weird issues and therefore delays to your development process.

Now every developer that has access to this image can build and flash the same firmware in the same way. Why somebody wouldn’t like that?

I understand that maybe some embedded engineers find this process cumbersome, but it’s not really. If you think about it, for the developer the only additional step is to create the container once and just start it once on every boot and then just develop the same way like before. In case you were using automations before, like pressing a button in the IDE and build and flash, then that’s not a big issue. There are ways to overcome this and all new IDEs support to create custom commands and buttons or run scripts, so you can use those to achieve the same functionality.

Using gitlab-ci for CI/CD

Until now, I’ve explained how you can use the CDE image to build your source the code as a developer, but in embedded projects it’s also important to have a CI/CD pipeline that builds the code and runs some scripts or tests automatically. For example, in your pipeline you might also want to run a style formating tool (e.g. clang-format), or anything else you like, then run your tests and if everything passes then get a successful build.

There are many CI/CD services like Jenkins, Travis-CI, gitlab-ci, buildbot, bamboo and many others, just to name a few. Each one of them has its pros and cons. I haven’t used all of them, so I can’t really comment on what are their differences. I’ve only used gitab-ci, jenkins, buildbot and bamboo, but again not in great depth. Bamboo is mainly used in companies that are using Atlassian’s tools widely in their organisation, but you need to buy a license and it’s closed source. Jenkins is the most famous of all, it’s free, open-source and has a ton of plug-ins. Buildbot is used for Yocto projects, which means that it can handle complexity. Gitlab-CI is used in the gitlab platform and it’s also free to use.

From all the above, I personally prefer gitlab-ci as it’s the most simple and straight forward service to create pipelines, easily and without much complexity. For example, I find Jenkins pipelines to be very cumbersome and hard to maintain; it feels like the pipeline interface doesn’t fit well in the rest of the platform. Of course, there are people who prefer Jenkins or whatever else compared to GitLab-CI. Anyway, there’s no right or wrong here, it has to do with your current infrastructure, needs and expertise. In my case, I find gitlab-ci to be dead simple to use and create a pipeline in no time.

To use gitlab-ci, you need to host your project to gitlab or install gitlab to your own server. In this example, I’m hosting the `stm32f103-cmake-template` project in bitbucket, github and gitlab at the same time and all these repos are mirror repos (I’ll explain in another post how I do this, but I can share the script that I’ve wrote here to do this). Therefore, the same project can be found in all these repos:

For now, we care only for the gitlab repo. In your case, you need to either fork this repo, or create a new one with your code and follow the guide. Therefore, from now on, it doesn’t matter if I refer to the `stm32f103-cmake-template` project as this can be any project.

To use gitlab-ci, you only need a template file in your repo which is called `.gitlab-ci.yml`. This is a simple YAML file that configures the CI/CD of your project. To get more information and details on this, you need to read the documentation here. In this example, I’ll only use simple and basic stuff, so it’s easy to follow, but you can do much more complex things than what I’m doing in this post.

For our example, we’re going to use this repo here:

I’ve added a .gitlab-ci.yml file in the repo, that you can see here:

      name: dimtass/stm32-cde-image:0.1
      entrypoint: [""]
      - build
      - test
      stage: build
      script: TOOLCHAIN_DIR=/opt/toolchains/gcc-arm-none-eabi-9-2019-q4-major CLEANBUILD=true USE_STDPERIPH_DRIVER=ON SRC=src_stdperiph ./
      stage: test
      script: echo "Dummy tests are done..."

First thing you notice, is the image entry. This points to the docker image that I’ve built with packer and pushed to the docker hub registry. This entry will force the gitlab-runner instance to download the image and then run a container and inside that container will run the given stages. Also, the entrypoint entry is for handling a problem that some images built with Packer have, which is the container can’t find the /bin/sh path. You can see more about issue here. This already raises the flag that not all solutions are handled the same in the end, but depending the tools you’re using and the way you’re using them, there might be some differences in the way that you need to handle specific cases.

Lastly, there are two stages, build and test. Test stage, doesn’t do anything in this case and it’s empty, but the build stage will run the build command(s) and if any step of these stages fails, then the pipeline will fail.

After commit and pushing the .gitlab-ci.yml file in the repo then the prjects CI/CD service will automatically pick up the the job and it will use any available shared runner from the stack and use it to run the pipeline. This is an important information,  by default the shared gitlab runners are using a docker image, which will fail to build your specific project and that’s the reason why you need to declare to the gitlab-ci that it needs to pull your custom docker image (dimtass/stm32-cde-image in this case).

After picking the job, the runner will start executing the yaml file steps. See, how simple and easy is to create this file. This is an important difference of gitlab-ci, compared to other CI/CD services. I’ll post a couple of pictures here from that build here:

In the first picture, you see the output of the runner and that it’s pulling the dimtass/stm32-cde-image and then fetches the git submodules. In the next image you can see that the build was successful and finally in the third image you see that both build and test stages completed successfully.

A few notes now. As you can imagine, there are no produced artifacts, meaning that the firmware is built and then gone. If you need to keep the artifact, then you need to push/upload the artifact to some kind of registry that you can download. Recently, though, gitlab supports to host the build artifact for a specific time, so let’s use this! I’ve changed my .gitlab-ci.yml to this one:

    name: dimtass/stm32-cde-image:0.1
    entrypoint: [""]


    - build
    - test

    stage: build
    script: TOOLCHAIN_DIR=/opt/toolchains/gcc-arm-none-eabi-9-2019-q4-major CLEANBUILD=true USE_STDPERIPH_DRIVER=ON SRC=src_stdperiph ./
        - build-stm32/src_stdperiph/stm32-cmake-template.bin
        expire_in: 1 week

    stage: test
    script: echo "Dummy tests are done..."

The above yml script will now also create an artifact that you can download, flash and test to your target. To download the artifact, click to your project’s CI/CD -> Pipelines and from the list click the drop-down menu in the last build and download the .zip file that contains the artifact. This is shown here:

Another thing, you may notice during the pipeline execution is that each stage is running after the previous is cleaned. Therefore, the test stage will download again the docker image, git clone the submodules again and then run the test stage. That doesn’t sound what we really want, right? Don’t get me wrong, the default behavior is totally right and seems very logical for each stage to run in a clean environment, it’s just that in some cases you don’t want this to happen. In that case you can run your tests in the the build stage.

Finally, what’s useful in gitlab is the cache. This is useful, because when you run a stage then you can cache a specific folder or file and tag it with a key. You can do this for several files/folders. Then on another stage, you can use the cache key and have access to this files/folders and run whatever actions you like. To do this, let’s change our .gitlab-ci.yml file into this:

    name: dimtass/stm32-cde-image:0.1
    entrypoint: [""]


    - build
    - test

    stage: build
    script: TOOLCHAIN_DIR=/opt/toolchains/gcc-arm-none-eabi-9-2019-q4-major CLEANBUILD=true USE_STDPERIPH_DRIVER=ON SRC=src_stdperiph ./
        key: build-cache
        - build-stm32/src_stdperiph
        - build-stm32/src_stdperiph/stm32-cmake-template.bin
        expire_in: 1 week

    stage: test
    script: file build-stm32/src_stdperiph/stm32-cmake-template.bin
        key: build-cache

As you can see in the above file, the build-cache will store the content of the build-stm32/src_stdperiph folder during the build stage and then in the test stage will run the file command on this file, which is going to be fetched from the cache. This is the output of this in gitlab:

Downloading artifacts for build (368146185)...
Downloading artifacts from coordinator... ok        id=368146185 responseStatus=200 OK token=jeC2trFF
$ file build-stm32/src_stdperiph/stm32-cmake-template.bin
build-stm32/src_stdperiph/stm32-cmake-template.bin: data
Job succeeded

Just beautiful. Simple done and useful in many ways.


In this first post of the DevOps for embedded series, I’ve covered the creation of a docker image that can be used from developers as a CDE (common development environment) and CI/CD services as a pipeline image. I’ve used Packer to create the image and Ansible to provision and configure it. Packer is useful because you can use it to create almost similar images with the same configuration file, for different providers like docker, AWS, e.t.c. Also Ansible is preferred instead of using another provisioner because it supports also many different providers and also it configures the image remotely and it doesn’t have to run locally on the image.

Then I’ve tested the CDE image to build a template firmware locally on my host OS and also was able to flash the firmware on the target MCU using the st-flash tool from the image, but for this step I had to share part of the /dev/bus/… with the docker image, which some may argue that’s not optimal. Personally, I think it is just fine if you are targeting a specific port rather your whole /dev directory.

Finally, I’ve used the same CDE image with gitlab-ci and created a CI/CD pipeline that runs a build and a test stage. I’ve also shown how you can use the artifact and cache capability of gitlab-ci. Although, there are many different CI/CD services, I’ve chosen gitlab, because I find it simple and straight-forward to use and the documentation is very good, too. If you want or have to use another CI/CD service, then the procedure details will be different, but what you need to do is the same.

In the next post, I’ll show how you can use an AWS EC2 AMI instance (and the AWS tools in general) to do the same thing. Until then…

Have fun!

Control Siglent SDG1025 with python (bonus: add web access using any SBC)


When you’re an electronic engineer you have some standard equipment on your bench. A few multimeters, a soldering station, a bench-top PSU and an oscilloscope. Some of us have also a frequency counter and a waveform generator. If your equipment is quite new then probably it has some kind of communication interface to connect and control the device remotely. That’s very useful in some cases, because you can do a lot of fancy things. The problem usually is that if you’re a Linux user, then you don’t this that easy, as the EE world is dominated by Windows. Nevertheless, nowadays there are solutions sometimes, which are also open source and that makes it easy to take full control.

This post is about how to control the Siglent SDG1000 waveform generator series using Python. I own an SDG1025, so I’ve only tested on this, but it should be the same for all the series.

Siglent SDG1025

A few words about SDG1025, though if you’re reading this you probably already have it already on your bench. SDG1025 is an arbitrary function generator that is able to output a maximum frequency of 25MHz for sine, square and Gaussian noise; but less for other signals (e.g. 5MHz for pulse and arbitrary waves). It has a dual channel output and the second channel can also be used as a frequency counter.

This is the device user manual with more info and that’s how it looks like.

What makes SDG1025 special is its price and the features that comes with this. This is not a hobbyist neither a pro equipment, regarding the price range. It lies somewhere in the middle, but by doing a very small and easy modification it gets in the semi-pro range. This modification is to change the XTAL on the motherboard with a better TCXO, as the motherboard has already the footprint and also it’s an easy to do modification. You can find 25MHz TCXOs with 0.1 or 0.3 ppm in ebay for ~15-10 EUR and this will make the input/ouput much more accurate. It is questionable though if those old format TCXOs are capable of such low ppm accuracy, but in any case with a bit more effort you can also use a new SMD TCXO as long as it has TTL level output. This is a video that shows this modification, if you’re interested.

Another big plus of SDG1025 is the USB communication port. This port is used for communication between your workstation and the instrument and it supports the Virtual Instrument Software Architecture (VISA) API.

The only negative thing I can think of about the SDG1025, is that the fan is loud. I think that would be probably my next hack on the SDG1025, but I don’t use it that often to really consider doing it, yet.


VISA is a standard communication protocol made by various companies, but you probably know it from the National Instrument’s NI-VISA, which is used in LabView. This protocol is available over Ethernet, GPIB, serial, USB e.t.c. and it’s quite a simple protocol where the host queries the device and gets the devie sends a response to the host. The response can be some meaningful data or just an acknowledgement of the received command.

The problem with VISA is that almost all the tools and drivers are made for Windows. You see LabView was a big thing in the Windows XP era (and I guess it still probably is in some industries) as it was easier for people who didn’t know programming to create monitoring and control tools and create automations using LabView. Of course, a lot of things have changed since then and to be honest I’m not aware of the current state of LabView. At some point I believe it was used also from engineers who knew about programming, but at the time LabView had become a de facto tool in the industry and also people preferred the nice GUI elements that it was offering. I remember at that time (early 2000’s) I was using Borland’s VCL components with Borland C++ and Delphi to create my own GUIs for instrument control, instead of LabView. I remember I was finding easier to do use VCL and C++ than use the LabView workflow. Oh, the VCL! It was an awesome library ahead of its time and I wonder why it didn’t conquer the market…

Anyway, because Windows are favored when it comes to those tools, the Linux users are left behind. Times have changed though and although Linux should be considered as the main development platform, some companies are left a bit behind or they provide solutions and support only to specific partners.

USB Port

Now let’s get back to the SDG1025. In the rear side of the generator there’s a USB port, as you can see from this image.

That USB port is used for communicating with your workstation and it supports two modes of operation, USBRAW and USBTMC.

USBTMC stands for USB Test and Measurement Class. You might know this already from those cheap 8-ch, 24MHz network analyzers that sold on ebay and some of which are based on the CY7C68013A EZ-USB. Those analyzers claim compatibility with Sigrok software suite, which supports USBTMC.

Supporting USBTMC, doesn’t mean that your device is supported (e.g. SDG1025). It only means that your device supports that class specification; therefore you could integrate your device to Sigrok if you like, but you have to add the support. So, it’s just a protocol specification and not the communication protocol specific details, which are different and custom for each device and equipment.

So, let’s go back to the SDG1025. You can select the USB mode by pressing the Utility button and browse to this menu [Utility >> Interface >> USB Setup]. There you’ll find those two options:

In this menu you need to select the USBTMC option.

Now connect the USB cable to your Linux machine and run the lsusb command which should print a line like this.

$ lsusb
Bus 008 Device 002: ID f4ed:ee3a Shenzhen Siglent Co., Ltd. SDG1010 Waveform Generator (TMC mode)

That means that your workstation recognises the SDG1025 and it’s also set to the correct mode (USBTMC). Now all is left is to be able to communicate with the device.

Host controller

Usually when you want to control an equipment you need to use your workstation. But that’s not always the case, as you could also use a small Linux SBC. Therefore, in this post I’ll assume that the host controller is the nanopi-neo SBC, which is based on the Allwinner H3 that has a Quad-core Cortex-A7 that runs up to 1.2GHz. This small board costs $10 and it’s powerful enough to perform in some tasks better that other more expensive SBCs.

Therefore, since we’re talking about Linux you have the flexibility to use whatever hardware you like as the host controller. In this case, the nanopi-neo can convert actually the SDG1025 to an Ethernet controlled device, therefore you can use an SBC with WiFi to do the same and control the SDG1025 via your WiFi network. Anyway you get the idea, it’s very flexible. Also the power consumption of these small SBCs is very low, too.

Therefore, from now on the commands that I’m going to use are the same regardless the host controller. Finally, for the nanopi-neo I’ve built the Armbian 5.93 git version, but you can also download a pre-built image from here.

Communicate with the SDG1025

Let’s assume that you’re have booted your Linux host and you have a console terminal. In general, when it comes to python, the best practise is to always use a virtual environment. That makes things easier, because if you break something you don’t break your main environment but the virtual. To setup your host, create a new virtual environment and install the needed dependencies, use these commands:

# Update system
sudo apt update
sudo apt -y upgrade

# Install needed packages
sudo apt install python3-venv
sudo apt install python3-pip
sudo apt install libusb-1.0-0-dev

# Creae a new environment
mkdir ~/pyenv
cd ~/pyenv
python3 -m venv sdg1025
source ~/pyenv/sdg1025/bin/activate

# Install python dependencies
pip3 install setuptools
pip3 install pyusb

After the running the “source ~/pyenv/sdg1025/bin/activate” command, you should see in your prompt the environment name. For example, on my nanopi-neo I see this prompt:

(sdg1025) dimtass@nanopineo:~$

That means that now you’re working on the virtual environment and all the python modules you’re installing are installed in that environment. To deactivate the virtual environment you run this command


Finally, you always need to activate this environment when you need to communicate with the sdg1025 and use those tools.

source ~/pyenv/sdg1025/bin/activate

If everything works fine for you and you didn’t brake anything in the process, then if you don’t like using this virtual env, you can just install the same tools as before without creating the virtual env and skip those steps. From now on though, I’ll assume that the virtual env is used.

Now you need to create a udev rule, so your user gets permissions to operate on the USB device, as by default only root has access to that. If you remember in a previous step, you’ve run the lsusb command and you got an output similar to this:

$ lsusb
Bus 008 Device 002: ID f4ed:ee3a Shenzhen Siglent Co., Ltd. SDG1010 Waveform Generator (TMC mode)

From this output you need the USB VID and PID numbers which are the f4ed and ee3a in this case. Your SDG1025 should have the same VID/PID. Now you need to create a new udev rule with those IDs.

sudo nano /etc/udev/rules.d/51-siglent-sdg1025.rules

And then copy this line and save the file.

SUBSYSTEM=="usb", ATTRS{idVendor}=="f4ed", ATTRS{idProduct}=="ee3a", MODE="0666"

Now you need to update and re-trigger the rules with this command:

sudo udevadm control --reload-rules && sudo udevadm trigger

If you haven’t already connected the USB cable from the SDG1025 to your host, you can do it now.


The last piece you need is the python-usbtmc module, which you can download from here:

This is the module which implements the USBTMC protocol and you can use to script with python. To install it run those commands:

git clone
cd python-usbtmc
sudo python install
cd ../
rm -rf python-usbtmc

If everything went smoothly then now you should be able to test the communication between the host and the SDG1025. To do this launch python3 and you should see something like this:

(sdg1025) $ python3
Python 3.6.8 (default, Jan 14 2019, 11:02:34) 
[GCC 8.0.1 20180414 (experimental) [trunk revision 259383]] on linux
Type "help", "copyright", "credits" or "license" for more information.

The next commands are used to import the python-usbtmc module and list the connected USBTMC devices that are connected on the system

>>> import usbtmc
>>> usbtmc.list_resources()

As you can see, the list_resources() function listed the SDG1025. Now you need to copy this exact string and use it to instantiate a new instrument object.

>>> sdg = usbtmc.Instrument('USB::62000::60000::SDG10GA1234567::INSTR')

In your case this string will be different, of course, as it also contains the device serial number. Now you can poll the device and get some basic information.

>>> print(sdg.ask("*IDN?"))
*IDN SDG,SDG1025,SDG10GA1234567,,04-00-00-30-28

The *IDN? is a standard protocol command that sends back some basic information. In this case it’s the device name, serial number and firmware version.

You’re done! Now you can control the SDG1025 via USB.

Supported commands

The SDG1025 supports a lot of commands that pretty much can control all the functionality of the device and there’s a full list of the supported commands in this document. Yes, it’s a quite large document and there are many commands in there, but you don’t really need them all, right? You should have a look at all in order to decide which seem more interesting for you to create some custom scripts. The very basic commands, I think, are the BSWV (BASIC_WAVE) and OUTP (OUTPUT).

The BSWV sets or gets basic wave parameters and you can use it to control the output waveform. To read the current output setup of channel 1 run this:

>>> sdg.ask("C1:BSWV?")

As you can guess C1means channel 1, so for channel 2 you should run this

>>> sdg.ask("C2:BSWV?")

Therefore, if you want to set the CH1 frequency to a 3V3, 10MHz, square output, you need to run this:

>>> sdg.write('C1:BSWV WVTP,SQUARE,FRQ,10000000,AMP,3.3V')

As you can see though, the BSWV command only configures the output, but it doesn’t also enable it by default. You need the OUTP command to do that. Therefore, to enable the CH1 output with that configuration you need to run this

>>> sdg.write('C1:OUTP ON')

You can disable the output like this:

>>> instr.write('C1:OUTP OFF')

Finally, you can query the output state like this

>>> sdg.ask('C1:OUTP?')

From the response, you see that you can also control the output load (LOAD) and the polarity (PLRT). NOR means that the polarity is normal, but you can set it to INVT if you want to inverse it.

At this point it should be clear what you can do with this module and the supported commands. Now that you know that you can script with python and control the device, you can understand how many things you can do. It’s awesome!

Just read the Programming Guide document and do your magic.

Bonus: Create a web control using the nanopi-neo

This wouldn’t be a proper stupid project if this was missing, so I couldn’t resist the temptation. As bonus material I’ll show you how to use the nanopi-neo (or any other SBC) to do some basic control on the output via a web interface using python, flask and websockets. I won’t really get into the code details, but the code is free to use and edit and you can get it this repo:

In order to run the code you need to follow all the commands in the previous sections to install a virtual env and python-usbtmc. Additionally, you need to install a couple of things more in order to support flask and wtforms on which my web interface is based on. In my case I’ve used the nanopi-neo and also my workstation and it worked fine on both. Now type those commands:

pip3 install flask
pip3 install wtforms
pip3 install flask-socketio
pip3 install eventlet

The last command which installs the eventlet package may fail, but it doesn’t really matter. This package is for installing the eventlet async mode, which makes websockets faster. Without it the default mode is set to threading, which is still fast but not that fast (I’ve seen around 2ms ping times using websockets with eventlet and ~30ms with threading).

Now that you’ve installed the packages, you can run the code, but first you need to connect the USB cable on the nanopi-neo (or whatever) and the SDG1025. This is everything connected together (in the photo the web-interface is already running).

Now, run these commands (includes also the repo clone).

git clone
cd web-interface-for-sdg1025

And that’s it. You should see in the output messages like the following:

[SDG1025]: Connecting to the first available device: USB::62701::60986::SDG10GA4150816::INSTR
[SDG1025]: Connected to: SDG10GA4150816
Starting Flask web server...
Open this link to your browser:

That means that everything worked fine. Now, if the web app is running on the nanopi-neo (or wherever), open your browser and connect to its IP address and port 5000. And you will see something like this:

Note: in the picture the frequency should be (HZ) and not (KHz), but I’ve fixed that in the repo.

When the interface is opened, then it reads the current configuration for both channels and then updates all the needed elements on the web interface. You can use the controls to play around with the interface. I’ll also post a video that I’ve made using the nanopi-neo running the web server and my tablet to control the SDG1025. Excuse some delays when using controls on my tablet, but it’s because the touch screen is problematic. I won’t buy a new tablet until is completely fallen apart. For now it serves its purpose just fine.


Most of the new equipment are coming with some kind of communication interface, which is really nice and you can use it to perform a lot of automated tasks. The problem is that most of those devices support Windows only. This is an old habit for most of the companies as they either can’t see the value in supporting Linux platforms or the engineers are still left in the 00’s era where Windows XP was the ultimate platform for engineers.

Anyway, for whatever the reason, there is hope if your equipment supports a standard protocol or class; because then you can probably find (or develop your own) module and then do whatever you like. For the SDG1025 I’ve used the python-usbtmc module which is open and available here. I’ve also used Flask to develop a web app. The front-end is made with HTML, javascript (jquery + jquery-ui) and the back-end is written in Python. The reason I’ve used Flask, is because it was easier to integrate the python-usbtmc module in the web app and Flask is really nice and mature enough and it has a very nice back-end that integrates also the web server, so you don’t need to install one. Anyway, I also like Flask, so I use it whenever python is involved.

This web app is a very basic one and I’ve made it just for fun. I’ve only supported setting the outputs, but you can use it as a base template to do whatever you like and add more functionality. You can make tons of automation using Python scripting and the SDG1025, it’s up to your imagination.

Have in mind, that I’ve noticed some times that the communication was broken and I had to restart the device. I didn’t find the cause it as it wasn’t often, but I suspect the device…

That was my last stupid project for the next few months as I plan to have some rest and make some proper holidays.

Have fun!

Benchmarking TensorFlow Lite for microcontrollers on Linux SBCs


([06.08.2019] Edit: I’ve added an extra section with more benchmarks for the nanopi-neo4)

([07.08.2019] Edit: I’ve added 3 more results that Shaw Tan posted in the comments)

In this post, I’ll show you the results of benchmarking the TensorFlow Lite for microcontrollers (tflite-micro) API not on various MCUs this time, but on various Linux SBCs (Single-Board Computers). For this purpose I’ve used a code which I’ve originally written to test and compare the tflite-micro API which is written in C++ and the tflite python API. This is the repo:

Then, I thought why not test the tflite-micro API on various SBCs that I have around.

For those who don’t know what an SBC is, then it’s just a small computer with an application CPU like the Raspberry Pi (RPi). Probably everyone knows RPi but there are more than a 100 SBCs in the market in all forms and sizes (e.g. RPi  has currently released 13 different models).


Although I have plenty of those boards around, I didn’t used all of them. That would be very time consuming. I’ve selected quite a few though. Before I list the devices and their specs, here is a photo of the testing bench. A couple of SBCs have missed the photo-shooting and joined the party later.

Top row, from left to right:

Bottom row, from left to right:

  • Nanopi Neo 4: Rockchip RK3399, Dual Cortex-A72 @ 1.8GHz + Quad Cortex-A53 @ 1.5GHz
  • Nanopi K1 Plus: Allwinner H5, Quad Cortex-A53 @ 1.3GHz, 3GB DDR3
  • Nanopi Neo2: Allwinner H5, Quad Cortex A53 @ 900MHz, 512MB DDR3
  • Orangepi Prime: Allwinner H5, Quad Cortex-A53 @ 1.3 GHz, 2GB DDR3

Missing from the photo:

  • Nanopi neo: Allwinner H3, Quad Cortex-A7 @ 1.2GHz, 512MB DDR3
  • Nanopi duo: Allwinner H2+, QuadCortex-A7 @ 1GHz, 512MB DDR3

I’ve tried to use the same distribution packages for each SBC, so in the majority of the boards the rootfs and packages are from Ubuntu 18.04, but the kernels are different versions. It’s impossible to have the exact same kernel version as not every SoC has a mainline support. Also, even if the SoC has a mainline support, like H5 and RK3399, the SBC itself probably doesn’t. Therefore, each board needs a device-tree and several patches for a specific kernel version to be able to boot properly. There are build systems like Armbian, which make things easier, but it supports only a fragment of the available SBCs. The SoC and SBC support in the mainline kernel is an important issue for a long time now, but let’s continue.

In this table I’ve listed the image I’ve used for each board. These are the boards that I’ve benchmarked myself

SBC Image Kernel
AML-S905X-CC Ubuntu 18.04 Headless 4.19.55
Raspberry Pi 3 B+ Ubuntu 18.04 Server 4.4.0-1102
Jetson nano Ubuntu 18.04 4.9
Nanopi Duo Armbian 5.93 4.19.63
Nanopi Neo Armbian 5.93 4.19.63
Nanopi Neo2 Armbian 5.93 4.19.63
Nanopi Neo 4 Armbian 5.93 4.4.179
Nanopi Neo 4 (2) Yocto build 4.4.176
Nanopi K1 Plus Armbian 5.93 4.19.63
Orangepi Prime Armbian 5.93 4.19.63
Beaglebone Black Debian 9.5

This table is from other SBCs that Shaw Tan benchmarked and posted the results in the comments

SBC Image Kernel
Google Coral Dev Board Debian Stretch aarch64 4.9.51-imx
Rock Pi 4B Debian Stretch aarch64 4.4.154
Raspberry Pi 4 Rasbian 4.19.59-v8

But why use tflite-micro?

That’s a good question. Why use the tflite-micro API on a Linux SBC? The tflite-micro is meant to be used with small microcontrollers and there is the tflite C++ API for Linux, which is more powerful and complete. Let’s see what are the pros and cons of the C++ tflite API.


  • Supports more Ops
  • Supports also Python (that doesn’t have to do with C++ but is a plus for the API)
  • Can scale to multi-core CPUs
  • GPU, TPU and NCS2 support
  • Can load flatbuffer models from the file system
  • Small binary (~300KB)


  • It’s hard to build
  • A lot of dependencies
  • The whole API is integrated in the git repo and it’s very difficult to get only the needed files to create your custom build via Make or CMake
  • Not available pre-build binaries for architectures other than x86_64

I’ve tried to build the API in a couple of SBCs and that failed for several different reasons. For now only RPi seems to be supported, but again it’s quite difficult to re-use the library, it seems like you need to develop your application inside the tensorflow repo and it’s difficult to extract the headers. Maybe there is somewhere a documentation on how to do this, but I couldn’t find it. I’ve also seen other people having the same issue (see here), but none of the suggestions worked for me (except in my x86_64 workstation). At some point I was tired to keep trying, so I quit. Also the build was taking hours just to see it fail.

Then I though, why not use the tflite-micro API which worked fine on the STM32F7 before and since it’s a simple library that builds on MCUs and the only dependency is flatbuffers, then it should build everywhere else (including Linux for those SBCs). So I’ve tried this and it worked just fine. Then I though to run some benchmarks on various SBCs I have around to see how it performs and if it’s something usable.

tflite model

Before I present the results, let’s have a quick look at the tflite model. The model is the same I’ve used in the tests I’ve done in the “ML for embedded” series here, here and here. The complexity (MACC) of this model is measured in multiply-and-accumulate (MAC) operations, which is the fundamental operation in ML.

For the specific mnist model I’ve used, the MACC is 2,852,598. That means that to get the output from any random input the CPU executes 2.8 million MAC operations. That’s I guess a medium size model for a NN, but definitely not a small one.

Because now we’re using the tflite-micro API, the model is not a file which is loaded from the file system, but it’s embedded in the executable binary, as the flatbuffer model is just a byte array in a header file.

Build the benchmark

If you want to test on any other SBC, you just need g++ and cmake. If your distro supports apt, then run this:

sudo apt install cmake g++

Then you just need to clone the repo, build the benchmark and run it. Just be aware that the output filename is changing depending on the CPU architecture. So for an aarch64 CPU run those commands:

git clone
cd tflite-micro-python-comparison/

And you’ll get the results.

(For a cortex-a7, the executable name will be mnist-tflite-micro-armv7l)


Note: I haven’t done any custom optimization for any of the SBCs or the compiler flags for each architecture. The compiler is let to automatically detect the best options depending the system. Also, I’m running the stock images, without even change the performance governor, which is set “ondemand” for all the SBCs. Finally, I haven’t applied any overclocking for any of the SBCs, therefore the CPU frequencies are whatever those images have set for the ondemand governor.

Ok, so now let’s see the results I got on those SBCs. I’ve created a table with the average time of 1000 inferences for each test. In this list I’ve also added the results from my workstation and also in the end the results of the exact same code with the exact same tflite model running on the STM32F746, which is an MCU and baremetal firmware.

SBC Average for 1000 runs  (ms)
Ryzen 2700X (this is not SBC) 2.19
AML-S905X-CC 15.54
Raspberry Pi 3 B+ 13.47
Jetson nano 9.34
Nanopi Duo 36.76
Nanopi Neo 16
Nanopi Neo2 22.83
Nanopi-Neo4 5.82
Nanopi-Ne4 (2) 5.21
Nanopi K1 Plus 14.32
Orangepi Prime 18.40
Beaglebone Black 97.03
STM32F746 @ 216MHz 76.75
STM32F746 @ 288 MHz 57.95

The next table has results that Shaw Tan posted in the comments section:

SBC Average for 1000 runs  (ms)
Google Coral Dev Board 12.40
Rock Pi 4B 6.33
Raspberry Pi 4 8.35

There are many interesting results on this table.

First, the RK3399 cpu (nanopi-neo4) outperforms the Jetson nano and it’s only 2.6 times slower that my Ryzen 2700X cpu. Amazing, right? This boards costs only $50. Excellent performance on this specific test.

Then the Jetson nano needs 9.34 ms. Be aware that this is a single CPU thread time! If you remember, here the tflite python API scored 0.98 ms in MAXN mode and 2.42 in 5W mode. The tflite python API supported the GPU though and used CUDA acceleration. So, for this board, when using the GPU and the tflite you get 9.5x (MAXN mode) and 3.8x (5W mode) better performance. Nevertheless, 9.34 ms doesn’t sound that bad compared to the 5W mode.

Beaglebone black is the worst performer. It’s even slower than the STM32F7, which is quite expected as BBB is a single core running @ 1GHz and the code is running on top of Linux. So the OS has a huge impact in performance, compared to baremetal. This raises an interesting question…

What would happened if the tflite-micro was running baremetal on those application CPUs? Without OS? Interesting question, maybe I get to it at another post some time in the future.

Then, the rest SBCs lie somewhere in the middle.

Finally, it worth noting that the nanopi neo is an SBC that costs only $9.99! That’s an amazing price/performance ratio for running the tflite-micro API. This board amazed me with those results (given its price). Also it’s supposed to be an LTS board, which means that it will be in stock for some time, though it’s not clear for how long.

Additional benchmarks on the nanopi-neo4 (RK3399)

Since the nanopi-neo4 performed better than the other SBCs, I’ve build an image using Yocto. Since I’m the owner and maintainer of this layer, I need to say that this is actually a mix of the armbian u-boot and kernel versions and patches and also I’ve used some parts from meta-sunxi, but I’ve done more integrations to like supporting the GPU and some other things.

Initially, I’ve tried the armbian build because it’s easy for everyone to reproduce, but then I though to test also my Yocto layer. Therefore I’ve used this repo here:

There are details how to use the layer and build in the README file of this repo, but this is the command I’ve used to build the image I’ve tested (after setting up the repo as described in the readme).

MACHINE=nanopi-neo4 DISTRO=rk-none source ./ build

Then in the build/conf/local.conf file I’ve added this line in the end of the file

IMAGE_INSTALL += " cmake "

And finally I’ve build with this command:

bitbake rk-image-testing

Then I’ve flashed the image on an SD card:

# change sdX with your SD card
sudo umount /dev/sdX*
sudo dd if=build/tmp/deploy/images/nanopi-neo4/rk-image-testing-nanopi-neo4.wic.bz2 of=/dev/sdX status=progress
sudo eject /dev/sdX

This distro gave me a bit better results. I’ve also tried to benchmark each core separately by running the script like this:

taskset --cpu-list 0 ./build-aarch64/src/mnist-tflite-micro-aarch64

Two test all cores you need to replace in the above script the 0 with the number of cores from 0 to 5. Cores [0-3] are the Cortex-A53 and [4-5] are the Cortex-A72, which are faster. Without using taskset the OS will always run the script on the A72. These are the results:

Core number Results (ms)
0 12.39
1 12.40
2 12.39
3 12.93
4 5.21
5 5.21

Therefore, compare to the armbian distro there’s a slight better performance of 0.61 ms, but this might be on the different kernel version, I don’t know, but it the difference seems to be constant on every inference run I’ve tested.


From the above results I’ve come to some conclusions, which I’ll list in a pros/cons list of using the tflite-micro API on a SBC running a Linux OS.


  • Very simple to build and use
  • Minimal dependencies
  • Easy to extract the proper source and header files from the Tensorflow repo (compared to tflite)
  • Performance is not that bad (that’s a personal opinion though)
  • Portability. You can compile the exact same code on any CPU or MCU and also on Linux or baremetal (maybe also in Windows with MinGW, I haven’t tested).


  • No multi-threading. The tflite-micro only supports 1 thread.
  • The model is embedded in the executable as a byte array, therefore if you want to be able to load tflite models from the file system, then you need to implement your own parser, which just loads the tflite model file from the filesystem to a dynamically allocated byte array.
  • It’s slower compared to tflite C++ API

From the above points (I may missed few, so I’ll update if somethings comes to my mind), it’s clear that the performance using the tflite-micro API on a multi-core CPU and in Linux, will be worse compared to the tflite API. I only have numbers to do a comparison between tflite and tflite-micro for my Ryzen 2700x and the Jetson nano, but not for the other boards. See the table:

CPU tflite-micro/tflite speed ratio
Ryzen 2700X 10.63x
Jetson nano (5W) 9.46x
Jetson nano (MAXN) 3.86x

The above table shows that the tflite API is 10.63x faster than tflite-micro on the 2700x and 9.46x and 3.86x faster on the Jetson nano (depending the mode). This a great difference, of course.

What I would keep from this benchmark is that the tflite-micro is easy to build and use on any architecture, but there’s a performance impact. This impact is much larger if the SoC is multi-core, has a GPU or any other accelerator which can’t be used from the tflite API.

It’s up to you to decide, if the performance of the tflite-micro is enough for your model (depending the MACC). If it is, then you can just use tflite-micro instead. Of course, don’t expect to run inferences on real-time video but for real-time audio it should be probably enough.

Anyway, with tflite-micro it’s very easy to test and evaluate for your model on your SBC.

Have fun!

Controlling a 3D object in Unity3D with teensy and MPU-6050


Have a look at this image.

What does it look like? Is it the new Star Wars? Nope. It’s a 3D real time rendering from a “game” that I’ve build just to test my new stupid project with the MPU6050 motion tracking sensor. Cool isn’t it? Before I get there let me do a short introduction.

The reason I’ve enjoyed this project so much, is that I’ve started this project without knowing anything about 3D game engines, skyboxes and quaternions and after 1.5 day I got this thing up and running! I don’t say this to praise myself. On the contrary, I mention this to praise the current state of the free and open source software. Therefore, I’ll use some space here to praise the people that contribute to FOSS and make for others easy and fast to experiment and prototype.

I don’t know for how long you’ve been engineers, hobbyists or anything related and for how long. But from my experience, trying to make something similar 15 or 10 years ago (and also on a Linux machine), would be really hard and time spending procedure. Of course, I don’t mean getting the same 3D results, I mean a result relative to that era. Maybe I would spend several months, because I would have to do almost everything by myself. Now though, I’ve only wrote a few hundred lines of glue code and spend some time on YouTube and surfing github and some other sources around the web. Nowadays, there are so many things that are free/open and available that is almost impossible not to find what you’re looking for. I’m not only talking about source code, but also tools, documentation, finished projects and resources (even open hardware). There are so many people that provide their knowledge and the outcome of their work/research nowadays, that you can find almost everything. You can learn almost everything from the internet, too. OK, probably it’s quite difficult to become a nuclear physicist using only online sources, but you can definitely learn anything related to programming and other engineering domains. It doesn’t matter why people decide to make their knowledge available publicly, all it matters is that it’s out there, available for everyone to use it.

And thanks to those people, I’ve managed to install Unity3D on my Ubuntu, learn how to use it to make what I needed, found a free to use 3D model for the Millennium Falcon, used Blender on Linux to edit the 3D model, found a tool to create a Skybox that resembles the universe, found an HID example for the teensy 3.2, the source code for the MPU6050 and how to calibrate it and finally dozens of posts with information about all those things. Things like how to use lens flares, mesh colliders to control flare visibility on cameras with flare layers, event system for GUI interactions and several other stuff that I wasn’t even aware of before, everything explained from other people in forums in a way that it’s far easier to read from any available documentation. Tons of information.

Then I just put all the pieces together and wrote the glue code and this is the result.

(Excuse, the bad video quality, but I’ve used my terrible smartphone camera)

It is a great time to be an engineer or a hobbyist and having all those tools and information available to your hands. All you need is just some time for playing and making stupid projects 😉

All the source code and files for this project are available here:

Note: This post is not targeting 3D graphics developers in no way. It’s meant mostly for embedded engineers or hobbyists.

So, let’s dive into the project…


To make the project I’ve used various software tools and only two hardware components. Let’s see the components.

Teensy 3.2

You’re not limited on Teensy 3.2, but you can use any Teensy that supports the RawHID lib. Teensy 3.2 it’s based on the NXP MK20DX256VLH7 which has a Cortex-M4 core running at 72 MHz and can be overclocked easily using the Arduino IDE up to 96MHz. It has various of peripherals and the pinout exports everything you need to build various interesting projects. For this project I’ve only used the USB and I2C. Teensy is not the cheapest MCU board out there as it costs around $20, but it comes with a great support and it’s great for prototyping.


According to TDK (which manufactures the MPU-6050) this is a Six-Axis (Gyro + Accelerometer) MEMS MotionTracking Devices which has an onboard Digital Motion Processor (DMP). According the TDK web page I should use a ™ on every word less than 4 characters in the previous sentence. Anyway, to put it simple it’s a package that contains a 3-axis Gyro, a 3-axis Accelerometer and a special processing unit that performs some complex calculations on the sensor’s data. You can find small boards with the sensor on ebay that cost ~1.5 EUR, which is dirt cheap. The sensor is 3V3 and 5V capable, which makes it easy to be used with a very wide range of development boards and kits.

Connecting the hardware

The hardware connections are easy as both the Teensy and the mpu-6050 are voltage level compatible. The following table shows the connections you need to make.

Teensy 3.2 (pin) MPU-6050
18 SDA
19 SCL
23 INT

That’s it, you’re done. Now all you have to do, is to connect Teensy 3.2 to your workstation, calibrate the sensor and then build the firmware and flash it.

Calibrating the MPU-6050

Not all the MPUs are made the same. Since it’s a quite complex device, both the gyro and the accelerometer (accel) have tolerances. Those tolerances affect the readings you get, for example if you place the sensor on a perfectly flat space then you’ll get a reading from the sensor that it’s not on a flat space, which means that you’re reading the tolerances (offsets). Therefore, first you need to place the sensor on a flat space and then use the readings to calibrate the sensor and minimize those offsets. Because every chip has different tolerances, you need to do this for every sensor, so you don’t do this once for a single sensor and then re-use the same values also for others (even if they are in the same production batch). This sensor supports to upload to it those offset values using some I2C registers in order to perform calculations with those offsets, which in the end offloads the external CPU.

Normally, if you need maximum accuracy during calibration, then you need an inclinometer in order to place your sensor completely flat and a vibration free base. You can find inclinometers on ebay or amazon, from 7 EUR up to thousands of EUR. Of course, you get what you pay. Have in mind that an inclinometer is just a tilt sensor, but it’s calibrated in the production. A cheap inclinometer may suck in many different ways, e.g. maybe is not even calibrated or the calibration reference is not calibrated or the tilt sensor itself is crap. Anyway, for this project you don’t really need to use this.

For now just place the sensor in a surface that seems flat. Also because you probably have already soldered the pin-headers, try to flatten the sensor compare to the surface. We really don’t need accuracy here, just an approximation, so make your eye an inclinometer.

Another important thing to know is that when you power-up the sensor then the orientation is zeroed at the current orientation. That means if the sensor is pointing south then this direction will always be the starting point. This is important for when you connect the sensor to the 3D object, then you need to put the sensor flat and pointing to that object on your screen and then power-on (connect the USB cable to Teensy).

Note: Before you start the calibration you need to leave the module powered on for a few minutes in order for the temperature to stabilize. This is very important, don’t skip this step.

OK, so now you should have your sensor connected to Teensy and on a flat(-ish) surface. Now you need to need to flash the calibration firmware. I’ve included two calibration source codes in the repo. The easiest one to use is in `mpu6050-calibration/mpu6050-raw-calibration/mpu6050-raw-calibration.ino`. I’ve got this source code from here.

Note: In order to be able to build the firmware on the Arduino IDE, you need to add this library here. The Arduino library in this repo is for both the MPU-6050 and the I2Cdev which is needed from all the source codes. Just copy from this folder the I2Cdev and MPU6050 in to your Arduino library folder.

When you build and upload the `mpu6050-raw-calibration.ino` on the Teensy, then you also need to use the Arduino IDE to open the Serial Monitor. When you do that, then you’ll get this prompt repeatedly:

Send any character to start calibrating...

Press Enterin the output textbox of the Serial Monitor and the calibration will start. In my case there were a few iterations and then I got the calibration values in the output:

            ax	ay	az	gx	gy	gz
average values:		-7	-5	16380	0	1	0
calibration offsets:	1471	-3445	1355	-44	26	26

MPU-6050 is calibrated.

Use these calibration offsets in your code:

Now copy-paste the above code block in to your
'teensy-motion-hid/teensy-motion-hid.ino' file
in function setCalibrationValues().

As the message says, now just open the `teensy-motion-hid/teensy-motion-hid.ino` file and copy the mpu.set*function calls in the setCalibrationValues()function.

Advanced calibration

If you want to see a few more details regarding calibration and an example on how to use a PID controller for calibrating the sensor and then use a jupyter notebook to analyze the results, then continue here. Otherwise, you can skip this section as it’s not really needed.

In order to calculate the calibration offsets you can use a PID controller. For those who doesn’t know what PID controller is, then you’ll have to see this first (or if you know how negative feedback on op-amps works, then think that it’s quite the same). Generally, is a control feedback loop that is used a lot in control systems (e.g. HVAC for room temperature, elevators e.t.c).

Anyway, in order to calibrate the sensor using a PID controller, then you need to build and upload the `mpu6050-calibration/mpu6050-pid-calibration/mpu6050-pid-calibration.ino` source code. I won’t get in to the details of the code, but the important thing is that this code uses 6 different PID controllers, one for each offset you want to calculate (3 for the accel. axes and 3 for the gyro axes). This source code is a modification I’ve made of this repo here.

Again, you need to let the sensor a few minutes powered on before perform the calibration and set it on a flat surface. When the firmware starts, then it will spit out numbers on the serial monitor. Here’s an example:


And this goes on forever. Each line is a comma separated list of values and the meaning of those values from left to right is:

  • Average Acc X value
  • Average Acc X offset
  • Average Acc Y value
  • Average Acc Y offset
  • Average Acc Z value
  • Average Acc Z offset
  • Average Gyro X value
  • Average Gyro X offset
  • Average Gyro Y value
  • Average Gyro Y offset
  • Average Gyro Z value
  • Average Gyro Z offset

Now, all you need to do is to let this running for a couple of seconds (30-60) and then copy all the output from the serial monitor to a file named calibration_data.txt. The file actually already exists in the `/rnd/bitbucket/teensy-hid-with-unity3d/mpu6050-calibration` folder and it contains the values from my sensor, so you can use those to experiment or erase the file and put yours in its place. Also, be aware that when you copy the output from the serial monitor to the txt file, you need to delete any empty line in the end for the notebook scripts to work, otherwise, you’ll get an error in the jupyter notepad.

Note: while you running the calibration firmware you need to be sure that the there are no any vibrations on the surface. For example, if you put this on your desk then be sure that there’s no vibrations from you or any other equipment you may have running on the desk.

As I’m explaining quite thorough in the notebook how to use it, I’ll keep it simple here. Also, from this point I assume that you’ve read the jupyter notepad in the repo here.

You can use the notebook to visualize the PID controller output and also calculate the values to use for your sensor’s offsets. It’s interesting to see some plots here. As I mention in the notebook,  you can use the data_start_offset and data_end_offset, to plot different subsets of data for each axis.

This is the plot when data_start_offset=0 and data_end_offset=20.

Click on each image to zoom-in.

So, as you can see in the first 20 samples, the PID controller kicks in and tries to correct the error and as the error in the beginning is significant, you get this slope. See in the first 15 samples the error for the Acc X axis is corrected from more than -3500 to near zero. For the gyro things are a bit different, as it’s more sensitive and fluctuates a bit more. Let’s see the same plots with data_start_offset=20 and data_end_offset=120.

On the above images, I’ve started from sample 20, in order to remove the steep slope during the first PID controller input/output correction. Now you can see that the data that are coming from the accel. and gyro axes are fluctuating quite much and the PID tries on every iteration to correct this error. Of course, you’ll never get a zero error as the sensor is quite sensitive and there’s also thermal and electronic noise and also vibrations that you can’t sense but the sensor does. So, what you do in such cases is that you use the average value for each axis. Be careful, though. You don’t want to include the first samples in the average value calculations for each axis, because that would give you values that are way off. As you can see in the notepad here, I’m using the skip_first_n_data to skip the first 100 samples and then calculate the average from the rest values.

Finally, you can use the calculated values from the “Source code values” section and copy those in the firmware. You can use whatever method you like to calibrate the sensor, just be aware that if you try both methods you won’t get the same values! Even if you run the same test twice you won’t the exact same values, but they should be similar.

HID Manager

In the hid_manager/ folder you’ll find the source code from a tool I’ve written and I named hid_manager. Let me explain what this does and why is needed.

The HID manager is the software that receives the HID raw messages from Teensy and then copies the data in to a buffer that is shared with Unity. Note that this code works only for Linux. If you want to use the project on Windows then you need to port this code and actually is the only part of the project that is OS specific.

So, why use this HID manager? The problem with Unity3D and most similar software is that although they have some kind of input API, this is often quite complicated. I mean, I’ve tried to have a look at the API and try to use it, but quickly I’ve realized that it would take too much effort and time to create my own custom input controller for Unity and the use it in there. So, I’ve decided it to make it quick and dirty. In this case, though, I would say that quick and dirty, is not a bad thing (except that it’s OS specific). Therefore, what is the easiest and fast way to import real-time data to any other software that runs on the same host? Using some kind of inter-process communication, of course. In this case, the easiest way to do that was to use the Linux /tmp folder, which is mount in the system’s RAM memory and then create a file buffer in the /tmp and share this buffer between the Unity3D game application and the hid manager.

To do that I’ve written a script in hid_manager/, which makes sure to create this buffer and occupy 64 bytes in the RAM. The USB HID packets I’m using are 64 bytes long, so we don’t need more RAM than that. Of course, I’m not using all the bytes, but it’s good to have more than the exact the same number. For example, the first test was to send only the Euler angles from the sensor, but then I’ve realized that I was getting affected from the Gimbal lock effect, so I also had to add the Quaternion values, that I was getting anyways from the sensor (I’ll come back to those later). So, having more available size is always nice and actually in this case the USB packet buffer is always 64 bytes, so you get them for free. The problem arises when you need more than 64-byts, then you need to use some data serialization and packaging.

Also, note in both the teensy-motion-hid/teensy-motion-hid.ino and hid_manager/hid_manager.c, the indianness is the same, which makes things a bit easier and faster.

In order to build the code, just run make inside the folder and then you need first flash the Teensy with the teensy-motion-hid.ino firmware and then run the manager using the script.


If you try to run the script before the Teensy is programmed, then you’ll get an error as the HID device won’t be found attached and enumerated on your system.

Note: if the HID manager is not running then on the serial monitor output you’ll get this message

Unable to transmit packet

The HID manager supports two modes. The default mode is that it runs and it just copies the incoming data from the HID device to the buffer. The second one is the debug mode. In the debug mode, it prints also the 64 bytes that it gets from the Teensy. To run the HID manager in debug mode run this command.

DEBUG=1 ./

By running the above command you’ll get an output in your console similar to this:

$ DEBUG=1 ./ 
    This script starts the HID manager. You need to connect
    the teensy via USB to your workstation and flash the proper
    firmware that works with this tool.
    More info here:
Controlling a 3D object in Unity3D with teensy and MPU-6050
Usage for no debugging mode: $ ./ Usage for debugging mode (it prints the HID incoming data): $ DEBUG=1 ./ Debug ouput is set to: 1 Starting HID manager... Denug mode enabled... Open shared memfile found rawhid device recv 64 bytes: AB CD A9 D5 BD 37 4C 2B E5 BB 0B D3 BD BE 00 FC 7F 3F 00 00 54 3B 00 00 80 38 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 F9 29 00 00 01 00 00 00 E1 94 FF 1F recv 64 bytes: AB CD 9D 38 E5 3B 28 E3 AB 37 B6 EA AB BE 00 FC 7F 3F 00 00 40 3B 00 00 00 00 00 00 80 B8 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 F9 29 00 00 01 00 00 00 E1 94 FF 1F ...

The 0xAB 0xCD bytes are just a preamble.

Note: I haven’t added any checks on the data, like checking the preamble or having checksums e.t.c. as there wasn’t a reason for this. In other case I would consider at least a naive checksum like xor’ing the bytes, which is fast.

In the next video you can see on the left top side the quaternion data output from the Teensy and on the left bottom the raw hid data in the hid manager while it’s running in debug mode.

Of course, printing the data on both ends adds significant latency in model motion.

Teensy 3.2 firmware

The firmware for Teensy is located in teensy-motion-hid/teensy-motion-hid.ino. There’s no much to say here, just open the file with the Arduino IDE and then build and upload the firmware.

The important part in the code are those lines here:

        mpu.dmpGetQuaternion(&q, fifoBuffer);

        un_hid_payload pl;
        pl.packet.preamble[0] = 0xAB;
        pl.packet.preamble[1] = 0xCD;

        mpu.dmpGetEuler(euler, &q);
        pl.packet.x.number = euler[0] * 180 / M_PI;
        pl.packet.y.number = euler[1] * 180 / M_PI;
        pl.packet.z.number = euler[2] * 180 / M_PI;
        pl.packet.qw.number = q.w;
        pl.packet.qx.number = q.x;
        pl.packet.qy.number = q.y;
        pl.packet.qz.number = q.z;

If the ADD_EULER_TO_HID is enabled, then the Euler angles will also be added in the hid packet, but this might be add a bit more latency.

Now that the data are copied from Teensy to a shared memory buffer in /tmp, you can use Unity3D to read those data and use them in your game. Before proceed with the Unity3D section, though, let’s open a short parenthesis on the kind of data you get from the sensor.

Sensor data

As I’ve mentioned, the sensor does all the hard work and maths to calculate the Euler and the quaternion values from the 6 axes values in real-time (which is a great acceleration). But what are those values, why we need them and why I prefer to use only the quaternion? Usually I prefer to give just a quick explanation and leave the pros explain it better than me, so I’ll the same now.

The Euler angles is just the angle of the rotation for each axis in the 3D space. In air navigation those angles are known as roll, pitch and yaw and by knowing those angles you know your object’s rotation. You can see also this video which explains this better than I do. There’s one problem with Euler angles and this is that if two of the 3 axes are driven in a parallel configuration then you loose one degree of freedom. This is a video explains this in more detail.

As I’ve already mentioned, the sensor calculates the quaternion values. Quaternion is much more difficult to explain as it’s a 4-D number and anything more then 3-D is difficult to visualize and explain. I will try to avoid to explain this myself and just post this link here, which explains quaternions and try to represent them to the 3D space. The important thing you need to know, is that the quaternion doesn’t suffer from the gimbal lock, also it’s supported in Unity3D and it’s supposed to make calculations faster compared to vector calculations for the CPU/GPU.

Unity3D project

For this project I wanted to interact with a 3D object on the screen using the mpu-6050. Then I remembered that I’ve seen a video on Unity3D which seemed nice, so when I’ve seen that there was a Linux build (but not officially supported), then I thought to give it a try. When I’ve started the project I knew nothing about this software, but for doing simple things it seems nice. I had quite a few difficulties, but with some google-fu I’ve fixed all the issues.

Installing Unity3D on Ubuntu is not pain free, but it’s not that hard either and when you succeed, it works great. I’ve download the installer from here (see always the last post which has the latest version) and to install Unity Hub I’ve followed this guide here. Unity3D is not open source, but it’s free for personal use and you need to create an account in order to get your free license. I guess I could use an open 3D game machine, but since it was free and I wanted for personal use, I went for that. In order, to install the same versions that I’ve used run the following commands:

# install dependencies
sudo apt install libgtk2.0-0 libsoup2.4-1 libarchive13 libpng16-16 libgconf-2-4 lib32stdc++6 libcanberra-gtk-module

# install unity
chmod +x UnitySetup-2019.1.0f2

# install unity hub
chmod +x UnityHubSetup.AppImage

When you open the project, you’ll find in the bottom tab some files. The one that’s interesting for you is the HID_controller.cs. This file in the repo is located in here: Unity3D-project/gyro-acc-test/Assets/HID_Controller.cs. In this file the important bits are the MemoryMappedfile object which is instantiated in the start() function and opens the shared file in the /tmp and reads the mpu6050 data and the update() function.

void Start()
    hid_buf = new byte[64];

    // transform.Rotate(30.53f, -5.86f, -6.98f);
    Debug.developerConsoleVisible = true;
    Debug.Log("Unity3D demo started...");

    mmf = MemoryMappedFile.CreateFromFile("/tmp/hid-shared-buffer", FileMode.OpenOrCreate, "/tmp/hid-shared-buffer");
    using (var stream = mmf.CreateViewStream()) {
        var data = stream.Read(hid_buf, 0, 64);
        if (data > 0) {
            Debug.Log("Data in: " + data);
            float hid_x = System.BitConverter.ToSingle(hid_buf, 2);
            Debug.Log("x: " + hid_x);


// Update is called once per frame
void Update()
    using (var stream = mmf.CreateViewStream()) {
        var data = stream.Read(hid_buf, 0, 64);
        if (data > 0) {
            float qw = System.BitConverter.ToSingle(hid_buf, (int) DataIndex.qW);
            float qx = System.BitConverter.ToSingle(hid_buf, (int) DataIndex.qX);
            float qy = System.BitConverter.ToSingle(hid_buf, (int) DataIndex.qY);
            float qz = System.BitConverter.ToSingle(hid_buf, (int) DataIndex.qZ);
            transform.rotation = new Quaternion(-qy, -qz, qx, qw);

As you can see in the start() function, the mmf MemoryMappedFile is created and attached to the /tmp/hid-shared-buffer. Then there’s a dummy read from the file to make sure that the stream works and prints a debug message. This code runs only once when the HID_Controller class is created.

In update() function the code creates a stream connected to the memory mapped file, then reads the data, parses the proper float values from the buffer and finally creates a Quaternion object with the 4D values and updates the object rotation.

You’ll also notice that the values in the quaternion are not in the (x,y,z,w) order, but (-y,-z,x,w). This is weird, right? But this happens for a couple of reasons that I’ll try to explain. In page 40 of this PDF datasheet you’ll find this image.

These are the X,Y,Z axes on the chip. Notice also the signs, they are positive on the direction is shown and negative on the opposite direction. The dot on the top corner indicated where pin 1 is located on the plastic package. The next image is the stick I’ve made with the sensor board attached on it on which I’ve drawn the dot position and the axes.

Therefore, you see that the X and Y axes are swapped (compared to the pdf image), so the quaternion from (x,y,z,w) becomes (y,x,z,w). But wait… in the code is (-y,-z,x,w)! Well, that troubled me also for a moment, then I’ve read this in the documentation, “Most of the standard scripts in Unity assume that the y-axis represents up in your 3D world.“, which means that you need also to swap Y and Z, but because in the place of Y now is X, then you replace X with Z, so the quaternion from (y,x,z,w) becomes (y,z,x,w). But wait! What about the “-” signs? Well if you see again the first image it shows the sign for each axis. Because of the way you hold the stick, compared to the moving object on the screen reverses that rotation for the y and z axes, then the quaternion becomes (-y,-z,x,w). Well, I don’t know anything about 3D graphics, unity and quaternions, but at least the above explanation makes sense and it works, so… this must be it.

I’ve found the Millenium Falcon 3D model here and it’s free to use (except that any star-wars related stuff are trademarked and you can’t really use them for any professional use). This is what I meant in the intro, all the software I’ve used until now was free or open. So A. Meerow, who build this model did this 3D mesh in his free time, gave it for free and I was able to save dozens of hours to make the same thing. Thanks mate.

I had a few issues with the 3D model though when I imported the model in Unity. One issue was that there was a significant offset on one of the axis, another issue was that because of the previous thing I’ve mentioned I had to export the model with the Y – Z axes swapped and finally another issue was that the texture when importing the .obj file wasn’t applied properly, so I had to import the model in the .fbx format. To fix those things I’ve downloaded and used Blender. I’ve also used blender for the first time, but it was quite straight forward to use and fix those issues.

Blender is a free and open source 3D creation suite and I have to say that it looks beautiful and very professional tool. There are so many options, buttons and menus that makes clear that this is a professional grade tool. And it’s free! Amazing.

Anyway, after I’ve fixed those things and exported the model to the .fbx format I wanted to change the default Skybox in Unity3D and I wanted something that seems like deep space. So I’ve found another awesome free and open tool, which is named Spacescape and creates a spaceboxes with stars and nebulas, using algorithms and it also has a ton of options to play with. The funny thing was that I’ve tried to build it on my Ubuntu 18.04 but I had a lot of issues as it’s based on a quite old Qt version and also needs some dependencies that also failed. Anyway, I’ve downloaded the Windows executable and it worked fine with Wine (version 3.0). This is a screenshot of the tool running on my ubuntu.

These are the default options and I’ve actually used them as the result was great.

Finally, I’ve just added some lights, a lens flare and a camera rotation in the Unity3D scene and it was done.

Play the game demo

In order to “play” the game (yeah I know it’s not a game, it’s just a moving 3d object on a scene), you need to load the project from the Unity3D-project/gyro-acc-test/ folder. Then you just build it by pressing Ctrl+B and this will create an executable file named “teensy-wars.x86_64” and at the same time it will also launch the game. After you build the project (File >> Build Settings), you can just lauch the teensy-wars.x86_64 executable.

Make sure that before you do this, you’re running the hid_manager in the background and that you’ve flashed Teensy with the teensy-motion-hid/teensy-motion-hid.ino firmware and the mpu-6050 sensor is connected properly.


I’m amazed with this project for many reasons. It took me 1.5 day to complete it. Now that I’m writing those lines, I’m thinking that I’ve spend more time in documenting this project and write the article rather implement it and the reason for this the power of open source, the available free software (free to use or open), the tons of available information in documents, manuals and forums and finally and most important the fact that people are willing to share their work and know-how. Of course, open source it’s not new to me, I do this for years also by myself, but I was amazed that I was able to build something like this without even use any of those tools before in such short time. Prototyping has become so easy nowadays. It’s really amazing.

Anyway, I’ve really enjoyed this project and I’ve enjoyed more the fact that I’ve realized the power that everyone has in their hands nowadays. It’s really easy to build amazing stuff, very fast and get good results. Of course, I need to mention, that this can’t replace the domain expertise in any way. But still it’s nice that engineers from other domains can jump into another unknown domain and make something quick and dirty and get some idea how things are working.

Have fun!

Machine Learning on Embedded (Part 5)

Note: This post is the fourth in the series. Here you can find part 1, part 2, part 3 and part 4.


In the previous post here, I’ve used x-cube-ai with the STM32F746 to run a tflite model and benchmark the inference performance. In that post I’ve found that the x-cube-ai is ~12.5x faster compared to TensorFlow Lite for microcontrollers (tflite-micro) when running on the same MCU. Generally, the first 4 posts were focused on running the model inference on the edge, which means running the inference on the embedded MCU. This actually is the most important thing nowadays, as being able running inferences on the edge on small MCUs means less consumption and more important that are not rely on the cloud. What is cloud? That means that there is an inference accelerator in cloud, or in layman terms the inference is running on a remote server somewhere on the internet.

One thing to note, is that the only reason I’m using the MNIST model is for benchmarking and consistency with the previous post. There’s no any real reason to use this model in a scenario like this. The important thing here is not the model, but the model’s complexity. So any model with the some kind of complexity that matches your use-case scenario can be used. But as I’ve said since I’ve used this model in the previous posts, I’ll use it also here.

So, what are the benefits of running the inference on the cloud?

Well, that depends. There are many parameters that define a decision like this. I’ll try to list a few of them.

  • It might be faster to run an inference on the cloud (that depends also on other parameters though).
  • The MCU that you already have (or you must use) is not capable to run the inference itself using e.g. tflite-micro or another API.
  • There is a robust network
  • The time that the cloud inference to be run (including the network transactions) is faster than running on the edge (=on the device)
  • If the target device runs on battery it may be more energy efficient to use a cloud accelerator
  • It’s possible to re-train your model and update the cloud without having to update the clients (as long the input and output tensors are not changed).

What are the disadvantages on running the inference on the cloud?

  • You need a target with a network connection
  • Networks are not always reliable
  • The server hardware is not reliable. If the server fails, all the clients will fail
  • The cloud server is not energy efficient
  • Maintenance of the cloud

If you ask me, the most important advantage of edge devices is that they don’t rely on any external dependencies. And the most important advantage of the cloud is that it can be updated at any time, even on the fly.

On this post I’ll focus on running the inference on the cloud and use an MCU as a client to that service. Since I like embedded things the cloud tflite server will be a Jetson nano running in the two supported power modes and the client will be an esp8266 NodeMCU running at 160MHz.

All the project file are in this repo:

Now let’s dive into it.


Let’s have a look in the components I’ve used.

ESP8266 NodeMCU

This is the esp8266 module with 4MB flash and the esp8266 core which can run up to 160MHz. It has two SPI interfaces, one used for the onboard EEPROM and one it’s free to use. Also it has a 12-bit ADC channel which is limited to max 1V input signals. This is a great limitation and we’ll see later why. You can find this on ebay sold for ~1.5 EUR, which is dirt cheap. For this project I’ll use the Arduino library to create a TCP socket that connects to a server, sends an input array and then retrieves the inference result output.

Jetson nano development board

The jetson nano dev board is based on a Quad-core ARM Cortex-A57 running @ 1.4GHz, with 4GB LPDDR4 and an NVIDIA Maxwell GPU with 128 CUDA cores. I’m using this board because the tensorflow-gpu (which contains the tflite) supports its GPU and therefore it provides acceleration when running a model inference. This board doesn’t have WiFi or BT, but it has a mini-pcie connector (key-E) so you’re able to connect a WiFi-BT module. In this project I will just use the ethernet cable connected to a wireless router.

The Jetson nano supports two power modes. The default mode 0 is called MAXN and the mode 1 is called 5W. You can verify on which mode your CPU is running with this command:

nvpmodel -q

And you can set the mode (e.g. mode 1 – 5W) like this:

# sets the mode to 5W
sudo nvpmodel -m 1

# sets the mode to MAXN
sudo nvpmodel -m 0

I’ll benchmark both modes in this post.

My workstation

I’ve also used my development workstation in order to do benchmark comparisons with the Jetson nano. The main specs are:

  • Ryzen 2700x @ 3700MHz (8 cores / 16 threads)
  • 32GB @ 3200MHz
  • GeForce GT 710 (No CUDA 🙁 )
  • Ubuntu 18.04
  • Kernel 4.18.20-041820-generic

Network setup

This is the network setup I’ve used for the development and testing/benchmarking the project. The esp8266 is connected via WiFi on the router and the workstation (2700x) and the jetson nano are connected via Ethernet (in the drawing replace TCP = ETH!).

This is a photo of the development setup.

Repo details

In the repo you’ll find several folders. Here I’ll list what each folder contains. I suggest you also read the file in the repo as it contains information that might not be available here and also the README file will be always updated.

  • ./esp8266-tf-client: This folder contains the firmware for the esp8266
  • ./jupyter_notebook: This folder contains the .ipynb jupyter notebook which you can use on the server and includes the TfliteServer class (which will be explained later) and the tflite model file (mnist.tflite).
  • ./schema: The flatbuffers schema file I’ve used for the communication protocol
  • ./tcp-stress-tool: A C/C++ tool that I’vewritten to stress and benchmark the tflite server.

esp8266 firmware

This folder contains the source code for the esp8266 firmware. To build the esp8266 firmware open the `esp8266-tf-client/esp8266-tf-client.ino` with Arduino IDE (version > 1.8). Then in the file you need to change a couple of variables according to your network setup. In the source code you’ll find those values:

#define SSID "SSID"
#define SERVER_IP ""
#define SERVER_PORT 32001

You need to edit them according to your wifi network and router setup. So, use your wifi router’s SSID and password. The `SERVER_IP` is the IP of the computer that will run the python server and the `SERVER_PORT` is the server’s port and they both need to be the same also in the python script. All the data in the communication between the client and the server are serialized with flatbuffers. This comes with quite a significant performance hit, but it’s quite necessary in this case. The client sends 3180 bytes on every transaction to the server, which are the serialized 784 floats for each 28×28 byte digit. Then the response from the server to the client is 96 bytes. These byte lengths are hardcoded, so if you do any changes you need also to change he definitions in the code. They are hard-coded in order to accelerate the network recv() routines so they don’t wait for timeouts.

By default this project assumes that the esp8266 runs at 160MHz. In case you change this to 80MHz then you need also to change the `MS_CONST` in the code like this:

#define MS_CONST 80000.0f

Otherwise the ms values will be wrong. I guess there’s an easier and automated way to do this, but yeah…

The firmware supports 3 serial commands that you can send via the terminal. All the commands need to be terminated with a newline. The supported commands are:

Command Description
TEST Sends a single digit inference request to the server and it will print the parsed response
START=<SECS> Triggers a TCP inference request from the server every <SECS>. Therefore, if you want to poll the server every 5 secs then you need to send this command over the serial to the esp8266 (don’t forget the newline in the end). For example, this will trigger an inference request every 5 seconds: `START=5`.
STOP Stops the timer that sends the periodical TCP inference requests

To build and upload the firmware to the esp8266 read the of the repo.

Using the Jupyter notebook

I’ve used the exact same tflite model that I’ve used in part 3 and part 4. The model is located in ./jupyter_notebook/mnist.tflite. You need to clone the repo on the Jetson nano (or your workstation is you prefer). From now on instead of making a distinction between the Jetson nano and the workstation I’ll just refer to them as the cloud as it doesn’t really make any difference. Therefore, just clone the repo to your cloud server. This here is the jupyter notepad on bitbucket.

Benchmarking the inference on the cloud

The important sections in the notepad are 3 and 4. Section 3 is the `Benchmark the inference on the Jetson-nano`. Here I assume that this runs on the nano, but it’s the same on any server. So, in this section I’m benchmarking the model inference with a random input. I’ve run this benchmark on both my workstation and the Jetson nano and these are the results I got. For reference I’ll also add the numbers with the edge inference on the STM32F7 from the previous post using x-cube-ai.

Cloud server ms (after 1000 runs)
My workstation 0.206410
Jetson nano (MAXN) 0.987536
Jetson nano (5W) 2.419758
STM32F746 @ 216MHz 76.754
STM32F746 @ 288MHz 57.959

The next table shown the difference in performance between all the different benchmarks.

STM@216 STM@288 Nano 5W Nano MAXN 2700x
STM@216 1 1.324 31.719 77.722 371.852
STM@288 0.755 1 23.952 58.69 280.795
Nano 5W 0.031 0.041 1 2.45 11.723
Nano MAXN 0.012 0.017 0.408 1 4.784
2700x 0.002 0.003 0.085 0.209 1

An example how to read the above table is that the STM32F7@288 is 1.324x faster than STM32F7@216. Also Ryzen 2700x is 371.8x times faster. Also the Jetson nano in MAXN mode is 2.45x times faster that the 5W mode, e.t.c.

What you should probably keep from the above table is that Jetson nano is ~32x to 78x times faster than the STM32F7 at the stock clocks. Also the 2700x is only ~4.7x times faster than nano in MAXN mode, which is very good performance for the nano if you think about its consumption, cost and size.

Therefore, the performance/cost and performance/consumption ratio is far better on the Jetson nano compared to 2700x. So it makes perfect sense to use this as a cloud tflite server. One use-case of this scenario is having a cloud accelerator running locally on a place that covers a wide area with WiFi and then having dozens of esp8266 clients that request inferences from the server.

Benchmarking the tflite cloud inference

To run the server you need to run the cell in section `4. Run the TCP server`. First you need to insert the correct IP of the cloud server. For example my Jetson nano has the IP Then you run the cell. The other way is you can edit the `jupyter_notebook/TfliteServer/` file and in this code change the IP (or the TCP if you like also)

if __name__=="__main__":
    srv = TfliteServer('../mnist.tflite')
    srv.listen('', 32001)

Then on your terminal run:


This will run the server and you’ll get the following output.

dimtass@jetson-nano:~/rnd/tensorflow-nano/jupyter_notebook/TfliteServer$ python3
TfliteServer initialized
TCP server started at port: 32001

Now send the TEST command on the esp8266 via the serial terminal. When you do this, then the following things will happen:

  1. esp8266 serializes the 28×28 random array to a flatbuffer
  2. esp8266 connects the TCP port of the server
  3. esp8266 sends the flabuffer to the server
  4. Server de-serializes the flatbuffer
  5. Server converts the tensor from (784,) to (1, 28, 28, 1)
  6. Server runs the inference with the input
  7. Server serializes the output it in a flatbuffer (including the time in ms of the inference operation)
  8. Server sends the output back to the esp8266
  9. esp8266 de-serializes the output
  10. esp8266 outputs the result

This is what you get from the esp8266 serial output:

Request a single inference...
======== Results ========
Inference time in ms: 12.608528
out[0]: 0.080897
out[1]: 0.128900
out[2]: 0.112090
out[3]: 0.129278
out[4]: 0.079890
out[5]: 0.106956
out[6]: 0.074446
out[7]: 0.106730
out[8]: 0.103112
out[9]: 0.077702
Transaction time: 42.387493 ms

In this output the “inference time in ms” is the time in ms that the cloud server spend to run the inference. Then you get the array of the 10 predictions for the output and finally the “Transaction time” is the total time of the whole procedure. The total time is the time that steps 1-9 spent. At the same time the output of the server is the following:

==== Results ====
Hander time in msec: 30.779839
Prediction results: [0.08089687 0.12889975 0.11208985 0.12927799 0.07988966 0.10695633
 0.07444601 0.10673008 0.10311186 0.07770159]
Predicted value: 3

The “handler time in msec” is the time that the TCP reception handler used (see: jupyter_notebook/TfliteServer/ and the FbTcpHandler class.

From the above benchmark with the esp8266 we need to keep the following two things:

  1. From the 42.38 ms the 12.60 ms was the inference run time, so all the rest things like serialization and network transactions costed 29.78 ms (on the local WiFi network). Therefore, the extra time was 2.3x times more that the inference running time itself.
  2. The total time that the above operation lasted was 42.38 ms and the STM32F7 needed 76.75 ms @ 216MHz (or 57.96 @ 288MHz). That means the the cloud inference is 1.8x and 1.36x times faster.

Finally, as you probably already noticed, the protocol is very simple, so there are no checksums, server-client validation and other fail-safe mechanisms. Of course, that’s on purpose, as you can imagine. Otherwise, the complexity would be higher. But you need to consider those things if you’re going to design a system like this.

Benchmarking the tflite server

The tflite TCP server is just a python TCP socket listener. That means that by design it has much lower performance compared to any TCP server written in C/C++ or Java. Despite the fact that I was aware of this limitation, I’ve chosen to go with this solution in order to integrate the server easily in the jupyter notebook and it was also much faster to implement. Sadly, I’ve seen a great performance hit with this implementation and I would like to investigate a bit further (in the future) and verify if that’s because of the python implementation or something else. The results were pretty bad especially for the Jetson nano.

In order to test the server, I’ve written a small C/C++ stress tool that I’ve used to spawn a user-defined number of TCP client threads and request inferences from the server. Because it’s still early in my testing, I assume that the gpu can only run one inference per time, therefore there’s a thread lock before any thread is able to call the inference function. This lock is in the jupyter_notebook/TfliteServer/ file in those lines:

output_data, time_ms = runInference(resp.Input().DigitAsNumpy())

One thing I would like to mention here is that I’m not lazy to investigate in depth every aspect of each framework, it’s just that I don’t have the time, therefore I do logical assumptions. This is why I assume that I need to put a lock there, in order to prevent several simultaneous calls in the tensorflow API. Maybe this is handled in the API, I don’t know. Anyway, have in mind that’s the reason this lock there, so all the inferences requests will block and wait until the current running inference is finished.

So, the easiest way to run some benchmarks is to use run the TfliteServer on the server. First you need to edit the IP address in the __main__ function. You need to use the IP of the server, or if you run this locally (even when I do this locally I use the real IP address). Then run the server:

cd jupyter_notebook/TfliteServer/

Then you can run the client and pass the server IP, port and number of threads in the command line. For example, I’ve run both the client and server on my workstation, which has the IP, so the command I’ve used was:

cd tcp-stress-tool/
./tcp-stress-tool 32001 500

This will spawn 500 clients (each on its own thread) and request an inference from the python server. Because the output is quite big, I’ll only post the last line (but I’ve copied some logs in the results/ folder in the repo).

This tool will spawn a number of TCP clients and will request
the tflite server to run an inference on random data.
Warning: there is no proper input parsing, so you need to be
cautious and read the usage below.

tcp-stress-tool [server ip] [server port] [number of clients]

server ip:
server port: 32001
number of clients: 500

Spawning 500 TCP clients...
[thread=2] Connected
[thread=1] Connected
[thread=3] Connected


Total elapsed time: 31228.558064 ms
Average server inference time: 0.461818 ms

The output means that 500 TCP transactions and inferences were completed in 31.2 secs with average inference time 0.46 ms. That means the total time for the inferences were 23 secs and the rest 8.2 secs were spend in the comms and serializing the data. These 8.2 secs seem a bit too much, though, right? I’m sure that this time should be less. On the Jetson nano it’s even worse, because I wasn’t able to run a test with 500 clients and many connections were rejected. Any number more that 20 threads and python script can’t handle this. I don’t know why. In the results/ folder you’ll find the following test results:

  • tcp-stress-jetson-nano-10c-5W.txt
  • tcp-stress-jetson-nano-50c-5W.txt
  • tcp-stress-jetson-nano-50c-MAXN.txt
  • tcp-stress-output-workstation-500c.txt

As you can guess from the filename, Xc is the number of threads and for Jetson nano there are results for both modes (MAXN and 5W). This is a table with all the results:

Test Threads Total time ms Avg. inference ms
Nano 5W 10 1057.1 3.645
Nano 5W 20 3094.05 4.888
Nano MAXN 10 236.13 2.41
Nano MAXN 20 3073.33 3.048
2700x 500 31228.55 0.461

From those results, I’m quite sure that there’s something wrong with the python3 TCP server. Maybe at some point I’ll try something different. In any case that concludes my tests, although there’s a question mark as regarding the performance of the Jetson nano when it’s acting as tflite server. For now, it seems that it can’t handle a lot of connections (with this implementation), but I’m quite certain this will be much different if the server is a proper C/C++ implementation.


With this post I’ve finished the main tests around ML I had originally on my mind. I’ve explored how ML can be used with various embedded MCUs and I’ve tested both edge and cloud implementations. At the edge part of ML, I’ve tested a naive implementation and also two higher level APIs (the TensorFlow Lite for Microcontrollers API and also the x-cube-ai from ST). For the cloud part of ML, I’ve tested one of the most common and dirt cheap WiFi enabled MCUs the esp8266.

I’ll mention here once again that, although I’ve used the MNIST example, that doesn’t really matter. It’s the NN model complexity that matters. By that I mean that although it doesn’t make any sense to send a 28×28 tensor from the esp8266 to the cloud for running a prediction on a digit, the model is still just fine for running benchmarks and make conclusions. Also this (784,) input tensor, stresses also the network, which is good for performance tests.

One thing that you might wondering at this point is, “which implementation is better”? There’s no a single answer for this. This is a per case decision and it depends on several parameters around the specific requirements of the project, like cost, energy efficiency, location, environmental conditioons and several other things. By doing those tests though, I now have a more clear image of the capabilities and the limitations of the current technology and this is a very good thing to have when you have to start with a real project development. I hope that the readers who gone all the posts of this series are able to make some conclusions about those tools and the limitations; and based on this knowledge can start evaluating more focused solutions that fit their project’s specs.

One thing that’s also important, is that the whole ML domain is developing really fast and things are changing very fast, even in next days or even hours. New APIs, new tools, new hardware are showing up. For example, numerous hardware vendors are now releasing products with some kind of NN acceleration (or AI as they prefer to mention it). I’ve read a couple of days ago that even Alibaba released a 16-core RISC-V Processor (XuanTie 910) with AI acceleration. AmLogic released the A311D. Rockchip released the RK3399Pro.  Also, Gyrfalcon released the Lightspeeur 2801S Neural Accelerator, to compete Intel’s NCS2 and Google’s TPU. And many more chinese manufactures will release several other RISC-V CPUs with accelerators for NN the next few weeks and months. Therefore, as you can see the ML on the embedded domain is very hot.

I think I will return from time to time to the embedded ML domain in the future to sync with the current progress and maybe write a few more posts on the subject. But the next stupid-project will be something different. There’s still some clean up and editing I want to do in the first 2 posts in the series, though.

I hope you enjoyed this series as much as I did.

Have fun!

Machine Learning on Embedded (Part 4)


Note: This post is the fourth in the series. Here you can find part 1, part 2 and part 3.

For this post I’ve used the same MNIST model that I’ve trained for TensorFlow Lite for Microcontrollers (tflite-micro) and I’ve implemented the firmware on the 32F746GDISCOVERY by using the ST’s X-CUBE-AI framework. But before dive into this let’s do a recap and repeat some key points from the previous articles.

In part 1, I’ve implemented a naive implementation of a single neuron with 3-inputs and 1-output. Naive means that the inference was just C code, without any accelerations from the hardware. I’ve run those tests on a various different MCUs and it was fun seeing even an arduino nano running this thing. Also I’ve overclocked a few MCUs to see how the frequency increment scales with the inference performance.

In part 2, I’ve implemented another naive implementation of a NN with 3-input, 32-hidden, 1-output. The result was that as expected, which means that as the NN complexity increases the performance drops. Therefore, not all MCUs can provide the performance to run more complex in real-time. The real-time part now is something objective, because real-time can be from a few ns up to several hours depending on the project. That means that if the inference of a deep-er network needs 12 hours to run in your arduino and your data stream is 1 input per 12 hours and 2 minutes, then you’re fine. Anyway, I won’t debate on that I think you know what I mean. But if your input sample is every few ms then you need something faster. Also, in the back of my head was to verify if this simple NN complexity is useful at all and if it can offer something more than lookup tables or algorithms.

In part 3, I was planning to use x-cube-ai from ST, to port a Keras NN and then benchmark the inference, but after the hint I got in the comments from Raukk, I’ve decided to go with the tflite-micro. Tflite-micro at that point seemed very appealing, because it’s a great idea to have a common API between the desktop, the embedded Linux and the MCU worlds. Think about it. It’s really great to be able to share (almost) the same code between those platforms. During the process though, I’ve found a lot of issues and I had to spend more time than I thought. You can read the full list of those issues in the post, but the main issue was that the performance was not good. Actually, the inference was extremely slow. At the end of writing the article I’ve figured out that I had the FPU disabled (duh!) and when I’ve enabled I got a 3x times better performance, but it was still slow.

Therefore, in this post I’ve implemented the exact same model to do a comparison of the x-cube-ai and tflite-micro. As I’ve mentioned also to the previous posts (and I’m doing this also now), the Machine Learning (ML) on the low embedded (=MCUs) is still a work in progress and there’s a lot of development on the various tools. If you think about it the whole ML is still is changing rapidly for the last years and its introduction to microcontrollers is even more recent. It’s a very hot topic and domain right now. For example, while I was doing the tflite-micro post the repo, it was updated several times; but I had to stop updating and lock to a git version in order to finish the post.

Also, after I’ve finished the post for the x-cube-ai, the same day the new version 4.0.0 released, which pushed back the post release. The new version supports to import tflite models and because I’ve used a Keras model in my first implementation, I had to throw away quite some work that I’ve done… But I couldn’t do otherwise, as now I had the chance to use the exact same tflite model and not the Keras model (the tflite was a port from Keras). Of course, I didn’t expect any differences, but still it’s better to compare the exact same models.

You’ll find all the source code for this project here:

So, let’s dive into it.


ST presents the X-CUBE-AI as an “STM32Cube Expansion Package part of the STM32Cube.AI ecosystem and extending STM32CubeMX capabilities with automatic conversion of pre-trained Neural Network and integration of generated optimized library into the user’s project“. Yeah, I know, fancy words. In plain English that means that it’s just a static library for the STM32 MCUs that uses the cmsis-dsp accelerations and a set of tools that convert various model formats to the format that the library can process. That’s it. And it works really well.

There’s also a very informative video here, that shows the procedure you need to follow in order to create a new x-cube-ai project and that’s the one I’ve also used to create the project in this repo. I believe it’s very straight forward and there’s no reason to explain anything more than that. The only different thing I do always is that I’m just integrating the resulted code from STM32CubeMX to my cmake template.

So the x-cube-ai adds some tools in the CubeMX GUI and you can use them to analyze the model, compress the weight values, and validate the model on both desktop and the target. With x-cube-ai, you can finally create source code for 3 types of projects, which are the SystemPerformance, Validation and ApplicationTemplate. For the first two projects you just compile them, flash and run, so you don’t have to write any code yourself (unless you want to change default behaviour). As you can see on the YouTube link I’ve posted, you can choose the type of project in the “Pinout & Configuration” tab and then click in the “Additional Software”. From that list expand the “X-CUBE-AI/Application” (be careful to select the proper (=latest?) version if you have many) and then in the Selection column, select the type of the project you want to build.

Analyzing the model

I want to mention here that in ST they’ve done a great job on logging and display information for the model. You get many information in CubeMX while preparing your model and you know beforehand the RAM/ROM size with the compression, the complexity, the ROM usage, MACC and also you can derive the complexity by layer. This is an example output I got when I’ve analyzed the MNIST model.

Analyzing model 
Neural Network Tools for STM32 v1.0.0 (AI tools v4.0.0) 
-- Importing model 
-- Importing model - done (elapsed time 0.401s) 
-- Rendering model 
-- Rendering model - done (elapsed time 0.156s) 
Creating report file /home/dimtass/STM32Cube/Repository/Packs/STMicroelectronics/X-CUBE-AI/4.0.0/Utilities/linux/stm32ai_output/mnistkeras_analyze_report.txt 
Exec/report summary (analyze 0.558s err=0) 
model file      : /rnd/bitbucket/machine-learning-for-embedded/code-stm32f746-xcube/mnist.tflite 
type            : tflite (tflite) 
c_name          : mnistkeras 
compression     : 4 
quantize        : None 
L2r error       : NOT EVALUATED 
workspace dir   : /tmp/mxAI_workspace26422621629890969500934879814382 
output dir      : /home/dimtass/STM32Cube/Repository/Packs/STMicroelectronics/X-CUBE-AI/4.0.0/Utilities/linux/stm32ai_output 
model_name      : mnist 
model_hash      : 3be31e1950791ab00299d58cada9dfae 
input           : input_0 (item#=784, size=3.06 KiB, fmt=FLOAT32) 
input (total)   : 3.06 KiB 
output          : nl_7 (item#=10, size=40 B, fmt=FLOAT32) 
output (total)  : 40 B 
params #        : 93,322 (364.54 KiB) 
macc            : 2,852,598 
rom (ro)        : 263,720 (257.54 KiB) -29.35% 
ram (rw)        : 33,664 + 3,176 (32.88 KiB + 3.10 KiB) 
id  layer (type)        output shape      param #     connected to             macc           rom                 
0   input_0 (Input)     (28, 28, 1)                                                                               
    conv2d_0 (Conv2D)   (26, 26, 32)      320         input_0                  237,984        1,280               
    nl_0 (Nonlinearity) (26, 26, 32)                  conv2d_0                                                    
1   pool_1 (Pool)       (13, 13, 32)                  nl_0                                                        
2   conv2d_2 (Conv2D)   (11, 11, 64)      18,496      pool_1                   2,244,480      73,984              
    nl_2 (Nonlinearity) (11, 11, 64)                  conv2d_2                                                    
3   pool_3 (Pool)       (5, 5, 64)                    nl_2                                                        
4   conv2d_4 (Conv2D)   (3, 3, 64)        36,928      pool_3                   332,416        147,712             
    nl_4 (Nonlinearity) (3, 3, 64)                    conv2d_4                                                    
5   reshape_5 (Reshape) (576,)                        nl_4                                                        
    dense_5 (Dense)     (64,)             36,928      reshape_5                36,864         38,144 (c)          
    nl_5 (Nonlinearity) (64,)                         dense_5                  64                                 
6   dense_6 (Dense)     (10,)             650         nl_5                     640            2,600               
7   nl_7 (Nonlinearity) (10,)                         dense_6                  150                                
mnist p=93322(364.54 KBytes) macc=2852598 rom=257.54 KBytes ram=32.88 KBytes -29.35% 
Complexity by layer - macc=2,852,598 rom=263,720 
id      layer (type)        macc                                    rom                                     
0       conv2d_0 (Conv2D)   ||||                              8.3%  |                                 0.5%  
2       conv2d_2 (Conv2D)   |||||||||||||||||||||||||||||||  78.7%  ||||||||||||||||                 28.1%  
4       conv2d_4 (Conv2D)   |||||                            11.7%  |||||||||||||||||||||||||||||||  56.0%  
5       dense_5 (Dense)     |                                 1.3%  ||||||||                         14.5%  
5       nl_5 (Nonlinearity) |                                 0.0%  |                                 0.0%  
6       dense_6 (Dense)     |                                 0.0%  |                                 1.0%  
7       nl_7 (Nonlinearity) |                                 0.0%  |                                 0.0%  
Using TensorFlow backend. 
Analyze complete on AI model

This is the output that you get by just running the analyze tool on the imported tflite model in CubeMX. Lots of information there, but let’s focus in some really important info. As you can see, you know exactly how much ROM and RAM you need! You couldn’t do that with the tflite-micro. In tflite-micro you need to either calculate this by your own, or you would need to add heap size and try to load the model, if the heap wasn’t enough and the allocator was complaining, then add more heap and repeat. This is not very convenient right? But with x-cube-ai you know exactly how much heap you need at least for the model (and also add more for your app). Great stuff.

Model RAM/ROM usage

So in this case the ROM needed for the model is 263760 bytes. In part 3, that was 375740 bytes (see section 3 in the jupyter notepad). That difference is not because I’ve used quantization, but because of the 4x compression selection I’ve made for the weights in the tool (see in the YouTube video which does the same time=3:21). Therefore, the decrease in the model size in ROM is from that compression. According to the tools that’s -29.35% compared to the original size. In the current project the model binary blob is in the `source/src/mnistkeras_data.c` file and it’s an C array like the one in the tflite-micro project. The similar file in the tf-lite model was the `source/src/inc/model_data.h`. Those sizes are without quantization, because I didn’t manage to convert the model to UINT8 as the TFLiteConverter converts the model only to INT8, which is not supported in tflite. I’m still puzzled with that and I can’t figure out why this happening and I couldn’t find any documentation or example how to do that.

Now, let’s go to the RAM usage. With x-cube-ai the RAM needed is only 36840 bytes! In the tflite-micro I needed 151312 bytes (see the table in the “Model RAM Usage” section here). That’s 4x times less RAM. It’s amazing. The reason for that is that in tflite-micro the micro_allocator expands the layers of the model in the RAM, but in the x-cube-ai that doesn’t happen. From the above report (and from what I’ve seen) it seems that the layers remain in the ROM and it seems that the API only allocates RAM for the needed operations.

As you can imagine those two things (RAM and ROM usage) makes x-cube-ai a much better option even to start with. That makes even possible to run this model in MCUs with less RAM/ROM than the STM32F746, which is considered a buffed MCU. Huge difference in terms of resources.

Framework output project types

As I’ve mentioned previously, with x-cube-ai you can create 3 types of projects (SystemPerformance, Validation, ApplicationTemplate). Let’s see a few more details about those.

Note: for the SystemPerformance and Validation project types, I’ve included the bin files in the extras/folder. You can only flash those on the STM32F746 which comes with the 32F746GDISCOVERY board.


As the name clearly implies, you can use this project type in order to benchmark the performance using random inputs. If you think about it, that’s all that I would need for this post. I just need to import the model, build this application and there you go, I have all I need. That’s correct, but… I wanted to do the same that I’ve done in the previous project with tflite-micro and be able to use a comm protocol to upload inputs from hand-drawn digits from the jupyter notebook to the STM32F7, run the inference and get the output back and validate the result. Therefore, although this project type is enough for benchmarking, I still had work to do. But in case you just need to benchmark the MCU running the model inference, just build this. You don’t even have to write a single line of code. This is the serial output when this code runs (this is a loop, but I only post one iteration).

Running PerfTest on "mnistkeras" with random inputs (16 iterations)...

Results for "mnistkeras", 16 inferences @216MHz/216MHz (complexity: 2852598 MACC)
 duration     : 73.785 ms (average)
 CPU cycles   : 15937636 -1352/+808 (average,-/+)
 CPU Workload : 7%
 cycles/MACC  : 5.58 (average for all layers)
 used stack   : 576 bytes
 used heap    : 0:0 0:0 (req:allocated,req:released) cfg=0

From the above output we can see that @216MHz (default frequency) the inference duration was 73.78 ms (average) and then some other info. Ok, so now let’s push the frequency up a bit @288MHz and see what happens.

Running PerfTest on "mnistkeras" with random inputs (16 iterations)...

Results for "mnistkeras", 16 inferences @288MHz/288MHz (complexity: 2852598 MACC)
 duration     : 55.339 ms (average)
 CPU cycles   : 15937845 -934/+1145 (average,-/+)
 CPU Workload : 5%
 cycles/MACC  : 5.58 (average for all layers)
 used stack   : 576 bytes
 used heap    : 0:0 0:0 (req:allocated,req:released) cfg=0

55.39 ms! It’s amazing. More about that later.


The validation project type is the one that you can use if you want to validate your model with different inputs. There is a mode that you can validate on the target with either random or user-defined data. There is a pdf document here, named “Getting started with X-CUBE-AI Expansion Package for Artificial Intelligence (AI)” and you can find the format of the user input in section 14.2, which is just a csv file with comma separated values.

The default mode, which is the random inputs produces the following output (warning: a lot of text is following).

Starting AI validation on target with random data... 
Neural Network Tools for STM32 v1.0.0 (AI tools v4.0.0) 
-- Importing model 
-- Importing model - done (elapsed time 0.403s) 
-- Building X86 C-model 
-- Building X86 C-model - done (elapsed time 0.519s) 
-- Setting inputs (and outputs) data 
Using random input, shape=(10, 784) 
-- Setting inputs (and outputs) data - done (elapsed time 0.691s) 
-- Running STM32 C-model 
ON-DEVICE STM32 execution ("mnistkeras", /dev/ttyUSB0, 115200).. 
<Stm32com id=0x7f8fd8339ef0 - CONNECTED(/dev/ttyUSB0/115200) devid=0x449/STM32F74xxx msg=2.0> 
 0x449/STM32F74xxx @216MHz/216MHz (FPU is present) lat=7 Core:I$/D$ ART: 
 found network(s): ['mnistkeras'] 
 description    : 'mnistkeras' (28, 28, 1)-[7]->(1, 1, 10) macc=2852598 rom=257.54KiB ram=32.88KiB 
 tools versions : rt=(4, 0, 0) tool=(4, 0, 0)/(1, 3, 0) api=(1, 1, 0) "Fri Jul 26 14:30:06 2019" 
Running with inputs=(10, 28, 28, 1).. 
....... 1/10 
....... 2/10 
....... 3/10 
....... 4/10 
....... 5/10 
....... 6/10 
....... 7/10 
....... 8/10 
....... 9/10 
....... 10/10 
 RUN Stats    : batches=10 dur=4.912s tfx=4.684s 6.621KiB/s (wb=30.625KiB,rb=400B) 
Results for 10 inference(s) @216/216MHz (macc:2852598) 
 duration    : 78.513 ms (average) 
 CPU cycles  : 16958877 (average) 
 cycles/MACC : 5.95 (average for all layers) 
Inspector report (layer by layer) 
 n_nodes        : 7 
 num_inferences : 10 
Clayer  id  desc                          oshape          fmt       ms         
0       0   10011/(Merged Conv2d / Pool)  (13, 13, 32)    FLOAT32   11.289     
1       2   10011/(Merged Conv2d / Pool)  (5, 5, 64)      FLOAT32   57.406     
2       4   10004/(2D Convolutional)      (3, 3, 64)      FLOAT32   8.768      
3       5   10005/(Dense)                 (1, 1, 64)      FLOAT32   1.009      
4       5   10009/(Nonlinearity)          (1, 1, 64)      FLOAT32   0.006      
5       6   10005/(Dense)                 (1, 1, 10)      FLOAT32   0.022      
6       7   10009/(Nonlinearity)          (1, 1, 10)      FLOAT32   0.015      
                                                                    78.513 (total) 
-- Running STM32 C-model - done (elapsed time 5.282s) 
-- Running original model 
-- Running original model - done (elapsed time 0.100s) 
Exec/report summary (validate 0.000s err=0) 
model file      : /rnd/bitbucket/machine-learning-for-embedded/code-stm32f746-xcube/mnist.tflite 
type            : tflite (tflite) 
c_name          : mnistkeras 
compression     : 4 
quantize        : None 
L2r error       : 2.87924684e-03 (expected to be < 0.01) 
workspace dir   : /tmp/mxAI_workspace3396387792167015918690437549914931 
output dir      : /home/dimtass/.stm32cubemx/stm32ai_output 
model_name      : mnist 
model_hash      : 3be31e1950791ab00299d58cada9dfae 
input           : input_0 (item#=784, size=3.06 KiB, fmt=FLOAT32) 
input (total)   : 3.06 KiB 
output          : nl_7 (item#=10, size=40 B, fmt=FLOAT32) 
output (total)  : 40 B 
params #        : 93,322 (364.54 KiB) 
macc            : 2,852,598 
rom (ro)        : 263,720 (257.54 KiB) -29.35% 
ram (rw)        : 33,664 + 3,176 (32.88 KiB + 3.10 KiB) 
id  layer (type)        output shape      param #     connected to             macc           rom                 
0   input_0 (Input)     (28, 28, 1)                                                                               
    conv2d_0 (Conv2D)   (26, 26, 32)      320         input_0                  237,984        1,280               
    nl_0 (Nonlinearity) (26, 26, 32)                  conv2d_0                                                    
1   pool_1 (Pool)       (13, 13, 32)                  nl_0                                                        
2   conv2d_2 (Conv2D)   (11, 11, 64)      18,496      pool_1                   2,244,480      73,984              
    nl_2 (Nonlinearity) (11, 11, 64)                  conv2d_2                                                    
3   pool_3 (Pool)       (5, 5, 64)                    nl_2                                                        
4   conv2d_4 (Conv2D)   (3, 3, 64)        36,928      pool_3                   332,416        147,712             
    nl_4 (Nonlinearity) (3, 3, 64)                    conv2d_4                                                    
5   reshape_5 (Reshape) (576,)                        nl_4                                                        
    dense_5 (Dense)     (64,)             36,928      reshape_5                36,864         38,144 (c)          
    nl_5 (Nonlinearity) (64,)                         dense_5                  64                                 
6   dense_6 (Dense)     (10,)             650         nl_5                     640            2,600               
7   nl_7 (Nonlinearity) (10,)                         dense_6                  150                                
mnist p=93322(364.54 KBytes) macc=2852598 rom=257.54 KBytes ram=32.88 KBytes -29.35% 
Cross accuracy report (reference vs C-model) 
NOTE: the output of the reference model is used as ground truth value 
acc=100.00%, rmse=0.0007, mae=0.0003 
10 classes (10 samples) 
C0         0    .    .    .    .    .    .    .    .    .   
C1         .    0    .    .    .    .    .    .    .    .   
C2         .    .    2    .    .    .    .    .    .    .   
C3         .    .    .    0    .    .    .    .    .    .   
C4         .    .    .    .    0    .    .    .    .    .   
C5         .    .    .    .    .    1    .    .    .    .   
C6         .    .    .    .    .    .    0    .    .    .   
C7         .    .    .    .    .    .    .    2    .    .   
C8         .    .    .    .    .    .    .    .    5    .   
C9         .    .    .    .    .    .    .    .    .    0   
Creating /home/dimtass/.stm32cubemx/stm32ai_output/mnistkeras_val_m_inputs.csv 
Creating /home/dimtass/.stm32cubemx/stm32ai_output/mnistkeras_val_c_inputs.csv 
Creating /home/dimtass/.stm32cubemx/stm32ai_output/mnistkeras_val_m_outputs.csv 
Creating /home/dimtass/.stm32cubemx/stm32ai_output/mnistkeras_val_c_outputs.csv 
Creating /home/dimtass/.stm32cubemx/stm32ai_output/mnistkeras_val_io.npz 
Evaluation report (summary) 
Mode                acc       rmse      mae       
X-cross             100.0%    0.000672  0.000304  
L2r error : 2.87924684e-03 (expected to be < 0.01) 
Creating report file /home/dimtass/.stm32cubemx/stm32ai_output/mnistkeras_validate_report.txt 
Complexity/l2r error by layer - macc=2,852,598 rom=263,720 
id  layer (type)        macc                          rom                           l2r error                     
0   conv2d_0 (Conv2D)   |||                     8.3%  |                       0.5%                                
2   conv2d_2 (Conv2D)   |||||||||||||||||||||  78.7%  |||||||||||            28.1%                                
4   conv2d_4 (Conv2D)   |||                    11.7%  |||||||||||||||||||||  56.0%                                
5   dense_5 (Dense)     |                       1.3%  ||||||                 14.5%                                
5   nl_5 (Nonlinearity) |                       0.0%  |                       0.0%                                
6   dense_6 (Dense)     |                       0.0%  |                       1.0%                                
7   nl_7 (Nonlinearity) |                       0.0%  |                       0.0%  2.87924684e-03 *              
fatal: not a git repository (or any of the parent directories): .git 
Using TensorFlow backend. 
Validation ended

I’ve also included a file extras/digit.csv which is the digit “2” (same used in the jupyter notebook) that you can use this to verify the model on the target using the `extras/code-stm32f746-xcube-evaluation.bin` firmware and CubeMX. You just need to load the digit to the CubeMX input and validate the model on the target. This is part of the output, when validating with that file:

Cross accuracy report (reference vs C-model) 
NOTE: the output of the reference model is used as ground truth value 
acc=100.00%, rmse=0.0000, mae=0.0000 
10 classes (1 samples) 
C0         0    .    .    .    .    .    .    .    .    .   
C1         .    0    .    .    .    .    .    .    .    .   
C2         .    .    1    .    .    .    .    .    .    .   
C3         .    .    .    0    .    .    .    .    .    .   
C4         .    .    .    .    0    .    .    .    .    .   
C5         .    .    .    .    .    0    .    .    .    .   
C6         .    .    .    .    .    .    0    .    .    .   
C7         .    .    .    .    .    .    .    0    .    .   
C8         .    .    .    .    .    .    .    .    0    .   
C9         .    .    .    .    .    .    .    .    .    0

The above output means that the network found the digit “2” with 100% accuracy.


This is the project you want to build when you develop your own application. In this case CubeMX creates only the necessary code that wraps the x-cube-ai library. These are the app_x-cube-ai.hand app_x-cube-ai.cfiles that are located in the source/srcfolder (and in the inc/ forder in the src). These are just wrappers files around the library and the model. You actually only need to call this function and then you’re ready to run your inference.


The x-cube-ai static lib

Let’s see a few things about the x-cube-ai library. First and most important, it’s a closed source library, so it’s a proprietary software. You won’t get the code for this, which for people like me is a big negative. I guess that way ST tries to keep the library around their own hardware, which it makes sense; but nevertheless I don’t like it. That means that the only thing you have access are the header files in the `source/libs/AI/Inc` folder and the static library blob. The only insight you can have in the library is using the elfread tool and extract some information from the blob. I’ve added the output in the `extras/elfread_libNetworkRuntime400_CM7_GCC.txt`.

From the output I can tell that this was build on a windows machine from the user `fauvarqd`, lol. Very valuable information. OK seriously now, you can also see the exported calls (which you could see anyways from the header files) and also the name of the object files that are used to build the library. An other trick if you want to get more info is to try to build the project by removing the dsp library. Then the linker will complain that the lib doesn’t find some functions, which means that you can derive some of them. But does it really matter though. No source code, no fun 🙁

I don’t like the fact that I don’t have access in there, but it is what it is, so let’s move on.

Building the project

You can find the C++ cmake project here:

In the source/libs folder you’ll find all the necessary libraries which are CMSIS, the STM32F7xx_HAL_Driver, flatbuffers and the x-cube-ai lib. All these are building as static libraries and then the main.cpp app is linked against those static libs. You will find the cmake files for those libs in source/cmake. The file in the repo is quite thorough about the build options and the different builds. To build the code run this command:


If you want to enable overclocking the you can build like this:


Just be aware to select the value you like for the clock in sources/src/main.cppfile in this line:

RCC_OscInitStruct.PLL.PLLN = 288; // Overclock

The default overclocking value is 288MHz, but you can experiment with a higher one (in my case that was the maximum without hard-faults).

Also if you overclock you want to change also the clock dividers on the APB1 and APB2 buses, otherwise the clocks will be too high and you’ll get hard-faults.

RCC_ClkInitStruct.APB1CLKDivider = RCC_HCLK_DIV4;
RCC_ClkInitStruct.APB2CLKDivider = RCC_HCLK_DIV2;

The build command will build the project in the build-stm32folder. It’s interesting to see the resulted sizes for all the libs and the binary file. The next array lists the sizes by using the current latest gcc-arm-none-eabi-8-2019-q3-update toolchain from here. By the time you read the article this might already have changed.

File Size
stm32f7-mnist-x_cube_ai.bin 339.5 kB
libNetworkRuntime400_CM7_GCC.a 414.4kB

This is interesting. Let’s see now the differences between the resulted binary and the main AI libs (tflite-micro and x-cube-ai).

(sizes in kB)
x-cube-ai tflite-micro
binary 339.5 542.7
library 414.4 2867

As you can see from above, both the binary and the library for x-cube-ai are much smaller. Regarding the binary, that’s because the model is smaller as the weights are compressed. Regarding the libs you can’t really say if the size matters are the implementation and the supported layers for tflite-micro are different, but it seems that the x-cube-ai library is much more optimized for this MCU and also it must be more stripped down.

Supported commands in STM32F7 firmware

The code structure of this project in the repo is pretty much the same with the code in the 3rd post. In this case though I’ve only used a single command. I’ll copy-paste the text needed from the previous post.

After you build and flash the firmware on the STM32F7 (read the for more detailed info), you can use a serial port to either send commands via a terminal like cutecom or interact with the jupyter notebook. The firmware supports two UART ports on the STM32F7. In the first case the commands are just ASCII strings, but in the second case it’s a binary flatbuffer schema. You can find the schema in `source/schema/schema.fbs` if you want to experiment and change stuff. In the firmware code the handing of the received UART commands is done in `source/src/main.cpp` in function `dbg_uart_parser()`.

The command protocol is plain simple (115200,8,n,1) and its format is:

where ID is a number and each number is a different command. So:
CMD=1, runs the inference of the hard-coded hand-drawn digit (see below)

This is how I’ve connected the two UART ports in my case. Also have a look the repo’s file for the exact pins on the connector.

Note: More copy-paste from the previous post is coming, as many things are the same, but I have to add them here for consistency.

Use the Jupyter notebook with STM32F7

In the jupyter notebook here, there’s a description on how to evaluate the model on the STM32F7. There are actually two ways to do that, the first one is to use the digit which is already integrated in the code and the other way is to upload your hand-draw digit to the STM32 for evaluation. In any case this will validate the model and also benchmark the NN. Therefore, all you need to do is to build and upload the firmware, make the proper connections, run the jupyter notebook and follow the steps in “5. Load model and interpreter”.

I’ve written two custom Python classes which are used in the notebook. Those classes are located in jupyter_notebook/ folder and each has its own folder.


The MnistDigitDraw class is using tkinter to create a small window on which you can draw your custom digit using your mouse.


In the left window you can draw your digit by using your mouse. When you’ve done then you can either press the Clearbutton if you’re not satisfied. If you are then you can press the Inferencebutton which will actually convert the digit to the format that is used for the inference (I know think that this button name in not the best I could use, but anyway). This will also display the converted digit on the right side of the panel. This is an example.

Finally, you need to press the Exportbutton to write the digit into a file, which can be used later in the notepad. Have in mind that jupyter notepad can only execute only one cell at a time. That means that as long as the this window is not terminated then the current cell is running, so you need to first to close the window pressing the [x] button to proceed.

After you export the digit you can validate it in the next cells either in the notepad or the STM32F7.


The FbComm class handles the communication between the jupyter notepad and the STM32F7 (or another tool which I’ll explain). The FbComm supports two different communication means. The first is the Serial comms using a serial port and the other is a TCP socket. There is a reason I’ve done this. Normally, the communication of the notepad is using the serial port and send to/receive data from the STM32F7. To develop using this communication is slow as it takes a lot of time to build and flash the firmware on the device every time. Therefore, I’ve written a small C++ tool in `jupyter_notebook/FbComm/test_cpp_app/fb_comm_test.cpp`. Actually it’s mainlt C code for sockets but wrapped in a C++ file as flatbuffers need C++. Anyway, if you plan on changing stuff in the flatbuffer schema it’s better to use this tool first to validate the protocol and the conversions and when it’s done then just copy-paste the code on the STM32F7 and expect that it should work.

When you switch to the STM32F7 then you can just use the same class but with the proper arguments for using the serial port.


The files in this folder are generated from the flatc compiler, so you shouldn’t change anything in there. If you make any changes in `source/schema/schema.fbs`, then you need to re-run the flatc compiler to re-create the new files. Have a look in the “Flatbuffers” section in the file how to do this.

Benchmarking the x-cube-ai

The benchmark procedure was a bit easier with the x-cube-ai compared to the tflite-micro. I’ve just compiled the project w/ and w/o overclocking and run the inference several times from the jupyter notebook. As I’ve mentioned earlier you don’t really have to do that, just use the SystemPerformance project from the CubeMX and just change the frequency, but this is not so cool like uploading your hand-drawn digit, right? Anyway, that’s the table with the results:

216 MHz 288 MHz
76.754 ms 57.959 ms

Now let’s do a comparison between the tflite-micro and the x-cube-ai inference run times.

x-cube-ai (ms) tflite-micro (ms) difference
216 MHz 76.754 923 12.5x (170%)
288 MHz 57.959 692 12.5x (170%)

Wow. That’s a huge difference, right. The x-cube-ai is 12.5x times faster (or 170% better performance). Great work they’ve done.

You might noticed that the inference time is a bit higher now compared to the SystemPerformance project binary. I only assume that this is because in the benchmark the outputs are not populated and they are dropped. I’m not sure about this, but it’s my guess as this seems to be a consistent behaviour. Anyway, the difference is 2-3 ms, so I’ll skip ruin my day thinking more about this as the results of my project are actually a bit faster than the default validation project.

Evaluating on the STM32F7

This is an example image of the digit I’ve drawn. The format is the standard grayscale 28×28 px image. That’s an uint8 grayscale image [0, 255], but it’s normalized to a [0, 1] float, as the network input and output is float32.

After running the inference on the target we get back this result on the jupyter notebook.

Comm initialized
Num of elements: 784
Sending image data
Receive results...
Command: 2
Execution time: 76.754265 msec
Out[9]: 0.000000
Out[8]: 0.000000
Out[7]: 0.000000
Out[6]: 0.000000
Out[5]: 0.000000
Out[4]: 0.000000
Out[3]: 0.000000
Out[2]: 1.000000
Out[1]: 0.000000
Out[0]: 0.000000

The output predicts that the input is number 2 and it’s 100% certain about it. Cool.

Things I liked and didn’t liked about x-cube-ai

From the things that you’ve read above you can pretty much conclude by yourself about the pros of the x-cube-ai, which actually make almost all the cons to seem less important, but I’ll list them anyways. This is not yet a comparison with tflite-micro.


  1. It’s lightning fast. The performance of this library is amazing.
  2. It’s very light and doesn’t use a lot of resources and the result binary is small.
  3. The tool in the CubeMX is able to compress the weights.
  4. The x-cube-ai tool is integrated nicely in the CubeMX interface, although it could be better.
  5. Great analysis reports that helps you make decisions for which MCU you need to use and optimizations before even start coding (regarding ROM and RAM usage).
  6. Supports importing models from Keras, tflite, Lasagne, Caffe and ConvNetJS. So, you are not limited in one tool and also Keras support is really nice.
  7. You can build and test the performance and validate your NN without having to write a single line of code. Just import your model and build the SystemPerformance or Validation application and you’re done.
  8. When you write your own application based on the template then you actually only have to use two functions, one to init the network and a function to run your inference. That’s it.


  1. It’s a proprietary library! No source code available. That’s a big, big problem for many reasons. I never had a good experience with closed source libraries, because when you hit a bug, then you’re f*cked. You can’t debug and solve this by yourself and you need to file a report for the bug and then wait. And you might wait forever!
  2. ST support quite sucks if you’re an individual developer or a really small company. There is a forum, which is based on other developers help, but most of the times you might not get an answer. Sometimes, you see answers from ST stuff, but expect that this won’t happen most of the times. If you’re a big player and you have support from component vendors like Arrow e.t.c. then you can expect all the help you need.
  3. Lack of documentation. There’s only a pdf document here (UM2526). This has a lot of information, but there are still a lot of information missing. Tbh, after I searched in the x-cube-ai folders which are installed in the CubeMX, I’ve found more info and tools, but there’s no mention about those anywhere! I really didn’t like that. OK, now I know, so if you’re also desperate then in your Linux box, have a look at this path: ~/STM32Cube/Repository/Packs/STMicroelectronics/X-CUBE-AI/4.0.0/Documentation. That’s for the 4.0.0 version, so in our case it might be different.

TFLite-micro vs x-cube-ai

Disclaimer: I have nothing to do with ST and I’ve never even got a free sample from them. I had to do this, for what is following…

As you can see the x-cube-ai’s has more pros than cons are more cons compare to the tflite-micro. Tbh, I’ve also enjoyed more working with the x-cube-ai rather the tflite-micro as it was much easier. The only thing from the x-cube-ai that leaves a bitter taste is that it’s a proprietary software. I can’t stress out how much I don’t like this and all the problems that brings along. For example, let’s assume that tomorrow ST decides to pull off the plug from this project, boom, everything is gone. That doesn’t sound very nice when you’re planning for a long commitment to an API or tool. I quite insist on this, because the last 15-16 years I’ve seen this many times in my professional career and you don’t want this to happen to your released product.  Of course, if the API serves you well for your current running project and you don’t plan on changing something critical then it’s fine, go for it. But, I really like the fact that tflite-micro is open.

I’m a bit puzzled about tflite. At the this point, the only reason I can think of using tflite-micro over x-cube-ai, is if you want to port your code from a tflite project which already runs on your application CPU (and Linux) to an MCU to test and prototype and decide if it worth switching to an MCU as a cheaper solution. Of course, the impact of tflite in the performance is something that needs consideration and currently there’s no rule of thumb of how much slower is compared to other APIs and on specific hardware. For example in the STM32F7 case (and for the specific model) is 12.5x times slower, but this figure might be different for another MCU. Anyway, you must be aware of these limitations, know what to really expect from tflite-micro and how much room you have for performance enhancement.

There is another nice thing about tflite-micro thought. Because it’s open source you can branch the git repo and then spend time to optimise the code for your specific hardware. Definitely the performance will be much, much better; but I can’t really say how much as it depends on so many things. Have in mind that also tflite-micro is written in C++ and some of its hocus pocus and template magic may have negative impact in the performance. But at least it remains a good alternative option for prototyping, experimentation and develop to its core. And that’s the best thing with open source code.

Finally, x-cube-ai limits your options to the STM32 series. Don’t get me wrong this MCU series is great and I use stm32 for many of my projects, but it’s always nice to have an alternative.


The x-cube-ai is fast. It’s also easy to use and develop on it, it has those ready-to-build apps and the template to build your project, everything is in an all-in-one solution (CubeMX). But on the other hand is a black box and don’t expect much support if you’re not a big player.

ST was very active the last year. I also liked the STM32-MP1 SBC they released with Yocto support from day one and mainline kernel support. They are very active and serious. Although I still consider the whole HAL Driver library a bloated library (which it is, as I’ve proven that in previous stupid-projects). I didn’t had any issues; but I’ve also didn’t write much code for these last two projects (I had serious issues when I’ve tried a few years ago).

Generally, the code is focused around the NN libs performance and not the MCU peripheral library performance, but still you need to consider those things when you evaluating platforms to start a new project.

From a brief in the source code though, it seems that you can use the x-cube-ai library without the HAL library, but you would need to port some bits to the LL library to use it with that one. Anyway, that’s me; I guess most people are happy with HAL, so…

In my next post, I will use a jetson-nano to run the same inference using tflite (not micro) and an ESP8266 that will use a REST-API. Also TensorRT, seems nice. I may also try this for the next post, will see.

Update: Next post is available here.

Have fun!