Tensorflow 2.1.0 for microcontrollers benchmarks on Teensy 4.0


It’s being some time that I haven’t update the blog, but I was too busy with other stuff. I’ve updated a couple of times the meta-allwinner-hx layer and the last major update was to move the layer to the latest dunfell 3.1 version. Also, I’ve ordered an Artillery Genius 3D printer (I felt connected with it’s name, lol). I wanted to buy into 3D printing for many years now, but never really had the time. Not that I do have time now, but as I have realized with many things in life, if you don’t just do it you’ll never find the time for it.

On other news, I’ve also received an STM32MP157C-DK2 from ST to evaluate and I would like to focus my next couple posts to this SBC. I’ve already got somehow deep into the documentation and also pushed some PRs to the public git Yocto meta layer. Next thing is to start writing a couple of posts to test some interesting aspects of this SBC and list its cons and pros.

But before getting deep to the STM32MP1 I wanted to do something else first and this was to test my TensorFlow Lite for microcontrollers template on the i.MX RT1062 MCU, which is used on the Teensy 4.0 and it’s supposed to be one of the fastest MCUs.

So, is it really that fast? How does it perform compared to STM32F746 I’ve used in the last post here?

About Teensy 4.0

Well, regarding Teensy 4.0, I really like this board and definitely I love it’s ecosystem and congrats to Paul Stoffregen for what he created. He also seems extremely active in every aspect, like supporting the forums, the released boards, the middleware and the customers and at the same time also find time to design and release new boards fast.

In one of my latest posts I’ve figured out that the Teensy 4.0 (actually the imxrt1062) doesn’t have an onboard flash; it happens to me often because I don’t spend time to read and understand all the specs of the boards I’m buying for evaluation. Therefore, the firmware is written in an external SPI NOR. Also there is a custom bootloader and a host flashing tool that uploads the firmware. Gladly, I’ve also found out that this bootloader doesn’t do any other advanced things like preparing the runtime environment and using custom libs that the firmware has dependency on, therefore you can just compile your code using the NXP SDK and CMSIS and then upload it to the SPI NOR, avoiding to use the arduino framework.

This works great and I’ve also created a cmake template, but I haven’t found time to finish it properly and pack it in a state that can be published. Nevertheless, I’ve used this unpublished template to build the TF-Lite micro code with the Keras model I’m using in all the Machine Learning for Embedded posts. So, in this post I will present you the results and give you access to the source code to try it yourself.

If you’ve read so far and you’ve also read the previous post, then pause for a moment and think. What results do you expect? Especially compared to STM32F746 which is a 216MHz MCU when knowing that iMX RT1060 is running on 600MHz and it’s a beast in terms of clock speed. How much performance difference you would expect?

Source code

Since I’ve explained most part of the useful code in the previous post here, I won’t go again into the details. You can find the source code for the Teensy 4.0 here:


In the repo’s README there’s a thorough explanation on how to build the firmware and how to flash it, so I’ll skip some details here. Therefore, to build the code you can run this command:

./docker-build.sh "./build.sh"

This will build the code with the default options. With the default options the uncompressed model is used and also the tflite doesn’t use the cmsis-nn acceleration library. To run the above command you need to have Docker installed to your system and it’s the preferred method to build with this code, so you get the same results with me. After running this command you should see this in your console:

[ 99%] Linking CXX executable flexspi_nor_release/imxrt1062-tflite-micro-mnist.elf
Print sizes: imxrt1062-tflite-micro-mnist.hex
   text	   data	    bss	    dec	    hex	filename
 642300	    484	  95376	 738160	  b4370	flexspi_nor_release/imxrt1062-tflite-micro-mnist.elf
[ 99%] Built target imxrt1062-tflite-micro-mnist.elf
Scanning dependencies of target imxrt1062-tflite-micro-mnist.hex
[100%] Generating imxrt1062-tflite-micro-mnist.hex
[100%] Built target imxrt1062-tflite-micro-mnist.hex

This means that the code is built without error and the size of the code is 642300 bytes. Also the cmake will create a HEX file from the elf which is needed by the teensy cli flashing tool.

If you want to build the code with the compressed model and the cmsis-nn acceleration library then you need to run this command:

./docker-build.sh "USE_COMP_MODEL=ON USE_CMSIS_NN=ON ./build.sh"

After this command you should see this output:

[ 99%] Linking CXX executable flexspi_nor_release/imxrt1062-tflite-micro-mnist.elf
Print sizes: imxrt1062-tflite-micro-mnist.hex
   text	   data	    bss	    dec	    hex	filename
 387108	    708	 108152	 495968	  79160	flexspi_nor_release/imxrt1062-tflite-micro-mnist.elf
[ 99%] Built target imxrt1062-tflite-micro-mnist.elf
[100%] Generating imxrt1062-tflite-micro-mnist.hex
[100%] Built target imxrt1062-tflite-micro-mnist.hex

As you can see now, the size of the firmware is much smaller as the compressed model is used.

To flash any of those firmwares you need to install the teensy_loader_cli tools which is located in this repo here.


In order to see the debug output you need to connect a USB-to-UART module on the Teensy, as I’m not using any VCP interface in this case for simplicity in the code. In this case I’m using the UART2 and the pinmux for the Tx and Rx pins is as follows:

FUNCTION Teensy pin
Tx 14
Rx 15

Then you need to open a terminal on your host (I’m using CuteCom) and then send either 1 or 2 in the serial port. You can either just send the character or add a newline in the end, it doesn’t matter. The function of those two commands are:

Command Description
1 Execute the ViewModel() function to output the details of the model
2 Run the inference using a predefined digit (handwritten number 8)


Now the interesting part. I’ve build the code using 4 different cases.

Case Description
1 Uncompressed model, without CMSIS-NN @ 600MHz
2 Compressed model, without CMSIS-NN @ 600MHz
3 Uncompressed model, with CMSIS-NN @ 600MHz
4 Compressed model, with CMSIS-NN @ 600MHz

The results are listed in the following table (all numbers are milliseconds):

Layer [1] [2] [3] [4]
DEPTHWISE_CONV_2D 6.30 6.31 6.04 6.05
MAX_POOL_2D 0.863 0.858 0.826 0.828
CONV_2D 171.40 165.73 156.84 150.84
MAX_POOL_2D 0.246 0.247 0.256 0.257
CONV_2D 26.40 25.58 26.60 26.35
FULLY_CONNECTED 3.00 0.759 3.02 1.81
FULLY_CONNECTED 0.066 0.090 0.081 0.098
SOFTMAX 0.035 0.037 0.033 0.035
Total time: 208.4 199.62 193.72 186.29

What do you think about these results? Hm… OK, let me take the best case scenario for those results and compare them with the best results I got with the STM32F746 in the previous post.

Layer i.MXRT1062 @ 600MHz
STM32F746 @ 288MHz
DEPTHWISE_CONV_2D 6.05 18.68
MAX_POOL_2D 0.828 2.45
CONV_2D 150.84 124.54
MAX_POOL_2D 0.257 0.72
CONV_2D 26.35 17.49
SOFTMAX 0.035 0.01
Total time: 186.29 165.02


The above results are so confusing to me, that I’m start thinking that I may doing something wrong here. I’ve double checked all the compiler optimizations and flags so they are the same with the STM32F7 and also checked that the imxrt1062 clock frequency is correct. Everything seems to be fine, yet I get those low results with the exact same version of TensorFlow.

When I get to such cases and I find that is difficult to debug further, then I either try to make logical assumptions for the problem, or I’m trying to reach for some help (e.g. RTFM). In this case, I will only make assumptions, because I don’t really have the time to read the whole MCU user manual to try to figure out if there’s a specific flag in the MCU that may give some better results.

The first thing that comes to my mind is of course the storage from which the code is running. In case of STM32F746 there is a fast onboard flash with prefetch, which means that the CPU waits for a small amount of time for the next commands to end up in the execution pipeline. But in case of the imxrt1062 the code is stored in an external SPI NOR. This means that each command needs first to be read via SPI to end up in the execution pipeline and this needs more time compared to the onboard flash. Hence, this is my theory why the imxrt1062 has worse inference performance, although it’s core clock is 2x faster compared to the overclocked STM32746 @ 288.

So, what do you think? Does that make sense to you? Do you have another suggestion?


To be honest, I was expecting much better results from the Teensy 4.0 (I mean the imxrt1062). I didn’t expect to be 2x faster, but I expected ~1.5x factor, but I was wrong in my assumption. My assumption is that the lower performance is due to the fact that the SPI NOR has a great performance hit in this case. I also assume that another MCU with the same imxrt core and a fast internal flash would perform much better than that.

So, is Teensy 4.0 or the imxrt1062 crap? No! Not at all. I see a lot of performance gain in computation demanding applications where the data are already stored in the RAM. Also the linker script for the imxrt1062 is written in a convenient way that you can easily mount specific functions and data in the m_data and m_data2 areas (see source/MIMXRT1062xxxxx_flexspi_nor.ld in the repo). Also, in GCC you can use the compiler’s section attribute to this, for example:

void function(void)

Anyway, for this case the imxrt1062 doesn’t seem to perform well and actually is even slower compared to the STM32F746, which runs in much slower clock.

There is a chance, of course, that I may do something wrong in my tests, so if I have any update then I’ll post it here.

Have fun!

Tensorflow 2.1.0 for microcontrollers benchmarks on STM32F746


Over 8 months ago I’ve started writing the Machine Learning for Embedded post series, which starts here. The 3rd post in this series (here) was about using the tensorflow lite for microcontrollers on the STM32746NGH6U (STM32F746-disco board). At that post I’ve did some benchmarks and in the next post, I’ve compared the performance with the X-CUBE-AI framework from ST.

At that time the latest TF version was 1.14.0 and the tflite-micro was still in experimental state. The performance was quite bad compared to X-CUBE-AI. CMSIS-DSP and CMSIS-NN was not supported and also optimized or compressed models weren’t supported, too. When I tried to use an optimized model then I was getting an error that INT8 is not supported in the framework.

So, after almost a year TF is now in version 2.1.0,  2.2.0 is around the corner and also the tflite-micro is no longer experimental. Also, CMSIS-DSP/NN is now supported for many kernels as also optimized models. Therefore, I felt like it make sense to give another try.

For those tests I’m using the default Keras MNIST dataset and a quite large model with many weights. So, let’s see what has changed.

Source code

I’ve updated the repo I’ve used in post 3 and added two new tags. The previous version has the v1.14.0 tag and the new version has the v2.1.0 tag. The repo is located here:


Besides the TF version upgrade I’ve also made some minor changes in the code. The most important is that I’m now using a much more precise timer to benchmark each layer of the model. For that I’m using the Data Watchpoint and Trace unit (DWT) that is available on all Cortex-M MCUs. This provides a very accurate counter and the code for this in the repo is located in `source/src/inc/dwt_tools.h`. In order to use it properly I had to do some small modifications in the tflite-micro and specifically to the file `source/libs/tensorflow/tensorflow/lite/micro/micro_interpreter.cc`.

In this unit the Invoke() function of the interpreter is called that runs the inference and to do that it executes each layer of the model. Therefore, I’m using DWT to time the execution of each independent layer and then I’m adding the time to a separate variable. This is the code:

TfLiteStatus MicroInterpreter::Invoke() {
  if (initialization_status_ != kTfLiteOk) {
                         "Invoke() called after initialization failed\n");
    return kTfLiteError;

  // Ensure tensors are allocated before the interpreter is invoked to avoid
  // difficult to debug segfaults.
  if (!tensors_allocated_) {
    TF_LITE_ENSURE_OK(&context_, AllocateTensors());

  for (size_t i = 0; i < operators_->size(); ++i) {
    auto* node = &(node_and_registrations_[i].node);
    auto* registration = node_and_registrations_[i].registration;

    /* reset dwt */
    if (registration->invoke) {
      TfLiteStatus invoke_status = registration->invoke(&context_, node);
      if (invoke_status == kTfLiteError) {
            "Node %s (number %d) failed to invoke with status %d",
            OpNameFromRegistration(registration), i, invoke_status);
        return kTfLiteError;
      } else if (invoke_status != kTfLiteOk) {
        return invoke_status;
    float time_ms = dwt_cycles_to_float_ms( dwt_get_cycles() );
    glb_inference_time_ms += time_ms;
    printf("%s: %f msec\n", OpNameFromRegistration(registration), time_ms);
  return kTfLiteOk;

As you can see the dwt_init() initializes DWT and then dwt_reset() is reseting the counter. This is done inside the for loop that runs each layer. After the layer is executed dwt_get_cycles() returns the clock cycles that the MCU spent and then those are converted in msec and it’s printed to the UART. Finally, all msecs are added in the glb_inference_time_ms variable which is printed in the main.c after the inference execution is done.

Pretty much, everything else is the same, except that I’ve also updated the CMSIS-DSP version from 1.6.0 to 1.7.0, but I wasn’t expecting any serious performance gain from this change.

Update: I’ve updated the 1.14.0 version and now it’s using the exact same versions for all the external libraries. Also v1.14.0 has now it’s own branch if you want to test it.

Because of these changes I had to add two additional cmake options which are the following:

USE_CMSIS_NN: this enables/disables the usage of the cmsis-nn enabled cores which are located in `source/libs/tensorflow/tensorflow/lite/micro/kernels/cmsis-nn`. By default this option is OFF, therefore you need to explicitly enable it in the build script call as I’ll show later.

USE_COMP_MODEL: this selects which model is going to be used. Currently there are two models with compressed and un-compressed weights. The default option is set to OFF so the uncompressed model is used, which is 2.1MB thus making it only able to be used from MCUs that have so much FLASH area.

Both models are just byte arrays which are the serialized flatbuffer of the model structure including the model configuration (settings and layers) and the weights. This blob is then expanded in real time from tflite-micro API to call the inference. The uncompressed model is located in the `source/src/inc/model_data_uncompressed.h` header and the compressed in `source/src/inc/model_data_compressed.h`.

Finally, there’s a hard-coded digit in the flash which is the number 2. The digit is in file `source/src/inc/digit.h`. Of course, it doesn’t really matter if we use an actual digit or random tensor in the input in order to benchmark the inference, but since the digit is there, I’ll use that.


This firmware supports two UART ports, one for debugging and the other to be used with the Jupyter notebook in `jupyter_notebook/MNIST-TensorFlow.ipynb`. UART6 is used for printing debug messages and also send commands via the terminal. UART7 is used for communicating with the jupyter notebook and send/receive flatbuffers. For this post I won’t use UART7, so I’ll just trigger the inference of the pre-defined digit, which is already in the FLASH, by sending a command via UART6. The STM32F7-discovery board has an Arduino-like header arrangement, therefore the pinout is the folowing:

Function Arduino connector Actual pin

The baudrate for this port is 115200 bps and there are only supported commands, which are the following:

Command Description
CMD=1 Prints the model input and output (tensor size)
CMD=2 Executes the inference for the default digit.

When the program starts it prints a line similar to this in the UART:

Program started @ 216000000...

As you can see the frequency is displayed when the firmware boots, so it’s easier to verify the clock.

Build the code

To build the code you just need a toolchain and cmake, but to make it easier I’ve added a script in the repo that uses the CDE image I’ve created in the DevOps for Embedded post series. I’m using the same CDE docker image to build all my repos in circleci and gitlab, so you can expect that you’ll get the same results with me.

To use the docker-build script you just need to run it like this:


This command will build the uncompressed model, without cmsis-nn support and overclock and the next command will build the firmware with overclocking the MCU to 288MHz and the cmsis-nn kernels:


CLEANBUILD and USE_COMP_MODEL are not really necessary for the above commands, but I’ve added them for completeness. If you like you can have a look in the circleci builds here.


Finally, the best part. I’ve run 4 different test which are the following:

Case Description
1 Default kernels @ 216MHz
2 CMSIS-NN kernels @ 216MHz
3 Default kernels @ 288MHz
4 CMSIS-NN kernels @ 288MHz

The results I got are the following (all times are in msec):

Layer [1] [2] [3] [4]
DEPTHWISE_CONV_2D 25.19 24.9 18.89 18.68
MAX_POOL_2D 3.25 3.27 2.44 2.45
CONV_2D 166.25 166.29 124.58 124.54
MAX_POOL_2D 0.956 0.96 0.71 0.72
CONV_2D 23.327 23.32 17.489 17.49
FULLY_CONNECTED 1.48 1.48 1.11 1.11
FULLY_CONNECTED 0.03 0.03 0.02 0.02
SOFTMAX 0.02 0.02 0.017 0.01
Total time: 220.503 220.27 165.256 165.02

Note: In the next tables I don’t specify if the benchmark is run on the compressed or the uncompressed model, because the performance is identical.

So, now let’s do a comparison between the 1.14.0 version I’ve used a few months ago and the new 2.1.0 version. This is the table with the results:

Default kernels @ 216MHz [1]
Layer 1.14.0 2.1.0 diff (msec)
DEPTHWISE_CONV_2D 18.77 25.19 6.42
MAX_POOL_2D 1.99 3.25 1.26
CONV_2D 90.94 166.25 75.31
MAX_POOL_2D 0.56 0.956 0.396
CONV_2D 12.49 23.327 10.837
SOFTMAX 0.01 0.02 0.01
TOTAL TIME= 126.27 220.503 94.233


CMSIS-NN kernels @ 216MHz [2]
Layer 1.14.0 2.1.0 diff (msec)
DEPTHWISE_CONV_2D 18.7 24.9 6.2
MAX_POOL_2D 1.99 3.27 1.28
CONV_2D 91.08 166.29 75.21
MAX_POOL_2D 0.56 0.96 0.4
CONV_2D 12.51 23.32 10.81
SOFTMAX 0.01 0.02 0.01
TOTAL TIME= 126.36 220.27 93.91


Default kernels @ 288MHz [4]
Layer 1.14.0 2.1.0 diff (msec)
DEPTHWISE_CONV_2D 18.77 25.19 6.42
MAX_POOL_2D 1.99 3.25 1.26
CONV_2D 90.94 166.25 75.31
MAX_POOL_2D 0.56 0.956 0.396
CONV_2D 12.49 23.327 10.837
SOFTMAX 0.01 0.02 0.01
TOTAL TIME= 126.27 220.503 94.233


CMSIS-NN kernels @ 288MHz [4]
Layer 1.14.0 2.1.0 diff (msec)
DEPTHWISE_CONV_2D 52 55.05 -3.05
MAX_POOL_2D 5 5.2 -0.2
CONV_2D 550 576.54 -26.54
MAX_POOL_2D 2 1.53 0.47
CONV_2D 81 84.78 -3.78
FULLY_CONNECTED 2 2.27 -0.27
FULLY_CONNECTED 0 0.04 -0.04
SOFTMAX 0 0.02 -0.02
TOTAL TIME= 692 725.43 -33.43


As you can image I’m really struggling with those results, as it seems that the performance in 2.1.0 version is slightly worse, even in case that now more layers support cmsis-nn.

In version 1.14.0, only the depthwise_conv_2d was implemented with cmsis-nn as you can see here. But in the new 2.1.0 stable version more kernels are implemented as you can see here. Therefore, now conv_2d and fully connected are supported. Nevertheless, the performance seems to be worse…

Initially I thought that I was doing something terribly wrong, therefore, I’ve deliberately was introducing compiler errors in those files and specifically in the parts that the cmsis-nn functions are used and the compiler actually was complaining, therefore I was sure that at least the compilation was right.

I don’t really have strong opinion why this is happening yet and I’m going to report this just to verify that I’m not doing something wrong. My assumption is that I might have to enable an extra flag for the compiler or the code and because I don’t know which one that might be then the inference uses the default non-optimized kernels.

I’ve also checked that the sensor type in the compressed model is the correct one. I did that by printing the tensor_type in the ConvertTensorType() function in `source/libs/tensorflow/tensorflow/lite/core/api/flatbuffer_conversions.cc`. When the compressed model is loaded I get `kTfLiteFloat16` and `TensorType_INT8` as a result, which means that the weights are indeed compressed. Therefore, I can’t really say why the performance is such slow…

I’ve also tried with different option for the mfp16-format in the compiler. I’ve tried with these two in the `source/CMakeLists.txt`.


But none of those make any difference whatsoever.

Another issue I’m seeing is that the compressed model doesn’t return valid results. For example, when I’m using the uncompressed model I’m getting this result:

Out[0]: 0.000000
Out[1]: 0.000000
Out[2]: 1.000000
Out[3]: 0.000000
Out[4]: 0.000000
Out[5]: 0.000000
Out[6]: 0.000000
Out[7]: 0.000000
Out[8]: 0.000000
Out[9]: 0.000000

This means that the inference returns a certainty of 100% for the correct digit. But when I’m using the compressed model with the same input I’m getting this output:

Out[0]: 0.093871
Out[1]: 0.100892
Out[2]: 0.099631
Out[3]: 0.106597
Out[4]: 0.099124
Out[5]: 0.096398
Out[6]: 0.099573
Out[7]: 0.101923
Out[8]: 0.103691
Out[9]: 0.098299

Which means that the inference cannot be trusted as there’s no definite result.


There’s something going on here… It’s either me missing some implementation details and I need to add an extra build flag or actually there’s no any performance gain in the last tflite-micro version.

I really hope that I’m doing something wrong, because as I’ve mentioned in previous posts, I really like the tflite-micro API as it can be universal for all ARM MCUs that have a DSP unit. This means that you can write a portable code that can be re-used in different MCUs from different vendors. For now it seems that X-CUBE-AI from ST performs far better compared to tflite-micro and it’s one way solution.

Let’s see, I’ve posted it in the previous issue I got from the other post and I’ll try to figure this out and if there’s any update I’ll post it here.

Have fun!

Using CCM on STM32F303CC


There are many silicon vendors that make MCUs and most of them they use the same cores (e.g. ARM Cortex). Therefore, in order to compete each other, vendors need to make themselves stand out from their competitors and this is done in many different ways. Of course, the most important is the price, but some times that’s not enough, because even the low price doesn’t mean that the controller fits your project. Therefore, vendors come with different peripherals, clocks, power saving modes e.t.c.

Sometimes though, vendors provide some very interesting features in their cores and in this post I will get down to the Core-Coupled Memory (CCM) that you can find in some STM32 MCUs. In this post I’ll use the STM32F303CC, as I’ve already have written a cmake template project for this here that I use for fast development and testing.


As I’ve said in this post I’ll use the STM32F303CC and specifically I’ll use the RobotDyn STM32-MINI (or black-pill) module. Well, don’t get confused, there are many different black-pill modules (some with an STM32F411, which I’ll use on a future stupid project). The one I’m using is this:

This beauty has 256KB flash, 40KB SRAM and 8KB of CCM RAM.

What is CCM?

The STM32F303 reference manual refers to the CCM as:

It is used to execute critical routines or to access data. It can be accessed by the CPU only. No DMA accesses are allowed. This memory can be addressed at maximum system clock frequency without wait state.

You can get a better explanation though in the application note AN4296. I’ll just copy part of the appnote here.

The CCM SRAM is tightly coupled with the Arm® Cortex® core, to execute the code at the maximum system clock frequency without any wait-state penalty. This also brings a significant decrease of the critical task execution time, compared to code execution from Flash memory. The CCM SRAM is typically used for real-time and computation intensive routines, like the following:

  • digital power conversion control loops (switch-mode power supplies, lighting)
  • field-oriented 3-phase motor control
  • real-time DSP (digital signal processing) tasks

When the code is located in CCM SRAM and data stored in the regular SRAM, the Cortex-M4 core is in the optimum Harvard configuration. A dedicated zero-wait-state memory is connected to each of its I-bus and D-bus (see the figures below) and can thus perform at 1.25 DMIPS/MHz, with a deterministic performance of 90 DMIPS in STM32F3 and 213 DMIPS in STM32G4. This also guarantees a minimal latency if the interrupt service routines are placed in the CCM SRAM.

The architecture of the CCM RAM is the following one:

As you can see the CCM SRAM is connected only to the i-bus (S0 <-> M3) and D-bus (S1 <-> M3). Since there’s a zero-wait it means that it’s the fastest RAM you can use.

Show me the code!

So how to use it then? First you need to clone this cmake repo from here:


This is a cmake project based on this template here and it’s configured to enable the CCM RAM area. By default the CCM RAM is only enabled in the linker file which is the `source/config/LinkerScripts/STM32F303xC/STM32F303VC_FLASH.ld`, but I had also to edit the start up file `source/libs/cmsis/device/startup_stm32f30x.s` for actually be able to use the CCM RAM. In the start up file you’ll find this code here:

/* Copy the data segment initializers from flash to SRAM and CCMRAM */  
  movs  r1, #0
  b  LoopCopyDataInit

  ldr  r3, =_sidata
  ldr  r3, [r3, r1]
  str  r3, [r0, r1]
  adds  r1, r1, #4
  ldr  r0, =_sdata
  ldr  r3, =_edata
  adds  r2, r0, r1
  cmp  r2, r3
  bcc  CopyDataInit
  movs r1, #0
  b LoopCopyDataInit1

  ldr r3, =_siccmram
  ldr r3, [r3, r1]
  str r3, [r0, r1]
  adds r1, r1, #4

  ldr r0, =_sccmram
  ldr r3, =_eccmram
  adds r2, r0, r1
  cmp r2, r3
  bcc CopyDataInit1
  ldr  r2, =_sbss
  b  LoopFillZerobss
/* Zero fill the bss segment. */

Also in the linker file you can see the memory area and it’s size which is that one:

  FLASH (rx)      : ORIGIN = 0x08000000, LENGTH = 256K
  RAM (xrw)       : ORIGIN = 0x20000000, LENGTH = 40K
  MEMORY_B1 (rx)  : ORIGIN = 0x60000000, LENGTH = 0K
  CCMRAM (rw)     : ORIGIN = 0x10000000, LENGTH = 8K

As you can see the SRAM area starts at address 0x20000000 and it’s 40K and the CCMRAM starts at 0x10000000 and it’s 8K. It’s important to remember those addresses when debugging your code, because it will save you from a lot of time if you know what you’re looking for and what to expect.

In the linker file I’ve also added an .sram section in order to be able to map functions in the RAM. You can see this here:

/* Initialized data sections goes into RAM, load LMA copy after code */
.data : 
  . = ALIGN(4);
  _sdata = .;        /* create a global symbol at data start */
  *(.data)           /* .data sections */
  *(.data*)          /* .data* sections */
  *(.sram)           /* .sram sections */
  *(.sram*)          /* .sram* sections */

  . = ALIGN(4);
  _edata = .;        /* define a global symbol at data end */

The .sram and .sram* is the sections I’ve added my self. Also in the same file you can find the .ccmram section here:

.ccmram :
  . = ALIGN(4);
  _sccmram = .;       /* create a global symbol at ccmram start */
  . = ALIGN(4);
  _eccmram = .;       /* create a global symbol at ccmram end */

In order to test the CCMRAM you need a reference code that it can stress your CPU and RAM and for that reason I’ve decided to use LZ4. LZ4 is a fast compression library which has a very small footprint, it’s written in pure C so it’s portable and it has a lot more benefits that for now are irrelevant. From this library I’ll only use one function for compression without decompression or verification as it doesn’t matter. Since I only care for testing the performance it means that evaluating the library functionality is not critical for the task.

The LZ4 library is located in `source/libs/lz4` and I’ve written a cmake module which is located in `source/cmake/lz4.cmake`. As you can see it’s only a header and C source file.

In the main.c file the interesting code is the block size and count used for the compression test routine. The block size is just the size of the buffer that the compression routine will process and the block count is actually the number of the blocks that will be processed. There is an enum that defines those numbers:

enum {

The USE_BLOCK_COUNT and USE_BLOCK_SIZE are defined in the build.sh script which passes those variables in the cmake. The default values are:

: ${USE_CCM:="ON"}
: ${USE_SRAM:="OFF"}
: ${USE_BLOCK_COUNT:="512"}
: ${USE_BLOCK_SIZE:="8"}

From the syntax probably you can see that these parameters can be overridden when running the script, therefore you can use any block size and block count. For example, for my tests I’ve used two different block sizes 8K and 6K and left the default block count. Therefore, the build script needs to be run like this:

# 8K with 512 counts
# It's the same with:
# 16K with 512 counts

As you can see I’ve used two different block sizes 8K and 16K and the count is 512. That means that the compression routine will process 512*1024*8 = 4MB of data and 8MB of data for each case. On the STM32F303CC there isn’t any 4MB or 8MB continuous storage, but I’m using USE_BLOCK_COUNT for this. You can see what I’ve done in the source code and specifically in the testing function in main.c.

int test_lz4()
    LZ4_stream_t lz4Stream_body;
    LZ4_stream_t* lz4Stream = &lz4Stream_body;

    int  inpBufIndex = 0;

    LZ4_initStream(lz4Stream, sizeof (*lz4Stream));

    for(int i=0; i<BLOCK_COUNT; i++) {
        char* const inpPtr = (char*) ((uint32_t)0x20000000);
        const int inpBytes = BLOCK_SIZE;
            char cmpBuf[LZ4_COMPRESSBOUND(BLOCK_SIZE)];
            const int cmpBytes = LZ4_compress_fast_continue(
                lz4Stream, inpPtr, cmpBuf, inpBytes, sizeof(cmpBuf), 1);
            if(cmpBytes <= 0) {
        inpBufIndex = (inpBufIndex + 1) % 2;
    return 0;

As you can see from the above code I’m pointing with the inpPtr to the SRAM which begins in 0x2000000 then the code compresses the SRAM content using the given block size which is 8K and 16K. Remember the SRAM is 20K, therefore if you try with a block size bigger that 20K then the CPU will hang and it will end up looping in the MemManage_Handler() or HardFault_Handler() exception in the `source/src/stm32f30x_it.c`. That was actually a part of my tests, too in order to verify that is working as expected.

For many people that are having a background on embedded Linux, this might seem very strange, but for MCUs it’s fine to have access to all the range of the memory and read code. Some MCUs -including many STM32- have a memory protection unit (MPU) that disables write on defined memory areas. This most of the times is used to protect the stack from growing out of the limits, but it also has other usages.

Anyway, as you can also see in the previous code, the BLOCK_COUNT is actually a for loop that read the same SRAM area multiple times, therefore the 4MB and 8MB is not a sequential storage but it’s more like a “ring buffer” read in the SRAM.

Finally, the testing routine is called every 1 sec with this code here:

static inline void main_loop(void)
    /* 1 ms timer */
    if (glb_tmr_1ms) {
        glb_tmr_1ms = 0;
    if (glb_tmr_1000ms >= 1000) {
        glb_tmr_1000ms = 0;
        glb_cntr = 0;
        DBG_PORT->ODR |= DBG_PIN;
        DBG_PORT->ODR &= ~DBG_PIN;
        TRACE(("lz4: %d\n", glb_cntr));

The glb_tmr_* are volatile variables that are incremented every 1ms in the SysTick_Handler() interrupt function in `source/src/stm32f30x_it.c`. As you can see from the function declaration I’ve used the .ccmram attribute in order to place the interrupt handler in the CCMRAM, so it executes faster.

void SysTick_Handler(void)

Therefore, this is the magic line that you need to add to your functions in order to place them in the CCMRAM area:


The only thing that you need to make sure is that the function you want to place in CCMRAM it actually fits, but the linker will warn you anyways if it doesn’t.

The same way you can use another attribute to place code in the SRAM:


but I’ll get there later in a bit.

Last important thing is the `LZ4_compress_generic()` function that it’s called from the LZ4_compress_fast_continue() and does the actual compression and it’s located in the `source/libs/lz4/src/lz4.c` file. If you try to place the LZ4_compress_fast_continue() function in the CCMRAM this won’t work as it larger than 8K, but you also don’t have to as the `LZ4_compress_generic()` does the work.

The definition of the `LZ4_compress_generic()` function in the original source code here is this one:

LZ4_FORCE_INLINE int LZ4_compress_generic(...

Do you see this `LZ4_FORCE_INLINE`? We don’t like that. Why? Because inline functions cannot be moved to the CCMRAM or the SRAM! If you just use the following code it won’t work:

LZ4_FORCE_INLINE int LZ4_compress_generic(

Instead you can see from the change I’ve made in `source/libs/lz4/src/lz4.c` that in order to be able to move the function to the CCMRAM you need to do this:

#if defined(USE_CCM)
#elif defined(USE_FLASH)
#elif defined(USE_SRAM)
int LZ4_compress_generic(...

As you can see the USE_CCM flag controls if the function is placed in the .ccmram area. The USE_FLASH controls if it’s placed in the flash but inlined, which is a custom optimization that forces the inlining of this critical function. Finally the USE_SRAM flag places the function into the SRAM. Have in mind that if all flags are disabled, then the behavior is again to place the code in flash, but not inlined in the calling function. That means that the function will have it’s own address in the flash.

Exciting, isn’t it? OK, so before go to the benchmarks, let’s verify that those USE_* flags are actually working and what is the result. We can verify this in several ways. One is to print the function address in the firmware, which means that we need to build the firmware and then flash it on the targer. But there’s a better and more proper way to do this. In Linux you can just use the elfread tool and see the address of any function.

Verifying the build flags and memory areas

Before I proceed with the verification, I’ll list here the memory areas of the STM32F303CC.

Memory Start Stop Size (KB)
FLASH 0x0800 0000 0x0803 FFFF 256
SRAM 0x2000 0000 0x2000 9FFF 40
CCMRAM 0x1000 0000 0x1000 1FFF 8

Now, let’s first build the code with this command:


You don’t really need to write all the flags since they do have default values, but I’m doing this here for clarity. This will build the code and create an elf, hex and bin file in the `build-stm32/src/` folder. Now you can use this command to get the LZ4_compress_generic address:

readelf -a build-stm32/src/stm32f303xc-ccm-test.elf | grep LZ4_compress_generic

This will return the following output:

363: 0800158d  3482 FUNC    GLOBAL DEFAULT    2 LZ4_compress_generic

From this response you can see that the function is located in 0x0800 158d, which means it’s located in the flash area. That means that the function is not inline but a proper calling function.

Now let’s build with this command here:


Again, use readelf to get the function address

readelf -a build-stm32/src/stm32f303xc-ccm-test.elf | grep LZ4_compress_generic

Hmm, it prints nothing! What’s going on? Is that’s correct? Yes! Why? Because USE_FLASH=ON means that the function is inlined in the `LZ4_compress_fast_continue()` function, therefore you need to run this command:

readelf -a build-stm32/src/stm32f303xc-ccm-test.elf | grep LZ4_compress_fast_

which will print something similar to this:

353: 080015b1 13674 FUNC    GLOBAL DEFAULT    2 LZ4_compress_fast_continu

Which means that this function is in the flash area (0x800015b1) and the LZ4_compress_generic() function is inlined in that function. This is why you don’t get an address for the LZ4_compress_generic(). Does it make sense now? OK, let’s see the next example, now try this command to build the firmware:


Now check again the elf:

readelf -a build-stm32/src/stm32f303xc-ccm-test.elf | grep LZ4_compress_generic

This will print:

204: 08001999     8 FUNC    LOCAL  DEFAULT    2 __LZ4_compress_generic_ve
369: 200000ed  3482 FUNC    GLOBAL DEFAULT    6 LZ4_compress_generic

Now you see that the function address is located in 0x2000 00ed, therefore it’s located in the SRAM. That means that the flag works properly.

Now, test with this command:


Now check the elf file again:

readelf -a build-stm32/src/stm32f303xc-ccm-test.elf | grep LZ4_compress_generic

This will print:

204: 08001999     8 FUNC    LOCAL  DEFAULT    2 __LZ4_compress_generic_ve
369: 10000029  3482 FUNC    GLOBAL DEFAULT    7 LZ4_compress_generic

Now you see that the function is placed in the CCMRAM in 0x10000029. So, it works!

Some of you may wonder what’s this `__LZ4_compress_generic_ve` function that is printed when the function is placed in the SRAM or CCMRAM and why this function has an address in the flash? Well, that’s quite easy to answer. The only non-volatile storage on the MCU is the flash. SRAM and CCMRAM are volatile, which means that when the power is removed then all data are gone. Then if that’s the case, how this code works when you supply the MCU with power? How the function ends up in the SRAM and CCMRAM. Well, this is what the startup code does. Takes the address of those functions that are needed to be in the RAM and then copies the code in there. All the addresses are static, so the startup code just copies from and to pre-defined addresses. These addresses are set by the linker when you build the firmware as the linker knows exactly what memory is available.

Compilers and linkers are really interesting things, but I won’t spend more time on them now. Also, I’m not an expert on the subject (not even close). Therefore, I hope that at least it’s clear how the things are put together so far and how this functions are places from flash into different memory areas.

Build command

Before continue with the benchmarks, let’s have a look in the build command. The syntax format is the following.


And this is the explanation of all the flags:

  • `USE_OVERCLOCKING`, ON: enable overclocking at 128MHz, OFF: 72MHz
  • `USE_BLOCK_SIZE`, number of bytes used as the block size. Default: 8, which means 8K
  • `USE_BLOCK_COUNT`, number of blocks used for the compression. Default: 512.
  • `USE_CCM`, ON: move compression function to CCMRAM
  • `USE_SRAM`, ON: move compression function to SRAM
  • `USE_FLASH`, ON: move compression function to FLASH

These are some notes for the parameters:

  • Only one of the `USE_CCM`, `USE_SRAM`, `USE_FLASH` can be `ON`.
  • The processed size will be `USE_BLOCK_SIZE`*`USE_BLOCK_COUNT`
  • The default processed size is 4MB
  • The `USE_BLOCK_SIZE` can not be larger than 20KB

As I’ve mentioned all the parameters have already default value, therefore you don’t have to write those long commands. You can change the default values in the build.sh script instead.

Using Docker

Instead of setting up a build environment, then if you have docker you can use my CDE image to build the code without much hassle. Just clone the code like this:

cd ~/Downloads
git clone https://dimtass@bitbucket.org/dimtass/stm32f303-ccmram-test.git
cd stm32f303-ccmram-test

And then to build the CCM example, run this command:

docker run --rm -it -v $(pwd):/tmp -w=/tmp dimtass/stm32-cde-image:0.1 -c "USE_CCM=ON ./build.sh"

You can use any of the build commands I’ll mention in the next section by just placing them in the double quotes after the -c in the docker command.


Some benchmarks at last! Well, that’s always my favorite part and it always takes some time to get here as it wouldn’t be beneficial for others if I didn’t explain how I get to this point. So, now that we verified that the flags are working, it’s time to start benchmarking. To make it even better I’ll benchmark the compression code in the maximum MCU core default frequency which is 72MHz and when its overclocked at 128MHz.

To do this I’ve build the code with various flag combinations, then flash it on the target and then wait for the UART output to get the time in msec. I’ve also used a GPIO that tooggles to verify that the time is printed is valid and I can say for sure that it is. Therefore, this is the list of the commands I’ve used.

Flash benchmarks (non-inline function)





FLASH benchmarks (inline function)





SRAM benchmarks





CCMRAM benchmarks





Finally, this is a table with all the results. The table shows the execution time of the test_lz4() function and all numbers are in milliseconds. Therefore, the smaller the number the faster was the execution.

8K @72MHz 279 304 251 172
8K @128MHz 156 171 141 97
16K @72MHz 466 631 496 340
16K @128MHz 262 355 278 191

There are so many interesting things in this table!

  1. It’s clear who’s the winner. CCM is faster compared to any other memory.
  2. SRAM doesn’t seem much faster compared to Flash, can you see this?
  3. By forcing the inline to the compiler (LZ4_FORCE_INLINE) actually makes things worse for both block sizes! The compiler optimizations do better job, but on the other hand the inline is forced by the library itself. Therefore, you need to actually remove it to gain more performance! Awesome finding.
  4. When the block size is 16K, the FLASH code is faster than the SRAM!

OK, so now let’s see how much faster the CCMRAM is in this case.

8K @72MHz 47.45% 55.46% 37.35%
8K @128MHz 46.64% 55.22% 36.97%
16K @72MHz 31.26% 59.93% 37.32%
16K @128MHz 31.34% 60.07% 37.1%

As you can see from this table the CCM RAM is faster from 31% up to 60% and that’s a huge gain! Therefore, CCM is as advertised the fastest RAM that you can use on the STM32F303CC. It’s only shame that it’s only 8K 🙁


This stupid project was really fun. I’ve spotted by chance this CCM RAM in the datasheet and I thought, meh, let’s try it. I was expecting that it would be a bit faster, but I didn’t expect that the difference would be that great. 31% faster is a lot of performance gain, you can’t ignore this, especially in time critical code.

To be honest, I didn’t expect that the flash would be faster than the RAM, but I have a theory for this. My theory is that this happens because I’m using the RAM as an input to the compression function and when the block size is 16KB -which is almost all the RAM- then it seems that this slows down the R/W. It seems that in this case the CPU performs better when executing less code from the RAM. That’s my theory, but it doesn’t mean that it’s right. But in any case, with large blocks the STM32 performs better if the function is executed from the flash.

Finally, the LZ4_FORCE_INLINE in the LZ4_compress_generic() it seems that makes performance worse and the GCC compiler with the compiler and linker flags I’ve used makes better job.

After this, I’ve also updated my cmake template for the STM32F303CC, so I’m able to use the __attribute__ directive for both .ccmram and .sram areas and place functions in there.

I hope you enjoyed this stupid project.

Have fun!

Using NXP SDK with Teensy 4.0


There’s a new Teensy in the town. It’s the new Teensy 4.0 and it’s a small beast. Well, as you probably already know, Teensy boards are not only famous because of their nice small factor boards and their Arduino compatibility, but also because of the ecosystem around them. There are so many tools and libraries for them that you can implement complex projects in a very short time.

Currently, you can use Teensy 4.0 only with the Arduino library and environment. That’s great of course, but if you follow this blog for some time, then you know me already. I like the low level peripheral libraries and CMSIS when I do my stupid projects.

Anyway, a couple of days ago I’ve ordered my Teensy 4.0 from PJRC and it actually arrived very fast, but I just found time today to play with it. First thing was to test that it works, so I’ve connected it and I’ve seen the LED flashing. Then followed the instructions and installed Arduino IDE and the support tools for Teensy and a few minutes later I verified that also the USB-CDC works and I could get messages on my terminal.

I also have Teensy 3.1, 3.5 and 3.6 and as a matter of fact, I’ve used Teensy 3.2 in this project here with the MPU-6050 in order to control a 3D object inside Unity3D. But that project was done using the Arduino library and it was implemented fast as the USB raw lib works stable on Teensy.

Anyway, back to Teensy 4.0, the next step was to use PlatformIO and also tested it there, so I’ve connected the PCM5102A board I’ve also used in this post and in just few minutes I verified that I was getting a sin signal on the DAC’s output. All great.

So, final thing was to find what SDK NXP provides for imxrt1062 and try to build an example.


after some search I’ve found that the MCU that Teensy uses doesn’t have an internal flash but an external SPI NOR. Then I start wondering what’s going on, because I wasn’t familiar with the new imxrt1062 MCU (yep, it’s common for me to just order dev boards without RTFM first). Then a friend text me and I told him what I was doing and he got triggered also and start looking on the Teensy 4.0, then at some point he told me that there was another Cortex-M0 on the board and later at some point I also had a look in the schematics and all became clear. BTW kudos for having the schematics open, that’s really cool.

So at that point, I knew that the bootloader is hardcoded and running actually on the external Cortex-M0 and that the bootloader uploads the firmware on the external NOR. Then I’ve looked in the source code of the Arduino core for teensy in github and I’ve seen that there is a custom bootdata.c file instead of the startup files that I’m used to for ARM Cortex-M. Yes, of course I know that you can write your own startup code in C (or asm), but so far I was always using some CMSIS peripheral libraries that come with an asm startup file, which sets up the IVT in the RAM and them pass execution to main.

So, I’ve decided to download the SDK for the imxrt1062 from here. Then I got interrupted (highest priority) from my 1.5 year old son and stopped all activities for several hours… When I got back at night, I’ve extracted the SDK and found out that actually there are startup files and the SDK is very similar to the Standard peripheral drivers library from STM, which was a good sign. So the next question was how to build a firmware that works on the Teensy using the SDK.

During that time I’ve asked a question on the PJRC forum and received the answer that there isn’t such a thing. That triggered a challenge inside me, of course, but on the other hand I don’t have much free time, so my initial thought was just to leave it be. Then Paul Stoffregen came and did a very brief description of the architecture and he gave me the two hints that actually made everything clear. The first was that the bootloader doesn’t do any magic (e.g. encryption, setting up the environment, e.t.c.) and the second that the firmware needs to contain a 512 bytes header with the boot data sector and the IVT.

When I’ve read that I smiled, because that means that since the bootloader doesn’t do any exotic things to bring up the CPU then it means that the the CPU starts from reset state and starts reading from the NOR. At that point I didn’t care for any other details, because that meant that the things were much easier than I initially thought and I thought lets try build any firmware using the SDK and upload it to the NOR using the Teensy bootloader. Later, also Paul verified my assumption.

Therefore in the next part I’ll explain how to use the NXP SDK and CMSIS to build a firmware and upload it on the Teensy 4.0.


The following guide is tested on Linux (Ubuntu 18.04), but I’ve decided to use my CDE Docker image that I’ve introduced in the DevOps for Embedded post series. This makes things easier also for those who run Windows, but I haven’t tested that on Windows so I don’t know if it works.

Therefore, you need at least to install Docker or use an Ubuntu VM if you don’t have Linux and the docker image doesn’t work on your Windows machine (which I can’t think why this can happen, but anyway).

Build the SDK examples

First you need to download the SDK for imxrt1062 from here. Since there’s no direct link you need to browse in Processors -> i.MX -> RT -> MIMXRT1060 -> MIMXRT1062xxxxA and when it comes to select which packages you want in your SDK then just select all. After the SDK is downloaded then extract it to any folder you want. Here I’ll assume that the folder you’re using will be the `~/Downlads/SDK_2.7.0_MIMXRT1062xxxxA`. Therefore inside this folder you should see this directory tree:


Now cd to this directory:

cd ~/Downloads/SDK_2.7.0_MIMXRT1062xxxxA/boards/evkmimxrt1060/driver_examples/gpio/led_output/armgcc

This is the hello-world example code that simply toggles a gpio. If that works, then everything else should work, because that means that the CPU and the needed peripherals will get configured and since the SDK libs are based on CMSIS then all I need is working.

Here you need to have in mind that this SDK is based on another development this board from NXP. Therefore the BSP is tailored for the MIMXRT1060-EVK board. This means that you need to take care of any differences between the Teensy pinout and the EVK for the example codes. For this reason you need to change the EXAMPLE_LED_GPIO_PIN definition in the code in the `gpio_led_output.c` file.

The pinout for Teensy 4.0 is here. From this table you can see that GPIO1.25 is the pin 23 on the Teensy board therefore in the `gpio_led_output.c` file you need to make this change:


That means that now the “LED” pin (from the codes perspective) will be the pin 23. I didn’t use actual Teensy’s LED pin because I wanted to verify the result with the oscilloscope. After you do this change, you’re actually done… Now you need to build the project.

The good thing is that NXP is using CMake! Thanks NXP. ST please keep notes. On the other hand the cmake files are tailored for the specific SDK examples, but nevertheless is better than nothing. I won’t go into the details on how cmake works, but in order to build the firmware you need to trigger the build inside the armgcc folder. Let me repeat the full path, just in case:


Now assuming that you have Docker installed, open a terminal in your host, change to the armgcc folder and run this fugly command:

docker run --rm -it -v /home/$(whoami)/Downloads/SDK_2.7.0_MIMXRT1062xxxxA:/tmp -w=/tmp/boards/evkmimxrt1060/driver_examples/gpio/led_output/armgcc dimtass/stm32-cde-image:0.1 -c "ARMGCC_DIR=/opt/toolchains/gcc-arm-none-eabi-9-2019-q4-major/ ./build_all.sh"

Yes, I know. That’s a monstrous one-liner, but if I explain what it does you’ll see that it’s quite simple. The above command will run a container using the `dimtass/stm32-cde-image:0.1` image. It will then mount the SDK top dir folder in the container’s /tmp folder and it will change the working dir to the armgcc folder that you currently are. Then inside the container will just run this command:

ARMGCC_DIR=/opt/toolchains/gcc-arm-none-eabi-9-2019-q4-major/ ./build_all.sh

You can make a bash script to avoid using this long command. Anyway, this will build the cmake project and you’ll get this output in the end:

[ 95%] Building C object CMakeFiles/igpio_led_output.elf.dir/tmp/devices/MIMXRT1062/utilities/fsl_sbrk.c.obj
[100%] Linking C executable flexspi_nor_release/igpio_led_output.elf
[100%] Built target igpio_led_output.elf

Then you’ll see some new folders and one of them is called `flexspi_nor_release`. Inside this folder you’ll find the `igpio_led_output.elf` firmware. Last step is to convert the elf executable to a HEX file. You can use docker again for this by running another fugly command:

docker run --rm -it -v /home/$(whoami)/Downloads/SDK_2.7.0_MIMXRT1062xxxxA:/tmp -w=/tmp/boards/evkmimxrt1060/driver_examples/gpio/led_output/armgcc dimtass/stm32-cde-image:0.1 -c "/opt/toolchains/gcc-arm-none-eabi-9-2019-q4-major/bin/arm-none-eabi-objcopy -O ihex flexspi_nor_release/igpio_led_output.elf flexspi_nor_release/igpio_led_output.hex"

The above command will just execute this command inside the docker container:

/opt/toolchains/gcc-arm-none-eabi-9-2019-q4-major/bin/arm-none-eabi-objcopy -O ihex flexspi_nor_release/igpio_led_output.elf flexspi_nor_release/igpio_led_output.hex

That’s just the objcopy of the docker image gcc toolchain that converts the elf file to hex. That’s it! You can also add another target in the cmake to do this automatically, but for now it doesn’t matter.

Now, you need to either run the Teensy GUI and load that hex file or use `teensy_loader_cli`. In my case I’ve used the CLI like this:

teensy_loader_cli -v -w --mcu=imxrt1062 flexspi_nor_release/igpio_led_output.hex

Then connect the oscilloscope probe on the pin 23, connect the USB cable from your host to Teensy while holding the reset button on the board and then release the button. If all goes properly, you should see this output:

Teensy Loader, Command Line, Version 2.1
Read "flexspi_nor_release/igpio_led_output.hex": 9496 bytes, 0.5% usage
Waiting for Teensy device...
 (hint: press the reset button)
Found HalfKay Bootloader
Read "flexspi_nor_release/igpio_led_output.hex": 9496 bytes, 0.5% usage

Also on the oscilloscope you should see this

Now, just to be sure, that also code changes work then in the gpio_led_output.c file change the time constant value in the SDK_DelayAtLeastUs() and remove one zero, so it becomes:


Now, re-build and re-flash and you should see something like this:

OK, now you’re sure that it works.


So, in this post I’ve explained how you can use the SDK peripheral library (which is based on the CMSIS) to build a firmware that runs on Teensy 4.0. The good thing is that it seems that is working fine, but I haven’t checked other examples other than the gpio. The next thing for me is to create a cmake template like those I have for the various STM32 MCUs and I’m using in other projects that I post here. To do that I’ll use the current CMake files as a base, but pretty much I’ll have to re-write most of it.

The reason I’m not using the current cmake project from the SDK is that it’s based on another dev-kit, also the current cmake files target the current SDK’s file hierarchy and libraries and of course the size of the current SDK is huge to use as a template. So I’ll strip this down to a minimal template for my future projects. The I need to think about another stupid project. Anyway, I hope that more people find this useful.

Have fun!

Biquad audio DSP filters using STM32F303CC (black-pill)


Filters! I used to love analog filters. I mean, I still like them; but it’s being many years since I’ve designed and used any advanced passive or active analog filters. I remember spending hours of doing filter design using MathCad and plotting the Bode graphs and try to trim the frequencies. Then I was implementing those filters using mostly opamps and trimming the component values and running tests to verify the filter. Well, at that time was fun, now I’m experienced enough to know that this amount of detail in designing a simple filter is useless for the 90% of real-life cases. At least for me. Most of the times a rough estimation done on a napkin is more than enough.

Of course, there are cases that filters accuracy is critical, but not in what I’m doing anyways. Now, even just a resistor and a capacitor are just enough for filtering annoying signals or noise from a path and the accuracy for that is negligible. Nevertheless, I’ve enjoyed that era and I’ve quite missed it.

Also, back then filtering was done using analog parts and only some advanced DSP chips were started to do real-time filtering in the digital domain and other complex funky stuff. Later on the also CPUs got faster and advanced DSP was started to become a standard thing for mainstream desktop computers. Then also the MCUs got faster and real-time DSP was also possible on fast MCUs. Nowadays, you can pretty much use any MCU to implement basic filters and when it gets to ARM Cortex cores, there are even DSP CMSIS libraries that can handle the maths using even dedicated FPUs. Well, actually the STM32F303CC (aka black-pill) is one of them.

A few years back, in 2016, I was reading a book where the author was using the digital biquad filter topology to implement various of different filters and I liked this approach so much, that I’ve ported those filters to C++ code. You can find that repo here:


Lately, this repo got my attention again, because I’ve seen many people starred it and did forks and I was thinking, “man… I’ve done so many advanced stuff, but this simple weekend project got so much attention”. It seems that DSP and audio is a very hot domain for many people. Although I’m a musician my self, I’m a bit old school and I don’t use any digital effects or filters, so everything on my audio path is mostly analog. Nevertheless, DSP is definitely a huge domain with a lot of interesting stuff and filtering is only a small area of this vast domain.

Therefore, I thought to port those DSP filters to C and use an ARM Cortex-M4 to test them in real time. And thus this stupid project was born.


Those are the components and equipment I’ll use in this project.


There are many “black-pills” out there, but I’m referring to my favorite RobotDyn STM32 mini.

This board comes with an STM32F303CCT6 running at 72MHz (also I’ve tested it overclocked at 128MHz), 256KB ROM, 40KB RAM and plenty of timers and peripherals. In this project I’ll use a timer, an ADC and a DAC, but more on that later.


Of course, in order to test a filter you need an input signal. I have this arbitrary signal generator for this case, but you can use any other generator you like, as long as it’s able to create signals that are in the supported range of the STM32 (0-3.3V)


You need also an oscilloscope in order to verify the signal output from the DAC. Usually, I’m using my little TDS200 for testing those stuff, but for now I’ve used my Rigol 1054z in order to capture the screenshots you’ll see in the post. Since you have an input and an output, you’ll need two channels. Because I’ll only use audio frequencies, even a basic oscilloscope is more than enough for this purpose.

Filter theory

Oh, no, no, no, I’m joking, I won’t get into this. That’s a huge domain and there are so many books, web posts and youtube videos out there that explain those things far better than I could ever explain. Therefore, in this case you need to be already familiar with what is a low-pass filter (LPF), high-pass filter (HPF), all-pass filter (APF) and band-pass filter (BPF). Well, even the names are quite self-explanatory.

Now regarding the biquad filter I can say that is a generic form of a digital IIR filter. What it actually does is that it sums products of coefficients and sample values of the input and the output. It’s just maths, but all the magic happens in the calculation of those coefficients which control the behavior of the biquad filter. Actually, those coefficients control the two poles and zeros of the filter transfer function, therefore they control the type of filter you can implement. Since biquads have two poles and two zeros, they can implement first order and second order filters.


In order to test those filters with the STM32F303 we need one ADC to sample the output signal from the generator (which is input for the STM32), then process the samples and finally convert the result sample to an analog signal using a DAC. That’s quite an easy thing to do with STM32, but what is important here is the sampling rate. Therefore, we need to sync all those states and drive the sequence using a standard sample rate. Since the STM32 is quite fast, I’ll use a sample rate of 96000 samples/sec. For audio this high fidelity as it’s 4.8x times the audio frequency range. Well, for my ears is almost 8x times as I’ve lost my upper octave in an accident during my military service, lol. Yeah, army sucks in so many different ways…

To drive the sequence with 96KHz we need a timer. So in this case, I’ll use TIM1 to trigger the ADC, then a DMA channel to copy the ADC reading as fast as possible to the RAM, then apply the filter and then pass the sample to the DAC. So this is a simple diagram of the setup.

In the above diagram we see that there’s a function generator which is used to provide the input signal, in this case just a sinusoidal signal. Next there’s in an optional anti-aliasing filter, which I’m not using in my tests. In this case the anti-aliasing filter doesn’t make sense, because the SDG1025 generator outputs a clean sin, but normally you would need that in order to filter frequencies over 20KHz, so their mirror images are not shown in the 20-20KHz range that we care about.

Then it’s the MCU that uses the ADC to sample the input signal, then the DSP software algorithm (in this case the filters) and then the DAC that outputs the processed signal. Also, there’s a timer that triggers the ADC with frequency equal to the sample-rate we need. Then in the output, after the MCU, there’s an optional reconstruction filter, which again is a low pass filter that filters all frequencies above 20KHz. I’ll not use this filter on this test, because I like to see the DAC quantized signal, as I’ve also altering the sampling rate during my tests. Finally, there’s the oscilloscope that displays the output.

As you can guess, it’s expected to have a phase delay between the input and the output (ADC to DAC) as it needs time to sample, process and convert the input. From my tests this phase shift is around 25 degrees as you can see later in the screenshots, which is just a few micro-seconds.

This is the setup on my desk.

Code explanation

I’ve written a small cmake project for the stm32f303cc that implements all the above things and you can find the code here:


To clone the repo locally run:

git clone --recursive https://dimtass@bitbucket.org/dimtass/stm32f303-adc-dac-dsp.git

The supported filters in the code are:

  • First order all-pass filter (fo_apf)
  • First order high-pass filter (fo_hpf)
  • First order low-pass filter (fo_lpf)
  • First order high-shelving filter (fo_shelving_high)
  • First order low-shelving filter (fo_shelving_low)
  • Second order all-pass filter (so_apf)
  • Second order band-pass filter (so_bpf)
  • Second order band-stop filter (so_bsf)
  • Second order Butterworth band-pass filter (so_butterworth_bpf)
  • Second order Butterworth band-stop filter (so_butterworth_bsf)
  • Second order Butterworth high-pass filter (so_butterworth_hpf)
  • Second order Butterworth low-pass filter (so_butterworth_lpf)
  • Second order high-pass filter (so_hpf)
  • Second order Linkwitz-Riley high-pass filter (so_linkwitz_riley_hpf)
  • Second order Linkwitz-Riley low-pass filter (so_linkwitz_riley_lpf)
  • Second order Low-pass filter (so_lpf)
  • Second order parametric/peaking boost filter with constant-Q (so_parametric_cq_boost)
  • Second order parametric/peaking cut filter with constant-Q (so_parametric_cq_cut)
  • Second order parametric/peaking filter with non-constant-Q (so_parametric_ncq)

All the filters are based on the standard digital biquad filter (DBF), which is displayed here:

The mathematical formula for the DBF is the following:

y(n) = a0*x(n) + a1*x(n-1) + a2*x(n-2) – b*y(n-1) + b2*y(n-2)

Now, back to the repo you can find this code in the source/libs/filter_lib/filter_common.h as a macro. Yes, that’s right, it’s a macro. I know that many people don’t like them, but it’s fine to use macros if you know what you’re doing and it’s also DRY (Do-not-Repeat-Yourself). In my C++ code for those DSP filters for example I don’t use any macros as classes makes things much better. The macro is this one:

#define BIQUAD (m_coeffs.a0*xn + m_coeffs.a1*m_xnz1 + m_coeffs.a2*m_xnz2 - m_coeffs.b1*m_ynz1 - m_coeffs.b2*m_ynz2)

Well, although the README.md file in the repo is quite thorough, I’ll repeat most of the things I’ve written there also in here. I’ve added a pointer to an array of functions in order to be able to apply multiply filters on each sample. The code is in the source/main.c file and there you’ll find these lines:

#define NUM_OF_FILTERS 5
F_SIZE (*filter_p[NUM_OF_FILTERS])(F_SIZE sample);

The default array size is 5, which is more than enough, but you can increase it if you like. The reason for this is to create more complex filters by stacking other filters. For example the default filter in the repo is a band-pass (BFP) Butterworth filter composed by a high-pass filter (HPF) with corner-frequency of 5KHz and a low-pass filter (LPF) with corner-frequency of 10KHz. Therefore, the filter bandwidth is 5KHz. This is the code in main.c

/* Set your filter here: */
so_butterworth_lpf_calculate_coeffs(10000, SAMPLE_RATE);
so_butterworth_hpf_calculate_coeffs(5000, SAMPLE_RATE);
filter_p[0] = &so_butterworth_hpf_filter;
filter_p[1] = &so_butterworth_lpf_filter;

In the above code you see that first the filters are initialized by calculating the coefficients, then an offset of 2048 is added to the HPF and then I’ve added the HPF filter in the first slot of the array of filters and then the LPF in the second. The filter processing on the sample happens in the DMA interrupt.

void DMA1_Channel1_IRQHandler(void)
    /* Test on DMA1 Channel1 Transfer Complete interrupt */
        io.sample_ready = 1;
        io.dac_sample = io.adc_sample;
        for (int i=0; i<NUM_OF_FILTERS; i++) {
            if (filter_p[i])
                io.dac_sample = filter_p[i](io.dac_sample);
    	DAC_SetChannel1Data(DAC1, DAC_Align_12b_R, io.dac_sample);

        /* Clear DMA1 Channel1 Half Transfer, Transfer Complete and Global interrupt pending bits */

Also this function increments a counter (irq_count) on every interrupt and every second in the main_loop() function this counter is printed in the UART output and then it gets reset. This means that if the sampling frequency is 96000, then in your COM port terminal (I’m using CuteCom) you should see the 96000 printed every second. If not then there’s a problem somewhere in the code path or the sampling rate is too fast. You can change the sampling rate by setting the `SAMPLE_RATE` you want in main.c

#define SAMPLE_RATE 96000

The pinout for this project is in the following table:

STM32 pin Function
A0 ADC in
A4 DAC out

Build and flash the code

To build the code you need to run the build script, but you need to have a GCC toolchain and cmake installed. Generally, is easier just to use Docker and build the firmware using the docker image that I’m using also in other projects and I’ve created in the DevOps for Embedded posts. To do that you only need Docker installed in your system and then run:


Or if you prefer to run the full command then:

docker run --rm -it -v `pwd`:/tmp -w=/tmp dimtass/stm32-cde-image:0.1 -c "./build.sh"

The above command will download the image if it’s not already available and then build the code in the build-stm32/ folder. Therefore, you’ll find a bin, hex and elf file in the build-stm32/src/ folder. Then you can flash it however you like. Personally I’m using stlink on Ubuntu and an ST-Link V2. If you have the same setup, then you can use my flash script and just run


Testing the filters

Assuming that you have a working setup now you can start playing with the filters! This gif is from the default filter in the code.

As you can see from the gif the signal starts from 1KHz and then I increase the frequency up to 20KHz. At 1KHz the output is suppressed by the HPF, after 5KHz the output is -3dB compared to input and start to increase and it reaches the same Vp-p as the input. While increasing the frequency more, the output starts to suppressed by the LPF and at 10KHz is again -3dB and it gets lower as the input increases.

Nice stuff!

Next I’ve tested various filters and they seem to working fine. I’m uploading a few pictures here for reference. For all the screenshots the sampling rate is 96KHz and the corner-frequency is 5KHz.

This is the first-order all-pass filter (APF)

This is the first-order HPF

This is the first-order LPF

I think there’s not a real benefit uploading more screenshots. You can play around with the filters if you like and add as many filters you want to at the same time and see the result. What is interesting is the affect that the sample rate has on the DAC, therefore I’m uploading here 3 different sampling rates I’ve used (96KHz, 192KHz and 342KHz) while the input frequency is 20KHz.

It’s obvious that as the sampling rate increases the DAC output has better resolution.

You may wonder here, why 342KHz and not 384KHz, which more common as its two times the 192KHz. Well, that’s because that’s the limit of the STM32! Actually when the core runs at the default maximum frequency (=72MHz) then the maximum sampling rate is limited at 192KHz. Therefore, I had to overclock the STM32 at 128MHz in order to achieve this 384KHz sampling rate. In order to do the same you need to build the code with an extra flag, like this:

docker run --rm -it -v `pwd`:/tmp -w=/tmp dimtass/stm32-cde-image:0.1 -c "USE_OVERCLOCKING=ON ./build.sh"

The USE_OVERCLOCKING=ON will enable the oveclock_stm32f303() function in the main.c. Be aware that there might be a chance that this won’t work with your MCU, but it worked in all the black-pills I have around…

Using the CMSIS-DSP library

As I’ve mentioned the STM32F303CC has a Cortex-M4 core with a dedicated FPU. This means that you can use the CMSIS-DSP library that ARM provides, for which you can find more details here. This library comes the CMSIS version of your MCU and the version that comes with the STM32F30x_DSP_StdPeriph_Lib_V1.2.3 3 is the 4.2, which is quite old, but definitely this doesn’t affect the one single function we need to use.

So, in order to test the CMSIS-DSP lib you can only use the so_butterworth_lpf filter, because I didn’t implement the process function for all filters (you’ll see why in a bit). Also you need to initialize a debug pin and use it to time the filter function. First add the dbg_pin_init() just right before you setup your filter and also setup only the so_butterworth_lpf. Your code in main() should look like this:


/* Set your filter here: */
so_butterworth_lpf_calculate_coeffs(10000, SAMPLE_RATE);
filter_p[0] = &so_butterworth_lpf_filter;

Then in the DMA1_Channel1_IRQHandler() function you need to change it like this:

void DMA1_Channel1_IRQHandler(void)
    /* Test on DMA1 Channel1 Transfer Complete interrupt */
        io.sample_ready = 1;
        io.dac_sample = io.adc_sample;
        DBG_PORT->ODR |= DBG_PIN;
        io.dac_sample = filter_p[0](io.dac_sample);
        DBG_PORT->ODR &= ~DBG_PIN;
    	        DAC_SetChannel1Data(DAC1, DAC_Align_12b_R, io.dac_sample);

        /* Clear DMA1 Channel1 Half Transfer, Transfer Complete and Global interrupt pending bits */

This code will set the B7 pin high before calling the filter function and set it LOW right after, therefore you can use the oscilloscope to measure the time. Finally, in order to build the firmware using the CMSIS-DSP lib you need to build the firmware with this command:

docker run --rm -it -v `pwd`:/tmp -w=/tmp dimtass/stm32-cde-image:0.1 -c "USE_FPU=ON ./build.sh"

The USE_FPU flag controls the use of the CMSIS-DSP for the filter function. Finally, let’s check the filter function implementation before proceed with the benchmarks. You’ll find it in the `source/libs/filters_lib/src/so_butterworth_lpf.c` file.

F_SIZE so_butterworth_lpf_filter(F_SIZE sample)
    F_SIZE xn = sample;

#ifdef USE_FPU
    F_SIZE A[] = {m_coeffs.a0, m_coeffs.a1, m_coeffs.a2, -m_coeffs.b1, -m_coeffs.b2};
    F_SIZE B[] = {xn, m_xnz1, m_xnz2, m_ynz1, m_xnz2};
    F_SIZE yn = 0;
    arm_dot_prod_f32((F_SIZE*) &A, (F_SIZE*) &B, 5, &yn);
    F_SIZE yn = BIQUAD;


    return(yn + m_offset);

When USE_FPU is defined, then I use the arm_dot_prod_f32() function to calculate the dot product of two arrays, which are the coefficients and the input/output sample values. Let’s see the results now. Please keep in mind that on those screenshots I’ve used a sampling rate of 192KHz.

First this is the result with using the CMSIS-DSP library.

As you can see from the screenshot, the time execution of the filter function is approx. 3 microseconds. Now let’s see without using the CMSIS-DSP library:

As you can see now the filtering function takes 1.7 microseconds, which is almost the half time!

So, why is that happening? Well, I didn’t check the asm, but I guess that the memory copy operations in order to create the arrays to pass to the function takes a lot of time and at the same time the compiler optimizations are good enough to make the code run fast even without the CMSIS-DSP library. You can have a look in the C flags that I’m using in cmake, but they are generally trimmed for maximum performance.

Therefore, after those initial results I decided not to continue with using the the CMSIS-DSP lib for the filter function.

Also another interesting thing is the distance of those pulses. Remember the sampling rate is 192KHz and each pulse means the call of the filter function, but there’s also other code that is running at the same time, like the sys cloc, sampling rate timer, ADC and DAC interrupts, blinking a LED e.t.c. Therefore, you can imagine how much time all those other things take. Which is too less. That’s mostly because of the DMA and the interrupts.

Anyway, it’s also interesting the sum of the high and low pulse time in the two cases. Let’s see the next table:

Filter time (usec) Other (usec) Sum (usec)
CMSIS-DSP 3.16 2.01 5.17
mathlib 1.8 3.4 5.2

You see that the period between each filter function call is ~5.2 usec, therefore the frequency is 1/5.2 = 192.3KHz, which is the expected sampling rate. Therefore, it’s obvious that the MCU is near it’s limits and we can’t use faster sampling rate, but using the mathlib in this case gives us an additional 3.16-1.8= 1.36 usec to use for other tasks. Neat.


Well, that was a fun stupid project. So to summarize, I’ve ported my C++ filter library to C and then used an STM32F303CC to test the code and verify that the filters are working. By the way, the C port is also available in this repo as a standalone library.

There’s not much more to say really, the biquad filter seems to be working fine in all the filters versions. One thing that I need to clarify is why I had to add this 2048 offset only on the HPFs in order to works. I guess there’s something in the maths, but I’ll figure out at some point later.

Also, I’m satisfied with the ADC -> DMA -> process -> DAC speed. I can get 192KHz sampling rate at the default MCU speed and almost the double when it’s overclocked. One thing that I could do and I may do at some point in the future- is to add another ADC and DAC channel, so I can have a stereo input/output. Also the phase delay is low, just a few micro-seconds.

I didn’t expect that CMSIS-DSP would be slower than the mathlib, but after seeing the results it probably should be expected as the memcpy in order to create the needed arrays for the `arm_dot_prod_f32()` take quite much time.

To be honest, I don’t really see many real usage scenarios for such filters as it’s very easy to implement them with passive components outside the MCU easily. Where it could be really useful though, is when you need an adaptive or programmable filter. In this case, you can use this project and add UART commands to control the type of the filter, the corner frequency, Q and BW and the sampling rate in real-time. This would be awesome and very easy to do using this project as a template. Maybe I can do this in the future, but it’s a bit boring procedure for me for now and my time is a bit limited.

Just by estimating from the current results, I believe 192KHz stereo is not possible at the default 72MHz, but 96KHz should be OK.

Also if you plan to really use this, then you’ll probably need the reconstruction filter after the DAC to remove the quantization noise. A second-order LPF with 20KHz cut-off frequency would be fine. I would probably use an active filter with an opamp so I can also have output buffering and drive higher loads. Well, that depends on your case, but either way any 2nd order LPF would be fine.

I hope you enjoyed this little project.

Have fun!

Posts archive

This is a list of all the blog posts:

DevOps for Embedded (part 3)


Note: This is the third post of the DevOps for Embedded series. You can find the first post here and the second here.

In this series of post, I’m explaining a different ways of creating a reusable environment for developing and also building a template firmware for an STM32F103. The project itself, meaning the STM32 firmware, doesn’t really matters and is just a reference project to work with the presented examples. Instead of this firmware you can have anything else, less or more complicated. In this post though, I’m going to use another firmware which is not the template I’ve used in the two previous posts as I want to demonstrate how a test farm can be used. More on this later.

In the first post I’ve explained how to use Packer and Ansible to create a Docker CDE (common development environment) image and use it on your host OS to build and flash your code for a simple STM32F103 project. This docker image was then pushed to the docker hub repository and then I’ve shown how you can create a gitlab-ci pipeline that triggers on repo code changes, pulls the docker CDE image and then builds the code, runs tests and finally exports the artifact to the gitlab repo.

In the second post I’ve explained how to use Packer and Ansible to build an AWS EC2 AMI image that includes the tools that you need to compile the STM32 code. Also, I’ve shown how you can install the gitlab-runner in the AMI and then use an instance of this image with gitlab-ci to build the code when you push changes in the repo. Also, you’ve seen how you can use docker and your workstation to create multiple gitlab-runners and also how to use Vagrant to make the process of running instances easier.

Therefore, we pretty much covered the CI part of the CI/CD pipeline and I’ve shown a few different ways how this can be done, so you can chose what fits your specific case.

In this post, I’ll try to scratch the surface of testing your embedded project in the CI/CD pipeline.Scratching the surface is literal, because fully automating the testing of an embedded project can be even impossible. But I’ll talk about this later.

Testing in the embedded domain

When it comes to embedded, there’s one important thing that it would be great to include in to your automation and that’s testing. There are different types of testing and a few of them apply only to the embedded domain and not in the software domain or other technology domains. That is because in an embedded project you have a custom hardware that works with a custom software and that may be connected to- or be a part of- a bigger and more complex system. So what do you test in this case? The hardware? The software? The overall system? The functionality? It can get really complicated.

Embedded projects can be also very complex themselves. Usually an embedded product consists of an MCU or application CPU and some hardware around that. The complexity is not only about the peripheral hardware around the MCU, but also the overall functionality of the hardware. Therefore, for example the hardware might be quite simple, but it might be difficult to test it because of it’s functionality and/or the testing conditions. For example, it would be rather difficult to test your toggling LED firmware on the arduino in zero-gravity or 100 g acceleration :p

Anyway, getting back to testing, it should be clear that it can become very hard to do. Nevertheless, the important thing is to know at least what you can test and what you can’t; therefore it’s mandatory to write down your specs and create a list of the tests that you would like to have and then start to group those in tests that can be done in software and those that can be done with some external hardware. Finally, you need to sort those groups on how easy is to implement each test.

You see there are many different things that actually need testing, but let’s break this down

Unit testing

Usually testing in the software domain is about running tests on specific code functions, modules or units (unit testing) and most of the times those tests are isolated, meaning that only the specific module is tested and not the whole software. Unit testing nowadays is used more and more in the lower embedded software (firmware), which is good. Most of the times though embedded engineers, especially older ones like me, don’t like unit tests. I think the reason behind that is that because it takes a lot of time to write unit tests and also because many embedded engineers are fed up with how rapid the domain is moving and can’t follow for various reasons.

There are a few things that you need to be aware of unit tests. Unit tests are not a panacea, which means is not a remedy for all the problems you may have. Unit tests only test the very specific thing you’ve programmed them to test, nothing more nothing less. Unit tests will test a function, but they won’t prevent you from using that function in a way that it will brake your code, therefore you can only test very obvious things that you already thought that they might brake your code.  Of course, this is very useful but it’s far from being a solution that if you use it then you can feel safe.

Most of the times, unit tests are useful to verify that a change you’ve made in your code or in a function that is deep in the execution list of another function, doesn’t brake anything that was working before. Again, you need to have in mind that this won’t make you safe. Bad coding is always bad coding no matter how many tests you do. For example if you start pushing and popping things to the stack in your code, instead of leaving the compiler to be the only owner of the stack, then you’re heading to troubles and no unit test can save you from this.

Finally, don’t get too excited with unit testing and start to make tests for any function you have. You need to use it wisely and not spend too much time implementing tests that are not really needed. You need to reach a point that you are confident for the code you’re writing and also re-use the code you’ve written for other projects. Therefore, focus on writing better code and write tests only when is needed or for critical modules. This will also simplify your CI pipeline and make its maintenance easier.

Testing with mocks

Some tests can be done using either software or external hardware. Back in 2005 I’ve wrote the firmware for a GPS tracker and I wanted to test both the software module that was receiving and processing the NMEA GPS input via a UART and also the application firmware. To do that one way was to use a unit test to test the module, which is OK to test the processing but not much useful for real case scenario tests. The other way was to test the device out in the real world, which it would be the best option, but as you can imagine it’s a tedious and hard process to debug the device and fix code while you’re driving. Finally, the other way was to mock the GPS receiver and replay data in the UART port. This is called mocking the hardware and in this case I’ve just wrote a simple VB6 app that was replaying captured GPS NMEA data from another GPS receiver to the device under test (DUT).

There are several ways for mocking hardware. Mocking can be software only and also integrated in the test unit itself, or it can be external like the GPS tracker example. In most of the cases though, mocking is done in the unit tests without an external hardware and of course this is the “simplest” case and it’s preferred most of the times. I’ve used quotes in simplest, because sometimes mocking the hardware only with software integrated in the unit test can be also very hard.

Personally I consider mocking to be different from unit tests, although some times you may find them together (e.g. in a unit testing framework that supports mocking) and maybe they are presented like they are the same thing. It’s not.

In the common development environment (CDE) image that I’ve used also in the previous posts there’s another Ansible role I’ve used, which I didn’t mentioned on purpose and that’s the cpputest framework, which you’ll find in this role `provisioning/roles/cpputest/tasks/main.yml`. In case you haven’t read the previous posts, you’ll need to clone this repo here to run the image build examples:


There are many testing frameworks for different programming languages, but since this example project template is limited to C/C++ I’ve used cpputest as a reference. Of course, in your case you can create another Ansible role and install the testing framework of your preference. Usually, those frameworks are easy to install and therefore also easy to add them into your image. One thing worth mention here is that if the testing framework you want to use is included in the package repository of your image (e.g. apt for debian) and you’re going to install this during the image provision, then keep in mind that this version might change if you re-build your image after some time and the package is updated in the repo, too. Of course, the same issue stands for all the repo packages.

In general, if you want to play safe you need to have this in mind. It’s not wrong to use package repositories to install tools you need during provisioning. it’s just that you need to be aware of this and that at some point one of the packages may be updated and introduce a different behavior than the expected.

System and functional testing

This is where things are getting interesting in embedded. How do you test your system? How do you test the functionality? These are the real problems in embedded and this is where you should spend most of your time when thinking about testing.

First you need to make a plan. You need to write down the specs for testing your project. This is the most important part and the embedded engineer is valuable to bring out the things that need testing (which are not obvious). The idea of compiling this list of tests needed is simple and there are two questions that you have to answer, which is “what” and “how”. “What” do you need to test? “How” are you going to do that in a simple way?

After you make this list, then there are already several tools and frameworks that can help you on doing this, so let’s have a look.

Other testing frameworks

Oh boy… this is the part that I will try to be fair and distant my self from my personal experiences and beliefs. Just a brief story, I’ve evaluated many different testing frameworks and I’m not really satisfied about what’s available out there, as everything seems too complicated or bloated or it doesn’t make sense to me. Anyway, I’ll try not to rant about that a lot, but every time in the end I had to either implement my own framework or part of it to make the job done.

The problem with all these frameworks in general is that they try to be very generic and do many things at the same time. Thus sometimes this becomes a problem, especially in cases you need to solve specific problems that could be solved easily with a bash script and eventually you spend more time trying to fit the framework in to your case, rather have a simple and clean solution. The result many times will be a really ugly hack of the framework usage, which is hard to maintain especially if you haven’t done it yourself and nobody has documented it.

When I refer to testing frameworks in this section I don’t mean unit testing frameworks, but higher level automation frameworks like Robot Framework, LAVA, Fuego Test System, tbot and several others. Please, keep in mind that I don’t mean to be disrespectful to any of those frameworks or their users and mostly their developers. These are great tools and they’ve being developed from people that know well what testing is all about (well they know better than be, tbh). I’m also sure that these frameworks fit excellent in many cases and are valuable tools to use. Personally, I prefer simplicity and I prefer simple tools that do a single thing and don’t have many dependencies, that’s it; nothing more, nothing less.

As you can see, there are a lot of testing frameworks (many more that the ones I’ve listed). Many companies or developers also make their own testing tools. It’s very important for the developer/architect of the testing to feel comfortable and connected with the testing framework. This is why testing frameworks are important and that’s why there are so many.

In this post I’ll use robot framework, only because I’ve seen many QAs are using it for their tests. Personally I find this framework really hard to maintain and I wouldn’t use it my self, because many times it doesn’t make sense and especially for embedded. Generally I think robot-framework target the web development and when it gets to integrate it in other domains it can get nasty, because of the syntax and that the delimiters are multiple spaces and also by default all variables are strings. Also sometimes it’s hard to use when you load library classes that have integer parameters and you need to pass those from the command line. Anyway, the current newer version 3.1.x handles that a bit better, but until now it was cumbersome.

Testing stages order and/or parallelization

The testing frameworks that I’ve mentioned above are the ones that are commonly used in the CI/CD pipelines. In some CI tools, like gitlab-ci, you’re able to run multiple tasks in a stage, so in case of the testing stage you may run multiple testing tasks. These tasks may be anything, e.g. unit or functional tests or whatever you want, but it is important to decide the architecture you’re going to use and if that makes sense for your pipeline. Also be aware that you can’t run multiple tasks in the same stage if those tasks are using both the same hardware, unless you handle this explicitly. For example, you can’t have two tasks sending I2C commands to the target and expect that you get valid results. Therefore, in the test stage you may run multiple tasks with unit tests and only one that accesses the target hardware (that depends on the case, of course, but you get the point).

Also let’s assume that you have unit tests and functional tests that you need to run. There’s a question that you need to ask yourself. Does it make sense to run functional tests before unit tests? Does it make sense to run them in parallel? Probably you’ll think that it’s quite inefficient to run complex and time consuming functional tests if your unit tests fails. Probably your unit tests need less time to run, so why not run them first and then run the functional tests? Also, if you run them in parallel then if your unit test fails fast, then your functional test may take a lot of time until it completes, which means that your gitlab-runner (or Jenkins node e.t.c.) will be occupied running a test on an already failed build.

On the other hand, you may prefer to run them in parallel even if one of them fails, so at least you know early that the other tests passes and you don’t have to do also changes there. But this also means that in the next iteration if you your error is not fixed, then you’ll loose the same amount of time again. Also by running the tasks parallel even if one occasionally fails and you have a many successful builds then you save a lot of time in long term. But at the same time, parallel stages means more runners, but also at the same time each runner can be specific for each task, so you have another runner to build the code and another to run functional tests. The last means that your builders may be x86_64 machines that use x86_64 toolchains and the functional test runners are ARM SBCs that don’t need to have toolchains to build code.

You see how quickly this escalates and it gets even deeper than that. Therefore, there are many details that you need to consider. To make it worse, of course, you don’t need to decide this or the other, but implement a hybrid solution or change your solution in the middle of your project because it makes more sense to have another architecture when the project starts, when is half-way and when it’s done and only updates and maintenance is done.

DevOps is (and should be) an alive, flexible and dynamic process in embedded. You need to take advantage of this flexibility and use it in a way that makes more sense in each step of your project. You don’t have to stick to an architecture that worked fine in the beginning of the project, but later it became inefficient. DevOps need to be adaptive and this is why being as much as IaC as possible makes those transitions safer and with lower risk. If your infrastructure is in code and automated then you can bring it back if an architectural core change fails.

Well, for now if you need to keep a very draft information out of this, then just try to keep your CI pipeline stages stack up in a logical way that makes sense.

Project example setup

For this post I’ll create a small farm with test servers that will operate on the connected devices under test (DUT). It’s important to make clear at this point that there many different architectures that you can implement for this scenario, but I will try to cover only two of them. This doesn’t mean that this is the only way to implement your farm, but I think having two examples to show the different ways that you can do this demonstrates the flexibility and may trigger you to find other architectures that are better for your project.

As in the previous posts I’ll use gitlab-ci to run CI/CD pipelines, therefore I’ll refer to the agents as runners. If it was Jenkins I would refer to them as nodes. Anyway, agents and runners will be used interchangeably. The same goes for stages and jobs when talking about pipelines.

A simple test server

Before getting into the CI/CD part let’s see what a test server is. The test server is a hardware platform that is able to conduct tests on the DUT (device under test) and also prepare the DUT for those tests. That preparation needs to be done in a way that the DUT is not affected from any previous tests that ran on it. That can be achieved in different ways depending the DUT, but in this case the “reset” state is when the firmware is flashed, the device is reset and any peripheral connections have always the same and predefined state.

For this post the DUT is the STM32F103C8T6 (blue-pill) but what should we use as a test server? The short answer is any Linux capable SBC, but in this case I’ll use the Nanopi K1 Plus and the Nanopi Neo2-LTS. Although those are different SBCs, they share the same important component, which is the Allwinner H5 SoC that integrates an ARM Cortex-A53 cpu. The main difference is the CPU clock and the RAM size, which is 2GB for the TK1 and 512MB for the Neo2. The reason I’ve chosen different SBCs is to also benchmark their differences. The cost of the Neo2 is $20 at the time the post is written and the K1 is $35. The price might be almost twice more for the K1, but $35 on the other hand is really cheap for building a farm. You could also use a rapberry pi instead, but later on it will be more obvious why I’ve chosen those SBCs.

By choosing a Linux capable SBC like these comes with great flexibility. First you don’t need to build your own hardware, which is fine for most of the cases you’ll have to deal with. If your projects demands, for some reason, a specific hardware then you may build only the part is needed in a PCB which is connected to the SBC. The flexibility then expands also to the OS that is running on the SBC, which is Linux. Having Linux running on your test server board is great for many reasons that I won’t go into detail, but the most important is that you can choose from a variety of already existing software and frameworks for Linux.

So, what packages does this test server needs to have? Well, that’s up to you and your project specs, but in this case I’ll use Python3, the robot testing framework Docker and a few other tools. It’s important to just grasp how this is set up and not which tools are used, because tools are only important to a specific project. but you can use whatever you like. During the next post I’ll explain how those tools are used in each case.

This is a simple diagram of how the test server is connected on the DUT.

The test server (SBC) has various I/Os and ports, like USB, I2C, SPI, GPIOs, OTG e.t.c. The DUT (STM32) is connected on the ST-Link programmer via the SWD interface and the ST-Link is conencted via USB to the SBC. Also the STM32F1 is connected to the SBC via an I2C interface and a GPIO (GPIO A0)

Since this setup is meant to be simple and just a proof of concept, it doesn’t need to implement any complicated task, since you can scale this example in whatever extent you need/want to. Therefore, for this example project the STM32 device will implement a naive I2C GPIO expander device. The device will have 8 GPIOs and a set of registers that can be accessed via I2C to program each GPIO functionality. Therefore, the test server will run a robot test that will check that the device is connected and then configure a pin as output, set it high, read and verify the pin, then set it low and again read and verify the pin. Each of this tasks is a different test and if all tests pass then the testing will be successful.

So, this is our simple example. Although it’s simple, it’s important to focus on the architecture and the procedure and not in the complexity of the system. Your project’s complexity may be much more, but in the end if you break things down everything ends up to this simple case, therefore it’s many simple cases connected together. Of course, there’s a chance that your DUT can’t be 100% tested automatically for various reasons, but even if you can automate the 40-50% it’s a huge gain in time and cost.

Now let’s see the two different test farm architectures we’re going to test in this post.

Multi-agent pipeline

In multi-agent pipelines you just have multiple runners in the same pipeline. The agents may execute different actions in parallel in the same stage/job or each agent may execute a different job in the pipeline sequence, one after another. Before going to the pros and cons of this architecture, let’s see first how this is realized in our simple example.

Although the above may seem a bit complicate, in reality it’s quite simple. This shows that when a new code commit is done then the gitlab-ci will start executing the pipeline. In this pipeline there’s a build stage that is compiling the code and runs units tests on an AWS EC2 instance and if everything goes well, then the artifact (firmware binary) is uploaded to gitlab. In this case, it doesn’t really matter if it’s an AWS or any other baremetal server or whatever infrastructure you have for building code. What matters is that the code build process is done on a different agent than the one that tests the DUT.

It’s quite obvious that when it comes to the next stage, which is the automated firmware testing, that includes flashing the firmware and run tests on the real hardware, then this can’t be done on an AWS EC2 instance. It also can’t be done on your baremetal build servers which may be anywhere in your building. Therefore, in this stage you need to have another agent that will handle the testing and that means that you can have a multi-stage/multi-agent pipeline. In this case the roles are very distinct and also the test farm can be somewhere remotely.

Now let’s see some pros and cons of this architecture.


  • Isolation. Since the build servers are isolated they can be re-used for other tasks while the pipeline waits for the next stage which is the testing and which may get quite some time.
  • Flexibility. If you use orchestration (e.g. kubernetes) then you can scale your build servers on demand. The same doesn’t stand for the test server though.
  • Failure handling. If you use a monitoring tool or an orchestrator (e.g. Kubernetes) then if your build server fails then a new one is spawned and you’re back and running automatically.
  • Maintenance. This is actually can be both in the pro and con, but in this context it means that you maintain your build servers and test servers differently and that for some is consider better. My cup is half full on this, I think it depends on the use case.
  • Speed. The build servers can be much faster that the test servers, which means that the build stage will end much faster and the runners will be available for other jobs.


  • Maintenance. You’ll need to maintain two different infrastructures, which is the build servers and the test servers.
  • Costs. It cost more to run both infrastructure at the same time, but the costs may be negligible in real use scenarios, especially for companies.
  • Different technologies. This is quite similar with the maintenance, but also means that you need to be able to integrate all those technologies to your pipeline, which might be a pain for small teams that there’s no know-how and time to learn to do this properly. If this is not done properly in an early stage then you may deal with a lot of problems later in the development and you may even need to re-structure your solution, which means more time and maybe more money.

These are not the only pros and cons, but these are generic and there are others depending your specific project. Nevertheless, if the multi-agent pipeline architecture is clear then you can pretty much extract the pain point you may have in your specific case.

Single-agent pipeline

In the single-agent pipeline there’s only one runner that does everything. The runner will compile the firmware and then will also execute the tests. So the SBC will run the gitlab-runner agent, will have an ARM toolchain for the SoC architecture and will run all the needed tools, like Docker, robot tests or any other testing framework is used. This sounds awesome, right? And it actually is. To visualize this setup see the following image.

In the above image you can see that the test-server (in this case the nanopi-k1-plus) is running the the gitlab-runner client and then it peaks the build job from the gitlab-ci on a new commit. Then it uploads the firmware artifact and then it runs the tests on the target. For this to happen it means that the test server needs to have all the tools are needed and the STM32 toolchain (for the aarch64 architecture). Let’s see the pros and cons here:


  • Maintenance. Only one infrastructure to maintain.
  • Cost. The running costs may be less when using this architecture.


  • Failures. If your agent fails for any reason, hardware or software, then you may need physical presence to fix this and the fix might be as simple as reset the board or change a cable or in worst case your SBC is dead and you need to replace it. Of course, this may happen in the previous architecture, but in this case you didn’t only lost a test server but also a builder.
  • Speed. A Cortex-A53 is slower when building software than an x86 CPU (even if that’s a vCPU on a cloud cluster)
  • Support. Currently there are a few aarch64 builds for gitlab, but gitlab is not officially supported for this architecture. This can be true also for other tools and packages, therefore you need to consider this before choosing your platform.

Again these are the generic pros/cons and in no way the only ones. There may be more in your specific case, which you need to consider by extracting your project’s pain points.

So, the next question is how do we setup our test servers?

Test server OS image

There are two ways to setup your testing farm as a IaC. I’ll only explain briefly the first one, but I’ll use the second way to this post example. The first way is to use an already existing OS image for your SBC like a debian or ubuntu distro (e.g. Armbian) and then use a provisioner like Ansible to setup the image. Therefore, you only need to flash the image to the SD card and then boot the board and then provision the OS. Let’s see the pros and cons of this solution:


  • The image is coming with an APT repository that you can use to install whatever you need using Ansible. This makes the image dynamic.
  • You can update your provisioning scripts at a later point and run them on all agents.
  • The provisioning scripts are versioned with git.
  • The provision scripts can be re-used on any targeted hardware as long as the OS distro is the same.


  • You may need to safely backup the base OS Image as it might be difficult to find it again in the future and the provisioning may fail because of that.
  • The image is bloated with unnecessary packages.
  • In most of those images you need first to do the initial installation steps manually as they have a loader that expects from the user to manually create a new user for safety reasons (e.g. Armbian)
  • Not being a static image may become problem in some cases as someone may install packages that make the image unstable or create a different environment from the rest agents.
  • You need different images if you have different SBCs in your farm.

The strong argument to use this solution is the existence of an APT repository which is comes from a well established distro (e.g. Debian). Therefore, in this scenario you would have to download or build your base image with a cross-toolchain, which is not that hard nowadays as there are many tools for that. Then you would need to store this base image for future use. Then flash an SD card for each agent and boot the agent while is connected to the network. Then run the ansible playbook targeting your test-farm (which means you need a host file that describes your farm). After that you’re ready to go.

The second option, which I’ll actually use in the post is to build a custom OS distro using Yocto. That means that your IaC is a meta layer that builds upon a Yocto BSP layer and using the default Poky distro as a base. One of the pros for me in this case (which is not necessarily also your case), is that I’m the maintainer of this allwinner BSP layer for this SoCs… Anyway, let’s see the generic pros and cons of this option:


  • The test server specific meta layer can support all the Yocto BSPs. It doesn’t matter if you have an RPi or any of the supported allwinner BSPs, the image that you’ll get in the end will be the same.
  • It’s a static image. That means all the versions will be always the same and it’s more difficult to break your environment.
  • It’s an IaC solution, since it’s a Yocto meta layer which it can be hosted in a git repo.
  • It’s full flexible and you can customize your image during build without the need of provisioning.
  • Full control on your package repository, if you need one.
  • You can still use Ansible to provision your image even if you don’t use a package repo (of course you can also build your own package repository using Yocto, which is very simple to do).
  • Yocto meta layers are re-usable


  • Although you can build your own package repository and build ipk, deb and rpm packages for your image, it’s not something that it comes that easy like using Debian APT and you need to support this infrastructure also in the future. This of course gives you more control, which is a good thing.
  • If you don’t use a package manager, then you can’t provision your test farm to add new functionality and you need to add this functionality to your Yocto image and then build and re-flash all the test servers. Normally, this won’t happen often, but in case you want to play safe then you need to build your own package repository.
  • Yocto is much more harder to use and build your image compared to just use Ansible on a standard mainstream image.

As I’ve already mentioned in this test farm example I’ll use Yocto. You may think that for this example that’s too much, but I believe that it’s a much better solution in the long term, it’s a more proper IaC as everything is code and no binary images or external dependencies are needed. It’s also a good starting point to integrate Yocto to your workflow as it has become the industry standard and that’s for good reason. Therefore, integrating this technology to your infrastructure is not beneficial for the project workflow and also for your general infrastructure as it provides flexibility, integrates wide used and well established technologies and expands you knowledge base.

Of course, you need to always consider if the above are fit to your case if you’re working in an organisation. I mean, there’s no point to integrate a new stack in your workflow if you can do your job in a more simple way and you’re sure that you’ll never need this new stack to the current project’s workflow (or any near future project). Always, choose the more simple solution that makes you sure that fits your skills and project’s requirements; because this will bring less problems in the future and prevent you from using a new acquired skill or tool in a wrong way. But if you see or feel that the solutions that you can provide now can’t scale well then it’s probably better to spend some time to integrate a new technology rather try to fix your problems with ugly hacks that may break easily in the future (and probably they will).

Setting up the SBCs (nanopi-neo2 and nanopi-k1-plus)

The first component of the test farm is the test server. In this case I’ll use two different allwinner SBCs for that in order to show how flexible you can be by using the IaC concept in DevOps. To do that I’ll build a custom Linux image using Yocto and the result will be two identical images but for different SBCs, which you can image how powerful that is as you have full control over the image and also you can use almost any SBC.

To build the image you’ll need the meta-allwinner-hx layer which is the SBC layer that supports all those different SBCs. Since I’m the maintainer of the layer I’m updating this quite often and I try to ensure that it works properly, although I can’t test it on all the supported SBCs. Anyway, then you need an extra layer that will sit on top of the BSP and create an image with all the tools are needed and will provide a pre-configured image to support the external connected hardware (e.g. stlink) and also the required services and environment for the tests. The source of the custom test server Yocto meta layer that I’m going to use is here:


In the README.md file of this repo there are instructions on how to use this layer, but I’ll also explain how to build the image here. First you need to prepare the build environment for that you need to create a folder for the image build and a folder for the sources.

mkdir -p yocto-testserver/sources
cd yocto-testserver/sources
git clone --depth 1 https://dimtass@bitbucket.org/dimtass/meta-test-server.git
cd ../

The last command will git clone the test server meta layer repo in the sources/meta-test-server folder. Now you have two options, first is to just run this script from the root folder.


This script will prepare everything and will also build the Docker image that is needed to build the Yocto image. That means that the Yocto image for the BSP will not use your host environment, but it’s going to be built inside a Docker container. This is pretty much the same that we did in the previous posts to build the STM32 firmware, but this time we build a custom Linux distribution for the test servers.

The other option to build the image (without the script) is to type each command in the script yourself to your terminal, so you can edit any commands if you like (e.g. the docker image and container name).

Assuming that you’ve built the Docker image and now you’re in the running container, you can run these commands to setup the build environment and the then build the image:

DISTRO=allwinner-distro-console MACHINE=nanopi-k1-plus source ./setup-environment.sh build
bitbake test-server-image

After this is finished then exit the running container and you can now flash the image to the SD card and check if the test server boots. To flash the image, then you need first to find the path of your SD card (e.g. /dev/sdX) and then run this command:

sudo MACHINE=nanopi-k1-plus ./flash_sd.sh /dev/sdX

If this commands fails because you’re missing the bmap-tool then you can install it in your host using apt

sudo apt install bmap-tools

After you flash the SD card and boot the SBC (in this case the nanopi-k1-plus) with that image then you can just log in as root without any password. To do this you need to connect a USB to uart module and map the proper Tx, Rx and GND pins between the USB module and the SBC. Then you can open any serial terminal (for Linux serial terminal I prefer using putty) and connect at 115200 baudrate. Then you should be able to view the the boot messsages and login as root without password.

Now for sanity check you can run the `uname -a` command. This is what I get in my case (you’ll get a different output while the post is getting older).

root:~# uname -a
Linux nanopi-k1-plus 5.3.13-allwinner #1 SMP Mon Dec 16 20:21:03 UTC 2019 aarch64 GNU/Linux

That means that the Yocto image runs on the 5.3.13 SMP kernel and the architecture is aarch64. You can also test that docker, gitlab-runner and the toolchain are there. For example you can just git clone the STM32 template code I’ve used in the previous posts and build the code. If that builds then you’re good to go. In my case, I’ve used these commands here in the nanopi-k1-plus:

git clone --recursive https://dimtass@bitbucket.org/dimtass/stm32f103-cmake-template.git
cd stm32f103-cmake-template
time TOOLCHAIN_DIR=/opt/toolchains/gcc-arm-none-eabi-9-2019-q4-major CLEANBUILD=true USE_STDPERIPH_DRIVER=ON SRC=src_stdperiph ./build.sh

And this is the result I got:

Building the project in Linux environment
- removing build directory: /home/root/stm32f103-cmake-template/build-stm32
--- Pre-cmake ---
architecture      : stm32
distclean         : true
parallel          : 4
[ 95%] Linking C executable stm32-cmake-template.elf
   text	   data	    bss	    dec	    hex	filename
  14924	    856	   1144	  16924	   421c	stm32-cmake-template.elf
[ 95%] Built target stm32-cmake-template.elf
Scanning dependencies of target stm32-cmake-template.hex
Scanning dependencies of target stm32-cmake-template.bin
[100%] Generating stm32-cmake-template.hex
[100%] Generating stm32-cmake-template.bin
[100%] Built target stm32-cmake-template.hex
[100%] Built target stm32-cmake-template.bin

real	0m11.816s
user	0m30.953s
sys	0m3.973s

That means that the nanopi-k1-plus built the STM32 template firmware using 4 threads and the aarch64 GCC toolchain and this tool 11.8 secs. Yep, you see right. The nanopi with the 4-core ARM-CortexA53 is by far the slowest when it comes to build the firmware even compared to AWS EC2. Do you really care about this though? Maybe not, maybe yes. Anyway I’ll post the benchmark results later in this post.

Next test is to check if the SBC is able to flash the firmware on the STM32. First you can probe the st-link. In my case this is what I get:

root:~/stm32f103-cmake-template# st-info --probe
Found 1 stlink programmers
 serial: 513f6e06493f55564009213f
openocd: "\x51\x3f\x6e\x06\x49\x3f\x55\x56\x40\x09\x21\x3f"
  flash: 65536 (pagesize: 1024)
   sram: 20480
 chipid: 0x0410
  descr: F1 Medium-density device

Therefore, that means that the st-link is properly built, installed and the udev rules work fine. Next step is to try to flash the firmware with this command:

st-flash --reset write build-stm32/src_stdperiph/stm32-cmake-template.bin 0x8000000

Then you should see something like this in the terminal

st-flash 1.5.1
2019-12-20T23:56:22 INFO common.c: Loading device parameters....
2019-12-20T23:56:22 INFO common.c: Device connected is: F1 Medium-density device, id 0x20036410
2019-12-20T23:56:22 INFO common.c: SRAM size: 0x5000 bytes (20 KiB), Flash: 0x10000 bytes (64 KiB) in pages of 1024 bytes
2019-12-20T23:56:22 INFO common.c: Attempting to write 15780 (0x3da4) bytes to stm32 address: 134217728 (0x8000000)
Flash page at addr: 0x08003c00 erased
2019-12-20T23:56:24 INFO common.c: Finished erasing 16 pages of 1024 (0x400) bytes
2019-12-20T23:56:24 INFO common.c: Starting Flash write for VL/F0/F3/F1_XL core id
2019-12-20T23:56:24 INFO flash_loader.c: Successfully loaded flash loader in sram
 16/16 pages written
2019-12-20T23:56:25 INFO common.c: Starting verification of write complete
2019-12-20T23:56:25 INFO common.c: Flash written and verified! jolly good!

That means that the SBC managed to flash the firmware on the STM32. You can visually verify this by checking that the LED on gpio C13 is toggling every 500ms.

Great. By doing those steps we verified that we’ve built a Docker image that can build the Yocto image for the nanopi-k1-plus and after booting the image all the tools that are needed  to build and flash the STM32 firmware are there and work fine.

The above sequence can be integrated in a CI pipeline that builds the image whenever there’s a change in the meta-test-server layer. Therefore, when you add more tools or change things in the image and push the changes then the pipeline build the new Linux image that you can flash to the SD cards of the SBCs. The MACHINE parameter in the build command can be a parameter in the pipeline so you can build different images for any SBC that you have. This makes things simple and automates your test farm infrastructure (in terms of the OS).

Note: You can also connect via ssh as root in the image without password if you know the IP. Since DHCP is on by default for the image then if you don’t want to use a USB to serial module then you can assign a static IP in your router to the SBC using its MAC address.

You may wonder here, why I’ve used a Dockerfile to build the image that builds the Yocto image, instead of using Packer and Ansible like I’ve did in the previous posts. Well, the only reason is that the Dockerfile comes already with the meta-allwinner-hx repo and it’s well tested and ready to be use. You can definitely use Packer to build your custom docker image that builds your Yocto image but is that what you really want in this case? Think that the distributed Dockerfile that comes with the meta-allwinner-hx Yocto layer is what your vendor or distributor may supply you with for another SBC. In this case it’s always better to use what you get from the vendor because you will probably get support when something goes wrong.

Therefore, don’t always create everything from the scratch, but use whatever is available that makes your life easier. DevOps is not about create everything from the scratch but use whatever is simpler for your case.

Also in this case, there’s also something else that you need to consider. Since I don’t have powerful build servers to build yocto images, I have to do it on my desktop workstation. Building a Yocto image may require more that 16GB or RAM and 50GB of disk space and also 8 cores are quite the minimum.

Also, using AWS EC2 to build a Yocto image doesn’t make sense either for this case, since this would require a very powerful EC2 instance, a lot of time and a lot of space.

So, in your specific case you’ll need a quite powerful build server if you want to integrate the Yocto image build in your CI/CD (which you should). For this post, though, I’ll use my workstation to build the image and I’ll skip the details of creating a CI/CD pipeline for build the Yocto image as it’s pretty much the same as any other CI/CD case except that you’ll need a more powerful agent.

In case you need to have the Yocto docker image in a registry, then use Packer to build those images and upload them to a docker registry or make a CI/CD pipeline that just builds the Yocto builder image from a Dockerfile (in this case the ine in the meta-allwinner-hx repo) and then uploads that image on your remote/local registry. There are many possibilities, you choose what is the best and easier for your.

Again, the specific details for this test case are not important. What’s important is the procedure and the architecture.

Setting up the DUT

Now that you’ve setup your test server it’s time to setup your DUT (the stm32f103 blue pill in this case). What’s important is how you connect the SBC and the STM32 as you need to power the STM32 and also connect the proper pins. The following table shows the connections you need to make (the pin number for nanopi-neo2 are here and for nanopi-k1-plus here).

SBC (nanopi-neo2/k1-plus)
pin # : [function]
STM32F103 (blue pill)
pin #
1 : [SYS_3.3V] 3V3
3 : [I2C0_SDA] PB7
5: [I2C0_SCL] PB6
6 : [GND] GND
7: [GPIOG11] PA0

The ST-LINK USB module is connected on any available USB port of your SBC (nanopi-neo2 has only one). Then you need to connect the pins of the module to the STM32. That’s straight forward just connect the RST, SWDIO, SWCLK and GND of the module to the same pins on the STM32 (the SWDIO, SWCLK and GND are in the connector on the back and only RST is near PB11). Do not connect the 3V3 power of the ST-LINK module to the STM32 as it’s already getting power from the SBC’s SYS_3.3V.

Since we’re using the I2C bus then we need to use two external pull-up resistors on the PB7 and PB6 pins. Just connect those two resistors to the pins and one of the available 3V3 pins of the STM32.

Finally, for debugging purposes you may want to connect the serial port of the SBC and the STM32 to two USB-to-UART modules. I’ve done this for debugging and generally is useful at the point you’re setting up your farm and when everything is working as expected then remove it and only re-connect when troubleshooting an agent in your farm.

As I’ve mentioned before, I won’t use the same STM32 template firmware I’ve used in the previous posts, but this time I’ll use a firmware that actually does something. The firmware code is located here:


Although the README file contains a lot of info about the firmware, I’ll make a brief description about it. So, this firmware will make the stm32f103 to function as an I2C GPIO expander, like the ones that are used from many Linux SBCs. The mcu will be connected to the SBC via I2C and a GPIO, then the SBC will be able to configure the stm32 pins using the I2C protocol and then also control the GPIO. The protocol is quite simple and it’s explained in detail in the README.

If you want to experiment with building the firmware locally on your host or the SBC you can do it, so you get a bit familiar with the firmware, but it’s not necessary from now on as everything will be done by the CI pipeline.

Automated tests

One last thing before we get to testing the farm, is to explain what tests are in the firmware repo and how they are used. Let’s go back again in the firmware repo here:


In this repo you’ll see a folder that is named tests. In there you’ll find all the test scripts that the CI pipeline will use. The pipeline configuration yaml file is on the root folder of the repo and it’s named .gitlab-ci.yml as the previous posts. In this file you’ll see that there are 3 stages which are the build, flash and test. The build stage will build the code, the flash stage will flash the binary on the STM32 and finally the test stage will run a test using the robot-framework. Although, I’m using robot-framework here, that could be any other framework or even scripts.

The robot test is the `tests/robot-tests/stm32-fw-test.robot` file. If you open this file you’ll see that it uses two Libraries. Those libraries are just python classes that are located in `tests/` and specifically it’s calling `tests/STM32GpioExpander.py` and `tests/HostGpio.py`. Those files are just python classes, so open those files and have a look in there, too.

The `tests/STM32GpioExpander.py` implements the I2C specific protocol that the STM32 firmware supports to configure the GPIOs. This class is using the python smbus package which is installed from the meta-test-server Yocto layer.  You also see that this object supports to probe the I2C to find if there’s an STM32 device running the firmware. This is done by reading the 0x00 and 0x01 registers that contain the magic word 0xBEEF (sorry no vegan hex). Therefore this function will read those two registers and if 0xBEEF is found then it means that the board is there and running the correct firmware. You could also add a firmware version (which you should), but I haven’t in this case. You’ll also find functions to read/write registers, set the configuration of the available GPIOs and set the pins value.

Next, the `tests/HostGpio.py` class exposes a python interface to configure and control the SBC’s GPIOs in this case. Have in mind that when you use the SBC’s GPIOs via the sysfs in Linux, then you need some additional udev rules in your OS for the python to have the proper permissions. This file is currently provided from the meta-allwinner-hx BSP layer in this file `meta-allwinner-hx/recipes-extended/udev/udev-python-gpio/60-python-gpio-permissions.rules`. Therefore, in case you use your own image you need to add this rule to your udev otherwise the GPIOs won’t be available with this python class.

The rest of the python files are just helpers, so the above two files are the important ones. Back to the robot test script you can see the test cases, which I list here for convenience.

*** Test Cases ***
Probe STM32
     ${probe} =     stm32.probe
     Log  STM32GpioExpander exists ${probe}

Config output pin
     ${config} =    stm32.set_config    ${0}  ${0}  ${0}  ${1}  ${0}  ${0}
     Log  Configured gpio[0] as output as GPOA PIN0

Set stm32 pin high
     ${pin} =  stm32.set_pin  ${0}  ${1}
     Log  Set gpio[0] to HIGH

Read host pin
     host.set_config     ${203}  ${0}  ${0}
     ${pin} =  host.read_pin  ${203}
     Should Be Equal     ${pin}  1
     Log  Read gpio[0] value: ${pin}

Reset stm32 pin
     ${pin} =  stm32.set_pin  ${0}  ${0}
     Log  Set gpio[0] to LOW

Read host pin 2
     host.set_config     ${203}  ${0}  ${0}
     ${pin} =  host.read_pin  ${203}
     Should Be Equal     ${pin}  0
     Log  Read gpio[0] value: ${pin}

The first test is called “Probe STM32” and it just probes the STM32 device and reads the magic word (or the preamble if you prefer). This will actually call the STM32GpioExpander.probe() function which returns True if the device exists otherwise False. The next test “Config output pin” will configure the STM32’s output PA0 as output. The next test “Set stm32 pin high” will set the PA0 to HIGH. The next test “Read host pin” will configure the SBC’s GPIOG11 (pin #203) to input and then read if the input which is connected to the STM32 output PA0 is indeed HIGH. If the input reads “1” then the test passes. The next test “Reset stm32 pin” will set the PA0 to LOW (“0”). Finally, the last test “Read host pin 2” will read again the GPIOG11 pin and verify that it’s low.

If all tests pass then it means that the firmware is flashed, the I2C bus is working, the protocol is working for both sides and the GPIOs are working properly, too.

Have in mind that the voltage levels for the SBC and the STM32 must be the same (3V3 in this case), so don’t connect a 5V device to the SBC and if you do, then you need to use a voltage level shifter IC. Also you may want to connect a 5-10KΩ resistor in series with the two pins, just in case. I didn’t but you should :p

From the .gitlab-ci.yml you can see that the robot command is the following:

robot --pythonpath . --outputdir ./results -v I2C_SLAVE:0x08 -v BOARD:nanopi_neo2 -t "Probe STM32" -t "Config output pin" -t "Set stm32 pin high" -t "Read host pin" -t "Reset stm32 pin" -t "Read host pin 2" ./robot-tests/

Yes, this is a long command, which passes the I2C slave address as a parameter, the board parameter and a list with the tests to run. The I2C address can be omitted as it’s hard-coded  in the robot test for backwards compatibility as in pre-3.1.x versions all parameters are passed as strings and this won’t work with the STM32GpioExpander class. The robot command will only return 0 (which means no errors) if all tests are passed.

A couple of notes here. This robot test will run all tests in sequence and it will continue in the next test even if the previous fails. You may don’t like this behavior, so you can change that by calling the robot test multiple times for each test and if the previous fails then skip the rest tests. Anyway, at this point you know what’s the best scenario for you, but just have that in mind when you write your tests.

Testing the farm!

When everything is ready to go then you just need to push a change to your firmware repo or manually trigger the pipeline. At that point you have your farm up and running. This is a photo of my small testing farm.

On the left side is the nanopi-k1-plus and on the right side is the nanopi-neo2. Both are running the same Yocto image, have an internet connection, have an ST-LINK v2 connected on the USB port and the I2C and GPIO signals are connected between the SBC and the STM32. Also the two UART ports of the SBCs are connected to my workstation for debugging. Although you see only two test servers in the photo you could have as many as you like in your farm.

Next step is to make sure that the gitlab-runner is running in both SBCs and they’re registered as runners in the repo. Normally in this case the gitlab-runner is a service in the Yocto image that runs a script to make sure that if the SBC is not registered then it registers automatically to the project. For this to make it happen, you need to add two lines in your local.conf file when you build the Yocto image.

GITLAB_REPO = "stm32f103-i2c-io-expander"
GITLAB_TOKEN = "4sFaXz89yJStt1hdKM9z"

Of course, you should change those values with yours. Just as a reminder, although I’ve explained in the previous post how to get those values, the GITLAB_REPO is just your repo name and the GITLAB_TOKEN is the token that you get in the “Settings -> CI/CD -> Runners” tab in your gitlab repo.

In case that the gitlab-runner doesn’t run in your SBC for any reason then just run this command in your SBC’s terminal


In my case everything seems to be working fine, therefore in my gitlab project I can see this in the “”Settings -> CI/CD -> Runners”.

Note: You need to disable the shared runners in the “Settings -> CI/CD -> Runners” page for this to work, otherwise any random runner will pick the job and fail.

The first one is the nanopi-k1-plus and the second one is the nanopi-neo2. You can verify those hashes in your /etc/gitlab-runner/config.toml file in each SBC in the token parameter.

Next I’m triggering the pipeline manually and see what happens. In the gitlab web interface (after some failures until get the pipeline up and running) I see that there are 3 stages and the build stage started. The first job is picked by the nanopi-neo2 and you can see the log here. This is part of this log

1 Running with gitlab-runner 12.6.0~beta.2049.gc941d463 (c941d463)
2   on nanopi-neo2 _JXcYZdJ
3 Using Shell executor... 00:00
5 Running on nanopi-neo2... 00:00
7 Fetching changes with git depth set to 50... 00:03
8 Reinitialized existing Git repository in /home/root/builds/_JXcYZdJ/0/dimtass/stm32f103-i2c-io-expander/.git/
9 From https://gitlab.com/dimtass/stm32f103-i2c-io-expander
10  * [new ref]         refs/pipelines/106633593 -> refs/pipelines/106633593

141 [100%] Built target stm32f103-i2c-io-expander.bin
142 [100%] Built target stm32f103-i2c-io-expander.hex
143 real	0m14.413s
144 user	0m35.717s
145 sys	0m4.805s
148 Creating cache build-cache... 00:01
149 Runtime platform                                    arch=arm64 os=linux pid=4438 revision=c941d463 version=12.6.0~beta.2049.gc941d463
150 build-stm32/src: found 55 matching files           
151 No URL provided, cache will be not uploaded to shared cache server. Cache will be stored only locally. 
152 Created cache
154 Uploading artifacts... 00:03
155 Runtime platform                                    arch=arm64 os=linux pid=4463 revision=c941d463 version=12.6.0~beta.2049.gc941d463
156 build-stm32/src/stm32f103-i2c-io-expander.bin: found 1 matching files 
157 Uploading artifacts to coordinator... ok            id=392515265 responseStatus=201 Created token=dXApM1zs
159 Job succeeded

This means that the build succeeded, it took 14.413s and the artifact is uploaded. It’s interesting also that the runtime platform is arch=arm64 because the gitlab-runner is running on the nanopi-neo2, which is an ARM 64-bit CPU.

Next stage is the flash stage and the log is here. This is a part of this log:

1 Running with gitlab-runner 12.6.0~beta.2049.gc941d463 (c941d463)
2   on nanopi-neo2 _JXcYZdJ
3 Using Shell executor... 00:00
5 Running on nanopi-neo2... 00:00
7 Fetching changes with git depth set to 50... 00:03
8 Reinitialized existing Git repository in /home/root/builds/_JXcYZdJ/0/dimtass/stm32f103-i2c-io-expander/.git/
9 Checking out bac70245 as master...

30 Downloading artifacts for build (392515265)... 00:01
31 Runtime platform                                    arch=arm64 os=linux pid=4730 revision=c941d463 version=12.6.0~beta.2049.gc941d463
32 Downloading artifacts from coordinator... ok        id=392515265 responseStatus=200 OK token=dXApM1zs
34 $ st-flash --reset write build-stm32/src/stm32f103-i2c-io-expander.bin 0x8000000 00:02
35 st-flash 1.5.1
36 2020-01-02T14:41:27 INFO common.c: Loading device parameters....
37 2020-01-02T14:41:27 INFO common.c: Device connected is: F1 Medium-density device, id 0x20036410
38 2020-01-02T14:41:27 INFO common.c: SRAM size: 0x5000 bytes (20 KiB), Flash: 0x10000 bytes (64 KiB) in pages of 1024 bytes
39 2020-01-02T14:41:27 INFO common.c: Attempting to write 12316 (0x301c) bytes to stm32 address: 134217728 (0x8000000)
40 Flash page at addr: 0x08003000 erased
41 2020-01-02T14:41:28 INFO common.c: Finished erasing 13 pages of 1024 (0x400) bytes
42 2020-01-02T14:41:28 INFO common.c: Starting Flash write for VL/F0/F3/F1_XL core id
43 2020-01-02T14:41:28 INFO flash_loader.c: Successfully loaded flash loader in sram
44  13/13 pages written
45 2020-01-02T14:41:29 INFO common.c: Starting verification of write complete
46 2020-01-02T14:41:29 INFO common.c: Flash written and verified! jolly good!
51 Job succeeded

Nice, right? The log shows that this stage cleaned up the previous job and downloaded the firmware artifact and the used the st-flash tool and the ST-LINKv2 to flash the binary to the STM32 and that was successful.

Finally, the last stage is the run_robot stage and the log is here. This is part of this log:

1 Running with gitlab-runner 12.6.0~beta.2049.gc941d463 (c941d463)
2   on nanopi-neo2 _JXcYZdJ
3 Using Shell executor... 00:00
5 Running on nanopi-neo2...

32 Downloading artifacts for build (392515265)... 00:01
33 Runtime platform                                    arch=arm64 os=linux pid=5313 revision=c941d463 version=12.6.0~beta.2049.gc941d463
34 Downloading artifacts from coordinator... ok        id=392515265 responseStatus=200 OK token=dXApM1zs
36 $ cd tests/ 00:02
37 $ robot --pythonpath . --outputdir ./results -v I2C_SLAVE:0x08 -v BOARD:nanopi_neo2 -t "Probe STM32" -t "Config output pin" -t "Set stm32 pin high" -t "Read host pin" -t "Reset stm32 pin" -t "Read host pin 2" ./robot-tests/
38 ==============================================================================
39 Robot-Tests                                                                   
40 ==============================================================================
41 Robot-Tests.Stm32-Fw-Test :: This test verifies that there is an STM32 board  
42 ==============================================================================
43 Probe STM32                                                           | PASS |
44 ------------------------------------------------------------------------------
45 Config output pin                                                     | PASS |
46 ------------------------------------------------------------------------------
47 Set stm32 pin high                                                    | PASS |
48 ------------------------------------------------------------------------------
49 Read host pin                                                         | PASS |
50 ------------------------------------------------------------------------------
51 Reset stm32 pin                                                       | PASS |
52 ------------------------------------------------------------------------------
53 Read host pin 2                                                       | PASS |
54 ------------------------------------------------------------------------------
55 Robot-Tests.Stm32-Fw-Test :: This test verifies that there is an S... | PASS |
56 6 critical tests, 6 passed, 0 failed
57 6 tests total, 6 passed, 0 failed
58 ==============================================================================
59 Robot-Tests                                                           | PASS |
60 6 critical tests, 6 passed, 0 failed
61 6 tests total, 6 passed, 0 failed
62 ==============================================================================
63 Output:  /home/root/builds/_JXcYZdJ/0/dimtass/stm32f103-i2c-io-expander/tests/results/output.xml
64 Log:     /home/root/builds/_JXcYZdJ/0/dimtass/stm32f103-i2c-io-expander/tests/results/log.html
65 Report:  /home/root/builds/_JXcYZdJ/0/dimtass/stm32f103-i2c-io-expander/tests/results/report.html
70 Job succeeded

Great. This log shows that robot test ran in the SBC and all the tests were successful. Therefore, because all the tests are passed I get this:

Green is the favorite color of DevOps.

Let’s make a resume of what was achieved. I’ve mentioned earlier that I’ll test two different architectures. This is the second one that the test server is also the agent that builds the firmware and then also performs the firmware flashing and testing. This is a very powerful architecture, because a single agent runs all the different stages, therefore it’s easy to add more agents without worrying about adding more builders in the cloud. On the other hand the build stage is slower and also it would be much easier to add more builders in the cloud rather adding SBCs. Of course cloud builders are only capable of building the firmware faster, therefore the bottleneck will always be the flashing and testing stages.

What you should keep from this architecture example is that everything is included in a single runner (or agent or node). There are no external dependencies and everything is running in one place. You will decide if that’s good or bad for your test farm.

Testing the farm with cloud builders

Next I’ll implement the first architecture example that I’ve mentioned earlier, which is that one:

In the above case I’ll create an AWS EC2 instance to build the firmware in the same way that I did it in the previous post here. The other two stages (flashing and testing) will be executed on the SBCs. So how do you do that?

Gitlab CI supports tags for the runners, so you can specify which stage is running on which runner. You can have multiple runners with different capabilities and each runner type can have a custom tag that defines it’s capabilities. Therefore, I’ll use a different tag for the AWS EC2 instance and the SBCs and each stage in the .gitlab-ci.yml will execute on the specified tag. This is a very powerful feature as you can have agents with different capabilities that serve multiple projects.

Since I don’t to change the default .gitlab-ci.yml pipeline in my repo I’ll post the yaml file you need to use here:

        name: dimtass/stm32-cde-image:0.1
        entrypoint: [""]
        GIT_SUBMODULE_STRATEGY: recursive
        - build
        - flash
        - test
            - stm32-builder
        stage: build
        script: time TOOLCHAIN_DIR=/opt/toolchains/gcc-arm-none-eabi-9-2019-q4-major CLEANBUILD=true USE_STDPERIPH_DRIVER=ON USE_DBGUART=ON SRC=src ./build.sh
            key: build-cache
            - build-stm32/src
            - build-stm32/src/stm32f103-i2c-io-expander.bin
            expire_in: 1 week
            - test-farm
        stage: flash
        script: st-flash --reset write build-stm32/src/stm32f103-i2c-io-expander.bin 0x8000000
            key: build-cache
            - test-farm
        stage: test
            - cd tests/
            - robot --pythonpath . --outputdir ./results -v I2C_SLAVE:0x08 -v BOARD:nanopi_neo2 -t "Probe STM32" -t "Config output pin" -t "Set stm32 pin high" -t "Read host pin" -t "Reset stm32 pin" -t "Read host pin 2" ./robot-tests/

I’ll use this temporarily in the repo for testing, but you won’t find this file, as I’ll revert it back.

In this file you can see that each stage now has a tags entry that lists which runners can execute each stage. The Yocto images doesn’t have tags for the runners, therefore for this test I’ll remove the previous runners using the web interface and then run this command on each test server:

gitlab-runner verify --delete
gitlab-runner register

Then you need to enter each parameter like that:

Please enter the gitlab-ci coordinator URL (e.g. https://gitlab.com/):
Please enter the gitlab-ci token for this runner:
Please enter the gitlab-ci description for this runner:
[nanopi-neo2]: nanopi-neo2
Please enter the gitlab-ci tags for this runner (comma separated):
Registering runner... succeeded                     runner=4sFaXz89
Please enter the executor: parallels, shell, ssh, virtualbox, docker+machine, docker-ssh+machine, docker, docker-ssh, kubernetes, custom:
Runner registered successfully. Feel free to start it, but if it's running already the config should be automatically reloaded!

From the above the important bits are the description (in this case nanopi-neo2) and the tags (in this case test-farm).

Next, I’ve done the same on the nanopi-k1-plus. Please ignore these manual steps here, these will be scripted or included in the Yocto image. This is only for testing and demonstrate the architecture.

Finally, to build the AWS EC2 instance I’ve ran those commands:

git clone git@bitbucket.org:dimtass/stm32-cde-template.git
cd stm32-cde-template
packer build stm32-cde-aws-gitlab-runner.json
ln -sf Vagrantfile_aws Vagrantfile

Now edit the Vagrantfile and fill in the aws_keypair_name and aws_ami_name, according to the ones you have in your web EC2 management console (I’ve also explained those steps in the previous post here).

vagrant up
vagrant shh
ps -A | grep gitlab

The last command will display a running instance of the gitlab-runner, but in this case here this is registered to the stm32-cmake-template project of the previous post! That’s because this image was made for that project. Normally I would need to change the Ansible scripts and point to the new repo, but for the demonstration I will just stop the instances, remove the the runner and add a new one that points to this post repo.

So, first I killed the gitlab process in the aws image like this:

ubuntu@ip-172-31-43-158:~$ ps -A | grep gitlab
 1188 ?        00:00:00 gitlab-runner
ubuntu@ip-172-31-43-158:~$ sudo kill -9 1188
ubuntu@ip-172-31-43-158:~$ gitlab-runner verify --delete
ubuntu@ip-172-31-43-158:~$ gitlab-runner register

And then I’ve filled the following:

Please enter the gitlab-ci coordinator URL (e.g. https://gitlab.com/):
Please enter the gitlab-ci token for this runner:
Please enter the gitlab-ci description for this runner:
[ip-172-31-43-158]: aws-stm32-agent
Please enter the gitlab-ci tags for this runner (comma separated):
Registering runner... succeeded                     runner=4sFaXz89
Please enter the executor: custom, docker, docker-ssh, shell, virtualbox, docker+machine, kubernetes, parallels, ssh, docker-ssh+machine:
Runner registered successfully. Feel free to start it, but if it's running already the config should be automatically reloaded! 

The important bits here are again the description which is `aws-stm32-agent` and the tag which is `stm32-builder`. Therefore, you see that the tag of this runner is the same with the tag in the build stage in the .gitlab-ci.yml script and the tags of the SBCs are the same with the tags of the flash and test stages.

That’s it! You can now verify the above in the “Settings -> CI/CD -> Runners” page in the repo. In my case I see this:

In the image above you see that now each runner has a different description and the nanopi-neo2 and nanopi-k1-plus have the same tag which is test-farm. You can of course do this with a script similar to the `meta-test-server/recipes-testtools/gitlab-runner/gitlab-runner/initialize-gitlab-runner.sh` in the meta-test-server image.

Here is the successful CI pipeline that the build is done on the AWS EC2 instance and the flash and test stages were executed on the nanopi-k1-plus. You can see from this log here, that the build was done on AWS, I’m pasting a part of the log:

1 Running with gitlab-runner 12.6.0 (ac8e767a)
2   on aws-stm32-agent Qyc1_zKi
3 Using Shell executor... 00:00
5 Running on ip-172-31-43-158... 00:00
7 Fetching changes with git depth set to 50... 00:01
8 Reinitialized existing Git repository in /home/ubuntu/builds/Qyc1_zKi/0/dimtass/stm32f103-i2c-io-expander/.git/

140 real	0m5.309s
141 user	0m4.248s
142 sys	0m0.456s
145 Creating cache build-cache... 00:00
146 Runtime platform                                    arch=amd64 os=linux pid=3735 revision=ac8e767a version=12.6.0
147 build-stm32/src: found 55 matching files           
148 No URL provided, cache will be not uploaded to shared cache server. Cache will be stored only locally. 
149 Created cache 151
Uploading artifacts... 00:03
152 Runtime platform                                    arch=amd64 os=linux pid=3751 revision=ac8e767a version=12.6.0
153 build-stm32/src/stm32f103-i2c-io-expander.bin: found 1 matching files 
154 Uploading artifacts to coordinator... ok            id=393617901 responseStatus=201 Created token=yHW5azBz
156 Job succeeded

In line 152, you see that the arch is amd64, which is correct for the AWS EC2 image. Next here and here are the logs for the flash and test stages. In those logs you can see the runner name now is the nanopi-k1-plus and the arch is arm64.

Also in this pipeline here, you see from this log here that the same AWS EC2 instance as before did the firmware build. Finally from this log here and here, you can verify that the nanopi-neo2 runner flashed its attached STM32 and run the robot test on it.

Now I’ll try to explain the whole process that happened in a simple way. First the pipeline was triggered on the web interface. Then the gitlab server parsed the .gitlab-ci.yml and found that there are 3 stages. Also it found that each stage has a tag. Therefore, it stopped in the first stage and was waiting for any runner with that tag to poll the server. Let’s assume that at some point the gitlab-runner on any of the SBCs polls the server first (this happens by default every 3 secs for all the runners). When the SBC connected to the server then it posted the repo name, the token, a few other details and also the runner tag. The server checked the tag name of the runner and the stage and it sees that it’s different, so it doesn’t execute the stage on that runner and closes the connection.

After 1-2 secs the gitlab-runner in the AWS EC2 instance with the same tag as the build stage connects to the gitlab server. Then the server verifies both the runner and the tag and then starts to execute commands to the runner via a tunnel. First it git clones the repo to a temporary folder in the AWS EC2 instance storage and then executes the script line. Then the firmware is built and the server downloads the artifact that was requested in the yaml file (see: artifacts in the build stage in .gitlab-ci.yml).

Now that the build stage is successful and the artifact is on the server, the server initiates the next stage and now waits for any runner with the test-farm tag. Of course the EC2 instance can’t execute this stage as it has another tag. Then at some point one of the SBCs is connected and the server takes control of the SBC and executes the flash stage using a tunnel. In this case, because there is an actual ST-LINKv2 USB module connected and also the STM32, then the server uploads the artifact bin firmware on the SBC and then flashes the firmware to the STM32.

Finally, the server in the test stage uses the same runner since it has the proper tag and executes the robot test via the tunnel between the server and the runner. Then all tests passes and the success result is returned.

That’s it. This is pretty much the whole process that happened in the background. Of course, the same process happened in the first architecture with the only difference that because there wasn’t a tag in the stages and the runners, then all the stages were running on the same runner.


Let’s see a few numbers now. The interesting benchmark is the build time on the AWS EC2 and the two nanopi SBCs. In the following table you can see how much time the build spend in each case.

Runner Time is secs
AWS EC2 (t2.micro – 1 core) 5.345s
Nanopi-neo2 (4 cores) 14.413s
Nanopi-k1-plus (4 cores) 12.509s
Ryzen 2700X (8-cores/16-threads) 1.914s

As you can see from the above table the t2.micro x86_64 instance is by far faster builder than the nanopi SBCs and of course my workstation (Ryzen 2700X) is even more faster. Have in mind that this is a very simple firmware and it builds quite fast. Therefore, you need to consider when you decide your CI pipeline architecture. Is that performance important for you?


That was a long post! If I have to resume all my conclusions in this post series in only one sentence, then this is it:

DevOps for Embedded is tough!

I think that I’ve made a lot of comments during those posts that verify this and the truth is that this was just a simple example. Of course, most of the things I’ve demonstrated are the same regardless the complexity of the project. When it gets really complicated is not in the CI/CD architecture as you have many available options and tools you can use. There’s a tool for everything that you might need! Of course you need to be careful with this… Only use what is simple to understand and maintain in the future and avoid complicated solutions and tools. Also avoid using a ton of different tools to do something that you can do with a few lines of code in any programming language.

Always go for the most simple solution and the one that seems that it can scale to your future needs. Do not assume that you won’t need to scale up in the future even for the same project. Also do not assume that you have to find a single architecture for the whole project life, as this might change during the different project phases. This is why you need to follow one of the main best practise of DevOps, which is IaC. All your infrastructure needs to be code, so you can make changes easy and fast and be able to revert back to a known working state if something goes wrong. That way you can change your architecture easily even during the project.

I don’t know what your opinion is after consuming these information in those posts, but personally I see room for cloud services in the embedded domain. You can use builders in the cloud and also you can run unit tests and mocks to the cloud, too. I believe it’s easier to maintain this kind of infrastructure and also scale it up as you like using an orchestrator like Kubernetes.

On the other hand the cloud agents can’t run tests on the real hardware and be part of a testing farm. In this case your test farm needs real hardware like the examples I’ve demonstrated on this post. Running a testing farm is a huge and demanding task, though and the bigger the farm is the more you need to automate everything. Automation needs to be everywhere. You show that there are many things involved. Even the OS image needs to be part of the IaC. Use Yocto or buildroot or any distro (and provision it with Ansible). Make a CI pipeline also for that image. Just try to automate everything.

Of course, there are things that can’t be automated. For example, if an STM32 board is malfunctioning or a power supply dies or even a single cable has a lot of resistance and creates weird issues, then you can’t use a script to fix this. You can’t write script to create more agents in your farm. All those things are manual work and it needs time, but you can create your farm in a way that you minimize the problems and the time to add new agents or debug problems. Use jigs for example. Use a 3D printer to create custom jigs that the SBC and the STM32 board are fit in there with proper cable management and also check your cables before the installation.

Finally, every project is different and developers have different needs. Find what are the pain points are for your project early and automate the solution.

If you’re an embedded engineer and you’re reading this I hope that you can see the need of DevOps to your project. And if you’re a DevOps engineer new to the embedded domain and reading this, then I hope you grasp the needs of this domain and how complicated can be.

The next post will definitely be a stupid project. The last three were a bit serious… Until then,

Have fun!






Component database with LED indicators [Updated]


[Update: it’s being almost 10 months since I’m using this project and it works great with no issues. The firmware is robust and the LED strips are great, too.]

Finally, a really stupid project! You know, I was worrying that this blog is getting too serious with all those posts about machine learning and DevOps. But now it’s time for a genuine stupid project. This time I’m going to describe why and how I’ve built my personal component database and how I’ve extended this and made some modifications to make it more friendly.

For the last 20 years my collection of components, various devices and peripherals has grown too much. Dozens of various MCUs, SBCs, prototype PCBs and hundreds of components like ICs, passive devices (resistors, capacitors, e.t.c.), and even cables. That’s too much and very hard to organize and sort. Most EE and embedded engineers already feel familiar with the problem. So, the usual solution to this is to buy component organizers with various size of storage compartments and some times even large drawers to fit the larger items. Then the next step is to use some invisible tape as labels on each drawer and write what’s inside.

While this is fine when you have a few components, when the list is growing too much then even that is not useful. Especially, after some time that you forgot when some stuff are really located then you need to start reading the labels on every drawer. Sometimes it may happen that you pass the drawer and then start from beginning. If you’re like me this can get you angry after few seconds of reading labels. You see, the thing is that when you have to start looking for a component is because you need it right now as you’re in middle of something and if you stop doing the important task and start spend time searching for the component, which is supposed to be organized, then you start getting pissed. At least I do.

Therefore, the solution is to have a database (DB) that you can search for the component and then you get its location. Having a such a database is very convenient, because you can also add additional information, like an image of the component, the datasheet and finally create a web interface that is able run queries in the db and display all those information in a nice format.

For that reason a few years back in 2013 I’ve created such a database and a web interface for my components and until now I was running this on my bananapi BPI-M1. Until then though, the inventory has kept growing and recently I’ve realized that the database is not enough and I need to do something else in order to be able to find the component I need a bit faster. The solution was already somewhere in those components I already had, therefore I’ve used an ESP8266 module and an addressable RGB LED strip.

Before continue with the actual project details, this is a video of the actual result that I got.

If what you’ve just see didn’t make any sense, then let’s continue with the post.


This is a list of the components that I’ve used for this project


This is the old classic BPI-M1

You don’t have to use this exact SBC, you can use whatever you have. The reason I’ll use this in this example is just because it’s my temporary file server and already runs a web server. BPI-M1 is the first banana-pi board and it’s known for its SATA interface, because at that time is was one of the very few boards that had a real SATA interface (and not a USB-to-SATA or similar). Not only that, but also has a GbE and you can achieve more than 60 MB/sec on a GbE network, which is quite amazing as most of the boards are much slower. I’m using this SBC as my home web server and home automation server, as also a temporary network storage for data that are not important and also I’m running a bunch of services like the DDNS client and other stuff. Therefore, in my case I already had that SBC running in the house for many years. Since 2014 it was running an old bananian distro and only recently I’ve updated to one of the latest Armbian distros.

As I’ve mentioned you can use whatever SBC you like/have, but I’ll explain the procedure for BPI-M1 and Armbian, which it should be the same for any SBC running Armbian and almost the same for other distros.

Of course you can use any web server in the internet if you already have. I’ll explain later why there’s no problem to host the web interface in a web server on the internet and at the same time running the ESP8266 in your local network.

Components organizer

This box has many names as components organizer, storage box with drawers, storage organizer, e.t.c. Whatever the name is, this is how it looks like:

There are many small plastic drawers that you organize your components. I mean you know what it is, right? I won’t explain more. As you’ve seen on the video when the LED is turning on then the drawer is also lit. This kind of white or transparent plastic (I don’t know the proper English name) is good for this purpose as it’s illuminating with the color of the RGB light. So prefer those organizers instead of a non-transparent.

WS2812B RGB LED strip

This is the ws2812b RGB LED strip I’ve used.

You probably are more familiar with this strip format.

But it’s important to use the first strip for the reasons I’ll explain in a bit.

The good ol’ ws2812b is one of the first addressable RGB LEDs and until today it’s just fine if you don’t need to do fancy animation and generally any fast blinking on large RGB strips. Also, it’s not really addressable but it seems like that from the software perspective. The reason that ws2812b it’s not really addressable is because you can control each LED individually, but if you want to change a LED color you need to send all the LED values on the strip (actually the colors up to the wanted index). That’s because the data are shifted from one LED to the next one in the chain/strip, so every 24-bits (3x colors, 8-bit/color) the data are shifted to the next LED.

As you can imagine this strip is placed in the back of the organizer in order to illuminate the drawers. The reason you need the first format is that each drawer has a distance from the next one. The first strip has 10 LEDs/meter which means 1 LED every 10cm which is just fine for almost any organiser (except the bottom big drawer). The second strip has 60 LEDs/meter which means that in that case behind every drawer will be more than 1 LED, which is great waste. So if you used the second strip you would either have to cut every LED and resolder it using an extension cable or skip LEDs and use a more expensive strip. Anyway, it doesn’t make sense to use the second strip, just use the first one.


The ESP8266 is an MCU with WiFi and there are many different modules with this MCU as you can see here. Also there are many different PCB boards that use some of those variations, for example the ESP-12F can be found on many different PCBs with the most known being the NodeMCU. In this project I’ll use the NodeMCU, because it’s easier for most of the people to use as you can use a USB cable to power the module and also flash it easily and do development using the UART for debugging. In reality though, since I have a dozen of ESP-01 with 512KB and 1MB, I’m using that in my setup. I won’t get into the details for the ESP-01 though, because you need to make different connections in order to flash it. Since you only need 2 pins, one for output data in the strip and one for reset the firmware configuration settings to default, then the ESP-01 would fit fine, but with some ticker for make the pins usable.

Anyway let’s assume that we’ll use the NodeMCU for now, which looks like that

Project details

Let’s see a few details for the project now. Let’s start with the project repo which is located here:


In there you’ll find a few different things. First is the www/ folder that contains the web interface, then it’s the esp8266-firmware that contains the firmware for the NodeMCU (or any ESP-12E) and finally there is a dockerfile that builds an image with a webserver that you can use for your tests. I’ll explain every bit in more detail later, but for now let’s focus on how everything is connected. This is the functional diagram.

As you can see from the above diagram there’s a web server (in this case Lighttpds) with PHP and SQLite3 support that listens on a specific port and IP address. This server can run on either a docker container (for testing or normal use) or an SBC. In my case I did all the development with docker and after everything was ready I’ve uploaded the web interface on the BPI.

Then from the diagram it’s shown that any web browser in the network can have access to the web interface and execute SQL queries to display data. The web browser is also able to send aREST requests via POST to a remote device that runs an aREST server, which in this case this is the ESP8266.

Finally, you see that the ESP8266 is also connected in the network and accepts aREST requests. Also it drives a ws2812b RGB LED strip.

So what happens in normal usage is that you open the web interface with your web browser. By default the web server will return a list with all the components in the database (you can change that in the code if you like). Then you can write in the search text field the keyword to search in the database. The keyword will be looked up in the part name, the description or the tag fields of the database’s records (more for the db records later) and the executed SQL query will return all the results and will update the web page with the entries. Then depending on the record you can view the image of the item, download the datasheet, or if you’ve set an index in the Location field in the db table then you’ll also see a button with label “Show”.

The image and the datasheet are stored in the web server in the datasheet/ and images/ folder. Actually the image that you see in the list is the thumbnail of the actual image and the thumbnails are stored in the thumbs/ folder. The reason of having thumbnails is to load the web interface faster and then if you click on a thumbnail then the actual image is loaded, which can be in much better resolution with finer detail. By default the thumbs are max 320×320 pixels, but you can also change that as you prefer.

When pressing the “Show” button, then a javascript function is executed on the web browser and an aREST request is posted to the IP address of the ESP8266. Then the ESP8266 will receive the POST and will execute the corresponding function, which one of them is to turn on a LED on the RGB strip using the passing index from the request. Then the LED will turn on and will remain lit for a programmable time, which by default is 5 secs. It’s possible to have more than one LED lit at the same time and each LED has it’s own timer. You can also lit all the LEDs at the same time with a programmable color in order to create a more romantic atmosphere in your working environment…

Finally, you can send aREST commands to change the ESP8266 configuration and program the available parameters, which I’ll list later.

Web interface

The web interface code is in the www/lab/ folder of the repo. In case you want to use it in your current running web server, then just copy the lab/ folder to your parent web directory. Although the web interface has quite a few files, it’s very basic and simple. You can ignore the css/ and the js/ folders as they just have helper scripts.

Since the web interface runs on your web browser then it’s necessary that your device (web browser) is on the same network and subnet with the ESP8266. Therefore, you can also have the web interface running for example on your VPS (virtual private server) on the internet and the aREST posts will work fine as long you’re in the same network with the ESP8266.

The main file of the web interface is the `www/lab/index.php`. This is the file that displays the item list, sends SQL queries to the web server and also posts aREST commands to ESP8266. In the $(document).ready function you can see that there’s a PHP code that first read the ESP8266IP.txt file. This file is in the lab/ folder of the web server and it contains the IP of the ESP8266. By default is set to, therefore you need to change that with the IP of your ESP8266 device. In the same function the web browser will try to detect if the ESP8266 is actually on the network and if it’s not then you’ll get a warning.

The function turn_on_led(index) function sends a POST to the ESP8266 aREST server with the index of the LED that needs to turn on.

The html form the the id searchform is the form that handles the “Search part” area in the web interface. When you write a keyword in there and press the “Search” button, then this code is executed:

// by default list all the parts in the database
$sql = "SELECT * from tbl_parts";
// check if a submit is done
  // check for valid search string
  $part=$_POST['part'];	// get part
  $sql = 'SELECT * from tbl_parts WHERE Part LIKE "%'.$part.'%" OR Description LIKE "%'.$part.'%" OR Tags LIKE "%'.$part.'%"';
echo '<br>Results for query: '.$sql.'<br><br>';

This will reload the content of the page and will create a list of items with the results. The code for that is just right below the above snippet in the index.php file. The last interesting code in this file is that one here:

if (!empty($row['Location']) && $ip) {
  echo '<button onclick="turn_on_led(' . $row['Location'] . ')">Show</button>';

This code will add the “Show” button if the proper field in the DB is set and the ESP8266 is on the network.

There are also other two pages that are important and these are the web/lab/upload.html and web/lab/upload.php files. These two pages are used to add new records in to the database. The upload.html page is opened in the web browser when you press the “Add new part button” and this is what you see

There you fill all the text boxes you want and also select an image file and a datasheet if you like to upload to the server. When you fill all the data then you press the “Add part to database” button and if everything runs smooth then you’ll see this

This means that a new record is inserted in the database without error and if you press the “Return” button then you get back to the main page. There’s a problem though. I wouldn’t use this way to create all the records in the database because it’s very slow and it will take a lot of time. Also with this way you can’t edit or update a record, therefore the preferred way to initially set up your database with many components is to use program that can open and edit the database.

In my case I’m using Ubuntu, therefore I’ll explain how to do that with Linux, but in case you’re a Windows user then you can use this the DB Browser for SQLite. To install that in Ubuntu just run this command:

sudo apt install sqlitebrowser

Then from your applications run the DB Browser for SQLite and when the program starts open the `www/lab/parts.db` and click on the “Browse Data” tab. In there you’ll see this

Now this is a much easier way to add records, but also there’s a drawback because the files won’t be uploaded and also the thumbnails won’t be created automatically. But there’s a solution about that, too.

So, the best way to fill your database is to run the docker web server as this will help you to test and also add the records faster. I’ll explain how to run the docker image in the next section, so for now I assume that you already do this, so I won’t interrupt the explanation and the process.

The DB has the following fields:

  • Part
  • Description
  • Datasheet
  • Image
  • Manufacturer
  • Qty
  • Tags
  • Location

This is an example of a record as shown in the web interface

In this the bluepill is the the Part field, the long description is the Description field and the image and datasheet icons are shown because I’ve added the image name in the Image and Datasheet fields. The Manufacturer, Qty fields are not really important and are not shown anywhere. The Tags field is used as an extra tag for DB SQL search queries and finally if you set a number in the Location field then the “Show” button is shown and that number is the index of the LED on the addressable RGB strip.

What is important is that you copy the image file in the web/lab/images folder and the datasheet pdf file in the web/lab/datasheets folder, but in the DB fields you only write the filename including the extension, for example blue-pill.png and blue-pill.pdf. Finally, after you fill the DB with all your parts and items then to create the thumbs you can run the `www/lab/scripts/batch_resize.sh` script from the parent directory of the repo like this

cd www/lab

This script will create the thumbnails in the www/lab/thumbs folder. You can change the default thumb size like this:

MAX_WIDTH=240 MAX_HEIGHT=240 ./tools/batch_resize.sh

After you’ve finished with the DB changes, then don’t forget to click the “Write Changes” button in the program and then you can upload all the files to your web server after you’ve done testing with the docker web server image. Remember that if you click on the image in the item list then the original image is loaded and opens in the browser.

Testing with the docker web server

As I’ve mentioned earlier, testing with the actual web server that runs on your SBC or web server is quite cumbersome as you first need to set up the server and then it’s still not very convenient to edit the files remotely or edit them locally and then upload them to test. You can do it if you like, but personally I would get tired with this soon. Therefore, I’ve added a Dockerfile in the repo which is located in the `docker-lighttpd-php7.4-sqlite3/` folder. To build the docker image run the following command

docker build --build-arg WWW_DATA_UID=$(id -u ${USER}) -t lighttpd-server docker-lighttpd-php7.4-sqlite3/

This command will build a docker image with the Lighhtpd web server, PHP 7.4 and SQLite which can be used for your tests. After the image is built you also need the docker-compose tool, which is not installed with docker and you need to install it separately with these commands

sudo curl -L "https://github.com/docker/compose/releases/download/1.25.0/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
sudo chmod +x /usr/local/bin/docker-compose

If you don’t use Linux then see here how to install docker-compose.

So, after you’ve built the image and you installed docker-compose then from the root directory of the repo you can run this command:

docker-compose up

After running this command you should see this output:

Creating lab-web-db_webserver_1 ... done
Attaching to lab-web-db_webserver_1
webserver_1  | [14-Jan-2020 14:43:13] NOTICE: fpm is running, pid 7
webserver_1  | [14-Jan-2020 14:43:13] NOTICE: ready to handle connections

This command will actually run a container with the web server and it will also mount the web interface folder in the /var/www/html/lab folder inside the container and will also override the /etc/lighttpd/lighttpd.conf and `/usr/local/etc/php/php.ini` files in the container. You can also open the `docker-compose.yml` file to see the exact configuration of the running container. Now to test the web server you need to get the IP of the server and to do that the easiest way is to run this command on a new console window/tab

docker-compose exec webserver /bin/sh

This command will get you in the container’s console, so you can run ifconfig and get the IP of the container, for example:

743e05812196:/var/www/html# ifconfig
eth0      Link encap:Ethernet  HWaddr 02:42:AC:12:00:02  
          inet addr:  Bcast:  Mask:
          RX packets:15 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:2430 (2.3 KiB)  TX bytes:0 (0.0 B)

In this case the IP of the container is which means that you can open the web interface to your web browser using this URL:

That’s it! Now you can edit your files with any text editor and add stuff in your DB and just refresh the web page in your browser to load the changes. That way the testing is done much easier.


Finally there’s the firmware for the ESP8266! I’ll try to explain a few things about the firmware here, but generally is very easy and straight forward. The firmware is in the `esp8266-firmware/` folder in the repo. I’ve used Visual Studio Code (VSC) and the platform.io (PIO) plugin to write the firmware as it’s very easy to do stupid projects fast and I’ve also used the Arduino framework with the ESP8266. The nice thing with the PIO plugin in VS Code is that you don’t need to find the libraries I’ve used yourself, as the plugin will handle those.

First open the `esp8266-firmware/` folder in VSC and then install the PIO plugin from the extensions in the left menu of the IDE. When you do that, then click on the alien head in the left menu, which opens the PIO menu in which you see the project tasks. Press the “Build” and see what happens. If that builds OK, then you’re good to go, but if it complains about missing libraries then click on the “View -> Terminal” menu of VSC and then in the terminal run those two commands:

pio lib update
pio lib install

Those commands should install the missing libs which are the FastLED and aRest.

A note about aREST here. Although the library claims that it’s a true restful API it’s not! There’s no any control about GET, POST, UPDATE and all the other supported rest API commands. Whatever command you use, the aREST handles them the same in the same callback… That’s actually a pity because I had to implement some commands twice in order to implement the POST/GET functionality. Anyway, a small rant, but also I don’t intend to fix this myself…

The main course file is the src/main.cpp. There a few things going on in there, but generally you need to define a few things before you build the firmware, which are:

  • def_ssid: The default SSID of your router
  • def_ssid_password: The SSID password
  • def_led_on_color: The default RGB color when a LED is activated
  • def_led_off_color: The default RGB color for deactivated LED
  • RESET_TO_DEF_PIN: The pin is used to reset the configuration to the default values
  • STRIPE_DATA_PIN: The pin that drives the RGB LED strip
  • NUM_LEDS: The number of LEDS in the strip. This is the number of the drawers of your component organizer.

Normally, you would only change initially the def_ssid and `def_ssid_password` for testing and then at some point also the def_led_on_color.

Now you need to connect everything together, which is quite simple. Use a +5V power supply that can provide the needed power and current for the LED strip. In my case I see approx 2.5 Amps with 100 LEDs when are set to white which is the max. That means that I need at least a 3A PSU (or 15W). Now, connect the +5V and GND of the WS2812B strip to the PSU and then also do the following connections

ESP8266 (NodeMCU) WS2812B
D2 Din

Also connect the D1 pin of the NodeMCU to GND via a 4K7 resistor. This pin is used for reseting the configuration to the default values and when it’s connected to the GND then it’s the normal operation and when is connected to +3V3 then if you reset the ESP8266 then the default configuration will be loaded. Always remember to connect the resistor again to the GND after reseting to defaults.

Finally connect a USB cable to ESP8266 and your computer or a USB power supply in normal use.

Getting back to the code, the `load_default_configuration()` will load the default configuration and this function is called either if a false configuration is found (e.g. empty or corrupted conf) or when the D1 pin is HIGH.

When the program starts and the main function is loaded, then the serial port is configured and then the D1 input pin. Then the configuration is loaded from the EEPROM (actually the flash in case of ESP8266) and next the aREST API is configured with the exposed functions and the internal variables. Then the module is connected to the SSID via WiFi and the main 100ms timer is set to continuous run. This timer is used for all the timeout actions of the LEDs. Finally, the WS2812B strip is initialized and all LEDs are set to the led_off_color value, which is Black (aka turned off) by default.

In the main loop, the code handles all the aREST requests, checks if it needs to save the configuration in the EEPROM (=flash) and also check the timeout of the LEDs and in any is activated it check the timeout and when it’s reached it turns off the LED.

The rest of the functions with the ICACHE_RAM_ATTR prefix are the callbacks for the aREST API. This prefix in the function is actually a macro that instructs the linker to place those functions in RAM instead of flash, so the code is accessed faster. So let’s see now the supported commands which are listed in the following table

Command Description
The index of the LED to turn ON
The integer value of the CRGB color for ON
The integer value of the CRGB color for OFF
The integer value of the LED timeout in seconds
The integer value of the CRGB color for the ambient mode (for real-time selection)
The integer value of the CRGB color that is saved as the default ambient color
Enables/disables the ambient mode
The WiFi password

You can find the CRGB value for each supported color in the FastLED library file which is located in .pio/libdeps/nodemcuv2/FastLED_ID126/pixeltypes.h. Then you can use a calculator to convert HEX values to integers.

In case you want to play around with the aREST commands, you can use a REST client like insomnia or you can use your browser and use the address bar to send GET commands. In any case the URL for each command has the following format:


Therefore if the ESP8266 IP address is and you want to use the led_index command to turn on the 2nd LED of the strip, then paste this line to your web browser’s address bar and hit enter.

Or if you want to set the led_on_color to SkyBlue (which according to pixeltypes.h is 0x87CEEB in HEX or 8900331 in integer), then

Finally, you can get all the current aREST variable value with this URL:

This is will return something like this:

  "variables": {
    "led_on_color": 1,
    "led_off_color": 0,
    "led_on_timeout": 5000,
    "led_ambient": 1,
    "enable_ambient": 0,
    "wifi_ssid": "    ",
    "wifi_password": "         "
  "id": "1",
  "name": "lab-db",
  "hardware": "esp8266",
  "connected": true

Actually, since I’m using Insomnia for my tests, this is an example output from the tool.

Finally, I would like to mention that the configuration, although you see that I’m using the EEPROM.h header and functions, is actually stored in the flash. The configuration is a struct and specifically the struct tp_config which has several members, but the important are the preamble, the version and the crc. The preamble is just a fixed 16-bit word (see FLASH_PREAMBLE) which is 0xBEEF by default and it’s used by the code the verify that when it copies the EEPROM in to the struct then it’s a valid struct. The configuration version is used to compare the EEPROM configuration version with the firmware version and if those are different (i.e. because of a firmware update) then the check_eeprom_version() function is responsible to handle this. Therefore, if you add more variables in the configuration then the size of the configuration will change, which means that you need bump up the version in the firmware code and then also handle this in that function in case you want to preserve the current configuration (e.g. WiFi credentials), otherwise the easy way to just restore the defaults, which is already happens in the code. Finally, the crc is the checksum of the configuration and if that is not correct when reading the data from the EEPROM then again it’s an error condition and by default is handled by reset the configuration to defaults.

To build and upload the code now, just select the PIO in the activity bar in the left. There you would see the project tasks which include “Build” and “Upload and Monitor”. If you can’t see those then you probably opened the top repo folder which means that you need to open only the esp8266-firmare/ folder in VSC and then you should be able to see those tasks. By clicking the build task the terminal should open in the bottom of the UI and you should get something like this

Building in release mode
Retrieving maximum program size .pio/build/nodemcuv2/firmware.elf
Checking size .pio/build/nodemcuv2/firmware.elf
Advanced Memory Usage is available via "PlatformIO Home > Project Inspect"
DATA:    [====      ]  38.2% (used 31316 bytes from 81920 bytes)
PROGRAM: [===       ]  28.1% (used 293584 bytes from 1044464 bytes)
====================================================== [SUCCESS] Took 2.26 seconds ======================================================

Terminal will be reused by tasks, press any key to close it.

That means that the firmware is ready to be uploaded to the ESP8266, so click the “Upload and Monitor” task from the left menu and the flash procedure will start. There’s a chance that the USB Comm port is not found, in this case if you’re using Linux it means that you need to add the proper udev rules (see how you do this here).


Before you proceed with installing the LEDs in the components organizer, you first need to check that everything works as expected. Therefore, first you need to create your database and the easiest way as I’ve mentioned is using the docker image and the DB Browser for SQLite. After you’ve done with the database then you can do the tests with the ESP8266 in the docker container, so first you need to make the proper connections and connect the WB2812B led stripe with the ESP8266.

For this post I’ll use the NodeMCU (actually a clone) with the ESP-12E module. Just connect the D1 pin with a 4K7 resistor to GND, also the GND of the module with the GND of the WS2812B, the D2 pin of the ESP8266 with the Din pin of the strip and finally connect the ESP8266 with a USB cable to either your PC or to USB charger. Of course, before that you need to verify that your firmware is flashed and the ESP8266 is able to connect to the WiFi router and also responds to the http://ip-address/ aREST command with it’s current configuration.

Finally, connect the WS2812B to a +5V power supply that can provide enough current. Just a note here, although the ESP8266 is a 3V3 device and the WS2812B is a 5V, you shouldn’t expect any problems with the voltage difference and you shouldn’t need a level translator. From my experience I never had any issues with any 3V3 MCU and the WS2812B.

After all connections are made and everything is powered up, then you can use your desktop browser (not your smartphone) that runs the docker container and test that the LED is lit on when you press the “Show” button of a component that has an index in the Location field in the DB. If that works then you’re good to go and install the LED strip in the back of the components organizer.

LED strip comparison and details

As I’ve mentioned I didn’t use the standard 30/60/144 LEDs per meter strips that you usually find in ebay. This is a 10 LEDs/meter strip, which means each LED is 10cm away from the other. That is perfect for this project for many reason but also has a drawback, which I’ll discuss later.

For comparison, this a photo of the 10 LEDs/m strip and another 60 LEDs/m I have. The comparison is between them and an ESP-01 module, which is one of the smallest ESP8266 modules available.

The strip on the left is 5m length, which means it counts 50 LEDs and the one on the right side is 4m length and it counts 240 LEDs. The good thing about the strip on the left is that much more thinner and flexible and the distance between the LEDs is ideal for the components organizer drawers.

The problem with the thinner strip though it’s that it has more resistance. Why is that a problem? Well, if it was only one strip (aka 5 meters) then it would be fine, but since I’m using 2x strips (10 meters) then as you can imagine the resistance grows for the LEDs that are far away from the power supply. Also, because the thinner strip is not a flexible PCB, but is has thin cables between each LED, then the resistance is even higher.

In the next photo you can see the effect that this resistance has on the LEDs when two strips are connected together and the power supply is only connected to the one end of the strip.

As you can see, the strip that is near the PSU is bright white as it should be, but the half of the next strip is starting to have discoloration and instead of white light you get yellow. That’s because the length of the strip adds more resistance in the end of the line and the voltage on the LEDs is dropped enough to not give full power to the LEDs, so you get a dim light (thus yellow).

The solution for that, as you may already have guessed, is to supply voltage to both ends of the strip. Therefore, the +5V PSU output is connected in both ends (you don’t have to connect both GND though, one is enough). The result is that now the stripe will have equal voltage in both sides, so all the LEDs will be lit the same. The next photo shows the 10 meter strip with the PSU connected to both sides.

You see now that the light is uniform through the whole LED strip length.

Finally, I’ve measured the maximum consumption of the 10m LEDs strip when all color pixels (RGB) are boosted to the max value 0xFF. The result is:

So, it’s about 1.3A @ 5V or 6.5W. Therefore a USB 5V/2A charger is more than enough for the strip including the ESP8266. I don’t plan to lit all the LEDs to full white at the same time anyways as I like dim lights. I just remind you here that you can use the ambient light functionality in the ESP firmware and light your organizer in any color to have a cool ambient effect in your lab.

Installing the LEDs

Next step is to install the LED strip in the back of the organizer. It seems to me that this was the most time consuming task compared to write the firmware… Anyway, since I have 3 organizers that I want to illuminate and each one has 33 drawers, it means I need 99 LEDs, so I got a LED strip with 100 LEDs with 1 LED / 10cm. The extra length is very convenient in this case.

Also, because I have 3 organizers and only 1 strip then it means that I had to cut the strip and solder connectors, so when I have to move them then I don’t have to remove the strip or remove them altogether. Also in case I need to add an extra organizer then I can extend the strip with another connector. Also using the WS2812B is nice because one of the LEDs is malfunctioning I can replace it easily.

One last thing is to decide the LED indexes before you fill the DB. I should probably mention that earlier, but I guess you’ll read that far before doing any of the previous steps. Let’s see a simplified example of how I’ve laid out the LED strip in the back of the organizers

The above diagram shows a simplified diagram of the 3 organizers and the index increment direction for the LED indexes. It’s obvious that it doesn’t make much sense to label each drawer like that if you were using a sticker on each drawer and write an index with a marker. But it makes perfect sense in case you have a LED strip like the WS2812B, because that way you minimize the length of the strip. Of course you can use a different arrangement like start from top-to-bottom instead of left-to-right. In my case I’ve used a quite unusual indexing as you’ll see in the next picture that I’ve done with the one of the organizers.

Let me say here, that for my standards it’s the best I could do. I’m getting really bored and lazy when it comes to manual work like this, so I’m just doing everything as fast as I can and never look back again (literally in this case).

As you can see there’s a single LED behind each drawer and I’ve used an invisible tape to stick the flexible strip on the plastic rails in the back. I hope they won’t fell soon, but generally I’ve used tons of invisible tape for various reasons and it never failed me. Also, look the the indexing on the top it doesn’t make sense, especially with the indexes I’ve used in the previous example. The reason for that is that only this way the input was because it was more convenient in order to connect the next strip. This is a calc sheet I’ve made. It probably doesn’t make sense but the teal color represents the left organizer, the violet the middle and the orange the right organizer. Then (ABC) is the first column of the organizer, (DEF) the second, (GHI) the third and finally (JKL) is the fourth column of drawers. Yeah I know, my brain works a bit weird, but I find this very easy to understand and remember what I’ve done.

Anyway, you don’t have to use this, you can skip it and make an indexing map that is better for you.

Now, this is a photo of the back of all 3 organizers when I’ve done with this ridiculous amount of boring manual work…

At least it worked without any issues on the first boot. As you may see I had to cut the strip at every end and solder connectors and also had to solder the two catted pieces from the long strips to make the strip part for the last organizer. Oh, so boring!

In the above ans also the next picture I’ve tested the ESP8266 and the strip with the indexes and the ambient functionality. This is after I’ve finished with the whole installation.

Neat. Of course, you wouldn’t expect to have uniform Illumination as each drawer has different components, some they’re full, some almost empty. But still the result in really nice and actually it looks better in my lab than the picture.

Finally, I’ve also made a custom USB cable to provide power to both the ESP8266 and the WS2812B strip. I’ve used a USB power pack that is capable to provide 2A @ 5V, which is more than enough for everything. This is the cable.

Next thing I’ll do when I get my 3D printer is to print a nice case to fit the ESP module.

Installing the web-interface to BPI-M1

In may case, after everything seems to be working I’ve moved the whole www/lab folder from the repo to the BPI-M1. As I’ve already mentioned I’m using the Armbian distro on my BPI-M1, which doesn’t have a web server, PHP and sqlite by default. Therefore, I’ll list the commands I’ve used to install those in Armbian. So, to install lighhtpd with php and sqlite support, I’ve run those commands:

apt-get update
apt-get upgrade
apt-get install lighttpd php7.2-fpm php7.2-sqlite3 php-yaml

lighttpd-enable-mod fastcgi
lighttpd-enable-mod fastcgi-php

Then edit the `15-fastcgi-php.conf` file

vi /etc/lighttpd/conf-available/15-fastcgi-php.conf

and add this:

# -*- depends: fastcgi -*-
# /usr/share/doc/lighttpd/fastcgi.txt.gz
# http://redmine.lighttpd.net/projects/lighttpd/wiki/Docs:ConfigurationOptions#mod_fastcgi-fastcgi

## Start an FastCGI server for php (needs the php5-cgi package)
fastcgi.server += ( ".php" => 
        "socket" => "/var/run/php/php7.2-fpm.sock",
        "broken-scriptfilename" => "enable"

Then uncomment this line in /etc/php/7.2/fpm/php.ini


And finally restart lighttpd

sudo service lighttpd force-reload

Now copy lab/ folder from the www/ in the repo to the /var/www/html folder, so the index.php file now should be in /var/www/html/lab/index.php.

Finally, you need the right permissions in the web folder, so run this command in your SBC terminal (via ssh or serial console)

sudo chown www-data:www-data /var/www/html/lab

Now use any web browser (smartphone, laptop, e.t.c.) and connect to the web interface and test again that everything is working fine. Also test that the upload is working with large pdf files. In case it doesn’t then have a look in the `www/php.ini` file and add those params to your SBC, too.

If it’s working, then you’re done.


Here is a video with playing around with the web interface and also a video that I’m using the web browser on my workstation to enable the “ambient” feature of the ESP8266 firmware, which just lids all the WB2812B strip LEDs to a color and makes a nice color-power-consuming atmosphere in my home lab.

In the first few minutes I’m just fooling around with the database and I’m searching for “MCP” and “STM” keywords in there. I’m also using the “Show” button to demonstrate that the LED of the drawer that contains the part is lit. Then I test the ambient light function with my smartphone and finally with the workstation browser.

Generally, I’m always using my workstation and never the smartphone. The difference in the two cases is that when the web interface runs on the workstation then the color HTML object changes the color in real-time while you’re moving your mouse and you don’t have to press “set” every time (which you need to do with the mobile browser). You can see that at 3:13 of the video. In the code you can see two javascript events, ambient_color() and save_ambient_color(). The first event is called with the oninput callback, which means on any color update in real-time from the HTML color object and the second event, which is the onchange() is triggered only when you close the color picker on the desktop web-browser or click “set” on the mobile browser. In the case of the onchange() event the color is also saved in flash, so next time you press the “Enable ambient” button then this color is used.

Some weird issues

During my experiments I had some weird behavior from a specific batch of ESP8266 modules. It seems that not all modules are made the same. These are the two modules I’ve used

On the left side is the LoLin NodeMCU module and on the right side is a generic cheap module I’ve bought from ebay some time ago with this description here: “NodeMCU ESP8266 ESP-12E V1.0 Wifi CP2102 IoT Lua 267 NEW”.

The problem I had is that when I flashed the same firmware on both devices the generic ESP8266 module was jamming my smartphone WiFi connection, but the LoLin didn’t had any affect. The next image shows the affect both modules have on a Speedtest run.

On the left side is when the firmware is running on the LoLin module and the right when it’s running on the other. You see that the difference is quite large, but the worst thing is that in the second case it was making the internet browsing with my smartphone very hard as I couldn’t download web pages fast and it was lagging a lot.

First thing I thought was the transmitting power. I believe that the second module transmit power is quite high and thus creates those issues. I thought to use my RTL-SDR (that I’ve used in this post here), but the problem with this dongle is that the Rafael Micro R820T/2 tuner is able to tune up to 1766 MHz, which makes sniffing the 2.4GHz band impossible. I’ve searched and found an MMDS downconverter in aliexpress but that doesn’t ship to Germany and the other options where quite expensive, so I’ve dropped the idea to get into it and find what’s going on. At some point I’m planning to get a HackRF One anyways, which is able to sniff up to 6GHz, so I may look into it again then.

Anyway, just have that in mind because it may cause troubles in your WiFi network. For the ESP8266 arduino framework there’s also the `setOutputPower()` function that is supposed to control the TX power from 0 up to 20.5dBm (see here). I may play around with it also at some point, but for now I’m using the LoLin module.

Generally, I’ll have a look at it in the future and write a post when I figure out why this is happening. I have 3 of those ESP8266 modules and it’s a shame to be unusable. I’ll try experimenting first with the `setOutputPower()`.

Edit 1:

I’ve just realized that the default behavior for WiFi.begin() is to configure both STA + AP, which means that ESP8266 is configured as both access point (AP) and client/station (STA). This is definitely an issue and affects the overall WiFi performance of the surrounding devices, but still doesn’t explain the difference between the different modules. Anyway, now I’m forcing only STA by explicitly setting the mode in the code


Edit 2:

After forcing the STA mode, the speedtest got a bit better on my smartphone but still the uploading was horrible. Next was to limit the TX power with the `WiFi.setOutputPower()` function. This solved mostly the issues I had, but not completely. So, around 6-8 dBm I’ve seen a couple of disconnections on the ESP8266 while I was running the speedtest on the the smartphone. Anyway, around 12.5 dBm seems the sweet spot for my home arrangement, but still not quite happy with the upload speed on the speedtests.

For now I’ve ordered a couple more LoLin modules, because I only have one that seems to be working fine. All the other modules have this issue. When I get the HackRF or if I find a MMDS downconverter I’ll get deeper into this…


This post ended up longer than I expected. Although all the different parts and codes are simple, it’s just that there are many different things that compose the outcome. Also I got a bit deeper with the development and testing tools (e.g. docker). Sometimes I tend to think that I’m using Docker too much, but on the other hand I’m happy that I keep my development workstation clean from tons of packages and tools that I don’t use often,  but only when making those stupid projects.

Regarding the project, I have to say that for me this DB was a savior for me, because many times I need components and I don’t know if I have them and even more I have no idea where to start looking for them. Generally, I keep things organized otherwise it would be impossible to keep track of every component, but no matter how organised I am, even if I find that I have what I need in the database, then if I don’t remember -which is the case most of the times- where the component might be, then I need to start looking everywhere.

For that reason making this interactive RGB LED strip to point to the correct drawer is a huge advantage for me. Sometimes now I just play with this even if I don’t really need a component, he he. It’s really nice to have.

On the other hand, if you’re thinking to implement this, then you’re free to use the code and do whatever you like with it. The only boring thing is to feel the database if you have a lot of components, but you don’t have to do it in one day. I’m filling this database for the last 6 years. What I do is, first I buy the components from ebay (usually) and then I find a nice image and the datasheet and I add the component to the database.

Here’s a nice trick I’m using in some components. Since the thumbnail doesn’t have to be same image with the image in the image/ folder, what I do is that I’m using a different image in thumb/ and image/ folder. In this case, when the list is shown the thumbnail image is shown and then when I click on the image then the image from the image/ folder is displayed. This is useful for example in many MCUs, because I use an actual photo of the MCU for the thumbnail and the pinout image for the bigger image.

Another useful thing that I do, is that many times I can’t find a datasheet for the component but I’ve found a web page that has info for that component. Then I print the web page in a PDF file and I’m uploading that. Finally, some times I need to save more than one PDF for a component, therefore in this case I merge two or more PDF files in one using one of the available online PDF mergers (e.g. this or that). That way I can overcome the fact that only one PDF is available for each component in the database.

Finally, what you’ll learn by doing this project is work with docker images and web servers, setting up your SBC as a web server, a bit of REST (although aREST is not really a RESTful environment), and also a bit of PHP, javascript, SQLite, ESP8266 firmware and playing with LEDs.

That’s it, I hope you find this stupid project useful, for some reason.

Have fun!

DevOps for embedded (part 2)


Note: This is the second post of the DevOps for Embedded series. You can find the first post here and the last one here.

In the previous post, I’ve explained how to use Packer and Ansible to create a Docker CDE (common development environment) image and use it on your host OS to build and flash your code for a simple STM32F103 project. This docker image was then pushed to the docker hub repository and then I’ve shown how you can create a gitlab-ci pipeline that triggers on repo code changes, pulls the docker CDE image and then builds the code, runs tests and finally export the artifact to the gitlab repo.

This is the most common case of a CI/CD pipeline in an embedded project. What was not demonstrated was the testing on the real hardware, which I’ll leave for the next post.

In this post, I’ll show how you can use a cloud service and specifically AWS to perform the same task as the previous post and create a cloud VM that builds the code as either a CDE or as a part of s CI/CD pipeline. I’ll also show a couple of different architectures for doing this and discuss what are the pros and cons for each one. So, let’s dive into it.

Install aws CLI

In order to use the aws services you need first to install the aws CLI tool. The time this post is written, there is the version 1 of the aws and there’s also a new version 2, which is preview for evaluating and testing. Therefore, I’ll use version 1 for this post. There’s a guide here on how to install the aws CLI, but it’s easy and you just need python3 and use pip:

pip3 install awscli --upgrade --user

After that you can test the installation by running aws --version:

$ aws --version
aws-cli/1.16.281 Python/3.6.9 Linux/5.3.0-22-generic botocore/1.13.17

So, the version I’ll be using in this post is 1.16.281.

It’s important to know that if you get errors while following this guide then you need to be sure that you use the same versions for each software that is used. The reason is that Packer and Ansible for example are getting updated quite often and sometimes the newer versions are not backwards compatible.

Create key pairs for AWS

This will be needed later in order to run an instance from the created AMI. When an instance is created you need somehow to connect to it. This instance will get a public IP and the easiest way to connect to is by SSH. For security reasons you need to either create a pair of private/public key on the AWS EC2 MC (MC = Management Console) or create one locally on your host and then upload the public key to AWS. Both solutions are fine, but they have some differences and here you’ll find the documentation for this.

If you use the MC to create the keys then your browser will download a pem file, which you can use to connect to any instance that uses this pair. You can store this pem file in a safe place, distribute it or do whatever you like. If you create your keys that way, then you trust CM backend for the key creation, which I assume is fine, but in case you have a better random generator or you want full control, then you may don’t want the CM to create the keys.

The other way is to create your keys on your side if you prefer. I assume that you would need that for two reasons, one is to re-use an existing ssh pair that you already have or you may prefer use your random generator. In any case, AWS supports to import external keys, so just create the keys and upload them in your CM in the “Network & Security -> Key Pairs” tab.

In this example, I’m using my host’s key pair which is in my user’s ~/.ssh so if you haven’t created ssh keys to your host then just run this command:

ssh-keygen -t rsa -C "your_email@example.com"

Of course, you need to change the dummy email with yours. When you run this command it will create two files (id_rsa and id_rsa.pub ) in your ~/.ssh folder. Have a look in there, just to be sure.

Now that you have your keys, open to your AWS EC2 CM and upload you public ~/.ssh/id_rsa.pub key in the “Network & Security -> Key Pairs” tab.

Creating a security group

Another thing that you’ll need later is a security group for your AMI. This is just some firewall rules grouped in a tag that you can use when you create new instances. Then the new instance will get those rules and apply them to the instance. The most usual case is that you need to allow only ssh inbound connections and allow all the outbound connections from the image.

To create a new group of rules, go to your AWS EC2 management console and click on the “NETWORK & SECURITY -> Security Groups” tab. Then clink on the “Create Security Group” button and then on the new dialog, type “ssh-22-only” for the Security group name, write your own description and then in the inbound tab press “Add Rule” button and then select “SSH” in the type column, select “Custom” in the Source column and type “”. Apply the changes and you’ll see your new ssh-22-only rule in the list.

Creating an AWS EC2 AMI

In the repo that I’ve used in the previous post to build the docker image with Packer, there’s also a json file to build the AWS EC2 AMI. AMI stands for Amazon Machine Image and it’s just a containerized image with a special format that Amazon uses for their purposes and infrastructure.

Before proceeding with the post you need to make sure that you have at least a free tier AWS account and you’ve created you credentials.

To build the EC2 image you need to make sure that you have a folder named .aws in your user’s home and inside there are the configand credentialfiles. If you don’t have this, then that means that you need to configure aws using the CLI. To do this, run the following command in your host’s terminal

aws configure

When this runs, you’ll need to fill in your AWS Access Key ID and the AWS Secret Access Key that you got when you’ve created your credentials. Also you need to select your region, which is the location that you want to store your AMI. This selection is depending on your geo-location and there’s a list to choose the proper name depending where you’re located. The list is in this link. In my case, because I’m staying in Berlin it makes sense to choose eu-central-1 which is located in Frankfurt.

After you run the configuration, the ~/.aws folder and the needed files should be there. You can also verify it by cat the files in the shell.

cat ~/.aws/{config,credentials}

The config file contains information about the region and the credentials file contains your AWS Access Key ID and the AWS Secret Access Key. Normally, you don’t want those keys flapping around in text mode and you should use some kind of vault service, but let’s skip this step for now.

If you haven’t cloned the repo I’ve used in the previous post already, then this is the repo you need to clone:

Have a look in the `stm32-cde-aws.json` file. I’ll also paste it here:

    "_comment": "Build an AWS EC2 image for STM32 CDE",
    "_author": "Dimitris Tassopoulos <dimtass@gmail.com>",
    "variables": {
      "aws_access_key": "",
      "aws_secret_key": "",
      "cde_version": "0.1",
      "build_user": "stm32"
    "builders": [{
      "type": "amazon-ebs",
      "access_key": "{{user `aws_access_key`}}",
      "secret_key": "{{user `aws_secret_key`}}",
      "region": "eu-central-1",
      "source_ami": "ami-050a22b7e0cf85dd0",
      "instance_type": "t2.micro",
      "ssh_username": "ubuntu",
      "ami_name": "stm32-cde-image-{{user `cde_version`}} {{timestamp | clean_ami_name}}",
      "ssh_interface": "public_ip"
        "type": "shell",
        "inline": [
          "sleep 20"
        "type": "ansible",
        "user": "ubuntu",
        "playbook_file": "provisioning/setup-cde.yml",
        "extra_arguments": [
          "-e env_build_user={{user `build_user`}}"

There you see the variables section and in that section there are two important variables which are the aws_access_key and the aws_secret_key and both are empty in the json file. In our case that we have configured the aws CLI that’s no problem, because Packer will use the aws CLI with our user’s credentials. Nevertheless, this is something that you need to take care if the Packer build runs on a VM in a CI/CD pipeline, because in this case you’ll need to provide those credentials somehow and usually you do this either by hard-encoding the keys which is not recommended or having the keys to your environment which is better than have them hard-coded but not ideal from the security aspect, or by using a safe vault that stores those keys and the builder can retrieve the keys from there and use them in a “safe” way. In this example, since the variables are empty, Packer expects to find those in the host environment.

Next, in the builders section you see that I’ve selected the “amazon-ebs” type (which is documented here). There you see the keys that are pulled from the host environment (which in this case is the ~./aws). The region, as you can see in this case is hard-coded in the json, so you need to change this depending your location.

The source_amifield is also important as it points to the base AWS image that it’s going to be used as a basis to create the custom CDE image. When you run the build function of Packer, then in the background Packer will create a new instance from the source_ami and the instance will be set with the configuration inside the “builders” block in the stm32-cde-aws.json file. All the supported configuration parameters that you can use with packer are listed here. In this case, the instance will be a t2.micro which has 1 vCPU and the default snapshot will have 8GB of storage. After the instance is created and then Packer will run the provisioners scripts and then it will create a new AMI from this instance and name it by the ami_name that you’ve chosen in the json file.

You can verify this behavior later by running the packer build function and having your EC2 management console open in your browser. There you will see that while Packer is running it will create a temporary instance from the source_ami do it’s stuff in there, then it will create the new AMI and it will terminate the temporary instance. It’s important to know what each instance state means. Terminated means that the instance is gone and it can’t be used anymore and also its storage is gone, too. If you need to keep your storage after an instance is terminated then you need to create a block volume and mount it in your image, it’s not important for us now, but you can have a look in the documentation of AWS and packer how to do that.

Let’s go back on how to find the image you want to use. This is a bit cumbersome and imho it could be better, because as it is now it’s a bit hard to figure it out. To do that there are two ways, one is to use the aws-cli tool and list all the available images, but this is almost useless as it returns a huge json file with all the images. For example:

aws ec2 describe-images > ~/Downloads/ami-images.json
cat ~/Downloads/ami-images.json

Of course, you can use some filters if you like and limit the results, but still that’s not an easy way to do it. For example, to get all the official amazon ebs images, then you can run this command:

aws ec2 describe-images --owners amazon > ~/Downloads/images.json

For more about filtering the results run this command to read the help:

aws ec2 describe-images help

There’s a bit better way to do this from your aws web console, though but it’s a bit hidden and not easy to find. To do that first go to your aws console here. Then you’ll see a button named “Launch instance” and if you press that then a new page will open and where you can search for available AMIs. The good thing with the web search is that you can also easily see if the image you need is “Free tier eligible”, which means that you can use it with your free subscription with no extra cost. In our case the image is the ami-050a22b7e0cf85dd0, which is an ubuntu 16.04 x86_64 base image.

Next field in the json file is the “instance_type” for now it’s important to set it to t2.micro. You can find out more about the instance types here. The t2.micro instance for the ami-050a22b7e0cf85dd0 AMI is “free tier eligible”. You can find which instances are fee tier eligible when you select the image in the web interface and press “Next” to proceed to the next step in which you select the instance type and you can see which of them are in the free tier. Have in mind that free tier instance type are depending the source_ami.

The ssh_username in the json, is the name that you need to use when you ssh in the image and the ami_name is the name of the new image that is going to be created. In this case you can see that this name is dynamic as it depends on the cde_version and the timestamp. The timestamp and clean_ami_name are created by Packer.

The “provisioners” section, like in the previous post, uses Ansible to configure the image; but as you can see there are two additional provisioning steps. The first is a shell script that sleeps for 20 secs and this is an important step because you need a delay in order to wait for the image to boot and then be able to ssh to it. If Ansible tries to connect before the instance is up and running and be able to accept SSH connections, then the image build will fail. Also, some times even if you don’t have enough delay then even it’s get connected via SSH the apt packager may fail, so always use a delay.

The next provisioning step uses a shell sudo command (inside the instance) to update the APT repos and install python. This step also wasn’t needed in the case of the docker image creation with Packer. As I’ve mentioned in the previous post, although the base images from different providers are quite the same, they do have differences and in this case the AWS Ubuntu image doesn’t have Python installed and Python is needed on the target for Ansible, even if it’s connected remotely via SSH.

To build the AWS image, just run this command in your host terminal.

packer build stm32-cde-aws.json

After running the above command you should see an output like this:

$ packer build stm32-cde-aws.json 
amazon-ebs output will be in this color.

==> amazon-ebs: Prevalidating AMI Name: stm32-cde-image-0.1 1575473402
    amazon-ebs: Found Image ID: ami-050a22b7e0cf85dd0
==> amazon-ebs: Creating temporary keypair: packer_5de7d0fb-33e1-1465-13e0-fa53cf8f7eb5
==> amazon-ebs: Creating temporary security group for this instance: packer_5de7d0fd-b64c-ea24-b0e4-116e34b0bbf3
==> amazon-ebs: Authorizing access to port 22 from [] in the temporary security groups...
==> amazon-ebs: Launching a source AWS instance...
==> amazon-ebs: Adding tags to source instance
    amazon-ebs: Adding tag: "Name": "Packer Builder"
    amazon-ebs: Instance ID: i-01e0fd352782b1be3
==> amazon-ebs: Waiting for instance (i-01e0fd352782b1be3) to become ready...
==> amazon-ebs: Using ssh communicator to connect:
==> amazon-ebs: Waiting for SSH to become available...
==> amazon-ebs: Connected to SSH!
==> amazon-ebs: Provisioning with shell script: /tmp/packer-shell622070440
==> amazon-ebs: Provisioning with shell script: /tmp/packer-shell792469849
==> amazon-ebs: Provisioning with Ansible...
==> amazon-ebs: Executing Ansible: ansible-playbook --extra-vars packer_build_name=amazon-ebs packer_builder_type=amazon-ebs -o IdentitiesOnly=yes -i /tmp/packer-provisioner-ansible661166803 /.../stm32-cde-template/provisioning/setup-cde.yml -e ansible_ssh_private_key_file=/tmp/ansible-key093516356
==> amazon-ebs: Stopping the source instance...
    amazon-ebs: Stopping instance
==> amazon-ebs: Waiting for the instance to stop...
==> amazon-ebs: Creating AMI stm32-cde-image-0.1 1575473402 from instance i-01e0fd352782b1be3
    amazon-ebs: AMI: ami-07362049ac21dd92c
==> amazon-ebs: Waiting for AMI to become ready...
==> amazon-ebs: Terminating the source AWS instance...
==> amazon-ebs: Cleaning up any extra volumes...
==> amazon-ebs: No volumes to clean up, skipping
==> amazon-ebs: Deleting temporary security group...
==> amazon-ebs: Deleting temporary keypair...
Build 'amazon-ebs' finished.

==> Builds finished. The artifacts of successful builds are:
--> amazon-ebs: AMIs were created:
eu-central-1: ami-07362049ac21dd92c

That means that Packer created a new AMI in the eu-central-1 regional server. To verify the AMI creation you can also connect to your EC2 Management Console and view the AMI tab. In my case I got this:

Some clarifications now. This is just an AMI, it’s not a running instance and you can see from the image that the status is set to available. This means that the image is ready to be used and you can create/run an instance from this. So let’s try to do this and test if it works.

Create an AWS EC2 instance

Now that you have your AMI you can create instances. To create instances there several ways, like using your EC2 management console web interface or the CLI interface or even use Vagrant that simplifies things a lot. I prefer using Vagrant, but let me explain why I thunk Vagrant is a good option by mentioning why the other options are not that good. First, it should be obvious why using the web interface to start/stop instances is not your best option. If not then, just think what you need to do this every time you need to build your code. You need to open your browser, connect to your management console, then perform several clicks  to create the instance, then open your terminal and connect to the instance and after finishing your build, stop the instance again from the web interface. That takes time and it’s almost a labor work…

Next option would be the CLI interface. That’s not that bad actually, but the shell command would be a long train of options and then you would need a shell script and also you would had to deal with the CLI too much, which is not ideal in any case.

Finally, you can use Vagrant. With vagrant the only thing you need is a file named Vagrantfile which is a ruby script file and that contains the configuration of the instance you want to create from a given AMI. In there, you can also define several other options and most important, since this file it’s just a text file, it can be pushed in a git repo and benefit from the versioning and re-usability git provides. Therefore let’s see how to use Vagrant for this.

Installing Vagrant

To install vagrant you just need to download the binary from here. Then you can copy this file somewhere to your path, e.g. ~/.local/bin. To check if everything is ok then run this:

vagrant version

In my case it returns:

Installed Version: 2.2.6
Latest Version: 2.2.6

Now you need to install a plugin called vagrant-aws.

vagrant plugin install vagrant-aws

Because Vagrant works with boxes, you need to install a special dummy box that wraps the AWS EC2 functionality and it’s actually a proxy or gateway box to the AWS service

vagrant box add aws-dummy https://github.com/mitchellh/vagrant-aws/raw/master/dummy.box

Now inside the `stm32-cde-template` git repo that you’ve cloned, create a symlink of Vagrantfile_aws to Vagrantfile like this.

ls -sf Vagrantfile_aws Vagrantfile

This will create a symlink that you can override at any point as I’ll show later on this post.

Using Vagrant to launch an instance

Before you proceed with creating an instance with Vagrant you need first to fill the configuration parameters in the vagrant-aws-settings.yml. The only thing that this files does, is to store the variables values so the Vagrantfile remains dynamic and portable. Having a Vagrantfile which is portable is very convenient, because you can just copy/paste the file between your git repos and only edit the vagrant-aws-settings.yml file and put the proper values for your AMI.

Therefore, open `vagrant-aws-settings.yml` with your editor and fill the proper values in there.

  • aws_region is the region that you want your instance to be created.
  • aws_keypair_name is the key pair that you’ve created earlier in your AWS EC2 MC. Use the name that you used in the MC, not the filename of the key in your host!
  • aws_ami_name this is the AMI name of the instance. You’ll get that in your “IMAGES -> AMIs” tab in your MC. Just copy the string in the “AMI ID” column and paste it in the settings yaml file. Note that the AMI name will be also printed in the end of the packer build command.
  • aws_instance_type that’s the instance type, as we’ve said t2.micro is eligible for free tier accounts.
  • aws_security_groups the security group you’ve created earlier for allowing ssh inbound connections (e.g. the ssh-22-only we’ve created earlier).
  • ssh_username the AMI username. Each base AMI has a default username and all the AMI that Packer builds inherits that username. See the full list of default usernames here. In our case it’s ubuntu.
  • ssh_private_key_path the path of the public ssh key that the instance will use.

When you fill the proper values, then in the root folder of the git repo run this command:

vagrant up

Then you’ll see Vagrant starting to create the new instance with the given configuration. The output should like similar to this:

$ vagrant up
Bringing machine 'default' up with 'aws' provider...
==> default: Warning! The AWS provider doesn't support any of the Vagrant
==> default: high-level network configurations (`config.vm.network`). They
==> default: will be silently ignored.
==> default: Launching an instance with the following settings...
==> default:  -- Type: t2.micro
==> default:  -- AMI: ami-04928c5fa611e89b4
==> default:  -- Region: eu-central-1
==> default:  -- Keypair: id_rsa_laptop_lux
==> default:  -- Security Groups: ["ssh-22-only"]
==> default:  -- Block Device Mapping: []
==> default:  -- Terminate On Shutdown: false
==> default:  -- Monitoring: false
==> default:  -- EBS optimized: false
==> default:  -- Source Destination check: 
==> default:  -- Assigning a public IP address in a VPC: false
==> default:  -- VPC tenancy specification: default
==> default: Waiting for instance to become "ready"...
==> default: Waiting for SSH to become available...
==> default: Machine is booted and ready for use!
==> default: Rsyncing folder: /home/.../stm32-cde-template/ => /vagrant

You’ll notice that you’re again in the prompt in your terminal, so what happened? Well, Vagrant just created a new instance from our custom Packer AMI and left it in running state. You can check in your MC and verify that the new instance is in running state. Therefore, now it’s up and running and all you need to do is to connect there and start executing commands.

To connect, there are two options. One is to use any SSH client and connect as the ubuntu user in the machine. To get the ip of the running instance just click on the instance in the “INSTANCES -> Instances” tab in your CM and then on the bottom area you’ll see this “Public DNS:”. The string after this, is the public DNS string. Copy that and then connect via ssh.

The other way, which is the easiest if to use Vagrant and just type:

vagrant ssh

This will automatically connect you to the instance, so you don’t need to find the DNS name or even connect to your CM.

Building your code

Now that you’re inside your instance’s terminal, after running vagrant ssh you can build the STM32 template code like this:

git clone --depth 1 --recursive https://dimtass@bitbucket.org/dimtass/stm32f103-cmake-template.git
cd stm32f103-cmake-template
time TOOLCHAIN_DIR=/opt/toolchains/gcc-arm-none-eabi-9-2019-q4-major CLEANBUILD=true USE_STDPERIPH_DRIVER=ON SRC=src_stdperiph ./build.sh

I’ve used the time command to build the firmware in order to benchmark the build time. The result on my AWS instance is:

Building the project in Linux environment
[ 95%] Linking C executable stm32-cmake-template.elf
   text	   data	    bss	    dec	    hex	filename
  14924	    856	   1144	  16924	   421c	stm32-cmake-template.elf
[ 95%] Built target stm32-cmake-template.elf
Scanning dependencies of target stm32-cmake-template.bin
Scanning dependencies of target stm32-cmake-template.hex
[ 97%] Generating stm32-cmake-template.bin
[100%] Generating stm32-cmake-template.hex
[100%] Built target stm32-cmake-template.hex
[100%] Built target stm32-cmake-template.bin

real	0m5.318s
user	0m4.304s
sys	0m0.372s

So the code builds fine! Checking the firmware size with the previous post, you’ll see it’s the same. Therefore, what happened is that we’ve managed to create a CDE image for our project on the AWS cloud, started an instance, ssh to it and then build the code using the same toolchain and almost identical environment as the previous post. This is great.

Now, every developer that has this Vagrantfile can create an instance from the same AMI and build the code. You can have as many instances you like or you can have one and multiple developers connect on the same instance. In the last case, though, multiple developers can connect to the instance using an SSH client and not the vagrant ssh.

Now there’s a question, though. What happens with the instance after you finish? Well, you can just type exit in the instance console and the SSH connection will close and you’ll be back to your host’s terminal. At that point the instance is still running in the AWS cloud. You can leave it running or destroy the instance by running this command on your host:

vagrant destroy

This command will terminate the instance and you can verify this to your AWS EC2 CM.

Note: When you destroy the instance then everything will disappear, so you’ll lose any work you’ve done, also the firmware binaries will destroyed and you won’t be able to revert the change, unless you’ve created a volume and mounted it in to your instance or upload the artifacts somewhere else.

Now let’s see a few more details about proper cleaning up your instances and images.

Cleaning up properly

When building an AMI with Packer you don’t pay for this as it uses a t2.micro instance which is a free tier and when it finishes it cleans up properly also any storage that was used.

The AMI that has being built also occupies some space. If you think about it, it’s a full blown OS, with rootfs and also some storage and this needs to be stored somewhere. When the AMI is built, it occupies a storage space which is called snapshot and every new instance that runs from this AMI is actually starting from this snapshot (think like forking in git) and a new snapshot is created which stores all the changes compared to the original snapshot. Therefore, more space is needed.

This storage though is not free, well the first 30GB of elastic block storage (=EBS which is the one that is used to store the EC2 volumes and snapshots) are free, but then you pay depending on how much more you use. Therefore, if you have an AMI with 8GB storage then it’s free. Then if you create a couple of instances more, then if you exceed the current 30GB storage limit then you’ll need to pay for the extra storage, per month. Don’t be afraid, though as the cost will be like 1 cent per snapshot, so it’s not that much at all.

EBS is not the same storage as the Amazon simple storage service (S3). S3 in the free-tier accounts is limited to 5GB and this is the bucket storage you can use to save your data or meta-data. For the rest of he post, we’re only using EBS not S3.

You can see how much EBS storage you’re using in the “ELASTIC BLOCK STORE” tab in your EC2 AWS CM. In there, there are two other tabs, Volumes and Snapshots. The Volume lists the storage that your snapshot uses. You AMI also uses some drive space though and this points to a snapshot. If you click on your AMI in the CM you’ll see that the block device points to a snapshot. Now click on the snapshots tab and verify that the snapshot of the instance is there.

What happens when you run vagrant up is that a new Volume is created for the instance that derives from the AMI’s snapshot. Any changes you do there they have affect only on the volume not the AMI’s snapshot! When you run vagrant destroy then the volume that is mounted in that instance is deleted and the AMI’s snapshot is still there for the next instance.

Therefore, each instance you run it will have a volume and until the instance is destroyed/terminated you’ll pay for the storage (if it exceeds the 30GB). The amount is negligible, but still you need to have this in mind.

Regarding the snapshot billing you can see here.

Just have in mind that if you don’t want to get charged about anything then just terminate your instances, delete your AMIs, the volumes and the snapshots and you’re fine. Otherwise, you need to always have in mind not to exceed the 30GB limit.

Finally, the 30GB of EBS is the current limit the date I’m writing this post. This may change at any time in the future, so always check that.

Code build benchmarks

I can’t help it, I love benchmarks. So, before move on let’s see how the current solutions we’ve seen so far performing when build the STM32 firmware. I’ll compare my laptop (which is a i7-8750, 12x cores @ 2.2 and 16GB), my workstation (Ryzen 2700X & 32GB RAM),the gitlab-ci and the AWS EC2 AMI we just build. There are the results using the time command:

2700X Laptop GitLab-CI AWS (t2.micro)
Build (secs) 1.046 3.547s 5.971s 5.318s

I remind you that this is just a template project, so no much code to build, but still the results are quite impressive. My laptop that uses 12 threads needs 3.5 secs and the cloud instances need around 5-6 secs. Not bad, right? Nevertheless, I wouldn’t compare that much gitlab-ci and aws instances as this is just a single run and the build time one each might change also during the day and the servers load (I guess). The important thing is that the difference is not huge, at least for this scenario. For a more complicated project (e.g. a Yocto build), you should expect that the difference will be significant.

Using AWS in your CI/CD pipeline

Like in the previous post, I’ll show you how to use this AMI that you created with Packer into your CI/CD. This is where the things are getting a bit more complicated and I’ll try to explain why. Let’s see how the CI/CD pipeline actually works. You make some changes in the code and then push the changes. Then gitlab-ci is triggered automatically and peaks a gitlab-runner from the list with the compatible runners and sends all the settings to that runner to build the code.

By default those runners in gitlab are always running and poll the main gitlab-ci service for waiting builds, so gitlab-ci doesn’t initiate the communication, the runner does. Although gitlab provides a number of free to use runners, you have a limit on the actual free time that you can use them. Of course this limitation applies only when using gitlab.org and not a local gitlab installation. gitlab allows you to use your own gitlab-runner even when you use the gitlab.org service, though. It doesn’t matter if the runner is your laptop, a baremetal server or cloud instance. All you need is to run the gitlab-runenr client that you can download from here (see the instructions in the link) and then do some configuration in your gitlab source code repo where your .gitlab-ci.yml is.

The following image shows the simple architecture of this scenario.

In the above example gitlab is hosting the repo and the repo has a gitlab-ci pipeline. The developer pulls the code, makes changes and pushes back in the repo. Then the AWS EC2 CDE instance will poll the gitlab-ci service and when a tagged build is available it will pick the job, run the CI/CD pipeline and then return the result (including any artifacts).

There are two problems here, though. The first is how the AWS CDE builder knows which build to execute and doesn’t execute other builds from other repos. The second is that as you’ve probably noticed, the builder needs to always running in order to poll the gitlab-ci service and peak the builds! That means that the cost starting adding up even if the builder is idle and doesn’t execute builds.

For now, let’s see how to create an instance that is running the gitlab-runner.

What is an AWS EC2 gitlab-runner?

We can use Packer and Ansible to create an AMI that runs a gitlab-runner. There are a few things that we need to consider though and implement an architecture that is scalable and can be used in more that one projects and also it will be easy to create as many runner instances you need without further configuration.

The gitlab-runner needs a couple of settings in order to work properly. First it needs to know the domain that will poll for new jobs. As I’ve mentioned earlier, in gitlab, the runner is polling the server in pre-configred intervals and asks for pending jobs.

Then the gitlab-runner needs to know which jobs is able/allowed to run. This makes sense, because if you think about it, a gitlab-runner is an OS image and it’s coming with specific libraries and tools. Therefore, if the job needs libraries that are not included in the runner’s environment, then the build will fail. Later in this post I’ll explain what are the available architectures that you can use.

Anyway, for now the runner needs to be aware of the project that it can run and this is handled with a unique token per project that you can create in the gitlab project settings. When you create this token, then gitlab-ci will forward your project’s builds only to runners that are registered with this token.

Before proceed further let’s see the architecture approach.

gitlab-runner images architecture approach

The gitlab-runner can be a few different things. One is having a generic runner (or image instance when it comes to the cloud) with no dependencies and use the build pipeline to configure the running instance and install all the needed libraries and tools just before the build stage. An other approach is having multiple images that are only meant to build specific projects and only include specific dependencies. And finally you can have a generic image that supports docker and can use other docker containers which have the needed dependencies installed to build the source code.

Which of the the above scenarios sounds better for your case? Let’s have a look at the pros/cons of each solution.

Single generic image (no installed dependencies)
+ Easy to maintain your image as it’s only one (or just very few)
+ Less storage space needed (either for AWS snapshots or docker repos)
On every build the code repo needs to prepare the environment
If the build environment is very complex, then creating it will be time consuming and each build will take a lot of time
Increased network traffic on every build
Multiple images with integrated dependencies
+ Builds are fast, because the environment is already configured
+ Source code pipeline is agnostic to the environment (.gitlab-ci.yml)
A lot of storage needed for all images, which may increase maintenance costs
Generic image that supports docker
+ Easy to maintain the image
+/- Multiple docker images need space and maintenance but the maintenance is easy
The image instance will always have to download a docker container on every different stage of build (remember previous post?)
The build takes more time

From my perspective, tbh it’s not very clear which architecture is overall better. All of them have their strengths and weaknesses. I guess this is something that you need to decide when you have your project specs and you know exactly what you need and what you expect from your infrastructure and what other people (like devs) expect for their workflow. You may even decide to mix architectures in order to have fast developer builds and more automated backbone infrastructure.

For our example I’ll go with the multiple image approach that includes the dependencies. So, I’ll build a specific image for this purpose and then run instances from this image, that will act as gitlab-runners which they have the proper environment to build the code themselves.

Building the gitlab-runner AWS AMI

As I’ve mentioned earlier, I’ll go with the strategy to create an AWS image that contains the STM32 CDE and also runs a gitlab-runner. This is a more compact solution as the AMI won’t have to run docker and download a docker image to build the firmware and also the firmware build will be a lot faster.

Note that the instance will have to always run in order to be able to pick jobs from the gitlab-ci server. Therefore, using docker or not (which makes the build slower) doesn’t really affect the running costs, but it only affects the time that the runner will be available again for the next job, which is also important. So, it’s not about running costs, so much but for better performance and faster availability.

Again you’ll need this repo here:

In the repo you’ll find several different packer json image configurations. The one that we’re interested in is the the `stm32-cde-aws-gitlab-runner.json` which will build an AWS AMI that includes the CDE and also has a gitlab-runner installed. There is a change that you need to do in the json, because the configuration doesn’t know the token of your gitlab project. Therefore, you need to go to your gitlab projects “Settings -> CI/CD -> Runners” and then copy the registration token from that page and paste it in the gitlab_runner_token in the json file.

Then you need to build the image with Packer:

packer build stm32-cde-aws-gitlab-runner.json

This command will start building the AWS AMI and in the meantime you can see the progress to your AWS EC2 Management Console (MC). In the end you will see something like this in your host console

==> amazon-ebs: Stopping the source instance...
    amazon-ebs: Stopping instance
==> amazon-ebs: Waiting for the instance to stop...
==> amazon-ebs: Creating AMI stm32-cde-image-gitlab-runner-0.1 1575670554 from instance i-038028cdda29c419e
    amazon-ebs: AMI: ami-0771050a755ad82ea
==> amazon-ebs: Waiting for AMI to become ready...
==> amazon-ebs: Terminating the source AWS instance...
==> amazon-ebs: Cleaning up any extra volumes...
==> amazon-ebs: No volumes to clean up, skipping
==> amazon-ebs: Deleting temporary security group...
==> amazon-ebs: Deleting temporary keypair...
Build 'amazon-ebs' finished.

==> Builds finished. The artifacts of successful builds are:
--> amazon-ebs: AMIs were created:
eu-central-1: ami-0771050a755ad82ea

Note in the 4th line, the AMI name in this case is `stm32-cde-image-gitlab-runner-0.1` and not `stm32-cde-image-0.1` like the previous post.

Now we’ll use Vagrant again to run an instance of the built AMI in the AWS and verify that the gitlab-runner in the instance works properly and connects and gets jobs from the gitlab-ci.

Before run Vagrant make sure that you edit the `vagrant-aws-settings.yml` file and place the proper aws_keypair_name and aws_ami_name. You’ll find the new AMI name in the AWS EC2 MC in the “IMAGES –> AMIs” tab in the “AMI ID” column. After using the proper values then run these commands:

ln -sf Vagrantfile_aws Vagrantfile
vagrant up
vagrant shh
ps -A | grep gitlab

Normally you’ve seen that the ps command shown a running instance of gitlab-ruuner. You can also verify in your gitlab project in”Settings -> CI/CD -> Runners”, that the runner is connected. In my case this is what ps returns:

ubuntu@ip-xx-xx-34-51:~$ ps -A | grep gitlab
 1165 ?        00:00:00 gitlab-runner

That means the gitlab-runner runs, but let’s also see the configuration to be sure:

ubuntu@ip-xx-xx-34-x51:~$ sudo cat /etc/gitlab-runner/config.toml
concurrent = 1
check_interval = 0

  session_timeout = 1800

  name = "runner-20191206T224300"
  limit = 1
  url = "https://gitlab.com"
  token = "w_xchoJGszqGzCPd55y9"
  executor = "shell"

The token you see now in the config.toml is the token of the source code project repo, but the token of the runner! Therefore, don’t think that’s an error. So in this case the token is `w_xchoJGszqGzCPd55y9` and if you go to your gitlab’s “Settings -> CI/CD -> Runner” web page you’ll see something similar to this:

You see that the connected runner is the w_xchoJG, so the gitlab-runner that runs on the AWS AMI we’ve just built seems to be working fine. But’s lets build the code to be sure that the AWS gitlab-runner works. To do that just go to your repo in the “CI/CD -> Pipelines” and trigger a manual build and then click on the build icon to get into the gitlab’s console output. In my case this is the output

1 Running with gitlab-runner 12.5.0 (577f813d)
2   on runner-20191206T232941 jL5AyaAe
3 Using Shell executor... 00:00
5 Running on ip-172-31-32-40... 00:00
7 Fetching changes with git depth set to 50...
8 Initialized empty Git repository in /home/gitlab-runner/builds/jL5AyaAe/0/dimtass/stm32f103-cmake-template/.git/
9 Created fresh repository.
10 From https://gitlab.com/dimtass/stm32f103-cmake-template
126 [ 95%] Linking C executable stm32-cmake-template.elf
127    text	   data	    bss	    dec	    hex	filename
128   14924	    856	   1144	  16924	   421c	stm32-cmake-template.elf
129 [ 95%] Built target stm32-cmake-template.elf
130 Scanning dependencies of target stm32-cmake-template.hex
131 Scanning dependencies of target stm32-cmake-template.bin
132 [ 97%] Generating stm32-cmake-template.bin
133 [100%] Generating stm32-cmake-template.hex
134 [100%] Built target stm32-cmake-template.bin
135 [100%] Built target stm32-cmake-template.hex
136 real	0m6.030s
137 user	0m4.372s
138 sys	0m0.444s
139 Creating cache build-cache... 00:00
140 Runtime platform                                    arch=amd64 os=linux pid=2486 revision=577f813d version=12.5.0
141 build-stm32/src_stdperiph: found 54 matching files 
142 No URL provided, cache will be not uploaded to shared cache server. Cache will be stored only locally. 
143 Created cache
144 Uploading artifacts... 00:02
145 Runtime platform                                    arch=amd64 os=linux pid=2518 revision=577f813d version=12.5.0
146 build-stm32/src_stdperiph/stm32-cmake-template.bin: found 1 matching files 
147 Uploading artifacts to coordinator... ok            id=372257848 responseStatus=201 Created token=r3k6HRr8
148 Job succeeded

Success! Do you see the ip-172-31-32-40? This is the AWS instance. The AWS instance managed to build the code and also uploaded the built artifact back in the gitlab-ci. Therefore we managed to use packer to build an AWS EC2 AMI that we can use for building the code.

Although it seems that everything is fine, there is something that you need to consider as this solution has a significant drawback that needs to be resolved. The problem is that the gitlab-runner is registering to the gitlab-ci server when the AMI image is build. Therefore, any instance that is running from this image will have the same runner token and that’s a problem. That means that you can either run only a single instance from this image, therefore build several almost identical images with different tokens. Or have a way to re-register the runner every time that a new instance is running.

To fix this, then you need to use a startup script that re-registers a new gitlab-runner every time the instance runs. There is a documentation about how to use the user_data do this in here. In this case because I’m using Vagrant, all I have to do is to edit the `vagrant-aws-settings.yml` file and add the `register-gitlab-runner.sh` name in the aws_startup_script: line, like this:

aws_startup_script: register-gitlab-runner.sh

What will happen now is that Vagrant will pass the content of this script file to AWS and when an instance is running then the instance will register as a new gitlab-runner.

Also another thing you need to do is to comment out the two last lines in the `provisioning/roles/gitlab-runner/tasks/main.yml` file, in order to disable registering a runner by default while the image is created.

# - name: Register GitLab Runner
#   import_tasks: gitlab-runner-register.yml

Then re-build the Packer image and run Vagrant again.

packer build stm32-cde-aws-gitlab-runner.json
vagrant up

In this case, the Packer image won’t register a runner while building and then the vagrant up script will pass the aws_startup_script which is defined in the `vagrant-aws-settings.yml` file and the gitlab-runner will registered when the instance is running.

One last thing that remains is to automate the gitlab-runner un-registration. In this case with AWS you need to add your custom scipt in the /etc/ec2-termination as it’s described in here. For my example that’s not convenient to do that now, but you should be aware of this and implement it probably in your Ansible provision while building the image.

Therefore, with the current AMI, you’ll have to remove the outdated gitlab-runners from the gitlab web interface in your project’s “Settings -> CI/CD -> Runners” page.

Using docker as the gitlab-runner (Vagrant)

In the same repo you may have noticed two more files, `stm32-cde-docker-gitlab-runner.json` and `Vagrantfile_docker`. Probably, you’ve guessed right… Those two files can be used to build a docker image with the CDE including the gitlab-runner and the vagrant file to run a container of this image. To do this run the following commands on your host:

ln -sf Vagrantfile_docker Vagrantfile
vagrant up --provider=docker
vagrant ssh
ps -A | grep gitlab

With those commands you’ll build the image using vagrant and then run the instance, connect to it and verify that gitlab-runner is running. Also check again your gitlab project to verify this and then re-run the pipeline to verify that the docker instance picks up the build. In this case, I’ve used my workstation which is a Ryzen 2700X with 32GB RAM and an NVME drive. Again, this time my workstation registered as a runner in the project and it worked fine. This is the result in the giltab-ci output.

1 Running with gitlab-runner 12.5.0 (577f813d)
2   on runner-20191207T183544 scBgCx85
3 Using Shell executor... 00:00
5 Running on stm32-builder.dev... 00:00
7 Fetching changes with git depth set to 50...
8 Initialized empty Git repository in /home/gitlab-runner/builds/scBgCx85/0/dimtass/stm32f103-cmake-template/.git/
9 Created fresh repository.
10 From https://gitlab.com/dimtass/stm32f103-cmake-template
126 [ 95%] Linking C executable stm32-cmake-template.elf
127    text	   data	    bss	    dec	    hex	filename
128   14924	    856	   1144	  16924	   421c	stm32-cmake-template.elf
129 [ 95%] Built target stm32-cmake-template.elf
130 Scanning dependencies of target stm32-cmake-template.bin
131 Scanning dependencies of target stm32-cmake-template.hex
132 [100%] Generating stm32-cmake-template.hex
133 [100%] Generating stm32-cmake-template.bin
134 [100%] Built target stm32-cmake-template.hex
135 [100%] Built target stm32-cmake-template.bin
136 real	0m1.744s
137 user	0m4.697s
138 sys	0m1.046s
139 Creating cache build-cache... 00:00
140 Runtime platform                                    arch=amd64 os=linux pid=14639 revision=577f813d version=12.5.0
141 build-stm32/src_stdperiph: found 54 matching files 
142 No URL provided, cache will be not uploaded to shared cache server. Cache will be stored only locally. 
143 Created cache
144 Uploading artifacts... 00:03
145 Runtime platform                                    arch=amd64 os=linux pid=14679 revision=577f813d version=12.5.0
146 build-stm32/src_stdperiph/stm32-cmake-template.bin: found 1 matching files 
147 Uploading artifacts to coordinator... ok            id=372524980 responseStatus=201 Created token=KAc2ebBj
148 Job succeeded

It’s obvious that the runner now is more powerful because the build only lasted 1.7 secs. Of course, that’s a negligible difference compared to the gitlab-ci built in runners and also the AWS EC2 instances.

The important thing to keep in mind in this case is that when you run vagrant up then the container instance is not terminated after the Ansible provisioning is running! That means that the container runs in the background just after ansible ends, therefore that’s the reason that the gitlab-runner is running when you connect in the image using vagrant ssh. This is a significant difference with using other methods to run the docker container as we see in the next example.

Using docker as the gitlab-runner (Packer)

Finally, you can also use the packer json file to build the image and push it to your docker hub repository (if you like) and then use your console to run a docker container from this image. In this case, though;  you need to manually run the gitlab-runner when you create a container from the docker image, because no background services are running on the container when it starts, unless you run them manually when you create the container or have an entry-point script that runs those services (e.g. the gitlab runner).

Let’s see the simple case first that you need to start a container with bash as an entry.

docker run -it dimtass/stm32-cde-image-gitlab-runner:0.1 -c "/bin/bash"

If you’ve built your own image then you need to replace it to the above command. After you run this command you’ll end up inside the container and then if you try to run ps -A you’ll find that there’s pretty much nothing running in the background, including the gitlab-runner. Therefore, the current running container is like running CDE image we’ve build in the previous post. That means that this image can be used as CDE as also a gitlab-runner container! That’s important to have in mind and you can take advantage of this default behavior of docker containers.

Now in the container terminal run this:

gitlab-runner run

This command will run gitlab-runner in the container, which in turn will use the config in `/etc/gitlab-runner/config.toml` that Ansible installed. Then you can run the gitlab pipeline again to build using this runner. It’s good during your tests to run only a single runner for your project so it’s always the one that picks the job. I won’t paste the result again, but trust me it works the same way as before.

The other way you can use the packer image is that instead of running the container using /bin/bash as entry-point, yes you’ve guessed right, use the gitlab-runner run as entry point. That way you can use the image for doing automations. To test this stop any previous running containers and run this command:

docker run -it dimtass/stm32-cde-image-gitlab-runner:0.1 -c "/usr/bin/gitlab-runner run"

This command will run the gitlab-runner in the container and the terminal will block until you exit or quit. This here is also an important thing to be careful! gitlab-runner supports also to run a background service, by running this like that.

docker run -it dimtass/stm32-cde-image-gitlab-runner:0.1 -c "/usr/bin/gitlab-runner start"

This won’t work! Docker by design will exit when the command is exits and it will return the exit status. Therefore, if you don’t want to block in your terminal when running the gitlab-runner run command, then you need to run the container as a detached container. To do this, run this command:

docker run -d -it dimtass/stm32-cde-image-gitlab-runner:0.1 -c "/usr/bin/gitlab-runner run"

This command will return by printing the hash of the running container, but the container will keep running in the background as a detached container. Therefore, your runner will keep running. You can always use the docker ps -a command to verify which containers are running and which are exited.

If you think that you can just spawn 10 of these containers now, then don’t get excited because this can’t be done. You see, the runner was registered during the Ansible provisioning state, so at that point the runner got a token, that you can verify by comparing the `/etc/gitlab-runner/config.toml` configuration and your project’s gitlab web interface in the “Settings -> CI/CD -> Runners”. Therefore, if you start running multiple containers then all of them will have the same token.

The above thing means that you have to deal with this somehow, otherwise you would need to build a new image every time you need a runner. Well, no worries this can be done with a simple script like the AWS case previously, which it may it may seem a bit ugly, but it’s fine to use. Again there are many ways to do that. I’ll explain two of them.

One way is to add an entry script in the docker image and then you point to that as an entry point when you run a new container. BUT that means that you need to store the token in the image, which is not great and also that means that this image will be only valid to use with that specific gitlab repo. That doesn’t sound great, right?

The other way is to have a script on your host and then run a new container by mounting that script and run it in the entry-point. That way you keep the token on the host and also you can have a git hosted and versioned script that you can store on a different git repo (which brings several good things and helps automation). So, let’s see how to do this.

First have a look in the git repo (stm32-cde-template) and open the `register-gitlab-runner.sh` file. In this file you need to add your token from your repo (stm32f103-cmake-template in our case) or you can add the token in the command line in your automation, but for this example just enter the token in the file.

What the script does is that first un-registers any previous local (in the container not globally) runner and then registers a new runner and runs it. So simple. The only simple thing is the weird command you need to execute in order to run a new container which is:

docker run -d -v $(pwd)/docker-entry-gitlab-runner.sh:/docker-entry-gitlab-runner.sh dimtass/stm32-cde-image-gitlab-runner:0.1 -c "/bin/bash /docker-entry-gitlab-runner.sh"

Note, that you need to run this command on the top level directory of the stm32-cde-template repo. What that command does is that it creates a detached container, mounts the host’s local docker-entry-gitlab-runner.sh script and then it executes that script in the container entry. Ugly command, but what it does it’s quite simple.

What to do now? Just run that command multiple times! 5-6 times will do. Now run docker ps -a to your host and also see your repos Runners in the settings. This is my repo after running this command.

Cool right? I have 6 gitlab runners, running on my workstation that can share 16 cores. OK, enough with that, now stop all the containers and remove them using docker stop and docker rm.


With this post we’ve done with the simple ways of creating a common development environment (CDE) image that we can use for building a firmware for the STM32. This image can be used also to create gitlab-runners that will build the firmware in a pipeline. You’ve noticed that there are many ways to achieve the same result and each way is better in some things and worse that the other ways. It’s up to you to decide the best and more simple approach to solve your problem and base your CI/CD architecture.

You need to experiment at least once with each case and document the pros and cons of each method yourself. Pretty much the same thing that I’ve done in the last two posts, but you need to do it yourself, because you may have other needs and get deeper in aspects that I didn’t go myself. Actually, there are so many different projects and each project has it’s own specifications and needs, so it’s hard to tell which is the best solution for any case.

In the next post, I’ll get deeper in creating pipelines with testing farms for building the firmware, flashing and testing. You’ll see that there are cheap ways you can use to create small farms and that these can be used in many different ways. You need to keep in mind that those posts are just introduction to the DevOps for embedded and only scratch the surface of this huge topic. Eventually, you may find that these examples in these posts are enough for the most cases you might need as not all projects are complicated, but still it just scratching the surface when it comes to large scale projects, where things are much more complicated. But in any case, even in large scale projects you start simple and then get deeper step-by-step. Until the next post…

Have fun!








DevOps for embedded (part 1)


Note: This is the first post of the DevOps for Embedded series. Here you can find part 2, and part 3.

Wow, it’s being some time since the last post. A lot happened since then, but now it’s time to come back with a new series of posts around DevOps and embedded. Of course, embedded is a vast domain (as also DevOps), so I’ll start with a stupid project (of course) that just builds the code of a small MCU and then I’ll probably get into a small embedded Linux project with Yocto, because there are significant differences in the approach and strategies you need to use on each case. Although, the tools are pretty much the same. One thing that you’ll realize when you get familiar with the tools and the DevOps domain, is that you can achieve the same result by using different methodologies and architectures, but what it matters in the end of the day is to find a solution which is simple for everyone to understand, it’s re-usable, it has scalability and it makes maintenance easy.

Oh, btw this article is not meant for DevOps engineers as they’re already familiar with these basic tools and the architecture of CI/CD pipelines . Maybe what’s interesting for DevOps, especially those that are coming from the Web/Cloud/DB domain, are the problems that embedded engineers have and the tools that they’re using.

Note: the host OS used in the article is Ubuntu 18.04.3 LTS, therefore the commands and tools installation procedure is for this OS, but it should be quite similar for any OS.

What is DevOps?

Actually, there is a lot of discussion around DevOps and what exactly it is. I’ll try to describe it in a simple way. The word is an abbreviation and it means Development/Operations and according to wikipedia: “DevOps is a set of practices that combines software development (Dev) and information-technology operations (Ops) which aims to shorten the systems development life cycle and provide continuous delivery with high software quality.”

To put it other words is all about automation. Automate everything. Automate all the stages of development (+deployment/delivery) and every detail of the procedure. Ideally, a successful DevOps architecture would achieve this: if the building is burned down to the ground, then if you just buy new equipment, then you can have your infrastructure back within a couple hours. If your infrastructure is in the cloud, then you can have it back within minutes. The automation, ideally starts from people’s access key-cards for the building and ends up to the product that clients gets into their hands. Some would argue that also the coffee machine should be also included in the automation process, in this case if it has a CPU and a debug port, then it’s doable.

If you have to keep something from this post, is that DevOps is about providing simple and scalable solutions to the problems that arise while you trying to fully automate your infrastructure.

Why DevOps?

It should be obvious by now that having everything automated not only saves you a lot of time (e.g. having your infrastructure back in case it “dies on a bizarre gardening accident” [SpinalTap pun]), but if it’s done right then it also makes it easier to track problematic changes in your infrastructure and revert them or fix them fast, so it minimizes failures and hard-to-find issues. Also, because normally you will automate your infrastructure gradually, you will have to architecture this automation in steps, which makes you understand your infrastructure better and deal more efficient with issues and problems. This understanding of the core of your infrastructure is useful and it’s a nice opportunity to also question your current infrastructure architecture and processes and therefore make changes that will help those processes in future developments.

It’s not necessary to automate the 100% of your infrastructure, but it is at least important to automate procedures that are repeated and they are prone to frequent errors or issues that add delays in the development cycle. Usually those are the things that are altered by many people at the same time, for example code. In this case you need to have a basic framework and workflow that protects you from spending a lot of time in debugging and try to fix issues rather adding new functionality.

What DevOps isn’t.

DevOps isn’t about the tools! There are so many tools, frameworks, platforms, languages, abbreviations and acronyms in DevOps, that you’ll definitely get lost. There are tons of them! And to make it even worse there are new things coming out almost every week. Of course, trying to integrate all the new trends to your automation infrastructure doesn’t make sense and this shouldn’t be your objective. Each of these tools, has its usage and you need to know exactly what you need. If you have time for research, then do your evaluation, but not force your self to use something because it’s hype.

In DevOps and automation in general you can achieve the same result with numerous different ways and tools. There’s not a single or “right” way for doing something. It’s up to you to understand your problem well, know at least a few available tools and provide a simple solution. What’s important here, is always to provide simple solutions. Simple enough for everyone in the team to understand and it’s easy to test, troubleshoot and debug.

So forget about that you have to use Kubernetes or Jenkins or whatever, because others do. You need only to find the simplest solution for your problem.

DevOps and embedded

Usually DevOps are found in cloud services. That means that most of the web servers and databases on the internet are based on the DevOps practices, which makes a lot of sense, since you wouldn’t like your web page or corporate online platform to be offline for hours or days if something goes wrong. Everything needs to be automated and actually there’s a special branch of DevOps called Site Reliability Engineering (SRE) that hardens even more the infrastructure up to the point, in some cases, to survive even catastrophic global events. This blog for example is running on a VPS (virtual private server) on the cloud, but tbh honest I don’t think that my host service may survive a global catastrophic event, but also if that happens, who cares about a blog.

So, what about embedded? How would you use DevOps to your embedded workflow and infrastructure. Well, that’s easy, actually. Just think how would you automate everything you do, so that if you loose everything tomorrow (except you code, of course), then you can be back on track in a few minutes or hours. For example, all your developers get new laptops and are able to start working in a few minutes and all the tools, SDKs, toolchains have the same versions and the development environment is the same for everyone. Also, your build servers are back online and you start your continuous integration, continuous delivery (CI/CD) in minutes.

How does this sound? Cool, right?

Well, it’s doable, of course, but there are many steps in the between and the most important is to be able to understand what are the problems you need to solve to get there, what are your expectations in time, cost and resources and how to architect such a solution in a simple and robust way. What makes embedded interesting also, is that most of the times you have a hardware product that you need to flash the firmware on the various MCUs, FPGAs and DSPs and then start running automated tests and verify that everything works properly. If your project is large, you may even have farms with the specific hardware running and use them in your CI/CD pipeline.

Although, it’s too early in the article, I’ll put a link from a presentation that BMW did a few years ago. They managed to automate a large scale embedded project for an automotive head unit. This is the YouTube presentation and this is the PDF document. I repeat that this is a large scale project with hundreds of developers and also complicated in many aspect. Nevertheless, they successfully managed to automate the full process (at least that’s the claim).

Have in mind that the more you reading about this domain, many times, instead of DevOps you’ll find that alternative phrases like “CI/CD” or “CI pipeline” are used instead. Actually, CD/CD is considered part of the DevOps domain and it’s not an alternative abbreviation for DevOps, but if you know this then you won’t get confused while reading. You’ll also find many other abbreviations like IaC or IaaS, which you probably used more on cloud computing, but still the target is the same. Automate everything.

How to start?

So, now it should be quite clear what DevOps is and how it can help your project and if you’re interested, then you’re probably wondering, how to start?

Wait… Before getting into details, let’s define a very basic and simple embedded project in which you need to develop a firmware for a small MCU (e.g. STM32). In this case, you need to write the code, compile it, flash it on the target, test it and if everything works as expected then integrate the changes to the release branch and ship it. Although, that’s a very basic project, there are many different ways to automate the whole pipeline and some ways may be better than others depending the case. The important thing is to understand that each case is different and the challenge is to find the simplest and optimal solution for your specific problem and do not over-engineer your solution.

As you realize, whatever follows in this series of articles is not necessarily the best solution to every problem. It’s just one of the ways to do solve this particular problem. Because the tools and the ways to do something are so many, I won’t get into details for all of them (I don’t even know all of the available tools out there as they are too many, so there might be a simpler solution than the one I’m using).

Although, there’s not a single answer that covers all the possible scenarios there are a couple of things that need to be done in most of the cases. Therefore, I’ll list those things here and start to analyze a few of them and make examples during the process. So, this is the list:

Note that not all of these steps can be fully automated. For example, the testing may involve some manual tests and also most of the times the automated user acceptance it’s difficult to achieve in embedded projects that may have various electronics around.

Finally, this is not a comprehensive list, but it’s a basic skeleton for most of the projects. Let’s now see some of the above items.

Common development environment (CDE)

It’s very important to have a common development environment shared from all the developers and that’s also easy to reproduce, test, deploy and deliver itself. This is actually one of the most important things you need apply to your workflow in early stages, because it will save you from a lot of troubles down the path.

So, what’s CDE? The assumption was that we’re going to develop a firmware on an STM32 MCU. In this case, CDE is all those things that are needed to compile your code. That includes your OS version, your system libraries, the toolchain, the compiler and linker flags and the way you build the firmware. What’s not important in this case is the IDE that you’re using, which makes sense if you think about it and it’s also preferable and convenient as developers can use their favorite editor to write code (including their favorite plugins). It is important though, that each of those IDEs won’t pollute the project environment with unnecessary custom configuration and files, so precaution actions needed there, too (git makes that easy, though).

Why using a CDE is important? I’ll write a real project case, recently I was working in a company that instead of my recommendations, developers refused to integrate a CDE in their development workflow. I already had created a CDE for building a quite complex Yocto distro and nobody ever used it and I couldn’t also enforce its use because in flat hierarchies the majority can make also bad choices. Eventually, developers were using their own workstation environment to build the Linux distro, which in the end led to a chaotic situation that the image wasn’t building in all workstations in the same way and on some workstation it was even failing to build and at the same time it was building on others. As you can imagine this is not sustainable and it makes development really hard and chaotic. At some point I just gave up on even try to build the distro as it needed too much time to solve issues rather building it. Having a CDE ensures that your project will build the same way wherever it runs, which is robust and there is consistency. This only reason actually it should be enough for everyone to consider adopting a CDE in the development workflow.

So how can you do this? Well, there are a few ways, but it is important to be able to have an automated way to reproduce a versioned CDE. This is usually done by using a Virtual Machine (or virtual OS image) that is provisioned to include all the needed tools. There are quite a few virtualization solutions like VMWare, VirtualBox, Docker, cloud VPS (Virtual Private Server) like AWS, Azure, e.t.c. Some of them are free, but others cost per usage or you need to buy a license. Which one is best for you it’s something that you need to decide. Usually, if you don’t have any experience you will probably need to buy support even for a free product, just to play safe.

Therefore, in our project example the CDE will be a VM (or virtual OS) that includes all those tools and can be shared between developers that they may even use different OS as their base system on their workstation. See? This is also very useful because the CDE will be OS independent.

Since there are many virtualization technologies, you need to first decide which one suits your needs. VMWare and VirtualBox are virtual machine solutions, which means that they run a software machine on top of your host OS. This is a sandboxed machine that uses your hardware’s shared resources, therefore you can assign a number of CPUs, RAM and storage from your host system to the VM. As you can imagine, the performance of this setup is not that great and you may need more resources compared to other solutions.

An other solution is using a containerized image, like Docker, that uses all your hardware resources without a virtual machine layer in between. It’s still isolated (not 100% though) and since it’s containerized it can be transferred and re-used, too (like the VM solution). This solution has better performance compared to VMs, but it’s underground technology implementation is different depending the host OS, therefore; in case of a Linux host, Docker performs great, but in case of a Windows host it’s actually implemented as a VM.

Then there are solutions like working on remote VPS (virtual private servers), which they are either baremetal or cloud servers (in your premises or on a remote location). These servers are running an OS and host the CDE that you use to build your code, therefore you edit your code locally on your host OS with your favorite IDE and then build the code remotely on the VPS. The benefit of this solution is that the VPS can be faster than your laptop/workstation and can be shared also among other developers. The disadvantage is that you need a network connection and this may become a problem if the produced build artifacts that you need to test are large in size and the network connection is not fast (e.g. if the image is a 30+ GB Linux distro that’s a lot of data to download even on fast networks).

What’s important though, is that whatever solution you choose, to apply the IaC (infrastructure as code) concept, which means that the infrastructure will be managed and provisioned through scripts, code or config files.

Now, let’s examine our specific case. Since we’re interested on building a baremetal firmware on the STM32 it’s obvious that we don’t need a lot of resources and a powerful builder. Even if the project has a lot of code, the build may take 1-2 mins, which is a fair time for a large code base. If the code is not that large then it may be just a few seconds. Of course, if the build is running on a host that can run a parallel build and compile many source files at the same time, then the build will also be faster. So in this case, which solution is preferred?

Well, all the solutions that I’ve mentioned earlier are suitable for this case; therefore, you can use VirtualBox and Docker (which are free) or use a VPS like AWS, which is not free but it doesn’t cost that much (especially for a company or organization) and also saves you time, money, energy from running your own infrastructure 24/7, as you pay only for your image running time.

Because all these solutions seems to be suitable, there’s another question. How can you evaluate all of these and decide? Or even better, how can you achieve having an alternative solution, that won’t cost you a lot of time and money, when you decide to change from one solution to another? Well, as you guess there are tools for this, like Packer or Vagrant, that can simplify the process of creating different type of images with the same configuration.  Although those two tools seem similar, they’re actually a bit different and it’s always a good idea to read more about those in the previous links. You can use both to create CDEs, but Packer is better for creating images (or boxes that can be used also from Vagrant) and this can be done by using a single configuration file that builds images for different providers at the same time. Providers in this context are Docker, Vagrant, AWS e.t.c. Also, both tools support provisioning tools that allow you to customize your image.

There are many provisioning tools like Puppet, Chef, Ansible and even shell scripts. Pretty much Packer and Vagrant can use the same provisioners (link1 and link2). I won’t get into the details for each one of them, but although seemingly the most simple provisioning seems to be using shell scripts, it’s not really. In this article series I’ll use Ansible.

Why Ansible? That’s because it’s more flexible, makes easier to handle more complex cases and it comes with a ton of ready to use examples, modules and roles (roles are just pre-defined and tested scripts for popular packages). That makes it better compared to use shell scripts. Compared to Puppet and Chef, Ansible is “better” in our case, because it’s able to connect via SSH to the target and configure it remotely, instead of having to run the provisioner on the target itself.

Another problem with using shell scripts is that they may get too complicated in case you want to cover all cases. For example, let’s assume that you want to create a directory on the target image, then you need to use mkdir command and create it. But if this directory already exists, then the script will fail, unless you check first if the directory already exists before you execute the command. Ansible does exactly that, first checks the change that it needs to make and if the status is already the wanted one, then it doesn’t apply the change.

Since we decided to create a Docker and AWS EC2 AMI image, which have almost the similar configuration, then Packer seems a suitable solution for this. You could also use Vagrant, but let’s go with Packer, for now. In both cases, you need to have a Docker and AWS account in order to proceed with the rest of the examples, as you’ll need to push those images to a remote server. If you don’t care about cloud and pushing an image to the cloud then you can still use Packer to create only the Docker image, which can be pushed to your local repository on your computer, but in this case maybe go with Vagrant as it provider a more user friendly way to build, run and use the image, without having to deal with the docker cli.

I won’t get into details here on how to create your AWS account and credentials, as this would need a lot of time and writing. You can find this information though in the AWS EC2 documentation, which explains things much better than me. Therefore, from now on I assume that you already have at least a free AWS account and your credentials (access and secret key).

In these article, we’re going to use these repos here:


In the first repo you’ll find the files that Packer will use to build the image and the second repo is just a small STM32 template code sample that I use every time I start a new project for the STM32F103.

Before continuing building the image, let’s have a look at the environment that we need inside this image. By having a look at the source code it’s obvious that we need CMake and an ARM GCC toolchain. Therefore, the image needs to include these and we do that by using Ansible as the provisioner during the image build. Ansible uses playbooks and roles for provisioning targets. Again, I won’t get into the details of how to use Ansible in this article and you can find more information in the online documentation here.

Installing Packer

To install Packer, download the pre-compiled binary for your architecture here. For Ubuntu 18.04 that would be the Linux 64-bit. Then copy this to your /usr/bin (using sudo). You can also just copy it to your ~/.local/bin folder and if you’re the only user that will use Packer.

To test if everything is ok, run:

packer -v

In my case it returns version 1.4.5.

Installing Ansible

To install Ansible in Ubuntu 18.04 you need to add an external repo and install it from there following the following commands:

sudo apt-add-repository ppa:ansible/ansible
sudo apt-get update
sudo apt-get install ansible -y

This is the official ppa repo, so it’s ok to use that. If you trust nobody, then you can see other installation methods here (you can also just use the git repo).

To test that everything is ok, run:

ansible --version

In my case the version is 2.9.1. Have in mind that since those tools are under constant development they’re updated regularly and you need always to use the documentation depending your version.

Registering to docker and creating a repository

In this example we’ll use the docker hub and create a repository for this image and store it there. The reason for this is that the image will be accessible from anywhere and any docker host will have access to it. Of course, in case you don’t want to publish your image you can just create either a local repository on your workstation or have a local server in the network that stores the docker images.

In this case, I’ll use docker hub. So you need to create an account there and then create a repository with the name stm32-cde-image. In this case your image will be in the <username>/stm32-cde-image repository. So, in my case than my username is dimtass it will be dimtass/stm32-cde-image.

Now that you have your repo, you need to go to your user account settings and create a token in order to be able to push images to your repositories. You can do this in Account Settings -> Security -> Access Tokensand follow the instructions there and create a token, run a shell command and use your username and the token to create your credentials locally on your system. After that, you can just push images to your account, so Packer can also do the same.

Finally, in the stm32-cde-docker.jsonand stm32-cde-images.jsonfiles, you need to change the repository name in the docker post-processor and use yours, otherwise the image build will fail.

Create the CDE images

Now that you have all the tools you need, open your terminal to the packer repo you cloned earlier and have a look at the stm32-cde-images.json, stm32-cde-aws.json and stm32-cde-docker.json files. These files are the configuration files that packer will use to create the images. The stm32-cde-images.jsonfile contains the configuration for both docker and aws ec2 images and then the other two files are for each image (docker, aws).

In part-1 I’ll only explain how to build and use the docker image, so forget about anything related to AWS for now. In the stm32-cde-images.json and stm32-cde-docker.json files there is a builder type which is for Docker. In this case, all it matters is to choose the imagefrom any repository. In this case, I’ve chosen the ubuntu:16.04 as a base image which is from the official docker repository and it’s quite similar to the amazon’s EC2 image (there are significant differences though). The base image is actually an official docker image, that the docker client in your host OS will fetch from the docker hub and use as a base to create another image on top of that. This official image is a strip down ubuntu image, on which you can install whatever packages, tools and libraries you need on top, which means the more you add the size will get larger. Therefore, the base ubuntu:16.04 image may be just 123MB, but this can easily grow up to a GB and even more.

As I’ve noted earlier, although you can use Packer to build images from various different builder types, this doesn’t mean that all the base images from all the sources will be the same and there might be some insignificant differences between them. Nevertheless, using a provisioner like Ansible will handle this without issues, as it will only apply changes that are not existed in the image and that’s the reason that Ansible is better than a generic shell script, as a script may fail in an specific image and not on another. Of course, there’s a chance that you’re using shell commands in your Ansible configuration, therefore in this case you just need to test that everything works as expected.

To create the docker image you need to clone this repo:


And use packer to build the stm32-cde-docker.json like this:

git clone https://dimtass@bitbucket.org/dimtass/stm32-cde-template.git
cd stm32-cde-template
packer build stm32-cde-docker.json

Note: before run the build command, you need to change the docker_repo variable and put your docker repo, that you’ve created and also you need to have created a token for docker to be able to push the image.

If everything goes fine, the you’ll see something like this:

==> docker: Creating a temporary directory for sharing data...
==> docker: Pulling Docker image: ubuntu:16.04
==> docker: Using docker communicator to connect:
==> docker: Provisioning with shell script: /tmp/packer-shell909138989
==> docker: Provisioning with Ansible...
==> docker: Committing the container
==> docker: Killing the container: 84a9c7842a090521f8dc9bd70c39bd99b6ce46b0d409da3cdf68b05404484b0f
==> docker: Running post-processor: docker-tag
    docker (docker-tag): Tagging image: sha256:0126157afb61d7b4dff8124fcc06a776425ac6c08eeb1866322a63fa1c3d3921
    docker (docker-tag): Repository: dimtass/stm32-cde-image:0.1
==> docker: Running post-processor: docker-push
Build 'docker' finished.

That means that you image was build without errors and its also pushed to the docker hub repository.

Using the Docker CDE image

OK, now we an image, what do we do with this?

Here we need to make clear that this image is just the CDE. It’s not yet used in any way and it’s just available to be used in any way you like. This image can be used in different ways, of course. For example, developers can use the image to build their code locally on their workstation or remotely, but the same image can be used to build the code during a continuous integration pipeline. The same image also can be used for testing, though in this case, as you’ll see you need to use the provisioner to add more tools and packages in there (which is actually done already). Therefore, this image is the base image that we can use in our whole pipeline workflow.

Note that these images were created by only using configuration files, which are versioned and stored in a git repo. This is nice, because that way you can create branches to test your changes, add more tools and also keep a history log of all the changes so you can track and debug any issues that were introduced. I hope it’s clear how convenient that is.

Depending the kind of image you build there’s a bit different way to use it. For example, if you have a VirtualBox VM then you need to launch VBox and then start the image, or if you’ve used Vagrant (regardless if the Vagrant box is a VirtualBox or a Docker image), then you need to start the image again using “vagrant up” and “vagrant ssh” to connect to the image. If you’re using docker, then you need to run a new container from the image, attach to it, do your job and then stop/destroy the container (or leave it stopped and re-use it later). For AWS AMIs it’s a bit different procedure. Therefore, each tool and way you’ve used to create the image, defines how to use the image. What happens in the background is pretty much the same in most cases (e.g. in case of Docker, it doesn’t matter if you use Vagrant or the docker command line (cli), the result in the background is the same).

In this post I’ll only explain how to use the Docker image and I’ll explain how to use the Vagrant and AWS EC2 AMI in a next post.

So, let’s start with the docker image. Again, I won’t go into the details of the cli command, but I’ll just list them and explain the result. For more details you need to consult the docker cli reference manual.

In the previous steps we created the docker image and pushed it to our docker.io hub repository. Although, I’ll use my repo image name, you need to change that to yours, but also my repo for testing is fine. The test repo we’re going to use to build the code is that one here:


Therefore, the objective is to create a docker container from our new image and then clone the code repo and build it inside the docker container.

cd ~
docker run -it dimtass/stm32-cde-image:0.1 -c "/bin/bash"

This will create a new container, run it and attach to it’s bash terminal, so you should find your terminal’s prompt inside the running container. Now, to test if that the container has all the tools that are needed and also build the firmware, then all you have to do is to git clone the stm32 firmware repo in the container and build it, like this:

cd /root
git clone https://dimtass@bitbucket.org/dimtass/stm32f103-cmake-template.git
cd stm32f103-cmake-template
TOOLCHAIN_DIR=/opt/toolchains/gcc-arm-none-eabi-9-2019-q4-major CLEANBUILD=true USE_STDPERIPH_DRIVER=ON SRC=src_stdperiph ./build.sh

The above commands should be run inside the docker container. So, first you change dir in /root and then clone and build the template repo. In the build command (the last one), you need to pass the toolchain path, which in this case we’ve used Packer and Ansible to install the toolchain in `/opt/toolchains/gcc-arm-none-eabi-9-2019-q4-major`. Then you need to pass a few other parameters that are explained in the stm32f103-cmake-template repo, as you can build also a FreeRTOS template project or use libopencm3 instead of ST standard peripheral library.

The result of the build command in my case was:

[ 95%] Linking C executable stm32-cmake-template.elf
   text	   data	    bss	    dec	    hex	filename
  14924	    856	   1144	  16924	   421c	stm32-cmake-template.elf
[ 95%] Built target stm32-cmake-template.elf
Scanning dependencies of target stm32-cmake-template.bin
Scanning dependencies of target stm32-cmake-template.hex
[100%] Generating stm32-cmake-template.hex
[100%] Generating stm32-cmake-template.bin
[100%] Built target stm32-cmake-template.bin
[100%] Built target stm32-cmake-template.hex

That means that our CMake project for the STM32 was built fine and now we have a hex file that we can use. But, wait…. how can you test this firmware? This is how you develop? How do you make code changes and build them? Well, there are many options here. To flash the image from inside the container you can expose the USB st-link programmer to docker by using the --device option in docker when creating the container using docker run -it command. You can a look how to use this option here. We’ll get to that later.

The most important thing now, though, is how a developer can use an IDE to make changes in the code and then build the firmware. The option for this is to share your drive with the container using volumes. That way you can have a folder that is mounted to the container and then you can clone the repo inside this folder. The result is the the repo will be cloned to your local folder, therefore you can use your IDE in your OS to open the code and make edits. Then build the code inside the container.

To do that, let’s first clean the previous container. To do that, run this command to  your OS terminal (not the docker container) which will list the current stopped or running containers.

docker ps -a

An example of this output may be:

CONTAINER ID        IMAGE                           COMMAND                  CREATED             STATUS                     PORTS               NAMES
166f2ef0ff7d        dimtass/stm32-cde-image:0.1     "/bin/sh -c /bin/bash"   2 minutes ago       Up 2 minutes                                   admiring_jepsen

The above output means that there is a container running using the dimtass/stm32-cde-image. That’s the one that I’ve started earlier, so now I need to stop it, if it’s running and then remove it, like this:

docker stop 166f2ef0ff7d
docker rm 166f2ef0ff7d

Note that `166f2ef0ff7d` is the container’s ID which is unique for this container and you can use docker cli tool to do actions on this specific container. The above commands will just stop and then remove the container.

Now, you need to run a new container and this time mount a local folder from your OS filesystem to the docker container. To do that run these commands:

cd ~/Downloads
mkdir -p testing-docker-image/shared
cd testing-docker-image
docker run -v $(pwd)/shared:/home/stm32/shared -it dimtass/stm32-cde-image:0.1 -c "/bin/bash"

The last command will end up inside the container’s bash. Then you need to change to the stm32 user and repeat the git clone.

cd /home/stm32
su stm32
git clone --recursive https://dimtass@bitbucket.org/dimtass/stm32f103-cmake-template.git
cd stm32f103-cmake-template
TOOLCHAIN_DIR=/opt/toolchains/gcc-arm-none-eabi-9-2019-q4-major CLEANBUILD=true USE_STDPERIPH_DRIVER=ON SRC=src_stdperiph ./build.sh

There’s a trap here, though! For this to work, your OS user needs to have the same uid and gid as the stm32 user in the container (or the opposite). Therefore, if your user has uid/gid = 1000/1000, then the same needs to be for the stm32 user in the container. If that’s not the case for you, then you need to create a user in the container that has the same uid/gid. If these are not the same, then the permissions will be different and you won’t be able to create/edit files.

For example if your OS uid/gid is 1001 and the stm32 user in the container is 1000, then while the container is running run this command in the terminal inside the docker container:

usermod -u 1001 stm32

The above command will change the uid/gid of the stm32 user inside the container from 1000 to 1001.

Let’s now assume that the permissions are OK and the uid/gid are the same for both users in your OS and container. That means that you can use your favorite IDE and open the git project, do your changes and at the same time running the docker container in the background and build the source when you like. Have in mind that the build will be a manual procedure and you won’t be able to press a button to your IDE to build the source in the container. Of course, you can achieve this functionality, if you wish, as there are a few ways, but let’s not focus on this right now.

Finally, you’ve managed to create an image that is your common development environment and that contains all the needed tools in order to build you code and test it. At any time that you need to add more tools in your image, then you can use the Ansible provisioning files and add more packages or roles in there. Then all the developers will automatically get the last updates when they create a new container to build the firmware. For this reason, it’s good to use the latest tag in the docker image and the repository, so the devs get automatically the latest image.

Flashing the firmware

To flash the firmware that was build with the docker image, there are two ways. One is use your OS and have installed the st-link tool, but this means that you need to find the tool for your OS and install it. Also if some developers have different OSes and workstation with different architectures, then probably they will have also different versions of the tool and even not the same tool. Therefore, having the st-flash tool in the docker image helps to distribute also the same flasher to all developers. In this case, there’s an Ansible role that installs st-flash in the image, so it’s there and you can use it for flashing.

As mentioned earlier, though; that’s not enough because the docker image needs to have access to the USB st-link device. That’s not a problem using docker though. So in my case, as I’m running on Ubuntu it’s easy to find the USB device path. Just, run the lsusb command on a terminal and get the description, so in my case it’s:

Bus 001 Device 006: ID 0483:3748 STMicroelectronics ST-LINK/V2

That means that the ST-LINK V2 device was enumerated in bus 001 and it’s device ID 006. For Linux and docker that means that if you mount the `/dev/bus/usb/001/` path in the docker image then you’ll be able to use the st-link device. One thing to note here is that this trick will always expect the programmer to be connected on the same port. Usually this is not a problem for a developer machine, but in case that you want to generalize this you can mount /dev/bus/usb and docker will be able to have access to any usb device that is registered. But that’s a bad practice, for development always use the same USB port.

Therefore, now remove any previous containers that are stopped/running (use docker ps -a to find them) and then create/run a new container that also mounts the /dev path for the usb, like this:

docker run -v $(pwd)/shared:/home/stm32/shared -v /dev/bus/usb/001:/dev/bus/usb/001 --privileged -it dimtass/stm32-cde-image:0.1 -c "/bin/bash"

Disconnect the device if it’s already connected and then re-connect. Then to verify that the device is enumerated in the docker container and also the udev rules are properly loaded, run this command:

st-info --probe

In my case this returns:

root@6ebcda502dd5:/# st-info --probe
Found 1 stlink programmers
 serial: 513f6e06493f55564009213f
openocd: "\x51\x3f\x6e\x06\x49\x3f\x55\x56\x40\x09\x21\x3f"
  flash: 65536 (pagesize: 1024)
   sram: 20480
 chipid: 0x0410
  descr: F1 Medium-density device

Great, so the device is enumerated fine on the docker container. Now, let’s clone again the repo, build and flash the firmware. Before proceed to this step, make sure that the stm32f103 (I’m using blue-pill) is connected to the st-link and the st-link is enumerated in the docker container. Then run these commands:

cd /home/stm32
su stm32
git clone --recursive https://dimtass@bitbucket.org/dimtass/stm32f103-cmake-template.git
cd stm32f103-cmake-template
TOOLCHAIN_DIR=/opt/toolchains/gcc-arm-none-eabi-9-2019-q4-major CLEANBUILD=true USE_STDPERIPH_DRIVER=ON SRC=src_stdperiph ./build.sh

Now that the firmware is built, flash it with this command:

st-flash --reset write build-stm32/src_stdperiph/stm32-cmake-template.bin 0x8000000

You should see something like this:

stm32@6ebcda502dd5:~/stm32f103-cmake-template$ st-flash --reset write build-stm32/src_stdperiph/stm32-cmake-template.bin 0x8000000
st-flash 1.5.1
2019-12-02T21:18:04 INFO common.c: Loading device parameters....
2019-12-02T21:18:04 INFO common.c: Device connected is: F1 Medium-density device, id 0x20036410
2019-12-02T21:18:04 INFO common.c: SRAM size: 0x5000 bytes (20 KiB), Flash: 0x10000 bytes (64 KiB) in pages of 1024 bytes
2019-12-02T21:18:04 INFO common.c: Attempting to write 15780 (0x3da4) bytes to stm32 address: 134217728 (0x8000000)
Flash page at addr: 0x08003c00 erased
2019-12-02T21:18:04 INFO common.c: Finished erasing 16 pages of 1024 (0x400) bytes
2019-12-02T21:18:04 INFO common.c: Starting Flash write for VL/F0/F3/F1_XL core id
2019-12-02T21:18:04 INFO flash_loader.c: Successfully loaded flash loader in sram
 16/16 pages written
2019-12-02T21:18:05 INFO common.c: Starting verification of write complete
2019-12-02T21:18:05 INFO common.c: Flash written and verified! jolly good!

What just happened is that you run a docker container on your machine, which may not have any of the tools that are needed to build and flash the firmware and managed to build the firmware and flash it. The same way, any other developer in the team in the building or from home, can build the same firmware the same way, get the same binary and flash it with the same programmer version. This means that you’ve just ruled out a great part of the development chain that can cause weird issues and therefore delays to your development process.

Now every developer that has access to this image can build and flash the same firmware in the same way. Why somebody wouldn’t like that?

I understand that maybe some embedded engineers find this process cumbersome, but it’s not really. If you think about it, for the developer the only additional step is to create the container once and just start it once on every boot and then just develop the same way like before. In case you were using automations before, like pressing a button in the IDE and build and flash, then that’s not a big issue. There are ways to overcome this and all new IDEs support to create custom commands and buttons or run scripts, so you can use those to achieve the same functionality.

Using gitlab-ci for CI/CD

Until now, I’ve explained how you can use the CDE image to build your source the code as a developer, but in embedded projects it’s also important to have a CI/CD pipeline that builds the code and runs some scripts or tests automatically. For example, in your pipeline you might also want to run a style formating tool (e.g. clang-format), or anything else you like, then run your tests and if everything passes then get a successful build.

There are many CI/CD services like Jenkins, Travis-CI, gitlab-ci, buildbot, bamboo and many others, just to name a few. Each one of them has its pros and cons. I haven’t used all of them, so I can’t really comment on what are their differences. I’ve only used gitab-ci, jenkins, buildbot and bamboo, but again not in great depth. Bamboo is mainly used in companies that are using Atlassian’s tools widely in their organisation, but you need to buy a license and it’s closed source. Jenkins is the most famous of all, it’s free, open-source and has a ton of plug-ins. Buildbot is used for Yocto projects, which means that it can handle complexity. Gitlab-CI is used in the gitlab platform and it’s also free to use.

From all the above, I personally prefer gitlab-ci as it’s the most simple and straight forward service to create pipelines, easily and without much complexity. For example, I find Jenkins pipelines to be very cumbersome and hard to maintain; it feels like the pipeline interface doesn’t fit well in the rest of the platform. Of course, there are people who prefer Jenkins or whatever else compared to GitLab-CI. Anyway, there’s no right or wrong here, it has to do with your current infrastructure, needs and expertise. In my case, I find gitlab-ci to be dead simple to use and create a pipeline in no time.

To use gitlab-ci, you need to host your project to gitlab or install gitlab to your own server. In this example, I’m hosting the `stm32f103-cmake-template` project in bitbucket, github and gitlab at the same time and all these repos are mirror repos (I’ll explain in another post how I do this, but I can share the script that I’ve wrote here to do this). Therefore, the same project can be found in all these repos:


For now, we care only for the gitlab repo. In your case, you need to either fork this repo, or create a new one with your code and follow the guide. Therefore, from now on, it doesn’t matter if I refer to the `stm32f103-cmake-template` project as this can be any project.

To use gitlab-ci, you only need a template file in your repo which is called `.gitlab-ci.yml`. This is a simple YAML file that configures the CI/CD of your project. To get more information and details on this, you need to read the documentation here. In this example, I’ll only use simple and basic stuff, so it’s easy to follow, but you can do much more complex things than what I’m doing in this post.

For our example, we’re going to use this repo here:


I’ve added a .gitlab-ci.yml file in the repo, that you can see here:

      name: dimtass/stm32-cde-image:0.1
      entrypoint: [""]
      - build
      - test
      stage: build
      script: TOOLCHAIN_DIR=/opt/toolchains/gcc-arm-none-eabi-9-2019-q4-major CLEANBUILD=true USE_STDPERIPH_DRIVER=ON SRC=src_stdperiph ./build.sh
      stage: test
      script: echo "Dummy tests are done..."

First thing you notice, is the image entry. This points to the docker image that I’ve built with packer and pushed to the docker hub registry. This entry will force the gitlab-runner instance to download the image and then run a container and inside that container will run the given stages. Also, the entrypoint entry is for handling a problem that some images built with Packer have, which is the container can’t find the /bin/sh path. You can see more about issue here. This already raises the flag that not all solutions are handled the same in the end, but depending the tools you’re using and the way you’re using them, there might be some differences in the way that you need to handle specific cases.

Lastly, there are two stages, build and test. Test stage, doesn’t do anything in this case and it’s empty, but the build stage will run the build command(s) and if any step of these stages fails, then the pipeline will fail.

After commit and pushing the .gitlab-ci.yml file in the repo then the prjects CI/CD service will automatically pick up the the job and it will use any available shared runner from the stack and use it to run the pipeline. This is an important information,  by default the shared gitlab runners are using a docker image, which will fail to build your specific project and that’s the reason why you need to declare to the gitlab-ci that it needs to pull your custom docker image (dimtass/stm32-cde-image in this case).

After picking the job, the runner will start executing the yaml file steps. See, how simple and easy is to create this file. This is an important difference of gitlab-ci, compared to other CI/CD services. I’ll post a couple of pictures here from that build here:

In the first picture, you see the output of the runner and that it’s pulling the dimtass/stm32-cde-image and then fetches the git submodules. In the next image you can see that the build was successful and finally in the third image you see that both build and test stages completed successfully.

A few notes now. As you can imagine, there are no produced artifacts, meaning that the firmware is built and then gone. If you need to keep the artifact, then you need to push/upload the artifact to some kind of registry that you can download. Recently, though, gitlab supports to host the build artifact for a specific time, so let’s use this! I’ve changed my .gitlab-ci.yml to this one:

    name: dimtass/stm32-cde-image:0.1
    entrypoint: [""]


    - build
    - test

    stage: build
    script: TOOLCHAIN_DIR=/opt/toolchains/gcc-arm-none-eabi-9-2019-q4-major CLEANBUILD=true USE_STDPERIPH_DRIVER=ON SRC=src_stdperiph ./build.sh
        - build-stm32/src_stdperiph/stm32-cmake-template.bin
        expire_in: 1 week

    stage: test
    script: echo "Dummy tests are done..."

The above yml script will now also create an artifact that you can download, flash and test to your target. To download the artifact, click to your project’s CI/CD -> Pipelines and from the list click the drop-down menu in the last build and download the .zip file that contains the artifact. This is shown here:

Another thing, you may notice during the pipeline execution is that each stage is running after the previous is cleaned. Therefore, the test stage will download again the docker image, git clone the submodules again and then run the test stage. That doesn’t sound what we really want, right? Don’t get me wrong, the default behavior is totally right and seems very logical for each stage to run in a clean environment, it’s just that in some cases you don’t want this to happen. In that case you can run your tests in the the build stage.

Finally, what’s useful in gitlab is the cache. This is useful, because when you run a stage then you can cache a specific folder or file and tag it with a key. You can do this for several files/folders. Then on another stage, you can use the cache key and have access to this files/folders and run whatever actions you like. To do this, let’s change our .gitlab-ci.yml file into this:

    name: dimtass/stm32-cde-image:0.1
    entrypoint: [""]


    - build
    - test

    stage: build
    script: TOOLCHAIN_DIR=/opt/toolchains/gcc-arm-none-eabi-9-2019-q4-major CLEANBUILD=true USE_STDPERIPH_DRIVER=ON SRC=src_stdperiph ./build.sh
        key: build-cache
        - build-stm32/src_stdperiph
        - build-stm32/src_stdperiph/stm32-cmake-template.bin
        expire_in: 1 week

    stage: test
    script: file build-stm32/src_stdperiph/stm32-cmake-template.bin
        key: build-cache

As you can see in the above file, the build-cache will store the content of the build-stm32/src_stdperiph folder during the build stage and then in the test stage will run the file command on this file, which is going to be fetched from the cache. This is the output of this in gitlab:

Downloading artifacts for build (368146185)...
Downloading artifacts from coordinator... ok        id=368146185 responseStatus=200 OK token=jeC2trFF
$ file build-stm32/src_stdperiph/stm32-cmake-template.bin
build-stm32/src_stdperiph/stm32-cmake-template.bin: data
Job succeeded

Just beautiful. Simple done and useful in many ways.


In this first post of the DevOps for embedded series, I’ve covered the creation of a docker image that can be used from developers as a CDE (common development environment) and CI/CD services as a pipeline image. I’ve used Packer to create the image and Ansible to provision and configure it. Packer is useful because you can use it to create almost similar images with the same configuration file, for different providers like docker, AWS, e.t.c. Also Ansible is preferred instead of using another provisioner because it supports also many different providers and also it configures the image remotely and it doesn’t have to run locally on the image.

Then I’ve tested the CDE image to build a template firmware locally on my host OS and also was able to flash the firmware on the target MCU using the st-flash tool from the image, but for this step I had to share part of the /dev/bus/… with the docker image, which some may argue that’s not optimal. Personally, I think it is just fine if you are targeting a specific port rather your whole /dev directory.

Finally, I’ve used the same CDE image with gitlab-ci and created a CI/CD pipeline that runs a build and a test stage. I’ve also shown how you can use the artifact and cache capability of gitlab-ci. Although, there are many different CI/CD services, I’ve chosen gitlab, because I find it simple and straight-forward to use and the documentation is very good, too. If you want or have to use another CI/CD service, then the procedure details will be different, but what you need to do is the same.

In the next post, I’ll show how you can use an AWS EC2 AMI instance (and the AWS tools in general) to do the same thing. Until then…

Have fun!