Hacking a Sonoff S20 and adding an ACS712 current sensor


Warning! This post is only meant to document my work on how to hack the Sonoff S20. This device is connected directly to 110/220V and if you try to do this your own you risk your own life. Do not attempt to open your device or follow this guide if you’re not also a trained electronic or electrical engineer.

It’s being quite some long time since I’ve hacked a device for the blog. This time I’ve did this hack mainly for the next post of the “Using Elastic stack (ELK) on embedded“. In both the previous posts I’ve used a Linux SBC that runs an Elastic Beat client and publishes data to a remote Elasticsearch server. But for the next (and maaaybe the last) post I want to use an example of a baremetal embedded device, like the ESP8266. Therefore I had two choices, first to just write a generic and simple firmware for an ESP module (e.g. ESP-01 or ESP-12E) or even better use an actual mainstream product and create an example of a real IoT application.

For that reason I’ve decided to use the Sonoff S20 smart socket that I’m using to turn on/off my 3D printer. Smart socket just means a device with a single power socket and an internal relay that controls power on the load and in this case the relay is controlled by an ESP8266. In my case, I’m using a raspberry pi 3 running Octopi and both the printer and the RPi are connected on a power socket which is connected on the S20. When I want to print then I turn on the socket using my smartphone and the RPi boots and the printer is turned on. When the printing is done then, the octopi sends a termination message to the S20 and it shuts down. Then after 20 seconds the S20 has received the message is turns off the power on everything.

The current setup doesn’t make much sense to use it with Elastic stack, so I’ve decided to add a current sensor in the Sonoff S20 and monitor the current that the printer draws. Then publish the data to the Elasticsearch server.

I’ve also decided to split the hacking from the rest ELK part, so I’m making this post here to document this hack. Again, I have to tell you not to try this yourself if you’re not a trained professional.


Sonoff S20

According to this wiki, the Sonoff S20 is a wifi wireless smart socket that can connect any home appliances and electric devices via WiFi, allowing you to remote control on iOS/Android APP eWeLink. The device looks like this

Since this device is using an ESP8266, then it’s hack-able and you can write and upload your own firmware. There are also many clones and to be honest it’s difficult to know if you buy an original or a clone, especially from eBay. The schematics are also available here. The most known firmware for the Sonoff S20 is the open source Tasmota firmware that you can find here. Of course, as you can guess, I’m using my own custom firmware because I need to do only some basic and specific stuff and I don’t need all the bloat that comes with the other firmwares.


The ACS712 is a fully Integrated, Hall Effect-Based Linear Current Sensor IC with 2.1 kVRMS Isolation and a Low-Resistance Current Conductor. Although this IC only needs very few components around it, you can find it really cheap in eBay in a small PCB factor, like this one here:

This IC comes in 3 flavors, rated at 5A, 20A and 30A. I’m using the 5A which means that the IC can sense up to 5A of AC/CD current and converts the input current in voltage with a 185 mV/A step. That means that 1A of 220VAC is 185mV and 5A is 1V.

The ugly truth

OK, here you might think that this is quite useless. Well, you’re right, he he, but it’s according to the blog’s specs, so it’s good to go. For those who wonder why it’s useless, the reason is that the 185 mV/A sensitivity of the ACS712 means that the range of 5A is only 925mV. Since the range is -5A to 5A then according to the IC specs it means that the output is from ~1.5V for -5A, 2.5V for 0A and up to ~3.5V for +5A.

Also, the ESP8266 ADC input is limited by design to 1V maximum, but at the same time the noise of the ESP8266 ADC is quite bad. Therefore, the accuracy is not good at all and also when it comes to AC currents 1A means 220VA and 5A (=1V out) means 2200VA, which is extremely high for a 3D printer consumption, which is ~600W max with the heat-bed at max temperature. So, the readings from the ACS712 are not really accurate. But they are enough to see what’s going on.

Also, have in mind that with the ACS712 you can only measure current and this is not enough to calculate real consumption as you need also the real value of the VAC on your socket, which is not 220V and actually fluctuates a lot. Ideally you should have another ADC to measure the output of the transformer which is relative to the AC mains and use this value to calculate real power.

Nevertheless, it’s fun to do it, so I’ll do it and use this device in the next post to publish the ACS712 readings to the Elasticsearch server.

Hacking the Sonoff S20

Well, hacking sounds really awesome, but in reality it was too easy to even consider it as hacking, but anyway. Do, first problem I faced was that the ESP8266 is a 3V3 device and the ACS712 is a 5V device. Also the ACS712 is one of those 5V rated devices that indeed doesn’t work with 3V3. Therefore, you actually need 5V, but the problem is that there isn’t any 5V on the circuit.

First let’s have a look on my bench.

I’ve used a USB “microscope” to check the board and also make some pictures, but those cheap microscopes are so crap that in the end I’ve just used the magnify lamp to do the soldering…

As you can find out from the schematics here, there is a 220V to 6.2V transformer on the PCB and then there’s an AMS1117-3.3 regulator to provide power to the low voltage circuit.

Therefore in order to provide the 5V I’ve soldered an AMS1117-5 on top of the 3V3 regulator.

That’s it, an AMS1117 sandwich. OK, I know, you’re thinking that this is a bad idea because now there’s no proper heat dissipation from the bottom regulator and it’s true. But I can live with that, as the heat is not that much. Generally though, this would be a really bad idea in other cases.

Next thing was desoldering the load cable (red) from the PCB and solder it on the IP+ pin of the ACS712 and then solder an extension cable from the IP- to the PCB.

Then I’ve removed the green block connector and the pin-header from the ACS712 PCB so it takes as much less space as possible and then I’ve soldered the GND and the 5V.

In this picture the regulator sandwich is also better visible. I’ve used a low resistance wire wrapping wire.

Next problem was that the ADC input on the ESP8266 is limited to a max 1V and the output of the ACS712 can be up to 3.5V (let’s assume 5V just to be sure in any case). Therefore, I’ve used a voltage divider on the output with a 220Ω and 56Ω resistors, which means at 5V I get 1.014V. For that I’ve used some 0805 1% SMD resistors and I’ve soldered them on the ACS712 PCB and I’ve solder a wire between them.

Then I’ve used a heated silicone gun to create a blob that holds the wires and prevent them from breaking.

Then I guess the most difficult thing would probably be to solder the wire from the voltage divider to the ADC pin of the ESP8266, because the pin on the IC is barely visible. Although I’m using a quite flat and big soldering tip on my weller, the trick I do is to melt just a bit solder on the wire end, then place the wire on the pin and finally touch it quite fast with the solder iron tip. I can’t really explain it more, but anyway for me it works fine.

That’s quite clean soldering. Then I’ve used some hot silicone glue to attach the ACS712 PCB on the relay so it doesn’t fly around the case.

Last thing was to solder the pin header for the UART connector that I’ve used for flashing and debugging the firmware.

From top to bottom the pinout is GND, TX, RX and VCC. Be aware that if you use the USB-to-UART module to supply the ESP8266 with power then you need to be aware that you need a 3V3 module and not 5V and also never connect the VCC pin of the module while the sonoff device is on the mains. At first I was using the 3V3 of the UART module to flash the device while not connected and then after I’ve verified that everything works as expected, then I’ve removed the VCC from the module and I was flashing the device while it was connected on mains.

So this is the Sonoff before I close the case

And here is with the case closed.

For a test load during my tests I’ve used a hair-dryer which is rated up to 2200W, but it has a cold and heat mode and also different fan and heating modes, so it was easy to test with various currents. Of course, using the hair-dryer’s max power with the ACS712-5A is not a good idea as the dryer can draw up to 10A.

Short firmware description

OK, so now let’s have a look in the source code that I’ve used for testing the ACS712 and the Sonoff. You can find the repo here:


Reading the ADC value is the easy part, but there are a couple of things that you need to do in order to convert this value to something meaningful. Since I’m measuring AC current, then a snapshot of the ADC value is meaningless as it oscillates around 0V and also takes negative values. In this case we need the RMS value, which is the root-mean-square value. Before continuing further, I’ve compiled this list of equations.

I’m not going to explain everything though, as it’s quite basic equations. (1) is the RMS formula. It’s the square root of the mean of the sum of the sampled square current values. The trick here is that you square the value, therefore that works well for negative values as they add up. Then according to the ACS712 datasheet, in function (2), α is the 185mV/A, which is the sensitivity. This number means that for each Ampere the voltage will change 185mV.

Function (3) is the Vx output of the voltage divider and Vo is the instant value of the ACS712 output. Function (4) states that the Vx is the ADC value divided by 1023. This is true because the max output voltage is 1V and the ADC is 10-bit. Here you also see that, it’s actually the Delta of the ADC value and in (5) you see that the delta is the ADC value minus the ADC0 value.

The ADC0 value is the sampled ADC value of the ACS712 output when there’s no current flowing. Normally, you would expect that to be 511, which is the half of the 1023 max value of the 10-bit ADC, but you need actually to calibrate your device in order to find this value. I’ll explain how to do this later.

In function (6) I’ve just replaced (2) with (3), (4) and (5). Finally, in (7) I’ve replaced the constant values with k, to simplify the equation. In (8) and (9) you can see which are the constant values and their values.

Therefore, in my case k is always 0.026, but I haven’t used a hard-coded value in the code, so in the code you can see from the following that you can change those values to whatever.

#define ACS712_MVA    0.185   // This means 185mV/A, see the datasheet for the correct value
#define R1            220.0
#define R2            56.0
#define ADC_MAX_VAL   1023.0

float k_const = (1/ADC_MAX_VAL) * ((R1 + R2) / R2) * (1/ACS712_MVA);


Now in order to calibrate your device you need to enable the calibration mode in the code.


First you need to build the firmware and then remove any load from the Sonoff S20. When the firmware starts it will display the ADC sampled value. The code samples the output every 1ms and after 5secs (5000 samples) calculates the average value. This ADC average value is the ADC0 value. Normally, according to the ACS712 datasheet it should be 2.5V, but this may not be always true. This is the output of the firmware in my case

Calibrating with 5000 samples.
Calibration value: 431
Calibration value: 431
Calibration value: 431
Calibration value: 431
Calibration value: 431

Based on that output, I’ve defined the CALIBRATION_VALUE in my code to be 431.


Testing the firmware

Now, make sure that CALIBRATION_MODE is not defined in the code and re-build and re-flash. After flashing is finished, the green LED is flashing while the Sonoff tries to connect to the WiFi router and when it’s done connecting it stays constant on. By default the relay is always off for safety reasons.

The firmware supports a REST API that it’s used to turn on/off the relay and also get the current Irms value. When the Sonoff is connected to the network, then you can use your browser to control the relay. In the following URLs you need to change the IP address and use the one which is right for your device. To turn the relay on then paste the next URL in your web browser and hit Enter.

Then you’ll see this response in your browser:

{"return_value": 1, "id": "", "name": "", "hardware": "esp8266", "connected": true}

To turn off the relay use this URL:

Then you’ll get this response from the server (ESP8266)

{"return_value": 0, "id": "", "name": "", "hardware": "esp8266", "connected": true}

As you can see, the params variable in the relay URL defines the state of the relay and 0 is OFF and 1 is ON. The return value of the server’s response is the actual value of the relay_status in the code. The REST callback in the code just calls a function that controls the relay and then returns the variable. There’s no a true feedback except the blue LED on the device that the relay is on, so be aware of that. This is the related code:

int restapi_relay_control(String command);

void SetRelay(bool onoff)
  if (onoff) {
    relay_status = 1;
    digitalWrite(GPIO_RELAY, HIGH);
  else {
    relay_status = 0;
    digitalWrite(GPIO_RELAY, LOW);
  Serial.println("Relay status: " + String(relay_status));

ICACHE_RAM_ATTR int restapi_relay_control(String command)
  return relay_status;

Finally, you can use the REST API to retrieve the RMS value of the current on the connected load. To do so, browse to this URL

The response will be something like this:

{"rms_value": 0.06, "id": "", "name": "", "hardware": "esp8266", "connected": true}

As you can see the rms_value field in the response is the RMS current in Amperes. This is the response when the hair dryer is on

{"rms_value": 6.43, "id": "", "name": "", "hardware": "esp8266", "connected": true}

As you can see the RMS value now is 6.43A which is more than the 5A limit of the ACS712-5A! That’s not good, I know. Don’t do that. In my case, I’ve only use the high scale of the hair dryer for 1-2 seconds on purpose, which may not be enough to harm the IC. It seems that the overcurrent transient tolerance of the ACS712 is 100A for 100ms, which is quite high, therefore I hope that 6.5A for 2 secs are not enough to kill it.

Last thing regarding this firmware is that the Sonoff button is used to toggle the relay status. The button is using debouncing which is set to 25ms by default, but you can change that in the code here:

buttons[GPIO_BUTTON].attach( BUTTON_PINS[0] , INPUT_PULLUP  );


In this post, I’ve explained how to “hack” a Sonoff-S20 smart socket and documented the procedure. You shouldn’t do this if you’re not a trained electronic or electrical engineer, because mains AC can kill you.

To sum up things, I’ve provided a firmware that you can use it to calibrate your ACS712 and the ADC readings and in the normal operation you can also use it to control the relay and read the RMS current value. To switch between modes you need to re-build and re-flash the firmware. The reason for this is just to simplify the firmware and done with is as fast as possible. Of course, it can be done in a way that you can switch between two modes, using the onboard button for example (which is used for toggling the relay by default) or using a REST command. I leave that as an exercise (I really like this moto when people are bored to do things).

As I’ve mentioned this post is just a preparation post for the next one, which will be to use the Sonoff S20 as a node agent that publishes the RMS current and the relay status to a remote Elasticsearch server. Since this hack is out of scope for the next post, I’ve decided to write this one as it’s a quite long process, but also fun.

Normally, I’m using this Sonoff S20 for my 3D printed with a custom REST command to toggle the power from my smartphone or the octopi server that runs on a RPi. I guess that the 3D printer consumption is not that high to get any meaningful data to show, but I’ll try it anyways. Hope you liked the simple hack and don’t do this at home.

Have fun!

Using Elastic stack (ELK) on embedded [part 2]


Note: This is a series of posts on using the ELK on embedded. Here you can find part1.

In the previous post on the ELK on Embedded I’ve demonstrated the most simple example you can use. I’ve set up an Elasticsearch and a Kibana server using docker on my workstation and then I’ve use this meta layer on a custom Yocto image on the nanopi-k1-plus to build the official metricbeat and test it. This was a really simple example but at the same time is very useful because you can use the Yocto layer to build beats for your custom image and any ARM architecture using the Yocto toolchain.

On this post things will get a bit more complicated, but at the same time I’ll demonstrate a full custom solution to use ELK on your custom hardware and with your custom beat. For this post, I’ve chosen the STM32MP157C dev kit, which I’ve presented here and here. This will make things even more complicated and I’ll explain later why. So, let’s have a look at the demo system architecture.

System Architecture

The following block diagram shows this project’s architecture.

As you can see from the above diagram, most of the things remain the same with part-1 and the only thing that changes is the client. Also the extra complexity is on that client. So let’s see what the client does exactly.

The STM32MP1 SoC integrates a Cortex-M4 (CM4) and a Cortex-A7 (CA7) on the same SoC and both have access to the same peripherals and address space. In this project I’m using 4x ADC channels which are available on the bottom Arduino connector of the board. The channels I’m using are A0, A1, A2 and A3. I’m also using a DMA stream to copy the ADC samples on the memory with double buffering. Finally the ADC peripheral is triggered by a timer. The sampled values then are sent to the CA7 using the OpenAMP IPC. Therefore, as you can see the firmware is already complex enough.

At the same time the CA7 CPU is running a custom Yocto Linux image. On the user space, I’ve developed a custom elastic beat that reads the ADC data from the OpenAMP tty port and then publishes the data to a remote Elasticsearch server. Finally, I’m using a custom Kibana dashboard to monitor the ADC values using a standalone X96mini.

As you can see the main complexity is mostly on the client, which is the common case scenario that you going to deal with in embedded. Next I’ll explain all the steps needed to achieve this.

Setting up an Elasticsearch and Kibana server

This step has been explained on the previous post, with enough detail, therefore I’ll save some space and time. You can read on how to set it up here. You can use the part-2 folder of the repo, though, but be aware it’s almost the same. The only thing I’ve changed is the versions for Elasticsearch and Kibana.

Proceed with the guide on the previous post, until the point that you verify that the server status is OK.


The above command is when checking the status from the server and you need to use the server’s IP when checking from the web client (X96mini in this case).

Cortex-CM4 firmware

So, let’s have a look at the firmware of the CM4. You can find the firmware here:


In the repo’s README file you can read more information on how to build the firmware, but since I’m using Yocto I won’t get into those details. The important files of the firmware are the source/src_hal/main.c and `source/src_hal/stm32mp1xx_hal_msp.c`.  Also in main.h you’ll find the main structures I’m using for the ADCs.

enum en_adc_channel {

#define NUMBER_OF_ADCS 2

struct adc_dev_t {
  ADC_TypeDef         *adc;
  uint8_t             adc_irqn;
  void (*adc_irq_handler)(void);
  DMA_Stream_TypeDef  *dma_stream;
  uint8_t             dma_stream_irqn;
  void (*stream_irq_handler)(void);

struct adc_channel_t {
  ADC_TypeDef         *adc;
  uint32_t            channel;
  GPIO_TypeDef        *port;
  uint16_t            pin;

extern struct adc_channel_t adc_channels[ADC_CH_NUM];
extern struct adc_dev_t adc_dev[NUMBER_OF_ADCS];

The `adc_dev_t` struct contains the details for the ADC peripheral which is this case is the ADC2 and the adc_channel_t contains the channel configuration. As you can see both are declares as arrays, because the CM4 has 2x ADCs and I’m also using 4x channels for the ADC2. Both structs are initialized in the main.c

uint16_t adc1_values[ADC1_BUFFER_SIZE];  // 2 channels on ADC1
uint16_t adc2_values[ADC2_BUFFER_SIZE];  // 3 channels on ADC2

struct adc_channel_t adc_channels[ADC_CH_NUM] = {
  [ADC_CH_AC] = {
    .adc = ADC2,
    .channel = ADC_CHANNEL_6,
    .port = GPIOF,
    .pin = GPIO_PIN_14,
  [ADC_CH_CT1] = {
    .adc = ADC2,
    .channel = ADC_CHANNEL_2,
    .port = GPIOF,
    .pin = GPIO_PIN_13,
  [ADC_CH_CT2] = {
    .adc = ADC2,
    .channel = ADC_CHANNEL_0,
    .port = GPIOA,
    .pin = GPIO_PIN_0,
  [ADC_CH_CT3] = {
    .adc = ADC2,
    .channel = ADC_CHANNEL_1,
    .port = GPIOA,
    .pin = GPIO_PIN_1,

struct adc_dev_t adc_dev[NUMBER_OF_ADCS] = {
  [0] = {
    .adc = ADC2,
    .adc_irqn = ADC2_IRQn,
    .adc_irq_handler = &ADC2_IRQHandler,
    .dma_stream = DMA2_Stream1,
    .dma_stream_irqn = DMA2_Stream1_IRQn,
    .stream_irq_handler = &DMA2_Stream1_IRQHandler,
  [1] = {
    .adc = ADC1,
    .adc_irqn = ADC1_IRQn,
    .adc_irq_handler = &ADC1_IRQHandler,
    .dma_stream = DMA2_Stream2,
    .dma_stream_irqn = DMA2_Stream2_IRQn,
    .stream_irq_handler = &DMA2_Stream2_IRQHandler,

I find the above way the easiest and more descriptive to initialize such structures in C. I’m using those structs in the rest of the code in the various functions and because those structs are generic is easy to handle with pointers.

In case of STM32MP1 you need to be aware that the pin-mux for both CA7 and CM4 is configured by using the device tree. Therefore, the configuration is done when the kernel boots and furthermore you need to plan which pin is used by each processor. Also, in order to be able to run this firmware you need to enable the PMIC access from the CM4, but I’ll get back to that later when I’ll describe how to use the Yocto image.

Next important part of the firmware is the Timer, ADC and DMA configuration, which is done in `source/src_hal/stm32mp1xx_hal_msp.c`. The timer is initialized in HAL_TIM_Base_MspInit() and MX_TIM2_Init(). The timer is only used to trigger the ADC on a constant period and every time the timer is expired, it triggers an ADC conversion and then reloads.

The ADC is initialized and configured in HAL_ADC_MspInit and Configure_ADC() functions. The DMA_Init_Channel() is called for every channel. Finally, the interrupts from the ADC/DMA are handled in HAL_ADC_ConvHalfCpltCallback() and HAL_ADC_ConvCpltCallback(). The reason there are two interrupts is that I’m using double buffering and each interrupt fills one half of the buffer, therefore the first interrupt fills the first half and the second interrupt fills the second half of the buffer. This means that there is enough time between the same interrupt triggers again to fill another buffer with the results and send the buffer on the CA7 using OpenAMP.

You need to be aware though, that the OpenAMP is a slow IPC which is meant for control and not for exchanging fast or big data. Therefore, you need to have a quite slow timer that triggers the ADC conversions, otherwise the interrupts will be faster that the OpenAMP. To solve this you can use a larger buffer pool that sends the data async in the main loop, rather inside the interrupt. In this case though, I’m just sending the ADC values inside the interrupt using a slow timer for simplicity. Also there’s no reason to flood the Elasticsearch server with ADC data.

There’s also a smarter way to handle the ADC values flow rate from the STM32MP1 to the Elasticserver. You can have a tick timer that sends ADC values at constant times which are not that frequent and if you want to be able to “catch” important changes in the values then you can have an algorithmic filter that buffers those important changes that are outside of configurable limits and then report only those values to the server. You can also use average values is it’s applicable for your case.

Finally, this is the part where the CM4 firmware transmits the ADC values in a string format to the CA7

void HAL_ADC_ConvCpltCallback(ADC_HandleTypeDef *hadc)
  /* Update status variable of DMA transfer */
  ubDmaTransferStatus = 1;
  /* Set LED depending on DMA transfer status */
  /* - Turn-on if DMA transfer is completed */
  /* - Turn-off if DMA transfer is not completed */

  if (hadc->Instance == ADC1) {
    sprintf((char*)virt_uart0.tx_buffer, "ADC[1.2]:%d,%d,%d,%d\n",
      adc2_values[0], adc2_values[1], adc2_values[2], adc2_values[3]);
  else if (hadc->Instance == ADC2) {
    sprintf((char*)virt_uart0.tx_buffer, "ADC[2.2]:%d,%d,%d,%d\n",
      adc2_values[0], adc2_values[1], adc2_values[2], adc2_values[3]);

  virt_uart0.rx_size = 0;
  virt_uart0_expected_nbytes = 0;
  virt_uart0.tx_size = strlen((char*)virt_uart0.tx_buffer);
  virt_uart0.tx_status = SET;

As you can see the format looks like this:



  • x, is the the ADC peripheral [1 or 2]
  • y, is the double-buffer index, 10: first half, 2: second half
  • <ADCz>, is the 12-bit value of the ADC sample


Now that we have the CM4 firmware the next thing we need is the elastic beat client that sends the ADC sample values to the Elasticsearch server. Since this is a custom application there’s no any beat available that meets our needs, therefore we need to need to create our own! To do this it’s actually quite easy, all you need is to use the tool that comes with the elastic beats repo. There’s also a guide on how to create a new beat here.

My custom beat repo for this post is this one here:


Note: This guide is for v7.9.x-v8.0.x and it might be different in other versions!

First you need to setup your Golang environment.

export GOPATH=/home/$USER/go
cd /home/$USER/go
mkdir -p src/github.com/elastic
cd src/github.com/elastic
git clone https://github.com/elastic/beats.git
cd beats

Now that you’re in the beats directory you need to run the following command in order to create a new beat

mage GenerateCustomBeat

This will start an interactive process that you need to fill some details about your custom beat. The following are the ones that I’ve used, but you need to use your own personal data.

Enter the beat name [examplebeat]: adcsamplerbeat
Enter your github name [your-github-name]: dimtass
Enter the beat path [github.com/dimtass/adcsamplerbeat]: 
Enter your full name [Firstname Lastname]: Dimitris Tassopoulos
Enter the beat type [beat]: 
Enter the github.com/elastic/beats revision [master]:

After the script is done the tool is already created a new folder with your custom beat, so go in there and have a look.

cd $GOPATH/src/github.com/dimtass/adcsamplerbeat

As you can see there are too many things already in the folder, but don’t worry not all of them are important to us. Actually, there are only 3 important files. Let’s have a look in the configuration file first, which is located in `config/config.go`. In this file you need to add our custom configuration variables. In this case I need this go beat to be able to open a Linux tty port and retrieve data, therefore I need a serial go module. I’ve used the tarm/serial module which you can find here.

So, to open a serial port I need the device path in the filesystem, the baudrate and the read timeout. Of course you could add more configuration like parity, bit size and stop bits, but in this case that’s not important as it’s not a generic beat and the serial port configuration is static. Therefore, to add the needed configuration parameters in the the config/config.go I’ve edited the file and add this paramters:

type Config struct {
    Period 			time.Duration `config:"period"`
    SerialPort		string	`config:"serial_port"`
    SerialBaud		int		`config:"serial_baud"`
    SerialTimeout	time.Duration	`config:"serial_timeout"`

var DefaultConfig = Config{
    Period: 1 * time.Second,
    SerialPort: "/dev/ttyUSB0",
    SerialBaud: 115200,
    SerialTimeout: 50,

As you can see there’s a default config struct, which contains the default values, but you can override those values using the the yaml configuration file as I’ll explain in a bit. Now that you’ve edited this file you need to run a command that creates all the necessary code based on this configuration. To do so, run this:

make update

The next important file is where the magic happens and that’s the `beater/adcsamplerbeat.go` file. In there it’s the main code of the beat, so you can add you custom functionality in there. You can have a look in detail in the file here, but the interesting code is that one:

ticker := time.NewTicker(bt.config.Period)
    counter := 1
    for {
        select {
        case <-bt.done:
            return nil
        case <-ticker.C:

        buf := make([]byte, 512)
        n, _ = port.Read(buf)
        s := string(buf[:n])
        s1 := strings.Split(s,"\n")	// split new lines
        if len(s1) > 2 && len(s1[1]) > 16 {
            fmt.Println("s1: ", s1[1])
            s2 := strings.SplitAfterN(s1[1], ":", 2)
            fmt.Println("s2: ", s2[1])
            s3 := strings.Split(s2[1], ",")
            fmt.Println("adc1_val: ", s3[0])
            fmt.Println("adc2_val: ", s3[1])
            fmt.Println("adc3_val: ", s3[2])
            fmt.Println("adc4_val: ", s3[3])
            adc1_val, _ := strconv.ParseFloat(s3[0], 32)
            adc2_val, _ := strconv.ParseFloat(s3[1], 32)
            adc3_val, _ := strconv.ParseFloat(s3[2], 32)
            adc4_val, _ := strconv.ParseFloat(s3[3], 32)

            event := beat.Event {
                Timestamp: time.Now(),
                Fields: common.MapStr{
                    "type":    b.Info.Name,
                    "counter": counter,
                    "adc1_val": adc1_val,
                    "adc2_val": adc2_val,
                    "adc3_val": adc3_val,
                    "adc4_val": adc4_val,
            logp.Info("Event sent")

Before this code, I’m just opening the tty port and send some dummy data to trigger the port and then everything happens in the above loop. In this loop the serial module reads data from the serial port and then parses the interesting information, which are the ADC sample values and constructs a beat event. Finally it publishes the event at the Elasticserver.

In this case I’m using a small trick. Since the data that are coming from the serial port are quite fast, the serial read buffer usually contains 3 or 4 samples and maybe more. For that reason I’m always parsing the second one (index 1) and the reason for that is to avoid having to deal with more complex buffer parsing, because there might be a case that the begin and the end of the buffer are not complete. That way I’m just using a newline split and remove the first and the last strings that may incomplete. Because this is a trick, you might prefer to a better implementation but in this case I just need a fast prototype.

Finally, the last important file is the configuration yaml file, which in this case is the `adcsamplerbeat.yml`. This file contains the configuration that overrides the default values and also contains the generic configuration of the beat which means that you need to configure the IP of the remote Elasticsearch server as also any other configuration that is available for the beat, e.g. authentication e.t.c. Be aware that the period parameter in the yml file refers to how often the beat client connect to the server to publish its data. This means that if the beat client collects 12 events per second then it will connect only once per second and publish that data.

For now this configuration file is not that important because I’m going to use Yocto that will override the whole yaml file using the package recipe.

Building the Yocto image

Using Yocto is the key of this demonstration and the reason for this is that by using a Yocto recipe you can build any elastic beat for your specific architecture. For example even for the official beats there are only aarch64 pre-built binaries and there aren’t any armhf or armel, but in this case the STM32MP1 is an armhf CPU, therefore it wouldn’t possible even to find binaries for this CPU. Therefore Yocto comes handy and we can use its superpowers to build also our custom beat.

For the STM32MP1 I’ve created a BSP base layer that simplifies a lot the Yocto development and you can find it here:


I’ve written thorough details on how to use it and be able to build an image in the repo README file, therefore I’ll skip this step and focus on the important stuff. First you need to build the stm32mp1-qt-eglfs-image image (well you could also build another, but this is what I did for also other reason which are irrelevant to the post). Then you need to flash the image on the STM32MP1. I didn’t add the firmware and the adcsamplerbeat recipes in the image, but instead I’ve just built the recipes and the scp the deb files in the target and install them using dpkg, which works just fine for developing and testing.

The firmware for the CM4 is already included in the meta-stm32mp1-bsp-base repo and the recipe is the `meta-stm32mp1-bsp-base/recipes-extended/stm32mp1-rpmsg-adcsampler/stm32mp1-rpmsg-adcsampler_git.bb`.

But you need to add the adcsampler recipe yourself. For that reason I’ve created another meta recipe layer which is this one:


To use this recipe you need to add it your sources folder and then also add it in the bblayers. First clone the repo in the sources folder.

cd sources
git clone https://gitlab.com/dimtass/meta-adcsamplebeat.git

Then in the build folder that you’ve run the setup-environment.sh script as explained in my BSP base repo, run this command

bitbake-layers add-layer ../sources/meta-adcsamplebeat

Now you should be able to build the adcsamplerbeat for the STM32MP1 with this command

bitbake adcsamplerbea

After the build finishes the deb file should be in `build/tmp-glibc/deploy/deb/cortexa7t2hf-neon-vfpv4/adcsamplerbeat_git-r0_armhf.deb`.

That’s it! Now you should have all that you need. So first boot the STM32MP1 and then scp the deb file from your host builder and install it on the target like that

# Yocto builder
scp /path/to/yocto/build/tmp-glibc/deploy/deb/cortexa7t2hf-neon-vfpv4/adcsamplerbeat_git-r0_armhf.deb root@<stm32mp1ip>:/home/root

# On the STM32MP1
cd /home/root
dpkg -i adcsamplerbeat_git-r0_armhf.deb

Of course you need to do the same procedure for the CM4 firmware, meaning that you need to build it using bitbake

bitbake stm32mp1-rpmsg-adcsampler

And then scp the deb from `build/tmp-glibc/deploy/deb/cortexa7t2hf-neon-vfpv4/stm32mp1-rpmsg-adcsampler_git-r0_armhf.deb` to the STM32MP1 target and install it using dpkg.

Finally, you need to verify that everything is installed. To do that have a look in the following path and verify that the files exists


If the above files exists then probably everything is set up correctly.

Last thing is that you need to load the proper device tree otherwise the ADCs won’t work. To do that edit the `/boot/extlinux/extlinux.conf` file on the board and set the default boot mode to `stm32mp157c-dk2-m4-examples` like this

DEFAULT stm32mp157c-dk2-m4-examples

This mode will use the `/stm32mp157c-dk2-m4-examples.dtb` device tree file which let’s the CM4 to enable the ADC ldos using the PMIC.

Testing the adcsamplerbeat

Finally, it’s time for some fun. Now everything should be ready on the STM32MP1 and also the Elasticsearch and Kibana servers should be up and running. First thing you need to do is to execute the CM4 firmware. To do that run this command on the STM32MP1 target

fw_cortex_m4.sh start

This command will load the stm32mp157c-rpmsg-adcsampler.elf from the /lib/firmware on the CM4 and execute it. When this is done, then the CM4 will start sampling the four ADC channels and send the results on the Linux user-space using OpenAMP.

Next, you need to run the adcsamplerbeat on the STM32MP1, but before you do that verify that the `/etc/adcsamplerbeat/adcsamplerbeat.yml` file contains the correct configuration for the tty port and the remote Elasticsearch server. Also make sure that both the SBC and the server are on the same network. The run this command:


You should see an output like that

root:~# adcsamplerbeat
Opening serial: %s,%d /dev/ttyRPMSG0 115200
s1:  ADC[2.2]:4042,3350,4034,4034
s2:  4042,3350,4034,4034
adc1_val:  4042
adc2_val:  3350
adc3_val:  4034
adc4_val:  4034

This means that the custom beat is working fine on the armhf CPU! Nice.

Now you need to verify that also the Elastic server gets the data from the beat. To do that open a browser (I’m using the X96mini as a remote Kibana client) and type the Kibana url. Wait for the UI to load and then using your mouse and starting from the left menu browse to the this path

Management -> Stack Management -> Kibana -> Index Patterns -> Create index pattern

Then you should see this

If you already see those sources that you can select from in your results, then it means that it’s working. So no you need to type adcsamplerbeat-* in the index pattern name like this:

Finally, after applying and getting to the next step you should see something similar to this

As you can see from the above image these are the data that are transmitted from the target to the server. In my case I’ve commented out these processors in my adcsamplerbeat.yml file

  - add_host_metadata: ~
  # - add_cloud_metadata: ~
  # - add_docker_metadata: ~

I suggest you do the same to minimize the data transmitted over the network. You could also comment out the host, but this would make it difficult then to trace which data from the database are belonging to a host if many exist.

Now we need to setup a dashboard to be able to visualize these data.

Setting up a custom Kibana dashboard

Kibana is very flexible and a great tool to create custom dashboards. To create a new dashboard to display the ADC sample data you need to use your mouse and starting from the left menu browse to

Dashboard -> Create new dashboard

You should see this

Now select the “Line” visualization and then you get the next screen

Ignore any other indexes and just select adcsamplerbeat-* then in the next screen you need to add a Y-axis for every ADC value and also use the following configuration for each axis

Aggregation: Max
Field: adcX_val
Custon label: ADCx

Instead of X, x use the ADC index (e.g. adc1_val and ADC1 e.t.c.)

Finally you need to add a bucket. Bucket is actually the X axis and in there all you need is to configure it like that

When you apply all changes you should see something similar to this

In the above plot there are 4x ADCs but the 3 of them are floating and the value is close to the max ADC range value which is 4092 (for 12-bits).

That’s it! You’re done. Now you have your STM32MP1 sampling 4x ADC and publish the samples on an Elasticserver and then you can use Kibana to visualize the data on a custom dashboard.


So, that was a quite long post, but it was interesting to implement a close-to-real-case-scenario project using ELK for an embedded device. In this example I’ve used Yocto to build the Linux image for the STM32MP1 and also the firmware for the CM4 MCU and also the go beat module.

I’ve used the CM4 to sample 4x ADC channels using DMA and double buffering and then OpenAMP to send the buffers from the CM4 to the CA7 and the Linux user-space. Then the custom adcsamplerbeat elastic beat module published the ADC sampling data to the Elasticsearch server and finally I’ve used the Kibana web interface to create a dashboard and visualize the data.

This project might be a bit complex because of the STM32MP1, but other than that it’s quite simple. Nevertheless, it’s a fine example on how you can use all those interesting technologies together and build a project like that. The STM32MP157C is an interesting SBC to use in this case because you can connect whatever sensors you like on the various peripherals and then create a custom beat to publish the data.

By using this project as a template you can create very complex setups, with whatever data you like and create your custom filters and dashboards to monitor your data. If you use something like this, then share your experience in the comments.

Have fun!

Using Elastic stack (ELK) on embedded [part 1]


[Update – 04.09.2020]: Added info how to use Yocto to build the official elastic beats

Data collection and visualization are two very important things that when are used properly they are actually very useful. The last decade we’re overwhelmed with data and especially with how the data are visualized. Most of the cases, I believe, people don’t even understand what they see or how to interpret the data visualizations. It has become more important in the industry to present data in a visual pleasing way, rather actually to get a meaning out of them. But that’s another story for a more philosophical post.

In this series of posts I’ll won’t solve the above problem but I will probably contribute in to making it even worse and I’ll do that by explaining how to use Elasticsearch (and some other tools) with your embedded devices. An embedded device in this case can be either a Linux SBC or a micro-controller which is able to collect and send data (e.g. ESP8266). Since there are so many different use cases and scenarios, I’ll start with simple a simple concept in this post and then it will get more advanced in the next posts.

One thing that you need to have in mind is that data collection and presentation is something that goes many centuries back. Anyway in case of IT there’s only a few decades of history in presenting digital data. If you think about it, only the tools are getting different as the technology advances and as happens with all new things, those tools are getting more fancy, bloated and complicated, but at the same time more flexible. So nothing new here, just old concepts with new tools. Does that mean that they’re bad? No, not all. They are very useful when you use them right.

Elastic Stack

There are dozens of tools and frameworks to collect and visualize data. You can even implement you own simple framework to do that. For example in this post, I’ve designed an electronic load with a web interface. That’s pretty much data collection and visualization. OK, I know, you may say that it’s not really data collection because there is no a database, but that doesn’t really mean anything as you can have a circular buffer with the last 10 values and would make it “data collector”. Anyway, it doesn’t matter how complex you application and infrastructure is, the only thing that matters is that the ground concept is the same.

So, Elastic Stack (EStack) is a collection of open-source tools that collect, store and visualize data. There are many other frameworks, but I’ve chosen EStack because it’s open source, nowadays is a mature framework and it’s quite popular in the DevOps community. The main tools of the EStack are: Elasticsearch (ES), Kibana (KB), Logstash (LS) and Beats, but there are also others. Here is a video that explains a bit better how those tools are connected together.

It’s quite simple to understand what they do though. Elasticsearch is a database server that collects and stores data. Logstash and beats are clients that send data to the database and Kibana is just a user-interface that presents the data on a web page. That’s it. Simple. Of course that the main concept of the tools, but they offer much more than that and they are adding new functionalities really fast.

Now, the way they do what they do and the implementation is what it’s quite complicate. So EStack is quite a large framework. Protocols, security, databases, visualization, management, alerts and configuration is what it makes those frameworks huge. Therefore, the user or the administrator deals with less complexity, but in return the frameworks are getting larger and the internal complexity makes the user much less able to debug or resolve any problems inside the infrastructure. So, you win some, you lose some.

Most of the times, the moto is, if it works without problems for some time you’re lucky, but if it breaks you’re doomed. Of course at some point things break eventually, so this is where you need backups, because if your database gets corrupted then good luck with it if you have no backup.

Back to EStack… The Elasticsearch is a server with a database. You can have multiple ES server nodes running at the same time and different type of data can be stored on each server. But you’re able to have access to all the data from the different nodes with a single query. Then, you can have multiple Logstash and Beats clients that connect to one or more nodes. The clients are sending data to the server and the server stores the data to the database. The difference with older similar implementations is that ES uses the json syntax that receives from the client to store the data in the DB. I’m not sure about the implementation details and I may be wrong in my assumption, but I assume that ES creates tables on the fly with fields according to this json syntax if they don’t already aexist. So ES is aware of client’s data formatting when receiving well formated data. Anyway, the important thing is that the data are stored in the DB in a way that you can run queries and filters to obtain information from the database.

The main difference between Logstash and Beats clients is that Logstash is a generic client with multiple configurable functionality that can be configured to do anything, but Beats are lightweight clients which are tailored to collect and send specific type of data. For example you can configure Logstash to send 10 different types of data or you can even have many different Beats that send data to a Logstash client, which then re-formats and sends the data to an Elasticsearch server. On the other hand, Beats can deal only with specific type of data, for example they can collect and send the overview of the system resources of the host (client) or send the status of a specific server or poll a log file and then parse new lines and format the log data and send them back to the server. Each Beat is different. Here you can search all the available community Beats and here the official beats.

Beats are just simple client programs written in Go that they’re using the libeat Go API and perform very specific tasks. You can use this API to write your own beat clients in Go that can do whatever you like. The API just provides the interface and implements the network communication including authentication. At the higher level a Beat is split into two components, the component that collects the data and implements the business logic (e.g. reading a temperature/humidity sensor) and the publisher component that handles all the communication with the server including authorization, timeouts e.t.c. The diagram below is simple to understand.

As you can see, libbeat implements the publisher, so you only need to implement the custom/bussiness logic. The problem with libbeat is that it’s only available in Go, which is unfortunate because a plethora and actually the majority of small embedded IoT devices can not execute Go. Gladly there are some C++ implementations out there like this one, that do the same in C++, but the problem with those implementations is that they only support basic authentication.

You should be aware, though, that the Go libbeat also handles buffering, which is a very important feature and you need to be aware that if you do your own implementation you should take care of that. Buffering means that if the client loses the connection, even for hours and days, then it stores the data locally and when the connection is restored then sends all the buffered data to the server. That way you won’t have discontinued data in your server’s database. Also that means that you need to choose an optimal data sampling rate, so your database doesn’t get huge in short time.

One important thing that I need to mention here is that you can write your own communication stack as it’s just a simple HTTP POST with an attached json formatted data. So you can implement this on a simple baremetal device like ESP8266, but the problem is that the authentication will be just a basic one. This means that the authentication can be a user/pass, but that’s really a joke as both are attached in an unencrypted plain HTTP POST. I guess that for your home and internal network is not much of a problem, but you need to have a proper encryption if your devices are out in the wild.

A simple example

To test a new stack, framework or technology you need to start with a simple example and create a proof of concept. As a first example in this series I’ll run an Elasticsearch (ES) server and a Kibana instance on my workstation which will act as the main server. The ES will collect all the data and the Kibana instance will provide the front-end to visualize the data. Then I’ll use an SBC and specifically the nanopi-k1-plus, which is my reference board to test my meta-allwinner-hx BSP layer. The SBC will run a Beat that collects metric data from the running OS and sends them to the Elasticsearch server. This is the system overview:

So, this is an interesting setup here. The server is my workstation which is a Ryzen 2700X with 32GB RAM and the various fast NVMe and SSDs. The Elasticsearch and Kibana servers are running inside docker containers and on a NVMe. The host OS is Ubuntu 18.04.5 LTS.

It’s funny, but for the web interface client I’m using the X96 mini TV box… OK, so this is an ARM board based on the Amlogic S905X SoC which is a quad-core Cortex-A53 running at 1.5GHz with 2GB RAM. Itcurrently runs an Armbian image with Ubuntu Focal 20.04 on the 5.7.16 kernel. The reason I’ve selected to use this SBC as a web client is to “benchmark” the web interface, meaning I wanted to see how long it takes for a low spec device to load the interface and how it behaves. I’ll come to this later on, anyway this how it looks like.

A neat little Linux box. Finally, I’ve used the nanopi-k1-plus with a custom Yocto Linux image using the meta-allwinner-hx BSP with the 5.8.5 kernel version. On the nanopi I’m running the metricbeat. Metricbeat is a client that collect system resources and performance data and then sends them to the Elasticsearch server.

This setup might be look a bit complicated but it’s not really. It’s quite basic, I’m just using those SBCs with custom OSes that makes it look a bit complicated, but it’s really a very basic example and it’s better that running everything on the workstation host as docker containers. This setup is more fun and similar to real usage when it comes to embedded. Finally, that’s a photo of the real setup.

Running Elasticsearch and Kibana

For this project I’m running Elasticsearch and Kibana on a docker container on a Linux host and it makes total sense to do so. The reason is that the containers are sandboxed from the host and also later in a real project it makes more sense to have a fully provisioned setup as IaaC (infastructure as a code); therefore you’re able to spawn new instances and control your instances maybe using a swarm manager.

So let’s see how to setup a very basic Elasticsearch and Kibana container using docker. First head to this repo here:


There you will see the “part-1” folder which includes everything that you need for this example. In this post I’m using the latest Elastic Stack version which is 7.9.0. First you need to have docker and docker-composer installed to your workstation host. Then you need to pull the Elasticsearch and Kibana images from the docker hub with these commands:

docker pull docker.elastic.co/elasticsearch/elasticsearch:7.9.0
docker pull docker.elastic.co/kibana/kibana:7.9.0

Those two commands will download the images to your local registry. Now you can use those images to start the containers. I’ve already pushed the docker-compose file I’ve used in the repo, therefore all you need to do is to cd in the part-1 folder and run this command:

docker-compose -f docker-compose.yml up

This command will use the docker-compositor.yml file and launch two container instances, one for Elasticsearch and one for Kibana. Let’s see the content of the file.

version: '3'
    image: docker.elastic.co/elasticsearch/elasticsearch:7.9.0
    container_name: elastic_test_server
      - bootstrap.memory_lock=true
      - discovery.type=single-node
      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
        soft: -1
        hard: -1
      - /rnd2/elasticsearch/data:/usr/share/elasticsearch/data
      - 9200:9200
      - 9300:9300
      - elastic
    image: docker.elastic.co/kibana/kibana:7.9.0
    container_name: kibana_test
      - 5601:5601
      ELASTICSEARCH_URL: http://elastic_test_server:9200hosts: [""]
      ELASTICSEARCH_HOSTS: http://elastic_test_server:9200
      - elastic

    driver: bridge


You see that there are two services: elasticsearch and kibana. Each service uses the proper image and it has a unique container name. The ES container has a few environment variables, but the important one is the discovery.type which declares the instance as a single-node. This is important because there’s only one node and in case you had more then you need to configure those nodes in the yaml file, so they can discover each other in the network. Another important thing is that the ES volume is attached to the host’s physical drive. This is important so when you kill and remove the container instance, then the data (and the database) are not lost. Finally, the network and the network ports are configured.

The Kibana service configures the web server port, the environment variables that point to the ES server and the server host. Also it configures the service in the same network with ES and most importantly it sets the host address to the host’s IP address so the web server is also accessible from other web clients on the same network. If the SERVER_HOST is set to then you can only access the web interface from the localhost. Final touch is that the network is bridged.

Once you run the docker compose command then both the Elasticsearch and the Kibana servers should be up and running. In order to test that everything works as expected then you need to open your browser (e.g. on your localhost) and launch this address


It may take some time to load the interface and it might give some extra information the first time you run the web app, but eventually you should see something like this

Everything should be green, but most importantly plugin:elasticsearch needs to be in ready status, which means that there’s communication between the Kibana app and the Elasticsearch server. If there’s no communication and both instances are running then there’s something wrong with the network setup.

Now it’s time to get some data!

Setting up the metricbeat client

As I’ve already mentioned the metricbeat client will run on the nanopi-k1-plus. In my case I’ve just built the console image from the meta-allwinner-hx repo. You can use any SBC instead or any distro as long as it’s arm64 and the reason for this is that there are only arm64 prebuild binaries for the official Beats. Not only that, but you can’t even find them in the download page here, and you need to use a trick to download them.

So, you can use any SBC and distro or image (e.g. an Armbian image) as long it’s compatible with arm64 and one of the available packages. In my case I’ve built the Yocto image with deb support, therefore I need a deb package and the dpkg tool in the image. To download the deb image, open your browser and fetch this link.


Then scp the deb file to your SBC (or download it in there with wget) and then run:

dpkg -i filebeat-7.9.0-arm64.deb

After installing the deb package the executable will be installed in /usr/bin/metricbeat and also a service file will be installed in /lib/systemd/system and some configuration files will be installed in /etc/metricbeat. The configuration file `/etc/metricbeat/metricbeat.yml` is important and this is the file you need to setup the list of the remote hosts (in this case is the Elasticsearch server) and also you need to configure the module and the metricsets that you want to send from the nanopi to the ES server. To make it a bit more easy, I’ve included the metricbeat.yml file I’ve used in this example in the repo folder, so you just need to scp this file to your SBC, but don’t forget to edit the hosts line and use the IP address of your ES server.

# ---------------------------- Elasticsearch Output ----------------------------
  # Array of hosts to connect to.
  hosts: [""]

Build Elastic beats in Yocto

I’ve written a software meta layer for Yocto that you can use to add the official Elastic beats into your image. The meta layer is here:


You can add the layer to your sources and to your bblayers.conf file and then add one or more of the following recipes to your image, using IMAGE_INSTALL.

  • elastic-beats-auditbeat
  • elastic-beats-filebeat
  • elastic-beats-heartbeat
  • elastic-beats-journalbeat
  • elastic-beats-metricbeat
  • elastic-beats-packetbeat

Although there are some template configuration files in the repo, it’s expected that you override them and use your own configuration yaml files. The configuration files are located in the `meta-elastic-beats/recipes-devops/elastic-beats/elastic-beats` folder in the repo. Also you need to read carefully the README, because golang by default sets some files as read-only and if the work directory is not cleaned properly, then the build will fail.

Show me the data

At this point, we have a running instance of an Elasticsearch server, a Kibana server and a ready-to-go SBC with the metricbeat client. Next thing to do now is to run the metricbeat client and start collecting data in the server. As I’ve mentioned earlier the deb file also installs a service file, so you can either run the executable in your terminal or even better enable and start the service.

systemctl start metricbeat
systemctl enable metricbeat

In my case I’ve just executed the binary from the serial-tty console, but for long term usage of course the service makes more sense.

Now, if you wait for a bit then the server will start receiving metric data from the SBC. In order to view your data you need to open the Kibana web interface app into your browser. I’ve tried this on both my workstation and the X96mini in order to compare the performance. On my workstation it takes a couple of seconds to load the interface, but on the X96mini it took around 1.5-2 minutes to load! Yep, that means that the user interface is resource demanding, which is a bit unfortunate as it would be nice to have an option for a lighter web interface (maybe there is and I’m not aware of it).

Next you need to click on the Kibana tab in the left menu and then click “Dashboard”. This will show a list of some template dashboards. You can implement your own dashboard and do whatever customizations you in the theme and the components, but for now let’s use one of the templates. Since, the nanopi sends system metric data, you need to select the “[Metricbeat System] Host overview ECS” template. You can limit the listed items if you search for the “system” keyword.

If your remote host (e.g. nanopi) doesn’t show up automatically, then in the upper left corner you need to create a new filter and set the “host.name” to the one of your SBC. To get the host name of your SBC, then run the uname -n in your console. In my case is:


So, after applying the filter you should get your data. Just have in mind that it might need a few minutes to collect enough data to show something. The next two screenshots are from the X96mini.

Click on the images the view them in full screen. In the first screenshot you see that the web interface is using only 3% of CPU, but it uses 24.4% of the system 2GB RAM when it’s idle! In the next screenshot I’ve pressed the “Refresh” button on the web interface to see what happens to the system resources. In this case the web interface needed 120% of CPU, which means that more than 1 core is utilized. The next two screenshots display all the available data in this dashboard.

Well, that’s it! Now I can monitor my nanopi-k1-plus SBC using the Kibana web app. In case you have many SBCs around running, you can monitor all of them. Of course, monitoring is just one thing you can do. There are many more things that you can do with Kibana, like for example create alerts and send notifications using the APM interface. So, for example you can create alerts for when the storage is getting low, or communication is lost or whatever you like using the available data from the remote host in the server’s database. As you understand there’s no limit in what you can do.


In the first post, I’ve demonstrated a very basic use case of a remote SBC that sends metric data to an Elasticsearch server by using the metricbeat client. Then I’ve shown a template dashboard in Kibana that visualizes the data of the remote client. It can’t get simpler than that, really. Setting up the ES and Kibana server was easy using docker, but as I’ve mentioned I haven’t used any of the available security features; therefore you shouldn’t use this example for a real-case scenario especially if the clients are out in the wilderness of the internet.

The pros of using a solution like this, is that it’s quite easy to setup the infrastructure, also the project is open source and it’s actively supported and there’s also an active community that creates custom beats. On the negative side is that the libbeat API is only available in Go which makes it unusable to baremetal IoT devices and also the tools of the Elastic Stack are complex and bloated, which it may be hard to debug it yourself when issues arise. Of course, the complexity is expected as you get tons of features and functionality, actually more features that you will probably use. It’s the downside of all the Swiss-army-knife solutions.

Is Elastic Stack only good for monitoring remote SBCs? No. It’s capable of many more things. You need to think about Elastic Stack like it’s just a generic stack that provides functionalities and features and what you do with this it’s up to your imagination. For example you can have hundreds or thousands of IoT devices and monitor all of them, create custom dashboards to visualize any data in any way you like, create reports, create alerts and many other things.

Where you benefit most with such tools is that they scale up more easily. You can have 5 temperature/humidity sensors in your house, or dozens of MOX sensors in an industrial environment, or hundreds of thousands environmental sensors around the world or a fleet of devices sending data to a single or multiple Elasticsearch servers. Then you can use Kibana to handle those data and create visualization dashboards and automations. Since the project is open-source, there’s actually no limit in what you can do with the information inside the database.

I hope that at some point there will be a generic C or even a C++ API similar to libbeat that doesn’t have many dependencies or even better no dependencies at all. This API should be implemented in a way that can be used in baremetal devices that run a firmware and not a RTOS capable of running Go. This would be really cool.

So, what’s next? In the next posts I’ll show how to create custom beats to use in Linux SBCs and also how to connect baremetal devices to Elasticsearch and then use Kibana to visualize the data.

Are there other similar tools? Well, yes there are a few of them, like Loggly and Splunk, but I believe that Elastic Stack could fit perfectly the IoT in the future. There are also alternatives to each component, for example Graphana is an alternative to Kibana. I guess if you are in the market for adapting such a tool you need to do your own research and investigation.

This post just scratched the surface of the very basic things that you can do with Elastic Stack, but I hope it makes somehow clear how you can use those tools in the IoT domain and what are the potentials and possibilities. Personally I like it much and I think there’s still some room for improvement to just fit the embedded world.

Have fun!



A Yocto BSP layer for nanopi-neo3 and nanopi-r2s


Having a tech blog comes with some perks and one of them is receiving samples of various SBCs. This time FriendlyElec sent me two very nice boards, the new nanopi-neo3 and the nanopi-r2s. Both SBCs came with all the extra options including a heatsink and case and they look really great. So, I’ve received them yesterday and I had to do something with then, right?

As you expect, if you reading this blog for some time, I won’t do a product presentation or review, because there are already many reviews that pretty much cover everything. Nevertheless, I’ve done something more challenging and fun that might be also helpful for others and that is that I’ve created a Yocto BSP layer for those boards that you can use to create your custom Linux distribution.

Similarly to the meta-allwinner-hx, I’ve based the BSP layer on the Armbian image. The reason for that is because the Armbian image is very well and actively supported, which means that you get updated kernels and also other features, including extended wireless support. Therefore, this Yocto layer applies the same u-boot and kernel patches as Armbian and at the same time allows you to create your own custom distribution and image.


The nanopi-r2s and nanopi-neo3 share the same SoC and almost the same schematic layout, which means that they also share the same device-tree. This is very convenient as it makes it easier to update both BSPs at the same time. At the same time, though, they have also significant differences which makes them suitable for different purposes.

The nanopi-r2s is based on the Rockchip RK3328 SoC which is a Quad-core Cortex-A53 with each core running at 1.5GHz. You might already know this SoC as it’s being used a lot in various Android TV boxes that being sold in ebay and many Chinese tech sites. In 2018 there was a peak in the release of such TV boxes. Nevertheless, the R2S is not a TV box, but the PCB layout reminds more a network gateway.

This is the R2S I’ve received that came with the metal case.

And this is how it looks when opened.

The board has the following specs:

  • 1 GB DDR4 RAM
  • 1x internal GbE ethernet
  • 1x GbE ethernet (using a USB3.0 to Ethernet bridge IC)
  • 1x USB2.0 port
  • Micro SD
  • micro-USB power port
  • 10x pin-header with GPIOs, I2C, IR receiver and UART

As you can see from the above image the case has an extruded rectangle which is used as a the SoC heatsink. The problem I’ve seen is that the silicon pad on the SoC is not thick enough and therefore the contact is not that good, which means that the heat dissipation might be affected negatively. In my case I’ve resolved this by using a thicker thermal silicon pad on the SoC.

FriendlyElec provides an Ubuntu and OpenWRT image for the board, which I haven’t tested as I’ve implemented my own Yocto BSP layer and image. I’ve read some people complaining about that it doesn’t have some things that they would like, but I guess nobody can be 100% satisfied with a SBC and also it’s not possible for all SBCs to fit all cases. From my point of view this BSP is ideal if you need a quite powerful CPU and two GbE ports.

Personally I’m thinking about implementing a home automation gateway with this board and use one GbE to connect to a KNX/IP bus and the other to the home network, hence keep the two domains separated with a firewall which is more secure.


The nanopi-neo3 is a really compact SBC with the same RK3328 SoC and it has a bit bigger size compared to the nanopi-neo2 board and smaller that nanopi-neo4 (RK3399). The plastic case is really nice and I can imagine having it in a place in my apartment which is visible as it would visually fit just fine. The specs for this SBC are the following:

  • 2GB DDR4 (also 1GB available)
  • 1x USB3.0 port
  • 2x USB2.0 ports (in pinheader)
  • Micro SD
  • 5V fan connector (2-pin)
  • Type-C power
  • 26x pin-header with GPIOs, I2C, SPI, I2S, IR receiver and UART

As you can see the main difference is that there’s only one GbE port and there is also an extra USB3.0 port. This configuration makes it more suitable for sharing a USB3.0 drive in your home network or other IoT applications as it has a pin-header with many available interfaces.

FriendlyElec provides an Ubuntu and a OpenWRT image with the 5.4 kernel and the images are the same with the nanopi-r2s. Actually, since the SoC and the device-tree is the same you can boot both boards with the same OS in the SD card.

The Yocto BSP layer

As I’ve mentioned, generally I prefer custom build images with Yocto for the various SBCs that I’m using. That’s because I have more control on what’s in the image and also it’s more fun to build my own images. An extra benefit is that this also keeps me up to speed with Yocto and the various updates in the project. Also Yocto is perfect for tailoring the image the way you want it, provides provisioning and it’s fitted for continuous integration. Of course, you can use any other available pre-build image like the ones FriendlyElec provides or the Armbian images and then use Ansible for example to provision your image if you like. But Yocto is much more advanced and gives you access to every bit of the image.

OK, enough with blah-blah, the BSP repo is here:


I think the README file is quite thorough and explains all the steps you need to follow to build the image. Currently it supports the new Dunfell version and I may keep updating it in the future if I have time, as my main priority is the meta-allwinner-hx BSP updates.

The image supports the 5.7.17 kernel version and it’s based on the mainline stable repo (the patches are applied on the upstream repo), which is really nice. Generally rockchip has a better mainline support compared to other vendors, so most of the patches are meant to support specific boards and not so much the SoC.

I’ve added here the boot sequence of the nanopi-neo3 board with the custom Yocto image:

As you can see, the board boots up just fine with the custom image which is based on the Poky distro. As you can imagine from now on sky is the limit with what you can do using these boards and Yocto…


The nanopi-neo3 and nanopi-r2s are two SBCs that are similar enough to be able to use the same Linux image, but at the same time they serve different purposes and use-cases. Personally, I like having them both in my collection and I’ve already thought a project for the nanopi-r2s, which is a home automation gateway for KNX/IP. I’ll come back again with another post for that in the future, as this post is only about presenting this Yocto BSP layer for those boards.

I spend a fun afternoon implementing this Yocto layer and I’m glad that both boards worked fine. It also seems that this is the first time that those boards are supported with a Yocto BSP layer and I’m really interested in finding how people are going to use it. If you do, please provide a feedback here in the comment.

Have fun!

Using code from the internet as a professional


One of the most popular and misunderstood beliefs in engineering and especially in programming is that professional engineers shouldn’t use code from the internet. At least I’ve heard and read about this many times. Furthermore, using code from the internet automatically means that those who do it are bad engineers. Is there any truth to this statement, though? Is it right or wrong?

In this post I will try to give my 2 Euro-cents on this matter and try to explain that there’s not only black and white in programming or engineering, but there are many shades in between.

Be sure it’s not illegal

OK, so first things first. If and when you’re using code from the internet you need to first make sure that you comply with the source code licence. It’s illegal and furthermore not ethical to use any source code in a way that is against its licence.

One thing about licences is that there are many people that use licences without really knowing what they mean. That’s pretty much expected, because licences tend to be a very complicated document with many legal terms and sometimes not clear indications how it can be used. Also the number of open source licences are too many. It’s easy to get lost.

Because of that, sometimes the authors are using licences that are not really meet their vision (or criteria) of how they would like to share their code. Because of that, if you find a source code that you really want to use, but the licence seems to be a restriction, then it’s just fine to contact the source code owner and ask about the licence and explain how you would like to use the code and get permission.

This is actually the reason that I’m using MIT almost exclusively, so it’s clear to people that can grab the code and do whatever they like with it. Therefore, always check the licence and don’t be afraid to ask the author if you want to use the source code in a way that it may not meet the current licence.

Reasons to use code from the internet

So, let’s start with this. Why use random source code from the internet? Well, there are many reasons, but I’ll try to list a few of them

  • You don’t know how to do it yourself
  • You’re too bored to write it yourself
  • It takes too much time and effort to write everything by yourself
  • You don’t have the time to read all the involved APIs and functions, so you use a shortcut
  • You think that someone else did it better than you can (= you’re not confident about yourself)
  • You want to do a proof of concept and then re-write parts all of the code to adapt it to your needs
  • The code you found seems more elegant than your coding
  • This will save you time from your work so you can slack
  • You get too much pressure from your manager or the project and you want to just finish ASAP

All the above seem to be valid reasons and trust me all of them are full of… traps. So, no matter the reason, there is always a trap in there and you need to able to deal with it. If you don’t know how to do it, then just using random code will may lead you in a much worse position later and further down in the path, because you won’t be able to debug the code or solve problems. Therefore, if you don’t know how to do it, it’s only OK to use a code if you spend time to understand the code and feel comfortable with it.

If you’re too bored to write it yourself, then you really have to be very confident about your skills and be sure that you can deal with issues later. Well, being bored in this job is also not something that will give you any joy in your work-life, so just be careful about it. I understand that not everyone enjoys their jobs and it’s also normal, but this creates also problems to the other people that may have to deal with it.

If it takes too much time to write the code and you need a shortcut, then be sure that you fully understand the code. If you don’t then it’s a trap. The same goes for unknown APIs. It’s OK to use functions that you don’t really understand in depth, but again you need to be sure about your skills or at least have a brief overview of the API and if there is a documentation then try to find important information or warnings.

If you think that someone else did it better than you, then that’s fine. You can’t be an expert in everything, but still you need to be able to understand the code and read it and try to figure out how it’s working. Most of the times, if you see a very complicated code, then probably there’s something wrong with the code and it doesn’t necessarily mean that this code is advanced or better than your code. Most of the times, code should be simple and clean. The same goes if you find a source code that seems more elegant that yours. Syntactic sugar and beautified code, doesn’t really mean “better” code.

If you want to use source code from the internet so you can slack, then you need to consider to find another job. Really. It’s not good for your mental health. Also, you might just need a break, engineering is a tough job no matter what other people think.

If you’re getting too much pressure and the only way to deal with the project pressure is to use code from the internet, then that’s also wrong for various reasons. In the end, you’ll probably have to do it anyways, even if you don’t like it, but still that’s a problem and it goes beyond yourself as it’s also bad for the project. Again, try to understand the code as much as possible.

Reasons not to use code from the internet

I guess for people that already have enough experience in the domain it’s clear that most projects are full of crap code. By definition all code is crap, because it’s nearly impossible to get it perfectly right from the specifications and design to implementation. If you’re lucky and really good then the code it won’t suck that much, but still it won’t be that good or perfect.

Nevertheless, it’s really common to see bad code in projects and the reason is not only that many engineers use code from the internet, but when they do they use it without understand it and without making proper modifications. Another reason is that many times the proof-of-concept code ends up to be a production code, which is a huge mistake. The reason for that is that most of the times there’s pressure from the upper management that if it works don’t fix it, but they don’t understand that this is just a PoC. You can’t do much about this and if there’s pressure then you need to just go along with it, knowing that the shitstorm may come to you in the future.

Most of the times, the reason that engineers quit is that they realize that the code base is unmanageable and they abandon the ship before it wrecks. Even that needs experience to do it at the right timing…

Now do you see what are the impacts of having really crappy and unmanageable code? It’s a disaster for everyone. For the current project developers, for the newcomers, for the product itself and for the company. It’s a disaster for everyone. Nobody likes to deal with that mess and nobody likes to be responsible for it.

But does that really has to do with just using code from the internet? Well, only partially. The main reason for such a disaster is the lack of experience, planning and good management. And it happens again and again all around the world, even in huge companies and projects.

Therefore, most of the times it seems that the problem is that developers are using code from the internet and that ends up to an unmanageable code blob. Yes, it’s true. But, it’s not entirely true. The problem is not the small or large pieces of code from the internet, the problem is that people don’t understand the code or they don’t care about it or they get too much pressure. So, the code that is coming from the internet is not the real cause of the problem. The problem is the people and the circumstances under the code was used.


Personally, I like reading code and I’m using other’s code often and I believe there’s nothing wrong with that. That doesn’t make me feel less professional and it doesn’t hurt my work or my projects at all. But at the same time, I strongly believe that you should never use any code from the internet if you don’t really understand it, because this will probably end up badly at some point later.

Also, nowadays there are so many APIs and frameworks and it’s impossible to be expert on everything. For that reason nobody should expect from you to be expert in everything. If they do, then try to avoid those jobs. The only thing that you really must be an expert is to understand in depth what you’re doing, the reasons of doing something and also be able to foresee the future and the consequences of what you’re doing. This comes with experience, though.

If you still lack the experience then before you use a random code from the internet, try hard to understand it. Spend time on it. Try it, test it, use it, change it, play with it and do your own experiments. This will give you the experience you need. By getting more and more experienced you’ll eventually find that pretty much everything is the same; it’s just another code or another API and you’ll feel comfortable with that code and be able to understand it in no time. But if you don’t do this, then you’ll never be able to understand the code even if you use it many times on different projects and this is really bad and eventually it will bring you in trouble.

Also be aware that the internet is full of crap code. Finding something that just works doesn’t mean is good for using it. I’ve so much bad code that just works for the presented case, but it should never be used in a production code. Sometimes, also code may be working for the author but that doesn’t mean that it will work for you as even differences in the toolchains, the build and the runtime environment can have great impact in the code functionality. You always need to be really cautious about the code you find and be sure that you use it properly.

Therefore, I believe it’s totally fine to use code from the internet as long it’s legal, you -really- understand it and also you adapt it to your own needs and not just copy-paste and push it in to the production.

If you think that’s wrong then I’m interested to hear your opinion on the matter.

Have fun!


Measuring antennas with the nanovna v2


Many IoT projects are using some kind of interface for transmitting data to a server (or cloud which is the fancy name nowadays). Back in my days (I’m not that old) that was a serial port (RS-232 or RS-485) sending data to a computer. Later, the serial-to-ethernet modules (like xport from Lantronix) replaced serial cables and then WiFi (or any other wireless technology like BT and LoRa) came into play.  Anyway, today using a wireless module has became the standard as it became easy and cheap to use and also it doesn’t need a cable to connect to ends.

Actually, it’s amazing how cheap is to buy an ESP-01 today, which is based on the ESP8266. With less than $2 you can just add to your device an IEEE 802.11 b/g/n Wi-Fi module with integrated TR switch, balun, LNA, power amplifier and matching network and support for WEP, WPA and WPA2 authentication.

But all those awesome features come with some difficulties or issues and in this post I’ll focus on one of them which are the antennas. I’ll try to keep this post simple as I’m not an RF engineer, so if you are then this post is most probably not for you as it’s not advanced.


I guess that many tons of WiFi modules are sold every year. Many modules, like the ESP-01 come with an integrated PCB antenna, but also many modules are using external antennas. Some time ago I’ve bought a WiFi camera for my 3D printer, that was before I switched to octopi. That camera is based on the ESP32 and it has a fish-eye lens and also it doesn’t have an integrated antenna, but an SMA connector for an external antenna.

At some point I was curious how well those antennas actually perform and how they’re affected from the surrounding environment. There are many ways that the performance of the antenna can be affected, like the material is used to build the antenna itself, the position of the antenna, the rotation and also other factors. For example, the 3D printer is made from aluminum and steel, lots of it, which are both conductive and this means that they affect the antenna.  The ways that they affect the antenna performance can be many, e.g. shifting the resonance frequencies and maybe the gain.

I guess, for most applications and especially for this use case (the camera for this 3D printer) the affect is negligible, but again this a blog for stupid projects, so let’s do this.

Vector Network Analyzers

So, how do you measure the antenna’s parameters and what are those parameters? First, an antenna has several parameters, like gain, resonance frequencies, return & insertion loss, impedance, standing wave ratio (SWR or impedance matching) and also others. All those parameters are defined by the antenna design and also the environment conditions. Since it’s very hard to figure out and also calculate all those different parameters while designing an antenna, it’s easier to design the antenna using not very precise calculations and then try to trim it by testing and measurements.

Another thing is that even in the manufacturing stage, most of the times the antennas won’t have the exact same parameters, therefore you would need to test and measure again in order to trim them. In some designs this can be done a bit easier since there is a passive component which is a trimmer.

In any case, in order to measure and test the antenna you need a device called Vector Network Analyzer or VNA. VNAs are instruments that actually measure coefficients and from those coefficients you can then calculate the rest of the parameters. Pretty much all VNAs share the same principle design. In the next two pictures is the block diagram of a VNA when measuring an I/O device (e.g. filter) and an antenna.

The original image was taken from here and I’ve edited it. Click on the images to view them in larger size.

So, in the first image you see that the VNA has a a signal generator which feeds the device under test (DUT) from one port and then receives the output of the DUT on a second port. Then the VNA is able to measure the incident, reflected and transmitted signal and with some maths can calculate the parameters. The VNA is actually only able to measure the reflection coefficient (S11) which depends on the incident and reflected signal before the DUT; and the transmission coefficient (S21) which depends on the transmitted signal after the DUT. An intuitive way of thinking about this is what happens when you try to fit a hose to a water pump while it’s pumping out water. While you try to fit the hose, if the hose diameter doesn’t fit the pump then part of the water flow will get into the hose and some will spilled out (reflected back?). But if the hose fits the pump perfect then all the flow will go through the hose.

In case of an antenna, you can’t really measure the transmitted energy in a simple way that can be accurate for this kinds of measurements. Therefore, only one port is used and the VNA can just calculate the reflection coefficient. But in most of the times this is enough to get valuable information and be able to trim your antenna. The S11 coefficient is important to calculate the antenna matching and consequently the impedance matching defines if your antenna will be able to transmit the maximum energy. If the impedance don’t match, then your antenna will only transmit a portion of the incident energy.

As I’ve mentioned earlier, the antenna will be affected also from other conditions and not only impedance matching, therefore a portable VNA is really helpful to measure the antenna on site and on different conditions.

For this post, I’m using the new NanoVNA v2 analyzer, which I’ll introduce next.

NanoVNA v2

Until few years ago a VNA was a very expensive and considered as an exotic measurement instrument. A few decades ago a VNA would cost many times your apartment (or house) price and a few years ago many times your yearly salary. Nowadays the IC manufacturing process and technology and the available processing power of the newer small and cheap MCUs, brought this cost down to… $50. Yes, that’s just $50. This is so mind blowing that you should get one even if don’t know what it is and you’ll never use it…

Anyway, this is how it looks like:

There are many different sellers in eBay that sell this VNA with different options. The one I bought costs $60 and also includes a stylus pen for the touch screen, a hard metallic cover plate for top and bottom, a battery, coax cables and load standards (open, close and 50Ω). I’ve also printed this case here with my 3D printer and replaced the metal plates.

As you can see this is the V2, so there has to be a V1 and there is one. The main and most important difference between the V1 and V2 is the frequency range, which in V1 is from 50KHz up to 900MHz and in V2 is from 50KHz up to 3GHz! That’s an amazing range for this small and cheap device and also means that you can measure 2.4GHz antennas used in WiFi. I’m not going to get in more details in this post, but you can see the full specs in here.

As I’ve mentioned the TFT has a touch controller which is resistive, therefore you need a stylus pen to get precise control. Actually, this display is the ILI9341 which I’ve used in this post a few years back and I’ve managed to get an amazing refresh rate of 50 FPS using a dirt-cheap stm32f103 (blue-pill). As you can see from the above photo, the display is showing a Smith chart and four different graphs. This is the default display format but there are many more that you can get. If you’re interested you can have a look in the manual here.

The graphs have a color coding, therefore in the above photo the green graph is the smith plot, the yellow is the port 1 (incident, reflected), the blue is the port 2 (transmitted) and the green is the phase. In the bottom you can see the center frequency and the span which is this case is 2.4GHz and 800MHz, but depending the mode in bottom of the screen you may see other information. Again, for all the details you can read the manual as the menu is quite large.

My favourite thing about the nanovna v2 is that the firmware is open source and available on github. You can fork the firmware and make changes, build it and then uploaded to the device. The MCU is the GD32F303CC, which seems to be either an STM32F303 clone or it’s a side product from ST with a different part number to target even lower-cost markets. From the Makefile it seems that this MCU doesn’t have an FPU, which is a bit unfortunate as this kind of device would definitely benefit from this. Also from the repo you can see that libopencm3 and mculib are used. One thing that it could be better is CMake support and using a docker for image builds in order to get reproducible builds, which increase stability. Anyway, I’ll probably do this myself at some point.

The only negative I could say for this device is that the display touch-film is too glossy, which makes hard to view under sunlight or even during day inside a bright room. This can be solved, I guess, by using a matte non-reflective film like those ones used in smartphones (but not glass because the touch membrane is resistive).

The antenna

As I’ve mentioned earlier, for this post I’ll only measure an external WiFi antenna which is common in many different devices. In my case this device is the open source TTGO T-Journal ESP32 camera here. This PCB ships with various different camera sensors and it usually costs ~$15. I bought this camera to use it with my 3D printer before I’ve installed octopi and to be honest I may revert back to it because it was more convenient than octopi. I’ve also printed a case for it and this is how it looks.

Above is just the PCB module and below is the module inside the case and the antenna connected. This is a neat little camera btw.

As you can guess, those antennas in those cheap camera modules are not meant to be very precise, but they do the job. As a matter of fact, it doesn’t really make sense to even measure this antenna, because for it’s usage inside the house the performance is more that enough. But it would make sense with an antenna that you need to get the maximum performance. Nevertheless, it’s fine to use this for playing around and learning.

This is an omni-directional antenna, which means that it transmits energy like a light-bulb glows light to all directions. This is nice because it doesn’t matter that much how you place the antenna and your camera, but at the same time the gain is low as a lot of energy is wasted. The opposite would be a directional antenna which behaves like a flashlight. The problem with those antennas from eBay is that they don’t came with specs, so you know nothing about the antenna except that it is supposed to be suitable for the WiFi frequency range.


OK, now let’s measure this antenna using the nanovna and get its parameters. First thing you need to do before taking any measurements is calibrate the VNA. There are a couple of calibration methods, but the nanovna uses the SOLT method. SOLT means Short-Open-Load-Through and this refers to the port 1 load, therefore you need to run 4 different calibrations, one is by shorting the port, then leave the port open, then attaching a 50Ω load and finally connect port 1 through port 2. After those steps the device should be ready and calibrated. There is a nice video here, that demonstrates this better than I can.

The nice thing is that you can save your calibration data, therefore in my case I’ve set the center frequency to 2.4GHz, calibrated the nanovna and then stored the calibration data in the flash. There are 5 memories available, but maybe you can increase that size in the firmware if needed (that’s the nice thing having access to the code).

These are some things that are very important during calibration and usage:

  • Tight the SMA connectors properly. If you don’t do that then all your measurements will be wrong, also if you tight it too much then you may break the PCB connector or crack the PCB and routes. The best way is to use a variable torque wrench, but since this would cost $20-$60, then just better be careful.
  • Do not touch any conductive parts during measurements. For example don’t touch the SMA connectors.
  • Keep the device away from metal surfaces during calibration.
  • Keep the antenna (or DUT) away from metal surfaces during measurements, unless you do it on purpose.
  • Make sure you have enough battery left or you have power bank with you.

In the next photo you can see the current test setup after the calibration.

As you can see the antenna is connected on port 1 and I’m running the nanovna with the battery. The next photo is after done with the calibration at 2.4GHz and with the antenna connected.

From the above image you see that the antenna resonates at 2456 MHz, which is quite good. From the Smith plot you can see that the impedance doesn’t match perfectly and it’s a bit higher than the 50Ω (54Ω) and also it has a 824pH inductance. That means that part of the incident energy is reflected at the resonance frequency.

Next you can see how the antenna parameters change in real-time when a fork is getting close to the antenna.

As you can see the antenna is drastically affected, therefore you need to have this in mind. Of course, in this case the fork is getting quite close to the antenna.

Finally, I’ve used the whole setup near the 3D printer while printing to see how the printer affects the antenna in different positions. As you might be able to see from the following gif, the position does affect the antenna and also is seems that the head movement affects it, too.


I have to admit that although I’m not an RF engineer or experienced on this domain, I find the nanovna-v2 a great device. It’s not only amazing what you can get nowadays with $60, but it’s also nice to be able to experiment with those RF magic stuff. At some point in the future I’m planning to make custom antennas for drones and FPVs and also build my own FPV, so this will certainly be very useful then.

Even if you’re not doing any RF this is a nice toy to play around with and try to understand how it works and also how antennas work and perforn. Making measurements is really easy and being a portable device makes it even better for experimenting. It’s really cool to see the antenna parameters in real time and try to explain what you see on the display.

As I’ve mentioned in the post, this case scenario is a bit useless as the camera just works fine inside the house and also the distance from the wifi router is not that far. Still, it was another stupid-project which was fun and I’ve learned a few new tricks. Again, I really recommend this device, even if you’re not an RF engineer and try to experiment with it. There are many variations and clones in eBay and most of them seem to be compatible with the open source firmware, just be aware that there’s a case that some variations may not be compatible.

Have fun!


Benchmarking the STM32MP1 IPC between the MCU and CPU (part 2)


In this post series I’m benchmarking the IPC between the Cortex-A7 (CA7) and the Cortex-M4 (CM4) of the STM32MP1 SoC. In the previous post here, I’ve tested the default OpenAMP TTY driver. This IPC method is the direct buffer sharing mode, because the OpenAMP is used as the buffer transfer API. After my tests I’ve verified that this option is very slow for large data transfers. Also, a TTY interface is not really an ideal interface to add into your code, for several reasons. For me dealing with old interfaces and APIs with new code is not the best option as there are also other ways to do this.

After seeing the results, my next idea was to replace the TTY interface with a Netlink socket. In order to do that though, I needed to write a custom kernel module, a raw OpenAMP firmware and also the userspace client that sends data to the Netlink socket. In this post I’ll briefly explain those things and also present the benchmark results and compare them with the TTY interface.

Test Yocto image

I’ve added the recipes with all the code in the BSP base layer here:


This is the same repo as the previous post, but I’ve now added 3 more recipes, which are the following:

To build the image using this BSP base layer, then read the README.md file in the repo or the previous post. The README files is more than enough, though.

The repo of the actual code is here:


Kernel driver code

The kernel driver module code is this one here. As you can see this is a quite simple driver, so no interrupts, DAM or anything fancy. In the init function I just register a new rpmsg driver.

ret = register_rpmsg_driver(&rpmsg_netlink_drv);

This line just registers the driver, but the driver will be probed only when a new service requests for this specific driver’s id name, which is this one here

static struct rpmsg_device_id rpmsg_netlink_driver_id_table[] = {
    { .name	= "rpmsg-netlink" },
    { },

Therefore, when a new device (or service) that is added requests for this name, then the probe function will be executed. I’ll show you later, how the driver is actually triggered.

Then regarding the probe function, when it’s triggered then a new Netlink kernel is created and a callback function is added in the cfg Netlink kernel configuration struct.

nl_sk = netlink_kernel_create(&init_net, NETLINK_USER, &cfg);
if(!nl_sk) {
    dev_err(dev, "Error creating socket.\n");
    return -10;

The callback for the Netlink kernel is this function

static void netlink_recv_cbk(struct sk_buff *skb)

This callback is triggered when new data are received in this socket. It’s important to select a unit id which is not used by other Netlink kernel drivers. In this case I’m using this id

#define NETLINK_USER 31

Inside the Netlink callback, the received data are parsed and then are sent to the CM4 using the rpmsg (OpenAMP) API. As you can see from the code here, the driver doesn’t send all the received buffer, but it splits the data into blocks as the rpmsg has a hard-coded buffer limited to 512 bytes. Therefore, the limitation that we had in the previous post still remains, of course. The point, as I’ve mentioned is to simplify the userspace client code and not use TTY.

Finally, the rpmsg_drv_cb() is the callback function of the OpenAMP and you can see the code here. This callback is triggered when the CM4 firmware sends data to the CA7 via rpmsg. In this case, the CM4 firmware will send the number of bytes that were received from the CA7 kernel driver (from the Netlink callback). Then the callback will send this uint16_t back to the usespace application using Netlink.

Therefore, the userspace app sends/receives data to/from the kernel using Netlink and the kernel sends/receives data to/from the CM4 firmware using rpmsg. Note, that all these stages are copying buffers! So, no zero-copy here, but multiple memcpys, so we already expect some latency, but we’ll how much later.

CM4 firmware

The CM4 firmware code is here. This code is more complex that the kernel driver, but the interesting code is in the main.c file. The most important lines are

#define RPMSG_SERVICE_NAME              "rpmsg-netlink"


OPENAMP_create_endpoint(&resmgr_ept, RPMSG_SERVICE_NAME, RPMSG_ADDR_ANY, rx_callback, NULL);

As you may have guessed the RPMSG_SERVICE_NAME is the same with the kernel driver id name. This means that those two names need to match, otherwise the kernel driver won’t get probed.

The rx_callback() function is the interrupt function of the rpmsg on the firmware side. This will only copy the buffer (more memcpys in the pipeline) and then the handling will be done in the main() in this code

if (rx_dev.rx_status == SET)
  /* Message received: send back a message anwser */
  rx_dev.rx_status = RESET;

  struct packet* in = (struct packet*) &rx_dev.rx_buffer[0];
  if (in->preamble == PREAMBLE) {
    in->preamble = 0;
    rpmsg_expected_nbytes = in->length;
    log_info("Expected length: %d\n", rpmsg_expected_nbytes);                        

  log_info("RPMSG: %d/%d\n", rx_dev.rx_size, rpmsg_expected_nbytes);
  if (rx_dev.rx_size >= rpmsg_expected_nbytes) {
    rx_dev.rx_size = 0;
    rpmsg_reply[0] = rpmsg_expected_nbytes & 0xff;
    rpmsg_reply[1] = (rpmsg_expected_nbytes >> 8) & 0xff;
    log_info("RPMSG resp: %d\n", rpmsg_expected_nbytes);
    rpmsg_expected_nbytes = 0;

    if (OPENAMP_send(&resmgr_ept, rpmsg_reply, 2) < 0) {
      log_err("Failed to send message\r\n");

As you can see from the above code, the buffer is parsed and if there is a valid packet in there, then it extracts the length of the expected data and when those data are received, then it sends back the number or bytes using the OpenAMP API. Those data will be received then by the kernel and then send to userspace using Netlink.

User-space application

The userspace application code is here. If you browse the code you’ll find out that is very similar to the previous post’s tty client and I’ve only made a few changes like removing the tty and adding the Netlink socket class. Like in the previous post, a number of tests are added when the program starts like this


Then the tests are executed. What you may find interesting is the Netlink class code and especially the part that sends/ receives the to/from the kernel, which is this code here. Have in this code:

do {
    int n_tx = buffer_len < MAX_BLOCK_SIZE ?  buffer_len : MAX_BLOCK_SIZE;
    buffer_len -= n_tx;

    memset(&kernel, 0, sizeof(kernel));
    kernel.nl_family = AF_NETLINK;
    kernel.nl_groups = 0;

    memset(&iov, 0, sizeof(iov));
    iov.iov_base = (void *)m_nlh;
    iov.iov_len = n_tx;
    std::memset(m_nlh, 0, NLMSG_SPACE(n_tx));
    m_nlh->nlmsg_len = NLMSG_SPACE(n_tx);
    m_nlh->nlmsg_pid = getpid();
    m_nlh->nlmsg_flags = 0;

    std::memcpy(NLMSG_DATA(m_nlh), buffer, n_tx);

    memset(&msg, 0, sizeof(msg));
    msg.msg_name = &kernel;
    msg.msg_namelen = sizeof(kernel);
    msg.msg_iov = &iov;
    msg.msg_iovlen = 1;

    L_(ldebug) << "Sending " << n_tx << "/" << buffer_len;
    int err = sendmsg(m_sock_fd, &msg, 0);
    if (err < 0) {
        L_(lerror) << "Failed to send netlink message: " <<  err;

} while(buffer_len);

As you can see, the data are not sent a single buffer in the kernel driver via the Netlink socket. The reason is that the kernel socket can only assign a buffer equal to the page size, therefore if you try to send more that 4KB then the kernel will crash. Therefore, we need to split the data in to smaller blocks and send them via Netlink. There are some ways to increase this size, but a change like this would be global to all the kernel and this would mean that all drivers would allocated larger buffers even if they don’t need them and that’s a waste of memory.

Benchmark results

To execute the test I’ve built the Yocto image using my BSP base layer, which includes all the recipes and installs everything by default in the image. What is important is that the module is already loaded in the kernel when it boots, so it’s not needed to modprobe the module. Given this, it’s only needed to upload the firmware in the CM4 and then execute the application. n this image, all the commands need to be executed in the /home/root path.

First load the firmware like this:

./fw_cortex_m4_netlink.sh start

When running this, the kernel will print those messages (you can use dmesg -w to read those).

[ 3997.439653] remoteproc remoteproc0: powering up m4
[ 3997.444869] remoteproc remoteproc0: Booting fw image stm32mp157c-rpmsg-netlink.elf, size 198364
[ 3997.452743]  mlahb:m4@10000000#vdev0buffer: assigned reserved memory node vdev0buffer@10042000
[ 3997.461387] virtio_rpmsg_bus virtio0: rpmsg host is online
[ 3997.467937]  mlahb:m4@10000000#vdev0buffer: registered virtio0 (type 7)
[ 3997.472245] virtio_rpmsg_bus virtio0: creating channel rpmsg-netlink addr 0x0
[ 3997.473121] remoteproc remoteproc0: remote processor m4 is now up
[ 3997.492511] rpmsg_netlink virtio0.rpmsg-netlink.-1.0: rpmsg-netlink created netlink socket

The last line, is actually printed by our kernel driver module. This means that when the firmware loaded then the driver’s probe function was triggered, because it was matched by the RPMSG_SERVICE_NAME in the firmware. Next, run the application like this:


This will execute all the tests. This is a sample output on my board.

- 21:27:31.237 INFO: Application started
- 21:27:31.238 INFO: Initialized netlink client.
- 21:27:31.245 INFO: Initialized buffer with CRC16: 0x1818
- 21:27:31.245 INFO: ---- Creating tests ----
- 21:27:31.245 INFO: -> Add test: size=512
- 21:27:31.245 INFO: -> Add test: size=1024
- 21:27:31.245 INFO: -> Add test: size=2048
- 21:27:31.246 INFO: -> Add test: size=4096
- 21:27:31.246 INFO: -> Add test: size=8192
- 21:27:31.246 INFO: -> Add test: size=16384
- 21:27:31.246 INFO: -> Add test: size=32768
- 21:27:31.246 INFO: ---- Starting tests ----
- 21:27:31.268 INFO: -> b: 512, nsec: 21384671, bytes sent: 20
- 21:27:31.296 INFO: -> b: 1024, nsec: 27190729, bytes sent: 20
- 21:27:31.324 INFO: -> b: 2048, nsec: 27436772, bytes sent: 20
- 21:27:31.361 INFO: -> b: 4096, nsec: 31332686, bytes sent: 20
- 21:27:31.419 INFO: -> b: 8192, nsec: 55592343, bytes sent: 20
- 21:27:31.511 INFO: -> b: 16384, nsec: 88094875, bytes sent: 20
- 21:27:31.681 INFO: -> b: 32768, nsec: 162541198, bytes sent: 20

The results are starting after the “Starting tests” string and b is the block size and nsec is the number of nanoseconds that the whole transaction lasted. Ignore the “bytes sent” size as it’s not correct and to fix this would be a lot of hassle, as it would need a static counter in the kernel driver, which doesn’t really worth the trouble. I’ve used the Linux precision timers, which they’re not very precise compared to the CM4 timers, but it’s enough for this test since the times are in the range of milliseconds. I’m also listing the results in the next table.

# of bytes (block) msec
512 21.38
1024 27.19
2048 27.43
4096 31.33
8192 55.59
16384 88.09
32768 162.54

Now let’s compare those number with the previous tests in the following table

# of bytes TTY (msec) Netlink (msec) diff (msec)
512 11.97 21.38 9.41
1024 15.32 27.19 11.87
2048 21.74 27.43 5.69
4096 37.64 31.33 – 6.31
8192 55.59
16384 88.09
32768 162.54

These are interesting numbers. As you can see up to 2KB of data, the TTY implementation is faster, but at >=4KB the Netlink driver has better performance. Also it’s important that the Netlink implementation doesn’t have the issue with the limited block size, so you can sent more data using the netlink client API I’ve written. Well, the truth is that it does the have a limited block size hard-coded in OpenAMP, but in this case without the TTY interface the ringbuffer seems to empty properly. That’s something that it would need further investigation, but I don’t think I’ll have time for this.

From this table in case of the 32KB block we see that the transfer rate is 201.6 KB/sec, which almost the double compared to the TTY implementation. This is much better performance, but again it’s far slower than the indirect buffer sharing mode, which I’ll test in the next post.


In this post I’ve implemented a kernel driver that uses OpenAMP to exchange data with the CM4 and a netlink socket to exchange data with the userspace. In this scenario the performance is worse compared to the default TTY implementation (/dev/ttyRPMSGx) for blocks smaller than 4KB, but it’s faster for >=4KB. Also my tests shown that if the block is 32KB then this implementation it’s twice as fast than the TTY.

Although the results are better than the previous post, still this implementation can not be considered as a good option if you need to transfer large amounts of data and fast. Nevertheless, personally I would consider this as a good option for smaller data sizes, because now the interface from the userspace is much more convenient as it’s based on a netlink socket. Therefore, you don’t need to interface with a TTY port anymore and that is an advantage.

So, I’m quite happy with the results. I would definitely prefer to use this option rather the TTY interface for data blocks more than 2KB, because netlink is more friendly API, at least to me. Maybe you have a difference preference, but overall those two solutions are only good for smaller block sizes.

In the next post, I’ll benchmark the indirect buffer sharing.

Have fun!

Benchmarking the STM32MP1 IPC between the MCU and CPU (part 1)


Update: The second part is here.

Long time no see. Well, having two babies at home is really not recommended for personal stupid projects. Nevertheless, I’ve found some time to dive into the STM32MP157C-DK2 dev kit that I’ve received recently from ST. Let me say beforehand, that this is a really great board and of course it has it’s cons and pros, which I will mention later in this post. Although it’s being a year that this board has been released I didn’t had the chance to use it and I’m glad I received this sample.

I need to mention that this post is not an introduction to the dev kit and it’s a bit advanced as it deals with more advanced concepts. One thing I like with some new SoCs that were released the last couple of years, is the integrated MCU aside with the CPU, which shares the same resources and peripherals. This is a really powerful option and it allows you to split your functionality to real-time and critical application that run on the MCU and more advanced userspace applications that run on Linux on the CPU.

The MCU is an STM32 ARM Cortex-M4 core (CM4) running at 209MHz and the application CPU is a dual Cortex-A7 (CA7) running at 650MHz. Well, yes, this is a fast CM4 and a slow CA7. As I’ve said those two can share almost all the same peripherals and the OpenAMP API is used as an IPC to exchange data between those two.

I have quite some experience working with real-time Linux kernels and I really like this concept, that’s why I always support a PREEMPT-RT (P-RT) kernel to my meta-allwinner-hx Yocto meta layer. Although the P-RT is really nice to have when a low latency is needed, it’s still far too bloated and slow compared to a baremetal firmware running on an MCU. I won’t get into the details of this, because I think it’s obvious. Therefore, the STM32MP1 and other similar SoCs combine the best of the those two worlds and you can use the CM4 for critical timing applications and the CA7 for everything else, like the GUI, networking, remote updates e.t.c.

In case of STM32MP1 the OpenAMP framework is used for the IPC between the CM4 and CA7. When I’ve went through the docs for the STM32MP1 the first thing it came to my mind is to benchmark and evaluate its performance and this series of post is about getting into this and find out what are the options you have and how good they perform.

Reading the docs

One thing I like about ST is the very good documentation. Well, tbh it’s far from excellent but it’s better compared to other documentation I’ve seen. Generally, bad or missing documentation is a well know problem with almost all vendors and in my opinion Texas Instruments and ST are far better compare to other vendors (I’ll exclude NXP imx6, which is also well documented).

When you start with a new SoC and you already have experience with other SoCs then it’s easier to retrieve valuable information faster and also search for specific information you know it’s important. Nevertheless, I really enjoyed getting into the STM32MP1 documentation in the wiki, because I found really good explanations for the Linux graphic stack, trusted firmware, OS-TEE and the boot chain. For example have a look to the DRM/KMS overview here, it’s really good.

Of course there are many things that are missing from the docs, but I need also to say that this is a quite complex SoC and there are so many tools and frameworks involved in the development process that it would need tens of thousands of pages to explain everything and this is not manageable even for ST. For example, you’ll find your self lost the moment you want to create your own Yocto distro as there is very few info on how to do that and that’s the reason I’ve decided to write a template bsp layer that greatly simplifies the process.


While getting a dive into the SoC, I couldn’t help my self not to fix some things, therefore I’ve did some contributions, like this and this and I’ve also created this meta layer here. I’m also planning to contribute with a kernel module and a userspace application that will use netlink between the userspace and the kernel module, but I’ll explain why later. This will be demonstrated in the next post, though as it’s related to the IPC performance.

IPC options

OK, so let’s now see what are the current options that you get with the STM32MP1 when it comes to IPC between the CM4 and CA7. I’ve already mentioned the OpenAMP which ST names it the direct buffer exchange mode, but you can also use an indirect buffer exchange mode. Both are well explained here, but I’ll add a couple of more information here.

Generally, the OpenAMP has a weakness which is integrated to the implementation and this is the hard-coded buffer size limitation, which is set to 512 bytes. We’ll see later how this affects the transfer speed, but as you can imagine this size might not be enough in some cases. Therefore, before you decide which IPC option you’re going to choose you need to be sure how many data and how fast you need to share between the CM4 and CA7.

Another problem with OpenAMP is that there’s no zero-copy or DMA support, which means that buffers are copied from all ends and therefore the performance is expected to be affected by this. On top of that the buffers are not cacheable, which means further CPU cycles for fetching buffers and then copy them.

OpenAMP comes with 2 examples/implementations in the SDK. The first is a TTY device on the Linux side, which means that you transfer can data from/to the userspace by interfacing a tty com port. You can see this implementation here. If that sounds bad to you, then it is actually as bad as it sounds, because exchanging data that way really sucks, especially on the Linux side (interfacing a tty with code). The only way I think this is useful and makes sense is if you port your existing code from a device that used a CPU and an external MCU and they were exchanging data via a UART port. Otherwise, it doesn’t make much sense to use it. Nevertheless, being loyal to the blog’s name I’ve spent some time to implement this and see how it performs. More about this later.

The second implementation is the OpenAMP raw, which you can use a kernel module to interface with the CM4 and the firmware implementation is here. In order to exchange data with the userspace in this case you need a kernel module like this one here. But as you can see, this is pretty much useless because you can’t interface with the module from the userspace, therefore you need to write your own module and use ioctl or netlink (or anything else you like). But more about this on the next post. (Edit: actually I’ve seen that ST provides a module with IOCTL support here).

On the other hand, there is also the indirect buffer option, which compared to the OpenAMP, it allows sharing much larger buffers (of any size), in cached memory, with zero-copy and also using DMA. This option is expected to perform much faster compared to OpenAMP and with much lower latency. To be honest, this sounds like a no-brainer to use for me in most cases, except in case that you just need to exchange very few control commands. I mean it makes sense for the last case, because you need OpenAMP even in the indirect buffer mode, because you use that as a control interface for the shared buffer anyway.

Creating a Yocto image

In order to use or test any of the above things you need to build and flash a distro for the board. Here I can complain about the non-existence documentation on how to do this. In my case, that was not really an issue, but I guess many devs will eventually struggle with this. For that reason, I’ve created a BSP base layer for the STM32MP157C-DK2 board. This post though is not focused on how to deal with the Yocto distro, so I’ll explain those things quite fast so I can proceed further.

To build the distro I’ve used, you need this repo here:


The README file in the repo is quite thorough. So, in order to build the image you need to create a folder and run this command:

cd stm32mp1-yocto
repo init -u https://bitbucket.org/dimtass/meta-stm32mp1-bsp-base/src/master/default.xml
repo sync

This will fetch all the needed repos and place them in the sources/ folder. Then in order to build the image you need to run these commands:

ln -s sources/meta-stm32mp1-bsp-base/scripts/setup-environment.sh .
MACHINE=stm32mp1-discotest DISTRO=openstlinux-eglfs source ./setup-environment.sh build
bitbake stm32mp1-qt-eglfs-image

Your host needs to have all the needed dependencies in order to build the image. If this is the first time you do this, then better use a docker image instead like this one here. I’ve made this docker image to build the meta-allwinner-hx repo, but you can use the same for this repo, like this:

docker build --build-arg userid=$(id -u) --build-arg groupid=$(id -g) -t allwinner-yocto-image .
docker run -it --name allwinner-builder -v $(pwd):/docker -w /docker allwinner-yocto-image bash
MACHINE=stm32mp1-discotest DISTRO=openstlinux-eglfs source ./setup-environment.sh build
bitbake stm32mp1-qt-eglfs-image

If you want to build the SDK to use it for compiling other code then run this:

bitbake -c populate_sdk stm32mp1-qt-eglfs-image

Then you can install the produced SDK to your /opt folder and then source it to build any code you might want to test.

Benchmark tool code

Assuming that you have the above image built, then you can flash it in the target and run the test. There is a brief explanation how to flash the image to the dev kit in the meta-stm32mp1-bsp-base repo README.md file. So, after the image is flashed the firmware for the CM4 is located in /lib/firmware/stm32mp157c-rpmsg-test.elf. Also there is a script and an executable in /home/root. The recipes that install those files are located in `meta-stm32mp1-bsp-base/recipes-extended/stm32mp1-rpmsg-test`.

The code is located in this repo:


This repo contains both the CM4 firmare and the CA7 linux userspace tool. The userspace tool is using the /dev/ttyRPMSG0 tty port to send blocks of data. As you can see here, I’m creating some tests on the runtime with several block sizes and I’m sending those blocks a number of times. For example, see the following:

tester.add_test(512, 1);
tester.add_test(512, 2);
tester.add_test(512, 4);
tester.add_test(512, 8);

The first line sends a block of 512 bytes one time. The second sends the 512 block two times, e.t.c. In the code, you’ll see that there are some code lines that are commented out. The reason for that is that I’ve found that I couldn’t send more than 5-6KB per run and it seems that OpenAMP is hanging on the kernel side and I need to open/close the tty port to send more data. As I’ve mentioned the buffer size in OpenAMP is hardcoded and it’s 512 bytes, therefore I assume that the ringbuffer is overflowing at some point and the execution hangs, but I don’t know the real reason. Therefore, I wasn’t able to test more that 5KB.

The code for interfacing the com port is in here. I guess this class would be very useful to you if you want to interface with the ttyRPMSG in your code, as I’ve used it many times and it seems robust. Also it’s simple to use and integrate it to your code.

Finally, the test class is here and I’m using a simple protocol with a header that contains a preamble and the length of the packet data. This simplifies the CM4 firmware as it knows how much data to wait for and then acknowledge when the whole packet is received. Be aware then when you sent a 512 bytes block, then those data doesn’t arrive at once on the CM4 on a single callback, but OpenAMP may trigger more that one interrupts for the whole block, therefore knowing the length of the data in the start of transaction is important.

On the CM4 side, the firmware is located in here. The interesting code is in the main.c file and actually the most important function is the tty callback in here, so I’ll also list the code here.

void VIRT_UART0_RxCpltCallback(VIRT_UART_HandleTypeDef *huart)

  /* copy received msg in a variable to sent it back to master processor in main infinite loop*/
  uint16_t recv_size = huart->RxXferSize < MAX_BUFFER_SIZE? huart->RxXferSize : MAX_BUFFER_SIZE-1;

  struct packet* in = (struct packet*) &huart->pRxBuffPtr[0];
  if (in->preamble == PREAMBLE) {
    in->preamble = 0;
    virt_uart0_expected_nbytes = in->length;
    log_info("length: %d\n", virt_uart0_expected_nbytes);                        

  virt_uart0.rx_size += recv_size;
  log_info("UART0: %d/%d\n", virt_uart0.rx_size, virt_uart0_expected_nbytes);
  if (virt_uart0.rx_size >= virt_uart0_expected_nbytes) {
    virt_uart0.rx_size = 0;
    virt_uart0.tx_buffer[0] = virt_uart0_expected_nbytes & 0xff;
    virt_uart0.tx_buffer[1] = (virt_uart0_expected_nbytes >> 8) & 0xff;
    log_info("UART0 resp: %d\n", virt_uart0_expected_nbytes);
    virt_uart0_expected_nbytes = 0;
    virt_uart0.tx_size = 2;
    virt_uart0.tx_status = SET;
    // huart->RxXferSize = 0;

As you can see in this callback the code looks for the preamble and if it’s found then it copies the expected bytes length. Then the interrupt may triggered multiple times and when all bytes arrive, then the CM4 sends back a response that includes the bytes number and then the userspace application verifies that all bytes where sent. It’s as simple as that.

Benchmark results

To run the test you need to cd to the /home/root directory and then run this script.

./fw_cortex_m4.sh start

This will load the firmware that is stored in /lib/firmware to the CM4 and execute it. When this happend you should see this in your serial console:

fw_cortex_m4.sh: fmw_name=stm32mp1-rpmsg-test.elf
[  162.549297] remoteproc remoteproc0: powering up m4
[  162.588367] remoteproc remoteproc0: Booting fw image stm32mp1-rpmsg-test.elf, size 704924
[  162.596199]  mlahb:m4@10000000#vdev0buffer: assigned reserved memory node vdev0buffer@10042000
[  162.607353] virtio_rpmsg_bus virtio0: rpmsg host is online
[  162.615159]  mlahb:m4@10000000#vdev0buffer: registered virtio0 (type 7)
[  162.620334] virtio_rpmsg_bus virtio0: creating channel rpmsg-tty-channel addr 0x0
[  162.622155] rpmsg_tty virtio0.rpmsg-tty-channel.-1.0: new channel: 0x400 -> 0x0 : ttyRPMSG0
[  162.633298] remoteproc remoteproc0: remote processor m4 is now up
[  162.648221] virtio_rpmsg_bus virtio0: creating channel rpmsg-tty-channel addr 0x1
[  162.671119] rpmsg_tty virtio0.rpmsg-tty-channel.-1.1: new channel: 0x401 -> 0x1 : ttyRPMSG1

This means that the kernel module is loaded, the two ttyRPMSG devices are created in /dev, firmware is loaded on the CM4 and the firmware code is executed. In order to verify that the firmware is working you can use a USB to UART module and connect the Rx pin to the D0 and the Tx pin to the D1 (UART7) of the Arduino compatible header in the bottom on the board. When the firmware is booted you should see this:

[00000.008][INFO ]Cortex-M4 boot successful with STM32Cube FW version: v1.2.0 
[00000.016][INFO ]MAX_BUFFER_SIZE: 32768
[00000.019][INFO ]Virtual UART0 OpenAMP-rpmsg channel creation
[00000.025][INFO ]Virtual UART1 OpenAMP-rpmsg channel creation

Now you can execute the benchmark tool on the userspace by running this command in the console:

./tty-test-client /dev/ttyRPMSG0

Then the benchmark is executed and you’ll see the results on both consoles (CM4 and CA7). The CA7 output will be something like this:

- 15:43:54.953 INFO: Application started
- 15:43:54.954 INFO: Connected to /dev/ttyRPMSG0
- 15:43:54.962 INFO: Initialized buffer with CRC16: 0x1818
- 15:43:54.962 INFO: ---- Creating tests ----
- 15:43:54.962 INFO: -> Add test: block=512, blocks: 1
- 15:43:54.962 INFO: -> Add test: block=512, blocks: 2
- 15:43:54.962 INFO: -> Add test: block=512, blocks: 4
- 15:43:54.963 INFO: -> Add test: block=512, blocks: 8
- 15:43:54.963 INFO: -> Add test: block=1024, blocks: 1
- 15:43:54.964 INFO: -> Add test: block=1024, blocks: 2
- 15:43:54.964 INFO: -> Add test: block=1024, blocks: 4
- 15:43:54.964 INFO: -> Add test: block=1024, blocks: 5
- 15:43:54.964 INFO: -> Add test: block=2048, blocks: 1
- 15:43:54.964 INFO: -> Add test: block=2048, blocks: 2
- 15:43:54.964 INFO: -> Add test: block=4096, blocks: 1
- 15:43:54.964 INFO: ---- Starting tests ----
- 15:43:54.977 INFO: -> b: 512, n:1, nsec: 11970765, bytes sent: 512
- 15:43:54.996 INFO: -> b: 512, n:2, nsec: 18770380, bytes sent: 1024
- 15:43:55.027 INFO: -> b: 512, n:4, nsec: 31022063, bytes sent: 2048
- 15:43:55.083 INFO: -> b: 512, n:8, nsec: 56251848, bytes sent: 4096
- 15:43:55.099 INFO: -> b: 1024, n:1, nsec: 15322572, bytes sent: 1024
- 15:43:55.124 INFO: -> b: 1024, n:2, nsec: 24824116, bytes sent: 2048
- 15:43:55.168 INFO: -> b: 1024, n:4, nsec: 43830248, bytes sent: 4096
- 15:43:55.221 INFO: -> b: 1024, n:5, nsec: 53327292, bytes sent: 5120
- 15:43:55.243 INFO: -> b: 2048, n:1, nsec: 21742144, bytes sent: 2048
- 15:43:55.281 INFO: -> b: 2048, n:2, nsec: 37633885, bytes sent: 4096
- 15:43:55.318 INFO: -> b: 4096, n:1, nsec: 37649011, bytes sent: 4096

The results are starting after the “Starting tests” string and b is the block size, n is the number of blocks and nsec is the number of nanoseconds that the whole transaction lasted. I’ve used the linux precision timers, which they’re not very precise compared to the CM4 timers, but it’s enough for this test since the times are in the range of milliseconds. I’m also listing the results in the next table.

Block size # of blocks msec
512 1 11.97
512 2 18.77
512 4 31.02
512 8 56.25
1024 1 15.32
1024 2 24.82
1024 4 43.83
1024 5 53.32
2048 1 21.74
2048 2 37.63
4096 1 37.64

The results are interesting enough. As you can see, size matters! So, the larger the block size the faster the transfer rate. Therefore sending 4KB of data from the CA7 to the CM4 in 8x 512 blocks needs 56.25 ms and in case of 1x 4096 takes 37.64, which is 40% faster with the bigger block size.

Also another interesting result is that in the best case the transfer rate is 4096 bytes in 37.64 msec, or 106 KB/sec. That’s too slow!

To sum up the results, you can see that in case you want to use the direct buffer exchange mode (OpenAMP tty) then you’re limited at ~106KB/sec and in order to achieve this speed you need the largest possible block size.


In this post I’ve provided a simple BSP base Yocto layer that it might be helpful for you to start a new STM32MP1 project. I’ve used this Yocto image to build a benchmark tool and test the direct buffer exchange mode and the firmware and the userspace tool is are included in recipes in the BSP base repo.

The results are a bit disappointing when it comes to performance and I’ve found out that the larger the block size, then the transfer speed gets better. That means that you use this mode when sending a small amount of data (e.g. control commands) or if you want to port an application in STM32MP1m that it was running on a separate CPU and MCU before. If you need better performance then the indirect mode seems to be better, but I’ll verify this on a future post.

In the next post I’ll be using again the direct mode, but this time I’ll write a small kernel driver module using netlink and test if there’s any better performance. I don’t expect much out of this, but I want to do this anyways because I find using a tty interface to a modern program a bit cumbersome to handle and netlink is a more elegant solution.

In the conclusions section I would also like to list some pros/cons of the STM32MP157C-DK2 board (and MP1 SoC).


  • Very good Yocto support
  • Great documentation and datasheets
  • Many examples and code for the CM4
  • Support for OP-TEE secure RTOS
  • Very good power management (supported in both CM4 and CA7)
  • GbE
  • WiFi & BT 4.1
  • Arduino V3 connector
  • The CM4 firmware can be uploaded during boot (in u-boot) or in Linux
  • DSI display with touch
  • Direct and indirect buffer sharing mode
  • Provides an AI acceleration API (supports also TF-Lite)
  • Great and mature development tools


  • The CA7 has a low clock (which tbh doesn’t mean that this is bad, for many cases this is a good thing).
  • You can’t use DSI and HDMI at the same time.
  • IPC is not that fast (in direct mode only)
  • The 40-pin expansion connector is under the LCD and it’s hard to reach and fit the cables. It would be better if it had a 90 degree angle connector.
  • Generally because the LCD takes the whole one side of the board this makes it difficult to use also the Arduino connector. Therefore, you need to put the board on the LCD side and use the HDMI. Well, they couldn’t done much for this, but…
  • The packaging box could be done in a way that can be used as a leveled base that make the Arduino connector accessible. Something like the nVidia nano packaging box.
  • Yocto documentation is lacking and it will be difficult for non experienced developers to get started with (but that’s also a generic issue with Yocto as the documentation is also not complete and also some important features are not documented).
  • No CMake support in the examples. Generally, I prefer CMake for many reasons and I like for example the NXP provides their SDK with CMake instead of Makefiles (personal preference I guess, but CMake is also widely used in most of the projects I’ve worked with).

Overall, this is a great and complicated board and it takes some effort to setup and optimize everything. Even more it’s quite complicated to do a design from the scratch in a small team, therefore if you decline from the default design you might need more time to configure everything properly and bring up your custom board. Although, I would be able to handle a full custom design with this board, I know it would take a lot of time to finalize it. Therefore, if you don’t use one of the many available ready solutions with the STM32MP1 SoC, then have in mind that it will take some effort to do your custom solution. It’s up to you to decide the development cost of a full custom design or use one of the many available SoCs that can help you start in no time.

I would definitely take this SoC into account for applications that need good real-time performance (using the CM4 core) and use the CA7 core for the GUI and business logic. If you wonder what’s the difference with a multi-core ARM CPU that runs multiple cores in high frequency, then you can take as granted that -depending your use-case- even a PREEMPT-RT kernel on a 8-core running at 2GHz, can’t beat the real-time performance of a 200MHz CM4.

Have fun!

Tensorflow 2.1.0 for microcontrollers benchmarks on Teensy 4.0


It’s being some time that I haven’t update the blog, but I was too busy with other stuff. I’ve updated a couple of times the meta-allwinner-hx layer and the last major update was to move the layer to the latest dunfell 3.1 version. Also, I’ve ordered an Artillery Genius 3D printer (I felt connected with it’s name, lol). I wanted to buy into 3D printing for many years now, but never really had the time. Not that I do have time now, but as I have realized with many things in life, if you don’t just do it you’ll never find the time for it.

On other news, I’ve also received an STM32MP157C-DK2 from ST to evaluate and I would like to focus my next couple posts to this SBC. I’ve already got somehow deep into the documentation and also pushed some PRs to the public git Yocto meta layer. Next thing is to start writing a couple of posts to test some interesting aspects of this SBC and list its cons and pros.

But before getting deep to the STM32MP1 I wanted to do something else first and this was to test my TensorFlow Lite for microcontrollers template on the i.MX RT1062 MCU, which is used on the Teensy 4.0 and it’s supposed to be one of the fastest MCUs.

So, is it really that fast? How does it perform compared to STM32F746 I’ve used in the last post here?

About Teensy 4.0

Well, regarding Teensy 4.0, I really like this board and definitely I love it’s ecosystem and congrats to Paul Stoffregen for what he created. He also seems extremely active in every aspect, like supporting the forums, the released boards, the middleware and the customers and at the same time also find time to design and release new boards fast.

In one of my latest posts I’ve figured out that the Teensy 4.0 (actually the imxrt1062) doesn’t have an onboard flash; it happens to me often because I don’t spend time to read and understand all the specs of the boards I’m buying for evaluation. Therefore, the firmware is written in an external SPI NOR. Also there is a custom bootloader and a host flashing tool that uploads the firmware. Gladly, I’ve also found out that this bootloader doesn’t do any other advanced things like preparing the runtime environment and using custom libs that the firmware has dependency on, therefore you can just compile your code using the NXP SDK and CMSIS and then upload it to the SPI NOR, avoiding to use the arduino framework.

This works great and I’ve also created a cmake template, but I haven’t found time to finish it properly and pack it in a state that can be published. Nevertheless, I’ve used this unpublished template to build the TF-Lite micro code with the Keras model I’m using in all the Machine Learning for Embedded posts. So, in this post I will present you the results and give you access to the source code to try it yourself.

If you’ve read so far and you’ve also read the previous post, then pause for a moment and think. What results do you expect? Especially compared to STM32F746 which is a 216MHz MCU when knowing that iMX RT1060 is running on 600MHz and it’s a beast in terms of clock speed. How much performance difference you would expect?

Source code

Since I’ve explained most part of the useful code in the previous post here, I won’t go again into the details. You can find the source code for the Teensy 4.0 here:


In the repo’s README there’s a thorough explanation on how to build the firmware and how to flash it, so I’ll skip some details here. Therefore, to build the code you can run this command:

./docker-build.sh "./build.sh"

This will build the code with the default options. With the default options the uncompressed model is used and also the tflite doesn’t use the cmsis-nn acceleration library. To run the above command you need to have Docker installed to your system and it’s the preferred method to build with this code, so you get the same results with me. After running this command you should see this in your console:

[ 99%] Linking CXX executable flexspi_nor_release/imxrt1062-tflite-micro-mnist.elf
Print sizes: imxrt1062-tflite-micro-mnist.hex
   text	   data	    bss	    dec	    hex	filename
 642300	    484	  95376	 738160	  b4370	flexspi_nor_release/imxrt1062-tflite-micro-mnist.elf
[ 99%] Built target imxrt1062-tflite-micro-mnist.elf
Scanning dependencies of target imxrt1062-tflite-micro-mnist.hex
[100%] Generating imxrt1062-tflite-micro-mnist.hex
[100%] Built target imxrt1062-tflite-micro-mnist.hex

This means that the code is built without error and the size of the code is 642300 bytes. Also the cmake will create a HEX file from the elf which is needed by the teensy cli flashing tool.

If you want to build the code with the compressed model and the cmsis-nn acceleration library then you need to run this command:

./docker-build.sh "USE_COMP_MODEL=ON USE_CMSIS_NN=ON ./build.sh"

After this command you should see this output:

[ 99%] Linking CXX executable flexspi_nor_release/imxrt1062-tflite-micro-mnist.elf
Print sizes: imxrt1062-tflite-micro-mnist.hex
   text	   data	    bss	    dec	    hex	filename
 387108	    708	 108152	 495968	  79160	flexspi_nor_release/imxrt1062-tflite-micro-mnist.elf
[ 99%] Built target imxrt1062-tflite-micro-mnist.elf
[100%] Generating imxrt1062-tflite-micro-mnist.hex
[100%] Built target imxrt1062-tflite-micro-mnist.hex

As you can see now, the size of the firmware is much smaller as the compressed model is used.

To flash any of those firmwares you need to install the teensy_loader_cli tools which is located in this repo here.


In order to see the debug output you need to connect a USB-to-UART module on the Teensy, as I’m not using any VCP interface in this case for simplicity in the code. In this case I’m using the UART2 and the pinmux for the Tx and Rx pins is as follows:

FUNCTION Teensy pin
Tx 14
Rx 15

Then you need to open a terminal on your host (I’m using CuteCom) and then send either 1 or 2 in the serial port. You can either just send the character or add a newline in the end, it doesn’t matter. The function of those two commands are:

Command Description
1 Execute the ViewModel() function to output the details of the model
2 Run the inference using a predefined digit (handwritten number 8)


Now the interesting part. I’ve build the code using 4 different cases.

Case Description
1 Uncompressed model, without CMSIS-NN @ 600MHz
2 Compressed model, without CMSIS-NN @ 600MHz
3 Uncompressed model, with CMSIS-NN @ 600MHz
4 Compressed model, with CMSIS-NN @ 600MHz

The results are listed in the following table (all numbers are milliseconds):

Layer [1] [2] [3] [4]
DEPTHWISE_CONV_2D 6.30 6.31 6.04 6.05
MAX_POOL_2D 0.863 0.858 0.826 0.828
CONV_2D 171.40 165.73 156.84 150.84
MAX_POOL_2D 0.246 0.247 0.256 0.257
CONV_2D 26.40 25.58 26.60 26.35
FULLY_CONNECTED 3.00 0.759 3.02 1.81
FULLY_CONNECTED 0.066 0.090 0.081 0.098
SOFTMAX 0.035 0.037 0.033 0.035
Total time: 208.4 199.62 193.72 186.29

What do you think about these results? Hm… OK, let me take the best case scenario for those results and compare them with the best results I got with the STM32F746 in the previous post.

Layer i.MXRT1062 @ 600MHz
STM32F746 @ 288MHz
DEPTHWISE_CONV_2D 6.05 18.68
MAX_POOL_2D 0.828 2.45
CONV_2D 150.84 124.54
MAX_POOL_2D 0.257 0.72
CONV_2D 26.35 17.49
SOFTMAX 0.035 0.01
Total time: 186.29 165.02


The above results are so confusing to me, that I’m start thinking that I may doing something wrong here. I’ve double checked all the compiler optimizations and flags so they are the same with the STM32F7 and also checked that the imxrt1062 clock frequency is correct. Everything seems to be fine, yet I get those low results with the exact same version of TensorFlow.

When I get to such cases and I find that is difficult to debug further, then I either try to make logical assumptions for the problem, or I’m trying to reach for some help (e.g. RTFM). In this case, I will only make assumptions, because I don’t really have the time to read the whole MCU user manual to try to figure out if there’s a specific flag in the MCU that may give some better results.

The first thing that comes to my mind is of course the storage from which the code is running. In case of STM32F746 there is a fast onboard flash with prefetch, which means that the CPU waits for a small amount of time for the next commands to end up in the execution pipeline. But in case of the imxrt1062 the code is stored in an external SPI NOR. This means that each command needs first to be read via SPI to end up in the execution pipeline and this needs more time compared to the onboard flash. Hence, this is my theory why the imxrt1062 has worse inference performance, although it’s core clock is 2x faster compared to the overclocked STM32746 @ 288.

So, what do you think? Does that make sense to you? Do you have another suggestion?


To be honest, I was expecting much better results from the Teensy 4.0 (I mean the imxrt1062). I didn’t expect to be 2x faster, but I expected ~1.5x factor, but I was wrong in my assumption. My assumption is that the lower performance is due to the fact that the SPI NOR has a great performance hit in this case. I also assume that another MCU with the same imxrt core and a fast internal flash would perform much better than that.

So, is Teensy 4.0 or the imxrt1062 crap? No! Not at all. I see a lot of performance gain in computation demanding applications where the data are already stored in the RAM. Also the linker script for the imxrt1062 is written in a convenient way that you can easily mount specific functions and data in the m_data and m_data2 areas (see source/MIMXRT1062xxxxx_flexspi_nor.ld in the repo). Also, in GCC you can use the compiler’s section attribute to this, for example:

void function(void)

Anyway, for this case the imxrt1062 doesn’t seem to perform well and actually is even slower compared to the STM32F746, which runs in much slower clock.

There is a chance, of course, that I may do something wrong in my tests, so if I have any update then I’ll post it here.

Have fun!

Tensorflow 2.1.0 for microcontrollers benchmarks on STM32F746


Over 8 months ago I’ve started writing the Machine Learning for Embedded post series, which starts here. The 3rd post in this series (here) was about using the tensorflow lite for microcontrollers on the STM32746NGH6U (STM32F746-disco board). At that post I’ve did some benchmarks and in the next post, I’ve compared the performance with the X-CUBE-AI framework from ST.

At that time the latest TF version was 1.14.0 and the tflite-micro was still in experimental state. The performance was quite bad compared to X-CUBE-AI. CMSIS-DSP and CMSIS-NN was not supported and also optimized or compressed models weren’t supported, too. When I tried to use an optimized model then I was getting an error that INT8 is not supported in the framework.

So, after almost a year TF is now in version 2.1.0,  2.2.0 is around the corner and also the tflite-micro is no longer experimental. Also, CMSIS-DSP/NN is now supported for many kernels as also optimized models. Therefore, I felt like it make sense to give another try.

For those tests I’m using the default Keras MNIST dataset and a quite large model with many weights. So, let’s see what has changed.

Source code

I’ve updated the repo I’ve used in post 3 and added two new tags. The previous version has the v1.14.0 tag and the new version has the v2.1.0 tag. The repo is located here:


Besides the TF version upgrade I’ve also made some minor changes in the code. The most important is that I’m now using a much more precise timer to benchmark each layer of the model. For that I’m using the Data Watchpoint and Trace unit (DWT) that is available on all Cortex-M MCUs. This provides a very accurate counter and the code for this in the repo is located in `source/src/inc/dwt_tools.h`. In order to use it properly I had to do some small modifications in the tflite-micro and specifically to the file `source/libs/tensorflow/tensorflow/lite/micro/micro_interpreter.cc`.

In this unit the Invoke() function of the interpreter is called that runs the inference and to do that it executes each layer of the model. Therefore, I’m using DWT to time the execution of each independent layer and then I’m adding the time to a separate variable. This is the code:

TfLiteStatus MicroInterpreter::Invoke() {
  if (initialization_status_ != kTfLiteOk) {
                         "Invoke() called after initialization failed\n");
    return kTfLiteError;

  // Ensure tensors are allocated before the interpreter is invoked to avoid
  // difficult to debug segfaults.
  if (!tensors_allocated_) {
    TF_LITE_ENSURE_OK(&context_, AllocateTensors());

  for (size_t i = 0; i < operators_->size(); ++i) {
    auto* node = &(node_and_registrations_[i].node);
    auto* registration = node_and_registrations_[i].registration;

    /* reset dwt */
    if (registration->invoke) {
      TfLiteStatus invoke_status = registration->invoke(&context_, node);
      if (invoke_status == kTfLiteError) {
            "Node %s (number %d) failed to invoke with status %d",
            OpNameFromRegistration(registration), i, invoke_status);
        return kTfLiteError;
      } else if (invoke_status != kTfLiteOk) {
        return invoke_status;
    float time_ms = dwt_cycles_to_float_ms( dwt_get_cycles() );
    glb_inference_time_ms += time_ms;
    printf("%s: %f msec\n", OpNameFromRegistration(registration), time_ms);
  return kTfLiteOk;

As you can see the dwt_init() initializes DWT and then dwt_reset() is reseting the counter. This is done inside the for loop that runs each layer. After the layer is executed dwt_get_cycles() returns the clock cycles that the MCU spent and then those are converted in msec and it’s printed to the UART. Finally, all msecs are added in the glb_inference_time_ms variable which is printed in the main.c after the inference execution is done.

Pretty much, everything else is the same, except that I’ve also updated the CMSIS-DSP version from 1.6.0 to 1.7.0, but I wasn’t expecting any serious performance gain from this change.

Update: I’ve updated the 1.14.0 version and now it’s using the exact same versions for all the external libraries. Also v1.14.0 has now it’s own branch if you want to test it.

Because of these changes I had to add two additional cmake options which are the following:

USE_CMSIS_NN: this enables/disables the usage of the cmsis-nn enabled cores which are located in `source/libs/tensorflow/tensorflow/lite/micro/kernels/cmsis-nn`. By default this option is OFF, therefore you need to explicitly enable it in the build script call as I’ll show later.

USE_COMP_MODEL: this selects which model is going to be used. Currently there are two models with compressed and un-compressed weights. The default option is set to OFF so the uncompressed model is used, which is 2.1MB thus making it only able to be used from MCUs that have so much FLASH area.

Both models are just byte arrays which are the serialized flatbuffer of the model structure including the model configuration (settings and layers) and the weights. This blob is then expanded in real time from tflite-micro API to call the inference. The uncompressed model is located in the `source/src/inc/model_data_uncompressed.h` header and the compressed in `source/src/inc/model_data_compressed.h`.

Finally, there’s a hard-coded digit in the flash which is the number 2. The digit is in file `source/src/inc/digit.h`. Of course, it doesn’t really matter if we use an actual digit or random tensor in the input in order to benchmark the inference, but since the digit is there, I’ll use that.


This firmware supports two UART ports, one for debugging and the other to be used with the Jupyter notebook in `jupyter_notebook/MNIST-TensorFlow.ipynb`. UART6 is used for printing debug messages and also send commands via the terminal. UART7 is used for communicating with the jupyter notebook and send/receive flatbuffers. For this post I won’t use UART7, so I’ll just trigger the inference of the pre-defined digit, which is already in the FLASH, by sending a command via UART6. The STM32F7-discovery board has an Arduino-like header arrangement, therefore the pinout is the folowing:

Function Arduino connector Actual pin

The baudrate for this port is 115200 bps and there are only supported commands, which are the following:

Command Description
CMD=1 Prints the model input and output (tensor size)
CMD=2 Executes the inference for the default digit.

When the program starts it prints a line similar to this in the UART:

Program started @ 216000000...

As you can see the frequency is displayed when the firmware boots, so it’s easier to verify the clock.

Build the code

To build the code you just need a toolchain and cmake, but to make it easier I’ve added a script in the repo that uses the CDE image I’ve created in the DevOps for Embedded post series. I’m using the same CDE docker image to build all my repos in circleci and gitlab, so you can expect that you’ll get the same results with me.

To use the docker-build script you just need to run it like this:


This command will build the uncompressed model, without cmsis-nn support and overclock and the next command will build the firmware with overclocking the MCU to 288MHz and the cmsis-nn kernels:


CLEANBUILD and USE_COMP_MODEL are not really necessary for the above commands, but I’ve added them for completeness. If you like you can have a look in the circleci builds here.


Finally, the best part. I’ve run 4 different test which are the following:

Case Description
1 Default kernels @ 216MHz
2 CMSIS-NN kernels @ 216MHz
3 Default kernels @ 288MHz
4 CMSIS-NN kernels @ 288MHz

The results I got are the following (all times are in msec):

Layer [1] [2] [3] [4]
DEPTHWISE_CONV_2D 25.19 24.9 18.89 18.68
MAX_POOL_2D 3.25 3.27 2.44 2.45
CONV_2D 166.25 166.29 124.58 124.54
MAX_POOL_2D 0.956 0.96 0.71 0.72
CONV_2D 23.327 23.32 17.489 17.49
FULLY_CONNECTED 1.48 1.48 1.11 1.11
FULLY_CONNECTED 0.03 0.03 0.02 0.02
SOFTMAX 0.02 0.02 0.017 0.01
Total time: 220.503 220.27 165.256 165.02

Note: In the next tables I don’t specify if the benchmark is run on the compressed or the uncompressed model, because the performance is identical.

So, now let’s do a comparison between the 1.14.0 version I’ve used a few months ago and the new 2.1.0 version. This is the table with the results:

Default kernels @ 216MHz [1]
Layer 1.14.0 2.1.0 diff (msec)
DEPTHWISE_CONV_2D 18.77 25.19 6.42
MAX_POOL_2D 1.99 3.25 1.26
CONV_2D 90.94 166.25 75.31
MAX_POOL_2D 0.56 0.956 0.396
CONV_2D 12.49 23.327 10.837
SOFTMAX 0.01 0.02 0.01
TOTAL TIME= 126.27 220.503 94.233


CMSIS-NN kernels @ 216MHz [2]
Layer 1.14.0 2.1.0 diff (msec)
DEPTHWISE_CONV_2D 18.7 24.9 6.2
MAX_POOL_2D 1.99 3.27 1.28
CONV_2D 91.08 166.29 75.21
MAX_POOL_2D 0.56 0.96 0.4
CONV_2D 12.51 23.32 10.81
SOFTMAX 0.01 0.02 0.01
TOTAL TIME= 126.36 220.27 93.91


Default kernels @ 288MHz [4]
Layer 1.14.0 2.1.0 diff (msec)
DEPTHWISE_CONV_2D 18.77 25.19 6.42
MAX_POOL_2D 1.99 3.25 1.26
CONV_2D 90.94 166.25 75.31
MAX_POOL_2D 0.56 0.956 0.396
CONV_2D 12.49 23.327 10.837
SOFTMAX 0.01 0.02 0.01
TOTAL TIME= 126.27 220.503 94.233


CMSIS-NN kernels @ 288MHz [4]
Layer 1.14.0 2.1.0 diff (msec)
DEPTHWISE_CONV_2D 52 55.05 -3.05
MAX_POOL_2D 5 5.2 -0.2
CONV_2D 550 576.54 -26.54
MAX_POOL_2D 2 1.53 0.47
CONV_2D 81 84.78 -3.78
FULLY_CONNECTED 2 2.27 -0.27
FULLY_CONNECTED 0 0.04 -0.04
SOFTMAX 0 0.02 -0.02
TOTAL TIME= 692 725.43 -33.43


As you can image I’m really struggling with those results, as it seems that the performance in 2.1.0 version is slightly worse, even in case that now more layers support cmsis-nn.

In version 1.14.0, only the depthwise_conv_2d was implemented with cmsis-nn as you can see here. But in the new 2.1.0 stable version more kernels are implemented as you can see here. Therefore, now conv_2d and fully connected are supported. Nevertheless, the performance seems to be worse…

Initially I thought that I was doing something terribly wrong, therefore, I’ve deliberately was introducing compiler errors in those files and specifically in the parts that the cmsis-nn functions are used and the compiler actually was complaining, therefore I was sure that at least the compilation was right.

I don’t really have strong opinion why this is happening yet and I’m going to report this just to verify that I’m not doing something wrong. My assumption is that I might have to enable an extra flag for the compiler or the code and because I don’t know which one that might be then the inference uses the default non-optimized kernels.

I’ve also checked that the sensor type in the compressed model is the correct one. I did that by printing the tensor_type in the ConvertTensorType() function in `source/libs/tensorflow/tensorflow/lite/core/api/flatbuffer_conversions.cc`. When the compressed model is loaded I get `kTfLiteFloat16` and `TensorType_INT8` as a result, which means that the weights are indeed compressed. Therefore, I can’t really say why the performance is such slow…

I’ve also tried with different option for the mfp16-format in the compiler. I’ve tried with these two in the `source/CMakeLists.txt`.


But none of those make any difference whatsoever.

Another issue I’m seeing is that the compressed model doesn’t return valid results. For example, when I’m using the uncompressed model I’m getting this result:

Out[0]: 0.000000
Out[1]: 0.000000
Out[2]: 1.000000
Out[3]: 0.000000
Out[4]: 0.000000
Out[5]: 0.000000
Out[6]: 0.000000
Out[7]: 0.000000
Out[8]: 0.000000
Out[9]: 0.000000

This means that the inference returns a certainty of 100% for the correct digit. But when I’m using the compressed model with the same input I’m getting this output:

Out[0]: 0.093871
Out[1]: 0.100892
Out[2]: 0.099631
Out[3]: 0.106597
Out[4]: 0.099124
Out[5]: 0.096398
Out[6]: 0.099573
Out[7]: 0.101923
Out[8]: 0.103691
Out[9]: 0.098299

Which means that the inference cannot be trusted as there’s no definite result.


There’s something going on here… It’s either me missing some implementation details and I need to add an extra build flag or actually there’s no any performance gain in the last tflite-micro version.

I really hope that I’m doing something wrong, because as I’ve mentioned in previous posts, I really like the tflite-micro API as it can be universal for all ARM MCUs that have a DSP unit. This means that you can write a portable code that can be re-used in different MCUs from different vendors. For now it seems that X-CUBE-AI from ST performs far better compared to tflite-micro and it’s one way solution.

Let’s see, I’ve posted it in the previous issue I got from the other post and I’ll try to figure this out and if there’s any update I’ll post it here.

Have fun!