Machine Learning on Embedded (Part 1)


Note: This post is the first in the series. Here you can find part 2part 3, part 4 and part 5.

Since 2015 I was following the whole machine learning hype closely and after 4 years I can finally say that is mature enough for me to get involved and try to play and experiment with it in the low/mid embedded domain. Although it’s exciting to get immediately involved to new techs, I believe that engineers should just keep an eye on the ones that seem to be valuable in the future or have potential to grow into something that can be used on their domain. But at the same time engineers must be moderate and wait for the “hype” to fade and then get the real valuable information. This is what happened with machine learning in my case. Now I finally feel that it’s the right time to dig in this domain more seriously and that the tools and frameworks are mature and simple to use.

And this bring us to the fact that, it’s different to implement and develop the tools and it’s a different thing to use them to solve a problem. Until now, many engineers worked hard on the development of these tools and now it’s much easier for us to just use them to solve problems. Keras, for example, it’s exactly that. It’s one really mature and beautiful framework to use and now it’s very stable. On the other hand, when you wait for this to happen, then you have a more steep learn curve in front of you, so from time to time it’s good to be updated with what’s going on in the domains that you’re interested.

Anyway, this time I’ve decided to make a completely stupid project to find the limits and the use cases of ML in the embedded world. Well, don’t get me wrong that’s not a research, it’s just evaluating the current status of the tools and how they perform with the current low embedded technologies. Nowadays, when most engineers hear embedded they think of some kind ARM application CPU that runs Linux on a SBC. Well, sure, that’s embedded too and there are many of those SBCs these days and they are really cheap, but embedded is also those dirt cheap 8, 16, 32-bit RISC MCUs  and also the Cortex-M series.

Let’s make some assumptions for the rest of this post. Let’s assume that low embedded is everything that is equal or less than a Cortex-M MCUs and high embedded all the rest application CPUs that can actually run Linux. I know that there also some smaller MMU-less MCUs that can run Linux, but let’s forget about that now. Also from now on I’ll refer to machine and deep learning as ML, just for convenience. Although the terminology on the field is getting standardized I’ll try to keep it simple, even if I’m not using the proper convention in some cases. Otherwise this post will become like those that I was reading my self in the beginning that were really hard to follow. So, although there are some differences, let’s keep it simple. AI, deep learning, machine learning… I’ll just use ML! Also I’ll refer to a single neural as a neural or a node. Finally, I’ll refer to neural network as NN.

This article will be split in 4 or 5 different posts. The first one (this one) will have some very generic information about ML and NN; but not in depth, as this is not the purpose of this post series. Also in this post we’ll implement a very simple NN with a single node with 3 inputs and 1 output and then run some benchmarks in various MCUs and analyze the results.

In the second part  I’ll use the same MCUs but with a bit more complex NN that has the same inputs, but a hidden layer with 32 nodes and 1 output. This NN will be more accurate in its predictions (as we’ll see), compared to the simple NN; but at the same time it will need more processing time to run the forward prediction. Please don’t expect to learn the terminology and details on ML here, but it will be much easier to follow if you already know some basic things around NN.

Excited already? No? Well, don’t forget it’s a stupid project. There’s always a small excitement in doing something useless. So, let’s move on.


Spoiler. For me, one of the most interesting thing in this stupid project was the amount of the different boards that I’ve used to run those benchmarks. I think what I liked most was the fact that I was able to test all these different boards  with the same code. For sure, the stm32f103 (blue-pill) was more optimized as I’ve used my own low level cmake template, but nevertheless I enjoyed having most of my boards running the same neural network code. Well, I didn’t used any of my PSoC 4 & 5, STM8, LPC1110, LPC1768 and a few other boards I have around, but I didn’t have more time to spend on this. Maybe at some later point I’ll add the benchmark for those, too.

STM32F103C8T6 (aka blue-pill)

This is my favorite board and my reference, so it couldn’t miss the party. I’ve run the benchmarks @72MHz and then I’ve overclocked the MCU @128MHz.


This is actually the STM32 F7 discovery board here, which is a cute development board with a lot of peripherals and cool stuff on it, but in this case I’ve only use a serial port and a gpio pin. As I like to overclock the stm32s, I’ve managed to overclock this mcu @295MHz. After that it didn’t work for me.

Arduino Uno

I guess I don’t need to write more about this. Everybody knows it. It runs on a ATmega328p and it’s the slowest MCU in this comparison.

Arduino Leonardo

This is another Arduino variant with the ATmega32 cpu, which is a bit faster than the ATmega328p.

Arduino DUE

This is an arduino board that runs on an Atmel SAM3X8E MCU, which is actually an ARM Cortex-M3 running at 84MHz. Quite fast MCU for its release date back then.

Teensy 3.2

The teensy is a very interesting board. It’s a bit expensive, sure. But it’s almost fully compatible with the Arduino IDE libraries and that makes it ideal for fast prototyping and testing. It’s based on a Cortex-M4 CPU and for the test I’ve used it overclocked @120MHz.

Teensy 3.5

This teensy board is using a Cortex-M4 CPU, too; but it runs on faster clocks. I’ve also used it overclocked @168MHz. The overclocking options for both teensy boards, are coming easy and for free from within the Teensy plugin in the Arduino IDE. I had some issues with one library but nothing difficult to solve. More details in the file on each MCU code folder.


Yep, we all now this board. An L106 32-bit RISC CPU running up to 160MHz.

A simple NN

OK, so let’s now jump to the interesting stuff. Everything that is related to this project and for all the posts are in this bitbucket repo:

Although it’s not the best thing to have all these different things in one repo, it makes more sense as it makes it easier to maintain and update. During this post series I’ll use different parts from this repo, so everything you see there are not only for the this first post.

Before we begin, in case that you want to learn some basics for NN then you can watch these videos in YouTube (1, 2, 3, 4) and also this playlist.

First let’s start with a simple NN. For this post, we’re going to use a single neural with 3 inputs and 1 output. You can see that in the following picture.

In the above image we see the topology of a simple NN. That has 3x inputs and 1x output. I won’t get into the details of the math. In this case the output is simple to calculate and it’s:

y = a0*w0  +  a1*w1 + a2*w2

This is the dot product of a(n) and w(n), where n=1,2,3. Just be aware that a(n) is not a function, it just means a0, a1, a2. The same for w(n). So, a(n) are the inputs and w(n) are the so called weights. You can think that weights are just numbers that their size control the effect that each a(n) has in the output result. The higher the w(n) is the more a(n) affects y.

The output is not y, thought. The output is the sigmoid of y, so:

output = sigmoid(y)

What sigmoid does is that it limits the output between 0 and 1. So the more negative y is then it’s near 0 and the more positive it’s near 1. In the ML world this function is called activation function.

For this project we assume that a(n) is a single binary digit (0 or 1). Therefore, since we have 3 inputs then all the possible combinations are in the following table:

a0 a1 a2
0 0 0
0 0 1
0 1 0
0 1 1
1 0 0
1 0 1
1 1 0
1 1 1

For simplicity, you can think of those inputs as 3 buttons connected to 3 gpio pins on the MCU and their state is either pressedor not pressed. Then depending their state, the output is also a binary value (0 or 1).

Training the model

Training the model means getting a set of inputs that we already know that they produce a specific output and then train the NN according to these. Then we hope/expect that the NN is able to predict the output for unknown inputs that hasn’t been trained on. The training is not done on the target, but it’s done separately on a workstation (or cloud) that has more processing power; and finally only execute the prediction function on the MCU. Although this model is very simple, someone may argue that 2 inputs – 1 output is simpler :p . Although it’s simple enough, we’ll do the training on a workstation as it’s important to use some tools that make the workflow easier.

To do that, is better to use a jupyter notebook to do all the design, training, testing and evaluation. Also Jupyter notebooks are the standard documents that you’ll find in most new github projects. The most simple way to install Jupyter and the other tools we need is using miniconda. I’ll be really quick on this. I’m using Ubuntu, so the following commands are for that OS.

  • Download miniconda from here
  • Install miniconda
  • Create a new environment and install the tools you’ll need
    # Create a new environment
    conda create -n nn-env python
    # Activate the environment
    conda activate nn-env
    # Now install those packages to that environment
    conda install -c conda-forge numpy
    conda install -c conda-forge jupyter
    conda install -c conda-forge scikit-learn
    conda install -c conda-forge tensorflow-gpu
    conda install -c conda-forge keras

    Not all of the above packages needed for this example, but we’ll use them later.

  • Next git clone the repo for this project and run Jupyter.
    git clone
    cd machine-learning-for-embedded
    jupyter notebook &

    If everything goes right, then now you should be able to see the web-interface from Jupyter in your browser. If not, then I guess you need to do some google-fu. In this web interface you would see a folder with the name jupyter_notebooks. Just double click on that and there you’ll find all the notebooks for this project. The one we need for this post is the `Simple python NN.ipynb`. Just click on that.

    What you see in there is some mix of markdown text and python code. For the simple cases of the first two parts we’re going to implement the NN with just python code, without using any advanced library like tensorflow or keras. The reason for this is that we can write code that we can later convert to simple C and run tests on the different MCUs.

Again, I won’t go into the details of Jupyter notebooks and python. I guess there are plenty of tutorials in internet that are much better from any explanation I can provide.

Let’s see the notepad now.

Note: In case you just want to view the notebook and evaluate your results, you don’t have to install Jupyter, but instead you can just view the notebook in the bitbucket repo here.

First we import some functions from numpy to simplify the code. Then we create a NeuralNetwork class that is able to train and evaluate our simple NN. Then we create a training set for our binary inputs. As we’ve seen before, 3 binary inputs have 8 possible combinations and we choose to use a train set of 4 inputs. That means that we’ll train our NN with only 4 out of 8 combinations and then expect the NN to be able to predict the rest by itself. So we train with the 50% of the possible values.

Then we create an array with the 4 inputs and the 4 outputs. After that we initialize the NeuralNetwork class and view the random weights. A note here is that the weights always have random values in the beginning. The meaning of training is to calculate those weights (if you prefer the mathematical approach is to find where the slope of the function, I’ve mentioned before, is minimum or ideally zero). Another note is that when you run this notebook in your browser you may get different values after each training (you shouldn’t but you may). Those values should be similar to the ones in the notebook, but they also might differ a bit. In this case, if you just want to evaluate your results with the C code that runs on the MCUs then have in mind that you may need to change the weights in the MCU code according to your weights. By default, the weights in the C code are the ones that you get in the repo’s notebooks without execute any cells.

Finally, we train the model to calculate the weights and then we evaluate the model with all the possible input combinations. For convenience I’m copying my results here:

[0 0 0] = [0.5]
[0 0 1] = [0.009664]
[0 1 0] = [0.44822538]
[0 1 1] = [0.00786466]
[1 0 0] = [0.99993704]
[1 0 1] = [0.99358931]
[1 1 0] = [0.9999225]
[1 1 1] = [0.99211997]

From the above output we see that for the values that we used during training the predictions are very accurate. This output is from the stm32f203, as you’ll find out all the Arduino compiled code don’t have that floating point precision when you convert the doubles to strings. As I’ve mentioned before in the output we get values from 0 to 1. That’s the prediction of the NN and the closer is to 0 or 1 then the higher is the possibility that the output has that value (because in this example it happens that we have binary output so it’s 0 or 1 anyways). So in case of the training inputs [[0, 0, 1], [1, 1, 1], [1, 0, 1], [0, 1, 1]] we see that the accuracy is much better compared to the unknown inputs like [0 0 0] and [0 1 0]. Especially the first input it’s not actually possible to say if it’s 0 or 1 as it stands right in the middle. Ouch!

Evaluate on the MCUs

Now that we designed, trained and evaluated our model on the Jupyter notepad we’re going to test the NN on different MCUs.

What is important here is not actually if the prediction really works on the MCUs. I mean that’s just code, of course it will work the same way and you’ll get similar results. You results might differ a bit because as we use doubles that may differ from one architecture to other. What is important though, is the performance!

That’s all about we care eventually, right? And that was the main drive for me to create this project. To find out how do those MCUs perform in simple or more complex NNs? Is it possible to run a NN in real-time? Does it even have a meaning to do that on an MCU? What you should expect? Is it worth it? Can those tiny MCUs give a good performance? What are the limits? Is it maybe better to convert a NN problem to algorithmic in order to run it on a MCU? Are nested ifs, lookup tables, Karnaugh maps still a better alternative? And a lot of other questions.

Just be sure that I’m not going to answer all those things here though, as there are a lot of different parameters per project and use case. But by doing this yourself, you should definitely get an idea of the performance, the potentials and the limits that exist with the current technologies.

The evaluation on the MCUs is split in 3 different cases. We have the stm32f103 that has it’s own code folder in the `code-stm32f013` folder. Also the stm32f746 has it’s own code folder (code-stm32f746), as esp8266 and arduino due. For the other arduinos and teensy boards you can use the code-arduinofolder.

Just a parenthesis here. Probably people that read my blog more often, they know that I’m more a baremetal embedded guy. I enjoy doing stuff with more stripped down libraries even CMSIS for the Cortex-M. Although I’m mentioning from time to time that I don’t like using Arduino IDE or  HAL libraries, I’ve also mentioned that I find these very useful for cases like this. Prototyping! Therefore, using those tools for prototyping is an excellent choice and a no-brain decision. We need to choose and use our tools wisely and where they fit best every time. So evaluating a case or project on different HW it always make sense to use those tools for prototyping and then write the proper low embedded code where is needed. OK, enough with this parenthesis.

You’ll find details on how to build and run each code on each MCU in the README files in the project folders.  Hence, I’ll only mention the serial protocol that I’m using and also how it works in the background.

C code

The c code is really simple for this example. The dot product and the sigmoid function are implemented in the neural_network.h/c files and from the main.c file we just call the prediction() functions (which is just the sigmoid(dot()) function). The same .h and .c files are used for all the different codes. Also the weights for this example is the double weights[] array in main.c and the inputs are the double inputs[8][3] array again in the main.c function. For now just ignore the double weights_1[32][3] and double weights_2[] arrays, which are used for part 2.

Finally, also two important functions for this example are the benchmark_neural_network() and test_neural_network(). Both are triggered with commands from the serial port. The test function will just print the prediction for all the input combinations in order to compare them with the jupyter notebook and the benchmark function will run a single prediction and at the same time toggle a pin in order to measure the time the function has taken with an oscilloscope.

Supported serial commands

In order to simplify testing I’ve created a couple of commands. In case of stm32 you can connect to the serial port at 115kbps and for the rest MCUs that use the .ino project you need to connect at 9600 bps (anyway it’s either 9600 or 115200).

The supported commands are the following (all commands expect a newline in the end):


where <mode>: 1 or 2

This command will evaluate all the 8 possible inputs by running the prediction using the calculated weights and will print the output. Then you can compare the output with the output from the jupyter notebook.

Mode 1, is using the simple1 NN and its weights. The simple1 NN is the one we use on this post with 3 inputs and 1 output.

Mode 2, is using the simple2 NN and its weights. The simple2 NN is the one that we use on part 2 with 2 inputs, a hidden layer with 32 nodes and 1 output.

Note: If you run the TEST commands on any arduino build firmware you’ll get a bit disappointed as you for some reason the Serial.print function can only print double values with a 2 decimals. That’s a bit crap. I’ve read that there are some ways to fix this, but that it doesn’t really matter. It only matters that the predictions are correct enough. With stm32 that’s not an issue you will get pretty much the same accuracy with the python output.


where <mode>: 1 or 2 (has the same meaning as before)

This command starts a timer that every 3 seconds will run the prediction function and also toggles a gpio in order to help us to make precision measurements. By default, the prediction is using the first input set [0 0 0]. That doesn’t really matter as it doesn’t affect the computation timing, but you can change it in the code if you like. You can verify that mode 1 is much faster than mode 2, but we’ll have a look at it at the next post.


The STOP commands just stops the timer that was triggered with the START=<mode> command.


First I need to mention that the best way to measure the time that a code needs to run is by using an oscilloscope and a gpio pin. You can set the pin high just before you run your function, then run the function and then set the pin to low. Then by using the oscilloscope you can calculate the exact time the operation lasted.

There’s a catch though! When toggling a pin, that also takes some time and that time is different for different hardware and even gpio libraries for the same hardware. For that reason in the code you’ll find out that every time I’m toggling the pin twice before run the NN prediction function. That way you can measure the time that those two toggles spend and then subtract the average from the time that the prediction operation lasted. So, you measure the time of the two toggles and if that time is Tt then you measure the time between the HIGH and LOW of the prediction function and the total time spend for the predictions will be:

Tp = Thl – (Tt/2)
Tp : Prediction time
Thl: Time of High-Low transition that includes the prediction function
Tt: Time of the two toggles

Anyway, let’s not complicate things more. The above just helps only when the prediction function time is fast or different MCUs have similar time and you want to remove the overhead of any GPIO handling that may differ between different MCUs.

Note: I’ve included all the oscilloscope screenshots in the screenshots folder in the repo. Therefore, you can have a look on the oscilloscope output for each different MCU as I’m not going to post them all here (there are just too many).

Before posting the table of the results, these are the screenshots for the stm32f103 and the Arduino Uno. The name coding in the screenshots folder is <mcu>-<NN topology>-<frequency>-<capture>.png. That means that for the teensy 3.2 the ss for that simple example (simple1) and the pin toggle will be `teensy_3.2-simple1-120MHz-predict.png`. In the next post (part 2) the NN topology will be called simple2.

These are the captures for the toggle and prediction for stm32f103 and arduino uno.

stm32f103 @ 128MHz pin toggle time = 290 nsec

stm32f103 @ 128MHz prediction time = 9.38 μsec

Arduino Uno @ 8MHz pin toggle time = 15.5 μsec

Arduino Uno @ 8MHz prediction time = 114.4 μsec

Although you already get a rough idea, the next table summarizes everything.

MCU Pin toggle time (μsec) Prediction time (μsec)
stm32f103 @ 72MHz 0.512 16.9
stm32f103 @ 128MHz 0.290 9.38
Arduino Uno @ 8MHz 15.5 114.4
Ard. Leonardo @ 16MHz 21 116
Arduino DUE @ 84MHz 8.8 18.9
ESP8266-12E @ 160MHz 1.58 15.64
Teensy 3.2 @ 120MHz 0.830 11.76
Teensy 3.5 @ 168MHz 0.572 8.84
stm32f746 @ 216MHz 0.157 4.86
stm32f746 @ 295MHz 0.115 3.58

As you can see from the above table the higher the frequency the better the performance (o, really?). I haven’t substracted the pin toggle time from the prediction time! Also note that although the Teensy 3.5 has a better performance from the stm32f103@128MHz the pin toggle time is almost the double… That’s because those arduino libraries are implemented on top of bloated functions, even for just enable/disable a pin. Of course, the overclocked stm32f746 @ 295MHz is by far the fastest in all terms.

Also I’ve noticed something irrelevant with the NN. If you see the ratio of the (Prediction time)/(Pin toggle time), then you get some interesting numbers. Let’s see the following table:

MCU (prediction time)/(pin toggle time)
stm32f103 @ 72MHz 33
stm32f103 @ 128MHz 32.34
Arduino Uno @ 8MHz 7.38
Ard. Leonardo @ 16MHz 5.52
Arduino DUE @ 84MHz 2.14
ESP8266-12E @ 160MHz 9.89
Teensy 3.2 @ 120MHz 14.16
Teensy 3.5 @ 168MHz 15.45
stm32f746 @ 295MHz 31.13

The above table shows what you can expect from your hardware and how those bloatware arduino libs hurt the overall performance. To be fair though, the NN code is not affected from the libraries, as it’s plain C code. But normally your MCU will also do other tasks and not only run the NN; therefore, everything else that the cpu does affects the NN performance, especially if the code uses bloated libraries. In this case we just toggle a pin and running a timer in the background, nothing else. Well, not true, the stm32f103 actually runs also a few other stuff in the background, but nevertheless it has the best prediction/toggle ratio. The Arduino DUE has the most weird behavior, which doesn’t make sense, but it was consistent. I didn’t even bother to debug that, though. Anyway, the above table is the reason that sometimes I mention that prototyping is completely different from development. Prototyping is proof of concept, and after that going into the low level will bring you the performance. If you don’t care about performance, then sure pick the tool that suits your needs.


From this example we’ve seen that we can actually design, train, evaluate and test a NN with Jupyter and python and then run the forward prediction function on a small MCU. Isn’t that great? Yeah, I know… Using so much resources on those small MCUs to run a 3-input, 1-output NN deserves the title of the stupid project! Anyway, from the last tables we have some interesting results that you can also interpret as you think.

The most interesting is that we can actually use this NN for real-time applications! OK, don’t laugh. I know that this example it’s useless, but you can! Even the 114.4 usec of the Arduino is ok’ish for fast real-time applications. Of course, it depends on the case and the specs. I mean if you expect you inputs to change faster than that, of course you can’t use it. But think buttons for now! 😛

It’s really fast and even Arduino uno can handle this NN, 100 μsec is really fast. Oh, wait. That bring us on another question. If they are buttons then why not created a nested-if function and handle that much much faster.

Even better, why not create lookup table? Maybe even create a Karnaugh map of the inputs/outputs and reduce that to a couple of logic operations. That would work really really fast!

Well, as I said, this is a very simplified example. I mean, this is just for testing and is not meant to do anything really usable. But on the other hand think that what if instead of 3 inputs we had 128? Or 512? Then it would be really difficult to make a Karnaugh map and simplify it. Or we would need to write a ton of if-else cases. But what would happened if we needed to change something in the input or output sets? Then it would be also quite some work in the code. Maybe the lookup table is still a valid and good solution, though. It will cost RAM or FLASH space, but also the weights of the NN will get a lot of space. So you would need to compare how much space each solution would use and then if the NN needs less space then decide if less space is more important than speed execution.

It’s important to realise that ML doesn’t make better every problem we have, neither it’s a magic tool that solves all our engineering problems. It’s a tool that seems to have the potential to solve some issues that it was very complicated to solve before. And it may apply also to problems that we already have solutions for them, but ML may provide some flexibility we didn’t have before.

In the next post here, will do the same for a bit more complex NN with 3-inputs, a hidden layer with 32 nodes and 1-output.

Until then have fun!

Losing the wagon


This post is not about a stupid-project, but it’s a bit more philosophical and it’s about losing the wagon. Well, life has many wagons, but let’s narrow it to the technological and engineering wagon. This is an engineering blog after all.

The last couple of days I was exploring what’s the current state of the home automation domain and specifically for the KNX. I’ve started developing for the KNX bus back in 2007. The trigger was a friend of mine, who’s an electrical engineer and started talking about this fancy new KNX bus around in 2006-2007 (if I remember correctly) and which derived from the Instabus. He got my attention as he already have made some KNX installations and soon I got involved into it. I was fascinated with it and I wanted to start build stuff around it.

The KNX standard is supposed to be an open standard at the time, but it wasn’t really. Back then there were only few information around it and you needed to buy the specifications (which were expensive). So, I had to do a lot of stuff by my self. The only thing that was available it was the BCUSDK. This project started in 2005, but it was all that I needed. From this code I’ve managed to extract and understand the protocol and most things around it. The details weren’t obvious, of course, because the code wasn’t documented but having some KNX devices to experiment with and the code it was enough to do everything I wanted. Also a friend of mine (also an engineer) got fascinated with it and soon we got our KNX certification and in no time we’ve developed a whole platform around it. This included APIs, GUI interfaces and gateways to many standard protocols used at the time (IP, RS232, RS485) and gateways to GSM modules, GPS, Alarm systems and several other stuff.

Well, it was brilliant at the time and there wasn’t anything like that back in 2007. We could beat any competition. And then… for some reasons we just stopped. I don’t even remember the excuse at the time to be honest. But I know the real reason now. That was around in 2008-2009.

Now, I’ve checked again and the KNX automation domain has completely transformed to a huge market and code-base. Several different APIs for several programming languages exist. Python, C/C++, even a KNX module for Qt. I’ve wrote a KNX module in Qt in 2008 by myself and now I’ve seen that last year there was a new module in Qt for that. After 10 years!

So, I was 10 years ahead than the market. I’ve seen the wagon of this train more that 10 years ago and all its potentials. I’ve developed a whole system around it and I let it die, thus losing the wagon and the train. Now you can find almost everything and many stuff are open source, which is great. There is even a Yocto layer with all those tools included. It’s the meta-calaos.

Trust me, it may seem a bit disappointing to realize that you’ve lost the wagon and see how the market has ended today; and knowing that you were there 10 years ago and just did nothing. But, is it really? So, when this happens the most reasonable thing you need to do is ask yourself, why? Actually, not only a single why but several whys and then when you find the reasons and make some decisions for yourself, even knowing yourself better.

And this is what this post is all about.

Some thoughts

I guess I’m not the only one that had this situation. Some of you know exactly what I’m talking about and already being there at least once. Well, also for me this was not the first time. I had that more that once, but the above case hit me harder as I was pioneer and 10 years earlier than the rest of the market. Well, in my case I know why I’ve failed. The reason is that I’m a “lazy” engineer. I’ll come back to this phrase later.

I’ve seen many engineers in my life. Mostly “not-really-good” engineers, for my standards. Although, in my professional career I’ve been told that I’m a good engineer and I know that I’m capable to do stuff, at the same time, I don’t consider myself a -good- engineer. What is -good- engineer after all? No one should consider himself a -good- engineer. If you do that, then it’s over. Of course, when it comes to the professional aspect then you need to present yourself as a good engineer and it’s easier to do that if others already believe it. But in the end I just consider myself just an engineer. And this is a good and a bad thing at the same time.

Being an engineer is only a part of what you are in your professional career. You’re not only an engineer. You are also a salesman, a manager, a director and a CEO. You’re everything at the same time. At least you become those things after a few years in your domain, it’s the natural evolution which is called experience. But it’s the proportion of these analogies that you have and that makes and drives your professional career. Some people are better managers than engineers. Other might have more “CEO-like” qualities than the rest. So, you can’t have all the qualities in a high level at the same time. You may have one or two, but it’s extremely rare that you have everything. But, is that really a problem?

For example, I’m a “lazy” engineer. Lazy, doesn’t mean that I’m really lazy to do something. Actually it’s the opposite. I can drive myself to finish a project and complete it in the best and most optimal way. But then I need to do something else. I can’t stay on that for a long time. I can’t devote myself to a single project or domain and stay there forever. If I try to do that, then it makes me lazy in the end. I’m getting bored and I start hate what I do. And thus, I’m a “lazy” engineer. Well, at least until now I haven’t find a project or domain that I would like to stay forever.

But being a “lazy” engineer had its flaws. For example, in this case I was 10 years in advance compared to the market and then I got bored. So, I got lazy. Therefore, I had to just drop everything and go to the next challenge. Otherwise, I would doom myself in a situation that I would hate what I’m doing. Maybe some of you can understand this, maybe others don’t. It’s not necessary that every engineer is the same. We have different qualities and proportions of them and that’s fine!

I’ve met engineers that they are not so skilled, but they devoted themselves to an idea and a project and they succeeded to make it their main source of income. Many of those projects and ideas for me were so easy to develop and implement and even boring to even start doing them. They were just too simple for me, from the engineering aspect. But, they were profitable! And some engineers struggled to do something which for me seemed so easy and they made a profit out of it. Others didn’t, though. I believe those who did, were also a bit lucky, but all of them they were devoted and better salesman than engineers. Being able to sale something is more important that be able to build it in the best possible engineering way. The product may have it’s flaws, it may need several iterations until it gets released, it may even released and be a crap from the engineering aspect. But does this matter in the end? If it you make a profit and a business case out of it then it’s successful in mainstream market terms.

You don’t have to be an expert in something to do stuff. I’ve programmed in more than 10  programming languages as a professional. I may be an expert only on 2-3 of them, but it doesn’t really matter. You don’t need to be an expert in any programming language to make something that works and be profitable. Writing code is the most easy thing to do. Does it matter if it’s the best code? If you do it the pythonic, or yoctonic or C++17 way? All the code you ever written in the end is just crap. It’s a mess, unless it’s just a few lines that do a very specific thing. You might though you’ve written the best code 5 years ago and if you see that code today you’ll hate yourself for writing that crap. But, it doesn’t matter. Really. You become an expert in something, more as a “professional skill” that it will make it easier for you to find a better job; but if you want to realize your own ideas and make that a product, then it doesn’t matter if you’re an expert. It never mattered.

Therefore, who’s the successful engineer in the end? The one that managed to devote himself in a product and release it in the market and make a profit, or the one that one that didn’t? The one that is expert in 1-2 things or the one who’s capable in 10+? The one that delivers fast or the one that delivers something robust and well-engineered? The one that sees 10 steps further or the one that can focus on the current step? Don’t try to answer this question too fast.

I think that the success is to be satisfied in what you do and be happy with what you achieved in the end of your work. This sentence is a bit vague though, because what makes you happy now it doesn’t mean that it will make you happy in the future. But do you really know the future? No. So, what is left is what makes you a happy engineer now. And if you’re happy then it probably means that you’re also successful in what you do.

Therefore, making your own best-selling project and profit from your awesome idea is not necessary what will make you happy and a successful engineer. So, first you need to focus and find what makes you happy as an engineer and even more if engineering actually is making you happy at all. Because you might be a very good engineer and not be happy being an engineer. You need to know your assets and your values and what to expect from yourself.

Sure, it would be a great thing to become a successful engineer that will have a profitable idea and make a product out of it. But it’s not really necessary. Is it? It might happen, it might not. Maybe you even say it loud to yourself sometimes, but in the back of your head you don’t really want it or believe it. Because in the end everything comes with a price and you already know it. So, if your idea becomes successful, then you need to devote to it. You need to stop being an engineer and be a salesman, a CEO and whatever comes with it. But certainly not an engineer anymore. You will spend more time in managing things and do stuff unrelated to the engineering domain and you will fade as an engineer. That depends if it’s good or bad. If you like manage things and prefer it more than being an engineer then it’s great! But you also need to be capable with managing things, not just like it. Therefore, you need to know what you want to be and you need to know if you have the proper skills for that and if you don’t try to develop them.

If you know what makes you happy, then do it, but first consider all the consequences and be certain that you can judge yourself and skills right.

For me, being an engineer is not really a job. It’s just a hobby and it’s fun to explore new things and have different challenges. In my job I may don’t have the freedom to do exactly what I want every time, but I’m also doing a lot of stuff in my free time, without a profit. And I’m happy. It’s more like a lifestyle. What you do in your life, should be fun. And I feel lucky that it’s still fun for me. So, I don’t really have regrets about missing opportunities, because all the missing (or not) opportunities brought me to this point today. In the end, the only thing that matters is to know what makes you happy. You don’t have to be the best in something or find the best idea and make a huge profit. All you have to be is happy with what you do.

If you’re lucky enough to be happy with what you do, then you are a successful engineer and no matter what wagons you’ve lost or losing down the path, you’re always on your happiness wagon and do the things that you like. And that’s the best wagon you can ever be in your professional career.

Have fun!

STM32 cmake template (with more cool stuff)


While I’m still waiting for some parts for my next stupid project, I was a bit bored and decided to clean up my STM32 cmake template that I’m usually using for my bluepill projects. I mean, I was pretty happy with it since now and it was working fine, but there’s no better wasted time than doing the same thing again to get the same result and have the illusion that this time is better. So, this deserves a post
to the stupid-projects.

Anyway, while I was about to waste my time on that, I’ve though it would be a nice thing to make it support a few different libraries. Cmake, is something that you either love or hate. I do both. The good thing is that you can achieve the same
result by following a number of different ways. This is a nice thing, but also can be a trouble. The reason is that, if there was only a single valid solution or a way to do create a cmake build then it would be difficult to make errors. You would make a lot of mistakes until make it work, but when it worked, that would be the only right way. On the other hand, if there are many different ways to achieve the same result, then there are good and bad ways. And for some unknown universal law, the chance to choose the worst way is much higher that selecting every other way, good or bad.

Therefore, cmake gets both love and hate from my side. In the end, it’s all about experience. If you do something very often, then after some time you learn to choose the better ways. But if you create a cmake project 1-2 times per year, then then next time you deal with your own CMakeList.txt files and you have to re-learn everything you’ve done, then you realise how many things you’ve done wrong or you could do them better. Also the cmake documentation reminds me a law textbook. There are tons of information in there, but written in a way that stupid people like me can’t understand the documentation and need to read a cmake cookbook or see examples in the internet. Then everything gets clear.


I’m using a lot the standard peripheral library from ST. In general, I hate the monstrous HAL API and the use of C++ when it’s not really needed, but I like CubeMX, because it’s nice to calculate clocks and play around with the pinout. Also, when I’m using the USB on the stm32f103c8t6 (blue-pill), I’m always using the ST’s USB FS Device Driver that is compatible with the standard peripheral library. That combination is beautiful. I’ve found a couple bugs, which I’ve fixed and everything is great. I can say that I couldn’t need anything else than that.

I know that there are plenty people that like the HAL API and using C++ with the STM32 and that’s fine. If you like using that, then keep doing it. For my perspective, is that the HAL API is something that doesn’t provide anything more that the stdperiph library and also there are so many layers of software between the actual API and the CMSIS level, that it doesn’t make sense. For me it’s too much complicated and when it breaks it’s not just open the mcu datasheet and find the error, but you also need to debug all that software layer in between. Sorry, no time for this. Regarding the C++, I’ve wrote a post before here. Generally, there’s no right or wrong. But personally I prefer to write C++ when I’m developing a Qt app or when I really need some things that the language can make my code cleaner, more portable and faster. If it can’t do that, then I see no reason to use it. Also, the libraries are C libraries with C++ wrappers in the headers. That means something. Therefore, I need to be convinced that C++ will actually be better than C for the specific project, otherwise I’ll go with C.

There is also another project that supports the stm32 and plenty of other mcus and it deserves more love. This is the libopencm3 project. That is actually a library that replaces the standard peripheral library from ST. This is a very nice library. It’s low level library and based on CMSIS. It gets updated much more often that the stdperiph and the project is active. For example, I see now that the last update was a few hours ago (that doesn’t mean that it was for stm32f1) and at the same time the last version of the stdperiph was in 2012, so 7 years ago. Also another neat thing with libopencm3 is that everyone can join the project and send commits to add functionality or fix bugs. I’m thinking to commit the stm31f1 overclocking patch I have to clock the stm at 128MHz, but I guess this won’t be accepted as it’s out of specs, but anyway I’ll try, he he. So, yeah libopencm3 deserves more love and I think that sometimes you may also create smaller code.

So I’ve decided to add support to the cmake template also for the libopencm3.

Finally, let’s go to FreeRTOS. I guess, everyone knows what that is and I guess there are a lot of people that love it. Well, I respect rtos. I mean most of my work is done on embedded Linux, so I know about rtos. But still until today, I never, never had to really use an rtos on a small embedded mcu. Until today there was nothing that I couldn’t do using state machines. I’ve also written a very nice and neat lib for state machines and I think I should open-source it at some point. Anyway, I never had to use an rtos on an stm32 or other small mcu, but I guess there are needs that other people have. From my perspective it seems that simplifies things and produces less code and complexity, but on the other hand you loose more important things like full control of the runtime and also there’s a hit in performance. But anyway, it’s fun to have it as an option for prototyping and write small apps while you don’t want to mess with timers and interrupts.

Hence, in this cmake template you get all the above in the same project and you are able to select which libraries to enable by selecting the proper options in the cmake. But let’s have a look. This is the repo here:

After you clone the repo, there is a very interesting file that you should read. It’s supposed to written in a way that is easier to understand, compared to the cmake documentation. Also, another important file is the build.shscript that it handles all the details and runs cmake with the proper options.

So let’s see what those options are. The only thing you need to build the examples is to run the build.shscript with the proper parameters. Inside the build script you’ll find all the supported parameters, but not all of them are needed to be set everytime.

  • TOOLCHAIN_DIR: This points should point to your toolchain path
  • CMAKE_TOOLCHAIN: This points to your cmake toolchain file. This file actually sets up the toolchain to be used. When working with the blue-pill, you wouldn’t need to change that.
  • CLEANBUILD: This parameter is either true or false. When it’s true then the build script will delete the build folder and that means that you’ll get a clean build. Very useful, especially if you’re making changes to your cmake files, in order to remove the cmake cache. By default is false.
  • ECLIPSE_IDE: This is either true or false. If that’s true then the cmake will also create Eclipse project files so you can import the project in Eclipse and use it as an IDE to develop. That’s a handy option because of intellisense. By default is fault because I usually prefer the VS Code.
  • USE_STDPERIPH_DRIVER: This option can be ON or OFF and enables or disables the ST’s standard peripheral driver and the CMSIS lib. By default is set to OFF so you need to explicitly set it to ON during build.
  • USE_STM32_USB_FS_LIB: This option can be ON or OFF and enables or disables the ST’s USB FS Device Driver. By default is set to OFF so you need to explicitly set it to ON during build.
  • USE_LIBOPENCM3: This option can be ON or OFF and enables or disables the libopencm3 library. By default is set to OFF so you need to explicitly set it to ON during build. You can’t have this set to ON at the same time with the USE_STDPERIPH_DRIVER
  • USE_FREERTOS: This option can be ON or OFF and enables or disables the FreeRTOS library. By default is set to OFF so you need to explicitly set it to ON during build.
  • SRC: With this option you can specify the source folder. You may have different source folders with different projects in the source/ folder. For example in this template there are two folders the source/src_stdperiph and the source/src_freertos so you can select which one you want to build, as they have completely different projects and need different libraries.

The two example projects, as you can guess from the names, are for testing the stdperiph and the freertos/libopencm3 libs. To build those two projects you can run these commands:

# stdperiph

# FreeRTOS & LibopenCM3

# Create Eclipse projects files

So, yeah, pretty much that’s it. Easy and handy.


This was a re-write of my cmake template and as cherry on top I’ve decided to add the support for the FreeRTOS and LibopenCM3. I’ll probably use more often the libopencm3 in the future, ot at least evaluate it enough to see how it performs and regarding the FreeRTOS, I think it’s a nice addition for prototyping and use tasks instead of writing code.

Finally, one note here. Be careful when you use the -flto flag in the GCC optimisations, because this completely brakes the FreeRTOS. For example you can build the freertos example and flash it on the stm and you get a 500ms toggling LED, but it you add the -flto flag in the COMPILER_OPTIMISATION parameter in the main CMakeLists.txt file then you’ll find out that the vTaskDelay breaks and the pin toggling very fast.

Have fun!

EMC probe using RTL-SDR


This stupid project is a side project that I’ve done while I was preparing for another stupid project. And as it was a fun process, I’ve decided to create a post for it. Although it’s quite amazing, it’s actually quite useless for most of the home projects and completely useless for professional use. Anyway, in the next project I plan to use a 10MHz OCXO as a time reference. The idea was to use a rubidium reference, but it’s quite expensive and it didn’t fit in the cost expectancy I have for a stupid project. Of course, both OCXO and the Rb are enormous overkill, but this is the point of making stupid projects in the first place.

Anyway, a couple of weeks ago I’ve seen Dave’s (EEVblog) video in youtube here. In this video Dave explains how to build a cheap EMC probe using a rigid coax cable, an LNA and a spectrum analyser. So, he build this $10 probe, but the spectrum analyser he used costs lots of bucks already. He mentions of course in the end that he would at some point make a video by using a SDR instead of a spectrum analyser. Well, in my case I needed to measure how my OCXO behaves. Does it leak the 10MHz or harmonics in different places? And if yes, how can I reduce this RF leakage. So, I’ve decided to build an EMC probe myself. And this is what this post is about.


These are mainly two components that I’ve used for this project.


The SDR dongle I’ve used is the RTL-SDR. There are a few reasons for this decision, but the most important one is that it goes down to 500KHz without any modification or any additional external hardware. Although the tuner used on the RTL-SDR (which is Rafael Micro R820T/2) is rated at 24 – 1766 MHz, the V3 (3rd revision) supports a mode called direct sampling mode (dsm), which is software enabled and can make the reception under the 24MHz possible. This makes also possible to use it as a spectrum analyser to detect the leakage of the 10MHz OCXO. So, definitely buy this specific dongle from here if you’re interested for this range. Just have in mind that it’s not the intended use of dsm to have high accuracy, but it’s perfectly fine for this case.

This is the dongle:

Another reason to buy that specific dongle is to support the RTL-SDR blog the great work that Carl Laufer does (and the great support). In my case, the first dongle I got had a small issue and I’ve just contact Carl and he immediately shipped another unit. Also his book, is excellent for beginners and if you buy the kindle version you get free updates for ever. Great stuff.

Semi-rigid Coax cable

This is the cable that you need to make the probe. Get the 30cm one, so you can make 2 probes. I’ve made a small and big one and the large is twice the size of the small one. This how the cable looks like:

You may also need an antenna cable with an SMA male-female ends. This can be used as an extension to the probe (one of the RTL-SDR options includes that cable). I’ll show you later the two options you have how to use the probe with the rtl-sdr.


Regarding how to make the probe, I think Dave’s video that I’ve linked above, explains everything in great detail; therefore, I will just add my 2 cents from my own experience. Have in mind that those rigid cables are really tough to bend and cut their shielding. I’ve used a thick board marker and a pencil to bend the cables in the two different sizes. Also, in order to cut the shielding and create the gap on the top of the probe (that is explained in the video), you need to make sure that you have a new and sharp blade to your cutter, otherwise it will be really hard and messy. Finally, when you solder the end of the cable to the main cable body to create the loop, first put up you soldering iron temperature (400+ degrees), then put some solder in the tip of the exposed core and use a high temperature resistant glove (or oven glove) to bend, press and hold the end of the cable to touch the rigid shield and solder the tip of the cable to the shielding. Then it will keep its shape and you can have a free hand to finish soldering the rest.

These are my probes after soldering them.

Then I’ve used a black rubber coating paint for insulation on the rigid shield. So this is how the probes are looking now.

These are diameters of each probe.

So, the large one has almost the double radius and can also probe twice the area. It would be also nice if the probe end was a true circle (and not elliptic), but for my usage case it works just fine.

Now that you have the probes and the RTL-SDR, there are two ways to use them. One way is to use a USB extension to plug the dongle and then screw the probe straight on it. The other way is to plug your RTL-SDR dongle on a USB port then use an antenna extension cable to connect the probe to the dongle. The next pictures show the two setups.

Of course, you can also use the second setup with a USB extension. Anyway, I prefer the first more, because that lowers the interference and signal attenuation as there is no a long cable between the antenna and the dongle. Also the case of the dongle acts like a handle and makes it easier to hold it. It might get a bit warm depending the range you use it, but not that much, though.

Next I’ve tested how much current the RTL-SDR dongle draws when is running in the direct sampling mode (dsm) and it seems that’s not that much actually, so in my case was ~110mA.

The above one is the replacement dongle. The first dongle was drawing 200mA in the beginning and then this current started to increase more and more, the dongle was burning hot and then at some point it died. But the replacement is working just fine. This is the only picture of the old dongle that I’ve managed to get while it was still working.

Next thing is to select the software that you’re going to use. Since, I’m limited to Linux, I’ve used the SDR# git repo with the mono project run-time. There are a couple of things that you need to do in Ubuntu 18.04LTS in order to make the rtl-sdr work properly.

From now on RTL-SDR or dongle is referred to the hardware and rtl-sdr to the software.

First you need to clone this repo:

In there you’ll find a file called `rtl-sdr.rules`. Just copy that to your udev rules and reload.

sudo cp rtl-sdr.rules /etc/udev/rules.d/
sudo udevadm control --reload-rules
sudo udevadm trigger

The above just makes sure that your user account has proper permissions to the USB device in /dev and can use libusb to send commands to the dongle. Then you need to blacklist some kernel modules. In Ubuntu, the rtl-sdr has a module driver as this device is meant to be used a generic DVB-T receiver, but in order to use the device with SDR# you need to disable those drivers and use the rtl-sdr user space drivers. To identify if your distro has those default drivers run this command:

find /lib/modules/(uname -r) -type f -name '*.ko' | grep rtl28

In my case I get this output.


So, I had to blacklist the modules. To do that in ubuntu create this file `/etc/modprobe.d/blacklist-rtlsdr.conf` and add these lines in there.


You might have to reset your system to check that this works properly. Now you can build the rtl-sdr repo and test the module. To do that

git clone
cd rtl-sdr/
mkdir build
cd build/
cmake ../

Then the executable is located in the .src/ folder. Have in mind that this step is not really needed to run SDR#, but it’s good to be aware of it in case you need to test your dongle before you proceed with the SDR#. Now, to install SDR# in Ubuntu do this:

sudo apt install mono-complete libportaudio2 librtlsdr0 librtlsdr-dev
git clone
cd sdrsharp
xbuild /p:TargetFrameworkVersion="v4.6" /p:Configuration=Release
cd Release/
ln -s /usr/lib/x86_64-linux-gnu/
ln -s /usr/lib/x86_64-linux-gnu/ librtlsdr.dll
mono SDRSharp.exe

If you have your dongle connected then you can start play with it. For more details on how to use SDR# with the dongle, you better have a look to Carl’s book or have a look to the rtl-sdr blog as there are some posts that have info about it.

In my case I needed to use the direct sampling mode, so I’ve selected the RTL-SDR / USB device and then clicked configure and from the new dialog that opens select the Direct sampling (Q branch)option in the Sampling Mode. This will work only in the V3 of the RTL-SDR dongle! If you following these steps, now close the dialog box and check the Correct IQcheckbox, drag the audio AF Gain to zero (you don’t need audio feedback) and then you’re ready. I like to lower down the contrast slider in the right so the waterfall is a bit more deep blue as the noise is gone. Also I uncheck the Snap to grid.

In case that you have an older version of the RTL-SDR, or another dongle then you need to do a small modification to get it working in the direct sampling mode, but you need to google it, depending your dongle.

Testing the probes

For the tests, I’m going to use a cheap second-hand OCXO that I’ve bought from ebay and it’s made from Isotemp and the model number is OCXO143-141. This is the one:

This seems to be a customised OCXO as there’s no reference for it in the official cmpany site. There is a 143-XXX series but not 143-141 model. Anyway, from the PDF file here, it’s the A-package. I’m not going to go into the details, but it’s an amazing OCXO for the price and you can find it very cheap in ebay. I got mine for 16 EUR. Its frequency output is 10MHz and as I’ve said earlier that’s lower than the minimum supported frequency from most of the SDR dongles, but the RTL-SDR can go down to this frequency when is set to direct sampling mode.

As you can see from the above image, you just need to connect the GND and the +5V and that’s it. Normally, the OCXO needs approx. 5 mins to stabilise, because the heating element needs some time to warm up the case and the crystal inside it to the operating temperature which is 70 ºC. Hence, be aware not to touch it after some time, because it might be very warm. When the OCXO is connected to the power supply then it draws ~500mA current @ 5V for a few seconds, but after that gradually drops and it’s stable at approx 176 mA.

The temperature of the case was around 46 ºC, but with a lot of variation, so I guess that’s because of the reflective case, which makes difficult for my IR temperature gun to measure the temperature precisely.

Ok, so now that everything is set, lets test the probes.

I’ve used the USB extension to connect the dongle to my Ubuntu workstation and I’ve connected the probe on the dongle’s RF input. Then I’ve build the sdrsharp from the git source (I’ve explained the process earlier) and run the GUI. Selected the the RTL-SDR / USB, configured the sample rate to 2.4MSPS and the sampling mode to Direct sampling (Q branch). Then checked the Correct IQand unchecked the Snap to grid and pressed Play. Simple.

This is the capture I’ve made when I’ve put the probe close to the 10MHz output of the OCXO.

So, it works!

After that I’ve used to probe on the PSU cables that was powering the OCXO and I’ve seen that the 10MHz was leaking back to the PSU. Also the leak was not allover the cable but in some spots only. In my use case I’ll use a USB cable to power up the OCXO that will also power the rest circuit; therefore I’ll probably need some ferrite core cable clips, which I’ve already have ordered as I was expected that, but it will take some time to get them from China. Anyways, that’s also a topic for the next project, but I need the probe to at least minimise this leak.

Limitations and comparisons

Now I’ll try to guess how this DIY EMC probe using the RTL-SDR dongle compares to a spectrum analyser with proper EMC probes and then try to test the diy probes with my Rigol DS1054Z, which has an FFT function. The list might not be complete as I don’t know all the specific details and I’m pretty much guessing here.

Professional probe DIY probe
+ It’s calibrated + Very cheap to make
+ Just buy it and plug it in + If you don’t care for accuracy it’s fine
+ Well defined characteristics + Fun to make one
– High price – You need to build it by your self
– No fun making it – Not calibrated
    You can also buy a cheap set of EMC probes on ebay, of course. That costs around 40 EUR  which is that’s not too much if you want to get a bit more serious or want to avoid the hassle to create your own probes. It’s not in my plans to buy something like that, because I don’t need better accuracy or calibrated probes, I just need to be able to see if there’s a frequency there and for that purpose the DIY probe is more than perfect.

OK, so now let’s see how the RTL-SDR dongle compares to a spectrum analyser. Again, these are more guesses as I don’t have a spectrum analyser and I’m not expert in the field.

Spectrum Analyzer DIY probe
+ Wider real-time bandwidth + OK’ish real-time bandwidth
+ Very low noise + OK’ish in terms of noise
+ Tons of options and functions + You can write your own functions using the rtl-sdr API and hack it
+ Wider range (several GHz but for more $$$) + Very, very cheap
– Much more expensive + Portable (can be used even with your phone)
– Less portable – Limited real-time bandwidth
– Can’t hack it – Can’t be used when accuracy is needed

Depending the money you spend you can get very expensive spectrum analysers that include amazing functionality. There are also spectrum analysers for hobbyists and amateurs (like the Rigol DSO700 series), but they cost from 750 (for 500MHz) up to 1100 EUR (for the 1 GHz). That’s a lot of money for a hobbyist! You can spend just 25 EUR for the RTL-SDR dongle and buy something else with the rest…

If I had to choose which is the most significant difference, I would probably say it’s the real-time bandwidth. For example, by using the RTL-SDR dongle you’re limited in approx. 2MHz bandwidth real-time processing of data chunks. With a spectrum analyser the real-time bandwidth can be up to 10-25MHz (or more). I know that the RTL-SDR dongle documentation is clear regarding the real-time bandwidth and it mentions that the max bandwidth is 1.6MHz (due to Nyquist) and the 3.2MSPS or 3.2MHz is by using the I/Q samples. On the other hand, I don’t know if the spectrum analyser specs are referring to Nyquist bandwidth or not. They may do,  I don’t know, but nevertheless is higher bandwidth in any case. Do you really need it though? In that price?

The key point here is that for an amateur (or even professionals sometimes), it doesn’t really matter if what you see in the screen is 100% accurate. Most of the times you just want to see if there’s something there and if there’s a frequency in that bandwidth that pops up.  So with the RTL-SDR dongle you may not get an accurate value of the magnitude of the frequencies that you see and also more noise, but at least you can see that there is a frequency popping up in there.

Comparing with the FFT function of the DS1054Z

In order to compare the RTL-SDR and sdrsharp and the FFT function of the DS1054Z I can’t use the OCXO module as it’s too weak. So in this case, I’ve used the Siglent SDG1025 function generator, which is famous for not being a great performer, but still is just fine for any home project that I’ve done or I may do in the future. I’ve set the output to 10MHz and 1Vpp and then used my ds1054z to capture the output and use the FFT math function and at the same time I’ve used the DIY EMC probe to check what it captures from the SDG1025 cable.

Note: with the probe not connected I’ve an interference at ~9.6MHz, which I don’t know where it comes from, but I have a guess. It’s a lower frequency compared to the internal 28.8MHz TXCO used to clock both the RTL2832U and R820T2. My guess is that it’s a mirror image from the internal 28.8MHz TXCO, because 28.8/9.6 = 3, which is an integer. Btw, also the 48MHz freq of the USB may show an image at 9.6MHz, because 28*2 = 48/9.6 = 5. This is there even without the probe connected, so this makes me think it’s because of the TXCO. Anyway, it’s always a good thing to check first without the probe connected, to see if there are any frequencies already in the spectrum you’re interested. This is what I’ve got without the probe connected:

You see the 9.6MHz in the left side. Here you see that there is also one in 9.490MHz, without the probe connected. I can’t figure out where that comes from, because it’s near the 9.6 it’s a bit weird, but the TXCO is 1ppm, which means that even if it was coming from there 1ppm @ 28.8MHz means 28.8Hz and that means that the worst case is to had am image at 9.599MHz. Maybe there’s some internal PLL in the RTL2832U to get the 48MHz (I couldn’t find the datasheets to figure this out) and it has a large offset? Nah, anyway… It’s just there.

Then the next picture is with probing the 10MHz output of the SDG1025 with the RTL-SDR dongle.

Now you can see that there are quite a few frequencies around the 10MHz. The SDG1025 is set to output a sine! It seems that the sine output of the function generator contains a lot of harmonics that also emit and the probe can capture them. In case of the OCXO I didn’t saw the other spikes although the ouyput is a square wave. Probably because the output of the OCXO was much weaker.

The next picture is the FFT output of the ds1054z.


Here you can see that also the FFT of the Rigol shows that there are other frequencies around the 10MHz sine and also the freq counter shows that the output is not very stable. Probably that’s true as I didn’t wait for a long time for the function generator to settle. Btw, as I have the 10MHz OCXO, maybe I’ll use it as an external reference to the SDG1025. This would make it much, much better in terms of output stability (marked as a future project).

Another thing that deserves mention here is the bandwidth of the FFT on the Rigol. In this case I’ve set the FFT mode from trace to *memory in the FFT menu. That makes the screen update much slower (who cares), but you get a wider bandwidth. Also notice that the visible bandwidth here is 5MHz/Div, so you see actually see 50MHz bandwidth on the screen.

Also, it worth mention that the RTL-SDR has much better resolution compared to the Rigol. That’s expected because the bandwidth is smaller and that’s why you see the several spikes around 10MHz in case of the dongle and only a single lobe on the FFT of the Rigol.

Now have a look at this weird thing:

Here you see that the bandwidth is 100MHz/Div, which I dunno, it doesn’t make sense, since this is a 50MHz oscilloscope (with the 100MHz hack, of course) and the displayed spectrum now is 600MHz. So, yeah…. I don’t know what’s going on in here, maybe I’m missing something, but I just guess that it’s because the FFT is just a mathematical function, so although there’s no real input data on the oscilloscope the display just shows the maths results.


First, I really loved the RTL-SDR dongle. Although I was aware of it for some years now, I never had any excuse (or time) to play with it. I’m lucky though, because now it’s much more mature in both hardware and software terms. The RTL-SDR V3 is awesome, because of the software enabled direct sampling mode that allows it to go below the 24MHz and also the rtl-sdr API is more mature/stable and also supports the direct sampling mode setting, too.

Now that I’ve spent some time with the rtl-sdr and I’ve seen the API, I’m really intrigued to use one of the many SBCs that I have everywhere around my workbench to build a standalone spectrum analyser. I’ve seen that there’s already a project for the beaglebone black which is called ViewRF. This is actually a Qt4 application that uses the rtl-sdr API and then it draws on a canvas without the need of a window system as it uses the QWS (Qt Window System). There are a few problems with that, though. First it’s a quite old Qt version (which is fine), but I prefer to use the meta-qt5 layer to build my custom Yocto distros. For example, I already have the meta-allwinner-hx and meta-nanopi-neo4 layers that I could use to build a custom image for one of the supported boards. Those layers are supporting the Qt5 API, but the QWS was removed in that version and that means that I would need to write a new app to do something like the ViewRF. Anyway, I need to investigate a bit more on this, because there probably others that already implemented those stuff long time ago.

I’m also thinking to write an app for one the WiFi enabled SBCs that will run an http server and serve a GUI using websockets to send the data. I don’t know yet the amount of data that are coming from the dongle, so that might not work. Anyway, so many nice stuff to do and so many ideas, but no much time :/

For now, I just want to finish reading Carl’s book as there’s a ton of interesting information in there and is written in a way that really gets you into it. I may not have the time to do a lot of stuff with the RTL-SDR dongle in the future, but nevertheless I want to learn more about it and have the knowledge in the back of my head for future projects and it’s good to know what it can do with it and how it can be used. Still, the EMC probe with a standalone spectrum analyser is a nice project and I will do it at some point in the future. I’ve also found this interesting Yocto layer here. So many cool stuff, so less time.

Have fun!

NanoPi-Neo4 Yocto meta layer


Embedded Linux & GPU 3D acceleration… Say no more. When you get those words together then hell brakes loose. Well, it depends, though. If you’re using Yocto or Buildroot and choose a BSP that has all the pieces already connected, then you’re good to go. Just build, flash the image and you’re done. It’s even easier if you use a distro like Armbian, Raspbian or others, just download the image, flash it and you’re done. But what happens when you need to build up from the scratch? In this case you need to connect the pieces together yourself and then you realize that it’s a bit more complicated. The Linux graphics stack is a huge pool of buzzwords, abbreviations, tons of code and APIs. It’s really huge. Let’s play with numbers a little. I got the git repo of the linux-stable and I’ve used cloc to extract some numbers.

git clone git://
cd inux-stable
git checkout v5.0.4
cloc .

Cloc, will do its magic and will return some metrics like this:

$ cloc .
   68143 text files.
   67266 unique files.                                          
   19991 files ignored. v 1.74  T=245.01 s (206.9 files/s, 99334.7 lines/s)
Language                             files          blank        comment           code
C                                    26664        2638425        2520654       13228800
C/C++ Header                         19213         499711         920704        3754569
Assembly                              1328          47549         106234         275703
JSON                                   213              0              0         137286
make                                  2442           8986           9604          39369
Bourne Shell                           454           8586           7078          35343
Perl                                    54           5413           4000          27388
Python                                 116           3691           4060          19920
HTML                                     5            656              0           5440
yacc                                     9            697            376           4616
DOS Batch                               69            115              0           3540
PO File                                  5            791            918           3061
lex                                      8            330            321           2004
C++                                      8            290             82           1853
YAML                                    22            316            242           1849
Bourne Again Shell                      51            352            318           1722
awk                                     11            170            155           1386
TeX                                      1            108              3            915
Glade                                    1             58              0            603
NAnt script                              2            143              0            540
Markdown                                 2            133              0            423
Cucumber                                 1             28             49            166
Windows Module Definition                2             14              0            102
m4                                       1             15              1             95
CSS                                      1             27             28             72
XSLT                                     5             13             26             61
vim script                               1              3             12             27
Ruby                                     1              4              0             25
D                                        2              0              0             11
INI                                      1              1              0              6
sed                                      1              2              5              5
SUM:                                 50694        3216627        3574870       17546900

From the above you see that the Linux kernel is composed by 17.5 million text lines from which the 13.2 million are plain C code. Now let’s try the same at the same in the drivers/gpu folder:

$ cloc drivers/gpu
    4219 text files.
    4204 unique files.                                          
     328 files ignored. v 1.74  T=16.51 s (236.9 files/s, 170553.5 lines/s)
Language                     files          blank        comment           code
C/C++ Header                  1616          50271         146151        1320487
C                             2195         190710         166995         936436
Assembly                         2            454            354           1566
make                            98            382            937           1399
SUM:                          3911         241817         314437        2259888

So, the gpu drivers are approx 2.2 million code lines. That’s 12.5% of the whole Linux kernel and that’s a lot.  The whole arch/folder is 1.6 million code lines (including asm lines), which is much less and contains all the supported kernel architectures.

So now that you realize the size of the graphics stack in the Linux kernel, you can safely guess that the deeper you get into the graphics stack, the more complex and difficult things are getting. Well, ok… I mean pretty much all the Linux kernel subsystems are another world, but if you’re an electronic engineer (like I do) then most of subsystems do make sense, but geez… graphics is another beast.

Anyway, this time the stupid project was to make a Yocto layer for the incredible NanoPi-Neo4 board. Why incredible? Well, because it’s an RK3399 board in a very small factor that only costs 50$. Can’t get better that this.

So, this idea was spinning in my head for some time now, but I couldn’t justify that is stupid enough to be promoted to a stupid project. But then I’ve decided to lower my standards and just accept the challenge.



Meet NanoPi-Neo4

You can read the specs of the board here, but I’ll also list the highlights which are:

  • Dual Cortex-A72 + Quad Core Cortex-A53
  • Quad Core Mali T864 (GL|ES 3.1/3.0/2.0)
  • 1GB DDR3
  • GbE
  • WiFi + BT 4.0
  • CSI
  • HDMI
  • 1x USB3.0
  • 1x USB2.0
  • 1x USB2.0 (pin header)

Yep, only 1GB RAM, but come on for testing code on this CPU and do evaluation without pay a lot is more than enough.

LCD Display

Well, it’s all about 3D graphics, so you need an LCD display. I have a small cheap 7″ LCD (1024×600) that I’ve bought from ebay for something like 30 EUR. Although it seems a bit expensive for its specs, on the other hand it has a controller that has HDMI, VGA and RCA output, is powered from a USB port and it has this nice stand.


There isn’t much to say about the project, except that it took some time to connect the pieces. Also, although I’ve finished the yocto meta layer and everything worked fine I’ve realized that I had left a lot of blur points in the back of my head. Sometimes, most of us (engineers) we just connect pieces together and usually we “monkey see, monkey do”, because of reasons (especially time constraint reasons). This might be fine when something works, even on the professional level, but when that happens to me it means sleepless nights, knowing that I’ve done something without having the full knowledge why that eventually worked. So I’ve struggled a bit to find information how things are really connected and working in the lower level. In my quest for the truth, I’ve met Myy in the Armbian forum and I want to thank him here, because he was really helpful to clarify a lot of things. I consider Miouyouyou being a great person not only because his contribution to open source but also for his willing to share his knowledge. Thanks mate!

Pretty much, what I’ve done was to use parts of the Armbian distro that supports the NanoPi-Neo4 (like u-boot and kernel patches), then use some parts from the meta-rockchip meta layer and then make the glue that connects everything together and also add some extra features. Why do this? Well, because the Armbian distro has a much more updated kernel and patches and a boot script setup that I really like and prefer to use a similar one in my layers. On the other hand, the meta-rockchip layer had the recipes for the various components for the RK chips, which are not updated though, so I had to also update those, too.

Although, I’ve created images for console, X11, Wayland, XWayland and Qt, I’ve only tried and tested the Wayland image, so be aware of that, in case you’re here and reading how to use the layer for X11 or Qt. If Wayland works then I guess also the QtWayland would work just fine, too. But sorry, there’s no much free time to test everything.

A few things about the graphics stack

The main components for the graphics support are the Mali blobs and the Mali DRM driver. DRM is the Direct Rendering Manager and it’s a Linux subsystem. It’s actually a Linux kernel API that run in the kernel space and exposes a set of functions that user-space applications can send commands and data to the GPU. This meant to be a generic API that can be used from applications without care what’s the underlying hardware, so the application doesn’t care if you have an nvidia, amd, intel or mali GPU.

So now that  the kernel provide us with the DRM, how do we access it? Here is where the hardware specific libdrm comes into the game. The next image shows where the libdrm stands in the chain (image taken from here).

In the above image you see that the DRM is the kernel subsystem and it provides an API (via ioctl() commands) that any user app can use. Of course, when you write a graphics application, you don’t want to call ioctl functions. Also, there’s another problem. Although there are some common ioctl functions which are generic and hardware Independent, at the same time each vendor supports ioctl functions that are specific to its hardware. And this is where the trouble starts, because now every vendor needs to supply these specific functions, so you need an additional hardware specific driver. Therefore, the libdrm is composed by two different things, the libdrm core, which is hardware independent and the libdrm driver which is specific to the underlying hardware.

In case of RK3399, Rockchip provides this Mali specific libdrm driver that contains those hardware specific functions and you need to build and use this driver. Have in mind that if the vendor doesn’t do regular updates to the libdrm then you might end up in a situation that your window manager (e.g. Wayland) doesn’t support an older API version anymore and that’s really a bad thing, because then you’re stuck to an old version and maybe also in an old kernel, depending the dependencies of the various software layers.

So the conclusion is that libdrm is just a wrapper that simplifies the ioctl calls to the DRM kernel subsystem and provides a C API that is easier to use and also contains all the hardware specific functions.

Now that you have your way to send commands and data to the GPU you can do 3D graphics! Well… sure but it’s not that simple. With the DRM driver you can get a buffer and start drawing stuff, but that doesn’t really mean anything. Why? Because, acceleration. Let’s try to put it simple. I don’t know if my example succeeds to do this, but I’ll try anyways.

Have you ever used any paint software on your computer? If you haven’t stop here, go to extend your skillset by learning MS-Paint and then come back again. Let’s think the most simple drawing  program that you only have a brush of a fixed 1-pixel size, you can select the color and if you press the click button it only paints one pixel, so no drag and draw. Now, you need to create a black fill rectangle, how do you do it if you only have this brush tool? Well, you start paint pixels to create the four sides for the rectangle and then start clicking inside the rectangle on every pixel until you make it all black. How much time did that take? A lot. We’re lucky though, because in the next version there’s a rectangle tool and also you can chose to draw a rectangle that is already color filled. How long did that take? 1-click and just a few milliseconds needed to drag the mouse to select the size. This is the acceleration. It doesn’t matter if it’s 2D or 3D, there are tools that make things faster. Therefore, for 3D graphics there are APIs that accelerate the 3D graphics creation in so many ways that are beyond of my understanding and involve tons of math and algorithms.

So, now it should be clear that although you have the libdrm and a way to have access to a buffer and draw stuff on your display that’s not enough. You need a way to accelerate these graphics creation and this where several tools are coming in to the game. There are plenty of tools (= APIs, libraries). For example there’s the Mesa 3D library. Mesa is actually a set of different 3D acceleration libraries and includes libraries like OpenGL, Direct3D, Vulkan, OpenCL and others. Each of these libraries maybe have other subsets, e.g. the OpenGL|ES in the OpenGL. Pretty much is chaos in there. Tons of code and APIs, it’s a nightmare for the embedded engineers. What all those libraries do, is to accelerate the 3D graphics creation. But how do they do that? This is done in software algorithms, of course, but that’s not enough. If that was all that’s happening then all the GPUs would have the same rendering performance. And not only that, but they would have the same rendering performance with any software renderer. Actually, OpenGL has its own software renderer (which is the default one and is used as a reference) and does all the rendering in pure software aand CPU cycles. And here is the point where the competition between GPU vendors starts. Every vendor implements this acceleration in their silicon, so the GPU can implement let’s say the whole color rectangle creation in hardware with a simple ioctl call. So there is a specific hardware unit in the GPU that you can send a command and do that very fast. This is a very simplified example, of course. But that also means that each vendor implements the acceleration API in a different and hardware specific way. Therefore, every vendor provides it’s own implementation of these acceleration libraries and also the vendors like to provide only the pre-compiled blobs that do this in order to protect their intellectual property.

The Mali GPU that the RK3399 uses is no different from the other vendors, so in order to really use 3D acceleration and support these 3D acceleration libraries you need to get these pre-compiled  blob files from Rockchip and use them in place of the software renderers (like in case of OpenGL). Here is libmali. Libmali is those blob libraries from Mali (and Rockchip) that contain the hardware specific code. What it actually does is that exposes the various API functions of the acceleration libraries to the user and internally it converts those API calls to the GPU specific ioctl calls. So there’s some code in there, which is pre-compiled in order to hide the implementation and you just get the binary. In this case, the libmali supports the OpenGL, OpenCL and GBM all in one, at the same time and in order to use it you need to build mesa, add the build files in your rootfs and then replace some Mesa libraries with the libmali blob. In this case, the same blob exports the API functions for multiple different *.so libraries from Mesa (e.g. libEGL, libEGLES, libOpenCL, libgdm and libwayland). Anyway, you don’t really need to get into the details of those things, you just need to remember that the vendor’s blob library file contains the APIs from  different libraries in the same binary, whichcan replace all those libs by just create a symbolic link of those libraries to the same blob.

I’ll come back to this with an example later, after I describe how to build the images.

Build the image

The meta layer for the NanoPi-Neo4 is located here:

There’s a quite thorough README file in the repo, so please read that first, because I’ll skip this step in here, in order to update only one place regarding the procedure. Pretty much you need to git clone the needed meta layers and then run the script to setup the environment and build the image you want. In this case, I’ll build and test the rk-image-testing for the rk-wayland, so my setup environment command is:

MACHINE=nanopi-neo4 DISTRO=rk-wayland source ./ buil

And to build the image I’ve run:

bitbake rk-image-testing

Hopefully, you won’t get any errors. After that step I’ve flashed the image on an SD card and booted Linux.

Now, let’s see something interesting… Remember that I’ve told you that we’ve replaced some mesa libs with the mali blobs? Let’s see that. In your /usr/libfolder you can find the libraries that mesa builds. These are for example:

  • /usr/lib/
  • /usr/lib/
  • /usr/lib/
  • /usr/lib/
  • /usr/lib/
  • /usr/lib/

All these are different libraries that target a different API. But see what happens next.

root:~# ll /usr/lib/ | grep
lrwxrwxrwx  1 root root       10 Mar 23  2019 ->*
lrwxrwxrwx  1 root root       10 Mar 23  2019 ->*
lrwxrwxrwx  1 root root       10 Mar 23  2019 ->*
lrwxrwxrwx  1 root root       10 Mar 23  2019 ->*
lrwxrwxrwx  1 root root       10 Mar 23  2019 ->*
lrwxrwxrwx  1 root root       10 Mar 23  2019 ->*
-rwxr-xr-x  1 root root 26306696 Mar 23  2019*
lrwxrwxrwx  1 root root       10 Mar 23  2019 ->*
lrwxrwxrwx  1 root root       10 Mar 23  2019 ->*
lrwxrwxrwx  1 root root       10 Mar 23  2019 ->*
lrwxrwxrwx  1 root root       10 Mar 23  2019 ->*
lrwxrwxrwx  1 root root       10 Mar 23  2019 ->*
lrwxrwxrwx  1 root root       10 Mar 23  2019 ->*

You see here? All the above libs are a symbolic link to the blob that the `meta-nanopi-neo4/recipes-graphics/libgles/rockchip-mali_rk.bbappend` yocto recipe added into the image. Furthermore, you can list the symbol table (= API calls) that this blob exports and get all the different API functions from all the different acceleration libraries. To do that, run this command:

readelf -s /usr/lib/

This will return the symbol table of the, which is a huge list of API calls. There you will find gl*, egl_*, gbm_* and cl* calls, all mixed up together in the same libary. So the blob is a buffed up library that knows how to handle all these library calls, but the implementation is hidden in the binary blob and it’s specific for that GPU hardware and family.

So to wrap up things, the is the super-lib that provides the 3D acceleration APIs (and hides the implementation) and that lib is able to send commands and data to the Mali specific libdrm driver, which provides the connection between the kernel DRM subsystem and the user-space. The result of the libdrm driver a DRI device which is located in /dev/dri

root:~# ll /dev/dri/card0
crw-rw---- 1 root video 226, 0 Mar 22 20:31 /dev/dri/card0

So, any user-space application that needs to use the GPU, has to open the /dev/dri/card0device and start sending commands and data. Therefore, if an application uses 3D acceleration, needs to make calls in the renderer API ( to create the graphics on a buffer and then send the result to the /dev/dri/card0using the proper DRM api.

This is how the most important things are connected Linux graphics stack in case of RK3399. Of course, there are a lot of other different implementations in the stack in the various layers and other vendors may have a different approach and APIs. For example, there is the FBdev, Xorg e.t.c. If you like buzzwords, abbreviations and complexity then the Linux graphics stack will be your best friend. If you just want to scratch the surface and be able to do simple stuff like setting up an image that supports 3D acceleration, I think the above things are enough. This provides the minimum knowledge (in my opinion) that you need to setup the graphics for a BSP. You can of course, just copy paste stuff from other implementations and “monkey see, monkey do”, this will probably also work, but this won’t help you much if something brakes and you need to debug.

One thing that I left out is that from all the above things there’s also another important piece missing and this is the hardware specific DRM driver. Although the DRM is part of the kernel is not really generic, only a part of this driver is generic. Hence Mali has it’s own driver. This driver in the kernel (for Mali and Rockchip) is the `midgard_kbase` and is located in drivers/gpu/arm/midgard/. In this folder in mali_kbase_core_linux.cfile you’ll find the `kbase_dt_ids` structure which is this one here:

static const struct of_device_id kbase_dt_ids[] = {
    { .compatible = "arm,malit7xx" },
    { .compatible = "arm,mali-midgard" },
    { /* sentinel */ }
MODULE_DEVICE_TABLE(of, kbase_dt_ids);

Also in arch/arm64/boot/dts/rockchip/rk3399.dti, which is included from the NanoPi-Neo4 device tree you’ll find this entry here:

gpu: gpu@ff9a0000 {
    compatible = "arm,malit860",

This means that during the kernel boot when the device-tree is parsed the kernel will find this device-tree entry and will load the proper driver. You can also retrieve the version like this:

root:~# cat /sys/module/midgard_kbase/version
r18p0-01rel0 (UK version 10.6)


To test the 3D graphics acceleration I’ve used the glmark2-es2-drm tool. First I’ve run it like that to enjoy the benchmark on the screen.

root:~# glmark2-es2-drm

But this will limit the framerate to the vsync and you’ll only get 60fps. In order to test the raw performance of the GPU you need to run the benchmark and render to an off-screen surface with this command:

glmark2-es2-drm --off-screen

By running the previous command this is the output that I’m getting.

root:~# glmark2-es2-drm --off-screen
    glmark2 2017.07
    OpenGL Information
    GL_VENDOR:     ARM
    GL_RENDERER:   Mali-T860
    GL_VERSION:    OpenGL ES 3.2 v1.r14p0-01rel0-git(966ed26).1adba2a645140567eac3a1adfc8dc25d
[build] use-vbo=false: FPS: 118 FrameTime: 8.475 ms
[build] use-vbo=true: FPS: 134 FrameTime: 7.463 ms
[texture] texture-filter=nearest: FPS: 145 FrameTime: 6.897 ms
[texture] texture-filter=linear: FPS: 144 FrameTime: 6.944 ms
[texture] texture-filter=mipmap: FPS: 143 FrameTime: 6.993 ms
[shading] shading=gouraud: FPS: 122 FrameTime: 8.197 ms
[shading] shading=blinn-phong-inf: FPS: 114 FrameTime: 8.772 ms
[shading] shading=phong: FPS: 101 FrameTime: 9.901 ms
[shading] shading=cel: FPS: 98 FrameTime: 10.204 ms
[bump] bump-render=high-poly: FPS: 90 FrameTime: 11.111 ms
[bump] bump-render=normals: FPS: 125 FrameTime: 8.000 ms
[bump] bump-render=height: FPS: 125 FrameTime: 8.000 ms
libpng warning: iCCP: known incorrect sRGB profile
[effect2d] kernel=0,1,0;1,-4,1;0,1,0;: FPS: 57 FrameTime: 17.544 ms
libpng warning: iCCP: known incorrect sRGB profile
[effect2d] kernel=1,1,1,1,1;1,1,1,1,1;1,1,1,1,1;: FPS: 22 FrameTime: 45.455 ms
[pulsar] light=false:quads=5:texture=false: FPS: 138 FrameTime: 7.246 ms
libpng warning: iCCP: known incorrect sRGB profile
[desktop] blur-radius=5:effect=blur:passes=1:separable=true:windows=4: FPS: 25 FrameTime: 40.000 ms
libpng warning: iCCP: known incorrect sRGB profile
[desktop] effect=shadow:windows=4: FPS: 107 FrameTime: 9.346 ms
[buffer] columns=200:interleave=false:update-dispersion=0.9:update-fraction=0.5:update-method=map: FPS: 35 FrameTime: 28.571 ms
[buffer] columns=200:interleave=false:update-dispersion=0.9:update-fraction=0.5:update-method=subdata: FPS: 35 FrameTime: 28.571 ms
[buffer] columns=200:interleave=true:update-dispersion=0.9:update-fraction=0.5:update-method=map: FPS: 38 FrameTime: 26.316 ms
[ideas] speed=duration: FPS: 68 FrameTime: 14.706 ms
[jellyfish] <default>: FPS: 75 FrameTime: 13.333 ms
[terrain] <default>: FPS: 5 FrameTime: 200.000 ms
[shadow] <default>: FPS: 50 FrameTime: 20.000 ms
[refract] <default>: FPS: 28 FrameTime: 35.714 ms
[conditionals] fragment-steps=0:vertex-steps=0: FPS: 132 FrameTime: 7.576 ms
[conditionals] fragment-steps=5:vertex-steps=0: FPS: 73 FrameTime: 13.699 ms
[conditionals] fragment-steps=0:vertex-steps=5: FPS: 131 FrameTime: 7.634 ms
[function] fragment-complexity=low:fragment-steps=5: FPS: 97 FrameTime: 10.309 ms
[function] fragment-complexity=medium:fragment-steps=5: FPS: 65 FrameTime: 15.385 ms
[loop] fragment-loop=false:fragment-steps=5:vertex-steps=5: FPS: 97 FrameTime: 10.309 ms
[loop] fragment-steps=5:fragment-uniform=false:vertex-steps=5: FPS: 96 FrameTime: 10.417 ms
[loop] fragment-steps=5:fragment-uniform=true:vertex-steps=5: FPS: 71 FrameTime: 14.085 ms
                                  glmark2 Score: 88

The important thing from the above output is that the GL_VENDOR, GL_RENDERER and GL_VERSION are the expected. So the Mali-T860 GPU does the rendering and the version if the OpenGL|ES 3.2 (and the driver version is r14p0). This is great, we have all the greatest and latest stuff (the date that this post is written) and we’re ready to use the 3D hardware acceleration.

This is also a small video with the glmark2 rendering to the LCD screen.


Well, that was an interesting project. It started with just creating a Yocto meta layer for the NanoPi-Neo4 with 3D support. That actually worked quite fast, but then I’ve realized that although I had some background on the Linux graphics stack, I wasn’t sure why it worked and then I’ve realized that it was a bit more complex and start getting deeper. I also got a lot of valuable input from Myy, who was kind to share a lot of insights on the subject and without his explanations it would take me much more time to unravel this.

I hope this Yocto is layer is useful, but also have in mind that is not fully tested and it might not get very frequent updates as I’m also maintaining other stuff in my (much less now) free time.

I promise the next project will be really stupid. I already know what it will be.

Have fun!

Overclocking the stm32f303k8 (Nucleo-32 board)


Note: because I had to recover the db and there wasn’t a backup of this post, the oscilloscope screenshots and the comments were lost 🙁

Few weeks ago I’ve done a stupid project here in which I’ve designed a small deb board that is similar to the blue-pill board (which uses the stm32f103c8t6). The difference is that I’ve used the stm32f373c8t6. In the meantime, I’ve delayed the PCB order; because I wanted to do some changes, like use 0603 components and remove the extra pad areas for hand-soldering. And before do those changes, then one day BOOM! I’ve seen this NUCLEO-F303K8 board, that looks like the blue-pill and said, ok, lets order this and screw the custom board, who cares. Therefore, as a proper engineer, I’ve order two boards (one is never enough) without even read the details and after a couple of days these boards arrived.

Double win; I thought. It shipped faster and I don’t have to do any soldering. So, I got home and opened the packaging and BOOM! WTF is this?!? Why does it have 2 mcus? Where is the connector with the SWDIO and SWCLK pins to flash it with my st-link? Why there’s also an stm32f103c8t6 on the back side of the PCB? What’s going on here?

Oh, wait… Oh, nooo… And then I’ve realized all the reasons why you should always read the text around the pictures before you buy something. Of course, I’ve spontaneously forgiven myself, so I can repeat the same mistake the next time. After reading the manual (always RTFM), I’ve realized that the USB port is not connected on the stm32f303 and the stm32f103 was the on board st-link and I’ve also realized the reason that the board was a bit more expensive than I expected (though Arrow had a lower price compared to others). I’ve also realized that the stm32f303 was the small brother of the family, so no USB anyways, no much RAM and several other things. But, it still has all the other goodies like 2x DACs, opamp, ADCs, I2C, SPI e.t.c. Enough goodies for my happiness meter not to drop that much.


After the initial sock, I said to myself, ok let’s do something simple and stupid at least, just to see that it works. I’ve plugged the USB cable, the leds were flickering, cool. Then I’ve seen this ARM mbed enabled thing on the case and although I was sure that it would be something silly, I wanted to try it. So, after spending ~30mins I found out that I’m not the only one that does stupid things. ARM elevates that domain to a very high standard, which is difficult to compete. I’ve tried mbed before, but that was a lot of years ago when it started and it was meh, but I didn’t expect it to be still so much.. meh. I mean, who the heck is going to develop on an online IDE and compiler and do development on libs that can’t have access to fix bugs or hack? What a mess. Sorry ARM, but I really can’t understand who’s using this thing. This is not a “professional” tool and Hobbyists have much better tools like the various Whateverduinos. So after loosing 30 mins from my life, I’ve decided to make one of my standard cmake templates. Simple. But… I needed also to do something stupid on top, so I’ve decided to add in my template a way to overclock the stm32f303. Neat.

And yes… In this project you’ll get an stm32f303k8 overclocked up to 128MHz (if you’re lucky). And it seems that is working just fine, at least both of my boards worked fine with the overclock.


You can download the source code from here:

This template is quite simple, just a blinking LED and a USART port @115200 baud-rate. You can use this as a cmake template to build your own stuff on top of it. So, let’s have a look in the main()function to explain what is going on there.

int main(void)
    /* Initialize clock, enable safe clocks */
    sysclk_init(SYSCLK_SOURCE_INTERNAL, SYSCLK_72MHz, 1);
    RCC_ClocksTypeDef clocks;
    /* Get system frequency */
    SystemCoreClock = clocks.SYSCLK_Frequency;
    if (SysTick_Config(SystemCoreClock / 1000)) {
        /* Capture error */
        while (1);
    /* setup uart port */
    /* set callback for uart rx */
    dbg_uart.fp_dev_uart_cb = uart_rx_parser;
    dev_timer_add((void*) &dbg_uart, 5, (void*) &dev_uart_update, &dev_timer_list);
    /* Initialize led module */
    /* Attach led module to a timer */
    dev_timer_add((void*) &led_module, led_module.tick_ms, (void*) &dev_led_update, &dev_timer_list);
    /* Add a new led to the led module */
    dev_led_set_pattern(&led_status, 0b00001111);
    printf("Program started\n");
    printf("Freq: %lu\n", clocks.SYSCLK_Frequency);
    while(1) {
        GPIOB->ODR ^= GPIO_Pin_4;

The first function is the one that sets the system clock. The board doesn’t have an external XTAL, so the internal oscillator is used. The default frequency is 72MHz, which is the maximum officialfrequency. In order to overclock to 128MHz, you need to call the same function with the SYSCLK_128MHz parameter. Also, the first parameter of the function is selecting the clock source (I’ve only tested the internal as there’s no XTAL on this board) and the last parameter for using safe clocks for the other peripheral buses when using frequencies more that 64MHz. Therefore, when overclocking to 128MHz (or even 72MHz) if you want to be sure that you don’t exceed the official maximum frequencies for the PCLK1 and PCLK2 then set this flag to 1. Otherwise, everything will be overclocked. I live dangerously, so I’m always leaving this to 0. If that mcu had a USB device controller then when oveclocking it can’t succeed the frequency that is needed for the USB, so you need to stay at 72MHz max (only if you have a USB device, but the 303 doesn’t have anyways).

After the clock setting, you need to update the SystemCoreClock value with the new one and you can call the print_clocks(&clocks)function to display all the new clock values and verify that are correct. For example, in my case with the sys clock set to 128MHz I get this output from the com port:

Freq: 128000000
  HCLK: 128000000
  SYSCLK: 128000000
  PCLK1: 64000000
  PCLK2: 128000000
  HRTIM1CLK: 128000000
  ADC12CLK: 128000000
  ADC34CLK: 128000000
  USART1CLK: 64000000
  USART2CLK: 64000000
  USART3CLK: 64000000
  UART4CLK: 64000000
  UART5CLK: 64000000
  I2C1CLK: 8000000
  I2C2CLK: 8000000
  I2C3CLK: 8000000
  TIM1CLK: 128000000
  TIM2CLK: 0
  TIM3CLK: 0
  TIM8CLK: 128000000
  TIM15CLK: 128000000
  TIM16CLK: 128000000
  TIM17CLK: 536873712
  TIM20CLK: 128000000

I’m using the CuteCom for the serial port terminal, not only because I’m a contributor, but also these macros plugin I’ve added is very useful for doing development with mcus and UARTs.

On the above output, don’t mind the TIM17CLKvalue, there’s no such timer anyways.

To do some benchmarks with the different speeds, then you need to uncomment the #define BENCHMARK_MODEline in the main.cfile. By doing this the D12 pin (PB.4) will just toggle inside the main loop. So… it’s not really a benchmark, but still is something that is affected a lot by the system clock frequency.

The rest of the lines in the main()are just for setting up the LED and the UART module. I’ve developed those modules initially for the stm32f103, but it was easy to port them for the stm32f303.

One very important note though! In order to make the clock speeds to work properly, I had to make a change inside the official standard peripheral library. More specific in the `source/libs/StdPeriph_Driver/src/stm32f30x_rcc.c` file in line 876, I had to do this:

        /* HSI oscillator clock divided by 2 selected as PLL clock entry */
//        pllclk = (HSI_VALUE >> 1) * pllmull;
        pllclk = HSI_VALUE * pllmull;

You see, by default the library divides the HSI_VALUE by 2 (shifts 1 right), which is wrong (?). They probably do this because they enforce that way the /2 division of the HSI and do all the clock calculations depending on that. Therefore, if you overclock ,then all the calculated clock values for the other peripherals will be wrong because they are based on the assumption that the HSI value is always divided by 2. But that’s not true. Therefore, the baud rate of the USART will be wrong and although you set it to 115200bps, actually it will be 230400bps. If you want to use the PLL with the HSI as an input clock then you need a way to do fix this. Therefore, I’ve changed this line in the standard peripher library, so have that in mind in case you’re porting code to other projects. Also, have a look in the README.mdfile in the repo for additional info.


Finally! Let’s see some numbers. Again, this benchmark is stupid, just a toggling pin and measuring how the toggling frequency is affected from the system clock frequency. I’m adding all the images on the same collection so you can click on them and scroll.

NoteThe code is build with the -O3optimization flag.

So, in the above images you can see the frequency of the toggling pin for each system clock, but let’s add them on this table, too.

System clock (MHz) Toggling speed (MHz)
16 1.33
32 2.66
64 3.55
72 3.98
128 7.09

Indeed the toggling frequency scales with the system clock, which is expected. Isn’t it? Therefore, at the maximum system clock of 128MHz the bit-banging speed goes up to 7.09MHz, which is ok’ish and definitely not impressive at all. But anyway it is what it is. At least it scales properly with the frequency.

And if you think that this is slow, then you should think that most of the people develop with the debug flags enabled in order to use the debugger and sometimes they forget that they shouldn’t use it for development unless is really needed or there’s a specific problem that you need to figure out and you can’t by adding traces in you code. So let’s see the toggling speed at 125MHz with the debug flags on.

Ouch! Do you see that? 2.45MHz @ 125MHz system clock. That’s a huge impact in the speed; therefore, try not using debuggers when they are not essential to solve problems. From my experience, I’ve used a debugger maybe 5-6 times the last 15 years. My debugger was always a toggling pin and a trace with the uart port. Yeah, I know, that might take more time to debug something, but usually it’s not something that you do a lot. This time, I had to use it though, because in the standard peripheral library for this mcu, they’ve changed the interrupt names and because of that the UART interrupts were sending the code to the `LoopForever` asm in the source/libs/startup/startup_stm32.s. By using the debugger I’ve just seen that the interrupt was sending the code in there and then in the same file I’ve seen that the interrupt names for the USART1 and 2, were different compared to the stm32f103. That happened because I’ve ported my dev_uartlibrary from the stm32f103 to the stm32f303.


Before you buy something, always RTFM! That will save you some time and money. Nevertheless, I’m not completely disappointed for ordering this module. I’m more disappointed that I’ve seen that mbed is what it is. That’s sad. I think the time that some brilliant guys spend for this and I can’t really understand why. Anyway…

On the other hand, those mcus seem to be behave just fine at 125MHz. I’ve tried both modules and both were working fine. I guess I’ll do something with them in the near future. The on-board st-link is also (I guess) a cool feature, because you have everything you need on one module, but I would prefer if I just had the SW pins to use an external st-link and buy more modules in the half price (or even less). If anyone is thinking buying one of those, then they seem to be ok’ish.

Have fun!

Linux and the I2C and SPI interfaces (part 2)


Note: because I had to recover the db and there wasn’t a backup of this post, the images are only thumbnails and the comments were lost 🙁

In the previous post stupid project I’ve implemented a very simple I2C and SPI device with an Arduino that was interfaced with a raspberry pi and there I’ve done some tests by using various ways to communicate with the Arduino. There I’ve pretty much showed that for those two buses it doesn’t really matter if you write a kernel driver or you just use the user space to access the devices. Well, if you think about it, spidev is also a kernel driver and also when you access the I2C from the user-space there’s still a kernel driver that does the work. So writing a driver for a specific subsystem instead of just interface them with a custom user-space tool has its uses, but it’s also not necessary and needs some consideration before you decide to go to either way.

Still, there was something interesting missing from that stupid project. What about speed and performance? What are your options? How these options affect the general performance? Is there a magic solution for all the cases? Well, with this stupid project I will try to fail to answer those questions and make it even more worse to give any valuable information and answers to these questions.



For the last stupid project I’ve used a raspberry pi, but I’ve also provided the sources for you to use also the nanopi neo. This time I won’t be that kind so I’ll only use the nanopi neo. The reason is that I like its small format and it’s quite capable for the task and also I didn’t want to use a powerful board for this. So, it’s a low-mid tier board and also dirty cheap.


Last time I’ve used the Arduino nano. Well, nano it’s an excellent board to implement an I2C/SPI slave device in… minutes. But, it lacks performance. So, it’s completely incapable to stress out the nanopi neo. Therefore, we need something much faster and here comes the STM32F103 as it has all the good stuff in there. 72MHz (or up to 128MHz if you overclock it, which I did, of course), DMA for both the I2C and SPI and also very, very, very cheap (I’m talking about the famous blue-pill). Therefore, I’ve implemented an I2C and SPI slave that both use DMA for fast data transfers.


The rest of the components are exactly the same with the previous stupid project. So we have a whateverphoto-resistor (it doesn’t really matter) and a whatever LED.


This stupid project is focused actually on the Linux kernel. As everyone learns now in school the kernel comes in two main flavors, the SMP and PREEMPT-RT kernel. The first in the normal mainline kernel and the second one is the real-time patched version of the mainline. I won’t get into the details, but just to simplify, the main difference is that the PREEMPT-RT kernel, actually guarantees that any process that runs in the CPU will get a fair and predictable time of execution, which minimizes the latency of the application. Oversimplified, but that’s not a post about the Linux kernel.

Therefore, what happens if you have a couple of fastdevices that you want to interface under various conditions, like the CPU has a low or heavy background load? To find this out we actually need a fast slave, therefore the stm32f103 is just right for that, as we’ve seen in this stupid project that the SPI can achieve up to 63MHz by using DMA, which is way faster that the Arduino nano (and probably even the nanopi neo actually). So, by assuring that the slave device won’t be our bottleneck were good to go. Here you’ll find the repo for the project:

In order to build the nanopi neo image for both SMP and RT kernel you need Yocto, but again I won’t get into the details on that (you can read the file in the repo for more detailed info). Therefore, to switch between the SMP and RT kernel you need to use either of the following combinations in the build/conf/local.conf file:

PREFERRED_PROVIDER_virtual/kernel = "linux-stable"
PREFERRED_VERSION_linux-stable = "4.19%"

Or for the PREEMPT-RT:

PREFERRED_PROVIDER_virtual/kernel = "linux-stable-rt"
PREFERRED_VERSION_linux-stable-rt = "4.19%"

Also you should build the `​arduino-test-image` like the previous project (it’s the same image actually).

So, now let’s go to some fancy stuff. In the repo there is a tool in the linux-appfolder. You need to build this with the Yocto SDK or any other arm toolchain. This tool actually opens the i2c and spi devices and reads/writes data from them in a loop according to the options you’re passing in the command.

To make it a bit different, compared to the previous project, this time the SPI slave is the photo-resistor ADC and the I2C slave is the PWM LED (it was the opposite in the previous). Anyway, that doesn’t really matters, you can change that in the source code of the stm32f103 which is also available in the repo and you need also to build that and flash it on the mcu. Pretty much if you read the previous README file, it’s quite the same thing.


I’ve performed the benchmarks with the stm32f103 running on 72MHz and 128MHz, too; but there wasn’t any difference at all really, as I’ve limited the SPI bandwidth to 30MHz. The reason for that was actually the cabling that causing a lot of errors above that frequency and it seems that the problem was the nanopi neo and not the stm32f103. Still, the results are interesting and I was able to get valuable information.

I’ve performed various tests. First with two different kernels, SMP and PREEMPT-RT. Then for each kernel I’ve tested a range of SPI frequencies (50KHz, 1MHz, 2MHz, 5MHz, 10MHz, 20MHZ, 30MHz). The provided tool does that automatically, actually. Then for all the above cases I’ve tested the kernel with no load, then with a light load and then with heavy load. The light load was, guess what? printf of course. Well, printf might sound silly, but in a while loop does the trick because the uart buffer fills up pretty quickly and then the kernel will have to flush the buffer and send the data. For the heavy load I’ve just used the Linux stress tool. I’ve also included a calc file in the repo with my results.

So let’s get to the fun stuff. No. Before that I want also to say that there were two kind of benchmarks the first was a pin on the stm32f103 which was toggling every time a full read/write cycle was performed. That means that the Linux app was reading the ADC value of the photoresistor from the stm32f013, then it was writing that value to the I2C PWM LED on the stm32f103. Every time this cycle was performed a pin from the stm32f103 is toggling state. Therefore by measuring the time of a single pulse you actually get the time of the cycle.

Before proceed these are the kernel versions for the SMP and PREEMPT-RT kernels:

Linux nanopi-neo 4.19.91-allwinner #1 SMP ... 15:26:48 UTC 2019 armv7l GNU/Linux
Linux nanopi-neo 4.19.82-rt30-allwinner #1 SMP PREEMPT RT ... 20:12:29 UTC 2019 armv7l GNU/Linux

Let’s see the first three images. These are some oscilloscope probings with the SMP kernel and the SPI frequency at 500KHz, which is a very low frequency.

The first image is a zoom in on the stm32f103’s toggle pin. As I’ve said two toggles (or a pulse if you prefer) is a full read/write cycle, which means that in this time a 16-bit word on the SPI and a 3-bytes on the I2C are transferred. Because the I2C is much slower (100KHz) it has a strong affect on the speed compared to the SPI, but we try to emulate a real-life scenario here. In this case, the average cycle time is 475 μsecs (you see that the average is calculated for both low and high pulse).

The second and third screenshot display the toggle pin output when we’re running the light load with the printf. Wow! What’s going there? You see there are large gaps between the read/write cycles. Why? Well, that’s because of printf and the UART interface. UART is the dinosaur of the comm peripherals and its sloooow. In this case has only a 115200 bps baudrate. And why there are gaps, you’ll think. There are gaps because the printf is inside a while loop, which means that it fills the kernel UART buffer very fast and at some point the kernel needs to flush this buffer in order to create more space for the next bytes. During the flush is occupied with the UART peripheral which doesn’t support DMA in this case and there you have it… Huge gaps where the kernel flushes the UART buffer. We can extract valuable information from this, though. The middle picture shows that the kernel spends 328 ms to empty the UART buffer (see the BX-AX value on the top left corner). During this time you a gap. Then in the last picture you see that for the next 504 ms performs read/writes on the I2C/SPI. This behavior is with the default kernel affinity and scheduler priority per process.

Now let’s see the same output when the SPI is set to 30MHz which for the current setup seems to be the maximum frequency without getting errors.

In the first picture we see now that a full SPI/I2C read/write cycle takes 335 μsecs which is much faster compared to the 500KHz SPI speed. You can also see that the printf time is 550 ms and the SPI/I2C read/write time 218 ms, which means that the kernel uses the CPU for almost the same amount of time to empty the printf buffer, but also the SPI/I2C transactions are using the CPU for almost the half time. That seems that the kernel CPU time is tied to the SPI/I2C statistics.

Now let’s use the user-space tool to get some different numbers. In this case I’ll run the benchmark mode of the tool which counts the SPI/I2C read/write cycles per second. Each second is an iteration. Therefore, the tool also takes a parameter for how many iterations it will run. For example the following command, means that the tool will use /dev/i2c-0 and /dev/spidev-0 in benchmark mode (-m 1) and 20 iterations/runs/seconds (-r 20) and with the printf (light load) disabled (-p 0).

./linux-app -i /dev/i2c-0 -s /dev/spidev0.0 -m 1 -r 20 -p 0

After this test runs it will print some result, for example:

        SPI speed: 1000000 Hz (1000 KHz)
        SPI speed: 2000000 Hz (2000 KHz)
        SPI speed: 5000000 Hz (5000 KHz)
        SPI speed: 10000000 Hz (10000 KHz)
        SPI speed: 20000000 Hz (20000 KHz)
        SPI speed: 30000000 Hz (30000 KHz)

There you see that for each SPI speed the number of SPI/I2C read/write cycles are counted and printed. I won’t paste other data here, but I’ll use only the average values instead. You can have a look to the second sheet spreadsheet in the calc ods file for all the data.

So let’s see the average values when we use the benchmark mode, for 20 runs and the printf on and off.

SMP -m 1 -r 20 -p 0 -m 1 -r 20 -p 1
1MHz 2750.75 1561.55
2MHz 2843.75 1499.45
5MHz 2938.78 1427.05
10MHz 2936.2 1450.65
20MHz 2987 1902.6
30MHz 2986.6 1902.65

From the above table I make the following conclusions, there are almost twice as much SPI/I2C read/write cycles with the printf enabled in the loop and when using the SMP kernel. Wow, nobody expected that… And that when there no printf then after the 5MHz there’s no much difference in the number of cycles, but there is difference when the printf is enabled especially after the 20MHz. Anyway, as expected the faster the clock the more cycle counts.

But let’s now also enable a quite heavy load in the background and re-run those tests to see what happens. The full command I’ve used for the stress tools is:

stress --cpu 4 --io 4 --vm 2 --vm-bytes 128M

The above command means that the stress tool will utilize 4 cores, spawns 4 worker threads and two extra threads that spinning a malloc/free of 128MB each to utilize memory load. And these are the averages:

SMP -m 1 -r 20 -p 0 -m 1 -r 20 -p 1
1MHz 1733.7 1155.5
2MHz 1874.95 1186.9
5MHz 1760.65 1196.9
10MHz 1731.4 1154.65
20MHz 1698.7 1170.2
30MHz 1826.7 1298.75

Now with the background heave load we see of course a huge drop in the performance for the SMP kernel, for both cases with either the printf on or off. Here the frequency doesn’t really have a great impact, but it still performs better. Any increase in the performance that is more that the statistic error is welcome. Therefore even those 100-200 full read/write cycles are a better performance, it just doesn’t scale uniform as the previous  example that there wasn’t a background load.

Now let’s see the PREEMPT-RT kernel…


Let’s have a look to a couple of oscilloscope probings like we did in case of the SMT kernel.

In this case the average time for a full SPI/I2C cycle is 476 μsec. You can also see that the kernel performs read/write cycles for 504 ms and also spends 324 ms to flush the UART buffer. I will make the conclusions about how the SMP and the PREEMPT-RT are compared in the next section, so I’m continuing with the rest of the benchmarks.

These are the two tables for the PREEMPT-RT kernel with the benchmark result as the previous example.

PREEMPT-RT -m 1 -r 20 -p 0  -m 1 -r 20 -p 1
1MHz 2249.7 1448.55
2MHz 2254.2 1444.65
5MHz 2273.89 1447.55
10MHz 2281.45 1457.95
20MHz 2286.9 1457.55
30MHz 2297.85 1458.7

So, this means that the light load that printf adds to the CPU has a huge affect on the performance, although the kernel is the real-time kernel. That’s expected though, because real-time doesn’t mean that you’ll get the performance from each process that you get it’s running as the only process in the CPU, it just means that the scheduler will be fair and the process is guaranteed to get a minimum amount of time to execute frequently. Therefore, the whole performance is affected from any additional load as in the case of the SMP.

Now let’s see what’s happening when there’s a heavy background load as before. So I’ve used the exact same parameters for the stress tool as before and these are the results I’ve got.

PREEMPT-RT -m 1 -r 20 -p 0   -m 1 -r 20 -p 1
1MHz 1815.15 1398.7
2MHz 1930.25 1443.35
5MHz 1963.55 1399.9
10MHz 1929.9 1441.5
20MHz 2045.65 1472
30MHz 2002.05 1442.65

So, what we see here? Wow, right? There’s really no big difference compared to the previous table. It seems that although the load now is much higher, the performance impact was quite low. Why’s that? Well, that’s what the RT kernel does, it makes sure that your process will get a fair time to run and it will preempt other processes frequently, so there’s no process that will occupy the CPU more time that the others. Again, the printf has a great impact, because the problem relies in the implementation and there’s no DMA to unload the CPU from the task of sending bytes over the UART.


So let’s compare the results that we have from the two kernels and see what we got. I’ll create two new tables with the sum of the results for the light and heavy load. This is the table without the heavy background load.

1MHz 2750.75 2249.7 1561.55 1448.55
2MHz 2843.75 2254.2 1499.45 1444.65
5MHz 2938.79 2273.89 1427.05 1447.55
10MHz 2939.2 2281.45 1450.65 1457.95
20MHz 2987 2286.9 1902.6 1457.55
30MHz 2986.6 2297.85 1902.65 1458.7

In this table we see that without any load, the SMP kernel is much faster compared to RT. That’s happening because the scheduler is not really fair, but gives as much as processing time to the SPI/I2C and the benchmark tool as the rest of the processes are idle. Quite the same happens for the RT without the load, but still the CPU is forced to switch between also other tasks and processes that don’t have much to do, so the scheduler is more “fair”.

On the next two columns, the impact of the printf in the while loop has a huge affect on both kernels. Nevertheless, the SMP kernel gives more processing time to the benchmark tool and the SPI/I2C interfaces, therefore the SMP has approx. 450 more read/write cycles more in higher frequencies.

Another thing is obvious from the table is that the SPI/I2C read/writes scale with the frequency increment and the RT kernel is not. So for the RT kernel it doesn’t matter if the SPI bus is running on 1MHz or 30MHz. Cool, right? So that means that if you’re running on a RT kernel you don’t have to worry in optimizing your SPI to achieve the max frequency, because it doesn’t make any difference. But on the SMP you should definitely do any optimizations.

So in this case, it seems that the SMP kernel is much, much better for such use scenarios. What are those scenarios? Well, SPI displays are definitely one of those, for example. And this most probably is the same for every other peripheral that demands a high throughput (e.g. pcie, USB, e.t.c.)

Now let’s go to the next table that the benchmark is running with a heavy load in the background.

1MHz 1733.7 1815.15 1155.5 1398.7
2MHz 1874.95 1930.25 1186.9 1443.35
5MHz 1760.65 1963.55 1196.9 1399.9
10MHz 1731.4 1929.9 1154.65 1441.5
20MHz 1698.7 2045.65 1170.2 1472
30MHz 1826.7 2002.05 1298.75 1441.65

Wait, what? What happened here? In all benchmarks the RT kernel not only scores higher, but also if you see the full table in the calc file, you’ll see that there is smooth and consistent performance between each SPI/I2C read/write cycle for the RT kernel. The SMP kernel from the other hand, has a great variation between the cycles and also the average performance is lower. The performance difference between the SMP and RT is not huge, but its substantial. Who doesn’t want 100,200 or even 300 more SPI/I2C read/write cycles per second, right?

So what happened here? Well, as I’ve mentioned before, the RT scheduler is fair. Therefore, for the RT kernel you get the almost the same performance as you get with a lower load, because the kernel will more or less assign the CPU for the same amount of time. But, the performance on the SMP is getting a great impact, because now the kernel needs to assign more time to other processes that may need the kernel for more time. Hence, this difference between the last two tables.

OK, so what’s better then? What should I use? Which is better? Well… that depends. What are your needs? What your device is doing? For example, if you want to drive an SPI display with the max framerate possible then forget about RT, but on the same time make sure that there’s no other processes in your system that load the CPU that much, because then your framerate will drop even more compared to the RT kernel. Then why use the RT kernel? You would use the RT kernel if your system needs to perform specific tasks in a predictable way, even under heavy load. An example of that for example is audio or let’s say that you drive motors and you need minimum latency under every circumstance (no load, mid load, high load). In most cases the SMP kernel is what you need when a lot of IOs and a high throughput is needed and also almost every other case, except when you need low latency and predictable execution.

Another thing needs notice here is that the RT kernel is not just a plug n play thing that you boot in you OS and everything works just as fine as with SMP. Instead, there may a lot of underlying issues and bugs in there, that may have an undefined behaviour which is not triggered with the SMP kernel. This means that some drivers, subsystems, modules or interfaces, even a hardware may don’t be stable with the RT kernel. Of course, the same goes for the SMP, but at least the SMP is used much more widely and those issues are come to the surface and fixed sooner, compared to the RT kernel. Also, if your kernel is not a mainline kernel then it’s a hard and long process to convert it to a fully PREEMPT-RT kernel, as the patches for the RT kernel are targeting the mainline kernel only. So until all the PREEMP-RT patches become mainline and also we get to the point that your hardware supports those mainline versions, might take a looong time.

This post is just a stupid project and is not meant to be an extensive review, benchmark or versus between the SMP and the PREEMPT-RT. Don’t forget where you are. This is a stupid projects blog. And for that reason let’s see the SPI protoresistor and I2C PWM LED in action.

Linux and the I2C and SPI interfaces


Most of the people that reading that blog, except that they don’t have anything more interesting to do, are probably more familiar with the lower embedded stuff. I like the baremetal embedded stuff. Everything is simple, straight-forward, you work with registers and close to the hardware and you avoid all the bloatware between the hardware and the more complicated software layers, like an RTOS. Also, the code is faster, more efficient and you have the full control of everything. When you write a firmware, you’re the God.

And then an RTOS comes and says, well, I’m the god and I may give you some resources and time to spend with my CPU to run your pity firmware or buggy code. Then you become a semi-god, at best. So Linux is one of those annoying RTOSes that demotes you to a simple peasant and allows you to use only a part of the resources and only when it decides to. On the other hand, though, it gives you back a lot more benefits, like supporting multiple architectures and have an API for both the kernel and the user-space that you can re-use among those different architectures and hardware.

So, in this stupid project we’ll see how we can use a couple of hardware interfaces like I2C and SPI to establish a communication between the kernel and an external hardware. This will unveil the differences between those two worlds and you’ll see how those interfaces can be used in Linux in different ways and if one of those ways are better than the other.

Just a note here: I’ve tested this project in two different boards, on a raspberry pi 3 model B+ and on a nano-pi neo. Using the rpi is easier for most of the people, but I prefer the nano-pi neo as it’s smaller and much cheaper and it has everything you need. Therefore, in this post I will explain (not very thorough) how to make it work on the rpi, but in the file in the project repo, you’ll find how to use Yocto to build a custom image for the nano-pi and do the same thing. In case of the nano-pi you can also use a distro like armbian and build the modules by using the sources of the armbian build. There are so many ways to do this, so I’ll only focus one way here.



I tried to keep everything simple and cheap. For the Linux OS I’ve chosen to use the nanopi-neo board. This board costs around ~$12 and it has an Allwinner H3 cpu @800 or @1200MHz and a 512MB RAM. It also has various other interfaces, but we only care about the I2C and the SPI. This is the board:

You can see the full specs and pinout description here. You will find the guide how to use the nano-pi neo in the repo file in here:

Raspberry pi

I have several raspberry pi flying around here, but in this case I’ll use the latest Raspberry Pi 3 Model B+. That way I can justify to my self that I bought it for a reason and feel a bit better. In this guide I will explain how to make this stupid project with the rpi, as most of the people have access to this rather to a nano pi.

Arduino nano

Next piece of hardware is the arduino-nano. Why arduino? Well, it’s fast and easy, that’s why. I think the arduino is both bless and curse. If you have worked a lot with the baremetal stuff, arduino is like miracle. You just write a few line of codes and everything works. On the other hand, it’s also a trap. If you write a lot of code in there, you end up losing the reality of the baremetal and you become more and more lazy and forget about the real hardware. Anyway, because there’s no much time now, the Arduino API will do just fine! This is the Arduino nano:

Other stuff

You will also need a photoresistor, a LED and a couple of resistors. I’ll explain why in the next section. The photoresistor I’m using is a pre-historic component I’ve found in my inventory and the part name is VT33A603/2, but you can use whatever you have, it doesn’t really matter. Also, I’m using an orange led with a forward voltage of around 2.4V @ 70mA.


OK, that’s fine. But what’s the stupid project this time? I’ve though the most stupid thing you can build and would be a nice addition to my series of stupid projects. Let’s take a photo-resistor and the Arduino mini and use an ADC to read the resistance and then use the I2C interface as a slave to send the raw resistance value. This value actually the amount of the light energy that the resistor senses and therefore if you have the datasheets you can convert this raw resistance to a something more meaningful like lumens or whatever. But we don’t really care about this, we only care about the raw value, which will be something between 0 and 1023, as the avr mega328p (arduino nano) has a 10-bit ADC. Beautiful.

So, what we do with this photo-resistor? Well, we also use a PWM channel from the Arduino nano and we will drive a LED! The duty cycle of the PWM will control the LED brightness and we feed the mega328p with that value by using the SPI bus, so the Arduino will be also an SPI slave. The SPI word length will be 16-bit and from those only the 10-bit will be effective (same length as the ADC).

Yes, you’ve guessed right. From the Linux OS, we will read the raw photo-resistor value using the I2C interface and then feed back this value to the PWM LED using the SPI interface. Therefore, the LED will be brighter when we have more light in less bright in dark conditions. It’s like the auto-brightness of your mobile phone’s screen? Stupid right? Useless, but let’s see how to do that.


First you need to connect the photo-resistor and the LED to the Arduino nano as it’s shown in the next schematic.

As you can see the D3 pin of the Arduino nano will be the PWM output that drives the LED and the A3 pin is the ADC input. In my case I’ve used a 75Ω resistor to drive the LED to increase the brightness range, but you might have to use a higher value. It’s better to use a LED that can handle a current near the maximum current output of the Arduino nano. The same goes for the resistor that creates a voltage divider with the photo-resistor; use a value that is correct for your components.

Next you need to connect the I2C and SPI pins between the Arduino and the rpi (or nanopi-neo). Have in mind that the nano-pi neo has an rpi compatible pinout, so it’s the same pins. These are the connections:/

Signal Arduino RPI (and nano-pi neo)
/SS D10 24 (SPI0_CS)
SCK D13 23 (SPI0_CLK)
SDA A4 3 (I2C0_SDA)
SCL A5 5 (I2C0_SCL)

You will need to use two pull-up resistors for the SDA and SCL. I’ve used 10ΚΩ resistors, but you may have to check with an oscilloscope to choose the proper values.


You need to flash the Arduino with the proper firmware and also boot up the nanopi-neo with a proper Linux distribution. For the second thing you have two option that I’ll explain later. So, clone this repo from here:

There you will find the Arduino sketch in the arduino-i2c-spi folder. Use the Arduino IDE to build and to upload the firmware.

For the rpi download the standard raspbian strech lite image from here and flash an SD card with it. Don’t use the desktop image, because you won’t need any gui for this guide and also the lite image is faster. After you flash the image and boot the board there are a few debug tweaks you can do, like remove the root password and allow root passwordless connections. Yeah, I know it’s against the safety guidelines, don’t do this at your home and blah blah, but who cares? You don’t suppose to run your NAS with your dirty secrets on this thing, it’s only for testing this stupid project.

To remove the root password run this:

passwd -d root

Then just for fun also edit your /etc/passwdfile and remove the xfrom the root:x:... line. Then edit your /etc/ssh/sshd_config and edit it so the following lines are like that:

PermitRootLogin yes
#PubkeyAuthentication yes
PasswordAuthentication yes
PermitEmptyPasswords yes
UsePAM no

Now you should be able to ssh easily to the board like this:

ssh root@

Anyway, just flash the arduino firmware and the raspbian distro and then do all the proper connections and boot up everything.


Here’s the interesting part. How can you retrieve and send data from the nanopi-neo to the Arduino board? Of course, by using the I2C and the SPI, but how can you use those things inside Linux. Well, there are many ways and we’ll see a few of them.

Before you boot the raspbian image on the rpi3, you need to edit the /boot/config.txt and add this in the end of the file, in order to enable the uart console.

Raw access from user space using bash

Bash is cool (or any other shell). You can use it to read and write raw data from almost any hardware interface. From the bitbucket repo, have a look in the That’s a simple bash script that is able to read data from the I2C and then send data to the SPI using the spidev module. To do that the only thing you need to take care of is to load the spidev overlay and install the spi-tools. The problem with the debian stretch repos is that the spi-tools is not in the available packages, so you need to either build it yourself. To do this, just login as root and run the following commands on the armbian shell:

apt-get install -y git
apt-get install -y autotools-dev
apt-get install -y dh-autoreconf
apt-get install -y i2c-tools
git clone
cd spi-tools
autoreconf -fim
make install

Now that you have all the needed tools installed you need to enable the i2c0 and the spidev modules on the raspberry pi. To do that add this run the raspi-config tool and then browse to the Interfacing options and enable both I2C and SPI and then reboot. After that you will be able to see that there are these devices:



This means that you can use one of the bash scripts I’ve provided with the repo to read the photo-resistor value and also send a pwm value to the LED. First you need to copy the scripts from you workstation to the rpi, using scp (I assume that the IP is

cd bash-scripts
scp *.sh root@

Before you run the script you need to properly set up the SPI interface, using the spi-config tool that you installed earlier, otherwise the default speed is too high. To get the current SPI settings run this command:

pi-config -d /dev/spidev0.0 -q

If you don’t get the following then you need to configure the SPI.

/dev/spidev0.0: mode=0, lsb=0, bits=8, speed=10000000, spiready=0

To configure the SPI, run these commands:

spi-config -d /dev/spidev0.0 -m 0 -l 0 -b 8 -s 1000000

With the first command you configure the SPI in order to be consistent with the Arduino SPI bus configuration. Then you run the script. If you see the script then the sleep function is sleep for 0.05 secs, or 50ms. We do that for benchmark. Therefore, I’ve used the oscilloscope to see the time difference every SPI packet and the average is around 66ms (screenshots later on), instead of 50. Of course that includes the time to read from the I2C and also send to the SPI. Also, I’ve seen a few I2C failures with the following error:

mPTHD: 0x036e
mError: Read failed
PTHD: 0x
n\PTHD: 0x036d
xPTHD: 0x036e

Anyway, we’ve seen that way that we are able to read from the I2C and write to the SPI, without the need to write any custom drivers. Instead we used the spidev module which is available to the mainline kernel and also a couple of user-space tools. Cool!

Using a custom user-space utility in C

Now we’re going to write a small C utility that opens the /dev/i2c-1 and the /dev/spidev0.0 devices and read/writes to them like they are files. To do that you need to compile a small tool. You can do that on the rpi, but we’ll need to build some kernel modules later so let’s use a toolchain to do that.

The toolchain I’ve use is the `gcc-linaro-7.2.1-2017.11-x86_64_arm-linux-gnueabihf` and you can download it from here. You may seen that this is a 32-bit toolchain. Yep, although rpi is a 64-bit cpu, raspbian by default is 32-bit. Then just extract it to a folder (this folder in my case is /opt/toolchains) and then you can cross-build the tool to your desktop with these commands:

export CC=/opt/toolchains/gcc-linaro-7.2.1-2017.11-x86_64_arm-linux-gnueabihf/bin/arm-linux-gnueabihf-gcc
${CC} -o linux-app linux-app.c
scp linux-app root@

Then on the rpi terminal run this:

./linux-app /dev/i2c-1 /dev/spidev0.0

Again in the code I’ve used a sleep of 50ms and this time the oscilloscope shows that the average time between SPI packets is ~50ms (screenshots later on), which is a lot faster compared to the bash script. In the picture you see the average is 83, but that’s because sometimes there are some delays because of the OS like 200+ms, but that’s quite expected in a non PREEMPT-RT kernel. Also, I’ve noticed that there was no missing I2C packet with the executable. Nice!

You won’t need the spidev anymore, so you can run the raspi-config and disable the SPI, but leave the I2C as it is. Also, in the /boot/config.txt make sure that you have those:


Now reboot.

Use custom kernel driver modules

Now the more interesting stuff. Let’s build some drivers and use the device-tree to load the modules and see how the kernel really handles these type of devices using the IIO and the LED subsystems. First let’s build the IIO module. To do that you need first to set up the rpi kernel and the cross-toolchain to your workstation. To do that you need first to get the kernel from git and run some commands to prepare it.

git clone
cd linux

Now you need to checkout the correct hash/tag. To do this run this command to rpi console.

uname -a

In my case I get this:

Linux raspberrypi 4.14.79-v7+ #1159 SMP Sun Nov 4 17:50:20 GMT 2018 armv7l GNU/Linux

That means that the date this kernel was build was 04.11.2018. Then in the kernel repo to your workstation, run this:

git tag

And you will get a list of tags with dates. In my case the `tag:raspberrypi-kernel_1.20181112-1` seems to be the correct one, so check out to the one that is appopriate for your kernel, e.g.

git checkout tag:raspberrypi-kernel_1.20181112-1

Then run these commands:

export KERNEL=kernel7
export ARCH=arm
export CROSS_COMPILE=/opt/toolchains/gcc-linaro-7.2.1-2017.11-x86_64_arm-linux-gnueabihf/bin/arm-linux-gnueabihf-
export KERNEL_SRC=/rnd2/linux-dev/rpi-linux
make bcm2709_defconfig
make -j32 zImage modules dtbs

This will build the kernel, the modules and the device-tree files and it will create the Module.symversthat is needed to build our custom modules. Now on the same console (that the above parameters are exported) run this from the repo top level.

cd kernel_iio_driver
dtc -I dts -O dtb -o rpi-i2c-ard101ph.dtbo rpi-i2c-ard101ph.dts
scp ard101ph.ko root@
scp rpi-i2c-ard101ph.dtbo root@

Now, do the same thing for the led driver:

cd kernel_led_driver
dtc -I dts -O dtb -o rpi-spi-ardled.dtbo rpi-spi-ardled.dts
scp ardled.ko root@
scp rpi-spi-ardled.dtbo root@

And then run these commands to the rpi terminal:

mv ard101ph.ko /lib/modules/$(uname -r)/kernel/
mv ardled.ko /lib/modules/$(uname -r)/kernel/drivers/leds/drivers/iio/light/
depmod -a

And finally, edit the /boot/config.txtfile and make sure that those lines are in there:


And now reboot with this command:

systemctl reboot

After the reboot (and if everything went alright) you should be able to see the two new devices and also be able to read and write data like this:

cat /sys/bus/iio/devices/iio\:device0/in_illuminance_raw
echo 520 > /sys/bus/spi/devices/spi0.0/leds/ardled-0/brightness

Have in mind, that this is the 4.14.y kernel version and if you’re reading this and you have a newer version then a lot of things might be different.

So, now that we have our two awesome devices, we can re-write the bash script in order to use those kernel devices now. Well, the script is already in the bash-scripts folder so just scp the scripts and run this on the rpi:


The oscilloscope now shows an average period of 62.5ms, which is a bit faster compared to the raw read/write from bash in the first example, but the difference is too small to be significant.


Let’s see some pictures of the various oscilloscope screenshots. The first one is from the bash script and the spidev module:


The second is from the linux-app program that also used spidev and the /dev/i2c.

And the last one is by using the kernel’s iio and led subsystems.

So let’s make some conclusions about the speed in those different cases. It’s pretty clear that writing an app in the user space using the spidev and the /dev/spi is definitely the way to go as it’s the best option in terms of speed and robustness. Then the difference between using a bash script to read/write to the bus with different type of drivers [spidev & /dev/i2c] vs [leds & iio] is very small.

Then, why writing a driver for iio and leds in the first place if there’s no difference in performance? Exactly. In most of the cases it’s muck easier to write a user-space tool to control this kind of devices instead of writing a driver.

Then are those subsystems useless? Well, not really. There are useful if you use them right.

Let’s see a few bullet points, why writing a user-space app using standard modules like the spidev is good:

  • No need to know the kernel internal
  • Independent from the kernel version
  • You just compile on another platform without having to deal with hardware specific stuff.
  • Portability (pretty much a generalization of the above)
  • If the user space app crashes then the kernel remains intact
  • Easier to update the app than updating a kernel module
  • Less complicate compared to kernel drivers

On the other hand, writing or having a subsystem driver also has some nice points:

  • There are already a lot of kernel modules for a variety of devices
  • You can write a user space software which is independent from the hardware. For example if the app accesses an iio device then it doesn’t have to know how to handle the device and you can just change the device at any time (as long it’s compatible in iio terms)
  • You can hide the implementation if the driver is not a module but fixed in the kernel
  • It’s better for time critical operations with tight timings or interrupts
  • Might be a bit faster compare to a generic driver (e.g. spidev)
  • Keeps the hardware isolated from the user-space (that means that the user-space devs don’t have to know how to deal with the hardware)

These are pretty much the generic points that you may read about generally. As a personal preference I would definitely go for a user-space implementation in most of the cases, even if a device requires interrupts, but it’s not time critical. I would choose to write a driver only for very time critical systems. I mean, OK, writing and know how to write kernel drivers it’s a nice skill to have today and it’s not difficult. In the bottom line, I believe most of the times, even on the embedded domain, when it comes to I2C and SPI, you don’t have to write a kernel driver, unless we talking about ultra fast SPI that is more than 50MHz with DMAs and stuff like that. But there are very few cases that really needs that, like audio/video or a lot of data. Then in 95% of the rest cases the spidev and the user-space is fine. The spidev is even used for driving framebuffers, so it’s proven already. If you work in the embedded industry, then probably you know how to do both and choose the proper solution every time; but most of the time on a mainstream product you may choose to go with a driver because it’s “proper” rather “needed”.

Anyway, in this stupid project you pretty much seen how the SPI and I2C devices are used, how to implement your own I2C and SPI device using an Arduino and then interface with it in the Linux kernel, either by using the standard available drivers (like spidev) or by writing your custom subsystem driver.

Finally, this is a video where both kernel modules are loaded and the bash script reads the photo-resistor value every 50ms via I2C and then writes the value to the brightness value of the led device.

Have fun!

STM32F373CxT6 development board


Probably most of you people that browsing places like this, you already know the famous “blue pill” board that is based on the stm32f103c8t6 mcu. Actually, you can find a few projects here that are based on this mcu. It’s my favorite one, fast, nice peripherals, overclockable and dirt cheap. The whole PCB dev board costs around 1.7 EUR… I mean, look at this.

If you try to buy the components to build one and also order the PCBs from one the known cheap makers, that would cost more than 10 EUR per unit for a small quantity. But…

There’s a new player in town

Don’t get me wrong, the stm32f103 is still an excellent board and capable to do  most of the things I can image for small stupid projects. But there are also a few things that can’t do. And this is the gap that this new mcu comes to fill in. I’m talking about the stm32f373, of course. The STM32F373CxT6 (I’ll refer to it as f373 from now on) is pin to pin compatible with the STM32F103C8T6, but some pins are different and also is like its buffed brother. The new f373 also comes in a LQFP48 package, but there are significant differences, which I’ve summed in the following table.

STM32F373C8 STM32F103C8
Core Arm Cortex-M4 Arm Cortex-M3
RAM Size (kB) 16 20
Timers (typ) (16 bit) 12 4
Timers (typ) (32 bit) 2
A/D Converters (12-bit channels) 9 10
A/D Converters (16-bit channels) 8
D/A Converters (typ) (12 bit) 3
Comparator 2
SPI (typ) 3 2
I2S (typ) 3

So, you get a faster core, with DSP, FPU and MPU, DACs, 16-bit ADCs, I2S and more SPI channels, in the same package… This MCU is mind-blowing. I love it. But, of course is more expensive. Mostly I like the DACs because you can do a lot of stupid stuff with these and then the 16-bit ADC, because more bits means less noise, especially if you use proper software filters. The frequency is the same, so I don’t expect much difference in terms of raw speed, but I’ll test it at some point.

Also it’s great that ST already released a standard peripheral library for the mcu, so you don’t have to use this HAL crap bloatware. There’s a link for that here. The StdPeriphLib supports, now makes this part my favorite one, except that fact that… I haven’t tested it yet.

Where’s my pill?

For the f103 we have the blue pill. But where’s my f373 pill? There’s nothing out there yet. Only a very expensive dev board that costs around $250. See here. That’s almost 140 “blue pills”… Damn. Therefore, I had to design a board that is similar to the “blue pill” but uses the f373. Well, some of you already thought of why not remove the f103 from the “blue pill” and solder the f373 instead and there you are, you have the board. Well, that’s true… You can do that, some pins might need some special handling of course, but where’s the fun in this? We need to build a board and do a stupid project!

Well, in case you want to replace the f103 on the “blue pill” with the f373, then this is the list with the differences in the pin out.

PIN F103 F373
21 B10 E8
22 B11 E9
25 B12 VREFSD+
26 B13 B14
27 B14 B15
28 B15 D8
35 VSS_2 F7
36 VDD_2 F6

So the problematic pins are mainly the 35 & 36 and that’s because in case of the f373 the F7 is connected straight to ground and the F6 to 3V3 and that makes them unusable. According to the manual the F7 pin has these functionalities: I2C2_SDA, USART2_CK and the F6 these: SPI1_MOSI/I2S1_SD, USART3_RTS, TIM4_CH4, I2C2_SCL. That doesn’t mean that you can’t use the I2C2 as the I2C2_SDA is also available to pin 31 (PA10) and also the SPI1_MOSI is available to pin 41 (PB5) and I2C2_SCL to pin 30 (PA9). So no much harm, except the fact that those alternative pins overlap other functionalities, you might not be able to use.

Therefore, just change the f103 on the “blue pill” with the f373, if you like; but don’t fool yourself, this won’t be stupid enough, so it’s better to build a board from the scratch.

The board

I’ve designed copied the schematics of the “blue pill” and create a similar board with a few differences though. First, the components like resistors and capacitors are 0805 with larger pads, instead of 0603 on the original “blue pill”. The reason for that is to help older people like me with bad vision and shaky hands to solder these little things on the board. I’m planning to create a branch though with 0603 parts for the younger people. Soon…

I’ve used my favorite KiCAD for that as it’s free, opensource and every open-hardware (and not only) project should use that, too. I think that after the version 5, more and more people are joining the boat, finally.

This is the repo with the KiCAD files:

And this is the 3D render of the board:


You can replace the f103 on the “blue pill” with the f373, but that’s lame. Don’t do that. Always go with the hardest, more time consuming and more expensive solution and people will always love you. I don’t know why, it just works. So, instead of replace the IC, build a much more expensive board from the scratch, order the PCBs and the components, wait for the shipment and then solder everything by yourself. Then you have a great stupid project and I’ll be proud of you.

Have fun!

Electronic load using ESP8266


Long time, no see. Well, let me tell you a thing, parenting is the end of your free time to do your stupid projects. It took me a few months just to complete a very small stupid project like this… Anyway, this time I wanted to build an electronic load. There are so many designs out there with various different user interfaces, but I wanted something more useless than a pot to control the output current. Therefore, I though why not build an electronic load with a web interface? And which is the best option for that? Of course, esp8266. Well… not really, but we’ll see why later. Still, this is build on the esp8266 anyways.

Electronic load circuit

There are various ways to make an eload, but the easiest way to use a N-MOSFET and an opamp that it’s negative feedback drives the Gate-Source of the MOSFET. This is a screenshot of the main circuit:

You can find the kicad project for the eload circuit here:

In the above circuit there’s a first stage opamp that is driven from a DAC. This opamp amplifies the signal with a gain of 2.2 and then the second stage opamp drives the MOSFET. The source of the MOSFET is connected on the PSU under test and the source to an array of parallel 10Ω/1W resistors, which makes an effective 1Ω/10W resistor. The gate and source of the MOSFET are connected in the negative feedback loop of the opamp. This means the opamp will do what opamps do and will “mirror” the voltage on the (+) input to the (-) input. Which means that whatever voltage is applied on the (+) input then the opamp will drive the output in a way that both inputs have the same voltage. Because the gate and the source of the MOSFET are part of the feedback loop, then in this circuit it means the voltage on the source will be the same as the (+) input. Therefore, if you connect a load that is 1Ω then the current will be I=V/R and because R=1Ω, that means Ι=V. So, if 5V are applied in the (+) input then 5A will flow on the resistors.

It’s generally better to have resistors in parallel, because that means that the current will be split to these resistors, which means less heat and no need for extra cooling on a single resistor.

There is also another opamp which buffers the voltage on the negative loop and drives a voltage divider and an RC filter. We need a buffer there in order not to apply any load on the feedback, as the opamp will draw only a tiny current which doesn’t affect the performance of the circuit. The voltage divider is an 1/10 divider that limits the maximum 10V input to the max 1V allowed ADC input of the esp8266. Then there is a first grade RC filter with a pole at ~1Hz to filter out any noise, as we only care about the DC offset.

As the opamp is powered with a +10V then it means that if it’s a rail-to-rail opamp then we can have as much as 10A on the load. The circuit is design for a bit less though. But, let’s see the components.


For this project I’ve decided to build a custom PCB board instead of using a prototype breadboard and connect the different components with solder. The reason for that was “low noise” for the ADCs and DACs. Well, that didn’t work eventually, but still it’s a nice board. The main components I’ve used are:


This is the main component. This is the esp8266 module with 4MB flash and the esp8266 core which can run up to 160MHz. It has two SPI interfaces, one used for the onboard EEPROM and one it’s free to use. Also it has a 12-bit ADC channel which is limited to max 1V input signals. This is a great limitation and we’ll see later why. You can find this on ebay sold for ~1.5 EUR, which is dirt cheap.


As you see the vendor is the AI-THINKER and the model is ESP8266MOD and most of the times this is what you find in ebay, but you may also find modules with no markings.


This is an IC with 4x low noise single supply, rail-to-rail output opamps and a voltage range up to 16V, which is more than enough for the purpose. You can find the specs here. This costs around 0.85 EUR for single unit in Arrow or other electronic component distributors. I wouldn’t buy these ICs from ebay.


This is a N-channel logic level power MOSFET with a very low Rds(on) (0.018Ω @ 10V). We need a logic level MOSFET because the Vgs needs to be as low as possible. You can find all the specs here. Of course you can use any other MOSFET you like as long as it has the same footprint. That costs approx 1 EUR, but you can use a cheaper one.


This is a low noise 12-bit DAC with an internal VREF of 2.048V, a 2x gain capability and it also has an SPI interface. This thing is a beast. I love it. Easy to use and excellent precision for this kind of projects. This costs ~1.8 EUR. Well, yeeeeeah it’s a bit expensive, I know. We’ll get there soon.


This is a USB to serial IC that supports also the extra hardware signals like DTR and RTS. These are needed actually to simplify the flashing procedure of the esp8266, so you don’t have to use jumpers to put the chip in to programming mode. You can see the trick is used in the schematics, you just need two transistors and two resistors. This costs ~4 EUR… Another expensive component. Not good, not good. There are also other cheaper ones like the CP2102, but they are rare to find because they are very popular in the Chinese vendors that build those cheap USB to UART boards.


For the power supply we need a couple of stuff. Because the board is USB powered we get the 5V for free from the USB. But we also need 3.3V and 10V. For the first one it’s easy we just need a voltage regulator like the AMS1117-3.3 (or any other variation of the 1117 you can find). For the +10V that are needed to drive the opamps it’s a bit more complicated and for this purpose you need a charge pump IC like the SP6661. This IC can be connected in a way that can double it’s input voltage; therefore if we use the +5V from the USB we can get +10V. The cost of the SP6661 is ~1.5 EUR.

Project source code

The source code is really simple, because I’ve used the arduino libs for the esp8266, therefore I didn’t write much code. Although, for prototyping and also to do quick stuff the arduino libs are nice, I don’t like using them and I prefer to write everything baremetal. Actually, this is how I’ve started the project but at some point I figured out that it needed quite much time for a stupid project. Therefore, I decided to go with the arduino libs.

Originally, I’ve started with the cnlohr’s esp82xx, which although it’s an excellent start template to start with and it has quite a few nice things in there, like a custom small file system and websockets; at the same time it’s a bit tight to it’s default web interface and it’s quite an effort to strip out all the unused stuff and add your own things. Therefore, after already spending some time with this template and bulding on top of it, I’ve decided to go with the arduino libs for the esp8266 because in that case I just had to write literally a few lines of code.

You can find the source code here:

Also the kicad files are in there, so you can edit/order your own board. Read the file for more info how to build the binary and upload the web server files.

Web interface and functionality

Before I’ve build the PCB I’ve tested most of the functionality on a breadboard, just to be sure that the main stuff is working. 

Noise, noise, noise

Well, although the esp8266 has an ADC, there are two problems. First it’s maximum input voltage is 1V, which makes it prone to noise from external interference. Second it’s crap. When esp8266 is using it’s WiFi radio to transmit then there’s a lot of noise in the ADC, so much noise that it makes the ADC readings completely unusable even with analog external filters or software low pass or IIR filters. The error can even as much as 4-bit and that’s total crap, especially on low ADC values. Here we can see an image that when the DAC value is 0, then the ADC input is that crap signal.

That’s more than 200mV of noise. But, if I disconnect the RC filter output from the ADC and just probe it, then I get this output.

The noise drops to 78mV, which is still high, but much better than before. Therefore, you see that there’s a lot of noise in the system and especially there’s a lot of noise when the RC filter output is connected on the ADC; which means that the esp8266 creates a lot of noise itself. That’s really bad.

Anyway, the components are already expensive and the price is getting out of the scope for a stupid project, therefore I prefer having this crap ADC input instead of spending more for a low noise SPI 12-bit ADC. In the end of the article I will tell you what is the best option and one of my next stupid projects.

Web interface

That was what the whole project was made for. The web interface… I mean, ok, all the rest of the things could be made much less expensive and with better results in terms of noise, but let’s proceed with this. This is the web interface that is stored in the esp8266 upper 1M flash area.

The upper half of the screen is the ADC read value (converted to volts) and the lower half is the output value of the DAC (also converted to volts). The DAC value is very accurate and I’ve verified with my Fluke 87V in the whole range. I’ve also added a libreoffice calc sheet with the measured values. In real time the ADC value is changing constantly on the web interface, even with the RC filter and also a software low pass filter.

The web interface is using websockets in the background. There are many benefits on using websockets and you can find a lot of references in the internet. The days before websockets, for similar projects, I had to use ajax queries, which are more difficult to handle and also they can’t achieve the high speed comms of the websockets. There are a couple of excellent videos with websockets and esp8266, like this one here.

To control the current that will flow to the load resistors you just need to drag the slider and select the value that you want. You can actually only trust the slider value and just use the ADC reading as an approximate value and just verify that the Vgs is close to the wanted value.

This is the setup on my workbench that I’ve used to build the circuit and run the above tests.

You may see that I’ve solder a pin header pin on the ADC of the esp8266 board. That’s because the Lolin esp8266 board has a voltage divider in the ADC pin in order to protect the input from the 3V3 voltage you may insert as this is the reference voltage for the board.

PCB board

As I’ve mentioned I’ve created a PCB board for that and I’ve ordered it from seeedstudio, which by the time had the cheapest price. So for 5 PCBs, including shipping cost me 30 EUR. This is a 3D image from kicad, but the real one is black. I’ll post it when I get it and assemble it.

As you can see I’ve exported the same pinout as the Lolin board to make it usabke also for other tasks or tests. The heatsink will be much larger as 9A is quite a lot, but I couldn’t find larger 3D model (I think there’s a scale factor for the 3D model in kicad when you assign the 3D object to the footprint, but anyway).

The maximum current can be set up to 4.095 * 2.2 = ~9A. This is the 2.048 VREF of the DAC, multiplied by 2 with the DAC’s gain and multiplied by 2.2 from the gain of non-inverting opamp amplifier stage. Beware that if you tend to use that high current then you need a very good cooling solution. One way to limit the current in order not to make a mistake is to change the code and use the 1x gain and change also the values in the eload.js javascript file in the web interface. That change will limit the current to 4.5A.


Well, this was a really stupid project. It’s actually a failure in terms of cost and noise. It would be much better to use a microcontroller and replace most of the parts in there. For example instead of using a DAC, the USB-to-serial IC or an external ADC for less noise, you could use a microcontroller that has all those things integrated and also costs much less. For example a nice mcu for this would be the STM32F373CBT6 which has two 16-bit ADCs, two 16-bit DACs, supports USB device connection and other peripherals to use for an SPI display for example. This controller costs only around 5 EUR and it could replace a few parts. Also, you can implement a DFU bootloader to update both firmwares (STM32 and esp8266). Therefore, this might be a future stupid project upgrade. Actually, I’m trying to find an excuse to use the STM32F373CBT6 and build something around it…

Finally the prototype worked for me quite nice and I’ll wait for the board to arrive and test it with my PSU and see how accurate it is. I can’t see any reason why should anyone build this thing, but this is why stupid projects are all about, right?

Have fun!