Benchmarks with gcc, musl and clang and how can they affect the embedded development cost

Intro

Ok, I know that’s a long and pompous title, but it’s difficult to summarize the whole meaning of this post.

As you’ll find out on this blog I won’t only write about stupid projects, but I’ll write also posts about other things that I find interesting mainly on the embedded world.

When dealing with embedded Linux there are several things that need to be considered, because “embedded” usually refers to a vast ARM ecosystem that extends from the tiny cortex-M0 processor (ARMv6-M) to a cortex-A73 (ARMv8-A). Of course, you won’t see Linux on a cortex-M0 cpu but you may see it on a Cortex-M4.

There are two reasons for this post. The first one is to write some thoughts about when is preferred to use bare-metal, RTOS or Linux on embedded products. The second reason was this video, which leads to the question that after we’ve decided that we need to use Linux, then what options and tools do we have; and most importantly what happens with the code and binaries size? Well, in embedded you should care about size, because that can limit your options and also can affect your product cost, development time and budget.

Right now embedded is a hot topic and there a many things going on with the compilers, their optimizations and the system libraries. So what’s the deal with the compilers?

Compilers

On the embedded systems you can find many different compilers with fancy names, like ‘arm-none-eabi-‘, ‘arm-linux-gnueabi’, ‘gcc-arm-embedded’ and the list goes on. Assuming a specific architecture (e.g. ARM in this case), these compilers are all different but they all can be used to compile an application that doesn’t run on an OS. One of the main differences is that these compilers assume a different C library; therefore, ‘arm-none-eabi’ assumes no C library or newlib and the ‘arm-linux-gnueabi’ assumes the full blown glibc (or eglibc). So that means that usually if you cross-compile a bare metal source code for ARM (e.g. for an STM32 micro-controller) you should use ‘arm-none-eabi-‘ and when you cross-compile a Linux kernel or a Linux user space application you’ll use the ‘arm-linux-gnueabi’.

OS or RTOS?

This is another question that usually comes on the table when dealing with a new project. Usually, is pretty much clear if you need an OS or not, but the line between if you need Linux or another embedded RTOS (eRTOS) some times is blur. If you go with an eRTOS then you go with arm-none-eabi for the whole project, but if you go with Linux then you’ll need arm-none-linux-gnueabi.

The difference between eRTOS and Linux is huge. It’s much more complex to develop on an eRTOS compare to Linux, because in the eRTOS most of the times you’ll need to implement your own subsystems and underlying tasks that usually the Linux already provides. Also the Linux separated the kernel from the user space applications, but on a eRTOS the separation is not always achievable and you need to write in several different places in the eRTOS. On the other hand Linux takes care of all the low level mumbo jumbo and you only have to write your app using the kernel API. So what are the criteria that you decide which to choose? Well, that depends for many variables, like your existing code base, the development time, the platform tools, cost e.t.c. but now more than ever you need to consider also the size.

Size!

There’s no doubt that the size when using an embedded RTOS (eRTOS) will be much smaller but also comes with higher complexity; and some times is also difficult to chose the right eRTOS that will have the least problems, bugs and issues. In some cases where the cost of a small external RAM and eMMC/SD is not that high and the development comes with less complexity and faster development time, then you need to consider Linux. And this is what this post is all about.

Is it possible to shrink a Linux OS to a low cost embedded platform?

If the cost to use plenty of RAM and a big fast storage is not an issue, then Linux it’s a no brainer, but when there’s a cost restriction and at the same time there’s a budget to use a small ram and a cheap storage, then you must do a research if the current tools we have these days can deliver small enough user space code size.

How to shrink size?

The time this post is written there are a few tools that targeting to achieve small binary sizes. First, gcc already has some optimization flags that can be used to optimize speed and/or size. Also gcc provides a Link Time Optimization (LTO) tool (-flto flag in gcc), which also can reduce the compiled size. What LTO does, is that it gathers more details about the code during the compile time and then provides the linker with these details so the linker can use them to optimize further.

Apart from gcc there’s also the clang compiler which is designed to fully replace gcc. Therefore, when you build a source with clang then you don’t use gcc at all. Although clang is highly compatible with gcc, it doesn’t offer full compatibility, which means that you may not be able to compile all the existing code base.

In addition with the compilers, we also have the system libraries which also play significant role in the resulted binary size. GLib is the low level system library used by gcc and it’s huge! You would never consider the system library size when developing desktop applications, but for small embedded systems GLib is a behemoth. There are various light-weight replacement GLib libraries for the arm-none-linux-eabi, like the musl lib. You can see a comparison of few libraries here. Note, that there are also replacement glib libraries for the arm-none-eabi, like the nano-lib (–specs=nano.specs compiler option)., which does a great difference in code size, too.

So, to sum up, we have different optimization flags, compilers and system libraries. Now, we can proceed with the benchmarks by using all these different options and see what happens.

Benchmarks

I’ve prepared a benchmark repository in bitbucket that you can you use to do your own benchmarks here:

https://bitbucket.org/dimtass/gcc_musl_clang_benchmark

You can git clone the repo and then run the benchmark.sh script followed with the c file and any extra flags you want, like this:

./benchmark.sh oggenc.c -lm

By default the binaries are created in the output/ folder and when the test is finished they are deleted; therefore, if you want to keep the binaries to do some tests then run the benchmark like this:

KEEP=true ./benchmark.sh oggenc.c -lm

I’ve included these 4 source code files: test.c, bzip2.c, gcc.c, oggenc.c. The test.c is a simple code file I’ve made and the rest files I’ve found them here. Every file has different size and source code and by comparing these we can obtain an overview of how well each benchmark performs in real applications.

These are the details for the compilers and libraries on my Linux Mint 18 ‘Sarah’ 64-bit.

GCC version : gcc (Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609
GLib version : (Ubuntu GLIBC 2.23-0ubuntu7) 2.23
clang version : clang version 3.8.0-2ubuntu4 (tags/RELEASE_380/final)
musl version : 1.1.16

And these are the results for each source file.

./benchmark.sh test.c
Testing file: test.c

gcc            :  8976
gcc -Os        :  8984
gcc -O3        :  8984
gcc -flto      :  8968
gcc -flto -Os  :  8920
gcc -flto -O3  :  8920
musl           :  7792
musl -Os       :  7792
musl -O3       :  7792
musl -flto     :  7784
musl -flto -Os :  7728
musl -flto -O3 :  7728
clang          :  7664
clang -Os      :  7800
clang -O3      :  7832
clang+musl     :  4792
clang+musl -Os :  4896
clang+musl -O3 :  4928

./benchmark.sh oggenc.c -lm
Testing file: oggenc.c -lm

gcc            :  2147072
gcc -Os        :  2028656
gcc -O3        :  2179880
gcc -flto      :  2140944
gcc -flto -Os  :  1974736
gcc -flto -O3  :  2067600
musl           :  2141096
musl -Os       :  2022552
musl -O3       :  2173776
musl -flto     :  2134952
musl -flto -Os :  1972976
musl -flto -O3 :  2069824
clang          :  2112544
clang -Os      :  2020600
clang -O3      :  2116544
clang+musl     :  2107952
clang+musl -Os :  2016264
clang+musl -O3 :  2110528

./benchmark.sh bzip2.c
Testing file: bzip2.c
gcc            :  138008
gcc -Os        :  79320
gcc -O3        :  115912
gcc -flto      :  130448
gcc -flto -Os  :  71456
gcc -flto -O3  :  107760
musl           :  136376
musl -Os       :  73512
musl -O3       :  114304
musl -flto     :  128840
musl -flto -Os :  69728
musl -flto -O3 :  106152
clang          :  129112
clang -Os      :  94568
clang -O3      :  113504
clang+musl     :  125840
clang+musl -Os :  90824
clang+musl -O3 :  109968

./benchmark.sh gcc.c
Testing file: gcc.c
gcc            :  6879424
gcc -Os        :  4263096
gcc -O3        :  6493624
gcc -flto      :  6772760
gcc -flto -Os  :  3886664
gcc -flto -O3  :  5889240
musl           :  failed
musl -Os       :  failed
musl -O3       :  failed
musl -flto     :  failed
musl -flto -Os :  failed
musl -flto -O3 :  failed
clang          :  7446328
clang -Os      :  4785560
clang -O3      :  6537192
clang+musl     :  failed
clang+musl -Os :  failed
clang+musl -O3 :  failed

Analyzing the results

Now that we have the results we need to analyze them and the best way to do that is to visualize the data in a way that’s easy to compare the results. For that reason I chose to plot the size for each file against the compiler and the lib it was used and the last plot is the sum of all the output file sizes, except the gcc.c which was failed to build with the musl lib. Click on each image to zoom in.

Lets take this step by step. If you see the source code of the test.c you’ll find out that it’s a very simple program that doesn’t do much. There the combination of clang+musl shines and the result code is half the size compared to gcc+glib and also it is much smaller compared to gcc+musl. This is a very good sign for clang. Also clang by itself managed to produce almost the same binary size with gcc+musl. When I’ve seen these results I was surprised and I thought that this groundbreaking news and actually it is, but… after running the benchmarks with the rest of the files my feelings were mixed.

oggenc.c benchmark shows that gcc+glib+LTO+Os and gcc+musl+LTO+O3 have the same performance and much smaller size compared to others benchmarks.

bzip2.c shows that gcc+glib+LTO+Os, gcc+musl+Os and gcc+musl+LTO+Os perform the same and better that the rest.

gcc.c failed on musl and the gcc+glib+flto+Os performed better than the others.

Finally, the most important result is by adding the sizes of all the resulted builds as this is closer to real-life application as your rootfs size will be the sum of all the programs. Of course, only 3 programs doesn’t really reflect to let’s say 100 or 1000 or 2000 programs that your rootfs may really have, but still it’s a good measure to get a rough view of what it might be. By seeing the sum of the sizes is obvious that gcc+glib+LTO+Os and gcc+musl+LTO+Os perform much better that the others and clang is far behind.

You can make your own conclusions with the result and even better try by your self.

Conclusion

From the things I’ve seen my opinion is that the LTO combined with the -Os does the most significant difference compared the other options. That means that the gcc compiler optimizations outperform clang and also it seems that musl doesn’t give much more compared to glib. Of course the last statement isn’t always true. Musl does make a huge difference on some programs (test.c) and it doesn’t make on others (oggenc and bzip), but also it failed to build gcc. I think musl looks promising though. On the other hand clang results were very mixed. On a simple application test.c it produces really small binary and outperforms all the other options but when the things get tough in more complex code, is worse that gcc.

So, let’s go back to the first question. Does it worth? Can these results help us to get a decision on the thin line between choosing an RTOS and a Linux OS for an embedded product? Well, don’t expect this answer from me. You are the one to decide what is right for you, but I’ll tell you what I would do.

Personally, I break the embedded projects needs in three categories like sequential execution, couple of parallel threads and real multi-thread.

  • If your project can be implemented with sequential execution procedures you don’t need RTOS or Linux, you go with bare metal.
  • If you project needs 1 or 2 threads, then that doesn’t mean that you need an RTOS. You can develop using finite states (FSM); but if for some reason you think that your FSM will be hard to maintain (e.g. a complex GUI with user inputs and peripherals) then go for an RTOS.
  • If you have a couple of threads and the cost must be kept low then go for RTOS.
  • If you have multiple threads and the budget allows you to add some RAM and a small storage go for Linux.
  • If your code base is on Linux, then of course, you go for Linux.

In any case is good to use optimizations. Nano-lib for bare-metal and compiler optimizations & alternative system libraries for the OS.

Now, if you decide to go with Linux and a limited budget then you’ll need all the performance gains the compiler and the alternative system libraries can offer. In that case you need to verify that all the core programs you need in your rootfs can be compiled with all the above optimizations and the result size is small enough to meet your specs on RAM and storage.

It’s encouraging to know that there’s an active development and effort to the correct path. Size matters and it can do the difference and introduce Linux in the low budget projects. Of course, Linux is not a panacea for every embedded project and it shouldn’t be, but it has it’s use and it can solve problems.

Also, many cheap linux-enabled SBCs are showing up these days. For example have a look at Omega2, which is a $5 SBC with Linux, 64MB memory, 16MB storage, USB, WiFi, I2C, SPI, i2S and 15 GPIOs. On the other side a cheap stm32f103c8t6 board (Cortex-M3) costs $2.5 and an stm32f407vet6 board costs $7. NXP ARM mcus are much more expensive as also Renesas, Cypress and others. Of course, the STM boards have more peripherals and still they have their use, but the future for the most mainstream products seems to be the ultra low cost Linux IOT ready boards and for that reason we need smaller size binaries and better performing compilers and tools.

For those that only develop only on small ARM mcus, don’t worry! There are many reasons that low embedded will be still on the market in the next years, no matter what happens. Especially on industries that have to do with health, human life, security, defense e.t.c. where everything needs to be absolutely deterministic. An RTOS or any other OS is non-deterministic and regardless the budget and the costs they can’t be used on these products as the regulations and approvals are very strict about this.

WiFi digital control DC power supply with web interface and USB

Intro

Welcome to my next stupid project!

Ok, this project  is really stupid and I’ll probably never going to use it for any of my next projects, but it was fun doing it nevertheless. I have a bunch of these adjustable LM2596 DC-DC boards in one of my “magic” component cabinets, that you can find quite cheap in ebay (~$1.5).

As you can see it’s composed of few components like an inductor, capacitors, resistors e.t.c. You apply a DC input voltage and then you get a step-down DC output on the other side. You can control the output voltage with a 10KΩ pot (the blue block device with the screw on top). That’s great. But… this means that every time you need to change the output voltage you need to turn the screw several times to get it. The good thing with that is that if you have a good multimeter you can get quite precise output voltages as the POT is analog. The bad thing is that you need to do manual labor every time you need to change the output voltage. But not anymore.

The idea was to simply change the analog POT with a digital one and then find a way to control it remotely. So, why not do that using a UART port, or even better a USB port. Oh, wait… Why not make it WiFi controlled and also have a web interface? And this how stupid projects are made. Do to that we’ll need the following components.

Components

Digi-pot

There are several digital pots on the ebay and they are cheap, but the trick here is to find a digi-pot that has enough wiper points (or steps). Why you need many steps? The ‘step’ for a digital pot defines it’s resolution, so for a 10KΩ pot with 100 steps (like the X9C103P) each step is 10ΚΩ/100=100Ω. That’s quite large when it’s used in voltage a divider like the pot on the LM2596 board. On the other hand by using a digi-pot like the MCP41010 that has 256 steps, the resolution gets much higher. You can find these microchip digi-pots on ebay for around $1.5 each.

The mcp41010 is controlled with an SPI interface, which means that you need a micro-controller. That’s cool, because this means that you can also use an ESP8566 wifi module and why not a USB interface if the controller comes with it.

Micro-controller

I like ARM processors. My favorite boards are those stm32f103c8t6 that you can find on ebay for $2. They are ultra-cheap, they have almost any interface that I need for my stupid projects and they also have a very nice API to program them. The stm32f103 has a USB port, more than the 2 uarts that are needed and an SPI interface to control the digi-pot.

This board is power either from the USB connector either from the 5V or 3.3V on-board pins. Never use both of them at the same time!

ESP8266

The ESP8266 shows up again on this stupid project as also on the first one. Copy-paste from the previous post follows: it’s easy to modify the source code with the SDK and also you can use any network capable device to interact with them. I’ll write a separate post about ESP8266 in the future and how you can write your own code for these modules. You can find them cheap in ebay and they cost around $2.50. There are a few types of this module that they have a different flash size (512KB, 1MB), but the only thing that you should care about now is that it needs to support 9600 baud rate and not only 115200, because I’m using a software serial library for the arduino that behaves much better on lower baud rates. Be aware that ESP8266 is a 3V3 only device. This is the module:

Output relay

In the output I’ve used a relay to turn on and off the output from the LM2596, you may not want to do that, but it’s nice to have it. There are many cheap variations of opto-couple relays in ebay that cost $1-$2 like this one

I prefer the high-level drive relays as the microprocessors usually after a reset they drive their pins to low, which is safe as the relay doesn’t get activated when the mcu is reset. You can connect the positive voltage output of the reference power supply to the COM and NC terminals. Also, make sure that the relay is rated for the DC output you’re going to use, so don’t use a 30V relay to output 40V. Finally, make sure that the relay is able to be activated with a 3V3 input trigger.

Step-down (AMS1117-3.3)

You’ll also need a step-down DC power supply module to power the components, like the AMS1117-3.3. There are some cheap pre-soldered modules with the AMS1117 on the ebay, I’ve found 5pcs for less that $1, which means $0.20 for each. They look like this

They are very convenient if you are using a breadboard or a double-sided prototype PCB, as they only have 3 pins and all the needed components are on the module. It also has a LED indicator that indicates that it’s powered.

Prototype breadboard

If you want to assembly the circuit without design a PCB, then you can buy on ebay one of these cheap prototype breadboards for less than $1.

 

The image is from a pack of various pcb sizes, you’ll just need one that fits everything you need to solder.

ST-Link

Finally, you need an ST-Link programmer to upload the firmware. Generally, I have the original ST-Link V2 and the Segger J-link programmers, but to be honest, most of the times I’m using one of these cheap st-link copies; which I find much more convenient to use on the limited workspace. Of course, I suggest you to buy the original ones, but if you also have a limited space then buy one of those from ebay.

Making the stupid project

I haven’t build a PCB for this project and it is only a proof of concept on my breadboard. These are the schematics of the circuit you need to build.

The stm32 is powered either from the USB port when it’s connected on the PC or from the AMS1117-3.3. Therefore, you need to be careful with that, so if you’re going to use a USB adapter or connect the stm32 to your PC then you need to remove the K1 jumper. If you’re not going to use the USB interface then place the jumper.

When you unsolder the blue resistor POT from the LM2596 PSU module, then you’ll have three empty holes on the PCB. The LM2596-POT1 and LM2596-POT2 terminals are connected to the PCB holes next to the OUT+ output. There are two main power inputs, the one is the VIN that is connected to the AMS1117-3.3 and provides power to the circuit and the other is the PSU V+ that comes from the external power supply you’re going to use for the LM2596. Therefore, the [LM2596-POT1/2] and [PSU V+ in/out] are connected to the LM2596 PSU. The USB_UART (P1) is not necessary to use and it’s just the debug UART port.

You can download the source files from my bitbucket project

https://bitbucket.org/dimtass/stm32f103_wifi_usb_psu

Read the README.md file as it has all the details you need, but still I’ll explain some things here. You don’t have to build the code to use it, but I suggest that you do as you need to change a few parameters in the source files. The pre-build binaries of the latest build are located in the firmware folder, therefore you can use the ST-Link utility on windows to flash the hex file or the st-flash utility on Linux to flash the bin file. To upload the firmware you just need to connect the USB cable on the stm32f103 board and run the flashing commands which are in the README.md file.

If you need to build the code then you’ll find the cmake files and scripts for both Windows and Linux. So, if you have a Windows OS, then you need to install cmake, make and a gcc toolchain for the ARM cortex-M3. Read this to see how you can do that.

After you’ve setup everything, all you need to do is run the build.cmd (on Windows) or build.sh (on Linux) to build the code. Cmake will create the binaries in the build-stm/src folder but also will create the .cproject and .project files in the build-stm folder. This means that you can import this project to your eclipse CDT IDE and edit the code in there.

One of the things you probably have to edit is the IP address definition (HTTP_IP_ADDRESS) in the src/http_server.h file. This is the address that your AP assigns to the ESP8266 module. That means that you need to configure your AP router to always assign the same IP to the MAC address of the ESP8266. Then you can re-build the code and flash the new binary on the stm32. It’s recommended to erase the stm32’s flash before you upload the firmware as the last 1K of the flash (address: 0x0800FC00) is used to store the configuration data, like the AP SSID, password and the pre-defined POT values.

The first time that the stm32 powers up after a firmware update, it will try to load the configuration data from the flash. If it doesn’t find a valid configuration then it creates a default configuration. In the default configuration the stm is not able to connect to any AP and the pre-defined POT values are all set to 127 (which means 5KΩ). You can use the USB or UART interface to update the configuration data for the AP SSID and password. To do that just connect the stm via a USB cable on your computer and open a terminal (I always prefer to use Bray’s terminal). Whatever terminal you use make sure that the LF char is handled as (CR & LF). Bray’s terminal has that option by checking the [CR=LF] checkbox in the settings area and the [+CR] next to the [-> Send] button.

If you have already red the README.md file in the project folder you’ll know which commands to use. As an example I’ll suppose that the AP SSID name is “Router” and the WPA2 password is “MyPassword”; of course, you need to change these with your own. To update the configuration send the following commands to the terminal.

SSID=Router
PASS=MyPassword
RECONNECT

The first command stores the SSID name, the second the password and the third one initiates a reconnection to the router. If the last one doesn’t work, just remove and re-apply the power on the stm. When the stm starts then the LED will start flashing every 250ms if the ESP is not connected and when it gets connected the it flash every 500ms. When it’s connected you can finally open you browser and connect to the web interface by writing its IP address on the address bar. This is what you’ll get

This is a very simple html page, so there are minimal javascript automations, which means that the web interface will not automatically restore the PSU state, so you don’t really know if it’s turned on/off and which pre-defined output is used. Therefore, have that in mind. On some browsers you may need to do a dummy click on a button first, so you can click the OFF button for 1-2 times. As you see the web interface is quite simple. In the first row there are the ‘Power & Trim’ buttons that you use to turn off or on the output relay and trim the output voltage by change the digi-pot value. The +/- buttons change the digi-pot step by 1 on every click and the –/++ buttons change the step by 5. You’ll need these buttons to trim the output and save the value on one of the pre-defined values.

Let’s say that it’s the first time you open the web interface after a new firmware update and a full flash erase, so there aren’t any configuration data. Also suppose that you’ve chosen that the pre-defined values will be (2V5), (3V3), (5V), (6V) and (12V) in the src/http_server.c file. Now you need to do the calibration procedure as all the pre-defined values are set by default to step 127 for the digi-pot (5KΩ). First connect the input power [ΙΝ+/ΙΝ-] to the LM2596 (e.g. 12V) and a multimeter in the output to measure the voltage, then click 1-2 times the OFF button, click the pre-defined value you need to set and click the ON button. On your multimeter you’ll see the output value that corresponds to the digi-pot’s step 127 (5KΩ). Now you need to change the output. To do that keep pressing the (–) button (or ++) until you get as close to the pre-defined value. Then use the (-/+) buttons to trim further and get the wanted output. Don’t expect the output to be very precise as the 256 steps for a 10K digi-pot is only 39Ω per step. When you get the nearest value to the wanted one then press the (Save) button and this value will be stored. Then press the next pre-defined button (3V3) and repeat the procedure until you set all the values and this is how the calibration is done. Don’t expect to get 12V in the output with a 12V input as there’s about a 0.7V drop on the LM2596.

The digi-pot is a linear resistor and not logarithmic like the pot you’ve replaced on the LM2596. This means that if you use a 12V PSU as a reference, you’ll get a better resolution under the 5V but over this voltage the resolution drops significantly.

If for some reason you change the input voltage on the LM2596 then you need to repeat the calibration procedure again. Also, it’s good to do that from time to time to be sure that it’s calibrated and always check the output with a multimeter before you connect a circuit. If you need another output voltage then just connect your multimeter in the output press the pre-defined voltage that’s closer to the wanted output and use the (+/-) button to get it.

Finally, you can do all these things by using the USB connection and the commands that are explained in the README.md file. You can also use the USB interface to set the exact digi-pot step value like this.

POT=127

You can set any value from 0 to 255.

This is a screenshot of how the web interface actually looks like in use.

You can edit the html file in the src/http_server.c source code file. You can right click on you browser and view the page source as it’s quite cryptic with the C formatting and then you can do any changes you like. Just don’t add too many things in there because the interface will take more time to load. You can change the CSS styles also to change the colors and then re-build the code and upload the firmware.

Summary

Well, this is a completely stupid project and is quite simple to build. All the needed components costs around $10. This the list of the components and their approximate price on ebay (you may find these even cheaper).

LM2596 PSU ~ $1.5
MCP41010 (10KΩ digi-pot) ~ $1.5
STM32F103C8T6 dev board ~ $2
ESP8266-05 module ~ $2.5
Optocouple relay board ~ $1.5
AMS1117-3.3 module ~ $0.20
double sided prototype breadboard ~ $1

This is an extremely bad photo of everything working on a breadboard.

Have fun!