Nanopi-R4S benchmarks with networking optimizations

Intro

In the last post, I’ve done a benchmark on the Nanopi-R2S and Nanopi-R4S. After a comment from tkaiser here, I’ve verified that all IRQs are mapped to CPU0, which means that the default FriendlyWRT images are not optimized. Those optimizations though are easy to make and in this post I will guide you through those optimizations and also present the results and make a comparison with the default images.

SMP IRQ Affinity

The SMP IRQ affinity is a functionality in the Linux kernel that allows the user to control which CPU handles the IRQ of a certain device  that comes from the interrupt controller. In multi-core CPUs that makes a lot of sense, because the user can change the default behavior of the kernel which puts all the effort to CPU0. Therefore, with the SMP IRQ affinity the user can use a bit mask and control which core handles specific interrupts.

Hence, the process has two stages. The first is to find the interrupt number of the device you want to move its IRQ handling to another core. The second step is to set a proper bit mask to the smp_affinity mask of the specific interrupt. This mask will be used to forward the IRQ handling to the CPU core you want.

Another important reason for the user to control the SMP affinity manually is that in multicore SoCs like the RK3399 which include core with different architecture, there might be a chance that the more powerful cores are not used properly by default from the kernel. And actually this is the case with the RK3399, because in this case the CPU0 is a Cortex-A53 from the quad configuration which is clocked at 1.4GHz and therefore the faster dual Cortex-A72 which is clocked at 1.8GHz is staying idle.

Receive Packet Steering (RPS)

The RPS is very well explained here, but I will try to simplify it a bit more. RPS is a functionality in the kernel that -in multi-core systems- distributes the workload of network packet handling to different cores in order to speed up the process. This might not be so important for your desktop, but for a router -like the Nanopi-RxS- it’s very important to be able to tweak that and be able to set a specific core for the task.

Again, the RPS is a bit mask that controls which cores are in the kernel’s distribution list, but you can also set a single core for that. Therefore, one of the optimizations we can do is to assign a specific core -preferably one of the fast cores- for that.

FriendlyWRT

In this post for both Nanopi-R2S and R4S I’m using the custom FriendlyWRT distro from FriendlyElec. This is the version for Nanopi-R2S:

Linux FriendlyWrt 5.4.61 #1 SMP PREEMPT Fri Sep 4 15:12:58 CST 2020 aarch64 GNU/Linux

And this is the version of Nanopi-R4S:

Linux FriendlyWrt 5.4.75 #1 SMP PREEMPT Tue Nov 10 11:13:15 CST 2020 aarch64 GNU/Linux

Nanopi-R4S optimizations

Hopefully the above explanations are quite clear, so I’ll explain now what are the two optimizations that you need to do and how to do them. The Nanopi-R4S is based on the RK3399 SoC which is a dual Cortex-A72 core and a quad Cortex-A53. In FriendlyWRT the A72 is clocked at 1.8GHz and the A53 at 1.4GHz. Generally, the core clocks are not something that is controlled only from the CPU itself, but in Linux the min and max clock speeds are defined via the device-tree when the kernel loads. Therefore, the max rated clock for the A72 might be 2.0GHz, but that doesn’t mean that this will be the max clock on every OS and therefore, as I’ve mentioned in case of FriedlyWRT the max clock is set to 1.8GHz.

So, the first optimization is to assign the interrupt handling of the eth0 and eth1 interfaces to another core than CPU0 and preferably to the fastest cores. To view the CPU frequency for each core you can use the following command:

root@FriendlyWrt:~# cat /sys/devices/system/cpu/cpu*/cpufreq/cpuinfo_max_freq
1416000
1416000
1416000
1416000
1800000
1800000

From the above result we see that CPU0-3 is the Cortex-A53 and CPU4-5 is the Cortex-A72. Therefore, by default all IRQs are served by CPU0, which is the slower core and thus this can create an artificial bottleneck, since there are 2 fast cores sitting there and doing nothing.

Next step is to figure out which are the IRQ numbers for eth0 and eth1. To do that I first run a iperf test with my Laptop as a client and connected on the LAN port of the Nanopi-R4S and my workstation as a server and connected on the WAN port.

Then I’ve printed the /proc/interrupts to view the number of IRQs and this is the result.

root@FriendlyWrt:~# cat /proc/interrupts
           CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       
 15:      59197      27202      24660      23678      44323      32887     GICv3  30 Level     arch_timer
 17:       2733      16708      15550       8386      12996      11459     GICv3 113 Level     rk_timer
 18:          0          0          0          0          0          0  GICv3-23   0 Level     arm-pmu
 19:          0          0          0          0          0          0  GICv3-23   1 Level     arm-pmu
 20:          0          0          0          0          0          0     GICv3  37 Level     ff6d0000.dma-controller
 21:          0          0          0          0          0          0     GICv3  38 Level     ff6d0000.dma-controller
 22:          0          0          0          0          0          0     GICv3  39 Level     ff6e0000.dma-controller
 23:          0          0          0          0          0          0     GICv3  40 Level     ff6e0000.dma-controller
 24:          1          0          0          0          0          0     GICv3  81 Level     pcie-sys
 26:          0          0          0          0          0          0     GICv3  83 Level     pcie-client
 27:   10628015          0          0          0          0          0     GICv3  44 Level     eth0
 28:      28962          0          0          0          0          0     GICv3  97 Level     dw-mci
 29:          0          0          0          0          0          0     GICv3  58 Level     ehci_hcd:usb1
 30:          0          0          0          0          0          0     GICv3  60 Level     ohci_hcd:usb3
 31:          0          0          0          0          0          0     GICv3  62 Level     ehci_hcd:usb2
 32:          0          0          0          0          0          0     GICv3  64 Level     ohci_hcd:usb4
 33:          0          0          0          0          0          0     GICv3  94 Level     ff100000.saradc
 34:          0          0          0          0          0          0     GICv3  91 Level     ff110000.i2c
 35:          0          0          0          0          0          0     GICv3  67 Level     ff120000.i2c
 36:          0          0          0          0          0          0     GICv3  68 Level     ff160000.i2c
 38:        106          0          0          0          0          0     GICv3 132 Level     ttyS2
 39:          0          0          0          0          0          0     GICv3 129 Level     rockchip_thermal
 40:       3231          0          0          0          0          0     GICv3  89 Level     ff3c0000.i2c
 41:          0          0          0          0          0          0     GICv3  88 Level     ff3d0000.i2c
 44:          0          0          0          0          0          0     GICv3 147 Level     ff650800.iommu
 45:          0          0          0          0          0          0     GICv3  87 Level     ff680000.rga
 47:          0          0          0          0          0          0     GICv3 151 Level     ff8f3f00.iommu, ff8f0000.vop
 48:          0          0          0          0          0          0     GICv3 150 Level     ff903f00.iommu, ff900000.vop
 49:          0          0          0          0          0          0     GICv3  75 Level     ff914000.iommu
 50:          0          0          0          0          0          0     GICv3  76 Level     ff924000.iommu
 51:          0          0          0          0          0          0     GICv3  55 Level     ff940000.hdmi
 65:          0          0          0          0          0          0  rockchip_gpio_irq   5 Edge      GPIO Key Power
 67:          0          0          0          0          0          0  rockchip_gpio_irq   7 Edge      fe320000.dwmmc cd
113:          0          0          0          0          0          0  rockchip_gpio_irq  21 Level     rk808
114:          0          0          0          0          0          0  rockchip_gpio_irq  22 Edge      K1
166:         16          0          0          0          0          0  rockchip_gpio_irq  10 Level     stmmac-0:01
220:          0          0          0          0          0          0     GICv3  59 Level     rockchip_usb2phy
222:          0          0          0          0          0          0   ITS-MSI   0 Edge      PCIe PME, aerdrv
223:          0          0          0          0          0          0     GICv3 137 Level     xhci-hcd:usb5
224:          0          0          0          0          0          0     GICv3 142 Level     xhci-hcd:usb7
230:          0          0          0          0          0          0     rk808   5 Edge      RTC alarm
234:     855861          0          0          0          0          0   ITS-MSI 524288 Edge      eth1
IPI0:     11514     109149      55900      39843      31597     748488       Rescheduling interrupts
IPI1:       468       5518     259096     128778    4150594      21564       Function call interrupts
IPI2:         0          0          0          0          0          0       CPU stop interrupts
IPI3:         0          0          0          0          0          0       CPU stop (for crash dump) interrupts
IPI4:       976       2231       1880       1877       2373       2655       Timer broadcast interrupts
IPI5:     16670       6729       7059       4912      23162      13914       IRQ work interrupts
IPI6:         0          0          0          0          0          0       CPU wake-up interrupts
Err:          0

From the above output you can see that IRQ number 27 is mapped on the eth0 interface and IRQ number 234 is mapped to the eth1. Therefore, now we can use the proper bit masks to assign those IRQs to another core.

In this case, I’ll assign IRQ 27 to CPU4 (eth0) and IRQ 234 to CPU5 (eth1). Both CPU4 and CPU5 are the Cortex-A72 cores, the fast ones (1.8GHz). The same we need to do for the RPS for both interfaces. Gladly, the bit masks are the same for both cases (ethX and RPS), therefore the bit mask for eth0 is 0x10 and for eth1 is 0x20. This is because the following table.

CPU5 CPU4 CPU3 CPU2 CPU1 CPU0
eth0 0 1 0 0 0 0
eth1 1 0 0 0 0 0

As you can see on the above table every cell is a bit, so we have 6 bits for 6 cores and therefore 01 0000 (=0x10 HEX) is the bit mask for eth0 and 10 0000 (=0x20) is the bit mask for eth1.

Then you need to create this script and make it executable.

#!/bin/sh
# CPU4 and CPU5 are the 1.8GHz cores
# Set CPU4 to handle eth0 IRQs
echo 10 > /proc/irq/27/smp_affinity
echo 10 > /sys/class/net/eth0/queues/rx-0/rps_cpus

# Set CPU5 to handle eth1 IRQs
echo 20 > /proc/irq/234/smp_affinity
echo 20 > /sys/class/net/eth1/queues/rx-0/rps_cpus

The first two lines are copying the mask to the smp_affinity -which controls in which core the IRQ is assigned- and to the rps_cpus, which controls in which core the network packet processing is done.

The next two lines do the same for the eth1 interface.

In FriendlyWRT you can add a service in `/etc/init.d/` that calls this script if you like these changes to be effective when the system boots.

Nanopi-R4S benchmarks

In the previous post here, I’ve ran several benchmarks using iperf, but I found that the two benchmarks that are more demanding for the device is using two parallel threads for TCP and the UDP test, when both are done in both directions. This time I will also do this test using the default MTU which is 1500 and also test with 512 bytes, as this seems to be the size that various protocols prefer. The MTU size defines the largest packet size that will be transmitted over the network.

This change in MTU needs to be done on the iperf server and client, therefore in my case I had to use this command on my workstation and laptop.

sudo ip link set dev IF_NAME mtu 512

where, IF_NAME is the name of your network interface e.g. eth0, enp3s0, e.t.c. You can verify the size of the interface MTU with this command:

ip link list

The iperf server IP in these examples is 192.168.0.2 and the server is connected on the WAN port. The iperf client is 192.168.2.126 and it’s connected on the LAN port.

For the TCP test the commands I’ve used for the server and client are:

Server (WAN) iperf -s
Client (LAN) iperf -c 192.168.0.2 -t 120 -d -P 2

For the UDP test the commands I’ve used for the server and client are:

Server (WAN) iperf -s -u
Client (LAN) iperf -c 192.168.0.2 -u -t 120 -b 1000M -d

Nanopi-R4S benchmarks

These are the results for the Nanopi-R4S with and without using the network optimizations and using 512 and 1500 MTU.

MTU TCP/UDP Results Mbits/sec
(default)
Results Mbits/sec
(optimized)
1500 TCP 941 941
1500 UDP 808 742
512 TCP 575 588
512 UDP 556 (854)

This is the /proc/interrupts after the benchmark, to verify the core assignment to the IRQs.

           CPU0       CPU1       CPU2       CPU3       CPU4       CPU5
 27:        152          0          0          0   22505343          0     GICv3  44 Level     eth0
234:        171          0          0          0          0    9048229   ITS-MSI 524288 Edge      eth1

Viewing those result, I don’t see any real benefit from assigning the the IRQ and RPS to the fast cores. The throughput seems to be the same. The last UDP result for the optimized case which is 854 Mbits/sec and it’s way faster than the non-optimized, is something that I don’t really trust, because I’m getting a weird warning when the test ends.

WARNING: did not receive ack of last datagram after 10 tries.

Also, I’m not getting any stats regarding the acknowledged received data and my gut feeling is that the UDP packets are actually lost when the MTU is set to 512. This I think needs more investigation, because it shouldn’t happen.

Therefore, given these results and ignoring the result in the parentheses, I would say that these optimizations don’t really benefit the network performance of the Nanopi-R4S.

Nanopi-R2S optimizations

It seems that the FriendlyWRT distro for the Nanopi-R2S is properly optimized by default and the eth0 IRQ is assigned to CPU1 and the `xhci-hcd:usb4` is assigned to CPU2 as you can see here:

           CPU0       CPU1       CPU2       CPU3
 29:          0     473493          0          0     GICv2  56 Level     eth0
167:          0          0      81033          0     GICv2  99 Level     xhci-hcd:usb4

If you’re not aware the second GbE port of the Nanopi-R4S is a USB-to-GbE connected on a USB3.0 port, in this case in `xhci-hcd:usb4`.

For that reason I don’t see any point in to re-run the benchmarks for this case.

Conclusions

As I’ve mentioned, in my opinion the network optimization with the assignment of different cores for the network interface interrupts don’t really benefit the network performance of the Nanopi-R4S. I’m also not sure about the warning I’m getting with the last UDP test when the MTU is set to 512 where it seems that the data are actually getting lost, so I don’t consider this as a valid result.

Be aware that there might be other network optimizations which I’m not aware of, so this post might be incomplete.

Also the nice thing is that the FriendlyWRT distro for the Nanopi-R2S seems to have those optimizations already in place.

Personally, I would add those optimizations described in this post also on the Nanopi-R4S, because it makes total sense to have them there. Keep in mind that in this test the device is just running the default distro stuff, so there’s no external hard drive connected, no extra services e.t.c. But normally, on a system that is fully utilized there will be more services that run in the background, therefore it’s a good strategy to have those optimizations anyways.

In any case, personally I like the Nanopi-R4S and I find that it’s performance is good enough for my needs. Hope this post helped if you own the device.

Have fun!

Benchmarking the NanoPi R4S

Intro

Note: I’ve wrote a complementary post on how to do some networking optimizations here.

This week I’ve received the new NanoPi R4S from FriendlyElec for evaluation purposes.

My original plan was to create a Yocto BSP layer for the board as I’ve also did with the NanoPi R2S in a recent post here. The way I usually create a Yocto BSP layers for those SBCs is that I use parts of the Armbian build tool and integrate them in bitbake recipes and then add them in a layer. The problem this time is that Armbian hasn’t released yet support for this board, therefore I thought it’s a good chance to benchmark the board itself.

Let’s first see the specs of the board.

NanoPi R4S specs

The board is based on the Rockchip RK3399 and the specs of the specific board I’ve received are:

  • Rockchip RK3399 SoC
    • 2x Cortex-A72 @1.8GHz
    • 4x Cortex-A53 @1.4GHz
    • Mali-T864 GPU
    • VPU capable of 4K VP9 and 4K 10bits H265/H264 60fps decoding
  • 4GB LPDDR4 RAM
  • RK808-D PMIC
  • 2x GbE ports
  • 2x USB 3.0 ports
  • Extension pin-headers
    • 2x USB 2.0
    • 1x SPI
    • 1x I2C
    • 1x UART (console)
    • RTC Battery
  • USB Type-C connector for power

What makes the board special is of course the dual GbE. One interface is integrated in the SoC and the other is a PCIe GbE which is connected on the SoC’s PCIe bus.

As you can guess this board is meant to be used in custom router configurations and that’s the reason that there’s already a custom OpenWRT image for it, which is named FriendlyWRT. Therefore, in my tests I’ve used this image and actually I’ve used the image that it came in the SD card with the board. More details about the versions later.

Also the board I’ve received came with an aluminum case, which has a dual purpose, for housing -of course- and it’s also used to cool down the CPU. This is the SBC I’ve received.

The Nanopi-R4S’s case is very compact and it’s only a bit bigger than its predecessor Nanopi-R2S, but also includes more horse power under the hood. You can see how those two compare in size.

As you can see in the above image I’ve done a modification on the case by drilling a hole above the USB power connector. I’ve done this hole in order to be able to use the UART console while the case is closed.

The Nanopi-R2S also has the same issue, as you can’t connect the UART console while the case is closed. Therefore, I’ve done the same modification to both cases. This is very easy to do as aluminum is very easy and soft to drill.

Here’s an image of the Nanopi-R2S with the case open.

I also had to use a cutter to trim the top of the dupont connectors in order for the case to close properly.

In the Nanopi-R2S I’ve drilled the hole above the reset button and on the Nanopi-R4S above the USB power connector.

Another thing I need to mention here is that I had to also change the thermal pad on the Nanopi-R2S because it comes with a 0.5mm pad and I’ve seen that the temperature was a bit high. When I’ve changed the pad with a 1mm thickness then it was much better. I guess you can use any thermal pad, but in my case I’ve used this one that I got from ebay.

The thermal pad of the Nanopi-R4S seems to be fine so I haven’t changed that.

Test setup

The setup I’ve used for the benchmarks is very simple and I’ve only used standard equipment that anyone can buy cheap. So, no expensive or smart Ethernet switches or expensive cables. I think that way the results will reflect the most common case scenarios, which is use the board in a home or small office environment.

The switch I’ve used is the TP-Link TL-SG108, which is an 8-port GbE switch and it costs approx. 20 EUR here in Germany, which I believe is cheap. The cables I’ve used are some ultra-cheap 1m CAT6 which I use for my tests.

I’ve done two kind of tests. The first test is to test only the WAN interface, using my workstation which has an onboard GbE interface. This is the setup.

As you can see from the above diagram the workstation is connected on the GbE switch as also the WAN interface of the Nanopi-RxS (I mean both R2S and R4S). Then the LAN interface, which by default is bridged internally in FriendlyWRT, is connected on my Laptop which is using a USB-to-GbE adaptor as it doesn’t have a LAN connector.

I’ve tested thoroughly this USB-to-GbE adapter in many cases and it’s perfectly capable of 1 Gbit. So, don’t worry about it, it won’t affect the results.

Someone could argue here that the switch should be placed after the LAN and not before WAN, because WAN is meant to be connected to your ADSL/VDSL router. Well, that’s one option, but also this option is a valid setup for many setups. For example I prefer to have the WiFi router before the WAN and some of my devices after the LAN, so the devices connected with before WAN don’t have access to the LAN devices via the bridge interface.

My workstation is a Ryzen 2700X and the Ethernet is an onboard interface on the ASRock Fatal1ty X470 Gaming K4 motherboard with the latest firmware at this date. The kernel and OS are the following:

PRETTY_NAME="Ubuntu 18.04.5 LTS"
Linux workstation 5.9.10-050910-generic #202011221708 SMP Sun Nov 22 18:07:21 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

The Laptop is a Lenovo 330S-15ARR with a Ryzen 2500u and the following kernel and OS.

PRETTY_NAME="Ubuntu 20.04.1 LTS"
Linux laptop 5.3.16-050316-generic #201912130343 SMP Fri Dec 13 08:45:06 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

The USB-to-GbE is based on the RTL8153.

Bus 002 Device 004: ID 0bda:8153 Realtek Semiconductor Corp. RTL8153 Gigabit Ethernet Adapter

This is the kernel and OS of the Nanopi-R4S

PRETTY_NAME="OpenWrt 19.07.4"
Linux FriendlyWrt 5.4.75 #1 SMP PREEMPT Tue Nov 10 11:13:15 CST 2020 aarch64 GNU/Linux

And this is the kernel ans OS of the Nanopi-R2S

PRETTY_NAME="OpenWrt 19.07.1"
Linux FriendlyWrt 5.4.61 #1 SMP PREEMPT Fri Sep 4 15:12:58 CST 2020 aarch64 GNU/Linux

 

With this setup, I’ve used iperf to benchmark the WAN interface and the bridged interface. The WAN interface is getting an IP from the DHCP router which is also connected on the GbE switch and the Nanopi-RxS runs it’s own DHCP server for the bridge interface which is the range of 192.168.2.0. Therefore, the laptop gets an IP address in the 192.168.2.x range but it does have access to all the IPs in the WAN. For that reason in the bridge tests the Laptop always acts as the client.

Before see the benchmark results this is the table of the IP addresses.

Workstation 192.168.0.2
Nanopi-R4S (WAN) 192.168.0.62
Nanopi-R4S (LAN) 192.168.2.1
Nanopi-R2S (WAN) 192.168.0.63
Laptop 192.168.2.128

I’ll do 6 different tests, which are described in the following table.

Test # iperf client cmd Description
1 -t 120 TCP, 2 mins, default window size
2 -t 120 -w 65536 TCP, 2 mins, 128KB window size
3 -t 120 -w 131072 TCP, 2 mins, 256KB window size
4 -t 120 -d -P 2 TCP, 2 mins, default window size, 2x parallel
5 -u -t 120 -b 1000M UDP, 2 mins, 1x Gbits
6 -u -t 120 -b 1000M -d UDP, 2 mins, 1x Gbits, both directions

2 mins, means that the test lasts for 2 minutes (or 120 secs), which is a pretty good average time as it means approx. 13GB of data in a GbE.

I’ve also tried several TCP window sizes, but usually the default size is the one that you should focus.

Finally, I’ve also added a 2x parallel socket test for TCP and for UDP I’ve tested both directions (server/client).

Let’s see the benchmarks now.

Nanopi-R4S WAN benchmarks

Laptop -> Switch -> Workstation

The first benchmark is to test the speed between the workstation and the Laptop on the GbE switch to verify the max speed this setup can achieve. These are the results:

Workstation iperf -s
Laptop iperf -c 192.168.0.78 -t 120
------------------------------------------------------------
Client connecting to 192.168.0.2, TCP port 5001
TCP window size:  289 KByte (default)
------------------------------------------------------------
[  3] local 192.168.0.78 port 54456 connected with 192.168.0.2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-120.0 sec  13.2 GBytes   941 Mbits/sec

In the tables I’ll be adding the commands that I’m using on each server/client and then I’ll add the result that iperf outputs.

So, in this case we can see that the setup maximizes at 941 Mbits/sec.

Workstation -> Switch -> Nanopi-R4S (WAN)

Test #1

Nanopi-R4S iperf -s
Workstation iperf -c 192.168.0.62 -t 120
------------------------------------------------------------
Client connecting to 192.168.0.62, TCP port 5001
TCP window size: 85.0 KByte (default)
------------------------------------------------------------
[  3] local 192.168.0.2 port 59820 connected with 192.168.0.62 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-120.0 sec  13.2 GBytes   942 Mbits/sec

Test #2

Nanopi-R4S iperf -s
Workstation iperf -c 192.168.0.62 -t 120 -w 65536
------------------------------------------------------------
Client connecting to 192.168.0.62, TCP port 5001
TCP window size:  128 KByte (WARNING: requested 64.0 KByte)
------------------------------------------------------------
[  3] local 192.168.0.2 port 59850 connected with 192.168.0.62 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-120.0 sec  13.2 GBytes   941 Mbits/sec

Test #3

Nanopi-R4S iperf -s -w 131072
Workstation iperf -c 192.168.0.62 -t 120 -w 131072
------------------------------------------------------------
Client connecting to 192.168.0.62, TCP port 5001
TCP window size:  256 KByte (WARNING: requested  128 KByte)
------------------------------------------------------------
[  3] local 192.168.0.2 port 59854 connected with 192.168.0.62 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-120.0 sec  13.1 GBytes   941 Mbits/sec

Test #4

Nanopi-R4S iperf -s
Workstation iperf -c 192.168.0.62 -t 120 -d -P 2
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size:  128 KByte (default)
------------------------------------------------------------
------------------------------------------------------------
Client connecting to 192.168.0.62, TCP port 5001
TCP window size:  391 KByte (default)
------------------------------------------------------------
[  5] local 192.168.0.2 port 32882 connected with 192.168.0.62 port 5001
[  4] local 192.168.0.2 port 32880 connected with 192.168.0.62 port 5001
[  6] local 192.168.0.2 port 5001 connected with 192.168.0.62 port 55152
[  7] local 192.168.0.2 port 5001 connected with 192.168.0.62 port 55154
[ ID] Interval       Transfer     Bandwidth
[  8]  0.0-120.0 sec  6.45 GBytes   462 Mbits/sec
[  6]  0.0-120.0 sec  6.46 GBytes   462 Mbits/sec
[  4]  0.0-120.5 sec  6.47 GBytes   461 Mbits/sec
[  5]  0.0-120.5 sec  6.46 GBytes   461 Mbits/sec
[SUM]  0.0-120.5 sec  12.9 GBytes   922 Mbits/sec

Test #5

Nanopi-R4S iperf -s -u
Workstation iperf -c 192.168.0.62 -u -t 120 -b 1000M
------------------------------------------------------------
Client connecting to 192.168.0.62, UDP port 5001
Sending 1470 byte datagrams, IPG target: 11.22 us (kalman adjust)
UDP buffer size:  208 KByte (default)
------------------------------------------------------------
[  3] local 192.168.0.2 port 50200 connected with 192.168.0.62 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-120.0 sec  13.4 GBytes   957 Mbits/sec
[  3] Sent 9764940 datagrams
[  3] Server Report:
[  3]  0.0-120.0 sec  13.4 GBytes   957 Mbits/sec   0.000 ms 2127956392/2137718708 (0%)

Test #6

Nanopi-R4S iperf -s -u
Workstation iperf -c 192.168.0.62 -u -t 120 -b 1000M -d
------------------------------------------------------------
Server listening on UDP port 5001
Receiving 1470 byte datagrams
UDP buffer size:  208 KByte (default)
------------------------------------------------------------
------------------------------------------------------------
Client connecting to 192.168.0.62, UDP port 5001
Sending 1470 byte datagrams, IPG target: 11.22 us (kalman adjust)
UDP buffer size:  208 KByte (default)
------------------------------------------------------------
[  4] local 192.168.0.2 port 40576 connected with 192.168.0.62 port 5001 (peer 2.0.13)
[  3] local 192.168.0.2 port 5001 connected with 192.168.0.62 port 48176
[ ID] Interval       Transfer     Bandwidth
[  4]  0.0-120.0 sec  13.4 GBytes   957 Mbits/sec
[  4] Sent 9765245 datagrams
[  3]  0.0-120.0 sec  9.11 GBytes   652 Mbits/sec   0.027 ms    0/6653321 (0%)
[  3] WARNING: ack of last datagram failed after 10 tries.
[  4] Server Report:
[  4]  0.0-120.0 sec  12.9 GBytes   926 Mbits/sec   0.000 ms 2128274496/2137718403 (0%)
[  4] 0.00-120.00 sec  1 datagrams received out-of-order

Nanopi-R4S WAN results

As you can see from the above benchmarks the Nanopi-R4S WAN interface is fully capable of GbE speed, which is really cool. I’ve also used the FriendlyWRT web interface to get some screenshots of the various Luci statistics using the default enabled collectd sensors. These are the results.

As you can see the CPU usage is ~35%, but most of the load is because of the docker daemon that is running in the background. Also note that the temperature is 60C without the heatsink and less than 40C with the heatsink. This happened because I’ve added the heatsink in the middle of the test, but that was eventually a good thing because I’ve seen the difference that it does.

As you can see the WAN interface reaches the max GbE speed of my setup, which is really great.

The consumption when Nanopi-R4S is idle is 0.44A at 5V and 1A during test #6.

Nanopi-R4S bridge benchmarks

So, now lets test the bridge network interface, which is actually the most interesting benchmark for this device as it shows it’s capabilities in the real-case scenario.

Workstation -> Switch -> Nanopi-R4S (WAN) -> Nanopi-R4S (Br0) -> Laptop (USB-to-GbE)

Test #1

Workstation iperf -s
Laptop iperf -c 192.168.0.2 -t 120
------------------------------------------------------------
Client connecting to 192.168.0.2, TCP port 5001
TCP window size:  204 KByte (default)
------------------------------------------------------------
[  3] local 192.168.2.126 port 39014 connected with 192.168.0.2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-120.0 sec  13.1 GBytes   940 Mbits/sec

Test #2

Workstation iperf -s
Laptop iperf -c 192.168.0.2 -t 120 -w 65536
------------------------------------------------------------
Client connecting to 192.168.0.2, TCP port 5001
TCP window size:  128 KByte (WARNING: requested 64.0 KByte)
------------------------------------------------------------
[  3] local 192.168.2.126 port 39196 connected with 192.168.0.2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-120.0 sec  9.50 GBytes   680 Mbits/sec

Test #3

Workstation iperf -s -w 131072
Laptop iperf -c 192.168.0.2 -t 120 -w 131072
------------------------------------------------------------
Client connecting to 192.168.0.2, TCP port 5001
TCP window size:  256 KByte (WARNING: requested  128 KByte)
------------------------------------------------------------
[  3] local 192.168.2.126 port 39240 connected with 192.168.0.2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-120.0 sec  10.6 GBytes   758 Mbits/sec

Test #4

Workstation iperf -s
Laptop iperf -c 192.168.0.2 -t 120 -d -P 2
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size:  128 KByte (default)
------------------------------------------------------------
------------------------------------------------------------
Client connecting to 192.168.0.2, TCP port 5001
TCP window size:  153 KByte (default)
------------------------------------------------------------
[  4] local 192.168.2.126 port 49010 connected with 192.168.0.2 port 5001
[  5] local 192.168.2.126 port 49012 connected with 192.168.0.2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  4]  0.0-120.0 sec  6.66 GBytes   476 Mbits/sec
[  5]  0.0-120.0 sec  6.49 GBytes   465 Mbits/sec
[SUM]  0.0-120.0 sec  13.2 GBytes   941 Mbits/sec

Test #5

Workstation iperf -s -u
Laptop iperf -c 192.168.0.2 -u -t 120 -b 1000M

Client connecting to 192.168.0.2, UDP port 5001
Sending 1470 byte datagrams, IPG target: 11.22 us (kalman adjust)
UDP buffer size:  208 KByte (default)
------------------------------------------------------------
[  3] local 192.168.2.126 port 57264 connected with 192.168.0.2 port 5001
[  3] WARNING: did not receive ack of last datagram after 10 tries.
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-120.0 sec  13.4 GBytes   957 Mbits/sec
[  3] Sent 9761678 datagrams

Test #6

Workstation iperf -s -u
Laptop iperf -c 192.168.0.2 -u -t 120 -b 1000M -d
------------------------------------------------------------
Server listening on UDP port 5001
Receiving 1470 byte datagrams
UDP buffer size:  208 KByte (default)
------------------------------------------------------------
------------------------------------------------------------
Client connecting to 192.168.0.2, UDP port 5001
Sending 1470 byte datagrams, IPG target: 11.22 us (kalman adjust)
UDP buffer size:  208 KByte (default)
------------------------------------------------------------
[  4] local 192.168.2.126 port 39169 connected with 192.168.0.2 port 5001 (peer 2.0.10-alpha)
[ ID] Interval       Transfer     Bandwidth
[  4]  0.0-120.0 sec  13.4 GBytes   956 Mbits/sec
[  4] Sent 9757183 datagrams
[  4] Server Report:
[  4]  0.0-120.0 sec  7.84 GBytes   561 Mbits/sec   0.000 ms 2131996332/2137726465 (1e+02%)
[  4] 0.0000-120.0257 sec  1 datagrams received out-of-order

Nanopi-R4S bridge results

As you can see from the above benchmarks the Nanopi-R4S can max the GbE speed of my setup in the bridged mode. You may see that the speed dropped when using custom TCP window sizes, which I guess it’s because of the mismatch window sizes internally in bridge, but I don’t give much attention to this as the default TCP size works fine.

These are some screenshots from the web interface and the sensors.

As you can see the CPU load is less than 30% and again most of the load is because of the docker daemon running in the background. Also the temperature doesn’t exceed the 38C, which means the heatsink works really well.

Again my personal opinion is that the Nanopi-R4S reaches the maximum performance of my network also in the bridge mode. Excellent.

Nanopi-R2S WAN Benchmarks

This post would be incomplete without compare benchmarks between the Nanopi-R4S and Nanopi-R2S. Therefore, I’ve also executed the same benchmarks on the R2S in both WAN and bridge interface. These are the WAN results.

Test #1

Nanopi-R2S iperf -s
Workstation iperf -c 192.168.0.63 -t 120
------------------------------------------------------------
Client connecting to 192.168.0.63, TCP port 5001
TCP window size: 85.0 KByte (default)
------------------------------------------------------------
[  3] local 192.168.0.2 port 60864 connected with 192.168.0.63 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-120.0 sec  13.1 GBytes   941 Mbits/sec

Test #2

Nanopi-R2S iperf -s
Workstation iperf -c 192.168.0.63 -t 120 -w 65536
------------------------------------------------------------
Client connecting to 192.168.0.63, TCP port 5001
TCP window size:  128 KByte (WARNING: requested 64.0 KByte)
------------------------------------------------------------
[  3] local 192.168.0.2 port 60878 connected with 192.168.0.63 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-120.0 sec  13.1 GBytes   937 Mbits/sec

Test #3

Nanopi-R2S iperf -s -w 131072
Workstation iperf -c 192.168.0.63 -t 120 -w 131072
------------------------------------------------------------
Client connecting to 192.168.0.63, TCP port 5001
TCP window size:  256 KByte (WARNING: requested  128 KByte)
------------------------------------------------------------
[  3] local 192.168.0.2 port 60884 connected with 192.168.0.63 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-120.0 sec  13.1 GBytes   938 Mbits/sec

Test #4

Nanopi-R2S iperf -s
Workstation iperf -c 192.168.0.63 -t 120 -d -P 2
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size:  128 KByte (default)
------------------------------------------------------------
------------------------------------------------------------
Client connecting to 192.168.0.63, TCP port 5001
TCP window size:  246 KByte (default)
------------------------------------------------------------
[  5] local 192.168.0.2 port 60890 connected with 192.168.0.63 port 5001
[  4] local 192.168.0.2 port 60888 connected with 192.168.0.63 port 5001
[  6] local 192.168.0.2 port 5001 connected with 192.168.0.63 port 42238
[  7] local 192.168.0.2 port 5001 connected with 192.168.0.63 port 42240

[ ID] Interval       Transfer     Bandwidth
[  6]  0.0-120.0 sec  5.65 GBytes   405 Mbits/sec
[  4]  0.0-120.0 sec  5.12 GBytes   366 Mbits/sec
[  5]  0.0-120.1 sec  4.68 GBytes   335 Mbits/sec
[SUM]  0.0-120.1 sec  9.79 GBytes   701 Mbits/sec
[  8]  0.0-120.0 sec  7.29 GBytes   521 Mbits/sec

Test #5

Nanopi-R2S iperf -s -u
Workstation iperf -c 192.168.0.63 -u -t 120 -b 1000M
------------------------------------------------------------
Client connecting to 192.168.0.63, UDP port 5001
Sending 1470 byte datagrams, IPG target: 11.22 us (kalman adjust)
UDP buffer size:  208 KByte (default)
------------------------------------------------------------
[  3] local 192.168.0.2 port 39818 connected with 192.168.0.63 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-120.0 sec  13.4 GBytes   957 Mbits/sec
[  3] Sent 9765428 datagrams
[  3] Server Report:
[  3]  0.0-120.0 sec  8.06 GBytes   577 Mbits/sec   0.000 ms 2131827393/2137718220 (0%)

Test #6

Nanopi-R2S iperf -s -u
Workstation iperf -c 192.168.0.63 -u -t 120 -b 1000M -d
------------------------------------------------------------
Server listening on UDP port 5001
Receiving 1470 byte datagrams
UDP buffer size:  208 KByte (default)
------------------------------------------------------------
------------------------------------------------------------
Client connecting to 192.168.0.63, UDP port 5001
Sending 1470 byte datagrams, IPG target: 11.22 us (kalman adjust)
UDP buffer size:  208 KByte (default)
------------------------------------------------------------
[  4] local 192.168.0.2 port 50365 connected with 192.168.0.63 port 5001 (peer 2.0.13)
[  3] local 192.168.0.2 port 5001 connected with 192.168.0.63 port 58210
[ ID] Interval       Transfer     Bandwidth
[  4]  0.0-120.0 sec  13.4 GBytes   957 Mbits/sec
[  4] Sent 9765269 datagrams
[  3]  0.0-120.0 sec  7.56 GBytes   541 Mbits/sec   0.040 ms    0/5520546 (0%)
[  3] WARNING: ack of last datagram failed after 10 tries.
[  4] Server Report:
[  4]  0.0-120.0 sec  8.05 GBytes   576 Mbits/sec   0.000 ms 2131838458/2137718379 (0%)
[  4] 0.00-120.00 sec  1 datagrams received out-of-order

Nanopi-R2S WAN results

As you can see from the above benchmarks, although the TCP test with the default window size reaches the maximum performance, the performance drops in the TCP parallel test.

These are the sensor graphs from the web interface.

As you can see the processor usage is ~60% which the double compared to the Nanopi-R4S, but again the docker daemon uses ~30% of the CPU. Also the temperature seems to be higher than Nanopi-R4S, even with the better thermal I’ve used.

Nanopi-R2S bridge benchmarks

Let’s now see the benchmarks of the bridge interface for the Nanopi-R2S

Test #1

Workstation iperf -s
Laptop iperf -c 192.168.0.2 -t 120
------------------------------------------------------------
Client connecting to 192.168.0.2, TCP port 5001
TCP window size:  196 KByte (default)
------------------------------------------------------------
[  3] local 192.168.2.126 port 43028 connected with 192.168.0.2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-120.0 sec  9.51 GBytes   681 Mbits/sec

Test #2

Workstation iperf -s
Laptop iperf -c 192.168.0.2 -t 120 -w 65536
------------------------------------------------------------
Client connecting to 192.168.0.2, TCP port 5001
TCP window size:  128 KByte (WARNING: requested 64.0 KByte)
------------------------------------------------------------
[  3] local 192.168.2.126 port 43030 connected with 192.168.0.2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-120.0 sec  5.04 GBytes   360 Mbits/sec

Test #3

Workstation iperf -s -w 131072
Laptop iperf -c 192.168.0.2 -t 120 -w 131072
------------------------------------------------------------
Client connecting to 192.168.0.2, TCP port 5001
TCP window size:  128 KByte (WARNING: requested 64.0 KByte)
------------------------------------------------------------
[  3] local 192.168.2.126 port 43030 connected with 192.168.0.2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-120.0 sec  5.04 GBytes   360 Mbits/sec

Test #4

Workstation iperf -s
Laptop iperf -c 192.168.0.2 -t 120 -d -P 2
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size:  128 KByte (default)
------------------------------------------------------------
------------------------------------------------------------
Client connecting to 192.168.0.2, TCP port 5001
TCP window size:  153 KByte (default)
------------------------------------------------------------
[  4] local 192.168.2.126 port 43042 connected with 192.168.0.2 port 5001
[  5] local 192.168.2.126 port 43044 connected with 192.168.0.2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  5]  0.0-120.0 sec  4.41 GBytes   315 Mbits/sec
[  4]  0.0-120.0 sec  4.63 GBytes   331 Mbits/sec
[SUM]  0.0-120.0 sec  9.04 GBytes   647 Mbits/sec

Test #5

Workstation iperf -s -u
Laptop iperf -c 192.168.0.2 -u -t 120 -b 1000M
------------------------------------------------------------
Client connecting to 192.168.0.2, UDP port 5001
Sending 1470 byte datagrams, IPG target: 11.22 us (kalman adjust)
UDP buffer size:  208 KByte (default)
------------------------------------------------------------
[  3] local 192.168.2.126 port 52107 connected with 192.168.0.2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-120.0 sec  13.4 GBytes   957 Mbits/sec
[  3] Sent 9762311 datagrams
[  3] Server Report:
[  3]  0.0-120.0 sec  6.42 GBytes   460 Mbits/sec   0.000 ms 2133031099/2137721337 (1e+02%)

Test #5

Workstation iperf -s -u
Laptop iperf -c 192.168.0.2 -u -t 120 -b 1000M -d
------------------------------------------------------------
Server listening on UDP port 5001
Receiving 1470 byte datagrams
UDP buffer size:  208 KByte (default)
------------------------------------------------------------
------------------------------------------------------------
Client connecting to 192.168.0.2, UDP port 5001
Sending 1470 byte datagrams, IPG target: 11.22 us (kalman adjust)
UDP buffer size:  208 KByte (default)
------------------------------------------------------------
[  3] local 192.168.2.126 port 54616 connected with 192.168.0.2 port 5001 (peer 2.0.10-alpha)
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-120.0 sec  12.4 GBytes   890 Mbits/sec
[  3] Sent 9081045 datagrams
[  3] Server Report:
[  3]  0.0-120.2 sec   917 MBytes  64.0 Mbits/sec   0.000 ms 2137748578/2138402602 (1e+02%)
[  3] 0.0000-120.2481 sec  1 datagrams received out-of-order

Nanopi-R2S bridge results

As you can see from the above benchmarks the Nanopi-R2S has trouble to reach the maximum performance of the network and this seems to be an issue with the internal USB-to-GbE of the board. Nevertheless, some of you may be satisfied with those results for your use case.

These are the sensors graphs from the web interface.

The consumption when Nanopi-R2S is idle is 0.3A at 5V and 0.58A during test #6.

Nanopi-R4S vs Nanopi-R2S

OK, some of you may want to choose between those two boards. Well, I won’t try to get into too many details in the specific things that are different between those two. Instead I’ll just add a table here with the tests results and also add a few notes after that.

Test Nanopi-R4S (Mbits/sec) Nanopi-R2S (Mbits/sec)
WAN Br0 WAN Br0
#1 942 940 941 681
#2 941 680 937 360
#3 941 758 938 360
#4 922 941 701 647
#5 957 957 577 460
#6 926 561 576 64

From the above table you can see that the new Nanopi-R4S is faster than R2S especially when it comes to UDP data transfers. The R4S is reaching in most cases the maximum network capacity, but R2S can’t. Therefore, the R4S has better performance over the R2S.

Nanopi R4S Nanopi R2S
pros + Performance
+ Up to 4GB RAM
+ Nice case
+ Doesn’t get hot
+ Consumption
+ Smaller size
+ Nice case
+ Price
cons – Price
– A bit larger than R2S
– Performance
– It gets a bit hot

The R4S costs $45 (1GB) and the R2S $22.

Conclusions

As I’ve mentioned, I’ve received this board from FriendlyArm as a sample for evaluation purposes. My plan was to create a Yocto BSP layer for this, but since the Armbian support is not ready yet I’ve decided to do a benchmark and a comparison between the R4S and the R2S.

Personally, I like the performance of this board and I think it’s nice to use as a home router. Just have in mind that if you compare it with the R2S then the performance is better but also it consumes more power, as the RK3399 is a more power-hungry SoC than the RK3328.

Hope this post was useful.

Have fun!

Hacking a Sonoff S20 and adding an ACS712 current sensor

Intro

Warning! This post is only meant to document my work on how to hack the Sonoff S20. This device is connected directly to 110/220V and if you try to do this your own you risk your own life. Do not attempt to open your device or follow this guide if you’re not also a trained electronic or electrical engineer.

It’s being quite some long time since I’ve hacked a device for the blog. This time I’ve did this hack mainly for the next post of the “Using Elastic stack (ELK) on embedded“. In both the previous posts I’ve used a Linux SBC that runs an Elastic Beat client and publishes data to a remote Elasticsearch server. But for the next (and maaaybe the last) post I want to use an example of a baremetal embedded device, like the ESP8266. Therefore I had two choices, first to just write a generic and simple firmware for an ESP module (e.g. ESP-01 or ESP-12E) or even better use an actual mainstream product and create an example of a real IoT application.

For that reason I’ve decided to use the Sonoff S20 smart socket that I’m using to turn on/off my 3D printer. Smart socket just means a device with a single power socket and an internal relay that controls power on the load and in this case the relay is controlled by an ESP8266. In my case, I’m using a raspberry pi 3 running Octopi and both the printer and the RPi are connected on a power socket which is connected on the S20. When I want to print then I turn on the socket using my smartphone and the RPi boots and the printer is turned on. When the printing is done then, the octopi sends a termination message to the S20 and it shuts down. Then after 20 seconds the S20 has received the message is turns off the power on everything.

The current setup doesn’t make much sense to use it with Elastic stack, so I’ve decided to add a current sensor in the Sonoff S20 and monitor the current that the printer draws. Then publish the data to the Elasticsearch server.

I’ve also decided to split the hacking from the rest ELK part, so I’m making this post here to document this hack. Again, I have to tell you not to try this yourself if you’re not a trained professional.

Components

Sonoff S20

According to this wiki, the Sonoff S20 is a wifi wireless smart socket that can connect any home appliances and electric devices via WiFi, allowing you to remote control on iOS/Android APP eWeLink. The device looks like this

Since this device is using an ESP8266, then it’s hack-able and you can write and upload your own firmware. There are also many clones and to be honest it’s difficult to know if you buy an original or a clone, especially from eBay. The schematics are also available here. The most known firmware for the Sonoff S20 is the open source Tasmota firmware that you can find here. Of course, as you can guess, I’m using my own custom firmware because I need to do only some basic and specific stuff and I don’t need all the bloat that comes with the other firmwares.

ACS712

The ACS712 is a fully Integrated, Hall Effect-Based Linear Current Sensor IC with 2.1 kVRMS Isolation and a Low-Resistance Current Conductor. Although this IC only needs very few components around it, you can find it really cheap in eBay in a small PCB factor, like this one here:

This IC comes in 3 flavors, rated at 5A, 20A and 30A. I’m using the 5A which means that the IC can sense up to 5A of AC/CD current and converts the input current in voltage with a 185 mV/A step. That means that 1A of 220VAC is 185mV and 5A is 1V.

The ugly truth

OK, here you might think that this is quite useless. Well, you’re right, he he, but it’s according to the blog’s specs, so it’s good to go. For those who wonder why it’s useless, the reason is that the 185 mV/A sensitivity of the ACS712 means that the range of 5A is only 925mV. Since the range is -5A to 5A then according to the IC specs it means that the output is from ~1.5V for -5A, 2.5V for 0A and up to ~3.5V for +5A.

Also, the ESP8266 ADC input is limited by design to 1V maximum, but at the same time the noise of the ESP8266 ADC is quite bad. Therefore, the accuracy is not good at all and also when it comes to AC currents 1A means 220VA and 5A (=1V out) means 2200VA, which is extremely high for a 3D printer consumption, which is ~600W max with the heat-bed at max temperature. So, the readings from the ACS712 are not really accurate. But they are enough to see what’s going on.

Also, have in mind that with the ACS712 you can only measure current and this is not enough to calculate real consumption as you need also the real value of the VAC on your socket, which is not 220V and actually fluctuates a lot. Ideally you should have another ADC to measure the output of the transformer which is relative to the AC mains and use this value to calculate real power.

Nevertheless, it’s fun to do it, so I’ll do it and use this device in the next post to publish the ACS712 readings to the Elasticsearch server.

Hacking the Sonoff S20

Well, hacking sounds really awesome, but in reality it was too easy to even consider it as hacking, but anyway. Do, first problem I faced was that the ESP8266 is a 3V3 device and the ACS712 is a 5V device. Also the ACS712 is one of those 5V rated devices that indeed doesn’t work with 3V3. Therefore, you actually need 5V, but the problem is that there isn’t any 5V on the circuit.

First let’s have a look on my bench.

I’ve used a USB “microscope” to check the board and also make some pictures, but those cheap microscopes are so crap that in the end I’ve just used the magnify lamp to do the soldering…

As you can find out from the schematics here, there is a 220V to 6.2V transformer on the PCB and then there’s an AMS1117-3.3 regulator to provide power to the low voltage circuit.

Therefore in order to provide the 5V I’ve soldered an AMS1117-5 on top of the 3V3 regulator.

That’s it, an AMS1117 sandwich. OK, I know, you’re thinking that this is a bad idea because now there’s no proper heat dissipation from the bottom regulator and it’s true. But I can live with that, as the heat is not that much. Generally though, this would be a really bad idea in other cases.

Next thing was desoldering the load cable (red) from the PCB and solder it on the IP+ pin of the ACS712 and then solder an extension cable from the IP- to the PCB.

Then I’ve removed the green block connector and the pin-header from the ACS712 PCB so it takes as much less space as possible and then I’ve soldered the GND and the 5V.

In this picture the regulator sandwich is also better visible. I’ve used a low resistance wire wrapping wire.

Next problem was that the ADC input on the ESP8266 is limited to a max 1V and the output of the ACS712 can be up to 3.5V (let’s assume 5V just to be sure in any case). Therefore, I’ve used a voltage divider on the output with a 220Ω and 56Ω resistors, which means at 5V I get 1.014V. For that I’ve used some 0805 1% SMD resistors and I’ve soldered them on the ACS712 PCB and I’ve solder a wire between them.

Then I’ve used a heated silicone gun to create a blob that holds the wires and prevent them from breaking.

Then I guess the most difficult thing would probably be to solder the wire from the voltage divider to the ADC pin of the ESP8266, because the pin on the IC is barely visible. Although I’m using a quite flat and big soldering tip on my weller, the trick I do is to melt just a bit solder on the wire end, then place the wire on the pin and finally touch it quite fast with the solder iron tip. I can’t really explain it more, but anyway for me it works fine.

That’s quite clean soldering. Then I’ve used some hot silicone glue to attach the ACS712 PCB on the relay so it doesn’t fly around the case.

Last thing was to solder the pin header for the UART connector that I’ve used for flashing and debugging the firmware.

From top to bottom the pinout is GND, TX, RX and VCC. Be aware that if you use the USB-to-UART module to supply the ESP8266 with power then you need to be aware that you need a 3V3 module and not 5V and also never connect the VCC pin of the module while the sonoff device is on the mains. At first I was using the 3V3 of the UART module to flash the device while not connected and then after I’ve verified that everything works as expected, then I’ve removed the VCC from the module and I was flashing the device while it was connected on mains.

So this is the Sonoff before I close the case

And here is with the case closed.

For a test load during my tests I’ve used a hair-dryer which is rated up to 2200W, but it has a cold and heat mode and also different fan and heating modes, so it was easy to test with various currents. Of course, using the hair-dryer’s max power with the ACS712-5A is not a good idea as the dryer can draw up to 10A.

Short firmware description

OK, so now let’s have a look in the source code that I’ve used for testing the ACS712 and the Sonoff. You can find the repo here:

https://gitlab.com/dimtass/esp8266-sonoff-acs712-test-firmware
https://github.com/dimtass/esp8266-sonoff-acs712-test-firmware
https://bitbucket.org/dimtass/esp8266-sonoff-acs712-test-firmware

Reading the ADC value is the easy part, but there are a couple of things that you need to do in order to convert this value to something meaningful. Since I’m measuring AC current, then a snapshot of the ADC value is meaningless as it oscillates around 0V and also takes negative values. In this case we need the RMS value, which is the root-mean-square value. Before continuing further, I’ve compiled this list of equations.

I’m not going to explain everything though, as it’s quite basic equations. (1) is the RMS formula. It’s the square root of the mean of the sum of the sampled square current values. The trick here is that you square the value, therefore that works well for negative values as they add up. Then according to the ACS712 datasheet, in function (2), α is the 185mV/A, which is the sensitivity. This number means that for each Ampere the voltage will change 185mV.

Function (3) is the Vx output of the voltage divider and Vo is the instant value of the ACS712 output. Function (4) states that the Vx is the ADC value divided by 1023. This is true because the max output voltage is 1V and the ADC is 10-bit. Here you also see that, it’s actually the Delta of the ADC value and in (5) you see that the delta is the ADC value minus the ADC0 value.

The ADC0 value is the sampled ADC value of the ACS712 output when there’s no current flowing. Normally, you would expect that to be 511, which is the half of the 1023 max value of the 10-bit ADC, but you need actually to calibrate your device in order to find this value. I’ll explain how to do this later.

In function (6) I’ve just replaced (2) with (3), (4) and (5). Finally, in (7) I’ve replaced the constant values with k, to simplify the equation. In (8) and (9) you can see which are the constant values and their values.

Therefore, in my case k is always 0.026, but I haven’t used a hard-coded value in the code, so in the code you can see from the following that you can change those values to whatever.

#define CALIBRATION_VALUE 444
#define ACS712_MVA    0.185   // This means 185mV/A, see the datasheet for the correct value
#define R1            220.0
#define R2            56.0
#define ADC_MAX_VAL   1023.0

float k_const = (1/ADC_MAX_VAL) * ((R1 + R2) / R2) * (1/ACS712_MVA);

Calibration

Now in order to calibrate your device you need to enable the calibration mode in the code.

#define CALIBRATION_MODE

First you need to build the firmware and then remove any load from the Sonoff S20. When the firmware starts it will display the ADC sampled value. The code samples the output every 1ms and after 5secs (5000 samples) calculates the average value. This ADC average value is the ADC0 value. Normally, according to the ACS712 datasheet it should be 2.5V, but this may not be always true. This is the output of the firmware in my case

Calibrating with 5000 samples.
Calibration value: 431
Calibration value: 431
Calibration value: 431
Calibration value: 431
Calibration value: 431

Based on that output, I’ve defined the CALIBRATION_VALUE in my code to be 431.

#define CALIBRATION_VALUE 431

Testing the firmware

Now, make sure that CALIBRATION_MODE is not defined in the code and re-build and re-flash. After flashing is finished, the green LED is flashing while the Sonoff tries to connect to the WiFi router and when it’s done connecting it stays constant on. By default the relay is always off for safety reasons.

The firmware supports a REST API that it’s used to turn on/off the relay and also get the current Irms value. When the Sonoff is connected to the network, then you can use your browser to control the relay. In the following URLs you need to change the IP address and use the one which is right for your device. To turn the relay on then paste the next URL in your web browser and hit Enter.

http://192.168.0.76/relay?params=1

Then you’ll see this response in your browser:

{"return_value": 1, "id": "", "name": "", "hardware": "esp8266", "connected": true}

To turn off the relay use this URL:

http://192.168.0.76/relay?params=0

Then you’ll get this response from the server (ESP8266)

{"return_value": 0, "id": "", "name": "", "hardware": "esp8266", "connected": true}

As you can see, the params variable in the relay URL defines the state of the relay and 0 is OFF and 1 is ON. The return value of the server’s response is the actual value of the relay_status in the code. The REST callback in the code just calls a function that controls the relay and then returns the variable. There’s no a true feedback except the blue LED on the device that the relay is on, so be aware of that. This is the related code:

int restapi_relay_control(String command);

void SetRelay(bool onoff)
{
  if (onoff) {
    relay_status = 1;
    digitalWrite(GPIO_RELAY, HIGH);
  }
  else {
    relay_status = 0;
    digitalWrite(GPIO_RELAY, LOW);
  }
  Serial.println("Relay status: " + String(relay_status));
}

ICACHE_RAM_ATTR int restapi_relay_control(String command)
{
  SetRelay(command.toInt());
  return relay_status;
}

Finally, you can use the REST API to retrieve the RMS value of the current on the connected load. To do so, browse to this URL

http://192.168.0.76/rms_value

The response will be something like this:

{"rms_value": 0.06, "id": "", "name": "", "hardware": "esp8266", "connected": true}

As you can see the rms_value field in the response is the RMS current in Amperes. This is the response when the hair dryer is on

{"rms_value": 6.43, "id": "", "name": "", "hardware": "esp8266", "connected": true}

As you can see the RMS value now is 6.43A which is more than the 5A limit of the ACS712-5A! That’s not good, I know. Don’t do that. In my case, I’ve only use the high scale of the hair dryer for 1-2 seconds on purpose, which may not be enough to harm the IC. It seems that the overcurrent transient tolerance of the ACS712 is 100A for 100ms, which is quite high, therefore I hope that 6.5A for 2 secs are not enough to kill it.

Last thing regarding this firmware is that the Sonoff button is used to toggle the relay status. The button is using debouncing which is set to 25ms by default, but you can change that in the code here:

buttons[GPIO_BUTTON].attach( BUTTON_PINS[0] , INPUT_PULLUP  );
buttons[GPIO_BUTTON].interval(25);

Conclusions

In this post, I’ve explained how to “hack” a Sonoff-S20 smart socket and documented the procedure. You shouldn’t do this if you’re not a trained electronic or electrical engineer, because mains AC can kill you.

To sum up things, I’ve provided a firmware that you can use it to calibrate your ACS712 and the ADC readings and in the normal operation you can also use it to control the relay and read the RMS current value. To switch between modes you need to re-build and re-flash the firmware. The reason for this is just to simplify the firmware and done with is as fast as possible. Of course, it can be done in a way that you can switch between two modes, using the onboard button for example (which is used for toggling the relay by default) or using a REST command. I leave that as an exercise (I really like this moto when people are bored to do things).

As I’ve mentioned this post is just a preparation post for the next one, which will be to use the Sonoff S20 as a node agent that publishes the RMS current and the relay status to a remote Elasticsearch server. Since this hack is out of scope for the next post, I’ve decided to write this one as it’s a quite long process, but also fun.

Normally, I’m using this Sonoff S20 for my 3D printed with a custom REST command to toggle the power from my smartphone or the octopi server that runs on a RPi. I guess that the 3D printer consumption is not that high to get any meaningful data to show, but I’ll try it anyways. Hope you liked the simple hack and don’t do this at home.

Have fun!

Using Elastic stack (ELK) on embedded [part 2]

Intro

Note: This is a series of posts on using the ELK on embedded. Here you can find part1.

In the previous post on the ELK on Embedded I’ve demonstrated the most simple example you can use. I’ve set up an Elasticsearch and a Kibana server using docker on my workstation and then I’ve use this meta layer on a custom Yocto image on the nanopi-k1-plus to build the official metricbeat and test it. This was a really simple example but at the same time is very useful because you can use the Yocto layer to build beats for your custom image and any ARM architecture using the Yocto toolchain.

On this post things will get a bit more complicated, but at the same time I’ll demonstrate a full custom solution to use ELK on your custom hardware and with your custom beat. For this post, I’ve chosen the STM32MP157C dev kit, which I’ve presented here and here. This will make things even more complicated and I’ll explain later why. So, let’s have a look at the demo system architecture.

System Architecture

The following block diagram shows this project’s architecture.

As you can see from the above diagram, most of the things remain the same with part-1 and the only thing that changes is the client. Also the extra complexity is on that client. So let’s see what the client does exactly.

The STM32MP1 SoC integrates a Cortex-M4 (CM4) and a Cortex-A7 (CA7) on the same SoC and both have access to the same peripherals and address space. In this project I’m using 4x ADC channels which are available on the bottom Arduino connector of the board. The channels I’m using are A0, A1, A2 and A3. I’m also using a DMA stream to copy the ADC samples on the memory with double buffering. Finally the ADC peripheral is triggered by a timer. The sampled values then are sent to the CA7 using the OpenAMP IPC. Therefore, as you can see the firmware is already complex enough.

At the same time the CA7 CPU is running a custom Yocto Linux image. On the user space, I’ve developed a custom elastic beat that reads the ADC data from the OpenAMP tty port and then publishes the data to a remote Elasticsearch server. Finally, I’m using a custom Kibana dashboard to monitor the ADC values using a standalone X96mini.

As you can see the main complexity is mostly on the client, which is the common case scenario that you going to deal with in embedded. Next I’ll explain all the steps needed to achieve this.

Setting up an Elasticsearch and Kibana server

This step has been explained on the previous post, with enough detail, therefore I’ll save some space and time. You can read on how to set it up here. You can use the part-2 folder of the repo, though, but be aware it’s almost the same. The only thing I’ve changed is the versions for Elasticsearch and Kibana.

Proceed with the guide on the previous post, until the point that you verify that the server status is OK.

http://localhost:5601/status

The above command is when checking the status from the server and you need to use the server’s IP when checking from the web client (X96mini in this case).

Cortex-CM4 firmware

So, let’s have a look at the firmware of the CM4. You can find the firmware here:

https://gitlab.com/dimtass/stm32mp1-rpmsg-adcsampler
https://github.com/dimtass/stm32mp1-rpmsg-adcsampler
https://bitbucket.org/dimtass/stm32mp1-rpmsg-adcsampler

In the repo’s README file you can read more information on how to build the firmware, but since I’m using Yocto I won’t get into those details. The important files of the firmware are the source/src_hal/main.c and `source/src_hal/stm32mp1xx_hal_msp.c`.  Also in main.h you’ll find the main structures I’m using for the ADCs.

enum en_adc_channel {
  ADC_CH_AC,
  ADC_CH_CT1,
  ADC_CH_CT2,
  ADC_CH_CT3,
  ADC_CH_NUM
};

#define NUMBER_OF_ADCS 2

struct adc_dev_t {
  ADC_TypeDef         *adc;
  uint8_t             adc_irqn;
  void (*adc_irq_handler)(void);
  DMA_Stream_TypeDef  *dma_stream;
  uint8_t             dma_stream_irqn;
  void (*stream_irq_handler)(void);
};

struct adc_channel_t {
  ADC_TypeDef         *adc;
  uint32_t            channel;
  GPIO_TypeDef        *port;
  uint16_t            pin;
};

extern struct adc_channel_t adc_channels[ADC_CH_NUM];
extern struct adc_dev_t adc_dev[NUMBER_OF_ADCS];

The `adc_dev_t` struct contains the details for the ADC peripheral which is this case is the ADC2 and the adc_channel_t contains the channel configuration. As you can see both are declares as arrays, because the CM4 has 2x ADCs and I’m also using 4x channels for the ADC2. Both structs are initialized in the main.c

uint16_t adc1_values[ADC1_BUFFER_SIZE];  // 2 channels on ADC1
uint16_t adc2_values[ADC2_BUFFER_SIZE];  // 3 channels on ADC2


struct adc_channel_t adc_channels[ADC_CH_NUM] = {
  [ADC_CH_AC] = {
    .adc = ADC2,
    .channel = ADC_CHANNEL_6,
    .port = GPIOF,
    .pin = GPIO_PIN_14,
  },
  [ADC_CH_CT1] = {
    .adc = ADC2,
    .channel = ADC_CHANNEL_2,
    .port = GPIOF,
    .pin = GPIO_PIN_13,
  },
  [ADC_CH_CT2] = {
    .adc = ADC2,
    .channel = ADC_CHANNEL_0,
    .port = GPIOA,
    .pin = GPIO_PIN_0,
  },
  [ADC_CH_CT3] = {
    .adc = ADC2,
    .channel = ADC_CHANNEL_1,
    .port = GPIOA,
    .pin = GPIO_PIN_1,
  },
};


struct adc_dev_t adc_dev[NUMBER_OF_ADCS] = {
  [0] = {
    .adc = ADC2,
    .adc_irqn = ADC2_IRQn,
    .adc_irq_handler = &ADC2_IRQHandler,
    .dma_stream = DMA2_Stream1,
    .dma_stream_irqn = DMA2_Stream1_IRQn,
    .stream_irq_handler = &DMA2_Stream1_IRQHandler,
  },
  [1] = {
    .adc = ADC1,
    .adc_irqn = ADC1_IRQn,
    .adc_irq_handler = &ADC1_IRQHandler,
    .dma_stream = DMA2_Stream2,
    .dma_stream_irqn = DMA2_Stream2_IRQn,
    .stream_irq_handler = &DMA2_Stream2_IRQHandler,
  },
};

I find the above way the easiest and more descriptive to initialize such structures in C. I’m using those structs in the rest of the code in the various functions and because those structs are generic is easy to handle with pointers.

In case of STM32MP1 you need to be aware that the pin-mux for both CA7 and CM4 is configured by using the device tree. Therefore, the configuration is done when the kernel boots and furthermore you need to plan which pin is used by each processor. Also, in order to be able to run this firmware you need to enable the PMIC access from the CM4, but I’ll get back to that later when I’ll describe how to use the Yocto image.

Next important part of the firmware is the Timer, ADC and DMA configuration, which is done in `source/src_hal/stm32mp1xx_hal_msp.c`. The timer is initialized in HAL_TIM_Base_MspInit() and MX_TIM2_Init(). The timer is only used to trigger the ADC on a constant period and every time the timer is expired, it triggers an ADC conversion and then reloads.

The ADC is initialized and configured in HAL_ADC_MspInit and Configure_ADC() functions. The DMA_Init_Channel() is called for every channel. Finally, the interrupts from the ADC/DMA are handled in HAL_ADC_ConvHalfCpltCallback() and HAL_ADC_ConvCpltCallback(). The reason there are two interrupts is that I’m using double buffering and each interrupt fills one half of the buffer, therefore the first interrupt fills the first half and the second interrupt fills the second half of the buffer. This means that there is enough time between the same interrupt triggers again to fill another buffer with the results and send the buffer on the CA7 using OpenAMP.

You need to be aware though, that the OpenAMP is a slow IPC which is meant for control and not for exchanging fast or big data. Therefore, you need to have a quite slow timer that triggers the ADC conversions, otherwise the interrupts will be faster that the OpenAMP. To solve this you can use a larger buffer pool that sends the data async in the main loop, rather inside the interrupt. In this case though, I’m just sending the ADC values inside the interrupt using a slow timer for simplicity. Also there’s no reason to flood the Elasticsearch server with ADC data.

There’s also a smarter way to handle the ADC values flow rate from the STM32MP1 to the Elasticserver. You can have a tick timer that sends ADC values at constant times which are not that frequent and if you want to be able to “catch” important changes in the values then you can have an algorithmic filter that buffers those important changes that are outside of configurable limits and then report only those values to the server. You can also use average values is it’s applicable for your case.

Finally, this is the part where the CM4 firmware transmits the ADC values in a string format to the CA7

void HAL_ADC_ConvCpltCallback(ADC_HandleTypeDef *hadc)
{
  
  /* Update status variable of DMA transfer */
  ubDmaTransferStatus = 1;
  
  /* Set LED depending on DMA transfer status */
  /* - Turn-on if DMA transfer is completed */
  /* - Turn-off if DMA transfer is not completed */
  BSP_LED_On(LED7);

  if (hadc->Instance == ADC1) {
    sprintf((char*)virt_uart0.tx_buffer, "ADC[1.2]:%d,%d,%d,%d\n",
      adc2_values[0], adc2_values[1], adc2_values[2], adc2_values[3]);
    printf((char*)virt_uart0.tx_buffer);
  }
  else if (hadc->Instance == ADC2) {
    sprintf((char*)virt_uart0.tx_buffer, "ADC[2.2]:%d,%d,%d,%d\n",
      adc2_values[0], adc2_values[1], adc2_values[2], adc2_values[3]);
    printf((char*)virt_uart0.tx_buffer);
  }

  virt_uart0.rx_size = 0;
  virt_uart0_expected_nbytes = 0;
  virt_uart0.tx_size = strlen((char*)virt_uart0.tx_buffer);
  virt_uart0.tx_status = SET;
}

As you can see the format looks like this:

ADC[x,y]:<ADC1>,<ADC2>,<ADC3>,<ADC4>\n

where:

  • x, is the the ADC peripheral [1 or 2]
  • y, is the double-buffer index, 10: first half, 2: second half
  • <ADCz>, is the 12-bit value of the ADC sample

adcsamplerbeat

Now that we have the CM4 firmware the next thing we need is the elastic beat client that sends the ADC sample values to the Elasticsearch server. Since this is a custom application there’s no any beat available that meets our needs, therefore we need to need to create our own! To do this it’s actually quite easy, all you need is to use the tool that comes with the elastic beats repo. There’s also a guide on how to create a new beat here.

My custom beat repo for this post is this one here:

https://github.com/dimtass/adcsamplerbeat

Note: This guide is for v7.9.x-v8.0.x and it might be different in other versions!

First you need to setup your Golang environment.

export GOPATH=/home/$USER/go
cd /home/$USER/go
mkdir -p src/github.com/elastic
cd src/github.com/elastic
git clone https://github.com/elastic/beats.git
cd beats

Now that you’re in the beats directory you need to run the following command in order to create a new beat

mage GenerateCustomBeat

This will start an interactive process that you need to fill some details about your custom beat. The following are the ones that I’ve used, but you need to use your own personal data.

Enter the beat name [examplebeat]: adcsamplerbeat
Enter your github name [your-github-name]: dimtass
Enter the beat path [github.com/dimtass/adcsamplerbeat]: 
Enter your full name [Firstname Lastname]: Dimitris Tassopoulos
Enter the beat type [beat]: 
Enter the github.com/elastic/beats revision [master]:

After the script is done the tool is already created a new folder with your custom beat, so go in there and have a look.

cd $GOPATH/src/github.com/dimtass/adcsamplerbeat

As you can see there are too many things already in the folder, but don’t worry not all of them are important to us. Actually, there are only 3 important files. Let’s have a look in the configuration file first, which is located in `config/config.go`. In this file you need to add our custom configuration variables. In this case I need this go beat to be able to open a Linux tty port and retrieve data, therefore I need a serial go module. I’ve used the tarm/serial module which you can find here.

So, to open a serial port I need the device path in the filesystem, the baudrate and the read timeout. Of course you could add more configuration like parity, bit size and stop bits, but in this case that’s not important as it’s not a generic beat and the serial port configuration is static. Therefore, to add the needed configuration parameters in the the config/config.go I’ve edited the file and add this paramters:

type Config struct {
    Period 			time.Duration `config:"period"`
    SerialPort		string	`config:"serial_port"`
    SerialBaud		int		`config:"serial_baud"`
    SerialTimeout	time.Duration	`config:"serial_timeout"`
}

var DefaultConfig = Config{
    Period: 1 * time.Second,
    SerialPort: "/dev/ttyUSB0",
    SerialBaud: 115200,
    SerialTimeout: 50,
}

As you can see there’s a default config struct, which contains the default values, but you can override those values using the the yaml configuration file as I’ll explain in a bit. Now that you’ve edited this file you need to run a command that creates all the necessary code based on this configuration. To do so, run this:

make update

The next important file is where the magic happens and that’s the `beater/adcsamplerbeat.go` file. In there it’s the main code of the beat, so you can add you custom functionality in there. You can have a look in detail in the file here, but the interesting code is that one:

ticker := time.NewTicker(bt.config.Period)
    counter := 1
    for {
        select {
        case <-bt.done:
            return nil
        case <-ticker.C:
        }

        buf := make([]byte, 512)
        
        n, _ = port.Read(buf)
        s := string(buf[:n])
        s1 := strings.Split(s,"\n")	// split new lines
        if len(s1) > 2 && len(s1[1]) > 16 {
            fmt.Println("s1: ", s1[1])
            s2 := strings.SplitAfterN(s1[1], ":", 2)
            fmt.Println("s2: ", s2[1])
            s3 := strings.Split(s2[1], ",")
            fmt.Println("adc1_val: ", s3[0])
            fmt.Println("adc2_val: ", s3[1])
            fmt.Println("adc3_val: ", s3[2])
            fmt.Println("adc4_val: ", s3[3])
            adc1_val, _ := strconv.ParseFloat(s3[0], 32)
            adc2_val, _ := strconv.ParseFloat(s3[1], 32)
            adc3_val, _ := strconv.ParseFloat(s3[2], 32)
            adc4_val, _ := strconv.ParseFloat(s3[3], 32)

            event := beat.Event {
                Timestamp: time.Now(),
                Fields: common.MapStr{
                    "type":    b.Info.Name,
                    "counter": counter,
                    "adc1_val": adc1_val,
                    "adc2_val": adc2_val,
                    "adc3_val": adc3_val,
                    "adc4_val": adc4_val,
                },
            }
            bt.client.Publish(event)
            logp.Info("Event sent")
            counter++
        }
    }

Before this code, I’m just opening the tty port and send some dummy data to trigger the port and then everything happens in the above loop. In this loop the serial module reads data from the serial port and then parses the interesting information, which are the ADC sample values and constructs a beat event. Finally it publishes the event at the Elasticserver.

In this case I’m using a small trick. Since the data that are coming from the serial port are quite fast, the serial read buffer usually contains 3 or 4 samples and maybe more. For that reason I’m always parsing the second one (index 1) and the reason for that is to avoid having to deal with more complex buffer parsing, because there might be a case that the begin and the end of the buffer are not complete. That way I’m just using a newline split and remove the first and the last strings that may incomplete. Because this is a trick, you might prefer to a better implementation but in this case I just need a fast prototype.

Finally, the last important file is the configuration yaml file, which in this case is the `adcsamplerbeat.yml`. This file contains the configuration that overrides the default values and also contains the generic configuration of the beat which means that you need to configure the IP of the remote Elasticsearch server as also any other configuration that is available for the beat, e.g. authentication e.t.c. Be aware that the period parameter in the yml file refers to how often the beat client connect to the server to publish its data. This means that if the beat client collects 12 events per second then it will connect only once per second and publish that data.

For now this configuration file is not that important because I’m going to use Yocto that will override the whole yaml file using the package recipe.

Building the Yocto image

Using Yocto is the key of this demonstration and the reason for this is that by using a Yocto recipe you can build any elastic beat for your specific architecture. For example even for the official beats there are only aarch64 pre-built binaries and there aren’t any armhf or armel, but in this case the STM32MP1 is an armhf CPU, therefore it wouldn’t possible even to find binaries for this CPU. Therefore Yocto comes handy and we can use its superpowers to build also our custom beat.

For the STM32MP1 I’ve created a BSP base layer that simplifies a lot the Yocto development and you can find it here:

https://gitlab.com/dimtass/meta-stm32mp1-bsp-base
https://github.com/dimtass/meta-stm32mp1-bsp-base
https://bitbucket.org/dimtass/meta-stm32mp1-bsp-base

I’ve written thorough details on how to use it and be able to build an image in the repo README file, therefore I’ll skip this step and focus on the important stuff. First you need to build the stm32mp1-qt-eglfs-image image (well you could also build another, but this is what I did for also other reason which are irrelevant to the post). Then you need to flash the image on the STM32MP1. I didn’t add the firmware and the adcsamplerbeat recipes in the image, but instead I’ve just built the recipes and the scp the deb files in the target and install them using dpkg, which works just fine for developing and testing.

The firmware for the CM4 is already included in the meta-stm32mp1-bsp-base repo and the recipe is the `meta-stm32mp1-bsp-base/recipes-extended/stm32mp1-rpmsg-adcsampler/stm32mp1-rpmsg-adcsampler_git.bb`.

But you need to add the adcsampler recipe yourself. For that reason I’ve created another meta recipe layer which is this one:

https://gitlab.com/dimtass/meta-adcsamplebeat
https://github.com/dimtass/meta-adcsamplebeat
https://bitbucket.org/dimtass/meta-adcsamplebeat

To use this recipe you need to add it your sources folder and then also add it in the bblayers. First clone the repo in the sources folder.

cd sources
git clone https://gitlab.com/dimtass/meta-adcsamplebeat.git

Then in the build folder that you’ve run the setup-environment.sh script as explained in my BSP base repo, run this command

bitbake-layers add-layer ../sources/meta-adcsamplebeat

Now you should be able to build the adcsamplerbeat for the STM32MP1 with this command

bitbake adcsamplerbea

After the build finishes the deb file should be in `build/tmp-glibc/deploy/deb/cortexa7t2hf-neon-vfpv4/adcsamplerbeat_git-r0_armhf.deb`.

That’s it! Now you should have all that you need. So first boot the STM32MP1 and then scp the deb file from your host builder and install it on the target like that

# Yocto builder
scp /path/to/yocto/build/tmp-glibc/deploy/deb/cortexa7t2hf-neon-vfpv4/adcsamplerbeat_git-r0_armhf.deb root@<stm32mp1ip>:/home/root

# On the STM32MP1
cd /home/root
dpkg -i adcsamplerbeat_git-r0_armhf.deb

Of course you need to do the same procedure for the CM4 firmware, meaning that you need to build it using bitbake

bitbake stm32mp1-rpmsg-adcsampler

And then scp the deb from `build/tmp-glibc/deploy/deb/cortexa7t2hf-neon-vfpv4/stm32mp1-rpmsg-adcsampler_git-r0_armhf.deb` to the STM32MP1 target and install it using dpkg.

Finally, you need to verify that everything is installed. To do that have a look in the following path and verify that the files exists

/usr/bin/fw_cortex_m4.sh
/lib/firmware/stm32mp157c-rpmsg-adcsampler.elf
/usr/bin/adcsamplerbeat
/usr/share/adcsamplerbeat/adcsamplerbeat

If the above files exists then probably everything is set up correctly.

Last thing is that you need to load the proper device tree otherwise the ADCs won’t work. To do that edit the `/boot/extlinux/extlinux.conf` file on the board and set the default boot mode to `stm32mp157c-dk2-m4-examples` like this

DEFAULT stm32mp157c-dk2-m4-examples

This mode will use the `/stm32mp157c-dk2-m4-examples.dtb` device tree file which let’s the CM4 to enable the ADC ldos using the PMIC.

Testing the adcsamplerbeat

Finally, it’s time for some fun. Now everything should be ready on the STM32MP1 and also the Elasticsearch and Kibana servers should be up and running. First thing you need to do is to execute the CM4 firmware. To do that run this command on the STM32MP1 target

fw_cortex_m4.sh start

This command will load the stm32mp157c-rpmsg-adcsampler.elf from the /lib/firmware on the CM4 and execute it. When this is done, then the CM4 will start sampling the four ADC channels and send the results on the Linux user-space using OpenAMP.

Next, you need to run the adcsamplerbeat on the STM32MP1, but before you do that verify that the `/etc/adcsamplerbeat/adcsamplerbeat.yml` file contains the correct configuration for the tty port and the remote Elasticsearch server. Also make sure that both the SBC and the server are on the same network. The run this command:

adcsamplerbeat

You should see an output like that

root:~# adcsamplerbeat
Opening serial: %s,%d /dev/ttyRPMSG0 115200
s1:  ADC[2.2]:4042,3350,4034,4034
s2:  4042,3350,4034,4034
adc1_val:  4042
adc2_val:  3350
adc3_val:  4034
adc4_val:  4034

This means that the custom beat is working fine on the armhf CPU! Nice.

Now you need to verify that also the Elastic server gets the data from the beat. To do that open a browser (I’m using the X96mini as a remote Kibana client) and type the Kibana url. Wait for the UI to load and then using your mouse and starting from the left menu browse to the this path

Management -> Stack Management -> Kibana -> Index Patterns -> Create index pattern

Then you should see this

If you already see those sources that you can select from in your results, then it means that it’s working. So no you need to type adcsamplerbeat-* in the index pattern name like this:

Finally, after applying and getting to the next step you should see something similar to this

As you can see from the above image these are the data that are transmitted from the target to the server. In my case I’ve commented out these processors in my adcsamplerbeat.yml file

processors:
  - add_host_metadata: ~
  # - add_cloud_metadata: ~
  # - add_docker_metadata: ~

I suggest you do the same to minimize the data transmitted over the network. You could also comment out the host, but this would make it difficult then to trace which data from the database are belonging to a host if many exist.

Now we need to setup a dashboard to be able to visualize these data.

Setting up a custom Kibana dashboard

Kibana is very flexible and a great tool to create custom dashboards. To create a new dashboard to display the ADC sample data you need to use your mouse and starting from the left menu browse to

Dashboard -> Create new dashboard

You should see this

Now select the “Line” visualization and then you get the next screen

Ignore any other indexes and just select adcsamplerbeat-* then in the next screen you need to add a Y-axis for every ADC value and also use the following configuration for each axis

Aggregation: Max
Field: adcX_val
Custon label: ADCx

Instead of X, x use the ADC index (e.g. adc1_val and ADC1 e.t.c.)

Finally you need to add a bucket. Bucket is actually the X axis and in there all you need is to configure it like that

When you apply all changes you should see something similar to this

In the above plot there are 4x ADCs but the 3 of them are floating and the value is close to the max ADC range value which is 4092 (for 12-bits).

That’s it! You’re done. Now you have your STM32MP1 sampling 4x ADC and publish the samples on an Elasticserver and then you can use Kibana to visualize the data on a custom dashboard.

Conclusions

So, that was a quite long post, but it was interesting to implement a close-to-real-case-scenario project using ELK for an embedded device. In this example I’ve used Yocto to build the Linux image for the STM32MP1 and also the firmware for the CM4 MCU and also the go beat module.

I’ve used the CM4 to sample 4x ADC channels using DMA and double buffering and then OpenAMP to send the buffers from the CM4 to the CA7 and the Linux user-space. Then the custom adcsamplerbeat elastic beat module published the ADC sampling data to the Elasticsearch server and finally I’ve used the Kibana web interface to create a dashboard and visualize the data.

This project might be a bit complex because of the STM32MP1, but other than that it’s quite simple. Nevertheless, it’s a fine example on how you can use all those interesting technologies together and build a project like that. The STM32MP157C is an interesting SBC to use in this case because you can connect whatever sensors you like on the various peripherals and then create a custom beat to publish the data.

By using this project as a template you can create very complex setups, with whatever data you like and create your custom filters and dashboards to monitor your data. If you use something like this, then share your experience in the comments.

Have fun!

Using Elastic stack (ELK) on embedded [part 1]

Intro

[Update – 04.09.2020]: Added info how to use Yocto to build the official elastic beats

Data collection and visualization are two very important things that when are used properly they are actually very useful. The last decade we’re overwhelmed with data and especially with how the data are visualized. Most of the cases, I believe, people don’t even understand what they see or how to interpret the data visualizations. It has become more important in the industry to present data in a visual pleasing way, rather actually to get a meaning out of them. But that’s another story for a more philosophical post.

In this series of posts I’ll won’t solve the above problem but I will probably contribute in to making it even worse and I’ll do that by explaining how to use Elasticsearch (and some other tools) with your embedded devices. An embedded device in this case can be either a Linux SBC or a micro-controller which is able to collect and send data (e.g. ESP8266). Since there are so many different use cases and scenarios, I’ll start with simple a simple concept in this post and then it will get more advanced in the next posts.

One thing that you need to have in mind is that data collection and presentation is something that goes many centuries back. Anyway in case of IT there’s only a few decades of history in presenting digital data. If you think about it, only the tools are getting different as the technology advances and as happens with all new things, those tools are getting more fancy, bloated and complicated, but at the same time more flexible. So nothing new here, just old concepts with new tools. Does that mean that they’re bad? No, not all. They are very useful when you use them right.

Elastic Stack

There are dozens of tools and frameworks to collect and visualize data. You can even implement you own simple framework to do that. For example in this post, I’ve designed an electronic load with a web interface. That’s pretty much data collection and visualization. OK, I know, you may say that it’s not really data collection because there is no a database, but that doesn’t really mean anything as you can have a circular buffer with the last 10 values and would make it “data collector”. Anyway, it doesn’t matter how complex you application and infrastructure is, the only thing that matters is that the ground concept is the same.

So, Elastic Stack (EStack) is a collection of open-source tools that collect, store and visualize data. There are many other frameworks, but I’ve chosen EStack because it’s open source, nowadays is a mature framework and it’s quite popular in the DevOps community. The main tools of the EStack are: Elasticsearch (ES), Kibana (KB), Logstash (LS) and Beats, but there are also others. Here is a video that explains a bit better how those tools are connected together.

It’s quite simple to understand what they do though. Elasticsearch is a database server that collects and stores data. Logstash and beats are clients that send data to the database and Kibana is just a user-interface that presents the data on a web page. That’s it. Simple. Of course that the main concept of the tools, but they offer much more than that and they are adding new functionalities really fast.

Now, the way they do what they do and the implementation is what it’s quite complicate. So EStack is quite a large framework. Protocols, security, databases, visualization, management, alerts and configuration is what it makes those frameworks huge. Therefore, the user or the administrator deals with less complexity, but in return the frameworks are getting larger and the internal complexity makes the user much less able to debug or resolve any problems inside the infrastructure. So, you win some, you lose some.

Most of the times, the moto is, if it works without problems for some time you’re lucky, but if it breaks you’re doomed. Of course at some point things break eventually, so this is where you need backups, because if your database gets corrupted then good luck with it if you have no backup.

Back to EStack… The Elasticsearch is a server with a database. You can have multiple ES server nodes running at the same time and different type of data can be stored on each server. But you’re able to have access to all the data from the different nodes with a single query. Then, you can have multiple Logstash and Beats clients that connect to one or more nodes. The clients are sending data to the server and the server stores the data to the database. The difference with older similar implementations is that ES uses the json syntax that receives from the client to store the data in the DB. I’m not sure about the implementation details and I may be wrong in my assumption, but I assume that ES creates tables on the fly with fields according to this json syntax if they don’t already aexist. So ES is aware of client’s data formatting when receiving well formated data. Anyway, the important thing is that the data are stored in the DB in a way that you can run queries and filters to obtain information from the database.

The main difference between Logstash and Beats clients is that Logstash is a generic client with multiple configurable functionality that can be configured to do anything, but Beats are lightweight clients which are tailored to collect and send specific type of data. For example you can configure Logstash to send 10 different types of data or you can even have many different Beats that send data to a Logstash client, which then re-formats and sends the data to an Elasticsearch server. On the other hand, Beats can deal only with specific type of data, for example they can collect and send the overview of the system resources of the host (client) or send the status of a specific server or poll a log file and then parse new lines and format the log data and send them back to the server. Each Beat is different. Here you can search all the available community Beats and here the official beats.

Beats are just simple client programs written in Go that they’re using the libeat Go API and perform very specific tasks. You can use this API to write your own beat clients in Go that can do whatever you like. The API just provides the interface and implements the network communication including authentication. At the higher level a Beat is split into two components, the component that collects the data and implements the business logic (e.g. reading a temperature/humidity sensor) and the publisher component that handles all the communication with the server including authorization, timeouts e.t.c. The diagram below is simple to understand.

As you can see, libbeat implements the publisher, so you only need to implement the custom/bussiness logic. The problem with libbeat is that it’s only available in Go, which is unfortunate because a plethora and actually the majority of small embedded IoT devices can not execute Go. Gladly there are some C++ implementations out there like this one, that do the same in C++, but the problem with those implementations is that they only support basic authentication.

You should be aware, though, that the Go libbeat also handles buffering, which is a very important feature and you need to be aware that if you do your own implementation you should take care of that. Buffering means that if the client loses the connection, even for hours and days, then it stores the data locally and when the connection is restored then sends all the buffered data to the server. That way you won’t have discontinued data in your server’s database. Also that means that you need to choose an optimal data sampling rate, so your database doesn’t get huge in short time.

One important thing that I need to mention here is that you can write your own communication stack as it’s just a simple HTTP POST with an attached json formatted data. So you can implement this on a simple baremetal device like ESP8266, but the problem is that the authentication will be just a basic one. This means that the authentication can be a user/pass, but that’s really a joke as both are attached in an unencrypted plain HTTP POST. I guess that for your home and internal network is not much of a problem, but you need to have a proper encryption if your devices are out in the wild.

A simple example

To test a new stack, framework or technology you need to start with a simple example and create a proof of concept. As a first example in this series I’ll run an Elasticsearch (ES) server and a Kibana instance on my workstation which will act as the main server. The ES will collect all the data and the Kibana instance will provide the front-end to visualize the data. Then I’ll use an SBC and specifically the nanopi-k1-plus, which is my reference board to test my meta-allwinner-hx BSP layer. The SBC will run a Beat that collects metric data from the running OS and sends them to the Elasticsearch server. This is the system overview:

So, this is an interesting setup here. The server is my workstation which is a Ryzen 2700X with 32GB RAM and the various fast NVMe and SSDs. The Elasticsearch and Kibana servers are running inside docker containers and on a NVMe. The host OS is Ubuntu 18.04.5 LTS.

It’s funny, but for the web interface client I’m using the X96 mini TV box… OK, so this is an ARM board based on the Amlogic S905X SoC which is a quad-core Cortex-A53 running at 1.5GHz with 2GB RAM. Itcurrently runs an Armbian image with Ubuntu Focal 20.04 on the 5.7.16 kernel. The reason I’ve selected to use this SBC as a web client is to “benchmark” the web interface, meaning I wanted to see how long it takes for a low spec device to load the interface and how it behaves. I’ll come to this later on, anyway this how it looks like.

A neat little Linux box. Finally, I’ve used the nanopi-k1-plus with a custom Yocto Linux image using the meta-allwinner-hx BSP with the 5.8.5 kernel version. On the nanopi I’m running the metricbeat. Metricbeat is a client that collect system resources and performance data and then sends them to the Elasticsearch server.

This setup might be look a bit complicated but it’s not really. It’s quite basic, I’m just using those SBCs with custom OSes that makes it look a bit complicated, but it’s really a very basic example and it’s better that running everything on the workstation host as docker containers. This setup is more fun and similar to real usage when it comes to embedded. Finally, that’s a photo of the real setup.

Running Elasticsearch and Kibana

For this project I’m running Elasticsearch and Kibana on a docker container on a Linux host and it makes total sense to do so. The reason is that the containers are sandboxed from the host and also later in a real project it makes more sense to have a fully provisioned setup as IaaC (infastructure as a code); therefore you’re able to spawn new instances and control your instances maybe using a swarm manager.

So let’s see how to setup a very basic Elasticsearch and Kibana container using docker. First head to this repo here:

https://bitbucket.org/dimtass/elastic-stack-on-embedded/src/master/
https://github.com/dimtass/elastic-stack-on-embedded.git
https://gitlab.com/dimtass/elastic-stack-on-embedded.git

There you will see the “part-1” folder which includes everything that you need for this example. In this post I’m using the latest Elastic Stack version which is 7.9.0. First you need to have docker and docker-composer installed to your workstation host. Then you need to pull the Elasticsearch and Kibana images from the docker hub with these commands:

docker pull docker.elastic.co/elasticsearch/elasticsearch:7.9.0
docker pull docker.elastic.co/kibana/kibana:7.9.0

Those two commands will download the images to your local registry. Now you can use those images to start the containers. I’ve already pushed the docker-compose file I’ve used in the repo, therefore all you need to do is to cd in the part-1 folder and run this command:

docker-compose -f docker-compose.yml up

This command will use the docker-compositor.yml file and launch two container instances, one for Elasticsearch and one for Kibana. Let’s see the content of the file.

version: '3'
services:
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:7.9.0
    container_name: elastic_test_server
    environment:
      - bootstrap.memory_lock=true
      - discovery.type=single-node
      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
      - TAKE_FILE_OWNERSHIP=true
    ulimits:
      memlock:
        soft: -1
        hard: -1
    volumes:
      - /rnd2/elasticsearch/data:/usr/share/elasticsearch/data
    ports:
      - 9200:9200
      - 9300:9300
    networks:
      - elastic
  kibana:
    image: docker.elastic.co/kibana/kibana:7.9.0
    container_name: kibana_test
    ports:
      - 5601:5601
    environment:
      ELASTICSEARCH_URL: http://elastic_test_server:9200hosts: ["192.168.0.2:9200"]
      ELASTICSEARCH_HOSTS: http://elastic_test_server:9200
      SERVER_HOST: 192.168.0.2
    networks:
      - elastic

networks:
  elastic:
    driver: bridge

 

You see that there are two services: elasticsearch and kibana. Each service uses the proper image and it has a unique container name. The ES container has a few environment variables, but the important one is the discovery.type which declares the instance as a single-node. This is important because there’s only one node and in case you had more then you need to configure those nodes in the yaml file, so they can discover each other in the network. Another important thing is that the ES volume is attached to the host’s physical drive. This is important so when you kill and remove the container instance, then the data (and the database) are not lost. Finally, the network and the network ports are configured.

The Kibana service configures the web server port, the environment variables that point to the ES server and the server host. Also it configures the service in the same network with ES and most importantly it sets the host address to the host’s IP address so the web server is also accessible from other web clients on the same network. If the SERVER_HOST is set to 0.0.0.0 then you can only access the web interface from the localhost. Final touch is that the network is bridged.

Once you run the docker compose command then both the Elasticsearch and the Kibana servers should be up and running. In order to test that everything works as expected then you need to open your browser (e.g. on your localhost) and launch this address

http://localhost:5601/status

It may take some time to load the interface and it might give some extra information the first time you run the web app, but eventually you should see something like this

Everything should be green, but most importantly plugin:elasticsearch needs to be in ready status, which means that there’s communication between the Kibana app and the Elasticsearch server. If there’s no communication and both instances are running then there’s something wrong with the network setup.

Now it’s time to get some data!

Setting up the metricbeat client

As I’ve already mentioned the metricbeat client will run on the nanopi-k1-plus. In my case I’ve just built the console image from the meta-allwinner-hx repo. You can use any SBC instead or any distro as long as it’s arm64 and the reason for this is that there are only arm64 prebuild binaries for the official Beats. Not only that, but you can’t even find them in the download page here, and you need to use a trick to download them.

So, you can use any SBC and distro or image (e.g. an Armbian image) as long it’s compatible with arm64 and one of the available packages. In my case I’ve built the Yocto image with deb support, therefore I need a deb package and the dpkg tool in the image. To download the deb image, open your browser and fetch this link.

https://artifacts.elastic.co/downloads/beats/metricbeat/metricbeat-7.9.0-arm64.deb

Then scp the deb file to your SBC (or download it in there with wget) and then run:

dpkg -i filebeat-7.9.0-arm64.deb

After installing the deb package the executable will be installed in /usr/bin/metricbeat and also a service file will be installed in /lib/systemd/system and some configuration files will be installed in /etc/metricbeat. The configuration file `/etc/metricbeat/metricbeat.yml` is important and this is the file you need to setup the list of the remote hosts (in this case is the Elasticsearch server) and also you need to configure the module and the metricsets that you want to send from the nanopi to the ES server. To make it a bit more easy, I’ve included the metricbeat.yml file I’ve used in this example in the repo folder, so you just need to scp this file to your SBC, but don’t forget to edit the hosts line and use the IP address of your ES server.

# ---------------------------- Elasticsearch Output ----------------------------
output.elasticsearch:
  # Array of hosts to connect to.
  hosts: ["192.168.0.2:9200"]

Build Elastic beats in Yocto

I’ve written a software meta layer for Yocto that you can use to add the official Elastic beats into your image. The meta layer is here:

https://gitlab.com/dimtass/meta-elastic-beats
https://github.com/dimtass/meta-elastic-beats
https://bitbucket.org/dimtass/meta-elastic-beats

You can add the layer to your sources and to your bblayers.conf file and then add one or more of the following recipes to your image, using IMAGE_INSTALL.

  • elastic-beats-auditbeat
  • elastic-beats-filebeat
  • elastic-beats-heartbeat
  • elastic-beats-journalbeat
  • elastic-beats-metricbeat
  • elastic-beats-packetbeat

Although there are some template configuration files in the repo, it’s expected that you override them and use your own configuration yaml files. The configuration files are located in the `meta-elastic-beats/recipes-devops/elastic-beats/elastic-beats` folder in the repo. Also you need to read carefully the README, because golang by default sets some files as read-only and if the work directory is not cleaned properly, then the build will fail.

Show me the data

At this point, we have a running instance of an Elasticsearch server, a Kibana server and a ready-to-go SBC with the metricbeat client. Next thing to do now is to run the metricbeat client and start collecting data in the server. As I’ve mentioned earlier the deb file also installs a service file, so you can either run the executable in your terminal or even better enable and start the service.

systemctl start metricbeat
systemctl enable metricbeat

In my case I’ve just executed the binary from the serial-tty console, but for long term usage of course the service makes more sense.

Now, if you wait for a bit then the server will start receiving metric data from the SBC. In order to view your data you need to open the Kibana web interface app into your browser. I’ve tried this on both my workstation and the X96mini in order to compare the performance. On my workstation it takes a couple of seconds to load the interface, but on the X96mini it took around 1.5-2 minutes to load! Yep, that means that the user interface is resource demanding, which is a bit unfortunate as it would be nice to have an option for a lighter web interface (maybe there is and I’m not aware of it).

Next you need to click on the Kibana tab in the left menu and then click “Dashboard”. This will show a list of some template dashboards. You can implement your own dashboard and do whatever customizations you in the theme and the components, but for now let’s use one of the templates. Since, the nanopi sends system metric data, you need to select the “[Metricbeat System] Host overview ECS” template. You can limit the listed items if you search for the “system” keyword.

If your remote host (e.g. nanopi) doesn’t show up automatically, then in the upper left corner you need to create a new filter and set the “host.name” to the one of your SBC. To get the host name of your SBC, then run the uname -n in your console. In my case is:

host.name:"nanopi-k1-plus"

So, after applying the filter you should get your data. Just have in mind that it might need a few minutes to collect enough data to show something. The next two screenshots are from the X96mini.

Click on the images the view them in full screen. In the first screenshot you see that the web interface is using only 3% of CPU, but it uses 24.4% of the system 2GB RAM when it’s idle! In the next screenshot I’ve pressed the “Refresh” button on the web interface to see what happens to the system resources. In this case the web interface needed 120% of CPU, which means that more than 1 core is utilized. The next two screenshots display all the available data in this dashboard.

Well, that’s it! Now I can monitor my nanopi-k1-plus SBC using the Kibana web app. In case you have many SBCs around running, you can monitor all of them. Of course, monitoring is just one thing you can do. There are many more things that you can do with Kibana, like for example create alerts and send notifications using the APM interface. So, for example you can create alerts for when the storage is getting low, or communication is lost or whatever you like using the available data from the remote host in the server’s database. As you understand there’s no limit in what you can do.

Conclusions

In the first post, I’ve demonstrated a very basic use case of a remote SBC that sends metric data to an Elasticsearch server by using the metricbeat client. Then I’ve shown a template dashboard in Kibana that visualizes the data of the remote client. It can’t get simpler than that, really. Setting up the ES and Kibana server was easy using docker, but as I’ve mentioned I haven’t used any of the available security features; therefore you shouldn’t use this example for a real-case scenario especially if the clients are out in the wilderness of the internet.

The pros of using a solution like this, is that it’s quite easy to setup the infrastructure, also the project is open source and it’s actively supported and there’s also an active community that creates custom beats. On the negative side is that the libbeat API is only available in Go which makes it unusable to baremetal IoT devices and also the tools of the Elastic Stack are complex and bloated, which it may be hard to debug it yourself when issues arise. Of course, the complexity is expected as you get tons of features and functionality, actually more features that you will probably use. It’s the downside of all the Swiss-army-knife solutions.

Is Elastic Stack only good for monitoring remote SBCs? No. It’s capable of many more things. You need to think about Elastic Stack like it’s just a generic stack that provides functionalities and features and what you do with this it’s up to your imagination. For example you can have hundreds or thousands of IoT devices and monitor all of them, create custom dashboards to visualize any data in any way you like, create reports, create alerts and many other things.

Where you benefit most with such tools is that they scale up more easily. You can have 5 temperature/humidity sensors in your house, or dozens of MOX sensors in an industrial environment, or hundreds of thousands environmental sensors around the world or a fleet of devices sending data to a single or multiple Elasticsearch servers. Then you can use Kibana to handle those data and create visualization dashboards and automations. Since the project is open-source, there’s actually no limit in what you can do with the information inside the database.

I hope that at some point there will be a generic C or even a C++ API similar to libbeat that doesn’t have many dependencies or even better no dependencies at all. This API should be implemented in a way that can be used in baremetal devices that run a firmware and not a RTOS capable of running Go. This would be really cool.

So, what’s next? In the next posts I’ll show how to create custom beats to use in Linux SBCs and also how to connect baremetal devices to Elasticsearch and then use Kibana to visualize the data.

Are there other similar tools? Well, yes there are a few of them, like Loggly and Splunk, but I believe that Elastic Stack could fit perfectly the IoT in the future. There are also alternatives to each component, for example Graphana is an alternative to Kibana. I guess if you are in the market for adapting such a tool you need to do your own research and investigation.

This post just scratched the surface of the very basic things that you can do with Elastic Stack, but I hope it makes somehow clear how you can use those tools in the IoT domain and what are the potentials and possibilities. Personally I like it much and I think there’s still some room for improvement to just fit the embedded world.

Have fun!

 

 

A Yocto BSP layer for nanopi-neo3 and nanopi-r2s

Intro

Having a tech blog comes with some perks and one of them is receiving samples of various SBCs. This time FriendlyElec sent me two very nice boards, the new nanopi-neo3 and the nanopi-r2s. Both SBCs came with all the extra options including a heatsink and case and they look really great. So, I’ve received them yesterday and I had to do something with then, right?

As you expect, if you reading this blog for some time, I won’t do a product presentation or review, because there are already many reviews that pretty much cover everything. Nevertheless, I’ve done something more challenging and fun that might be also helpful for others and that is that I’ve created a Yocto BSP layer for those boards that you can use to create your custom Linux distribution.

Similarly to the meta-allwinner-hx, I’ve based the BSP layer on the Armbian image. The reason for that is because the Armbian image is very well and actively supported, which means that you get updated kernels and also other features, including extended wireless support. Therefore, this Yocto layer applies the same u-boot and kernel patches as Armbian and at the same time allows you to create your own custom distribution and image.

Nanopi-R2S

The nanopi-r2s and nanopi-neo3 share the same SoC and almost the same schematic layout, which means that they also share the same device-tree. This is very convenient as it makes it easier to update both BSPs at the same time. At the same time, though, they have also significant differences which makes them suitable for different purposes.

The nanopi-r2s is based on the Rockchip RK3328 SoC which is a Quad-core Cortex-A53 with each core running at 1.5GHz. You might already know this SoC as it’s being used a lot in various Android TV boxes that being sold in ebay and many Chinese tech sites. In 2018 there was a peak in the release of such TV boxes. Nevertheless, the R2S is not a TV box, but the PCB layout reminds more a network gateway.

This is the R2S I’ve received that came with the metal case.

And this is how it looks when opened.

The board has the following specs:

  • 1 GB DDR4 RAM
  • 1x internal GbE ethernet
  • 1x GbE ethernet (using a USB3.0 to Ethernet bridge IC)
  • 1x USB2.0 port
  • Micro SD
  • micro-USB power port
  • 10x pin-header with GPIOs, I2C, IR receiver and UART

As you can see from the above image the case has an extruded rectangle which is used as a the SoC heatsink. The problem I’ve seen is that the silicon pad on the SoC is not thick enough and therefore the contact is not that good, which means that the heat dissipation might be affected negatively. In my case I’ve resolved this by using a thicker thermal silicon pad on the SoC.

FriendlyElec provides an Ubuntu and OpenWRT image for the board, which I haven’t tested as I’ve implemented my own Yocto BSP layer and image. I’ve read some people complaining about that it doesn’t have some things that they would like, but I guess nobody can be 100% satisfied with a SBC and also it’s not possible for all SBCs to fit all cases. From my point of view this BSP is ideal if you need a quite powerful CPU and two GbE ports.

Personally I’m thinking about implementing a home automation gateway with this board and use one GbE to connect to a KNX/IP bus and the other to the home network, hence keep the two domains separated with a firewall which is more secure.

Nanopi-neo3

The nanopi-neo3 is a really compact SBC with the same RK3328 SoC and it has a bit bigger size compared to the nanopi-neo2 board and smaller that nanopi-neo4 (RK3399). The plastic case is really nice and I can imagine having it in a place in my apartment which is visible as it would visually fit just fine. The specs for this SBC are the following:

  • 2GB DDR4 (also 1GB available)
  • 1x USB3.0 port
  • 2x USB2.0 ports (in pinheader)
  • Micro SD
  • 5V fan connector (2-pin)
  • Type-C power
  • 26x pin-header with GPIOs, I2C, SPI, I2S, IR receiver and UART

As you can see the main difference is that there’s only one GbE port and there is also an extra USB3.0 port. This configuration makes it more suitable for sharing a USB3.0 drive in your home network or other IoT applications as it has a pin-header with many available interfaces.

FriendlyElec provides an Ubuntu and a OpenWRT image with the 5.4 kernel and the images are the same with the nanopi-r2s. Actually, since the SoC and the device-tree is the same you can boot both boards with the same OS in the SD card.

The Yocto BSP layer

As I’ve mentioned, generally I prefer custom build images with Yocto for the various SBCs that I’m using. That’s because I have more control on what’s in the image and also it’s more fun to build my own images. An extra benefit is that this also keeps me up to speed with Yocto and the various updates in the project. Also Yocto is perfect for tailoring the image the way you want it, provides provisioning and it’s fitted for continuous integration. Of course, you can use any other available pre-build image like the ones FriendlyElec provides or the Armbian images and then use Ansible for example to provision your image if you like. But Yocto is much more advanced and gives you access to every bit of the image.

OK, enough with blah-blah, the BSP repo is here:

https://bitbucket.org/dimtass/meta-nanopi-rockchip64/src/master/
https://github.com/dimtass/meta-nanopi-rockchip64
https://gitlab.com/dimtass/meta-nanopi-rockchip64

I think the README file is quite thorough and explains all the steps you need to follow to build the image. Currently it supports the new Dunfell version and I may keep updating it in the future if I have time, as my main priority is the meta-allwinner-hx BSP updates.

The image supports the 5.7.17 kernel version and it’s based on the mainline stable repo (the patches are applied on the upstream repo), which is really nice. Generally rockchip has a better mainline support compared to other vendors, so most of the patches are meant to support specific boards and not so much the SoC.

I’ve added here the boot sequence of the nanopi-neo3 board with the custom Yocto image:

As you can see, the board boots up just fine with the custom image which is based on the Poky distro. As you can imagine from now on sky is the limit with what you can do using these boards and Yocto…

Conclusions

The nanopi-neo3 and nanopi-r2s are two SBCs that are similar enough to be able to use the same Linux image, but at the same time they serve different purposes and use-cases. Personally, I like having them both in my collection and I’ve already thought a project for the nanopi-r2s, which is a home automation gateway for KNX/IP. I’ll come back again with another post for that in the future, as this post is only about presenting this Yocto BSP layer for those boards.

I spend a fun afternoon implementing this Yocto layer and I’m glad that both boards worked fine. It also seems that this is the first time that those boards are supported with a Yocto BSP layer and I’m really interested in finding how people are going to use it. If you do, please provide a feedback here in the comment.

Have fun!

Using code from the internet as a professional

Intro

One of the most popular and misunderstood beliefs in engineering and especially in programming is that professional engineers shouldn’t use code from the internet. At least I’ve heard and read about this many times. Furthermore, using code from the internet automatically means that those who do it are bad engineers. Is there any truth to this statement, though? Is it right or wrong?

In this post I will try to give my 2 Euro-cents on this matter and try to explain that there’s not only black and white in programming or engineering, but there are many shades in between.

Be sure it’s not illegal

OK, so first things first. If and when you’re using code from the internet you need to first make sure that you comply with the source code licence. It’s illegal and furthermore not ethical to use any source code in a way that is against its licence.

One thing about licences is that there are many people that use licences without really knowing what they mean. That’s pretty much expected, because licences tend to be a very complicated document with many legal terms and sometimes not clear indications how it can be used. Also the number of open source licences are too many. It’s easy to get lost.

Because of that, sometimes the authors are using licences that are not really meet their vision (or criteria) of how they would like to share their code. Because of that, if you find a source code that you really want to use, but the licence seems to be a restriction, then it’s just fine to contact the source code owner and ask about the licence and explain how you would like to use the code and get permission.

This is actually the reason that I’m using MIT almost exclusively, so it’s clear to people that can grab the code and do whatever they like with it. Therefore, always check the licence and don’t be afraid to ask the author if you want to use the source code in a way that it may not meet the current licence.

Reasons to use code from the internet

So, let’s start with this. Why use random source code from the internet? Well, there are many reasons, but I’ll try to list a few of them

  • You don’t know how to do it yourself
  • You’re too bored to write it yourself
  • It takes too much time and effort to write everything by yourself
  • You don’t have the time to read all the involved APIs and functions, so you use a shortcut
  • You think that someone else did it better than you can (= you’re not confident about yourself)
  • You want to do a proof of concept and then re-write parts all of the code to adapt it to your needs
  • The code you found seems more elegant than your coding
  • This will save you time from your work so you can slack
  • You get too much pressure from your manager or the project and you want to just finish ASAP

All the above seem to be valid reasons and trust me all of them are full of… traps. So, no matter the reason, there is always a trap in there and you need to able to deal with it. If you don’t know how to do it, then just using random code will may lead you in a much worse position later and further down in the path, because you won’t be able to debug the code or solve problems. Therefore, if you don’t know how to do it, it’s only OK to use a code if you spend time to understand the code and feel comfortable with it.

If you’re too bored to write it yourself, then you really have to be very confident about your skills and be sure that you can deal with issues later. Well, being bored in this job is also not something that will give you any joy in your work-life, so just be careful about it. I understand that not everyone enjoys their jobs and it’s also normal, but this creates also problems to the other people that may have to deal with it.

If it takes too much time to write the code and you need a shortcut, then be sure that you fully understand the code. If you don’t then it’s a trap. The same goes for unknown APIs. It’s OK to use functions that you don’t really understand in depth, but again you need to be sure about your skills or at least have a brief overview of the API and if there is a documentation then try to find important information or warnings.

If you think that someone else did it better than you, then that’s fine. You can’t be an expert in everything, but still you need to be able to understand the code and read it and try to figure out how it’s working. Most of the times, if you see a very complicated code, then probably there’s something wrong with the code and it doesn’t necessarily mean that this code is advanced or better than your code. Most of the times, code should be simple and clean. The same goes if you find a source code that seems more elegant that yours. Syntactic sugar and beautified code, doesn’t really mean “better” code.

If you want to use source code from the internet so you can slack, then you need to consider to find another job. Really. It’s not good for your mental health. Also, you might just need a break, engineering is a tough job no matter what other people think.

If you’re getting too much pressure and the only way to deal with the project pressure is to use code from the internet, then that’s also wrong for various reasons. In the end, you’ll probably have to do it anyways, even if you don’t like it, but still that’s a problem and it goes beyond yourself as it’s also bad for the project. Again, try to understand the code as much as possible.

Reasons not to use code from the internet

I guess for people that already have enough experience in the domain it’s clear that most projects are full of crap code. By definition all code is crap, because it’s nearly impossible to get it perfectly right from the specifications and design to implementation. If you’re lucky and really good then the code it won’t suck that much, but still it won’t be that good or perfect.

Nevertheless, it’s really common to see bad code in projects and the reason is not only that many engineers use code from the internet, but when they do they use it without understand it and without making proper modifications. Another reason is that many times the proof-of-concept code ends up to be a production code, which is a huge mistake. The reason for that is that most of the times there’s pressure from the upper management that if it works don’t fix it, but they don’t understand that this is just a PoC. You can’t do much about this and if there’s pressure then you need to just go along with it, knowing that the shitstorm may come to you in the future.

Most of the times, the reason that engineers quit is that they realize that the code base is unmanageable and they abandon the ship before it wrecks. Even that needs experience to do it at the right timing…

Now do you see what are the impacts of having really crappy and unmanageable code? It’s a disaster for everyone. For the current project developers, for the newcomers, for the product itself and for the company. It’s a disaster for everyone. Nobody likes to deal with that mess and nobody likes to be responsible for it.

But does that really has to do with just using code from the internet? Well, only partially. The main reason for such a disaster is the lack of experience, planning and good management. And it happens again and again all around the world, even in huge companies and projects.

Therefore, most of the times it seems that the problem is that developers are using code from the internet and that ends up to an unmanageable code blob. Yes, it’s true. But, it’s not entirely true. The problem is not the small or large pieces of code from the internet, the problem is that people don’t understand the code or they don’t care about it or they get too much pressure. So, the code that is coming from the internet is not the real cause of the problem. The problem is the people and the circumstances under the code was used.

Conclusions

Personally, I like reading code and I’m using other’s code often and I believe there’s nothing wrong with that. That doesn’t make me feel less professional and it doesn’t hurt my work or my projects at all. But at the same time, I strongly believe that you should never use any code from the internet if you don’t really understand it, because this will probably end up badly at some point later.

Also, nowadays there are so many APIs and frameworks and it’s impossible to be expert on everything. For that reason nobody should expect from you to be expert in everything. If they do, then try to avoid those jobs. The only thing that you really must be an expert is to understand in depth what you’re doing, the reasons of doing something and also be able to foresee the future and the consequences of what you’re doing. This comes with experience, though.

If you still lack the experience then before you use a random code from the internet, try hard to understand it. Spend time on it. Try it, test it, use it, change it, play with it and do your own experiments. This will give you the experience you need. By getting more and more experienced you’ll eventually find that pretty much everything is the same; it’s just another code or another API and you’ll feel comfortable with that code and be able to understand it in no time. But if you don’t do this, then you’ll never be able to understand the code even if you use it many times on different projects and this is really bad and eventually it will bring you in trouble.

Also be aware that the internet is full of crap code. Finding something that just works doesn’t mean is good for using it. I’ve so much bad code that just works for the presented case, but it should never be used in a production code. Sometimes, also code may be working for the author but that doesn’t mean that it will work for you as even differences in the toolchains, the build and the runtime environment can have great impact in the code functionality. You always need to be really cautious about the code you find and be sure that you use it properly.

Therefore, I believe it’s totally fine to use code from the internet as long it’s legal, you -really- understand it and also you adapt it to your own needs and not just copy-paste and push it in to the production.

If you think that’s wrong then I’m interested to hear your opinion on the matter.

Have fun!

 

Measuring antennas with the nanovna v2

Intro

Many IoT projects are using some kind of interface for transmitting data to a server (or cloud which is the fancy name nowadays). Back in my days (I’m not that old) that was a serial port (RS-232 or RS-485) sending data to a computer. Later, the serial-to-ethernet modules (like xport from Lantronix) replaced serial cables and then WiFi (or any other wireless technology like BT and LoRa) came into play.  Anyway, today using a wireless module has became the standard as it became easy and cheap to use and also it doesn’t need a cable to connect to ends.

Actually, it’s amazing how cheap is to buy an ESP-01 today, which is based on the ESP8266. With less than $2 you can just add to your device an IEEE 802.11 b/g/n Wi-Fi module with integrated TR switch, balun, LNA, power amplifier and matching network and support for WEP, WPA and WPA2 authentication.

But all those awesome features come with some difficulties or issues and in this post I’ll focus on one of them which are the antennas. I’ll try to keep this post simple as I’m not an RF engineer, so if you are then this post is most probably not for you as it’s not advanced.

Antennas

I guess that many tons of WiFi modules are sold every year. Many modules, like the ESP-01 come with an integrated PCB antenna, but also many modules are using external antennas. Some time ago I’ve bought a WiFi camera for my 3D printer, that was before I switched to octopi. That camera is based on the ESP32 and it has a fish-eye lens and also it doesn’t have an integrated antenna, but an SMA connector for an external antenna.

At some point I was curious how well those antennas actually perform and how they’re affected from the surrounding environment. There are many ways that the performance of the antenna can be affected, like the material is used to build the antenna itself, the position of the antenna, the rotation and also other factors. For example, the 3D printer is made from aluminum and steel, lots of it, which are both conductive and this means that they affect the antenna.  The ways that they affect the antenna performance can be many, e.g. shifting the resonance frequencies and maybe the gain.

I guess, for most applications and especially for this use case (the camera for this 3D printer) the affect is negligible, but again this a blog for stupid projects, so let’s do this.

Vector Network Analyzers

So, how do you measure the antenna’s parameters and what are those parameters? First, an antenna has several parameters, like gain, resonance frequencies, return & insertion loss, impedance, standing wave ratio (SWR or impedance matching) and also others. All those parameters are defined by the antenna design and also the environment conditions. Since it’s very hard to figure out and also calculate all those different parameters while designing an antenna, it’s easier to design the antenna using not very precise calculations and then try to trim it by testing and measurements.

Another thing is that even in the manufacturing stage, most of the times the antennas won’t have the exact same parameters, therefore you would need to test and measure again in order to trim them. In some designs this can be done a bit easier since there is a passive component which is a trimmer.

In any case, in order to measure and test the antenna you need a device called Vector Network Analyzer or VNA. VNAs are instruments that actually measure coefficients and from those coefficients you can then calculate the rest of the parameters. Pretty much all VNAs share the same principle design. In the next two pictures is the block diagram of a VNA when measuring an I/O device (e.g. filter) and an antenna.

The original image was taken from here and I’ve edited it. Click on the images to view them in larger size.

So, in the first image you see that the VNA has a a signal generator which feeds the device under test (DUT) from one port and then receives the output of the DUT on a second port. Then the VNA is able to measure the incident, reflected and transmitted signal and with some maths can calculate the parameters. The VNA is actually only able to measure the reflection coefficient (S11) which depends on the incident and reflected signal before the DUT; and the transmission coefficient (S21) which depends on the transmitted signal after the DUT. An intuitive way of thinking about this is what happens when you try to fit a hose to a water pump while it’s pumping out water. While you try to fit the hose, if the hose diameter doesn’t fit the pump then part of the water flow will get into the hose and some will spilled out (reflected back?). But if the hose fits the pump perfect then all the flow will go through the hose.

In case of an antenna, you can’t really measure the transmitted energy in a simple way that can be accurate for this kinds of measurements. Therefore, only one port is used and the VNA can just calculate the reflection coefficient. But in most of the times this is enough to get valuable information and be able to trim your antenna. The S11 coefficient is important to calculate the antenna matching and consequently the impedance matching defines if your antenna will be able to transmit the maximum energy. If the impedance don’t match, then your antenna will only transmit a portion of the incident energy.

As I’ve mentioned earlier, the antenna will be affected also from other conditions and not only impedance matching, therefore a portable VNA is really helpful to measure the antenna on site and on different conditions.

For this post, I’m using the new NanoVNA v2 analyzer, which I’ll introduce next.

NanoVNA v2

Until few years ago a VNA was a very expensive and considered as an exotic measurement instrument. A few decades ago a VNA would cost many times your apartment (or house) price and a few years ago many times your yearly salary. Nowadays the IC manufacturing process and technology and the available processing power of the newer small and cheap MCUs, brought this cost down to… $50. Yes, that’s just $50. This is so mind blowing that you should get one even if don’t know what it is and you’ll never use it…

Anyway, this is how it looks like:

There are many different sellers in eBay that sell this VNA with different options. The one I bought costs $60 and also includes a stylus pen for the touch screen, a hard metallic cover plate for top and bottom, a battery, coax cables and load standards (open, close and 50Ω). I’ve also printed this case here with my 3D printer and replaced the metal plates.

As you can see this is the V2, so there has to be a V1 and there is one. The main and most important difference between the V1 and V2 is the frequency range, which in V1 is from 50KHz up to 900MHz and in V2 is from 50KHz up to 3GHz! That’s an amazing range for this small and cheap device and also means that you can measure 2.4GHz antennas used in WiFi. I’m not going to get in more details in this post, but you can see the full specs in here.

As I’ve mentioned the TFT has a touch controller which is resistive, therefore you need a stylus pen to get precise control. Actually, this display is the ILI9341 which I’ve used in this post a few years back and I’ve managed to get an amazing refresh rate of 50 FPS using a dirt-cheap stm32f103 (blue-pill). As you can see from the above photo, the display is showing a Smith chart and four different graphs. This is the default display format but there are many more that you can get. If you’re interested you can have a look in the manual here.

The graphs have a color coding, therefore in the above photo the green graph is the smith plot, the yellow is the port 1 (incident, reflected), the blue is the port 2 (transmitted) and the green is the phase. In the bottom you can see the center frequency and the span which is this case is 2.4GHz and 800MHz, but depending the mode in bottom of the screen you may see other information. Again, for all the details you can read the manual as the menu is quite large.

My favourite thing about the nanovna v2 is that the firmware is open source and available on github. You can fork the firmware and make changes, build it and then uploaded to the device. The MCU is the GD32F303CC, which seems to be either an STM32F303 clone or it’s a side product from ST with a different part number to target even lower-cost markets. From the Makefile it seems that this MCU doesn’t have an FPU, which is a bit unfortunate as this kind of device would definitely benefit from this. Also from the repo you can see that libopencm3 and mculib are used. One thing that it could be better is CMake support and using a docker for image builds in order to get reproducible builds, which increase stability. Anyway, I’ll probably do this myself at some point.

The only negative I could say for this device is that the display touch-film is too glossy, which makes hard to view under sunlight or even during day inside a bright room. This can be solved, I guess, by using a matte non-reflective film like those ones used in smartphones (but not glass because the touch membrane is resistive).

The antenna

As I’ve mentioned earlier, for this post I’ll only measure an external WiFi antenna which is common in many different devices. In my case this device is the open source TTGO T-Journal ESP32 camera here. This PCB ships with various different camera sensors and it usually costs ~$15. I bought this camera to use it with my 3D printer before I’ve installed octopi and to be honest I may revert back to it because it was more convenient than octopi. I’ve also printed a case for it and this is how it looks.

Above is just the PCB module and below is the module inside the case and the antenna connected. This is a neat little camera btw.

As you can guess, those antennas in those cheap camera modules are not meant to be very precise, but they do the job. As a matter of fact, it doesn’t really make sense to even measure this antenna, because for it’s usage inside the house the performance is more that enough. But it would make sense with an antenna that you need to get the maximum performance. Nevertheless, it’s fine to use this for playing around and learning.

This is an omni-directional antenna, which means that it transmits energy like a light-bulb glows light to all directions. This is nice because it doesn’t matter that much how you place the antenna and your camera, but at the same time the gain is low as a lot of energy is wasted. The opposite would be a directional antenna which behaves like a flashlight. The problem with those antennas from eBay is that they don’t came with specs, so you know nothing about the antenna except that it is supposed to be suitable for the WiFi frequency range.

Measurements

OK, now let’s measure this antenna using the nanovna and get its parameters. First thing you need to do before taking any measurements is calibrate the VNA. There are a couple of calibration methods, but the nanovna uses the SOLT method. SOLT means Short-Open-Load-Through and this refers to the port 1 load, therefore you need to run 4 different calibrations, one is by shorting the port, then leave the port open, then attaching a 50Ω load and finally connect port 1 through port 2. After those steps the device should be ready and calibrated. There is a nice video here, that demonstrates this better than I can.

The nice thing is that you can save your calibration data, therefore in my case I’ve set the center frequency to 2.4GHz, calibrated the nanovna and then stored the calibration data in the flash. There are 5 memories available, but maybe you can increase that size in the firmware if needed (that’s the nice thing having access to the code).

These are some things that are very important during calibration and usage:

  • Tight the SMA connectors properly. If you don’t do that then all your measurements will be wrong, also if you tight it too much then you may break the PCB connector or crack the PCB and routes. The best way is to use a variable torque wrench, but since this would cost $20-$60, then just better be careful.
  • Do not touch any conductive parts during measurements. For example don’t touch the SMA connectors.
  • Keep the device away from metal surfaces during calibration.
  • Keep the antenna (or DUT) away from metal surfaces during measurements, unless you do it on purpose.
  • Make sure you have enough battery left or you have power bank with you.

In the next photo you can see the current test setup after the calibration.

As you can see the antenna is connected on port 1 and I’m running the nanovna with the battery. The next photo is after done with the calibration at 2.4GHz and with the antenna connected.

From the above image you see that the antenna resonates at 2456 MHz, which is quite good. From the Smith plot you can see that the impedance doesn’t match perfectly and it’s a bit higher than the 50Ω (54Ω) and also it has a 824pH inductance. That means that part of the incident energy is reflected at the resonance frequency.

Next you can see how the antenna parameters change in real-time when a fork is getting close to the antenna.

As you can see the antenna is drastically affected, therefore you need to have this in mind. Of course, in this case the fork is getting quite close to the antenna.

Finally, I’ve used the whole setup near the 3D printer while printing to see how the printer affects the antenna in different positions. As you might be able to see from the following gif, the position does affect the antenna and also is seems that the head movement affects it, too.

Conclusions

I have to admit that although I’m not an RF engineer or experienced on this domain, I find the nanovna-v2 a great device. It’s not only amazing what you can get nowadays with $60, but it’s also nice to be able to experiment with those RF magic stuff. At some point in the future I’m planning to make custom antennas for drones and FPVs and also build my own FPV, so this will certainly be very useful then.

Even if you’re not doing any RF this is a nice toy to play around with and try to understand how it works and also how antennas work and perforn. Making measurements is really easy and being a portable device makes it even better for experimenting. It’s really cool to see the antenna parameters in real time and try to explain what you see on the display.

As I’ve mentioned in the post, this case scenario is a bit useless as the camera just works fine inside the house and also the distance from the wifi router is not that far. Still, it was another stupid-project which was fun and I’ve learned a few new tricks. Again, I really recommend this device, even if you’re not an RF engineer and try to experiment with it. There are many variations and clones in eBay and most of them seem to be compatible with the open source firmware, just be aware that there’s a case that some variations may not be compatible.

Have fun!

 

Benchmarking the STM32MP1 IPC between the MCU and CPU (part 2)

Intro

In this post series I’m benchmarking the IPC between the Cortex-A7 (CA7) and the Cortex-M4 (CM4) of the STM32MP1 SoC. In the previous post here, I’ve tested the default OpenAMP TTY driver. This IPC method is the direct buffer sharing mode, because the OpenAMP is used as the buffer transfer API. After my tests I’ve verified that this option is very slow for large data transfers. Also, a TTY interface is not really an ideal interface to add into your code, for several reasons. For me dealing with old interfaces and APIs with new code is not the best option as there are also other ways to do this.

After seeing the results, my next idea was to replace the TTY interface with a Netlink socket. In order to do that though, I needed to write a custom kernel module, a raw OpenAMP firmware and also the userspace client that sends data to the Netlink socket. In this post I’ll briefly explain those things and also present the benchmark results and compare them with the TTY interface.

Test Yocto image

I’ve added the recipes with all the code in the BSP base layer here:

https://bitbucket.org/dimtass/meta-stm32mp1-bsp-base/src/master/
https://github.com/dimtass/meta-stm32mp1-bsp-base
https://gitlab.com/dimtass/meta-stm32mp1-bsp-base

This is the same repo as the previous post, but I’ve now added 3 more recipes, which are the following:

To build the image using this BSP base layer, then read the README.md file in the repo or the previous post. The README files is more than enough, though.

The repo of the actual code is here:

https://bitbucket.org/dimtass/stm32mp1-rpmsg-netlink-example/src/master/
https://github.com/dimtass/stm32mp1-rpmsg-netlink-example
https://gitlab.com/dimtass/stm32mp1-rpmsg-netlink-example

Kernel driver code

The kernel driver module code is this one here. As you can see this is a quite simple driver, so no interrupts, DAM or anything fancy. In the init function I just register a new rpmsg driver.

ret = register_rpmsg_driver(&rpmsg_netlink_drv);

This line just registers the driver, but the driver will be probed only when a new service requests for this specific driver’s id name, which is this one here

static struct rpmsg_device_id rpmsg_netlink_driver_id_table[] = {
    { .name	= "rpmsg-netlink" },
    { },
};

Therefore, when a new device (or service) that is added requests for this name, then the probe function will be executed. I’ll show you later, how the driver is actually triggered.

Then regarding the probe function, when it’s triggered then a new Netlink kernel is created and a callback function is added in the cfg Netlink kernel configuration struct.

nl_sk = netlink_kernel_create(&init_net, NETLINK_USER, &cfg);
if(!nl_sk) {
    dev_err(dev, "Error creating socket.\n");
    return -10;
}

The callback for the Netlink kernel is this function

static void netlink_recv_cbk(struct sk_buff *skb)

This callback is triggered when new data are received in this socket. It’s important to select a unit id which is not used by other Netlink kernel drivers. In this case I’m using this id

#define NETLINK_USER 31

Inside the Netlink callback, the received data are parsed and then are sent to the CM4 using the rpmsg (OpenAMP) API. As you can see from the code here, the driver doesn’t send all the received buffer, but it splits the data into blocks as the rpmsg has a hard-coded buffer limited to 512 bytes. Therefore, the limitation that we had in the previous post still remains, of course. The point, as I’ve mentioned is to simplify the userspace client code and not use TTY.

Finally, the rpmsg_drv_cb() is the callback function of the OpenAMP and you can see the code here. This callback is triggered when the CM4 firmware sends data to the CA7 via rpmsg. In this case, the CM4 firmware will send the number of bytes that were received from the CA7 kernel driver (from the Netlink callback). Then the callback will send this uint16_t back to the usespace application using Netlink.

Therefore, the userspace app sends/receives data to/from the kernel using Netlink and the kernel sends/receives data to/from the CM4 firmware using rpmsg. Note, that all these stages are copying buffers! So, no zero-copy here, but multiple memcpys, so we already expect some latency, but we’ll how much later.

CM4 firmware

The CM4 firmware code is here. This code is more complex that the kernel driver, but the interesting code is in the main.c file. The most important lines are

#define RPMSG_SERVICE_NAME              "rpmsg-netlink"

and

OPENAMP_create_endpoint(&resmgr_ept, RPMSG_SERVICE_NAME, RPMSG_ADDR_ANY, rx_callback, NULL);

As you may have guessed the RPMSG_SERVICE_NAME is the same with the kernel driver id name. This means that those two names need to match, otherwise the kernel driver won’t get probed.

The rx_callback() function is the interrupt function of the rpmsg on the firmware side. This will only copy the buffer (more memcpys in the pipeline) and then the handling will be done in the main() in this code

if (rx_dev.rx_status == SET)
{
  /* Message received: send back a message anwser */
  rx_dev.rx_status = RESET;

  struct packet* in = (struct packet*) &rx_dev.rx_buffer[0];
  if (in->preamble == PREAMBLE) {
    in->preamble = 0;
    rpmsg_expected_nbytes = in->length;
    log_info("Expected length: %d\n", rpmsg_expected_nbytes);                        
  }

  log_info("RPMSG: %d/%d\n", rx_dev.rx_size, rpmsg_expected_nbytes);
  if (rx_dev.rx_size >= rpmsg_expected_nbytes) {
    rx_dev.rx_size = 0;
    rpmsg_reply[0] = rpmsg_expected_nbytes & 0xff;
    rpmsg_reply[1] = (rpmsg_expected_nbytes >> 8) & 0xff;
    log_info("RPMSG resp: %d\n", rpmsg_expected_nbytes);
    rpmsg_expected_nbytes = 0;

    if (OPENAMP_send(&resmgr_ept, rpmsg_reply, 2) < 0) {
      log_err("Failed to send message\r\n");
      Error_Handler();
    }
  }

As you can see from the above code, the buffer is parsed and if there is a valid packet in there, then it extracts the length of the expected data and when those data are received, then it sends back the number or bytes using the OpenAMP API. Those data will be received then by the kernel and then send to userspace using Netlink.

User-space application

The userspace application code is here. If you browse the code you’ll find out that is very similar to the previous post’s tty client and I’ve only made a few changes like removing the tty and adding the Netlink socket class. Like in the previous post, a number of tests are added when the program starts like this

  tester.add_test(512);
  tester.add_test(1024);
  tester.add_test(2048);
  tester.add_test(4096);
  tester.add_test(8192);
  tester.add_test(16384);
  tester.add_test(32768);

Then the tests are executed. What you may find interesting is the Netlink class code and especially the part that sends/ receives the to/from the kernel, which is this code here. Have in this code:

do {
    int n_tx = buffer_len < MAX_BLOCK_SIZE ?  buffer_len : MAX_BLOCK_SIZE;
    buffer_len -= n_tx;

    memset(&kernel, 0, sizeof(kernel));
    kernel.nl_family = AF_NETLINK;
    kernel.nl_groups = 0;

    memset(&iov, 0, sizeof(iov));
    iov.iov_base = (void *)m_nlh;
    iov.iov_len = n_tx;
    
    std::memset(m_nlh, 0, NLMSG_SPACE(n_tx));
    m_nlh->nlmsg_len = NLMSG_SPACE(n_tx);
    m_nlh->nlmsg_pid = getpid();
    m_nlh->nlmsg_flags = 0;

    std::memcpy(NLMSG_DATA(m_nlh), buffer, n_tx);

    memset(&msg, 0, sizeof(msg));
    msg.msg_name = &kernel;
    msg.msg_namelen = sizeof(kernel);
    msg.msg_iov = &iov;
    msg.msg_iovlen = 1;

    L_(ldebug) << "Sending " << n_tx << "/" << buffer_len;
    int err = sendmsg(m_sock_fd, &msg, 0);
    if (err < 0) {
        L_(lerror) << "Failed to send netlink message: " <<  err;
        return(0);
    }

} while(buffer_len);

As you can see, the data are not sent a single buffer in the kernel driver via the Netlink socket. The reason is that the kernel socket can only assign a buffer equal to the page size, therefore if you try to send more that 4KB then the kernel will crash. Therefore, we need to split the data in to smaller blocks and send them via Netlink. There are some ways to increase this size, but a change like this would be global to all the kernel and this would mean that all drivers would allocated larger buffers even if they don’t need them and that’s a waste of memory.

Benchmark results

To execute the test I’ve built the Yocto image using my BSP base layer, which includes all the recipes and installs everything by default in the image. What is important is that the module is already loaded in the kernel when it boots, so it’s not needed to modprobe the module. Given this, it’s only needed to upload the firmware in the CM4 and then execute the application. n this image, all the commands need to be executed in the /home/root path.

First load the firmware like this:

./fw_cortex_m4_netlink.sh start

When running this, the kernel will print those messages (you can use dmesg -w to read those).

[ 3997.439653] remoteproc remoteproc0: powering up m4
[ 3997.444869] remoteproc remoteproc0: Booting fw image stm32mp157c-rpmsg-netlink.elf, size 198364
[ 3997.452743]  mlahb:m4@10000000#vdev0buffer: assigned reserved memory node vdev0buffer@10042000
[ 3997.461387] virtio_rpmsg_bus virtio0: rpmsg host is online
[ 3997.467937]  mlahb:m4@10000000#vdev0buffer: registered virtio0 (type 7)
[ 3997.472245] virtio_rpmsg_bus virtio0: creating channel rpmsg-netlink addr 0x0
[ 3997.473121] remoteproc remoteproc0: remote processor m4 is now up
[ 3997.492511] rpmsg_netlink virtio0.rpmsg-netlink.-1.0: rpmsg-netlink created netlink socket

The last line, is actually printed by our kernel driver module. This means that when the firmware loaded then the driver’s probe function was triggered, because it was matched by the RPMSG_SERVICE_NAME in the firmware. Next, run the application like this:

./rpmsg-netlink-client

This will execute all the tests. This is a sample output on my board.

- 21:27:31.237 INFO: Application started
- 21:27:31.238 INFO: Initialized netlink client.
- 21:27:31.245 INFO: Initialized buffer with CRC16: 0x1818
- 21:27:31.245 INFO: ---- Creating tests ----
- 21:27:31.245 INFO: -> Add test: size=512
- 21:27:31.245 INFO: -> Add test: size=1024
- 21:27:31.245 INFO: -> Add test: size=2048
- 21:27:31.246 INFO: -> Add test: size=4096
- 21:27:31.246 INFO: -> Add test: size=8192
- 21:27:31.246 INFO: -> Add test: size=16384
- 21:27:31.246 INFO: -> Add test: size=32768
- 21:27:31.246 INFO: ---- Starting tests ----
- 21:27:31.268 INFO: -> b: 512, nsec: 21384671, bytes sent: 20
- 21:27:31.296 INFO: -> b: 1024, nsec: 27190729, bytes sent: 20
- 21:27:31.324 INFO: -> b: 2048, nsec: 27436772, bytes sent: 20
- 21:27:31.361 INFO: -> b: 4096, nsec: 31332686, bytes sent: 20
- 21:27:31.419 INFO: -> b: 8192, nsec: 55592343, bytes sent: 20
- 21:27:31.511 INFO: -> b: 16384, nsec: 88094875, bytes sent: 20
- 21:27:31.681 INFO: -> b: 32768, nsec: 162541198, bytes sent: 20

The results are starting after the “Starting tests” string and b is the block size and nsec is the number of nanoseconds that the whole transaction lasted. Ignore the “bytes sent” size as it’s not correct and to fix this would be a lot of hassle, as it would need a static counter in the kernel driver, which doesn’t really worth the trouble. I’ve used the Linux precision timers, which they’re not very precise compared to the CM4 timers, but it’s enough for this test since the times are in the range of milliseconds. I’m also listing the results in the next table.

# of bytes (block) msec
512 21.38
1024 27.19
2048 27.43
4096 31.33
8192 55.59
16384 88.09
32768 162.54

Now let’s compare those number with the previous tests in the following table

# of bytes TTY (msec) Netlink (msec) diff (msec)
512 11.97 21.38 9.41
1024 15.32 27.19 11.87
2048 21.74 27.43 5.69
4096 37.64 31.33 – 6.31
8192 55.59
16384 88.09
32768 162.54

These are interesting numbers. As you can see up to 2KB of data, the TTY implementation is faster, but at >=4KB the Netlink driver has better performance. Also it’s important that the Netlink implementation doesn’t have the issue with the limited block size, so you can sent more data using the netlink client API I’ve written. Well, the truth is that it does the have a limited block size hard-coded in OpenAMP, but in this case without the TTY interface the ringbuffer seems to empty properly. That’s something that it would need further investigation, but I don’t think I’ll have time for this.

From this table in case of the 32KB block we see that the transfer rate is 201.6 KB/sec, which almost the double compared to the TTY implementation. This is much better performance, but again it’s far slower than the indirect buffer sharing mode, which I’ll test in the next post.

Conclusions

In this post I’ve implemented a kernel driver that uses OpenAMP to exchange data with the CM4 and a netlink socket to exchange data with the userspace. In this scenario the performance is worse compared to the default TTY implementation (/dev/ttyRPMSGx) for blocks smaller than 4KB, but it’s faster for >=4KB. Also my tests shown that if the block is 32KB then this implementation it’s twice as fast than the TTY.

Although the results are better than the previous post, still this implementation can not be considered as a good option if you need to transfer large amounts of data and fast. Nevertheless, personally I would consider this as a good option for smaller data sizes, because now the interface from the userspace is much more convenient as it’s based on a netlink socket. Therefore, you don’t need to interface with a TTY port anymore and that is an advantage.

So, I’m quite happy with the results. I would definitely prefer to use this option rather the TTY interface for data blocks more than 2KB, because netlink is more friendly API, at least to me. Maybe you have a difference preference, but overall those two solutions are only good for smaller block sizes.

In the next post, I’ll benchmark the indirect buffer sharing.

Have fun!

Benchmarking the STM32MP1 IPC between the MCU and CPU (part 1)

Intro

Update: The second part is here.

Long time no see. Well, having two babies at home is really not recommended for personal stupid projects. Nevertheless, I’ve found some time to dive into the STM32MP157C-DK2 dev kit that I’ve received recently from ST. Let me say beforehand, that this is a really great board and of course it has it’s cons and pros, which I will mention later in this post. Although it’s being a year that this board has been released I didn’t had the chance to use it and I’m glad I received this sample.

I need to mention that this post is not an introduction to the dev kit and it’s a bit advanced as it deals with more advanced concepts. One thing I like with some new SoCs that were released the last couple of years, is the integrated MCU aside with the CPU, which shares the same resources and peripherals. This is a really powerful option and it allows you to split your functionality to real-time and critical application that run on the MCU and more advanced userspace applications that run on Linux on the CPU.

The MCU is an STM32 ARM Cortex-M4 core (CM4) running at 209MHz and the application CPU is a dual Cortex-A7 (CA7) running at 650MHz. Well, yes, this is a fast CM4 and a slow CA7. As I’ve said those two can share almost all the same peripherals and the OpenAMP API is used as an IPC to exchange data between those two.

I have quite some experience working with real-time Linux kernels and I really like this concept, that’s why I always support a PREEMPT-RT (P-RT) kernel to my meta-allwinner-hx Yocto meta layer. Although the P-RT is really nice to have when a low latency is needed, it’s still far too bloated and slow compared to a baremetal firmware running on an MCU. I won’t get into the details of this, because I think it’s obvious. Therefore, the STM32MP1 and other similar SoCs combine the best of the those two worlds and you can use the CM4 for critical timing applications and the CA7 for everything else, like the GUI, networking, remote updates e.t.c.

In case of STM32MP1 the OpenAMP framework is used for the IPC between the CM4 and CA7. When I’ve went through the docs for the STM32MP1 the first thing it came to my mind is to benchmark and evaluate its performance and this series of post is about getting into this and find out what are the options you have and how good they perform.

Reading the docs

One thing I like about ST is the very good documentation. Well, tbh it’s far from excellent but it’s better compared to other documentation I’ve seen. Generally, bad or missing documentation is a well know problem with almost all vendors and in my opinion Texas Instruments and ST are far better compare to other vendors (I’ll exclude NXP imx6, which is also well documented).

When you start with a new SoC and you already have experience with other SoCs then it’s easier to retrieve valuable information faster and also search for specific information you know it’s important. Nevertheless, I really enjoyed getting into the STM32MP1 documentation in the wiki, because I found really good explanations for the Linux graphic stack, trusted firmware, OS-TEE and the boot chain. For example have a look to the DRM/KMS overview here, it’s really good.

Of course there are many things that are missing from the docs, but I need also to say that this is a quite complex SoC and there are so many tools and frameworks involved in the development process that it would need tens of thousands of pages to explain everything and this is not manageable even for ST. For example, you’ll find your self lost the moment you want to create your own Yocto distro as there is very few info on how to do that and that’s the reason I’ve decided to write a template bsp layer that greatly simplifies the process.

Contributions

While getting a dive into the SoC, I couldn’t help my self not to fix some things, therefore I’ve did some contributions, like this and this and I’ve also created this meta layer here. I’m also planning to contribute with a kernel module and a userspace application that will use netlink between the userspace and the kernel module, but I’ll explain why later. This will be demonstrated in the next post, though as it’s related to the IPC performance.

IPC options

OK, so let’s now see what are the current options that you get with the STM32MP1 when it comes to IPC between the CM4 and CA7. I’ve already mentioned the OpenAMP which ST names it the direct buffer exchange mode, but you can also use an indirect buffer exchange mode. Both are well explained here, but I’ll add a couple of more information here.

Generally, the OpenAMP has a weakness which is integrated to the implementation and this is the hard-coded buffer size limitation, which is set to 512 bytes. We’ll see later how this affects the transfer speed, but as you can imagine this size might not be enough in some cases. Therefore, before you decide which IPC option you’re going to choose you need to be sure how many data and how fast you need to share between the CM4 and CA7.

Another problem with OpenAMP is that there’s no zero-copy or DMA support, which means that buffers are copied from all ends and therefore the performance is expected to be affected by this. On top of that the buffers are not cacheable, which means further CPU cycles for fetching buffers and then copy them.

OpenAMP comes with 2 examples/implementations in the SDK. The first is a TTY device on the Linux side, which means that you transfer can data from/to the userspace by interfacing a tty com port. You can see this implementation here. If that sounds bad to you, then it is actually as bad as it sounds, because exchanging data that way really sucks, especially on the Linux side (interfacing a tty with code). The only way I think this is useful and makes sense is if you port your existing code from a device that used a CPU and an external MCU and they were exchanging data via a UART port. Otherwise, it doesn’t make much sense to use it. Nevertheless, being loyal to the blog’s name I’ve spent some time to implement this and see how it performs. More about this later.

The second implementation is the OpenAMP raw, which you can use a kernel module to interface with the CM4 and the firmware implementation is here. In order to exchange data with the userspace in this case you need a kernel module like this one here. But as you can see, this is pretty much useless because you can’t interface with the module from the userspace, therefore you need to write your own module and use ioctl or netlink (or anything else you like). But more about this on the next post. (Edit: actually I’ve seen that ST provides a module with IOCTL support here).

On the other hand, there is also the indirect buffer option, which compared to the OpenAMP, it allows sharing much larger buffers (of any size), in cached memory, with zero-copy and also using DMA. This option is expected to perform much faster compared to OpenAMP and with much lower latency. To be honest, this sounds like a no-brainer to use for me in most cases, except in case that you just need to exchange very few control commands. I mean it makes sense for the last case, because you need OpenAMP even in the indirect buffer mode, because you use that as a control interface for the shared buffer anyway.

Creating a Yocto image

In order to use or test any of the above things you need to build and flash a distro for the board. Here I can complain about the non-existence documentation on how to do this. In my case, that was not really an issue, but I guess many devs will eventually struggle with this. For that reason, I’ve created a BSP base layer for the STM32MP157C-DK2 board. This post though is not focused on how to deal with the Yocto distro, so I’ll explain those things quite fast so I can proceed further.

To build the distro I’ve used, you need this repo here:

https://bitbucket.org/dimtass/meta-stm32mp1-bsp-base/src/master/
https://github.com/dimtass/meta-stm32mp1-bsp-base
https://gitlab.com/dimtass/meta-stm32mp1-bsp-base

The README file in the repo is quite thorough. So, in order to build the image you need to create a folder and run this command:

cd stm32mp1-yocto
repo init -u https://bitbucket.org/dimtass/meta-stm32mp1-bsp-base/src/master/default.xml
repo sync

This will fetch all the needed repos and place them in the sources/ folder. Then in order to build the image you need to run these commands:

ln -s sources/meta-stm32mp1-bsp-base/scripts/setup-environment.sh .
MACHINE=stm32mp1-discotest DISTRO=openstlinux-eglfs source ./setup-environment.sh build
bitbake stm32mp1-qt-eglfs-image

Your host needs to have all the needed dependencies in order to build the image. If this is the first time you do this, then better use a docker image instead like this one here. I’ve made this docker image to build the meta-allwinner-hx repo, but you can use the same for this repo, like this:

docker build --build-arg userid=$(id -u) --build-arg groupid=$(id -g) -t allwinner-yocto-image .
docker run -it --name allwinner-builder -v $(pwd):/docker -w /docker allwinner-yocto-image bash
MACHINE=stm32mp1-discotest DISTRO=openstlinux-eglfs source ./setup-environment.sh build
bitbake stm32mp1-qt-eglfs-image

If you want to build the SDK to use it for compiling other code then run this:

bitbake -c populate_sdk stm32mp1-qt-eglfs-image

Then you can install the produced SDK to your /opt folder and then source it to build any code you might want to test.

Benchmark tool code

Assuming that you have the above image built, then you can flash it in the target and run the test. There is a brief explanation how to flash the image to the dev kit in the meta-stm32mp1-bsp-base repo README.md file. So, after the image is flashed the firmware for the CM4 is located in /lib/firmware/stm32mp157c-rpmsg-test.elf. Also there is a script and an executable in /home/root. The recipes that install those files are located in `meta-stm32mp1-bsp-base/recipes-extended/stm32mp1-rpmsg-test`.

The code is located in this repo:

https://bitbucket.org/dimtass/stm32mp1-rpmsg-test/src/master/
https://github.com/dimtass/stm32mp1-rpmsg-test
https://gitlab.com/dimtass/stm32mp1-rpmsg-test

This repo contains both the CM4 firmare and the CA7 linux userspace tool. The userspace tool is using the /dev/ttyRPMSG0 tty port to send blocks of data. As you can see here, I’m creating some tests on the runtime with several block sizes and I’m sending those blocks a number of times. For example, see the following:

tester.add_test(512, 1);
tester.add_test(512, 2);
tester.add_test(512, 4);
tester.add_test(512, 8);

The first line sends a block of 512 bytes one time. The second sends the 512 block two times, e.t.c. In the code, you’ll see that there are some code lines that are commented out. The reason for that is that I’ve found that I couldn’t send more than 5-6KB per run and it seems that OpenAMP is hanging on the kernel side and I need to open/close the tty port to send more data. As I’ve mentioned the buffer size in OpenAMP is hardcoded and it’s 512 bytes, therefore I assume that the ringbuffer is overflowing at some point and the execution hangs, but I don’t know the real reason. Therefore, I wasn’t able to test more that 5KB.

The code for interfacing the com port is in here. I guess this class would be very useful to you if you want to interface with the ttyRPMSG in your code, as I’ve used it many times and it seems robust. Also it’s simple to use and integrate it to your code.

Finally, the test class is here and I’m using a simple protocol with a header that contains a preamble and the length of the packet data. This simplifies the CM4 firmware as it knows how much data to wait for and then acknowledge when the whole packet is received. Be aware then when you sent a 512 bytes block, then those data doesn’t arrive at once on the CM4 on a single callback, but OpenAMP may trigger more that one interrupts for the whole block, therefore knowing the length of the data in the start of transaction is important.

On the CM4 side, the firmware is located in here. The interesting code is in the main.c file and actually the most important function is the tty callback in here, so I’ll also list the code here.

void VIRT_UART0_RxCpltCallback(VIRT_UART_HandleTypeDef *huart)
{

  /* copy received msg in a variable to sent it back to master processor in main infinite loop*/
  uint16_t recv_size = huart->RxXferSize < MAX_BUFFER_SIZE? huart->RxXferSize : MAX_BUFFER_SIZE-1;

  struct packet* in = (struct packet*) &huart->pRxBuffPtr[0];
  if (in->preamble == PREAMBLE) {
    in->preamble = 0;
    virt_uart0_expected_nbytes = in->length;
    log_info("length: %d\n", virt_uart0_expected_nbytes);                        
  }

  virt_uart0.rx_size += recv_size;
  log_info("UART0: %d/%d\n", virt_uart0.rx_size, virt_uart0_expected_nbytes);
  if (virt_uart0.rx_size >= virt_uart0_expected_nbytes) {
    virt_uart0.rx_size = 0;
    virt_uart0.tx_buffer[0] = virt_uart0_expected_nbytes & 0xff;
    virt_uart0.tx_buffer[1] = (virt_uart0_expected_nbytes >> 8) & 0xff;
    log_info("UART0 resp: %d\n", virt_uart0_expected_nbytes);
    virt_uart0_expected_nbytes = 0;
    virt_uart0.tx_size = 2;
    virt_uart0.tx_status = SET;
    // huart->RxXferSize = 0;
  }
}

As you can see in this callback the code looks for the preamble and if it’s found then it copies the expected bytes length. Then the interrupt may triggered multiple times and when all bytes arrive, then the CM4 sends back a response that includes the bytes number and then the userspace application verifies that all bytes where sent. It’s as simple as that.

Benchmark results

To run the test you need to cd to the /home/root directory and then run this script.

./fw_cortex_m4.sh start

This will load the firmware that is stored in /lib/firmware to the CM4 and execute it. When this happend you should see this in your serial console:

fw_cortex_m4.sh: fmw_name=stm32mp1-rpmsg-test.elf
[  162.549297] remoteproc remoteproc0: powering up m4
[  162.588367] remoteproc remoteproc0: Booting fw image stm32mp1-rpmsg-test.elf, size 704924
[  162.596199]  mlahb:m4@10000000#vdev0buffer: assigned reserved memory node vdev0buffer@10042000
[  162.607353] virtio_rpmsg_bus virtio0: rpmsg host is online
[  162.615159]  mlahb:m4@10000000#vdev0buffer: registered virtio0 (type 7)
[  162.620334] virtio_rpmsg_bus virtio0: creating channel rpmsg-tty-channel addr 0x0
[  162.622155] rpmsg_tty virtio0.rpmsg-tty-channel.-1.0: new channel: 0x400 -> 0x0 : ttyRPMSG0
[  162.633298] remoteproc remoteproc0: remote processor m4 is now up
[  162.648221] virtio_rpmsg_bus virtio0: creating channel rpmsg-tty-channel addr 0x1
[  162.671119] rpmsg_tty virtio0.rpmsg-tty-channel.-1.1: new channel: 0x401 -> 0x1 : ttyRPMSG1

This means that the kernel module is loaded, the two ttyRPMSG devices are created in /dev, firmware is loaded on the CM4 and the firmware code is executed. In order to verify that the firmware is working you can use a USB to UART module and connect the Rx pin to the D0 and the Tx pin to the D1 (UART7) of the Arduino compatible header in the bottom on the board. When the firmware is booted you should see this:

[00000.008][INFO ]Cortex-M4 boot successful with STM32Cube FW version: v1.2.0 
[00000.016][INFO ]MAX_BUFFER_SIZE: 32768
[00000.019][INFO ]Virtual UART0 OpenAMP-rpmsg channel creation
[00000.025][INFO ]Virtual UART1 OpenAMP-rpmsg channel creation

Now you can execute the benchmark tool on the userspace by running this command in the console:

./tty-test-client /dev/ttyRPMSG0

Then the benchmark is executed and you’ll see the results on both consoles (CM4 and CA7). The CA7 output will be something like this:

- 15:43:54.953 INFO: Application started
- 15:43:54.954 INFO: Connected to /dev/ttyRPMSG0
- 15:43:54.962 INFO: Initialized buffer with CRC16: 0x1818
- 15:43:54.962 INFO: ---- Creating tests ----
- 15:43:54.962 INFO: -> Add test: block=512, blocks: 1
- 15:43:54.962 INFO: -> Add test: block=512, blocks: 2
- 15:43:54.962 INFO: -> Add test: block=512, blocks: 4
- 15:43:54.963 INFO: -> Add test: block=512, blocks: 8
- 15:43:54.963 INFO: -> Add test: block=1024, blocks: 1
- 15:43:54.964 INFO: -> Add test: block=1024, blocks: 2
- 15:43:54.964 INFO: -> Add test: block=1024, blocks: 4
- 15:43:54.964 INFO: -> Add test: block=1024, blocks: 5
- 15:43:54.964 INFO: -> Add test: block=2048, blocks: 1
- 15:43:54.964 INFO: -> Add test: block=2048, blocks: 2
- 15:43:54.964 INFO: -> Add test: block=4096, blocks: 1
- 15:43:54.964 INFO: ---- Starting tests ----
- 15:43:54.977 INFO: -> b: 512, n:1, nsec: 11970765, bytes sent: 512
- 15:43:54.996 INFO: -> b: 512, n:2, nsec: 18770380, bytes sent: 1024
- 15:43:55.027 INFO: -> b: 512, n:4, nsec: 31022063, bytes sent: 2048
- 15:43:55.083 INFO: -> b: 512, n:8, nsec: 56251848, bytes sent: 4096
- 15:43:55.099 INFO: -> b: 1024, n:1, nsec: 15322572, bytes sent: 1024
- 15:43:55.124 INFO: -> b: 1024, n:2, nsec: 24824116, bytes sent: 2048
- 15:43:55.168 INFO: -> b: 1024, n:4, nsec: 43830248, bytes sent: 4096
- 15:43:55.221 INFO: -> b: 1024, n:5, nsec: 53327292, bytes sent: 5120
- 15:43:55.243 INFO: -> b: 2048, n:1, nsec: 21742144, bytes sent: 2048
- 15:43:55.281 INFO: -> b: 2048, n:2, nsec: 37633885, bytes sent: 4096
- 15:43:55.318 INFO: -> b: 4096, n:1, nsec: 37649011, bytes sent: 4096

The results are starting after the “Starting tests” string and b is the block size, n is the number of blocks and nsec is the number of nanoseconds that the whole transaction lasted. I’ve used the linux precision timers, which they’re not very precise compared to the CM4 timers, but it’s enough for this test since the times are in the range of milliseconds. I’m also listing the results in the next table.

Block size # of blocks msec
512 1 11.97
512 2 18.77
512 4 31.02
512 8 56.25
1024 1 15.32
1024 2 24.82
1024 4 43.83
1024 5 53.32
2048 1 21.74
2048 2 37.63
4096 1 37.64

The results are interesting enough. As you can see, size matters! So, the larger the block size the faster the transfer rate. Therefore sending 4KB of data from the CA7 to the CM4 in 8x 512 blocks needs 56.25 ms and in case of 1x 4096 takes 37.64, which is 40% faster with the bigger block size.

Also another interesting result is that in the best case the transfer rate is 4096 bytes in 37.64 msec, or 106 KB/sec. That’s too slow!

To sum up the results, you can see that in case you want to use the direct buffer exchange mode (OpenAMP tty) then you’re limited at ~106KB/sec and in order to achieve this speed you need the largest possible block size.

Conclusions

In this post I’ve provided a simple BSP base Yocto layer that it might be helpful for you to start a new STM32MP1 project. I’ve used this Yocto image to build a benchmark tool and test the direct buffer exchange mode and the firmware and the userspace tool is are included in recipes in the BSP base repo.

The results are a bit disappointing when it comes to performance and I’ve found out that the larger the block size, then the transfer speed gets better. That means that you use this mode when sending a small amount of data (e.g. control commands) or if you want to port an application in STM32MP1m that it was running on a separate CPU and MCU before. If you need better performance then the indirect mode seems to be better, but I’ll verify this on a future post.

In the next post I’ll be using again the direct mode, but this time I’ll write a small kernel driver module using netlink and test if there’s any better performance. I don’t expect much out of this, but I want to do this anyways because I find using a tty interface to a modern program a bit cumbersome to handle and netlink is a more elegant solution.

In the conclusions section I would also like to list some pros/cons of the STM32MP157C-DK2 board (and MP1 SoC).

Pros:

  • Very good Yocto support
  • Great documentation and datasheets
  • Many examples and code for the CM4
  • Support for OP-TEE secure RTOS
  • Very good power management (supported in both CM4 and CA7)
  • GbE
  • WiFi & BT 4.1
  • Arduino V3 connector
  • The CM4 firmware can be uploaded during boot (in u-boot) or in Linux
  • DSI display with touch
  • Direct and indirect buffer sharing mode
  • Provides an AI acceleration API (supports also TF-Lite)
  • Great and mature development tools

Cons:

  • The CA7 has a low clock (which tbh doesn’t mean that this is bad, for many cases this is a good thing).
  • You can’t use DSI and HDMI at the same time.
  • IPC is not that fast (in direct mode only)
  • The 40-pin expansion connector is under the LCD and it’s hard to reach and fit the cables. It would be better if it had a 90 degree angle connector.
  • Generally because the LCD takes the whole one side of the board this makes it difficult to use also the Arduino connector. Therefore, you need to put the board on the LCD side and use the HDMI. Well, they couldn’t done much for this, but…
  • The packaging box could be done in a way that can be used as a leveled base that make the Arduino connector accessible. Something like the nVidia nano packaging box.
  • Yocto documentation is lacking and it will be difficult for non experienced developers to get started with (but that’s also a generic issue with Yocto as the documentation is also not complete and also some important features are not documented).
  • No CMake support in the examples. Generally, I prefer CMake for many reasons and I like for example the NXP provides their SDK with CMake instead of Makefiles (personal preference I guess, but CMake is also widely used in most of the projects I’ve worked with).

Overall, this is a great and complicated board and it takes some effort to setup and optimize everything. Even more it’s quite complicated to do a design from the scratch in a small team, therefore if you decline from the default design you might need more time to configure everything properly and bring up your custom board. Although, I would be able to handle a full custom design with this board, I know it would take a lot of time to finalize it. Therefore, if you don’t use one of the many available ready solutions with the STM32MP1 SoC, then have in mind that it will take some effort to do your custom solution. It’s up to you to decide the development cost of a full custom design or use one of the many available SoCs that can help you start in no time.

I would definitely take this SoC into account for applications that need good real-time performance (using the CM4 core) and use the CA7 core for the GUI and business logic. If you wonder what’s the difference with a multi-core ARM CPU that runs multiple cores in high frequency, then you can take as granted that -depending your use-case- even a PREEMPT-RT kernel on a 8-core running at 2GHz, can’t beat the real-time performance of a 200MHz CM4.

Have fun!