Measuring antennas with the nanovna v2

Intro

Many IoT projects are using some kind of interface for transmitting data to a server (or cloud which is the fancy name nowadays). Back in my days (I’m not that old) that was a serial port (RS-232 or RS-485) sending data to a computer. Later, the serial-to-ethernet modules (like xport from Lantronix) replaced serial cables and then WiFi (or any other wireless technology like BT and LoRa) came into play.  Anyway, today using a wireless module has became the standard as it became easy and cheap to use and also it doesn’t need a cable to connect to ends.

Actually, it’s amazing how cheap is to buy an ESP-01 today, which is based on the ESP8266. With less than $2 you can just add to your device an IEEE 802.11 b/g/n Wi-Fi module with integrated TR switch, balun, LNA, power amplifier and matching network and support for WEP, WPA and WPA2 authentication.

But all those awesome features come with some difficulties or issues and in this post I’ll focus on one of them which are the antennas. I’ll try to keep this post simple as I’m not an RF engineer, so if you are then this post is most probably not for you as it’s not advanced.

Antennas

I guess that many tons of WiFi modules are sold every year. Many modules, like the ESP-01 come with an integrated PCB antenna, but also many modules are using external antennas. Some time ago I’ve bought a WiFi camera for my 3D printer, that was before I switched to octopi. That camera is based on the ESP32 and it has a fish-eye lens and also it doesn’t have an integrated antenna, but an SMA connector for an external antenna.

At some point I was curious how well those antennas actually perform and how they’re affected from the surrounding environment. There are many ways that the performance of the antenna can be affected, like the material is used to build the antenna itself, the position of the antenna, the rotation and also other factors. For example, the 3D printer is made from aluminum and steel, lots of it, which are both conductive and this means that they affect the antenna.  The ways that they affect the antenna performance can be many, e.g. shifting the resonance frequencies and maybe the gain.

I guess, for most applications and especially for this use case (the camera for this 3D printer) the affect is negligible, but again this a blog for stupid projects, so let’s do this.

Vector Network Analyzers

So, how do you measure the antenna’s parameters and what are those parameters? First, an antenna has several parameters, like gain, resonance frequencies, return & insertion loss, impedance, standing wave ratio (SWR or impedance matching) and also others. All those parameters are defined by the antenna design and also the environment conditions. Since it’s very hard to figure out and also calculate all those different parameters while designing an antenna, it’s easier to design the antenna using not very precise calculations and then try to trim it by testing and measurements.

Another thing is that even in the manufacturing stage, most of the times the antennas won’t have the exact same parameters, therefore you would need to test and measure again in order to trim them. In some designs this can be done a bit easier since there is a passive component which is a trimmer.

In any case, in order to measure and test the antenna you need a device called Vector Network Analyzer or VNA. VNAs are instruments that actually measure coefficients and from those coefficients you can then calculate the rest of the parameters. Pretty much all VNAs share the same principle design. In the next two pictures is the block diagram of a VNA when measuring an I/O device (e.g. filter) and an antenna.

The original image was taken from here and I’ve edited it. Click on the images to view them in larger size.

So, in the first image you see that the VNA has a a signal generator which feeds the device under test (DUT) from one port and then receives the output of the DUT on a second port. Then the VNA is able to measure the incident, reflected and transmitted signal and with some maths can calculate the parameters. The VNA is actually only able to measure the reflection coefficient (S11) which depends on the incident and reflected signal before the DUT; and the transmission coefficient (S21) which depends on the transmitted signal after the DUT. An intuitive way of thinking about this is what happens when you try to fit a hose to a water pump while it’s pumping out water. While you try to fit the hose, if the hose diameter doesn’t fit the pump then part of the water flow will get into the hose and some will spilled out (reflected back?). But if the hose fits the pump perfect then all the flow will go through the hose.

In case of an antenna, you can’t really measure the transmitted energy in a simple way that can be accurate for this kinds of measurements. Therefore, only one port is used and the VNA can just calculate the reflection coefficient. But in most of the times this is enough to get valuable information and be able to trim your antenna. The S11 coefficient is important to calculate the antenna matching and consequently the impedance matching defines if your antenna will be able to transmit the maximum energy. If the impedance don’t match, then your antenna will only transmit a portion of the incident energy.

As I’ve mentioned earlier, the antenna will be affected also from other conditions and not only impedance matching, therefore a portable VNA is really helpful to measure the antenna on site and on different conditions.

For this post, I’m using the new NanoVNA v2 analyzer, which I’ll introduce next.

NanoVNA v2

Until few years ago a VNA was a very expensive and considered as an exotic measurement instrument. A few decades ago a VNA would cost many times your apartment (or house) price and a few years ago many times your yearly salary. Nowadays the IC manufacturing process and technology and the available processing power of the newer small and cheap MCUs, brought this cost down to… $50. Yes, that’s just $50. This is so mind blowing that you should get one even if don’t know what it is and you’ll never use it…

Anyway, this is how it looks like:

There are many different sellers in eBay that sell this VNA with different options. The one I bought costs $60 and also includes a stylus pen for the touch screen, a hard metallic cover plate for top and bottom, a battery, coax cables and load standards (open, close and 50Ω). I’ve also printed this case here with my 3D printer and replaced the metal plates.

As you can see this is the V2, so there has to be a V1 and there is one. The main and most important difference between the V1 and V2 is the frequency range, which in V1 is from 50KHz up to 900MHz and in V2 is from 50KHz up to 3GHz! That’s an amazing range for this small and cheap device and also means that you can measure 2.4GHz antennas used in WiFi. I’m not going to get in more details in this post, but you can see the full specs in here.

As I’ve mentioned the TFT has a touch controller which is resistive, therefore you need a stylus pen to get precise control. Actually, this display is the ILI9341 which I’ve used in this post a few years back and I’ve managed to get an amazing refresh rate of 50 FPS using a dirt-cheap stm32f103 (blue-pill). As you can see from the above photo, the display is showing a Smith chart and four different graphs. This is the default display format but there are many more that you can get. If you’re interested you can have a look in the manual here.

The graphs have a color coding, therefore in the above photo the green graph is the smith plot, the yellow is the port 1 (incident, reflected), the blue is the port 2 (transmitted) and the green is the phase. In the bottom you can see the center frequency and the span which is this case is 2.4GHz and 800MHz, but depending the mode in bottom of the screen you may see other information. Again, for all the details you can read the manual as the menu is quite large.

My favourite thing about the nanovna v2 is that the firmware is open source and available on github. You can fork the firmware and make changes, build it and then uploaded to the device. The MCU is the GD32F303CC, which seems to be either an STM32F303 clone or it’s a side product from ST with a different part number to target even lower-cost markets. From the Makefile it seems that this MCU doesn’t have an FPU, which is a bit unfortunate as this kind of device would definitely benefit from this. Also from the repo you can see that libopencm3 and mculib are used. One thing that it could be better is CMake support and using a docker for image builds in order to get reproducible builds, which increase stability. Anyway, I’ll probably do this myself at some point.

The only negative I could say for this device is that the display touch-film is too glossy, which makes hard to view under sunlight or even during day inside a bright room. This can be solved, I guess, by using a matte non-reflective film like those ones used in smartphones (but not glass because the touch membrane is resistive).

The antenna

As I’ve mentioned earlier, for this post I’ll only measure an external WiFi antenna which is common in many different devices. In my case this device is the open source TTGO T-Journal ESP32 camera here. This PCB ships with various different camera sensors and it usually costs ~$15. I bought this camera to use it with my 3D printer before I’ve installed octopi and to be honest I may revert back to it because it was more convenient than octopi. I’ve also printed a case for it and this is how it looks.

Above is just the PCB module and below is the module inside the case and the antenna connected. This is a neat little camera btw.

As you can guess, those antennas in those cheap camera modules are not meant to be very precise, but they do the job. As a matter of fact, it doesn’t really make sense to even measure this antenna, because for it’s usage inside the house the performance is more that enough. But it would make sense with an antenna that you need to get the maximum performance. Nevertheless, it’s fine to use this for playing around and learning.

This is an omni-directional antenna, which means that it transmits energy like a light-bulb glows light to all directions. This is nice because it doesn’t matter that much how you place the antenna and your camera, but at the same time the gain is low as a lot of energy is wasted. The opposite would be a directional antenna which behaves like a flashlight. The problem with those antennas from eBay is that they don’t came with specs, so you know nothing about the antenna except that it is supposed to be suitable for the WiFi frequency range.

Measurements

OK, now let’s measure this antenna using the nanovna and get its parameters. First thing you need to do before taking any measurements is calibrate the VNA. There are a couple of calibration methods, but the nanovna uses the SOLT method. SOLT means Short-Open-Load-Through and this refers to the port 1 load, therefore you need to run 4 different calibrations, one is by shorting the port, then leave the port open, then attaching a 50Ω load and finally connect port 1 through port 2. After those steps the device should be ready and calibrated. There is a nice video here, that demonstrates this better than I can.

The nice thing is that you can save your calibration data, therefore in my case I’ve set the center frequency to 2.4GHz, calibrated the nanovna and then stored the calibration data in the flash. There are 5 memories available, but maybe you can increase that size in the firmware if needed (that’s the nice thing having access to the code).

These are some things that are very important during calibration and usage:

  • Tight the SMA connectors properly. If you don’t do that then all your measurements will be wrong, also if you tight it too much then you may break the PCB connector or crack the PCB and routes. The best way is to use a variable torque wrench, but since this would cost $20-$60, then just better be careful.
  • Do not touch any conductive parts during measurements. For example don’t touch the SMA connectors.
  • Keep the device away from metal surfaces during calibration.
  • Keep the antenna (or DUT) away from metal surfaces during measurements, unless you do it on purpose.
  • Make sure you have enough battery left or you have power bank with you.

In the next photo you can see the current test setup after the calibration.

As you can see the antenna is connected on port 1 and I’m running the nanovna with the battery. The next photo is after done with the calibration at 2.4GHz and with the antenna connected.

From the above image you see that the antenna resonates at 2456 MHz, which is quite good. From the Smith plot you can see that the impedance doesn’t match perfectly and it’s a bit higher than the 50Ω (54Ω) and also it has a 824pH inductance. That means that part of the incident energy is reflected at the resonance frequency.

Next you can see how the antenna parameters change in real-time when a fork is getting close to the antenna.

As you can see the antenna is drastically affected, therefore you need to have this in mind. Of course, in this case the fork is getting quite close to the antenna.

Finally, I’ve used the whole setup near the 3D printer while printing to see how the printer affects the antenna in different positions. As you might be able to see from the following gif, the position does affect the antenna and also is seems that the head movement affects it, too.

Conclusions

I have to admit that although I’m not an RF engineer or experienced on this domain, I find the nanovna-v2 a great device. It’s not only amazing what you can get nowadays with $60, but it’s also nice to be able to experiment with those RF magic stuff. At some point in the future I’m planning to make custom antennas for drones and FPVs and also build my own FPV, so this will certainly be very useful then.

Even if you’re not doing any RF this is a nice toy to play around with and try to understand how it works and also how antennas work and perforn. Making measurements is really easy and being a portable device makes it even better for experimenting. It’s really cool to see the antenna parameters in real time and try to explain what you see on the display.

As I’ve mentioned in the post, this case scenario is a bit useless as the camera just works fine inside the house and also the distance from the wifi router is not that far. Still, it was another stupid-project which was fun and I’ve learned a few new tricks. Again, I really recommend this device, even if you’re not an RF engineer and try to experiment with it. There are many variations and clones in eBay and most of them seem to be compatible with the open source firmware, just be aware that there’s a case that some variations may not be compatible.

Have fun!

 

Benchmarking the STM32MP1 IPC between the MCU and CPU (part 2)

Intro

In this post series I’m benchmarking the IPC between the Cortex-A7 (CA7) and the Cortex-M4 (CM4) of the STM32MP1 SoC. In the previous post here, I’ve tested the default OpenAMP TTY driver. This IPC method is the direct buffer sharing mode, because the OpenAMP is used as the buffer transfer API. After my tests I’ve verified that this option is very slow for large data transfers. Also, a TTY interface is not really an ideal interface to add into your code, for several reasons. For me dealing with old interfaces and APIs with new code is not the best option as there are also other ways to do this.

After seeing the results, my next idea was to replace the TTY interface with a Netlink socket. In order to do that though, I needed to write a custom kernel module, a raw OpenAMP firmware and also the userspace client that sends data to the Netlink socket. In this post I’ll briefly explain those things and also present the benchmark results and compare them with the TTY interface.

Test Yocto image

I’ve added the recipes with all the code in the BSP base layer here:

https://bitbucket.org/dimtass/meta-stm32mp1-bsp-base/src/master/
https://github.com/dimtass/meta-stm32mp1-bsp-base
https://gitlab.com/dimtass/meta-stm32mp1-bsp-base

This is the same repo as the previous post, but I’ve now added 3 more recipes, which are the following:

To build the image using this BSP base layer, then read the README.md file in the repo or the previous post. The README files is more than enough, though.

The repo of the actual code is here:

https://bitbucket.org/dimtass/stm32mp1-rpmsg-netlink-example/src/master/
https://github.com/dimtass/stm32mp1-rpmsg-netlink-example
https://gitlab.com/dimtass/stm32mp1-rpmsg-netlink-example

Kernel driver code

The kernel driver module code is this one here. As you can see this is a quite simple driver, so no interrupts, DAM or anything fancy. In the init function I just register a new rpmsg driver.

ret = register_rpmsg_driver(&rpmsg_netlink_drv);

This line just registers the driver, but the driver will be probed only when a new service requests for this specific driver’s id name, which is this one here

static struct rpmsg_device_id rpmsg_netlink_driver_id_table[] = {
    { .name	= "rpmsg-netlink" },
    { },
};

Therefore, when a new device (or service) that is added requests for this name, then the probe function will be executed. I’ll show you later, how the driver is actually triggered.

Then regarding the probe function, when it’s triggered then a new Netlink kernel is created and a callback function is added in the cfg Netlink kernel configuration struct.

nl_sk = netlink_kernel_create(&init_net, NETLINK_USER, &cfg);
if(!nl_sk) {
    dev_err(dev, "Error creating socket.\n");
    return -10;
}

The callback for the Netlink kernel is this function

static void netlink_recv_cbk(struct sk_buff *skb)

This callback is triggered when new data are received in this socket. It’s important to select a unit id which is not used by other Netlink kernel drivers. In this case I’m using this id

#define NETLINK_USER 31

Inside the Netlink callback, the received data are parsed and then are sent to the CM4 using the rpmsg (OpenAMP) API. As you can see from the code here, the driver doesn’t send all the received buffer, but it splits the data into blocks as the rpmsg has a hard-coded buffer limited to 512 bytes. Therefore, the limitation that we had in the previous post still remains, of course. The point, as I’ve mentioned is to simplify the userspace client code and not use TTY.

Finally, the rpmsg_drv_cb() is the callback function of the OpenAMP and you can see the code here. This callback is triggered when the CM4 firmware sends data to the CA7 via rpmsg. In this case, the CM4 firmware will send the number of bytes that were received from the CA7 kernel driver (from the Netlink callback). Then the callback will send this uint16_t back to the usespace application using Netlink.

Therefore, the userspace app sends/receives data to/from the kernel using Netlink and the kernel sends/receives data to/from the CM4 firmware using rpmsg. Note, that all these stages are copying buffers! So, no zero-copy here, but multiple memcpys, so we already expect some latency, but we’ll how much later.

CM4 firmware

The CM4 firmware code is here. This code is more complex that the kernel driver, but the interesting code is in the main.c file. The most important lines are

#define RPMSG_SERVICE_NAME              "rpmsg-netlink"

and

OPENAMP_create_endpoint(&resmgr_ept, RPMSG_SERVICE_NAME, RPMSG_ADDR_ANY, rx_callback, NULL);

As you may have guessed the RPMSG_SERVICE_NAME is the same with the kernel driver id name. This means that those two names need to match, otherwise the kernel driver won’t get probed.

The rx_callback() function is the interrupt function of the rpmsg on the firmware side. This will only copy the buffer (more memcpys in the pipeline) and then the handling will be done in the main() in this code

if (rx_dev.rx_status == SET)
{
  /* Message received: send back a message anwser */
  rx_dev.rx_status = RESET;

  struct packet* in = (struct packet*) &rx_dev.rx_buffer[0];
  if (in->preamble == PREAMBLE) {
    in->preamble = 0;
    rpmsg_expected_nbytes = in->length;
    log_info("Expected length: %d\n", rpmsg_expected_nbytes);                        
  }

  log_info("RPMSG: %d/%d\n", rx_dev.rx_size, rpmsg_expected_nbytes);
  if (rx_dev.rx_size >= rpmsg_expected_nbytes) {
    rx_dev.rx_size = 0;
    rpmsg_reply[0] = rpmsg_expected_nbytes & 0xff;
    rpmsg_reply[1] = (rpmsg_expected_nbytes >> 8) & 0xff;
    log_info("RPMSG resp: %d\n", rpmsg_expected_nbytes);
    rpmsg_expected_nbytes = 0;

    if (OPENAMP_send(&resmgr_ept, rpmsg_reply, 2) < 0) {
      log_err("Failed to send message\r\n");
      Error_Handler();
    }
  }

As you can see from the above code, the buffer is parsed and if there is a valid packet in there, then it extracts the length of the expected data and when those data are received, then it sends back the number or bytes using the OpenAMP API. Those data will be received then by the kernel and then send to userspace using Netlink.

User-space application

The userspace application code is here. If you browse the code you’ll find out that is very similar to the previous post’s tty client and I’ve only made a few changes like removing the tty and adding the Netlink socket class. Like in the previous post, a number of tests are added when the program starts like this

  tester.add_test(512);
  tester.add_test(1024);
  tester.add_test(2048);
  tester.add_test(4096);
  tester.add_test(8192);
  tester.add_test(16384);
  tester.add_test(32768);

Then the tests are executed. What you may find interesting is the Netlink class code and especially the part that sends/ receives the to/from the kernel, which is this code here. Have in this code:

do {
    int n_tx = buffer_len < MAX_BLOCK_SIZE ?  buffer_len : MAX_BLOCK_SIZE;
    buffer_len -= n_tx;

    memset(&kernel, 0, sizeof(kernel));
    kernel.nl_family = AF_NETLINK;
    kernel.nl_groups = 0;

    memset(&iov, 0, sizeof(iov));
    iov.iov_base = (void *)m_nlh;
    iov.iov_len = n_tx;
    
    std::memset(m_nlh, 0, NLMSG_SPACE(n_tx));
    m_nlh->nlmsg_len = NLMSG_SPACE(n_tx);
    m_nlh->nlmsg_pid = getpid();
    m_nlh->nlmsg_flags = 0;

    std::memcpy(NLMSG_DATA(m_nlh), buffer, n_tx);

    memset(&msg, 0, sizeof(msg));
    msg.msg_name = &kernel;
    msg.msg_namelen = sizeof(kernel);
    msg.msg_iov = &iov;
    msg.msg_iovlen = 1;

    L_(ldebug) << "Sending " << n_tx << "/" << buffer_len;
    int err = sendmsg(m_sock_fd, &msg, 0);
    if (err < 0) {
        L_(lerror) << "Failed to send netlink message: " <<  err;
        return(0);
    }

} while(buffer_len);

As you can see, the data are not sent a single buffer in the kernel driver via the Netlink socket. The reason is that the kernel socket can only assign a buffer equal to the page size, therefore if you try to send more that 4KB then the kernel will crash. Therefore, we need to split the data in to smaller blocks and send them via Netlink. There are some ways to increase this size, but a change like this would be global to all the kernel and this would mean that all drivers would allocated larger buffers even if they don’t need them and that’s a waste of memory.

Benchmark results

To execute the test I’ve built the Yocto image using my BSP base layer, which includes all the recipes and installs everything by default in the image. What is important is that the module is already loaded in the kernel when it boots, so it’s not needed to modprobe the module. Given this, it’s only needed to upload the firmware in the CM4 and then execute the application. n this image, all the commands need to be executed in the /home/root path.

First load the firmware like this:

./fw_cortex_m4_netlink.sh start

When running this, the kernel will print those messages (you can use dmesg -w to read those).

[ 3997.439653] remoteproc remoteproc0: powering up m4
[ 3997.444869] remoteproc remoteproc0: Booting fw image stm32mp157c-rpmsg-netlink.elf, size 198364
[ 3997.452743]  mlahb:m4@10000000#vdev0buffer: assigned reserved memory node vdev0buffer@10042000
[ 3997.461387] virtio_rpmsg_bus virtio0: rpmsg host is online
[ 3997.467937]  mlahb:m4@10000000#vdev0buffer: registered virtio0 (type 7)
[ 3997.472245] virtio_rpmsg_bus virtio0: creating channel rpmsg-netlink addr 0x0
[ 3997.473121] remoteproc remoteproc0: remote processor m4 is now up
[ 3997.492511] rpmsg_netlink virtio0.rpmsg-netlink.-1.0: rpmsg-netlink created netlink socket

The last line, is actually printed by our kernel driver module. This means that when the firmware loaded then the driver’s probe function was triggered, because it was matched by the RPMSG_SERVICE_NAME in the firmware. Next, run the application like this:

./rpmsg-netlink-client

This will execute all the tests. This is a sample output on my board.

- 21:27:31.237 INFO: Application started
- 21:27:31.238 INFO: Initialized netlink client.
- 21:27:31.245 INFO: Initialized buffer with CRC16: 0x1818
- 21:27:31.245 INFO: ---- Creating tests ----
- 21:27:31.245 INFO: -> Add test: size=512
- 21:27:31.245 INFO: -> Add test: size=1024
- 21:27:31.245 INFO: -> Add test: size=2048
- 21:27:31.246 INFO: -> Add test: size=4096
- 21:27:31.246 INFO: -> Add test: size=8192
- 21:27:31.246 INFO: -> Add test: size=16384
- 21:27:31.246 INFO: -> Add test: size=32768
- 21:27:31.246 INFO: ---- Starting tests ----
- 21:27:31.268 INFO: -> b: 512, nsec: 21384671, bytes sent: 20
- 21:27:31.296 INFO: -> b: 1024, nsec: 27190729, bytes sent: 20
- 21:27:31.324 INFO: -> b: 2048, nsec: 27436772, bytes sent: 20
- 21:27:31.361 INFO: -> b: 4096, nsec: 31332686, bytes sent: 20
- 21:27:31.419 INFO: -> b: 8192, nsec: 55592343, bytes sent: 20
- 21:27:31.511 INFO: -> b: 16384, nsec: 88094875, bytes sent: 20
- 21:27:31.681 INFO: -> b: 32768, nsec: 162541198, bytes sent: 20

The results are starting after the “Starting tests” string and b is the block size and nsec is the number of nanoseconds that the whole transaction lasted. Ignore the “bytes sent” size as it’s not correct and to fix this would be a lot of hassle, as it would need a static counter in the kernel driver, which doesn’t really worth the trouble. I’ve used the Linux precision timers, which they’re not very precise compared to the CM4 timers, but it’s enough for this test since the times are in the range of milliseconds. I’m also listing the results in the next table.

# of bytes (block) msec
512 21.38
1024 27.19
2048 27.43
4096 31.33
8192 55.59
16384 88.09
32768 162.54

Now let’s compare those number with the previous tests in the following table

# of bytes TTY (msec) Netlink (msec) diff (msec)
512 11.97 21.38 9.41
1024 15.32 27.19 11.87
2048 21.74 27.43 5.69
4096 37.64 31.33 – 6.31
8192 55.59
16384 88.09
32768 162.54

These are interesting numbers. As you can see up to 2KB of data, the TTY implementation is faster, but at >=4KB the Netlink driver has better performance. Also it’s important that the Netlink implementation doesn’t have the issue with the limited block size, so you can sent more data using the netlink client API I’ve written. Well, the truth is that it does the have a limited block size hard-coded in OpenAMP, but in this case without the TTY interface the ringbuffer seems to empty properly. That’s something that it would need further investigation, but I don’t think I’ll have time for this.

From this table in case of the 32KB block we see that the transfer rate is 201.6 KB/sec, which almost the double compared to the TTY implementation. This is much better performance, but again it’s far slower than the indirect buffer sharing mode, which I’ll test in the next post.

Conclusions

In this post I’ve implemented a kernel driver that uses OpenAMP to exchange data with the CM4 and a netlink socket to exchange data with the userspace. In this scenario the performance is worse compared to the default TTY implementation (/dev/ttyRPMSGx) for blocks smaller than 4KB, but it’s faster for >=4KB. Also my tests shown that if the block is 32KB then this implementation it’s twice as fast than the TTY.

Although the results are better than the previous post, still this implementation can not be considered as a good option if you need to transfer large amounts of data and fast. Nevertheless, personally I would consider this as a good option for smaller data sizes, because now the interface from the userspace is much more convenient as it’s based on a netlink socket. Therefore, you don’t need to interface with a TTY port anymore and that is an advantage.

So, I’m quite happy with the results. I would definitely prefer to use this option rather the TTY interface for data blocks more than 2KB, because netlink is more friendly API, at least to me. Maybe you have a difference preference, but overall those two solutions are only good for smaller block sizes.

In the next post, I’ll benchmark the indirect buffer sharing.

Have fun!

Benchmarking the STM32MP1 IPC between the MCU and CPU (part 1)

Intro

Update: The second part is here.

Long time no see. Well, having two babies at home is really not recommended for personal stupid projects. Nevertheless, I’ve found some time to dive into the STM32MP157C-DK2 dev kit that I’ve received recently from ST. Let me say beforehand, that this is a really great board and of course it has it’s cons and pros, which I will mention later in this post. Although it’s being a year that this board has been released I didn’t had the chance to use it and I’m glad I received this sample.

I need to mention that this post is not an introduction to the dev kit and it’s a bit advanced as it deals with more advanced concepts. One thing I like with some new SoCs that were released the last couple of years, is the integrated MCU aside with the CPU, which shares the same resources and peripherals. This is a really powerful option and it allows you to split your functionality to real-time and critical application that run on the MCU and more advanced userspace applications that run on Linux on the CPU.

The MCU is an STM32 ARM Cortex-M4 core (CM4) running at 209MHz and the application CPU is a dual Cortex-A7 (CA7) running at 650MHz. Well, yes, this is a fast CM4 and a slow CA7. As I’ve said those two can share almost all the same peripherals and the OpenAMP API is used as an IPC to exchange data between those two.

I have quite some experience working with real-time Linux kernels and I really like this concept, that’s why I always support a PREEMPT-RT (P-RT) kernel to my meta-allwinner-hx Yocto meta layer. Although the P-RT is really nice to have when a low latency is needed, it’s still far too bloated and slow compared to a baremetal firmware running on an MCU. I won’t get into the details of this, because I think it’s obvious. Therefore, the STM32MP1 and other similar SoCs combine the best of the those two worlds and you can use the CM4 for critical timing applications and the CA7 for everything else, like the GUI, networking, remote updates e.t.c.

In case of STM32MP1 the OpenAMP framework is used for the IPC between the CM4 and CA7. When I’ve went through the docs for the STM32MP1 the first thing it came to my mind is to benchmark and evaluate its performance and this series of post is about getting into this and find out what are the options you have and how good they perform.

Reading the docs

One thing I like about ST is the very good documentation. Well, tbh it’s far from excellent but it’s better compared to other documentation I’ve seen. Generally, bad or missing documentation is a well know problem with almost all vendors and in my opinion Texas Instruments and ST are far better compare to other vendors (I’ll exclude NXP imx6, which is also well documented).

When you start with a new SoC and you already have experience with other SoCs then it’s easier to retrieve valuable information faster and also search for specific information you know it’s important. Nevertheless, I really enjoyed getting into the STM32MP1 documentation in the wiki, because I found really good explanations for the Linux graphic stack, trusted firmware, OS-TEE and the boot chain. For example have a look to the DRM/KMS overview here, it’s really good.

Of course there are many things that are missing from the docs, but I need also to say that this is a quite complex SoC and there are so many tools and frameworks involved in the development process that it would need tens of thousands of pages to explain everything and this is not manageable even for ST. For example, you’ll find your self lost the moment you want to create your own Yocto distro as there is very few info on how to do that and that’s the reason I’ve decided to write a template bsp layer that greatly simplifies the process.

Contributions

While getting a dive into the SoC, I couldn’t help my self not to fix some things, therefore I’ve did some contributions, like this and this and I’ve also created this meta layer here. I’m also planning to contribute with a kernel module and a userspace application that will use netlink between the userspace and the kernel module, but I’ll explain why later. This will be demonstrated in the next post, though as it’s related to the IPC performance.

IPC options

OK, so let’s now see what are the current options that you get with the STM32MP1 when it comes to IPC between the CM4 and CA7. I’ve already mentioned the OpenAMP which ST names it the direct buffer exchange mode, but you can also use an indirect buffer exchange mode. Both are well explained here, but I’ll add a couple of more information here.

Generally, the OpenAMP has a weakness which is integrated to the implementation and this is the hard-coded buffer size limitation, which is set to 512 bytes. We’ll see later how this affects the transfer speed, but as you can imagine this size might not be enough in some cases. Therefore, before you decide which IPC option you’re going to choose you need to be sure how many data and how fast you need to share between the CM4 and CA7.

Another problem with OpenAMP is that there’s no zero-copy or DMA support, which means that buffers are copied from all ends and therefore the performance is expected to be affected by this. On top of that the buffers are not cacheable, which means further CPU cycles for fetching buffers and then copy them.

OpenAMP comes with 2 examples/implementations in the SDK. The first is a TTY device on the Linux side, which means that you transfer can data from/to the userspace by interfacing a tty com port. You can see this implementation here. If that sounds bad to you, then it is actually as bad as it sounds, because exchanging data that way really sucks, especially on the Linux side (interfacing a tty with code). The only way I think this is useful and makes sense is if you port your existing code from a device that used a CPU and an external MCU and they were exchanging data via a UART port. Otherwise, it doesn’t make much sense to use it. Nevertheless, being loyal to the blog’s name I’ve spent some time to implement this and see how it performs. More about this later.

The second implementation is the OpenAMP raw, which you can use a kernel module to interface with the CM4 and the firmware implementation is here. In order to exchange data with the userspace in this case you need a kernel module like this one here. But as you can see, this is pretty much useless because you can’t interface with the module from the userspace, therefore you need to write your own module and use ioctl or netlink (or anything else you like). But more about this on the next post. (Edit: actually I’ve seen that ST provides a module with IOCTL support here).

On the other hand, there is also the indirect buffer option, which compared to the OpenAMP, it allows sharing much larger buffers (of any size), in cached memory, with zero-copy and also using DMA. This option is expected to perform much faster compared to OpenAMP and with much lower latency. To be honest, this sounds like a no-brainer to use for me in most cases, except in case that you just need to exchange very few control commands. I mean it makes sense for the last case, because you need OpenAMP even in the indirect buffer mode, because you use that as a control interface for the shared buffer anyway.

Creating a Yocto image

In order to use or test any of the above things you need to build and flash a distro for the board. Here I can complain about the non-existence documentation on how to do this. In my case, that was not really an issue, but I guess many devs will eventually struggle with this. For that reason, I’ve created a BSP base layer for the STM32MP157C-DK2 board. This post though is not focused on how to deal with the Yocto distro, so I’ll explain those things quite fast so I can proceed further.

To build the distro I’ve used, you need this repo here:

https://bitbucket.org/dimtass/meta-stm32mp1-bsp-base/src/master/
https://github.com/dimtass/meta-stm32mp1-bsp-base
https://gitlab.com/dimtass/meta-stm32mp1-bsp-base

The README file in the repo is quite thorough. So, in order to build the image you need to create a folder and run this command:

cd stm32mp1-yocto
repo init -u https://bitbucket.org/dimtass/meta-stm32mp1-bsp-base/src/master/default.xml
repo sync

This will fetch all the needed repos and place them in the sources/ folder. Then in order to build the image you need to run these commands:

ln -s sources/meta-stm32mp1-bsp-base/scripts/setup-environment.sh .
MACHINE=stm32mp1-discotest DISTRO=openstlinux-eglfs source ./setup-environment.sh build
bitbake stm32mp1-qt-eglfs-image

Your host needs to have all the needed dependencies in order to build the image. If this is the first time you do this, then better use a docker image instead like this one here. I’ve made this docker image to build the meta-allwinner-hx repo, but you can use the same for this repo, like this:

docker build --build-arg userid=$(id -u) --build-arg groupid=$(id -g) -t allwinner-yocto-image .
docker run -it --name allwinner-builder -v $(pwd):/docker -w /docker allwinner-yocto-image bash
MACHINE=stm32mp1-discotest DISTRO=openstlinux-eglfs source ./setup-environment.sh build
bitbake stm32mp1-qt-eglfs-image

If you want to build the SDK to use it for compiling other code then run this:

bitbake -c populate_sdk stm32mp1-qt-eglfs-image

Then you can install the produced SDK to your /opt folder and then source it to build any code you might want to test.

Benchmark tool code

Assuming that you have the above image built, then you can flash it in the target and run the test. There is a brief explanation how to flash the image to the dev kit in the meta-stm32mp1-bsp-base repo README.md file. So, after the image is flashed the firmware for the CM4 is located in /lib/firmware/stm32mp157c-rpmsg-test.elf. Also there is a script and an executable in /home/root. The recipes that install those files are located in `meta-stm32mp1-bsp-base/recipes-extended/stm32mp1-rpmsg-test`.

The code is located in this repo:

https://bitbucket.org/dimtass/stm32mp1-rpmsg-test/src/master/
https://github.com/dimtass/stm32mp1-rpmsg-test
https://gitlab.com/dimtass/stm32mp1-rpmsg-test

This repo contains both the CM4 firmare and the CA7 linux userspace tool. The userspace tool is using the /dev/ttyRPMSG0 tty port to send blocks of data. As you can see here, I’m creating some tests on the runtime with several block sizes and I’m sending those blocks a number of times. For example, see the following:

tester.add_test(512, 1);
tester.add_test(512, 2);
tester.add_test(512, 4);
tester.add_test(512, 8);

The first line sends a block of 512 bytes one time. The second sends the 512 block two times, e.t.c. In the code, you’ll see that there are some code lines that are commented out. The reason for that is that I’ve found that I couldn’t send more than 5-6KB per run and it seems that OpenAMP is hanging on the kernel side and I need to open/close the tty port to send more data. As I’ve mentioned the buffer size in OpenAMP is hardcoded and it’s 512 bytes, therefore I assume that the ringbuffer is overflowing at some point and the execution hangs, but I don’t know the real reason. Therefore, I wasn’t able to test more that 5KB.

The code for interfacing the com port is in here. I guess this class would be very useful to you if you want to interface with the ttyRPMSG in your code, as I’ve used it many times and it seems robust. Also it’s simple to use and integrate it to your code.

Finally, the test class is here and I’m using a simple protocol with a header that contains a preamble and the length of the packet data. This simplifies the CM4 firmware as it knows how much data to wait for and then acknowledge when the whole packet is received. Be aware then when you sent a 512 bytes block, then those data doesn’t arrive at once on the CM4 on a single callback, but OpenAMP may trigger more that one interrupts for the whole block, therefore knowing the length of the data in the start of transaction is important.

On the CM4 side, the firmware is located in here. The interesting code is in the main.c file and actually the most important function is the tty callback in here, so I’ll also list the code here.

void VIRT_UART0_RxCpltCallback(VIRT_UART_HandleTypeDef *huart)
{

  /* copy received msg in a variable to sent it back to master processor in main infinite loop*/
  uint16_t recv_size = huart->RxXferSize < MAX_BUFFER_SIZE? huart->RxXferSize : MAX_BUFFER_SIZE-1;

  struct packet* in = (struct packet*) &huart->pRxBuffPtr[0];
  if (in->preamble == PREAMBLE) {
    in->preamble = 0;
    virt_uart0_expected_nbytes = in->length;
    log_info("length: %d\n", virt_uart0_expected_nbytes);                        
  }

  virt_uart0.rx_size += recv_size;
  log_info("UART0: %d/%d\n", virt_uart0.rx_size, virt_uart0_expected_nbytes);
  if (virt_uart0.rx_size >= virt_uart0_expected_nbytes) {
    virt_uart0.rx_size = 0;
    virt_uart0.tx_buffer[0] = virt_uart0_expected_nbytes & 0xff;
    virt_uart0.tx_buffer[1] = (virt_uart0_expected_nbytes >> 8) & 0xff;
    log_info("UART0 resp: %d\n", virt_uart0_expected_nbytes);
    virt_uart0_expected_nbytes = 0;
    virt_uart0.tx_size = 2;
    virt_uart0.tx_status = SET;
    // huart->RxXferSize = 0;
  }
}

As you can see in this callback the code looks for the preamble and if it’s found then it copies the expected bytes length. Then the interrupt may triggered multiple times and when all bytes arrive, then the CM4 sends back a response that includes the bytes number and then the userspace application verifies that all bytes where sent. It’s as simple as that.

Benchmark results

To run the test you need to cd to the /home/root directory and then run this script.

./fw_cortex_m4.sh start

This will load the firmware that is stored in /lib/firmware to the CM4 and execute it. When this happend you should see this in your serial console:

fw_cortex_m4.sh: fmw_name=stm32mp1-rpmsg-test.elf
[  162.549297] remoteproc remoteproc0: powering up m4
[  162.588367] remoteproc remoteproc0: Booting fw image stm32mp1-rpmsg-test.elf, size 704924
[  162.596199]  mlahb:m4@10000000#vdev0buffer: assigned reserved memory node vdev0buffer@10042000
[  162.607353] virtio_rpmsg_bus virtio0: rpmsg host is online
[  162.615159]  mlahb:m4@10000000#vdev0buffer: registered virtio0 (type 7)
[  162.620334] virtio_rpmsg_bus virtio0: creating channel rpmsg-tty-channel addr 0x0
[  162.622155] rpmsg_tty virtio0.rpmsg-tty-channel.-1.0: new channel: 0x400 -> 0x0 : ttyRPMSG0
[  162.633298] remoteproc remoteproc0: remote processor m4 is now up
[  162.648221] virtio_rpmsg_bus virtio0: creating channel rpmsg-tty-channel addr 0x1
[  162.671119] rpmsg_tty virtio0.rpmsg-tty-channel.-1.1: new channel: 0x401 -> 0x1 : ttyRPMSG1

This means that the kernel module is loaded, the two ttyRPMSG devices are created in /dev, firmware is loaded on the CM4 and the firmware code is executed. In order to verify that the firmware is working you can use a USB to UART module and connect the Rx pin to the D0 and the Tx pin to the D1 (UART7) of the Arduino compatible header in the bottom on the board. When the firmware is booted you should see this:

[00000.008][INFO ]Cortex-M4 boot successful with STM32Cube FW version: v1.2.0 
[00000.016][INFO ]MAX_BUFFER_SIZE: 32768
[00000.019][INFO ]Virtual UART0 OpenAMP-rpmsg channel creation
[00000.025][INFO ]Virtual UART1 OpenAMP-rpmsg channel creation

Now you can execute the benchmark tool on the userspace by running this command in the console:

./tty-test-client /dev/ttyRPMSG0

Then the benchmark is executed and you’ll see the results on both consoles (CM4 and CA7). The CA7 output will be something like this:

- 15:43:54.953 INFO: Application started
- 15:43:54.954 INFO: Connected to /dev/ttyRPMSG0
- 15:43:54.962 INFO: Initialized buffer with CRC16: 0x1818
- 15:43:54.962 INFO: ---- Creating tests ----
- 15:43:54.962 INFO: -> Add test: block=512, blocks: 1
- 15:43:54.962 INFO: -> Add test: block=512, blocks: 2
- 15:43:54.962 INFO: -> Add test: block=512, blocks: 4
- 15:43:54.963 INFO: -> Add test: block=512, blocks: 8
- 15:43:54.963 INFO: -> Add test: block=1024, blocks: 1
- 15:43:54.964 INFO: -> Add test: block=1024, blocks: 2
- 15:43:54.964 INFO: -> Add test: block=1024, blocks: 4
- 15:43:54.964 INFO: -> Add test: block=1024, blocks: 5
- 15:43:54.964 INFO: -> Add test: block=2048, blocks: 1
- 15:43:54.964 INFO: -> Add test: block=2048, blocks: 2
- 15:43:54.964 INFO: -> Add test: block=4096, blocks: 1
- 15:43:54.964 INFO: ---- Starting tests ----
- 15:43:54.977 INFO: -> b: 512, n:1, nsec: 11970765, bytes sent: 512
- 15:43:54.996 INFO: -> b: 512, n:2, nsec: 18770380, bytes sent: 1024
- 15:43:55.027 INFO: -> b: 512, n:4, nsec: 31022063, bytes sent: 2048
- 15:43:55.083 INFO: -> b: 512, n:8, nsec: 56251848, bytes sent: 4096
- 15:43:55.099 INFO: -> b: 1024, n:1, nsec: 15322572, bytes sent: 1024
- 15:43:55.124 INFO: -> b: 1024, n:2, nsec: 24824116, bytes sent: 2048
- 15:43:55.168 INFO: -> b: 1024, n:4, nsec: 43830248, bytes sent: 4096
- 15:43:55.221 INFO: -> b: 1024, n:5, nsec: 53327292, bytes sent: 5120
- 15:43:55.243 INFO: -> b: 2048, n:1, nsec: 21742144, bytes sent: 2048
- 15:43:55.281 INFO: -> b: 2048, n:2, nsec: 37633885, bytes sent: 4096
- 15:43:55.318 INFO: -> b: 4096, n:1, nsec: 37649011, bytes sent: 4096

The results are starting after the “Starting tests” string and b is the block size, n is the number of blocks and nsec is the number of nanoseconds that the whole transaction lasted. I’ve used the linux precision timers, which they’re not very precise compared to the CM4 timers, but it’s enough for this test since the times are in the range of milliseconds. I’m also listing the results in the next table.

Block size # of blocks msec
512 1 11.97
512 2 18.77
512 4 31.02
512 8 56.25
1024 1 15.32
1024 2 24.82
1024 4 43.83
1024 5 53.32
2048 1 21.74
2048 2 37.63
4096 1 37.64

The results are interesting enough. As you can see, size matters! So, the larger the block size the faster the transfer rate. Therefore sending 4KB of data from the CA7 to the CM4 in 8x 512 blocks needs 56.25 ms and in case of 1x 4096 takes 37.64, which is 40% faster with the bigger block size.

Also another interesting result is that in the best case the transfer rate is 4096 bytes in 37.64 msec, or 106 KB/sec. That’s too slow!

To sum up the results, you can see that in case you want to use the direct buffer exchange mode (OpenAMP tty) then you’re limited at ~106KB/sec and in order to achieve this speed you need the largest possible block size.

Conclusions

In this post I’ve provided a simple BSP base Yocto layer that it might be helpful for you to start a new STM32MP1 project. I’ve used this Yocto image to build a benchmark tool and test the direct buffer exchange mode and the firmware and the userspace tool is are included in recipes in the BSP base repo.

The results are a bit disappointing when it comes to performance and I’ve found out that the larger the block size, then the transfer speed gets better. That means that you use this mode when sending a small amount of data (e.g. control commands) or if you want to port an application in STM32MP1m that it was running on a separate CPU and MCU before. If you need better performance then the indirect mode seems to be better, but I’ll verify this on a future post.

In the next post I’ll be using again the direct mode, but this time I’ll write a small kernel driver module using netlink and test if there’s any better performance. I don’t expect much out of this, but I want to do this anyways because I find using a tty interface to a modern program a bit cumbersome to handle and netlink is a more elegant solution.

In the conclusions section I would also like to list some pros/cons of the STM32MP157C-DK2 board (and MP1 SoC).

Pros:

  • Very good Yocto support
  • Great documentation and datasheets
  • Many examples and code for the CM4
  • Support for OP-TEE secure RTOS
  • Very good power management (supported in both CM4 and CA7)
  • GbE
  • WiFi & BT 4.1
  • Arduino V3 connector
  • The CM4 firmware can be uploaded during boot (in u-boot) or in Linux
  • DSI display with touch
  • Direct and indirect buffer sharing mode
  • Provides an AI acceleration API (supports also TF-Lite)
  • Great and mature development tools

Cons:

  • The CA7 has a low clock (which tbh doesn’t mean that this is bad, for many cases this is a good thing).
  • You can’t use DSI and HDMI at the same time.
  • IPC is not that fast (in direct mode only)
  • The 40-pin expansion connector is under the LCD and it’s hard to reach and fit the cables. It would be better if it had a 90 degree angle connector.
  • Generally because the LCD takes the whole one side of the board this makes it difficult to use also the Arduino connector. Therefore, you need to put the board on the LCD side and use the HDMI. Well, they couldn’t done much for this, but…
  • The packaging box could be done in a way that can be used as a leveled base that make the Arduino connector accessible. Something like the nVidia nano packaging box.
  • Yocto documentation is lacking and it will be difficult for non experienced developers to get started with (but that’s also a generic issue with Yocto as the documentation is also not complete and also some important features are not documented).
  • No CMake support in the examples. Generally, I prefer CMake for many reasons and I like for example the NXP provides their SDK with CMake instead of Makefiles (personal preference I guess, but CMake is also widely used in most of the projects I’ve worked with).

Overall, this is a great and complicated board and it takes some effort to setup and optimize everything. Even more it’s quite complicated to do a design from the scratch in a small team, therefore if you decline from the default design you might need more time to configure everything properly and bring up your custom board. Although, I would be able to handle a full custom design with this board, I know it would take a lot of time to finalize it. Therefore, if you don’t use one of the many available ready solutions with the STM32MP1 SoC, then have in mind that it will take some effort to do your custom solution. It’s up to you to decide the development cost of a full custom design or use one of the many available SoCs that can help you start in no time.

I would definitely take this SoC into account for applications that need good real-time performance (using the CM4 core) and use the CA7 core for the GUI and business logic. If you wonder what’s the difference with a multi-core ARM CPU that runs multiple cores in high frequency, then you can take as granted that -depending your use-case- even a PREEMPT-RT kernel on a 8-core running at 2GHz, can’t beat the real-time performance of a 200MHz CM4.

Have fun!