Nanopi-R4S benchmarks with networking optimizations

Intro

In the last post, I’ve done a benchmark on the Nanopi-R2S and Nanopi-R4S. After a comment from tkaiser here, I’ve verified that all IRQs are mapped to CPU0, which means that the default FriendlyWRT images are not optimized. Those optimizations though are easy to make and in this post I will guide you through those optimizations and also present the results and make a comparison with the default images.

SMP IRQ Affinity

The SMP IRQ affinity is a functionality in the Linux kernel that allows the user to control which CPU handles the IRQ of a certain device  that comes from the interrupt controller. In multi-core CPUs that makes a lot of sense, because the user can change the default behavior of the kernel which puts all the effort to CPU0. Therefore, with the SMP IRQ affinity the user can use a bit mask and control which core handles specific interrupts.

Hence, the process has two stages. The first is to find the interrupt number of the device you want to move its IRQ handling to another core. The second step is to set a proper bit mask to the smp_affinity mask of the specific interrupt. This mask will be used to forward the IRQ handling to the CPU core you want.

Another important reason for the user to control the SMP affinity manually is that in multicore SoCs like the RK3399 which include core with different architecture, there might be a chance that the more powerful cores are not used properly by default from the kernel. And actually this is the case with the RK3399, because in this case the CPU0 is a Cortex-A53 from the quad configuration which is clocked at 1.4GHz and therefore the faster dual Cortex-A72 which is clocked at 1.8GHz is staying idle.

Receive Packet Steering (RPS)

The RPS is very well explained here, but I will try to simplify it a bit more. RPS is a functionality in the kernel that -in multi-core systems- distributes the workload of network packet handling to different cores in order to speed up the process. This might not be so important for your desktop, but for a router -like the Nanopi-RxS- it’s very important to be able to tweak that and be able to set a specific core for the task.

Again, the RPS is a bit mask that controls which cores are in the kernel’s distribution list, but you can also set a single core for that. Therefore, one of the optimizations we can do is to assign a specific core -preferably one of the fast cores- for that.

FriendlyWRT

In this post for both Nanopi-R2S and R4S I’m using the custom FriendlyWRT distro from FriendlyElec. This is the version for Nanopi-R2S:

Linux FriendlyWrt 5.4.61 #1 SMP PREEMPT Fri Sep 4 15:12:58 CST 2020 aarch64 GNU/Linux

And this is the version of Nanopi-R4S:

Linux FriendlyWrt 5.4.75 #1 SMP PREEMPT Tue Nov 10 11:13:15 CST 2020 aarch64 GNU/Linux

Nanopi-R4S optimizations

Hopefully the above explanations are quite clear, so I’ll explain now what are the two optimizations that you need to do and how to do them. The Nanopi-R4S is based on the RK3399 SoC which is a dual Cortex-A72 core and a quad Cortex-A53. In FriendlyWRT the A72 is clocked at 1.8GHz and the A53 at 1.4GHz. Generally, the core clocks are not something that is controlled only from the CPU itself, but in Linux the min and max clock speeds are defined via the device-tree when the kernel loads. Therefore, the max rated clock for the A72 might be 2.0GHz, but that doesn’t mean that this will be the max clock on every OS and therefore, as I’ve mentioned in case of FriedlyWRT the max clock is set to 1.8GHz.

So, the first optimization is to assign the interrupt handling of the eth0 and eth1 interfaces to another core than CPU0 and preferably to the fastest cores. To view the CPU frequency for each core you can use the following command:

root@FriendlyWrt:~# cat /sys/devices/system/cpu/cpu*/cpufreq/cpuinfo_max_freq
1416000
1416000
1416000
1416000
1800000
1800000

From the above result we see that CPU0-3 is the Cortex-A53 and CPU4-5 is the Cortex-A72. Therefore, by default all IRQs are served by CPU0, which is the slower core and thus this can create an artificial bottleneck, since there are 2 fast cores sitting there and doing nothing.

Next step is to figure out which are the IRQ numbers for eth0 and eth1. To do that I first run a iperf test with my Laptop as a client and connected on the LAN port of the Nanopi-R4S and my workstation as a server and connected on the WAN port.

Then I’ve printed the /proc/interrupts to view the number of IRQs and this is the result.

root@FriendlyWrt:~# cat /proc/interrupts
           CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       
 15:      59197      27202      24660      23678      44323      32887     GICv3  30 Level     arch_timer
 17:       2733      16708      15550       8386      12996      11459     GICv3 113 Level     rk_timer
 18:          0          0          0          0          0          0  GICv3-23   0 Level     arm-pmu
 19:          0          0          0          0          0          0  GICv3-23   1 Level     arm-pmu
 20:          0          0          0          0          0          0     GICv3  37 Level     ff6d0000.dma-controller
 21:          0          0          0          0          0          0     GICv3  38 Level     ff6d0000.dma-controller
 22:          0          0          0          0          0          0     GICv3  39 Level     ff6e0000.dma-controller
 23:          0          0          0          0          0          0     GICv3  40 Level     ff6e0000.dma-controller
 24:          1          0          0          0          0          0     GICv3  81 Level     pcie-sys
 26:          0          0          0          0          0          0     GICv3  83 Level     pcie-client
 27:   10628015          0          0          0          0          0     GICv3  44 Level     eth0
 28:      28962          0          0          0          0          0     GICv3  97 Level     dw-mci
 29:          0          0          0          0          0          0     GICv3  58 Level     ehci_hcd:usb1
 30:          0          0          0          0          0          0     GICv3  60 Level     ohci_hcd:usb3
 31:          0          0          0          0          0          0     GICv3  62 Level     ehci_hcd:usb2
 32:          0          0          0          0          0          0     GICv3  64 Level     ohci_hcd:usb4
 33:          0          0          0          0          0          0     GICv3  94 Level     ff100000.saradc
 34:          0          0          0          0          0          0     GICv3  91 Level     ff110000.i2c
 35:          0          0          0          0          0          0     GICv3  67 Level     ff120000.i2c
 36:          0          0          0          0          0          0     GICv3  68 Level     ff160000.i2c
 38:        106          0          0          0          0          0     GICv3 132 Level     ttyS2
 39:          0          0          0          0          0          0     GICv3 129 Level     rockchip_thermal
 40:       3231          0          0          0          0          0     GICv3  89 Level     ff3c0000.i2c
 41:          0          0          0          0          0          0     GICv3  88 Level     ff3d0000.i2c
 44:          0          0          0          0          0          0     GICv3 147 Level     ff650800.iommu
 45:          0          0          0          0          0          0     GICv3  87 Level     ff680000.rga
 47:          0          0          0          0          0          0     GICv3 151 Level     ff8f3f00.iommu, ff8f0000.vop
 48:          0          0          0          0          0          0     GICv3 150 Level     ff903f00.iommu, ff900000.vop
 49:          0          0          0          0          0          0     GICv3  75 Level     ff914000.iommu
 50:          0          0          0          0          0          0     GICv3  76 Level     ff924000.iommu
 51:          0          0          0          0          0          0     GICv3  55 Level     ff940000.hdmi
 65:          0          0          0          0          0          0  rockchip_gpio_irq   5 Edge      GPIO Key Power
 67:          0          0          0          0          0          0  rockchip_gpio_irq   7 Edge      fe320000.dwmmc cd
113:          0          0          0          0          0          0  rockchip_gpio_irq  21 Level     rk808
114:          0          0          0          0          0          0  rockchip_gpio_irq  22 Edge      K1
166:         16          0          0          0          0          0  rockchip_gpio_irq  10 Level     stmmac-0:01
220:          0          0          0          0          0          0     GICv3  59 Level     rockchip_usb2phy
222:          0          0          0          0          0          0   ITS-MSI   0 Edge      PCIe PME, aerdrv
223:          0          0          0          0          0          0     GICv3 137 Level     xhci-hcd:usb5
224:          0          0          0          0          0          0     GICv3 142 Level     xhci-hcd:usb7
230:          0          0          0          0          0          0     rk808   5 Edge      RTC alarm
234:     855861          0          0          0          0          0   ITS-MSI 524288 Edge      eth1
IPI0:     11514     109149      55900      39843      31597     748488       Rescheduling interrupts
IPI1:       468       5518     259096     128778    4150594      21564       Function call interrupts
IPI2:         0          0          0          0          0          0       CPU stop interrupts
IPI3:         0          0          0          0          0          0       CPU stop (for crash dump) interrupts
IPI4:       976       2231       1880       1877       2373       2655       Timer broadcast interrupts
IPI5:     16670       6729       7059       4912      23162      13914       IRQ work interrupts
IPI6:         0          0          0          0          0          0       CPU wake-up interrupts
Err:          0

From the above output you can see that IRQ number 27 is mapped on the eth0 interface and IRQ number 234 is mapped to the eth1. Therefore, now we can use the proper bit masks to assign those IRQs to another core.

In this case, I’ll assign IRQ 27 to CPU4 (eth0) and IRQ 234 to CPU5 (eth1). Both CPU4 and CPU5 are the Cortex-A72 cores, the fast ones (1.8GHz). The same we need to do for the RPS for both interfaces. Gladly, the bit masks are the same for both cases (ethX and RPS), therefore the bit mask for eth0 is 0x10 and for eth1 is 0x20. This is because the following table.

CPU5 CPU4 CPU3 CPU2 CPU1 CPU0
eth0 0 1 0 0 0 0
eth1 1 0 0 0 0 0

As you can see on the above table every cell is a bit, so we have 6 bits for 6 cores and therefore 01 0000 (=0x10 HEX) is the bit mask for eth0 and 10 0000 (=0x20) is the bit mask for eth1.

Then you need to create this script and make it executable.

#!/bin/sh
# CPU4 and CPU5 are the 1.8GHz cores
# Set CPU4 to handle eth0 IRQs
echo 10 > /proc/irq/27/smp_affinity
echo 10 > /sys/class/net/eth0/queues/rx-0/rps_cpus

# Set CPU5 to handle eth1 IRQs
echo 20 > /proc/irq/234/smp_affinity
echo 20 > /sys/class/net/eth1/queues/rx-0/rps_cpus

The first two lines are copying the mask to the smp_affinity -which controls in which core the IRQ is assigned- and to the rps_cpus, which controls in which core the network packet processing is done.

The next two lines do the same for the eth1 interface.

In FriendlyWRT you can add a service in `/etc/init.d/` that calls this script if you like these changes to be effective when the system boots.

Nanopi-R4S benchmarks

In the previous post here, I’ve ran several benchmarks using iperf, but I found that the two benchmarks that are more demanding for the device is using two parallel threads for TCP and the UDP test, when both are done in both directions. This time I will also do this test using the default MTU which is 1500 and also test with 512 bytes, as this seems to be the size that various protocols prefer. The MTU size defines the largest packet size that will be transmitted over the network.

This change in MTU needs to be done on the iperf server and client, therefore in my case I had to use this command on my workstation and laptop.

sudo ip link set dev IF_NAME mtu 512

where, IF_NAME is the name of your network interface e.g. eth0, enp3s0, e.t.c. You can verify the size of the interface MTU with this command:

ip link list

The iperf server IP in these examples is 192.168.0.2 and the server is connected on the WAN port. The iperf client is 192.168.2.126 and it’s connected on the LAN port.

For the TCP test the commands I’ve used for the server and client are:

Server (WAN) iperf -s
Client (LAN) iperf -c 192.168.0.2 -t 120 -d -P 2

For the UDP test the commands I’ve used for the server and client are:

Server (WAN) iperf -s -u
Client (LAN) iperf -c 192.168.0.2 -u -t 120 -b 1000M -d

Nanopi-R4S benchmarks

These are the results for the Nanopi-R4S with and without using the network optimizations and using 512 and 1500 MTU.

MTU TCP/UDP Results Mbits/sec
(default)
Results Mbits/sec
(optimized)
1500 TCP 941 941
1500 UDP 808 742
512 TCP 575 588
512 UDP 556 (854)

This is the /proc/interrupts after the benchmark, to verify the core assignment to the IRQs.

           CPU0       CPU1       CPU2       CPU3       CPU4       CPU5
 27:        152          0          0          0   22505343          0     GICv3  44 Level     eth0
234:        171          0          0          0          0    9048229   ITS-MSI 524288 Edge      eth1

Viewing those result, I don’t see any real benefit from assigning the the IRQ and RPS to the fast cores. The throughput seems to be the same. The last UDP result for the optimized case which is 854 Mbits/sec and it’s way faster than the non-optimized, is something that I don’t really trust, because I’m getting a weird warning when the test ends.

WARNING: did not receive ack of last datagram after 10 tries.

Also, I’m not getting any stats regarding the acknowledged received data and my gut feeling is that the UDP packets are actually lost when the MTU is set to 512. This I think needs more investigation, because it shouldn’t happen.

Therefore, given these results and ignoring the result in the parentheses, I would say that these optimizations don’t really benefit the network performance of the Nanopi-R4S.

Nanopi-R2S optimizations

It seems that the FriendlyWRT distro for the Nanopi-R2S is properly optimized by default and the eth0 IRQ is assigned to CPU1 and the `xhci-hcd:usb4` is assigned to CPU2 as you can see here:

           CPU0       CPU1       CPU2       CPU3
 29:          0     473493          0          0     GICv2  56 Level     eth0
167:          0          0      81033          0     GICv2  99 Level     xhci-hcd:usb4

If you’re not aware the second GbE port of the Nanopi-R4S is a USB-to-GbE connected on a USB3.0 port, in this case in `xhci-hcd:usb4`.

For that reason I don’t see any point in to re-run the benchmarks for this case.

Conclusions

As I’ve mentioned, in my opinion the network optimization with the assignment of different cores for the network interface interrupts don’t really benefit the network performance of the Nanopi-R4S. I’m also not sure about the warning I’m getting with the last UDP test when the MTU is set to 512 where it seems that the data are actually getting lost, so I don’t consider this as a valid result.

Be aware that there might be other network optimizations which I’m not aware of, so this post might be incomplete.

Also the nice thing is that the FriendlyWRT distro for the Nanopi-R2S seems to have those optimizations already in place.

Personally, I would add those optimizations described in this post also on the Nanopi-R4S, because it makes total sense to have them there. Keep in mind that in this test the device is just running the default distro stuff, so there’s no external hard drive connected, no extra services e.t.c. But normally, on a system that is fully utilized there will be more services that run in the background, therefore it’s a good strategy to have those optimizations anyways.

In any case, personally I like the Nanopi-R4S and I find that it’s performance is good enough for my needs. Hope this post helped if you own the device.

Have fun!

Benchmarking the NanoPi R4S

Intro

Note: I’ve wrote a complementary post on how to do some networking optimizations here.

This week I’ve received the new NanoPi R4S from FriendlyElec for evaluation purposes.

My original plan was to create a Yocto BSP layer for the board as I’ve also did with the NanoPi R2S in a recent post here. The way I usually create a Yocto BSP layers for those SBCs is that I use parts of the Armbian build tool and integrate them in bitbake recipes and then add them in a layer. The problem this time is that Armbian hasn’t released yet support for this board, therefore I thought it’s a good chance to benchmark the board itself.

Let’s first see the specs of the board.

NanoPi R4S specs

The board is based on the Rockchip RK3399 and the specs of the specific board I’ve received are:

  • Rockchip RK3399 SoC
    • 2x Cortex-A72 @1.8GHz
    • 4x Cortex-A53 @1.4GHz
    • Mali-T864 GPU
    • VPU capable of 4K VP9 and 4K 10bits H265/H264 60fps decoding
  • 4GB LPDDR4 RAM
  • RK808-D PMIC
  • 2x GbE ports
  • 2x USB 3.0 ports
  • Extension pin-headers
    • 2x USB 2.0
    • 1x SPI
    • 1x I2C
    • 1x UART (console)
    • RTC Battery
  • USB Type-C connector for power

What makes the board special is of course the dual GbE. One interface is integrated in the SoC and the other is a PCIe GbE which is connected on the SoC’s PCIe bus.

As you can guess this board is meant to be used in custom router configurations and that’s the reason that there’s already a custom OpenWRT image for it, which is named FriendlyWRT. Therefore, in my tests I’ve used this image and actually I’ve used the image that it came in the SD card with the board. More details about the versions later.

Also the board I’ve received came with an aluminum case, which has a dual purpose, for housing -of course- and it’s also used to cool down the CPU. This is the SBC I’ve received.

The Nanopi-R4S’s case is very compact and it’s only a bit bigger than its predecessor Nanopi-R2S, but also includes more horse power under the hood. You can see how those two compare in size.

As you can see in the above image I’ve done a modification on the case by drilling a hole above the USB power connector. I’ve done this hole in order to be able to use the UART console while the case is closed.

The Nanopi-R2S also has the same issue, as you can’t connect the UART console while the case is closed. Therefore, I’ve done the same modification to both cases. This is very easy to do as aluminum is very easy and soft to drill.

Here’s an image of the Nanopi-R2S with the case open.

I also had to use a cutter to trim the top of the dupont connectors in order for the case to close properly.

In the Nanopi-R2S I’ve drilled the hole above the reset button and on the Nanopi-R4S above the USB power connector.

Another thing I need to mention here is that I had to also change the thermal pad on the Nanopi-R2S because it comes with a 0.5mm pad and I’ve seen that the temperature was a bit high. When I’ve changed the pad with a 1mm thickness then it was much better. I guess you can use any thermal pad, but in my case I’ve used this one that I got from ebay.

The thermal pad of the Nanopi-R4S seems to be fine so I haven’t changed that.

Test setup

The setup I’ve used for the benchmarks is very simple and I’ve only used standard equipment that anyone can buy cheap. So, no expensive or smart Ethernet switches or expensive cables. I think that way the results will reflect the most common case scenarios, which is use the board in a home or small office environment.

The switch I’ve used is the TP-Link TL-SG108, which is an 8-port GbE switch and it costs approx. 20 EUR here in Germany, which I believe is cheap. The cables I’ve used are some ultra-cheap 1m CAT6 which I use for my tests.

I’ve done two kind of tests. The first test is to test only the WAN interface, using my workstation which has an onboard GbE interface. This is the setup.

As you can see from the above diagram the workstation is connected on the GbE switch as also the WAN interface of the Nanopi-RxS (I mean both R2S and R4S). Then the LAN interface, which by default is bridged internally in FriendlyWRT, is connected on my Laptop which is using a USB-to-GbE adaptor as it doesn’t have a LAN connector.

I’ve tested thoroughly this USB-to-GbE adapter in many cases and it’s perfectly capable of 1 Gbit. So, don’t worry about it, it won’t affect the results.

Someone could argue here that the switch should be placed after the LAN and not before WAN, because WAN is meant to be connected to your ADSL/VDSL router. Well, that’s one option, but also this option is a valid setup for many setups. For example I prefer to have the WiFi router before the WAN and some of my devices after the LAN, so the devices connected with before WAN don’t have access to the LAN devices via the bridge interface.

My workstation is a Ryzen 2700X and the Ethernet is an onboard interface on the ASRock Fatal1ty X470 Gaming K4 motherboard with the latest firmware at this date. The kernel and OS are the following:

PRETTY_NAME="Ubuntu 18.04.5 LTS"
Linux workstation 5.9.10-050910-generic #202011221708 SMP Sun Nov 22 18:07:21 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

The Laptop is a Lenovo 330S-15ARR with a Ryzen 2500u and the following kernel and OS.

PRETTY_NAME="Ubuntu 20.04.1 LTS"
Linux laptop 5.3.16-050316-generic #201912130343 SMP Fri Dec 13 08:45:06 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

The USB-to-GbE is based on the RTL8153.

Bus 002 Device 004: ID 0bda:8153 Realtek Semiconductor Corp. RTL8153 Gigabit Ethernet Adapter

This is the kernel and OS of the Nanopi-R4S

PRETTY_NAME="OpenWrt 19.07.4"
Linux FriendlyWrt 5.4.75 #1 SMP PREEMPT Tue Nov 10 11:13:15 CST 2020 aarch64 GNU/Linux

And this is the kernel ans OS of the Nanopi-R2S

PRETTY_NAME="OpenWrt 19.07.1"
Linux FriendlyWrt 5.4.61 #1 SMP PREEMPT Fri Sep 4 15:12:58 CST 2020 aarch64 GNU/Linux

 

With this setup, I’ve used iperf to benchmark the WAN interface and the bridged interface. The WAN interface is getting an IP from the DHCP router which is also connected on the GbE switch and the Nanopi-RxS runs it’s own DHCP server for the bridge interface which is the range of 192.168.2.0. Therefore, the laptop gets an IP address in the 192.168.2.x range but it does have access to all the IPs in the WAN. For that reason in the bridge tests the Laptop always acts as the client.

Before see the benchmark results this is the table of the IP addresses.

Workstation 192.168.0.2
Nanopi-R4S (WAN) 192.168.0.62
Nanopi-R4S (LAN) 192.168.2.1
Nanopi-R2S (WAN) 192.168.0.63
Laptop 192.168.2.128

I’ll do 6 different tests, which are described in the following table.

Test # iperf client cmd Description
1 -t 120 TCP, 2 mins, default window size
2 -t 120 -w 65536 TCP, 2 mins, 128KB window size
3 -t 120 -w 131072 TCP, 2 mins, 256KB window size
4 -t 120 -d -P 2 TCP, 2 mins, default window size, 2x parallel
5 -u -t 120 -b 1000M UDP, 2 mins, 1x Gbits
6 -u -t 120 -b 1000M -d UDP, 2 mins, 1x Gbits, both directions

2 mins, means that the test lasts for 2 minutes (or 120 secs), which is a pretty good average time as it means approx. 13GB of data in a GbE.

I’ve also tried several TCP window sizes, but usually the default size is the one that you should focus.

Finally, I’ve also added a 2x parallel socket test for TCP and for UDP I’ve tested both directions (server/client).

Let’s see the benchmarks now.

Nanopi-R4S WAN benchmarks

Laptop -> Switch -> Workstation

The first benchmark is to test the speed between the workstation and the Laptop on the GbE switch to verify the max speed this setup can achieve. These are the results:

Workstation iperf -s
Laptop iperf -c 192.168.0.78 -t 120
------------------------------------------------------------
Client connecting to 192.168.0.2, TCP port 5001
TCP window size:  289 KByte (default)
------------------------------------------------------------
[  3] local 192.168.0.78 port 54456 connected with 192.168.0.2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-120.0 sec  13.2 GBytes   941 Mbits/sec

In the tables I’ll be adding the commands that I’m using on each server/client and then I’ll add the result that iperf outputs.

So, in this case we can see that the setup maximizes at 941 Mbits/sec.

Workstation -> Switch -> Nanopi-R4S (WAN)

Test #1

Nanopi-R4S iperf -s
Workstation iperf -c 192.168.0.62 -t 120
------------------------------------------------------------
Client connecting to 192.168.0.62, TCP port 5001
TCP window size: 85.0 KByte (default)
------------------------------------------------------------
[  3] local 192.168.0.2 port 59820 connected with 192.168.0.62 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-120.0 sec  13.2 GBytes   942 Mbits/sec

Test #2

Nanopi-R4S iperf -s
Workstation iperf -c 192.168.0.62 -t 120 -w 65536
------------------------------------------------------------
Client connecting to 192.168.0.62, TCP port 5001
TCP window size:  128 KByte (WARNING: requested 64.0 KByte)
------------------------------------------------------------
[  3] local 192.168.0.2 port 59850 connected with 192.168.0.62 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-120.0 sec  13.2 GBytes   941 Mbits/sec

Test #3

Nanopi-R4S iperf -s -w 131072
Workstation iperf -c 192.168.0.62 -t 120 -w 131072
------------------------------------------------------------
Client connecting to 192.168.0.62, TCP port 5001
TCP window size:  256 KByte (WARNING: requested  128 KByte)
------------------------------------------------------------
[  3] local 192.168.0.2 port 59854 connected with 192.168.0.62 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-120.0 sec  13.1 GBytes   941 Mbits/sec

Test #4

Nanopi-R4S iperf -s
Workstation iperf -c 192.168.0.62 -t 120 -d -P 2
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size:  128 KByte (default)
------------------------------------------------------------
------------------------------------------------------------
Client connecting to 192.168.0.62, TCP port 5001
TCP window size:  391 KByte (default)
------------------------------------------------------------
[  5] local 192.168.0.2 port 32882 connected with 192.168.0.62 port 5001
[  4] local 192.168.0.2 port 32880 connected with 192.168.0.62 port 5001
[  6] local 192.168.0.2 port 5001 connected with 192.168.0.62 port 55152
[  7] local 192.168.0.2 port 5001 connected with 192.168.0.62 port 55154
[ ID] Interval       Transfer     Bandwidth
[  8]  0.0-120.0 sec  6.45 GBytes   462 Mbits/sec
[  6]  0.0-120.0 sec  6.46 GBytes   462 Mbits/sec
[  4]  0.0-120.5 sec  6.47 GBytes   461 Mbits/sec
[  5]  0.0-120.5 sec  6.46 GBytes   461 Mbits/sec
[SUM]  0.0-120.5 sec  12.9 GBytes   922 Mbits/sec

Test #5

Nanopi-R4S iperf -s -u
Workstation iperf -c 192.168.0.62 -u -t 120 -b 1000M
------------------------------------------------------------
Client connecting to 192.168.0.62, UDP port 5001
Sending 1470 byte datagrams, IPG target: 11.22 us (kalman adjust)
UDP buffer size:  208 KByte (default)
------------------------------------------------------------
[  3] local 192.168.0.2 port 50200 connected with 192.168.0.62 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-120.0 sec  13.4 GBytes   957 Mbits/sec
[  3] Sent 9764940 datagrams
[  3] Server Report:
[  3]  0.0-120.0 sec  13.4 GBytes   957 Mbits/sec   0.000 ms 2127956392/2137718708 (0%)

Test #6

Nanopi-R4S iperf -s -u
Workstation iperf -c 192.168.0.62 -u -t 120 -b 1000M -d
------------------------------------------------------------
Server listening on UDP port 5001
Receiving 1470 byte datagrams
UDP buffer size:  208 KByte (default)
------------------------------------------------------------
------------------------------------------------------------
Client connecting to 192.168.0.62, UDP port 5001
Sending 1470 byte datagrams, IPG target: 11.22 us (kalman adjust)
UDP buffer size:  208 KByte (default)
------------------------------------------------------------
[  4] local 192.168.0.2 port 40576 connected with 192.168.0.62 port 5001 (peer 2.0.13)
[  3] local 192.168.0.2 port 5001 connected with 192.168.0.62 port 48176
[ ID] Interval       Transfer     Bandwidth
[  4]  0.0-120.0 sec  13.4 GBytes   957 Mbits/sec
[  4] Sent 9765245 datagrams
[  3]  0.0-120.0 sec  9.11 GBytes   652 Mbits/sec   0.027 ms    0/6653321 (0%)
[  3] WARNING: ack of last datagram failed after 10 tries.
[  4] Server Report:
[  4]  0.0-120.0 sec  12.9 GBytes   926 Mbits/sec   0.000 ms 2128274496/2137718403 (0%)
[  4] 0.00-120.00 sec  1 datagrams received out-of-order

Nanopi-R4S WAN results

As you can see from the above benchmarks the Nanopi-R4S WAN interface is fully capable of GbE speed, which is really cool. I’ve also used the FriendlyWRT web interface to get some screenshots of the various Luci statistics using the default enabled collectd sensors. These are the results.

As you can see the CPU usage is ~35%, but most of the load is because of the docker daemon that is running in the background. Also note that the temperature is 60C without the heatsink and less than 40C with the heatsink. This happened because I’ve added the heatsink in the middle of the test, but that was eventually a good thing because I’ve seen the difference that it does.

As you can see the WAN interface reaches the max GbE speed of my setup, which is really great.

The consumption when Nanopi-R4S is idle is 0.44A at 5V and 1A during test #6.

Nanopi-R4S bridge benchmarks

So, now lets test the bridge network interface, which is actually the most interesting benchmark for this device as it shows it’s capabilities in the real-case scenario.

Workstation -> Switch -> Nanopi-R4S (WAN) -> Nanopi-R4S (Br0) -> Laptop (USB-to-GbE)

Test #1

Workstation iperf -s
Laptop iperf -c 192.168.0.2 -t 120
------------------------------------------------------------
Client connecting to 192.168.0.2, TCP port 5001
TCP window size:  204 KByte (default)
------------------------------------------------------------
[  3] local 192.168.2.126 port 39014 connected with 192.168.0.2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-120.0 sec  13.1 GBytes   940 Mbits/sec

Test #2

Workstation iperf -s
Laptop iperf -c 192.168.0.2 -t 120 -w 65536
------------------------------------------------------------
Client connecting to 192.168.0.2, TCP port 5001
TCP window size:  128 KByte (WARNING: requested 64.0 KByte)
------------------------------------------------------------
[  3] local 192.168.2.126 port 39196 connected with 192.168.0.2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-120.0 sec  9.50 GBytes   680 Mbits/sec

Test #3

Workstation iperf -s -w 131072
Laptop iperf -c 192.168.0.2 -t 120 -w 131072
------------------------------------------------------------
Client connecting to 192.168.0.2, TCP port 5001
TCP window size:  256 KByte (WARNING: requested  128 KByte)
------------------------------------------------------------
[  3] local 192.168.2.126 port 39240 connected with 192.168.0.2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-120.0 sec  10.6 GBytes   758 Mbits/sec

Test #4

Workstation iperf -s
Laptop iperf -c 192.168.0.2 -t 120 -d -P 2
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size:  128 KByte (default)
------------------------------------------------------------
------------------------------------------------------------
Client connecting to 192.168.0.2, TCP port 5001
TCP window size:  153 KByte (default)
------------------------------------------------------------
[  4] local 192.168.2.126 port 49010 connected with 192.168.0.2 port 5001
[  5] local 192.168.2.126 port 49012 connected with 192.168.0.2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  4]  0.0-120.0 sec  6.66 GBytes   476 Mbits/sec
[  5]  0.0-120.0 sec  6.49 GBytes   465 Mbits/sec
[SUM]  0.0-120.0 sec  13.2 GBytes   941 Mbits/sec

Test #5

Workstation iperf -s -u
Laptop iperf -c 192.168.0.2 -u -t 120 -b 1000M

Client connecting to 192.168.0.2, UDP port 5001
Sending 1470 byte datagrams, IPG target: 11.22 us (kalman adjust)
UDP buffer size:  208 KByte (default)
------------------------------------------------------------
[  3] local 192.168.2.126 port 57264 connected with 192.168.0.2 port 5001
[  3] WARNING: did not receive ack of last datagram after 10 tries.
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-120.0 sec  13.4 GBytes   957 Mbits/sec
[  3] Sent 9761678 datagrams

Test #6

Workstation iperf -s -u
Laptop iperf -c 192.168.0.2 -u -t 120 -b 1000M -d
------------------------------------------------------------
Server listening on UDP port 5001
Receiving 1470 byte datagrams
UDP buffer size:  208 KByte (default)
------------------------------------------------------------
------------------------------------------------------------
Client connecting to 192.168.0.2, UDP port 5001
Sending 1470 byte datagrams, IPG target: 11.22 us (kalman adjust)
UDP buffer size:  208 KByte (default)
------------------------------------------------------------
[  4] local 192.168.2.126 port 39169 connected with 192.168.0.2 port 5001 (peer 2.0.10-alpha)
[ ID] Interval       Transfer     Bandwidth
[  4]  0.0-120.0 sec  13.4 GBytes   956 Mbits/sec
[  4] Sent 9757183 datagrams
[  4] Server Report:
[  4]  0.0-120.0 sec  7.84 GBytes   561 Mbits/sec   0.000 ms 2131996332/2137726465 (1e+02%)
[  4] 0.0000-120.0257 sec  1 datagrams received out-of-order

Nanopi-R4S bridge results

As you can see from the above benchmarks the Nanopi-R4S can max the GbE speed of my setup in the bridged mode. You may see that the speed dropped when using custom TCP window sizes, which I guess it’s because of the mismatch window sizes internally in bridge, but I don’t give much attention to this as the default TCP size works fine.

These are some screenshots from the web interface and the sensors.

As you can see the CPU load is less than 30% and again most of the load is because of the docker daemon running in the background. Also the temperature doesn’t exceed the 38C, which means the heatsink works really well.

Again my personal opinion is that the Nanopi-R4S reaches the maximum performance of my network also in the bridge mode. Excellent.

Nanopi-R2S WAN Benchmarks

This post would be incomplete without compare benchmarks between the Nanopi-R4S and Nanopi-R2S. Therefore, I’ve also executed the same benchmarks on the R2S in both WAN and bridge interface. These are the WAN results.

Test #1

Nanopi-R2S iperf -s
Workstation iperf -c 192.168.0.63 -t 120
------------------------------------------------------------
Client connecting to 192.168.0.63, TCP port 5001
TCP window size: 85.0 KByte (default)
------------------------------------------------------------
[  3] local 192.168.0.2 port 60864 connected with 192.168.0.63 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-120.0 sec  13.1 GBytes   941 Mbits/sec

Test #2

Nanopi-R2S iperf -s
Workstation iperf -c 192.168.0.63 -t 120 -w 65536
------------------------------------------------------------
Client connecting to 192.168.0.63, TCP port 5001
TCP window size:  128 KByte (WARNING: requested 64.0 KByte)
------------------------------------------------------------
[  3] local 192.168.0.2 port 60878 connected with 192.168.0.63 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-120.0 sec  13.1 GBytes   937 Mbits/sec

Test #3

Nanopi-R2S iperf -s -w 131072
Workstation iperf -c 192.168.0.63 -t 120 -w 131072
------------------------------------------------------------
Client connecting to 192.168.0.63, TCP port 5001
TCP window size:  256 KByte (WARNING: requested  128 KByte)
------------------------------------------------------------
[  3] local 192.168.0.2 port 60884 connected with 192.168.0.63 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-120.0 sec  13.1 GBytes   938 Mbits/sec

Test #4

Nanopi-R2S iperf -s
Workstation iperf -c 192.168.0.63 -t 120 -d -P 2
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size:  128 KByte (default)
------------------------------------------------------------
------------------------------------------------------------
Client connecting to 192.168.0.63, TCP port 5001
TCP window size:  246 KByte (default)
------------------------------------------------------------
[  5] local 192.168.0.2 port 60890 connected with 192.168.0.63 port 5001
[  4] local 192.168.0.2 port 60888 connected with 192.168.0.63 port 5001
[  6] local 192.168.0.2 port 5001 connected with 192.168.0.63 port 42238
[  7] local 192.168.0.2 port 5001 connected with 192.168.0.63 port 42240

[ ID] Interval       Transfer     Bandwidth
[  6]  0.0-120.0 sec  5.65 GBytes   405 Mbits/sec
[  4]  0.0-120.0 sec  5.12 GBytes   366 Mbits/sec
[  5]  0.0-120.1 sec  4.68 GBytes   335 Mbits/sec
[SUM]  0.0-120.1 sec  9.79 GBytes   701 Mbits/sec
[  8]  0.0-120.0 sec  7.29 GBytes   521 Mbits/sec

Test #5

Nanopi-R2S iperf -s -u
Workstation iperf -c 192.168.0.63 -u -t 120 -b 1000M
------------------------------------------------------------
Client connecting to 192.168.0.63, UDP port 5001
Sending 1470 byte datagrams, IPG target: 11.22 us (kalman adjust)
UDP buffer size:  208 KByte (default)
------------------------------------------------------------
[  3] local 192.168.0.2 port 39818 connected with 192.168.0.63 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-120.0 sec  13.4 GBytes   957 Mbits/sec
[  3] Sent 9765428 datagrams
[  3] Server Report:
[  3]  0.0-120.0 sec  8.06 GBytes   577 Mbits/sec   0.000 ms 2131827393/2137718220 (0%)

Test #6

Nanopi-R2S iperf -s -u
Workstation iperf -c 192.168.0.63 -u -t 120 -b 1000M -d
------------------------------------------------------------
Server listening on UDP port 5001
Receiving 1470 byte datagrams
UDP buffer size:  208 KByte (default)
------------------------------------------------------------
------------------------------------------------------------
Client connecting to 192.168.0.63, UDP port 5001
Sending 1470 byte datagrams, IPG target: 11.22 us (kalman adjust)
UDP buffer size:  208 KByte (default)
------------------------------------------------------------
[  4] local 192.168.0.2 port 50365 connected with 192.168.0.63 port 5001 (peer 2.0.13)
[  3] local 192.168.0.2 port 5001 connected with 192.168.0.63 port 58210
[ ID] Interval       Transfer     Bandwidth
[  4]  0.0-120.0 sec  13.4 GBytes   957 Mbits/sec
[  4] Sent 9765269 datagrams
[  3]  0.0-120.0 sec  7.56 GBytes   541 Mbits/sec   0.040 ms    0/5520546 (0%)
[  3] WARNING: ack of last datagram failed after 10 tries.
[  4] Server Report:
[  4]  0.0-120.0 sec  8.05 GBytes   576 Mbits/sec   0.000 ms 2131838458/2137718379 (0%)
[  4] 0.00-120.00 sec  1 datagrams received out-of-order

Nanopi-R2S WAN results

As you can see from the above benchmarks, although the TCP test with the default window size reaches the maximum performance, the performance drops in the TCP parallel test.

These are the sensor graphs from the web interface.

As you can see the processor usage is ~60% which the double compared to the Nanopi-R4S, but again the docker daemon uses ~30% of the CPU. Also the temperature seems to be higher than Nanopi-R4S, even with the better thermal I’ve used.

Nanopi-R2S bridge benchmarks

Let’s now see the benchmarks of the bridge interface for the Nanopi-R2S

Test #1

Workstation iperf -s
Laptop iperf -c 192.168.0.2 -t 120
------------------------------------------------------------
Client connecting to 192.168.0.2, TCP port 5001
TCP window size:  196 KByte (default)
------------------------------------------------------------
[  3] local 192.168.2.126 port 43028 connected with 192.168.0.2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-120.0 sec  9.51 GBytes   681 Mbits/sec

Test #2

Workstation iperf -s
Laptop iperf -c 192.168.0.2 -t 120 -w 65536
------------------------------------------------------------
Client connecting to 192.168.0.2, TCP port 5001
TCP window size:  128 KByte (WARNING: requested 64.0 KByte)
------------------------------------------------------------
[  3] local 192.168.2.126 port 43030 connected with 192.168.0.2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-120.0 sec  5.04 GBytes   360 Mbits/sec

Test #3

Workstation iperf -s -w 131072
Laptop iperf -c 192.168.0.2 -t 120 -w 131072
------------------------------------------------------------
Client connecting to 192.168.0.2, TCP port 5001
TCP window size:  128 KByte (WARNING: requested 64.0 KByte)
------------------------------------------------------------
[  3] local 192.168.2.126 port 43030 connected with 192.168.0.2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-120.0 sec  5.04 GBytes   360 Mbits/sec

Test #4

Workstation iperf -s
Laptop iperf -c 192.168.0.2 -t 120 -d -P 2
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size:  128 KByte (default)
------------------------------------------------------------
------------------------------------------------------------
Client connecting to 192.168.0.2, TCP port 5001
TCP window size:  153 KByte (default)
------------------------------------------------------------
[  4] local 192.168.2.126 port 43042 connected with 192.168.0.2 port 5001
[  5] local 192.168.2.126 port 43044 connected with 192.168.0.2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  5]  0.0-120.0 sec  4.41 GBytes   315 Mbits/sec
[  4]  0.0-120.0 sec  4.63 GBytes   331 Mbits/sec
[SUM]  0.0-120.0 sec  9.04 GBytes   647 Mbits/sec

Test #5

Workstation iperf -s -u
Laptop iperf -c 192.168.0.2 -u -t 120 -b 1000M
------------------------------------------------------------
Client connecting to 192.168.0.2, UDP port 5001
Sending 1470 byte datagrams, IPG target: 11.22 us (kalman adjust)
UDP buffer size:  208 KByte (default)
------------------------------------------------------------
[  3] local 192.168.2.126 port 52107 connected with 192.168.0.2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-120.0 sec  13.4 GBytes   957 Mbits/sec
[  3] Sent 9762311 datagrams
[  3] Server Report:
[  3]  0.0-120.0 sec  6.42 GBytes   460 Mbits/sec   0.000 ms 2133031099/2137721337 (1e+02%)

Test #5

Workstation iperf -s -u
Laptop iperf -c 192.168.0.2 -u -t 120 -b 1000M -d
------------------------------------------------------------
Server listening on UDP port 5001
Receiving 1470 byte datagrams
UDP buffer size:  208 KByte (default)
------------------------------------------------------------
------------------------------------------------------------
Client connecting to 192.168.0.2, UDP port 5001
Sending 1470 byte datagrams, IPG target: 11.22 us (kalman adjust)
UDP buffer size:  208 KByte (default)
------------------------------------------------------------
[  3] local 192.168.2.126 port 54616 connected with 192.168.0.2 port 5001 (peer 2.0.10-alpha)
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-120.0 sec  12.4 GBytes   890 Mbits/sec
[  3] Sent 9081045 datagrams
[  3] Server Report:
[  3]  0.0-120.2 sec   917 MBytes  64.0 Mbits/sec   0.000 ms 2137748578/2138402602 (1e+02%)
[  3] 0.0000-120.2481 sec  1 datagrams received out-of-order

Nanopi-R2S bridge results

As you can see from the above benchmarks the Nanopi-R2S has trouble to reach the maximum performance of the network and this seems to be an issue with the internal USB-to-GbE of the board. Nevertheless, some of you may be satisfied with those results for your use case.

These are the sensors graphs from the web interface.

The consumption when Nanopi-R2S is idle is 0.3A at 5V and 0.58A during test #6.

Nanopi-R4S vs Nanopi-R2S

OK, some of you may want to choose between those two boards. Well, I won’t try to get into too many details in the specific things that are different between those two. Instead I’ll just add a table here with the tests results and also add a few notes after that.

Test Nanopi-R4S (Mbits/sec) Nanopi-R2S (Mbits/sec)
WAN Br0 WAN Br0
#1 942 940 941 681
#2 941 680 937 360
#3 941 758 938 360
#4 922 941 701 647
#5 957 957 577 460
#6 926 561 576 64

From the above table you can see that the new Nanopi-R4S is faster than R2S especially when it comes to UDP data transfers. The R4S is reaching in most cases the maximum network capacity, but R2S can’t. Therefore, the R4S has better performance over the R2S.

Nanopi R4S Nanopi R2S
pros + Performance
+ Up to 4GB RAM
+ Nice case
+ Doesn’t get hot
+ Consumption
+ Smaller size
+ Nice case
+ Price
cons – Price
– A bit larger than R2S
– Performance
– It gets a bit hot

The R4S costs $45 (1GB) and the R2S $22.

Conclusions

As I’ve mentioned, I’ve received this board from FriendlyArm as a sample for evaluation purposes. My plan was to create a Yocto BSP layer for this, but since the Armbian support is not ready yet I’ve decided to do a benchmark and a comparison between the R4S and the R2S.

Personally, I like the performance of this board and I think it’s nice to use as a home router. Just have in mind that if you compare it with the R2S then the performance is better but also it consumes more power, as the RK3399 is a more power-hungry SoC than the RK3328.

Hope this post was useful.

Have fun!