Putting a systemd service behind a VPN

It recently occurred that I had the need to route all traffic by a certain application through a VPN. While premade solutions for this exist using Docker (especially in combination with BitTorrent clients), I preferred looking for something that really just isolates what is needed and leaves the rest alone.

As it turns out systemd has good support for Linux namespaces and sharing them between units.

The basic plan here is as follows:

  1. Create a network namespace

  2. Get internet access into our namespace

  3. Enable the VPN inside the namespace

  4. Run the intended application inside the namespace

Namespace & NAT

While systemd can create a namespace it is not able to link it to a name for interaction with ip netns. This is not strictly necessary but simplifies further setup a lot so we'll take care of it manually.

/etc/systemd/system/netns@.service:

[Unit]
Description=Network namespace %i
StopWhenUnneeded=true

[Service]
Type=oneshot
RemainAfterExit=yes

# systemd creates a new network namespace for us
PrivateNetwork=yes

# Create new named namespace using ip netns
# (this ensures that things like /var/run/netns are properly setup)
ExecStart=/sbin/ip netns add %i

# Drop the network namespace that ip netns just created
ExecStart=/bin/umount /var/run/netns/%i

# Re-use the same name for the network namespace that systemd put us in
ExecStart=/bin/mount --bind /proc/self/ns/net /var/run/netns/%i

# Clean up the name when we're done
ExecStop=/sbin/ip netns delete %i

Starting this unit gets us a namespace with a single interface (lo for localhost) inside. What we want to do next is configure a virtual link between the namespace and the "outside" and a simple NAT setup on top of that.

/etc/systemd/system/veth-setup@.service:

[Unit]
Description=Setup veth for network namespace %i
Requires=netns@%i.service
Requires=systemd-networkd.service
After=netns@%i.service

[Service]
Type=oneshot
RemainAfterExit=yes

# Load $ADDRESS and $ROUTES from here
EnvironmentFile=/etc/veth-setup.%i.conf

# Create a veth pair (vg = "veth guest", vh = "veth host")
ExecStart=/sbin/ip link add vh-%i type veth peer name vg-%i

# Move one end into the netns and set it up
ExecStart=/sbin/ip link set dev vg-%i netns %i
ExecStart=/sbin/ip netns exec %i ip l set vg-%i up
ExecStart=/sbin/ip netns exec %i ip a add dev vg-%i $ADDRESS
ExecStart=/sbin/ip netns exec %i sh -c 'echo "$ROUTES" | while read -r args; do ip r add $args; done'
ExecStart=/sbin/ip netns exec %i sysctl -w net.ipv6.conf.vg-%i.disable_ipv6=1

The units are generic but you will have to decide on a name for your namespace now. I will be using 'example'.

/etc/veth-setup.example.conf:

ADDRESS=192.168.255.1/30
ROUTES='default via 192.168.255.0
192.168.0.0/16 via 192.168.255.0'

In this case the 192.168.0.0/16 route is needed so I can reach the software from my home network despite the VPN routing. If you have a more complex setup consider using a reverse proxy.

For convenience reasons we'll let systemd-networkd handle everything related to the host side including NAT setup.

/etc/systemd/network/vh-example.network:

[Match]
Name=vh-example
Driver=veth

[Network]
Address=192.168.255.0/30
IPMasquerade=ipv4
LinkLocalAddressing=no

Joining the namespace

At this point you will want modify the unit that enables the VPN connection to move it into the namespace. Run e.g. systemctl edit openvpn-client@nordvpn.service and configure as follows:

[Unit]
Requires=veth-setup@example.service
After=veth-setup@example.service
JoinsNamespaceOf=netns@example.service

[Service]
PrivateNetwork=true

Next do something similar (note the two different lines) for the service you wanted to isolate in the first place:

[Unit]
Requires=openvpn-client@nordvpn.service
After=openvpn-client@nordvpn.service
JoinsNamespaceOf=netns@example.service

[Service]
PrivateNetwork=true

At this point if you reload systemd and start the service, systemd will first create the namespace, run the setup commands, start the VPN service and then run the initial service. Inside the namespace everything will be routed through the VPN while the rest of the system is entirely unaffected.

Leak protection

To safeguard against bugs or misconfigurations you may want to make sure only VPN traffic can exit from the namespace. Here's an example with iptables and port 1194/UDP for VPN traffic:

-A FORWARD -s 192.168.255.0/30 -p udp -m udp --dport 1194 -j ACCEPT
-A FORWARD -s 192.168.255.0/30 -d 192.168.0.0/16 -j ACCEPT
-A FORWARD -s 192.168.255.0/30 -j DROP

Ad-hoc router using OpenWrt in a VM

/images/openwrt_crop.png

During writing of the last post I actually bricked my home router after installing a custom image. At the same time I didn't manage to get TFTP recovery working [1] so I started searching for ways to restore my home network for now to worry about fixing the router later.

The choice fell on using an ARM board to run OpenWrt, which required some creative workarounds. This post documents them.

What I want to replace:

  • router-switch combo: 1x WAN, 5x LAN, no WiFi

  • OS: OpenWrt

What I have:

  • 64-bit ARM SBC, capable of hardware virtualization

  • USB 3.0 Ethernet adapter (if your SBC has 2x LAN natively that works too)

  • 6-port Ethernet switch

Network setup

eth0 will be the LAN, eth1 the WAN. The LAN bridge also gets an IP address for management (pick one outside your DHCP pool) so you can SSH to the host even if the OpenWrt VM is not in operation.

/etc/systemd/network/br0.netdev:

[NetDev]
Name=br0
Kind=bridge

/etc/systemd/network/br1.netdev:

[NetDev]
Name=br1
Kind=bridge

/etc/systemd/network/br0.network:

[Match]
Name=br0
[Bridge]
MulticastRouter=permanent
[Network]
Address=192.168.2.254/24
Gateway=192.168.2.1
DNS=192.168.2.1
IPv6AcceptRA=no
LinkLocalAddressing=no

/etc/systemd/network/br1.netdev:

[Match]
Name=br1
[Bridge]
MulticastRouter=permanent
[Network]
DHCP=no
IPv6AcceptRA=no
LinkLocalAddressing=no

/etc/systemd/network/eth0.network:

[Match]
Name=eth0
[Network]
Bridge=br0

/etc/systemd/network/eth1.network:

[Match]
Name=eth1
[Network]
Bridge=br1

Virtual Machine

I had to virtualize OpenWrt because in terms of SBC it supports a few select platforms (eg. Raspberry Pi), but not any I own.

Installation:

cd /root
wget "https://downloads.openwrt.org/snapshots/targets/armvirt/64/openwrt-armvirt-64-"{rootfs-ext4.img.gz,Image}
gunzip openwrt-armvirt-64-rootfs-ext4.img.gz
echo 'allow all' >/etc/qemu/bridge.conf
systemctl enable guest.service

/root/run.sh:

#!/bin/bash
cd /root
exec qemu-system-aarch64 -enable-kvm -cpu host \
        -nographic -M virt,highmem=off -m 128 -smp 4 \
        -kernel openwrt-armvirt-64-Image -append "root=fe00" \
        -drive file=openwrt-armvirt-64-rootfs-ext4.img,format=raw,if=none,id=hd0 \
        -device virtio-blk-pci,drive=hd0 \
        -nic bridge,br=br0,model=virtio -nic bridge,br=br1,model=virtio

/etc/systemd/system/guest.service:

[Unit]
Description=QEMU guest
After=network.target
[Service]
ExecStart=/root/run.sh
Restart=always
RestartSec=10s
[Install]
WantedBy=multi-user.target

OpenWrt

If the default OpenWrt configuration has address conflicts with the rest of your network, you can adjust it ahead of time like so:

ip l add dev br0 type bridge
ip l add dev br1 type bridge
./run.sh
# log into openwrt
  uci set network.eth0.ipaddr=192.168.7.1
  uci commit
  poweroff

Now suppose you want to import the configuration of your old OpenWrt install, except: it has totally different interface names all over! Not a problem either as you can rename them by adding the following lines to /etc/rc.local:

ip l set eth0 name lan1
ip l set eth1 name wan

If you did everything right up until this point you can shut down the SBC, plug in all the cables, let it reboot and an OpenWrt router should appear in your network at http://192.168.1.1 just as if it was a real device.

Future thoughts

  • Setting net.core.default_qdisc = pfifo_fast on the host can save some CPU time, not sure whether this breaks QoS

  • If you have a WiFi USB adapter you could pull it into the VM and enjoy wireless too! See here for an example

  • When running this setup long term it would make sense to strip down the host OS

  • The QEMU serial console could be bound to a socket or PTS to interact with it even when guest.service is running

OpenWrt build notes

This post contains Q&A-style notes on compiling software for OpenWrt or compiling OpenWrt itself.

SDK or Source code?

If you want to cross-compile software to run on an OpenWrt device or rebuild/patch existing OpenWrt packages, the SDK suffices.

The SDK can be found under the <target>/<device> folders on downloads.openwrt.org, note that the snapshots are rebuilt frequently and not a stable target to build for. The source is at https://github.com/openwrt/openwrt, make sure to check out a branch/tag if you don't want the development branch.

Ok, now how do I set it up?

Build tools (tested on Ubuntu 22.04):

sudo apt update
sudo apt install build-essential gawk gcc-multilib flex git gettext libncurses5-dev libssl-dev python3-distutils unzip zlib1g-dev

Feeds:

cd openwrt
./scripts/feeds update -a && ./scripts/feeds install -a

Don't forget to do all this as an unprivileged user (even if you have e.g. a container), the build tools don't like being run as a root.

[Source] Target and package selection

Run make menuconfig. The first three options select the device to target. Most of the rest are for packages, < > means a package will not be built, <*> will build it into the final firmware image (preinstalled), <M> will build it as an installable package file.

Pressing / will open up search.

Configuring is annoying so you can keep the .config file for next time. Copy it back and run make oldconfig or yes "" | make oldconfig if you don't like answering questions.

[Source] Building everything

Run nice make -j$(nproc).

The firmware and sysupgrade images will end up in bin/targets/, packages also under bin/packages/.

How do I build a single package?

make packages/NAMEHERE/compile, that's it.

[SDK] How do I build software that's not in OpenWrt?

Haven't had to do that yet, the answer is probably here.

[SDK] I don't care about packaging, where's the cross-compiler?
./staging_dir/toolchain-<whatever>/bin/<arch>-openwrt-linux-gcc
(This likely won't lead you to success unless all you need is a simple C program.)
How do I apply patches to packages?

There's probably multiple ways but the most convenient is overriding the source tree.

You should have a git checkout with your modifications (commit them!) somewhere. Then:

ln -s /somewhere/.git package/network/services/odhcpd/git-src
make package/odhcpd/{clean,compile} V=s CONFIG_SRC_TREE_OVERRIDE=y
[Source] Which kernel version will I get?

Check for KERNEL_PATCHVER in target/linux/<target>/Makefile.

If a KERNEL_TESTING_PATCHVER is defined too you can switch to the newer kernel by enabling "Global build settings > Use the testing kernel version".

Note that OpenWrt kernels are heavily patched so you can't really use a version other than the predefined ones even if you wanted to.

[Source] Ricing your build flags

If you've ever used Gentoo you will find this fun. [1]

Under "Advanced configuration options" you can configure global compiler flags (TARGET_OPTIMIZATION) and the ones used by the kernel (KERNEL_CFLAGS).

Tell me more!

Headless Raspberry Pi OS virtualization, 64-bit edition

With Raspbian (now named Raspberry Pi OS) having been released as 64-bit I can finally write a proper sequel to the previous post that dealt with virtualizing ARM/Linux distributions headlessly using QEMU.

You can read the original article here: Virtualizing Raspbian (or any ARM/Linux distro) headless using QEMU. Since the process is the same I will skip detailed explanations here.

Native emulation in QEMU

QEMU includes a raspi3b machine type and emulates UART, SD, USB controllers and more. This is enough for working headless usage.

With the root filesystem prepared and the appropriate files extracted from the boot partition the command line woud look as follows:

qemu-system-aarch64 -M raspi3b -kernel kernel8.img -dtb bcm2710-rpi-3-b.dtb \
        -drive file=rootfs.qcow2,if=sd -usb -device usb-net,netdev=u1 -netdev user,id=u1 \
        -append "root=/dev/mmcblk0 rw console=ttyAMA0" -nographic

The system boots up fine (with a few errors here and there) and is usable but I don't suggest using it like this.

A better alternative

The virt machine type is much better suited for this, our plan is to attach both the disk and network via Virtio.

For the kernel (and modules) we'll grab the linux-aarch64 package from Arch Linux ARM.

Extracting the root filesystem into a virtual disk image

Download Raspberry Pi OS (64-bit) from the official website, then run the script below or follow the steps manually.

The only difference from before is that we have to unlock the "root" user so we can actually log in later.

#!/bin/bash -e
input=2022-04-04-raspios-bullseye-arm64-lite.img
[ -f $input ]

mkdir mnt
cp --reflink=auto $input source.img
truncate -s 10G source.img
echo ", +" | sfdisk -N 2 source.img
dev=$(sudo losetup -fP --show source.img)
[ -n "$dev" ]
sudo resize2fs ${dev}p2
sudo mount ${dev}p2 ./mnt -o rw
sudo sed '/^PARTUUID/d' -i ./mnt/etc/fstab
sudo sed '/^root:/ s|\*||' -i ./mnt/etc/shadow
remove_services=rpi-eeprom-update,hciuart,dphys-swapfile,rng-tools-debian
sudo bash -c "rm -f \
        ./mnt/etc/systemd/system/multi-user.target.wants/{$remove_services}.service \
        ./mnt/etc/rc?.d/?01{$remove_services}"
sudo umount ./mnt
sudo chmod a+r ${dev}p2
qemu-img convert -O qcow2 ${dev}p2 rootfs.qcow2
sudo losetup -d $dev
rm source.img; rmdir mnt

The kernel and initramfs

Kernel

Extract the kernel like this:

tar -xvf linux-aarch64*.pkg.tar.* --strip-components=1 boot/Image.gz

Building an initramfs

The differences to the previous iteration of the script are:

  • Recompressing of zstd kernel modules as gzip (no busybox support for it)

  • Busybox isn't downloaded. You need to compile it for 64-bit ARM yourself and insert the path [1]

#!/bin/bash -e
pkg=$(echo linux-aarch64-*.pkg.tar.*)
[ -f "$pkg" ]

mkdir initrd; pushd initrd
mkdir bin dev mnt proc sys
tar -xaf "../$pkg" --strip-components=1 usr/lib/modules
rm -rf lib/modules/*/kernel/{sound,drivers/{net/{wireless,ethernet},media,gpu,iio,staging,scsi}}
find lib/modules -name '*.zst' -exec zstd -d --rm {} ';'
find lib/modules -name '*.ko' -exec gzip -9 {} ';'
install -p /FILL/ME/IN/busybox-aarch64 bin/busybox
cat >init <<"SCRIPT"
#!/bin/busybox sh
busybox mount -t proc none /proc
busybox mount -t sysfs none /sys
busybox mount -t devtmpfs none /dev

for mod in virtio-pci virtio-blk virtio-net; do
        busybox modprobe $mod
done

busybox mount -o rw /dev/vda /mnt || exit 1

busybox umount /proc
busybox umount /sys
busybox umount /dev

exec busybox switch_root /mnt /sbin/init
SCRIPT
chmod +x bin/busybox init
bsdtar --format newc --uid 0 --gid 0 -cf - -- * | gzip -9 >../initrd.gz
popd; rm -r initrd

Booting the virtual machine

With the initramfs built, we have all parts needed to actually run the VM:

qemu-system-aarch64 -M virt -cpu cortex-a53 -m 2048 -smp 4 -kernel Image.gz -initrd initrd.gz \
        -drive file=rootfs.qcow2,if=virtio -nic user,model=virtio \
        -append "console=ttyAMA0" -nographic
After a bit of booting you should be greeted by Debian GNU/Linux 11 raspberrypi ttyAMA0 and a login prompt.
You can log in as "root" without a password.

Installing UniFi Network on Raspberry Pi OS 11 (bullseye)

Installing an Unifi controller on a Raspberry Pi seems like a straightforward task until you notice the section with system requirements.

The software requires a MongoDB version before 4.0. The last version that satisfies this is 3.7.9 which is almost four years old at the time of writing. You may find old versions packaged on the MongoDB website or in other repositories but certainly not for ARM. The second problem is that MongoDB dropped 32-bit support in version 3.4 so the latest we can actually use is 3.2.22 (also 4 years old).

In the end I was unable to find a build of MongoDB 3.2 that could run on a Pi which leaves only the option of compiling from source. This is what I ended up doing, it required lots of trial and error [1] before it succeeded. To (hopefully) save someone else time I put up the final Debian package for download.

Instructions

Requirements:

  1. Add the Unifi repository

apt install ca-certificates apt-transport-https
echo 'deb https://www.ui.com/downloads/unifi/debian stable ubiquiti' >/etc/apt/sources.list.d/100-ubnt-unifi.list
apt-key adv --keyserver keyserver.ubuntu.com --recv 06E85760C0A52C50
apt update
  1. Install required packages.

apt install ./mongodb-server_3.2.22_armhf.deb unifi
  1. Create a mongod wrapper script

printf '%s\n' '#!/bin/bash' 'exec /usr/bin/mongod --journal "$@"' >/usr/lib/unifi/bin/mongod
chmod +x /usr/lib/unifi/bin/mongod
  1. Enable and start the Unifi service: systemctl enable --now unifi

  2. Wait a while, it can take about 5 minutes until the controller is reachable at https://IP:8443/.

Not actually using a Raspberry Pi?

On a normal amd64 Debian system this whole ordeal gets much simpler:

  • Unifi installation works as described as above

  • mongodb-org-server_3.6.23_amd64.deb can be downloaded from the MongoDB website and runs out-of-the-box

  • OpenJDK 8 can be pulled from AdoptOpenJDK's repo since it's not available on recent Debian versions anymore

As of Unifi 7.3 the JDK support has changed to version 11, so you don't need a separate repo anymore.


Overview of VPN providers that support IPv6

Finding commercial VPN providers that provide real IPv6 support (as opposed to just blackholing traffic) is pretty hard so I decided to write down the ones I know about. The amount of information present depends on the provider's documentation and whether I have tested them myself.

Provider Access method IPv6 egress? Connect via IPv6? IP inside tunnel?1 Port forwarding possible?2 Last updated
AirVPN OpenVPN / Wireguard Yes
(NAT)
Yes Private Yes Feb 2023 (T)
AzireVPN OpenVPN / Wireguard Yes
(NAT)
No Public No Feb 2023 (T)
Hide.me OpenVPN / Wireguard Yes
(?)
? ? ? Nov 2021
IVPN Wireguard Yes
(NAT)
No Private No Nov 2021 (T)
Mullvad OpenVPN Yes
(NAT)
No Private No May 2023 (T)
Wireguard Yes
(NAT)
Yes Private No
Njalla OpenVPN Yes Yes Public No need (public IP) Apr 2022 (T)
OpenVPN (NAT option) Yes
(NAT)
Yes Private No
Wireguard Yes
(NAT)
Yes
OVPN.com OpenVPN Yes No Public No need (public IP) Feb 2023 (T)
Wireguard Yes
(NPTv6)
OVPN.to OpenVPN Yes
(NAT)
Yes Private Yes Feb 2023
Perfect Privacy OpenVPN Yes
(?)
? ? ? Nov 2021
VPN Secure OpenVPN Yes
(?)
? ? ? May 2023

1: This is mainly important for address selection according to RFC3484. If this column says "Private" your OS will prefer IPv4 when connected to the VPN despite presence of an IPv6 address.
2: Specificially, whether IPv6 port forwarding works.

Setting up Smokeping in a systemd-nspawn container

Smokeping is a nifty tool that continuously performs network measurements (such as ICMP ping tests) and graphs the results in a web interface. It can help you assess performance and detect issues in not only your own but also upstream networks.

/images/smokeping_last_864000.png

This is not how your graphs should look.

This article details setup steps for running Smokeping in a systemd-nspawn container with some additional requirements:

  • IPv6 probes must work

  • The container will directly use the host network so that no routing, NAT or address assignment needs to be set up.

  • To reduce disk and runtime footprint the container will run Alpine Linux

Container setup

First we need to set up an Alpine Linux root filesystem in a folder.
Usage is simple: ./alpine-container.sh /var/lib/machines/smokeping

Next we'll boot into the container to configure everything: systemd-nspawn -b -M smokeping -U

If not already done edit /etc/apk/repositories to add the community repo.
Additionally, you have to touch /etc/network/interfaces so that the network initscript can start up later (even though there is nothing to configure).

Install all required packages: apk add smokeping fping lighttpd ttf-dejavu

Make sure that fping works by running e.g. fping ::1.

lighttpd

Next is the lighttpd configuration inside /etc/lighttpd.

Get rid of all the examples: mv lighttpd.conf lighttpd.conf.bak && grep -v '^#' lighttpd.conf.bak | uniq >lighttpd.conf

There are multiple changes to be done in lighttpd.conf:

  • change server.groupname = "smokeping", the CGI process will need access to smokeping's files.

  • add server.port = 8081 and server.use-ipv6 = "enable"

  • configure mod_fastcgi for Smokeping by appending the following:

server.modules += ("mod_fastcgi")
fastcgi.server = (
        ".cgi" => ((
                "bin-path" => "/usr/share/webapps/smokeping/smokeping.cgi",
                "socket" => "/tmp/smokeping-fastcgi.socket",
                "max-procs" => 1,
        )),
)

We also need to link smokeping's files into the webroot: ln -s /usr/share/webapps/smokeping/ /var/www/localhost/htdocs/smokeping

smokeping

Next is the smokeping configuration located at /etc/smokeping/config.

The most important change here is to set cgiurl to the URL smokeping will be externally reachable at, like so:
cgiurl = http://your_server_here:8081/smokeping/smokeping.cgi

Smokeping's configuration [2] isn't super obvious if you haven't done it before so I'll provide an example here (this replaces the Probes and Targets sections):

*** Probes ***

+ FPing
binary = /usr/sbin/fping

+ FPing6
binary = /usr/sbin/fping

*** Targets ***
probe = FPing

menu = Top
title = Network Latency Grapher
remark =

+ targets
menu = IPv4 targets

++ google
menu = Google
title = Example Target: Google (IPv4)
host = 8.8.4.4

+ targets6
menu = IPv6 targets
probe = FPing6

++ google
menu = Google
title = Example Target: Google (IPv6)
host = 2001:4860:4860::8844

Lastly, grant the CGI process write access to the image folder: chmod g+w /var/lib/smokeping/.simg

Final container setup

Set services to run on boot: rc-update add smokeping && rc-update add lighttpd
Then shut down the container using poweroff.

We need to tell systemd-nspawn not to create a virtual network when the container is started as a service.
Do this by creating /etc/systemd/nspawn/smokeping.nspawn:
[Exec]
KillSignal=SIGTERM

[Network]
VirtualEthernet=no

Finally start up the container: systemctl start systemd-nspawn@smokeping
If this does not work due to private users you are running on old systemd [3] and can try again with PrivateUsers=no in the Exec section.

You can now visit http://your_server_here:8081/smokeping/smokeping.cgi and should see a mostly empty page with a sidebar containing "Charts", "IPv4 targets" and "IPv6 targets" on the left.


Fully unprivileged VMs with User Mode Linux (UML) and SLIRP User Networking

A few months ago I wanted to test something that involved OpenVPN on an old, small VPS I rented.

The VPS runs on OpenVZ, which shares a kernel with the host and comes with one important constraint:
TUN/TAP support needs to be manually enabled, which on this VPS it was not.

Maybe run a VM instead? Nope, KVM is not available.

At this point it would've been easier to give up or temporarily rent another VPS, but I really wanted to run the test on this particular one.

Enter: User Mode Linux

User Mode Linux (UML) is a way to run the Linux kernel as an user-mode process on another Linux system, no root or special setup required.

At its time it was considered to be useful for virtualization, sandboxing and more. These days it's well past its prime but it still exists in the Linux source and more importantly works.

You'd build a kernel binary like this:

git clone --depth=100 https://github.com/torvalds/linux
cd linux
make ARCH=um defconfig
nice make ARCH=um -j4
strip vmlinux

The virtual machine will require a root filesystem image, you can obtain one via the usual ways such as debootstrap (Debian/Ubuntu) or pacstrap (Arch) which I won't cover here.

Networking

Now onto the next issue: How is networking supported in User Mode Linux?

UML provides a number of options for network connectivity [1]: attaching to TUN/TAP, purely virtual networks and SLIRP
TUN/TAP is out of question, a virtual network doesn't help us so that leaves only SLIRP.

SLIP is a very old protocol [2] designed to carry IP packets over a serial line. SLIRP describes the use of this protocol to share the host's Internet connection over serial.
The SLIRP application exposes a virtual network to the client and performs NAT internally.

The standard slirp implementation is found on sourceforge: https://sourceforge.net/projects/slirp/
Its last release was in 2006 and the tarball even includes a file named security.patch that swaps out a few sprintf for snprintf and adds /* TODO: length check */ in other places.
At this point it was obvious that this wasn't going to work.

Rewriting slirp

The only logical thing to do now is to rewrite slirp so that it works.

Although slirp itself is dead the concept lives on as libslirp, which is notably used by QEMU [3] and VirtualBox.
libslirp's API is still a bit too low-level so I chose to use libvdeslirp.
SLIP is a simple protocol and not too complicated to implement, the rest is just passing packets around with Ethernet (un-)wrapping and a tiny bit of ARP.

Here's the code: https://gist.github.com/sfan5/696ad2f03f05a3e13952c44f7b767e81

Usage

You'll need:

  • vmlinux: the User Mode Linux kernel

  • root_fs: a root filesystem image

  • /path/to/slirp: the compiled slirp binary (build it using the Makefile that comes with it)

At this point your virtualized Linux system is just one command away:

./vmlinux mem=256M rw eth0=slirp,,/path/to/slirp
Once logged in you need to manually configure the network like this:
ip a add dev eth0 10.0.2.1 && ip l set eth0 up && ip r add default dev eth0
echo nameserver 10.0.2.3 >/etc/resolv.conf
While you enter these you should see --- slirp ready --- on the console you ran vmlinux on.

You can forward port(s) from outside into the VM by editing the commented out code in slirp.c.


Dealing with glibc faccessat2 breakage under systemd-nspawn

Backstory

A few months ago I stumbled upon this report on Red Hat's bugzilla.

The gist of it is that glibc began to make use of the new faccessat2 syscall, which when running under older systemd-nspawn is filtered to return EPERM. This misdirects glibc into assuming a file or folder cannot be accessed, when in reality nspawn just doesn't know the syscall.

A fix was submitted to systemd [1] but it turned out this didn't only affect nspawn, but also needed to be fixed in various container runtimes and related software [2] [3] [4] [5]. Hacking around it in glibc [6] or the kernel [7] was proposed, with both (rightfully) rejected immediately.

I pondered what an awful bug that was and was glad I didn't have to deal with this mess.


Fast forward to last week, I upgraded an Arch Linux installation I had running in a container. Immediately after the update pacman refused to work entirely, complaining it "could not find or read" /var/lib/pacman when this directory clearly existed (I checked).

A few minutes later (and after noticing the upgrade to glibc 2.33) it hit me that this was the exact bug I read about months ago. And, worse, that I'd have to deal with a lot more since I have multiple servers that run containers on systemd-nspawn.

Binary patching systemd-nspawn to fix the seccomp filter

If you hit this bug with one of your containers you have exactly one option: upgrade systemd on the host system to v247 or later.

Aside from the fact that upgrading something as central as systemd isn't exactly risk free, I couldn't do that even if I wanted. There is no backported systemd for Ubuntu 18.04 LTS.

This calls for another option: Patching systemd yourself to fix the bug.

Without further ado, here's a Python script doing exactly that. I've tested that it performs the correct patch on Debian 10, Ubuntu 18.04 and Ubuntu 20.04. There are also plenty safeguards that it shouldn't break anything no matter what (no warranty though).

#!/usr/bin/env python3
import subprocess, re, os
path = "/usr/bin/systemd-nspawn"
print("Looking at %s" % path)
proc = subprocess.Popen(["objdump", "-w", "-d", path], stdout=subprocess.PIPE, encoding="ascii")
instr = list(m.groups() for m in (re.match(r'\s*([0-9a-f]+):\s*([0-9a-f ]+)\s{4,}(.+)', line) for line in proc.stdout) if m)
if proc.wait() != 0: raise RuntimeError("objdump returned error")
p_off, p_old, p_new = None, None, None
for i, (addr, b, asm) in enumerate(instr):
        if asm.startswith("call") and "<seccomp_init_for_arch@" in asm:
                print("Found function call at 0x%s:\n  %s%s" % (addr, b, asm))
                for addr, b, asm in instr[i-1:i-12:-1]:
                        m = re.match(r'mov\s+\$0x([0-9a-f]+)\s*,\s*%edx', asm)
                        if m:
                                print("Found argument at 0x%s:\n  %s%s" % (addr, b, asm))
                                m = int(m.group(1), 16)
                                if m == 0x50026:
                                        print("...but it's already patched, nothing to do.")
                                        exit(0)
                                if m != 0x50001: raise RuntimeError("unexpected value")
                                p_off, p_old = int(addr, 16), bytes.fromhex(b)
                                if len(p_old) != 5: raise RuntimeError("unexpected instr length")
                                p_new = b"\xba\x26\x00\x05\x00"
                                break
                        if re.search(r'%[re]?dx|^(call|pop|j[a-z])', asm): break # likely went too far
                break
if not p_off: raise RuntimeError("no patch location found")
print("Patching %d bytes at %d from <%s> to <%s>" % (len(p_old), p_off, p_old.hex(), p_new.hex()))
with open(path, "r+b") as f:
        if os.pread(f.fileno(), len(p_old), p_off) != p_old: raise RuntimeError("contents don't match")
        os.pwrite(f.fileno(), p_new, p_off)
print("OK.")

Running the above script (as root) will attempt to locate certain related instructions in /usr/bin/systemd-nspawn, attempt to patch one of them and hopefully end with an output of "OK.".

What does the binary patch change? Essentially it makes the following change to the compiled code of nspawn-seccomp.c:

 log_debug("Applying allow list on architecture: %s", seccomp_arch_to_string(arch));

-r = seccomp_init_for_arch(&seccomp, arch, SCMP_ACT_ERRNO(EPERM));
+r = seccomp_init_for_arch(&seccomp, arch, SCMP_ACT_ERRNO(ENOSYS));
 if (r < 0)
         return log_error_errno(r, "Failed to allocate seccomp object: %m");

Instead of EPERM, both blocked and unknown syscalls now return ENOSYS back to the libc. This isn't ideal either (error handling code might get a bit confused) but it is more correct and allows glibc to not catastrophically fail upon attempting to use faccessat2,

Unfortunately the same change cannot [8] be applied to processes in already running containers, you have to restart them.

Further reading

Running Windows 10 for ARM64 in a QEMU virtual machine

/images/2020-08-04_scrot.png

Since the development stages of Windows 10, Microsoft has been releasing a version of Windows that runs on 64-bit ARM (AArch64) based CPUs. Despite some hardware shipping with Windows 10 ARM [1] [2] [3] this port has received little attention and you can barely find programs that run on it.

Naturally, I wanted to try this out to see if it worked. And it turned out it does!

Getting the ISO

I'm not aware of any official page that lets you download an ARM64 ISO, so this part relies on community-made solutions instead:

In the MDL forums I looked for the right ESD download link and used an ESD>ISO conversion script (also found there) to get a bootable ISO.

Alternatively adguard's download page provides similar scripts that download and pack an ISO for you, though they're pretty slow in my experience.

There's one more important point:

I had no success booting version 2004 or 20H2 (specifically: 19041.388 / 19041.423) so I went with version 1909 (18363.592) instead.

Starting with 18363.1139 Windows seems to require the virtualization extension [7] to be enabled. I had initially used an older version before finding this out, the command line below has now been corrected accordingly. (Updated 2020-12-27)

Installation

Before we begin we also need:

  • the Virtio driver ISO

  • an appropriately sized disk image (qemu-img create -f qcow2 disk.qcow2 40G)

  • QEMU_EFI.fd extracted from the edk2.git-aarch64 RPM found here

The qemu command line then looks as follows:

isoname=19042.631.201119-2058.20H2_RELEASE_SVC_REFRESH_CLIENTENTERPRISE_VOL_A64FRE_DE-DE.ISO
virtio=~/Downloads/virtio-win.iso
qemu-system-aarch64 -M virt,virtualization=true -cpu cortex-a53 -smp 4 -m 4096 \
        -device qemu-xhci -device usb-kbd -device usb-tablet \
        -drive file=disk.qcow2,if=virtio \
        -nic user,model=virtio \
        -drive file="$isoname",media=cdrom,if=none,id=cdrom -device usb-storage,drive=cdrom \
        -drive file="$virtio",media=cdrom,if=none,id=drivers -device usb-storage,drive=drivers \
        -bios QEMU_EFI.fd -device ramfb

You can then follow the installation process as normal. Before partitioning the disks the setup will ask you to load disk drivers, these can be found at viostor/w10/ARM64/ on the virtio cdrom.

Video output

The above command line already takes these limitations into account, these sections are for explanation only.

A previous blogpost on running Windows 10 ARM in QEMU has used a patched EDK2 to get support for standard VGA back in. It's not clear to me why EDK2 removed support if it was working, but this is not a solution I wanted to use either way.

It turns out [4] that the options on ARM are limited to virtio gpu and ramfb. Virtio graphics are Linux-only so that leaves ramfb.

Attaching disk images

Since the virt machine has no SATA controller we cannot attach a hard disk to the VM the usual way, I went with virtio here instead. It would have been possible to do this over usb-storage, this works out of the box and would have saved us all the work with virtio drivers (except for networking [5]).

This also means something else (which has wasted me quite some time): You cannot use -cdrom.

If you do, EDK2 will boot the Windows CD fine but setup will ask you to load drivers early (because it cannot find its own CD). None of the virtio drivers can fix this situation, leaving you stuck with no clear indication what went wrong.

After installation

The onboarding process has a few hiccups (in particular device detection), if you retry it a few times it'll let you continue anyway.

High CPU Usage

After the first boot I noticed two regsvr32.exe processes at 100% CPU that didn't seem to finish in reasonable time.

Further investigation with Process Explorer [6] showed these belonging to Windows' printing service. Since I don't want to print in this VM anyway, the affected service can just be stopped and disabled:

sc stop "Spooler"
sc config "Spooler" start= disabled

Networking

We're still missing the network driver from the virtio cdrom. Unfortunately the NetKVM driver doesn't seem to be properly signed, so you have to enable loading unsigned drivers first (and reboot!):

bcdedit /set testsigning on

Afterwards the right driver can be installed from the device manager (NetKVM/w10/ARM64 on cdrom).

General Performance Tweaks

These aren't specific to Windows 10 ARM or Virtual Machines, but can be pretty useful to stop your VM from acting sluggish.

REM Disable Windows Search Indexing
sc stop "WSearch"
sc config "WSearch" start= disabled
REM Disable Automatic Defragmentation
schtasks /Delete /TN "\Microsoft\Windows\Defrag\ScheduledDefrag" /F
REM Disable Pagefile
wmic computersystem set AutomaticManagedPagefile=FALSE
wmic pagefileset delete
REM Disable Hibernation
powercfg -h off

Higher Display Resolution

The default resolution is 800x600, but you can change it in the UEFI menu (press F2 or Esc during boot).

But first you will need vars-template-pflash.raw from the same edk package as earlier as UEFI will store its settings in there.
Add the following to your qemu args: -drive file=vars-template-pflash.raw,if=pflash,index=1

The display resolution can then be increased up to 1024x768 under Device Manager > OVMF Platform Configuration.

Wrapping up

With a bit of preparation it is possible to run Windows 10 ARM in a virtual machine. Although the emulation is somewhat slow you could feasibly use this to test one or two programs.

If you have ARM64 hardware with sufficient specs and KVM support, it should be possible to use -enable-kvm -cpu host for native execution speed, though I haven't had a chance to see how this performs yet.