Dealing with glibc faccessat2 breakage under systemd-nspawn

Backstory

A few months ago I stumbled upon this report on Red Hat's bugzilla.

The gist of it is that glibc began to make use of the new faccessat2 syscall, which when running under older systemd-nspawn is filtered to return EPERM. This misdirects glibc into assuming a file or folder cannot be accessed, when in reality nspawn just doesn't know the syscall.

A fix was submitted to systemd [1] but it turned out this didn't only affect nspawn, but also needed to be fixed in various container runtimes and related software [2] [3] [4] [5]. Hacking around it in glibc [6] or the kernel [7] was proposed, with both (rightfully) rejected immediately.

I pondered what an awful bug that was and was glad I didn't have to deal with this mess.


Fast forward to last week, I upgraded an Arch Linux installation I had running in a container. Immediately after the update pacman refused to work entirely, complaining it "could not find or read" /var/lib/pacman when this directory clearly existed (I checked).

A few minutes later (and after noticing the upgrade to glibc 2.33) it hit me that this was the exact bug I read about months ago. And, worse, that I'd have to deal with a lot more since I have multiple servers that run containers on systemd-nspawn.

Binary patching systemd-nspawn to fix the seccomp filter

If you hit this bug with one of your containers you have exactly one option: upgrade systemd on the host system to v247 or later.

Aside from the fact that upgrading something as central as systemd isn't exactly risk free, I couldn't do that even if I wanted. There is no backported systemd for Ubuntu 18.04 LTS.

This calls for another option: Patching systemd yourself to fix the bug.

Without further ado, here's a Python script doing exactly that. I've tested that it performs the correct patch on Debian 10, Ubuntu 18.04 and Ubuntu 20.04. There are also plenty safeguards that it shouldn't break anything no matter what (no warranty though).

#!/usr/bin/env python3
import subprocess, re, os
path = "/usr/bin/systemd-nspawn"
print("Looking at %s" % path)
proc = subprocess.Popen(["objdump", "-w", "-d", path], stdout=subprocess.PIPE, encoding="ascii")
instr = list(m.groups() for m in (re.match(r'\s*([0-9a-f]+):\s*([0-9a-f ]+)\s{4,}(.+)', line) for line in proc.stdout) if m)
if proc.wait() != 0: raise RuntimeError("objdump returned error")
p_off, p_old, p_new = None, None, None
for i, (addr, b, asm) in enumerate(instr):
        if asm.startswith("call") and "<seccomp_init_for_arch@" in asm:
                print("Found function call at 0x%s:\n  %s%s" % (addr, b, asm))
                for addr, b, asm in instr[i-1:i-12:-1]:
                        m = re.match(r'mov\s+\$0x([0-9a-f]+)\s*,\s*%edx', asm)
                        if m:
                                print("Found argument at 0x%s:\n  %s%s" % (addr, b, asm))
                                m = int(m.group(1), 16)
                                if m == 0x50026:
                                        print("...but it's already patched, nothing to do.")
                                        exit(0)
                                if m != 0x50001: raise RuntimeError("unexpected value")
                                p_off, p_old = int(addr, 16), bytes.fromhex(b)
                                if len(p_old) != 5: raise RuntimeError("unexpected instr length")
                                p_new = b"\xba\x26\x00\x05\x00"
                                break
                        if re.search(r'%[re]?dx|^(call|pop|j[a-z])', asm): break # likely went too far
                break
if not p_off: raise RuntimeError("no patch location found")
print("Patching %d bytes at %d from <%s> to <%s>" % (len(p_old), p_off, p_old.hex(), p_new.hex()))
with open(path, "r+b") as f:
        if os.pread(f.fileno(), len(p_old), p_off) != p_old: raise RuntimeError("contents don't match")
        os.pwrite(f.fileno(), p_new, p_off)
print("OK.")

Running the above script (as root) will attempt to locate certain related instructions in /usr/bin/systemd-nspawn, attempt to patch one of them and hopefully end with an output of "OK.".

What does the binary patch change? Essentially it makes the following change to the compiled code of nspawn-seccomp.c:

 log_debug("Applying allow list on architecture: %s", seccomp_arch_to_string(arch));

-r = seccomp_init_for_arch(&seccomp, arch, SCMP_ACT_ERRNO(EPERM));
+r = seccomp_init_for_arch(&seccomp, arch, SCMP_ACT_ERRNO(ENOSYS));
 if (r < 0)
         return log_error_errno(r, "Failed to allocate seccomp object: %m");

Instead of EPERM, both blocked and unknown syscalls now return ENOSYS back to the libc. This isn't ideal either (error handling code might get a bit confused) but it is more correct and allows glibc to not catastrophically fail upon attempting to use faccessat2,

Unfortunately the same change cannot [8] be applied to processes in already running containers, you have to restart them.

Further reading