Dealing with glibc faccessat2 breakage under systemd-nspawn
Backstory
The gist of it is that glibc began to make use of the new faccessat2
syscall,
which when running under older systemd-nspawn is filtered to return EPERM
.
This misdirects glibc into assuming a file or folder cannot be accessed,
when in reality nspawn just doesn't know the syscall.
A fix was submitted to systemd [1] but it turned out this didn't only affect nspawn, but also needed to be fixed in various container runtimes and related software [2] [3] [4] [5]. Hacking around it in glibc [6] or the kernel [7] was proposed, with both (rightfully) rejected immediately.
I pondered what an awful bug that was and was glad I didn't have to deal with this mess.
Fast forward to last week, I upgraded an Arch Linux installation I had running in a container.
Immediately after the update pacman refused to work entirely,
complaining it "could not find or read" /var/lib/pacman
when this directory clearly existed (I checked).
A few minutes later (and after noticing the upgrade to glibc 2.33
) it hit me that this was the exact bug I read about months ago.
And, worse, that I'd have to deal with a lot more since I have multiple servers that run containers on systemd-nspawn.
Binary patching systemd-nspawn to fix the seccomp filter
Aside from the fact that upgrading something as central as systemd isn't exactly risk free, I couldn't do that even if I wanted. There is no backported systemd for Ubuntu 18.04 LTS.
Without further ado, here's a Python script doing exactly that. I've tested that it performs the correct patch on Debian 10, Ubuntu 18.04 and Ubuntu 20.04. There are also plenty safeguards that it shouldn't break anything no matter what (no warranty though).
#!/usr/bin/env python3 import subprocess, re, os path = "/usr/bin/systemd-nspawn" print("Looking at %s" % path) proc = subprocess.Popen(["objdump", "-w", "-d", path], stdout=subprocess.PIPE, encoding="ascii") instr = list(m.groups() for m in (re.match(r'\s*([0-9a-f]+):\s*([0-9a-f ]+)\s{4,}(.+)', line) for line in proc.stdout) if m) if proc.wait() != 0: raise RuntimeError("objdump returned error") p_off, p_old, p_new = None, None, None for i, (addr, b, asm) in enumerate(instr): if asm.startswith("call") and "<seccomp_init_for_arch@" in asm: print("Found function call at 0x%s:\n %s%s" % (addr, b, asm)) for addr, b, asm in instr[i-1:i-12:-1]: m = re.match(r'mov\s+\$0x([0-9a-f]+)\s*,\s*%edx', asm) if m: print("Found argument at 0x%s:\n %s%s" % (addr, b, asm)) m = int(m.group(1), 16) if m == 0x50026: print("...but it's already patched, nothing to do.") exit(0) if m != 0x50001: raise RuntimeError("unexpected value") p_off, p_old = int(addr, 16), bytes.fromhex(b) if len(p_old) != 5: raise RuntimeError("unexpected instr length") p_new = b"\xba\x26\x00\x05\x00" break if re.search(r'%[re]?dx|^(call|pop|j[a-z])', asm): break # likely went too far break if not p_off: raise RuntimeError("no patch location found") print("Patching %d bytes at %d from <%s> to <%s>" % (len(p_old), p_off, p_old.hex(), p_new.hex())) with open(path, "r+b") as f: if os.pread(f.fileno(), len(p_old), p_off) != p_old: raise RuntimeError("contents don't match") os.pwrite(f.fileno(), p_new, p_off) print("OK.")
Running the above script (as root) will attempt to locate certain related instructions
in /usr/bin/systemd-nspawn
, attempt to patch one of them and hopefully end with an output of "OK.".
What does the binary patch change? Essentially it makes the following change to the compiled code of nspawn-seccomp.c:
log_debug("Applying allow list on architecture: %s", seccomp_arch_to_string(arch)); -r = seccomp_init_for_arch(&seccomp, arch, SCMP_ACT_ERRNO(EPERM)); +r = seccomp_init_for_arch(&seccomp, arch, SCMP_ACT_ERRNO(ENOSYS)); if (r < 0) return log_error_errno(r, "Failed to allocate seccomp object: %m");
Instead of EPERM
, both blocked and unknown syscalls now return ENOSYS
back to the libc.
This isn't ideal either (error handling code might get a bit confused) but it is more correct
and allows glibc to not catastrophically fail upon attempting to use faccessat2
,
Unfortunately the same change cannot [8] be applied to processes in already running containers, you have to restart them.
Further reading
"The inherent fragility of seccomp()": https://lwn.net/Articles/738694/
and the references below ↓