2022-03-12 / Bartłomiej Kurek
LXC unprivileged containers

Overview
LXC and configuration check
LXC networking
Configuring defaults
Unprivileged containers
Creating containers
Checking a container
Container networking: DHCP, Firewall, NAT
- DHCP
- NAT
Create another container
DNS

Overview

For development purposes I'm creating a docker swarm/kubernetes infrastructure. It will consist of two nodes running as full operating system LXC containers (running init). Once set up I'll install docker and kubernetes in these nodes. The LXC containers will run Debian images (bookworm release).

There are plenty of tutorials on how to get LXC configured and running. However, I'm following the official Debian guide on LXC as it covers details specific to Debian and its kernel version/configuration.

The containers will run in unprivileged mode. This means that the processes in containers that are run as root user are mapped to a regular user on the main host). This is achieved by subordinate user id (man 5 subuid).
The containers' network intefaces will be "Virtual Ethernet Devices" (man 5 veth). LXC will use a individual lxcbr0 interface and all container interfaces will be added to this bridge. I opt for static IP addressing, so this will be configured in dnsmasq (using regular hardware addresses (MAC)).

LXC and configuration check

Installation:

# apt install lxc libvirt0 libpam-cgfs bridge-utils uidmap

Confiugration check.

# lxc-checkconfig

Depending on the system configuration, the above command may report missing cgroups.
In my case I had to update grub configuration and reboot.

File /etc/default/grub:

GRUB_CMDLINE_LINUX_DEFAULT="systemd.unified_cgroup_hierarchy=0"

Then update grub and reboot:

# update-grub
# reboot

Once the system boots, check lxc-checkconfig again.

LXC networking

I'm configuring lxc to use the separate bridge interface. This is controlled via USE_LXC_BRIDGE in /etc/default/lxc-net file.

USE_LXC_BRIDGE="true"

The bridge comes up after restarting lxc-net.

# /etc/init.d/lxc-net restart
Restarting lxc-net (via systemctl): lxc-net.service.

The bridge interface should now be present.

# ifconfig lxcbr0
lxcbr0: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1500
        inet 10.0.3.1  netmask 255.255.255.0  broadcast 10.0.3.255
        ether 00:16:3e:00:00:00  txqueuelen 1000  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

Configuring defaults

The system wide LXC configuration resides in /etc/lxc/default.conf;
In my case it is as follows:

lxc.net.0.type = veth
lxc.net.0.link = lxcbr0
lxc.net.0.flags = up

lxc.apparmor.profile = generated
lxc.apparmor.allow_nesting = 1

However, unprivileged containers are started as a regular user which which will have a custom (but similar) lxc configuration. In case of unprivileged containers the apparmor profile will have to be changed.

Unprivileged containers

Adding a dedicated user

I'm adding a new user (lxcuser). At the moment I'm not interested in logging in as this user, so I'm setting the shell to /usr/bin/nologin. This users account will only be available to root via su with a shell argument.

# useradd -s /usr/sbin/nologin -d /home/lxcuser --create-home lxcuser

Upon the Creation of a new user account, new uid and gid range are added in the system. Check:

# grep lxcuser /etc/subuid
lxcuser:231072:65536
# grep lxcuser /etc/subgid
lxcuser:231072:65536

The specific numbers depend on other user accounts. When there are no other user accounts, the first number will usually be 100000. I have other accounts in the system, so a different id map was allocated.

Allow the regular user to create virtual interfaces

File /etc/lxc/lxc-usernet:

# cat /etc/lxc/lxc-usernet
lxcuser veth lxcbr0 10

This allows lxcuser to create veth interfaces that will be added to lxcbr0 (at most 10 of them).

Unprivileged userns clone (sysctl)

If unprivileged user namespaces are not enabled - update sysctl.conf.

File /etc/sysctl.conf:

kernel.unprivileged_userns_clone=1

LXC configuration (regular user)

Once the dedicated account has been created, switch to it using su.

# su - lxcuser -s /bin/bash

Custom user configuration for LXC resized in ~/.config/lxc/default.conf. It is very similar to the system-wide defaults, but aooarmor profile is changed and idmaps are added.

lxc.net.0.type = veth
lxc.net.0.link = lxcbr0
lxc.net.0.flags = up

lxc.apparmor.profile = unconfined
lxc.apparmor.allow_nesting = 1

lxc.idmap = u 0 231072 65535
lxc.idmap = g 0 231072 65535

For uid/gid maps use the ids from /etc/subuid and /etc/subgid, respectively.

Creating containers

Switch to the dedicated user account:

# su - lxcuser -s /bin/bash

Create a container. If lxc-create command fails with "Unable to fetch gpg key from keyserver" message, it means the GPG keyserver has to be configured.
This can be set as environment variable (DOWNLOAD_KEYSERVER) or the server address may be passed as an option.

$ lxc-create --template download \
    --name node-a \
    -- \
    --dist debian --release bookworm -a amd64 \
    --keyserver hkps://keyserver.ubuntu.com:443

Start the container:

$ lxc-start node-a

Attach:

$ lxc-attach node-a
root@node-a:/#

root@node-a:/# grep NAME /etc/os-release 
PRETTY_NAME="Debian GNU/Linux bookworm/sid"
NAME="Debian GNU/Linux"

Checking a container

These should be a full operating system container, so attach to it and confirm that the init is running.

root@node-a:/# ps 1
    PID TTY      STAT   TIME COMMAND
      1 ?        Ss     0:00 /sbin/init

Processes run by root in a container should be mapped to lxcuser on the main host. Execute a command in a container and check ps output on the main host.

Container:

root@node-a:/# sleep infinity

Main host:

# ps waux | grep infinity
231072     19926  0.0  0.0   5416   676 pts/4    S+   15:03   0:00 sleep infinity

Indeed, although the process in a container runs as root, the process on the main host belongs to a regular user.

Container networking: DHCP, Firewall, NAT

If IPv6 support is not needed in a container, it can be turned off:

(root@container) # echo 'net.ipv6.conf.all.disable_ipv6=1' >> /etc/sysctl.conf 
(root@container) # echo 'net.ipv6.conf.default.disable_ipv6=1' >> /etc/sysctl.conf

DHCP

By default, container IP addresses are assigned by DHCP.
DHCP is served by dnsmasq on the main host. Static container IPs can be configured as well.
I choose to set MAC addresses for individual containers and add corresponding entries in dnsmasq.

Example - setting MAC address for node-a container:

File ~/.local/share/lxc/node-a/config:

lxc.net.0.hwaddr = 00:00:00:00:00:0a

Entries in dnsmasq (file /etc/dnsmasq.conf, main host):

domain-needed
bogus-priv
except-interface=wlan0
expand-hosts
dhcp-range=lxc,10.0.3.100,10.0.3.200,12h
dhcp-option=lxc,option:router,10.0.3.1
dhcp-host=00:00:00:00:00:0a,node-a,10.0.3.100
dhcp-host=00:00:00:00:00:0b,node-b,10.0.3.101
log-queries
log-dhcp
conf-dir=/etc/dnsmasq.d/,*.conf

Restart the container:

$ lxc-stop node-a
$ lxc-start node-a

Check:

$ lxc-ls -f
NAME   STATE   AUTOSTART GROUPS IPV4       IPV6 UNPRIVILEGED 
node-a RUNNING 0         -      10.0.3.100 -    true

NAT

A permissive iptables setup (main host):

iptables -t filter -A INPUT -i lxcbr0 -j ACCEPT
iptables -t filter -A FORWARD -i lxcbr0 -j ACCEPT
iptables -t filter -A FORWARD -o lxcbr0 -j ACCEPT
iptables -t filter -A OUTPUT -o lxcbr0 -j ACCEPT

iptables -t nat -A POSTROUTING -s 10.0.3.0/16 ! -d 10.0.3.0/16 -j MASQUERADE

Create another container

$ lxc-create --template download --name node-b \
    -- \
    --dist debian --release bookworm -a amd64 \
    --keyserver hkps://keyserver.ubuntu.com:443

Set hardware address for the container's ethernet device (in ~/.local/share/lxc/node-b/config):

[...]
# Network configuration
lxc.net.0.type = veth
lxc.net.0.link = lxcbr0
lxc.net.0.hwaddr = 00:00:00:00:00:0b
lxc.net.0.flags = up
[...]

Start and inspect the container.

$ lxc-start node-b
$ lxc-ls -f
NAME   STATE   AUTOSTART GROUPS IPV4       IPV6 UNPRIVILEGED 
node-a RUNNING 0         -      10.0.3.100 -    true
node-b RUNNING 0         -      10.0.3.101 -    true

If the MAC address was different than the one configured in dnsmasq, the container would get assigned a different IP.

DNS

DNS for the containers is handled by dnsmasq, so the nodes should be able to resolve the hostnames.

Ping test: node-a -> node-b.

root@node-a:/# ping node-b -c 1 
PING node-b (10.0.3.101) 56(84) bytes of data.
64 bytes from node-b (10.0.3.101): icmp_seq=1 ttl=64 time=0.128 ms

--- node-b ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.128/0.128/0.128/0.000 ms