Overview
For development purposes I'm creating a docker swarm/kubernetes infrastructure. It will consist of two nodes running as full operating system LXC containers (running init). Once set up I'll install docker and kubernetes in these nodes. The LXC containers will run Debian images (bookworm
release).
There are plenty of tutorials on how to get LXC configured and running. However, I'm following the official Debian guide on LXC as it covers details specific to Debian and its kernel version/configuration.
The containers will run in unprivileged mode. This means that the processes in containers that are run as root user are mapped to a regular user on the main host). This is achieved by subordinate user id (man 5 subuid
).
The containers' network intefaces will be "Virtual Ethernet Devices" (man 5 veth
). LXC will use a individual lxcbr0
interface and all container interfaces will be added to this bridge. I opt for static IP addressing, so this will be configured in dnsmasq (using regular hardware addresses (MAC)).
LXC and configuration check
Installation:
# apt install lxc libvirt0 libpam-cgfs bridge-utils uidmap
Confiugration check.
# lxc-checkconfig
Depending on the system configuration, the above command may report missing cgroups.
In my case I had to update grub configuration and reboot.
File /etc/default/grub
:
GRUB_CMDLINE_LINUX_DEFAULT="systemd.unified_cgroup_hierarchy=0"
Then update grub and reboot:
# update-grub
# reboot
Once the system boots, check lxc-checkconfig
again.
LXC networking
I'm configuring lxc to use the separate bridge interface. This is controlled via USE_LXC_BRIDGE
in /etc/default/lxc-net
file.
USE_LXC_BRIDGE="true"
The bridge comes up after restarting lxc-net
.
# /etc/init.d/lxc-net restart
Restarting lxc-net (via systemctl): lxc-net.service.
The bridge interface should now be present.
# ifconfig lxcbr0
lxcbr0: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500
inet 10.0.3.1 netmask 255.255.255.0 broadcast 10.0.3.255
ether 00:16:3e:00:00:00 txqueuelen 1000 (Ethernet)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
Configuring defaults
The system wide LXC configuration resides in /etc/lxc/default.conf
;
In my case it is as follows:
lxc.net.0.type = veth
lxc.net.0.link = lxcbr0
lxc.net.0.flags = up
lxc.apparmor.profile = generated
lxc.apparmor.allow_nesting = 1
However, unprivileged containers are started as a regular user which which will have a custom (but similar) lxc configuration. In case of unprivileged containers the apparmor profile will have to be changed.
Unprivileged containers
Adding a dedicated user
I'm adding a new user (lxcuser
). At the moment I'm not interested in logging in as this user, so I'm setting the shell to /usr/bin/nologin
. This users account will only be available to root via su
with a shell argument.
# useradd -s /usr/sbin/nologin -d /home/lxcuser --create-home lxcuser
Upon the Creation of a new user account, new uid and gid range are added in the system. Check:
# grep lxcuser /etc/subuid
lxcuser:231072:65536
# grep lxcuser /etc/subgid
lxcuser:231072:65536
The specific numbers depend on other user accounts. When there are no other user accounts, the first number will usually be 100000
. I have other accounts in the system, so a different id map was allocated.
Allow the regular user to create virtual interfaces
File /etc/lxc/lxc-usernet
:
# cat /etc/lxc/lxc-usernet
lxcuser veth lxcbr0 10
This allows lxcuser
to create veth interfaces that will be added to lxcbr0
(at most 10 of them).
Unprivileged userns clone (sysctl)
If unprivileged user namespaces are not enabled - update sysctl.conf.
File /etc/sysctl.conf
:
kernel.unprivileged_userns_clone=1
LXC configuration (regular user)
Once the dedicated account has been created, switch to it using su
.
# su - lxcuser -s /bin/bash
Custom user configuration for LXC resized in ~/.config/lxc/default.conf
. It is very similar to the system-wide defaults, but aooarmor profile is changed and idmaps are added.
lxc.net.0.type = veth
lxc.net.0.link = lxcbr0
lxc.net.0.flags = up
lxc.apparmor.profile = unconfined
lxc.apparmor.allow_nesting = 1
lxc.idmap = u 0 231072 65535
lxc.idmap = g 0 231072 65535
For uid/gid maps use the ids from /etc/subuid
and /etc/subgid
, respectively.
Creating containers
Switch to the dedicated user account:
# su - lxcuser -s /bin/bash
Create a container. If lxc-create
command fails with "Unable to fetch gpg key from keyserver" message, it means the GPG keyserver has to be configured.
This can be set as environment variable (DOWNLOAD_KEYSERVER
) or the server address may be passed as an option.
$ lxc-create --template download \
--name node-a \
-- \
--dist debian --release bookworm -a amd64 \
--keyserver hkps://keyserver.ubuntu.com:443
Start the container:
$ lxc-start node-a
Attach:
$ lxc-attach node-a
root@node-a:/#
root@node-a:/# grep NAME /etc/os-release
PRETTY_NAME="Debian GNU/Linux bookworm/sid"
NAME="Debian GNU/Linux"
Checking a container
These should be a full operating system container, so attach to it and confirm that the init is running.
root@node-a:/# ps 1
PID TTY STAT TIME COMMAND
1 ? Ss 0:00 /sbin/init
Processes run by root in a container should be mapped to lxcuser
on the main host. Execute a command in a container and check ps
output on the main host.
Container:
root@node-a:/# sleep infinity
Main host:
# ps waux | grep infinity
231072 19926 0.0 0.0 5416 676 pts/4 S+ 15:03 0:00 sleep infinity
Indeed, although the process in a container runs as root, the process on the main host belongs to a regular user.
Container networking: DHCP, Firewall, NAT
If IPv6 support is not needed in a container, it can be turned off:
(root@container) # echo 'net.ipv6.conf.all.disable_ipv6=1' >> /etc/sysctl.conf
(root@container) # echo 'net.ipv6.conf.default.disable_ipv6=1' >> /etc/sysctl.conf
DHCP
By default, container IP addresses are assigned by DHCP.
DHCP is served by dnsmasq on the main host. Static container IPs can be configured as well.
I choose to set MAC addresses for individual containers and add corresponding entries in dnsmasq.
Example - setting MAC address for node-a
container:
File ~/.local/share/lxc/node-a/config
:
lxc.net.0.hwaddr = 00:00:00:00:00:0a
Entries in dnsmasq (file /etc/dnsmasq.conf
, main host):
domain-needed
bogus-priv
except-interface=wlan0
expand-hosts
dhcp-range=lxc,10.0.3.100,10.0.3.200,12h
dhcp-option=lxc,option:router,10.0.3.1
dhcp-host=00:00:00:00:00:0a,node-a,10.0.3.100
dhcp-host=00:00:00:00:00:0b,node-b,10.0.3.101
log-queries
log-dhcp
conf-dir=/etc/dnsmasq.d/,*.conf
Restart the container:
$ lxc-stop node-a
$ lxc-start node-a
Check:
$ lxc-ls -f
NAME STATE AUTOSTART GROUPS IPV4 IPV6 UNPRIVILEGED
node-a RUNNING 0 - 10.0.3.100 - true
NAT
A permissive iptables setup (main host):
iptables -t filter -A INPUT -i lxcbr0 -j ACCEPT
iptables -t filter -A FORWARD -i lxcbr0 -j ACCEPT
iptables -t filter -A FORWARD -o lxcbr0 -j ACCEPT
iptables -t filter -A OUTPUT -o lxcbr0 -j ACCEPT
iptables -t nat -A POSTROUTING -s 10.0.3.0/16 ! -d 10.0.3.0/16 -j MASQUERADE
Create another container
$ lxc-create --template download --name node-b \
-- \
--dist debian --release bookworm -a amd64 \
--keyserver hkps://keyserver.ubuntu.com:443
Set hardware address for the container's ethernet device (in ~/.local/share/lxc/node-b/config
):
[...]
# Network configuration
lxc.net.0.type = veth
lxc.net.0.link = lxcbr0
lxc.net.0.hwaddr = 00:00:00:00:00:0b
lxc.net.0.flags = up
[...]
Start and inspect the container.
$ lxc-start node-b
$ lxc-ls -f
NAME STATE AUTOSTART GROUPS IPV4 IPV6 UNPRIVILEGED
node-a RUNNING 0 - 10.0.3.100 - true
node-b RUNNING 0 - 10.0.3.101 - true
If the MAC address was different than the one configured in dnsmasq, the container would get assigned a different IP.
DNS
DNS for the containers is handled by dnsmasq, so the nodes should be able to resolve the hostnames.
Ping test: node-a -> node-b.
root@node-a:/# ping node-b -c 1
PING node-b (10.0.3.101) 56(84) bytes of data.
64 bytes from node-b (10.0.3.101): icmp_seq=1 ttl=64 time=0.128 ms
--- node-b ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.128/0.128/0.128/0.000 ms