SELinux Policy for OpenShift Containers

I was exploring Cilium on OpenShift, this paper is a summary of what is required for a container to run properly in OpenShift where SELinux is turned on by default.

The Problem

level=warning msg="Failed to create peer client for peers synchronization; will try again after the timeout has expired" error="context deadline exceeded" subsys=hubble-relay target="unix:///var/run/cilium/hubble.sock"

Simulated App

func main() {
socketFile := os.Getenv("APP_UNIX_SOCK")
if socketFile == "" {
socketFile = "/var/run/app.sock"
}
err := os.RemoveAll(socketFile)
if err != nil {
logrus.Fatalf("Failed to remove socket file: %v", err)
}
listener, err := net.Listen("unix", socketFile)
if err != nil {
logrus.Fatalf("Failed to listen on socket file: %s", err)
}
defer listener.Close()
for {
client, err := listener.Accept()
if err != nil {
log.Fatalf("Error on accept: %s", err)
}
logrus.WithFields(logrus.Fields{
"local": client.LocalAddr(),
"remote": client.RemoteAddr(),
}).Infof("client connected")
go func(c net.Conn) {
io.Copy(c, c)
}(client)
}
}

For the client app, we just use netcat to connect to the Unix socket.

Compile and build the container image, push it into the OpenShift internal registry.

Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
name: server
labels:
app: server
spec:
replicas: 1
selector:
matchLabels:
app: server
template:
metadata:
labels:
app: server
spec:
containers:
- name: server
image: image-registry.openshift-image-registry.svc:5000/selinux-test/selinux-test:v1.0
env:
- name: APP_UNIX_SOCK
value: /var/run/app/app.sock
volumeMounts:
- name: unix-socket
mountPath: /var/run/app
securityContext:
privileged: true
volumes:
- name: unix-socket
hostPath:
path: /var/run/app
type: DirectoryOrCreate

The main app maps a hostPath volume and creates the UNIX socket on this volume. The app is running in privileged mode.

The client’s deployment is shown as below,

apiVersion: apps/v1
kind: Deployment
metadata:
name: client
labels:
app: client
spec:
replicas: 1
selector:
matchLabels:
app: client
template:
metadata:
labels:
app: client
spec:
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- server
topologyKey: kubernetes.io/hostname
containers:
- name: client
image: alpine
command:
- sh
- -c
- apk add netcat-openbsd; while true; do sleep 30; done
volumeMounts:
- name: unix-socket
mountPath: /var/run/app
volumes:
- name: unix-socket
hostPath:
path: /var/run/app
type: Directory

The pod is scheduled to run as the same node where the server pod runs by defining the podAffinity. The UNIX socket on the node is mounted through the hostPath volume.

Notice the client is not running in privileged mode, matching with what the Cilium deployment does. As a CNI plugin, the Cilium is running with the privileged mode to access and manage the required resources on the host. While the hubble relay as a client application will not run in privileged mode.

Assign the privileged SCC to the default service account. Apply the server and client deployment resources. The server pod is running successfully. Now lets exec into the pods,

$ oc exec -it server-6c7794f66f-xwzq2 -- sh
/app # cd /var/run/app/
/run/app # ls -l
total 0
srwxr-xr-x 1 root root 0 Oct 2 04:12 app.sock

Sure enough, the server process created the UNIX socket. However, check the client pod,

oc exec -it client-7db86d679-282pf -- sh
/ # cd /var/run/app
/run/app # ls -l
ls: can't open '.': Permission denied
total 0

Though we are running root, we cannot access the socket file. But why?

Who denied the access?

ssh core@192.168.XX.XXX
$ sudo tail -f /var/log/audit/audit.log | grep AVC

Now it is clear, whenever we do an ls -l in the above pod’s exec shell, we see a denial log,

type=AVC msg=audit(1633158753.927:70800): avc:  denied  { read } for  pid=3514030 comm="ls" name="app" dev="tmpfs" ino=35452717 scontext=system_u:system_r:container_t:s0:c364,c644 tcontext=system_u:object_r:container_var_run_t:s0 tclass=dir permissive=0

The source context, scontext, is the container_t process that is trying to access the resource. If we do a process search for that context,

ps -efZ | grep system_u:system_r:container_t:s0:c364,c644
system_u:system_r:container_t:s0:c364,c644 root 3067720 3067708 0 05:50 ? 00:00:00 /usr/bin/pod
system_u:system_r:container_t:s0:c364,c644 root 3067766 3067750 0 05:50 ? 00:00:00 sh -c while true; do sleep 30; done
system_u:system_r:container_t:s0:c364,c644 root 3497336 3497319 0 07:09 pts/0 00:00:00 sh
system_u:system_r:container_t:s0:c364,c644 root 3573040 3067766 0 07:23 ? 00:00:00 sleep 30

It is the client pod.

The target context, tcontext, is container_var_run_t. Run the following command to find the labels of the directory and the sock file.

sudo ls -laZ /var/run/app
total 0
drwxr-xr-x. 2 root root system_u:object_r:container_var_run_t:s0 60 Oct 2 04:12 .
drwxr-xr-x. 45 root root system_u:object_r:var_run_t:s0 1180 Sep 30 13:22 ..srwxr-xr-x. 1 root root system_u:object_r:container_var_run_t:s0 0 Oct 2 04:12 app.sock

By default, the Openshift container process is having a label of “container_t” which can access the files labeled as “container_file_t”. For example, if we find the container storage directory of root,

sudo crictl inspect 95b9c8d4d5330...
"root": {
"path": "/var/lib/containers/storage/overlay/4103fc186497f834605dedff1baa7808dd99af20d92e0ed54b0b89f01774fc65/merged"
},
...

Do an ls -lZ

sudo ls -lZ /var/lib/containers/storage/overlay/4103fc186497f834605dedff1baa7808dd99af20d92e0ed54b0b89f01774fc65/merged
total 8
drwxr-xr-x. 2 root root system_u:object_r:container_file_t:s0:c364,c644 4096 Aug 27 11:05 bin
drwxr-xr-x. 2 root root system_u:object_r:container_file_t:s0:c364,c644 6 Aug 27 11:05 dev
drwxr-xr-x. 1 root root system_u:object_r:container_file_t:s0:c364,c644 36 Oct 2 05:50 etc
drwxr-xr-x. 2 root root system_u:object_r:container_file_t:s0:c364,c644 6 Aug 27 11:05 home
...

All these files are accessible for the client container as the label matches. But the app.sock and its directory are having a label of “system_u:object_r:container_var_run_t:s0”, therefore the access is denied by the SELinux.

On the other hand, the server container running in privileged mode is having the label of spc_t, which can access all the resources.

system_u:system_r:spc_t:s0      root     2529710 2529698  0 04:12 ?        00:00:00 ./serving

Fixing the issue

sudo chcon --type container_file_t /var/run/app
sudo chcon --type container_file_t /var/run/app/app.sock

Check the file’s label again,

ls -laZ /var/run/app/
total 0
drwxr-xr-x. 2 root root system_u:object_r:container_file_t:s0 60 Oct 2 04:12 .
drwxr-xr-x. 45 root root system_u:object_r:var_run_t:s0 1180 Sep 30 13:22 ..
srwxr-xr-x. 1 root root system_u:object_r:container_file_t:s0 0 Oct 2 04:12 app.sock

Now, go back to the Pod’s exec shell, the client can access the directory and the socket file. The echo is working as expected.

/run/app # ls
app.sock
/run/app # nc -U app.sock
123
123
Echo this line
Echo this line

Create an SELinux Policy Package

Let’s use the audit2allow tool to generate the policy as the type enforcement (te) file.

sudo grep avc /var/log/audit/audit.log | audit2allow  -m myapp | tee myapp.temodule myapp 1.0;require {
type container_var_run_t;
type container_t;
class dir read;
class sock_file write;
}
#============= container_t ==============
allow container_t container_var_run_t:dir read;
allow container_t container_var_run_t:sock_file write;

Create a policy module file from the te file, and compile it as a policy package file, and load it into the kernel

sudo checkmodule -M -m -o myapp.mod myapp.te
sudo semodule_package -o myapp.pp -m myapp.mod
sudo semodule -i myapp.pp

Run the above command on all the worker nodes. Now the SElinux policy will allow the server/client app to run on any nodes.

The same applies to fixing the Cilium hubble-relay problem.

Cloud explorer