Containers Under The Hood

6 min readJun 6, 2020

The containers eco-system raises a huge hype, almost every infrastructure contains a sleek Kubernetes cluster or at least some Docker containers running in the wild.

I think containers are a great tool for deploying your software without worrying about issues like missing dependencies and services coordination for example. But, have you ever wondered what’s underneath the hood?

Let’s take a look on the following command:

Curiously, we got a new shell on something that looks like another machine (believe me, it’s not). Even more interesting, when running ps we can see only our process and it appears with PID 1 which usually assigned to init/systemd.

File System:

One of the great advantages of containers is the ease and consistency of your deployment. No more missing libraries or other dependencies — You get your application battery-included.

The Docker image is just a bundle that includes a filesystem with some meta-data. All we have to do is just manipulate the process’s file system point of view.

There are two techniques for achieving this part: chroot and pivot-root.
chroot is a syscall that changes the root directory for a process, whereas pivot-root changes the root directory for all the processes in our mount namespace (if you got scared from the term "namespace", don't worry we will soon cover it).

One of the reasons pivot-root is preferred over chroot is the fact that the latter is more vulnerable to escaping (mostly because the old root is still accessible) while with pivot-root you can umount it.

When we run multiple container instances we don’t really need to clone the whole image which can be very expensive. This is where CoW (Copy-On-Write) and the Layered file system come to the rescue.

The CoW concept is very simple: all instances share the same common layer and once someone needs to make a change it will get his own copy and do whatever he wants there without bothering the other instances. Then, instead of copying the whole image upon a change, we can use a common read-only branch and the new changes will be in a different layer (this is the basic idea, you can read more about overlayfs as it is now part of the mainline kernel)

Another important concept in terms of the file system is mounts, or more specifically “bind mount” which basically maps between two inodes in the VFS (Virtual File System).

There are a few modes of this mount type, which are differentiated by the way events are synchronized between the two nodes we link:
- shared: all the replicas are fully synchronized
- slave: only mount and umount events are passed in one direction
- private: no propagations will be received or forwarded
- unbindable: like private + other bindings are forbidden

This mount is really useful in cases where you want to use external configurations or volumes for the container (e.g: /tmp as writable storage)

Cool, so we know how the dependencies are handled. But, why don’t we see other processes when running ps? That is where Linux Namespaces fit in.

Namespaces is a Linux feature that enables us to isolate our process and create a restricted view of system resources like network interfaces, mounts, and processes. This isolation is good not only for security and access control but also for making sure that we won’t collide with other applications or services on the same host, and get a consistent point of view.

There are some syscalls that help manage those namespaces:
clone - Creates a new process in a new namespace
unshare - Creates a new namespace for the current process
setns - Change process namespace by a given namespace file descriptor

Here is a list of some Linux namespaces you should be familiar with:
- Mount: filesystem mount points
- UTS: host and domain names
- IPC: interprocess communication resources
- PID: processes tree
- Network: network interfaces
- User: UID/GID

We can list our process namespaces which are accessible by `/proc/<pid>/ns`

So, we have pretty good isolation (not really) with those namespaces. But what will happen if one process will start to hog the RAM? His neighbor containers will probably not like it.

Neighbors containers on the same host can be really annoying without good isolation

It’s time to meet Control Groups (aka cgroups) — a Linux mechanism that helps us manage system resources like CPU, memory, and network by setting quotas and access control.

In order to enter a process to the group (or more generally — task, which can also be a thread) you just need to write its id to the relevant task file:

/sys/fs/cgroup/<resource-type>/<user-class>/tasks

and from now on, the cgroups configuration will be applied to your process.

Pay attention that when you create a directory inside sys/fs/cgroup the kernel will auto-fill it with default values.

For example, if you want to change the maximum allowed RAM a process can consume it can be achieved by simply writing the desired limit to /sys/fs/cgroup/memory/<name>/memory.limit_in_bytes and that’s it, the cgroups mechanism will take it from here.

Nice way to get our process memory usage, can be very useful for monitoring and profiling

A few words about security. We should now have an understanding that container and Virtual Machines are two different things, given the fact the containers share the host’s kernel/OS which makes it easier to break out. I’ll briefly explain some of the common tools and technologies that try to make the attacker’s life harder:
- AppArmor/SELinux — they are both LSM (Linux Security Module) implementations that supply us MAC (Mandatory Access Control) that is used to whitelist/blacklist access to resources
- Seccomp — allows us to set up syscalls filters
- Capabilities — breaking up the monolithic root privilege so we can reduce the risk of that by compromising a single process the attacker gets full control

Let’s Get Our Hands Dirty

After discussing the containers theory, let’s implement a minimal container runtime using the concepts we just overviewed:

First, we need to make sure we don’t see the host process anymore:

Now, let’s check if our memory limit (~100MB) is handled:

we can see that our process was killed when it tried to allocate more memory than it allowed to

Voila! we got our minimal container up running and functional the way we want.

A few things to mention:

The code is written in Go but can be implemented in many other ways.
you should NOT implement your own container runtime. This example looks functional and running but we did some tricks in order to keep the code clear and simple (for example we wrote the cgroups attributes in the container process which means one can change it if he wants to exceed the limitation, we used chroot instead of pivot-root etc.)

I hope now you better understand what container means and how it works internally. There is a lot more to cover but I think it’s a really good start and you can take it from here.

Feel free to contact me for any question, correction or just to talk :-)

References:
[https://blog.gojekengineering.com/building-containers-from-scratch-c2368a8c8701]
[https://www.youtube.com/watch?v=8fi7uSYlOdc]
[https://ericchiang.github.io/post/containers-from-scratch]

Containers Under The Hood

Written by Moshe Beladev