In last week’s post, I talked about container images and how they empower you to build your container apps really quickly by utilizing software that was already installed. This week, I’m going to talk about how we create those images. This topic will also dive a little deeper into image layers, something I didn’t really want to discuss separate from the building of images.
This post will have some examples in it that you are more than welcome to try out yourself. I will be using Docker for my examples, but most of the concepts apply to other container image formats as well. I will also not be covering all of the Dockerfile instructions in this post, but I will cover the main ones that are required in the majority of images. You can install Docker locally from docker.io or you can run these commands interactively from a site like play-with-docker.com.
Back to Saving the World
Recall that last week we were building robots to fight off an alien invasion. You utilized parts that you already had around so that you could quickly assemble your army. The question now is “where did those parts come from?”. They weren’t just constructed out of thin air — there was a design behind each of those parts that someone followed in order to build the part. In the same manner, container images are built from a specification that dictates the contents of the image.
For Docker, this specification is called a Dockerfile. The specification consists of a series of instructions that are executed from top to bottom. The ordering of the instructions can be important, but as I’ll explain later, the results of everything executed in the Dockerfile are captured as part of the final image (unless you use some special tools to modify the image after it is built). A simple hello world Dockerfile looks like this:
CMD echo 'Hello World!'
We build an image from this using the docker build command:
$ docker build -t hello_world:1.0 .
The -t argument is to tag the resulting image with a meaningful name and version. The version (in this case 1.0), is optional, and will default to the magic version latest. You don’t have to specify a tag if you don’t want, but this means that you will have to refer to the image by it’s id which is simply a hex string. Then we can run the image in a container using the docker run command:
$ docker run hello_world:1.0
As with the build command, the version is not required, but it will default to the magic latest version, which may or may not exist (some images are tagged with a latest version, some are not).
The first instruction in every Dockerfile is the FROM instruction. This instruction specifies the foundation upon which we’ll be building our image. In some cases, this will be a simulated OS layer such as Ubuntu, Fedora, or CentOS. In other cases, it might be an application server such as JBoss or Tomcat. You can technically use any existing image as the foundation for your image, but you are required to include that foundation (it is not possible to have a Dockerfile without a FROM instruction). This is not a big hurdle, and typically you are going to want some basic commands or functionality in your image. There are some really thin base images, such as Alpine, that provide a good foundation in a really small size. You can just specify an image tag name if you want:
FROM alpine ...
or you can specify the name and version:
If you don’t specify a version, it defaults to the latest version, which may or may not be defined (I have a special tool called dockertags that you can use to look up the available versions of images on Docker Hub).
The next instruction that you will commonly use is RUN. This instruction is executed when building the image, and is usually used in combination with a package management tool such as yum or apt to install software into the image. You can, however, run any executable that already exists in the portion of the image created so far. You can not, however, run executables that are outside of the image (as you will see in a moment, you can copy files in as needed, and those can be executed, but typically you will be running commands from the base layer or from software you installed from package managers). You use the RUN instruction just as if you were executing commands in a shell script:
RUN apt-get update
RUN apt-get install -y python-dev curl
If you need to add files to your image, you can use the COPY or ADD instructions. Like the RUN instruction, this is executed when building the image. Most of the time, you will want to include files that are located adjacent to the Dockerfile, and so you should use the COPY command. The ADD command is similar except that you can specify a URL as the source.
COPY ./myscript.py /app/myscript.py
ADD https://github.com/anyuser/magic/scripts/script.py /app/magic.py
The final instructions I will talk about today are the CMD and ENTRYPOINT instructions, which specify the default command to run when a container is created (you can also use the instructions together, the CMD specifying the executable and ENTRYPOINT specifying the arguments). You can override the command that is executed when the container is created, but it easier to do with CMD than ENTRYPOINT. You can also omit these instructions, but if you attempt to create a container without specifying a command to run, you will get an error. Explaining the difference between CMD and ENTRYPOINT is complicated (and the syntax of the instruction is also complicated), so I will refer you to this blog post which explains the differences in detail. Most of the time, you’ll use CMD unless you intend for the user to not override the command:
CMD python /app/myscript.py
There are other instructions I have not talked about today, but these four types of instructions are the most common. There are some instructions that simply provide metadata (such as MAINTAINER or LABEL), others that specify environmental data (such as ENV or ARG), and finally others that change the context in which instructions are run (such as USER or WORKDIR). Consult the Dockerfile reference for more information about these other instructions.
One thing I have not talked about at all in this series so far is the concept of Layers. For each instruction in a Dockerfile, a layer in the image is created. For example, the following Dockerfile will result in an image with five layers:
FROM debian:stretchRUN apt-get -qq update
RUN apt-get -qq install -y python-dev curl ruby
COPY ./demo.py /app/demo.py
CMD python /app/demo.py
If I build this image using the docker build command, it will output the result of each instruction:
$ docker build -t dockerfile_demo .
Sending build context to Docker daemon 3.072kB
Step 1/5 : FROM debian:stretch
stretch: Pulling from library/debian
cc1a78bfd46b: Pull complete
Status: Downloaded newer image for debian:stretch
Step 2/5 : RUN apt-get -qq update
---> Running in efffc5e1f4e1
Removing intermediate container efffc5e1f4e1
Step 3/5 : RUN apt-get -qq install -y python-dev curl ruby
---> Running in 9815a0f4b9b8
debconf: delaying package configuration, since apt-utils is not installed
Selecting previously unselected package libpython2.7-minimal:amd64.
Running hooks in /etc/ca-certificates/update.d...
Removing intermediate container 9815a0f4b9b8
Step 4/5 : COPY ./demo.py /app/demo.py
Step 5/5 : CMD python /app/demo.py
---> Running in adbe22ce62a8
Removing intermediate container adbe22ce62a8
Successfully built e16c9d28a871
Successfully tagged dockerfile_demo:latest
There are a few things this output tells us. Each instruction results in a layer that also has been hashed (for example, the first layer hash is 8626492fecd3). The RUN and CMD instructions are executed in a container (that’s right folks — using containers to build images) and the results are added as a layer (the results for RUN are generally files created, whereas the results for CMD are metadata files that will be used by the container).
These layers that were created while building the image are immutable. That is to say, each additional layer doesn’t change anything in the previous layer. The layer instead captures files that were added or removed by running the instructions. There are a couple of reasons why Docker does it this way:
- Layers are reusable — in some cases, two Dockerfiles will execute the same instruction resulting in layers that are identical. This means that we can cache those layers and skip rebuilding them.
- Docker only rebuilds layers that have changed — because of the caching, Docker can skip rebuilding layers if they haven’t changed and they are cached. There is one caveat to this: Docker can only skip rebuilding a layer if it hasn’t changed and it is the first layer or the previous layer was also retrieved from the cache. That is to say, if I have a layer that has changed, all the following layers will have to be rebuilt. The reason is that those instructions may result in layer that is different from the previous time they were built (if those instructions rely on any state in the filesystem, it is possible to have a different result).
- Docker only transmits layers that it needs to transmit — When you build an image, the final result is pushed into the local registry. To save space, Docker only keeps one copy of each layer, even if the image is built multiple times. New layers will be added to the registry, but layers that haven’t changed will remain as is. When you push or pull images to/from remote registries, Docker will check to see what layers already exist in the destination and only transmit the ones that aren’t.
In a sense, layers are an optimization technique. However, if you have a lot of layers (and recall that each layer contains all of the output of the instruction for that layer), it can also increase the size of images considerably. There are a couple of options you can use to help reduce the size of the image:
- Use a tool like Docker Squash — there are tools that can flatten the layers by essentially merging the resulting filesystem in each layer together. This can be helpful when there are layers that add a bunch of files that get removed by later layers. The downside of this technique is that the resulting layer is less reusable.
- Use a multi-stage Dockerfile — in 2017 Docker introduced a way to split building an image into stages. The stages were demarcated by additional FROM instructions, and you could use the COPY command to copy files from one stage to another. This was especially helpful for Dockerfiles where a lot of instructions and tools were needed to build an executable or filesystem structure inside the image, but a lot of tools and libraries were not actually needed to run the container from the final image. This also enabled developers to use heavy base images in early stages as needed and then use a light image as the base for the final image.
- Chain commands together in RUN instructions — often you will find yourself running commands like yum or apt in multiple instructions in order to install several different packages. You can, however, run all of these commands together in one instruction by chaining them together with ampersands:
RUN apt-get update && apt-get install ... && apt-get clean
This can result in much smaller layers.
This week I talked about how you build images and how images are made up of layers. I also covered a few techniques that you can use to reduce the number of layers or the overall size of the constructed image. Now that I have covered the main three steps of building containers (specification -> image -> container), the next post will shift the focus to orchestration and using Docker Compose for orchestrating multiple containers, then introduce Kubernetes which orchestration with load management. That will take us to where OpenShift comes into the picture for the world of containers.
Originally published on May 15, 2018.