Quite often, the most time-consuming part of the build process is installing dependencies. This process is traditionally slow because package managers choose stability over performance. And this makes perfect sense: if something terrible happens (like a power failure or a kernel panic), the system must remain usable.

However, stability is not particularly important when building an image: if the build fails, the system discards the image anyway, and you have to start over. Therefore, if we could signal to the package manager that we don’t care much about data integrity, we could speed up our builds.

In this post, I will explain how to use `eatmydata` to speed up some operations using Debian-based images as an example.

First of all, why are package managers slow? The dpkg FAQ says (the same is true for rpm and probably other package managers):

To guarantee that the filesystem data is always consistent and safe, dpkg performs fsync(2)s on its database and files unpacked from packages. Newer filesystems (like btrfs or ext4) that implement delayed allocation do require those fsync(2)s as they trade data safety for performance, and expect programs to performs those fsync(2)s, but at the same time they have shown very poor performance on the behaviour they require from all applications, or they might end up producing zero-length files on system crashes or abrupt shutdowns.

fsync() transfers all modified in-core data of (i.e., modified buffer cache pages for) the file referred to by the file descriptor to the disk device (or another permanent storage device) so that all changed information can be retrieved even if the system crashes or is rebooted. This includes writing through or flushing a disk cache if present. The call blocks until the device reports that the transfer has been completed. As well as flushing the file data, fsync() also flushes the metadata information associated with the file.

The above means that the more files a package contains, the longer it will take to install.

Is there a way to disable those fsync() calls? Fortunately, yes: we can use the so-called LD_PRELOAD trick. And we can automate this even further πŸ™‚

Enters eatmydata.

How can we use it?

First of all, we need to install eatmydata:

apt-get -qq install eatmydata

Then, prefix invocations of the package manager with `eatmydata`, like this:

-apt-get -qq install subversion unzip default-mysql-client nodejs npm
+eatmydata apt-get -qq install subversion unzip default-mysql-client nodejs npm

On a real-world example, eatmydata has shown a nice performance boost (circa 33%):

  • Time to build the image without eatmydata: 08:10
  • Time to build the image with eatmydata: 05:28

Not bad. Especially if you have to build containers a lot or when your CI provider charges you per minute of the build time πŸ™‚

You can use eatmydata with any program, not only with package managers. Subversion, Git, etc., are likely to be faster when used together with eatmydata. Another example is if your test suite uses an embedded database like SQLite. eatmydata can make the test suite run a bit faster, because of the way SQLite uses fsync.

There is another way to use eatmydata with all programs: the LD_PRELOAD trick mentioned above. To use it, you need to add this line to your Dockerfile (of course, after you have eatmydata installed):

ENV LD_PRELOAD libeatmydata.so

This will inject libeatmydata.so into all programs run by the system. Which will effectively turn off the fsync() family of functions. Just make sure not to do this if your container needs to write some information to an external data storage.

Speeding up Docker Builds With eatmydata
Tagged on:             

Leave a Reply

Your email address will not be published. Required fields are marked *