Quite often, one of the most time-taking parts of the build process is the installation of dependencies. This process is traditionally slow because package managers choose stability over performance. And this perfectly makes sense: if something terrible happens (like a power failure or kernel panic), the system must remain in a usable state.
However, stability is not very important when you build an image: if the build fails, the system discards the image anyway, and you have to start over. Therefore, if we could hint to the package manager that we don’t care much about the integrity of the data, we could speed up our builds.
In this post, I will try to explain how to use
eatmydata to speed up some operations by the example of Debian-based images.
First of all, why are package managers slow? The dpkg FAQ says (the same is true for
rpm and probably other package managers):
To guarantee that the filesystem data is always consistent and safe, dpkg performs fsync(2)s on its database and files unpacked from packages. Newer filesystems (like btrfs or ext4) that implement delayed allocation do require those fsync(2)s as they trade data safety for performance, and expect programs to performs those fsync(2)s, but at the same time they have shown very poor performance on the behaviour they require from all applications, or they might end up producing zero-length files on system crashes or abrupt shutdowns.
fsync() transfers all modified in-core data of (i.e., modified buffer cache pages for) the file referred to by the file descriptor to the disk device (or another permanent storage device) so that all changed information can be retrieved even if the system crashes or is rebooted. This includes writing through or flushing a disk cache if present. The call blocks until the device reports that the transfer has been completed. As well as flushing the file data, fsync() also flushes the metadata information associated with the file.
The above means that the more files a package contains, the more time it will take to install it.
Is there a way to disable those
fsync() calls? Fortunately, yes: we can use the so-called
LD_PRELOAD trick. And we can automate this even further 🙂
How can we use it?
First of all, we need to install
apt-get -qq install eatmydata
Then, prefix invocations of the package manager with
eatmydata, like this:
-apt-get -qq install subversion unzip default-mysql-client nodejs npm +eatmydata apt-get -qq install subversion unzip default-mysql-client nodejs npm
On a real-world example,
eatmydata has shown a nice performance boost (circa 33%):
- Time to build the image without
- Time to build the image with
Not bad. Especially if you have to build containers a lot or when your CI provider charges you per minute of the build time 🙂
You can use
eatmydata with any program, not only with package managers. Subversion, Git, etc. are likely to be faster when used together with
eatmydata. Another example is if your test suite uses an embedded database like SQLite.
eatmydata can make the test suite run a bit faster, because of the way how SQLite uses
There is another way to use
eatmydata with all programs: the
LD_PRELOAD trick mentioned above. To use it, you need to add this line to your
Dockerfile (of course, after you have
ENV LD_PRELOAD libeatmydata.so
This will inject
libeatmydata.so into all programs run by the system. Which will effectively turn off the
fsync() functions family. Just make sure not to do this if your container needs to write some information to an external data storage.