Quite often, one of the most time-taking parts of the build process is the installation of dependencies. This process is traditionally slow because package managers choose stability over performance. And this perfectly makes sense: if something terrible happens (like a power failure or kernel panic), the system must remain in a usable state.
However, stability is not very important when you build an image: if the build fails, the system discards the image anyway, and you have to start over. Therefore, if we could hint to the package manager that we donβt care much about the integrity of the data, we could speed up our builds.
In this post, I will try to explain how to use eatmydata
to speed up some operations by the example of Debian-based images.
First of all, why are package managers slow? The dpkg FAQ says (the same is true for rpm
and probably other package managers):
To guarantee that the filesystem data is always consistent and safe, dpkg performs fsync(2)s on its database and files unpacked from packages. Newer filesystems (like btrfs or ext4) that implement delayed allocation do require those fsync(2)s as they trade data safety for performance, and expect programs to performs those fsync(2)s, but at the same time they have shown very poor performance on the behaviour they require from all applications, or they might end up producing zero-length files on system crashes or abrupt shutdowns.
fsync() transfers all modified in-core data of (i.e., modified buffer cache pages for) the file referred to by the file descriptor to the disk device (or another permanent storage device) so that all changed information can be retrieved even if the system crashes or is rebooted. This includes writing through or flushing a disk cache if present. The call blocks until the device reports that the transfer has been completed. As well as flushing the file data, fsync() also flushes the metadata information associated with the file.
The above means that the more files a package contains, the more time it will take to install it.
Is there a way to disable those fsync()
calls? Fortunately, yes: we can use the so-called LD_PRELOAD
trick. And we can automate this even further π
Enters eatmydata.
How can we use it?
First of all, we need to install eatmydata
:
apt-get -qq install eatmydata
Then, prefix invocations of the package manager with eatmydata
, like this:
-apt-get -qq install subversion unzip default-mysql-client nodejs npm +eatmydata apt-get -qq install subversion unzip default-mysql-client nodejs npm
On a real-world example, eatmydata
has shown a nice performance boost (circa 33%):
- Time to build the image without
eatmydata
: 08:10 - Time to build the image with
eatmydata
: 05:28
Not bad. Especially if you have to build containers a lot or when your CI provider charges you per minute of the build time π
You can use eatmydata
with any program, not only with package managers. Subversion, Git, etc. are likely to be faster when used together with eatmydata
. Another example is if your test suite uses an embedded database like SQLite. eatmydata
can make the test suite run a bit faster, because of the way how SQLite uses fsync
.
There is another way to use eatmydata
with all programs: the LD_PRELOAD
trick mentioned above. To use it, you need to add this line to your Dockerfile
(of course, after you have eatmydata
installed):
ENV LD_PRELOAD libeatmydata.so
This will inject libeatmydata.so
into all programs run by the system. Which will effectively turn off the fsync()
functions family. Just make sure not to do this if your container needs to write some information to an external data storage.