Inode cache for file hashes #577

liljenzin · 2020-04-06T18:52:15Z

Bumping the discussion in #377 with a proof of concept.

IMHO the file_stat_matches sloppiness option is not useful for several reasons.

Time stamps are updated when switching between branches, they don't match when files are checked out into different worktrees, they cannot be trusted alone as manifests contain relative file paths. And all included files will still need to be hashed for each compilation unit that uses them.

Hard facts from building a large program, taking the chromium browser as an example. With the current approach over 18 million files will be hashed to build about 34000 compilation units. With the suggested patch this goes down to hashing about 74000 files in the initial build and 26500 files in repeated builds from a hot ccache. This saves 7-8 minutes in serial build time for me, which is significant, although the real time gain for parallel builds is smaller and depends on the number of cores by natural reasons.

Note that this is a draft pull request with a patch I wrote just for the sake of discussion. Not sure if this is the best way to do it, but it demonstrates a cache can be efficient with close to zero overhead and as an embedded feature, thus not introducing new dependencies on external infrastructure such as caching daemons.

========================================================

The inode cache is a process shared cache that maps from device, inode,
mtime to saved hash results. The cache is stored persistently in a
single file that is mapped into shared memory by running processes,
allowing computed hash values to be reused both within and between
builds.

The chosen technical solution works for Linux and might work on other
POSIX platforms, but is not meant to be supported on non-POSIX
platforms such as Windows.

Use 'ccache -o inode_cache=true/false' to activate/deactivate the cache.
Use 'ccache -o inode_cache_file=/path/to/cache/file" to set a custom
cache-file location. Defaults to "{cache-dir}/inode-cache".

jrosdahl · 2020-04-13T20:58:25Z

Interesting! I unfortunately don't have much time to think about this right now, though.

afbjorklund · 2020-04-14T05:24:15Z

Think I missed something here, why would the inode change when file is updated ?

We have investigated reusing hash contents before, that time using memcached.
(i.e. using some existing key-value store, rather than writing our own for ccache)

Here is some old code, with "stat": 3.7-maint...afbjorklund:memcached-hashed
Also tried mmap (with lmdb), but there's just so many ccache processes created...

It has a great win, but same downfall as the other sloppy options for a silver bullet.

liljenzin · 2020-04-14T08:24:28Z

Think I missed something here, why would the inode change when file is updated ?

I will try to explain this way.

When calling hash_source_code_file() to hash "includes/foo.h", the result depends on the content of the specified file. It also depends on the provided seed and configuration options.

Caching the result based on (filename, mtime) would be a terrible idea since "includes/foo.h" is relative and could resolve to different files depending on current directory. It could also resolve to different files if files have been moved around or the directory structure has changed.

As (device, inode) makes the identity of a file, and mtime is always updated if the content is changed, then (device, inode, mtime) must resolve to the same content from time to time, otherwise the system violates assumptions made by other build tools like "make".

The saved hash value is made independent of the provided seed by always hashing files using a new seed, saving the hash digest from that, and then hashing the saved hash digest to the provided seed instead. Thus using the hash-combine trick to make saved hash values context independent. Configuration settings that change the hash result (currently sloppy_time_macros) must be added to the key though, which means (device, inode, mtime, sloppy_time_macros) will be used as combined key that identifies a saved entry in the cache.

We have investigated reusing hash contents before, that time using memcached.
(i.e. using some existing key-value store, rather than writing our own for ccache)

Here is some old code, with "stat": 3.7-maint...afbjorklund:memcached-hashed

It is always a tradeoff between generic solutions that solve generic problems and homebrewed solutions that makes exactly what you need, and not more than that.

I guess memcached is not zero-overhead because it runs in a seperate process and we have to communicate over a socket, causing context switching? I also guess integrating with memcached requires about the same amount of code as doing it all ourselves? And that memcached also requires additional administrative work for end users?

Please correct me if I'm wrong.

Also tried mmap (with lmdb), but there's just so many ccache processes created...

In this case a single file is mapped into shared memory by concurrent processes. This is basically how libc.so and other shared libraries are loaded every time a ccache process is spawned. Nothing would be fast if the kernels couldn't handle this very effeiciently.

The easiest way to find out is to build and test the patch yourself. Either it is fast or it isn't.

It has a great win, but same downfall as the other sloppy options for a silver bullet.

Exactly what downfalls do you see? And exactly how is it sloppy?

afbjorklund · 2020-04-14T13:08:55Z

Ah, OK. Used the absolute path as key for the same thing (as the device, inode) then.
You are probably right about the adminstrative overhead, I just spawned it from make...

Using memcached back then was natural since we already used it for distributed cache.
As it ended up not being integrated anyway, it would make sense to start over and try.

It has a great win, but same downfall as the other sloppy options for a silver bullet.

Exactly what downfalls do you see? And exactly how is it sloppy?

It has the usual mtime caveats, if you change content and reset timestamp you lose...

The "checksum always" strategy was the safe-and-slow option. But as opt-in, sure!

liljenzin · 2020-04-14T13:55:38Z

It has the usual mtime caveats, if you change content and reset timestamp you lose...

In the actual implementation I added ctime and size to the key, thus tightening such loopholes since you can't modify mtime without increasing ctime. I didn't mention it since I didn't want to complicate things with all details when explaining how it worked in principle.

The "checksum always" strategy was the safe-and-slow option. But as opt-in, sure!

You can always violate contracts by editing the actual disk blocks as a last resort. The point is that other parts of a build system also make the same assumptions and if you violate them ccache will not be the only build tool that breaks the chain.

The feature is not sloppy in the way that it will break by accident when used on a healthy system. If you have to be creative to break it, then it is not a flaw that will hit innocent users.

afbjorklund · 2020-04-14T16:30:37Z

The feature is not sloppy in the way that it will break by accident when used on a healthy system. If you have to be creative to break it, then it is not a flaw that will hit innocent users.

It was not meant sloppy as in dirty, and there is an inherit risk in all caching... Like you say, if the setup is healthy then it should be reliable. Hope that someone gets a chance to look at it!

jrosdahl · 2020-04-26T11:24:31Z

Hard facts from building a large program [...] This saves 7-8 minutes in serial build time for me [...]

Thanks for the numbers. Not that it matters much, but it would be interesting to know how much of the 7-8 minutes come from avoiding hashing and how much come from avoiding I/O.

By the way, do you use the file_clone or hard_link mode?

The inode cache is a process shared cache that maps from device, inode,
mtime to saved hash results. The cache is stored persistently in a
single file that is mapped into shared memory by running processes,
allowing computed hash values to be reused both within and between
builds.

This indeed sounds like a much better approach than the path-based one envisioned in #377. Thanks! I have nothing against this approach in principle. Well, maybe the only concern I have is if your use case is "non-edge" enough to motivate increasing the ccache implementation complexity.

unittest/test_InodeCache.cpp

doc/MANUAL.adoc

src/InodeCache.cpp

unittest/test_InodeCache.cpp

liljenzin · 2020-04-26T19:08:21Z

Thanks for the numbers. Not that it matters much, but it would be interesting to know how much of the 7-8 minutes come from avoiding hashing and how much come from avoiding I/O.

Seems to be from reduced hashing. Before implementing the cache I used valgrind/callgrind for profiling and found about 60-70% of cpu time was spent in hashing, while most of this time got eliminated by the cache. The mentioned build was performed on a machine with 128 GB RAM, which effectively eliminates all reads from physical disk since all source files and all output fit into buffer space. It doesn't mean I/O is for free, but there is no latency while waiting for disk.

By the way, do you use the file_clone or hard_link mode?

None of them, mainly because I/O pressure looks very low in general when I build, provided sufficient RAM and fast NVMe drives. At least I have not seen any measurable gain when trying these options in the past.

The inode cache is a process shared cache that maps from device, inode,
mtime to saved hash results. The cache is stored persistently in a
single file that is mapped into shared memory by running processes,
allowing computed hash values to be reused both within and between
builds.

This indeed sounds like a much better approach than the path-based one envisioned in #377. Thanks! I have nothing against this approach in principle. Well, maybe the only concern I have is if your use case is "non-edge" enough to motivate increasing the ccache implementation complexity.

I think the important question is if ccache performance matters when you get a hit, or if hits are already fast enough? If it matters it might be hard to find other changes that achieves similar improvements, while not affecting complexity even worse.

Note that also trivial compilation units such as the "Hello World!" program will gain from the cache, because a single standard header often includes a huge amount of other files.

$ cat t.cc:
#include <iostream>

int main() {
	std::cout << "Hello World!" << std::endl;
}

$ ccache -o inode_cache=false
$ time for ((i=0;i<10000;++i)); do ccache g++ -c t.cc; done
real	0m37,334s
user	0m26,053s
sys	0m11,462s

$ ccache -o inode_cache=true
$ time for ((i=0;i<10000;++i)); do ccache g++ -c t.cc; done

real	0m19,883s
user	0m9,167s
sys	0m10,981s

Which is 87% faster even for a trivial program. :-)

Thanks for the review comments btw. Will look at them when I get some spare time the coming week.

afbjorklund · 2020-05-02T11:22:36Z

Seems to be from reduced hashing. Before implementing the cache I used valgrind/callgrind for profiling and found about 60-70% of cpu time was spent in hashing, while most of this time got eliminated by the cache

You can also use the built-in ccache tracing, to get something you can load into chrome: #280

./configure --enable-tracing

export CCACHE_INTERNAL_TRACE=1

The inode cache is a process shared cache that maps from device, inode, mtime to saved hash results. The cache is stored persistently in a single file that is mapped into shared memory by running processes, allowing computed hash values to be reused both within and between builds. The chosen technical solution works for Linux and might work on other POSIX platforms, but is not meant to be supported on non-POSIX platforms such as Windows. Use 'ccache -o inode_cache=true/false' to activate/deactivate the cache. Use 'ccache -o inode_cache_file=/path/to/cache/file" to set a custom cache-file location. Defaults to "{cache-dir}/inode-cache".

Co-authored-by: Joel Rosdahl <joel@rosdahl.net>

Apparently posix_fallocate() needs read permission to the open file, otherwise it will not work on nfs since the emulation is client side.

…issues" This reverts commit 5a55739.

jrosdahl · 2020-05-31T10:02:18Z

Thanks!

Introduced in ccache#577 (213d988).

Unintended or not, ccache#577 (213d988) changed the behavior of “ccache --hash-file” to use hash_binary_file, which essentially performs hash(hash(path)) if the i-node cache is enabled, otherwise hash(path). This means that “ccache --hash-file” behaves differently depending on if i-node cache is enabled and also that it’s no longer usable for benchmarking purposes. Fix this by simply using “hash_file” again.

Introduced in ccache#577 (213d988).

Unintended or not, ccache#577 (213d988) changed the behavior of “ccache --hash-file” to use hash_binary_file, which essentially performs hash(hash(path)) if the i-node cache is enabled, otherwise hash(path). This means that “ccache --hash-file” behaves differently depending on if i-node cache is enabled and also that it’s no longer usable for benchmarking purposes. Fix this by simply using “hash_file” again.

Introduced in #577 (213d988).

Unintended or not, #577 (213d988) changed the behavior of “ccache --hash-file” to use hash_binary_file, which essentially performs hash(hash(path)) if the i-node cache is enabled, otherwise hash(path). This means that “ccache --hash-file” behaves differently depending on if i-node cache is enabled and also that it’s no longer usable for benchmarking purposes. Fix this by simply using “hash_file” again.

liljenzin force-pushed the inode-cache branch from 4e41fd0 to ec5b8ad Compare April 13, 2020 09:56

liljenzin marked this pull request as ready for review April 13, 2020 10:04

jrosdahl added discussion feature New or improved feature labels Apr 13, 2020

jrosdahl reviewed Apr 26, 2020

View reviewed changes

liljenzin and others added 16 commits May 3, 2020 11:02

Update doc/MANUAL.adoc

02cf688

Co-authored-by: Joel Rosdahl <joel@rosdahl.net>

Update src/InodeCache.cpp

f4d981c

Co-authored-by: Joel Rosdahl <joel@rosdahl.net>

Update src/InodeCache.cpp

a35227e

Co-authored-by: Joel Rosdahl <joel@rosdahl.net>

Update src/InodeCache.cpp

d9ed728

Co-authored-by: Joel Rosdahl <joel@rosdahl.net>

Update src/InodeCache.hpp

c5ff0d0

Co-authored-by: Joel Rosdahl <joel@rosdahl.net>

Update src/InodeCache.cpp

a91fa5a

Co-authored-by: Joel Rosdahl <joel@rosdahl.net>

Update src/InodeCache.cpp

567fd80

Co-authored-by: Joel Rosdahl <joel@rosdahl.net>

Update src/InodeCache.cpp

c66bb3d

Co-authored-by: Joel Rosdahl <joel@rosdahl.net>

Update unittest/test_InodeCache.cpp

1aed6b6

Co-authored-by: Joel Rosdahl <joel@rosdahl.net>

Update src/InodeCache.cpp

72afa09

Co-authored-by: Joel Rosdahl <joel@rosdahl.net>

Update src/InodeCache.cpp

7833e60

Co-authored-by: Joel Rosdahl <joel@rosdahl.net>

Update src/InodeCache.cpp

bb2ae8e

Co-authored-by: Joel Rosdahl <joel@rosdahl.net>

Update src/InodeCache.cpp

a66b5bd

Co-authored-by: Joel Rosdahl <joel@rosdahl.net>

Update src/InodeCache.cpp

e11f5a6

Co-authored-by: Joel Rosdahl <joel@rosdahl.net>

Update src/InodeCache.cpp

4b74a66

Co-authored-by: Joel Rosdahl <joel@rosdahl.net>

liljenzin added 2 commits May 25, 2020 20:26

Put inode cache file in /tmp when running tests to avoid nfs issues

5a55739

Fix flaky fallocate test

7a322fc

Apparently posix_fallocate() needs read permission to the open file, otherwise it will not work on nfs since the emulation is client side.

liljenzin force-pushed the inode-cache branch from 96f55a7 to 7a322fc Compare May 28, 2020 09:53

liljenzin and others added 13 commits May 28, 2020 18:43

Revert "Put inode cache file in /tmp when running tests to avoid nfs …

de1c6fb

…issues" This reverts commit 5a55739.

Remove forgotten leftover from development

67e5ff9

Add comment explaining the meaning of ContentType

bc9c313

Make precompiled headers their own content type

d59a434

Fail silently and disable the inode cache if located on nfs

a6a0a81

Move default temporary_dir to /run/user if available

a3e8b2a

Always put inode cache file in temporary_dir and remove setting

db295d5

Add BSD style implementation of is_nfs_fd

c11a3b3

Merge branch 'master' into pr-577

929f82d

Fix typos

bd6fc5e

Remove extra newline from log message

c0ac9c0

Fix return type of is_nfs_fd

cfe7d7f

Update documentation of temporary_dir’s default value

8cdc737

jrosdahl merged commit 213d988 into ccache:master May 31, 2020

jrosdahl added this to the 4.0 milestone May 31, 2020

jrosdahl mentioned this pull request May 31, 2020

Caching already hashed files during one build #377

Closed

jrosdahl changed the title ~~Inode cache for file hashes (proof of concept)~~ Inode cache for file hashes May 31, 2020

jrosdahl added a commit to jrosdahl/ccache that referenced this pull request Jun 18, 2020

Fix bug in hash_binary_file

e555c74

Introduced in ccache#577 (213d988).

jrosdahl added a commit to jrosdahl/ccache that referenced this pull request Jun 18, 2020

Fix bug in hash_binary_file

d145796

Introduced in ccache#577 (213d988).

jrosdahl added a commit that referenced this pull request Jun 18, 2020

Fix bug in hash_binary_file

e412b14

Introduced in #577 (213d988).

jrosdahl removed the discussion label Dec 28, 2020

jrosdahl mentioned this pull request Nov 13, 2022

fix: Do not create /run directory on systems that don't have it when … #1221

Merged

joyeecheung mentioned this pull request Apr 25, 2024

Tracking issue: compile cache nodejs/node#52696

Open

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Inode cache for file hashes #577

Inode cache for file hashes #577

Uh oh!

liljenzin commented Apr 6, 2020

Uh oh!

jrosdahl commented Apr 13, 2020

Uh oh!

afbjorklund commented Apr 14, 2020 •

edited

Loading

Uh oh!

liljenzin commented Apr 14, 2020

Uh oh!

afbjorklund commented Apr 14, 2020

Uh oh!

liljenzin commented Apr 14, 2020 •

edited

Loading

Uh oh!

afbjorklund commented Apr 14, 2020

Uh oh!

jrosdahl commented Apr 26, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

liljenzin commented Apr 26, 2020

Uh oh!

afbjorklund commented May 2, 2020

Uh oh!

jrosdahl commented May 31, 2020

Uh oh!

Uh oh!

Inode cache for file hashes #577

Inode cache for file hashes #577

Uh oh!

Conversation

liljenzin commented Apr 6, 2020

Uh oh!

jrosdahl commented Apr 13, 2020

Uh oh!

afbjorklund commented Apr 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

liljenzin commented Apr 14, 2020

Uh oh!

afbjorklund commented Apr 14, 2020

Uh oh!

liljenzin commented Apr 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

afbjorklund commented Apr 14, 2020

Uh oh!

jrosdahl commented Apr 26, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

liljenzin commented Apr 26, 2020

Uh oh!

afbjorklund commented May 2, 2020

Uh oh!

jrosdahl commented May 31, 2020

Uh oh!

Uh oh!

afbjorklund commented Apr 14, 2020 •

edited

Loading

liljenzin commented Apr 14, 2020 •

edited

Loading