-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Description
I have a deterministic failure: a cache miss across two workflows. It has taken me a whole day to find the fault behind the failure, and still I have no way to help it.
the setting
I have set up a repository specifically to work on this issue. It will be our running example. https://github.com/kindaro/release The logs be my witness.
the task
The task is to write a workflow release.yml
that makes a draft GitHub release, builds some executables and attaches them to said release. This workflow waits for another workflow haskell-ci.yml
to complete. release.yml
should pick up the cache made by haskell-ci.yml
whenever it is there. It is known to be there when the target operating system is Linux.
the failure
The failure is that the cache made by haskell-ci.yml
is not seen by release.yml
.
the hunt
I have undertaken the following steps to find the fault behind the failure:
-
Check if the cache is not seen across workflows because the action is triggered by a tag.
I made the action run on
master
. However, the failure still showed up. -
Check that the key is matching.
To this end, I replaced the key in both workflows by the same string without variables. However, the failure still showed up.
-
Look at the caches through GitHub API.
This is the URL I looked at. https://api.github.com/repos/kindaro/release/actions/caches I spotted that there are caches with the same key but different versions. This is an error. However, this error may have any of a few roots: wrong paths, wrong compression, wrong version of the action.
-
Check if the version of the action is the same for both workflows.
I spotted that it is different! I set it to be the same. However, the failure still showed up.
-
Reverse engineer the version values.
The version values are hashes, defined in an upstream library. With some trial, I found how the version values of my caches are made:
> [["gzip"], ["zstd"], ["zstd-without-long"], [ ]].map((compression) => {return (require('crypto').createHash('sha256').update(["~/.cabal/store"].concat(compression).concat(["1.0"]).join('|')).digest('hex'))}) [ '97150208f15627752f4bcfa20bf9811d3688b5b274ffe014984351184e875a74', 'bb4f75bb6ca5843bf5c49253ee4d5d67796506fa9441ca6cffe69d7960a2bcd4', '27a066cbe2873e20fde52127b6017ab7615ab4b954a020235b8fe7737035cbca', '1cf722fcd6cb72f17b62fd7954c4ae9118802bdb10b3ef077b4ec06d1f66bfbe' ]
You may see that the latter two hashes match the version values of my caches. So, the error is that one workflow does compression with
gzip
and another withzstd
. Why could that happen? The workflows must be running on different operating systems. -
Find out if the workflows are running on the same operating system.
I checked my workflows and it turned out that:
haskell-ci.yml
specifiesubuntu-18.04
release.yml
specifiesubuntu-latest
I set both to run on
ubuntu-18.04
. However, the failure still showed up. -
Look at all the logs again and see if the compression correlates with the name of the workflow.
Truly it turns out that
release.yml
always doeszstd
buthaskell-ci.yml
always doesgzip
. -
Switch on debug output.
Debug output told me that
zstd
is not found in the environment run byhaskell-ci.yml
. -
Check if
zstd
is killed by a forecoming step.I inserted steps that check if
zstd
is there. It is not there from the start. -
Look again at
haskell-ci.yml
.haskell-ci.yml
specifies a container, so all jobs are run in that container. Who knows why, but this container does not havezstd
. Therefore, the action picks a different compression. Therefore, version of the cache is different. Therefore, cache miss. This is my conclusion from evidence.
the outcome
You now see how hard it is to find the fault behind a failure like this, where at every step of the way the view is further obscured. The fault is not local — it is found in another workflow script, and in another section thereof.
- It is not clear what the version hash for a given cache is. The action does not say anything about it. This is easy to make better: make the action say what the version hash is, every time. On top of that — on a cache miss, make the action say a warning if there are caches which key matches, but which version hash does not match.
- It is not clear what the version hash is made of. It is a hash — it cannot be undone, only guessed at. How far easier it would be if version was a JSON string. Is there a justification from security for this hashing? If so, the action should say what it makes the version hash from, when it makes a cache, so that I can spot the difference.
Even when the fault is found, I now have to go to some length to run release.yml
in the same container as haskell-ci.yml
, or delete zstd
, or take some other otherwise needless steps.
- There is no way to tell the action what compression I want.
Can this be improved?