Skip to content

Cache miss because of different compression. #918

@kindaro

Description

@kindaro

I have a deterministic failure: a cache miss across two workflows. It has taken me a whole day to find the fault behind the failure, and still I have no way to help it.

the setting

I have set up a repository specifically to work on this issue. It will be our running example. https://github.com/kindaro/release The logs be my witness.

the task

The task is to write a workflow release.yml that makes a draft GitHub release, builds some executables and attaches them to said release. This workflow waits for another workflow haskell-ci.yml to complete. release.yml should pick up the cache made by haskell-ci.yml whenever it is there. It is known to be there when the target operating system is Linux.

the failure

The failure is that the cache made by haskell-ci.yml is not seen by release.yml.

the hunt

I have undertaken the following steps to find the fault behind the failure:

  1. Check if the cache is not seen across workflows because the action is triggered by a tag.

    I made the action run on master. However, the failure still showed up.

  2. Check that the key is matching.

    To this end, I replaced the key in both workflows by the same string without variables. However, the failure still showed up.

  3. Look at the caches through GitHub API.

    This is the URL I looked at. https://api.github.com/repos/kindaro/release/actions/caches I spotted that there are caches with the same key but different versions. This is an error. However, this error may have any of a few roots: wrong paths, wrong compression, wrong version of the action.

  4. Check if the version of the action is the same for both workflows.

    I spotted that it is different! I set it to be the same. However, the failure still showed up.

  5. Reverse engineer the version values.

    The version values are hashes, defined in an upstream library. With some trial, I found how the version values of my caches are made:

    > [["gzip"], ["zstd"], ["zstd-without-long"], [ ]].map((compression) => {return (require('crypto').createHash('sha256').update(["~/.cabal/store"].concat(compression).concat(["1.0"]).join('|')).digest('hex'))})
    [
      '97150208f15627752f4bcfa20bf9811d3688b5b274ffe014984351184e875a74',
      'bb4f75bb6ca5843bf5c49253ee4d5d67796506fa9441ca6cffe69d7960a2bcd4',
      '27a066cbe2873e20fde52127b6017ab7615ab4b954a020235b8fe7737035cbca',
      '1cf722fcd6cb72f17b62fd7954c4ae9118802bdb10b3ef077b4ec06d1f66bfbe'
    ]
    

    You may see that the latter two hashes match the version values of my caches. So, the error is that one workflow does compression with gzip and another with zstd. Why could that happen? The workflows must be running on different operating systems.

  6. Find out if the workflows are running on the same operating system.

    I checked my workflows and it turned out that:

    • haskell-ci.yml specifies ubuntu-18.04
    • release.yml specifies ubuntu-latest

    I set both to run on ubuntu-18.04. However, the failure still showed up.

  7. Look at all the logs again and see if the compression correlates with the name of the workflow.

    Truly it turns out that release.yml always does zstd but haskell-ci.yml always does gzip.

  8. Switch on debug output.

    Debug output told me that zstd is not found in the environment run by haskell-ci.yml.

  9. Check if zstd is killed by a forecoming step.

    I inserted steps that check if zstd is there. It is not there from the start.

  10. Look again at haskell-ci.yml.

    haskell-ci.yml specifies a container, so all jobs are run in that container. Who knows why, but this container does not have zstd. Therefore, the action picks a different compression. Therefore, version of the cache is different. Therefore, cache miss. This is my conclusion from evidence.

the outcome

You now see how hard it is to find the fault behind a failure like this, where at every step of the way the view is further obscured. The fault is not local — it is found in another workflow script, and in another section thereof.

  • It is not clear what the version hash for a given cache is. The action does not say anything about it. This is easy to make better: make the action say what the version hash is, every time. On top of that — on a cache miss, make the action say a warning if there are caches which key matches, but which version hash does not match.
  • It is not clear what the version hash is made of. It is a hash — it cannot be undone, only guessed at. How far easier it would be if version was a JSON string. Is there a justification from security for this hashing? If so, the action should say what it makes the version hash from, when it makes a cache, so that I can spot the difference.

Even when the fault is found, I now have to go to some length to run release.yml in the same container as haskell-ci.yml, or delete zstd, or take some other otherwise needless steps.

  • There is no way to tell the action what compression I want.

Can this be improved?

Metadata

Metadata

Labels

area:cache-versionIssues related to version of a cachedocumentationImprovements or additions to documentation

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions