Cache miss because of different compression.

I have a deterministic failure: a cache miss across two workflows. It has taken me a whole day to find the fault behind the failure, and still I have no way to help it.

### the setting

I have set up a repository specifically to work on this issue. It will be our running example. <https://github.com/kindaro/release> The logs be my witness.

### the task

The task is to write a workflow `release.yml` that makes a draft GitHub release, builds some executables and attaches them to said release. This workflow waits for another workflow `haskell-ci.yml` to complete. `release.yml` should pick up the cache made by `haskell-ci.yml` whenever it is there. It is known to be there when the target operating system is Linux.


### the failure

The failure is that the cache made by `haskell-ci.yml` is not seen by `release.yml`.

### the hunt

I have undertaken the following steps to find the fault behind the failure:

1. Check if the cache is not seen across workflows because the action is triggered by a tag.

   I made the action run on `master`. However, the failure still showed up.

1. Check that the key is matching.

   To this end, I replaced the key in both workflows by the same string without variables. However, the failure still showed up.

1. Look at the caches through GitHub API.

   This is the URL I looked at. <https://api.github.com/repos/kindaro/release/actions/caches> I spotted that there are caches with the same key but different versions. This is an error. However, this error may have any of a few roots: wrong paths, wrong compression, wrong version of the action.

1. Check if the version of the action is the same for both workflows.

   I spotted that it is different! I set it to be the same. However, the failure still showed up.

1. Reverse engineer the version values.

   [The version values are hashes, defined in an upstream library.](https://github.com/actions/toolkit/blob/500d0b42fee2552ae9eeb5933091fe2fbf14e72d/packages/cache/src/internal/cacheHttpClient.ts#L73-L90) With some trial, I found how the version values of my caches are made:

    ```
    > [["gzip"], ["zstd"], ["zstd-without-long"], [ ]].map((compression) => {return (require('crypto').createHash('sha256').update(["~/.cabal/store"].concat(compression).concat(["1.0"]).join('|')).digest('hex'))})
    [
      '97150208f15627752f4bcfa20bf9811d3688b5b274ffe014984351184e875a74',
      'bb4f75bb6ca5843bf5c49253ee4d5d67796506fa9441ca6cffe69d7960a2bcd4',
      '27a066cbe2873e20fde52127b6017ab7615ab4b954a020235b8fe7737035cbca',
      '1cf722fcd6cb72f17b62fd7954c4ae9118802bdb10b3ef077b4ec06d1f66bfbe'
    ]
    ```

    You may see that the latter two hashes match the version values of my caches. So, the error is that one workflow does compression with `gzip` and another with `zstd`. Why could that happen? The workflows must be running on different operating systems.

1. Find out if the workflows are running on the same operating system.

   I checked my workflows and it turned out that:
   * `haskell-ci.yml` specifies `ubuntu-18.04`
   * `release.yml` specifies `ubuntu-latest`

   I set both to run on `ubuntu-18.04`. However, the failure still showed up.

1. Look at all the logs again and see if the compression correlates with the name of the workflow.

   Truly it turns out that `release.yml` always does `zstd` but `haskell-ci.yml` always does `gzip`.

1. Switch on debug output.

   Debug output told me that `zstd` is not found in the environment run by `haskell-ci.yml`.

1. Check if `zstd` is killed by a forecoming step.

   I inserted steps that check if `zstd` is there. It is not there from the start.

1. Look again at `haskell-ci.yml`.

   `haskell-ci.yml` specifies a container, so all jobs are run in that container. Who knows why, but this container does not have `zstd`. Therefore, the action picks a different compression. Therefore, version of the cache is different. Therefore, cache miss. This is my conclusion from evidence.

### the outcome

You now see how hard it is to find the fault behind a failure like this, where at every step of the way the view is further obscured. The fault is not local — it is found in another workflow script, and in another section thereof.

* **It is not clear what the version hash for a given cache is.** The action does not say anything about it. This is easy to make better: make the action say what the version hash is, every time. On top of that — on a cache miss, make the action say a warning if there are caches which key matches, but which version hash does not match.
* **It is not clear what the version hash is made of.** It is a hash — it cannot be undone, only guessed at. How far easier it would be if version was a JSON string. Is there a justification from security for this hashing? If so, the action should say what it makes the version hash from, when it makes a cache, so that I can spot the difference.

Even when the fault is found, I now have to go to some length to run `release.yml` in the same container as `haskell-ci.yml`, or delete `zstd`, or take some other otherwise needless steps.

* **There is no way to tell the action what compression I want.**

Can this be improved?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cache miss because of different compression. #918

the setting

the task

the failure

the hunt

the outcome

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Cache miss because of different compression. #918

Description

the setting

the task

the failure

the hunt

the outcome

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions