Skip to content

Conversation

taylorsilva
Copy link
Member

@taylorsilva taylorsilva commented Feb 25, 2025

Changes proposed by this PR

We found UUID collisions occurring in our cluster. We thought it was
impossible but I decided to look into the library we were using for UUID
generation and found this issue: nu7hatch/gouuid#28

Which would explain some of the errors I've seen where the only possible
explanation was that we got the same UUID twice.

Notes to reviewer

The error we were seeing was this:

atc.tracker.notify.run.get-step.perform-get.failed-to-create-container-in-garden:
new container: container "bba06975-46f6-4bbe-73f5-9e39a869719a": already exists

I dug into how we generate this handle and found we use the nu7hatch/gouuid to generate the UUID. We trusted that this UUID would be truly random so of course the database doesn't have any protections against two containers being made with the same handle. We also don't do any checks ourselves that the handle has been used. Doing so would be a waste of CPU for the web and database assuming that your UUID library generates unique UUID's.

I'm trusting this new library more since it is maintained by Google and they're probably using this in their own production apps.

Related PR concourse/houdini#2

Release Note

  • Use github.com/google/uuid to generate UUID's (v4). The previous library incorrectly implemented UUID generation and would sometimes generate the same UUID twice. Therefore it was possible for two containers or volumes to be created with the same UUID. The second container/volume would fail to create due to the UUID collision.

analytically and others added 3 commits February 25, 2025 12:00
Signed-off-by: Mathias Bogaert <mathias.bogaert@gmail.com>
Co-authored-by: Taylor Silva <dev@taydev.net>
In the case where we failed to create a container we would log the error
twice in the web node and also mark it as failed twice. The second call
to mark the container as failed would itself fail and return an error
because the container is already marked as failed.

Signed-off-by: Taylor Silva <dev@taydev.net>
We found UUID collisions occuring in our cluster. We thought it was
impossible but I decided to look into the library we were using for UUID
generation and found this issue: nu7hatch/gouuid#28

Which would explain some of the errors I've seen where the only possible
explanation was that we got the same UUID twice.

Signed-off-by: Taylor Silva <dev@taydev.net>
@taylorsilva taylorsilva requested a review from a team as a code owner February 25, 2025 18:07
taylorsilva added a commit to concourse/s3-resource that referenced this pull request Feb 25, 2025
See concourse/concourse#9083 for details

Also made some nitpick updates based on gopls deprecation warnings

Signed-off-by: Taylor Silva <dev@taydev.net>
taylorsilva added a commit to concourse/s3-resource that referenced this pull request Feb 25, 2025
See concourse/concourse#9083 for details

Also made some nitpick updates based on gopls deprecation warnings

Signed-off-by: Taylor Silva <dev@taydev.net>
taylorsilva added a commit to concourse/semver-resource that referenced this pull request Feb 25, 2025
See concourse/concourse#9083 for details

Signed-off-by: Taylor Silva <dev@taydev.net>
@taylorsilva taylorsilva force-pushed the replace-uuid-lib branch 3 times, most recently from 91ae9da to 5abd00f Compare February 25, 2025 21:53
on systems where the test was slow to run it seemed that this testing of
the batch duration would flake a lot. I think there was also a general
race condition from accessing the NewRelicBatch array since it does get
cleared out in a go routine. I left NewRelicBatch public still because
it makes the testing setup easier, but now there's a Batch() function
that can be used to safely access the array.

Signed-off-by: Taylor Silva <dev@taydev.net>
@taylorsilva taylorsilva merged commit 8dc27d0 into master Feb 26, 2025
11 checks passed
@taylorsilva taylorsilva deleted the replace-uuid-lib branch February 26, 2025 01:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants