Skip to content
This repository was archived by the owner on Feb 22, 2022. It is now read-only.

Conversation

cirocosta
Copy link
Collaborator

@cirocosta cirocosta commented Dec 1, 2018

What this PR does / why we need it:

Previously, StatefulSets were used in order to work around some
issues that users had with having pods registering & deregistering
from ATC ( see concourse/concourse#1457).

From what I understand, we were not even making use of the idea of
bringing state together across restarts / new pods being created:

rm -rf ${CONCOURSE_WORK_DIR:-/concourse-work-dir}/*

After the introduction of --ephemeral, the logic of automatically pruning a worker
that has gone away is suepr simplified: after retire, it'll be automatically pruned.

That said, I guess we can go back to the best way of deploying concourse: assuming
that the workers are ephemeral in all ways (including state).

As a result of the changes, --ephemeral comes set to true by default.

Which issue this PR fixes

Special notes for your reviewer:

I think in a close future we can have StatefulSets back in order to have persistence but I'd say that for now (especially with 4.2.1 that doesn't include some GC improvements) we are much better off going with a simple Deployment that is totally stateless.

Another argument for having the StatefulSet would be to increase the size of the disk where the worker state lives if needed. Maybe there's a way around this? Maybe we should not completely ditch the StatefulSet and try hard to go straight to the scenario where we have either SS or Deployments?

Update: Actually, now I think it's actually pretty important to have the support for StatefulSet so we can support scenarios where disk IO is the highest priority

Wdyt?

Checklist

  • DCO signed
  • Chart Version bumped
  • Variables are documented in the README.md

@helm-bot helm-bot added Contribution Allowed If the contributor has signed the DCO or the CNCF CLA (prior to the move to a DCO). size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Dec 1, 2018
@cirocosta cirocosta force-pushed the concourse/ephemeral-workers branch from 036f59a to 835760f Compare December 3, 2018 23:53
@helm-bot helm-bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Dec 3, 2018
@cirocosta cirocosta force-pushed the concourse/ephemeral-workers branch from 835760f to 3dfe9a9 Compare December 4, 2018 01:59
@helm-bot helm-bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Dec 4, 2018
@stale
Copy link

stale bot commented Jan 3, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.

@stale stale bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 3, 2019
@cirocosta cirocosta force-pushed the concourse/ephemeral-workers branch from 3dfe9a9 to ddedb7b Compare January 5, 2019 15:35
@helm-bot helm-bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 5, 2019
@stale stale bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 5, 2019
@cirocosta cirocosta force-pushed the concourse/ephemeral-workers branch from ddedb7b to d241a56 Compare January 5, 2019 15:54
@helm-bot helm-bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 5, 2019
@cirocosta cirocosta changed the title [stable/concourse] Replaces worker SS by Deployment [stable/concourse] Replaces worker StatefulSet by Deployment Jan 5, 2019
@voor
Copy link
Contributor

voor commented Jan 14, 2019

Since you're using a deployment now instead of a statefulset you can update the blurb on Scaling the Chart as well.

@william-tran
Copy link
Collaborator

Yes this is so awesome and moves towards an ideal view of operating concourse on k8s; if a worker dies, a new one can come up without any expectation of state left behind by the previous incarnation. I would only be comfortable with removing the option of statefulset+pvc if we ensured that space on the node is reclaimed when the worker restarts. We need to test that out on recent versions of k8s using each baggage claim driver. Early on (late 2017) I noticed that space was never being reclaimed and I eventually ran out of disk on the node, causing trouble for any other pods running on the node and not reclaimable until restarted the host vm. Having a PVC made it possible to deal with that reality by tossing the disk when it became full.

@william-tran
Copy link
Collaborator

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 14, 2019
@paulczar
Copy link
Collaborator

let's rebase this and get it ready to merge!

@cirocosta
Copy link
Collaborator Author

That's a pretty valid concern @william-tran , especially when it comes to the btrfs baggageclaim driver.

I'll rebase the PR for now but still consider it on hold until we have in our test suite a test that verifies such behavior and ensures that for either btrfs, overlay or naive we have that cleaned up (prioritizing this today) - I just went through them and there's not one that specifically verifies that in an end-to-end manner.

Please let me know if there are any further scenarios you think would be interesting to cover before we move forward!

thx!

@cirocosta cirocosta force-pushed the concourse/ephemeral-workers branch from d241a56 to b97eaca Compare January 15, 2019 15:02
@helm-bot helm-bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 15, 2019
@cirocosta cirocosta force-pushed the concourse/ephemeral-workers branch from b97eaca to 9cb6da3 Compare January 15, 2019 15:03
@helm-bot helm-bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 15, 2019
@au-hao
Copy link

au-hao commented Jan 15, 2019

We are facing the issue (concourse/concourse#1780) in the latest version of concourse. Would it be possible after introducing this feature, the issue could be solved? If so, could @cirocosta you help on testing it as well?

@cirocosta
Copy link
Collaborator Author

cirocosta commented Jan 20, 2019

Hey,

We created an issue under concourse/concourse to keep track of a "leak" that happens when you use btrfs on a containerized environment: concourse/concourse#3091

Another possible issue that I can see is when making use of local disks instead of persistent dir (with btrfs) - without the clean up that emptyDir would have (managed by the kubelet), the local disk would soon be filled with btrfs images there are there's no place in the code where a clean up would happen on termination right now.

The idea is to include https://github.com/concourse/baggageclaim/blob/1bd5c2e7660df90ebef1c27ea8b7bdcb4840f85a/fs/btrfs.go#L67-L101 in the lifecycle (at termination) so that when retireing, the worker could have its things cleaned up.

The only sad part of it is that these changes would arrive only in the 5.x series (which is close to being released but not there yet).

thx!

@cirocosta cirocosta force-pushed the concourse/ephemeral-workers branch from 9cb6da3 to fab640d Compare January 23, 2019 16:17
@helm-bot helm-bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 23, 2019
Previously, `StatefulSet`s were used in order to work around some
issues that users had with having pods registering & deregistering
from ATC.

After the introduction of `--ephemeral`, such logic can be simplified
such that we can get back to the best way of deploying concourse: assuming
that the workers are ephemeral in all ways (including state).

Now, by default `--ephemeral` is set and persistence is disabled (for workers -
postgresql still defaults to being persistent).

Given that we're not using StatefulSets anymore, there's also no need
of having a headless service anymore.

Signed-off-by: Ciro S. Costa <cscosta@pivotal.io>
@cirocosta cirocosta force-pushed the concourse/ephemeral-workers branch from fab640d to d3ad546 Compare February 16, 2019 23:23
@helm-bot helm-bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 16, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cirocosta

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 16, 2019
@stale
Copy link

stale bot commented Mar 31, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.

@stale stale bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 31, 2019
@stale
Copy link

stale bot commented Apr 14, 2019

This issue is being automatically closed due to inactivity.

@stale stale bot closed this Apr 14, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. Contribution Allowed If the contributor has signed the DCO or the CNCF CLA (prior to the move to a DCO). do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[stable/concourse] Pods/workers bounce between running and retiring
7 participants