[stable/concourse] Replaces worker StatefulSet by Deployment #9668

cirocosta · 2018-12-01T00:52:04Z

What this PR does / why we need it:

Previously, StatefulSets were used in order to work around some
issues that users had with having pods registering & deregistering
from ATC ( see concourse/concourse#1457).

From what I understand, we were not even making use of the idea of
bringing state together across restarts / new pods being created:

charts/stable/concourse/templates/worker-statefulset.yaml

Line 46 in 43a94e6

rm -rf ${CONCOURSE_WORK_DIR:-/concourse-work-dir}/*

After the introduction of --ephemeral, the logic of automatically pruning a worker
that has gone away is suepr simplified: after retire, it'll be automatically pruned.

That said, I guess we can go back to the best way of deploying concourse: assuming
that the workers are ephemeral in all ways (including state).

As a result of the changes, --ephemeral comes set to true by default.

Which issue this PR fixes

fixes [stable/concourse] Pods/workers bounce between running and retiring #8765

Special notes for your reviewer:

I think in a close future we can have StatefulSets back in order to have persistence but I'd say that for now (especially with 4.2.1 that doesn't include some GC improvements) we are much better off going with a simple Deployment that is totally stateless.

~~Another argument for having the StatefulSet would be to increase the size of the disk where the worker state lives if needed. Maybe there's a way around this?~~ Maybe we should not completely ditch the StatefulSet and try hard to go straight to the scenario where we have either SS or Deployments?

Update: Actually, now I think it's actually pretty important to have the support for StatefulSet so we can support scenarios where disk IO is the highest priority

Wdyt?

Checklist

DCO signed
Chart Version bumped
Variables are documented in the README.md

stale · 2019-01-03T03:32:25Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.

voor · 2019-01-14T16:12:05Z

Since you're using a deployment now instead of a statefulset you can update the blurb on Scaling the Chart as well.

william-tran · 2019-01-14T20:07:18Z

Yes this is so awesome and moves towards an ideal view of operating concourse on k8s; if a worker dies, a new one can come up without any expectation of state left behind by the previous incarnation. I would only be comfortable with removing the option of statefulset+pvc if we ensured that space on the node is reclaimed when the worker restarts. We need to test that out on recent versions of k8s using each baggage claim driver. Early on (late 2017) I noticed that space was never being reclaimed and I eventually ran out of disk on the node, causing trouble for any other pods running on the node and not reclaimable until restarted the host vm. Having a PVC made it possible to deal with that reality by tossing the disk when it became full.

william-tran · 2019-01-14T22:46:32Z

/hold

paulczar · 2019-01-14T22:47:17Z

let's rebase this and get it ready to merge!

cirocosta · 2019-01-15T14:48:11Z

That's a pretty valid concern @william-tran , especially when it comes to the btrfs baggageclaim driver.

I'll rebase the PR for now but still consider it on hold until we have in our test suite a test that verifies such behavior and ensures that for either btrfs, overlay or naive we have that cleaned up (prioritizing this today) - I just went through them and there's not one that specifically verifies that in an end-to-end manner.

Please let me know if there are any further scenarios you think would be interesting to cover before we move forward!

thx!

au-hao · 2019-01-15T17:24:22Z

We are facing the issue (concourse/concourse#1780) in the latest version of concourse. Would it be possible after introducing this feature, the issue could be solved? If so, could @cirocosta you help on testing it as well?

cirocosta · 2019-01-20T14:08:01Z

Hey,

We created an issue under concourse/concourse to keep track of a "leak" that happens when you use btrfs on a containerized environment: concourse/concourse#3091

Another possible issue that I can see is when making use of local disks instead of persistent dir (with btrfs) - without the clean up that emptyDir would have (managed by the kubelet), the local disk would soon be filled with btrfs images there are there's no place in the code where a clean up would happen on termination right now.

The idea is to include https://github.com/concourse/baggageclaim/blob/1bd5c2e7660df90ebef1c27ea8b7bdcb4840f85a/fs/btrfs.go#L67-L101 in the lifecycle (at termination) so that when retireing, the worker could have its things cleaned up.

The only sad part of it is that these changes would arrive only in the 5.x series (which is close to being released but not there yet).

thx!

Previously, `StatefulSet`s were used in order to work around some issues that users had with having pods registering & deregistering from ATC. After the introduction of `--ephemeral`, such logic can be simplified such that we can get back to the best way of deploying concourse: assuming that the workers are ephemeral in all ways (including state). Now, by default `--ephemeral` is set and persistence is disabled (for workers - postgresql still defaults to being persistent). Given that we're not using StatefulSets anymore, there's also no need of having a headless service anymore. Signed-off-by: Ciro S. Costa <cscosta@pivotal.io>

k8s-ci-robot · 2019-02-16T23:23:36Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cirocosta

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~stable/concourse/OWNERS~~ [cirocosta]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

stale · 2019-03-31T22:28:32Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.

stale · 2019-04-14T23:13:19Z

This issue is being automatically closed due to inactivity.

helm-bot added Contribution Allowed If the contributor has signed the DCO or the CNCF CLA (prior to the move to a DCO). size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Dec 1, 2018

k8s-ci-robot requested review from frodenas and william-tran December 1, 2018 00:52

cirocosta force-pushed the concourse/ephemeral-workers branch from 036f59a to 835760f Compare December 3, 2018 23:53

helm-bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Dec 3, 2018

cirocosta force-pushed the concourse/ephemeral-workers branch from 835760f to 3dfe9a9 Compare December 4, 2018 01:59

helm-bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Dec 4, 2018

This was referenced Dec 9, 2018

Horizontal Pod Autoscaler for workers #7758

Closed

Create Internal helm-deployed Concourse Instance concourse/concourse#2876

Closed

stale bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 3, 2019

cirocosta force-pushed the concourse/ephemeral-workers branch from 3dfe9a9 to ddedb7b Compare January 5, 2019 15:35

helm-bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 5, 2019

stale bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 5, 2019

cirocosta force-pushed the concourse/ephemeral-workers branch from ddedb7b to d241a56 Compare January 5, 2019 15:54

helm-bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 5, 2019

cirocosta changed the title ~~[stable/concourse] Replaces worker SS by Deployment~~ [stable/concourse] Replaces worker StatefulSet by Deployment Jan 5, 2019

william-tran mentioned this pull request Jan 14, 2019

[stable/concourse] Address PR backlog and add more maintainers #10617

Closed

12 tasks

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 14, 2019

cirocosta force-pushed the concourse/ephemeral-workers branch from d241a56 to b97eaca Compare January 15, 2019 15:02

helm-bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 15, 2019

cirocosta force-pushed the concourse/ephemeral-workers branch from b97eaca to 9cb6da3 Compare January 15, 2019 15:03

helm-bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 15, 2019

cirocosta mentioned this pull request Jan 18, 2019

Baggageclaim btrfs driver "leaks" loopback device attachments concourse/concourse#3091

Closed

This was referenced Jan 21, 2019

topgun/k8s: ensure worker lifecycle works concourse/concourse#3111

Closed

Workers should be smart enough not to fill up their disk concourse/concourse#1751

Closed

cirocosta force-pushed the concourse/ephemeral-workers branch from 9cb6da3 to fab640d Compare January 23, 2019 16:17

helm-bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 23, 2019

cirocosta force-pushed the concourse/ephemeral-workers branch from fab640d to d3ad546 Compare February 16, 2019 23:23

helm-bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 16, 2019

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 16, 2019

cirocosta mentioned this pull request Feb 28, 2019

hh: make use of local disks concourse/hush-house#12

Closed

stale bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 31, 2019

stale bot closed this Apr 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[stable/concourse] Replaces worker StatefulSet by Deployment #9668

[stable/concourse] Replaces worker StatefulSet by Deployment #9668

Uh oh!

cirocosta commented Dec 1, 2018 •

edited

Loading

Uh oh!

stale bot commented Jan 3, 2019

Uh oh!

voor commented Jan 14, 2019

Uh oh!

william-tran commented Jan 14, 2019

Uh oh!

william-tran commented Jan 14, 2019

Uh oh!

paulczar commented Jan 14, 2019

Uh oh!

cirocosta commented Jan 15, 2019

Uh oh!

au-hao commented Jan 15, 2019

Uh oh!

cirocosta commented Jan 20, 2019 •

edited

Loading

Uh oh!

k8s-ci-robot commented Feb 16, 2019

Uh oh!

stale bot commented Mar 31, 2019

Uh oh!

stale bot commented Apr 14, 2019

Uh oh!

Uh oh!

[stable/concourse] Replaces worker StatefulSet by Deployment #9668

[stable/concourse] Replaces worker StatefulSet by Deployment #9668

Uh oh!

Conversation

cirocosta commented Dec 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it:

Which issue this PR fixes

Special notes for your reviewer:

Checklist

Uh oh!

stale bot commented Jan 3, 2019

Uh oh!

voor commented Jan 14, 2019

Uh oh!

william-tran commented Jan 14, 2019

Uh oh!

william-tran commented Jan 14, 2019

Uh oh!

paulczar commented Jan 14, 2019

Uh oh!

cirocosta commented Jan 15, 2019

Uh oh!

au-hao commented Jan 15, 2019

Uh oh!

cirocosta commented Jan 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Feb 16, 2019

Uh oh!

stale bot commented Mar 31, 2019

Uh oh!

stale bot commented Apr 14, 2019

Uh oh!

Uh oh!

cirocosta commented Dec 1, 2018 •

edited

Loading

cirocosta commented Jan 20, 2019 •

edited

Loading