Skip to content

Conversation

aoldershaw
Copy link
Contributor

@aoldershaw aoldershaw commented Feb 25, 2021

What does this PR accomplish?

The current atc/worker package is extremely fragmented with a ton of indirection and internal interfaces, presumably for the sake of easier testing. Here's a (loosely defined) dependency graph for the existing atc/worker package I put together when trying to understand how things currently interact:

worker_dep_graph

The tests all test against these internal interfaces against auto-generated fakes and break upon any sort of code restructuring.

This PR aims to tear out the existing atc/worker package and replace it with the same logic, but with much less indirection - pretty much all abstraction (besides the firm interfaces - Garden, Baggageclaim, and the DB) is limited to method calls. It brings a new test-suite that tests at a higher level against the interfaces that Garden-based workers actually depend on - Garden, Baggageclaim, and the DB - using hand-written in-memory fakes (or, in the case of the DB, an actual Postgres server).

It also introduces tries to come up with some more clean interfaces for implementing other runtimes (esp. k8s) and puts these types in the atc/runtime package.

Below is the new model of the world:
worker_dep_graph (1)

In particular, the big box in the first diagram has become the two packages atc/worker and atc/worker/gardenruntime above.

Changes proposed by this PR:

Still a wip, see the individual commits for much more detailed commit messages

Notes to reviewer:

Release Note

  • We will now error when a suitable worker does not exist rather than waiting forever.

Contributor Checklist

Reviewer Checklist

  • Code reviewed
  • Tests reviewed
  • Documentation reviewed
  • Release notes reviewed
  • PR acceptance performed
  • New config flags added? Ensure that they are added to the
    BOSH and
    Helm packaging; otherwise, ignored for
    the integration
    tests

    (for example, if they are Garden configs that are not displayed in the
    --help text).

@aoldershaw
Copy link
Contributor Author

FYI the intention is to take logic that previously existed in worker.Client and move it into the exec package - so, exec.{Get,Check,Put,Task}Step would interact directly with the runtime abstractions, which will hopefully be general enough that implementing a k8s or nomad runtime is as simple as implementing a (fairly) simple interface that is unaware of the business logic of these steps

@aoldershaw aoldershaw force-pushed the rewrite-atc-worker branch 3 times, most recently from 819f0f4 to 113d140 Compare February 28, 2021 20:04
@evanchaoli
Copy link
Contributor

Hey @aoldershaw, not sure if you have noticed #6495 that may impact this refactor.

@aoldershaw
Copy link
Contributor Author

@evanchaoli thanks, i've seen that PR, seems cool! I'm still not sure what will become of this PR, it's still very much a WIP - but if we decide to go forward with this PR, we'll be sure to bring your commits over

@aoldershaw aoldershaw force-pushed the rewrite-atc-worker branch 3 times, most recently from 7a5249a to 8c0d280 Compare March 8, 2021 01:28
@aoldershaw aoldershaw force-pushed the rewrite-atc-worker branch 6 times, most recently from 8b95a78 to f066e18 Compare March 14, 2021 17:39
@muntac
Copy link
Contributor

muntac commented Mar 15, 2021

@aoldershaw thanks for that great dependency graph! would it be easy enough to produce the same graph for the refactored code?

@aoldershaw
Copy link
Contributor Author

@muntac absolutely! I'm planning on writing out more in depth how things are structured, but hopefully this is a decent start:

worker_dep_graph (1)

The primary goal is to find a common interface that can be implemented for other runtimes (e.g. a k8s runtime). One thing I didn't illustrate in the original dependency graph is how external packages interact with the atc/worker package, which was kind of a mixed bag - a ton of functionality was baked into the atc/worker.Worker interface, which wasn't really a general interface, having a lot of Garden/Baggageclaim specifics.

So, the atc/runtime package includes a new set of interfaces that can hopefully be more general - this consists of Worker, Container, Volume, and Process (not included in the diagram). Worker provides the ability to construct Containers and Volumes, and you can start Processes on Containers to run things.

The atc/worker package provides helpers for interacting with the runtime types. Specifically, there are two main components - Pool (which represents the pool of workers, allowing you to select a worker for running a container and locate Volumes and Containers), and Streamer (which understands how to stream data between Volumes, and stream files out of Volumes). Implementations of the atc/runtime types make use of Pool and Streamer to locate external volumes and stream data from them.

atc/worker/gardenruntime is an implementation of the atc/runtime interfaces that uses Garden for running containers and Baggageclaim for managing volumes. External packages aren't aware of any Garden/Baggageclaim specifics - they're only aware of atc/runtime

atc/exec (and atc/resource) implement the logic for all the step types (most relevant here are task, get, put, check, but also steps like load_var and set_pipeline which stream files out of volumes). In a simplified way, these steps select a runtime.Worker using worker.Pool, then create a runtime.Container on that worker, then run a runtime.Process on that container. The GetStep is a bit more complicated, and has a resource.Getter helper to achieve a similar thing).

@aoldershaw aoldershaw changed the title [wip] simplify atc/worker package and test at a higher level [wip] simplify atc/worker package and extract runtime abstractions Mar 16, 2021
@aoldershaw aoldershaw changed the title [wip] simplify atc/worker package and extract runtime abstractions simplify atc/worker package and extract runtime abstractions Mar 22, 2021
@muntac
Copy link
Contributor

muntac commented Mar 22, 2021

Still wrapping my head around this. Initially tried to account for the interactions of the atc/worker package with the rest of the atc i.e. what are the files outside the worker that rely on it and which functionality is being used. This is to ensure the refactor accounts for all of this, and to get a high level idea for what functionality the rest of the atc expects from the atc/worker package.

Files that use atc/worker:

(relative to atc/)
api/handler.go (for containerserver)
api/containerserver/hijack.go
api/containerserver/server.go
api/artifactserver/create.go
api/artifactserver/server.go

atccmd/command.go

engine/build_step_delegate.go
engine/builder.go
engine/check_delegate.go
engine/delegate_factory.go
engine/get_delegate.go
engine/put_delegate.go
engine/task_delegate.go

exec/artifact_input_step.go
exec/artifact_output_step.go
exec/build_step_delegate.go
exec/check_step.go
exec/get_step.go
exec/load_var_step.go
exec/put_step.go
exec/set_pipeline_step.go
exec/task_config_source.go
exec/task_step.go

Still figuring out the functionality bit, and will then dive back into commits.

Current interface exposed by pool

pool
-> NewPool(WorkerProvider)
-> FindContainer(lager.Logger, int, string) (Container, bool, error)
-> FindVolume(lager.Logger, int, string) (Volume, bool, error)
-> CreateVolume(lager.Logger, VolumeSpec, WorkerSpec, db.VolumeType) (Volume, error)
-> ContainerInWorker(lager.Logger, db.ContainerOwner, WorkerSpec) (bool, error)
-> SelectWorker(context.Context, db.ContainerOwner,ContainerSpec,WorkerSpec,ContainerPlacementStrategy) (Client, error)

Feel free to suggest a different approach.

@aoldershaw
Copy link
Contributor Author

aoldershaw commented Mar 23, 2021

Still figuring out the functionality bit

@muntac would it help if I went over each of the interactions with atc/worker in a bit more depth?

api/containerserver/hijack.go - this is the API handler that deals with running fly hijack/fly intercept, which allows you to run a process (typically a shell, but not necessarily) on an existing container. So, it interacts with atc/worker to locate a container, start a process, continually update the TTY (esp. terminal size), and store some state in the DB ("the container was hijacked at this time, so don't GC it")


api/artifactserver/create.go - this is the API handler used by fly execute to upload artifacts to a worker, where artifacts refers to local inputs that you upload to Concourse. It interacts with atc/worker to create an empty volume on a random worker and then stream in the artifact contents. This artifact (and its corresponding volume) can then be used to run steps (specifically, the task step that's run in fly execute). It is used by the ArtifactInput step

exec/artifact_input_step.go - this step runs in builds created by fly execute. It interacts with atc/worker to locate the volume that was previously created in api/artifactserver/create.go and makes it available to other steps.

exec/artifact_output_step.go - this step also runs in builds created by fly execute to mark a volume created by the task as a downloadable artifact. Previously, it interacted with atc/worker to locate the volume - now, it only interacts with the DB to mark a volume as an artifact **. Once an artifact has been "output" by this step, it can be downloaded in the api via api/artifactserver/get.go

** This is due to the refactor...previously, we used "Artifact" to mean two things - the uploadable/downloadable volumes I've been describing that are used in fly execute, but also as a general handle to a volume (formerly runtime.Artifact). If that's confusing, I'm not surprised - runtime.Artifact didn't make much sense to me to be honest, since it just seemed like an abstraction on top of volumes.

api/artifactserver/get.go - as I alluded to, this is the counterpart to exec/artifact_output_step.go - it's used by fly execute to download the task's outputs after it finishes. It interacts with atc/worker to locate the volume and stream the contents out.

So, the high-level flow for fly execute is (was):

fly execute -c task.yml -i some-input=./input -o some-output=./output
  -> POST api/artifactserver/create.go
    -> volume := atc/worker.Pool.CreateVolume() // for the artifact some-input
    -> volume.InitializeArtifact() // mark in the DB that this is an artifact
    -> volume.StreamIn(req.Body)
  -> Create a build with an ArtifactInputStep and ArtifactOutputStep (and a TaskStep of course)
    -> Run ArtifactInputStep
      -> atc/worker.Pool.LocateVolume() // find the artifact created earlier
      -> Register runtime.Artifact on build // basically just keep track of the fact that `some-input` points to that volume
    -> Run TaskStep
      -> Will give more details later, but this also keeps track of the output volume for `some-output`
    -> Run ArtifactOutputStep
      -> volume := atc/worker.Pool.LocateVolume() // find the volume that `some-output`, created in the task step, points to
      -> volume.InitializeArtifact() // mark in the DB that this is an artifact
  -> GET api/artifactserver/get.go
    -> volume := atc/worker.Pool.LocateVolume() // find the artifact volume created in ArtifactOutputStep
    -> volume.StreamOut() // download the contents via baggageclaim

The various delegates in engine/*_delegate.go use an abstraction called the worker.ArtifactSourcer. This abstraction was removed in the refactor. In effect, worker.ArtifactSourcer's only purpose was to convert from a type that the exec package understands (runtime.Artifact) to a type that the worker package understands (worker.ArtifactSource).

worker.ArtifactSource is an interface that simply tells you whether an artifact (volume) is on a worker. It was previously used to model both input volumes (which also implemented worker.StreamableArtifactSource, which allows for streaming the volume to another worker), as well as task caches (which are not streamable, and simply remain on the worker they were created on). These abstraction was removed in favour of runtime.Volume (which can stream in/out) and worker.Streamer (which knows how to stream data to/from runtime.Volumes, optionally using p2p streaming). Now, the exec package just references runtime.Volumes directly, so there's no need for a conversion layer like we had with worker.ArtifactSourcer.

All that to say the engine no longer depends on worker post-refactor (except for basic dependency injection, since the engine package threads the worker dependencies to the exec.*Steps). There is one exception to that, actually - engine/task_delegate.go implements a SelectWorker method which uses the worker.Pool to, well, select a worker! The difference between this method and worker.Pool.SelectWorker is that the latter doesn't retry when no worker is available, while the former does (it loops until a worker is available). Note that with #6635, all steps will share the same codepath (which exists on the worker.Pool itself), so we can get rid of engine.TaskDelegate.SelectWorker.

@muntac
Copy link
Contributor

muntac commented Mar 24, 2021

@aoldershaw saw this but forgot to respond. This is great. Thanks so much!

@aoldershaw
Copy link
Contributor Author

@muntac glad to hear it! still working on it, haven't gotten to the exec package interactions yet, but will update that comment soon

@aoldershaw aoldershaw force-pushed the rewrite-atc-worker branch 3 times, most recently from 8fa5652 to f10ec1f Compare April 16, 2021 17:24
Signed-off-by: Aidan Oldershaw <aoldershaw@pivotal.io>
Signed-off-by: Aidan Oldershaw <aoldershaw@pivotal.io>
Signed-off-by: Aidan Oldershaw <aoldershaw@pivotal.io>
this commit adds a new abstraction, runtime.Artifact, that is simply
some chunk of data that can be streamed out. runtime.Artifact is a
generalization of runtime.Volume. the exec package only interacts with
runtime.Artifact - it doesn't care whether the Artifact is from a Volume
or anything else.

note that we don't currently have any types of Artifact that aren't
Volumes - however, this means we could support mounting volumes to
workers from content that is generated on the ATC in memory (rather than
needing a dummy task that creates an output with the file contents,
which is much more heavy weight).

Signed-off-by: Aidan Oldershaw <aoldershaw@pivotal.io>
Signed-off-by: Aidan Oldershaw <aoldershaw@pivotal.io>
...for the fewest-build-containers strategy. now that this is cached, it
should be very quick to access

Signed-off-by: Aidan Oldershaw <aoldershaw@pivotal.io>
#7336 sped up the TSA tests by compiling the test
binary with a package constant set. however, this package was moved, so
we weren't setting the constant properly. unfortunately, this doesn't
result in a compilation error, and fails silently

Signed-off-by: Aidan Oldershaw <aoldershaw@pivotal.io>
@aoldershaw aoldershaw merged commit 8893b0e into master Aug 26, 2021
@aoldershaw aoldershaw deleted the rewrite-atc-worker branch August 26, 2021 18:44
aoldershaw added a commit that referenced this pull request Aug 27, 2021
I gather these setting the workdir isn't required for these steps. After
#6597, the number of volumes on a given worker shot
up by ~50% (on a stress testing environment that solely runs check+get
steps) - this is a result of the extra volume per check+get container.
these volumes don't have a significant impact since they're empty, but
they will mess up people's metrics and placement strategies

Signed-off-by: Aidan Oldershaw <aoldershaw@pivotal.io>
aoldershaw added a commit that referenced this pull request Aug 27, 2021
I gather these setting the workdir isn't required for check steps. After
#6597, the number of volumes on a given worker shot
up by ~50% (on a stress testing environment that solely runs check+get
steps) - this is a result of the extra volume per check container.
these volumes don't have a significant impact since they're empty, but
they will mess up people's metrics and placement strategies

Signed-off-by: Aidan Oldershaw <aoldershaw@pivotal.io>
@clarafu clarafu changed the title simplify atc/worker package and extract runtime abstractions Simplify atc/worker package and extract runtime abstractions Sep 7, 2021
@clarafu clarafu added the release/undocumented This didn't warrant being documented or put in release notes. label Sep 7, 2021
@clarafu clarafu added enhancement and removed misc labels Sep 20, 2021
taylorsilva added a commit to concourse/docs that referenced this pull request Feb 20, 2025
concourse/concourse#6597 changed how the `path` works in Concourse and
allowed users to specify an absolute path. The exact commit is this one:

concourse/concourse@2ba0164

This has been the current behaviour since 7.5.0 so I don't think there's
any going back on this behaviour now. I think it may make caching stuff
a lot easier for folks to work with. I'm generally in favour of this
change which I think was kinda snuck in. It was a massive PR.

Signed-off-by: Taylor Silva <dev@taydev.net>
taylorsilva added a commit that referenced this pull request Feb 20, 2025
The behaviour of this was changed back in 7.5.0 by
#6597

Previously only relative paths were allowed. No tests were added to
handle these new cases. In testing this I also found that paths with
parent directories were also allowed and would work. You'd get the input
mounted above the task's CWD. This seems like a bad thing to encourage
so I've locked things down to only allow absolute and relative paths.
Allowing parent directories seems like an easy footgun for users to fall
onto.

We now have test coverage for:
- Allowing absolute paths in task inputs/outputs
- Ignoring paths with parent directory references in them

Signed-off-by: Taylor Silva <dev@taydev.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deploy enhancement release/undocumented This didn't warrant being documented or put in release notes.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants