-
-
Notifications
You must be signed in to change notification settings - Fork 867
Description
We have a product which customers install and run themselves. Customers want a disaster recovery strategy for how they operate this product, so we want to ship them Concourse pipelines that contain jobs/tasks/resources which encapsulate extracting a backup from the product, encrypting the backup artifact, later being able to decrypt the backup artifact, and restore the backup artifact to the product.
The downloaded backup artifact can be several terabytes in size. It may contain sensitive data that the customer wishes to keep confidential, hence the desire to encrypt the backup artifact. We plan to use symmetric key cryptography for encryption and decryption; the key itself is of course also sensitive and must be kept confidential.
There are open questions about how we will implement these Concourse pipelines, as there are various tradeoffs in terms of UX, security, performance, pluggability (third-party partners may provide the encryption solution). Hypothetically, if the Concourse team were to implement some new features around security and/or performance that we could leverage in some potential implementations but not others, that would certainly influence the tradeoff decisions we make in terms of how we implement this. Here are some of the tradeoffs we're considering:
- The backup pipeline may generate the key, in which case the key must be stored somewhere for later use in a recovery pipeline.
- In this case, the key may be pushed out as a PUT of an existing resource type;
- a PUT of a custom resource type; or
- uploaded "at runtime" within a task script.
- The backup pipeline may instead get the key where the onus is on the customer to make sure it's already stored somewhere.
- In this case, we again have the options of GETting an existing resource type;
- a custom resource type; or
- downloading "at runtime"
- The download of the backup artifact and subsequent encryption could happen in the same task or in two consecutive steps.
- Likewise, the decryption of the stored backup artifact and subsequent restore into the product could happen in one task or two.
My primary concern is around confidentiality of the sensitive data. These various solution options have different implications for how many places (volumes/caches/containers) plaintext copies of the backup artifact live, how long they live for, and how trivially they can be accessed (e.g. via fly intercept
). Ideally, I would want to control volume/resource/container caching/retention policy so that the artifact lives the absolute minimum time it's needed, and cannot be accessed by fly intercept
or any other means. Likewise for the key.
My secondary concern, although it really is a big one that can't be ignored, is performance. I suspect that if the backup+encrypt (or decrypt+restore) work are done in separate tasks, there can be major performance penalties in having to shuttle around the potentially-several-terabyte backup artifact. What are the expected relative performance characteristics of:
- having one task;
- having two tasks which are tagged to ensure they run on the same worker; and
- having two tasks with no particular worker tags;
relative to one another?