-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Add cvmfs snapshotter to containerd #2467
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
adding @AkihiroSuda into cc as he worked on similar issues on https://github.com/AkihiroSuda/filegrain |
🎉 Also, can we have |
snapshots/proxy/proxy.go
Outdated
"github.com/containerd/containerd/snapshots" | ||
protobuftypes "github.com/gogo/protobuf/types" | ||
) | ||
|
||
func init() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be configurable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, it was not suppose to be on the PR.
I reverted the test.
@AkihiroSuda, at the moment this snapshotter works only with thin images, however it is trivial to make it works with normal one. Here is a little tricky to decide what to allow, I am sure that someone would prefer a very rigid interface that works only with Most likely, if we want to support regular images we will based the implementation on the overlayfs snapshotter. Internally at CERN we are developing utilities to translate normal images into thin one, it is still a work in progress and it is in its infancy but it is open source: https://github.com/cvmfs/docker-graphdriver/tree/devel/daemon |
Also @stevvooe as we've discussed this a couple times in the past |
"syscall" | ||
|
||
"github.com/pkg/errors" | ||
logg "github.com/sirupsen/logrus" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't use logrus directly. Just use log.L
.
type snapshotter struct { | ||
root string | ||
ms *storage.MetaStore | ||
asyncRemove bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Describe what these fields are for.
if err != nil { | ||
return err | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not seeing any actual filesystem operations happening here. How does this actually work?
root string | ||
ms *storage.MetaStore | ||
asyncRemove bool | ||
cvmfsManager util.ICvmfsManager |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't accessed anywhere. What is this for?
I'm extremely confused: this PR has a snapshotter that doesn't really do anything but call into the transactional store. It looks like a clone of overlay, but I am not understanding the differences. I think it would be most helpful you could par down this proposal to the components that affect containerd and snapshotters. For example, we know what containers are, so no need to explain that. Specifically, I would focus the following questions:
That said, I think this is a good direction and having support at this level will be a huge benefit. However, we need to make sure the cvmfs is the right direction and won't be incur baggage that other users have to carry. |
@stevvooe Many thanks for looking into this! I think I can address some of your points, for the implementation details I'll leave it to @siscia. Our intention is not necessarily to make a cvmfs specific snapshotter. Cvmfs in this PR is mainly a demonstrator. We'd like to have a snapshotter that can start a container with a root file system in a mounted read-only directory -- in the same way you'd do a chroot. The directories with the container root file systems could then sit on any local or network mount point. Regarding cvmfs itself, it is used mostly in the scientific world. Fully open sourced, BSD licensed, on github, documented. Besides high-energy physics, used for instance also by LIGO (gravitational wave detectors), the EUCLID space mission, bio science, some industry users etc. From the numbers we know, more then 1 billion files under management, order of 100k world-wide installed clients. Fuse performance is less of an issue than one might think. Most of the calls are But again, we'd be happy with anything that lets us operate containerd on container images extracted in a directory. |
Hi @stevvooe, the PR was indeed based on the overlay snapshotter, I was in doubt if actually just call the overlay snapshotter methods here and there but at the end I opted to duplicate them, we can fix it. The main difference is in the function,
Those directories are where the cvmfs file system is mounted, so So, instead of getting the directories as tarball, decompress them inside Let me now answer your question.
On the client, so the machine that read the files inside CVMFS, we only have read-only rights, this makes the whole concept possible if we allowed also to write file it would be a NFS which all the problem that it implies. However, containerd requires to write in it's root. A posible solution would be to use overlayfs to create a writeable layer on top of the files provided by CVMFS, but then you would have two layers of overlayfs running one of top of the other, which is not possible.
As jakob already mentioned, yes, it can be run by anybody.
We would need a server running the CVMFS server and connect the CVMFS client, I do agree that it is some work, but it is not impossible.
We are definitely getting comparable performance with FUSE when the "cache is warm", when the data are not yet in the system and need to be retrievied via the network of course it is slower, but this should be compare with starting a container that is not in the system yet and that needs to be downloaed. Are you looking for specific benchmarks? I can definitely prepare some for the sake of discussion. Even though I have already linked some in the first poster
If we really want to go filegrain we need to leave behind the concept of layers and tarballs. Our intereopable solution it's been to create this thin layers, that are just like normal layer but they do not contains the "data" but they are the "recipe" to get the data. Can this approach be made more general to work with a wider range of technologies? The answer is clearly a yes. I believe it could help the discussion to make clear what really is a thin layer. The json below is the thin layer for redis:4
for each layer in the normal docker image we have have an URL, what the snapshotter does is to identify where the layers are in the filesystem and provide them inside a mountpoint. The code itself needs to be cleaned up, and I am going to work on that quite soon, but I would really appreciate this conversation to keep moving forward. Please, if I weren't clear on any particular point let me know. |
follow on #2943 |
Hi all,
this PR is an attempt to merge the world of HEP (High Energy Physics) and the wider industry with the respect of the distribution of executables and software.
I believe there is still some work to do, but I am definitely keen on gathering feedback from a larger community.
I will follow with a little bit of background about what the software distribution looks likes inside CERN and other HEP computing centers. Then I will explain how this can be exploited by the wider computing industry, I will show some of the work already done and finally, I will go into the details on how we can merge this two worlds.
Background
CERN has used computing technology since basically forever, here we will focus on the analysis of the data that comes from the accelerator and on the physics simulation.
On first approximation, both problems can be considered an embarrassingly parallel problem, so the time needed to get a result is strictly correlated to the amount of computing resource we use.
Hence it is mandatory for us to move data and software into the computing nodes as fast as possible, here we are focusing mostly on the software side.
Possible approaches
There are several possible solutions to the problem of provisioning computing nodes with software.
The most naive one is to simply use the operative system package manager, (
apt
oryum
) this approach can work on a small scale, where there are not too many nodes to provision, where the software stack is limited and where reproducibility of the results is not of great interest.With a lot of care and enough resource all this limitation can be overcome but it will be extremely expensive.
A most sophisticated approach will exploit containers technology, in this way it is easier to guarantee the reproducibility of the results.
This approach is limited by the size of the software stack, if we try to create containers with all the software stack that the analysts need those containers will be too big to be manageable.
On the other hand creating a lot of small images each for an unique task will be unmanageable from the point of view of the complexity.
Moreover, moving images is quite network intensive, also if the containers can be cached, and network bandwith is a very precious resource that we would like to use mostly for moving data.
CERN approach
CERN provided a solution for this problem in the form of CVMFS (Cern Virtual Machine File System).
CVMFS is born in an "technology niche" to solve a problem unique to CERN few years before the containers technology was mainstream. The first version of the software was released in 2008 (exactly 10 years ago), 1 year earlier than the first appearance of the golang languange, and 5 years earlier than the first release of docker.
This allowed to take a completelly different route than the one that the computing industry has taken so far, with different trade-off that we believe are very interesting to explore today.
Repository
The main idea of CVMFS is to provide an HTTP reachable software repository where to install the necessary software and dependencies.
The content of the repository are content addressable, so it is possible to identify and to download each element unambiguously.
Finally a catalog of the content of the repository is provided as SQLite file.
Hence, communicating with the repository is possible to download the software catalog, identify in what piece of software we are interested, download it, and finally run it.
This approach allows having a huge amount of software installed.
Client
Analyzing and using the repository catalog will be tedious and error prone, so we provide a client for CVMFS.
The client automatically connects to the know repository, downloads the manifests and populates the specific directory with the content of the repository.
The implementation of client is based on FUSE, this allows to defer the downloading of the file to the moment when the software is actually required (
open
syscall) and use the information in the catalog to "virtually" populate the directory (readdir
syscall).This allows saving a lot of network bandwith since only the opened files are downloaded at the cost of higer latency in the open syscall, since the file needs to be downloaded. Of course, this can mitigate if we know in advance what files are needed.
CVMFS Workflow
The intended workflow for CVMFS is to create software repository with all the components and dependencies necessary.
Install and connect the client on each computing machine and treat all the computing node as homogenous.
At the cost of setting up the repository once we:
Introducing containers
Containers are the industry solution to a similar problem.
They provide similar easiness to manage dependencies and to reproduce computing environment.
However, they have used a different approach on how to distribute the files.
The containers filesystem is split in layers and each layer is a tarfiles that get compressed and distributed.
At the cost of downloading and storing the layers we obtain the capabilities to:
A different approach, which is not better or worse but that has simply different tradeoffs.
Merging the two approaches
A rather recent research from Harter et. all Slacker: Fast Distribution with Lazy Docker Containers show that only ~7% of the bytes downloaded is used in a container and that -- at the same time -- downloading the images account for a considerable amount of time in the startup of a container.
We believe that using CVMFS could help in this regard, providing the possibilities of addressing a single file we could download only the used 7%, saving bandwidth and startup time.
In order to merge this two approached we introduce the concept of
thin image
, a normal docker image that contains only a single OCI layer with a single file, thethin.json
recipe.This recipe is then used to create the normal docker image on disk but using files coming from CVMFS, bringing all the advanced mentioned above.
This approach is already been implemented as docker plugin with great result. Showing a great reduction in the startup time and in the bandwidth used.
Technical details
In order to integrate with the wider docker ecosystem we start from a standard docker image and we convert that into a "thin image".
We store every layer inside a CVMFS repository and we construct the recipe file,
thin.json
.The snapshotter in this PR is modelled as the standard overlayfs snapshotter with the addiction of reading this recipe file and use its content to provide the lowerdir to mount inside containerd.
Goal of this PR
This PR is a first attempt to bring the same functionality to
containderd
exploiting a rudimental snapshotter.Our goal at the moment is to mostly create awareness of our solution and gather feedback from the wider community.
Reference
CVMFS Documentation: http://cvmfs.readthedocs.io/en/stable/
CC
@jblomer @radupopescu @gganis @lukasheinrich @rochaporto