Skip to content

Conversation

siscia
Copy link

@siscia siscia commented Jul 16, 2018

Hi all,

this PR is an attempt to merge the world of HEP (High Energy Physics) and the wider industry with the respect of the distribution of executables and software.

I believe there is still some work to do, but I am definitely keen on gathering feedback from a larger community.

I will follow with a little bit of background about what the software distribution looks likes inside CERN and other HEP computing centers. Then I will explain how this can be exploited by the wider computing industry, I will show some of the work already done and finally, I will go into the details on how we can merge this two worlds.

Background

CERN has used computing technology since basically forever, here we will focus on the analysis of the data that comes from the accelerator and on the physics simulation.

On first approximation, both problems can be considered an embarrassingly parallel problem, so the time needed to get a result is strictly correlated to the amount of computing resource we use.

Hence it is mandatory for us to move data and software into the computing nodes as fast as possible, here we are focusing mostly on the software side.

Possible approaches

There are several possible solutions to the problem of provisioning computing nodes with software.

The most naive one is to simply use the operative system package manager, (apt or yum) this approach can work on a small scale, where there are not too many nodes to provision, where the software stack is limited and where reproducibility of the results is not of great interest.
With a lot of care and enough resource all this limitation can be overcome but it will be extremely expensive.

A most sophisticated approach will exploit containers technology, in this way it is easier to guarantee the reproducibility of the results.
This approach is limited by the size of the software stack, if we try to create containers with all the software stack that the analysts need those containers will be too big to be manageable.
On the other hand creating a lot of small images each for an unique task will be unmanageable from the point of view of the complexity.
Moreover, moving images is quite network intensive, also if the containers can be cached, and network bandwith is a very precious resource that we would like to use mostly for moving data.

CERN approach

CERN provided a solution for this problem in the form of CVMFS (Cern Virtual Machine File System).

CVMFS is born in an "technology niche" to solve a problem unique to CERN few years before the containers technology was mainstream. The first version of the software was released in 2008 (exactly 10 years ago), 1 year earlier than the first appearance of the golang languange, and 5 years earlier than the first release of docker.
This allowed to take a completelly different route than the one that the computing industry has taken so far, with different trade-off that we believe are very interesting to explore today.

Repository

The main idea of CVMFS is to provide an HTTP reachable software repository where to install the necessary software and dependencies.

The content of the repository are content addressable, so it is possible to identify and to download each element unambiguously.

Finally a catalog of the content of the repository is provided as SQLite file.

Hence, communicating with the repository is possible to download the software catalog, identify in what piece of software we are interested, download it, and finally run it.

This approach allows having a huge amount of software installed.

Client

Analyzing and using the repository catalog will be tedious and error prone, so we provide a client for CVMFS.

The client automatically connects to the know repository, downloads the manifests and populates the specific directory with the content of the repository.

The implementation of client is based on FUSE, this allows to defer the downloading of the file to the moment when the software is actually required (open syscall) and use the information in the catalog to "virtually" populate the directory (readdir syscall).

This allows saving a lot of network bandwith since only the opened files are downloaded at the cost of higer latency in the open syscall, since the file needs to be downloaded. Of course, this can mitigate if we know in advance what files are needed.

CVMFS Workflow

The intended workflow for CVMFS is to create software repository with all the components and dependencies necessary.

Install and connect the client on each computing machine and treat all the computing node as homogenous.

At the cost of setting up the repository once we:

  1. Avoid the cost of managing dependency issues in different nodes
  2. Obtain file level granularity on the download of the software stack saving an enormous amount of bandwidth
  3. Provide a simple way to reproduce the same environment

Introducing containers

Containers are the industry solution to a similar problem.

They provide similar easiness to manage dependencies and to reproduce computing environment.

However, they have used a different approach on how to distribute the files.

The containers filesystem is split in layers and each layer is a tarfiles that get compressed and distributed.

At the cost of downloading and storing the layers we obtain the capabilities to:

  1. Cache common layers
  2. Native latency in opening a file

A different approach, which is not better or worse but that has simply different tradeoffs.

Merging the two approaches

A rather recent research from Harter et. all Slacker: Fast Distribution with Lazy Docker Containers show that only ~7% of the bytes downloaded is used in a container and that -- at the same time -- downloading the images account for a considerable amount of time in the startup of a container.

We believe that using CVMFS could help in this regard, providing the possibilities of addressing a single file we could download only the used 7%, saving bandwidth and startup time.

In order to merge this two approached we introduce the concept of thin image, a normal docker image that contains only a single OCI layer with a single file, the thin.json recipe.
This recipe is then used to create the normal docker image on disk but using files coming from CVMFS, bringing all the advanced mentioned above.

This approach is already been implemented as docker plugin with great result. Showing a great reduction in the startup time and in the bandwidth used.

Technical details

In order to integrate with the wider docker ecosystem we start from a standard docker image and we convert that into a "thin image".
We store every layer inside a CVMFS repository and we construct the recipe file, thin.json.

The snapshotter in this PR is modelled as the standard overlayfs snapshotter with the addiction of reading this recipe file and use its content to provide the lowerdir to mount inside containerd.

Goal of this PR

This PR is a first attempt to bring the same functionality to containderd exploiting a rudimental snapshotter.

Our goal at the moment is to mostly create awareness of our solution and gather feedback from the wider community.

Reference

CVMFS Documentation: http://cvmfs.readthedocs.io/en/stable/

CC
@jblomer @radupopescu @gganis @lukasheinrich @rochaporto

@lukasheinrich
Copy link

adding @AkihiroSuda into cc as he worked on similar issues on https://github.com/AkihiroSuda/filegrain

@AkihiroSuda
Copy link
Member

🎉
Does it work with "non-thin" images?

Also, can we have ctr images pull --unpack-cvmfs that creates thin.json for non-thin images, although I'm not sure it should be in this repo?

"github.com/containerd/containerd/snapshots"
protobuftypes "github.com/gogo/protobuf/types"
)

func init() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be configurable?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, it was not suppose to be on the PR.

I reverted the test.

@siscia
Copy link
Author

siscia commented Jul 16, 2018

@AkihiroSuda, at the moment this snapshotter works only with thin images, however it is trivial to make it works with normal one. Here is a little tricky to decide what to allow, I am sure that someone would prefer a very rigid interface that works only with thin images while someone else would prefer something more pragmatic allowing the use also of regular images.

Most likely, if we want to support regular images we will based the implementation on the overlayfs snapshotter.

Internally at CERN we are developing utilities to translate normal images into thin one, it is still a work in progress and it is in its infancy but it is open source: https://github.com/cvmfs/docker-graphdriver/tree/devel/daemon

@rochaporto
Copy link

Also @stevvooe as we've discussed this a couple times in the past

"syscall"

"github.com/pkg/errors"
logg "github.com/sirupsen/logrus"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't use logrus directly. Just use log.L.

type snapshotter struct {
root string
ms *storage.MetaStore
asyncRemove bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Describe what these fields are for.

if err != nil {
return err
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not seeing any actual filesystem operations happening here. How does this actually work?

root string
ms *storage.MetaStore
asyncRemove bool
cvmfsManager util.ICvmfsManager
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't accessed anywhere. What is this for?

@stevvooe
Copy link
Member

I'm extremely confused: this PR has a snapshotter that doesn't really do anything but call into the transactional store. It looks like a clone of overlay, but I am not understanding the differences.

I think it would be most helpful you could par down this proposal to the components that affect containerd and snapshotters. For example, we know what containers are, so no need to explain that. Specifically, I would focus the following questions:

  1. Why can't this be done with the existing snapshotters mounted on the fuse filesystem backed by cern vm fs (or other backend)?
  2. Can anyone run a cern vm fs? If so, how? What is the current adoption rate? Consider that it might be better to preserve this as a specific build of containerd for use at cern.
  3. How are we going to test this in CI?
  4. Let's see some real performance numbers: I do think you can reduce disk space with this approach, but there is no way you are getting native performance with FUSE.
  5. Let's look at this in the context of filegrain (someone already mentioned this). Are there are approaches to thin containers that don't rely on cernvmfs? Can we make this more generic so that others can get benefits?

That said, I think this is a good direction and having support at this level will be a huge benefit. However, we need to make sure the cvmfs is the right direction and won't be incur baggage that other users have to carry.

@jblomer
Copy link

jblomer commented Jul 18, 2018

@stevvooe Many thanks for looking into this! I think I can address some of your points, for the implementation details I'll leave it to @siscia.

Our intention is not necessarily to make a cvmfs specific snapshotter. Cvmfs in this PR is mainly a demonstrator. We'd like to have a snapshotter that can start a container with a root file system in a mounted read-only directory -- in the same way you'd do a chroot. The directories with the container root file systems could then sit on any local or network mount point.

Regarding cvmfs itself, it is used mostly in the scientific world. Fully open sourced, BSD licensed, on github, documented. Besides high-energy physics, used for instance also by LIGO (gravitational wave detectors), the EUCLID space mission, bio science, some industry users etc. From the numbers we know, more then 1 billion files under management, order of 100k world-wide installed clients.

Fuse performance is less of an issue than one might think. Most of the calls are stat() and they are covered by the Linux VFS buffers. Also, the performance penalty is limited to the startup period, after which the software is in memory. After cache warm-up, the penalty is practically negligible (early measurements, more recent ones).

But again, we'd be happy with anything that lets us operate containerd on container images extracted in a directory.

@siscia
Copy link
Author

siscia commented Jul 18, 2018

Hi @stevvooe,

the PR was indeed based on the overlay snapshotter, I was in doubt if actually just call the overlay snapshotter methods here and there but at the end I opted to duplicate them, we can fix it.

The main difference is in the function, (o *snapshotter) mounts on line 415 which calls (o *snapshotter) mountCVMFS L:308, the interesting part of this function is how it get the lowerdir to mount using overlay, indeed on line 341 it use getLowerFilesystem define on L:292

getLowerFilesystem is where the real magic happens, it read the thin.json file inside the directory and from that file, it infers where the actual content of the layers is and mounts those directories as lowerdir.

Those directories are where the cvmfs file system is mounted, so /cvmfs/REPO/layers as a convention.

So, instead of getting the directories as tarball, decompress them inside $containderd_root/io.containerd.snapshotter.v1.overlayfs/... we already have those directories under /cvmfs/REPO/layers/..., the advantage of that is that everything that is mounted and used via CVMFS has a filegrain level. We are not downloading anymore the whole tarball, but just and only the file we are interested in.

Let me now answer your question.

  1. Why can't this be done with the existing snapshotters mounted on the fuse filesystem backed by cern vm fs (or other backend)?

On the client, so the machine that read the files inside CVMFS, we only have read-only rights, this makes the whole concept possible if we allowed also to write file it would be a NFS which all the problem that it implies. However, containerd requires to write in it's root. A posible solution would be to use overlayfs to create a writeable layer on top of the files provided by CVMFS, but then you would have two layers of overlayfs running one of top of the other, which is not possible.

  1. Can anyone run a cern vm fs? If so, how? What is the current adoption rate? Consider that it might be better to preserve this as a specific build of containerd for use at cern.

As jakob already mentioned, yes, it can be run by anybody.

  1. How are we going to test this in CI?

We would need a server running the CVMFS server and connect the CVMFS client, I do agree that it is some work, but it is not impossible.

  1. Let's see some real performance numbers: I do think you can reduce disk space with this approach, but there is no way you are getting native performance with FUSE.

We are definitely getting comparable performance with FUSE when the "cache is warm", when the data are not yet in the system and need to be retrievied via the network of course it is slower, but this should be compare with starting a container that is not in the system yet and that needs to be downloaed.

Are you looking for specific benchmarks? I can definitely prepare some for the sake of discussion. Even though I have already linked some in the first poster

  1. Let's look at this in the context of filegrain (someone already mentioned this). Are there are approaches to thin containers that don't rely on cernvmfs? Can we make this more generic so that others can get benefits?

If we really want to go filegrain we need to leave behind the concept of layers and tarballs.

Our intereopable solution it's been to create this thin layers, that are just like normal layer but they do not contains the "data" but they are the "recipe" to get the data.

Can this approach be made more general to work with a wider range of technologies? The answer is clearly a yes.

I believe it could help the discussion to make clear what really is a thin layer.

The json below is the thin layer for redis:4

{
  "version": "1.0",
  "origin": "library/redis:4@https://registry-1.docker.io/v2",
  "layers": [
    {
      "digest": "683abbb4ea60e108164f1d351e7bcf13daf45941137d800086447874df05f48e",
      "url": "cvmfs://cd.cern.ch/layers/683abbb4ea60e108164f1d351e7bcf13daf45941137d800086447874df05f48e"
    },
    {
      "digest": "259238e792d86e23dab13fbcfcadf090333328ad9f80894544316437461f0d1b",
      "url": "cvmfs://cd.cern.ch/layers/259238e792d86e23dab13fbcfcadf090333328ad9f80894544316437461f0d1b"
    },
    {
      "digest": "78399601c709f0e252523e534db03152a1b3f017c9f7c756d68791ef07bc5d0b",
      "url": "cvmfs://cd.cern.ch/layers/78399601c709f0e252523e534db03152a1b3f017c9f7c756d68791ef07bc5d0b"
    },
    {
      "digest": "f397da4746012e36f5550d3eb830d81a522a1910417d6b3e1bd1ba046dfb8133",
      "url": "cvmfs://cd.cern.ch/layers/f397da4746012e36f5550d3eb830d81a522a1910417d6b3e1bd1ba046dfb8133"
    },
    {
      "digest": "c57de4edc390ed763d75e015294194b9147207a5ad77b6a3d01cd7ee22b0b010",
      "url": "cvmfs://cd.cern.ch/layers/c57de4edc390ed763d75e015294194b9147207a5ad77b6a3d01cd7ee22b0b010"
    },
    {
      "digest": "b2ea05c9d9a1b925610552e146549112aa893634e6a25ad18fd0cc295aec1cac",
      "url": "cvmfs://cd.cern.ch/layers/b2ea05c9d9a1b925610552e146549112aa893634e6a25ad18fd0cc295aec1cac"
    }
  ]
}

for each layer in the normal docker image we have have an URL, what the snapshotter does is to identify where the layers are in the filesystem and provide them inside a mountpoint.

The code itself needs to be cleaned up, and I am going to work on that quite soon, but I would really appreciate this conversation to keep moving forward.

Please, if I weren't clear on any particular point let me know.

@siscia
Copy link
Author

siscia commented Jan 23, 2019

follow on #2943

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants