Skip to content

Conversation

crosbymichael
Copy link
Member

@crosbymichael crosbymichael commented Jul 3, 2018

This implements the shimv2 API proposed in #2426. The tests are using the new io.containerd.oci.v1 runtime under the new TaskManger implementing the v2 APIs.

Runtime v2

Runtime v2 introduces a first class shim API for runtime authors to integrate with containerd.
The shim API is minimal and scoped to the execution lifecycle of a container.

Binary Naming

Users specify the runtime they wish to use when creating a container.
The runtime can also be changed via a container update.

> ctr run --runtime io.containerd.runc.v1

When a user specifies a runtime name, io.containerd.runc.v1, they will specify the name and version of the runtime.
This will be trasnlated by containerd into a binary name for the shim.

io.containerd.runc.v1 -> containerd-shim-runc-v1

containerd keeps the containerd-shim-* prefix so that users can ps aux | grep containerd-shim to see running shims on their system.

Shim Authoring

This section is dedicated to runtime authors wishing to build a shim.
It will detail how the API works and different considerations when building shim.

Commands

Container information is provided to a shim in two ways.
The OCI Runtime Bundle and on the Create rpc request.

start

Each shim MUST implement a start subcommand.
This command will launch new shims.
The start command MUST accept the following flags:

  • -namespace the namespace for the container
  • -id the id of the container
  • -address the address of the containerd's main socket
  • -publish-binary the binary path to publish events back to containerd

The start command, as well as all binary calls to the shim, has the bundle for the container set as the cwd.

The start command MUST return an address to a shim for containerd to issue API requests for container operations.

The start command can either start a new shim or return an address to an existing shim based on the shim's logic.

delete

Each shim MUST implement a delete subcommand.
This command allows containerd to delete any container resources created, mounted, and/or run by a shim when containerd can no longer communicate over rpc.
This happens if a shim is SIGKILL'd with a running container.
These resources will need to be cleaned up when containerd looses the connection to a shim.
This is also used when containerd boots and reconnects to shims.
If a bundle is still on disk but containerd cannot connect to a shim, the delete command is invoked.

The delete command MUST accept the following flags:

  • -namespace the namespace for the container
  • -id the id of the container
  • -address the address of the containerd's main socket
  • -publish-binary the binary path to publish events back to containerd

The delete command will be executed in the container's bundle as its cwd.

Host Level Shim Configuration

containerd does not provide any host level configuration for shims via the API.
If a shim needs configuration from the user with host level information across all instances, a shim specific configuration file can be setup.

Container Level Shim Configuration

On the create request, there is a generic *protobuf.Any that allows a user to specify container level configuration for the shim.

message CreateTaskRequest {
	string id = 1;
	...
	google.protobuf.Any options = 10;
}

A shim author can create their own protobuf message for configuration and clients can import and provide this information is needed.

I/O

I/O for a container is provided by the client to the shim via fifo on Linux, named pipes on Windows, or log files on disk.
The paths to these files are provided on the Create rpc for the initial creation and on the Exec rpc for additional processes.

message CreateTaskRequest {
	string id = 1;
	bool terminal = 4;
	string stdin = 5;
	string stdout = 6;
	string stderr = 7;
}
message ExecProcessRequest {
	string id = 1;
	string exec_id = 2;
	bool terminal = 3;
	string stdin = 4;
	string stdout = 5;
	string stderr = 6;
}

Containers that are to be launched with an interactive terminal will have the terminal field set to true, data is still copied over the files(fifos,pipes) in the same way as non interactive containers.

Root Filesystems

The root filesytems for the containers is provided by on the Create rpc.
Shims are responsible for managing the lifecycle of the filesystem mount during the lifecycle of a container.

message CreateTaskRequest {
	string id = 1;
	string bundle = 2;
	repeated containerd.types.Mount rootfs = 3;
	...
}

The mount protobuf message is:

message Mount {
	// Type defines the nature of the mount.
	string type = 1;
	// Source specifies the name of the mount. Depending on mount type, this
	// may be a volume name or a host path, or even ignored.
	string source = 2;
	// Target path in container
	string target = 3;
	// Options specifies zero or more fstab style mount options.
	repeated string options = 4;
}

Shims are responsible for mounting the filesystem into the rootfs/ directory of the bundle.
Shims are also responsible for unmounting of the filesystem.
During a delete binary call, the shim MUST ensure that filesystem is also unmounted.
Filesystems are provided by the containerd snapshotters.

Other

Unsupported rpcs

If a shim does not or cannot implement an rpc call, it MUST return a github.com/containerd/containerd/errdefs.ErrNotImplemented error.

ttrpc

ttrpc is the only currently supported protocol for shims.
It works with standard protobufs and GRPC services as well as generating clients.
The only difference between grpc and ttrpc is the wire protocol.
ttrpc removes the http stack in order to save memory and binary size to keep shims small.
It is recommended to use ttrpc in your shim but grpc support is also in development.

Closes #2426

client.go Outdated
@@ -79,8 +79,12 @@ func New(address string, opts ...ClientOpt) (*Client, error) {
return nil, err
}
}
rt := "io.containerd.oci.v1"
if runtime.GOOS != "linxu" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

linux

Copy link
Contributor

@mlaventure mlaventure left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good so far, don't have the time to finish goign through it all atm, but I'll try to get another pass at it.

client.go Outdated
@@ -79,8 +79,12 @@ func New(address string, opts ...ClientOpt) (*Client, error) {
return nil, err
}
}
rt := "io.containerd.oci.v1"
if runtime.GOOS != "linxu" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

linux

func main() {
if err := shim.Run(oci.New); err != nil {
logrus.WithError(err).Error("shim run")
fmt.Fprintf(os.Stderr, "containerd-shim-process-v1: %s\n", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

containerd-shim-oci-v1

@@ -871,70 +875,6 @@ func TestContainerKillAll(t *testing.T) {
}
}

func TestShimSigkilled(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not handled anymore?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at the test and it really wasn't testing much, killing the shim then killing the process. I don't think it was a valid test or it was and changed sometime in the past. Felt safe to remove.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It checks that the container process (redis) is correctly cleaned up if the shim crash/is sigkilled (e.g. by oom).
It doesn't actually kill the process, it checks that it died.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, my bad, i'll fix it. i missed the kill 0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems this test is still removed in the latest commit? or did I miss a move somewhere else?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, so I researched this more and the test is only valid for v1 when container was still running and didn't have to reconnect to a shim again. The reason is that it sets an exit handler that does the "cleanup on dead shim" code. However, this only runs if the shim process that containerd is doing a waidpid on returns and checks the shim pid.

I think its safe to remove this as it doesn't really work, it only works in the tests because the daemon isn't restarted. If we want to keep it, i can do some refactoring but it will take a little bit of work to make this work across daemon reboots with running shims. I can do that in a followup PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense.

I haven't yet got the time to compile and play around with the new API to see how it actually behave.

Having a way to start a "cleanup shim" in such scenario may be a cleaner solution for v2 though (and if that fails, things are left to the user to clean).

if wg != nil {
defer wg.Done()
}

stats, err := cg.Stat(cgroups.IgnoreNotExist)
ctx := namespaces.WithNamespace(context.Background(), t.Namespace())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No timeout / cancel / Sync ? We could end up with several go routine getting the stats on the same task, no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they are sync'd in the caller, hard to have a timeout for this because under load, collection could be slower

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, right. I missed the waitgroup, my bad.

IoGID int
WorkDir string

id string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, but could we bundle exported variables together?

}
}()
// create base directories
for _, d := range paths {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the separation between the base dir and the last dir in the path?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

base should always be created while mkdir should return an exists error if something is already there

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason behind doing them in 2 loops if the remove all happens on error anyway? Also what does it mean if Mkdir returns an exist on the second part, seems odd that it would both return an exists error and delete the directory.


outw, err := fifo.OpenFifo(ctx, stdout, syscall.O_WRONLY, 0)
if err != nil {
return nil, err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cleanup stdin if created?

}
outr, err := fifo.OpenFifo(ctx, stdout, syscall.O_RDONLY, 0)
if err != nil {
return nil, err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing cleanup of in/outw

@jterry75
Copy link
Contributor

jterry75 commented Jul 4, 2018

@crosbymichael - This is a really good start. I can rebase my windows side changes on this and resubmit

@codecov-io
Copy link

codecov-io commented Jul 5, 2018

Codecov Report

Merging #2434 into master will decrease coverage by 0.03%.
The diff coverage is 0%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2434      +/-   ##
==========================================
- Coverage   44.76%   44.73%   -0.04%     
==========================================
  Files          92       93       +1     
  Lines        9483     9490       +7     
==========================================
  Hits         4245     4245              
- Misses       4555     4562       +7     
  Partials      683      683
Flag Coverage Δ
#linux 48.96% <0%> (-0.02%) ⬇️
#windows 41.03% <ø> (ø) ⬆️
Impacted Files Coverage Δ
sys/reaper.go 0% <ø> (ø) ⬆️
sys/reaper_linux.go 0% <0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 02579c8...d53a96f. Read the comment docs.

if werr == nil {
err2 = os.RemoveAll(work)
if err2 == nil {
return err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wrapping this could be useful, I don't think the message includes the path

}

func (b *binary) Delete(ctx context.Context) (*runtime.Exit, error) {
cmd, err := shimCommand(ctx, b.runtime, b.containerdAddress, b.bundle, "-debug", "delete")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this command not given the bundle id anywhere?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its not needed and comes on the Create request

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this exec with the delete arg intended to do?

args = append(args, cmdArgs...)
name := getShimBinaryName(runtime)
if !filepath.IsAbs(name) {
_, err = exec.LookPath(name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently name will never be absolute, but is it worth considering a case where name is in the same directory as self but not in the lookup path? In that case the name might be better found at os.Join(os.Dir(self), name). Unless there is a plan here to document setting PATH on containerd for the runtime binary installation location.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think for now, we say that it must be in PATH. i'll remove this abs code

// newShim starts and returns a new shim
func newShim(ctx context.Context, bundle *Bundle, runtime, containerdAddress string, events *exchange.Exchange, rt *runtime.TaskList) (_ *shim, err error) {
if err := identifiers.Validate(bundle.ID); err != nil {
return nil, errors.Wrapf(err, "invalid task id")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: either Wrap or add the ID?

}()
select {
case <-time.After(1 * time.Second):
terminate(s.shimPid)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we return or log this error?


func (s *shim) Exec(ctx context.Context, id string, opts runtime.ExecOpts) (runtime.Process, error) {
if err := identifiers.Validate(id); err != nil {
return nil, errors.Wrapf(err, "invalid exec id")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

include the ID?

}
return runtime.State{}, errdefs.ErrNotFound
}
var status runtime.Status
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to check the StatusUnknown task type or have a runtime.UnknownStatus?

@crosbymichael crosbymichael force-pushed the shimv2 branch 2 times, most recently from 263acdb to 7976ef8 Compare July 9, 2018 18:21
@crosbymichael crosbymichael changed the title [wip] Runtime v2 (shim API) Runtime v2 (shim API) Jul 9, 2018
@crosbymichael crosbymichael force-pushed the shimv2 branch 2 times, most recently from aeb0941 to 578358e Compare July 9, 2018 20:57
@@ -547,32 +585,26 @@ func getTasksMetrics(ctx context.Context, filter filters.Filter, tasks []runtime
case "id":
return t.ID(), true
case "namespace":
return t.Info().Namespace, true
// return t.Info().Namespace, true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The formatting here is off

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a default case and explanation for not using now?

@crosbymichael crosbymichael force-pushed the shimv2 branch 4 times, most recently from 8deea2c to 9a1e7b9 Compare July 11, 2018 19:13
@containerd containerd deleted a comment from crosbymichael Jul 11, 2018
@containerd containerd deleted a comment from crosbymichael Jul 11, 2018
if err != nil {
return nil, errors.Wrapf(err, "%s", out)
}
address := string(out)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be trimmed? Also wondering if we should just take the stdout rather than combined output.

You previously expressed that this extra connect could cause some additional time, although mostly negligible. Could we still pass in the socket and if the returned address doesn't match just remove it and create a new connection, or would that end up being even longer?

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
Copy link
Member

@dmcgowan dmcgowan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

option go_package = "github.com/containerd/containerd/runtime/v2/runc/options;options";

message Options {
bool no_pivot_root = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this isn't popular, but let's document the intent of these fields.

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
// each container and allows reattaching to the IO and receiving the exit status
// for the container processes.
service Task {
rpc State(StateRequest) returns (StateResponse);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to document what each method does.

@stevvooe
Copy link
Member

I don't see anything wildly controversial here. It would be good to document the methods and fields used in GRPC.

@crosbymichael
Copy link
Member Author

@stevvooe I was going to do documentation in a separate PR. I'll take care of proto comments there as well. I wanted to make a document for shim authors, "Building a shim 101"

@Random-Liu
Copy link
Member

Random-Liu commented Jul 18, 2018

@crosbymichael Will take a look at this PR tomorrow. Sorry for the delay.

This readme specifies shim api and authoring docs.

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
@crosbymichael
Copy link
Member Author

Update the description and added a readme for shim authors.

Runtime v2

Runtime v2 introduces a first class shim API for runtime authors to integrate with containerd.
The shim API is minimal and scoped to the execution lifecycle of a container.

Binary Naming

Users specify the runtime they wish to use when creating a container.
The runtime can also be changed via a container update.

> ctr run --runtime io.containerd.runc.v1

When a user specifies a runtime name, io.containerd.runc.v1, they will specify the name and version of the runtime.
This will be trasnlated by containerd into a binary name for the shim.

io.containerd.runc.v1 -> containerd-shim-runc-v1

containerd keeps the containerd-shim-* prefix so that users can ps aux | grep containerd-shim to see running shims on their system.

Shim Authoring

This section is dedicated to runtime authors wishing to build a shim.
It will detail how the API works and different considerations when building shim.

Commands

Container information is provided to a shim in two ways.
The OCI Runtime Bundle and on the Create rpc request.

start

Each shim MUST implement a start subcommand.
This command will launch new shims.
The start command MUST accept the following flags:

  • -namespace the namespace for the container
  • -id the id of the container
  • -address the address of the containerd's main socket
  • -publish-binary the binary path to publish events back to containerd

The start command, as well as all binary calls to the shim, has the bundle for the container set as the cwd.

The start command MUST return an address to a shim for containerd to issue API requests for container operations.

The start command can either start a new shim or return an address to an existing shim based on the shim's logic.

delete

Each shim MUST implement a delete subcommand.
This command allows containerd to delete any container resources created, mounted, and/or run by a shim when containerd can no longer communicate over rpc.
This happens if a shim is SIGKILL'd with a running container.
These resources will need to be cleaned up when containerd looses the connection to a shim.
This is also used when containerd boots and reconnects to shims.
If a bundle is still on disk but containerd cannot connect to a shim, the delete command is invoked.

The delete command MUST accept the following flags:

  • -namespace the namespace for the container
  • -id the id of the container
  • -address the address of the containerd's main socket
  • -publish-binary the binary path to publish events back to containerd

The delete command will be executed in the container's bundle as its cwd.

Host Level Shim Configuration

containerd does not provide any host level configuration for shims via the API.
If a shim needs configuration from the user with host level information across all instances, a shim specific configuration file can be setup.

Container Level Shim Configuration

On the create request, there is a generic *protobuf.Any that allows a user to specify container level configuration for the shim.

message CreateTaskRequest {
	string id = 1;
	...
	google.protobuf.Any options = 10;
}

A shim author can create their own protobuf message for configuration and clients can import and provide this information is needed.

I/O

I/O for a container is provided by the client to the shim via fifo on Linux, named pipes on Windows, or log files on disk.
The paths to these files are provided on the Create rpc for the initial creation and on the Exec rpc for additional processes.

message CreateTaskRequest {
	string id = 1;
	bool terminal = 4;
	string stdin = 5;
	string stdout = 6;
	string stderr = 7;
}
message ExecProcessRequest {
	string id = 1;
	string exec_id = 2;
	bool terminal = 3;
	string stdin = 4;
	string stdout = 5;
	string stderr = 6;
}

Containers that are to be launched with an interactive terminal will have the terminal field set to true, data is still copied over the files(fifos,pipes) in the same way as non interactive containers.

Root Filesystems

The root filesytems for the containers is provided by on the Create rpc.
Shims are responsible for managing the lifecycle of the filesystem mount during the lifecycle of a container.

message CreateTaskRequest {
	string id = 1;
	string bundle = 2;
	repeated containerd.types.Mount rootfs = 3;
	...
}

The mount protobuf message is:

message Mount {
	// Type defines the nature of the mount.
	string type = 1;
	// Source specifies the name of the mount. Depending on mount type, this
	// may be a volume name or a host path, or even ignored.
	string source = 2;
	// Target path in container
	string target = 3;
	// Options specifies zero or more fstab style mount options.
	repeated string options = 4;
}

Shims are responsible for mounting the filesystem into the rootfs/ directory of the bundle.
Shims are also responsible for unmounting of the filesystem.
During a delete binary call, the shim MUST ensure that filesystem is also unmounted.
Filesystems are provided by the containerd snapshotters.

Other

Unsupported rpcs

If a shim does not or cannot implement an rpc call, it MUST return a github.com/containerd/containerd/errdefs.ErrNotImplemented error.

ttrpc

ttrpc is the only currently supported protocol for shims.
It works with standard protobufs and GRPC services as well as generating clients.
The only difference between grpc and ttrpc is the wire protocol.
ttrpc removes the http stack in order to save memory and binary size to keep shims small.
It is recommended to use ttrpc in your shim but grpc support is also in development.

@Random-Liu
Copy link
Member

The api LGTM overall. One thing is that can we include required events? e.g. TaskExit (required), OOMEvents (optional/required?) etc.

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
@crosbymichael
Copy link
Member Author

@Random-Liu ok, updated the README with events.

@dmcgowan dmcgowan merged commit 94cfce6 into containerd:master Jul 18, 2018
@thaJeztah
Copy link
Member

it's merged! 🎉 🙌👏👏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[proposal] Shim API v2
10 participants