Skip to content

RFC: Distributer Pipeline execution via Workers #107

@Skarlso

Description

@Skarlso

Abstract

This document discusses the problem of executing pipelines in a distributed
manner.

Table of Contents

  1. Introduction
  2. Problem Statement
  3. Terminology
  4. Architecture Diagram
  5. Proposed Worker Distribution Model
  6. Managing Workers
  7. Worker Tags
  8. The Worker RPC API
  9. Gaia Master - Agent
  10. Scheduling Jobs
  11. Implementation Approach

Introduction

Problem Statement

The problem poses the following set of challenges for Gaia:

  1. Manage workers
    • See what pipeline is running on which worker at any given point in time
    • Add / Delete / Suspend workers
    • Add specific environment variables to the worker
  2. Either automatically, or manually choose which pipeline should run on which
    worker.
  3. Label the workers so the user knows it's a windows machine or a linux machine
    or Go, Python, Java SDK is available on it... etc.

Terminology

Gaia Master: The Gaia Master is a running instance of gaia launched via make or the
released Gaia binary.
Worker: A worker is a server which is connected to the Gaia Master and has
certain capabilities like, what kind of SDK it supports or what operating system
is installed on it.
Pipeline: A pipeline is a configured entity with a set of Jobs.
Job: A job is a single running task like, create a user. A pipeline can have multiple jobs.
RPC: Remote Procedure Call

Architecture Diagram

distributed workers

Proposed Worker Distribution Model

The proposed model which aims to solve this problem is laid out as follows.

Managing Workers

The managing of the workers will happen through a set of API endpoints.
All workers are stored in the database with a designated set of labels
assigned name and IP address.

These endpoints will be Delete / Suspend. Since adding will be taken care of
by the Gaia agent, we don't support that operation here specifically.

Delete: Delete will simply remove the server from the rotation. It won't restart
the server, or shut it down, it will just simply delete it from the database which
holds the worker instances.
Suspend: Suspending a worker will take it out of rotation but will not delete it.
Suspended this worker will not be able to run any pipelines. This is a good option
if some kind of maintenance needs to be performed on the machine.

Worker Tags

The workers will need to be tagged with what kind of resource they are providing. For example:

name tags
Worker 1 Ubuntu Linux 64bit
Worker 2 Windows 10 64bit
Worker 3 Debian Linux 64bit

When a pipeline is first created in needs to set on the pipeline creation window what kind of resources it requires. These tags will need to be made accessible by a drop down list for ease of usage. These tags can be created when a Worker is created and saved to Gaia. Tagging them can be done manually on the Worker Manager screen.

The Worker RPC API

The Workers will talk to the Gaia Master via a set of defined RPC interfaces.
These are as follows:

// RegisterWorker will take a worker struct which contains the following information:
// Security: This will be protected by the TLS connection between master and worker.
// IP: The address of the worker
// Name: The name of the worker which typically can be `hostname`.
// Operating System: The OS of the worker to save as a label.
// SDK: The SDK the worker has.
rpc RegisterWorker(Worker) {}

// RunPipeline will take a pipeline, and execute it. This ia bi-directional endpoint.
// Pipeline struct:
// ID: Id of the pipeline
// Repo: The git repository for the pipeline. This is needed because the worker needs
// to build the pipeline.
rpc RunPipeline(Pipeline) returns (Success) {}

rpc GetAllPipelines(Worker) returns (Pipelines) {}

Gaia Master - Agent

The current Gaia implementation will still hold and will be designated as Gaia Master.
The master will be a hub for the worker to connect to, get pipelines from, and report
back on the current state of the pipelines they are running.

As such, Gaia Master will no longer be solely responsible to build and distribute
binaries. Since the operation system of the worker decides in what format the binaries
will be in, the workers will build their own binaries.

Which means a worker will get a repository to pull code from and do the whole thing
that Gaia does currently. This will not involve duplicating code however, since the
whole thing will be in the worker package. Gaia Master will use this package by
setting worker to localhost.

The Workers will need to have the go-plugin extracted because HashiCorp's plugin
system does not support RPC calls over the network. Just strictly localhost communication
is allowed. Pipeline execution and communication between jobs' running and state
changes are all through RPC.

Scheduling Jobs

Scheduling jobs will also have to be included into the workers. Workers will schedule
their own parallel jobs execution model and Gaia Master will have to schedule and manage
which worker to distribute pipelines to. This means that the workers will need an indicator
to define when they are too busy to accept more pipelines.

Where jobs are built

Currently, once a user initiates a pipeline build, that pipeline is saved and built on Gaia Master.
This has to change in order for the worker to be able to run the pipeline. The binary
needs to be built on the worker. However, Gaia also needs to be aware of the jobs,
and does pre-validation which means it also needs to build the pipeline.

Scenario 1:

We build the pipeline on both, the Gaia master, and the Worker. Which means we get immediate validation of the pipeline but have to duplicate the building process.

Scenario 2:

We only build on the worker and just save the pipeline on the master to track it. The validation will be deferred until it's actually built on one of the workers. This way, validation is deferred but the building process isn't duplicated.

Implementation Approach

  1. Extract all functionality regarding running and building pipelines including
    the SDK and the go-plugin facility into a worker package. This should not change
    the current behavior of Gaia. All tests should still pass. Including the WebHook
    capability which should be able to still just call build. The worker package should
    take care of building and distributing the binary.

  2. Create the API which handles most of the things worker related. But still don't
    bother extracting it.

  3. Create an Agent binary which calls back to master's RPC API and registers a
    server as a worker.

  4. Implement the managing of the servers below settings on the left of the admin
    screen.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions