Skip to content

CML contexts #311

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 33 commits into from
Oct 29, 2020
Merged

CML contexts #311

merged 33 commits into from
Oct 29, 2020

Conversation

DavidGOrtega
Copy link
Contributor

@DavidGOrtega DavidGOrtega commented Oct 14, 2020

This PR introduces contexts, this is the ability to be able to setup a repo and token to the command, so you can run under gitlab CI and comment in GH scm setting the repo and token in the command.
Additionally the repo and sha is taken from the repo itself making future integrations faster if the CI do a full clone of the repo instead of a mirror like GL does with gitlab.

Benefits:

  • Integrations are easier since we are not looking for specific CI ENV variables but extracting the repo, sha, etc... from the repo.
  • Commands are independent to be run locally or in a non CI environment.
  • We could add another kind of drivers, slack, confluence, etc...
  • Very easy to work with you on premise instances.

Example of a GL runner commenting under a GH repo setting repo_token to GH.

Note that token and commit-sha is not needed since the first can be extracted from the env variables (secret) and the second is inferred from the repo itself.:

train:
  script:
    - echo "Training..."
    - echo 'Hi from cml!' >> report.md
    - cml-send-comment --repo=https://github.com/iterative/3_tensorboard  report.md
    - echo "Trained!"

example of your own github on premise:

train:
  script:
    - echo "Training..."
    - echo 'Hi from cml!' >> report.md
    - cml-send-comment --repo=https://myentreprisegithub.com/iterative/3_tensorboard  --driver=github report.md
    - echo "Trained!"

* token

* clean env before tests

* print env

* yaml

* yaml

* yaml

* yaml

* check

* fix repo env

* fix repo env

* fix repo env

* fix repo env

* check name

* env github tests

* env github tests

* log check

* log check

* no check tessts

* enviromental tests

* workflow eenv

* gitlab uploads and github_token

* snapshots
@DavidGOrtega DavidGOrtega changed the title [wip] CML contexts CML contexts Oct 19, 2020
@DavidGOrtega DavidGOrtega marked this pull request as draft October 19, 2020 16:30
Copy link
Contributor

@elleobrien elleobrien left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DavidGOrtega looks pretty nice- everything makes sense. Two big comments to discuss:

  1. I don't think I fully understand where we would fit in the code for, say, Jenkins as my CI server- right now, the client code looks for environmental variables that are special to their runners. But I'm trying to imagine what the logic would be for Bitbucket with a Jenkins CI integration- since it looks like registering runners, creating comments, and calling the API all still happens in the same module as runner environmental variables are read in (GitlabClient.js and GithubClient.js). How would I do it?

  2. Because we define properties and methods of the clients in GitlabClient.js and GithubClient.js, and then again define those same methods and clients in cml.js (although in a very high-level way), it almost seems like there's duplicated code. It makes very clear sense why this pattern is used, but I'm concerned about maintenance with two extremely similar "levels". Do you think this is a concern?

Nice work, and looking forward to chatting at our meeting!

@@ -4,6 +4,12 @@ on: [push, pull_request]

env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
TEST_GITHUB_TOKEN: ${{ secrets.TEST_GITHUB_TOKEN }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, does this mean:

in order to test cml, we'll need to supply a repo url, sha, and access token as secrets for each provider (gitlab and github)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in order to test it independently we need to setup env variables

  TEST_GITHUB_TOKEN: ${{ secrets.TEST_GITHUB_TOKEN }} 
  TEST_GITHUB_REPO: https://github.com/iterative/cml_qa_tests_dummy/
  TEST_GITHUB_SHA: 62edc8b3f46a60b3fe1e5c08fd3e0046d350ee29
  TEST_GITLAB_TOKEN: ${{ secrets.TEST_GITLAB_TOKEN }}
  TEST_GITLAB_REPO: https://gitlab.com/iterative.ai/cml_qa_tests_dummy/
  TEST_GITLAB_SHA: c4c13286e78dc252dd2611f31a755f10d343fbd4

repo: RUNNER_REPO,
token: repo_token
});
const IS_GITHUB = cml.driver === 'github';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you point me to where IS_GITHUB is used? Not sure I see it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lines 68, 115 and 150 in the same file

@@ -0,0 +1,95 @@
jest.setTimeout(20000);

const GithubClient = require('./GithubClient');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very very minor, but should we use GitHub & GitLab stylizations for variables throughout( instead of Github and Gitlab?

src/cml.js Outdated
const { repo } = opts;
const { GITHUB_REPOSITORY, CI_PROJECT_URL } = process.env;

if (repo && repo.startsWith('https://github.com')) return 'github';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, this will make it easy to add as many clients as we want to the logic!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can actually infer the driver automatically in the environment or even setup the driver, repo and token to be pointing whatever we want

src/cml.js Outdated
};

const { driver = env_driver(), repo, token } = opts;
this.driver = driver;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

js question: in this context, what variable will this be set to? I know it's to do with global, just trying to understand logic

Copy link
Contributor Author

@DavidGOrtega DavidGOrtega Oct 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Points to the object itself, it's the classic OOP you know. JS supports class notation.

RUNNER_IMAGE = 'dvcorg/cml:latest'
RUNNER_IMAGE = 'dvcorg/cml:latest',

RUNNER_DRIVER,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like the flags and some js modules use driver, but others use client- if I understand correctly, they're referring to the same thing. Is this intentional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont remember to have set any client as a parameter, however you are right the driver points to one of the clients GitlabClient or GitlabClient.
Im changing them for Gitlab and GitHub and are going to be inside the drivers folder. To be more clear

src/cml.js Outdated

let mime, uri;

if (gitlab_uploads) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any way to move the gitlab_uploads logic out of CML.publish()? The only reason I ask is that if we expand to more clients with their own unique methods for uploading artifacts, I could imagine this logic getting hard to maintain.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is where it should go actually, the real logic is in the client already which is calling the gitlab api but here we define the business logic somehow you have to decide which upload is going to be used cml's native one or one provide by the client/driver

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @elleobrien. This should be gitlab's client business on how to publish. Otherwise, all the logic is spread around CML, github and gitlab and it will grow exponentially with new clients.

A more abstract interface might be required like publich_or_upload() (a better name is needed).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dmpetrov it is. The logic to publish in Gitlab is inside the Gitlab client. In CML there is only a decision. Take into account that even of you work in Gitlab you can still publish with CML storage. So its just an if

if (gitlab_uploads) {
      const client = get_client({ ...this, driver: 'gitlab' });
      ({ mime, uri } = await client.publish(opts));
    } else {
      ({ mime, uri } = await upload(opts));
    }

If you see all the specific publish code is inside the Gitlab client

Copy link
Contributor Author

@DavidGOrtega DavidGOrtega Oct 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we were setting that business logic we would be adding the CML upload in every client

} else {
      ({ mime, uri } = await upload(opts));
    }

* token

* clean env before tests

* print env

* yaml

* yaml

* yaml

* yaml

* check

* fix repo env

* fix repo env

* fix repo env

* fix repo env

* check name

* env github tests

* env github tests

* log check

* log check

* no check tessts

* enviromental tests

* workflow eenv

* gitlab uploads and github_token

* snapshots

* log env

* log env

* update also github.context

* No log
Copy link
Member

@dmpetrov dmpetrov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change makes the existing API cleaner. But I still see some opportunities in making API even better.

In the change. it tries to abstract out the providers (GH/GL) from logic but keep the separation on the public API level (CML class). It looks like we need to separate the public class (CML), the providers/drivers logic (it is mostly done), and the installation logic (this is part of CML for some reason).

Now: CML API (with some logic) --> Providers (with logic)
Should be: CML API (no logic, dummy proxy)--> Providers Factory (no logic, just abstracting out the providers) --> Providers(all logic is here)

Also, it is not clear from the change on how to use different systems:

  1. how to work with GitHub provider but GitLab CI/CD?
  2. how we suppose to integrate Jenkins (CI/CD) with GitLab/GitHub provider?

src/cml.js Outdated
}

env_repo() {
return get_client(this).env_repo();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These one-line methods seems a bit strange - clear logic duplication. You have already abstracted out the provider type by a factory method. Now you suppose to call the methods directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its not a factory but a facade. CML api its defining some methods and then its calling the proper underlying code. That code is wrapped inside every client.

I dont see code duplication but expose the methods that it should expose. Users that would use the CML SDK should not dig into the drivers/clients. Just only know that several operations are available in the CML.

src/cml.js Outdated

async comment_create(opts = {}) {
const client = get_client(this);
return await client.comment_create(opts);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here but with two lines, not one.

src/cml.js Outdated

let mime, uri;

if (gitlab_uploads) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @elleobrien. This should be gitlab's client business on how to publish. Otherwise, all the logic is spread around CML, github and gitlab and it will grow exponentially with new clients.

A more abstract interface might be required like publich_or_upload() (a better name is needed).

src/cml.js Outdated
throw new Error('driver unknown!');
};

class CML {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is CML (and cml.js) the right name? The code ends up using CML for different objects like:

const CML = require('../src/cml');
…
const cml = new CML(opts);

One of the big ideas behind this redesign is to separate different concepts - GitHub/GitLab and CI/CDs executors. But it seems like we still using a monolithic CML concept.

I'd suggest renaming this one to something like ScmHostingDriver with the factory method and abstract (empty) methods (or without methods since it is JS).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this CML is a general, user-facing API - it makes sense to keep it here and keep the name. But all the client's specific logic needs to be extracted. Including the abstract *Driver class with the factory method.

const client = get_client({ ...this, driver: 'gitlab' });
({ mime, uri } = await client.publish(opts));
} else {
({ mime, uri } = await upload(opts));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like there is one case is not covered - publish(gitlab_uploads=false) for GitLab client.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

publish(gitlab_uploads=false) goes to default upload the else

@DavidGOrtega
Copy link
Contributor Author

@dmpetrov

Now: CML API (with some logic) --> Providers (with logic)
Should be: CML API (no logic, dummy proxy)--> Providers Factory (no logic, just abstracting out the providers) --> Providers(all logic is here)

Its a facade. All the business logic (input data check, output data, etc...) should be done in the Facade Class, also its the one that exposes and defines all the methods available. It could be the class to be exposed in a microservice in example.

One of the big ideas behind this redesign is to separate different concepts - GitHub/GitLab and CI/CDs executors. But it seems like we still using a monolithic CML concept.

Its not. We are defining a common API to all the vendors. with equivalent inputs and outputs. I dont see monolithic to have a class

Also, it is not clear from the change on how to use different systems:
how to work with GitHub provider but GitLab CI/CD?
how we suppose to integrate Jenkins (CI/CD) with GitLab/GitHub provider?

Now every command is totally independent from the environment we can setup a repo and a token and input variables. The missing input variables has to be:

  • Aligned by the CI integration
  • Aligned by us

Before we were not able to comment on a GH repo from an GL runner, simply because we were not making the commands independents from the ENV.

As I say the SHA needed could be automatically set by the CI in its own SHA or we can actually extract it from the ENV or we will have to even do calls to APIs, its going to deepend on the integration, but its possible because we have a non ENV dependant code anymore.

@dmpetrov
Copy link
Member

Before we were not able to comment on a GH repo from an GL runner, simply because we were not making the commands independents from the ENV.

A code example would be really useful. Like GH repo with GL runner.

* git sha

* commit_sha

* log

* log gh sha

* log gh sha

* show origin

* remote as repo

* remote as repo

* tests

* remove one test not working on gitlab

* space
@DavidGOrtega DavidGOrtega changed the base branch from master to release/contexts October 28, 2020 16:14
@DavidGOrtega DavidGOrtega changed the base branch from release/contexts to master October 28, 2020 16:16
@DavidGOrtega DavidGOrtega marked this pull request as ready for review October 28, 2020 17:20
Copy link
Member

@dmpetrov dmpetrov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great change! I especially like the SHA/ENV part.

A few comments are inline - please address before emerging.

[boolean]
[deprecated: Use native instead] [boolean]
--native Uses driver's native capabilities to upload assets instead
of CML's backend. [boolean]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

backend might be not the best term in the context of CI/CD. how about ... instead of external CML's assets storage

.boolean('native')
.describe(
'native',
"Uses driver's native capabilities to upload assets instead of CML's backend."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

backend might be the best word - please see the comment above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Im confused here... not sure if you are asking to change native to backend or you are pointing is not the best word... A few iterations before I called it backend

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, I mean "backend might be NOT the best word" (even in the text). and I suggested changing the description, not the option name.

@@ -17,9 +17,17 @@ describe('CML e2e', () => {
--gitlab-uploads Uses GitLab uploads instead of CML storage. Use GitLab
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expect(output).toMatchInlineSnapshot(...) - it might trigger false alarms if library will change.
It is not a problem now, but it worth changing in the future in a more reliable way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by the library I mean yargs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, every time that we change the specs we have to update the template. It's a common practice in web dev to control that everything is looking as it should.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean that it can change even without text changes - from the lib change (they might decide to change the default formatting).

const infer_driver = (opts = {}) => {
const { repo } = opts;
if (repo && repo.includes('github.com')) return 'github';
if (repo && repo.includes('gitlab.com')) return 'gitlab';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could these checks be just a bit more rigorous?
For example, I might have a local repo in my file system like ~/src/my-gitlab.com-repo/ which might be no GitLab repo.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

repo has to be an URL. Its the url to be used by the API. can not be anything else and when its going to be used it will fail if not.


const { GITHUB_REPOSITORY, CI_PROJECT_URL } = process.env;
if (GITHUB_REPOSITORY) return 'github';
if (CI_PROJECT_URL) return 'gitlab';
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need similar variables for GL? Why not?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused by CI_PROJECT_URL :) Why it is not just GitLab like GITLAB_REPOSITORY ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI_PROJECT_URL is an env variable that GL CI generates. Its not our choose to name it that way. What we are doing here is try to guess if the driver is not set nor the repo either where are we being run.
This is useful for on premise instances also.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What we should do if none is defined? Should we throw a meaningful exception?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nothing! later on the exception will be raised when creating the Driver

if (repo && repo.includes('github.com')) return 'github';
if (repo && repo.includes('gitlab.com')) return 'gitlab';

const { GITHUB_REPOSITORY, CI_PROJECT_URL } = process.env;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is none of the variables are defined?
Is it ok if it is defined but empty or wrong address (does it make sense to check URL pattern)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are specific env variables for specific vendors. We are not even looking at the value we just need to 'see' them to 'know' which kind driver it is in case its not specified by the user

src/cml.js Outdated
if (driver === 'github') return new Github({ repo, token });
if (driver === 'gitlab') return new Gitlab({ repo, token });

throw new Error('driver unknown!');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might be more informative to include the wrong name in the error message

const { upload, exec } = require('./utils');

const uri_no_trailing_slash = (uri) => {
return uri.endsWith('/') ? uri.substr(0, uri.length - 1) : uri;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there triml/trimr or rstrip/lstrip in JS?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hahaha no! I made one a few iterations before but I removed it. We could use underscore or a library like that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants