Added tests for plugin package. Implemented job results feature. #101

michelvocks · 2018-09-03T11:19:31Z

Fixes #100 and also #2

codecov-io · 2018-09-03T11:56:10Z

Codecov Report

Merging #101 into master will decrease coverage by 0.23%.
The diff coverage is 36.11%.

@@            Coverage Diff             @@
##           master     #101      +/-   ##
==========================================
- Coverage   64.46%   64.23%   -0.24%     
==========================================
  Files          22       24       +2     
  Lines        1911     2044     +133     
==========================================
+ Hits         1232     1313      +81     
- Misses        541      580      +39     
- Partials      138      151      +13

Impacted Files	Coverage Δ
scheduler/scheduler.go	`71.83% <31.57%> (-2.73%)`	⬇️
plugin/plugin.go	`67.5% <41.17%> (ø)`
plugin/grpc.go	`0% <0%> (ø)`
security/vault.go	`76.69% <0%> (+3%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9622b72...1b8aa7a. Read the comment docs.

michelvocks · 2018-09-03T12:05:33Z

Ping @Skarlso

Skarlso · 2018-09-03T18:40:55Z

How is this a coverage decrease? :D

Skarlso · 2018-09-03T18:41:13Z

Oh! Job results feature. :P

Skarlso · 2018-09-03T18:47:38Z

\o/ for plugin tests.

michelvocks · 2018-09-04T05:53:43Z

@Skarlso Yeah it seems like when you have no tests at all then it doesn't count at all. 🤷‍♂️

Skarlso · 2018-09-04T10:29:39Z

Hhhaha. Going to pick this up for a manual later today and merge it when everything is okay. :)

scheduler/scheduler.go

Skarlso · 2018-09-04T21:19:56Z

Also, you misspelled javaExecueatable and Python too. :P And the comments. :D

scheduler/scheduler.go

Skarlso · 2018-09-05T04:43:10Z

scheduler/scheduler.go

+	}
+
+	// Connect to plugin(pipeline)
+	if err := pS.Validate(); err != nil {


The error message that follows this, is no longer valid. It didn't fail because it couldn't connect. It failed because the validation failed whatever that is in the plugin?

Skarlso · 2018-09-05T04:47:21Z

scheduler/scheduler.go

 	// Connect to plugin(pipeline)
-	if err := pS.Connect(c, nil); err != nil {
+	if err := pS.Validate(); err != nil {
 		gaia.Cfg.Logger.Debug("cannot connect to pipeline", "error", err.Error(), "pipeline", p)


This also isn't valid any longer since this is a validation error now.

michelvocks · 2018-09-05T07:36:52Z

@Skarlso PHAL 🤗

Skarlso · 2018-09-05T07:54:22Z

@michelvocks What's PHAL? :D

Either one of these?

Rank Abbr.	Meaning
PHAL	Public Health Association of Latvia (est. 1995)
PHAL	Paint Horse Association Luxembourg
PHAL	Portland Harbour Authority Limited (UK)

michelvocks · 2018-09-05T08:53:34Z

Please have a look 😆

scheduler/scheduler_test.go

Skarlso · 2018-09-05T15:05:06Z

LGTM! :)

Skarlso · 2018-09-06T04:38:56Z

Dude! A failing pipeline just killed Gaia:

2018-09-06T06:37:49.007+0200 [DEBUG] plugin.failing_golang: 2018/09/06 06:37:49 CreateDeployment has been finished!
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x4 pc=0x1501680]

goroutine 1309 [running]:
github.com/gaia-pipeline/gaia/plugin.(*Plugin).Execute(0xc42063c6c0, 0xc42055c150, 0xc4204aad20, 0xc42055c150)
        /Users/hannibal/golang/src/github.com/gaia-pipeline/gaia/plugin/plugin.go:184 +0x290
github.com/gaia-pipeline/gaia/scheduler.executeJob(0x69192d8a, 0xc4200e4400, 0x12, 0xc4205de900, 0x3e, 0xc420287820, 0x1, 0x4, 0xc4200e44a0, 0x15, ...)
        /Users/hannibal/golang/src/github.com/gaia-pipeline/gaia/scheduler/scheduler.go:334 +0x151
created by github.com/gaia-pipeline/gaia/scheduler.(*Scheduler).executeScheduler
        /Users/hannibal/golang/src/github.com/gaia-pipeline/gaia/scheduler/scheduler.go:581 +0xb29
exit status 2
make: *** [dev] Error 1

The repo I used: https://github.com/Skarlso/go-example

Skarlso

Failing pipeline job resulted in Gaia crashing.

michelvocks · 2018-09-06T07:18:34Z

Thanks @Skarlso . I also found some other issues which I need to fix before we can merge that 🤦‍♂️

michelvocks · 2018-09-09T19:32:46Z

@Skarlso please have a look again. When a job throws ErrorExitPipeline the pipeline will be stopped but is still marked as success.

Skarlso · 2018-09-11T19:43:57Z

LGTM now my man. Good to merge at your leisure @michelvocks.

michelvocks · 2018-09-12T05:50:04Z

@Skarlso We cannot add set FailPipeline when an error occurs because the pipeline should not fail. It should just exit but not failing. I will have a look again 🤦‍♂️ 😄

Skarlso · 2018-09-12T07:09:14Z

@michelvocks wait. I don't get it. Why shouldn't the pipeline fail if there is an error? It shouldn't pass. If there is an error, make it fail. I don't get that reasoning? :)

Skarlso · 2018-09-12T20:25:32Z

Oh like, for example, if you want a cleanup at the end to be executed no matter if a job failed, right? Yeah, in that case we don't want to mark the pipeline as failed. :/ Sorry. :)

michelvocks · 2018-09-13T06:51:42Z

@Skarlso The main idea was to safely exit a pipeline. So for example:

You got a pipeline with three jobs which should be executed one after the other.
During the execution of job two you find out that job three does not need to be executed at all so you want safely exit the pipeline without failing it.

Anyway, let me rework that. 😄

Skarlso · 2018-09-15T09:10:12Z

But wait. I just realised what you are writing. But in my case it wasn't safely exiting. I returned an error in the job. Isn't that supposed to fail the pipeline? Or cleanup still needs to run

Skarlso · 2018-09-19T20:04:18Z

So, I still don't understand. I found the bug, which was a timing issue with closing finish in case of a job execution failure. But this will still exit the pipeline and make it fail, because the job failed.

Also, I found that the forever running loops are pretty fragile. We need to add timeouts into those, or else we get into the trouble of an unfinished stuck pipeline which you can only kill if you kill gaia. That is unacceptable.

I'll add some timeout selects here and there to fail the loop if after a certain time there is nothing happening.

Also we might want to add user interaction like cancelling the whole thing mid run.

Sample run result of a job failing stopping the pipeline and cleanup not running any longer. But also the status of the pipeline will be failed. Is this what you want? :D

Skarlso · 2018-09-19T20:14:33Z

BTW also interesting one is that I was printing out the status from the workflow after it's supposedly replaced the items, but every time on every loop it was the same output:

2018-09-19T21:57:59.149+0200 [DEBUG] plugin.def-failing_golang: 2018/09/19 21:57:59 MigrateDB has been finished!
job:  DB Migration success
job Create DB User is in state waiting for execution
job DB Migration is in state waiting for execution
job Create K8S Namespace is in state waiting for execution
job Create K8S Deployment is in state waiting for execution
job Create K8S Service is in state waiting for execution
job Create K8S Ingress is in state waiting for execution
job Clean up is in state waiting for execution
looping forever
looping forever

Waiting for execution. Isn't that supposed to be updating? This is Part:

				for wl := range mw.Iter() {
					fmt.Printf("job %s is in state %s\n", wl.job.Title, wl.job.Status)
					if !wl.done {
						allWLDone = false
					}
				}

Where this part should have these updated to done:

				// Job is done
				wl := mw.GetByID(j.ID)
				wl.done = true
				mw.Replace(*wl)

Oh this is workload status pfff, not Job status.... Silly me. Still those guys should still be waiting for execution even after they finished? :)

Skarlso · 2018-09-19T20:15:11Z

Damn it. :( No, I didn't fix it. :( It made a tad bit better, but it's still hanging in an infinite loop without any channels being notified of anything.

Skarlso · 2018-09-19T21:25:10Z

Alright, to summarise, I know it's a timing issue with the channels and I know that it's happening in
func (s *Scheduler) executeScheduler(r *gaia.PipelineRun, pS Plugin) { where the for ends up as an infinite loop.

Skarlso · 2018-09-21T08:56:33Z

I found the problem. Fixing it..

Skarlso · 2018-09-21T09:12:51Z

@michelvocks So the problem is the following... :)

In here if !finalize && j.FailPipeline { you clean up the executeScheduler channel of jobs still waiting to be executed and then you close the channel. So nothing will be triggered any more.

Which is fine, if a job failed, we don't want to continue execution since that would only lead to more failures if the next job depended on the previous one doing something.

However, then you go, if there are parallel jobs still running, we want them to finish and not abort, if I understand correctly. Which is again, reasonable and fine. However... Cleanup, which the next job in line, is also in started state according to the workload which only has two states, started and done.

This hangs forever, because the channel is closed and the job won't run and it's just waiting for things to happen but no channel is ever called again thus this guy:

						// wait until finished
						<-notFinishedWL.finishedSig

blocks forever.

what I will do is introduce another state called inProgress, so that we know that a job is currently running so we'll wait for it to finish. But if it's just started, we will ignore it and not run it since the pipeline failed.

EDIT: OR! I just simply add singleWL.job.Status == gaia.JobRunning to the condition. OMG that might just work.

Skarlso · 2018-09-21T09:29:59Z

Yiss. Fixed it.

Interesting point is that executeJob was sending one last triggerSave call which failed because we close it before hand so that's a send on closed channel.

I put the close in defer, that way it should be after that, but still prevent the select from triggering again.

Skarlso · 2018-09-21T09:30:22Z

Nope... damn it.

panic: send on closed channel

goroutine 1081 [running]:
github.com/gaia-pipeline/gaia/scheduler.executeJob.func1(0xc420546840, 0xc420626000)
        /Users/gbrautigam/gohome/src/github.com/gaia-pipeline/gaia/scheduler/scheduler.go:326 +0x66
github.com/gaia-pipeline/gaia/scheduler.executeJob(0xcbe6c422, 0xc42018ef20, 0x12, 0xc42018d900, 0x3e, 0xc420010220, 0x1, 0x4, 0xc42018efc0, 0x15, ...)
        /Users/gbrautigam/gohome/src/github.com/gaia-pipeline/gaia/scheduler/scheduler.go:337 +0x143
created by github.com/gaia-pipeline/gaia/scheduler.(*Scheduler).executeScheduler
        /Users/gbrautigam/gohome/src/github.com/gaia-pipeline/gaia/scheduler/scheduler.go:582 +0xa2b
exit status 2
make: *** [dev] Error 1

still fails.

Skarlso · 2018-09-21T09:45:33Z

Ideally... the writer should close the channel. But since multiple routines are using it, I'm opting that this particular part of the process handling just simply leaves this channel alone.

There is no performance drop in leaving a channel open, and it will be recycled anyways.

michelvocks · 2018-09-23T10:54:31Z

Hey @Skarlso . I've cured my illness and now try to get back on the track. I just tried to follow all your comments but I'm a bit lost. Give me some time to get into the code again 🤗

michelvocks · 2018-09-23T18:26:46Z

@Skarlso I found the problem. Like you said, when a job fails in a pipeline I have to wait until parallel jobs are finished.
In the go routine where I check if parallel jobs are still running I set a pointer to the current Object of an iteration object. This is not allowed because the object is always changing. I had to do a deep copy or (what I did now) just got the pointer to the actual finish channel.

Also, I found that the forever running loops are pretty fragile. We need to add timeouts into those, or else we get into the trouble of an unfinished stuck pipeline which you can only kill if you kill gaia. That is unacceptable.

You are absolutely right but maybe we do it in a separate (following) PR? What do you think? 🤗

michelvocks · 2018-09-23T18:30:20Z

Ah by the way. Here is a pipeline which returns a specific error. This error tells gaia that the job failed and the pipeline should be aborted but not marked as failed: https://github.com/michelvocks/gaia-test-pipeline

Skarlso · 2018-09-23T18:59:39Z

@michelvocks Cool stuff dude. Yeah do the timeouts in a separate PR definitely. :)

Skarlso · 2018-09-23T19:02:36Z

I would also try to cleanup the channels later on. :)

michelvocks requested a review from Skarlso September 3, 2018 12:05

michelvocks self-assigned this Sep 3, 2018

michelvocks added the Ready To Merge PR is ready to be merged into master label Sep 3, 2018

Skarlso requested changes Sep 4, 2018

View reviewed changes

scheduler/scheduler.go Show resolved Hide resolved

Skarlso reviewed Sep 5, 2018

View reviewed changes

scheduler/scheduler.go Show resolved Hide resolved

Skarlso reviewed Sep 5, 2018

View reviewed changes

scheduler/scheduler_test.go Show resolved Hide resolved

Skarlso approved these changes Sep 5, 2018

View reviewed changes

Skarlso added Needs Work The PR still requires some work from the submitter. and removed Ready To Merge PR is ready to be merged into master labels Sep 6, 2018

Skarlso requested changes Sep 6, 2018

View reviewed changes

michelvocks added Ready To Merge PR is ready to be merged into master and removed Needs Work The PR still requires some work from the submitter. labels Sep 9, 2018

Added tests for plugin. Implemented job results feature

6ac0009

Fixing stuck job after it failed.

c63d0bc

Skarlso force-pushed the fail_pipeline_when_job_fails branch from 4257072 to c63d0bc Compare September 11, 2018 19:34

Skarlso approved these changes Sep 11, 2018

View reviewed changes

Skarlso added Needs Work The PR still requires some work from the submitter. and removed Ready To Merge PR is ready to be merged into master labels Sep 13, 2018

Working.

f27ed28

Skarlso added needs review and removed Needs Work The PR still requires some work from the submitter. labels Sep 21, 2018

Fixed wrong pointer to worload object. Fixed safe exit of pipeline

1b8aa7a

michelvocks merged commit 767ef36 into gaia-pipeline:master Sep 24, 2018

Added tests for plugin package. Implemented job results feature. #101

Added tests for plugin package. Implemented job results feature. #101

Uh oh!

Conversation

michelvocks commented Sep 3, 2018

Uh oh!

codecov-io commented Sep 3, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

michelvocks commented Sep 3, 2018

Uh oh!

Skarlso commented Sep 3, 2018

Uh oh!

Skarlso commented Sep 3, 2018

Uh oh!

Skarlso commented Sep 3, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

michelvocks commented Sep 4, 2018

Uh oh!

Skarlso commented Sep 4, 2018

Uh oh!

Uh oh!

Skarlso commented Sep 4, 2018

Uh oh!

Uh oh!

Skarlso Sep 5, 2018

Choose a reason for hiding this comment

Uh oh!

Skarlso Sep 5, 2018

Choose a reason for hiding this comment

Uh oh!

michelvocks commented Sep 5, 2018

Uh oh!

Skarlso commented Sep 5, 2018

Uh oh!

michelvocks commented Sep 5, 2018

Uh oh!

Uh oh!

Skarlso commented Sep 5, 2018

Uh oh!

Skarlso commented Sep 6, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Skarlso left a comment

Choose a reason for hiding this comment

Uh oh!

michelvocks commented Sep 6, 2018

Uh oh!

michelvocks commented Sep 9, 2018

Uh oh!

Skarlso commented Sep 11, 2018

Uh oh!

michelvocks commented Sep 12, 2018

Uh oh!

Skarlso commented Sep 12, 2018

Uh oh!

Skarlso commented Sep 12, 2018

Uh oh!

michelvocks commented Sep 13, 2018

Uh oh!

Skarlso commented Sep 15, 2018

Uh oh!

Skarlso commented Sep 19, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Skarlso commented Sep 19, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Skarlso commented Sep 19, 2018

Uh oh!

Skarlso commented Sep 19, 2018

Uh oh!

Skarlso commented Sep 21, 2018

Uh oh!

Skarlso commented Sep 21, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Skarlso commented Sep 21, 2018

codecov-io commented Sep 3, 2018 •

edited

Loading

Skarlso commented Sep 3, 2018 •

edited

Loading

Skarlso commented Sep 6, 2018 •

edited

Loading

Skarlso commented Sep 19, 2018 •

edited

Loading

Skarlso commented Sep 19, 2018 •

edited

Loading

Skarlso commented Sep 21, 2018 •

edited

Loading