Skip to content

Conversation

michelvocks
Copy link
Member

Fixes #100 and also #2

@codecov-io
Copy link

codecov-io commented Sep 3, 2018

Codecov Report

Merging #101 into master will decrease coverage by 0.23%.
The diff coverage is 36.11%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #101      +/-   ##
==========================================
- Coverage   64.46%   64.23%   -0.24%     
==========================================
  Files          22       24       +2     
  Lines        1911     2044     +133     
==========================================
+ Hits         1232     1313      +81     
- Misses        541      580      +39     
- Partials      138      151      +13
Impacted Files Coverage Δ
scheduler/scheduler.go 71.83% <31.57%> (-2.73%) ⬇️
plugin/plugin.go 67.5% <41.17%> (ø)
plugin/grpc.go 0% <0%> (ø)
security/vault.go 76.69% <0%> (+3%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9622b72...1b8aa7a. Read the comment docs.

@michelvocks michelvocks requested a review from Skarlso September 3, 2018 12:05
@michelvocks michelvocks self-assigned this Sep 3, 2018
@michelvocks michelvocks added the Ready To Merge PR is ready to be merged into master label Sep 3, 2018
@michelvocks
Copy link
Member Author

Ping @Skarlso

@Skarlso
Copy link
Member

Skarlso commented Sep 3, 2018

How is this a coverage decrease? :D

@Skarlso
Copy link
Member

Skarlso commented Sep 3, 2018

Oh! Job results feature. :P

@Skarlso
Copy link
Member

Skarlso commented Sep 3, 2018

\o/ for plugin tests.

@michelvocks
Copy link
Member Author

@Skarlso Yeah it seems like when you have no tests at all then it doesn't count at all. 🤷‍♂️

@Skarlso
Copy link
Member

Skarlso commented Sep 4, 2018

Hhhaha. Going to pick this up for a manual later today and merge it when everything is okay. :)

@Skarlso
Copy link
Member

Skarlso commented Sep 4, 2018

Also, you misspelled javaExecueatable and Python too. :P And the comments. :D

}

// Connect to plugin(pipeline)
if err := pS.Validate(); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message that follows this, is no longer valid. It didn't fail because it couldn't connect. It failed because the validation failed whatever that is in the plugin?

// Connect to plugin(pipeline)
if err := pS.Connect(c, nil); err != nil {
if err := pS.Validate(); err != nil {
gaia.Cfg.Logger.Debug("cannot connect to pipeline", "error", err.Error(), "pipeline", p)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also isn't valid any longer since this is a validation error now.

@michelvocks
Copy link
Member Author

@Skarlso PHAL 🤗

@Skarlso
Copy link
Member

Skarlso commented Sep 5, 2018

@michelvocks What's PHAL? :D

Either one of these?

Rank Abbr. Meaning  
PHAL Public Health Association of Latvia (est. 1995)  
PHAL Paint Horse Association Luxembourg  
PHAL Portland Harbour Authority Limited (UK)

@michelvocks
Copy link
Member Author

Please have a look 😆

@Skarlso
Copy link
Member

Skarlso commented Sep 5, 2018

LGTM! :)

@Skarlso Skarlso added Needs Work The PR still requires some work from the submitter. and removed Ready To Merge PR is ready to be merged into master labels Sep 6, 2018
@Skarlso
Copy link
Member

Skarlso commented Sep 6, 2018

Dude! A failing pipeline just killed Gaia:

2018-09-06T06:37:49.007+0200 [DEBUG] plugin.failing_golang: 2018/09/06 06:37:49 CreateDeployment has been finished!
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x4 pc=0x1501680]

goroutine 1309 [running]:
github.com/gaia-pipeline/gaia/plugin.(*Plugin).Execute(0xc42063c6c0, 0xc42055c150, 0xc4204aad20, 0xc42055c150)
        /Users/hannibal/golang/src/github.com/gaia-pipeline/gaia/plugin/plugin.go:184 +0x290
github.com/gaia-pipeline/gaia/scheduler.executeJob(0x69192d8a, 0xc4200e4400, 0x12, 0xc4205de900, 0x3e, 0xc420287820, 0x1, 0x4, 0xc4200e44a0, 0x15, ...)
        /Users/hannibal/golang/src/github.com/gaia-pipeline/gaia/scheduler/scheduler.go:334 +0x151
created by github.com/gaia-pipeline/gaia/scheduler.(*Scheduler).executeScheduler
        /Users/hannibal/golang/src/github.com/gaia-pipeline/gaia/scheduler/scheduler.go:581 +0xb29
exit status 2
make: *** [dev] Error 1

The repo I used: https://github.com/Skarlso/go-example

Copy link
Member

@Skarlso Skarlso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Failing pipeline job resulted in Gaia crashing.

@michelvocks
Copy link
Member Author

Thanks @Skarlso . I also found some other issues which I need to fix before we can merge that 🤦‍♂️

@michelvocks michelvocks added Ready To Merge PR is ready to be merged into master and removed Needs Work The PR still requires some work from the submitter. labels Sep 9, 2018
@michelvocks
Copy link
Member Author

@Skarlso please have a look again. When a job throws ErrorExitPipeline the pipeline will be stopped but is still marked as success.

@Skarlso Skarlso force-pushed the fail_pipeline_when_job_fails branch from 4257072 to c63d0bc Compare September 11, 2018 19:34
@Skarlso
Copy link
Member

Skarlso commented Sep 11, 2018

LGTM now my man. Good to merge at your leisure @michelvocks.

@michelvocks
Copy link
Member Author

@Skarlso We cannot add set FailPipeline when an error occurs because the pipeline should not fail. It should just exit but not failing. I will have a look again 🤦‍♂️ 😄

@Skarlso
Copy link
Member

Skarlso commented Sep 12, 2018

@michelvocks wait. I don't get it. Why shouldn't the pipeline fail if there is an error? It shouldn't pass. If there is an error, make it fail. I don't get that reasoning? :)

@Skarlso
Copy link
Member

Skarlso commented Sep 12, 2018

Oh like, for example, if you want a cleanup at the end to be executed no matter if a job failed, right? Yeah, in that case we don't want to mark the pipeline as failed. :/ Sorry. :)

@michelvocks
Copy link
Member Author

@Skarlso The main idea was to safely exit a pipeline. So for example:

You got a pipeline with three jobs which should be executed one after the other.
During the execution of job two you find out that job three does not need to be executed at all so you want safely exit the pipeline without failing it.

Anyway, let me rework that. 😄

@Skarlso Skarlso added Needs Work The PR still requires some work from the submitter. and removed Ready To Merge PR is ready to be merged into master labels Sep 13, 2018
@Skarlso
Copy link
Member

Skarlso commented Sep 15, 2018

But wait. I just realised what you are writing. But in my case it wasn't safely exiting. I returned an error in the job. Isn't that supposed to fail the pipeline? Or cleanup still needs to run

@Skarlso
Copy link
Member

Skarlso commented Sep 19, 2018

So, I still don't understand. I found the bug, which was a timing issue with closing finish in case of a job execution failure. But this will still exit the pipeline and make it fail, because the job failed.

Also, I found that the forever running loops are pretty fragile. We need to add timeouts into those, or else we get into the trouble of an unfinished stuck pipeline which you can only kill if you kill gaia. That is unacceptable.

I'll add some timeout selects here and there to fail the loop if after a certain time there is nothing happening.

Also we might want to add user interaction like cancelling the whole thing mid run.

Sample run result of a job failing stopping the pipeline and cleanup not running any longer. But also the status of the pipeline will be failed. Is this what you want? :D

screen shot 2018-09-19 at 22 04 51

@Skarlso
Copy link
Member

Skarlso commented Sep 19, 2018

BTW also interesting one is that I was printing out the status from the workflow after it's supposedly replaced the items, but every time on every loop it was the same output:

2018-09-19T21:57:59.149+0200 [DEBUG] plugin.def-failing_golang: 2018/09/19 21:57:59 MigrateDB has been finished!
job:  DB Migration success
job Create DB User is in state waiting for execution
job DB Migration is in state waiting for execution
job Create K8S Namespace is in state waiting for execution
job Create K8S Deployment is in state waiting for execution
job Create K8S Service is in state waiting for execution
job Create K8S Ingress is in state waiting for execution
job Clean up is in state waiting for execution
looping forever
looping forever

Waiting for execution. Isn't that supposed to be updating? This is Part:

				for wl := range mw.Iter() {
					fmt.Printf("job %s is in state %s\n", wl.job.Title, wl.job.Status)
					if !wl.done {
						allWLDone = false
					}
				}

Where this part should have these updated to done:

				// Job is done
				wl := mw.GetByID(j.ID)
				wl.done = true
				mw.Replace(*wl)

Oh this is workload status pfff, not Job status.... Silly me. Still those guys should still be waiting for execution even after they finished? :)

@Skarlso
Copy link
Member

Skarlso commented Sep 19, 2018

Damn it. :( No, I didn't fix it. :( It made a tad bit better, but it's still hanging in an infinite loop without any channels being notified of anything.

@Skarlso
Copy link
Member

Skarlso commented Sep 19, 2018

Alright, to summarise, I know it's a timing issue with the channels and I know that it's happening in
func (s *Scheduler) executeScheduler(r *gaia.PipelineRun, pS Plugin) { where the for ends up as an infinite loop.

@Skarlso
Copy link
Member

Skarlso commented Sep 21, 2018

I found the problem. Fixing it..

@Skarlso
Copy link
Member

Skarlso commented Sep 21, 2018

@michelvocks So the problem is the following... :)

In here if !finalize && j.FailPipeline { you clean up the executeScheduler channel of jobs still waiting to be executed and then you close the channel. So nothing will be triggered any more.

Which is fine, if a job failed, we don't want to continue execution since that would only lead to more failures if the next job depended on the previous one doing something.

However, then you go, if there are parallel jobs still running, we want them to finish and not abort, if I understand correctly. Which is again, reasonable and fine. However... Cleanup, which the next job in line, is also in started state according to the workload which only has two states, started and done.

This hangs forever, because the channel is closed and the job won't run and it's just waiting for things to happen but no channel is ever called again thus this guy:

						// wait until finished
						<-notFinishedWL.finishedSig

blocks forever.

what I will do is introduce another state called inProgress, so that we know that a job is currently running so we'll wait for it to finish. But if it's just started, we will ignore it and not run it since the pipeline failed.

EDIT: OR! I just simply add singleWL.job.Status == gaia.JobRunning to the condition. OMG that might just work.

@Skarlso
Copy link
Member

Skarlso commented Sep 21, 2018

Yiss. Fixed it.

Interesting point is that executeJob was sending one last triggerSave call which failed because we close it before hand so that's a send on closed channel.

I put the close in defer, that way it should be after that, but still prevent the select from triggering again.

@Skarlso
Copy link
Member

Skarlso commented Sep 21, 2018

Nope... damn it.

panic: send on closed channel

goroutine 1081 [running]:
github.com/gaia-pipeline/gaia/scheduler.executeJob.func1(0xc420546840, 0xc420626000)
        /Users/gbrautigam/gohome/src/github.com/gaia-pipeline/gaia/scheduler/scheduler.go:326 +0x66
github.com/gaia-pipeline/gaia/scheduler.executeJob(0xcbe6c422, 0xc42018ef20, 0x12, 0xc42018d900, 0x3e, 0xc420010220, 0x1, 0x4, 0xc42018efc0, 0x15, ...)
        /Users/gbrautigam/gohome/src/github.com/gaia-pipeline/gaia/scheduler/scheduler.go:337 +0x143
created by github.com/gaia-pipeline/gaia/scheduler.(*Scheduler).executeScheduler
        /Users/gbrautigam/gohome/src/github.com/gaia-pipeline/gaia/scheduler/scheduler.go:582 +0xa2b
exit status 2
make: *** [dev] Error 1

still fails.

@Skarlso
Copy link
Member

Skarlso commented Sep 21, 2018

Ideally... the writer should close the channel. But since multiple routines are using it, I'm opting that this particular part of the process handling just simply leaves this channel alone.

There is no performance drop in leaving a channel open, and it will be recycled anyways.

@Skarlso Skarlso added needs review and removed Needs Work The PR still requires some work from the submitter. labels Sep 21, 2018
@michelvocks
Copy link
Member Author

Hey @Skarlso . I've cured my illness and now try to get back on the track. I just tried to follow all your comments but I'm a bit lost. Give me some time to get into the code again 🤗

@michelvocks
Copy link
Member Author

@Skarlso I found the problem. Like you said, when a job fails in a pipeline I have to wait until parallel jobs are finished.
In the go routine where I check if parallel jobs are still running I set a pointer to the current Object of an iteration object. This is not allowed because the object is always changing. I had to do a deep copy or (what I did now) just got the pointer to the actual finish channel.

Also, I found that the forever running loops are pretty fragile. We need to add timeouts into those, or else we get into the trouble of an unfinished stuck pipeline which you can only kill if you kill gaia. That is unacceptable.

You are absolutely right but maybe we do it in a separate (following) PR? What do you think? 🤗

@michelvocks
Copy link
Member Author

Ah by the way. Here is a pipeline which returns a specific error. This error tells gaia that the job failed and the pipeline should be aborted but not marked as failed: https://github.com/michelvocks/gaia-test-pipeline

@Skarlso
Copy link
Member

Skarlso commented Sep 23, 2018

@michelvocks Cool stuff dude. Yeah do the timeouts in a separate PR definitely. :)

@Skarlso
Copy link
Member

Skarlso commented Sep 23, 2018

I would also try to cleanup the channels later on. :)

@michelvocks michelvocks merged commit 767ef36 into gaia-pipeline:master Sep 24, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants