Skip to content

Conversation

HonakerM
Copy link
Contributor

@HonakerM HonakerM commented Jul 25, 2024

What this PR does / why we need it:
This PR refactors the model trainer abstraction to be a more generic "job" abstraction. With this job abstraction we are able to implement both inferences and trainings in the background and poll their status/results.

This type of functionality has multiple use-cases inside Caikit with a few listed below:

  • Executing a long running inference which could be impacted by network issues
  • Tracking and managing inferences while they're running e.g. polling the status of an inference or cancelling a request.
  • Better support coordination between pods to support things like load distribution

This PR creates new http and grpc endpoints for the job predictions:

HTTP

POST /api/v1/task/<task>/job - create
GET /api/v1/task/<task>/job - get status
GET /api/v1/task/<task>/job/results - get results
DELETE /api/v1/task/<task>/job - cancel

GRPC:

caikit.runtime.SampleLib.SampleLibJobService.<task>StartPredictionJob - create job
caikit.runtime.SampleLib.SampleLibJobService.<task>PredictionJobStatus - get status
caikit.runtime.SampleLib.SampleLibJobService.<task>GetPredictionJobResult - get results
caikit.runtime.SampleLib.SampleLibJobService.<task>CancelPredictionJob - cancel

Special notes for your reviewer:

If applicable:

  • this PR contains documentation
  • this PR contains unit tests
  • this PR has been tested for backwards compatibility

HonakerM added 5 commits July 21, 2024 15:45
Signed-off-by: Michael Honaker <Michael.Honaker@ibm.com>
Signed-off-by: Michael Honaker <Michael.Honaker@ibm.com>
Signed-off-by: Michael Honaker <Michael.Honaker@ibm.com>
Signed-off-by: Michael Honaker <Michael.Honaker@ibm.com>
Signed-off-by: Michael Honaker <Michael.Honaker@ibm.com>
@HonakerM HonakerM force-pushed the model_background_task branch from 6efcfae to 815c02a Compare July 25, 2024 11:57
Signed-off-by: Michael Honaker <Michael.Honaker@ibm.com>
@HonakerM HonakerM force-pushed the model_background_task branch from 815c02a to c0419bb Compare July 31, 2024 19:11
Signed-off-by: Michael Honaker <Michael.Honaker@ibm.com>
@HonakerM HonakerM force-pushed the model_background_task branch 2 times, most recently from e19b10d to 875f66b Compare July 31, 2024 20:56
Signed-off-by: Michael Honaker <Michael.Honaker@ibm.com>
@HonakerM HonakerM force-pushed the model_background_task branch from 875f66b to ba034eb Compare July 31, 2024 21:19
Signed-off-by: Michael Honaker <Michael.Honaker@ibm.com>
@HonakerM HonakerM force-pushed the model_background_task branch from 394a6a4 to 6badeab Compare August 1, 2024 16:01
Signed-off-by: Michael Honaker <Michael.Honaker@ibm.com>
@HonakerM HonakerM force-pushed the model_background_task branch from 6badeab to 1a10065 Compare August 2, 2024 14:46
Signed-off-by: Michael Honaker <Michael.Honaker@ibm.com>
@HonakerM HonakerM force-pushed the model_background_task branch from 1a10065 to 38b3ced Compare August 2, 2024 14:47
Signed-off-by: Michael Honaker <Michael.Honaker@ibm.com>
@HonakerM HonakerM force-pushed the model_background_task branch from c049d97 to c38fdaa Compare August 2, 2024 15:52
Signed-off-by: Michael Honaker <Michael.Honaker@ibm.com>
@HonakerM HonakerM force-pushed the model_background_task branch from e403758 to 854323e Compare August 4, 2024 17:43
Signed-off-by: Michael Honaker <Michael.Honaker@ibm.com>
Signed-off-by: Michael Honaker <Michael.Honaker@ibm.com>
Signed-off-by: Michael Honaker <Michael.Honaker@ibm.com>
Signed-off-by: Michael Honaker <Michael.Honaker@ibm.com>
@HonakerM HonakerM marked this pull request as ready for review August 5, 2024 13:22
Signed-off-by: Michael Honaker <Michael.Honaker@ibm.com>
@HonakerM HonakerM force-pushed the model_background_task branch from b49d449 to b4be1ee Compare August 15, 2024 17:34
Signed-off-by: Michael Honaker <Michael.Honaker@ibm.com>
Copy link
Collaborator

@gabe-l-hart gabe-l-hart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A pile of NITs still (mostly copy-pasta) and a few structural things I still think need to be addressed to avoid API breaking changes:

  1. Preserve kwarg names for ModelFuture to avoid breaking existing trainers
  2. Don't switch 200 -> 202 (I kind of hate this one, but I think it's important)
  3. Let's see if we can solve the py-to-proto duplicate import issue too!

Signed-off-by: Michael Honaker <Michael.Honaker@ibm.com>
Signed-off-by: Michael Honaker <Michael.Honaker@ibm.com>
Signed-off-by: Michael Honaker <Michael.Honaker@ibm.com>
@HonakerM HonakerM force-pushed the model_background_task branch from 3416581 to 1db2be0 Compare August 19, 2024 19:50
Signed-off-by: Michael Honaker <Michael.Honaker@ibm.com>
Signed-off-by: Michael Honaker <Michael.Honaker@ibm.com>
@HonakerM HonakerM force-pushed the model_background_task branch from 21c487f to e30724e Compare August 20, 2024 16:42
Signed-off-by: Michael Honaker <Michael.Honaker@ibm.com>
@HonakerM HonakerM force-pushed the model_background_task branch from f682417 to e8a31f8 Compare August 20, 2024 17:11
@HonakerM HonakerM requested a review from gabe-l-hart August 20, 2024 17:59
Signed-off-by: Michael Honaker <Michael.Honaker@ibm.com>
@HonakerM HonakerM requested a review from tozhangkai August 23, 2024 14:20
Copy link
Collaborator

@gabe-l-hart gabe-l-hart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple more NITs and a small outstanding list:

  • We need to figure out if changing TrainingStatus -> JobStatus in the TrainingStatusResponse is an API breaking change
  • job_id -> prediction_id in data model to mirror training_id for training jobs
  • Duplicate log code search (we should really put in a CI check for this)

@@ -43,7 +43,7 @@ class ModelPointer(DataObjectBase):
@dataobject(RUNTIME_PACKAGE)
class TrainingStatusResponse(DataObjectBase):
training_id: Annotated[str, FieldNumber(1)]
state: Annotated[TrainingStatus, FieldNumber(2)]
state: Annotated[JobStatus, FieldNumber(2)]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we figured out whether this will cause an overall API breaking change by changing the name of the protobuf?

class PredictionJobInfoRequest(DataObjectBase):
"""DataModel to request information about a PredictionJob"""

job_id: Annotated[str, FieldNumber(1)]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're officially differentiating training_id and prediction_id elsewhere, I think maybe this should be named prediction_id instead of job_id? That would also mirror TrainingInfoRequest.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. I'll make the update. I originally thought of making it more generic but since training is specific predictions should be as well

self.notify_backends_with_context(model_id, context)

# Retrieve the model from the model manager
log.debug("<RUN52259029D>", "Retrieving model '%s'", model_id)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate log code! Let's make sure we do a grep for these. You could also just remove them on any messages debug or lower

## Interface-agnostic entrypoints ##
####################################

def get_job_result(self, job_id: str) -> DataObjectBase:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do job_id -> prediction_id in the interface, I think these should all change to <verb>_prediction_<noun>

HonakerM and others added 5 commits August 26, 2024 12:06
Co-authored-by: Gabe Goodhart <gabe.l.hart@gmail.com>
Signed-off-by: Michael Honaker <37811263+HonakerM@users.noreply.github.com>
Co-authored-by: Gabe Goodhart <gabe.l.hart@gmail.com>
Signed-off-by: Michael Honaker <37811263+HonakerM@users.noreply.github.com>
Co-authored-by: Gabe Goodhart <gabe.l.hart@gmail.com>
Signed-off-by: Michael Honaker <37811263+HonakerM@users.noreply.github.com>
Co-authored-by: Gabe Goodhart <gabe.l.hart@gmail.com>
Signed-off-by: Michael Honaker <37811263+HonakerM@users.noreply.github.com>
Signed-off-by: Michael Honaker <Michael.Honaker@ibm.com>
@HonakerM HonakerM force-pushed the model_background_task branch from 0e8065e to c03ef83 Compare August 26, 2024 16:37
@HonakerM HonakerM requested a review from gabe-l-hart August 26, 2024 16:50
Copy link
Collaborator

@gabe-l-hart gabe-l-hart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ONE. MORE. NIT!

Comment on lines 269 to 275
Args:
future_id (str): The ID string from the original prediction
submission's ModelFuture

Returns:
model_future (JobPredictorFutureBase): The future handle
to the job which holds the status of the in-flight prediction.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Args:
future_id (str): The ID string from the original prediction
submission's ModelFuture
Returns:
model_future (JobPredictorFutureBase): The future handle
to the job which holds the status of the in-flight prediction.
Args:
prediction_id (str): The ID string from the original prediction
submission's ModelFuture
Returns:
prediction_future (JobPredictorFutureBase): The future handle
to the job which holds the status of the in-flight prediction.

Signed-off-by: Michael Honaker <Michael.Honaker@ibm.com>
Copy link
Collaborator

@gabe-l-hart gabe-l-hart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ship it! I agree that we should call this an API breaking change even though it's very minimal. We should spell out exactly what would break in the release notes.

Signed-off-by: Michael Honaker <Michael.Honaker@ibm.com>
@gabe-l-hart gabe-l-hart merged commit bd6e8a8 into caikit:main Aug 26, 2024
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants