-
Notifications
You must be signed in to change notification settings - Fork 90
Adaptive Load Controller main loop #535
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adaptive Load Controller main loop #535
Conversation
update from master
merge from upstream
merge from upstream
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
…double Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
…variable_setter Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
merge from upstream
…ent.Output turns out not to include the status Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
…plugin names, log thresholds only once per session Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally LGTM, it's quite cool to see everything come together. I feel it reads like poetry :-) except for the special way we treat duration, which still feels a tiny bit off; maybe a code level comment can help here. Somehow I'm also wondering if duration isn't something we should allow implementing as a parameter, like qps. E.g. to allow one to research a question like "how does test duration influence convergence speed at a steady rps".
Left a few small other remarks.
@pamorgan Please review and assign back to me once done. |
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some nits but LGTM.
/** | ||
* Performs an adaptive load session consisting of the Adjusting Stage and the | ||
* Testing Stage. Adjusting Stage: Runs a series of short benchmarks, checks metrics according to | ||
* MetricSpecs, and adjusts load up or down based on the result; returns an error if convergence |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: extra quote.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reformatted this
* computed by a StepController plugin. Metric values are obtained through MetricsPlugins. Testing | ||
* Stage: When the optimal load is found, runs one long benchmark to validate it. | ||
* | ||
* @param nighthawk_service_stub A Nighthawk Service gRPC stub. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Maybe stub is more appropriate as a member in the class than an argument to the method. I don't think we expect it to change on a multiple call scenario.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class is intended to be stateless helpers and it's only a class to make mocking easier. We will have a chance to design a more convenient client that might be stateful when we merge RemoteProcessImpl
and NighthawkServiceClientImpl
.
absl::StatusOr<nighthawk::adaptive_load::AdaptiveLoadSessionOutput> PerformAdaptiveLoadSession( | ||
nighthawk::client::NighthawkService::StubInterface* nighthawk_service_stub, | ||
const nighthawk::adaptive_load::AdaptiveLoadSessionSpec& spec, Envoy::TimeSource& time_source); | ||
class AdaptiveLoadController { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be renamed to interface now, since we now a pure methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The convention for Nighthawk seems to be that everything under include
should be pure but no existing interfaces have Interface
in their name.
* controller.PerformAdaptiveLoadSession(&nighthawk_service_stub, spec); | ||
* | ||
* @param nighthawk_service_client A helper that executes Nighthawk Service benchmarks given a | ||
* gRPC stub. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reason behing seperating the client from the stub?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for historical reasons: We had a helper that made it easier to call the stub, then moved the helper into its own class because it's only easy to mock out methods when they are in a class.
I'm not sure what will happen with this, but we may add a more convenient Nighthawk Service client wrapper object that contains the stub, and then take this wrapper instead. Or the stub might be the best thing for the public API to take.
* metrics from MetricsPlugins, scores the results, and consults a StepController plugin to | ||
* determine the next load and detect convergence. All plugins are specified through the | ||
* AdaptiveLoadSessionSpec proto. | ||
* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a comment on the thread safety? Can you run concurrent sessions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
AdaptiveLoadSessionSpec spec = session_spec_proto_helper_.SetSessionSpecDefaults(input_spec); | ||
absl::Status validation_status = session_spec_proto_helper_.CheckSessionSpec(spec); | ||
if (!validation_status.ok()) { | ||
return validation_status; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we log reasons of failures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
absl::StrCat("Step controller determined that it can never converge: ", doom_reason)); | ||
} | ||
|
||
absl::StatusOr<BenchmarkResult> result_or = PerformAndAnalyzeNighthawkBenchmark( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you comment on why calling this again after complete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done (this is the testing stage)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW we have #539 to consider making the testing stage an optional thing, decided by the StepController, so this special call might go away.
return absl::DeadlineExceededError(absl::StrFormat( | ||
"Failed to converge before deadline of %.2f seconds.", time_limit_ns.count() / 1e9)); | ||
} | ||
} while (!step_controller->IsConverged() && !step_controller->IsDoomed(doom_reason)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually while loops are more readable than do/while. Is there a reason not to use a simple while loop here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had discussed this, I think the motivation for do/while is that IsConverged
and IsDoomed
don't have meaningful answers until the loop body has executed.
} | ||
BenchmarkResult result = result_or.value(); | ||
*output.mutable_adjusting_stage_results()->Add() = result; | ||
step_controller->UpdateAndRecompute(result); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It took me a second to follow the logic because step_controller->GetCurrentCommandLineOptions and step_controller->UpdateAndRecompute(result); are called in 2 different methods.
I think it would be easier to follow if you moved step_controller->GetCurrentCommandLineOptions to this method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just moved UpdateAndRecompute into the helper, which hopefully makes the same amount of sense. I had moved GetCurrentCommandLineOptions to the helper to reduce duplication in the main loop.
absl::flat_hash_map<std::string, MetricsPluginPtr> name_to_custom_metrics_plugin_map; | ||
for (const envoy::config::core::v3::TypedExtensionConfig& config : | ||
spec.metrics_plugin_configs()) { | ||
absl::StatusOr<MetricsPluginPtr> metrics_plugin_or = LoadMetricsPlugin(config); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: usually it's better to perform the load right after validations and pass in the metric plugins to avoid the assert. No action needed if it's too much of code refactoring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, it's a bit awkward for historical reasons -- filed #547
…o helper Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, it is obvious that a lot of work was done here to improve readability. This PR only contains the business logic of the adaptive controller. The main functions and test cases are easy to follow. Thank you for taking all the additional steps.
This is the main function of the Adaptive Load Controller library:
Fixes #485.
Part 9 of splitting PR #483.