Skip to content

Conversation

eric846
Copy link
Contributor

@eric846 eric846 commented Sep 14, 2020

This is the main function of the Adaptive Load Controller library:

  • Check the input proto for errors
  • Apply default input values
  • Adjusting Stage loop:
    • Get the latest dynamically generated CommandLineOptions from the StepController
    • Run a short benchmark with the Nighthawk Service
    • Obtain values from MetricsPlugins
    • Score metrics using ScoringFunction plugins
    • Report scores back to the StepController, which recalculates the load for the next iteration
    • Check for convergence deadline exceeded
    • Check for StepController convergence
    • Check for StepController doom
  • Testing Stage: Run one long benchmark on the Nighthawk Service at the converged load

Fixes #485.

Part 9 of splitting PR #483.

eric846 added 30 commits June 1, 2020 17:23
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
…double

Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
…variable_setter

Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
…ent.Output turns out not to include the status

Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
…plugin names, log thresholds only once per session

Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
@eric846 eric846 marked this pull request as ready for review September 14, 2020 23:09
@eric846 eric846 added the waiting-for-review A PR waiting for a review. label Sep 14, 2020
Copy link
Member

@oschaaf oschaaf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM, it's quite cool to see everything come together. I feel it reads like poetry :-) except for the special way we treat duration, which still feels a tiny bit off; maybe a code level comment can help here. Somehow I'm also wondering if duration isn't something we should allow implementing as a parameter, like qps. E.g. to allow one to research a question like "how does test duration influence convergence speed at a steady rps".
Left a few small other remarks.

@dubious90 dubious90 requested a review from pamorgan September 15, 2020 15:40
@dubious90
Copy link
Contributor

@pamorgan Please review and assign back to me once done.

Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
oschaaf
oschaaf previously approved these changes Sep 15, 2020
Copy link
Member

@oschaaf oschaaf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
pamorgan
pamorgan previously approved these changes Sep 18, 2020
Copy link
Contributor

@pamorgan pamorgan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some nits but LGTM.

/**
* Performs an adaptive load session consisting of the Adjusting Stage and the
* Testing Stage. Adjusting Stage: Runs a series of short benchmarks, checks metrics according to
* MetricSpecs, and adjusts load up or down based on the result; returns an error if convergence
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: extra quote.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reformatted this

* computed by a StepController plugin. Metric values are obtained through MetricsPlugins. Testing
* Stage: When the optimal load is found, runs one long benchmark to validate it.
*
* @param nighthawk_service_stub A Nighthawk Service gRPC stub.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Maybe stub is more appropriate as a member in the class than an argument to the method. I don't think we expect it to change on a multiple call scenario.

Copy link
Contributor Author

@eric846 eric846 Sep 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class is intended to be stateless helpers and it's only a class to make mocking easier. We will have a chance to design a more convenient client that might be stateful when we merge RemoteProcessImpl and NighthawkServiceClientImpl.

absl::StatusOr<nighthawk::adaptive_load::AdaptiveLoadSessionOutput> PerformAdaptiveLoadSession(
nighthawk::client::NighthawkService::StubInterface* nighthawk_service_stub,
const nighthawk::adaptive_load::AdaptiveLoadSessionSpec& spec, Envoy::TimeSource& time_source);
class AdaptiveLoadController {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be renamed to interface now, since we now a pure methods.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The convention for Nighthawk seems to be that everything under include should be pure but no existing interfaces have Interface in their name.

* controller.PerformAdaptiveLoadSession(&nighthawk_service_stub, spec);
*
* @param nighthawk_service_client A helper that executes Nighthawk Service benchmarks given a
* gRPC stub.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason behing seperating the client from the stub?

Copy link
Contributor Author

@eric846 eric846 Sep 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for historical reasons: We had a helper that made it easier to call the stub, then moved the helper into its own class because it's only easy to mock out methods when they are in a class.

I'm not sure what will happen with this, but we may add a more convenient Nighthawk Service client wrapper object that contains the stub, and then take this wrapper instead. Or the stub might be the best thing for the public API to take.

* metrics from MetricsPlugins, scores the results, and consults a StepController plugin to
* determine the next load and detect convergence. All plugins are specified through the
* AdaptiveLoadSessionSpec proto.
*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment on the thread safety? Can you run concurrent sessions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

AdaptiveLoadSessionSpec spec = session_spec_proto_helper_.SetSessionSpecDefaults(input_spec);
absl::Status validation_status = session_spec_proto_helper_.CheckSessionSpec(spec);
if (!validation_status.ok()) {
return validation_status;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we log reasons of failures.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

absl::StrCat("Step controller determined that it can never converge: ", doom_reason));
}

absl::StatusOr<BenchmarkResult> result_or = PerformAndAnalyzeNighthawkBenchmark(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you comment on why calling this again after complete.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done (this is the testing stage)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW we have #539 to consider making the testing stage an optional thing, decided by the StepController, so this special call might go away.

return absl::DeadlineExceededError(absl::StrFormat(
"Failed to converge before deadline of %.2f seconds.", time_limit_ns.count() / 1e9));
}
} while (!step_controller->IsConverged() && !step_controller->IsDoomed(doom_reason));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually while loops are more readable than do/while. Is there a reason not to use a simple while loop here?

Copy link
Contributor Author

@eric846 eric846 Sep 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had discussed this, I think the motivation for do/while is that IsConverged and IsDoomed don't have meaningful answers until the loop body has executed.

}
BenchmarkResult result = result_or.value();
*output.mutable_adjusting_stage_results()->Add() = result;
step_controller->UpdateAndRecompute(result);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took me a second to follow the logic because step_controller->GetCurrentCommandLineOptions and step_controller->UpdateAndRecompute(result); are called in 2 different methods.
I think it would be easier to follow if you moved step_controller->GetCurrentCommandLineOptions to this method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just moved UpdateAndRecompute into the helper, which hopefully makes the same amount of sense. I had moved GetCurrentCommandLineOptions to the helper to reduce duplication in the main loop.

absl::flat_hash_map<std::string, MetricsPluginPtr> name_to_custom_metrics_plugin_map;
for (const envoy::config::core::v3::TypedExtensionConfig& config :
spec.metrics_plugin_configs()) {
absl::StatusOr<MetricsPluginPtr> metrics_plugin_or = LoadMetricsPlugin(config);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: usually it's better to perform the load right after validations and pass in the metric plugins to avoid the assert. No action needed if it's too much of code refactoring.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, it's a bit awkward for historical reasons -- filed #547

@mum4k mum4k added waiting-for-changes A PR waiting for comments to be resolved and changes to be applied. and removed waiting-for-review A PR waiting for a review. labels Sep 18, 2020
…o helper

Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
@eric846 eric846 added waiting-for-review A PR waiting for a review. and removed waiting-for-changes A PR waiting for comments to be resolved and changes to be applied. labels Sep 19, 2020
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
Signed-off-by: eric846 <56563761+eric846@users.noreply.github.com>
@eric846 eric846 requested a review from pamorgan September 21, 2020 17:18
@mum4k mum4k assigned mum4k and unassigned dubious90 Sep 22, 2020
Copy link
Collaborator

@mum4k mum4k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, it is obvious that a lot of work was done here to improve readability. This PR only contains the business logic of the adaptive controller. The main functions and test cases are easy to follow. Thank you for taking all the additional steps.

@mum4k mum4k merged commit 401f09e into envoyproxy:master Sep 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
waiting-for-review A PR waiting for a review.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Consider making Adaptive Load Controller main API a class interface
5 participants