Skip to content

Conversation

myidpt
Copy link

@myidpt myidpt commented Mar 15, 2018

The ClientRunner runs as a thread to automatically update the KeyCertBundle. It executes the following loop:

  1. It sleeps until the current certificate near expiration (enters the grace period).
  2. When it wakes up, it calls the upstream CA to provision new key/certs.
  3. If unexpected error happens, the thread will pass error back to the caller through the error channel, and terminate.

This is a library can be used by components that needs to provision certificates with CAs: node agent, and Istio CAs in the multi-cluster use case.

@myidpt myidpt requested a review from a team March 15, 2018 18:35
@istio-merge-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
We suggest the following additional approver: wattli

Assign the PR to them by writing /assign @wattli in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

log.Infof("Retrieve new key and certs.")
certBytes, certChainBytes, privateKeyBytes, err := c.client.RetrieveNewKeyCert()
if err != nil {
//log.Errorf("error retrieving the key and cert %v, abort auto rotation", err)
Copy link

@incfly incfly Mar 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this line?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I will remove it.

@wattli
Copy link
Contributor

wattli commented Mar 15, 2018

Add more description to this PR? What is the 'auto cert rotation' used for?

@myidpt
Copy link
Author

myidpt commented Mar 15, 2018

@wattli Done.

}
} else {
log.Errorf("Error getting TTL from cert: %v. Rotate immediately.", ttlErr)
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: check error first: if ttlErr != nil { log & continue}, this can reduce the nesting level.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm... We don't want to continue, we need to execute the following lines. I don't know how to save the nesting here.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sorry, didn't catch that. In general, Go encourages to handle error first. https://github.com/golang/go/wiki/CodeReviewComments#in-band-errors, this is a recoverable error, but still be better to check error first.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

// ClientRunner interacts with the upstream CA to maintain the KeyCertBundle valid.
type ClientRunner interface {
// Run the ClientRunner loop (blocking call)
Run(stopCh <-chan struct{}, errCh chan<- error)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding Stop() method and makes stopCh as struct's field? So that callers don't have to maintain the mapping of stopCh and the runner instance. Will also be consistent with Start/Stop style interface we used in other places.

Copy link
Author

@myidpt myidpt Mar 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually we have a lot of code passing the channels:
https://github.com/istio/istio/search?utf8=%E2%9C%93&q=stop%28%29&type=
Is the Start/Stop style a golang common practice? If so, we can use it here.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a Go thing, but we use this in security/cmd/... , eg. nodeagent
workloadhandler

Just like standard library time.Ticker.Stop(), I think calling Stop() is simpler than passing info towards a channel for caller. Also that makes this functionality part of the interface itself.

"istio.io/istio/security/pkg/util"
)

// ClientRunner interacts with the upstream CA to maintain the KeyCertBundle valid.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming it CertRefresher/Refresher maybe more informative as that's what it actually does.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I admit this is not straightforward. But CertRefresher/Refresher does not say it's a continuously running routine.
Maybe keep this name and change it when we have a better name? Added a TODO here.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With CertRefresher { Start/Run(), Stop()} I think it gives the user some rough idea. We can change Run/Start to
StartLoop or something if want to emphasize the continuously running behavior.

The library runs CAClient to do something. I just feel we should name by what rather than how/implementation. Otherwise lib user has to rely on doc string to understand what it does.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Maybe KeyCertBundleRotator() is better? Because the key is also rotated :)

// Run periodically rotates the key/cert of the Istio CA by interacting with the upstream Istio CA.
// It is a blocking function that should run as a go routine.
func (c *clientRunner) Run(stopCh <-chan struct{}, errCh chan<- error) {
for {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will a bool channel be enough?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer an error channel because the caller is able to have different behavior based on the error, also the error can be logged with more context (the caller can wrap the error with more context).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm, I select the stopCh but seems like github's code review is does not support that...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see. Yes, I think it doesn't matter much whether a boolean or a dumb struct, but I can change it anyway.

}
}

// Run periodically rotates the key/cert of the Istio CA by interacting with the upstream Istio CA.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on description, IIUC, this library is intended to used by both NodeAgent refreshing workload key/cert from IstioCA and IstioCA's own key/cert from a higher level CA.

Is that case, should we not call out Istio CA here, just refers to "key/cert" and the CA that issues this key/cert?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this lib is to be used by both node agent and Isito CA.
I don't think I understand your last sentence well. Can you explain it a little more?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another example of github not supports highlight selection. :)

I mean since when this lib used for nodeagent, this key cert can belong to workload, we shouldn't say "key/cert of Istio CA", and same to "upstream Istio CA" -> "upstream CA".

Please correct me if I misunderstand the term.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. Will update them.

}

type clientRunner struct {
keycert pkiutil.KeyCertBundle
Copy link

@incfly incfly Mar 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we'll have one Runner instance per key/cert, and call Run() in a goroutine. If a NodeAgent has to refresh 200 service account's key, cert, does that mean to spawn 200 goroutines?

Can we make this libraries support multiple key/cert refresh, and in the loop, we find the earliest-to-be-expired key/cert to send CSR?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's definitely something we should support. I will add a todo here. For the multi-cluster CA use case, we just use a single KeyCertBundle, that's why I start with a single KeyCertBundle here.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@codecov
Copy link

codecov bot commented Mar 15, 2018

Codecov Report

Merging #4295 into master will increase coverage by 1%.
The diff coverage is 86%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #4295    +/-   ##
=======================================
+ Coverage      71%     71%    +1%     
=======================================
  Files         317     314     -3     
  Lines       28911   28862    -49     
=======================================
+ Hits        20247   20254     +7     
+ Misses       7482    7376   -106     
- Partials     1182    1232    +50
Impacted Files Coverage Δ
security/pkg/caclient/keycertbundlerotator.go 86% <86%> (ø)
pilot/pkg/config/kube/ingress/conversion.go 32% <0%> (-68%) ⬇️
pilot/pkg/config/kube/ingress/controller.go 40% <0%> (-52%) ⬇️
mixer/pkg/api/grpcServer.go 82% <0%> (-18%) ⬇️
pilot/pkg/config/monitor/monitor.go 69% <0%> (-6%) ⬇️
security/pkg/workload/server.go 80% <0%> (-4%) ⬇️
mixer/adapter/kubernetesenv/cache.go 83% <0%> (-2%) ⬇️
mixer/adapter/kubernetesenv/kubernetesenv.go 68% <0%> (-1%) ⬇️
mixer/adapter/rbac/rbac.go 10% <0%> (-1%) ⬇️
pilot/pkg/proxy/envoy/v1/route.go 93% <0%> (-1%) ⬇️
... and 43 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 59764d1...3369f9c. Read the comment docs.

Copy link

@incfly incfly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!


type keyCertBundleRotatorImpl struct {
// TODO: Support multiple KeyCertBundles.
keycert pkiutil.KeyCertBundle
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe sort alphabetically?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

@myidpt myidpt changed the title Add ClientRunner for automatic cert rotation with the upstream CA. Add KeyCertBundleRotator for automatic cert rotation with the upstream CA. Mar 17, 2018
@myidpt myidpt merged commit 2c7116e into istio:master Mar 17, 2018
@myidpt myidpt deleted the clientrunner branch March 17, 2018 02:55
PiotrSikora added a commit to PiotrSikora/istio that referenced this pull request Sep 13, 2018
Pulling the following changes from github.com/istio/proxy:

f498337 Fix a bug in origin authenticator that wrongly treats empty origin methods as pass (istio#1962)
c352de0 Mixer Client: Add support for TCP local attributes (istio#1967)
2c563c6 remove not used path patcher functions (istio#1966)
490d26f Update Envoy SHA to latest with TCP proxy fixes. (istio#1964)
4cc4b7c Mixer Client uses Node metadata to populate Mixer attributes (istio#1961)
cf23357 Fix the peerIsOptional and originIsOptional for authn filter. (istio#1959)
cc6e58e support per-path JWT validation. (istio#1928)
a5dd1aa mixer: clear route cache on header update (istio#1946)

Pulling the following changes from github.com/envoyproxy/envoy:

f936fc60f ssl: serialize accesses to SSL socket factory contexts (istio#4345)
e34dcd62a Fix crash in tcp_proxy (istio#4323)
ae6a25222 router: fix matching when all domains have wildcards (istio#4326)
aa06142ff test: Stop fake_upstream methods from accidentally succeeding (istio#4232)
5d731878f rbac: update the authenticated.user to a StringMatcher. (istio#4250)
c6bfc7d9a time: Event::TimeSystem abstraction to make it feasible to inject time with simulated timers (istio#4257)
752483ea9 Fixing the fix (istio#4333)
83487f6f3 tls: update BoringSSL to ab36a84b (3497). (istio#4338)
7bc210e02 test: fixing interactions between waitFor and ignore_spurious_events (istio#4309)
69474b398 admin: order stats in clusters json admin (istio#4306)
2d155f901 ppc64le build (istio#4183)
07efc6dc6 fix static initialization fiasco problem (istio#4314)
0b7e3b5e0 test: Remove declared but undefined class methods (istio#4297)
1485a1304 lua: make sure resetting dynamic metadata wrapper when request info is marked dead
d243cd62e test: set to zero when start_time exceeds limit (istio#4328)
0a1e92acc test: fix heap use-after-free in ~IntegrationTestServer. (istio#4319)
cddc732c7 CONTRIBUTING: Document 'kick-ci' trick. (istio#4335)
f13ef2464 docs: remove reference to deprecated value field (istio#4322)
e947a2766 router: minor doc fixes in stream idle timeout (istio#4329)
0c2e998af tcp-proxy: fixing a TCP proxy bug where we attempted to readDisable a closed connection (istio#4296)
00ffe44a2 utility: fix strftime overflow handling. (istio#4321)
af1183c28 Re-enable TcpProxySslIntegrationTest and make the tests pass again. (istio#4318)
35534617b fuzz: fix H2 codec fuzzer post istio#4262. (istio#4311)
42f604853 Proto string issue fix (istio#4320)
9c492a01d Support Envoy to fetch secrets using SDS service. (istio#4256)
a8572192f ratelimit: revert `revert rate limit failure mode config` and add tests (istio#4303)
1d34172bd dns: fix exception unsafe behavior in c-ares callbacks. (istio#4307)
121242340 alts: add gRPC TSI socket (istio#4153)
f0363ae63 fuzz: detect client-side resets in H2 codec fuzzer. (istio#4300)
01aa3f820 test: hopefully deflaking echo integration test (istio#4304)
1fc0f4ba2 ratelimit: link legacy proto when message is being used (istio#4308)
aa4481e6b fix rare List::remove(&target) segfault (istio#4244)
89e0f23ba headers: fixing fast fail of size-validation (istio#4269)
97eba5918 build: bump googletest version. (istio#4293)
0057e22d9 fuzz: avoid false positives in HCM fuzzer. (istio#4262)
9d094e590 Revert ac0bd74f6f9716e3a44d1412f795317c30ca770a (istio#4295)
ddb28a4a1 Add validation context provider (istio#4264)
3b47cbabb added histogram latency information to Hystrix dashboard stream (istio#3986)
cf87d50cd docs: update SNI FAQ. (istio#4285)
f952033a4 config: fix update empty stat for eds (istio#4276)
329e591d3 router: Add ability of custom headers to rely on per-request data (istio#4219)
68d20b46c  thrift: refactor build files and imports (istio#4271)
5fa8192a3 access_log: log requested_server_name in tcp proxy (istio#4144)
fa45bb48f fuzz: libc++ clocks don't like nanos. (istio#4282)
53f8944f7 stats: add symbol table for future stat name encoding (istio#3927)
c987b425b test infra: Remove timeSource() from the ClusterManager api (istio#4247)
cd171d9a9 websocket: tunneling websockets (and upgrades in general) over H2 (istio#4188)
b9dc5d9a0 router: disallow :path/host rewriting in request_headers_to_add. (istio#4220)
0c9101127 network: skip socket options and source address for UDS client connections (istio#4252)
da1857d59 build: fixing a downstream compile error by noting explicit fallthrough (istio#4265)
9857cfe2a fuzz: cleanup per-test environment after each fuzz case. (istio#4253)
52beb067d test: Wrap proto string in std::string before comparison (istio#4238)
f5e219edc extensions/thrift_proxy: Add header matching to thrift router (istio#4239)
c9ce5d2b1 fuzz: track read_disable_count bidirectionally in codec_impl_fuzz_test. (istio#4260)
35103b353 fuzz: use nanoseconds for SystemTime in RequestInfo. (istio#4255)
ba6ba9883 fuzz: make runtime root hermetic in server_fuzz_test. (istio#4258)
b0a901480 time: Add 'format' test to ensure no one directly instantiates Prod*Time from source. (istio#4248)
85674603b access_log: support beginning of epoch in START_TIME. (istio#4254)
28d5f4118 proto: unify envoy_proto_library/api_proto_library. (istio#4233)
f7d3cb638 http: fix allocation bug introduced in istio#4211. (istio#4245)

Fixes istio#8310.

Signed-off-by: Piotr Sikora <piotrsikora@google.com>
istio-testing pushed a commit that referenced this pull request Sep 17, 2018
* Update Proxy SHA to latest with TCP proxy fixes.

Pulling the following changes from github.com/istio/proxy:

f498337 Fix a bug in origin authenticator that wrongly treats empty origin methods as pass (#1962)
c352de0 Mixer Client: Add support for TCP local attributes (#1967)
2c563c6 remove not used path patcher functions (#1966)
490d26f Update Envoy SHA to latest with TCP proxy fixes. (#1964)
4cc4b7c Mixer Client uses Node metadata to populate Mixer attributes (#1961)
cf23357 Fix the peerIsOptional and originIsOptional for authn filter. (#1959)
cc6e58e support per-path JWT validation. (#1928)
a5dd1aa mixer: clear route cache on header update (#1946)

Pulling the following changes from github.com/envoyproxy/envoy:

f936fc60f ssl: serialize accesses to SSL socket factory contexts (#4345)
e34dcd62a Fix crash in tcp_proxy (#4323)
ae6a25222 router: fix matching when all domains have wildcards (#4326)
aa06142ff test: Stop fake_upstream methods from accidentally succeeding (#4232)
5d731878f rbac: update the authenticated.user to a StringMatcher. (#4250)
c6bfc7d9a time: Event::TimeSystem abstraction to make it feasible to inject time with simulated timers (#4257)
752483ea9 Fixing the fix (#4333)
83487f6f3 tls: update BoringSSL to ab36a84b (3497). (#4338)
7bc210e02 test: fixing interactions between waitFor and ignore_spurious_events (#4309)
69474b398 admin: order stats in clusters json admin (#4306)
2d155f901 ppc64le build (#4183)
07efc6dc6 fix static initialization fiasco problem (#4314)
0b7e3b5e0 test: Remove declared but undefined class methods (#4297)
1485a1304 lua: make sure resetting dynamic metadata wrapper when request info is marked dead
d243cd62e test: set to zero when start_time exceeds limit (#4328)
0a1e92acc test: fix heap use-after-free in ~IntegrationTestServer. (#4319)
cddc732c7 CONTRIBUTING: Document 'kick-ci' trick. (#4335)
f13ef2464 docs: remove reference to deprecated value field (#4322)
e947a2766 router: minor doc fixes in stream idle timeout (#4329)
0c2e998af tcp-proxy: fixing a TCP proxy bug where we attempted to readDisable a closed connection (#4296)
00ffe44a2 utility: fix strftime overflow handling. (#4321)
af1183c28 Re-enable TcpProxySslIntegrationTest and make the tests pass again. (#4318)
35534617b fuzz: fix H2 codec fuzzer post #4262. (#4311)
42f604853 Proto string issue fix (#4320)
9c492a01d Support Envoy to fetch secrets using SDS service. (#4256)
a8572192f ratelimit: revert `revert rate limit failure mode config` and add tests (#4303)
1d34172bd dns: fix exception unsafe behavior in c-ares callbacks. (#4307)
121242340 alts: add gRPC TSI socket (#4153)
f0363ae63 fuzz: detect client-side resets in H2 codec fuzzer. (#4300)
01aa3f820 test: hopefully deflaking echo integration test (#4304)
1fc0f4ba2 ratelimit: link legacy proto when message is being used (#4308)
aa4481e6b fix rare List::remove(&target) segfault (#4244)
89e0f23ba headers: fixing fast fail of size-validation (#4269)
97eba5918 build: bump googletest version. (#4293)
0057e22d9 fuzz: avoid false positives in HCM fuzzer. (#4262)
9d094e590 Revert ac0bd74f6f9716e3a44d1412f795317c30ca770a (#4295)
ddb28a4a1 Add validation context provider (#4264)
3b47cbabb added histogram latency information to Hystrix dashboard stream (#3986)
cf87d50cd docs: update SNI FAQ. (#4285)
f952033a4 config: fix update empty stat for eds (#4276)
329e591d3 router: Add ability of custom headers to rely on per-request data (#4219)
68d20b46c  thrift: refactor build files and imports (#4271)
5fa8192a3 access_log: log requested_server_name in tcp proxy (#4144)
fa45bb48f fuzz: libc++ clocks don't like nanos. (#4282)
53f8944f7 stats: add symbol table for future stat name encoding (#3927)
c987b425b test infra: Remove timeSource() from the ClusterManager api (#4247)
cd171d9a9 websocket: tunneling websockets (and upgrades in general) over H2 (#4188)
b9dc5d9a0 router: disallow :path/host rewriting in request_headers_to_add. (#4220)
0c9101127 network: skip socket options and source address for UDS client connections (#4252)
da1857d59 build: fixing a downstream compile error by noting explicit fallthrough (#4265)
9857cfe2a fuzz: cleanup per-test environment after each fuzz case. (#4253)
52beb067d test: Wrap proto string in std::string before comparison (#4238)
f5e219edc extensions/thrift_proxy: Add header matching to thrift router (#4239)
c9ce5d2b1 fuzz: track read_disable_count bidirectionally in codec_impl_fuzz_test. (#4260)
35103b353 fuzz: use nanoseconds for SystemTime in RequestInfo. (#4255)
ba6ba9883 fuzz: make runtime root hermetic in server_fuzz_test. (#4258)
b0a901480 time: Add 'format' test to ensure no one directly instantiates Prod*Time from source. (#4248)
85674603b access_log: support beginning of epoch in START_TIME. (#4254)
28d5f4118 proto: unify envoy_proto_library/api_proto_library. (#4233)
f7d3cb638 http: fix allocation bug introduced in #4211. (#4245)

Fixes #8310.

Signed-off-by: Piotr Sikora <piotrsikora@google.com>

* Update go-control-plane API and fix test fails

* Fixed the pilot rbac build fail due to envoyproxy/envoy#4250
* Fixed the istioctl test fail due to envoyproxy/envoy#4306

Signed-off-by: Piotr Sikora <piotrsikora@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants