-
Notifications
You must be signed in to change notification settings - Fork 8k
Ability to tap into istio-agent to route debug messages to istiod #31749
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Skipping CI for Draft Pull Request. |
/test all |
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At a high level this looks like the right approach. Nice work!
pkg/istio-agent/xds_proxy.go
Outdated
@@ -464,7 +473,11 @@ func (p *XdsProxy) handleUpstreamResponse(con *ProxyConnection) { | |||
forwardToEnvoy(con, resp) | |||
} | |||
default: | |||
forwardToEnvoy(con, resp) | |||
if strings.HasPrefix(resp.TypeUrl, "istio.io/debug/") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if there is a way to tag the response in some way. hypothetically the tap may want to support all types, including envoy types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to use DiscoveryRequest.ResponseNonce, and have the DiscoveryResponse.Nonce contain a matching value? I realize this is backwards from how it is used, but it would let the requests and replies be matched precisely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great if this change involved no changes to pilot's code at all. If that is impossible we can compromise, but ideally its all client side. Maybe resourceName could encode it? I will think on this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the latest, the only change to Pilot is to use Event Generator instead of API Generator.
pkg/istio-agent/xds_proxy.go
Outdated
if !ok { | ||
requestsQueue = make([]chan *discovery.DiscoveryResponse, 0, 1) | ||
} | ||
requestsQueue = append(requestsQueue, responseChannel) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we really need multiple concurrently queued requests? or just a single one per-typeurl?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually I am not sure its even possible to send 2 requests in a row without a response? Istiod will block the second request
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I imagine multiple 3rd party tools pulling at the same time. I can re-factor the code so that if multiple tools hit the debug tap at once that block on the agent rather than going up together to Istiod.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think that would work better, and agree its useful to be able to have multiple requests
pkg/istio-agent/xds_proxy.go
Outdated
|
||
func (p *XdsProxy) makeTapHandler() func(w http.ResponseWriter, req *http.Request) { | ||
return func(w http.ResponseWriter, req *http.Request) { | ||
serviceAccount := env.RegisterStringVar("SERVICE_ACCOUNT", "default", "").Get() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
long term we should use the entire same metadata we use for real XDS
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which shouldn't be too bad, just a lot of plumbing data around
pkg/istio-agent/xds_proxy.go
Outdated
|
||
p.connected.requestsChan <- req | ||
|
||
return responseChannel, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we do just have one enque-able, I think this function can also just block until it gets a response as well?
pkg/istio-agent/xds_proxy.go
Outdated
httpMux := http.NewServeMux() | ||
httpMux.HandleFunc("/debug/", p.makeTapHandler()) | ||
|
||
HTTPAddr := ":15009" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this was for demo, but we cannot listen on 0.0.0.0. Either UDS or localhost.
pilot/pkg/xds/ads.go
Outdated
@@ -242,6 +245,10 @@ func (s *DiscoveryServer) processRequest(req *discovery.DiscoveryRequest, con *C | |||
return s.pushXds(con, push, versionInfo(), con.Watched(req.TypeUrl), request) | |||
} | |||
|
|||
func isXdsEvent(req *discovery.DiscoveryRequest) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer not to have this logic if possible. I think I know why we do - to avoid treating the REQ as an ACK and being ignored? I think we could also set empty responceNonce as an option?
pkg/istio-agent/xds_proxy.go
Outdated
@@ -464,7 +473,11 @@ func (p *XdsProxy) handleUpstreamResponse(con *ProxyConnection) { | |||
forwardToEnvoy(con, resp) | |||
} | |||
default: | |||
forwardToEnvoy(con, resp) | |||
if strings.HasPrefix(resp.TypeUrl, "istio.io/debug/") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great if this change involved no changes to pilot's code at all. If that is impossible we can compromise, but ideally its all client side. Maybe resourceName could encode it? I will think on this
@esnible I may lack of context for the PR's purpose. So please correct me if I miss something. We now have an experimental command Basically it supports our current /debug/xxxz command. |
pkg/istio-agent/xds_proxy.go
Outdated
p.tapMutex.Lock() | ||
requestsQueue, ok := p.tapRequests[resp.TypeUrl] | ||
if ok { | ||
requestChan := requestsQueue[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks like multiple concurrent requests are supported. How can we be sure the first request is the one that matches this response?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed at the UX meeting, multiple concurrent requests are not needed. The latest change eliminates that.
With the design you commented on, the only parameter is the TypeUrl, and we check that it matches. The intention was that although the performance might be unfair, the results would be correct -- everyone sees the TypeUrl they asked for.
/test all |
1 similar comment
/test all |
/test integ-pilot-multicluster-tests_istio |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great
pkg/istio-agent/xds_proxy.go
Outdated
return nil, fmt.Errorf("proxy not connected to Istiod") | ||
} | ||
|
||
defer p.tapMutex.Unlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I suppose this works too but it seems more common to add this immediately after the .Lock()
pkg/istio-agent/xds_proxy.go
Outdated
@@ -464,7 +471,11 @@ func (p *XdsProxy) handleUpstreamResponse(con *ProxyConnection) { | |||
forwardToEnvoy(con, resp) | |||
} | |||
default: | |||
forwardToEnvoy(con, resp) | |||
if strings.HasPrefix(resp.TypeUrl, "istio.io/debug/") { | |||
p.tapResponseChannel <- resp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if pilot pushes a few responses and some times out this may deadlock? since there is no receiver trying to read from the channel. This would impact the real XDS traffic
pkg/istio-agent/xds_proxy.go
Outdated
typeURL := fmt.Sprintf("istio.io%s", req.URL.Path) | ||
response, err := p.tapRequest(&discovery.DiscoveryRequest{ | ||
Node: &envoy_v3.Node{ | ||
Id: "sidecar~1.1.1.1~debug~cluster.local", // TODO listen for and use Envoy's ID? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need it at all?
} | ||
|
||
// Send Istiod's response as HTTP body | ||
j, err := json.Marshal(response) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for xds marshal we probably need jsonpb?
/test all |
pkg/istio-agent/xds_proxy.go
Outdated
"os" | ||
"path" | ||
"strings" | ||
"sync" | ||
"time" | ||
|
||
discovery "github.com/envoyproxy/go-control-plane/envoy/service/discovery/v3" | ||
// _ "github.com/envoyproxy/go-control-plane/envoy/service/status/v3" // needed for piggyback marshal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is commented out. Is it still needed?
pkg/istio-agent/xds_proxy.go
Outdated
// Send Istiod's response as HTTP body | ||
jsonm := &jsonpb.Marshaler{Indent: " "} | ||
w.Header().Add("Content-Type", "application/json") | ||
_ = jsonm.Marshal(w, response) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Can the line "_ = jsonm.Marshal(w, response)" be simplified to "jsonm.Marshal(w, response)"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's there for lint
. If I don't do it, Istio's build process says Error return value of jsonm.Marshal
is not checked.
It would be possible to Marshal to a buffer, check the result, and then send that result to w
. It seemed like a lot of work, given that it isn't really possible for this code to fail.
d11b87b
to
d52ce0f
Compare
/retest |
// TODO move this to just directly using the resource TypeUrl | ||
g = s.Generators["api"] // default to "MCP" generators - any type supported by store | ||
if strings.HasPrefix(typeURL, "istio.io/debug/") { | ||
g = s.Generators["event"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One issue - when we send a debug request, we register a watch for that type. This watch is never removed. As a result, all config changes will re-push a new debug response. Example logs:
2021-04-05T21:06:24.639273Z info ads Push debounce stable[10] 1: 101.013956ms since last change, 101.013815ms since last push, full=true
2021-04-05T21:06:24.639415Z info ads XDS: Pushing:2021-04-05T21:06:24Z/8 Services:5 ConnectedEndpoints:2 Version:2021-04-05T21:06:24Z/8
2021-04-05T21:06:24.640212Z info ads CDS: PUSH for node:echo-cb96f8d94-ssbm8.default resources:24 size:17.1kB
2021-04-05T21:06:24.640247Z info ads EDS: PUSH for node:echo-cb96f8d94-ssbm8.default resources:6 size:1.2kB empty:0 cached:6/6
2021-04-05T21:06:24.641634Z info ads LDS: PUSH for node:echo-cb96f8d94-ssbm8.default resources:16 size:64.8kB
2021-04-05T21:06:24.641921Z info ads RDS: PUSH for node:echo-cb96f8d94-ssbm8.default resources:6 size:4.3kB
2021-04-05T21:06:24.641971Z info ads istio.io/debug/edsz: PUSH for node:echo-cb96f8d94-ssbm8.default resources:0 size:0B
2021-04-05T21:06:24.642000Z info ads istio.io/debug/configz: PUSH for node:echo-cb96f8d94-ssbm8.default resources:0 size:0B
2021-04-05T21:06:24.642010Z info ads istio.io/debug/config_dump with 0 ResourceNames
2021-04-05T21:06:24.642028Z info ads istio.io/debug/config_dump: PUSH for node:echo-cb96f8d94-ssbm8.default resources:0 size:0B
2021-04-05T21:06:24.642075Z info ads istio.io/debug/syncz: PUSH for node:echo-cb96f8d94-ssbm8.default resources:2 size:134B
2021-04-05T21:06:24.656383Z info ads EDS: PUSH request for node:echo-cb96f8d94-ssbm8.default resources:15 size:3.0kB empty:0 cached:15/15
// TODO move this to just directly using the resource TypeUrl | ||
g = s.Generators["api"] // default to "MCP" generators - any type supported by store | ||
if strings.HasPrefix(typeURL, "istio.io/debug/") { | ||
g = s.Generators["event"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this be s.Generators[TypeDebug]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need Generators[TypeDebug]
if the typeURL is istio.io/debug
(no trailing slash). If the type URL is istio.io/debug/syncz
, we do need a generator, but it is the event generator, not the debug one.
if err != nil { | ||
t.Fatalf("couldn't curl sidecar: %v", err) | ||
} | ||
if !strings.Contains(out, "\"typeUrl\": \"istio.io/debug/syncz\"") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe we should attempt unmarshal it to make sure its valid, then we can do better checks as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. I am now trying to unmarshal using jsonpb. If that fails I try again as json. Created #31997 to explain the problem.
/retest |
@@ -81,6 +81,8 @@ var ( | |||
proxyXDSViaAgent = env.RegisterBoolVar("PROXY_XDS_VIA_AGENT", true, | |||
"If set to true, envoy will proxy XDS calls via the agent instead of directly connecting to istiod. This option "+ | |||
"will be removed once the feature is stabilized.").Get() | |||
proxyXDSDebugViaAgent = env.RegisterBoolVar("PROXY_XDS_DEBUG_VIA_AGENT", true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we backport please make it false
. This is good for master though
…tio#31749) * Proof-of-concept * Use port 15009 * Send nonce with debug request * Use event generator for debug types * Remove debug printf * Refactor * Cleanup on close * test case, improved JSON serialization, simplify xDS request * Test case fix * Remove incorrect comment * Release note * Feature was added, not updated * Use port 15014, allow disabling with proxyMetadata.PROXY_XDS_DEBUG_VIA_AGENT=false * Move agent debug back to 15009 * Support for debug on single resources, and don't watch debug queries * typo * Fix test
This is for #22274
This PR causes the sidecar agent to listen on port 15009. (Currently plaintext and open, but that isn't the wish for the final version.)
When a request to :15009/debug/, for example :15009/debug/syncz, sends an XDS request with that TypeURL to Istiod and returns the response from Istiod. I did it this way so I could test with the browser, we may want to switch to require a POST of the actual XDS request.
@costinm @howardjohn @therealmitchconnors Does this match your expectations in terms of the direction?