Skip to content

Ability to tap into istio-agent to route debug messages to istiod #31749

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Apr 9, 2021

Conversation

esnible
Copy link
Contributor

@esnible esnible commented Mar 28, 2021

This is for #22274

This PR causes the sidecar agent to listen on port 15009. (Currently plaintext and open, but that isn't the wish for the final version.)

When a request to :15009/debug/, for example :15009/debug/syncz, sends an XDS request with that TypeURL to Istiod and returns the response from Istiod. I did it this way so I could test with the browser, we may want to switch to require a POST of the actual XDS request.

kubectl exec deployment/httpbin -c istio-proxy -- curl -s localhost:8080/debug/syncz
{
  "versionInfo": "2021-03-31T16:39:37Z/107",
  "resources": [
    {
      "@type": "type.googleapis.com/envoy.service.status.v3.ClientConfig",
      "node": {
        "id": "httpbin-86446ddffb-qz4fb.default"
      },
      "xdsConfig": [
        {
          "status": "SYNCED",
          "listenerConfig": {

          }
...

@costinm @howardjohn @therealmitchconnors Does this match your expectations in terms of the direction?

@istio-testing istio-testing added the do-not-merge/work-in-progress Block merging of a PR because it isn't ready yet. label Mar 28, 2021
@google-cla google-cla bot added the cla: yes Set by the Google CLA bot to indicate the author of a PR has signed the Google CLA. label Mar 28, 2021
@istio-testing istio-testing added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Mar 28, 2021
@istio-testing
Copy link
Collaborator

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@esnible
Copy link
Contributor Author

esnible commented Mar 28, 2021

/test all

@esnible
Copy link
Contributor Author

esnible commented Mar 29, 2021

/retest

Copy link
Member

@howardjohn howardjohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At a high level this looks like the right approach. Nice work!

@@ -464,7 +473,11 @@ func (p *XdsProxy) handleUpstreamResponse(con *ProxyConnection) {
forwardToEnvoy(con, resp)
}
default:
forwardToEnvoy(con, resp)
if strings.HasPrefix(resp.TypeUrl, "istio.io/debug/") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there is a way to tag the response in some way. hypothetically the tap may want to support all types, including envoy types?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to use DiscoveryRequest.ResponseNonce, and have the DiscoveryResponse.Nonce contain a matching value? I realize this is backwards from how it is used, but it would let the requests and replies be matched precisely.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great if this change involved no changes to pilot's code at all. If that is impossible we can compromise, but ideally its all client side. Maybe resourceName could encode it? I will think on this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the latest, the only change to Pilot is to use Event Generator instead of API Generator.

if !ok {
requestsQueue = make([]chan *discovery.DiscoveryResponse, 0, 1)
}
requestsQueue = append(requestsQueue, responseChannel)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need multiple concurrently queued requests? or just a single one per-typeurl?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually I am not sure its even possible to send 2 requests in a row without a response? Istiod will block the second request

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I imagine multiple 3rd party tools pulling at the same time. I can re-factor the code so that if multiple tools hit the debug tap at once that block on the agent rather than going up together to Istiod.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think that would work better, and agree its useful to be able to have multiple requests


func (p *XdsProxy) makeTapHandler() func(w http.ResponseWriter, req *http.Request) {
return func(w http.ResponseWriter, req *http.Request) {
serviceAccount := env.RegisterStringVar("SERVICE_ACCOUNT", "default", "").Get()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

long term we should use the entire same metadata we use for real XDS

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which shouldn't be too bad, just a lot of plumbing data around


p.connected.requestsChan <- req

return responseChannel, nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we do just have one enque-able, I think this function can also just block until it gets a response as well?

httpMux := http.NewServeMux()
httpMux.HandleFunc("/debug/", p.makeTapHandler())

HTTPAddr := ":15009"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this was for demo, but we cannot listen on 0.0.0.0. Either UDS or localhost.

@@ -242,6 +245,10 @@ func (s *DiscoveryServer) processRequest(req *discovery.DiscoveryRequest, con *C
return s.pushXds(con, push, versionInfo(), con.Watched(req.TypeUrl), request)
}

func isXdsEvent(req *discovery.DiscoveryRequest) bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer not to have this logic if possible. I think I know why we do - to avoid treating the REQ as an ACK and being ignored? I think we could also set empty responceNonce as an option?

@@ -464,7 +473,11 @@ func (p *XdsProxy) handleUpstreamResponse(con *ProxyConnection) {
forwardToEnvoy(con, resp)
}
default:
forwardToEnvoy(con, resp)
if strings.HasPrefix(resp.TypeUrl, "istio.io/debug/") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great if this change involved no changes to pilot's code at all. If that is impossible we can compromise, but ideally its all client side. Maybe resourceName could encode it? I will think on this

@irisdingbj
Copy link
Member

irisdingbj commented Mar 30, 2021

@esnible I may lack of context for the PR's purpose. So please correct me if I miss something. We now have an experimental command istioctl x internal-debug and it can send debug request to our secured debug interface. ( It will dispatch the xds request to our internal http debug server under the hood). If you provide a pod for this command, it will run using this pod's identify. (eg: istioctl x internal-debug productpage-xxx.default) . Wondering can we reuse it ?

Basically it supports our current /debug/xxxz command.

p.tapMutex.Lock()
requestsQueue, ok := p.tapRequests[resp.TypeUrl]
if ok {
requestChan := requestsQueue[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like multiple concurrent requests are supported. How can we be sure the first request is the one that matches this response?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed at the UX meeting, multiple concurrent requests are not needed. The latest change eliminates that.
With the design you commented on, the only parameter is the TypeUrl, and we check that it matches. The intention was that although the performance might be unfair, the results would be correct -- everyone sees the TypeUrl they asked for.

@esnible
Copy link
Contributor Author

esnible commented Mar 31, 2021

/test all

1 similar comment
@esnible
Copy link
Contributor Author

esnible commented Mar 31, 2021

/test all

@esnible
Copy link
Contributor Author

esnible commented Mar 31, 2021

/test integ-pilot-multicluster-tests_istio

Copy link
Member

@howardjohn howardjohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great

cc @hzxuzhonghu @ramaraochavali

return nil, fmt.Errorf("proxy not connected to Istiod")
}

defer p.tapMutex.Unlock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I suppose this works too but it seems more common to add this immediately after the .Lock()

@@ -464,7 +471,11 @@ func (p *XdsProxy) handleUpstreamResponse(con *ProxyConnection) {
forwardToEnvoy(con, resp)
}
default:
forwardToEnvoy(con, resp)
if strings.HasPrefix(resp.TypeUrl, "istio.io/debug/") {
p.tapResponseChannel <- resp
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if pilot pushes a few responses and some times out this may deadlock? since there is no receiver trying to read from the channel. This would impact the real XDS traffic

typeURL := fmt.Sprintf("istio.io%s", req.URL.Path)
response, err := p.tapRequest(&discovery.DiscoveryRequest{
Node: &envoy_v3.Node{
Id: "sidecar~1.1.1.1~debug~cluster.local", // TODO listen for and use Envoy's ID?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need it at all?

}

// Send Istiod's response as HTTP body
j, err := json.Marshal(response)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for xds marshal we probably need jsonpb?

@esnible
Copy link
Contributor Author

esnible commented Mar 31, 2021

/test all

@esnible esnible marked this pull request as ready for review March 31, 2021 19:52
@esnible esnible requested review from a team as code owners March 31, 2021 19:52
@istio-testing istio-testing removed the do-not-merge/work-in-progress Block merging of a PR because it isn't ready yet. label Mar 31, 2021
"os"
"path"
"strings"
"sync"
"time"

discovery "github.com/envoyproxy/go-control-plane/envoy/service/discovery/v3"
// _ "github.com/envoyproxy/go-control-plane/envoy/service/status/v3" // needed for piggyback marshal
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is commented out. Is it still needed?

// Send Istiod's response as HTTP body
jsonm := &jsonpb.Marshaler{Indent: " "}
w.Header().Add("Content-Type", "application/json")
_ = jsonm.Marshal(w, response)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Can the line "_ = jsonm.Marshal(w, response)" be simplified to "jsonm.Marshal(w, response)"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's there for lint. If I don't do it, Istio's build process says Error return value of jsonm.Marshal is not checked.

It would be possible to Marshal to a buffer, check the result, and then send that result to w. It seemed like a lot of work, given that it isn't really possible for this code to fail.

@esnible esnible requested review from a team as code owners March 31, 2021 21:38
@esnible esnible force-pushed the agent-piggyback-v2 branch from d11b87b to d52ce0f Compare April 1, 2021 18:19
@esnible esnible added the do-not-merge/hold Block automatic merging of a PR. label Apr 5, 2021
@esnible esnible removed the do-not-merge/hold Block automatic merging of a PR. label Apr 5, 2021
@esnible
Copy link
Contributor Author

esnible commented Apr 5, 2021

/retest

// TODO move this to just directly using the resource TypeUrl
g = s.Generators["api"] // default to "MCP" generators - any type supported by store
if strings.HasPrefix(typeURL, "istio.io/debug/") {
g = s.Generators["event"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One issue - when we send a debug request, we register a watch for that type. This watch is never removed. As a result, all config changes will re-push a new debug response. Example logs:

2021-04-05T21:06:24.639273Z     info    ads     Push debounce stable[10] 1: 101.013956ms since last change, 101.013815ms since last push, full=true
2021-04-05T21:06:24.639415Z     info    ads     XDS: Pushing:2021-04-05T21:06:24Z/8 Services:5 ConnectedEndpoints:2  Version:2021-04-05T21:06:24Z/8
2021-04-05T21:06:24.640212Z     info    ads     CDS: PUSH for node:echo-cb96f8d94-ssbm8.default resources:24 size:17.1kB
2021-04-05T21:06:24.640247Z     info    ads     EDS: PUSH for node:echo-cb96f8d94-ssbm8.default resources:6 size:1.2kB empty:0 cached:6/6
2021-04-05T21:06:24.641634Z     info    ads     LDS: PUSH for node:echo-cb96f8d94-ssbm8.default resources:16 size:64.8kB
2021-04-05T21:06:24.641921Z     info    ads     RDS: PUSH for node:echo-cb96f8d94-ssbm8.default resources:6 size:4.3kB
2021-04-05T21:06:24.641971Z     info    ads     istio.io/debug/edsz: PUSH for node:echo-cb96f8d94-ssbm8.default resources:0 size:0B
2021-04-05T21:06:24.642000Z     info    ads     istio.io/debug/configz: PUSH for node:echo-cb96f8d94-ssbm8.default resources:0 size:0B
2021-04-05T21:06:24.642010Z     info    ads     istio.io/debug/config_dump with 0 ResourceNames
2021-04-05T21:06:24.642028Z     info    ads     istio.io/debug/config_dump: PUSH for node:echo-cb96f8d94-ssbm8.default resources:0 size:0B
2021-04-05T21:06:24.642075Z     info    ads     istio.io/debug/syncz: PUSH for node:echo-cb96f8d94-ssbm8.default resources:2 size:134B
2021-04-05T21:06:24.656383Z     info    ads     EDS: PUSH request for node:echo-cb96f8d94-ssbm8.default resources:15 size:3.0kB empty:0 cached:15/15

// TODO move this to just directly using the resource TypeUrl
g = s.Generators["api"] // default to "MCP" generators - any type supported by store
if strings.HasPrefix(typeURL, "istio.io/debug/") {
g = s.Generators["event"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be s.Generators[TypeDebug]?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need Generators[TypeDebug] if the typeURL is istio.io/debug (no trailing slash). If the type URL is istio.io/debug/syncz, we do need a generator, but it is the event generator, not the debug one.

if err != nil {
t.Fatalf("couldn't curl sidecar: %v", err)
}
if !strings.Contains(out, "\"typeUrl\": \"istio.io/debug/syncz\"") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe we should attempt unmarshal it to make sure its valid, then we can do better checks as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I am now trying to unmarshal using jsonpb. If that fails I try again as json. Created #31997 to explain the problem.

@esnible esnible added the do-not-merge/hold Block automatic merging of a PR. label Apr 6, 2021
@esnible
Copy link
Contributor Author

esnible commented Apr 7, 2021

/retest

@esnible esnible removed the do-not-merge/hold Block automatic merging of a PR. label Apr 8, 2021
@@ -81,6 +81,8 @@ var (
proxyXDSViaAgent = env.RegisterBoolVar("PROXY_XDS_VIA_AGENT", true,
"If set to true, envoy will proxy XDS calls via the agent instead of directly connecting to istiod. This option "+
"will be removed once the feature is stabilized.").Get()
proxyXDSDebugViaAgent = env.RegisterBoolVar("PROXY_XDS_DEBUG_VIA_AGENT", true,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we backport please make it false. This is good for master though

@istio-testing istio-testing merged commit f464909 into istio:master Apr 9, 2021
shonecyx pushed a commit to shonecyx/istio that referenced this pull request Dec 24, 2021
…tio#31749)

* Proof-of-concept

* Use port 15009

* Send nonce with debug request

* Use event generator for debug types

* Remove debug printf

* Refactor

* Cleanup on close

* test case, improved JSON serialization, simplify xDS request

* Test case fix

* Remove incorrect comment

* Release note

* Feature was added, not updated

* Use port 15014, allow disabling with proxyMetadata.PROXY_XDS_DEBUG_VIA_AGENT=false

* Move agent debug back to 15009

* Support for debug on single resources, and don't watch debug queries

* typo

* Fix test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/user experience cla: yes Set by the Google CLA bot to indicate the author of a PR has signed the Google CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants