-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Add the Workload Identities for azureDns #5308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add the Workload Identities for azureDns #5308
Conversation
Hi @karlschriek. Thanks for your PR. I'm waiting for a cert-manager member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
I am removing the "WIP" label since it it seems that otherwise no one will look at this. It is not ready to be merged yet though. |
This comment was marked as outdated.
This comment was marked as outdated.
@inteon Couldn't we use the same principal as we do for pod identity / managed identity and check if |
@karlschriek sorry for the confusion, @MattiasAng corrected my comment; we currently already have the the current cert-manager/pkg/issuer/acme/dns/azuredns/azuredns.go Lines 88 to 103 in b84ea96
|
I see! I'll add that same check then.
Yes, both managed identities and workload identities are "ambient", at
least as far as I understand how you define it. A managed identity is
attached to the Node that cert-manager is running on (and all Pods on the
Node can use the identity). A workload identity is attached specifically to
the cert-manager Pod. In the implementation it comes down to the same thing
though, i.e. an identity is available and with a flag we can choose to use
it
…On Fri, 14 Oct 2022, 17:40 Tim Ramlot, ***@***.***> wrote:
@karlschriek <https://github.com/karlschriek> sorry for the confusion,
@MattiasAng <https://github.com/MattiasAng> corrected my comment; we
currently already have the --issuer-ambient-credentials and
--cluster-issuer-ambient-credentials flags that make sure we have no
unwanted lateral privilege escalation between Issuers.
I'm not sure what the difference is between these ambient credentials and
the proposed Workload Identities in this PR, but it is necessary that we
only use workload identities when allowed by the ambient-credentials flag.
the current ambient-credentials implementation can be found here:
https://github.com/cert-manager/cert-manager/blob/b84ea96d73938c40c9499c302d2641ff76fb641c/pkg/issuer/acme/dns/azuredns/azuredns.go#L88-L103
—
Reply to this email directly, view it on GitHub
<#5308 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AGBE4OGOFGYE6GJZ7LXXQBDWDF5HVANCNFSM53WJGAUQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Always attempt to use AWI and thereafter MSI when ambient credentials are enabled.
3c36099
to
ac15a4d
Compare
Thanks for your pull request. Before we can look at it, you'll need to add a 'DCO signoff' to your commits. 📝 Please follow instructions in the contributing guide to update your commits with the DCO Full details of the Developer Certificate of Origin can be found at developercertificate.org. The list of commits missing DCO signoff:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: karlschriek The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Ok I have now incorporated the discussion points above. Have also significantly simplified the PR. Previously I was introducing a new flag "useWorkloadIdentity" which needed to be explicitly toggled to "true" for the Azure Workload Identity be used. I've now abandoned this and have integrated with the current approach with the "ambient" flag. It now works as follows: If the checks are passed that ambient credentials should be used, then we first attempt to use the most specific ambient credentials (Workload Identity). If a Workload Identity is not found, we proceed to use the Managed Identity. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Karl,
I've read through the code and spotted one error that I think should be checked and returned.
I'll add ok-to-test, so that you can see the tests pass.
Do you intend to add some unit tests to https://github.com/cert-manager/cert-manager/blob/ac15a4dff11dec37edb11d4ac6fe77c46af0444b/pkg/issuer/acme/dns/azuredns/azuredns_test.go ?
Did you run those tests after changing this code? If so, please paste the results in a comment, because those tests are skipped in our CI environment.
/ok-to-test
spt, err := getWorkloadIdentity(env) | ||
if spt != nil { | ||
return spt, nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This err
should be checked and returned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to add to that, there could be multiple reasons why a token cannot be obtained. E.g. I could easily imagine a case where someone would forget to create a federated-credential identity resource, then the error would be this (taken from my experiments with external-dns):
external-dns-595b67dbc4-7pqz7 external-dns time="2022-10-26T17:01:16Z" level=error msg="azure.BearerAuthorizer#WithAuthorization: Failed to refresh the Token for request to https://management.azure.com/subscriptions/64842ced-4781-416f-81ff-482b7f562581/resourceGroups/aks-rg/providers/Microsoft.Network/dnsZones?api-version=2018-05-01: StatusCode=400 -- Original Error: adal: Refresh request failed. Status Code = '400'. Response body: {\"error\":\"invalid_request\",\"error_description\":\"AADSTS70021: No matching federated identity record found for presented assertion. Assertion Issuer: 'https://westeurope.oic.prod-aks.azure.com/bb9ebc7f-b105-4532-95ca-abf3431ff466/c4f4982c-33b9-4f49-b973-8f0095c5c551/'. Assertion Subject: 'system:serviceaccount:external-dns:external-dns'. Assertion Audience: 'api://AzureADTokenExchange'. https://docs.microsoft.com/en-us/azure/active-directory/develop/workload-identity-federation\\r\\nTrace ID: d9593b57-d1f7-41f3-9ef7-a236250f6500\\r\\nCorrelation ID: 3bb467a9-eb14-4291-a1b8-8502e75ebfac\\r\\nTimestamp: 2022-10-26 17:01:16Z\",\"error_codes\":[70021],\"timestamp\":\"2022-10-26 17:01:16Z\",\"trace_id\":\"d9593b57-d1f7-41f3-9ef7-a236250f6500\",\"correlation_id\":\"3bb467a9-eb14-4291-a1b8-8502e75ebfac\",\"error_uri\":\"https://login.microsoftonline.com/error?code=70021\"} Endpoint https://login.microsoftonline.com/bb9ebc7f-b105-4532-95ca-abf3431ff466/oauth2/token?api-version=1.0"
That makes me think that it's better not to implicitly rely on the if spt != nil
check, we should rather have an explicit way to force cert-manager to use Workload Identity. Then error handling and the code itself would become more straight-forward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about that more, if the ambient flag is the way to go in cert-manager, then we can use something like this:
// Use Workload Identity if present
if os.Getenv("AZURE_FEDERATED_TOKEN_FILE") != "" {
return getWorkloadIdentity(env)
}
instead of:
spt, err := getWorkloadIdentity(env)
if spt != nil {
return spt, nil
}
To me, that seems to be concise enough, and no errors are suppressed.
(Upd): a more comprehensive example: https://github.com/cert-manager/cert-manager/compare/master...weisdd:cert-manager:feature/azure-workload-identity?expand=1
awiClientId := os.Getenv("AZURE_CLIENT_ID") | ||
awiTenantId := os.Getenv("AZURE_TENANT_ID") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these values available from env
(azure.Environment
) ?
Do these environment variables get added to the cert-manager Pod by Mutating Admission Webhook ?
Add a comment explaining where these environment variables come from with a link to the workload identity federation documentaion.
} | ||
|
||
logf.Log.V(logf.InfoLevel).Info("No Azure Workload Identity found: attempting to authenticate with an Azure Managed Identity (MSI)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logf.Log.V(logf.InfoLevel).Info("No Azure Workload Identity found: attempting to authenticate with an Azure Managed Identity (MSI)") | |
logf.Log.V(logf.InfoLevel).Info("No Azure Workload Identity found: attempting to authenticate with an Azure Managed Service Identity (MSI)") |
@@ -84,19 +111,26 @@ func getAuthorization(env azure.Environment, clientID, clientSecret, subscriptio | |||
} | |||
return spt, nil | |||
} | |||
logf.Log.V(logf.InfoLevel).Info("No ClientID found: authenticating azuredns with managed identity (MSI)") | |||
logf.Log.V(logf.InfoLevel).Info("No ClientID found: attempting to authenticate with ambient credentials (Azure Workload Identity or Azure Managed Identity, in that order)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logf.Log.V(logf.InfoLevel).Info("No ClientID found: attempting to authenticate with ambient credentials (Azure Workload Identity or Azure Managed Identity, in that order)") | |
logf.Log.V(logf.InfoLevel).Info("No ClientID found: attempting to authenticate with ambient credentials (Azure Workload Identity or Azure Managed Service Identity, in that order)") |
jwtBytes, err := ioutil.ReadFile(os.Getenv("AZURE_FEDERATED_TOKEN_FILE")) | ||
if err != nil { | ||
return nil, fmt.Errorf("Failed to get Azure Workload Identity token for file: %v", err) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the AZURE_*
environment variables are empty or unset, these errors might be quite confusing.
If AZURE_FEDERATED_TOKEN_FILE is empty or not set, will it be clear from this error what has gone wrong?
Perhaps there needs to be some early checks that all these env vars are set and not-empty.
Can AZURE_CLIENT_ID and AZURE_TENNANT_ID be empty or unset for any reason?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Theoretically, ClientId is not required as an annotation on the service account when using Azure Workload. This means the environment variable can be absent. In that case it's up to the application to pass it in manually (obtained via other means).
Keda uses this approach wherein they allow you to override the client id on a per usage basis, which is quite important when talking to many different resources.
In context here, suppose you want to use a different MSI to authenticate with different azure dns zones from the same cluster. If you place the clientId annotation on the certmanager service account, that's what the env var would be in the certmanager pod... but you'd need a way to specify the other MSI. In such a situation, the likely approach would be to not set the annotation (and thus the env var) but instead pass it in via some other means--for example the dns01 stanza similar to MSI now.
Perhaps those cases have been covered here...I'm honestly not sure as I'm not familiar with this code base. I'm just someone who's been using AZWI since private preview and actively monitoring the PR/issues for apps like this (blocking our full adoption). I figured I'd call it out based on how I read your question, just in case. If I'm off base, ignore me...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure that is true. The Workload Identity webhook needs the Client ID (which is set as an annotation on the service account) in order to resolve the identity and request the token. It then injects that token, as well as the env vars AZURE_FEDERATED_TOKEN_FILE, AZURE_CLIENT_ID and AZURE_TENANT_ID.
If those are not set then the Workload Identity didn't get attached correctly (or some other process overwrote them)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure that is true. The Workload Identity webhook needs the Client ID
I can assure you that it's 100% optional. We use it in such a fashion. The only required metadata is the azure.workload.identity/use
label on the ServiceAccount
which is what the mutating webhook actually triggers off of. See the list of annotations/labels here. Note that nothing is marked as required except for the label. If omitted only the tenantId/token-file environment variables are injected into the pod. Don't take my word though -- it's easy enough to verify
The ClientId
as a whole is required in the sense that by the time you make the request for your access-token* you must have one. A consuming application must look to the environment variable as a default/fallback and offer a way to customize/override if/where appropriate.
The annotation is optional because you might be in a scenario where the single service account needs to act as multiple MSIs, which is a supported scenario by AZWI and something they promote. This is exactly how we use it with Keda; the single Keda pod needs to authenticate with multiple resources as different MSIs. For example, Keda scales application A1
based on Storage account S1
but scales application A2
based on Storage account S2
. With Pod Identity you needed to run Keda under a single "god MSI" that had access to both sets of storage. With AZWI, however, the point is you should be working in a mode of least-privilege meaning independent MSIs scoped to the individual applications and their resources. If you were to annotate the KEDA ServiceAccount
with the clientId for A1/S1
then processing for A2
might gain inadvertent access to unowned resources if you missed providing the overriden clientId, defeating the purpose.
In cases where multiple MSI's may be bound to a single SA* you'd more than likely not want a default value. Instead, you'd leave the annotation blank and rely on the application to provide an alternative means by which to provide the ClientId. For example, in Keda it's an extra field on their TriggerAuthentication
CRD.
In the context of cert-manager assume you have multiple DNS zones, with different MSI's / permissions to manage them. Multiple/distinct accounts is already supported indirectly with pod-identity
via multiple bindings. It's also supported natively assuming you want to manage the clientId/tenantId/resourceGroup
+ clientSecretSecretRef
in the dns01
stanza yourself. AZWI is designed to defeat the shortcomings of both of these approaches, but without allowing an override--presumably in dns01
--you're essentially crippling it. IMO it's half-baked without it.
* In Azure, federated credentials are a sub-resource of an MSI. In the context of AKS, each credential is bound to a specific cluster + service account. One Service Account can therefore have associated credentials for many MSIs. The token injected into the pod by the mutating webhook from the OIDC issuer (the cluster) is for the SA and is itself MSI agnostic. That token is then exchanged with AAD alongside an explicit clientId provided by the application. The web-hook itself cares nothing about the clientId; it does however provides a simple mechanism to inject the environment variable for the common case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, thanks for the explanation!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pinkfloydx33 Are you thinking about something like that: https://github.com/cert-manager/cert-manager/compare/master...weisdd:cert-manager:feature/azure-workload-identity?expand=1 ? I haven't tested it yet, just wrote it in accordance with your suggestion :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@weisdd Yeah that looks to be about what I was thinking. Personally I would just pass it opt.ClientID
/managedIdentity.ClientID
directly since the rest of the adal.ManagedIdentityOptions
struct (ie IdentityResourceID
) is useless to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some other thoughts:
I think(?; * see further) it's fine to key off the prescense of the AZURE_FEDERATED_TOKEN_FILE
variable. But at the same time, even if the token file is present we do allow the operator to specify a ServicePrincipal directly (ie. azureDns.clientID
/ the first ~12lines of getAuthorization
) and bypass AZWI. It's a little at odds with the below, but I think that's safe as the user would've explicitly needed to provide credentials.
The IdentityResourceId
however isn't used by AZWI, but it has a purpose in the context of pod-identity. If someone specified it, they'd likely(?) not be expecting AZWI. But because of the precense of the token variable we'd silently ignore the setting, even though we eagerly handle the SP case.
That might be fine; I don't know why you'd be trying to use both AZWI and podIdentity (outside of a cut-over maybe?). Perhaps an error should be raised if:
AZURE_FEDERATED_TOKEN_FILE
is set, and- [the actual token file is found(?), and]
managedIdentity.resourceID
is set
Or AZWI should be skipped if:
managedIdentity.resourceID
is set
I'd lean on the former (the error). Point being I think in the absence of a new CRD or application level flag (to select MSI-"mode") the order of operations should be obvious/spelled out clearly.
* When deciding to use AZWI, is precense of the one variable enough? Should we check both it and AZURE_TENANT_ID
exist? Or perhaps that the file actually exists? If all conditions aren't met, then fallback to the existing code? I can't see why someone would set the variable themselves so perhaps not a big deal...just thinking aloud.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pinkfloydx33 Sorry for the delay in my answers, several things overlapped at work, so quite busy.
Amongst the environment variables, I chose AZURE_FEDERATED_TOKEN_FILE
just because it felt like the safest option. Potentially, cert-manager can have an option to override tenant-id through tenantID
, so it's kind of fine to have other configuration means, whereas the federated token itself will always be read from disk as it's set to expire, simply incompatible with any static configuration option.
As for the sequence of authentication methods and additional errors, I'm not sure how tolerant cert-manager is expected to be for this type of misconfiguration (where you have certain settings set for a few different methods). It could feel like over-engineering, though if the code is expected to have a strict validation, it would be easy to add it.
For me, it's fine to just mention in the documentation that there's a certain set of supported authentication methods, and they're evaluated in a particular order. Plus, we already share some hints through log messages.
As for me passing opt
instead of opt.ClientID
, I just wanted to make it resemble the further call to adal.NewServicePrincipalTokenFromManagedIdentity
, but it's fine to change that :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem. I was "thinking out loud", just to talk it through. I think anything is fine so long as the documentation
- reflects the order/prerequisites, and
- indicates the AZWI clientId can be overridden via
azureDns.managedIdentity.clientID
- (?) mentions that
azureDns.managedIdentity.resourceID
is unused in this scenario / cannot influence AZWI
return nil, fmt.Errorf("failed to retrieve OAuth config: %v", err) | ||
} | ||
|
||
spt, err := adal.NewServicePrincipalTokenFromFederatedToken(*oauthConfig, awiClientId, jwt, env.ResourceManagerEndpoint) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@karlschriek first of all, thanks for your contribution!
I noticed that you implemented a similar patch for external-dns here: kubernetes-sigs/external-dns#2885 with a comment (although note, it does not deal with ensuring that the refreshed token is read from disk)
. As there have been no changes since July, and the PR sadly stays unreviewed till this day, yesterday I decided it'd be fair to try coming up with my own implementation that would tackle that issue while still being based on adal
.
So, basically, the federated token is valid for 1h, whereas the AAD token that is received in exchange for federated - for 24h (docs). As per my understanding, after ~24h the implicitly called .Refresh()
method for spt
would start producing errors, because the spt
struct would have a secret with stale jwt
, thus failing to get authorized against ActiveDirectoryEndpoint
.
adal
doesn't seem to offer any native support for tracking changes in the file with federated token. The only sane method that I was able to find by looking through the library was taking advantage of customRefreshFunc
, which is supposed to return a fresh access token or an error if any. To avoid having to reinvent the wheel, I wanted to reuse as many existing methods as possible.
The final logic is the following: whenever an access token is about to expire (or just empty right after spt
is generated), the library would automatically call a wrapper function that would read the federated token from disk and use it to obtain a new access token by invoking the standard process through explicit call of .Refresh()
method.
You could take a look at my implementation here: kubernetes-sigs/external-dns#3111
I'm not very familiar with cert-manager code yet and especially whether it has any issuer re-initialization logic in case of token renewal errors, but I guess we'll need to have the same custom wrapper in place here.
I'm still testing the wrapper in external-dns, but as customRefreshFunc
is successfully called right after external-dns starts, I don't foresee any issues.
WDYT? And have you had a chance to run the patched version of cert-manager for more than 1 day? (just curious what you'd see in logs and how cert-manager behaves)
@wallrj FYI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update on external-dns: I've run it with my patch for over 24h, the access token got refreshed just fine. That convinces me that it's worth porting the wrapper to cert-manager as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems like a perfectly good solution to me! Because you mention external-dns specifically, we currently work around the refresh issue by just restarting the whole Pod every hour ;)
I'm unfamiliar with the practices/policies of this repository. Should the OP be updated to reflect this information? It's currently out of sync with the implementation. Further questions:
I imagine it'd be a seperate PR anyways, but at least up to this point there shouldn't be a direct need outside of version bumps since existing constructs are being leveraged, correct? |
Hi @karlschriek @wallrj, I hope you found @pinkfloydx33's and my reviews useful. As mentioned in multiple threads, it would be great to fix token renewal and also introduce support for multiple identities. You could see potential code here: https://github.com/cert-manager/cert-manager/compare/master...weisdd:cert-manager:feature/azure-workload-identity?expand=1 Please, let me know if, how and in which form I can help you to drive it all further. |
I find the suggestions very sensible, but I will not find time to implement
them anytime soon. Would someone else like to have a go?
…On Mon, 7 Nov 2022, 15:11 Igor Beliakov, ***@***.***> wrote:
Hi @karlschriek <https://github.com/karlschriek> @wallrj
<https://github.com/wallrj>,
I hope you found @pinkfloydx33 <https://github.com/pinkfloydx33>'s and my
reviews useful. As mentioned in multiple threads, it would be great to fix
token renewal and also introduce support multiple identities.
You could see potential code here:
https://github.com/cert-manager/cert-manager/compare/master...weisdd:cert-manager:feature/azure-workload-identity?expand=1
Later this week, I'll spend some time testing it with multiple identities
(my own setup doesn't need that, so it's more to help with what
@pinkfloydx33 <https://github.com/pinkfloydx33> suggested), though I
don't expect much changes in code.
And my interest is not to push my code, but rather to help you to
introduce a robust support for Workload Identity.
Please, let me know if, how and in which form I can help you to drive it
all further.
Thanks!
—
Reply to this email directly, view it on GitHub
<#5308 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AGBE4OFCAQOI4HUA5SK62M3WHEE2DANCNFSM53WJGAUQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@karlschriek sure, I'll be happy to prepare another PR with these changes incorporated and to write some documentation as well. - I need to prepare a Workload Identity-enabled setup at work anyway. ETA should be somewhere between this Friday and next Wednesday, depending on the amount of side tasks at work. And I guess @pinkfloydx33 would be willing to test it :) |
Absolutely |
@pinkfloydx33 I've just opened a new PR: #5570 Let's continue further discussion there. I hope it'll work just fine in your sandbox. |
closing in favour of #5570 |
Signed-off-by: Karl Schriek karl.schriek@bma.network
Pull Request Motivation
This PR adds the ability to to authenticate against Azure using Workload Identities. Closes #5085
Bumps
adal
version to 0.9.20 in order to make use ofNewServicePrincipalTokenFromFederatedToken
Kind
feature
Release Note
Work in Progress!
I will need some help to get this across the line. In particular, I am fairly certain that the zz_generated.. files need to be regenerated, but am not sure how. Could also use some guidance on where to add tests. Happy also to have a discussion on how I implemented this in the API. I added a boolean field "UseWorkloadIdentity". If set to true, all other fields (client ID, managed identity etc.) are ignored. If set to false, everything works exactly the way it previously did. This may or may not be the best way to do this. It is a bit hard for me to judge since I am not familiar enough with the other methods.
Also, where should I document this change?