-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Transparent Encryption: rolling key updates #7450
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
2331e07
to
9225ee7
Compare
test-me-please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks awesome. I assume some of the warnings are leftovers from debugging.
9225ee7
to
ca2eac5
Compare
Removed warnings, converted Timer->Sleep, and fixed a couple typos |
test-me-please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, mostly just logging type feedback, plus one question on input validation.
} | ||
defer file.Close() | ||
return loadIPSecKeys(file) | ||
} | ||
|
||
func loadIPSecKeys(r io.Reader) error { | ||
func loadIPSecKeys(r io.Reader) (uint8, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: For future review, it'd be nice to place an example input above here. Bit tricky to really look this over and understand the string manipulation below without an example. I see associated unit tests so not so worried, could be a minor follow-up comment in another commit/PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've got a set of cleanups/pretty comments I'll add it to those.
} | ||
authname := s[0] | ||
spi = uint8(spiI) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I notice that all of your examples start with spi = 1. Is 0 special in any way (eg non-encrypted)? (If special, maybe needs validation, if not, feedback is a no-op here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spi = 0 is no-encryption. Can add a comment and document in documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree would be nice, been wondering about the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a valid configuration in the file? Should it be rejected in some way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A configuration like '0 algo foo ...` yeah thats a bit bogus.
ipSecKeysGlobal[""] = ipSecKey | ||
} | ||
|
||
scopedLog.WithError(err).Warning("newtimer: oldSPI %u new spi %u", oldSpi, ipSecKey.Spi) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this warning message in a form that you expect users to be able to react to? It seems a little cryptic at a glance. Is this a warning type situation or merely informational?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two more comments:
- There is an "spi" field inherited from the
scopedLog
above, reading the logs it is likely not to be obvious what that represents vs. these other two spis here - Log messages should avoid placing variable content into the message field, the new/old SPIs should be reported through the standard mechanism (ie
WithFields(...)
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That warning snuck in should be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its not really a warning I was just debugging earlier and forgot to remove it. In fact its not even a timer anymore so newtimer also is nonsense.
if oldSpi != ipSecKey.Spi { | ||
go func() { | ||
time.Sleep(linux_defaults.IPsecKeyDeleteDelay) | ||
ipsecDeleteXfrmSpi(ipSecKey.Spi) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be good to log after the sleep, but before deleting here? I think I'd want to know if somehow the forwarding is changed at an arbitrary point in future if I need to debug something going wrong here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can add add a dbg statement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't happen terribly often should it? Just wondering whether Info would be fine for this. Probably stick with Debug unless others also would agree with Info.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should only happen when keys change so Info might be nice.
return __encap_and_redirect_with_nodeid(skb, tunnel->ip4, seclabel, monitor); | ||
if (tunnel->key) { | ||
if (from_host) | ||
return enacap_and_redirect_nomark_ipsec(skb, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit, typo: enacap
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm and I must have cut'n'pasted it throughout? ... Will fix because I can't think up what I meant by enacap at the moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll do this in another rename patch seeing its started earlier in the tree.
} | ||
|
||
authkey, err := decodeIPSecKey(s[1]) | ||
spiI, err := strconv.Atoi(s[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is spiI? ID, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here spi is security parameter index and I fell into hungarian naming convention it seems just to indicate that spiI is the integer version here giving spiI
. We do need to be a bit careful here because the spi is shifted into the mark field and if we shift a uint8 << 12 well there is nothing left by C standards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With mentioned cleanups addressed, LGTM.
ca2eac5
to
9225085
Compare
@joestringer @borkmann comments addressed take a look. I'll have a cleanup patch series to do renaming of functions (per comment that came from earlier) and a bit of commentary to show file layouts. Note once CI issue is resolved and CI PR is pushed we will get an example format in test directory but I agree nice to have inline in code as well. |
test-me-please |
It is possible that ipv6 updates may fail due to missing kernel modules or iptable6 filters. In this case we still want the ipv4 rules to be in place so reorder ipv4 before ipv6 setup. Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Currently, OUT xfrm rules use full (src,dst,spi) tuple. The original thinking on this was that we wanted to ensure matches only on relavent IP addresses. However now both state and policy are further restricted by mark values we can drop the src piece without worrying about having unintended matches. Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Currently, rolling updates may get stuck due to a time period between when some set of nodes have started with encryption enabled and another set exist without encryption enabled. At this point these two sets of nodes can only communicate from non-encrypted to encrypted set. The set with encryption enabled will encrypt traffic that will in turn be dropped by the set that has yet to enable encryption. To resolve we make encryption a property of the endpoint id. Keeping the key identifier with the endpoint to inform cilium which key should be used during an upgrade. Because we use the mark space to identify keys we limit the key space to two keys currently. After this key secrets will need to be updated to include an id field as follows, keys: "1 hmac(sha256) 0123456789abcdef0123456789abcdef cbc(aes) 0123456789abcdef0123456789abcdef" Where '1' is the id here. IDs are enforced to be less than 16. This is a bit arbitrary we could go as high as 256 without hitting mark bit limits. However, 16 feels sufficient and we cant take bits back later so start low and bump up if needed. The id '0' is a special id and should not be used it represents the absence of keys in the datapath. Signed-off-by: John Fastabend <john.fastabend@gmail.com>
After a timeout delete xfrm rules not matching current version. In this way we can perform non-disruptive updates to the version. This still requires a cilium restart however until we attach an inotify event to the file. xfrm state cleanup is still an over-approximation because states do not use mark values yet. A follow up series can clean this up. Signed-off-by: John Fastabend <john.fastabend@gmail.com>
9225085
to
91522fe
Compare
test-me-please |
When keys are being updated with a rolling update there is a period where both previous (possibly null key) and current key need to be in use to keep rolling update going. This series allows multiple keys to exist at the same time and removes the previous keys after a delay. Now rolling updates work as expected and we also clean up old keys which was not the case previously.
With this PR key format is changing to incorporate versions. The new format is,
If the version field is omitted (previous keys) a version of '1' is chosen.
Tested on GKE Ubuntu and local boxes by deploying Cilium and upgrading from null key -> version 0x01 -> version 0x02 and verified old rules were removed after timeout.
Note: Current key deletion delay is 5 minutes, after the timer the keys will be removed. May need some experimenting to see if this is sufficient or we should extend it.
Note: System is not robust against users recycling key versions in a key deletion window yet. So if a user adds a key=1, transitions to key=2, and then back to key=1 in a timer delay all keys will be lost. Will fix later but easy answer is to tell users not to do this.
This change is