Skip to content

Conversation

jrfastab
Copy link
Contributor

@jrfastab jrfastab commented Mar 20, 2019

When keys are being updated with a rolling update there is a period where both previous (possibly null key) and current key need to be in use to keep rolling update going. This series allows multiple keys to exist at the same time and removes the previous keys after a delay. Now rolling updates work as expected and we also clean up old keys which was not the case previously.

With this PR key format is changing to incorporate versions. The new format is,

  version algo key algo key

If the version field is omitted (previous keys) a version of '1' is chosen.

Tested on GKE Ubuntu and local boxes by deploying Cilium and upgrading from null key -> version 0x01 -> version 0x02 and verified old rules were removed after timeout.

Note: Current key deletion delay is 5 minutes, after the timer the keys will be removed. May need some experimenting to see if this is sufficient or we should extend it.
Note: System is not robust against users recycling key versions in a key deletion window yet. So if a user adds a key=1, transitions to key=2, and then back to key=1 in a timer delay all keys will be lost. Will fix later but easy answer is to tell users not to do this.


This change is Reviewable

@jrfastab jrfastab requested review from a team March 20, 2019 15:13
@jrfastab jrfastab requested review from a team as code owners March 20, 2019 15:13
@jrfastab jrfastab requested review from a team March 20, 2019 15:13
@jrfastab jrfastab requested a review from a team as a code owner March 20, 2019 15:13
@jrfastab jrfastab force-pushed the rolling-updates branch 4 times, most recently from 2331e07 to 9225ee7 Compare March 20, 2019 16:18
@coveralls
Copy link

coveralls commented Mar 20, 2019

Coverage Status

Coverage decreased (-0.05%) to 42.598% when pulling 91522fe on rolling-updates into 8ac4d10 on master.

@jrfastab
Copy link
Contributor Author

test-me-please

Copy link
Member

@tgraf tgraf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks awesome. I assume some of the warnings are leftovers from debugging.

@jrfastab
Copy link
Contributor Author

Removed warnings, converted Timer->Sleep, and fixed a couple typos

@jrfastab
Copy link
Contributor Author

test-me-please

@tgraf tgraf added pending-review area/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. release-note/major This PR introduces major new functionality to Cilium. labels Mar 21, 2019
Copy link
Member

@joestringer joestringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, mostly just logging type feedback, plus one question on input validation.

}
defer file.Close()
return loadIPSecKeys(file)
}

func loadIPSecKeys(r io.Reader) error {
func loadIPSecKeys(r io.Reader) (uint8, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: For future review, it'd be nice to place an example input above here. Bit tricky to really look this over and understand the string manipulation below without an example. I see associated unit tests so not so worried, could be a minor follow-up comment in another commit/PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've got a set of cleanups/pretty comments I'll add it to those.

}
authname := s[0]
spi = uint8(spiI)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I notice that all of your examples start with spi = 1. Is 0 special in any way (eg non-encrypted)? (If special, maybe needs validation, if not, feedback is a no-op here)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spi = 0 is no-encryption. Can add a comment and document in documentation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree would be nice, been wondering about the same.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a valid configuration in the file? Should it be rejected in some way?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A configuration like '0 algo foo ...` yeah thats a bit bogus.

ipSecKeysGlobal[""] = ipSecKey
}

scopedLog.WithError(err).Warning("newtimer: oldSPI %u new spi %u", oldSpi, ipSecKey.Spi)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this warning message in a form that you expect users to be able to react to? It seems a little cryptic at a glance. Is this a warning type situation or merely informational?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two more comments:

  • There is an "spi" field inherited from the scopedLog above, reading the logs it is likely not to be obvious what that represents vs. these other two spis here
  • Log messages should avoid placing variable content into the message field, the new/old SPIs should be reported through the standard mechanism (ie WithFields(...)).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That warning snuck in should be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its not really a warning I was just debugging earlier and forgot to remove it. In fact its not even a timer anymore so newtimer also is nonsense.

if oldSpi != ipSecKey.Spi {
go func() {
time.Sleep(linux_defaults.IPsecKeyDeleteDelay)
ipsecDeleteXfrmSpi(ipSecKey.Spi)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be good to log after the sleep, but before deleting here? I think I'd want to know if somehow the forwarding is changed at an arbitrary point in future if I need to debug something going wrong here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can add add a dbg statement.

Copy link
Member

@joestringer joestringer Mar 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't happen terribly often should it? Just wondering whether Info would be fine for this. Probably stick with Debug unless others also would agree with Info.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should only happen when keys change so Info might be nice.

return __encap_and_redirect_with_nodeid(skb, tunnel->ip4, seclabel, monitor);
if (tunnel->key) {
if (from_host)
return enacap_and_redirect_nomark_ipsec(skb,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, typo: enacap ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm and I must have cut'n'pasted it throughout? ... Will fix because I can't think up what I meant by enacap at the moment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll do this in another rename patch seeing its started earlier in the tree.

}

authkey, err := decodeIPSecKey(s[1])
spiI, err := strconv.Atoi(s[0])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is spiI? ID, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here spi is security parameter index and I fell into hungarian naming convention it seems just to indicate that spiI is the integer version here giving spiI. We do need to be a bit careful here because the spi is shifted into the mark field and if we shift a uint8 << 12 well there is nothing left by C standards.

Copy link
Member

@borkmann borkmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With mentioned cleanups addressed, LGTM.

@jrfastab
Copy link
Contributor Author

@joestringer @borkmann comments addressed take a look. I'll have a cleanup patch series to do renaming of functions (per comment that came from earlier) and a bit of commentary to show file layouts. Note once CI issue is resolved and CI PR is pushed we will get an example format in test directory but I agree nice to have inline in code as well.

@jrfastab
Copy link
Contributor Author

test-me-please

@jrfastab jrfastab requested a review from joestringer March 22, 2019 03:58
@tgraf tgraf added the dont-merge/needs-rebase This PR needs to be rebased because it has merge conflicts. label Mar 22, 2019
It is possible that ipv6 updates may fail due to missing kernel
modules or iptable6 filters. In this case we still want the ipv4
rules to be in place so reorder ipv4 before ipv6 setup.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Currently, OUT xfrm rules use full (src,dst,spi) tuple. The original
thinking on this was that we wanted to ensure matches only on relavent
IP addresses. However now both state and policy are further restricted
by mark values we can drop the src piece without worrying about having
unintended matches.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Currently, rolling updates may get stuck due to a time period between
when some set of nodes have started with encryption enabled and another
set exist without encryption enabled. At this point these two sets of
nodes can only communicate from non-encrypted to encrypted set. The
set with encryption enabled will encrypt traffic that will in turn be
dropped by the set that has yet to enable encryption.

To resolve we make encryption a property of the endpoint id. Keeping
the key identifier with the endpoint to inform cilium which key
should be used during an upgrade. Because we use the mark space
to identify keys we limit the key space to two keys currently.

After this key secrets will need to be updated to include an id
field as follows,

  keys: "1 hmac(sha256) 0123456789abcdef0123456789abcdef cbc(aes) 0123456789abcdef0123456789abcdef"

Where '1' is the id here. IDs are enforced to be less than 16. This
is a bit arbitrary we could go as high as 256 without hitting mark
bit limits. However, 16 feels sufficient and we cant take bits back
later so start low and bump up if needed. The id '0' is a special id
and should not be used it represents the absence of keys in the
datapath.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
After a timeout delete xfrm rules not matching current version. In
this way we can perform non-disruptive updates to the version. This
still requires a cilium restart however until we attach an inotify
event to the file.

xfrm state cleanup is still an over-approximation because states do
not use mark values yet. A follow up series can clean this up.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
@jrfastab
Copy link
Contributor Author

test-me-please

@jrfastab jrfastab removed the dont-merge/needs-rebase This PR needs to be rebased because it has merge conflicts. label Mar 22, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. release-note/major This PR introduces major new functionality to Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants