Skip to content

Conversation

jrajahalme
Copy link
Member

@jrajahalme jrajahalme commented Mar 24, 2021

Add Cilium daemon option --tofqdns-idle-connection-grace-period to allow previously active connections to keep a DNS name/IP mapping alive for future connections for a user-defined duration (default 0s).

Keep latest expiry time for DNS Zombie entries that result from multiple DNS results.

Add the domain names to the DNS GC info-level log message.

Added a new daemon option `--tofqdns-idle-connection-grace-period`.

@jrajahalme jrajahalme added area/daemon Impacts operation of the Cilium daemon. release-note/minor This PR changes functionality that users may find relevant to operating Cilium. labels Mar 24, 2021
@jrajahalme jrajahalme requested review from a team March 24, 2021 22:59
@jrajahalme jrajahalme requested review from a team as code owners March 24, 2021 22:59
@jrajahalme jrajahalme requested review from nebril, kkourt and twpayne March 24, 2021 22:59
@jrajahalme jrajahalme requested a review from aditighag March 24, 2021 22:59
@jrajahalme jrajahalme marked this pull request as draft March 24, 2021 23:00
@jrajahalme
Copy link
Member Author

test-me-please

@jrajahalme jrajahalme marked this pull request as ready for review March 24, 2021 23:11
@jrajahalme
Copy link
Member Author

Had to make the default non-zero to have the command reference show the default value, so made it "60s", which prints in command reference as "1m0s".

@jrajahalme
Copy link
Member Author

test-me-please

@jrajahalme jrajahalme requested review from a team as code owners March 25, 2021 04:07
@jrajahalme jrajahalme requested a review from joestringer March 25, 2021 04:07
@jrajahalme
Copy link
Member Author

test-me-please

@jrajahalme
Copy link
Member Author

test-docs-please

@jrajahalme
Copy link
Member Author

jrajahalme commented Mar 25, 2021

Four design considerations:

  1. The new --tofqdns-idle-connection-grace-period option takes effect only if the source POD opens a connection during the (possibly extended, minimum 1 hour by default) DNS TTL period. When conntrack GC finds this connection it adds the new grace period to the CT GC start time, and uses it as the DNS Zombie AliveAt time, causing the DNS Zombie GC to consider the IP/name mapping alive up to this point. Further CT GC rounds will then extend this AliveAt time even further so that finally when the connection is GCd from conntrack, that timestamp + the grace period is the (so far) final AliveAt time for the IP/name mapping, allowing new connections to be opened during this time even after all the old connections have been terminated, upto the length of the grace period. Alternatively, we could have instructed users to simply supply a longer --tofqdns-min-ttl value to extend the TTL of all DNS responses.
  2. The value of the new command line option is formatted as Go time.Duration string representation. A plain number is rejected and causes Cilium agent to not start. Some of the existing FQDN options also use this format, but others use an int interpreted as seconds. Maybe an int as seconds is a better choice?
  3. The default value of the new option is now "1m0s" (or "60s"), so this changes the behavior of all DNS zombie entries. This is likely good for CI coverage, but may be undesirable for production use. I first tried to use default of "0s", but then the command reference would not show the default value at all, but we can include the 0 default in the usage text itself.
  4. The new option is really about new connections opened after old ones, not about "idle" connections. Conntrack keeps idle connections in the tables already, right? So maybe the new option is misnamed?

Different DNS queries may have different expiry times. Keep that
latest/furthest expiry time instead of blindly overwriting the old
time with a new one when updating an existing entry. This makes the
combined expiry time of the zombie entry not depend on the order in
which the queries were processed.

Signed-off-by: Jarno Rajahalme <jarno@covalent.io>
Signed-off-by: Jarno Rajahalme <jarno@covalent.io>
@jrajahalme
Copy link
Member Author

test-me-please

@jrajahalme
Copy link
Member Author

Changed the default of the new option to 0s.

@pchaigno pchaigno added the dont-merge/needs-rebase This PR needs to be rebased because it has merge conflicts. label Mar 25, 2021
@pchaigno
Copy link
Member

pchaigno commented Mar 25, 2021

@jrajahalme I believe a rebase will fix the failing tests. It's currently failing because Jenkins is executing the new test code with the old agent code (can be seen by looking at the flag, the new flags are missing from the agent logs).

EDIT: Nevermind, you just did rebase :-) They likely will pass now.

@pchaigno pchaigno removed the dont-merge/needs-rebase This PR needs to be rebased because it has merge conflicts. label Mar 25, 2021
@jrajahalme
Copy link
Member Author

Known flake #13071 on test-1.20-4.9

@jrajahalme
Copy link
Member Author

test-1.20-4.9

@jrajahalme
Copy link
Member Author

Looks like a test flake on ConformanceKind1.19 / installation-and-connectivity: https://github.com/cilium/cilium/pull/15458/checks?check_run_id=2197348595

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/daemon Impacts operation of the Cilium daemon. release-note/minor This PR changes functionality that users may find relevant to operating Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants