pkg/fswatcher: Abstract away fsnotify in tests #37909

glibsm · 2025-02-27T22:51:52Z

Replace the filesystem setup in tests with a channel of events, effectively mocking out the fsnotify functionality.

A number of tests were flaky and even with the best of intentions on a local machine it was not sending a reliable number of notifications per action. For example, a simple create and write to a file would emit either 1 or 2 events even with some delay between the actions.

This is an attempt to deterministically stabilize the test to cover some critical functionality while leaving fsnotify to test it's own internals.

Fixes #25666

Replace the filesystem setup in tests with a channel of events, effectively mocking out the fsnotify functionality. A number of tests were flaky and even with the best of intentions on a local machine it was not sending a reliable number of notifications per action. For example, a simple create and write to a file would emit either 1 or 2 events even with some delay between the actions. This is an attempt to deterministically stabilize the test to cover some critical functionality while leaving fsnotify to test it's own internals. Signed-off-by: Glib Smaga <code@gsmaga.com>

rolinh · 2025-02-28T10:22:50Z

I am not sure the updated tests actually test what matters now. One of the goal of the tests was to test symlink indirection where a symlink points to another symlink and we're not testing that anymore. This is important because this is typically how files appear when they are mounted in a container (at least from what I remember). Effectively, this means a regression in fsnotify could break the TLS certificate hot reloading capabilities we built which can have some pretty bad consequences on a cluster. So on the one hand, it's good that we tackle a flaky test, on the other we're not testing what really matters now.

glibsm · 2025-02-28T20:01:30Z

I am not sure the updated tests actually test what matters now. One of the goal of the tests was to test symlink indirection where a symlink points to another symlink and we're not testing that anymore. This is important because this is typically how files appear when they are mounted in a container (at least from what I remember). Effectively, this means a regression in fsnotify could break the TLS certificate hot reloading capabilities we built which can have some pretty bad consequences on a cluster. So on the one hand, it's good that we tackle a flaky test, on the other we're not testing what really matters now.

It's trivial to restore that part with the current setup, however that is the most flaky part of that test because sometimes you get even no events at all when creating a file and adding a symlink to it (no I don't know the reasons).

Also, it's not really clear what was being tested previously. It only verified that the event name changed, without validating what the events actually were.

I can try to put symlink back and add delays between each fs operations, but that will still result in flakes for this test.

gandro · 2025-03-03T11:41:20Z

pkg/fswatcher/fswatcher_test.go

-	require.NoError(t, err)
-	assertEventName(nestedFile)
-
-	// symlink $tmp/foo/symlink -> $tmp/foo/bar (this will emit an event on indirectSymlink)


As Robin pointed out, emulating the sequence of events emitted by changes to symlinks is something we really want to test, since this is the whole point of we we have fswatcher in the first place - to deal with the spooky action at a distance that K8s' symlink-based volume mount scheme (documented here) emits.

The new test for example doesn't seem to emit any RENAME events, for which the fswatcher package has specific logic for.

As to mocking: I'm fine with mocking the events received by fsnotify to make the tests more deterministic, since I don't expect the fsnotify sequence to change much. But the sequence of events that we mock really should emulate the sequence of events we expect from a K8s volume mount update, which the old test attempted to emulate. There is some more examples on the intricacies here:

#14565 (comment)

My suggestion therefore:
Let's record the generated fsnotify.Event events emitted by the file operations done in this old test (which should also contain some RENAME and CHMOD events for example) then let's them using the func(evs chan<- fsnotify.Event) mechanism that you introduced in the new framework.

This way we get both: Coverage for the intricate sequence of events (with all the CHMOD and RENAME events) we expect K8s to produce on volume map changes, but also deterministic tests that hopefully are no longer flaky.

Let's record the generated fsnotify.Event events emitted by the file operations done in this old test (which should also contain some RENAME and CHMOD events for example) then let's them using the func(evs chan<- fsnotify.Event) mechanism that you introduced in the new framework.

I like the idea, probably the best direction to go to stabilize the test.

However, those events are not consistent (the root cause for flakiness in the test). Which version of them do we take? Do I put a bunch of sleeps between the fs operations in the test and try to gather the happy case scenario of what we would actually expect?

Sometimes fsnotify doesn't even issue a file create event for example.

Sometimes fsnotify doesn't even issue a file create event for example.

Yes it's inconsistent and if you're testing under OSX you probably get a different set of events (kqueue vs inotify). I quickly went through the fsnotify repo and there is one release per year, people wondering whether it is still maintained, and also some very scary issues.

I like the idea, probably the best direction to go to stabilize the test.

The test yes, the code I'm not so sure. If we can't get consistent events from fsnotify in a controlled environment like testing, then I don't see how we could get consistency once deployed. If our (tested) code is robust but fsnotify is flacky, CI will be happy but we still have an issue (only now hidden by our CI — arguably worst than the status quo).

Proposition: let's get rid of fsnotify. fswatcher is a complex logic to handle the k8s specific symlink dance on top of a flacky maybe-unmaintained go module. Instead, poll the files every ten seconds or so (less in testing). In the process, we can rationalize the pattern as we already have it implemented at least twice for dynamic flow export configuration and dynamic metrics configuration (the latter check the hash of the file which we could generalize).

What do you think?

Proposition: let's get rid of fsnotify. fswatcher is a complex logic to handle the k8s specific symlink dance on top of a flacky maybe-unmaintained go module. Instead, poll the files every ten seconds or so (less in testing). In the process, we can rationalize the pattern as we already have it implemented at least twice for dynamic flow export configuration and dynamic metrics configuration (the latter check the hash of the file which we could generalize).

What do you think?

I'm a fan of this approach, and have successfully implemented it in the past in another project instead of relying on system events to notify me of the changes.

I will try to remove the fsnotify dep from fswatcher, not sure if in this PR or open a new one.

There are some other packages not just fswatcher that rely on fsnotify:

❯ ag fsnotify/fsnotify --ignore vendor --ignore go.mod --ignore go.sum -l daemon/cmd/cni/config.go pkg/clustermesh/common/config.go pkg/ipmasq/ipmasq.go pkg/datapath/linux/ipsec/ipsec_linux.go pkg/fswatcher/fswatcher_test.go pkg/fswatcher/fswatcher.go pkg/policy/directory/watcher.go

Perhaps we can start with just fswatcher but encapsulate the functionality into a watcher of some kind that can be easily re-used elsewhere with a different path?

I like the approach of polling too!

How about we do the "replay events" solution in this PR to stabilize the test and open a new issue to actually replace fsnotify ?

How about we do the "replay events" solution in this PR to stabilize the test and open a new issue to actually replace fsnotify ?

Works for me!

glibsm · 2025-03-26T20:43:10Z

Opened a new PR which implements what we talked about in regards to replacing fsnotify with interval polling.

glibsm added area/CI Continuous Integration testing issue or flake release-note/misc This PR makes changes that have no direct user impact. labels Feb 27, 2025

glibsm requested review from a team as code owners February 27, 2025 22:51

glibsm requested review from aspsk and rolinh February 27, 2025 22:51

rolinh added sig/hubble area/hubble Impacts hubble server or relay labels Feb 28, 2025

kaworu requested a review from gandro March 3, 2025 10:54

gandro reviewed Mar 3, 2025

View reviewed changes

glibsm mentioned this pull request Mar 26, 2025

pkg/fswatcher: Rewrite without underlying use of fsnotify #38537

Merged

glibsm closed this Mar 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

pkg/fswatcher: Abstract away fsnotify in tests #37909

pkg/fswatcher: Abstract away fsnotify in tests #37909

Uh oh!

glibsm commented Feb 27, 2025

Uh oh!

rolinh commented Feb 28, 2025

Uh oh!

glibsm commented Feb 28, 2025

Uh oh!

gandro Mar 3, 2025 •

edited

Loading

Uh oh!

glibsm Mar 3, 2025

Uh oh!

kaworu Mar 4, 2025

Uh oh!

glibsm Mar 5, 2025

Uh oh!

gandro Mar 6, 2025

Uh oh!

jelonka Mar 12, 2025 •

edited

Loading

Uh oh!

kaworu Mar 12, 2025

Uh oh!

glibsm commented Mar 26, 2025

Uh oh!

Uh oh!

pkg/fswatcher: Abstract away fsnotify in tests #37909

pkg/fswatcher: Abstract away fsnotify in tests #37909

Uh oh!

Conversation

glibsm commented Feb 27, 2025

Uh oh!

rolinh commented Feb 28, 2025

Uh oh!

glibsm commented Feb 28, 2025

Uh oh!

gandro Mar 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

glibsm Mar 3, 2025

Choose a reason for hiding this comment

Uh oh!

kaworu Mar 4, 2025

Choose a reason for hiding this comment

Uh oh!

glibsm Mar 5, 2025

Choose a reason for hiding this comment

Uh oh!

gandro Mar 6, 2025

Choose a reason for hiding this comment

Uh oh!

jelonka Mar 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kaworu Mar 12, 2025

Choose a reason for hiding this comment

Uh oh!

glibsm commented Mar 26, 2025

Uh oh!

Uh oh!

gandro Mar 3, 2025 •

edited

Loading

jelonka Mar 12, 2025 •

edited

Loading