Skip to content

Conversation

glibsm
Copy link
Member

@glibsm glibsm commented Feb 27, 2025

Replace the filesystem setup in tests with a channel of events, effectively mocking out the fsnotify functionality.

A number of tests were flaky and even with the best of intentions on a local machine it was not sending a reliable number of notifications per action. For example, a simple create and write to a file would emit either 1 or 2 events even with some delay between the actions.

This is an attempt to deterministically stabilize the test to cover some critical functionality while leaving fsnotify to test it's own internals.

Fixes #25666

Replace the filesystem setup in tests with a channel of
events, effectively mocking out the fsnotify functionality.

A number of tests were flaky and even with the best of
intentions on a local machine it was not sending a reliable
number of notifications per action. For example, a simple
create and write to a file would emit either 1 or 2 events
even with some delay between the actions.

This is an attempt to deterministically stabilize the test
to cover some critical functionality while leaving fsnotify
to test it's own internals.

Signed-off-by: Glib Smaga <code@gsmaga.com>
@glibsm glibsm added area/CI Continuous Integration testing issue or flake release-note/misc This PR makes changes that have no direct user impact. labels Feb 27, 2025
@glibsm glibsm requested review from a team as code owners February 27, 2025 22:51
@glibsm glibsm requested review from aspsk and rolinh February 27, 2025 22:51
@rolinh
Copy link
Member

rolinh commented Feb 28, 2025

I am not sure the updated tests actually test what matters now. One of the goal of the tests was to test symlink indirection where a symlink points to another symlink and we're not testing that anymore. This is important because this is typically how files appear when they are mounted in a container (at least from what I remember). Effectively, this means a regression in fsnotify could break the TLS certificate hot reloading capabilities we built which can have some pretty bad consequences on a cluster. So on the one hand, it's good that we tackle a flaky test, on the other we're not testing what really matters now.

@rolinh rolinh added sig/hubble area/hubble Impacts hubble server or relay labels Feb 28, 2025
@glibsm
Copy link
Member Author

glibsm commented Feb 28, 2025

I am not sure the updated tests actually test what matters now. One of the goal of the tests was to test symlink indirection where a symlink points to another symlink and we're not testing that anymore. This is important because this is typically how files appear when they are mounted in a container (at least from what I remember). Effectively, this means a regression in fsnotify could break the TLS certificate hot reloading capabilities we built which can have some pretty bad consequences on a cluster. So on the one hand, it's good that we tackle a flaky test, on the other we're not testing what really matters now.

It's trivial to restore that part with the current setup, however that is the most flaky part of that test because sometimes you get even no events at all when creating a file and adding a symlink to it (no I don't know the reasons).

Also, it's not really clear what was being tested previously. It only verified that the event name changed, without validating what the events actually were.

I can try to put symlink back and add delays between each fs operations, but that will still result in flakes for this test.

@kaworu kaworu requested a review from gandro March 3, 2025 10:54
require.NoError(t, err)
assertEventName(nestedFile)

// symlink $tmp/foo/symlink -> $tmp/foo/bar (this will emit an event on indirectSymlink)
Copy link
Member

@gandro gandro Mar 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As Robin pointed out, emulating the sequence of events emitted by changes to symlinks is something we really want to test, since this is the whole point of we we have fswatcher in the first place - to deal with the spooky action at a distance that K8s' symlink-based volume mount scheme (documented here) emits.

The new test for example doesn't seem to emit any RENAME events, for which the fswatcher package has specific logic for.

As to mocking: I'm fine with mocking the events received by fsnotify to make the tests more deterministic, since I don't expect the fsnotify sequence to change much. But the sequence of events that we mock really should emulate the sequence of events we expect from a K8s volume mount update, which the old test attempted to emulate. There is some more examples on the intricacies here:

#14565 (comment)

My suggestion therefore:
Let's record the generated fsnotify.Event events emitted by the file operations done in this old test (which should also contain some RENAME and CHMOD events for example) then let's them using the func(evs chan<- fsnotify.Event) mechanism that you introduced in the new framework.

This way we get both: Coverage for the intricate sequence of events (with all the CHMOD and RENAME events) we expect K8s to produce on volume map changes, but also deterministic tests that hopefully are no longer flaky.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's record the generated fsnotify.Event events emitted by the file operations done in this old test (which should also contain some RENAME and CHMOD events for example) then let's them using the func(evs chan<- fsnotify.Event) mechanism that you introduced in the new framework.

I like the idea, probably the best direction to go to stabilize the test.

However, those events are not consistent (the root cause for flakiness in the test). Which version of them do we take? Do I put a bunch of sleeps between the fs operations in the test and try to gather the happy case scenario of what we would actually expect?

Sometimes fsnotify doesn't even issue a file create event for example.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometimes fsnotify doesn't even issue a file create event for example.

Yes it's inconsistent and if you're testing under OSX you probably get a different set of events (kqueue vs inotify). I quickly went through the fsnotify repo and there is one release per year, people wondering whether it is still maintained, and also some very scary issues.

I like the idea, probably the best direction to go to stabilize the test.

The test yes, the code I'm not so sure. If we can't get consistent events from fsnotify in a controlled environment like testing, then I don't see how we could get consistency once deployed. If our (tested) code is robust but fsnotify is flacky, CI will be happy but we still have an issue (only now hidden by our CI — arguably worst than the status quo).

Proposition: let's get rid of fsnotify. fswatcher is a complex logic to handle the k8s specific symlink dance on top of a flacky maybe-unmaintained go module. Instead, poll the files every ten seconds or so (less in testing). In the process, we can rationalize the pattern as we already have it implemented at least twice for dynamic flow export configuration and dynamic metrics configuration (the latter check the hash of the file which we could generalize).

What do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Proposition: let's get rid of fsnotify. fswatcher is a complex logic to handle the k8s specific symlink dance on top of a flacky maybe-unmaintained go module. Instead, poll the files every ten seconds or so (less in testing). In the process, we can rationalize the pattern as we already have it implemented at least twice for dynamic flow export configuration and dynamic metrics configuration (the latter check the hash of the file which we could generalize).

What do you think?

I'm a fan of this approach, and have successfully implemented it in the past in another project instead of relying on system events to notify me of the changes.

I will try to remove the fsnotify dep from fswatcher, not sure if in this PR or open a new one.

There are some other packages not just fswatcher that rely on fsnotify:

❯ ag fsnotify/fsnotify --ignore vendor --ignore go.mod --ignore go.sum -l
daemon/cmd/cni/config.go
pkg/clustermesh/common/config.go
pkg/ipmasq/ipmasq.go
pkg/datapath/linux/ipsec/ipsec_linux.go
pkg/fswatcher/fswatcher_test.go
pkg/fswatcher/fswatcher.go
pkg/policy/directory/watcher.go

Perhaps we can start with just fswatcher but encapsulate the functionality into a watcher of some kind that can be easily re-used elsewhere with a different path?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the approach of polling too!

Copy link

@jelonka jelonka Mar 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we do the "replay events" solution in this PR to stabilize the test and open a new issue to actually replace fsnotify ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we do the "replay events" solution in this PR to stabilize the test and open a new issue to actually replace fsnotify ?

Works for me!

@glibsm
Copy link
Member Author

glibsm commented Mar 26, 2025

Opened a new PR which implements what we talked about in regards to replacing fsnotify with interval polling.

@glibsm glibsm closed this Mar 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/CI Continuous Integration testing issue or flake area/hubble Impacts hubble server or relay release-note/misc This PR makes changes that have no direct user impact.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CI: github.com/cilium/cilium/pkg/fswatcher timeout panic
5 participants