-
Notifications
You must be signed in to change notification settings - Fork 3.4k
pkg/fswatcher: Abstract away fsnotify in tests #37909
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Replace the filesystem setup in tests with a channel of events, effectively mocking out the fsnotify functionality. A number of tests were flaky and even with the best of intentions on a local machine it was not sending a reliable number of notifications per action. For example, a simple create and write to a file would emit either 1 or 2 events even with some delay between the actions. This is an attempt to deterministically stabilize the test to cover some critical functionality while leaving fsnotify to test it's own internals. Signed-off-by: Glib Smaga <code@gsmaga.com>
I am not sure the updated tests actually test what matters now. One of the goal of the tests was to test symlink indirection where a symlink points to another symlink and we're not testing that anymore. This is important because this is typically how files appear when they are mounted in a container (at least from what I remember). Effectively, this means a regression in fsnotify could break the TLS certificate hot reloading capabilities we built which can have some pretty bad consequences on a cluster. So on the one hand, it's good that we tackle a flaky test, on the other we're not testing what really matters now. |
It's trivial to restore that part with the current setup, however that is the most flaky part of that test because sometimes you get even no events at all when creating a file and adding a symlink to it (no I don't know the reasons). Also, it's not really clear what was being tested previously. It only verified that the event name changed, without validating what the events actually were. I can try to put symlink back and add delays between each fs operations, but that will still result in flakes for this test. |
require.NoError(t, err) | ||
assertEventName(nestedFile) | ||
|
||
// symlink $tmp/foo/symlink -> $tmp/foo/bar (this will emit an event on indirectSymlink) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As Robin pointed out, emulating the sequence of events emitted by changes to symlinks is something we really want to test, since this is the whole point of we we have fswatcher
in the first place - to deal with the spooky action at a distance that K8s' symlink-based volume mount scheme (documented here) emits.
The new test for example doesn't seem to emit any RENAME events, for which the fswatcher package has specific logic for.
As to mocking: I'm fine with mocking the events received by fsnotify to make the tests more deterministic, since I don't expect the fsnotify sequence to change much. But the sequence of events that we mock really should emulate the sequence of events we expect from a K8s volume mount update, which the old test attempted to emulate. There is some more examples on the intricacies here:
My suggestion therefore:
Let's record the generated fsnotify.Event
events emitted by the file operations done in this old test (which should also contain some RENAME and CHMOD events for example) then let's them using the func(evs chan<- fsnotify.Event)
mechanism that you introduced in the new framework.
This way we get both: Coverage for the intricate sequence of events (with all the CHMOD and RENAME events) we expect K8s to produce on volume map changes, but also deterministic tests that hopefully are no longer flaky.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's record the generated fsnotify.Event events emitted by the file operations done in this old test (which should also contain some RENAME and CHMOD events for example) then let's them using the func(evs chan<- fsnotify.Event) mechanism that you introduced in the new framework.
I like the idea, probably the best direction to go to stabilize the test.
However, those events are not consistent (the root cause for flakiness in the test). Which version of them do we take? Do I put a bunch of sleeps between the fs operations in the test and try to gather the happy case scenario of what we would actually expect?
Sometimes fsnotify doesn't even issue a file create event for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sometimes fsnotify doesn't even issue a file create event for example.
Yes it's inconsistent and if you're testing under OSX you probably get a different set of events (kqueue vs inotify). I quickly went through the fsnotify repo and there is one release per year, people wondering whether it is still maintained, and also some very scary issues.
I like the idea, probably the best direction to go to stabilize the test.
The test yes, the code I'm not so sure. If we can't get consistent events from fsnotify in a controlled environment like testing, then I don't see how we could get consistency once deployed. If our (tested) code is robust but fsnotify is flacky, CI will be happy but we still have an issue (only now hidden by our CI — arguably worst than the status quo).
Proposition: let's get rid of fsnotify. fswatcher is a complex logic to handle the k8s specific symlink dance on top of a flacky maybe-unmaintained go module. Instead, poll the files every ten seconds or so (less in testing). In the process, we can rationalize the pattern as we already have it implemented at least twice for dynamic flow export configuration and dynamic metrics configuration (the latter check the hash of the file which we could generalize).
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Proposition: let's get rid of fsnotify. fswatcher is a complex logic to handle the k8s specific symlink dance on top of a flacky maybe-unmaintained go module. Instead, poll the files every ten seconds or so (less in testing). In the process, we can rationalize the pattern as we already have it implemented at least twice for dynamic flow export configuration and dynamic metrics configuration (the latter check the hash of the file which we could generalize).
What do you think?
I'm a fan of this approach, and have successfully implemented it in the past in another project instead of relying on system events to notify me of the changes.
I will try to remove the fsnotify dep from fswatcher, not sure if in this PR or open a new one.
There are some other packages not just fswatcher that rely on fsnotify:
❯ ag fsnotify/fsnotify --ignore vendor --ignore go.mod --ignore go.sum -l
daemon/cmd/cni/config.go
pkg/clustermesh/common/config.go
pkg/ipmasq/ipmasq.go
pkg/datapath/linux/ipsec/ipsec_linux.go
pkg/fswatcher/fswatcher_test.go
pkg/fswatcher/fswatcher.go
pkg/policy/directory/watcher.go
Perhaps we can start with just fswatcher but encapsulate the functionality into a watcher of some kind that can be easily re-used elsewhere with a different path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the approach of polling too!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about we do the "replay events" solution in this PR to stabilize the test and open a new issue to actually replace fsnotify ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about we do the "replay events" solution in this PR to stabilize the test and open a new issue to actually replace fsnotify ?
Works for me!
Opened a new PR which implements what we talked about in regards to replacing fsnotify with interval polling. |
Replace the filesystem setup in tests with a channel of events, effectively mocking out the fsnotify functionality.
A number of tests were flaky and even with the best of intentions on a local machine it was not sending a reliable number of notifications per action. For example, a simple create and write to a file would emit either 1 or 2 events even with some delay between the actions.
This is an attempt to deterministically stabilize the test to cover some critical functionality while leaving fsnotify to test it's own internals.
Fixes #25666