Skip to content

Conversation

joestringer
Copy link
Member

Since merging clustermesh and kvstoremesh together, the code was still
quite separate, registering multiple Hives and duplicate Cells (+
duplicate configuration). This could cause some options to be not
responsive to user configuration, as well as just unnecessarily running
extra instances of core logic that is not necessary to run - for
instance, two pprof or two gops servers. This may also fix issues where
some clustermesh controllers do not appear in debug status output.

Consolidate these into a single Hive with all the common cells, and
expose a new 'clustermesh-apiserver hive' command to inspect the
hierarchy.

@joestringer joestringer requested a review from a team as a code owner February 13, 2025 16:30
@joestringer joestringer added the release-note/misc This PR makes changes that have no direct user impact. label Feb 13, 2025
@joestringer joestringer requested a review from giorio94 February 13, 2025 16:30
@joestringer joestringer force-pushed the pr/joe/clustermesh-hive branch from 6f022a2 to ae64045 Compare February 13, 2025 18:48
@joestringer
Copy link
Member Author

/test

@joestringer joestringer marked this pull request as draft February 13, 2025 19:39
@giorio94
Copy link
Member

Perhaps I'm missing some context about this change, but it is intentional that the clustermesh/kvstoremesh hives are separate, as each is run by its own individual sub-command (in turn executed in two separate containers, if both are enabled at the same time). I don't think there's the risk that any option is not responsive, because they are registered to the corresponding sub-command, as well as only one instance of gops/pprof and alike is guaranteed to be executed at any time depending on which command is executed.

It could be theoretically possible to consolidate everything into a single command, but I'm not sure that would have a lot of benefits, at the cost of worse separation of concerns (e.g., only one needs to access the k8s api-server, etcd clients are separate so that rate limiting of one doesn't impact the other and so on). It would also require a quite significant refactoring to allow enabling/disabling the different components independently, while that currently achieved via the separate sub-commands.

@joestringer
Copy link
Member Author

OK thanks a lot for that context. I haven't fully formed my thoughts on what this should look like, but here's the background I'm coming from:

  • We previously had a bunch of bugs in flag handling because the same flag was registered more than once in a single binary. I've been combing through the codebase to find patterns that could cause this to happen and fix them.
  • To help with the above, I tried to add a hive subcommand similar to what we have for other binaries in the tree so I could inspect the hive for this binary.
  • I noticed that there are multiple hives declared in this single binary. We generally expect there to be one Hive and for it to be static, so it is easy to determine right from initial bootstrap the full set of Cells. Given this wasn't the case, I tried combining the Hive so I could inspect it, and followed the rabbit hole from there.
  • As part of this I realized that the following Cells are common across both commands:
      	cell.Config(option.DefaultLegacyClusterMeshConfig),
      	cell.Config(cmtypes.DefaultClusterInfo),
      	cell.Config(pprofConfig),
    
      	pprof.Cell,
      	gops.Cell(defaults.EnableGops, defaults.GopsPortApiserver),
    
      	health.HealthAPIServerCell,
    
      	cmmetrics.Cell,
      	controller.Cell,
      	kvstore.Cell,
      	store.Cell,
    
      	cell.Provide(func(ss syncstate.SyncState) *kvstore.ExtraOptions {
      		return &kvstore.ExtraOptions{
      			BootstrapComplete: ss.WaitChannel(),
      		}
      	}),
    

I think that largely leads us to this PR.

Looking forward:

  1. I don't think there is currently a problem with overlapping flags in this binary.
  2. I'm not sure if it's worth tweaking the top-level usage of hives. I could maybe look into creating two Hive subcommands to inspect the individual "commands" rather than one top-level Hive.
  3. It might simplify the code a bit if the common Cells listed above are combined into a "clustermesh-common" Cell which could then be imported once from both commands, so the lists of common Cells are in one place.

More than anything, this gave me an excuse to poke around in the top level of this command to try to understand the functionality better, so that goal of mine is also achieved. Another alternative option going forward is just for me to abandon this PR.

@giorio94
Copy link
Member

Thanks a lot for the context, makes sense! And obviously feel free to ask if you have questions about this binary or clustermesh in general.

  • To help with the above, I tried to add a hive subcommand similar to what we have for other binaries in the tree so I could inspect the hive for this binary.

The commands to inspect the hive are already available, although as part of the individual sub-commands (not at the root level):

  • $ gor clustermesh-apiserver/main.go clustermesh hive
  • $ gor clustermesh-apiserver/main.go kvstoremesh hive

Essentially, the root of the clustermesh-apiserver is just a container for the other commands, and it does nothing by itself (besides printing the help message).

I noticed that there are multiple hives declared in this single binary. We generally expect there to be one Hive and for it to be static, so it is easy to determine right from initial bootstrap the full set of Cells. Given this wasn't the case, I tried combining the Hive so I could inspect it, and followed the rabbit hole from there.

Yes, the clustermesh-apiserver is a bit special in this respect as it bundles multiple independent functionalities as separate sub-commands, essentially to avoid requiring separate containers/binaries that have already proved problematic in the past. The hive configuration is still static as for e.g., the Cilium agent, but there are actually two, one for each core command (the other commands are instead more trivial and implemented the old way).

As part of this I realized that the following Cells are common across both commands: [...]

Yep, these could be grouped together in a dedicated module and imported by both. It wouldn't change anything from the functional point of view, but I agree it would remove a bit of code duplication.

@joestringer
Copy link
Member Author

OK, I'll refocus this PR on just creating the common Cell in that case - I think we'll get most of the cleanup benefits from that alone. I'm still not 100% confident about my statement "[there's no] problem with overlapping flags in this binary." given the way Hive is used in these binaries, but I have an idea about how to prove that out.

@joestringer joestringer force-pushed the pr/joe/clustermesh-hive branch from ae64045 to 69ed03f Compare February 14, 2025 22:38
@joestringer
Copy link
Member Author

/ci-clustermesh

@joestringer
Copy link
Member Author

/test

Copy link
Member

@giorio94 giorio94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking care of this. A couple of comments inline.

@joestringer joestringer marked this pull request as draft February 18, 2025 16:46
auto-merge was automatically disabled February 18, 2025 16:46

Pull request was converted to draft

@joestringer joestringer force-pushed the pr/joe/clustermesh-hive branch from 69ed03f to 5759727 Compare February 18, 2025 22:18
@joestringer
Copy link
Member Author

/test

There are several components which rely on the pprof Cell, with
different configuration parameters for each. In order to improve code
reuse, require the pprof configuration as a parameter for the Cell and
expect the caller to define the default configuration for those flags.

Signed-off-by: Joe Stringer <joe@cilium.io>
This function has been unused since the introduction of the CRD sync
promise. Remove it as it is no longer needed.

Fixes: d8102b9 ("k8s/synced: Add CRDSync promise")
Signed-off-by: Joe Stringer <joe@cilium.io>
This flag was previously declared in two places for the agent and
clustermesh-apiserver to define the configuration, however it's used
only from this synced package. Move the flag configuration into this
package to consolidate the declaration into one place for better code
reuse.

The agent hive command reference here presumably changes because the
flag was migrated from the traditional global flags reference over into
the Hive, so the Hive infrastructure is now aware of the flag.

Signed-off-by: Joe Stringer <joe@cilium.io>
These "legacy" configurations are common across both
clustermesh/kvstoremesh binaries, so we can reuse the same flag
declarations for both. While we're at it, explicitly define the default
configuration for the log options.

Signed-off-by: Joe Stringer <joe@cilium.io>
An upcoming change will move this into a common package. In preparation,
move this out to the top level so that the diff in the subsequent commit
is clearer to follow. No logical changes here.

Signed-off-by: Joe Stringer <joe@cilium.io>
Several of the core Cells that are reused by clustermesh and kvstoremesh
are declared separately. In order to simplify the logic, this commit
moves those common Cells into a single declaration at the top level and
directly passes that common Cell into each Hive when it is created on
startup.

The gops,pprof cells are kept separately declared in order to retain the
different port configuration to avoid conflicts when running on the same
IP. The Health API server is centralized, but the health endpoints APIs
are kept separate in order to allow them to diverge in future (though
they are currently identical).

Signed-off-by: Joe Stringer <joe@cilium.io>
@joestringer joestringer force-pushed the pr/joe/clustermesh-hive branch from 5759727 to 15eb083 Compare February 18, 2025 23:15
@joestringer
Copy link
Member Author

/ci-clustermesh

@joestringer
Copy link
Member Author

/test

@joestringer joestringer marked this pull request as ready for review February 19, 2025 01:49
@joestringer joestringer requested review from a team as code owners February 19, 2025 01:49
Copy link
Member

@giorio94 giorio94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link
Member

@dylandreimerink dylandreimerink left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a nice change. Can't find anything wrong with this

@joestringer joestringer added this pull request to the merge queue Feb 25, 2025
@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label Feb 25, 2025
Merged via the queue into main with commit e0e397e Feb 25, 2025
303 checks passed
@joestringer joestringer deleted the pr/joe/clustermesh-hive branch February 25, 2025 14:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/misc This PR makes changes that have no direct user impact.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants