Skip to content

Conversation

apolcyn
Copy link
Contributor

@apolcyn apolcyn commented Jan 7, 2021

The primary change here is to have grpc_dns_lookup_ares_locked return a unique_ptr<AresRequestInterface>.

Conversion of the remaining C-code in grpc_ares_wrapper.cc is left for more followups.


This change is Reviewable

@apolcyn apolcyn requested a review from markdroth as a code owner January 7, 2021 02:25
@apolcyn apolcyn added lang/core release notes: no Indicates if PR should not be in release notes labels Jan 7, 2021
Copy link
Member

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a step in the right direction!

Please let me know if you have any questions. Thanks!

Reviewed 11 of 11 files at r1.
Reviewable status: all files reviewed, 10 unresolved discussions (waiting on @apolcyn)


src/core/ext/filters/client_channel/resolver/dns/c_ares/dns_resolver_ares.cc, line 104 at r1 (raw file):

  bool resolving_ = false;
  /// the pending resolving request
  std::unique_ptr<AresRequestInterface> pending_request_ = nullptr;

No need for = nullptr.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.h, line 47 at r1 (raw file):

namespace grpc_core {

class AresRequestInterface {

This doesn't really need to be an interface, since there's only ever one concrete implementation of it. If the goal here is just to hide the internal state, then let's just make all of the data members private (which they should be in a class anyway).


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.h, line 53 at r1 (raw file):

};

/* Asynchronously resolve \a name. Use \a default_port if a port isn't

Might as well switch to C++-style comments while we're at it.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.h, line 54 at r1 (raw file):

/* Asynchronously resolve \a name. Use \a default_port if a port isn't
   designated in \a name, otherwise use the port in \a name. grpc_ares_init()

Please update comments to refer to the new function names (e.g., AresInit() instead of grpc_ares_init()).


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.h, line 71 at r1 (raw file):

  being held by the caller. The returned AresRequestInterface object is owned
  by the caller and it is safe to destroy after on_done is called back. */
extern std::unique_ptr<AresRequestInterface> (*LookupAresLocked)(

Can this be a static method of AresRequestInterface? Or would that be too hard, given that it needs to be overridable in tests?


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.h, line 81 at r1 (raw file):

/* Initialize gRPC ares wrapper. Must be called at least once before
   grpc_resolve_address_ares(). */
grpc_error* AresInit(void);

Can these two be static methods of AresRequestInterface?


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.h, line 96 at r1 (raw file):

/* Exposed in this header for C-core tests only */
extern void (*AresTestOnlyInjectConfig)(ares_channel channel);

This can be in a nested internal namespace.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc, line 67 at r1 (raw file):

struct grpc_ares_request : public AresRequestInterface {
  explicit grpc_ares_request(

No need for explicit here.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc, line 581 at r1 (raw file):

                                             ServerAddressList* addresses) {
  if (GRPC_TRACE_FLAG_ENABLED(grpc_trace_cares_address_sorting)) {
    log_address_sorting_list(r, *addresses, "input");

Why move this from here into the callers? Seems like this adds duplication.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc, line 1118 at r1 (raw file):

  if (resolve_as_ip_literal_locked(name, default_port, addrs)) {
    grpc_ares_complete_request_locked(r.get());
    return std::move(r);

No need for std::move() in a return statement. That happens automatically.

Same thing throughout.

Copy link
Contributor Author

@apolcyn apolcyn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 10 unresolved discussions (waiting on @markdroth)


src/core/ext/filters/client_channel/resolver/dns/c_ares/dns_resolver_ares.cc, line 104 at r1 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

No need for = nullptr.

Done.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.h, line 47 at r1 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

This doesn't really need to be an interface, since there's only ever one concrete implementation of it. If the goal here is just to hide the internal state, then let's just make all of the data members private (which they should be in a class anyway).

I just did that to avoid needing to declare classes in the header file, but I've updated to put the grpc_ares_request into the header (renamed into AresRequest).

I haven't yet made any data members of AresRequest private though, only since that would expand the scope of this PR a lot.

PLMK if you'd like to do that in this PR, though.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.h, line 53 at r1 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Might as well switch to C++-style comments while we're at it.

Done.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.h, line 54 at r1 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Please update comments to refer to the new function names (e.g., AresInit() instead of grpc_ares_init()).

Done.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.h, line 71 at r1 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Can this be a static method of AresRequestInterface? Or would that be too hard, given that it needs to be overridable in tests?

Right, the global method seems easiest to override for tests, so I'll tend to keep as is if OK.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.h, line 81 at r1 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Can these two be static methods of AresRequestInterface?

Done.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.h, line 96 at r1 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

This can be in a nested internal namespace.

Done.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc, line 67 at r1 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

No need for explicit here.

Done.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc, line 581 at r1 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Why move this from here into the callers? Seems like this adds duplication.

I moved this out since I wanted to log service addresses and grpclb addresses with slightly differently, but I've put this back into the sorting helper by adding a parameter to differentiate the logs.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc, line 1118 at r1 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

No need for std::move() in a return statement. That happens automatically.

Same thing throughout.

I had issues with a distrib test using an old gcc, which motivated this. I think that issue is no longer relevant though now that AresRequest isn't a derived class.;

Copy link
Member

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this!

Please let me know if you have any questions. Thanks!

Reviewed 8 of 8 files at r2.
Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @apolcyn)


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.h, line 47 at r1 (raw file):

Previously, apolcyn wrote…

I just did that to avoid needing to declare classes in the header file, but I've updated to put the grpc_ares_request into the header (renamed into AresRequest).

I haven't yet made any data members of AresRequest private though, only since that would expand the scope of this PR a lot.

PLMK if you'd like to do that in this PR, though.

It does seem like it would be better to just convert the whole module to C++ in a single PR, if that's not too hard. But I can live with it being split up if you'd prefer; if so, just add a TODO comment here indicating that we plan to finish converting it to idiomatic C++ in a subsequent PR.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.h, line 49 at r2 (raw file):

typedef struct grpc_ares_ev_driver grpc_ares_ev_driver;

struct AresRequest {

Please document what this struct represents, how it's expected to be used, lifetime semantics, etc.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.h, line 61 at r2 (raw file):

  /// Initialize the gRPC ares wrapper. Must be called at least once before
  /// grpc_core::ResolveAddressAres().

No need for grpc_core:: in any of these comments, since we're already in that namespace.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.h, line 68 at r2 (raw file):

  /// wrapper only if it has been called the same number of times as
  /// grpc_core::AresInit().
  static void Cleanup(void);

Suggest calling this Shutdown(), for consistency with the term we use throughout core.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.h, line 103 at r2 (raw file):

/// designated in \a name, otherwise it uses the port in \a name. AresInit()
/// must be called at least once before this function. The returned
/// AresRequest object is owned by the caller and it is safe to destroy

Since the return value is a unique_ptr<>, we don't need to explicitly state that ownership is transferred; that's implicit in the API now.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc, line 581 at r1 (raw file):

Previously, apolcyn wrote…

I moved this out since I wanted to log service addresses and grpclb addresses with slightly differently, but I've put this back into the sorting helper by adding a parameter to differentiate the logs.

Are there external callers of this API, or is it only exposed for testing purposes? If the former, then it might be better to just have the caller log a separate message with the context before calling this function.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc, line 135 at r2 (raw file):

} grpc_ares_hostbyname_request;

static void AresRequest_ref_locked(AresRequest* r);

When we do finish converting this to C++, we should probably just make AresRequest inherit from InternallyRefCounted<>. (Internally, see go/grpc-c-core-ref-counted-types-update for expected usage.)

Copy link
Contributor Author

@apolcyn apolcyn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 8 of 11 files reviewed, 7 unresolved discussions (waiting on @markdroth)


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.h, line 47 at r1 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

It does seem like it would be better to just convert the whole module to C++ in a single PR, if that's not too hard. But I can live with it being split up if you'd prefer; if so, just add a TODO comment here indicating that we plan to finish converting it to idiomatic C++ in a subsequent PR.

I've went ahead and converted the rest of the module to C++


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.h, line 49 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Please document what this struct represents, how it's expected to be used, lifetime semantics, etc.

Done.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.h, line 61 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

No need for grpc_core:: in any of these comments, since we're already in that namespace.

Done.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.h, line 68 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Suggest calling this Shutdown(), for consistency with the term we use throughout core.

Done.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.h, line 103 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Since the return value is a unique_ptr<>, we don't need to explicitly state that ownership is transferred; that's implicit in the API now.

Done.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc, line 581 at r1 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Are there external callers of this API, or is it only exposed for testing purposes? If the former, then it might be better to just have the caller log a separate message with the context before calling this function.

This is only exposed for testing purposes


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc, line 135 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

When we do finish converting this to C++, we should probably just make AresRequest inherit from InternallyRefCounted<>. (Internally, see go/grpc-c-core-ref-counted-types-update for expected usage.)

Thanks for the link to go/grpc-c-core-ref-counted-types-update.

After looking at this for a while, the IMO cleanest setup I've found is to keep AresRequest as a unique_ptr in it's external owner, and to introduce a small internal wrapper (which is DualRefCounted) over the AresRequest, and this wrapper is responsible for deciding when to schedule on_done.

My argument is that there are three different sets of owners of the AresRequest object:

  1. The single external owner, i.e. the caller of LookupAresLocked

  2. All pending DNS resolutions (e.g. one ref for A records, another for AAAA, SRV, etc.)

  3. Timer and I/O callbacks

The contract with the external owner is that the AresRequest object is safe to delete as soon as on_done is scheduled, so we just need to make sure that owners 2) and 3) are all gone first, and then we're safe to schedule on_done.

Owners 2) and 3) are different though, because as soon as all owners in 2) are gone, we want to cancel/shutdown/remove all owners in 3) as soon as possible.

So, the internal wrapper object uses owners in 2) as strong refs and owners in 3) as weak refs, and we schedule on_done in the wrapper object's dtor.

PLMK

@apolcyn apolcyn changed the title Convert grpc_ares_wrapper.h APIs to C++ Convert grpc_ares_wrapper to C++ Jan 20, 2021
Copy link
Member

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this! I'm really excited to see this code being modernized.

My only really significant comment is the one about the ref-counting semantics. Most of the others are relatively minor.

Please let me know if you have any questions. Thanks!

Reviewed 3 of 3 files at r4.
Reviewable status: all files reviewed, 27 unresolved discussions (waiting on @apolcyn)


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.h, line 55 at r4 (raw file):

/// (A queries, AAAA queries, etc.). An AresRequest is created with a call
/// to LookupAresLocked, and it's safe to destroy as soon as the \a on_done
/// callback passed to LookupAresLocked is ran. Meanwhile, a name resolution

s/is ran/has been run/


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.h, line 59 at r4 (raw file):

class AresRequest final {
 public:
  static std::unique_ptr<AresRequest> LookupAresLockedImpl(

Suggest calling this Create().


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.h, line 60 at r4 (raw file):

 public:
  static std::unique_ptr<AresRequest> LookupAresLockedImpl(
      const char* dns_server, const char* name, const char* default_port,

Please pass in all three of these parameters as absl::string_view.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.h, line 61 at r4 (raw file):

  static std::unique_ptr<AresRequest> LookupAresLockedImpl(
      const char* dns_server, const char* name, const char* default_port,
      grpc_pollset_set* interested_parties, grpc_closure* on_done,

Let's use std::function<> instead of grpc_closure.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.h, line 64 at r4 (raw file):

      std::unique_ptr<grpc_core::ServerAddressList>* addrs,
      std::unique_ptr<grpc_core::ServerAddressList>* balancer_addrs,
      char** service_config_json, int query_timeout_ms,

Let's use std::string* for service_config_json.

Same thing throughout.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.h, line 65 at r4 (raw file):

      std::unique_ptr<grpc_core::ServerAddressList>* balancer_addrs,
      char** service_config_json, int query_timeout_ms,
      std::shared_ptr<grpc_core::WorkSerializer> work_serializer);

As part of our long-term strategy to move the c-ares integration out of the client_channel code and into the iomgr code, it might be a good idea to remove the WorkSerializer here and instead just use a mutex internally. That way, callers that don't use a WorkSerializer (basically everything except the client_channel code) can use this API without needing to provide one, and the client_channel code can be responsible for hopping back into the WorkSerializer when its callback function is invoked.

This doesn't need to be done as part of this PR, though. We can just add a TODO for now.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.h, line 69 at r4 (raw file):

  /// Cancel the pending request. Must be called while holding the
  /// WorkSerializer that was used to call \a LookupAresLocked.
  void ShutdownLocked();

I think the original name made more sense here. This is for cancelling an individual request, not shutting down the whole subsystem.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.h, line 92 at r4 (raw file):

  /// AresRequest object (the one that invoked LookupAresLocked) can safely
  /// destroy it.
  class OnDoneScheduler final : public DualRefCounted<OnDoneScheduler> {

If we do keep this class for some reason (see my comment elsewhere for why I don't think we should have it), it should be private within AresRequest, since it's not needed outside of the implementation.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.h, line 109 at r4 (raw file):

 private:
  explicit AresRequest(

No need for explicit, since there is more than one argument here.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.h, line 109 at r4 (raw file):

 private:
  explicit AresRequest(

If we make AresRequest InternallyRefCounted<> as I suggested elsewhere, then we'll probably want to make this ctor public. We can document that callers should use the factory method instead of instantiating this directly, but making this private will make it impossible to use MakeOrphanable<> even from within the factory method.

For related discussion, see https://abseil.io/tips/134.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.h, line 154 at r4 (raw file):

  }

  struct FdNode {

Let's make this a class (private data members with appropriate names, etc).


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.h, line 154 at r4 (raw file):

  }

  struct FdNode {

As per the style guide, within each section (public, private, etc), types (including structs and classes) should be defined before methods.

https://google.github.io/styleguide/cppguide.html#Declaration_Order


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.h, line 202 at r4 (raw file):

  std::unique_ptr<grpc_core::ServerAddressList>* balancer_addresses_out_;
  // the pointer to receive the service config in JSON
  char** service_config_json_out_;

Would be nice to use std::string* here.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.h, line 210 at r4 (raw file):

  std::shared_ptr<grpc_core::WorkSerializer> work_serializer_;
  // a list of fds that this request is currently using.
  FdNode* fds_ = nullptr;

Can we use an STL container here instead of the current C-style linked list approach? Also, can we use std::unique_ptr<>, so that the compiler can enforce ownership? For example, we could use something like std::set<std::unique_ptr<FdNode>> here.

Alternatively, maybe we could even do something like std::map<ares_socket_t, std::unique_ptr<FdNode>>, to make lookups faster in AresRequest::PopFdNodeLocked().


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc, line 135 at r2 (raw file):

Previously, apolcyn wrote…

Thanks for the link to go/grpc-c-core-ref-counted-types-update.

After looking at this for a while, the IMO cleanest setup I've found is to keep AresRequest as a unique_ptr in it's external owner, and to introduce a small internal wrapper (which is DualRefCounted) over the AresRequest, and this wrapper is responsible for deciding when to schedule on_done.

My argument is that there are three different sets of owners of the AresRequest object:

  1. The single external owner, i.e. the caller of LookupAresLocked

  2. All pending DNS resolutions (e.g. one ref for A records, another for AAAA, SRV, etc.)

  3. Timer and I/O callbacks

The contract with the external owner is that the AresRequest object is safe to delete as soon as on_done is scheduled, so we just need to make sure that owners 2) and 3) are all gone first, and then we're safe to schedule on_done.

Owners 2) and 3) are different though, because as soon as all owners in 2) are gone, we want to cancel/shutdown/remove all owners in 3) as soon as possible.

So, the internal wrapper object uses owners in 2) as strong refs and owners in 3) as weak refs, and we schedule on_done in the wrapper object's dtor.

PLMK

This ownership model seems very different from the way we typically do things, and I think it's a lot more complex than what we really need. I don't think we should need to differentiate between the refs held by individual DNS queries and those that correspond to timers and I/O callbacks.

I am assuming here that we have the following properties:

  • When a DNS query is started, that triggers I/O, which will eventually invoke a callback.
  • When a DNS query finishes, all I/O associated with that query has been completed. (There may be additional I/O callbacks outstanding at this point, but if so they can be cancelled as soon as we know that they're not needed.)

If those properties hold, then I don't think we actually need to hold any refs for the DNS queries themselves; it should be sufficient to hold refs only for the underlying I/O callbacks. We just need to make sure that the last I/O callback for a given DNS query processes the result of the DNS query before it releases its ref.

To say this another way, it seems like we actually have two distinct problems here:

  1. Deciding when it's safe to destroy the AresRequest object.
  2. Deciding when to invoke the on-done callback.

I think this will be simpler if we pull those two problems apart and have a separate solution for each of them. I think we should use ref-counting for (1) but a separate mechanism for (2).

For (1), I think we should make the AresRequest object itself be InternallyRefCounted<>. The internal refs will be held by the I/O and timer callbacks, but the caller doesn't have to know or care about any of that. The caller can just use an OrphanablePtr<> to hold this object, and it can orphan the object whenever it's done with the request. Orphaning the object will cancel any in-flight DNS queries, so the underlying I/O callbacks will return failures, which will cause us to invoke the on-done callback with an error. It will also release the caller's ref.

For (2), I think we can simply have the AresRequest object contain an integer data member that indicates the number of in-flight queries. Whenever we launch a query, we increment it; whenever a query completes, we decrement it. At the end of completing a query (after starting any subsequent queries that may be needed), if the count is zero, then we invoke the on-done callback. (Or if we need something more complex for some reason, we could do something like having a separate bool for each DNS query to indicate if it's complete.)

This approach will de-couple the logic for invoking the on-done callback from the ref-counting logic. It will avoid the need for this OnDoneScheduler class. And it will allow us to invoke the on-done callback as soon as we have the result, independently of whether we're still waiting for some straggling callbacks to run before destroying the AresRequest object itself.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc, line 60 at r4 (raw file):

namespace grpc_core {

struct AresHostByNameRequest {

Please document what this struct represents and how it is used.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc, line 61 at r4 (raw file):

struct AresHostByNameRequest {
  explicit AresHostByNameRequest(RefCountedPtr<AresRequest::OnDoneScheduler> o,

No need for explicit here, since there is more than one parameter.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc, line 96 at r4 (raw file):

};

AresRequest::AresRequest(

All methods should be defined in the same order in which they appear in the .h file.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc, line 217 at r4 (raw file):

      "err=%s",
      this, shutting_down_, grpc_error_string(error));
  // TODO(apolcyn): can we remove this check on shutting_down_?

I don't think this can be removed. It's needed to deal with race conditions. Internally, see go/grpc-c-core-closure-slides for details.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc, line 437 at r4 (raw file):

                        const std::string& logging_prefix) {
  if (GRPC_TRACE_FLAG_ENABLED(grpc_trace_cares_address_sorting)) {
    LogAddressSortingList(r, *addresses, (logging_prefix + "-input").c_str());

Prefer absl::StrCat() over using the + operator.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc, line 462 at r4 (raw file):

                                         int /*timeouts*/,
                                         struct hostent* hostent) {
  std::unique_ptr<AresHostByNameRequest> hr(

It seems a little inconsistent that we use an AresHostByNameRequest struct for the A and AAAA lookups, but we don't have any such struct for the SRV or TXT lookups. Is there some reasonable way that we can make this more uniform, either by not using a struct for any of them, or by using a struct for all of them?

If we go with the latter approach, we could use a real class for each query instead of just a struct, and then the OnHostByNameDoneLocked(), OnSRVQueryDoneLocked(), and OnTXTDoneLocked() methods could be moved into the respective classes.

If there's no way to do this without making things more convoluted, then don't worry about it. It's a nice-to-have, not a requirement.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc, line 523 at r4 (raw file):

    std::string error_msg = absl::StrFormat(
        "C-ares status is not ARES_SUCCESS qtype=%s name=%s is_balancer=%d: %s",
        hr->qtype, hr->host.c_str(), hr->is_balancer, ares_strerror(status));

No need for .c_str() here; it's fine to pass a std::string (or an absl::string_view) to any of the absl string functions (e.g., StrFormat(), StrCat(), etc).

Same thing throughout.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc, line 551 at r4 (raw file):

              o, std::string(srv_it->host), htons(srv_it->port),
              true /* is_balancer */, AF_INET6 /* address_family */);
          ares_gethostbyname(r->channel_, hr->host.c_str(), hr->address_family,

This can be moved into the AresHostByNameRequest ctor.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc, line 581 at r4 (raw file):

      static_cast<AresRequest::OnDoneScheduler*>(arg));
  AresRequest* r = o->parent();
  const std::string kServiceConfigAttributePrefix = "grpc_config=";

This should go back to being declared as a global constant of type char*, so that we don't do an unnecessary allocation every time we enter this function.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc, line 593 at r4 (raw file):

  for (result = reply; result != nullptr; result = result->next) {
    if (result->record_start &&
        memcmp(result->txt, kServiceConfigAttributePrefix.data(),

absl::StartsWith(result->txt, kServiceConfigAttributePrefix)


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc, line 714 at r4 (raw file):

  grpc_resolved_address addr;
  std::string hostport =
      grpc_core::JoinHostPort(target_host_, atoi(target_port_.c_str()));

No need for grpc_core:: anywhere in this code, because we're already in that namespace.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc, line 863 at r4 (raw file):

namespace {

// A GrpcResolveAddressAresRequest maintains the state need to

I think the need for this alternative interface will go away when we move the c-ares integration into the iomgr code.

Copy link
Contributor Author

@apolcyn apolcyn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 9 of 11 files reviewed, 27 unresolved discussions (waiting on @apolcyn and @markdroth)


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc, line 135 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

This ownership model seems very different from the way we typically do things, and I think it's a lot more complex than what we really need. I don't think we should need to differentiate between the refs held by individual DNS queries and those that correspond to timers and I/O callbacks.

I am assuming here that we have the following properties:

  • When a DNS query is started, that triggers I/O, which will eventually invoke a callback.
  • When a DNS query finishes, all I/O associated with that query has been completed. (There may be additional I/O callbacks outstanding at this point, but if so they can be cancelled as soon as we know that they're not needed.)

If those properties hold, then I don't think we actually need to hold any refs for the DNS queries themselves; it should be sufficient to hold refs only for the underlying I/O callbacks. We just need to make sure that the last I/O callback for a given DNS query processes the result of the DNS query before it releases its ref.

To say this another way, it seems like we actually have two distinct problems here:

  1. Deciding when it's safe to destroy the AresRequest object.
  2. Deciding when to invoke the on-done callback.

I think this will be simpler if we pull those two problems apart and have a separate solution for each of them. I think we should use ref-counting for (1) but a separate mechanism for (2).

For (1), I think we should make the AresRequest object itself be InternallyRefCounted<>. The internal refs will be held by the I/O and timer callbacks, but the caller doesn't have to know or care about any of that. The caller can just use an OrphanablePtr<> to hold this object, and it can orphan the object whenever it's done with the request. Orphaning the object will cancel any in-flight DNS queries, so the underlying I/O callbacks will return failures, which will cause us to invoke the on-done callback with an error. It will also release the caller's ref.

For (2), I think we can simply have the AresRequest object contain an integer data member that indicates the number of in-flight queries. Whenever we launch a query, we increment it; whenever a query completes, we decrement it. At the end of completing a query (after starting any subsequent queries that may be needed), if the count is zero, then we invoke the on-done callback. (Or if we need something more complex for some reason, we could do something like having a separate bool for each DNS query to indicate if it's complete.)

This approach will de-couple the logic for invoking the on-done callback from the ref-counting logic. It will avoid the need for this OnDoneScheduler class. And it will allow us to invoke the on-done callback as soon as we have the result, independently of whether we're still waiting for some straggling callbacks to run before destroying the AresRequest object itself.

The main problem I'm worried about with de-coupling the scheduling of on_done and the destruction of the AresRequest is that this makes it harder to reason about the correctness of the AresRequest lifetime.

If they on_done and the AresRequest dtor are coupled, then because we don't schedule on_done until the AresRequest has no internal refs left, the entity that constructed the AresRequest also manages its deletion, and in this way we can basically guarantee that DNS resolutions don't leave any leaks.

If we leave around stray timer and I/O callbacks after scheduling on_done, I'd be worried running into a hard-to-debug scenario where e.g. timer or I/O callbacks don't getting cleaned up in a timely manner after the resolution is done, and this kind of leak might not be easily caught in e.g. unit tests.

Copy link
Member

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 9 of 11 files reviewed, 27 unresolved discussions (waiting on @apolcyn and @markdroth)


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc, line 135 at r2 (raw file):

Previously, apolcyn wrote…

The main problem I'm worried about with de-coupling the scheduling of on_done and the destruction of the AresRequest is that this makes it harder to reason about the correctness of the AresRequest lifetime.

If they on_done and the AresRequest dtor are coupled, then because we don't schedule on_done until the AresRequest has no internal refs left, the entity that constructed the AresRequest also manages its deletion, and in this way we can basically guarantee that DNS resolutions don't leave any leaks.

If we leave around stray timer and I/O callbacks after scheduling on_done, I'd be worried running into a hard-to-debug scenario where e.g. timer or I/O callbacks don't getting cleaned up in a timely manner after the resolution is done, and this kind of leak might not be easily caught in e.g. unit tests.

Handling these remaining timer and I/O callbacks is actually the exact case that the InternallyRefCounted<> pattern was created for. If we were always afraid that there might be a bug that prevents us from cleaning the object up, then we'd never be able to use this pattern anywhere.

Is there some reason that it's somehow more likely that a stray timer or I/O callback might not be cleaned up in this case than in any other case where we use InternallyRefCounted<>? If not, I don't think this is a good reason not to use that pattern.

Copy link
Contributor Author

@apolcyn apolcyn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 9 of 11 files reviewed, 27 unresolved discussions (waiting on @apolcyn and @markdroth)


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.h, line 55 at r4 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

s/is ran/has been run/

Done. But changed to "has began to run" to make it clear that it can be destroyed from within the callback


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.h, line 59 at r4 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Suggest calling this Create().

Done.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.h, line 60 at r4 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Please pass in all three of these parameters as absl::string_view.

Done.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc, line 135 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Handling these remaining timer and I/O callbacks is actually the exact case that the InternallyRefCounted<> pattern was created for. If we were always afraid that there might be a bug that prevents us from cleaning the object up, then we'd never be able to use this pattern anywhere.

Is there some reason that it's somehow more likely that a stray timer or I/O callback might not be cleaned up in this case than in any other case where we use InternallyRefCounted<>? If not, I don't think this is a good reason not to use that pattern.

There's not a particular reason that makes this code more likely to have timer or fd leaks, but the interactions are complicated enough that I'd be weary of at least introducing new bugs.

For example, with the approach that delays on_done until the object is safe to delete, one nice thing is that if a code change introduces a bug that causes such a leak, nothing will work at all and we'll catch it right away. But if we leave around stray refs, such a bug could more easily go undetected -- i.e. it makes it now difficult to test that the code is correct.

I can get behind the desire to be consistent with other cases in the code base, but I also think that the approach of delaying on_done until the object is safe to delete fits better with the C++ style guide's recommendation that objects be deleted by the code that creates them if it's possible.

What do you think if we got rid of the OnDoneScheduler object, but still delayed on_done until the object is safe to delete (we could perhaps do this with a counter of pending queries, and some kind of MaybeScheduleOnDone helper at the end of every query, timer, and I/O callback) ?

Thanks for the review so far though, I'm still working on getting back to the rest of the comments.

Copy link
Contributor Author

@apolcyn apolcyn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 5 of 11 files reviewed, 27 unresolved discussions (waiting on @markdroth)


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.h, line 61 at r4 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Let's use std::function<> instead of grpc_closure.

Done.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.h, line 64 at r4 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Let's use std::string* for service_config_json.

Same thing throughout.

Done, but changed to absl::optional<std::string>*, since it looks like we need to differentiate no service config with empty string service config


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.h, line 65 at r4 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

As part of our long-term strategy to move the c-ares integration out of the client_channel code and into the iomgr code, it might be a good idea to remove the WorkSerializer here and instead just use a mutex internally. That way, callers that don't use a WorkSerializer (basically everything except the client_channel code) can use this API without needing to provide one, and the client_channel code can be responsible for hopping back into the WorkSerializer when its callback function is invoked.

This doesn't need to be done as part of this PR, though. We can just add a TODO for now.

Added a TODO.

While we can definitely make this not take a WorkSerializer as a parameter, I'm a little worried that using a Mutex internally may cause more complicated code than with a WorkSerializer, but we can feasibly use either one internally.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.h, line 69 at r4 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

I think the original name made more sense here. This is for cancelling an individual request, not shutting down the whole subsystem.

Done.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.h, line 92 at r4 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

If we do keep this class for some reason (see my comment elsewhere for why I don't think we should have it), it should be private within AresRequest, since it's not needed outside of the implementation.

Latest approach gets rid of this class


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.h, line 109 at r4 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

No need for explicit, since there is more than one argument here.

Done.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.h, line 109 at r4 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

If we make AresRequest InternallyRefCounted<> as I suggested elsewhere, then we'll probably want to make this ctor public. We can document that callers should use the factory method instead of instantiating this directly, but making this private will make it impossible to use MakeOrphanable<> even from within the factory method.

For related discussion, see https://abseil.io/tips/134.

Made this public


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.h, line 154 at r4 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Let's make this a class (private data members with appropriate names, etc).

Done.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.h, line 154 at r4 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

As per the style guide, within each section (public, private, etc), types (including structs and classes) should be defined before methods.

https://google.github.io/styleguide/cppguide.html#Declaration_Order

Done.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.h, line 202 at r4 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Would be nice to use std::string* here.

changed to absl::optional<std::string>* to differentiate non-existant and empty as described above.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.h, line 210 at r4 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Can we use an STL container here instead of the current C-style linked list approach? Also, can we use std::unique_ptr<>, so that the compiler can enforce ownership? For example, we could use something like std::set<std::unique_ptr<FdNode>> here.

Alternatively, maybe we could even do something like std::map<ares_socket_t, std::unique_ptr<FdNode>>, to make lookups faster in AresRequest::PopFdNodeLocked().

Good idea, that nicely simplified things


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc, line 135 at r2 (raw file):

Previously, apolcyn wrote…

There's not a particular reason that makes this code more likely to have timer or fd leaks, but the interactions are complicated enough that I'd be weary of at least introducing new bugs.

For example, with the approach that delays on_done until the object is safe to delete, one nice thing is that if a code change introduces a bug that causes such a leak, nothing will work at all and we'll catch it right away. But if we leave around stray refs, such a bug could more easily go undetected -- i.e. it makes it now difficult to test that the code is correct.

I can get behind the desire to be consistent with other cases in the code base, but I also think that the approach of delaying on_done until the object is safe to delete fits better with the C++ style guide's recommendation that objects be deleted by the code that creates them if it's possible.

What do you think if we got rid of the OnDoneScheduler object, but still delayed on_done until the object is safe to delete (we could perhaps do this with a counter of pending queries, and some kind of MaybeScheduleOnDone helper at the end of every query, timer, and I/O callback) ?

Thanks for the review so far though, I'm still working on getting back to the rest of the comments.

The latest approach gets rid of the OnDoneScheduler class and the DualRefCounted idea, but replaces with a more manual approach which uses a MaybeCallOnDone helper function, and the new pending_queries_, timeout_done_ and backup_poller_done_ fields.

I'm wondering if you think this new approach can look simpler and more reasonable. Please let me know.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc, line 60 at r4 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Please document what this struct represents and how it is used.

Latest approach gets rid of this struct


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc, line 61 at r4 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

No need for explicit here, since there is more than one parameter.

Done.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc, line 96 at r4 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

All methods should be defined in the same order in which they appear in the .h file.

Done.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc, line 217 at r4 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

I don't think this can be removed. It's needed to deal with race conditions. Internally, see go/grpc-c-core-closure-slides for details.

I think this still might be able to be removed since CancelLocked should be idempotent, improved the TODO


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc, line 437 at r4 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Prefer absl::StrCat() over using the + operator.

Done.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc, line 462 at r4 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

It seems a little inconsistent that we use an AresHostByNameRequest struct for the A and AAAA lookups, but we don't have any such struct for the SRV or TXT lookups. Is there some reasonable way that we can make this more uniform, either by not using a struct for any of them, or by using a struct for all of them?

If we go with the latter approach, we could use a real class for each query instead of just a struct, and then the OnHostByNameDoneLocked(), OnSRVQueryDoneLocked(), and OnTXTDoneLocked() methods could be moved into the respective classes.

If there's no way to do this without making things more convoluted, then don't worry about it. It's a nice-to-have, not a requirement.

Done, changed these to use classes which are different subclasses of AresQuery


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc, line 523 at r4 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

No need for .c_str() here; it's fine to pass a std::string (or an absl::string_view) to any of the absl string functions (e.g., StrFormat(), StrCat(), etc).

Same thing throughout.

Done.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc, line 551 at r4 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

This can be moved into the AresHostByNameRequest ctor.

I moved this to static methods of the various AresQuery child classes. I believe we can't actually have this in the ctor because the c-ares API's can potentially run their callbacks inline, in which case we'd run the dtor before finishing the ctor completely.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc, line 581 at r4 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

This should go back to being declared as a global constant of type char*, so that we don't do an unnecessary allocation every time we enter this function.

What do you think about just changing to an absl::string_view (done in the latest update) ?

Because string literals have static storage duration, I want to say most compilers should avoid any allocation.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc, line 593 at r4 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

absl::StartsWith(result->txt, kServiceConfigAttributePrefix)

Done.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc, line 714 at r4 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

No need for grpc_core:: anywhere in this code, because we're already in that namespace.

Done.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc, line 863 at r4 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

I think the need for this alternative interface will go away when we move the c-ares integration into the iomgr code.

Agree

@apolcyn
Copy link
Contributor Author

apolcyn commented Jan 26, 2021

Comments addresses (though with the ref counting scheme still in discussion).

Thanks for the review so far. A heads up that there still seems to be something happening wrong on Windows, but I think this PR can still otherwise be good for another look.

Copy link
Member

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really shaping up nicely!

Please let me know if you have any questions. Thanks!

Reviewed 3 of 6 files at r6, 2 of 2 files at r7, 2 of 2 files at r8.
Reviewable status: all files reviewed, 20 unresolved discussions (waiting on @apolcyn)


src/core/ext/filters/client_channel/resolver/dns/c_ares/dns_resolver_ares.cc, line 422 at r8 (raw file):

  auto on_done = [this](grpc_error* error) {
    OnResolvedLocked(error);
    GRPC_ERROR_UNREF(error);

Ownership of this error should be passed to OnResolvedLocked(), as per https://github.com/grpc/grpc/blob/master/doc/core/grpc-error.md#ownership-rules.


src/core/ext/filters/client_channel/resolver/dns/c_ares/dns_resolver_ares.cc, line 425 at r8 (raw file):

  };
  pending_request_ = LookupAresLocked(
      dns_server_, name_to_resolve_, kDefaultPort, interested_parties_, on_done,

std::move(on_done), or just declare the lambda inline here.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.h, line 55 at r4 (raw file):

Previously, apolcyn wrote…

Done. But changed to "has began to run" to make it clear that it can be destroyed from within the callback

How about "has been called"?


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.h, line 64 at r4 (raw file):

Previously, apolcyn wrote…

Done, but changed to absl::optional<std::string>*, since it looks like we need to differentiate no service config with empty string service config

Good catch!


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.h, line 71 at r8 (raw file):

      std::shared_ptr<grpc_core::WorkSerializer> work_serializer);

  static std::unique_ptr<AresRequest> Create(

As per style guide, factory methods should go before ctors.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.h, line 103 at r8 (raw file):

  class AresQuery {
   protected:
    explicit AresQuery(AresRequest* r);

Please call the parameter request, as per general naming rules:

https://google.github.io/styleguide/cppguide.html#General_Naming_Rules

Same thing throughout (parameter names, data members, etc).


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.h, line 105 at r8 (raw file):

    explicit AresQuery(AresRequest* r);

    ~AresQuery();

Base class dtor must be virtual, as per style guide.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.h, line 110 at r8 (raw file):

  };

  class AddressQuery : AresQuery {

All inheritence must be public, as per style guide.

Same thing throughout.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.h, line 200 at r8 (raw file):

    bool writable_registered_ = false;
    // if the fd has been shutdown yet from grpc iomgr perspective
    bool already_shutdown_ = false;

Suggest removing the already_ prefix here.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc, line 135 at r2 (raw file):

Previously, apolcyn wrote…

The latest approach gets rid of the OnDoneScheduler class and the DualRefCounted idea, but replaces with a more manual approach which uses a MaybeCallOnDone helper function, and the new pending_queries_, timeout_done_ and backup_poller_done_ fields.

I'm wondering if you think this new approach can look simpler and more reasonable. Please let me know.

I'd still prefer to just make this InternallyRefCounted<>, for the following reasons:

  • We have a standard pattern in C-core for how we handle cases like these, and I think we need to have a compelling reason to do something different here. I'm not really seeing one in this case.
  • I don't think this code is that much more complex than any of the other classes inside C-core that have internal refs for pending callbacks, so I don't think it's actually that much more likely to cause a leak.
  • If we did have a bug and it did cause a leak, the asan tests would catch it anyway, so delaying the on-done callback doesn't actually give us any additional safety. But it can delay sending the response back to the caller, which can hurt latency.
  • The approach of having to depend on 3 different fields here is actually more complex and harder to understand than simply using InternallyRefCounted<>.

src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc, line 96 at r4 (raw file):

Previously, apolcyn wrote…

Done.

You can split them out by class -- e.g., define the methods of AresRequest::AresQuery first, then the methods of AresRequest::AddressQuery, then AresRequest::SRVQuery, then AresRequest::TXTQuery, then AresRequest::FdNode, then AresRequest itself.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc, line 581 at r4 (raw file):

Previously, apolcyn wrote…

What do you think about just changing to an absl::string_view (done in the latest update) ?

Because string literals have static storage duration, I want to say most compilers should avoid any allocation.

Yeah, that works too.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc, line 72 at r8 (raw file):

      polled_fd_factory_(NewGrpcPolledFdFactory(work_serializer_)),
      query_timeout_ms_(query_timeout_ms),
      on_done_(on_done) {}

std::move()


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc, line 85 at r8 (raw file):

  AddressQuery* q =
      new AddressQuery(r, host, port, is_balancer, address_family);
  ares_gethostbyname(r->channel_, q->host_.c_str(), q->address_family_,

Please add a comment explaining why this can't be done in the ctor.

Same for all of the query types.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc, line 258 at r8 (raw file):

  // Found a service config record.
  if (result != nullptr) {
    *r->service_config_json_out_ =

This would probably be more efficient if written like this:

std::vector<absl::string_view> service_config_parts = {
    absl::string_view(reinterpret_cast<const char*>(result->txt), result->length)
        .substr(kServiceConfigAttributePrefix.size())
};
for (result = result->next; result != nullptr && !result->record_start;
     result = result->next) {
  service_config_parts.emplace_back(
      reinterpret_cast<const char*>(result->txt), result->length);
}
*r->service_config_json_out_ = absl::StrJoin(service_config_parts, "");

This basically avoids allocating any strings until we know the final length.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc, line 274 at r8 (raw file):

  ares_free_data(reply);
  return;
fail:

Instead of using goto, I suggest declaring a local lambda at the top of the function that contains the error-handling code, like this:

auto on_error = [&]() {
  // ...error handling code...
};

Then, in all of the places where we currently say goto fail;, we can instead just say on_error(); return;.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc, line 442 at r8 (raw file):

  backup_poller_done_ = true;
  if (!shutting_down_ && error == GRPC_ERROR_NONE) {
    for (auto& it : fds_) {

Suggest using p here instead of it, since this is actually a pair, not an iterator.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc, line 477 at r8 (raw file):

// I/O readable/writable callbacks with these filedescriptors.
void AresRequest::NotifyOnEventLocked() {
  std::map<ares_socket_t, std::unique_ptr<AresRequest::FdNode>> active_fds;

No need for AresRequest:: here, since we're already in that scope.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc, line 486 at r8 (raw file):

        ares_socket_t s = socks[i];
        if (fds_[s] == nullptr) {
          fds_[s] = absl::make_unique<AresRequest::FdNode>(

Same here.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc, line 506 at r8 (raw file):

  // are therefore no longer in use, so they can be shut down and removed from
  // the list.
  for (auto& it : fds_) {

Suggest changing it to p.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc, line 807 at r8 (raw file):

  // Look up name using c-ares lib.
  r->ContinueAfterCheckLocalhostAndIPLiteralsLocked(dns_server);
done:

It would be nice to avoid the goto. In this case, there's only two statements after the goto, so we can probably just inline them in the places where we're currently using goto.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc, line 814 at r8 (raw file):

void AresRequest::CancelLocked() {
  shutting_down_ = true;
  for (auto& it : fds_) {

Suggest using p instead of it.

Copy link
Contributor Author

@apolcyn apolcyn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 9 of 12 files reviewed, 20 unresolved discussions (waiting on @apolcyn and @markdroth)


src/core/ext/filters/client_channel/resolver/dns/c_ares/dns_resolver_ares.cc, line 422 at r8 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Ownership of this error should be passed to OnResolvedLocked(), as per https://github.com/grpc/grpc/blob/master/doc/core/grpc-error.md#ownership-rules.

Done.


src/core/ext/filters/client_channel/resolver/dns/c_ares/dns_resolver_ares.cc, line 425 at r8 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

std::move(on_done), or just declare the lambda inline here.

Done.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.h, line 55 at r4 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

How about "has been called"?

done


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.h, line 71 at r8 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

As per style guide, factory methods should go before ctors.

Done.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.h, line 103 at r8 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Please call the parameter request, as per general naming rules:

https://google.github.io/styleguide/cppguide.html#General_Naming_Rules

Same thing throughout (parameter names, data members, etc).

Done.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.h, line 105 at r8 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Base class dtor must be virtual, as per style guide.

Done.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.h, line 110 at r8 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

All inheritence must be public, as per style guide.

Same thing throughout.

Done.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.h, line 200 at r8 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Suggest removing the already_ prefix here.

Done.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc, line 135 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

I'd still prefer to just make this InternallyRefCounted<>, for the following reasons:

  • We have a standard pattern in C-core for how we handle cases like these, and I think we need to have a compelling reason to do something different here. I'm not really seeing one in this case.
  • I don't think this code is that much more complex than any of the other classes inside C-core that have internal refs for pending callbacks, so I don't think it's actually that much more likely to cause a leak.
  • If we did have a bug and it did cause a leak, the asan tests would catch it anyway, so delaying the on-done callback doesn't actually give us any additional safety. But it can delay sending the response back to the caller, which can hurt latency.
  • The approach of having to depend on 3 different fields here is actually more complex and harder to understand than simply using InternallyRefCounted<>.

I'm open to using InternallyRefCounted if we still want to use it. I can see the consistency motivation. But I still have a couple of arguments against it:

If we did have a bug and it did cause a leak, the asan tests would catch it anyway, so delaying the on-done callback doesn't actually give us any additional safety.

There are still a number of potential leaks which won't be caught by asan.

I experimented with using InternallyRefCounted, and then applied the following patch to induce bugs that leak both our timers and our fd's:

diff --git a/src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc b/src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc
index aec44ba4afe..b2ae4b3d0fc 100644
--- a/src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc
+++ b/src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc
@@ -489,7 +489,7 @@ AresRequest::~AresRequest() {
 }
 
 void AresRequest::Orphan() {
-  CancelLocked();
+  //CancelLocked();
   Unref();
 }
 
@@ -720,9 +720,9 @@ void AresRequest::DecrementPendingQueries() {
     // shut down any remaining fds.
     // TODO(apolcyn): just run CancelLocked() ?
     // Unref();
-    shutting_down_ = true;
-    grpc_timer_cancel(&query_timeout_);
-    grpc_timer_cancel(&ares_backup_poll_alarm_);
+    //shutting_down_ = true;
+    //grpc_timer_cancel(&query_timeout_);
+    //grpc_timer_cancel(&ares_backup_poll_alarm_);
     ServerAddressList* addresses = addresses_out_->get();
     if (addresses != nullptr) {
       AddressSortingSort(this, addresses, "service-addresses");

.... and I confirmed that even with these bugs, //test/cpp/naming:resolver_component_tests_runner_invoker will pass without any complaints from asan.

FWIW, the timer leaks won't be caught because during grpc_shutdown, we cancel any still-pending timer callbacks as a part of timer thread shutdown.

And the fd leaks won't be caught I believe because they're registered in some global data structures which bypass asans checks.

We could perhaps catch the fd leaks in our tests if we ran them under GRPC_ABORT_ON_LEAKS=true (we might want be doing this for all tests anyways).

I'm not sure about the timer leaks, but there might be something similar to GRPC_ABORT_ON_LEAKS that could catch them too.

But overall, these bugs couldn't sneak in as easily if we avoided the stray refs, so I think there are real safety benefits to avoiding them.

it can delay sending the response back to the caller, which can hurt latency.

I'm not keen on doing this for performance reasons, since this would only be a micro-optimization and it's on the control plane anyways.

The approach of having to depend on 3 different fields here is actually more complex and harder to understand than simply using InternallyRefCounted<>.

FWIW, the InternallyRefCounted approach will still rely on the same pending_queries_ field and surrounding logic. We can, however, delete the timeout_done_ and backup_poller_done_ fields by using InternallyRefCounted.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc, line 96 at r4 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

You can split them out by class -- e.g., define the methods of AresRequest::AresQuery first, then the methods of AresRequest::AddressQuery, then AresRequest::SRVQuery, then AresRequest::TXTQuery, then AresRequest::FdNode, then AresRequest itself.

done


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc, line 72 at r8 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

std::move()

Done.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc, line 85 at r8 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Please add a comment explaining why this can't be done in the ctor.

Same for all of the query types.

Done.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc, line 258 at r8 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

This would probably be more efficient if written like this:

std::vector<absl::string_view> service_config_parts = {
    absl::string_view(reinterpret_cast<const char*>(result->txt), result->length)
        .substr(kServiceConfigAttributePrefix.size())
};
for (result = result->next; result != nullptr && !result->record_start;
     result = result->next) {
  service_config_parts.emplace_back(
      reinterpret_cast<const char*>(result->txt), result->length);
}
*r->service_config_json_out_ = absl::StrJoin(service_config_parts, "");

This basically avoids allocating any strings until we know the final length.

Nice change! done


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc, line 274 at r8 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Instead of using goto, I suggest declaring a local lambda at the top of the function that contains the error-handling code, like this:

auto on_error = [&]() {
  // ...error handling code...
};

Then, in all of the places where we currently say goto fail;, we can instead just say on_error(); return;.

Changed this to just inline the code behind the goto label before each return, per your other comment


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc, line 442 at r8 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Suggest using p here instead of it, since this is actually a pair, not an iterator.

Done.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc, line 477 at r8 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

No need for AresRequest:: here, since we're already in that scope.

Done.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc, line 486 at r8 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Same here.

Done.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc, line 506 at r8 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Suggest changing it to p.

Done.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc, line 814 at r8 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Suggest using p instead of it.

Done.

Copy link
Member

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking really good!

Please let me know if you have any questions. Thanks!

Reviewed 3 of 3 files at r9.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @apolcyn)


src/core/ext/filters/client_channel/resolver/dns/c_ares/dns_resolver_ares.cc, line 422 at r8 (raw file):

Previously, apolcyn wrote…

Done.

You also need to change OnResolvedLocked() to unref it.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.h, line 101 at r9 (raw file):

  // some book keeping on the relevant AresRequest that's needed to carry
  // out the overall resolution.
  class AresQuery {

Do we actually need this base class? Why not just have each subclass store the request itself?


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc, line 135 at r2 (raw file):

Previously, apolcyn wrote…

I'm open to using InternallyRefCounted if we still want to use it. I can see the consistency motivation. But I still have a couple of arguments against it:

If we did have a bug and it did cause a leak, the asan tests would catch it anyway, so delaying the on-done callback doesn't actually give us any additional safety.

There are still a number of potential leaks which won't be caught by asan.

I experimented with using InternallyRefCounted, and then applied the following patch to induce bugs that leak both our timers and our fd's:

diff --git a/src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc b/src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc
index aec44ba4afe..b2ae4b3d0fc 100644
--- a/src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc
+++ b/src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc
@@ -489,7 +489,7 @@ AresRequest::~AresRequest() {
 }
 
 void AresRequest::Orphan() {
-  CancelLocked();
+  //CancelLocked();
   Unref();
 }
 
@@ -720,9 +720,9 @@ void AresRequest::DecrementPendingQueries() {
     // shut down any remaining fds.
     // TODO(apolcyn): just run CancelLocked() ?
     // Unref();
-    shutting_down_ = true;
-    grpc_timer_cancel(&query_timeout_);
-    grpc_timer_cancel(&ares_backup_poll_alarm_);
+    //shutting_down_ = true;
+    //grpc_timer_cancel(&query_timeout_);
+    //grpc_timer_cancel(&ares_backup_poll_alarm_);
     ServerAddressList* addresses = addresses_out_->get();
     if (addresses != nullptr) {
       AddressSortingSort(this, addresses, "service-addresses");

.... and I confirmed that even with these bugs, //test/cpp/naming:resolver_component_tests_runner_invoker will pass without any complaints from asan.

FWIW, the timer leaks won't be caught because during grpc_shutdown, we cancel any still-pending timer callbacks as a part of timer thread shutdown.

And the fd leaks won't be caught I believe because they're registered in some global data structures which bypass asans checks.

We could perhaps catch the fd leaks in our tests if we ran them under GRPC_ABORT_ON_LEAKS=true (we might want be doing this for all tests anyways).

I'm not sure about the timer leaks, but there might be something similar to GRPC_ABORT_ON_LEAKS that could catch them too.

But overall, these bugs couldn't sneak in as easily if we avoided the stray refs, so I think there are real safety benefits to avoiding them.

it can delay sending the response back to the caller, which can hurt latency.

I'm not keen on doing this for performance reasons, since this would only be a micro-optimization and it's on the control plane anyways.

The approach of having to depend on 3 different fields here is actually more complex and harder to understand than simply using InternallyRefCounted<>.

FWIW, the InternallyRefCounted approach will still rely on the same pending_queries_ field and surrounding logic. We can, however, delete the timeout_done_ and backup_poller_done_ fields by using InternallyRefCounted.

All of your concerns about asan not catching these problems would also apply to almost every other place where we use InternallyRefCounted<>, so I don't think that's a good argument for doing something different here. We have an established pattern for how we handle this kind of use-case in C-core, and I think that to justify doing something different, there needs to be an argument why this case is actually different from all of the others. Right now, I'm not hearing such an argument.

I'd be fine using GRPC_ABORT_ON_LEAKS=true to run resolver_component_tests_runner_invoker. You could also investigate using it for other tests if you want. I assume that there are reasons why it's not enabled by default for all of our tests, but I'm not sure.

I don't know if there's an equivalent to GRPC_ABORT_ON_LEAKS for timers, but I'm also not too worried about timer callbacks leaking. Even if we forget to cancel a timer, the timer itself will fire normally at some point, and the resources will get cleaned up then.

However, if the timer is set to something like 5 minutes, that can cause significant latency for RPCs. That seems like a bad failure mode -- we can live with delaying that long to free resources a lot more easily than we can live with delaying RPCs for that long.

I understand that we'll still need pending_queries_, but I think that's better, because we'll have two different mechanisms for two different things. The pending_queries_ mechanism will be used for deciding when to invoke on_done, and the ref-counting mechanism will be used to decide when to clean up the object. The two mechanisms don't really need to interact directly at all, so they're both simpler to understand.

Please go ahead and change this code to use InternallyRefCounted<>.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc, line 262 at r9 (raw file):

    }
  }
  // Foquery und a service config record.

Looks like an accidental edit -- should say "Found".

Copy link
Contributor Author

@apolcyn apolcyn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @markdroth)


src/core/ext/filters/client_channel/resolver/dns/c_ares/dns_resolver_ares.cc, line 422 at r8 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

You also need to change OnResolvedLocked() to unref it.

OnResolvedLocked is currently unreffing it.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.h, line 101 at r9 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Do we actually need this base class? Why not just have each subclass store the request itself?

yeah this base class isn't providing a lot, latest updates get rid of it


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc, line 135 at r2 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

All of your concerns about asan not catching these problems would also apply to almost every other place where we use InternallyRefCounted<>, so I don't think that's a good argument for doing something different here. We have an established pattern for how we handle this kind of use-case in C-core, and I think that to justify doing something different, there needs to be an argument why this case is actually different from all of the others. Right now, I'm not hearing such an argument.

I'd be fine using GRPC_ABORT_ON_LEAKS=true to run resolver_component_tests_runner_invoker. You could also investigate using it for other tests if you want. I assume that there are reasons why it's not enabled by default for all of our tests, but I'm not sure.

I don't know if there's an equivalent to GRPC_ABORT_ON_LEAKS for timers, but I'm also not too worried about timer callbacks leaking. Even if we forget to cancel a timer, the timer itself will fire normally at some point, and the resources will get cleaned up then.

However, if the timer is set to something like 5 minutes, that can cause significant latency for RPCs. That seems like a bad failure mode -- we can live with delaying that long to free resources a lot more easily than we can live with delaying RPCs for that long.

I understand that we'll still need pending_queries_, but I think that's better, because we'll have two different mechanisms for two different things. The pending_queries_ mechanism will be used for deciding when to invoke on_done, and the ref-counting mechanism will be used to decide when to clean up the object. The two mechanisms don't really need to interact directly at all, so they're both simpler to understand.

Please go ahead and change this code to use InternallyRefCounted<>.

While updating to use InternallyRefCounted, I ran into another reason to not use it: the c-ares lookup takes a grpc_pollset_set parameter, and GrpcPolledFd's need to use that pollset set. In particular, the posix GrpcPolledFd implementation needs to add its fds to the pollset set and then remove them later (it does this in the GrpcPolledFd dtor).

The on_done callback holds an implicit ref over this pollset set param, and thus calling on_done before we're done using the pollset set is unsafe (I noticed this with crashes un the resolver_component_tests_runner_invoker under epollex).


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc, line 807 at r8 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

It would be nice to avoid the goto. In this case, there's only two statements after the goto, so we can probably just inline them in the places where we're currently using goto.

goto removed from earlier updates


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc, line 262 at r9 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Looks like an accidental edit -- should say "Found".

Done.

Copy link
Contributor Author

@apolcyn apolcyn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @markdroth)


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc, line 135 at r2 (raw file):

Previously, apolcyn wrote…

While updating to use InternallyRefCounted, I ran into another reason to not use it: the c-ares lookup takes a grpc_pollset_set parameter, and GrpcPolledFd's need to use that pollset set. In particular, the posix GrpcPolledFd implementation needs to add its fds to the pollset set and then remove them later (it does this in the GrpcPolledFd dtor).

The on_done callback holds an implicit ref over this pollset set param, and thus calling on_done before we're done using the pollset set is unsafe (I noticed this with crashes un the resolver_component_tests_runner_invoker under epollex).

We can perhaps fix this latter issue by creating a new pollset set that belongs to the AresRequest object, adding it to the pollset_set param passed to LookupAresLocked, and having internal GrpcPolledFd objects use the pollset set belonging to the AresRequest object.

Working on this...

Copy link
Contributor Author

@apolcyn apolcyn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 8 of 13 files reviewed, 4 unresolved discussions (waiting on @markdroth)


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc, line 135 at r2 (raw file):

Previously, apolcyn wrote…

We can perhaps fix this latter issue by creating a new pollset set that belongs to the AresRequest object, adding it to the pollset_set param passed to LookupAresLocked, and having internal GrpcPolledFd objects use the pollset set belonging to the AresRequest object.

Working on this...

It turns out that creating a pollset_set internal to the AresRequest object won't work actually, because, at least under the poll poller, we need to delete pollset set B from pollset set A, before destroying B, if we ever previously added B to A. Otherwise, we wind up with a dangling pointer from pollset set A to B.

This bug was caught by //test/core/http:httpcli_test@poller=poll FTR, see https://source.cloud.google.com/results/invocations/4498d7e7-b1fe-41b7-9b6f-c53f27d92b3f/targets/%2F%2Ftest%2Fcore%2Fhttp:httpcli_test@poller%3Dpoll/log

So we need to leverage the implicit ref that on_done holds over the LookupAresLocked pollset set parameter, so long as we need to use that pollset set for anything.

I think this is reason for sticking with the non-refcounted approach.

Copy link
Contributor Author

@apolcyn apolcyn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 8 of 13 files reviewed, 4 unresolved discussions (waiting on @markdroth)


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc, line 135 at r2 (raw file):

Previously, apolcyn wrote…

It turns out that creating a pollset_set internal to the AresRequest object won't work actually, because, at least under the poll poller, we need to delete pollset set B from pollset set A, before destroying B, if we ever previously added B to A. Otherwise, we wind up with a dangling pointer from pollset set A to B.

This bug was caught by //test/core/http:httpcli_test@poller=poll FTR, see https://source.cloud.google.com/results/invocations/4498d7e7-b1fe-41b7-9b6f-c53f27d92b3f/targets/%2F%2Ftest%2Fcore%2Fhttp:httpcli_test@poller%3Dpoll/log

So we need to leverage the implicit ref that on_done holds over the LookupAresLocked pollset set parameter, so long as we need to use that pollset set for anything.

I think this is reason for sticking with the non-refcounted approach.

Sorry for the back and forth here.

I've fixed this last issue with InternallyRefCounted by deleting the fds from the LookupAresLocked pollset set param during GrpcPolledFd::ShutdownLocked, which is guaranteed to be called on every fd, before on_done is called, ensuring liveness of the pollset set during the deletion.

So the code now uses InternallyRefCounted again.

PTAL

Copy link
Member

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really good! Remaining comments are mostly minor.

Please let me know if you have any questions. Thanks!

Reviewed 2 of 5 files at r10, 8 of 8 files at r11.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @apolcyn)


src/core/ext/filters/client_channel/resolver/dns/c_ares/dns_resolver_ares.cc, line 422 at r8 (raw file):

Previously, apolcyn wrote…

OnResolvedLocked is currently unreffing it.

You're right. Sorry, somehow I thought it wasn't.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_ev_driver.h, line 59 at r11 (raw file):

 public:
  virtual ~GrpcPolledFdFactory() {}
  /* Creates a new wrapped fd for the current platform.

Please use C++-style comments.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc, line 135 at r2 (raw file):

Previously, apolcyn wrote…

Sorry for the back and forth here.

I've fixed this last issue with InternallyRefCounted by deleting the fds from the LookupAresLocked pollset set param during GrpcPolledFd::ShutdownLocked, which is guaranteed to be called on every fd, before on_done is called, ensuring liveness of the pollset set during the deletion.

So the code now uses InternallyRefCounted again.

PTAL

This looks good!

I agree that the pollset_set lifetime nonsense is annoying. I hope that all of that will go away as part of moving to the new polling model.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc, line 288 at r11 (raw file):

AresRequest::FdNode::FdNode(RefCountedPtr<AresRequest> request,
                            std::unique_ptr<GrpcPolledFd> grpc_polled_fd)
    : request_(request), grpc_polled_fd_(std::move(grpc_polled_fd)) {

std::move(request)


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc, line 289 at r11 (raw file):

                            std::unique_ptr<GrpcPolledFd> grpc_polled_fd)
    : request_(request), grpc_polled_fd_(std::move(grpc_polled_fd)) {
  GRPC_CARES_TRACE_LOG("request:%p new fd: %s", request.get(),

This will need to look at request_ if we use std::move() above.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc, line 731 at r11 (raw file):

    // After setting shutting_down_ = true, NotifyOnEventLocked will
    // shut down any remaining fds.
    // TODO(apolcyn): just run CancelLocked() ?

Any reason not to do this?


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc, line 732 at r11 (raw file):

    // shut down any remaining fds.
    // TODO(apolcyn): just run CancelLocked() ?
    // Unref();

Why is this comment needed?

Copy link
Contributor Author

@apolcyn apolcyn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 11 of 15 files reviewed, 5 unresolved discussions (waiting on @markdroth)


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_ev_driver.h, line 59 at r11 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Please use C++-style comments.

Done.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc, line 288 at r11 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

std::move(request)

Done.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc, line 289 at r11 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

This will need to look at request_ if we use std::move() above.

Done.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc, line 731 at r11 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Any reason not to do this?

Done. I've changed this method to ShutdownIOLocked though, just since this is now a private method and ShutdownIOLocked seemed to reflect what it does better.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc, line 732 at r11 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Why is this comment needed?

whoops! removed

Copy link
Member

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the ping on this -- I'm sorry it fell off my radar!

The remaining comments are minor, so feel free to merge after addressing them.

Thanks!

Reviewed 2 of 4 files at r12, 7 of 7 files at r13.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @apolcyn)


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_ev_driver.h, line 1 at r13 (raw file):

/*

Please use C++-style comments here.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_ev_driver.h, line 81 at r13 (raw file):

}  // namespace grpc_core

#endif /* GRPC_CORE_EXT_FILTERS_CLIENT_CHANNEL_RESOLVER_DNS_C_ARES_GRPC_ARES_EV_DRIVER_H \

Please use a C++-style comment here as well.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.h, line 55 at r13 (raw file):

/// An AresRequest is a handle over a complete name resolution process
/// (A queries, AAAA queries, etc.). An AresRequest is created with a call
/// to LookupAresLocked, and it the name resolution process begins when

s/it the/the/


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc, line 799 at r11 (raw file):

                       nullptr /* args */);
    // Let the address sorter figure out which one should be tried first.
    AddressSortingSort(this, addresses_out_->get(), "service-addresses");

Why don't we need this anymore?

Copy link
Contributor Author

@apolcyn apolcyn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @markdroth)


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_ev_driver.h, line 1 at r13 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Please use C++-style comments here.

Done.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_ev_driver.h, line 81 at r13 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Please use a C++-style comment here as well.

Done.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.h, line 55 at r13 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

s/it the/the/

Done.


src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc, line 799 at r11 (raw file):

Previously, markdroth (Mark D. Roth) wrote…

Why don't we need this anymore?

We now only schedule the user-provided on_done callback effectively from DecrementPendingQueries, and that method first sorts the to-be-returned addresses.

@apolcyn
Copy link
Contributor Author

apolcyn commented Mar 17, 2021

Manually squashed commits and force pushed

@apolcyn
Copy link
Contributor Author

apolcyn commented Mar 17, 2021

artifact build windows: #25545

@apolcyn
Copy link
Contributor Author

apolcyn commented Mar 17, 2021

basic tests c#: #25748
python windows: #25749
portability tests windows: #25751

@apolcyn apolcyn merged commit 2ee7017 into grpc:master Mar 17, 2021
apolcyn added a commit to apolcyn/grpc that referenced this pull request Mar 18, 2021
apolcyn added a commit that referenced this pull request Mar 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lang/core release notes: no Indicates if PR should not be in release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants