Skip to content

Conversation

fjahr
Copy link
Contributor

@fjahr fjahr commented May 23, 2023

Fixes #11368

This requires libevent 2.2.1 which so far was only released in alpha.

For more context see specifically #11368 (comment), this uses the feature mentioned there as libevent#592.

@DrahtBot
Copy link
Contributor

DrahtBot commented May 23, 2023

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage & Benchmarks

For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/27731.

Reviews

See the guideline for information on the review process.
A summary of reviews will appear here.

Conflicts

No conflicts as of last run.

// Limit the maximum number of open connections to prevent exhausting
// the file descriptor limit. When the http server gets overwhelmed it
// will respond with 503 Service Unavailable.
evhttp_set_max_connections(http, workQueueDepth * 2);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure this actually limits us within the file descriptor limit. For example if you run:

$ ulimit -n 175
$ src/bitcoind
Warning: Reducing -maxconnections from 125 to 15, because of system limitations.
Bitcoin Core starting

I think (but have not tested) that if I then recieve 15 inbound connections I would could still run out of file descriptors for the http server?

Perhaps if we also modify this line:

nMaxConnections = std::min(nFD - MIN_CORE_FILEDESCRIPTORS - MAX_ADDNODE_CONNECTIONS - NUM_FDS_MESSAGE_CAPTURE, nMaxConnections);

...to consider the workqueue depth too, then it could work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think it will be hard to find a number where there isn't a specific limit and specific circumstances that would still cause problems. If we set the limit dynamically based on what's really available there would be a range where we would effectively set evhttp_set_max_connections(http, 0); which would probably mean that we couldn't even shut down the server via the RPC interface.

When the maxconnections are reduced to 15 that doesn't mean that there are only 15 file descriptors left. There are more reserved than are not always needed. But that part of the code certainly could use improvement, see #27539 and #27730. I think making changes to the maxconnections calculation is not necessarily linked to this change here because it could reduce the risk of running out of FDs regardless. Also, note that the node will even start if you set ulimit -n 150 and it have maxconnection at 0. So that is another question to discuss if that even makes sense. Because if we alter the calculation of maxconnections but still let the node run regardless of what the result is, we haven't really prevented the file descriptor exhaustion completely. And it may be too limiting on the other hand if we set the requirements of available file descriptors so high that we can absolutely be sure the node will not encounter an exhaustion under any circumstances.

So, I am not sure what would be the optimal number to use. I think we can still tweak this later so it covers the most common use cases best. What I am currently concerned about is if this actually improves the situation in #11368 for the default file descriptor limits in the most commonly used OS. macOS was among the lowest as far as I remember and it was 256 usually I think. So I am currently aiming at showing that it's an improvement for that. But I do struggle to reproduce the original issue on my new machine right now so I am not 100% sure anymore if this actually needs fixing. Though I am on a different machine now than I was back then.

@@ -419,6 +419,15 @@ bool InitHTTPServer()
int workQueueDepth = std::max((long)gArgs.GetIntArg("-rpcworkqueue", DEFAULT_HTTP_WORKQUEUE), 1L);
LogPrintfCategory(BCLog::HTTP, "creating work queue of depth %d\n", workQueueDepth);

#if LIBEVENT_VERSION_NUMBER >= 0x02020001
if (event_get_version_number() >= 0x02020001) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't runtime linking fail to find the evhttp_set_max_connections symbol?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, how would that happen? I didn't have any issues when testing this code with 2.2.1-alpha and 2.1.12-stable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking of the case where you build with 2.2.1, then try to run it with 2.1.12. (If that's not supported, the runtime check is unnecessary)

same time if this is under your control. It is hard to give general advice
since this depends on your system but if you make several hundred requests at
once you are definitely at risk of encountering this issue.
When too many connections are opened quickly the interface will start to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loses the correct docs for built-against-older-libevent

@DrahtBot
Copy link
Contributor

DrahtBot commented Feb 5, 2024

🤔 There hasn't been much activity lately and the CI seems to be failing.

If no one reviewed the current pull request by commit hash, a rebase can be considered. While the CI failure may be a false positive, the CI hasn't been running for some time, so there may be a real issue hiding as well. A rebase triggers the latest CI and makes sure that no silent merge conflicts have snuck in.

1 similar comment
@DrahtBot
Copy link
Contributor

DrahtBot commented May 5, 2024

🤔 There hasn't been much activity lately and the CI seems to be failing.

If no one reviewed the current pull request by commit hash, a rebase can be considered. While the CI failure may be a false positive, the CI hasn't been running for some time, so there may be a real issue hiding as well. A rebase triggers the latest CI and makes sure that no silent merge conflicts have snuck in.

@fjahr fjahr force-pushed the 2023-05-fd-exhaust branch from 8b0d64d to 3a66391 Compare May 5, 2024 03:38
@fjahr
Copy link
Contributor Author

fjahr commented May 5, 2024

Pushed a rebase though currently I am working on removing libevent as a dependency which should be a quicker solution than waiting for 2.2.1 :) I will address the feedback if 2.2.1 lands anytime soon though!

@DrahtBot DrahtBot removed the CI failed label May 5, 2024
@maflcko
Copy link
Member

maflcko commented May 6, 2024

removing libevent

Nice. Is there a tracking issue?

@fjahr
Copy link
Contributor Author

fjahr commented May 29, 2024

removing libevent

Nice. Is there a tracking issue?

Not yet, but should be open soon (TM).

@VzxPLnHqr
Copy link

I am not familiar with the intricacies of this PR, but is it possible for bitcoind to detect this impending crash before it happens and just return some sort of 500 error?

I ran into this issue (using v28.0) when running a straightforward script which was calling getblock for each block, starting from genesis. After a couple thousand blocks bitcoind crashed with the file descriptor error.

After some trial and error I have found that putting a 100 millisecond delay between http requests at least has kept bitcoind from crashing (on my system with very standard commodity hardware), but of course this is very slow and will take days to run. lsof -u bitcoin | wc -l seems to hover between 1100 and 1500 files open for me.

Reading through this issue, it seems we are either waiting for bitcoin core to upgrade to using libevent 2.2.1, or we are waiting for libevent to be removed entirely as a dependency.

However, in the interim it would be really very helpful if, instead of bitcoind crashing, it at least stops responding to the rpc calls (lets the client timeout) or returns some 500 error. Then clients can at least know that they are overwhelming the server and backoff.

@fjahr
Copy link
Contributor Author

fjahr commented Jan 31, 2025

Reading through this issue, it seems we are either waiting for bitcoin core to upgrade to using libevent 2.2.1, or we are waiting for libevent to be removed entirely as a dependency.

I would open this PR for review if libevent 2.2.1 was available but it's in Alpha since May 2023: https://github.com/libevent/libevent/releases So upgrading libevent isn't really an option yet but the removal is in the works, see #31194

However, in the interim it would be really very helpful if, instead of bitcoind crashing, it at least stops responding to the rpc calls (lets the client timeout) or returns some 500 error. Then clients can at least know that they are overwhelming the server and backoff.

I don't think there is a simple fix, there have been a few attempts over the years to address this issue outside of libevent but none have been merged and I would guess there wouldn't be considerable interest from reviewers given the parallel effort to remove libevent altogether. So I don't believe an interim solution would have a high chance to get merged given our limited resources for review, particular in this area of the code base (which is also a motivating factor for the removal of the dependency btw). I don't want to discourage anyone from trying to find a simple fix and opening a PR though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bitcoin core crashes when too many rpc calls are made
6 participants