-
Notifications
You must be signed in to change notification settings - Fork 37.8k
net: 0.19 seeds update #16999
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
net: 0.19 seeds update #16999
Conversation
Only 48 IPv6 seeds might be slightly concerning, dunno. It looks like they're nearly all eliminated in the ASN filer step (introduced in #15840). |
The script output table might be easier to read if you put the numbers first (ideally padded). The suspicious hosts check removes suspiciously few entries :-) Can you print a breakdown of the IPv6 ASNs and node count? |
Yea, only 16 "suspicious hosts" are defined in the script, which were added a long time ago. TBH I don't really think it makes much sense to maintain such a list in the repository (would make sense to pass it in as an external file, I think). (But, this is not the time for a large discussion about this script, that can be done at any time between releases, the point now is to get a list of nodes that is acceptable for merge and is at least more recent than the current one that was last updated for 0.17)
Will have a look at that later. |
It looks like the problem is not the ASN sifting itself, but the way the limiting works. There used to be a limit of 512 IPv4 peers, and no limit on IPv6 or Tor peers (because there are only so few of the latter). However the PR to include the IPv6 addresses in the ASN sifting also included them in the 512 limit. It's split more or less according to how many IPv4/IPv6 there are. Will make limits per-network instead. |
- Change regular expression to cover recent versions, as well as subversions with custom uacomment, and improve readability. - Vary uptime requirements per network (onions are allowed to have less uptime, to make sure we get enough of them) - Add deduplication step (to allow simple concatentation of multiple seeds files). - Log of number of nodes (per network) after every step.
Handle the multiple ports per IP case (as that's a criterion later).
1387bf5
to
3b09f2b
Compare
Looks better now:
|
A max of 2 seeds per ASN seems a bit strict. E.g. AS33915 contains all Dutch Vodafone mobile (not reachable) and Ziggo cable users (long living IP). It's fine for IPv4 because we still get the max of 512 for that net. Maybe make it 10 for IPv6? The odds of randomly connecting to two nodes on the same ASN still seems negligible. Should we bump |
All good suggestions, but I'd say, post-0.19. I've already had to do much more maintenance of this script than intended just before a release. |
ACK 3b09f2b Something to consider when we do improve the script, is that it's not very deterministic. E.g. I ran the script on sipa's data from a few days later plus my data (uploaded above), and 400 of the resulting nodes are different. Even though our numbers are similar at each step:
I tried moving the deduplication step after the uptime and sort steps (commit). I ran it, committed, fetched a slightly newer version of sipa's seed, and ran it again. Now only 140 nodes are different, which is still not great. |
See also: |
Looks good. |
Only looked at the address txt file (I'll let other folks look at the python changes and the equivalence between the txt file and the compiled-in seeds), and found the following groups with ore than two IPs per ASN.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed Python changes. Some minor typos and nits but otherwise looks like reasonable improvements.
ips = [ip for ip in ips if ip is not None] | ||
print('Skip entries with invalid address: %s' % (ip_stats(ips)), file=sys.stderr) | ||
# Skip duplicattes (in case multiple seeds files were concatenated) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/duplicattes/duplicates
@@ -121,6 +121,31 @@ def filtermultiport(ips): | |||
hist[ip['sortkey']].append(ip) | |||
return [value[0] for (key,value) in list(hist.items()) if len(value)==1] | |||
|
|||
def lookup_asn(net, ip): | |||
''' | |||
Look up the asn for an IP (4 or 6) address by querying cymry.com, or None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/cymry/cymru
nit: s/asn/ASN
I'm not going to fix up comment typos here, feel free to do that later.
The deduplication step randomizes the input order (by putting it into a dict, which uses hashing with a random seed in python). This is intentional, because the input (potentially) consists of concatenated nodes files from different sources. If it only picks the first valid nodes it'd effectively ignore your input. Sure, a deterministic shuffling could be done.
Wait, you're saying that the two IPs per ASN filtering isn't only non-deterministic, it doesn't even work? |
Removed these in a new commit as suggested by TheBlueMatt:
|
ACK 0218171. I also checked that |
0218171 contrib: Remove invalid nodes from seeds list (Wladimir J. van der Laan) 3b09f2b net: 0.19 hardcoded seeds update (Wladimir J. van der Laan) 801d341 contrib: makeseeds: More fancy output (Wladimir J. van der Laan) ed76299 contrib: makeseeds: Limit per network, instead of total (Wladimir J. van der Laan) c254a9e contrib: makeseeds: dedup by ip,port (Wladimir J. van der Laan) 3314d87 contrib: makeseeds: Factor out ASN lookup (Wladimir J. van der Laan) 301c2b1 contrib: makeseeds: Improve logging and filtering (Wladimir J. van der Laan) Pull request description: - contrib: Improve makeseeds script - net: 0.19 hardcoded seeds update Sources: - http://bitcoin.sipa.be/seeds.txt.gz (Sipa) - https://github.com/bitcoin/bitcoin/files/3671913/dnsseed.dump.tar.gz (Sjors) Output: ``` Initial: IPv4 418690, IPv6 55861, Onion 2747 Skip entries with invalid address: IPv4 418690, IPv6 55861, Onion 2747 After removing duplicates: IPv4 409220, IPv6 54028, Onion 2717 Skip entries from suspicious hosts: IPv4 409219, IPv6 54028, Onion 2717 Enforce minimal number of blocks: IPv4 106719, IPv6 46342, Onion 2621 Require service bit 1: IPv4 106384, IPv6 46241, Onion 2542 Require minimum uptime: IPv4 5300, IPv6 1153, Onion 201 Require a known and recent user agent: IPv4 4642, IPv6 1060, Onion 141 Filter out hosts with multiple bitcoin ports: IPv4 4642, IPv6 1060, Onion 141 Look up ASNs and limit results, both per ASN and globally: IPv4 464, IPv6 48, Onion 141 ``` ACKs for top commit: Sjors: ACK 0218171. I also checked that `chainparamsseeds.h` is generated from `nodes_main.txt`. Sounds like we should look at this script a bit more outside release moments :-) Tree-SHA512: c1f5795fe88d14800c4da918387368d51e85f4319f2ce3c0359851d041767e2883f32b1da371bba22bd5f0b442ac3e5ea7d685c233ad2cc4045c930f973b0aa2
6866259 net: Hardcoded seeds update for 0.21 (Wladimir J. van der Laan) 36e875b contrib: Add new versions to makeseeds.py and update gitignore (RandyMcMillan) Pull request description: Stats: ``` IPv4 IPv6 Onion Pass 426728 59523 7900 Initial 426728 59523 7900 Skip entries with invalid address 426728 59523 7900 After removing duplicates 426727 59523 7900 Skip entries from suspicious hosts 123226 51785 7787 Enforce minimal number of blocks 121710 51322 7586 Require service bit 1 4706 1427 3749 Require minimum uptime 4124 1098 3681 Require a known and recent user agent 4033 1075 3681 Filter out hosts with multiple bitcoin ports 512 140 512 Look up ASNs and limit results per ASN and per net ``` I've credited RandyMcMillan for the first commit because of #20190. There are at least enough onions this time! Number of IPv6 nodes that pass all the requirements seems similar to last time in #18506. For the next major release we'll want TORv3 hardcoded peers as well. This makes no sense now as there are hardly any. But it'd make sense to think about how to collect them because they cannot come from the DNS seeds. ### Reviewing ``` 2020-10-28 12:04:45 jnewbery wumpus: Do you have any suggestions for how to review #20237 ? 2020-10-28 12:28:37 wumpus jnewbery: previous PRs like it might be a guide there (#18506, #16999), e.g. people could try to repeat the last step in https://github.com/bitcoin/bitcoin/tree/master/contrib/seeds#seeds and see if it ends up with the same .h file, you could also repeat the entire process but as the list of peers from the seeder will be different every time that will give a (slightly, hopefully) 2020-10-28 12:28:37 wumpus different output 2020-10-28 12:49:40 wumpus testing what part of the peers are connectable is also useful 2020-10-28 12:51:05 wumpus or to go deeper, whether most part of the nodes are 'good nodes' and not say spy nodes, but i don't know what means of testing ``` ACKs for top commit: jonatack: ACK 6866259 Tree-SHA512: 6b913ec92932de03304301a0cbf7b4a912ed09d890b019deeb449b8fa787c4994222368c6bf08b3c6e2bfa474442612e1c9de9327ec46ba59c37a5f38af50c75
Sources:
Output: