Skip to content

Conversation

laanwj
Copy link
Member

@laanwj laanwj commented Sep 30, 2019

  • contrib: Improve makeseeds script
  • net: 0.19 hardcoded seeds update

Sources:

Output:

Initial: IPv4 418690, IPv6 55861, Onion 2747
Skip entries with invalid address: IPv4 418690, IPv6 55861, Onion 2747
After removing duplicates: IPv4 409220, IPv6 54028, Onion 2717
Skip entries from suspicious hosts: IPv4 409219, IPv6 54028, Onion 2717
Enforce minimal number of blocks: IPv4 106719, IPv6 46342, Onion 2621
Require service bit 1: IPv4 106384, IPv6 46241, Onion 2542
Require minimum uptime: IPv4 5300, IPv6 1153, Onion 201
Require a known and recent user agent: IPv4 4642, IPv6 1060, Onion 141
Filter out hosts with multiple bitcoin ports: IPv4 4642, IPv6 1060, Onion 141
Look up ASNs and limit results, both per ASN and globally: IPv4 464, IPv6 48, Onion 141

@laanwj
Copy link
Member Author

laanwj commented Sep 30, 2019

Only 48 IPv6 seeds might be slightly concerning, dunno. It looks like they're nearly all eliminated in the ASN filer step (introduced in #15840).

@Sjors
Copy link
Member

Sjors commented Sep 30, 2019

The script output table might be easier to read if you put the numbers first (ideally padded). The suspicious hosts check removes suspiciously few entries :-)

Can you print a breakdown of the IPv6 ASNs and node count?

@fanquake fanquake added this to the 0.19.0 milestone Sep 30, 2019
@laanwj
Copy link
Member Author

laanwj commented Oct 1, 2019

The suspicious hosts check removes suspiciously few entries :-)

Yea, only 16 "suspicious hosts" are defined in the script, which were added a long time ago. TBH I don't really think it makes much sense to maintain such a list in the repository (would make sense to pass it in as an external file, I think).

(But, this is not the time for a large discussion about this script, that can be done at any time between releases, the point now is to get a list of nodes that is acceptable for merge and is at least more recent than the current one that was last updated for 0.17)

Can you print a breakdown of the IPv6 ASNs and node count?

Will have a look at that later.

@laanwj
Copy link
Member Author

laanwj commented Oct 1, 2019

It looks like the problem is not the ASN sifting itself, but the way the limiting works. There used to be a limit of 512 IPv4 peers, and no limit on IPv6 or Tor peers (because there are only so few of the latter). However the PR to include the IPv6 addresses in the ASN sifting also included them in the 512 limit. It's split more or less according to how many IPv4/IPv6 there are.

Will make limits per-network instead.

laanwj added 6 commits October 1, 2019 11:38
- Change regular expression to cover recent versions, as well as
  subversions with custom uacomment, and improve readability.
- Vary uptime requirements per network (onions are allowed to have less
  uptime, to make sure we get enough of them)
- Add deduplication step (to allow simple concatentation of multiple seeds files).
- Log of number of nodes (per network) after every step.
Handle the multiple ports per IP case (as that's a criterion later).
@laanwj laanwj force-pushed the 2019_10_seeds_update branch from 1387bf5 to 3b09f2b Compare October 1, 2019 09:40
@laanwj
Copy link
Member Author

laanwj commented Oct 1, 2019

Looks better now:

  IPv4   IPv6  Onion Pass                                               
418690  55861   2747 Initial
418690  55861   2747 Skip entries with invalid address
410342  54085   2718 After removing duplicates
410341  54085   2718 Skip entries from suspicious hosts
107151  46381   2622 Enforce minimal number of blocks
106814  46278   2543 Require service bit 1
  5381   1166    202 Require minimum uptime
  4690   1068    141 Require a known and recent user agent
  4655   1062    141 Filter out hosts with multiple bitcoin ports
   512    110    141 Look up ASNs and limit results per ASN and per net

@Sjors
Copy link
Member

Sjors commented Oct 1, 2019

A max of 2 seeds per ASN seems a bit strict. E.g. AS33915 contains all Dutch Vodafone mobile (not reachable) and Ziggo cable users (long living IP). It's fine for IPv4 because we still get the max of 512 for that net.

Maybe make it 10 for IPv6? The odds of randomly connecting to two nodes on the same ASN still seems negligible.

Should we bump MIN_BLOCKS as well?

@laanwj
Copy link
Member Author

laanwj commented Oct 1, 2019

Maybe make it 10 for IPv6? The odds of randomly connecting to two nodes on the same ASN still seems negligible
Should we bump MIN_BLOCKS as well?

All good suggestions, but I'd say, post-0.19. I've already had to do much more maintenance of this script than intended just before a release.

@fanquake fanquake requested a review from TheBlueMatt October 1, 2019 12:49
@Sjors
Copy link
Member

Sjors commented Oct 1, 2019

ACK 3b09f2b

Something to consider when we do improve the script, is that it's not very deterministic. E.g. I ran the script on sipa's data from a few days later plus my data (uploaded above), and 400 of the resulting nodes are different. Even though our numbers are similar at each step:

  IPv4   IPv6  Onion Pass                                               
418727  55869   2746 Initial
418727  55869   2746 Skip entries with invalid address
410413  54115   2718 After removing duplicates
410412  54115   2718 Skip entries from suspicious hosts
107223  46411   2622 Enforce minimal number of blocks
106885  46308   2543 Require service bit 1
  5376   1164    202 Require minimum uptime
  4685   1066    141 Require a known and recent user agent
  4648   1060    141 Filter out hosts with multiple bitcoin ports
   512    116    141 Look up ASNs and limit results per ASN and per net

I tried moving the deduplication step after the uptime and sort steps (commit). I ran it, committed, fetched a slightly newer version of sipa's seed, and ran it again. Now only 140 nodes are different, which is still not great.

@maflcko
Copy link
Member

maflcko commented Oct 1, 2019

Something to consider when we do improve the script, is that it's not very deterministic.

See also:

@jonasschnelli
Copy link
Contributor

Looks good.
ACK 801d341 - I just did some random manual checks against my data source (https://bitcointools.jonasschnelli.ch/dnsseed.dump.tar.gz) and it seems like that a lot of the new IPs are not listed in my seeder dump. Though I guess that is okay.
Did some manual checks with telnet.

@TheBlueMatt
Copy link
Contributor

TheBlueMatt commented Oct 1, 2019

Only looked at the address txt file (I'll let other folks look at the python changes and the equivalence between the txt file and the compiled-in seeds), and found the following groups with ore than two IPs per ASN.
212.33.204.190:8333 should likely be removed as it is RPKI-Invalid (max-length 20, but announced as a 22), so it isn't gonna be accessible from a number of networks. The 2002:: IPs are 6to4 and likely also shouldn't be included (except maybe as the corresponding IPv4 addresses).

174 38.143.66.107:8333
174 [2001:550:3d05:156::100]:8333
174 [2602:ffb6:4:739e:f816:3eff:fe00:c2b3]:8333
4134 14.18.140.45:5559
4134 180.97.80.213:8333
4134 182.150.55.96:8333
4134 222.186.43.66:8333
6939 [2001:470:1f1d:61f:cd1a::109]:42434
6939 [2001:470:5:41e::3001]:8333
6939 [2002:2f5a:562a::2f5a:562a]:8333
6939 [2002:b6ff:3dca::b6ff:3dca]:28364

Copy link
Contributor

@jkczyz jkczyz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed Python changes. Some minor typos and nits but otherwise looks like reasonable improvements.

ips = [ip for ip in ips if ip is not None]
print('Skip entries with invalid address: %s' % (ip_stats(ips)), file=sys.stderr)
# Skip duplicattes (in case multiple seeds files were concatenated)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/duplicattes/duplicates

@@ -121,6 +121,31 @@ def filtermultiport(ips):
hist[ip['sortkey']].append(ip)
return [value[0] for (key,value) in list(hist.items()) if len(value)==1]

def lookup_asn(net, ip):
'''
Look up the asn for an IP (4 or 6) address by querying cymry.com, or None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/cymry/cymru

nit: s/asn/ASN

@laanwj
Copy link
Member Author

laanwj commented Oct 2, 2019

I'm not going to fix up comment typos here, feel free to do that later.

Something to consider when we do improve the script, is that it's not very deterministic.

The deduplication step randomizes the input order (by putting it into a dict, which uses hashing with a random seed in python).

This is intentional, because the input (potentially) consists of concatenated nodes files from different sources. If it only picks the first valid nodes it'd effectively ignore your input.

Sure, a deterministic shuffling could be done.

and found the following groups with ore than two IPs per ASN.

Wait, you're saying that the two IPs per ASN filtering isn't only non-deterministic, it doesn't even work?

@laanwj
Copy link
Member Author

laanwj commented Oct 2, 2019

Removed these in a new commit as suggested by TheBlueMatt:

174 38.143.66.107:8333
174 [2001:550:3d05:156::100]:8333
174 [2602:ffb6:4:739e:f816:3eff:fe00:c2b3]:8333
4134 14.18.140.45:5559
4134 180.97.80.213:8333
4134 182.150.55.96:8333
4134 222.186.43.66:8333
6939 [2001:470:1f1d:61f:cd1a::109]:42434
6939 [2001:470:5:41e::3001]:8333
6939 [2002:2f5a:562a::2f5a:562a]:8333
6939 [2002:b6ff:3dca::b6ff:3dca]:28364

@Sjors
Copy link
Member

Sjors commented Oct 2, 2019

ACK 0218171. I also checked that chainparamsseeds.h is generated from nodes_main.txt. Sounds like we should look at this script a bit more outside release moments :-)

laanwj added a commit that referenced this pull request Oct 2, 2019
0218171 contrib: Remove invalid nodes from seeds list (Wladimir J. van der Laan)
3b09f2b net: 0.19 hardcoded seeds update (Wladimir J. van der Laan)
801d341 contrib: makeseeds: More fancy output (Wladimir J. van der Laan)
ed76299 contrib: makeseeds: Limit per network, instead of total (Wladimir J. van der Laan)
c254a9e contrib: makeseeds: dedup by ip,port (Wladimir J. van der Laan)
3314d87 contrib: makeseeds: Factor out ASN lookup (Wladimir J. van der Laan)
301c2b1 contrib: makeseeds: Improve logging and filtering (Wladimir J. van der Laan)

Pull request description:

  - contrib: Improve makeseeds script
  - net: 0.19 hardcoded seeds update

  Sources:
  - http://bitcoin.sipa.be/seeds.txt.gz (Sipa)
  - https://github.com/bitcoin/bitcoin/files/3671913/dnsseed.dump.tar.gz (Sjors)

  Output:
  ```
  Initial: IPv4 418690, IPv6 55861, Onion 2747
  Skip entries with invalid address: IPv4 418690, IPv6 55861, Onion 2747
  After removing duplicates: IPv4 409220, IPv6 54028, Onion 2717
  Skip entries from suspicious hosts: IPv4 409219, IPv6 54028, Onion 2717
  Enforce minimal number of blocks: IPv4 106719, IPv6 46342, Onion 2621
  Require service bit 1: IPv4 106384, IPv6 46241, Onion 2542
  Require minimum uptime: IPv4 5300, IPv6 1153, Onion 201
  Require a known and recent user agent: IPv4 4642, IPv6 1060, Onion 141
  Filter out hosts with multiple bitcoin ports: IPv4 4642, IPv6 1060, Onion 141
  Look up ASNs and limit results, both per ASN and globally: IPv4 464, IPv6 48, Onion 141
  ```

ACKs for top commit:
  Sjors:
    ACK 0218171. I also checked that `chainparamsseeds.h` is generated from `nodes_main.txt`. Sounds like we should look at this script a bit more outside release moments :-)

Tree-SHA512: c1f5795fe88d14800c4da918387368d51e85f4319f2ce3c0359851d041767e2883f32b1da371bba22bd5f0b442ac3e5ea7d685c233ad2cc4045c930f973b0aa2
@laanwj laanwj merged commit 0218171 into bitcoin:master Oct 2, 2019
@laanwj laanwj mentioned this pull request Oct 2, 2019
11 tasks
laanwj added a commit that referenced this pull request Nov 3, 2020
6866259 net: Hardcoded seeds update for 0.21 (Wladimir J. van der Laan)
36e875b contrib: Add new versions to makeseeds.py and update gitignore (RandyMcMillan)

Pull request description:

  Stats:

  ```
    IPv4   IPv6  Onion Pass
  426728  59523   7900 Initial
  426728  59523   7900 Skip entries with invalid address
  426728  59523   7900 After removing duplicates
  426727  59523   7900 Skip entries from suspicious hosts
  123226  51785   7787 Enforce minimal number of blocks
  121710  51322   7586 Require service bit 1
    4706   1427   3749 Require minimum uptime
    4124   1098   3681 Require a known and recent user agent
    4033   1075   3681 Filter out hosts with multiple bitcoin ports
     512    140    512 Look up ASNs and limit results per ASN and per net
  ```
  I've credited RandyMcMillan for the first commit because of #20190.

  There are at least enough onions this time! Number of IPv6 nodes that pass all the requirements seems similar to last time in #18506.

  For the next major release we'll want TORv3 hardcoded peers as well. This makes no sense now as there are hardly any. But it'd make sense to think about how to collect them because they cannot come from the DNS seeds.

  ### Reviewing
  ```
  2020-10-28 12:04:45     jnewbery  wumpus: Do you have any suggestions for how to review #20237 ?
  2020-10-28 12:28:37     wumpus  jnewbery: previous PRs like it might be a guide there (#18506, #16999), e.g. people could try to repeat the last step in https://github.com/bitcoin/bitcoin/tree/master/contrib/seeds#seeds and see if it ends up with the same .h file, you could also repeat the entire process but as the list of peers from the seeder will be different every time that will give a (slightly, hopefully)
  2020-10-28 12:28:37     wumpus  different output
  2020-10-28 12:49:40     wumpus  testing what part of the peers are connectable is also useful
  2020-10-28 12:51:05     wumpus  or to go deeper, whether most part of the nodes are 'good nodes' and not say spy nodes, but i don't know what means of testing
  ```

ACKs for top commit:
  jonatack:
    ACK 6866259

Tree-SHA512: 6b913ec92932de03304301a0cbf7b4a912ed09d890b019deeb449b8fa787c4994222368c6bf08b3c6e2bfa474442612e1c9de9327ec46ba59c37a5f38af50c75
@bitcoin bitcoin locked as resolved and limited conversation to collaborators Dec 16, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants