Skip to content

Conversation

hyl
Copy link
Contributor

@hyl hyl commented May 26, 2021

Hi folks!

We're testing OutbackCDX on an ARM server as a deduplication source for Heritrix3 and we've come across a couple of issues that I've patched in this branch. I'm opening this as a draft PR because I just wanted to check that I'm not missing something with our configuration before patching something that doesn't exist 😁

This PR achieves the following:

  1. Upgraded rocksdbjni to 6.20.3 so that we can access ARM-compatible .so binaries & fixed deprecation warnings
  2. Added a command line option to strip any digest schemes from incoming hashes on CDX lines.

To explain 2. a little further, we encountered an issue whereby sending a digest with a digest scheme (in our case, sha1:) prepended to it corrupted the digest that was then stored in RocksDB and returned via the CDX API. For example, sending a value of sha1:VUICY64ZXNWY6YWTAXPQAI4HGDF46Q2X actually ended up being stored and retrieved as a format similar to SHAVUICY64ZXNWY6YWTAXPQAI4HGDF46AS=====, which was not the hash of the object (and therefore unable to be used as either the digest or for dedupeByHash operations).

The solution I've implemented is a command line flag which, when enabled, checks for a : value in an incoming CDX line's digest. If it exists, it treats the value of the digest as the section after the : - this operation is only performed if the digest contains a :.

The reason I've put it in as a draft PR is just to, firstly, check that behaviour detailed above has been seen before, and - if so - just to confirm that we'd want to do this digest stripping in outback?

The other query I had was around test suites: I couldn't find a natural place to add a test suite for the new command line flag but more than happy to add if we can figure out where is best.

I'm also happy to strip out the work done to upgrade RocksDB & open a separate PR if we decide that the digest scheme stripping is best done externally to OutbackCDX :)

hyl added 2 commits May 26, 2021 17:09
Fixes deprecated calls + allows building & running inside ARM environments
At present, CDX lines prepended with sha1: cause the ingest into outbackcdx to create a garbled digest.

This change allows outbackcdx to receive CDX lines in default hertrix3 format, with any digest scheme stripped from the incoming digest.
@ato
Copy link
Member

ato commented May 27, 2021

The reason I've put it in as a draft PR is just to, firstly, check that behaviour detailed above has been seen before, and - if so - just to confirm that we'd want to do this digest stripping in outback?

It's intentional that we decode the base32 digest to store it in less bytes but not good that we silently corrupt non-base32 data instead of rejecting it with a bad request error. I'm not familiar with any tools that generate CDX records with sha1: prefixes so in that sense I haven't seen this behaviour before.

Rather than adding a command-line option to strip prefixes I'd prefer to just hard enable it since it won't affect existing users who use that de facto standard CDX11 format which doesn't include prefixes and if there's anyone else who has sha1: prefixed input they're going to need to reindex anyway to deal with the corruption. That I think simplifies the implementation and also means we don't have to worry about how to write tests with different sets of command-line options enabled.

Comment on lines 253 to 256
// remove the protocol, if applicable
if(this.stripDigestScheme && capture.digest.contains(":")) {
capture.digest = capture.digest.split(":")[1];
}
Copy link
Member

@ato ato May 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in previous comment I'd prefer if we just enable this for everyone rather than adding an option for it. Then this stripping logic can go in Capture.fromCdxLine() alongside all the other field parsing and CDX variant handling logic.

Copy link
Member

@ato ato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RocksDB upgrade looks good.

hyl added 2 commits May 27, 2021 15:00
As per PR feedback. Also added new Capture test for CDX digest scheme stripping
@hyl hyl marked this pull request as ready for review May 27, 2021 16:50
@hyl hyl requested a review from ato May 27, 2021 16:50
@hyl
Copy link
Contributor Author

hyl commented May 27, 2021

Hi @ato ,

Thanks for the feedback! I've made the requested changes :)

@ato ato merged commit 724f9a0 into nla:master May 28, 2021
ato added a commit that referenced this pull request May 28, 2021
- Upgrade to RocksDB 6.20.3 (Jamie Hoyle) #102
- Strip prefixes like "sha1:" from the digest field when parsing CDX input (Jamie Hoyle) #102
@ato
Copy link
Member

ato commented May 28, 2021

Thanks! Released as 0.11.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants