-
Notifications
You must be signed in to change notification settings - Fork 20
Added support for stripping digest schemes, upgraded RocksDB version to 6.20.3 #102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Fixes deprecated calls + allows building & running inside ARM environments
At present, CDX lines prepended with sha1: cause the ingest into outbackcdx to create a garbled digest. This change allows outbackcdx to receive CDX lines in default hertrix3 format, with any digest scheme stripped from the incoming digest.
It's intentional that we decode the base32 digest to store it in less bytes but not good that we silently corrupt non-base32 data instead of rejecting it with a bad request error. I'm not familiar with any tools that generate CDX records with sha1: prefixes so in that sense I haven't seen this behaviour before. Rather than adding a command-line option to strip prefixes I'd prefer to just hard enable it since it won't affect existing users who use that de facto standard CDX11 format which doesn't include prefixes and if there's anyone else who has sha1: prefixed input they're going to need to reindex anyway to deal with the corruption. That I think simplifies the implementation and also means we don't have to worry about how to write tests with different sets of command-line options enabled. |
src/outbackcdx/Webapp.java
Outdated
// remove the protocol, if applicable | ||
if(this.stripDigestScheme && capture.digest.contains(":")) { | ||
capture.digest = capture.digest.split(":")[1]; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned in previous comment I'd prefer if we just enable this for everyone rather than adding an option for it. Then this stripping logic can go in Capture.fromCdxLine()
alongside all the other field parsing and CDX variant handling logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RocksDB upgrade looks good.
As per PR feedback. Also added new Capture test for CDX digest scheme stripping
Hi @ato , Thanks for the feedback! I've made the requested changes :) |
Thanks! Released as 0.11.0. |
Hi folks!
We're testing OutbackCDX on an ARM server as a deduplication source for Heritrix3 and we've come across a couple of issues that I've patched in this branch. I'm opening this as a draft PR because I just wanted to check that I'm not missing something with our configuration before patching something that doesn't exist 😁
This PR achieves the following:
.so
binaries & fixed deprecation warningsTo explain 2. a little further, we encountered an issue whereby sending a digest with a digest scheme (in our case,
sha1:
) prepended to it corrupted the digest that was then stored in RocksDB and returned via the CDX API. For example, sending a value ofsha1:VUICY64ZXNWY6YWTAXPQAI4HGDF46Q2X
actually ended up being stored and retrieved as a format similar toSHAVUICY64ZXNWY6YWTAXPQAI4HGDF46AS=====
, which was not the hash of the object (and therefore unable to be used as either the digest or for dedupeByHash operations).The solution I've implemented is a command line flag which, when enabled, checks for a
:
value in an incoming CDX line's digest. If it exists, it treats the value of the digest as the section after the:
- this operation is only performed if the digest contains a:
.The reason I've put it in as a draft PR is just to, firstly, check that behaviour detailed above has been seen before, and - if so - just to confirm that we'd want to do this digest stripping in outback?
The other query I had was around test suites: I couldn't find a natural place to add a test suite for the new command line flag but more than happy to add if we can figure out where is best.
I'm also happy to strip out the work done to upgrade RocksDB & open a separate PR if we decide that the digest scheme stripping is best done externally to OutbackCDX :)