Pywb-style CDXJ input and output formats #100
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This adds support for Pywb's CDXJ format. We follow the Pywb convention of emitting numeric values as JSON strings but accept JSON numbers if given them as input.
Support for arbitrarily named extension fields is not included yet and will be added separately as it requires a new version of the index storage format. Similarly our current index version doesn't really support the notion of missing fields so we map missing fields to "-" or -1 as appropriate for storage, which is a bit hacky but should generally work for now.
Other proposed CDXJ variants (such as "OpenWayback CDXJ") are not supported.
CC @ikreymer @anjackson
Closes #48