Releases: nla/outbackcdx
1.0.0
New features
- Added a CBOR-based index version 5 which supports storing arbitrary CDXJ fields
- Added query params
method
andrequestBody
to support replay of POST requests (currently requires pywb patch) - Added
omitSelfRedirects
query param and--omit-self-redirects
CLI option which omits records which after URL canonicalization redirect to themselves from results - API for compacting an index
- API for upgrading an index (see #117 for instructions for now)
- API for exporting data for statistics (/cube)
Changes
- Updated to RocksDB 8.1.1.1
Release 0.11.0
Release 0.10.0
0.9.1
Fixes storage of CDX lines such as those in CDX 9 or 10 format which lack a compressed size field. #99 (Kristinn Sigurdsson)
Previously OutbackCDX would return "0" instead of "-" for these records causing clients to think they were zero-length records. If you were affected by this you will need to reinsert the affected records into the index.
0.9.0
- Added support Pywb's __wb_post_data query parameter. #91 (Kai Jauslin)
- CDX parsing: Timestamps shorter than 14 digits are now padded with trailing zeroes so they sort correctly. #95 (Kristinn Sigurdsson)
- XmlQuery: OutbackCDX now supports OpenWayback's
count
andstart_page
parameters for pagination. - XmlQuery: the numreturned and numresults values are now returned in the XML output for better compatibility with OpenWayback's default templates and pagination. #98
To maintain the ability to stream results without holding them in memory the <request>
element is moved to the end of the XML output after <results>
element. This was tested against OpenWayback but if there exist any other clients the XmlQuery protocol they may be affected. Please let us know if you encounter any compatibility problems.
When there are more records matching the query than the count
parameter asks to be returned, in order to calculate numresults the OutbackCDX now needs to scan potentially many more records. This will slow down queries that match a lot of records but only return a fraction of them. A new --max-num-results
command-line option was added to constrain the number of extra records that will be scanned. This defaults to 10,000. You may wish to decrease it if you don't care about numresults
or increase if you want to paginate deeper than this in OpenWayback. This only affects queries using the XML protocol.
0.8.0
- HMAC protected WARC record URLs (see README for details)
- OpenWayback's count and start_page parameters in the XML CDX server protocol are now supported
- Very basic support for POST data (in conjunction with Pywb and Pywb's cdx-indexer)
- CDX records with timestamps shorter than 14 digits are now padded with zeroes.
- Timeouts and logging for replication requests
- --context-path option to deploy at a path other than /
- Exceptions during queries are now logged and reported
0.7.0 - replication, fuzzy canon, filter plugins and more
Big new features contributed by James Kafader and Noah Levitt from the Archive-It team at the Internet Archive:
- Replication
- Fuzzy canonicalization using pywb-style rules.yaml
- Request logging
- RocksDB upgraded to 6.0.1
- Command line options for tweaking performance
- Filter plugins
- New query params
- urlkey - alternative to url, bypasses outbackcdx canonicalization
- from - minimum timestamp
- to - maximum timestamp
- collapse aka collapseToFirst
- collapseToLast
0.6.0
- CDX meta/robotflags field support
- CDX10 support
- CDX11+3 support (Neil Munro)
- ?badLines=skip option (Noah Levitt and Madison Scott-Clary)
- Filter plugin API (Madison Scott-Clary)
- Added option to limit max open SST files (with default value heuristic suggested by Noah Levitt)
- Regex filters in query API
- Fix for hang on large requests
- Experimental option to use Undertow web server instead of nanohttpd
- Upgraded various dependencies (notably urlcanon)