Append POST data to canonized url #91
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem
Currently, POST data query parameters are ignored by the outbackcdx indexing process (__wb_post_data coming from pywb cdx-indexer or __warc_post_data coming from cdxj-indexer with -11 flag). This leads to playback problems of recorded webpages such as http://corona-data.ch. This webpage POSTs several XHR calls to the same endpoint, but with different parameters. A more detailed description of the problem can be found at webrecorder/pywb#585.
Analysis
The POST data is not indexed because outbackcdx uses its own canonicalizer based on the
original url
. The POST data query parameter is only added to theurlkey
(surt-canonized url) by the cdx-indexer.Proposed solution
If POST data is available in the passed
urlkey
at indexing time, it is copied and appended to the new surt after canonicalization. This is a change that can be done at indexing time in a locally isolated place (fromCdxLine
). It does not interfere with the current canonicalization and the canonicalization does not change the base64 encoded POST-data value.Testing with pywb
The proposed approach has been tested with pywb using the following configuraion:
api_url: http://outbackcdx/collection?closest={closest}&sort=closest&url={alt_url}
. The page http://corona-data.ch has been played back successfully.Possible issues within pywb
I think using
alt_url
in the template is not optimal, since it bypasses the fuzzy rulesets of pywb. I'm working on a solution for pywb to make theRemoteIndexSource
work better with the__wb_post_data
and theurl
field, but will follow this on a seperate track.Possible effects on existing systems
If new CDX data is indexed with POST-data included (
cdx-indexer -p
), there could possibly be issues in replay with exact matches. I'm also not sure what this means for OpenWayback with OutbackCDX (do we need to index twice - with/and without POST-data to make this work?)