Skip to content

Conversation

kaij
Copy link
Contributor

@kaij kaij commented Oct 29, 2020

Problem

Currently, POST data query parameters are ignored by the outbackcdx indexing process (__wb_post_data coming from pywb cdx-indexer or __warc_post_data coming from cdxj-indexer with -11 flag). This leads to playback problems of recorded webpages such as http://corona-data.ch. This webpage POSTs several XHR calls to the same endpoint, but with different parameters. A more detailed description of the problem can be found at webrecorder/pywb#585.

Analysis

The POST data is not indexed because outbackcdx uses its own canonicalizer based on the original url. The POST data query parameter is only added to the urlkey (surt-canonized url) by the cdx-indexer.

Proposed solution

If POST data is available in the passed urlkey at indexing time, it is copied and appended to the new surt after canonicalization. This is a change that can be done at indexing time in a locally isolated place (fromCdxLine). It does not interfere with the current canonicalization and the canonicalization does not change the base64 encoded POST-data value.

Testing with pywb

The proposed approach has been tested with pywb using the following configuraion: api_url: http://outbackcdx/collection?closest={closest}&sort=closest&url={alt_url}. The page http://corona-data.ch has been played back successfully.

Possible issues within pywb

I think using alt_url in the template is not optimal, since it bypasses the fuzzy rulesets of pywb. I'm working on a solution for pywb to make the RemoteIndexSource work better with the __wb_post_data and the url field, but will follow this on a seperate track.

Possible effects on existing systems

If new CDX data is indexed with POST-data included (cdx-indexer -p), there could possibly be issues in replay with exact matches. I'm also not sure what this means for OpenWayback with OutbackCDX (do we need to index twice - with/and without POST-data to make this work?)

@ato
Copy link
Member

ato commented Oct 30, 2020

This seems reasonable. OpenWayback doesn't support POST playback as far as I know so I don't think there's any effect. Well I guess it means it won't incorrectly playback a POST as a GET but that seems like a good thing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants