Skip to content

Conversation

makyo
Copy link
Contributor

@makyo makyo commented May 3, 2019

Work-in-progress branch for commenting, not for landing (yet).

@@ -29,6 +29,7 @@ public static Response queryIndex(Web.Request request, Index index, Iterable<Fil
Iterable<Capture> captures = query.execute(index);

boolean outputJson = "json".equals(request.param("output"));
// Check request headers for Accept: application/ors+cdxj and write CDXJ instead if requested.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pywb's CDX server uses a format=cdxj query parameter.

capture.isCdxj = true;
capture.timestamp = Long.parseLong(keyVal[1]);
capture.rawValue = keyVal[2];
// parse JSON from capture.rawValue and store in capture.cdxjValue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should also populate the fixed CDX fields from CDXJ equivalents if present?

What do we store on disk, the raw input JSON? A re-searialized (potentially cleaning excess whitespace) version? A more compact encoding like Smile or CBOR? Or perhaps just any extension fields that don't exist in vanilla CDX11?

The latter would keep the index smaller but might lead to rejecting CDXJ inputs if there's a type error (e.g. a string/object/array in an integer field). Might be good to reject them anyway, so we get failures up front at indexing time rather than at (CDX11) query time?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants