Skip to content

Conversation

EliotJones
Copy link
Member

Run integration tests against parts 0000 and 0001 (~2000 files) from the common crawl PDF corpus at https://digitalcorpora.s3.amazonaws.com/s3_browser.html#corpora/files/CC-MAIN-2021-31-PDF-UNTRUNCATED/zipfiles/0000-0999/

The action is mostly a ChatGPT special so I have no idea if it works but that's the joy of GHA, there's no way to make it work without trying a load of different things so I expect it will need a few further iterations.

@BobLd BobLd merged commit 4bf746c into master Jul 19, 2025
2 checks passed
@BobLd BobLd deleted the new-common-crawl-corpus-gha branch July 19, 2025 10:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants