Skip to content

Releases: allenai/olmocr

v0.3.4

31 Aug 03:25
Compare
Choose a tag to compare

What's new

Commits

56b08d5 Bump version to v0.3.4 for release
f3cdc78 Pushing new version
edd0980 Reverting version changes that broke, vllm 0.10.1 is not good
2779266 Transformers version bump needed also
03c7479 VLLM version bump
3eec580 Docker ignore
6be12c2 Baseline tests for blanks
ad33672 fix
c7aa217 Scripts to run benchmarks better
59321af Merge pull request #319 from haydn-jones/main
4dbf951 Merge pull request #313 from tongliang11/patch-1

v0.3.3

15 Aug 19:54
Compare
Choose a tag to compare

What's new

Commits

c492615 Bump version to v0.3.3 for release
cee12cc New version
76405b5 Lints
69c33ab Trying to keep queue loaded more
7c98673 Pipeline fixes for OMP_NUM_THREADS
b9238b8 Fix for floaty amount

v0.3.2

14 Aug 21:07
Compare
Choose a tag to compare

What's new

Commits

618777c Bump version to v0.3.2 for release
5532493 Pipeline should be improved to limit CPU usage on page renders
3a36ee2 Cleanup
a863d04 Cleanup page rendering cpu limits

v0.3.1

14 Aug 18:09
Compare
Choose a tag to compare

What's new

Commits

0dd4fe8 Bump version to v0.3.1 for release
7e8f9e4 New version
0a8cd93 Better queue managmenet again
3867924 Removing extra files
dc5c45e Deps
7b3b935 VLLM bump
4431b48 Better tracking of semaphore release on bigger jobs
4efd3f5 AI2 Internal budgeting
9f8df23 Readme updates

v0.3.0

13 Aug 21:56
Compare
Choose a tag to compare

What's new

Commits

36ca700 Bump version to v0.3.0 for release
3e5351c version bump
894c617 Merge pull request #303 from allenai/jakep/olmocr_v03
be1f845 Fixing issue with blank documents
6216896 Accidentally comitted too many files

v0.2.3

04 Aug 20:43
Compare
Choose a tag to compare

What's new

Commits

6417b2e Merge branch 'main' of https://github.com/allenai/olmocr
75a8b05 Bump version to v0.2.3 for release
f3aedf2 Bumping version
becd15d Reformating fix
d6591c0 Saving extra metadata that will be useful for finetuning
7c09895 Trying fix for transformers benchmark
8712534 Fix for docker ignore
0536c0e Lint fixes
08b263b Cumulative rotation support
5e991b6 Merge pull request #291 from haydn-jones/main

v0.2.2

04 Aug 18:05
Compare
Choose a tag to compare

What's new

Commits

c89b66b Bump version to v0.2.2 for release
1286f10 version bump
168953c Lowered memory usage check per #290
ed8a5d1 Ok fixed rotation stuff finally
e0158df Adding test file
6cdcb06 Removing some dead code and adding tests
4d773cc Adding pytest asyncio
0ff6919 Lint fix
a8d5299 Trying to add a test for rotation correction
1255f64 Better error messages
c106114 Fix for languages "no" in yaml
df52cb0 Small fixes for transformers test runner
cf1912d Some transformer bench ideas
26c6281 Formatting
4a70b2e Docs and parameter groups
fc983ca README

v0.2.1

23 Jul 22:00
Compare
Choose a tag to compare

What's new

Commits

476e20c Bump version to v0.2.1 for release
2545408 Minor release cleaning up a few pipeline things
54719b6 Fixed
c63e97f Default max model len cleanup
4acc85e bolds in tables
c13f5aa Readme
44cb957 Readmes
783cacd Merge branch 'main' of https://github.com/allenai/olmocr
ce32ceb Hopefully a cleaner pipeline

v0.2.0

23 Jul 15:47
Compare
Choose a tag to compare

What's new

Commits

b4c5913 Bump version to v0.2.0 for release
35a5329 New version
0a6b2fe Lints - bringing back files
56296d6 Brining back a few files
6e82724 Lint fixes
5ec4967 New default model
a4752b5 Merge remote-tracking branch 'origin/main' into jakep/new_trainer
9ef3fd7 Adjusting temp by attempt
60c3944 More configs
f44d03f Don't break on errors
8eb3786 Fixing compressor again
da6bc45 Fix for compresor
eb200e7 Fixing some default configs on quantizer
0aa7479 More calibration samples by default
6e48012 Trying out an idea for dataset augmentation
df960cb 2 epoch config just to try
a326c96 Default to 1288
b88c71e Rounding to better image size, full soups
75bfa6a Adding full soup configs
0f733ff FIxes to compare vllm script
16145a4 Need accelerate
4785759 Adding some souping suppor to prepare checkpoint
2b63855 Compare has better downloader
0b40bd3 Better docker ignore
d21a164 Fixing async stuff
3ca305d Adding some souping configs
c0bf310 Fixing import
31c834d Constants
5ea4e8a Compare vllm script
939a76a Adding a compare vllm checkpoint script
2460895 Working on comparing to vllm
e6c9823 Adding more pipeline retry stats, compress code fixed
4dbbf91 Compression script
feb2dab Adjus config
022f437 w8a8-int8 version
5a4a836 Calibration
9115a02 Fixes
4b0960b Test
ee69faa Dataset
bd92f08 Errors propagated
fcd373d Calibration stuff
2218bf8 Merge branch 'jakep/new_trainer_vllm092' into jakep/new_trainer
b5f480d Working on calibration set for compressor, seems like qwen2.5 is not working
3f9fc8b Better compressor hopefully
287c827 Starting to cleanup and merge yaml front matter stuff in
1092213 Merge branch 'jakep/new_traininer_nojson_newprompt' into jakep/new_trainer
679063a Adding some more logging to compressor
43ae28d Prepare checkpoint works for older models too
f306a52 Compress fix
01360ba Compressor script
1ede76d Cleaning up compress and prepare checkpoint scripts
a5a0cd7 Trying a few more configs
384a1b1 Qwen 2 config too
24a3fb8 128batch config, wsd config
0c773c4 Let's do a 1280 no anchor yaml
da5f8f2 wsd config
336b000 Adding wsd as an option
69581cc More config fixes
ca8e503 Ugh, lost some training runs because files got saved to the wrong place
02f0706 Reverting back to json pipeline as it seems better by default
8ae9104 Calling it with a new name
3976cee Adding 8192 cap on day2 config
ca2609c No doc anchoring version
560a585 Configs with proper names
53cc1a0 Fixed json configuration
2c54c6d ALlow unicode in json
b1ab996 Day 2 json config
a1c2ee8 More workers by default
d26ae4b Easier way to test configs
a7e2f71 Start a preemptible one at least once
6d6476b One idea for resume fix
2a20607 Get rid of fused
59f11c7 Better names
210d170 Adding a standard JSON output option
6f2a426 Fresh prompt configs
5e8017b Oops
4a6ef91 Matching old trainer config
5e2f703 Trying some config changes
94d7900 Default configs are better
56e51ea Improving regex even more
98df1d5 Adding max length option
abdc907 Pipeline fix
e691ea1 Better regex for structured decoding, adding some new prompts to train with
a651cf0 Adding guided regex decoder
748e2ae With yaml formatted responses, make sure response finishes with code stop
9bf8e9e Preparing pipeline for new format
c6c1fbd Better prepare checkpoint script
8dcfdd0 Checkpoint prep tool
c029ccd Added a few more configs to try
79a7818 New trainer launch script for beaker
dcf026a Better script
9f0f912 Ugh
1d007d1 Perhaps fixing default config
e7020c7 More configs
7cf9879 Image 1600 configuration
d2ef9d7 Four basic training configs for new version
a3ad61b Small config updates
ee8bd9b Better resume logic I hope
208fabc Validating on procespool
4f46f10 At least get resuming from checkpoints to work perhaps
2375079 Torch compile off, gives warnings and no speed boost, padding to do multi batch is not working either
c11120a Trying to do batch size > 1
5c2d69a Some cleanup stuff
e86511e Weka fix
656dbef Frontier configs
e2f2d36 More typos
ea72ea2 Ugh stupid fix
55a737c script
ba49fd5 frontier train script let's see what happens
bde6f29 Bf16 only
44dd966 Wandb fixes
f8071c7 Loss config
a399741 Naming config entries better
8e5e18f Checking that anchor text works for each pdf page when initializing dataloader
dc7fff5 Collator fix
12b5cc3 Lowwering size of default data load for testing
c36b5df Cleanup collator
887190e Cleanup
330f465 Small fixes
214c44d Reporting to wandb, better eval dataset loading
600d967 Config changes
850b598 Sdpa
b96454b Merge branch 'main' into jakep/new_trainer
58e4fad torchvision requirement
1451dd1 weka
680377c Example config
dee3730 Gantry stuff
0d7836b Basic atttempt to run trainer script
d7e5037 New trainer launch script cleanups
91e7b5c Claude generated train script
0ebc35c Basic train config loader for datasets
b93c262 Prepping new config stuff
e9828cd Lints, adding more perf tracking to pipeline
9ab742b Outputting finished output tok/sec as well
cc0c62a Adding more workers by default to improve bench perf
43c94fe Bencharmk update
b1e064f Run benchmark script will also start a job to convert 10k docs from olmocr-mix to check performance
3d72f34 Fixing prepare_olmocrmix
c93ac4a Cleaned up loader
6033881 Cleaning up dataloader
cfe9aa1 Ok, dataloader from start to finish is running, now to write a trainer
105d590 Dataloader progress
9f50bda More refactoring
6a360fa Cleanup
d17bef8 Working on a more pipeliney thing
d0df380 Cleaning data loader
5bbc1ff Parsing and validating front matter
aedc295 Image params to loader
9a390e3 Validating that we get single pages
0689676 Rendering the pdfs in the dataloader
352287c Starting on dataloader
0e17b50 Ok, looks like we have a nice extractor script for the dataset
f19f7c1 Almost done extracting
f0d8ff7 First attempt at new trainer code

v0.1.76

23 Jun 22:06
Compare
Choose a tag to compare

What's new

Commits

24a2f9b Bump version to v0.1.76 for release
cd93ca5 Version bump
ecce181 Merge pull request #256 from allenai/jakep/dockerfix
0c6d199 Update README.md
ec5c5b6 Updating pareto plots
6c51829 Some helper scripts
626952a Adding news
9d26079 README updates
69524cb Updatinge bench readme