Skip to content

Blank Page causes hallucinated text from "Art of War" #292

@sbutcher

Description

@sbutcher

🐛 Describe the bug

When processing a blank page of a scan, (previous page is only very faintly seen in mirror image) the resulting output is a chunk of "The Art of War"

Image
2025-08-04 20:21:47,644 - __main__ - INFO - FINAL METRICS SUMMARY
2025-08-04 20:21:47,644 - __main__ - INFO - ================================================================================
2025-08-04 20:21:47,644 - __main__ - INFO - Total elapsed time: 169.56 seconds
2025-08-04 20:21:47,644 - __main__ - INFO - Total Server Input tokens: 172,396
2025-08-04 20:21:47,644 - __main__ - INFO - Total Server Output tokens: 38,438
2025-08-04 20:21:47,644 - __main__ - INFO - Finished input tokens: 163,352
2025-08-04 20:21:47,644 - __main__ - INFO - Finished output tokens: 38,028
2025-08-04 20:21:47,644 - __main__ - INFO - Completed pages: 128
2025-08-04 20:21:47,644 - __main__ - INFO - Failed pages: 0
2025-08-04 20:21:47,644 - __main__ - INFO - Page Failure rate: 0.00%
2025-08-04 20:21:47,644 - __main__ - INFO - 
2025-08-04 20:21:47,644 - __main__ - INFO - Pages finished by attempt number:
2025-08-04 20:21:47,644 - __main__ - INFO -   Attempt 0: 126 pages (98.4%) - Cumulative: 126 (98.4%)
2025-08-04 20:21:47,644 - __main__ - INFO -   Attempt 1: 1 pages (0.8%) - Cumulative: 127 (99.2%)
2025-08-04 20:21:47,644 - __main__ - INFO -   Attempt 7: 1 pages (0.8%) - Cumulative: 128 (100.0%)
2025-08-04 20:21:47,644 - __main__ - INFO - Server Input tokens/sec rate: 1016.75
2025-08-04 20:21:47,644 - __main__ - INFO - Server Output tokens/sec rate: 226.70
2025-08-04 20:21:47,644 - __main__ - INFO - Finished Input tokens/sec rate: 963.41
2025-08-04 20:21:47,644 - __main__ - INFO - Finished Output tokens/sec rate: 224.28

Using the latest docker container and command line:

python -m olmocr.pipeline ./localworkspace --markdown --pdfs book.pdf

Model:

2025-08-04 20:18:58,247 - __main__ - INFO - Downloading model with hugging face 'allenai/olmOCR-7B-0725-FP8'

Versions

> python --version
Python 3.12.11
> pip freeze
aiohappyeyeballs==2.6.1
aiohttp==3.12.14
aiosignal==1.4.0
airportsdata==20250706
annotated-types==0.7.0
anyio==4.9.0
astor==0.8.1
attrs==25.3.0
beaker-py==2.4.7
beautifulsoup4==4.13.4
blake3==1.0.5
bleach==6.2.0
blinker==1.9.0
boto3==1.39.12
botocore==1.39.12
cached_path==1.7.3
cachetools==5.5.2
certifi==2025.7.14
cffi==1.17.1
chardet==4.0.0
charset-normalizer==3.4.2
click==8.2.1
cloudpickle==3.1.1
compressed-tensors==0.10.2
cryptography==45.0.5
cupy-cuda12x==13.5.1
dbus-python==1.2.18
defusedxml==0.7.1
Deprecated==1.2.18
depyf==0.18.0
dill==0.4.0
diskcache==5.6.3
distro==1.9.0
distro-info==1.1+ubuntu0.2
dnspython==2.7.0
einops==0.8.1
email_validator==2.2.0
eval_type_backport==0.2.2
fastapi==0.116.1
fastapi-cli==0.0.8
fastapi-cloud-cli==0.1.4
fastjsonschema==2.21.1
fastrlock==0.8.3
filelock==3.18.0
flashinfer-python @ https://download.pytorch.org/whl/cu128/flashinfer/flashinfer_python-0.2.6.post1%2Bcu128torch2.7-cp39-abi3-linux_x86_64.whl
Flask==3.1.1
frozenlist==1.7.0
fsspec==2025.7.0
ftfy==6.3.1
fuzzysearch==0.8.0
gguf==0.17.1
google-api-core==2.25.1
google-auth==2.40.3
google-cloud-core==2.4.3
google-cloud-storage==2.19.0
google-crc32c==1.7.1
google-genai==1.27.0
google-resumable-media==2.7.2
googleapis-common-protos==1.70.0
greenlet==3.2.3
grpcio==1.73.1
h11==0.16.0
hf-xet==1.1.5
httpcore==1.0.9
httplib2==0.20.2
httptools==0.6.4
httpx==0.28.1
huggingface-hub==0.33.4
idna==3.10
img2pdf==0.6.1
importlib-metadata==4.6.4
interegular==0.3.3
itsdangerous==2.2.0
jeepney==0.7.1
Jinja2==3.1.6
jiter==0.10.0
jmespath==1.0.1
jsonschema==4.25.0
jsonschema-specifications==2025.4.1
jupyter_client==8.6.3
jupyter_core==5.8.1
jupyterlab_pygments==0.3.0
keyring==23.5.0
lark==1.2.2
launchpadlib==1.10.16
lazr.restfulclient==0.14.4
lazr.uri==1.0.6
lingua-language-detector==2.1.1
llguidance==0.7.30
llvmlite==0.44.0
lm-format-enforcer==0.10.11
lxml==6.0.0
markdown-it-py==3.0.0
markdown2==2.5.3
MarkupSafe==3.0.2
mdurl==0.1.2
mistral_common==1.8.2
mistralai==1.9.3
mistune==3.1.3
more-itertools==8.10.0
mpmath==1.3.0
msgpack==1.1.1
msgspec==0.19.0
multidict==6.6.3
nbclient==0.10.2
nbconvert==7.16.6
nbformat==5.10.4
nest-asyncio==1.6.0
networkx==3.5
ninja==1.11.1.4
numba==0.61.2
numpy==2.2.6
nvidia-cublas-cu12==12.8.3.14
nvidia-cuda-cupti-cu12==12.8.57
nvidia-cuda-nvrtc-cu12==12.8.61
nvidia-cuda-runtime-cu12==12.8.57
nvidia-cudnn-cu12==9.7.1.26
nvidia-cufft-cu12==11.3.3.41
nvidia-cufile-cu12==1.13.0.11
nvidia-curand-cu12==10.3.9.55
nvidia-cusolver-cu12==11.7.2.55
nvidia-cusparse-cu12==12.5.7.53
nvidia-cusparselt-cu12==0.6.3
nvidia-nccl-cu12==2.26.2
nvidia-nvjitlink-cu12==12.8.61
nvidia-nvtx-cu12==12.8.55
oauthlib==3.2.0
olmocr @ file:///build
openai==1.90.0
opencv-python-headless==4.12.0.88
orjson==3.11.0
outlines==0.1.11
outlines_core==0.1.26
packaging==25.0
pandocfilters==1.5.1
partial-json-parser==0.2.1.1.post6
pikepdf==9.10.2
pillow==11.3.0
platformdirs==4.3.8
playwright==1.54.0
prometheus-fastapi-instrumentator==7.1.0
prometheus_client==0.22.1
propcache==0.3.2
proto-plus==1.26.1
protobuf==5.29.5
psutil==7.0.0
py-cpuinfo==9.0.0
pyasn1==0.6.1
pyasn1_modules==0.4.2
pybase64==1.4.1
pycountry==24.6.1
pycparser==2.22
pydantic==2.11.7
pydantic-extra-types==2.10.5
pydantic_core==2.33.2
pyee==13.0.0
Pygments==2.19.2
PyGObject==3.42.1
PyJWT==2.3.0
pyparsing==2.4.7
pypdf==5.8.0
pypdfium2==4.30.0
python-apt==2.4.0+ubuntu4
python-dateutil==2.9.0.post0
python-debian==0.1.43+ubuntu1.1
python-dotenv==1.1.1
python-json-logger==3.3.0
python-magic==0.4.27
python-multipart==0.0.20
PyYAML==6.0.2
pyzmq==27.0.0
RapidFuzz==3.13.0
ray==2.48.0
referencing==0.36.2
regex==2024.11.6
requests==2.32.4
rich==13.9.4
rich-toolkit==0.14.8
rignore==0.6.4
rpds-py==0.26.0
rsa==4.9.1
s3transfer==0.13.1
safetensors==0.5.3
scipy==1.16.0
SecretStorage==3.3.1
sentencepiece==0.2.0
sentry-sdk==2.33.2
sequence_align==0.3.0
setuptools==79.0.1
shellingham==1.5.4
six==1.17.0
smart_open==7.3.0.post1
sniffio==1.3.1
soupsieve==2.7
starlette==0.47.2
sympy==1.14.0
syntok==1.4.4
tenacity==8.5.0
tiktoken==0.9.0
tinycss2==1.4.0
tinyhost==0.4.18
tokenizers==0.21.2
torch==2.7.0+cu128
torchaudio==2.7.0+cu128
torchvision==0.22.0+cu128
tornado==6.5.1
tqdm==4.67.1
traitlets==5.14.3
transformers==4.52.4
triton==3.3.0
typer==0.16.0
typing-inspection==0.4.1
typing_extensions==4.14.1
ubuntu-pro-client==8001
unattended-upgrades==0.1
urllib3==2.5.0
uv==0.8.2
uvicorn==0.35.0
uvloop==0.21.0
vllm==0.9.2
wadllib==1.3.6
watchfiles==1.1.0
wcwidth==0.2.13
webencodings==0.5.1
websockets==15.0.1
Werkzeug==3.1.3
wrapt==1.17.2
xformers==0.0.30
xgrammar==0.1.19
yarl==1.20.1
zipp==1.0.0
zstandard==0.23.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions