Skip to content

LimitExceededException when calling the StartDocumentTextDetection operation #21

@ethanscorey

Description

@ethanscorey

Thank you for building this incredibly useful tool! I've found a lot of use for it recently, but I think I may have pushed it a bit beyond the scale it's built for.

I ran the line you included in the demo (s3-ocr start s3-ocr-demo --all -a ocr.json) on an S3 bucket that contains ~2,500 PDFs. It started Textract jobs for the first 102 PDFs in the bucket, but then it raised the following exception:

Traceback (most recent call last):
  File "/home/ethan/miniconda3/envs/nj_deaths/bin/s3-ocr", line 8, in <module>
    sys.exit(cli())
  File "/home/ethan/miniconda3/envs/nj_deaths/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/home/ethan/miniconda3/envs/nj_deaths/lib/python3.10/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/home/ethan/miniconda3/envs/nj_deaths/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/ethan/miniconda3/envs/nj_deaths/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/ethan/miniconda3/envs/nj_deaths/lib/python3.10/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/home/ethan/miniconda3/envs/nj_deaths/lib/python3.10/site-packages/s3_ocr/cli.py", line 137, in start
    response = textract.start_document_text_detection(
  File "/home/ethan/miniconda3/envs/nj_deaths/lib/python3.10/site-packages/botocore/client.py", line 508, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/home/ethan/miniconda3/envs/nj_deaths/lib/python3.10/site-packages/botocore/client.py", line 915, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.errorfactory.LimitExceededException: An error occurred (LimitExceededException) when calling the StartDocumentTextDetection operation: Open jobs exceed maximum concurrent job limit

While it's fairly clear what caused the exception (running too many jobs at once), there's no obvious way to avoid it—aside from, of course, OCRing fewer PDFs at once, but who wants to do that?!

Is there a way to tell s3-ocr to chunk the jobs so that jobs that exceed the limit are queued to wait until the other jobs finish?

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingenhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions