-
-
Notifications
You must be signed in to change notification settings - Fork 7
Closed
Labels
bugSomething isn't workingSomething isn't workingenhancementNew feature or requestNew feature or request
Description
Thank you for building this incredibly useful tool! I've found a lot of use for it recently, but I think I may have pushed it a bit beyond the scale it's built for.
I ran the line you included in the demo (s3-ocr start s3-ocr-demo --all -a ocr.json
) on an S3 bucket that contains ~2,500 PDFs. It started Textract jobs for the first 102 PDFs in the bucket, but then it raised the following exception:
Traceback (most recent call last):
File "/home/ethan/miniconda3/envs/nj_deaths/bin/s3-ocr", line 8, in <module>
sys.exit(cli())
File "/home/ethan/miniconda3/envs/nj_deaths/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
return self.main(*args, **kwargs)
File "/home/ethan/miniconda3/envs/nj_deaths/lib/python3.10/site-packages/click/core.py", line 1055, in main
rv = self.invoke(ctx)
File "/home/ethan/miniconda3/envs/nj_deaths/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/ethan/miniconda3/envs/nj_deaths/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/ethan/miniconda3/envs/nj_deaths/lib/python3.10/site-packages/click/core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "/home/ethan/miniconda3/envs/nj_deaths/lib/python3.10/site-packages/s3_ocr/cli.py", line 137, in start
response = textract.start_document_text_detection(
File "/home/ethan/miniconda3/envs/nj_deaths/lib/python3.10/site-packages/botocore/client.py", line 508, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/home/ethan/miniconda3/envs/nj_deaths/lib/python3.10/site-packages/botocore/client.py", line 915, in _make_api_call
raise error_class(parsed_response, operation_name)
botocore.errorfactory.LimitExceededException: An error occurred (LimitExceededException) when calling the StartDocumentTextDetection operation: Open jobs exceed maximum concurrent job limit
While it's fairly clear what caused the exception (running too many jobs at once), there's no obvious way to avoid it—aside from, of course, OCRing fewer PDFs at once, but who wants to do that?!
Is there a way to tell s3-ocr to chunk the jobs so that jobs that exceed the limit are queued to wait until the other jobs finish?
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingenhancementNew feature or requestNew feature or request