Releases · dbmdz/solr-ocrhighlighting

Changed

It is now possible to use ExternalUtf8Filter in a OCR field definition and still use inline OCR, mixed with external OCR.
Improve performance when highlighting with hl.ocr.scorePassages=false, we now only analyze as many passages as the number of snippets that were requested

Fixed

Version constraints in the Solr plugin repository have been corrected, these were previously pinned to Solr 9.0, which was wrong.
When highlighting with hl.ocr.scorePassages=false, the snippets returned would be the last snippets in the document, not the first, as intended. This has now been fixed and we only return the first snippets in the document.

Changed

Respect global request limits (time, memory, cpu) from the [*allowed Solr
query parameters][1] in addition to our own OCR-specific limit

Fixed

Highlighting responses that are partial by the time they hit the highlighting stage
during distributed request processing resulted in errors, this has been fixed.

[1] https://solr.apache.org/guide/solr/latest/query-guide/common-query-parameters.html#timeallowed-parameter

Breaking Changes

When using Solr >= 9.8 and the the plugin is included via the <lib> directive, the Solr JVM needs to be launched with -Dsolr.config.lib.enabled=true for the plugin to work

Changed

MiniOCR page fragment parsing is now more robust regarding the order of attributes and additional whitespace
Added a small command-line tool to convert ALTO and hOCR to MiniOCR in util/miniocr.py
Removed usage of deprecated API for retrieving stored field values for Solr versions that support it

@schmika

Changed

During indexing, we now only need a single pass through the input files, instead of
two, this is in preparation for the S3 storage backend, where we don't have the luxury
of relying on a page cache to paper over our inefficencies.

Fixed

Fix bug that resulted in missed matches during highlighting (#442, thanks @schmika!)
Fix bug that resulted in incomplete reads from the input file under some circumstances (#441, thanks @schmika!)
Compatibility with Solr 9.7

Major performance and stability improvements in this release, upgrading is highly recommended.

Changed:

Add support for multithreaded highlighting. Uses all available logical CPU cores by default and can be tweaked with the numHighlightingThreads and maxQueuedPerThread attributes on the OcrHighlightComponent in solrconfig.xml.
Removed PageCacheWarmer, no longer needed due to multithreading support.
Completely refactored, simplified and optimized I/O stack to reduce number of file system reads and allocations/data copies during highlighting, accounting for a significant performance improvement over previous versions (4-8 times faster in a synthetic benchmark that was not I/O-bound)
We no longer memory-map files for reading. Benchmarking revealed that it did not improve performance with the new I/O stack (probably due to the reduced amount of actual reads), on the contrary, performance was improved for many concurrent queries. A huge drawback of the memory-mapped approach was that in the presence of I/O errors like disappearing mounts, truncated files, etc, the JVM could simply crash (due to the kernel sending a SIGBUS signal when encountering an I/O error).
When locating breaks in the forward direction, we used to put the break point at the end of the limiting element opening tag. With the new implementation, the break point is now at the start of the limiting tag open element, i.e. no part of the limiting element is contained in the created section. This leads to a small change in the scores assigned to passages (since BM25 uses the length of the scored content in its calculations).

Fixed:

When using source pointers with multiple files, the plugin no longer leaks file descriptors. We previously didn't close the currently opened file when opening the next one.

Changed:

Add support for Solr 9.6
Removed unused classes
Refactored timeout logic to match new approach used in Solr >= 9.5
Dependency Updates

Changed:

Missing files no longer fail the complete search request, instead the OCR
highlighting for the document is skipped
Add support for Solr 9.5
Updated documentation with warning for Solr 9 users to disable security sandboxing
when using pointers to external files

Fixed:

Regular highlighting in case no hl field can be determined works again (#404)
Passage building across more than two concatenated files works now (#422)

Changed:

Add support for Solr 9.4
Improved sanitization of broken OCR XML during parsing

Fixed:

More robust bytecode patching for Solr 7/8
Frontend in example setup is working again

Another bugfix release, fixing some edge cases with 'odd' OCR files.

Bugfixes:

hOCR: Fix truncated passages during highlighting due to incomplete forward passes while parsing candidate passages.
All Formats: Use an iterative solution for skipping empty words instead of a recursive strategy, which could lead to stack overflows when encountering OCR files with many empty words.

Other Changes:

We now have pre-releases in the Solr repository that can be used to experiment with the latest changes in the plugin before the official release. For users not using the repository, a pre-release build is also pushed to the GitHub Releases page on every update to the repository.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Contributors

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Releases: dbmdz/solr-ocrhighlighting

Release 0.9.4: Fixes and improvements for unscored highlighting

Uh oh!

0.9.3: Support for limits API and fix for partial responses

Uh oh!

Release 0.9.2: Solr 9.8 compatibility, MiniOCR improvements

Uh oh!

0.9.1: Solr 9.7 compatibility, fixes

Contributors

Uh oh!

0.9.0: Major Performance Improvements

Uh oh!

0.8.6: Solr 9.6 Support

Uh oh!

0.8.5

Uh oh!

0.8.4

Uh oh!

WIP build (use at own risk)

Uh oh!

0.8.3

Uh oh!