Add support for arbitrary adapters via a config file

Since there's many feature requests for different file formats now, many of which do not have corresponding nice and fast Rust libraries, I think the best solution is to allow specifying "custom" preprocessors via a config file. 

This comes with the question about how this would differ than just using `rg` with the `--pre` directive directly:

* Compressed caching
  I'm very happy with this feature of rga, since extractors are often very slow and with the zstd-compressed cache most extractions are both very small and very fast to read, while barely adding any overhead on initial run. This is hard or impossible to reproduce in a simple extract-script ([see my original pdfextract.sh](https://gist.github.com/phiresky/bb51d2e6712c0f160d6fb7594eecf9f9))
* Archive recursion
   rga can recurse into archives, and return contents at any depth as a binary stream. The same can be implemented for other things that aren't strictly archives, like a pdf file that contains images, where the images may be searched by a different extractor


Future additions that might be possible here (no promises) that will probably not appear in rg core are:

* Declarative post-processing options
   Like the pdf extractor already adds the Page number to the pdftotext output by counting for ascii pagebreak symbols, there might be a some postprocessing steps that could be defined in the config file so they are implemented in fast rust without effort on behalf of the filetype-handler
* Not directly running a separate program for each file but using something like a file-type-handler-server instead
    From current usage the extractor is always slow enough so the initialization time is kinda irrelevant, but this might not always be the case:
   For example, stuff like tesseract loads neural networks into memory when started, which can be a significant overhead. I think those are evaluated on the CPU, but if there was stuff like GPU-based compute it would be even worse.
* More pipeable adapters
   It might be useful to add adapters that are more like text-conversion tools (such as removing broken characters (#26, #46) or changing encodings (#5, #47)) that could then be added as a step before or after the usual adapters

---

The baseline implementation of this should be pretty easy, more features can be added later. Main decision is the config file format, whether or not to change existing SpawningFileAdapters to build on top of this and how to document it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add support for arbitrary adapters via a config file #60

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Add support for arbitrary adapters via a config file #60

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions