Add TextSplitter as Preprocessor node

When this project was started initial intention was to handle only short text. But now we have added Google News and Crawlers, hence there is need to handle longer text as well.
As we know that most of BERT based model support 512 max tokens (with few exceptions like BigBird). Currently Analyzer ignore (https://github.com/lalitpagaria/obsei/issues/113) excessive text.

Now Idea to introduce TextSplitter to split longer text and feed it to Analyzer. But it introduce another complexity with Analyzer predictions? How to combine inferences by multiple chunks for final prediction. Right now there not proper solution exist to handle this scenarios except try few like voting, averaging or like the one suggested here https://discuss.huggingface.co/t/new-pipeline-for-zero-shot-text-classification/681/84.

For sake of simplicity let's first implement TextSplitter. For this purpose let's take inspiration from [Haystack splitter](https://github.com/deepset-ai/haystack/blob/master/haystack/preprocessor/preprocessor.py) along with adding context like chunk_id, passage_id, etc into meta data.

For inference aggregation later we can add another node for Inference aggregation, we may call it `InferenceAggregator`. This will aggregate Analyzer result on text chunks to compute final inference.

![image](https://user-images.githubusercontent.com/19303690/123604649-ab519f80-d818-11eb-8b53-d8e338a82d91.png)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add TextSplitter as Preprocessor node #153

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add TextSplitter as Preprocessor node #153

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions