Adding InferenceAggregator #166

akar5h · 2021-07-10T10:57:21Z

Minor Changes to make Analyzer work with TextSplitter for long texts.

akar5h · 2021-07-10T10:58:21Z

Hi , These are some minor changes undoing #113 , making analyzer work for long texts .

akar5h · 2021-07-10T11:20:48Z

However, the processing time with the current text splitter and analyzer is large for long documents, need to look for ways to optimize

lalitpagaria · 2021-07-10T17:37:16Z

@akar5h Thanks for the PR.
I will review this on Monday.

lalitpagaria · 2021-07-12T16:00:20Z

@akar5h Until we have inference aggregator, user will never get original output. Can you please let me know what is your thought here.

I thought we are doing this -

But now I have to clarify something, According to my experience the max_length parameter / sequence length parameter in BertModel (bert like models) does not necessarily mean text length , the text length can be beyond 512 and still the input tokens be less than 512 .
So would it not be more optimal that instead of splitting texts, we split the tokenizer output ( tokenizer.batch_encode_plus output) like input_ids , attention mask into sequences of length 512 and feed these splits to the model ?

akar5h · 2021-07-12T16:42:12Z

Hi , yes that was the way to go , However since we are working with Pipeline and not directly dealing with tokenizer output and models . We cannot implement the this as it is .

So would it not be more optimal that instead of splitting texts, we split the tokenizer output ( tokenizer.batch_encode_plus output) like input_ids , attention mask into sequences of length 512 and feed these splits to the model ?

So there are modules in tokenizers called pretokenizer tokenizers.pre_tokenizers . I used the pretokenize to estimate where to split the document , but again we can just estimate the split and not exactly split at 512 tokens since pretokenize does not has wordpiece tokenization ,
I created an alternate textsplitter using this pretokenize logic , https://github.com/akar5h/obsei/blob/text-splitter/obsei/preprocessor/text_splitter_v2.py , can u have a look ?

There was no difference model processing time in both this proposed way and the cureent TextSplitter

akar5h · 2021-07-12T16:46:31Z

However if you follow this link : https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/question_answering.ipynb#scrollTo=n9qywopnIrJH&uniqifier=1
Scroll down to Preprocessing the training data , . Here you can see how huggingface tokenizer provides a way to split longer length documents as an inbuilt operation .

lalitpagaria · 2021-07-12T16:49:18Z

Cool thanks Akarsh, I will have a look.
Please bear with me about delay in review, as I am bit busy in personal front.

akar5h · 2021-07-12T16:50:54Z

np , the last part InterfaceAggregator is not so straightforward since we have multi_class labels . So I'll take some time myself with that

lalitpagaria · 2021-07-14T13:26:25Z

@akar5h I had looked into your PR. I feel #113 okay for now. If use do not want truncation they will use TextSplitter anyway. What we need is merge module or way to merge inference after analyzer make inference.

My idea is to have multiple configuration of TextSplitter: honour sentence boundary, honour passage boundary, add title to each chunk, etc etc. Which will add value to user's pro processing capabilities.

There are few use cases some of Obsei users are asking. Major one is to determine sentiment of customer call/chat conversation.

Let me know when you are available next week we can schedule a call and discuss more.

akar5h · 2021-07-14T17:10:36Z

Hi @lalitpagaria , I am finalizing the Interface Aggregator module, its work in progress , I took the liberty to create a PostProcessing classes just like Preprocessors for InterfaceAggregator, will post a PR by tomorrow on this.
ZeroShotClassification Pipleline does not leave us with many options with completely utilizing transformer functionalities, hence aggregation is a simple logic for now . We can have multiple configurations for different aggregation strategies in the future.

Yes, both sentence boundary and paragraph boundaries can be implemented using Spacy for TextSplitter.

My idea is to have multiple configuration of TextSplitter: honour sentence boundary, honour passage boundary, add title to each chunk, etc etc. Which will add value to user's pro processing capabilities.

Sure , let's lets discuss over call this Saturday or coming Monday/Tuesday after 6 ?

There are few use cases some of Obsei users are asking. Major one is to determine sentiment of customer call/chat conversation.
Let me know when you are available next week we can schedule a call and discuss more.

akar5h · 2021-07-17T11:52:43Z

@lalitpagaria , Hi , can you have a look at this merge functions . The typecheck is failing because of TextPayload.meta or BasePayload,meta .

Kindly suggest changes if any , and some suggestion on how to handle the type check (mypy )

obsei/analyzer/ner_analyzer.py

obsei/postprocessor/interface_aggregator.py

lalitpagaria

@akar5h Thank you very much for this PR.
I liked your design for Aggregator.
I have few superficial comments, we can merge this PR after that.

obsei/postprocessor/interface_aggregator.py

akar5h · 2021-07-21T15:42:48Z

The sugested changes :

InferenceAggregator name change 😅
Chunk Id, Total Chunks in splitter
remove this method if there is not custom initialisation in constructor
add type check
payload_aggregator

Todos / Feedback required : @lalitpagaria pls comment

finalize among most_common or max_value
Standardise Meta or other hack for meta ( vars(meta) not working)

lalitpagaria · 2021-07-23T18:01:53Z

@akar5h I am doing few changes to your PR. I have some uncommitted changes. I will push them tomorrow.

lalitpagaria · 2021-07-26T10:17:25Z

@akar5h I have made some refactoring and added them to your PR.
Please see these make sense.

@shahrukhx01 Would you like to take a look into changes as well?

shahrukhx01 · 2021-07-26T16:35:07Z

LGTM

akar5h · 2021-07-26T16:43:07Z

the refactor makes sense. LGTM.

lalitpagaria

LGTM!