Skip to content

Conversation

akar5h
Copy link
Contributor

@akar5h akar5h commented Jul 10, 2021

Minor Changes to make Analyzer work with TextSplitter for long texts.

@akar5h
Copy link
Contributor Author

akar5h commented Jul 10, 2021

Hi , These are some minor changes undoing #113 , making analyzer work for long texts .

@akar5h
Copy link
Contributor Author

akar5h commented Jul 10, 2021

However, the processing time with the current text splitter and analyzer is large for long documents, need to look for ways to optimize

@lalitpagaria
Copy link
Collaborator

@akar5h Thanks for the PR.
I will review this on Monday.

@lalitpagaria
Copy link
Collaborator

@akar5h Until we have inference aggregator, user will never get original output. Can you please let me know what is your thought here.

I thought we are doing this -

But now I have to clarify something, According to my experience the max_length parameter / sequence length parameter in BertModel (bert like models) does not necessarily mean text length , the text length can be beyond 512 and still the input tokens be less than 512 .
So would it not be more optimal that instead of splitting texts, we split the tokenizer output ( tokenizer.batch_encode_plus output) like input_ids , attention mask into sequences of length 512 and feed these splits to the model ?

@akar5h
Copy link
Contributor Author

akar5h commented Jul 12, 2021

Hi , yes that was the way to go , However since we are working with Pipeline and not directly dealing with tokenizer output and models . We cannot implement the this as it is .

So would it not be more optimal that instead of splitting texts, we split the tokenizer output ( tokenizer.batch_encode_plus output) like input_ids , attention mask into sequences of length 512 and feed these splits to the model ?

So there are modules in tokenizers called pretokenizer tokenizers.pre_tokenizers . I used the pretokenize to estimate where to split the document , but again we can just estimate the split and not exactly split at 512 tokens since pretokenize does not has wordpiece tokenization ,
I created an alternate textsplitter using this pretokenize logic , https://github.com/akar5h/obsei/blob/text-splitter/obsei/preprocessor/text_splitter_v2.py , can u have a look ?

There was no difference model processing time in both this proposed way and the cureent TextSplitter

@akar5h
Copy link
Contributor Author

akar5h commented Jul 12, 2021

However if you follow this link : https://colab.research.google.com/github/huggingface/notebooks/blob/master/examples/question_answering.ipynb#scrollTo=n9qywopnIrJH&uniqifier=1
Scroll down to Preprocessing the training data , . Here you can see how huggingface tokenizer provides a way to split longer length documents as an inbuilt operation .

@lalitpagaria
Copy link
Collaborator

Cool thanks Akarsh, I will have a look.
Please bear with me about delay in review, as I am bit busy in personal front.

@akar5h
Copy link
Contributor Author

akar5h commented Jul 12, 2021

np , the last part InterfaceAggregator is not so straightforward since we have multi_class labels . So I'll take some time myself with that

@lalitpagaria
Copy link
Collaborator

@akar5h I had looked into your PR. I feel #113 okay for now. If use do not want truncation they will use TextSplitter anyway. What we need is merge module or way to merge inference after analyzer make inference.

My idea is to have multiple configuration of TextSplitter: honour sentence boundary, honour passage boundary, add title to each chunk, etc etc. Which will add value to user's pro processing capabilities.

There are few use cases some of Obsei users are asking. Major one is to determine sentiment of customer call/chat conversation.

Let me know when you are available next week we can schedule a call and discuss more.

@akar5h
Copy link
Contributor Author

akar5h commented Jul 14, 2021

Hi @lalitpagaria , I am finalizing the Interface Aggregator module, its work in progress , I took the liberty to create a PostProcessing classes just like Preprocessors for InterfaceAggregator, will post a PR by tomorrow on this.
ZeroShotClassification Pipleline does not leave us with many options with completely utilizing transformer functionalities, hence aggregation is a simple logic for now . We can have multiple configurations for different aggregation strategies in the future.

Yes, both sentence boundary and paragraph boundaries can be implemented using Spacy for TextSplitter.

My idea is to have multiple configuration of TextSplitter: honour sentence boundary, honour passage boundary, add title to each chunk, etc etc. Which will add value to user's pro processing capabilities.

Sure , let's lets discuss over call this Saturday or coming Monday/Tuesday after 6 ?

There are few use cases some of Obsei users are asking. Major one is to determine sentiment of customer call/chat conversation.
Let me know when you are available next week we can schedule a call and discuss more.

@akar5h
Copy link
Contributor Author

akar5h commented Jul 17, 2021

@lalitpagaria , Hi , can you have a look at this merge functions . The typecheck is failing because of TextPayload.meta or BasePayload,meta .

Kindly suggest changes if any , and some suggestion on how to handle the type check (mypy )

Copy link
Collaborator

@lalitpagaria lalitpagaria left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@akar5h Thank you very much for this PR.
I liked your design for Aggregator.
I have few superficial comments, we can merge this PR after that.

@akar5h
Copy link
Contributor Author

akar5h commented Jul 21, 2021

The sugested changes :

  • InferenceAggregator name change 😅
  • Chunk Id, Total Chunks in splitter
  • remove this method if there is not custom initialisation in constructor
  • add type check
  • payload_aggregator

Todos / Feedback required : @lalitpagaria pls comment

  • finalize among most_common or max_value
  • Standardise Meta or other hack for meta ( vars(meta) not working)

@lalitpagaria
Copy link
Collaborator

@akar5h I am doing few changes to your PR. I have some uncommitted changes. I will push them tomorrow.

@lalitpagaria
Copy link
Collaborator

@akar5h I have made some refactoring and added them to your PR.
Please see these make sense.

@shahrukhx01 Would you like to take a look into changes as well?

@lalitpagaria lalitpagaria added the breaking changes Breaking changes label Jul 26, 2021
@lalitpagaria lalitpagaria changed the title TextSplitter and Analyzer merge #163, #164 . Adding InferenceAggregator Jul 26, 2021
@lalitpagaria lalitpagaria linked an issue Jul 26, 2021 that may be closed by this pull request
@shahrukhx01
Copy link
Collaborator

LGTM

@akar5h
Copy link
Contributor Author

akar5h commented Jul 26, 2021

the refactor makes sense. LGTM.

Copy link
Collaborator

@lalitpagaria lalitpagaria left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@lalitpagaria lalitpagaria merged commit 9273f19 into obsei:master Jul 26, 2021
@lalitpagaria lalitpagaria added the enhancement New feature or request label Oct 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking changes Breaking changes enhancement New feature or request post-processor
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add InferenceAggregator node
3 participants