-
Notifications
You must be signed in to change notification settings - Fork 172
Add analyzer batching #118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@shahrukhx01 design wise your changes are good. It is nice that you have used Generator. It seems few transformers pipeline already support batching, we can relay on their strategy for batching. In this case we can pass List[str] to transformers pipeline. |
@lalitpagaria Batching on GPU does make the inference roughly 4x faster for larger batches. See here: should I go ahead make changes for other analyzers? Any change/recommendation to current code before I do that? |
Nice. Can you please add one more test to your benchmark. Generate text randomly. Just to rule out any intermediate catching. |
Just one small comment otherwise looks fine to me. |
@lalitpagaria generating a new batch of texts for each text gives a slower performance, however, it is approx. 2x faster than using batch of size=1 |
All fine. Please add to other analyzers |
…1/obsei into add_analyzer_batching merge changes
@lalitpagaria CI would fail on some instances because some test cases are based on single instance inference. i.e., ner analyzer etc |
Update: I have added batching to all transformers-based analyzers, for Vader and Presidio-based analyzers I'll have to double-check whether the batching is supported and what is the optimal way doing it there. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: I have added batching to all transformers-based analyzers, for Vader and Presidio-based analyzers I'll have to double-check whether the batching is supported and what is the optimal way doing it there.
Vader doesn't support batch call:
def polarity_scores(self, text):
"""
Return a float for sentiment strength based on the input text.
Positive values are positive valence, negative value are negative
valence.
"""
# convert emojis to their textual descriptions
text_no_emoji = ""
prev_space = True
for chr in text:
if chr in self.emojis:
# get the textual description
description = self.emojis[chr]
if not prev_space:
text_no_emoji += ' '
text_no_emoji += description
prev_space = False
else:
text_no_emoji += chr
prev_space = chr == ' '
text = text_no_emoji.strip()
sentitext = SentiText(text)
Presidio doesn't support batch call: def analyze(
self,
text: str,
language: str,
entities: Optional[List[str]] = None,
correlation_id: Optional[str] = None,
score_threshold: Optional[float] = None,
return_decision_process: Optional[bool] = False,
ad_hoc_recognizers: Optional[List[EntityRecognizer]] = None,
) -> List[RecognizerResult]: Anonymizer def anonymize(
self,
text: str,
analyzer_results: List[RecognizerResult],
operators: Optional[Dict[str, OperatorConfig]] = None
) -> EngineResult: |
@lalitpagaria I have updated the test cases and added batching to all transformer analyzers for others it's not possible as those libraries expect one text at a time. Please review and let me know if any change is needed. |
Overall PR is good. Thank for working on it. |
happening because of the last change, let me take a look. |
@lalitpagaria fixed, return type for classification pipeline of NER is |
@shahrukhx01 Thanks for the PR. |
@lalitpagaria you are welcome! :) |
@lalitpagaria I have added the batching on
ZeroShotAnalyzer
to get your feedback on changes that I made and design in general, I used the following code to benchmark performance, seems like batch size of1
are outperforming larger batches which is counter-intuitive this may change onGPU
I have only tested on localCPU