Add analyzer batching #118

shahrukhx01 · 2021-06-02T17:44:59Z

@lalitpagaria I have added the batching on ZeroShotAnalyzer to get your feedback on changes that I made and design in general, I used the following code to benchmark performance, seems like batch size of 1 are outperforming larger batches which is counter-intuitive this may change on GPU I have only tested on local CPU

if __name__ == "__main__":
    import timeit  ## for benchmarking execution

    start = timeit.default_timer()
    GOOD_TEXT = """If anyone is interested... these are our hosts. I can’t recommend them enough, Abc & Pbc.

    The unit is just lovely, you go to sleep & wake up to this incredible place, and you have use of a Weber grill and a ridiculously indulgent hot-tub under the stars"""

    BAD_TEXT = """I had the worst experience ever with XYZ in Egypt. Bad Cars, asking to pay in cash,  do not have enough fuel,  do not open AC,  wait far away from my location until the trip is cancelled,  call and ask about the destination then cancel, and more. Worst service."""

    MIXED_TEXT = """I am mixed"""

    TEXTS = [
        GOOD_TEXT,
        BAD_TEXT,
        MIXED_TEXT,
        GOOD_TEXT,
        BAD_TEXT,
        MIXED_TEXT,
        GOOD_TEXT,
        BAD_TEXT,
        MIXED_TEXT,
        GOOD_TEXT,
        BAD_TEXT,
        MIXED_TEXT,
        GOOD_TEXT,
        BAD_TEXT,
        MIXED_TEXT,
        GOOD_TEXT,
        BAD_TEXT,
        MIXED_TEXT,
        GOOD_TEXT,
        BAD_TEXT,
        MIXED_TEXT,
        GOOD_TEXT,
        BAD_TEXT,
        MIXED_TEXT,
        GOOD_TEXT,
        BAD_TEXT,
        MIXED_TEXT,
        GOOD_TEXT,
        BAD_TEXT,
        MIXED_TEXT,
        GOOD_TEXT,
        BAD_TEXT,
        MIXED_TEXT,
        GOOD_TEXT,
        BAD_TEXT,
        MIXED_TEXT,
    ] * 2
    zero_shot_analyzer = ZeroShotClassificationAnalyzer(
        model_name_or_path="typeform/mobilebert-uncased-mnli",
    )
    batch_size = 4
    labels = ["facility", "food", "comfortable", "positive", "negative"]

    source_responses = [
        AnalyzerRequest(processed_text=text, source_name="sample") for text in TEXTS
    ]
    analyzer_responses = zero_shot_analyzer.analyze_input(
        source_response_list=source_responses,
        analyzer_config=ClassificationAnalyzerConfig(
            labels=labels, batch_size=batch_size
        ),
    )
    stop = timeit.default_timer()

    print("Time: ", stop - start)
    print("Batch size: ", batch_size)
    print("Total Texts: ", len(TEXTS))
    """for res in analyzer_responses:
        print(res.segmented_data)
    """
    assert len(analyzer_responses) == len(TEXTS)

    for analyzer_response in analyzer_responses:
        assert len(analyzer_response.segmented_data) == len(labels)
        assert "positive" in analyzer_response.segmented_data
        assert "negative" in analyzer_response.segmented_data

lalitpagaria · 2021-06-02T19:48:30Z

@shahrukhx01 design wise your changes are good. It is nice that you have used Generator.
Regarding performance for batch_size > 1, can you please check on colab with GPU env?

It seems few transformers pipeline already support batching, we can relay on their strategy for batching. In this case we can pass List[str] to transformers pipeline.

shahrukhx01 · 2021-06-02T20:36:15Z

@lalitpagaria Batching on GPU does make the inference roughly 4x faster for larger batches. See here:
Analyzer Benchmarking Colab Notebook

should I go ahead make changes for other analyzers? Any change/recommendation to current code before I do that?

lalitpagaria · 2021-06-02T21:27:48Z

Nice. Can you please add one more test to your benchmark. Generate text randomly. Just to rule out any intermediate catching.

obsei/analyzer/classification_analyzer.py

lalitpagaria · 2021-06-02T21:32:55Z

Just one small comment otherwise looks fine to me.

shahrukhx01 · 2021-06-03T07:41:52Z

Nice. Can you please add one more test to your benchmark. Generate text randomly. Just to rule out any intermediate catching.

@lalitpagaria generating a new batch of texts for each text gives a slower performance, however, it is approx. 2x faster than using batch of size=1
Colab Notebook

lalitpagaria · 2021-06-03T07:44:44Z

All fine. Please add to other analyzers

…1/obsei into add_analyzer_batching merge changes

shahrukhx01 · 2021-06-03T08:10:38Z

@lalitpagaria CI would fail on some instances because some test cases are based on single instance inference. i.e., ner analyzer etc

shahrukhx01 · 2021-06-03T08:40:42Z

Update: I have added batching to all transformers-based analyzers, for Vader and Presidio-based analyzers I'll have to double-check whether the batching is supported and what is the optimal way doing it there.

shahrukhx01

Update: I have added batching to all transformers-based analyzers, for Vader and Presidio-based analyzers I'll have to double-check whether the batching is supported and what is the optimal way doing it there.

Vader doesn't support batch call:

def polarity_scores(self, text):
        """
        Return a float for sentiment strength based on the input text.
        Positive values are positive valence, negative value are negative
        valence.
        """
        # convert emojis to their textual descriptions
        text_no_emoji = ""
        prev_space = True
        for chr in text:
            if chr in self.emojis:
                # get the textual description
                description = self.emojis[chr]
                if not prev_space:
                    text_no_emoji += ' '
                text_no_emoji += description
                prev_space = False
            else:
                text_no_emoji += chr
                prev_space = chr == ' '
        text = text_no_emoji.strip()

        sentitext = SentiText(text)

obsei/analyzer/classification_analyzer.py

shahrukhx01 · 2021-06-03T11:29:07Z

Update: I have added batching to all transformers-based analyzers, for Vader and Presidio-based analyzers I'll have to double-check whether the batching is supported and what is the optimal way doing it there.

Presidio doesn't support batch call:
Analyzer

def analyze(
        self,
        text: str,
        language: str,
        entities: Optional[List[str]] = None,
        correlation_id: Optional[str] = None,
        score_threshold: Optional[float] = None,
        return_decision_process: Optional[bool] = False,
        ad_hoc_recognizers: Optional[List[EntityRecognizer]] = None,
    ) -> List[RecognizerResult]:

Anonymizer

def anonymize(
            self,
            text: str,
            analyzer_results: List[RecognizerResult],
            operators: Optional[Dict[str, OperatorConfig]] = None
    ) -> EngineResult:

shahrukhx01 · 2021-06-03T11:55:00Z

@lalitpagaria I have updated the test cases and added batching to all transformer analyzers for others it's not possible as those libraries expect one text at a time. Please review and let me know if any change is needed.

obsei/analyzer/base_analyzer.py

obsei/analyzer/translation_analyzer.py

lalitpagaria · 2021-06-03T12:22:17Z

Overall PR is good. Thank for working on it.
I have few minor comments.

lalitpagaria · 2021-06-03T12:50:19Z

@shahrukhx01 CI is failing https://github.com/lalitpagaria/obsei/pull/118/checks?check_run_id=2737130061

shahrukhx01 · 2021-06-03T12:51:40Z

@shahrukhx01 CI is failing https://github.com/lalitpagaria/obsei/pull/118/checks?check_run_id=2737130061

happening because of the last change, let me take a look.

shahrukhx01 · 2021-06-03T13:40:52Z

@shahrukhx01 CI is failing https://github.com/lalitpagaria/obsei/pull/118/checks?check_run_id=2737130061

happening because of the last change, let me take a look.

@lalitpagaria fixed, return type for classification pipeline of NER is List[List[Dict]]. Everything is good to go now!

lalitpagaria · 2021-06-03T13:42:17Z

@shahrukhx01 Thanks for the PR.

shahrukhx01 · 2021-06-03T13:43:40Z

@lalitpagaria you are welcome! :)

shahrukhx01 added 4 commits June 2, 2021 19:40

add batching zeroshot analyzer

1f50425

add batching zeroshot analyzer

94bad6b

remove .ds_store

f5d2413

fix type hint of batchify

5344867

lalitpagaria linked an issue Jun 2, 2021 that may be closed by this pull request

Batch call to pipeline in Analyzers #88

Closed

Merge branch 'master' into add_analyzer_batching

8612bb8

lalitpagaria reviewed Jun 2, 2021

View reviewed changes

obsei/analyzer/classification_analyzer.py Outdated Show resolved Hide resolved

shahrukhx01 added 2 commits June 3, 2021 10:05

add batching to ner_analyzer

babee98

Merge branch 'add_analyzer_batching' of https://github.com/shahrukhx0…

7c3d2bd

…1/obsei into add_analyzer_batching merge changes

add batching to translation_analyzer

1a5a43c

fix predictions scope in translational_analyzer

448f08e

shahrukhx01 commented Jun 3, 2021

View reviewed changes

obsei/analyzer/classification_analyzer.py Outdated Show resolved Hide resolved

shahrukhx01 added 2 commits June 3, 2021 13:32

fix ner analyzer test

d751317

fix translator analyzer test

c2777fc

lalitpagaria reviewed Jun 3, 2021

View reviewed changes

obsei/analyzer/base_analyzer.py Outdated Show resolved Hide resolved

change default analyzer batch size

99e1442

lalitpagaria reviewed Jun 3, 2021

View reviewed changes

obsei/analyzer/translation_analyzer.py Show resolved Hide resolved

update single/multi inference response

383ab22

shahrukhx01 added 2 commits June 3, 2021 15:29

update single/multi inference response

c88fd9e

fix return type of ner analyzer inferencer

b2417f2

lalitpagaria merged commit b87cef8 into obsei:master Jun 3, 2021

lalitpagaria added the enhancement New feature or request label Oct 5, 2021

Add analyzer batching #118

Add analyzer batching #118

Uh oh!

Conversation

shahrukhx01 commented Jun 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lalitpagaria commented Jun 2, 2021

Uh oh!

shahrukhx01 commented Jun 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lalitpagaria commented Jun 2, 2021

Uh oh!

Uh oh!

lalitpagaria commented Jun 2, 2021

Uh oh!

shahrukhx01 commented Jun 3, 2021

Uh oh!

lalitpagaria commented Jun 3, 2021

Uh oh!

shahrukhx01 commented Jun 3, 2021

Uh oh!

shahrukhx01 commented Jun 3, 2021

Uh oh!

shahrukhx01 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

shahrukhx01 commented Jun 3, 2021

Uh oh!

shahrukhx01 commented Jun 3, 2021

Uh oh!

Uh oh!

Uh oh!

lalitpagaria commented Jun 3, 2021

Uh oh!

lalitpagaria commented Jun 3, 2021

Uh oh!

shahrukhx01 commented Jun 3, 2021

Uh oh!

shahrukhx01 commented Jun 3, 2021

Uh oh!

lalitpagaria commented Jun 3, 2021

Uh oh!

shahrukhx01 commented Jun 3, 2021

Uh oh!

Uh oh!

shahrukhx01 commented Jun 2, 2021 •

edited

Loading

shahrukhx01 commented Jun 2, 2021 •

edited

Loading