Skip to content

Conversation

shahrukhx01
Copy link
Collaborator

@shahrukhx01 shahrukhx01 commented Jun 2, 2021

@lalitpagaria I have added the batching on ZeroShotAnalyzer to get your feedback on changes that I made and design in general, I used the following code to benchmark performance, seems like batch size of 1 are outperforming larger batches which is counter-intuitive this may change on GPU I have only tested on local CPU

if __name__ == "__main__":
    import timeit  ## for benchmarking execution

    start = timeit.default_timer()
    GOOD_TEXT = """If anyone is interested... these are our hosts. I can’t recommend them enough, Abc & Pbc.

    The unit is just lovely, you go to sleep & wake up to this incredible place, and you have use of a Weber grill and a ridiculously indulgent hot-tub under the stars"""

    BAD_TEXT = """I had the worst experience ever with XYZ in Egypt. Bad Cars, asking to pay in cash,  do not have enough fuel,  do not open AC,  wait far away from my location until the trip is cancelled,  call and ask about the destination then cancel, and more. Worst service."""

    MIXED_TEXT = """I am mixed"""

    TEXTS = [
        GOOD_TEXT,
        BAD_TEXT,
        MIXED_TEXT,
        GOOD_TEXT,
        BAD_TEXT,
        MIXED_TEXT,
        GOOD_TEXT,
        BAD_TEXT,
        MIXED_TEXT,
        GOOD_TEXT,
        BAD_TEXT,
        MIXED_TEXT,
        GOOD_TEXT,
        BAD_TEXT,
        MIXED_TEXT,
        GOOD_TEXT,
        BAD_TEXT,
        MIXED_TEXT,
        GOOD_TEXT,
        BAD_TEXT,
        MIXED_TEXT,
        GOOD_TEXT,
        BAD_TEXT,
        MIXED_TEXT,
        GOOD_TEXT,
        BAD_TEXT,
        MIXED_TEXT,
        GOOD_TEXT,
        BAD_TEXT,
        MIXED_TEXT,
        GOOD_TEXT,
        BAD_TEXT,
        MIXED_TEXT,
        GOOD_TEXT,
        BAD_TEXT,
        MIXED_TEXT,
    ] * 2
    zero_shot_analyzer = ZeroShotClassificationAnalyzer(
        model_name_or_path="typeform/mobilebert-uncased-mnli",
    )
    batch_size = 4
    labels = ["facility", "food", "comfortable", "positive", "negative"]

    source_responses = [
        AnalyzerRequest(processed_text=text, source_name="sample") for text in TEXTS
    ]
    analyzer_responses = zero_shot_analyzer.analyze_input(
        source_response_list=source_responses,
        analyzer_config=ClassificationAnalyzerConfig(
            labels=labels, batch_size=batch_size
        ),
    )
    stop = timeit.default_timer()

    print("Time: ", stop - start)
    print("Batch size: ", batch_size)
    print("Total Texts: ", len(TEXTS))
    """for res in analyzer_responses:
        print(res.segmented_data)
    """
    assert len(analyzer_responses) == len(TEXTS)

    for analyzer_response in analyzer_responses:
        assert len(analyzer_response.segmented_data) == len(labels)
        assert "positive" in analyzer_response.segmented_data
        assert "negative" in analyzer_response.segmented_data

@lalitpagaria
Copy link
Collaborator

@shahrukhx01 design wise your changes are good. It is nice that you have used Generator.
Regarding performance for batch_size > 1, can you please check on colab with GPU env?

It seems few transformers pipeline already support batching, we can relay on their strategy for batching. In this case we can pass List[str] to transformers pipeline.

@lalitpagaria lalitpagaria linked an issue Jun 2, 2021 that may be closed by this pull request
@shahrukhx01
Copy link
Collaborator Author

shahrukhx01 commented Jun 2, 2021

@lalitpagaria Batching on GPU does make the inference roughly 4x faster for larger batches. See here:
Analyzer Benchmarking Colab Notebook

should I go ahead make changes for other analyzers? Any change/recommendation to current code before I do that?

@lalitpagaria
Copy link
Collaborator

Nice. Can you please add one more test to your benchmark. Generate text randomly. Just to rule out any intermediate catching.

@lalitpagaria
Copy link
Collaborator

Just one small comment otherwise looks fine to me.

@shahrukhx01
Copy link
Collaborator Author

Nice. Can you please add one more test to your benchmark. Generate text randomly. Just to rule out any intermediate catching.

@lalitpagaria generating a new batch of texts for each text gives a slower performance, however, it is approx. 2x faster than using batch of size=1
Colab Notebook

@lalitpagaria
Copy link
Collaborator

All fine. Please add to other analyzers

@shahrukhx01
Copy link
Collaborator Author

@lalitpagaria CI would fail on some instances because some test cases are based on single instance inference. i.e., ner analyzer etc

@shahrukhx01
Copy link
Collaborator Author

Update: I have added batching to all transformers-based analyzers, for Vader and Presidio-based analyzers I'll have to double-check whether the batching is supported and what is the optimal way doing it there.

Copy link
Collaborator Author

@shahrukhx01 shahrukhx01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: I have added batching to all transformers-based analyzers, for Vader and Presidio-based analyzers I'll have to double-check whether the batching is supported and what is the optimal way doing it there.

Vader doesn't support batch call:

def polarity_scores(self, text):
        """
        Return a float for sentiment strength based on the input text.
        Positive values are positive valence, negative value are negative
        valence.
        """
        # convert emojis to their textual descriptions
        text_no_emoji = ""
        prev_space = True
        for chr in text:
            if chr in self.emojis:
                # get the textual description
                description = self.emojis[chr]
                if not prev_space:
                    text_no_emoji += ' '
                text_no_emoji += description
                prev_space = False
            else:
                text_no_emoji += chr
                prev_space = chr == ' '
        text = text_no_emoji.strip()

        sentitext = SentiText(text)

@shahrukhx01
Copy link
Collaborator Author

Update: I have added batching to all transformers-based analyzers, for Vader and Presidio-based analyzers I'll have to double-check whether the batching is supported and what is the optimal way doing it there.

Presidio doesn't support batch call:
Analyzer

def analyze(
        self,
        text: str,
        language: str,
        entities: Optional[List[str]] = None,
        correlation_id: Optional[str] = None,
        score_threshold: Optional[float] = None,
        return_decision_process: Optional[bool] = False,
        ad_hoc_recognizers: Optional[List[EntityRecognizer]] = None,
    ) -> List[RecognizerResult]:

Anonymizer

def anonymize(
            self,
            text: str,
            analyzer_results: List[RecognizerResult],
            operators: Optional[Dict[str, OperatorConfig]] = None
    ) -> EngineResult:

@shahrukhx01
Copy link
Collaborator Author

@lalitpagaria I have updated the test cases and added batching to all transformer analyzers for others it's not possible as those libraries expect one text at a time. Please review and let me know if any change is needed.

@lalitpagaria
Copy link
Collaborator

Overall PR is good. Thank for working on it.
I have few minor comments.

@lalitpagaria
Copy link
Collaborator

@shahrukhx01
Copy link
Collaborator Author

@shahrukhx01 CI is failing https://github.com/lalitpagaria/obsei/pull/118/checks?check_run_id=2737130061

happening because of the last change, let me take a look.

@shahrukhx01
Copy link
Collaborator Author

@shahrukhx01 CI is failing https://github.com/lalitpagaria/obsei/pull/118/checks?check_run_id=2737130061

happening because of the last change, let me take a look.

@lalitpagaria fixed, return type for classification pipeline of NER is List[List[Dict]]. Everything is good to go now!

@lalitpagaria lalitpagaria merged commit b87cef8 into obsei:master Jun 3, 2021
@lalitpagaria
Copy link
Collaborator

@shahrukhx01 Thanks for the PR.

@shahrukhx01
Copy link
Collaborator Author

@lalitpagaria you are welcome! :)

@lalitpagaria lalitpagaria added the enhancement New feature or request label Oct 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Batch call to pipeline in Analyzers
2 participants