Skip to content

Conversation

kevinhu
Copy link
Contributor

@kevinhu kevinhu commented Jun 7, 2025

This PR makes a couple of small refactors to improve performance with processing logits. On an L40S, this translates to a ~3x speed improvement.

  1. In get_token_spans, getattr calls to tokenizer.cls_token, tokenizer.sep_token, tokenizer.pad_token dominate and are replaced with a one-time definition when the tokenizer is first instantiated.
  2. In token_to_char_probs, assignment of char_probs is replaced with vectorized indexing instead of a for loop.

Traces from pyinstrument:

Before:
Screenshot 2025-06-06 at 16 51 44

After:
Screenshot 2025-06-06 at 16 50 03

@markus583 markus583 requested review from Copilot and markus583 June 23, 2025 03:05
@markus583 markus583 self-assigned this Jun 23, 2025
Copilot

This comment was marked as outdated.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@markus583 markus583 requested a review from Copilot June 23, 2025 03:15
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors token postprocessing to improve performance by caching special tokens and vectorizing the assignment of character probabilities. Key changes include:

  • Caching tokenizer special tokens to avoid repeated getattr calls.
  • Replacing an iterative loop with vectorized indexing in token_to_char_probs.
  • Updating function calls across multiple modules to use the new special_tokens parameter.

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
wtpsplit/utils/init.py Updated get_token_spans and token_to_char_probs for efficiency refactor.
wtpsplit/train/evaluate.py Adjusted token_to_char_probs call to pass the cached special_tokens.
wtpsplit/evaluation/intrinsic_pairwise.py Modified token_to_char_probs call to include the new special_tokens.
wtpsplit/evaluation/adapt.py Updated token_to_char_probs calls with the new special_tokens parameter.
wtpsplit/init.py Cached special_tokens in the initializer and updated usage accordingly.
Comments suppressed due to low confidence (1)

wtpsplit/utils/init.py:445

  • Consider updating the function docstring to specify that the special_tokens parameter should be a list of tokens, explaining its role in replacing repeated getattr calls.
def token_to_char_probs(text, tokens, token_logits, special_tokens, offsets_mapping):

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@markus583
Copy link
Collaborator

Thanks for the suggestions, looks all good to me! I will merge it.

@markus583 markus583 merged commit 8733cbf into segment-any-text:main Jun 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants