Improve postprocessing efficiency #157

kevinhu · 2025-06-07T00:06:14Z

This PR makes a couple of small refactors to improve performance with processing logits. On an L40S, this translates to a ~3x speed improvement.

In get_token_spans, getattr calls to tokenizer.cls_token, tokenizer.sep_token, tokenizer.pad_token dominate and are replaced with a one-time definition when the tokenizer is first instantiated.
In token_to_char_probs, assignment of char_probs is replaced with vectorized indexing instead of a for loop.

Traces from pyinstrument:

Before:

After:

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot

Pull Request Overview

This PR refactors token postprocessing to improve performance by caching special tokens and vectorizing the assignment of character probabilities. Key changes include:

Caching tokenizer special tokens to avoid repeated getattr calls.
Replacing an iterative loop with vectorized indexing in token_to_char_probs.
Updating function calls across multiple modules to use the new special_tokens parameter.

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
wtpsplit/utils/init.py	Updated get_token_spans and token_to_char_probs for efficiency refactor.
wtpsplit/train/evaluate.py	Adjusted token_to_char_probs call to pass the cached special_tokens.
wtpsplit/evaluation/intrinsic_pairwise.py	Modified token_to_char_probs call to include the new special_tokens.
wtpsplit/evaluation/adapt.py	Updated token_to_char_probs calls with the new special_tokens parameter.
wtpsplit/init.py	Cached special_tokens in the initializer and updated usage accordingly.

Comments suppressed due to low confidence (1)

wtpsplit/utils/init.py:445

Consider updating the function docstring to specify that the special_tokens parameter should be a list of tokens, explaining its role in replacing repeated getattr calls.

def token_to_char_probs(text, tokens, token_logits, special_tokens, offsets_mapping):

wtpsplit/utils/__init__.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

markus583 · 2025-06-23T03:42:09Z

Thanks for the suggestions, looks all good to me! I will merge it.

kevinhu added 3 commits June 4, 2025 16:28

Speed up special token getattr

7571189

Improve offset indexing

8a7344d

Fix

bd71213

markus583 requested review from Copilot and markus583 June 23, 2025 03:05

markus583 self-assigned this Jun 23, 2025

This comment was marked as outdated.

Sign in to view

typo

9b2e4b8

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

markus583 requested a review from Copilot June 23, 2025 03:15

Copilot AI reviewed Jun 23, 2025

View reviewed changes

wtpsplit/utils/__init__.py Show resolved Hide resolved

Update wtpsplit/utils/__init__.py

b45d534

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

markus583 merged commit 8733cbf into segment-any-text:main Jun 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve postprocessing efficiency #157

Improve postprocessing efficiency #157

Uh oh!

kevinhu commented Jun 7, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

markus583 commented Jun 23, 2025

Uh oh!

Uh oh!

Improve postprocessing efficiency #157

Improve postprocessing efficiency #157

Uh oh!

Conversation

kevinhu commented Jun 7, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

markus583 commented Jun 23, 2025

Uh oh!

Uh oh!