Avoid unnecessary len check by using is None for tokenizer #150

pjs102793 · 2025-02-28T06:21:44Z

This PR replaces if not tokenizer: with if tokenizer is None: to avoid unnecessary calls to the tokenizer’s len method, improving efficiency.

# PyTorch GPU
>>> sat_sm = SaT("sat-3l-sm")
>>> sat_sm.half().to("cuda") # optional, see above
>>> text = "this is a test this is another test"
>>> %timeit list(sat_sm.split(text))
# 52.7 ms ± 2.16 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# PR Patch PyTorch GPU
>>> %timeit list(sat_sm.split(text))
# 2.45 ms ± 24.1 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# ONNX CPU
>>> model_ort = SaT("sat-3l-sm", ort_providers=["CUDAExecutionProvider", "CPUExecutionProvider"])
>>> %timeit list(model_ort.split(text))
# 59.3 ms ± 1.83 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# PR Patch ONNX CPU
>>> %timeit list(model_ort.split(text))
# 14.7 ms ± 699 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

When executing if not tokenizer:, the len special method of the corresponding class is triggered, incurring a fixed overhead of approximately 40ms.

To avoid this overhead, we use a direct None check to determine the presence of the tokenizer, eliminating this bottleneck.

markus583 · 2025-03-04T06:32:45Z

What a nice find, thanks a lot for this! I will merge it.

Avoid unnecessary __len__ check by using 'is None' for tokenizer

6699d33

markus583 merged commit afd5646 into segment-any-text:main Mar 4, 2025

pjs102793 deleted the extract-faster branch March 4, 2025 06:40

lsorber mentioned this pull request Jun 13, 2025

feat: avoid unnecessary __len__ check by using is None for tokenizer superlinear-ai/wtpsplit-lite#6

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Avoid unnecessary len check by using is None for tokenizer #150

Avoid unnecessary len check by using is None for tokenizer #150

Uh oh!

pjs102793 commented Feb 28, 2025

Uh oh!

markus583 commented Mar 4, 2025

Uh oh!

Uh oh!

Avoid unnecessary __len__ check by using is None for tokenizer #150

Avoid unnecessary __len__ check by using is None for tokenizer #150

Uh oh!

Conversation

pjs102793 commented Feb 28, 2025

Uh oh!

markus583 commented Mar 4, 2025

Uh oh!

Uh oh!

Avoid unnecessary len check by using is None for tokenizer #150

Avoid unnecessary len check by using is None for tokenizer #150