🏷️ Add chunker types #762

evaline-ju · 2024-09-04T17:02:51Z

What this PR does / why we need it:

"Chunkers" can be thought of as servers that “chunk” the target modality. For text, this can look like tokenizers or sentence splitters. ref. https://github.com/foundation-model-stack/fms-guardrails-orchestrator/blob/main/docs/architecture/adrs/001-orchestrator.md

This PR essentially "open-sources" the chunker stream result and chunker task, to enable chunker implementations at least with the chunker task.

The chunker task and stream result are very similar to the existing "tokenization" task and stream result, with the addition of tracking indices in the stream case. This helps track information in the case of what tokens/text (say from a text generation server) map to which sentences in a chunker result, if the chunker is a sentence splitter.

Special notes for your reviewer:

If applicable:

this PR contains documentation
this PR contains unit tests
this PR has been tested for backwards compatibility

Signed-off-by: Evaline Ju <69598118+evaline-ju@users.noreply.github.com> Co-authored-by: gkumbhat <kumbhat.gaurav@gmail.com>

Signed-off-by: Evaline Ju <69598118+evaline-ju@users.noreply.github.com>

gabe-l-hart

One little future-proofing thought on field numbers

gabe-l-hart · 2024-09-04T19:18:56Z

caikit/interfaces/nlp/data_model/text.py

+    # Below 2 represent pointer from the input stream.
+    # These are different from start and processed_index index returned from
+    # TokenizationStreamResult, which refers to the char span
+    input_start_index: Annotated[int, FieldNumber(5)]


Small thought: It might be good to use a fully different set of field numbers for derived messages like this so that if additional fields are added to the parent, they don't need to be aware of all children. I'd suggest going with something like 20 and 21 or something like that.

good point ^^.. otherwise, there may be cases where parent adds new field and now things are overlapping

sure, technically this will be a breaking change for current usage (from where this was ported from) but this can be accounted for

Signed-off-by: Evaline Ju <69598118+evaline-ju@users.noreply.github.com>

gkumbhat

Looks good to me!

evaline-ju and others added 2 commits September 4, 2024 10:53

🏷️ Add chunker types

7c14094

Signed-off-by: Evaline Ju <69598118+evaline-ju@users.noreply.github.com> Co-authored-by: gkumbhat <kumbhat.gaurav@gmail.com>

🎨 Lint long line

43cc9e0

Signed-off-by: Evaline Ju <69598118+evaline-ju@users.noreply.github.com>

evaline-ju marked this pull request as ready for review September 4, 2024 19:16

evaline-ju requested review from gabe-l-hart, joerunde, prashantgupta24, gkumbhat, hickeyma, alex-jw-brooks, tharapalanivel and aluu317 as code owners September 4, 2024 19:16

gabe-l-hart reviewed Sep 4, 2024

View reviewed changes

🔧 Bump field numbers

401fa51

Signed-off-by: Evaline Ju <69598118+evaline-ju@users.noreply.github.com>

gkumbhat approved these changes Sep 4, 2024

View reviewed changes

evaline-ju merged commit d0551af into caikit:main Sep 4, 2024
8 checks passed

evaline-ju deleted the chunkers-dm branch September 4, 2024 22:52

This was referenced Sep 5, 2024

🏷️ Import new chunker type #764

Merged

Chunkers update to use open-sourced chunkers types foundation-model-stack/fms-guardrails-orchestrator#187

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🏷️ Add chunker types #762

🏷️ Add chunker types #762

Uh oh!

evaline-ju commented Sep 4, 2024 •

edited

Loading

Uh oh!

gabe-l-hart left a comment

Uh oh!

gabe-l-hart Sep 4, 2024

Uh oh!

gkumbhat Sep 4, 2024 •

edited

Loading

Uh oh!

evaline-ju Sep 4, 2024

Uh oh!

gkumbhat left a comment

Uh oh!

Uh oh!

Uh oh!

🏷️ Add chunker types #762

🏷️ Add chunker types #762

Uh oh!

Conversation

evaline-ju commented Sep 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gabe-l-hart left a comment

Choose a reason for hiding this comment

Uh oh!

gabe-l-hart Sep 4, 2024

Choose a reason for hiding this comment

Uh oh!

gkumbhat Sep 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

evaline-ju Sep 4, 2024

Choose a reason for hiding this comment

Uh oh!

gkumbhat left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

evaline-ju commented Sep 4, 2024 •

edited

Loading

gkumbhat Sep 4, 2024 •

edited

Loading