Skip to content

Conversation

evaline-ju
Copy link
Collaborator

@evaline-ju evaline-ju commented Sep 4, 2024

What this PR does / why we need it:

"Chunkers" can be thought of as servers that “chunk” the target modality. For text, this can look like tokenizers or sentence splitters. ref. https://github.com/foundation-model-stack/fms-guardrails-orchestrator/blob/main/docs/architecture/adrs/001-orchestrator.md

This PR essentially "open-sources" the chunker stream result and chunker task, to enable chunker implementations at least with the chunker task.

The chunker task and stream result are very similar to the existing "tokenization" task and stream result, with the addition of tracking indices in the stream case. This helps track information in the case of what tokens/text (say from a text generation server) map to which sentences in a chunker result, if the chunker is a sentence splitter.

Special notes for your reviewer:

If applicable:

  • this PR contains documentation
  • this PR contains unit tests
  • this PR has been tested for backwards compatibility

evaline-ju and others added 2 commits September 4, 2024 10:53
Signed-off-by: Evaline Ju <69598118+evaline-ju@users.noreply.github.com>

Co-authored-by: gkumbhat <kumbhat.gaurav@gmail.com>
Signed-off-by: Evaline Ju <69598118+evaline-ju@users.noreply.github.com>
Copy link
Collaborator

@gabe-l-hart gabe-l-hart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One little future-proofing thought on field numbers

# Below 2 represent pointer from the input stream.
# These are different from start and processed_index index returned from
# TokenizationStreamResult, which refers to the char span
input_start_index: Annotated[int, FieldNumber(5)]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small thought: It might be good to use a fully different set of field numbers for derived messages like this so that if additional fields are added to the parent, they don't need to be aware of all children. I'd suggest going with something like 20 and 21 or something like that.

Copy link
Contributor

@gkumbhat gkumbhat Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point ^^.. otherwise, there may be cases where parent adds new field and now things are overlapping

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, technically this will be a breaking change for current usage (from where this was ported from) but this can be accounted for

Signed-off-by: Evaline Ju <69598118+evaline-ju@users.noreply.github.com>
Copy link
Contributor

@gkumbhat gkumbhat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@evaline-ju evaline-ju merged commit d0551af into caikit:main Sep 4, 2024
8 checks passed
@evaline-ju evaline-ju deleted the chunkers-dm branch September 4, 2024 22:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants