-
Notifications
You must be signed in to change notification settings - Fork 70
🏷️ Add chunker types #762
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🏷️ Add chunker types #762
Conversation
Signed-off-by: Evaline Ju <69598118+evaline-ju@users.noreply.github.com> Co-authored-by: gkumbhat <kumbhat.gaurav@gmail.com>
Signed-off-by: Evaline Ju <69598118+evaline-ju@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One little future-proofing thought on field numbers
# Below 2 represent pointer from the input stream. | ||
# These are different from start and processed_index index returned from | ||
# TokenizationStreamResult, which refers to the char span | ||
input_start_index: Annotated[int, FieldNumber(5)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small thought: It might be good to use a fully different set of field numbers for derived messages like this so that if additional fields are added to the parent, they don't need to be aware of all children. I'd suggest going with something like 20
and 21
or something like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point ^^.. otherwise, there may be cases where parent adds new field and now things are overlapping
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, technically this will be a breaking change for current usage (from where this was ported from) but this can be accounted for
Signed-off-by: Evaline Ju <69598118+evaline-ju@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
What this PR does / why we need it:
"Chunkers" can be thought of as servers that “chunk” the target modality. For text, this can look like tokenizers or sentence splitters. ref. https://github.com/foundation-model-stack/fms-guardrails-orchestrator/blob/main/docs/architecture/adrs/001-orchestrator.md
This PR essentially "open-sources" the chunker stream result and chunker task, to enable chunker implementations at least with the chunker task.
The chunker task and stream result are very similar to the existing "tokenization" task and stream result, with the addition of tracking indices in the stream case. This helps track information in the case of what tokens/text (say from a text generation server) map to which sentences in a chunker result, if the chunker is a sentence splitter.
Special notes for your reviewer:
If applicable: