Skip to content

Conversation

fairydreaming
Copy link
Collaborator

This is a second PR from a series of PRs adding support for T5 and FLAN-T5 models.

This PR adds implementation of the Unigram tokenizer used in T5 and FLAN-T5 models. It also adds T5 model architecture, tensors and model header parameters to allow testing the tokenizer with llama-tokenize command.

@mofosyne mofosyne added the Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level label Jun 24, 2024
llama : fix preventing crashes when precompiled_charsmap is not present
llama.cpp Outdated
Comment on lines 4977 to 4979
vocab.n_precompiled_charsmap = gguf_get_arr_n(ctx, precompiled_charsmap_keyidx);
vocab.precompiled_charsmap = (char *) malloc(vocab.n_precompiled_charsmap);
memcpy((void*) vocab.precompiled_charsmap, gguf_get_arr_data(ctx, precompiled_charsmap_keyidx), vocab.n_precompiled_charsmap);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a memory leak here. Use std::vector<char> instead of:

    uint32_t n_precompiled_charsmap = 0;
    char * precompiled_charsmap = NULL;

Copy link
Collaborator Author

@fairydreaming fairydreaming Jun 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I replaced it with a vector as suggested, but I had to move endianness correction code from llm_tokenizer_ugm to llm_load_vocab - reference to vocab is const in tokenizer, so manipulating the precompiled_charsmap vector buffer would require const casts.

@fairydreaming fairydreaming merged commit 6fcbf68 into ggml-org:master Jun 25, 2024
Comment on lines +14067 to +14070
// initialize score_sum to -FLT_MAX so it will be always lower than sums of token scores
std::vector<struct best_tokenization> tokenization_results(input_len + 1, {0, 0, -FLT_MAX});
// at the beginning tokenization score is zero
tokenization_results[0] = { 0, 0, 0 };
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this supposed to be:

        // initialize score_sum to -FLT_MAX so it will be always lower than sums of token scores
        std::vector<struct best_tokenization> tokenization_results(input_len + 1, {vocab.special_unk_id, 0, -FLT_MAX});
        // at the beginning tokenization score is zero
        tokenization_results[0] = { vocab.special_unk_id, 0, 0 };

Currently, the string of a single space character tokenizes to a single PAD token [0], while the AutoTokenizer returns an empty array of tokens in this case. With the change above, llama.cpp returns a single UNK token [2], which is still incorrect though. Or at least, it does not match the AutoTokenizer result

Copy link
Collaborator Author

@fairydreaming fairydreaming Jul 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a commit to #8141 PR that adds an early return in tokenizer when normalized input is empty to match the SentencePiece implementation behavior in order to fix this problem: 78675f3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants