Skip to content
This repository was archived by the owner on Jul 30, 2025. It is now read-only.
This repository was archived by the owner on Jul 30, 2025. It is now read-only.

Why do we not set the ignore_index of FusedCrossEntropy to bos_id? #83

@larrylawl

Description

@larrylawl

Hi,

Can I check why did you not ignore the loss for the bos token (<s>)?

loss_func = FusedCrossEntropyLoss()

I noticed that the preprocessing causes the remainder of the binary file to be the bos token (<s>).

builder.write_reminder()

Consequently, my model checkpoint (not TinyLlama's) outputs poor qualitative results:

Prompt: "(very long text of 1027 tokens). How many yards longer was the longest passing touchdown than the shortest?"
Output: "<s>ending<s>end<s>ent<s>ended"

Interestingly, my model of previous checkpoint (100b tokens before) performed okay.

I'm trying to fix this by specifying the loss function to ignore the <s> idx (i.e. 1). I think this is a correct fix, but i'm not sure if it fixes the underlying issue (the issue should have plagued our model from the start, why did it only happen at this iter step?).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions