-
Notifications
You must be signed in to change notification settings - Fork 30.2k
Closed
Description
I'm pretty sure this comment:
should instead say:
# Sizes are [batch_size, 1, 1, to_seq_length]
# So we can broadcast to [batch_size, num_heads, from_seq_length, to_seq_length]
When masking out tokens for attention, it doesn't matter what happens to attention from padding tokens, only that there is no attention to padding tokens.
I don't believe the code is doing what the comment currently suggests because that would be an implementation flaw.
Metadata
Metadata
Assignees
Labels
No labels