Skip to content

strategy="ffd" implements BFD packing #3645

@toslunar

Description

@toslunar

The docstring says

            - `"ffd"` (First Fit Decreasing): Slower but preserves sequence boundaries. Sequences are never cut in the
                middle.

but the implemented algorithm searches the bin with the least remaining capacity, which is called Best-Fit-Decreasing in the paper https://arxiv.org/abs/2404.10830.

  • The algorithm before the optimization:

    trl/trl/data_utils.py

    Lines 507 to 517 in fef915e

    # Optimized bin packing using a priority queue approach
    bins_by_remaining = defaultdict(list) # remaining_space -> [bin_indices]
    bins = [] # [(current_length, seq_indices)]
    for i, (seq_len, _start, _end) in enumerate(sequences):
    # Find bins with enough space using the dictionary
    placed = False
    for remaining in range(seq_len, seq_length + 1):
    if bins_by_remaining[remaining]:
    # Use the first available bin with this remaining space
    bin_idx = bins_by_remaining[remaining].pop()
  • In v0.19.0 https://github.com/huggingface/trl/blob/v0.19.0/trl/data_utils.py#L551-L561, it looks like Appendix B (An Illustration of optimized BFD Algorithm) of https://arxiv.org/pdf/2404.10830v2

Metadata

Metadata

Assignees

No one assigned

    Labels

    🐛 bugSomething isn't working📚 documentationImprovements or additions to documentation

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions