Skip to content

[Bug]: OpenAI - Reaching max tokens per request. Count discrepancy between local count tokens and the count given by OpenAI API #728

@smoya

Description

@smoya

Description

We've identified a significant discrepancy between the token counting performed by the local tiktoken library and the actual token counting performed by the OpenAI Embeddings API. This issue causes pgai to unexpectedly hit max token limit per request when generating embeddings.

It seems some undocumented pessimistic estimation happens when the request reaches OpenAI API servers. Didn't find any official docs, just this very brief issue in their forums.

Details

When generating documents that reach approximately 300K tokens (the OpenAI embeddings API limit per request):

  1. We use tiktoken encoder to count locally, we generate as close as possible to 300K tokens.
  2. When these documents are sent to the OpenAI API, it reports a much higher token count (approximately 50% more)
  3. This results in errors like: openai.BadRequestError: Error code: 400 - {'error': {'message': 'Requested 454148 tokens, max 300000 tokens per request', 'type': 'max_tokens_per_request', 'param': None, 'code': 'max_tokens_per_request'}}

Key Observation: The discrepancy appears to be related to the length of the input array, not just only the total token count. When sending fewer items in the array (even when approaching the same token limit), the discrepancy is much smaller or nonexistent. The more items in the input array, the more pessimistic the API's token estimation becomes compared to local counting.

Impact

  1. When large datasets are used for vector embeddings, we might hit unexpected rate limits or token limits
  2. The pgai vectorizer might fail to process large documents that appear to be under the token limit based on local counting
  3. Users may need to implement more complex chunking strategies than expected based solely on local token counts

Potential Solutions

  1. Add a safety margin. Like applying some coefficient to the max token limit when using local counting.
  2. Reduce the number of items in input arrays by consolidating smaller inputs into larger chunks, since fewer array items appears to result in less token count inflation.
  3. Any other I can't think atm.

Regardless, we might need to update our documentation to warn users about this discrepancy.

pgai extension affected

No response

pgai library affected

No response

PostgreSQL version used

17

What operating system did you use?

Any

What installation method did you use?

Not applicable

What platform did you run on?

Not applicable

Relevant log output and stack trace

How can we reproduce the bug?

import tiktoken, openai

encoder = tiktoken.encoding_for_model("text-embedding-ada-002")

raw_inputs = [
    # long
    "This is the larger Lorem ipsum. Lorem ipsum dolor sit amet consectetur adipiscing elit rutrum eros ultricies, convallis porttitor commodo purus urna id duis mattis. Dictum pellentesque sollicitudin dignissim quam in eu scelerisque donec neque, fames metus vestibulum torquent sed erat velit facilisis magna vivamus, mauris nam cras aptent euismod sagittis quis sociis. Eleifend imperdiet commodo nisl phasellus non sodales nisi interdum blandit, turpis luctus viverra erat vivamus ac mus et, sem hendrerit dis natoque sollicitudin conubia eros himenaeos.",
    
    # short
    "This is shorter Lorem ipsum doc. Lorem ipsum dolor sit amet consectetur adipiscing elit rutrum eros ultricies, convallis porttitor commodo purus urna id duis mattis."
]

for template_index, doc_base in enumerate(raw_inputs):
    print(f"\n----- input {template_index + 1} -----")
    print(f"Input: {doc_base[:40]}...")
    
    inputs = []
    encoded_inputs = []
    encoded_length = 0
    
    while encoded_length < 300000:
        new_input = doc_base
        encoded = encoder.encode(new_input)
        new_tokens = len(encoded)
        
        if encoded_length + new_tokens <= 300000:
            inputs.append(new_input)
            encoded_inputs.append(encoded)
            encoded_length += new_tokens
        else:
            # close enough, stop adding
            break
    
    
    print(f"Local calculated tokens: {encoded_length}")
    print(f"Number of inputs: {len(inputs)}")
    
    try:    
        resp = openai.embeddings.create(model="text-embedding-3-small", input=inputs) 
        print(f"API counted: {resp.usage.prompt_tokens} tokens")
        
    except Exception as e:
        print(f"Request errored: {e}")

Are you going to work on the bugfix?

None

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions