-
Notifications
You must be signed in to change notification settings - Fork 278
Description
Description
We've identified a significant discrepancy between the token counting performed by the local tiktoken
library and the actual token counting performed by the OpenAI Embeddings API. This issue causes pgai to unexpectedly hit max token limit per request when generating embeddings.
It seems some undocumented pessimistic estimation happens when the request reaches OpenAI API servers. Didn't find any official docs, just this very brief issue in their forums.
Details
When generating documents that reach approximately 300K tokens (the OpenAI embeddings API limit per request):
- We use
tiktoken
encoder to count locally, we generate as close as possible to 300K tokens. - When these documents are sent to the OpenAI API, it reports a much higher token count (approximately 50% more)
- This results in errors like:
openai.BadRequestError: Error code: 400 - {'error': {'message': 'Requested 454148 tokens, max 300000 tokens per request', 'type': 'max_tokens_per_request', 'param': None, 'code': 'max_tokens_per_request'}}
Key Observation: The discrepancy appears to be related to the length of the input array, not just only the total token count. When sending fewer items in the array (even when approaching the same token limit), the discrepancy is much smaller or nonexistent. The more items in the input array, the more pessimistic the API's token estimation becomes compared to local counting.
Impact
- When large datasets are used for vector embeddings, we might hit unexpected rate limits or token limits
- The pgai vectorizer might fail to process large documents that appear to be under the token limit based on local counting
- Users may need to implement more complex chunking strategies than expected based solely on local token counts
Potential Solutions
- Add a safety margin. Like applying some coefficient to the max token limit when using local counting.
- Reduce the number of items in input arrays by consolidating smaller inputs into larger chunks, since fewer array items appears to result in less token count inflation.
- Any other I can't think atm.
Regardless, we might need to update our documentation to warn users about this discrepancy.
pgai extension affected
No response
pgai library affected
No response
PostgreSQL version used
17
What operating system did you use?
Any
What installation method did you use?
Not applicable
What platform did you run on?
Not applicable
Relevant log output and stack trace
How can we reproduce the bug?
import tiktoken, openai
encoder = tiktoken.encoding_for_model("text-embedding-ada-002")
raw_inputs = [
# long
"This is the larger Lorem ipsum. Lorem ipsum dolor sit amet consectetur adipiscing elit rutrum eros ultricies, convallis porttitor commodo purus urna id duis mattis. Dictum pellentesque sollicitudin dignissim quam in eu scelerisque donec neque, fames metus vestibulum torquent sed erat velit facilisis magna vivamus, mauris nam cras aptent euismod sagittis quis sociis. Eleifend imperdiet commodo nisl phasellus non sodales nisi interdum blandit, turpis luctus viverra erat vivamus ac mus et, sem hendrerit dis natoque sollicitudin conubia eros himenaeos.",
# short
"This is shorter Lorem ipsum doc. Lorem ipsum dolor sit amet consectetur adipiscing elit rutrum eros ultricies, convallis porttitor commodo purus urna id duis mattis."
]
for template_index, doc_base in enumerate(raw_inputs):
print(f"\n----- input {template_index + 1} -----")
print(f"Input: {doc_base[:40]}...")
inputs = []
encoded_inputs = []
encoded_length = 0
while encoded_length < 300000:
new_input = doc_base
encoded = encoder.encode(new_input)
new_tokens = len(encoded)
if encoded_length + new_tokens <= 300000:
inputs.append(new_input)
encoded_inputs.append(encoded)
encoded_length += new_tokens
else:
# close enough, stop adding
break
print(f"Local calculated tokens: {encoded_length}")
print(f"Number of inputs: {len(inputs)}")
try:
resp = openai.embeddings.create(model="text-embedding-3-small", input=inputs)
print(f"API counted: {resp.usage.prompt_tokens} tokens")
except Exception as e:
print(f"Request errored: {e}")
Are you going to work on the bugfix?
None