[Bug]: OpenAI - Reaching max tokens per request. Count discrepancy between local count tokens and the count given by OpenAI API

## Description
We've identified a significant discrepancy between the token counting performed by the local `tiktoken` library and the actual token counting performed by the OpenAI Embeddings API. This issue causes pgai to unexpectedly hit max token limit per request when generating embeddings.

It seems some undocumented pessimistic estimation happens when the request reaches OpenAI API servers. Didn't find any official docs, just [this very brief issue in their forums](https://community.openai.com/t/max-total-embeddings-tokens-per-request/1254699).

## Details
When generating documents that reach approximately 300K tokens (the OpenAI embeddings API limit per request):

1. We use `tiktoken` encoder to count locally, we generate as close as possible to 300K tokens.
2. When these documents are sent to the OpenAI API, it reports a much higher token count (approximately 50% more)
3. This results in errors like: `openai.BadRequestError: Error code: 400 - {'error': {'message': 'Requested 454148 tokens, max 300000 tokens per request', 'type': 'max_tokens_per_request', 'param': None, 'code': 'max_tokens_per_request'}}`

**Key Observation**: The discrepancy appears to be related to the length of the input array, not just only the total token count. When sending fewer items in the array (even when approaching the same token limit), the discrepancy is much smaller or nonexistent. The more items in the input array, the more pessimistic the API's token estimation becomes compared to local counting.

## Impact

1. When large datasets are used for vector embeddings, we might hit unexpected rate limits or token limits
2. The pgai vectorizer might fail to process large documents that appear to be under the token limit based on local counting
3. Users may need to implement more complex chunking strategies than expected based solely on local token counts

## Potential Solutions

1. Add a safety margin. Like applying some coefficient to the max token limit when using local counting.   
2. Reduce the number of items in input arrays by consolidating smaller inputs into larger chunks, since fewer array items appears to result in less token count inflation.
3. Any other I can't think atm.

Regardless, we might need to update our documentation to warn users about this discrepancy.

### pgai extension affected

_No response_

### pgai library affected

_No response_

### PostgreSQL version used

17

### What operating system did you use?

Any

### What installation method did you use?

Not applicable

### What platform did you run on?

Not applicable

### Relevant log output and stack trace

```bash

```

### How can we reproduce the bug?

```bash
import tiktoken, openai

encoder = tiktoken.encoding_for_model("text-embedding-ada-002")

raw_inputs = [
    # long
    "This is the larger Lorem ipsum. Lorem ipsum dolor sit amet consectetur adipiscing elit rutrum eros ultricies, convallis porttitor commodo purus urna id duis mattis. Dictum pellentesque sollicitudin dignissim quam in eu scelerisque donec neque, fames metus vestibulum torquent sed erat velit facilisis magna vivamus, mauris nam cras aptent euismod sagittis quis sociis. Eleifend imperdiet commodo nisl phasellus non sodales nisi interdum blandit, turpis luctus viverra erat vivamus ac mus et, sem hendrerit dis natoque sollicitudin conubia eros himenaeos.",
    
    # short
    "This is shorter Lorem ipsum doc. Lorem ipsum dolor sit amet consectetur adipiscing elit rutrum eros ultricies, convallis porttitor commodo purus urna id duis mattis."
]

for template_index, doc_base in enumerate(raw_inputs):
    print(f"\n----- input {template_index + 1} -----")
    print(f"Input: {doc_base[:40]}...")
    
    inputs = []
    encoded_inputs = []
    encoded_length = 0
    
    while encoded_length < 300000:
        new_input = doc_base
        encoded = encoder.encode(new_input)
        new_tokens = len(encoded)
        
        if encoded_length + new_tokens <= 300000:
            inputs.append(new_input)
            encoded_inputs.append(encoded)
            encoded_length += new_tokens
        else:
            # close enough, stop adding
            break
    
    
    print(f"Local calculated tokens: {encoded_length}")
    print(f"Number of inputs: {len(inputs)}")
    
    try:    
        resp = openai.embeddings.create(model="text-embedding-3-small", input=inputs) 
        print(f"API counted: {resp.usage.prompt_tokens} tokens")
        
    except Exception as e:
        print(f"Request errored: {e}")
```

### Are you going to work on the bugfix?

None

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: OpenAI - Reaching max tokens per request. Count discrepancy between local count tokens and the count given by OpenAI API #728

Description

Details

Impact

Potential Solutions

pgai extension affected

pgai library affected

PostgreSQL version used

What operating system did you use?

What installation method did you use?

What platform did you run on?

Relevant log output and stack trace

How can we reproduce the bug?

Are you going to work on the bugfix?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug]: OpenAI - Reaching max tokens per request. Count discrepancy between local count tokens and the count given by OpenAI API #728

Description

Description

Details

Impact

Potential Solutions

pgai extension affected

pgai library affected

PostgreSQL version used

What operating system did you use?

What installation method did you use?

What platform did you run on?

Relevant log output and stack trace

How can we reproduce the bug?

Are you going to work on the bugfix?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions