-
Notifications
You must be signed in to change notification settings - Fork 6.9k
feat: support prompt cache for anthropic models in sap ai core provider #4683
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: support prompt cache for anthropic models in sap ai core provider #4683
Conversation
🦋 Changeset detectedLatest commit: 4dfdfbb The changes in this PR will be included in the next version bump. This PR includes changesets to release 1 package
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
hi, @saoudrizwan @ocasta181 @NightTrek. I am using SAP AI Core + sonnet daily, the cache is critical for improving API response time and cost saving. Thanks |
Calling contributors to sapaicore.ts @tjandy98 @lizzzcai @schardosin If you could please help test and review this PR, that would be greatly appreciated! 🙏 |
Hi @tjandy98 , can you help to test if this will work locally, thanks. |
@@ -361,11 +391,15 @@ export class SapAiCoreHandler implements ApiHandler { | |||
if (data.metadata?.usage) { | |||
const inputTokens = data.metadata.usage.inputTokens || 0 | |||
const outputTokens = data.metadata.usage.outputTokens || 0 | |||
const cacheReadInputTokens = data.metadata.usage.cacheReadInputTokens || 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my testing, cache usage information is not returned when using converse-stream
. Despite that, it is implicitly reflected in usage.inputTokens
showing low token count (due to cache). This appears to be specific to AI Core only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review.
You are right, SAP AI Core doesn't return the cache usage information, neither cache read and cache write.
This is one example of metadata payload I found in the streamed response from sonnet model of SAP AI Core:
MetadataEvent(usage=TokenUsage(input_tokens=10, output_tokens=145, total_tokens=23150, cache_read_input_tokens=None, cache_write_input_tokens=None), metrics=Metrics(latency_ms=5109))
- input_tokens=10
- output_tokens=145
- total_tokens=23150
I believe cache mechanism works, so that some input tokens are written to cache, and some input token is read from cache.
I think the best way right now is to calibrate the input_token as this:
input_token = total_tokens - output_tokens
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, cache mechanism does work
Here is an example of a raw event:
:message-type��event{"metrics":{"latencyMs":17329},"p":"abcdefghijklmnopqrstuvwxyzAB","usage":{"cacheReadInputTokenCount":1490,"cacheReadInputTokens":1490,"cacheWriteInputTokenCount":0,"cacheWriteInputTokens":0,"inputTokens":4,"outputTokens":832,"totalTokens":2326}}%�qI
total_tokens = 2326
output_tokens = 832
input_tokens = 4
cache_read_input_tokens = 2326-832-4= 1490
The calculation logic may be adjusted according to the example calculation above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree.
The equation shall be :
totalTokens = cacheReadInputTokenCount + cacheWriteInputTokenCount + inputTokens + outputTokens
I will close this PR as another PR for the same purpose is merged.
I am creating another PR to calibrate the inputToken, otherwise the tokens number in context window is incorrect, context window might show a few token used but it could already exceeds the context window limit.
Related Issue
Issue:: NA
Description
A: This PR is to support prompt cache for anthropic models in SAP AI Core provider. Specifically, sonnet-3.7, sonnet-4, opus-4.
A: prompt cache is a significant way to save the cost.
A: I mostly referred to bedrock provider and anthropic provider to build the prompt cache function. SAP AI Core provides the anthropic API in bedrock's converse API format.
Test Procedure
I tested the same prompt with the same code base for cases with cache and without cache. These are the token usages:
Without prompt cache:
with prompt cache applied to system prompt:
with prompt cache applied to both system prompt and user-assistant messages:
As it illustrates, with only the system prompt cache, 10,000 tokens are cached; and with both system prompt cached and user-assistant messages cached, the input token count drops to single digit, showing more tokens are cached.
Type of Change
Pre-flight Checklist
npm test
) and code is formatted and linted (npm run format && npm run lint
)npm run changeset
(required for user-facing changes)Screenshots
Additional Notes
Caveat: According to my testing with my own SAP AI Core account with prompt cache enabled, SAP AI Core doesn't return cache_read and cache_write.
User will see input tokens is very minimum, just it doesn't reflect the real token usage.
because most prompt tokens are either cache_read or cache_write, leave input token to a small number.
Important
Adds prompt caching for SAP AI Core models
sonnet-3.7
,sonnet-4
, andopus-4
to reduce costs.sonnet-3.7
,sonnet-4
, andopus-4
models insapaicore.ts
.sapAiCoreModels
inapi.ts
to reflect caching support and pricing.applyCacheControlToMessages()
insapaicore.ts
to manage cache points in messages.createMessage()
insapaicore.ts
to include cache points in payloads.This description was created by
for 855025c. You can customize this summary. It will automatically update as commits are pushed.