Skip to content

Conversation

GTxx
Copy link
Contributor

@GTxx GTxx commented Jul 6, 2025

Related Issue

Issue:: NA

Description

  • What problem does this PR solve?
    A: This PR is to support prompt cache for anthropic models in SAP AI Core provider. Specifically, sonnet-3.7, sonnet-4, opus-4.
  • Why were these changes introduced and what purpose do they serve?
    A: prompt cache is a significant way to save the cost.
  • For larger changes, provide context about your approach and reasoning
    A: I mostly referred to bedrock provider and anthropic provider to build the prompt cache function. SAP AI Core provides the anthropic API in bedrock's converse API format.

Test Procedure

I tested the same prompt with the same code base for cases with cache and without cache. These are the token usages:

Without prompt cache:

inputToken = 15601
outputTokens = 185
totalTokens = 15786

with prompt cache applied to system prompt:

inputToken = 3753
outputTokens = 454
totalTokens = 16190

with prompt cache applied to both system prompt and user-assistant messages:

inputToken = 4
outputTokens = 213
totalTokens = 16025

As it illustrates, with only the system prompt cache, 10,000 tokens are cached; and with both system prompt cached and user-assistant messages cached, the input token count drops to single digit, showing more tokens are cached.

Type of Change

  • 🐛 Bug fix (non-breaking change which fixes an issue)
  • [x ] ✨ New feature (non-breaking change which adds functionality)
  • 💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • ♻️ Refactor Changes
  • 💅 Cosmetic Changes
  • 📚 Documentation update
  • 🏃 Workflow Changes

Pre-flight Checklist

  • Changes are limited to a single feature, bugfix or chore (split larger changes into separate PRs)
  • Tests are passing (npm test) and code is formatted and linted (npm run format && npm run lint)
  • I have created a changeset using npm run changeset (required for user-facing changes)
  • I have reviewed contributor guidelines

Screenshots

Additional Notes

Caveat: According to my testing with my own SAP AI Core account with prompt cache enabled, SAP AI Core doesn't return cache_read and cache_write.
User will see input tokens is very minimum, just it doesn't reflect the real token usage.

prompt token = input token + cache read + cache write  

because most prompt tokens are either cache_read or cache_write, leave input token to a small number.


Important

Adds prompt caching for SAP AI Core models sonnet-3.7, sonnet-4, and opus-4 to reduce costs.

  • Behavior:
    • Adds prompt caching for sonnet-3.7, sonnet-4, and opus-4 models in sapaicore.ts.
    • Caching is enabled by default to reduce costs.
    • Updates sapAiCoreModels in api.ts to reflect caching support and pricing.
  • Implementation:
    • Introduces applyCacheControlToMessages() in sapaicore.ts to manage cache points in messages.
    • Modifies createMessage() in sapaicore.ts to include cache points in payloads.
  • Testing:
    • Demonstrates token usage reduction with caching enabled, showing significant savings in input tokens.

This description was created by Ellipsis for 855025c. You can customize this summary. It will automatically update as commits are pushed.

Copy link

changeset-bot bot commented Jul 10, 2025

🦋 Changeset detected

Latest commit: 4dfdfbb

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package
Name Type
claude-dev Minor

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@GTxx
Copy link
Contributor Author

GTxx commented Jul 23, 2025

hi, @saoudrizwan @ocasta181 @NightTrek.

I am using SAP AI Core + sonnet daily, the cache is critical for improving API response time and cost saving.
Can you help review the PR, and provide the review feedback.

Thanks

@saoudrizwan
Copy link
Contributor

saoudrizwan commented Jul 26, 2025

Calling contributors to sapaicore.ts @tjandy98 @lizzzcai @schardosin

If you could please help test and review this PR, that would be greatly appreciated! 🙏

@lizzzcai
Copy link
Contributor

lizzzcai commented Aug 6, 2025

Hi @tjandy98 , can you help to test if this will work locally, thanks.

@ncryptedV1 ncryptedV1 mentioned this pull request Aug 6, 2025
11 tasks
@@ -361,11 +391,15 @@ export class SapAiCoreHandler implements ApiHandler {
if (data.metadata?.usage) {
const inputTokens = data.metadata.usage.inputTokens || 0
const outputTokens = data.metadata.usage.outputTokens || 0
const cacheReadInputTokens = data.metadata.usage.cacheReadInputTokens || 0
Copy link
Contributor

@tjandy98 tjandy98 Aug 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my testing, cache usage information is not returned when using converse-stream. Despite that, it is implicitly reflected in usage.inputTokens showing low token count (due to cache). This appears to be specific to AI Core only.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review.

You are right, SAP AI Core doesn't return the cache usage information, neither cache read and cache write.

This is one example of metadata payload I found in the streamed response from sonnet model of SAP AI Core:

MetadataEvent(usage=TokenUsage(input_tokens=10, output_tokens=145, total_tokens=23150, cache_read_input_tokens=None, cache_write_input_tokens=None), metrics=Metrics(latency_ms=5109))
  • input_tokens=10
  • output_tokens=145
  • total_tokens=23150

I believe cache mechanism works, so that some input tokens are written to cache, and some input token is read from cache.

I think the best way right now is to calibrate the input_token as this:

input_token = total_tokens - output_tokens

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, cache mechanism does work

Here is an example of a raw event:

:message-type��event{"metrics":{"latencyMs":17329},"p":"abcdefghijklmnopqrstuvwxyzAB","usage":{"cacheReadInputTokenCount":1490,"cacheReadInputTokens":1490,"cacheWriteInputTokenCount":0,"cacheWriteInputTokens":0,"inputTokens":4,"outputTokens":832,"totalTokens":2326}}%�qI

total_tokens = 2326
output_tokens = 832
input_tokens = 4
cache_read_input_tokens = 2326-832-4= 1490

The calculation logic may be adjusted according to the example calculation above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree.

The equation shall be :
totalTokens = cacheReadInputTokenCount + cacheWriteInputTokenCount + inputTokens + outputTokens

I will close this PR as another PR for the same purpose is merged.

I am creating another PR to calibrate the inputToken, otherwise the tokens number in context window is incorrect, context window might show a few token used but it could already exceeds the context window limit.

@GTxx GTxx closed this Aug 9, 2025
dtrugman pushed a commit to dtrugman/cline that referenced this pull request Aug 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants