# Anthropic Prompt Caching in Laravel: A Practical Guide

**Author:** Mozex | **Published:** 2026-04-20 | **Tags:** Laravel, PHP, AI, Tutorial, Anthropic | **URL:** https://mozex.dev/blog/18-anthropic-prompt-caching-in-laravel-a-practical-guide

---


Running Sevantia, our AI chat product, I watched the same 2,800-token system prompt get re-billed on every single user message. Multiply that by thousands of conversations a day and you're paying Anthropic to re-read the same instructions into Claude's context over and over. Prompt caching fixes this. It took me longer than it should have to get the mental model right, and the docs skip over the parts that trip up most people.

This is the guide I wish I'd had when I first enabled it.

<!--more-->

Examples target [`mozex/anthropic-php`](https://mozex.dev/docs/anthropic-php) and [`mozex/anthropic-laravel`](https://mozex.dev/docs/anthropic-laravel) on Claude Opus 4.7. Pricing and thresholds are current as of the April 2026 Anthropic docs; they've shifted before and will again, so verify the [official Anthropic caching docs](https://docs.claude.com/en/docs/build-with-claude/prompt-caching) before baking numbers into a spreadsheet.

## The 60-second mental model

Anthropic's API lets you mark a cutoff point in your prompt. Everything up to and including that point gets hashed and stored. Next time you send a request whose prefix matches that hash, you pay a fraction of the normal input rate.

Three numbers do the work, priced per million tokens on Claude Opus 4.7:

- Base input: $5
- 5-minute cache write: $6.25 (1.25x input)
- 1-hour cache write: $10 (2x input)
- Cache read (hit): $0.50 (0.1x input)

The first time you send a cacheable prefix, you pay 1.25x to write it. Every read after that, within the TTL, is 90% cheaper than sending those tokens fresh. The math works out aggressively in your favor: one read inside the 5-minute window already recovers the write premium and puts you ahead of sending the prefix fresh. For the 1-hour TTL (2x write, same 0.1x read), two reads are enough to fully recover the higher write cost.

Each read also refreshes the TTL for free. A chatbot that keeps serving messages every few minutes will hold its cache indefinitely as long as traffic doesn't go silent for the full window.

## The minimum token trap

The single most common reason people think caching "didn't work" is that their prompt is too short. Cache writes silently no-op below a per-model threshold. No error, no warning.

As of April 2026:

- Claude Opus 4.7, Opus 4.6, Opus 4.5, and Haiku 4.5: **4,096 tokens** minimum
- Claude Sonnet 4.6, Haiku 3.5, and Haiku 3: **2,048 tokens**
- Claude Sonnet 4.5, Opus 4.1, Opus 4, Sonnet 4, and Sonnet 3.7: **1,024 tokens**

If your cached prefix is one token below the threshold, you get full-price input tokens on every request and the response usage shows zeros for cache creation and cache reads. Not a bug; it's documented, just easy to miss. More on how to verify this below.

## Turning it on with anthropic-php

The SDK (which I maintain, and explain the history of in [why I built a PHP client for Anthropic's Claude API](https://mozex.dev/blog/9-why-i-built-a-php-client-for-anthropics-claude-api)) passes request payloads as plain associative arrays, so `cache_control` is just another key you add. Nothing special at the SDK layer.

Here's the minimal version with a long system prompt cached for 5 minutes:

```php
use Anthropic\Laravel\Facades\Anthropic;

$response = Anthropic::messages()->create([
    'model' => 'claude-opus-4-7',
    'max_tokens' => 1024,
    'system' => [
        [
            'type' => 'text',
            'text' => $longSystemPrompt,
            'cache_control' => ['type' => 'ephemeral'],
        ],
    ],
    'messages' => [
        ['role' => 'user', 'content' => $userMessage],
    ],
]);
```

Notice that `system` is an array of content blocks, not a plain string. That's the shape you need for cache control to attach to a specific block. If you pass a string system prompt, there's nowhere to hang `cache_control`.

For the 1-hour TTL, add a `ttl` field:

```php
'cache_control' => ['type' => 'ephemeral', 'ttl' => '1h'],
```

## Automatic vs explicit breakpoints

Anthropic offers two flavors, and they work together.

**Explicit breakpoints** let you place up to 4 `cache_control` markers in a single request, typically one each for tools, system, and specific message boundaries. Use this when the structure of your request is stable: fixed tool set, fixed system prompt, growing conversation.

**Automatic caching** lets you put a single `cache_control` at the top level of the request. The system picks the last cacheable block and keeps moving the breakpoint forward as the conversation grows. Use this for multi-turn chat where you don't want to manually re-mark the breakpoint on every turn.

The two are compatible. If you combine them, the automatic breakpoint consumes one of your 4 slots:

```php
$response = Anthropic::messages()->create([
    'model' => 'claude-opus-4-7',
    'max_tokens' => 1024,
    'cache_control' => ['type' => 'ephemeral'],
    'system' => [
        [
            'type' => 'text',
            'text' => $systemPrompt,
            'cache_control' => ['type' => 'ephemeral'],
        ],
    ],
    'tools' => $tools,
    'messages' => $conversation,
]);
```

This pattern pins the system prompt with its own breakpoint (so a growing conversation can't knock it out) and lets the automatic breakpoint ride along with the latest message.

## Verifying it actually worked

Every response includes usage fields that the SDK exposes directly:

```php
$response->usage->inputTokens;                // fresh tokens
$response->usage->cacheCreationInputTokens;   // tokens written to cache
$response->usage->cacheReadInputTokens;       // tokens read from cache
$response->usage->outputTokens;
```

If both `cacheCreationInputTokens` and `cacheReadInputTokens` are zero on a request you expected to hit cache, your prefix didn't meet the minimum token threshold. Fix it by expanding the cached content (more context, a longer system prompt) or by switching to a model with a lower threshold.

For a breakdown by TTL window:

```php
$response->usage->cacheCreation?->ephemeral5mInputTokens;
$response->usage->cacheCreation?->ephemeral1hInputTokens;
```

In Sevantia I log these on every response into an `ai_requests` table alongside user and conversation IDs. Graphing the cache-hit ratio over time is the only way to notice when a deploy silently breaks your caching (more on that in a minute).

## Invalidation is stricter than you think

This is where most people lose money. The cached prefix is an exact-match hash of everything up to the breakpoint, in order. Any change anywhere in that prefix invalidates the cache and bills you 1.25x to rewrite it.

The cascade works top-down in this order:

1. Tools
2. System prompt
3. Messages

Changing tools invalidates everything downstream. Changing the system prompt invalidates messages. Changing a message somewhere in the middle invalidates everything after it.

Things that quietly break the cache and surprise people:

- Adding or reordering a tool definition. Even a whitespace change in a tool description.
- Toggling citations. This modifies the system prompt internally, which invalidates the system and messages caches.
- Toggling extended thinking or changing the thinking budget. This one is narrower than people expect: only the messages cache is invalidated. Tools and system stay valid.
- Adding an image anywhere in the conversation. That invalidates message blocks.

I watched a client ship a "harmless" system prompt tweak (a comma, I'm not kidding) and double their API bill overnight because every active conversation rewrote its cached prefix on the next turn. If you're going to iterate on your system prompt, do it behind a feature flag and watch the cache creation-to-read ratio.

## The pricing math, concrete

Take a 3,000-token system prompt on Claude Opus 4.7 and a session where the same user sends 20 messages. The system prompt is identical on every turn.

Without caching, you pay $5/M to re-send those 3,000 tokens on every message. Twenty messages, 60,000 tokens, **$0.30** on the system prompt alone.

With 5-minute caching:

- First message writes the cache: 3,000 tokens at $6.25/M = $0.019
- The next 19 messages read it: 3,000 × 19 tokens at $0.50/M = $0.029
- Each read also refreshes the TTL, so the cache stays hot as long as the user keeps typing
- Total: **$0.048**

That's 84% off the cost of the system prompt portion of the bill. Your actual savings on a full session depend on the ratio between your cached prefix and everything else (user messages, assistant responses, output tokens), but the formula is always the same: cached tokens get billed at 10% of base input after the first write.

Cache is a bet on reuse. The break-even is tiny (one read for 5m, two for 1h), so the only time caching loses money is when you write a prefix that's never read back at all.

## When caching is wrong

It's not free and it's not universal. Skip it when:

- Your prompts are below the model's minimum threshold. The cache call silently no-ops; you lose nothing, but you also gain nothing.
- You send single-shot requests with unique system prompts. You pay 1.25x to write a cache you'll never read.
- Your system prompt is templated per user and rarely reused. Same problem.
- You're running ad-hoc experiments where the prompt shape changes every few minutes. Cache write cost will exceed any read savings.

## What I actually do in Sevantia

Three rules that have held up in production:

1. Cache the system prompt with a 5-minute TTL on every chat session. Break-even happens within the first few user messages, and the cache refreshes itself as long as the user keeps typing.
2. Use automatic caching for the conversation history. Explicit breakpoints on long conversations are a maintenance chore; the automatic version just works.
3. Log `cacheCreationInputTokens` and `cacheReadInputTokens` on every response, and alert if the cache-read ratio drops below 70% for any given model. A drop usually means a deploy changed something in the cached prefix.

There's no magic to this. Prompt caching is a straightforward mechanism with strict invalidation rules and clear pricing. The reason people leave money on the table is almost always that they enabled it, never verified it actually worked, and never checked the usage numbers after shipping. If you do the one thing of graphing `cacheReadInputTokens` as a percentage of total input tokens, you'll catch the other mistakes quickly.

Anthropic's own [prompt caching docs](https://docs.claude.com/en/docs/build-with-claude/prompt-caching) are the authoritative reference for thresholds and pricing; both move occasionally. If you're still choosing between AI SDKs for the integration layer, I covered that in [which AI package to actually use in Laravel](https://mozex.dev/blog/5-which-ai-package-should-you-actually-use-in-laravel). And if you're running anthropic-php directly, the `CreateResponseUsage` object is where every caching decision gets validated or falsified.