AWS Bedrock Prompt Cost Estimator

Estimate per-request and monthly cost across models, regions, and caching scenarios.

Last updated Apr 16, 2026
Percentage of input tokens served from Anthropic's prompt cache. Caching pays off when you reuse the same prompt prefix multiple times.
Pricing effective: Apr 16, 2026 · Last updated: Apr 16, 2026 · Estimates only; confirm against the official Bedrock pricing page for billing.

How Bedrock pricing works

AWS Bedrock bills per million tokens for input and output separately, and prices vary by model and by region. For Anthropic's Claude family, prompt caching introduces two additional line items: cache write (when Anthropic stores a cacheable portion of your prompt) and cache read (when a subsequent request reuses that cached portion). Cache reads are roughly 10% of the standard input rate, cache writes are about 1.25× standard input. For workloads with repeated prompt prefixes, this is a real-money saving.

Which model should I choose?

  • Claude Opus 4.7 — highest reasoning quality. Best for complex multi-step tasks, long analyses, or code generation where correctness matters more than cost. Most expensive per token.
  • Claude Sonnet 4.6 — the workhorse balance of quality and cost. A good default for most production workloads. Supports prompt caching.
  • Claude Haiku 4.5 — fastest, cheapest Claude. Good for classification, extraction, simple transforms at scale.
  • Amazon Nova Pro / Lite — Amazon's first-party models. Competitive on price and integrated natively with Bedrock features.
  • Meta Llama 3.3 70B — open-weight, low per-token cost, fine for general chat and summarization.
  • Mistral Large 2 — strong on European languages and structured output formats.
  • Titan Text Embeddings V2 — Amazon's embedding model. Input-only billing (no output tokens).

Prompt caching ROI — when does it pay off?

The math is straightforward: a cache write costs about 1.25× normal input, a cache read about 0.1× normal input. If you reuse a prompt prefix at least twice, caching saves money; if you reuse it a dozen times, the savings are huge. The patterns where this matters most are agent loops with long system prompts, RAG pipelines with fixed knowledge blocks, and chatbots with structured persona and rules in the prefix. For one-shot queries with no shared context, keep caching off — paying 1.25× for a cache write that's never read is pure waste.

Why token counts are estimates

Anthropic does not publish its tokenizer, so the counts shown here use a widely-used heuristic of roughly 4 characters per token for English prose. Code, tables, emoji, and non-English text tokenize differently. Expect actual token counts within about ±15% of the estimate for typical English prompts. If you need exact counts, run your prompt through the Bedrock Messages API and read the usage field on the response. For budgeting and model-comparison decisions, the estimate here is more than accurate enough.

Regional pricing: does it matter?

AWS charges different per-token rates by region. US regions (us-east-1, us-west-2) are usually the cheapest, with EU and Asia-Pacific regions slightly more expensive. The price delta is typically 5–15%, which can add up at scale — but not enough to move your workload halfway around the world. Latency usually wins: if your users are in Singapore, the round-trip penalty of calling us-east-1 dwarfs any savings, so host in ap-southeast-1 and accept the premium.

Frequently asked questions

Does this tool include Provisioned Throughput pricing?

No. This estimates on-demand pricing, which is what the vast majority of applications use. Provisioned Throughput is a capacity-reservation model that only makes sense at very high sustained throughput — if you're there, AWS will already be helping you model it.

Why do I see different prices in different regions?

AWS charges by region because the underlying compute and data-transfer costs vary. US regions are baseline; EU and APAC regions are incrementally more expensive. It's worth picking the cheapest region that meets your latency and compliance requirements — not the cheapest region absolutely.

How often do Bedrock prices change?

Rarely, but occasionally. We stamp the effective date and last-updated date on the pricing data so you can tell how fresh it is. If you notice a discrepancy with the official AWS Bedrock pricing page, let us know and we'll refresh our data.

What about cross-region inference profiles?

Cross-region inference uses a home region's pricing for billing purposes even when requests are served from other regions. For estimation, pick the home region of your inference profile.

How do I count output tokens before I actually run the prompt?

You can't know exactly, but you can bound it: pick a realistic "max tokens" ceiling based on the response type you need (a JSON classification response is tiny; a long-form article is thousands). Start generous; tighten once you have real usage data. Many production systems overestimate output by 2–3× compared to actual usage.

An error has occurred. This application may no longer respond until reloaded. Reload 🗙