AWS Expands Cloud With Cerebras AI Inference

Key Takeaway

AWS is integrating Cerebras AI inference hardware into its data centers and plans to deliver the capability through AWS Bedrock for open-source LLMs and Amazon Nova models. The collaboration combines Amazon Trainium processors with Cerebras systems to accelerate Cerebras AI inference workloads and increase token throughput.

Cerebras AI Inference Hardware Arrives on AWS – Key Points

The Story

AWS will deploy Cerebras hardware in its cloud data centers and make Cerebras AI inference capabilities available through AWS Bedrock under a multiyear partnership announced March 13. The system splits prompt processing and answer generation across Amazon Trainium processors and the Cerebras Wafer-Scale Engine, creating a disaggregated architecture designed for high-speed token output.

The companies say this approach addresses growing demand for fast inference in coding assistants, chatbots, and interactive applications. Reuters reported the service could come online in the second half of 2026, while the companies indicate the first Cerebras AI inference deployments may reach customers within months.

The Facts

AWS plans to deploy Cerebras processors inside its own data centers under a multiyear deal announced March 13.
AWS will integrate specialized chips designed by Cerebras into its infrastructure, bringing Cerebras AI inference capabilities directly to the AWS cloud rather than offering the hardware only through Cerebras systems.
The service is expected to be available through AWS Bedrock.
AWS Bedrock will provide access to open-source large language models and Amazon Nova models running on infrastructure optimized for Cerebras AI inference, positioning Bedrock as the central gateway for generative AI services on AWS.
The companies are developing a disaggregated inference architecture.
Amazon Trainium processors perform the prefill stage, while Cerebras WSE systems handle decode. This architecture separates the two main steps of inference so each processor can run the workload it handles most efficiently.
The AWS deployment links Cerebras chips with Amazon Trainium3 processors.
The systems are connected through Amazon networking technology, enabling coordinated execution of the two inference stages.
The companies say the architecture can deliver 5x more high-speed token capacity.
By combining Trainium with Cerebras hardware, the system is designed to expand throughput for Cerebras AI inference workloads without increasing the overall hardware footprint.
Demand for fast inference is rising sharply in AI development workflows.
Cerebras says agentic coding produces roughly 15× more tokens per request than standard conversational queries, increasing the need for infrastructure optimized for real-time Cerebras AI inference performance.
Prefill and decode stages have different hardware requirements.
Prefill is compute-heavy but relatively light on memory bandwidth, while decode requires constant retrieval of model weights. The architecture assigns each stage to hardware specialized for that workload.
Trainium handles compute-heavy stages while Cerebras handles token generation.
Amazon Trainium processors focus on prompt processing, while CS-3 systems based on wafer-scale processors generate the output tokens.
Cerebras has also become a major supplier of inference infrastructure to OpenAI.
Earlier reporting described a multiyear agreement worth more than $10 billion for up to 750 megawatts of compute capacity between 2026 and 2028 to support ChatGPT and other latency-sensitive workloads.
Cerebras wafer-scale processors are central to the company’s strategy.
The WSE-3 integrates about 900,000 AI cores and roughly 4 trillion transistors on a single wafer-scale chip, designed to reduce communication bottlenecks common in GPU clusters.
AWS and Cerebras will support both aggregated and disaggregated deployments.
Disaggregated configurations are intended for stable, large-scale workloads, while traditional architectures remain available for mixed inference patterns. Financial terms of the partnership were not disclosed.

Background / Context

Cerebras has focused on building infrastructure specifically for inference speed rather than general-purpose AI computing. The company is valued at about $23.1 billion and has pursued partnerships with major AI developers seeking alternatives to GPU-heavy architectures. The AWS collaboration places Cerebras AI inference systems inside one of the world’s largest cloud platforms, expanding their reach to enterprise customers running generative AI applications.

On January 16, 2026, OpenAI signed a multi-year deal valued at over $10 billion with Cerebras Systems to secure large-scale, low-latency compute capacity through 2028, underscoring both its acute infrastructure shortage and a deliberate strategy to diversify beyond Nvidia GPUs, with a specific focus on real-time inference performance.

Numbers that Matter

3,000 tokens per second: Claimed token generation speed achieved by Cerebras systems in some deployments.
15× more tokens per query: Estimated increase in token demand for agentic coding workflows compared with chat.
5× more high-speed token capacity: Claimed throughput increase from the Trainium–Cerebras architecture.
$23.1 billion: Estimated valuation of Cerebras.
$10 billion: Value of a separate compute agreement between OpenAI and Cerebras.
750 megawatts: Planned compute capacity from that agreement between 2026 and 2028.
900,000 AI cores and 4 trillion transistors: Approximate specifications of the Cerebras WSE-3 processor.

Why This Matters

The partnership signals a shift in cloud AI infrastructure toward specialized hardware for inference rather than relying exclusively on GPU clusters. By embedding Cerebras AI inference systems directly inside AWS infrastructure and pairing them with Trainium processors, the companies aim to deliver faster responses for coding tools, chatbots, and interactive AI services. If the performance gains hold up in production, the model could influence how future AI data centers are designed.

This article was drafted with the assistance of generative AI. All facts and details were reviewed and confirmed by an editor prior to publication.