Cerebras Launches the World’s Fastest AI Inference
Source: Cerebras
Cerebras Systems, the pioneer in high performance AI compute, announced Cerebras Inference, the fastest AI inference solution in the world. Delivering 1,800 tokens per second for Llama3.1 8B and 450 tokens per second for Llama3.1 70B, Cerebras Inference is 20 times faster than NVIDIA GPU-based solutions in hyperscale clouds.
Inference is the fastest growing segment of AI compute and constitutes approximately 40% of the total AI hardware market. The advent of high-speed AI inference, exceeding 1,000 tokens per second, is comparable to the introduction of broadband internet, unleashing vast new opportunities and heralding a new era for AI applications. Cerebras’ 16-bit accuracy and 20x faster inference calls empowers developers to build next-generation AI applications that require complex, multi-step, real-time performance of tasks, such as AI agents.
“Cerebras has taken the lead in Artificial Analysis’ AI inference benchmarks. Cerebras is delivering speeds an order of magnitude faster than GPU-based solutions for Meta’s Llama 3.1 8B and 70B AI models. We are measuring speeds above 1,800 output tokens per second on Llama 3.1 8B, and above 446 output tokens per second on Llama 3.1 70B – a new record in these benchmarks,” said Micah Hill-Smith, Co-Founder and CEO of Artificial Analysis.
“DeepLearning.AI has multiple agentic workflows that require prompting an LLM repeatedly to get a result. Cerebras has built an impressively fast inference capability which will be very helpful to such workloads,” said Dr. Andrew Ng, Founder of DeepLearning.AI.
Read the full article here.