Huawei Launches Revolutionary AI Inference Technology to Bridge China's Performance Gap

Published: August 12, 2025 17:45

Huawei unveiled its breakthrough UCM (Unified Cache Manager) technology today at the 2025 Financial AI Inference Applications Forum, targeting the critical bottleneck in China's AI inference capabilities.

The KV Cache-centered acceleration suite integrates multiple caching algorithms to hierarchically manage inference memory data, expanding context windows while delivering high throughput and low latency performance that significantly reduces per-token inference costs.

Huawei plans to open-source UCM in September 2025, initially launching through the MindSpore community before contributing to mainstream inference engine platforms and sharing with industry storage vendors and ecosystem partners.

640 (43)

source: HUAWEI

Addressing China's Inference Performance Challenge

The AI industry has pivoted from "pursuing maximum model capabilities" to "optimizing inference experience," with inference performance now serving as the definitive metric for AI model value and commercial viability.

UCM enables on-demand memory flow across HBM, DRAM, and SSD storage mediums based on memory access patterns, while integrating sparse attention algorithms for deep compute-storage collaboration. In long sequence scenarios, the technology delivers 2-22x improvements in TPS (tokens per second), substantially reducing per-token inference costs.

The performance gap is stark: leading international AI models achieve 200 tokens/s output speeds (5ms latency), while China's mainstream AI models typically deliver under 60 tokens/s (50-100ms latency).

As AI applications penetrate real-world scenarios, exponential growth in user scale and request volumes has created massive token processing demands. The resulting operational costs—including server maintenance and power consumption—have made optimizing per-token intelligence capacity a core industry objective.

Real-World Implementation

Huawei's AI inference acceleration solution, combining UCM with OceanStor A-series AI storage, is currently piloting with China UnionPay across three smart finance scenarios: customer voice analysis, marketing strategy development, and office assistance applications.

In office assistance deployments, the solution supports inference sequences exceeding 170,000 tokens, eliminating the computational bottlenecks that typically plague ultra-long sequence model operations.