AI has Revolutionized Server Architecture: Memory No Longer Needs to Reside on Each Machine

## The Shift in Memory Management for AI Servers

The memory shortage currently plaguing technology isn’t just a domestic issue; it extends to major players in the AI realm. As companies increasingly rely on artificial intelligence models deployed in data centers, the demand for memory has risen to unprecedented levels. This necessity challenges longstanding assumptions about server architecture, suggesting that perhaps machines should not solely depend on internal RAM.

### Rethinking Memory Allocation

The concept of a “memory godbox” emerges as a solution. Instead of being limited to local memory, systems can access a larger pool of shared memory across multiple machines. This shift mirrors existing data storage paradigms where information can reside on numerous devices, whether local or network-based. By adopting a similar model for RAM, systems can effectively redistribute memory according to real-time demands, enhancing efficiency and flexibility.

### The Rise of Compute Express Link (CXL)

For years, Compute Express Link (CXL) has promised a more modular architecture for server components. Initially slow to gain traction, its relevance has surged amid the current memory crunch. CXL offers a coherent interface that connects processors, accelerators, and memory using PCIe technology. While the concept of decoupling resources is straightforward, its implementation poses complex challenges in ensuring all components work seamlessly together.

#### Evolution of CXL Standards

CXL has seen progressive enhancements. Initially, it expanded server memory through modules linked to PCIe slots. With the introduction of CXL 2.0, memory pooling became possible, allowing machines to share a common memory pool. However, limitations existed; while memory could be reallocated, it couldn’t support simultaneous access from multiple systems working on shared data. CXL 3.0 marks a notable advance by expanding memory-sharing capabilities despite some technical limitations.

### Crucial Memory Needs for AI

As reported by The Next Platform, the shortfall in AI performance is not merely a computation issue; it is deeply rooted in memory limitations. High Bandwidth Memory (HBM) that operates with GPUs is rapid but constrained in capacity and pricey. In the training phase, vast datasets are processed for model building, but during inference—when utilizing trained models to respond to queries—memory needs intensify.

### The Importance of KV Cache

Each response generated by language models unfolds incrementally, which necessitates a working memory structure known as KV cache. This system preserves previous attention tokens, enabling the model to maintain context. Unfortunately, for services handling multiple users, this cache can balloon, occasionally surpassing the memory consumed by the model itself, which poses additional challenges.

### Moving from Theory to Reality

The concept of reconfiguring memory practices is not just theoretical. Companies like Panmnesia, Liqid, and UnifabriX are actively creating systems that allocate memory across multiple machines. Utilizing CXL switches and DDR5 reserves, these innovations aim to optimize memory distribution. Enfabrica’s Emfasys system, for instance, illustrates this trend, supporting up to 18 TB of DDR5 per memory server and 144 TB within a complete rack. Therefore, the industry is pivoting not only to augment memory capacity but also to revolutionize its allocation methods, ensuring that AI technologies can perform at their best.

General News – 2