XPU Encyclopedia
Learn about AI compute hardware terminology, specifications, and concepts
🏢 XPU Vendors
Major manufacturers of AI accelerators and their product portfolios. Each vendor brings unique architectural approaches and optimization strategies to AI compute.
Alibaba
Founded: N/A
AI accelerator manufacturer with 1 active products.
AMD
Founded: 1969 (GPU division from ATI acquisition 2006)
Second-largest GPU manufacturer, competing aggressively in datacenter AI with CDNA architecture. ROCm software stack provides CUDA alternative. Strong price-performance positioning.
AWS
Founded: Inferentia: 2019, Trainium: 2021
Amazon's custom AI chips designed for cost-effective inference (Inferentia) and training (Trainium). Tightly integrated with AWS infrastructure. Price-performance leaders for specific workloads.
Axelera AI
Founded: N/A
AI accelerator manufacturer with 0 active products.
Baidu
Founded: N/A
AI accelerator manufacturer with 1 active products.
Biren Technology
Founded: N/A
AI accelerator manufacturer with 1 active products.
Cambricon
Founded: N/A
AI accelerator manufacturer with 1 active products.
Cerebras
Founded: 2016
Revolutionary wafer-scale processor - largest chip ever built. Single CS-3 wafer contains 850,000 cores and 40GB on-chip memory. Eliminates traditional GPU bottlenecks for massive models.
d-Matrix
Founded: N/A
AI accelerator manufacturer with 0 active products.
Enflame Technology
Founded: N/A
AI accelerator manufacturer with 1 active products.
Etched
Founded: N/A
AI accelerator manufacturer with 1 active products.
FuriosaAI
Founded: 2017
Korean startup focused on energy-efficient inference accelerators. Industry-leading efficiency (40+ TFLOPs/Watt). Software-centric approach with compiler optimization.
Founded: 1998 (TPU: 2015)
Pioneered custom AI chips with TPU (Tensor Processing Unit). Designed from scratch for TensorFlow/JAX workloads. Available only through Google Cloud. Industry-leading efficiency metrics.
Graphcore
Founded: 2016
Intelligence Processing Unit (IPU) with unique architecture featuring massive parallelism and In-Processor-Memory. Strong in graph neural networks and alternative model architectures.
Groq
Founded: 2016
Language Processing Unit (LPU) achieving industry-leading inference latency (1.5ms first-token). Eliminates GPU bottlenecks through deterministic scheduling. Optimized for LLM serving.
Huawei
Founded: 1987 (Ascend: 2018)
Ascend AI processors competing with NVIDIA in China. HiSilicon design with full-stack AI framework (MindSpore). Strong in Chinese market.
Hygon
Founded: N/A
AI accelerator manufacturer with 0 active products.
Iluvatar CoreX
Founded: N/A
AI accelerator manufacturer with 1 active products.
Intel
Founded: 1968 (AI accelerators: Habana acquisition 2019)
Traditional CPU leader entering AI accelerator market with Gaudi and Data Center GPU Max series. Leveraging OneAPI for software portability. Strong enterprise relationships.
Intel Habana
Founded: N/A
AI accelerator manufacturer with 2 active products.
Kalray
Founded: N/A
AI accelerator manufacturer with 0 active products.
Lightmatter
Founded: N/A
AI accelerator manufacturer with 0 active products.
Meta
Founded: N/A
AI accelerator manufacturer with 1 active products.
Microsoft
Founded: N/A
AI accelerator manufacturer with 1 active products.
Moore Threads
Founded: N/A
AI accelerator manufacturer with 1 active products.
Mythic
Founded: N/A
AI accelerator manufacturer with 0 active products.
NVIDIA
Founded: 1993
Market leader in AI accelerators with 80%+ datacenter GPU market share. Pioneered GPU computing for AI with CUDA ecosystem. Known for Tensor Cores, NVLink interconnect, and comprehensive software stack.
Qualcomm
Founded: 1985 (AI accelerators: ~2015)
Mobile AI leader with Neural Processing Units in Snapdragon chips. Edge AI focus with power efficiency for smartphones, IoT, and automotive applications.
Rebellions
Founded: N/A
AI accelerator manufacturer with 1 active products.
SambaNova
Founded: N/A
AI accelerator manufacturer with 1 active products.
Tencent
Founded: N/A
AI accelerator manufacturer with 0 active products.
Tenstorrent
Founded: N/A
AI accelerator manufacturer with 2 active products.
Untether AI
Founded: N/A
AI accelerator manufacturer with 0 active products.
☁️ Cloud Providers
Cloud computing providers offering GPU and XPU rentals on-demand. Compare providers by pricing, availability, and hardware options.
Verified Providers
AWS
global
Amazon Web Services - Largest cloud provider with P5 (H100), P4 (A100), and custom Inferentia/Trainium chips. Global availability.
Azure
global
Microsoft Azure - Enterprise-focused with NVIDIA GPUs and integration with Azure AI services. Global presence.
CoreWeave
us
Genesis Cloud
eu
European GPU cloud provider with focus on AI/ML workloads.
Google Cloud
global
Google Cloud Platform - TPU pioneer with exclusive access to TPU v4/v5. Also offers NVIDIA GPUs. Strong ML infrastructure.
Lambda Labs
global
GPU cloud specialist offering competitive pricing on NVIDIA GPUs. Popular with AI researchers and startups.
OCI (Oracle)
global
Paperspace
global
Developer-friendly GPU cloud with notebooks and deployment tools. Good for experimentation.
RunPod
global
Community-driven GPU cloud with spot pricing. Flexible and affordable for ML workloads.
Community Providers
Vast.ai
global
Marketplace for GPU compute connecting buyers with hardware owners. Extremely low prices.
Want to list your service?
If you're a cloud provider offering GPU or XPU compute, we'd love to list you here. Contact us to get verified and featured.
🖥️ XPU Types
GPU (Graphics Processing Unit)
Originally designed for graphics rendering, GPUs have become the dominant hardware for AI training and inference. They excel at parallel processing, making them ideal for matrix operations in deep learning. Modern datacenter GPUs from NVIDIA, AMD, and Intel are optimized specifically for AI workloads.
Examples: NVIDIA H100, AMD MI300X, Intel Data Center GPU Max
TPU (Tensor Processing Unit)
Custom AI accelerators developed by Google specifically for tensor operations in neural networks. TPUs are designed from the ground up for AI workloads and offer high efficiency for training and inference, particularly for models built with TensorFlow/JAX.
Examples: Google TPU v5p, TPU v4, TPU v5e
NPU (Neural Processing Unit)
Specialized processors optimized for neural network inference, often found in edge devices and mobile processors. NPUs typically offer lower power consumption and are designed for specific AI tasks like image recognition, speech processing, or recommendation systems.
Examples: Huawei Ascend, Apple Neural Engine, Qualcomm AI Engine
IPU (Intelligence Processing Unit)
Graphcore's specialized processor architecture designed for both training and inference. IPUs use a unique massively parallel architecture with thousands of independent processors and use In-Processor-Memory for ultra-low latency data access.
Examples: Graphcore Bow IPU, IPU-M2000
LPU (Language Processing Unit)
Groq's specialized architecture optimized specifically for large language model inference. The LPU achieves industry-leading inference latency (as low as 1.5ms) by eliminating traditional GPU bottlenecks like memory bandwidth limitations.
Example: Groq LPU
WSE (Wafer-Scale Engine)
Cerebras's revolutionary wafer-scale processor - the largest chip ever built. Instead of cutting a silicon wafer into hundreds of small chips, the entire wafer becomes a single massive processor with 850,000 cores and 40GB of on-chip memory for unprecedented performance.
Example: Cerebras WSE-3
📊 Performance Metrics
TFLOPs (TeraFLOPS)
Trillions of floating-point operations per second. This is the theoretical peak computational throughput of a processor. Higher TFLOPs generally means faster AI training and inference, though real-world performance depends on many factors including memory bandwidth, software optimization, and workload characteristics.
Example: NVIDIA H100 delivers 1,979 TFLOPs at FP8 precision
Tokens per Second
For large language models (LLMs), performance is often measured in tokens processed per second. One token is roughly 3-4 characters of text. Higher tokens/sec means faster text generation (inference) or faster training. This metric is more practical than TFLOPs for comparing LLM performance.
Example: H200 can process ~15,000 tokens/sec training LLaMA 70B
Latency (Time to First Token)
The time between sending a prompt to an LLM and receiving the first token of the response. Critical for interactive applications like chatbots. Measured in milliseconds (ms). Lower latency means more responsive applications.
Example: Groq LPU achieves 1.5ms first-token latency - industry leading
Images per Second
For image generation models like Stable Diffusion, performance is measured in how many images can be generated per second. This depends on resolution, number of diffusion steps, and batch size.
Example: L40S generates 3.2 images/sec at 1024x1024, 50 steps
MFU (Model FLOPs Utilization)
The percentage of theoretical peak performance actually achieved in practice. Google uses MFU to measure how efficiently TPUs are being utilized. Higher MFU (50-60%+) indicates excellent software optimization and minimal bottlenecks.
Example: TPU v5p achieves 58% MFU on GPT-3 training
TFLOPs per Watt
Power efficiency metric showing how much compute performance you get per watt of power consumed. Critical for datacenter operators concerned with electricity costs and cooling. Higher values mean better efficiency.
Example: FuriosaAI Warboy achieves 40 TFLOPs/Watt - extremely efficient
🎯 Precision Types
FP32 (32-bit Floating Point)
Full precision floating-point format. Highest accuracy but slowest performance and highest memory usage. Rarely used for modern AI training or inference. Primarily used for scientific computing where numerical precision is critical.
FP16 (16-bit Floating Point)
Half precision format that offers 2x performance improvement over FP32 with minimal accuracy loss for most AI workloads. Widely supported across all modern GPUs. Good balance of performance and accuracy.
BF16 (Brain Float 16)
Google's 16-bit format optimized for deep learning. Uses the same exponent range as FP32 but with reduced precision. Better numerical stability than FP16 for training. Now widely adopted across NVIDIA, AMD, and Intel GPUs as the preferred training format.
FP8 (8-bit Floating Point)
Newest precision format offering 2x performance improvement over FP16/BF16. Supported on latest generation hardware (NVIDIA Hopper, AMD CDNA 3). Requires careful calibration but enables massive throughput gains with acceptable accuracy for many models.
INT8 (8-bit Integer)
Integer quantization format primarily used for inference. Offers excellent performance and memory efficiency but requires quantization-aware training or post-training quantization. Popular for deploying models at scale.
INT4 / INT2
Ultra-low precision formats for extreme efficiency. Used in specialized scenarios where model size and inference speed are critical. Requires advanced quantization techniques to maintain acceptable accuracy.
🔌 Form Factors
SXM (Server PCI Express Module)
NVIDIA's proprietary high-performance form factor for datacenter GPUs. Features a socket-based design with direct connection to NVLink for multi-GPU communication. Requires specialized server motherboards. Offers highest performance and power delivery (up to 1000W+).
Used by: H100, H200, A100, B200
OAM (OCP Accelerator Module)
Open Compute Project standard for AI accelerator modules. Vendor-neutral specification supported by multiple manufacturers. Enables interoperability and standardized server designs. Competing standard to NVIDIA's SXM.
Used by: AMD MI300X, Intel Gaudi 2/3
PCIe (PCI Express)
Standard expansion card format that fits in any PCIe x16 slot. Most flexible and widely compatible option. Lower power delivery (typically 300-450W) compared to SXM/OAM. Good for workstations and smaller deployments.
Used by: L40S, L4, A40, most consumer/workstation GPUs
Mezzanine Card
Compact form factor that connects directly to a motherboard via specialized connector. Used primarily for inference accelerators in servers where multiple cards need to be densely packed. Lower power consumption than PCIe cards.
Used by: AWS Inferentia, some edge inference accelerators
⚙️ Key Specifications
VRAM (Video RAM / HBM)
High-bandwidth memory used by GPUs and AI accelerators. Measured in gigabytes (GB). More VRAM allows training larger models and processing bigger batches. Modern AI accelerators use HBM2e or HBM3 for maximum bandwidth. HBM is stacked directly on the chip package for extremely high bandwidth.
Example: H200 has 141GB HBM3e - enough for large model inference
Memory Bandwidth
Measured in GB/s (gigabytes per second). Determines how quickly data can be moved between memory and compute cores. Critical bottleneck for AI workloads. Higher bandwidth enables better utilization of compute resources.
Example: H200 delivers 4,800 GB/s - critical for large models
TDP (Thermal Design Power)
Maximum power consumption measured in watts (W). Determines cooling requirements and datacenter power/cooling costs. Training GPUs typically consume 350-1000W. Inference accelerators are often more power-efficient at 75-350W.
Example: H100 SXM: 700W, L4: 72W
NVLink / Infinity Fabric
High-speed interconnects for multi-GPU communication. NVLink (NVIDIA) and Infinity Fabric (AMD) enable GPUs to communicate much faster than PCIe, critical for distributed training. Measured in GB/s of bidirectional bandwidth.
Interconnect Topology
How multiple accelerators are connected in a system. Common topologies include:
- NVSwitch/Full mesh: Every GPU can communicate with every other GPU at full speed
- Ring: GPUs connected in a circle, data passes through intermediate GPUs
- Tree: Hierarchical structure, some GPUs closer than others
🏗️ Architecture Generations
NVIDIA Architectures
- Blackwell (2024): B200 - Latest generation with FP4 support
- Hopper (2022): H100, H200 - Transformer Engine, FP8
- Ampere (2020): A100, A40, A10 - First datacenter GPU with BF16
- Volta (2017): V100 - First Tensor Core GPU
AMD Architectures
- CDNA 3 (2023): MI300X - Chiplet design, 192GB HBM3
- CDNA 2 (2021): MI250X, MI210 - First CDNA with BF16
- CDNA (2020): MI100 - First AMD datacenter AI GPU
Intel Architectures
- Gaudi 3 (2024): Latest generation, competitive with H100
- Gaudi 2 (2022): First major AI accelerator from Intel
- Ponte Vecchio (2023): Data Center GPU Max - tiles/chiplets
💼 Workload Types
LLM Training
Training large language models like GPT, LLaMA, or Mistral. Extremely compute-intensive, requires massive parallelism across many GPUs. Memory bandwidth and inter-GPU communication are critical. Measured in tokens/sec or hours to train a full model.
LLM Inference
Running trained language models to generate text. Latency-sensitive for interactive applications. Can be bandwidth-bound for large models. Batch processing improves throughput. Measured in tokens/sec or first-token latency.
Image Generation
Creating images using diffusion models like Stable Diffusion, Midjourney, or DALL-E. Less memory-intensive than LLM training but requires good compute throughput. Measured in images/sec.
Computer Vision
Image classification, object detection, segmentation. Smaller models than LLMs, often can run efficiently on inference-optimized accelerators. Real-time video processing requires low latency.
Recommendation Systems
Serving personalized recommendations (e-commerce, social media, ads). Characterized by large embedding tables and high-throughput inference. Memory capacity and bandwidth critical. Measured in inferences/sec.
HPC (High-Performance Computing)
Scientific simulations, weather modeling, molecular dynamics. Requires FP64 (double precision) support. Different workload characteristics than AI. Measured in FP64 TFLOPs.
💰 Pricing Models
CAPEX (Capital Expenditure)
Upfront purchase price of hardware. You own the equipment. Higher initial cost but lower long-term cost for continuous usage. Must account for:
- Purchase price of GPUs/accelerators
- Server infrastructure costs
- Networking equipment
- Datacenter space and cooling
- Maintenance and replacement over time
Example: H100 street price ~$30,000-40,000 per GPU
OPEX (Operating Expenditure)
Pay-as-you-go cloud pricing. Measured per hour or per token. No upfront cost but higher long-term cost for continuous usage. Includes compute, memory, networking, and datacenter costs. Good for:
- Variable workloads
- Short-term projects
- Avoiding upfront investment
- Elastic scaling needs
Example: AWS p5.48xlarge with 8x H100: $98.32/hour
Cost per Million Tokens
For LLM inference, many cloud providers charge per token rather than per hour. More predictable costs based on actual usage. Typical range: $0.08-$5.00 per million tokens depending on model size and provider.
Reserved Instances / Committed Use
Discounted cloud pricing in exchange for committing to usage over 1-3 years. Can save 30-70% vs on-demand pricing. Good middle ground between CAPEX and on-demand OPEX.