XPU Encyclopedia

Learn about AI compute hardware terminology, specifications, and concepts

🏢 XPU Vendors

Major manufacturers of AI accelerators and their product portfolios. Each vendor brings unique architectural approaches and optimization strategies to AI compute.

Alibaba

Founded: N/A

View Products →

AI accelerator manufacturer with 1 active products.

Active Products
1
Total Products
1
Avg VRAM
Peak TFLOPs
Focus Area
AI Acceleration

AMD

Founded: 1969 (GPU division from ATI acquisition 2006)

View Products →

Second-largest GPU manufacturer, competing aggressively in datacenter AI with CDNA architecture. ROCm software stack provides CUDA alternative. Strong price-performance positioning.

Active Products
10
Total Products
10
Avg VRAM
118GB
Peak TFLOPs
2,100
Focus Area
Training & Inference, Open software ecosystem

AWS

Founded: Inferentia: 2019, Trainium: 2021

View Products →

Amazon's custom AI chips designed for cost-effective inference (Inferentia) and training (Trainium). Tightly integrated with AWS infrastructure. Price-performance leaders for specific workloads.

Active Products
3
Total Products
3
Avg VRAM
32GB
Peak TFLOPs
680
Focus Area
Cost-optimized inference and training

Axelera AI

Founded: N/A

View Products →

AI accelerator manufacturer with 0 active products.

Active Products
0
Total Products
0
Avg VRAM
Peak TFLOPs
Focus Area
AI Acceleration

Baidu

Founded: N/A

View Products →

AI accelerator manufacturer with 1 active products.

Active Products
1
Total Products
1
Avg VRAM
32GB
Peak TFLOPs
Focus Area
AI Acceleration

Biren Technology

Founded: N/A

View Products →

AI accelerator manufacturer with 1 active products.

Active Products
1
Total Products
1
Avg VRAM
64GB
Peak TFLOPs
Focus Area
AI Acceleration

Cambricon

Founded: N/A

View Products →

AI accelerator manufacturer with 1 active products.

Active Products
1
Total Products
1
Avg VRAM
48GB
Peak TFLOPs
256
Focus Area
AI Acceleration

Cerebras

Founded: 2016

View Products →

Revolutionary wafer-scale processor - largest chip ever built. Single CS-3 wafer contains 850,000 cores and 40GB on-chip memory. Eliminates traditional GPU bottlenecks for massive models.

Active Products
1
Total Products
1
Avg VRAM
44GB
Peak TFLOPs
Focus Area
Extreme-scale AI training

d-Matrix

Founded: N/A

View Products →

AI accelerator manufacturer with 0 active products.

Active Products
0
Total Products
0
Avg VRAM
Peak TFLOPs
Focus Area
AI Acceleration

Enflame Technology

Founded: N/A

View Products →

AI accelerator manufacturer with 1 active products.

Active Products
1
Total Products
1
Avg VRAM
32GB
Peak TFLOPs
Focus Area
AI Acceleration

Etched

Founded: N/A

View Products →

AI accelerator manufacturer with 1 active products.

Active Products
1
Total Products
1
Avg VRAM
144GB
Peak TFLOPs
Focus Area
AI Acceleration

FuriosaAI

Founded: 2017

View Products →

Korean startup focused on energy-efficient inference accelerators. Industry-leading efficiency (40+ TFLOPs/Watt). Software-centric approach with compiler optimization.

Active Products
1
Total Products
1
Avg VRAM
32GB
Peak TFLOPs
Focus Area
Power-efficient inference

Google

Founded: 1998 (TPU: 2015)

View Products →

Pioneered custom AI chips with TPU (Tensor Processing Unit). Designed from scratch for TensorFlow/JAX workloads. Available only through Google Cloud. Industry-leading efficiency metrics.

Active Products
4
Total Products
4
Avg VRAM
32GB
Peak TFLOPs
918
Focus Area
Custom silicon for Google workloads, Cloud-only

Graphcore

Founded: 2016

View Products →

Intelligence Processing Unit (IPU) with unique architecture featuring massive parallelism and In-Processor-Memory. Strong in graph neural networks and alternative model architectures.

Active Products
2
Total Products
2
Avg VRAM
4GB
Peak TFLOPs
Focus Area
Novel architectures, Graph networks

Groq

Founded: 2016

View Products →

Language Processing Unit (LPU) achieving industry-leading inference latency (1.5ms first-token). Eliminates GPU bottlenecks through deterministic scheduling. Optimized for LLM serving.

Active Products
1
Total Products
1
Avg VRAM
230GB
Peak TFLOPs
Focus Area
Ultra-low latency LLM inference

Huawei

Founded: 1987 (Ascend: 2018)

View Products →

Ascend AI processors competing with NVIDIA in China. HiSilicon design with full-stack AI framework (MindSpore). Strong in Chinese market.

Active Products
1
Total Products
1
Avg VRAM
64GB
Peak TFLOPs
Focus Area
China market, Full-stack AI

Hygon

Founded: N/A

View Products →

AI accelerator manufacturer with 0 active products.

Active Products
0
Total Products
0
Avg VRAM
Peak TFLOPs
Focus Area
AI Acceleration

Iluvatar CoreX

Founded: N/A

View Products →

AI accelerator manufacturer with 1 active products.

Active Products
1
Total Products
1
Avg VRAM
48GB
Peak TFLOPs
300
Focus Area
AI Acceleration

Intel

Founded: 1968 (AI accelerators: Habana acquisition 2019)

View Products →

Traditional CPU leader entering AI accelerator market with Gaudi and Data Center GPU Max series. Leveraging OneAPI for software portability. Strong enterprise relationships.

Active Products
2
Total Products
2
Avg VRAM
88GB
Peak TFLOPs
419
Focus Area
Inference-optimized, Enterprise AI

Intel Habana

Founded: N/A

View Products →

AI accelerator manufacturer with 2 active products.

Active Products
2
Total Products
2
Avg VRAM
112GB
Peak TFLOPs
1,835
Focus Area
AI Acceleration

Kalray

Founded: N/A

View Products →

AI accelerator manufacturer with 0 active products.

Active Products
0
Total Products
0
Avg VRAM
Peak TFLOPs
Focus Area
AI Acceleration

Lightmatter

Founded: N/A

View Products →

AI accelerator manufacturer with 0 active products.

Active Products
0
Total Products
0
Avg VRAM
Peak TFLOPs
Focus Area
AI Acceleration

Meta

Founded: N/A

View Products →

AI accelerator manufacturer with 1 active products.

Active Products
1
Total Products
1
Avg VRAM
128GB
Peak TFLOPs
Focus Area
AI Acceleration

Microsoft

Founded: N/A

View Products →

AI accelerator manufacturer with 1 active products.

Active Products
1
Total Products
1
Avg VRAM
64GB
Peak TFLOPs
700
Focus Area
AI Acceleration

Moore Threads

Founded: N/A

View Products →

AI accelerator manufacturer with 1 active products.

Active Products
1
Total Products
1
Avg VRAM
32GB
Peak TFLOPs
Focus Area
AI Acceleration

Mythic

Founded: N/A

View Products →

AI accelerator manufacturer with 0 active products.

Active Products
0
Total Products
0
Avg VRAM
Peak TFLOPs
Focus Area
AI Acceleration

NVIDIA

Founded: 1993

View Products →

Market leader in AI accelerators with 80%+ datacenter GPU market share. Pioneered GPU computing for AI with CUDA ecosystem. Known for Tensor Cores, NVLink interconnect, and comprehensive software stack.

Active Products
30
Total Products
31
Avg VRAM
522GB
Peak TFLOPs
360,000
Focus Area
Training & Inference, Gaming GPUs repurposed for AI

Qualcomm

Founded: 1985 (AI accelerators: ~2015)

View Products →

Mobile AI leader with Neural Processing Units in Snapdragon chips. Edge AI focus with power efficiency for smartphones, IoT, and automotive applications.

Active Products
1
Total Products
1
Avg VRAM
16GB
Peak TFLOPs
Focus Area
Edge AI, Mobile devices

Rebellions

Founded: N/A

View Products →

AI accelerator manufacturer with 1 active products.

Active Products
1
Total Products
1
Avg VRAM
Peak TFLOPs
Focus Area
AI Acceleration

SambaNova

Founded: N/A

View Products →

AI accelerator manufacturer with 1 active products.

Active Products
1
Total Products
1
Avg VRAM
640GB
Peak TFLOPs
Focus Area
AI Acceleration

Tencent

Founded: N/A

View Products →

AI accelerator manufacturer with 0 active products.

Active Products
0
Total Products
0
Avg VRAM
Peak TFLOPs
Focus Area
AI Acceleration

Tenstorrent

Founded: N/A

View Products →

AI accelerator manufacturer with 2 active products.

Active Products
2
Total Products
2
Avg VRAM
16GB
Peak TFLOPs
364
Focus Area
AI Acceleration

Untether AI

Founded: N/A

View Products →

AI accelerator manufacturer with 0 active products.

Active Products
0
Total Products
0
Avg VRAM
Peak TFLOPs
Focus Area
AI Acceleration

☁️ Cloud Providers

Cloud computing providers offering GPU and XPU rentals on-demand. Compare providers by pricing, availability, and hardware options.

Verified Providers

AWS

global

Verified

Amazon Web Services - Largest cloud provider with P5 (H100), P4 (A100), and custom Inferentia/Trainium chips. Global availability.

7 GPUs listed
Visit →
Azure

global

Verified

Microsoft Azure - Enterprise-focused with NVIDIA GPUs and integration with Azure AI services. Global presence.

5 GPUs listed
Visit →
CoreWeave

us

Verified
Pricing coming soon
Visit →
Genesis Cloud

eu

Verified

European GPU cloud provider with focus on AI/ML workloads.

Pricing coming soon
Visit →
Google Cloud

global

Verified

Google Cloud Platform - TPU pioneer with exclusive access to TPU v4/v5. Also offers NVIDIA GPUs. Strong ML infrastructure.

7 GPUs listed
Visit →
Lambda Labs

global

Verified

GPU cloud specialist offering competitive pricing on NVIDIA GPUs. Popular with AI researchers and startups.

2 GPUs listed
Visit →
OCI (Oracle)

global

Verified
Pricing coming soon
Visit →
Paperspace

global

Verified

Developer-friendly GPU cloud with notebooks and deployment tools. Good for experimentation.

Pricing coming soon
Visit →
RunPod

global

Verified

Community-driven GPU cloud with spot pricing. Flexible and affordable for ML workloads.

Pricing coming soon
Visit →

Community Providers

Vast.ai

global

Marketplace for GPU compute connecting buyers with hardware owners. Extremely low prices.

Pricing coming soon
Visit →

Want to list your service?

If you're a cloud provider offering GPU or XPU compute, we'd love to list you here. Contact us to get verified and featured.

🖥️ XPU Types

GPU (Graphics Processing Unit)

Originally designed for graphics rendering, GPUs have become the dominant hardware for AI training and inference. They excel at parallel processing, making them ideal for matrix operations in deep learning. Modern datacenter GPUs from NVIDIA, AMD, and Intel are optimized specifically for AI workloads.

Examples: NVIDIA H100, AMD MI300X, Intel Data Center GPU Max

TPU (Tensor Processing Unit)

Custom AI accelerators developed by Google specifically for tensor operations in neural networks. TPUs are designed from the ground up for AI workloads and offer high efficiency for training and inference, particularly for models built with TensorFlow/JAX.

Examples: Google TPU v5p, TPU v4, TPU v5e

NPU (Neural Processing Unit)

Specialized processors optimized for neural network inference, often found in edge devices and mobile processors. NPUs typically offer lower power consumption and are designed for specific AI tasks like image recognition, speech processing, or recommendation systems.

Examples: Huawei Ascend, Apple Neural Engine, Qualcomm AI Engine

IPU (Intelligence Processing Unit)

Graphcore's specialized processor architecture designed for both training and inference. IPUs use a unique massively parallel architecture with thousands of independent processors and use In-Processor-Memory for ultra-low latency data access.

Examples: Graphcore Bow IPU, IPU-M2000

LPU (Language Processing Unit)

Groq's specialized architecture optimized specifically for large language model inference. The LPU achieves industry-leading inference latency (as low as 1.5ms) by eliminating traditional GPU bottlenecks like memory bandwidth limitations.

Example: Groq LPU

WSE (Wafer-Scale Engine)

Cerebras's revolutionary wafer-scale processor - the largest chip ever built. Instead of cutting a silicon wafer into hundreds of small chips, the entire wafer becomes a single massive processor with 850,000 cores and 40GB of on-chip memory for unprecedented performance.

Example: Cerebras WSE-3

📊 Performance Metrics

TFLOPs (TeraFLOPS)

Trillions of floating-point operations per second. This is the theoretical peak computational throughput of a processor. Higher TFLOPs generally means faster AI training and inference, though real-world performance depends on many factors including memory bandwidth, software optimization, and workload characteristics.

Example: NVIDIA H100 delivers 1,979 TFLOPs at FP8 precision

Tokens per Second

For large language models (LLMs), performance is often measured in tokens processed per second. One token is roughly 3-4 characters of text. Higher tokens/sec means faster text generation (inference) or faster training. This metric is more practical than TFLOPs for comparing LLM performance.

Example: H200 can process ~15,000 tokens/sec training LLaMA 70B

Latency (Time to First Token)

The time between sending a prompt to an LLM and receiving the first token of the response. Critical for interactive applications like chatbots. Measured in milliseconds (ms). Lower latency means more responsive applications.

Example: Groq LPU achieves 1.5ms first-token latency - industry leading

Images per Second

For image generation models like Stable Diffusion, performance is measured in how many images can be generated per second. This depends on resolution, number of diffusion steps, and batch size.

Example: L40S generates 3.2 images/sec at 1024x1024, 50 steps

MFU (Model FLOPs Utilization)

The percentage of theoretical peak performance actually achieved in practice. Google uses MFU to measure how efficiently TPUs are being utilized. Higher MFU (50-60%+) indicates excellent software optimization and minimal bottlenecks.

Example: TPU v5p achieves 58% MFU on GPT-3 training

TFLOPs per Watt

Power efficiency metric showing how much compute performance you get per watt of power consumed. Critical for datacenter operators concerned with electricity costs and cooling. Higher values mean better efficiency.

Example: FuriosaAI Warboy achieves 40 TFLOPs/Watt - extremely efficient

🎯 Precision Types

FP32 (32-bit Floating Point)

Full precision floating-point format. Highest accuracy but slowest performance and highest memory usage. Rarely used for modern AI training or inference. Primarily used for scientific computing where numerical precision is critical.

FP16 (16-bit Floating Point)

Half precision format that offers 2x performance improvement over FP32 with minimal accuracy loss for most AI workloads. Widely supported across all modern GPUs. Good balance of performance and accuracy.

BF16 (Brain Float 16)

Google's 16-bit format optimized for deep learning. Uses the same exponent range as FP32 but with reduced precision. Better numerical stability than FP16 for training. Now widely adopted across NVIDIA, AMD, and Intel GPUs as the preferred training format.

FP8 (8-bit Floating Point)

Newest precision format offering 2x performance improvement over FP16/BF16. Supported on latest generation hardware (NVIDIA Hopper, AMD CDNA 3). Requires careful calibration but enables massive throughput gains with acceptable accuracy for many models.

INT8 (8-bit Integer)

Integer quantization format primarily used for inference. Offers excellent performance and memory efficiency but requires quantization-aware training or post-training quantization. Popular for deploying models at scale.

INT4 / INT2

Ultra-low precision formats for extreme efficiency. Used in specialized scenarios where model size and inference speed are critical. Requires advanced quantization techniques to maintain acceptable accuracy.

🔌 Form Factors

SXM (Server PCI Express Module)

NVIDIA's proprietary high-performance form factor for datacenter GPUs. Features a socket-based design with direct connection to NVLink for multi-GPU communication. Requires specialized server motherboards. Offers highest performance and power delivery (up to 1000W+).

Used by: H100, H200, A100, B200

OAM (OCP Accelerator Module)

Open Compute Project standard for AI accelerator modules. Vendor-neutral specification supported by multiple manufacturers. Enables interoperability and standardized server designs. Competing standard to NVIDIA's SXM.

Used by: AMD MI300X, Intel Gaudi 2/3

PCIe (PCI Express)

Standard expansion card format that fits in any PCIe x16 slot. Most flexible and widely compatible option. Lower power delivery (typically 300-450W) compared to SXM/OAM. Good for workstations and smaller deployments.

Used by: L40S, L4, A40, most consumer/workstation GPUs

Mezzanine Card

Compact form factor that connects directly to a motherboard via specialized connector. Used primarily for inference accelerators in servers where multiple cards need to be densely packed. Lower power consumption than PCIe cards.

Used by: AWS Inferentia, some edge inference accelerators

⚙️ Key Specifications

VRAM (Video RAM / HBM)

High-bandwidth memory used by GPUs and AI accelerators. Measured in gigabytes (GB). More VRAM allows training larger models and processing bigger batches. Modern AI accelerators use HBM2e or HBM3 for maximum bandwidth. HBM is stacked directly on the chip package for extremely high bandwidth.

Example: H200 has 141GB HBM3e - enough for large model inference

Memory Bandwidth

Measured in GB/s (gigabytes per second). Determines how quickly data can be moved between memory and compute cores. Critical bottleneck for AI workloads. Higher bandwidth enables better utilization of compute resources.

Example: H200 delivers 4,800 GB/s - critical for large models

TDP (Thermal Design Power)

Maximum power consumption measured in watts (W). Determines cooling requirements and datacenter power/cooling costs. Training GPUs typically consume 350-1000W. Inference accelerators are often more power-efficient at 75-350W.

Example: H100 SXM: 700W, L4: 72W

High-speed interconnects for multi-GPU communication. NVLink (NVIDIA) and Infinity Fabric (AMD) enable GPUs to communicate much faster than PCIe, critical for distributed training. Measured in GB/s of bidirectional bandwidth.

Interconnect Topology

How multiple accelerators are connected in a system. Common topologies include:

  • NVSwitch/Full mesh: Every GPU can communicate with every other GPU at full speed
  • Ring: GPUs connected in a circle, data passes through intermediate GPUs
  • Tree: Hierarchical structure, some GPUs closer than others

🏗️ Architecture Generations

NVIDIA Architectures

  • Blackwell (2024): B200 - Latest generation with FP4 support
  • Hopper (2022): H100, H200 - Transformer Engine, FP8
  • Ampere (2020): A100, A40, A10 - First datacenter GPU with BF16
  • Volta (2017): V100 - First Tensor Core GPU

AMD Architectures

  • CDNA 3 (2023): MI300X - Chiplet design, 192GB HBM3
  • CDNA 2 (2021): MI250X, MI210 - First CDNA with BF16
  • CDNA (2020): MI100 - First AMD datacenter AI GPU

Intel Architectures

  • Gaudi 3 (2024): Latest generation, competitive with H100
  • Gaudi 2 (2022): First major AI accelerator from Intel
  • Ponte Vecchio (2023): Data Center GPU Max - tiles/chiplets

💼 Workload Types

LLM Training

Training large language models like GPT, LLaMA, or Mistral. Extremely compute-intensive, requires massive parallelism across many GPUs. Memory bandwidth and inter-GPU communication are critical. Measured in tokens/sec or hours to train a full model.

LLM Inference

Running trained language models to generate text. Latency-sensitive for interactive applications. Can be bandwidth-bound for large models. Batch processing improves throughput. Measured in tokens/sec or first-token latency.

Image Generation

Creating images using diffusion models like Stable Diffusion, Midjourney, or DALL-E. Less memory-intensive than LLM training but requires good compute throughput. Measured in images/sec.

Computer Vision

Image classification, object detection, segmentation. Smaller models than LLMs, often can run efficiently on inference-optimized accelerators. Real-time video processing requires low latency.

Recommendation Systems

Serving personalized recommendations (e-commerce, social media, ads). Characterized by large embedding tables and high-throughput inference. Memory capacity and bandwidth critical. Measured in inferences/sec.

HPC (High-Performance Computing)

Scientific simulations, weather modeling, molecular dynamics. Requires FP64 (double precision) support. Different workload characteristics than AI. Measured in FP64 TFLOPs.

💰 Pricing Models

CAPEX (Capital Expenditure)

Upfront purchase price of hardware. You own the equipment. Higher initial cost but lower long-term cost for continuous usage. Must account for:

  • Purchase price of GPUs/accelerators
  • Server infrastructure costs
  • Networking equipment
  • Datacenter space and cooling
  • Maintenance and replacement over time

Example: H100 street price ~$30,000-40,000 per GPU

OPEX (Operating Expenditure)

Pay-as-you-go cloud pricing. Measured per hour or per token. No upfront cost but higher long-term cost for continuous usage. Includes compute, memory, networking, and datacenter costs. Good for:

  • Variable workloads
  • Short-term projects
  • Avoiding upfront investment
  • Elastic scaling needs

Example: AWS p5.48xlarge with 8x H100: $98.32/hour

Cost per Million Tokens

For LLM inference, many cloud providers charge per token rather than per hour. More predictable costs based on actual usage. Typical range: $0.08-$5.00 per million tokens depending on model size and provider.

Reserved Instances / Committed Use

Discounted cloud pricing in exchange for committing to usage over 1-3 years. Can save 30-70% vs on-demand pricing. Good middle ground between CAPEX and on-demand OPEX.