Keynote - Enabling Generative AI on the Edge - Cormac Brick, Principal Engineer, Google
# Generative AI on the Edge: A Deep Dive into Innovations and Tools
## Introduction
Good morning! My name is Caric Prick, and I’m excited to share insights about how generative AI is becoming increasingly popular on the edge. This growth is driven by significant advancements in the PyTorch ecosystem and the broader open model community. Edge developers are leveraging these tools to create innovative applications that provide instant responses, work offline, and deliver personalized experiences while respecting privacy by keeping data local.
The idea of deploying generative AI on edge devices might have seemed futuristic a few years ago, but today it’s no longer just a vision—it’s a reality. With the right compute power, models, and tools, developers are building cutting-edge applications that run seamlessly on mobile, desktop, IoT devices, or even within browsers.
---
## Compute Power: The Backbone of Generative AI
When discussing generative AI, one cannot overlook the importance of compute power. Over the past year, there has been a lot of attention focused on AI’s massive computational demands. However, what often goes unnoticed is the quieter revolution happening in mobile NPUs (Neural Processing Units).
In 2024 alone, it’s estimated that five ZetaOps of compute power will be shipped through mobile NPUs—a figure that highlights the immense potential of mobile ecosystems. While GPUs and CPUs also play a significant role in this ecosystem, the advancements in mobile NPUs are particularly noteworthy.
For context, let’s compare the projected compute power for NVIDIA H100s and mobile NPUs:
- **NVIDIA H100s**: 4 ZetaOps of compute power.
- **Mobile NPUs**: Approximately 5 ZetaOps of compute power.
These numbers underscore the growing parity between high-end data center GPUs and mobile hardware, making generative AI deployment on edge devices more feasible than ever before.
---
## Model Quality: From Large to Small, Innovation is Everywhere
The past year has seen remarkable progress in both large-scale and smaller open models. While large language models (LLMs) continue to dominate the headlines, smaller models are also making significant strides.
### Key Observations:
1. **Large Open Models**: The blue bars on our chart reveal a clear upward trend in innovation, with LLMs delivering exciting new capabilities to the open ecosystem.
2. **Smaller Models (Less than 4B Parameters)**: These models are equally important due to their smaller memory footprint and ability to run efficiently on edge devices.
For instance, consider the journey of **Gemma 2**:
- At the start of this year, Gemma 2 was a modest 53-parameter model.
- By the end of last month, it had evolved into a more capable version, showcasing how quickly smaller models are catching up to larger ones in terms of performance and features.
This trend is particularly promising for developers looking to deploy generative AI on devices with limited computational resources.
---
## Deploying Generative AI: A Comparison to Classic ML
Deploying generative AI models on edge devices differs significantly from traditional machine learning (ML) deployments. Let’s break down the differences:
### Classic ML Deployment:
1. **Workflow**: Define, train, and export a model using tools like PyTorch or TensorFlow.
2. **Runtime**: Use frameworks like ONNX or TensorRT for deployment on edge devices.
3. **Application Logic**: Wrap the model in simple application logic and call runtime APIs.
### Generative AI Deployment:
- The process is more complex due to the need for dynamic interactions, such as text generation or image synthesis.
- This complexity has led to the emergence of Boutique Frameworks designed specifically for generative AI on edge devices.
However, there’s good news: the PyTorch ecosystem is actively addressing these challenges. Key priorities include:
1. **Keeping It in PyTorch**: There’s no need to switch tools when deploying generative AI. You can stay within PyTorch for model development and deployment.
2. **Single Model Artifact**: Export a single model artifact that includes all weights while allowing flexibility for different deployment scenarios (e.g., prefill and decode operations).
---
## AI EdgeTorch: A Tool for Modern Deployment
To meet the demands of generative AI on edge devices, our team has developed **AI EdgeTorch**, a library within PyTorch designed specifically for edge deployment. Here’s how it works:
1. **Model Development**: Build models in PyTorch using tools like `torch.tune` for fine-tuning.
2. **Optimization**: Use AI EdgeTorch-optimized layers to improve performance and validate your models directly within the PyTorch environment.
3. **Conversion and Runtime**: Convert your model into a format suitable for edge devices using `LOR`, which is a renamed and evolved version of TensorFlow Lite.
AI EdgeTorch supports a variety of smaller models, including:
- Llama (Tiny)
- Gemma
- Open Elm F2 Small
- T5
- Stable Diffusion
These models are optimized for edge devices, ensuring they run efficiently on mobile GPUs and other hardware.
---
## Tiny Lama: A Case Study in Efficiency
Here’s a quick look at how Tiny Lama—a small language model—was implemented using PyTorch. The code, written in just 30 lines, demonstrates the power of PyTorch for edge deployments:
```python
import torch
from torch import nn
class TinyLlama(nn.Module):
def __init__(self):
super().__init__()
# Standard NN modules here...
self.decoder_layer = nn.TransformerDecoderLayer(d_model=..., ...)
# Optimized rope catch implementation here...
model = TinyLlama()
# Export the model with multiple entry points for prefill and decode
scripted_model = torch.compile(model, mode='max-autotune')
scripted_model.save('TinyLlama.pt')
```
This example highlights how developers can create highly optimized models that run efficiently on edge devices while maintaining flexibility for different deployment scenarios.
---
## Model Explorer: Visualizing and Debugging Models
To ensure optimal performance and usability, we’ve developed **Model Explorer**, a tool designed to visualize and analyze generative AI models. Here’s what it does:
1. **Handling Large Models**: The tool supports massive models, including Gemma 2 (with 2,000 nodes) and internal models with up to 50,000 nodes.
2. **Model Hierarchy**: Visualize the model hierarchy, from high-level blocks down to specific layers like attention mechanisms.
3. **Metadata and Insights**: View metadata about the model, such as node names and performance metrics.
4. **Custom Overlays**: Add custom JSON overlays to provide additional insights, such as heatmaps for runtime latency.
For example, when analyzing Tiny Lama on a CPU, the tool reveals that the final fully connected layer delivers low runtime latency—a critical insight for optimizing performance.
---
## Performance Benchmarks and Examples
AI EdgeTorch has been benchmarked against internally developed handwritten models, and the results are encouraging:
- AI EdgeTorch performs within 10% of target performance on edge devices.
We’ve also built real-world examples of applications running on mobile GPUs, showcasing how generative AI can deliver fast, responsive experiences on-edge.
---
## Conclusion and Call to Action
Generative AI on the edge is at an exciting inflection point, with:
- A growing amount of compute power available for deployment.
- Rapid innovation in both large and small models.
- Tools like AI EdgeTorch and Model Explorer making deployment more accessible and efficient.
If you’re interested in learning more about these tools or want to dive deeper into generative AI on the edge, we invite you to:
1. Explore the open-source AI EdgeTorch library.
2. Check out our poster session for a hands-on demonstration of Model Explorer.
3. Join us at the exhibit booth to see real devices running generative AI models in action.
Thank you for your interest in this space, and we’re excited to see what you’ll build with these tools!