內容大鋼
在音頻、視頻以及複雜數據分析等領域,AI技術的巨大潛力尚未被充分挖掘。事實上,由於缺乏指導和實際應用案例,很多當今的專業人士在將AI創新成果應用於這些多元領域時仍面臨挑戰。
這本全面指南填補了這一空白,專為中高級機器學習工程師、數據科學家和研究人員量身打造。作者Nicole Koenigstein帶領讀者深入探索Transformer模型的多樣化應用,不僅深化理論理解,更著重強調了面向實際應用的可操作策略。本書將提供關於Transformer的「大統一理論」——這些基礎洞見將確保你始終立於技術前沿,無論最先進的模型如何演變。
通過本書,你將學會將Transformer應用於:
圖像、視頻、音樂生成等非文本領域。
推理模型、代碼智能體以及多智能體架構。
訓練階段與推理階段的優化策略。
生產部署、運行時工程以及硬體效率優化。
目錄
Preface
1.From First Principles to State-of-the-Art Transformers
Transformer Basics
Tokenizer: Text Representation in the Transformer
Token and Positional Embeddings
Attention Mechanism
Encoder and Decoder Parts
Enhancements in Transformer Design:
Longer Context and Attention Variations
Longer Context Windows with Better Performance
Attention Mechanism Variations
Conclusion
2.Transformers for Time Series
Understanding the Intricacies of Time Series Data
Autocorrelation and Partial Autocorrelation
Cointegration
Cross-Correlation
Stationarity
Trend and Seasonality
Preparing a Dataset
Time Series Modeling in Various Application Domains
Tokenizing Time Series Data
Chronos: Learning the Language of Time Series
Fine-Tuning Chronos
PatchTST: A Time Series Is Worth 64 Words
Fine-Tuning PatchTST on Historical IBM Stock Prices
TimesFM: A Decoder-Only Time Series Foundation Model
Fine-Tuning TimesFM on Hourly Energy Consumption Data
AnomalyBERT for Self-Supervised Anomaly Detection
Conclusion
3.Transformers for Vision Tasks
Overview of Different Vision Tasks
Embeddings and Tokenization for Vision Models
Key Strategies for Improving the Robustness and Effectiveness of Vision Tasks
Swin Transformer V2
Image classification with Swin Transformer V2
Segment Anything
Fine-Tuning SAM on a Custom Dataset
Segment Anything in Images and Videos
Segment Videos and Images with Concept Prompts
Conclusion
4.Transformers for Image Generation
Introduction to Generative Image Models
Diffusion Models: What's That Noise About?
Classifier-Free Guidance in Diffusion Models
Scalable Diffusion Models with Transformers
Generating Images with the DiT
PIXART-α
Generating Images with PixArt-Σ
Diffusion Vision Transformers for Image Generation
Interpretable Features with Diffusion Transformers
Conclusion
5.Transformers for Video Generation
Hidden Effectiveness of Latent Diffusion
LTX-Video: Video in Realtime
Latte: Structured Detail, Poured into Every Video Frame
Tora: From Trajectory to Storyline, One Frame at a Time
Conclusion
6.From Sound to Token and Back: Transformers in the Audio Domain
From Waveforms to Spectrograms:
Understanding the Structure of Audio Data
Audio as a Waveform
Sampling Rate and the Nyquist Theorem
Amplitude, Bit Depth, and Quantization
The Frequency Domain and Fourier Transform
Spectrograms and the Short-Time Fourier Transform
The Mel Spectrogram and Perceptual Scaling
Phase, Reconstruction, and Vocoders
Audio Modeling in Various Application Domains
Transformer Architectures for Audio:
From Perception to Foundational Intelligence
The Rise of Speech Transformers: The Impact of Whisper
Audio Foundation Models: Unifying Understanding, Generation, and Conversation
Qwen2-Audio
Transcribing a Meeting with Kimi-Audio
Segment Anything in Audio
Beyond Text and Speech: Transformers as Music Composers
Conclusion
7.Reinforcement Learning Transformers
Getting Started with Reinforcement Learning
Foundational Concepts in Reinforcement Learning
Online and Offline Reinforcement Learning
Model-Based and Model-Free Approaches
On-Policy Versus Off-Policy Reinforcement Learning
Temporal Difference Learning
World Models in Reinforcement Learning
Transformers in Reinforcement Learning
Decision Transformer
Going Live: Online Decision Transformer
A Brave New World: Stochastic Transformer-Based World Model
TWISTER: Transformer-Based World Models with Contrastive Predictive Coding
Conclusion
8.Embracing the Era of Experience: Transformers for Planning, Reasoning, and Coding
From Human Data to Lived Experience
Learning to Reason: From Pretraining to Reinforcement Learning
DeepSeek-R1: Reinforcing Reasoning Capabilities
Qwen3: Unified Reasoning with Dynamic Control
Qwen3-Coder: Agentic Reasoning for Open-Ended Coding
Kimi K2: Open Agentic Intelligence at Scale
Muon: Scaling Optimization for the Agentic Era
Inference with Kimi K2
Scaling Reasoning at Test-Time: Smarter, Not Just Bigger
Adaptive Branching Monte Carlo Tree Search (AB-MCTS)
The RethinkMCTS Framework for Code Generation
The S* Framework for Code Generation
Conclusion
9.From Scripts to Thinking: AI Agents for Complex Tasks
Autonomy: What's Possible at the Moment?
Designing Agent Workflows
Multi-Agent Architectures
Agentic Communication: The Right Context Is All You Need
Beyond Context: How to Help Agents Remember
Agent Memory Types
Going Global and Lifelong
The Human Factor: Steering Agent Actions
Common Patterns for Human-in-the-Loop
Solving GitHub Issues with Coding Agents
Conclusion
10.Smarter, Better, Faster, Stronger: Optimizing LLMs and AI Agents
Training-Time Intelligence: Reinforcement Learning for Agents
Beyond Hand-Crafted Rewards: How RULER Works
Training in Practice: ART in a Market Scenario
Reason Smarter, Not Harder: Adaptive Compute Allocation
The Delta Incentive: Enforcing Efficiency
Open Innovation: Community-Driven RL Frameworks
The Checkpoint Engine: Systems-Level Optimization for LLM Policy Updates
Conclusion
11.Deploying Transformer Models
Choosing Between Open and Closed Source
Understanding the Architecture You're Deploying
Deploying Decoder-Only Models
Runtime Engineering for Decoder-Only Models
Security Considerations for Decoder-Only Deployments
Building Applications with Coding Models
Evaluating LLM Deployments in Production
Cost Efficiency and Hardware Comparison
Quantization
Test-Time Low-Rank Adaptation in Vision-Language Models
Conclusion
12.Where to Go Next: From Models to Intelligent Systems
Combining Capabilities: SAM 3 Agent
The Science of Scaling Agentic Systems
Conclusion
Index