Ch.00
Advanced DL: Large Models and Generative AI Paradigm
Advanced Deep Learning (Ch.00) is the entry point that connects “why models got so large” with “how generative AI systems actually work.” We go beyond learning representations from data: how large Transformers build contextual understanding, predict the next token, and then how we align, control, and deploy those models for real users.
An advanced roadmap toward large generative models
This roadmap gradually fills from Ch01 onward, showing how each chapter contributes to the full system.
What you will learn in Ch01–Ch24
- Ch.01Transformer 1: Self-Attention and Parallelization
- Ch.02Transformer: Positional Encoding and Feed-Forward
- Ch.03Transformer Lineage: Encoder (BERT) vs Decoder (GPT)
- Ch.04Attention Optimization: FlashAttention and Sparse Attention
- Ch.05Vision Transformer (ViT) and Image Patches
- Ch.06Swin Transformer: Hierarchical Windows and Global Context
- Ch.07Vision Models: Local CNN vs Global ViT
- Ch.08PEFT 1: PEFT and LoRA
- Ch.09QLoRA and Quantization: Tuning When Smaller
- Ch.10Value Alignment and RLHF: Matching Human Preferences
- Ch.11DPO: Aligning with Preferences without Reinforcement Learning
- Ch.12RAG: Reducing Hallucinations with Retrieval
- Ch.13LLM Agents: Models That Use Tools
- Ch.14Master CNNs: Kernels, Stride, Padding & Backbone Evolution
- Ch.15Object Detection: R-CNN Family vs YOLO (Bounding Boxes)
- Ch.16Image Segmentation: U-Net and DeepLab (Pixel-Level Understanding)
- Ch.17Grad-CAM and XAI: Where CNNs Look
- Ch.18Graph Neural Networks (GNN): Message Passing to Neighbors
- Ch.19Autoencoder: Compress and Reconstruct
- Ch.20VAE: A Generative Space in Probability
- Ch.21GAN Basics: Generator vs Discriminator
- Ch.22Conditional GAN: Generate on Condition
- Ch.23Diffusion 1: Add Noise, Then Denoise
- Ch.24Diffusion 2: Diffusing in Latent Space
- Ch.25Vision-Language Models and CLIP: Images and Text Together (CNN Meets LLM)
- Ch.26Speech Recognition and Audio: Sound to Text
- Ch.27Model Compression and Knowledge Distillation
- Ch.28Inference Optimization and Deployment: From Servers to Browser Runtimes
- Ch.29Advanced DL Wrap-Up: Architecture and Future
What is Advanced DL? (Generative AI system view)
Foundation models / LLMs are trained with the objective of predicting the next token. In other words, they maximize , learning language flow and patterns that go beyond simple grammar.
A practical way to understand generative AI is to split it into stages: pretraining (broad knowledge), instruction / SFT (follow user intent), and alignment (preference, safety, and reduced hallucinations).
The backbone is mostly Transformers. Self-attention creates “token-to-token” context, and feed-forward + normalization layers refine it so the model stays consistent even with long contexts.
Bigger models can improve capability, but they also make training less stable and dramatically increase cost. Advanced DL therefore focuses on more than accuracy: training stability, efficiency (compute/memory), and reproducibility.
In the real world, generative AI is judged by trust: truthfulness, safety, and reliability. Achieving that requires alignment, evaluation, and control mechanisms.
Finally, deployment constraints (latency, cost, server limits) matter. So advanced DL continues from training to inference optimization, compression, and serving strategies.
In production, systems usually follow a pipeline like `text/image -> tokenization -> context window -> Transformer -> decoding (greedy/beam/sample)`. Decoding strategy and prompt design strongly affect output quality.
Alignment and control can be done in multiple ways. For example, RLHF / DPO uses preferences to improve the model, and RAG retrieves external knowledge to ground answers.
From a product perspective, tool use, caching/batching, and optimization such as quantization or knowledge distillation are part of the whole stack. The same base model can feel very different depending on how you run it.
This section ties the whole Advanced DL track to how you might reason about it in exam-style questions. Next-token prediction in pretraining builds broad language ability and connects to probabilistic generation and representation learning. Instruction tuning and SFT shape how models follow user intent, which brings in data formatting and fine-tuning.
Alignment addresses preferences, safety, and truthfulness through ideas like preference learning and reward modeling. RAG and grounded generation lean on retrieval, embeddings, and assembling context to reduce ungrounded answers. Inference optimization targets latency and cost with quantization, caching, distillation, and similar serving-side tools.