Ch.00

Advanced DL: Large Models and Generative AI Paradigm

Advanced Deep Learning (Ch.00) is the entry point that connects “why models got so large” with “how generative AI systems actually work.” We go beyond learning representations from data: how large Transformers build contextual understanding, predict the next token, and then how we align, control, and deploy those models for real users.

An advanced roadmap toward large generative models

This roadmap gradually fills from Ch01 onward, showing how each chapter contributes to the full system.

What you will learn in Ch01–Ch24

Ch.01
Transformer 1: Self-Attention and Parallelization
Ch.02
Transformer: Positional Encoding and Feed-Forward
Ch.03
Transformer Lineage: Encoder (BERT) vs Decoder (GPT)
Ch.04
Attention Optimization: FlashAttention and Sparse Attention
Ch.05
Vision Transformer (ViT) and Image Patches
Ch.06
Swin Transformer: Hierarchical Windows and Global Context
Ch.07
Vision Models: Local CNN vs Global ViT
Ch.08
PEFT 1: PEFT and LoRA
Ch.09
QLoRA and Quantization: Tuning When Smaller
Ch.10
Value Alignment and RLHF: Matching Human Preferences
Ch.11
DPO: Aligning with Preferences without Reinforcement Learning
Ch.12
RAG: Reducing Hallucinations with Retrieval
Ch.13
LLM Agents: Models That Use Tools
Ch.14
Master CNNs: Kernels, Stride, Padding & Backbone Evolution
Ch.15
Object Detection: R-CNN Family vs YOLO (Bounding Boxes)
Ch.16
Image Segmentation: U-Net and DeepLab (Pixel-Level Understanding)
Ch.17
Grad-CAM and XAI: Where CNNs Look
Ch.18
Graph Neural Networks (GNN): Message Passing to Neighbors
Ch.19
Autoencoder: Compress and Reconstruct
Ch.20
VAE: A Generative Space in Probability
Ch.21
GAN Basics: Generator vs Discriminator
Ch.22
Conditional GAN: Generate on Condition
Ch.23
Diffusion 1: Add Noise, Then Denoise
Ch.24
Diffusion 2: Diffusing in Latent Space
Ch.25
Vision-Language Models and CLIP: Images and Text Together (CNN Meets LLM)
Ch.26
Speech Recognition and Audio: Sound to Text
Ch.27
Model Compression and Knowledge Distillation
Ch.28
Inference Optimization and Deployment: From Servers to Browser Runtimes
Ch.29
Advanced DL Wrap-Up: Architecture and Future

What is Advanced DL? (Generative AI system view)

p(x_t\mid x_{<t})

A practical way to understand generative AI is to split it into stages: pretraining (broad knowledge), instruction / SFT (follow user intent), and alignment (preference, safety, and reduced hallucinations).

The backbone is mostly Transformers . Self-attention creates “token-to-token” context, and feed-forward + normalization layers refine it so the model stays consistent even with long contexts.

Bigger models can improve capability, but they also make training less stable and dramatically increase cost. Advanced DL therefore focuses on more than accuracy: training stability, efficiency (compute/memory), and reproducibility .

In the real world, generative AI is judged by trust: truthfulness, safety, and reliability. Achieving that requires alignment, evaluation, and control mechanisms.

Finally, deployment constraints (latency, cost, server limits) matter. So advanced DL continues from training to inference optimization, compression, and serving strategies .

In production, systems usually follow a pipeline like `text/image -> tokenization -> context window -> Transformer -> decoding (greedy/beam/sample)`. Decoding strategy and prompt design strongly affect output quality.

Alignment and control can be done in multiple ways. For example, RLHF / DPO uses preferences to improve the model, and RAG retrieves external knowledge to ground answers.

From a product perspective, tool use, caching/batching, and optimization such as quantization or knowledge distillation are part of the whole stack. The same base model can feel very different depending on how you run it.

This section ties the whole Advanced DL track to how you might reason about it in exam-style questions. Next-token prediction in pretraining builds broad language ability and connects to probabilistic generation and representation learning. Instruction tuning and SFT shape how models follow user intent, which brings in data formatting and fine-tuning. Alignment addresses preferences, safety, and truthfulness through ideas like preference learning and reward modeling. RAG and grounded generation lean on retrieval, embeddings, and assembling context to reduce ungrounded answers. Inference optimization targets latency and cost with quantization, caching, distillation, and similar serving-side tools.

What is Advanced DL? (Generative AI system view)

p(x_t\mid x_{<t})

A practical way to understand generative AI is to split it into stages: pretraining (broad knowledge), instruction / SFT (follow user intent), and alignment (preference, safety, and reduced hallucinations).

The backbone is mostly Transformers . Self-attention creates “token-to-token” context, and feed-forward + normalization layers refine it so the model stays consistent even with long contexts.

Bigger models can improve capability, but they also make training less stable and dramatically increase cost. Advanced DL therefore focuses on more than accuracy: training stability, efficiency (compute/memory), and reproducibility .

In the real world, generative AI is judged by trust: truthfulness, safety, and reliability. Achieving that requires alignment, evaluation, and control mechanisms.

Finally, deployment constraints (latency, cost, server limits) matter. So advanced DL continues from training to inference optimization, compression, and serving strategies .

In production, systems usually follow a pipeline like `text/image -> tokenization -> context window -> Transformer -> decoding (greedy/beam/sample)`. Decoding strategy and prompt design strongly affect output quality.

Alignment and control can be done in multiple ways. For example, RLHF / DPO uses preferences to improve the model, and RAG retrieves external knowledge to ground answers.

From a product perspective, tool use, caching/batching, and optimization such as quantization or knowledge distillation are part of the whole stack. The same base model can feel very different depending on how you run it.

This section ties the whole Advanced DL track to how you might reason about it in exam-style questions. Next-token prediction in pretraining builds broad language ability and connects to probabilistic generation and representation learning. Instruction tuning and SFT shape how models follow user intent, which brings in data formatting and fine-tuning. Alignment addresses preferences, safety, and truthfulness through ideas like preference learning and reward modeling. RAG and grounded generation lean on retrieval, embeddings, and assembling context to reduce ungrounded answers. Inference optimization targets latency and cost with quantization, caching, distillation, and similar serving-side tools.