Everyone's AI
Machine learningAI Papers
Loading...

Learn

🏅My achievements

Ch.08

PEFT 1: PEFT and LoRA

Skip one giant table—use two small pieces

Don’t rewrite a huge weight table from scratch. Multiply a tall skinny B by a wide short A to build the update. When the middle width r is small, you train far fewer numbers.
The grids show an example size (4×2 · 2×5 → 4×5).
① B: tall② r: narrow③ A: wide④ One table
B: 4×2 (rows×cols)A: 2×5×=ΔW: 4×5Narrowestin the middle
You train roughly the two small matrices’ worth of numbers—far less than tuning every cell of the full table. LoRA updates only those two pieces.

PEFT & LoRA: fine-tune with few trainable weights

Keep the main highway ($W_0$) frozen; lay narrow LoRA ramps ($A,B$) to steer the task.

Keep the big pretrained chunk (W₀) as-is. Only small A and B learn to nudge the output. The wide lane and the narrow shortcut meet to form the final answer.

Load backboneFreeze W₀Train A·Bα/r scaleOutput / merge

Training flow at a glance

  1. ① Load backbone: Start from the big model that’s already trained.
  2. ② Freeze: Leave the original weights alone; training signal goes through LoRA and the top layers.
  3. ③ Train LoRA: Learn only the two small pieces (A and B) to gently nudge the output.
  4. ④ Dial the strength: Set how much LoRA matters with a scale (often α divided by r; details vary by code).
  5. ⑤ Use, merge, ship: After it looks good, fold the update into the base weights or release adapters only.
Think of training an AI model as remodeling a house. The ViT from Chapter 05 and huge language models are like a stately mansion already built—a pretrained model. If you want to reshape it to your taste, tearing down every pillar and wall to rebuild is full fine-tuning. It can boost performance, but it costs massive memory and storage—and a lot of time.
The reliever is PEFT (Parameter-Efficient Fine-Tuning). PEFT says: keep the structural frame of the house as-is, and only bring in the small custom furniture (adapters) you really need—a smart, economical playbook.
The best-known member of the PEFT family is LoRA (Low-Rank Adaptation). You leave the model’s enormous knowledge matrix alone and lay a very narrow bypass (a low-rank matrix) beside it—a shortcut. In symbols, you build new learnable knowledge ΔW\Delta WΔW from the product of two smaller matrices, BABABA, instead of rewriting the giant matrix wholesale.
The model’s output becomes h=W0x+ΔWxh = W_0x + \Delta Wxh=W0​x+ΔWx: you add the newly taught part (ΔWx\Delta WxΔWx) on top of what was already there (W0xW_0xW0​x). This chapter unpacks LoRA in plain language—how it almost magically lets you handle big models lightly, even on your own machine.

Reading the math (one LoRA linear layer)

1) Pretrained weights & freezing
In one sentence: keep the big matrix W0W_0W0​ as-is; only learn a small correction ΔW\Delta WΔW.
For weights W0∈Rdout×dinW_0\in\mathbb{R}^{d_{\mathrm{out}}\times d_{\mathrm{in}}}W0​∈Rdout​×din​, input xxx, output hhh, LoRA often reads as “base output + a little extra”:
h=W0x+ΔWxh = W_0 x + \Delta W xh=W0​x+ΔWx
- W0xW_0 xW0​x: what the pretrained model already does (usually frozen)
- ΔWx\Delta W xΔWx: the extra part you tune for your task (LoRA mainly trains ΔW\Delta WΔW)
Full fine-tuning edits every entry of W0W_0W0​; LoRA usually freezes W0W_0W0​ and only fits the correction.
xW0frozen×W0xAB×α/rΔWx+h
One layer: the top branch is frozen W0W_0W0​ giving W0xW_0xW0​x; the bottom branch is the trainable A ⁣→ ⁣BA\!\to\!BA→B path with α/r\alpha/rα/r, yielding ΔWx\Delta W xΔWx. They add to h=W0x+ΔWxh=W_0x+\Delta W xh=W0​x+ΔWx.
2) Writing ΔW\Delta WΔW in low rank
In one sentence: don’t store ΔW\Delta WΔW as one huge dense matrix—build it from two smaller matrices BBB and AAA with a narrow middle width rrr.
Pick r≪min⁡(din,dout)r\ll \min(d_{\mathrm{in}},d_{\mathrm{out}})r≪min(din​,dout​), with B∈Rdout×rB\in\mathbb{R}^{d_{\mathrm{out}}\times r}B∈Rdout​×r, A∈Rr×dinA\in\mathbb{R}^{r\times d_{\mathrm{in}}}A∈Rr×din​:
ΔW=αrBA\Delta W = \frac{\alpha}{r} B AΔW=rα​BA
- BABABA: the “shape” of the correction passing through the narrow bottleneck rrr
- α/r\alpha/rα/r: a dial for how strongly the correction blends into the output (exact factoring varies by code)
Trainable count ≈ entries in BBB plus entries in AAA, i.e. doutr+rdind_{\mathrm{out}} r + r d_{\mathrm{in}}dout​r+rdin​—far smaller than rewriting all of W0W_0W0​.
3) Why low rank?
Intuition: many task-specific changes may live in a few main directions, not in every possible direction—so a small middle rank rrr can still work.
- Larger rrr → more capacity, but more compute/storage
- Where you attach LoRA (e.g. attention WQ,WK,…W_Q,W_K,\dotsWQ​,WK​,… vs FFN) changes what you observe in practice
4) Optional merge for inference
In one sentence: if you don’t want an extra “LoRA side path” at runtime, fold the learned correction into one weight matrix.
After training, add ΔW\Delta WΔW to W0W_0W0​:
W=W0+ΔWW = W_0 + \Delta WW=W0​+ΔW
Then a plain linear layer (one matmul) can run inference with lower latency. Numeric details depend on the framework.

PEFT & LoRA: Fine-Tune by Updating Only a Few Parameters

1. Why PEFT? (Keep the frame, change the furniture)
* Concept: Models like ChatGPT can have tens of billions of parameters. Full fine-tuning updates all of them—it needs a powerful GPU and huge checkpoints. PEFT means any technique that trains only a tiny fraction (often under 1%) to make the model smart efficiently.
* Analogy: Nobody buys a new phone just for a winter vibe—you slap on a winter phone case (PEFT). Swap cases for different uses; swap light adapters for different tasks.
* Practice: Great value, but if the adapter is too small or data too scarce, the model may not absorb new knowledge—tuning learning rate and setup matters.
2. LoRA’s secret: two small matrices instead of one giant one
* Concept: A deep model is built from big tables of numbers—call the original one W0W_0W0​. LoRA freezes W0W_0W0​ and learns two small matrices BBB and AAA. The key identity is ΔW=BA\Delta W = B AΔW=BA.
* Analogy: Reprinting a 1,000-page encyclopedia (W0W_0W0​) to fix a typo is wasteful. Instead, stick a small sticky note (BABABA) on the right page.
* Detail: AAA compresses information to essentials; BBB expands it back—so you train far fewer numbers than editing the whole matrix.
3. LoRA’s cockpit: rank rrr and scale α\alphaα
* Concept: Two dials matter: rank rrr and scale α\alphaα.
* Rank rrr (how many side lanes): Think “size of the sticky note.” r=8r=8r=8 is an 8-lane bypass; r=16r=16r=16 is wider—smarter but more memory.
* Scale α\alphaα (how hard to blend new knowledge): An amplifier for how strongly BABABA mixes with frozen W0W_0W0​.
* Tip: If VRAM is tight, lower rrr; if the model won’t fit your data, try raising rrr a bit.
4. Fits Chapters 04 & 05—and a preview of Chapter 09
* Concept: Remember attention from Ch.04–05? LoRA often “sticks notes” on QQQ, KKK, VVV—wherever there is a linear weight matrix, you can attach LoRA.
* Preview: Even with LoRA, the big house (W0W_0W0​) is still heavy in memory. Chapter 09 meets quantization plus LoRA—the QLoRA duo—to slim the backbone further.

Why it matters

Anyone can train a giant model on a “corner GPU”
Once you needed a datacenter to customize huge models. With LoRA, trainable parameters can drop to 1/100 or 1/1000—so startups and students can still adapt massive models to their domain (medicine, finance, law, …) on limited GPUs.
Chameleon models: the magic of multi-task deployment
Turning one model into a translator, coder, and chef used to mean storing three full copies. With PEFT you keep one shared brain (base) and only store light LoRA adapters per role—swap adapters, not the whole model, for far simpler ops.
Smarter forgetting: reduce catastrophic forgetting
Full fine-tuning can make the model “forget” basics—like learning college math and losing times tables. LoRA freezes W0W_0W0​ and trains only the side path, so you can add new knowledge while keeping generalization safer.
Efficiency in the big picture
Chapters 01–03 gave you the skeleton (linear layers); this chapter answers how to add muscle efficiently. If you’ve grasped parameter-efficient updates, you’re ready for heavier optimization tricks.

How it is used

Step 1: Ice the base—train LoRA only (training strategy)
1. Freeze the pretrained weights (lock them).
2. Attach empty LoRA modules at key joints (attention layers, etc.).
3. Train only those LoRA modules. It’s faster than full FT—if data is small, watch a validation set for overfitting.
Step 2: Merge for deployment (merge & inference)
After training, you can “print” what’s on the sticky note (BABABA) into the encyclopedia (W0W_0W0​)—weight merge.
* Wmerged=W0+BAW_{merged} = W_0 + BAWmerged​=W0​+BA
Merged, the model doesn’t need two matmuls at inference, so you avoid extra latency.
Step 3: Pick your PEFT outfit
* Adapter: small networks tacked onto each layer.
* Prompt tuning: learn virtual tokens in front of the prompt.
* LoRA: softly edits weight matrices with few constraints—today’s default in industry for quality vs cost.
✅ Practical checklist
* Quality: if answers are weak, raise rank (rrr), scale (α\alphaα), or `target_modules`.
* Memory: shrink batch first; next chapter’s QLoRA (quantization + LoRA) is the next lever.
* Reproducibility: log which layers got LoRA and which rrr you used—so you can rebuild runs later.

Summary

One-liner — Freeze huge pretrained W0W_0W0​ and put new updates in a low-rank form ΔW=BA\Delta W = BAΔW=BA so trainable parameters drop dramatically.
Links: On the linear layer & attention skeleton from Ch.01–03, LoRA and PEFT are the most economical, smart way to adapt the model to you.
Practice: Balance rank (rrr), scale (α\alphaα), and target modules; if memory still hurts, hand off to Chapter 09 QLoRA (quantization) for lighter optimization.

How to approach problems

LoRA/PEFT items pair frozen W0W_0W0​ with low-rank BABABA, and ask about trainable counts, rrr, α\alphaα. ViT patch drills (Chapter 05 review) use (H/p)×(W/p)(H/p)\times(W/p)(H/p)×(W/p); one square d×dd\times dd×d layer often has ~2dr2dr2dr LoRA params in the common factored form.
① Numeric patterns you may see (same shape as the bank)
LoRA trainable-parameter count (aggregate) — For one layer with d=64d=64d=64, r=4r=4r=4, approximate 2dr2dr2dr = 512

Patch grid (config) — 8 patches along each side → 8×8=8\times8=8×8= 64 cells

Patch count (vote-style) — input 32×3232\times3232×32, patch 16×1616\times1616×16, no CLS → (32/16)2=(32/16)^2=(32/16)2= 4

Patch tokens (ensemble) — 224×224224\times224224×224, patch 161616, exclude CLS → (224/16)2=142=(224/16)^2=14^2=(224/16)2=142= 196

Dense attention scale (ensemble) — N=20N=20N=20 tokens → N2=N^2=N2= 400
Example (concept) — “Main goal of PEFT?”
② few extra parameters → 2.

Example (calc) — “d=64d=64d=64, r=4r=4r=4, approximate LoRA params ~2dr2dr2dr?” → 2⋅64⋅4=5122\cdot64\cdot4=5122⋅64⋅4=512 → 512.

Example (T/F) — “Usually train all of W0W_0W0​ in LoRA.” → Usually frozen → 0.

Example (application) — “Ship many tasks cheaply?”
② save PEFT adapters only → 2.

ViT review — 224×224224\times224224×224, patch 161616, no CLS → 142=14^2=142= 196.
Definition — “LoRA always learns a full-rank ΔW\Delta WΔW directly.” → False (low-rank). 0

T/F — “Increasing rrr usually increases LoRA trainable parameters.” → True. 1

Choice — “What should you combine with LoRA in Chapter 09 to save more VRAM?”
① quantization
② delete labels only → 1

Calc — “d=32d=32d=32, r=2r=2r=2, 2dr2dr2dr?” → 128.