Ch.08
PEFT 1: PEFT and LoRA
Skip one giant table—use two small pieces
Don’t rewrite a huge weight table from scratch. Multiply a tall skinny B by a wide short A to build the update. When the middle width r is small, you train far fewer numbers.
The grids show an example size (4×2 · 2×5 → 4×5).
① B: tall② r: narrow③ A: wide④ One table
You train roughly the two small matrices’ worth of numbers—far less than tuning every cell of the full table. LoRA updates only those two pieces.
PEFT & LoRA: fine-tune with few trainable weights
Keep the main highway ($W_0$) frozen; lay narrow LoRA ramps ($A,B$) to steer the task.
Keep the big pretrained chunk (W₀) as-is. Only small A and B learn to nudge the output. The wide lane and the narrow shortcut meet to form the final answer.
Load backboneFreeze W₀Train A·Bα/r scaleOutput / merge
Training flow at a glance
- ① Load backbone: Start from the big model that’s already trained.
- ② Freeze: Leave the original weights alone; training signal goes through LoRA and the top layers.
- ③ Train LoRA: Learn only the two small pieces (A and B) to gently nudge the output.
- ④ Dial the strength: Set how much LoRA matters with a scale (often α divided by r; details vary by code).
- ⑤ Use, merge, ship: After it looks good, fold the update into the base weights or release adapters only.
Think of training an AI model as remodeling a house. The ViT from Chapter 05 and huge language models are like a stately mansion already built—a pretrained model. If you want to reshape it to your taste, tearing down every pillar and wall to rebuild is full fine-tuning. It can boost performance, but it costs massive memory and storage—and a lot of time.
The reliever is PEFT (Parameter-Efficient Fine-Tuning). PEFT says: keep the structural frame of the house as-is, and only bring in the small custom furniture (adapters) you really need—a smart, economical playbook.
The best-known member of the PEFT family is LoRA (Low-Rank Adaptation). You leave the model’s enormous knowledge matrix alone and lay a very narrow bypass (a low-rank matrix) beside it—a shortcut. In symbols, you build new learnable knowledge from the product of two smaller matrices, , instead of rewriting the giant matrix wholesale.
The model’s output becomes : you add the newly taught part () on top of what was already there (). This chapter unpacks LoRA in plain language—how it almost magically lets you handle big models lightly, even on your own machine.
Reading the math (one LoRA linear layer)
1) Pretrained weights & freezing
In one sentence: keep the big matrix as-is; only learn a small correction .
For weights , input , output , LoRA often reads as “base output + a little extra”:
- : what the pretrained model already does (usually frozen)
- : the extra part you tune for your task (LoRA mainly trains )
Full fine-tuning edits every entry of ; LoRA usually freezes and only fits the correction.
2) Writing in low rank
In one sentence: don’t store as one huge dense matrix—build it from two smaller matrices and with a narrow middle width .
Pick , with , :
- : the “shape” of the correction passing through the narrow bottleneck
- : a dial for how strongly the correction blends into the output (exact factoring varies by code)
Trainable count ≈ entries in plus entries in , i.e. —far smaller than rewriting all of .
3) Why low rank?
Intuition: many task-specific changes may live in a few main directions, not in every possible direction—so a small middle rank can still work.
- Larger → more capacity, but more compute/storage
- Where you attach LoRA (e.g. attention vs FFN) changes what you observe in practice
4) Optional merge for inference
In one sentence: if you don’t want an extra “LoRA side path” at runtime, fold the learned correction into one weight matrix.
After training, add to :
Then a plain linear layer (one matmul) can run inference with lower latency. Numeric details depend on the framework.
PEFT & LoRA: Fine-Tune by Updating Only a Few Parameters
1. Why PEFT? (Keep the frame, change the furniture)
* Concept: Models like ChatGPT can have tens of billions of parameters. Full fine-tuning updates all of them—it needs a powerful GPU and huge checkpoints. PEFT means any technique that trains only a tiny fraction (often under 1%) to make the model smart efficiently.
* Analogy: Nobody buys a new phone just for a winter vibe—you slap on a winter phone case (PEFT). Swap cases for different uses; swap light adapters for different tasks.
* Practice: Great value, but if the adapter is too small or data too scarce, the model may not absorb new knowledge—tuning learning rate and setup matters.
2. LoRA’s secret: two small matrices instead of one giant one
* Concept: A deep model is built from big tables of numbers—call the original one . LoRA freezes and learns two small matrices and . The key identity is .
* Analogy: Reprinting a 1,000-page encyclopedia () to fix a typo is wasteful. Instead, stick a small sticky note () on the right page.
* Detail: compresses information to essentials; expands it back—so you train far fewer numbers than editing the whole matrix.
3. LoRA’s cockpit: rank and scale
* Concept: Two dials matter: rank and scale .
* Rank (how many side lanes): Think “size of the sticky note.” is an 8-lane bypass; is wider—smarter but more memory.
* Scale (how hard to blend new knowledge): An amplifier for how strongly mixes with frozen .
* Tip: If VRAM is tight, lower ; if the model won’t fit your data, try raising a bit.
4. Fits Chapters 04 & 05—and a preview of Chapter 09
* Concept: Remember attention from Ch.04–05? LoRA often “sticks notes” on , , —wherever there is a linear weight matrix, you can attach LoRA.
* Preview: Even with LoRA, the big house () is still heavy in memory. Chapter 09 meets quantization plus LoRA—the QLoRA duo—to slim the backbone further.
Why it matters
Anyone can train a giant model on a “corner GPU”
Once you needed a datacenter to customize huge models. With LoRA, trainable parameters can drop to 1/100 or 1/1000—so startups and students can still adapt massive models to their domain (medicine, finance, law, …) on limited GPUs.
Chameleon models: the magic of multi-task deployment
Turning one model into a translator, coder, and chef used to mean storing three full copies. With PEFT you keep one shared brain (base) and only store light LoRA adapters per role—swap adapters, not the whole model, for far simpler ops.
Smarter forgetting: reduce catastrophic forgetting
Full fine-tuning can make the model “forget” basics—like learning college math and losing times tables. LoRA freezes and trains only the side path, so you can add new knowledge while keeping generalization safer.
Efficiency in the big picture
Chapters 01–03 gave you the skeleton (linear layers); this chapter answers how to add muscle efficiently. If you’ve grasped parameter-efficient updates, you’re ready for heavier optimization tricks.
How it is used
Step 1: Ice the base—train LoRA only (training strategy)
1. Freeze the pretrained weights (lock them).
2. Attach empty LoRA modules at key joints (attention layers, etc.).
3. Train only those LoRA modules. It’s faster than full FT—if data is small, watch a validation set for overfitting.
Step 2: Merge for deployment (merge & inference)
After training, you can “print” what’s on the sticky note () into the encyclopedia ()—weight merge.
*
Merged, the model doesn’t need two matmuls at inference, so you avoid extra latency.
Step 3: Pick your PEFT outfit
* Adapter: small networks tacked onto each layer.
* Prompt tuning: learn virtual tokens in front of the prompt.
* LoRA: softly edits weight matrices with few constraints—today’s default in industry for quality vs cost.
✅ Practical checklist
* Quality: if answers are weak, raise rank (), scale (), or `target_modules`.
* Memory: shrink batch first; next chapter’s QLoRA (quantization + LoRA) is the next lever.
* Reproducibility: log which layers got LoRA and which you used—so you can rebuild runs later.
Summary
One-liner — Freeze huge pretrained and put new updates in a low-rank form so trainable parameters drop dramatically.
Links: On the linear layer & attention skeleton from Ch.01–03, LoRA and PEFT are the most economical, smart way to adapt the model to you.
Practice: Balance rank (), scale (), and target modules; if memory still hurts, hand off to Chapter 09 QLoRA (quantization) for lighter optimization.
How to approach problems
LoRA/PEFT items pair frozen with low-rank , and ask about trainable counts, , . ViT patch drills (Chapter 05 review) use ; one square layer often has ~ LoRA params in the common factored form.
① Numeric patterns you may see (same shape as the bank)
LoRA trainable-parameter count (aggregate) — For one layer with , , approximate = 512
Patch grid (config) — 8 patches along each side → 64 cells
Patch count (vote-style) — input , patch , no CLS → 4
Patch tokens (ensemble) — , patch , exclude CLS → 196
Dense attention scale (ensemble) — tokens → 400
Example (concept) — “Main goal of PEFT?”
② few extra parameters → 2.
Example (calc) — “, , approximate LoRA params ~?” → → 512.
Example (T/F) — “Usually train all of in LoRA.” → Usually frozen → 0.
Example (application) — “Ship many tasks cheaply?”
② save PEFT adapters only → 2.
ViT review — , patch , no CLS → 196.
Definition — “LoRA always learns a full-rank directly.” → False (low-rank). 0
T/F — “Increasing usually increases LoRA trainable parameters.” → True. 1
Choice — “What should you combine with LoRA in Chapter 09 to save more VRAM?”
① quantization
② delete labels only → 1
Calc — “, , ?” → 128.