Ch.08

PEFT 1: PEFT and LoRA

Skip one giant table—use two small pieces

Don’t rewrite a huge weight table from scratch. Multiply a tall skinny B by a wide short A to build the update. When the middle width r is small, you train far fewer numbers .

The grids show an example size (4\times2 \cdot 2\times5 \to 4\times5).

① B: tall② r: narrow③ A: wide④ One table

You train roughly the two small matrices’ worth of numbers—far less than tuning every cell of the full table. LoRA updates only those two pieces .

PEFT & LoRA: fine-tune with few trainable weights

Keep the main highway ($W_0$) frozen; lay narrow LoRA ramps ($A,B$) to steer the task.

Keep the big pretrained chunk (W₀) as-is. Only small A and B learn to nudge the output. The wide lane and the narrow shortcut meet to form the final answer .

Load backboneFreeze W₀Train A·Bα/r scaleOutput / merge

Training flow at a glance

① Load backbone: Start from the big model that’s already trained. ② Freeze: Leave the original weights alone; training signal goes through LoRA and the top layers. ③ Train LoRA: Learn only the two small pieces (A and B) to gently nudge the output. ④ Dial the strength: Set how much LoRA matters with a scale (often α divided by r; details vary by code). ⑤ Use, merge, ship: After it looks good, fold the update into the base weights or release adapters only.

\Delta W

Reading the math (one LoRA linear layer)

W_0

PEFT & LoRA: Fine-Tune by Updating Only a Few Parameters

W_0

Why it matters

W_0

How it is used

BA

Summary

W_0

How to approach problems

W_0

PEFT & LoRA: Fine-Tune by Updating Only a Few Parameters

1. Why PEFT? (Keep the frame, change the furniture)

* Concept: Models like ChatGPT can have tens of billions of parameters. Full fine-tuning updates all of them—it needs a powerful GPU and huge checkpoints. PEFT means any technique that trains only a tiny fraction (often under 1%) to make the model smart efficiently.

* Analogy: Nobody buys a new phone just for a winter vibe—you slap on a winter phone case (PEFT). Swap cases for different uses; swap light adapters for different tasks.

* Practice: Great value, but if the adapter is too small or data too scarce, the model may not absorb new knowledge—tuning learning rate and setup matters.

2. LoRA’s secret: two small matrices instead of one giant one

* Concept: A deep model is built from big tables of numbers—call the original one

W_0

. LoRA freezes

W_0

and learns two small matrices

B

and

A

. The key identity is

\Delta W = B A

* Analogy: Reprinting a 1,000-page encyclopedia (

W_0

) to fix a typo is wasteful. Instead, stick a small sticky note ( $BA$ ) on the right page.

* Detail:

A

compresses information to essentials;

B

expands it back—so you train far fewer numbers than editing the whole matrix.

3. LoRA’s cockpit: rank $r$ and scale $\alpha$

* Concept: Two dials matter: rank $r$ and scale $\alpha$ .

* Rank $r$ (how many side lanes): Think “size of the sticky note.”

r=8

is an 8-lane bypass;

r=16

is wider—smarter but more memory.

* Scale $\alpha$ (how hard to blend new knowledge): An amplifier for how strongly

BA

mixes with frozen

W_0

* Tip: If VRAM is tight, lower $r$ ; if the model won’t fit your data, try raising $r$ a bit.

4. Fits Chapters 04 & 05—and a preview of Chapter 09

* Concept: Remember attention from Ch.04–05? LoRA often “sticks notes” on

Q

K

V

—wherever there is a linear weight matrix, you can attach LoRA.

* Preview: Even with LoRA, the big house (

W_0

) is still heavy in memory. Chapter 09 meets quantization plus LoRA—the QLoRA duo—to slim the backbone further.