Everyone's AI
Machine learningPlayground
Loading...

Learn

🏅My achievements

Ch.09

QLoRA and Quantization: Tuning When Smaller

Quant flow: FP16 → INT8 → FP16

Map high-precision values to INT8 with scale α, then divide back when computing.

FP16 inputmax(abs)compute αINT8 quantde-quantcheck restore

Step by step

  1. ① FP16: original float vector.
  2. ② max(abs): pick the scale reference.
  3. ③ α: scale factor for the range.
  4. ④ INT8: multiply, round, store as integers.
  5. ⑤ De-quant: divide by α to restore floats.
  6. ⑥ QLoRA: compressed backbone + LoRA only trained.

QLoRA & quantization

Compress the frame (4-bit); keep custom LoRA furniture sharp.

Use de-quant W0(q)W_0^{(q)}W0(q)​ for the backbone; only LoRA A,BA,BA,B learn.

load 4-bitfreeze backbonetrain LoRAforward · save

QLoRA training flow

  1. ① Load pretrained LLM in 4-bit.
  2. ② Keep quantized backbone fixed.
  3. ③ Train LoRA in FP16/BF16 only.
  4. ④ Save adapters · deploy.
In Chapter 08, LoRA made what you newly train thin—but the huge pretrained weights W0W_0W0​ still dominate memory. It is like trying to redecorate a mansion: even if you only change the furniture, the building itself is still enormous.
This chapter adds quantization to lighten the building and QLoRA, which combines that with LoRA: a compressed, frozen backbone plus sharp LoRA notes you actually tune.
Analogy: Like turning a huge 4K film into phone-friendly MP4—the story stays, the file shrinks. Big language models become easier to adapt on everyday hardware.

Reading the formulas (quantization · QLoRA)

1. Bit width & memory — why compress?
One-line summary: How many bits you use to store each weight changes both file size and GPU VRAM.
A deep model holds billions of numbers (weights). Keeping them in FP32 (32-bit) is accurate but heavy. Quantization means: store shorter integer codes (INT8, 4-bit, …) and restore near-float values only when you actually compute.
Analogy: Like choosing RAW (huge) vs JPEG (small) for a photo—the scene looks similar, but storage shrinks.
  • FormatFP32
  • ~bytes/weight4
  • Plain wordsfull precision
  • FormatFP16
  • ~bytes/weight2
  • Plain wordshalf size
  • FormatINT8
  • ~bytes/weight1
  • Plain words8-bit integer box
  • FormatINT4
  • ~bytes/weight0.5
  • Plain words4-bit extreme diet
Format~bytes/weightPlain words
FP324full precision
FP162half size
INT818-bit integer box
INT40.54-bit extreme diet
Read the table row by row: fewer bits → fewer bytes per weight. For teaching, “32-bit → 4-bit” is roughly 32÷4 = 8 → about 8× lighter. Bank items like 32/4 and 16/4 use this same idea.
QLoRA link: Chapter 08 LoRA shrinks what you train, but W0W_0W0​ is still large. QLoRA compresses W0W_0W0​ itself (often 4-bit), then tunes only LoRA in high precision.
frozen · 4-bittrain · LoRAxW₀⁽q⁾ · x4-bit · frozenBAx+hbackboneLoRA
How to read the figure: Input xxx (left) flows through the top gray box (frozen compressed backbone) and bottom violet box (trainable LoRA), meets + in the middle, and becomes output hhh (right). Gray = 4-bit · frozen; violet = what you train.
2. INT8 — split storage and compute
One line: store short integers; divide by α right before you multiply.
Plain words: keep numbers in small boxes, unfold to wider decimals only for real math.
Four steps:
(1) max value 5.4
(2) scale α ≈ 23.5 for 127
(3) round 1.2→28
(4) restore 28÷23.5≈1.2
qi=round(xi⋅α),x^i=qi/αq_i = \mathrm{round}(x_i \cdot \alpha), \quad \hat{x}_i = q_i / \alphaqi​=round(xi​⋅α),x^i​=qi​/α
Mnemonic: multiply & round → store, divide → restore.
3. One QLoRA layer — wide path (book) + narrow path (LoRA)
Input xxx splits: ① frozen compressed book + ② LoRA → add for hhh.
Analogy: top gray = compressed encyclopedia (unfold only when computing); bottom purple = your note (only this learns).
h≈W0(q)x+αlorarBAxh \approx W_0^{(q)} x + \frac{\alpha_{\mathrm{lora}}}{r} B A xh≈W0(q)​x+rαlora​​BAx
Plain reading:
- W0(q)xW_0^{(q)} xW0(q)​x: pretrained reaction (restore before multiply)
- BAxB A xBAx: your add-on (LoRA only trains)
- αlora/r\alpha_{\mathrm{lora}}/rαlora​/r: how strong LoRA mixes in
Wrap-up: answer = book part + LoRA part. Book fixed, LoRA tuned.
4. More keywords — NF4 · double quantization · structure
NF4 (NormalFloat 4-bit)
Weights often cluster near zero. NF4 uses tighter codes where values are dense and sparser codes in the tails—often less quality loss than a naive 4-bit grid. In quizzes, remember distribution-aware 4-bit.
Double quantization
Each block has its own α. With many blocks, the α table itself grows, so double quantization compresses α again. Small per-block savings add up.
Structure (backbone + adapters)
Picture one shared compressed 7B backbone and small LoRA files per task (medical, legal, FAQ, …). You do not need a full copy of the giant model each time—swap thin adapters instead.

QLoRA & quantization: handle giant models lightly

1. What is quantization? — big numbers into small boxes
Model weights are usually decimals. FP32/FP16 = many-digit precise decimals; INT8/4-bit = short integer slots.
Quantization = store short, restore similar decimals when computing. Not “integers only forever.”
Analogy: RAW → JPEG—¼~⅛ the size, shape mostly the same.
Tiny example: store `14` instead of `0.0137`, restore when needed.
2. How quantization shrinks numbers — set a ruler (scale)
First find the largest number in the group (largest absolute value). Example: among `1.2`, `-0.5`, `5.4` → use 5.4.
In plain words: pick scale α so that peak fits integer 127. Here α ≈ 23.5.
Multiply each value by α, round → store 28 for 1.2. Later 28 ÷ 23.5 ≈ 1.2 to restore (de-quantization).
One line: multiply & round to store → divide to restore. Bank values 28, 127, α≈24 follow this.
3. NF4 — smarter 4-bit
Weights often cluster near zero. NF4 uses tighter bins where values are common—remember distribution-aware 4-bit.
Double quantization also compresses the α ruler list when there are thousands of blocks.
Details in the formula guide. Here: store small, restore when using + NF4 = smarter 4-bit.
4. QLoRA — quantization + LoRA together
Chapter 08 LoRA: keep the big book (W0W_0W0​) fixed, train only a small note (LoRA).
QLoRA adds: make the book light too.
- W0W_0W0​: 4-bit compressed, frozen
- LoRA: train in clear numbers (FP16/BF16) for your data
Analogy: encyclopedia on microfilm, your note in sharp ink. Unfold a clear page only when you read.
One line: compressed book answer + LoRA add-on. Formulas in the formula guide.

Why it matters

1. More people can tune big models
Full fine-tuning of multi-billion-parameter models is memory-heavy, yet more people want models that fit their data and tasks. QLoRA compresses and freezes the backbone and trains adapters only, making custom tuning reachable on ordinary GPUs.
Chapter 08 LoRA cut trainable size; QLoRA also shrinks the frozen backbone itself.
2. NF4 & double quantization
Naive 4-bit can shake quality. NF4 is distribution-aware 4-bit—often less quality loss than a blunt 4-bit grid. Double quantization shrinks α tables too; across thousands of blocks, small savings accumulate.
3. One backbone, many adapters
A compressed backbone plus small LoRA files lets you aim the same large model at different tasks without storing a full 7B copy each time—swap thin adapter files instead.
Chapter 08 taught swapping post-it notes (LoRA); this chapter also lightens the encyclopedia on the shelf. Together they form QLoRA.
4. Why split the picture — less confusion
Memory: compressed frozen book + trainable LoRA + batch size & sequence length matter too.
Compute: store small, restore before multiply. Habit: small storage, wider math.
Training: book fixed, LoRA only changes. If fit is weak, check data, learning rate, rank, 4-bit vs 8-bit first.

How it is used

Step 1: Bring in a compressed backbone
To start QLoRA, you first load a strong pretrained model W0W_0W0​ in a light storage form. Weights sit on disk and in VRAM as short integer codes (often 4-bit / NF4), and only when the network actually multiplies matrices are they briefly restored to something like FP16/BF16 for stable math.
Analogy: Instead of shelving the full encyclopedia, you keep a microfilm copy and only unfold a sharp page when you read. That is why this chapter’s first habit is “store small, compute in a wider dtype.”
Step 2: Add LoRA and fine-tune
Freeze W0W_0W0​. The 4-bit-packed part usually does not learn; attach small LoRA on needed layers and train only LoRA in FP16/BF16.
Only LoRA changes. Each answer ≈ compressed book + LoRA add-on—same Chapter 08 picture, but the book is light too.
Step 3: Tune data and dials
Good adaptation needs both what you teach (data) and how hard / how thin you train (hyperparameters). Prepare examples that match the task (instructions, dialogue, FAQs), then adjust learning rate, epochs, LoRA rank (adapter thickness), and whether you stay at 4-bit or relax toward 8-bit.
If VRAM is still tight, people often shrink batch size or sequence length first. If answers look weak, it is more natural to revisit data, rank, and learning rate than to unfreeze the whole backbone right away.
Step 4: What you keep afterward
After training you typically keep one shared compressed backbone and small LoRA files per task. For the same 7B-class model, you can picture swapping only the adapter for medical vs legal vs in-house FAQ use.
Quantization flow: max → pick α → multiply, round, store → divide to restore. QLoRA stacks frozen 4-bit W0W_0W0​ + high-precision LoRA only on top. In one line: carry the big book once, in a light form, and write many thin notes (adapters) for different needs.

Summary

In one sentence: freeze a light 4-bit book (W0W_0W0​), tune only a small LoRA — that is QLoRA.
Ch.08 → 09: LoRA shrinks what you train; quantization shrinks the book size.
Four ideas (plain)
1. Quantization — store as short integers, ÷α restore when using
2. NF4 & double quant — smarter 4-bit + compress α tables
3. QLoRA — light book + LoRA · book fixed, LoRA learns
4. Structure — one book, swap LoRA per task
Picture: store small · restore before compute · train LoRA only.
Traps: fine-tune whole 4-bit book (×) · INT4 = half FP16 (×, teaching ~8×).

How to approach the problems

For quantization · QLoRA, start with “frozen compressed backbone + LoRA adapter.” Chapter 08 LoRA shrinks what you train; QLoRA also shrinks W0W_0W0​. On numeric items, link bit ratios (32÷4 = 8) and ÷α restore to steps that produce values like 28, 127, 24.
Below are bank-shaped examples—begin with concept and true/false.

Example (concept)
“Which is closest to QLoRA?
① Retrain all of W0W_0W0​ in FP32
② 4-bit compressed backbone + high-precision LoRA only
③ Delete labels”
→ Answer 2
Why? The backbone stays frozen; only adapters adapt.

Example (true/false · ox)
“In QLoRA you usually fine-tune the entire 4-bit W0W_0W0​ in FP16.”
→ Usually frozen → Answer 0
① Numeric styles you may see in a session (same shape as the bank)
Example (bit ratio · vote) — FP32(32-bit) → INT4(4-bit) teaching factor? → 32÷4=32÷4=32÷4= 8
Example (INT8 round · vote) — with α≈23.5\alpha \approx 23.5α≈23.5, round(1.2×23.5)\mathrm{round}(1.2 \times 23.5)round(1.2×23.5) → 28
Example (quant scale · aggregate) — max abs 5.45.45.4 → α≈round(127/5.4)=\alpha \approx \mathrm{round}(127/5.4)=α≈round(127/5.4)= 24
Example (power · config) — 242^424 is closest to? → 16
Example (scenario)
“When LoRA alone still runs out of VRAM, the next lever?
① Drop labels
② Compress the backbone with 4-bit QLoRA
③ Remove attention”
→ Answer 2
Example (NF4 · concept)
“Best description of NF4?
① RNN activation
② 4-bit, distribution-aware codes
③ Optimizer name”
→ Answer 2
Example (double quant · concept)
“Benefit of double quantization?
① Auto higher LoRA rank
② Extra savings on α tables
③ Auto larger batch”
→ Answer 2
Example (O/X · memory)
“INT4 uses about half the memory of FP16 (teaching).”
→ Much smaller in practice → Answer 0 (teaching ~8× gap)
Example (O/X · QLoRA)
“QLoRA combines quantization and LoRA.”
→ Answer 1
Example (choice · NF4 bits)
“How many bits is NF4 closest to?”
→ 4
Example (choice · INT8)
“Common max positive INT8 code?”
→ 127