Everyone's AI
Machine learningAI Papers
Loading...

Learn

🏅My achievements

Ch.19

Autoencoder: Compress and Reconstruct

Autoencoder: compress & reconstruct

Like summarizing text on a sticky note, then rewriting it in full.

The encoder compresses input xxx to latent bottleneck zzz; the decoder expands zzz to x^\hat{x}x^. Lower reconstruction loss means outputs closer to the input.

Input xEncoderBottleneck zDecoderLoss

Training flow

  1. ① Input: feed xxx.
  2. ② Encoder: map xxx to zzz.
  3. ③ Bottleneck: small zzz summarizes information.
  4. ④ Decoder: map zzz to x^\hat{x}x^.
  5. ⑤ Loss: minimize mismatch between xxx and x^\hat{x}x^.
Feeding images or high-dimensional data xxx into a network, the model first encodes a compact summary code zzz (latent representation), then decodes to x^\hat{x}x^ with the same shape—an autoencoder. Training minimizes reconstruction loss between xxx and x^\hat{x}x^. This is classic unsupervised learning: no class labels; the data itself is the target.
A narrow bottleneck enables dimensionality reduction and anomaly detection. Chapter 18 (VAE) adds a probabilistic latent model for generation; here we build the compress–reconstruct foundation.

Reading the formulas (autoencoder)

1. Encoder and decoder in one line
z=fθ(x)z=f_\theta(x)z=fθ​(x), x^=gϕ(z)\hat{x}=g_\phi(z)x^=gϕ​(z). Loss example: L=∥x−x^∥22\mathcal{L}=\|x-\hat{x}\|_2^2L=∥x−x^∥22​.
- zzz: Latent code at the bottleneck
- x^\hat{x}x^: Reconstructed output
xEncoderzBottleneckDecoderx̂Compare x and x̂ · reconstruction loss
In one line: xxx is compressed to a narrow zzz, expanded to x^\hat{x}x^, and compared to xxx.
2. Bottleneck and compression
Input dim ddd, latent k≪dk\ll dk≪d: compression ratio is about k/dk/dk/d.
- Smaller kkk: stronger compression (more information loss possible)
- Larger kkk: easier reconstruction, weaker summarization
3. Linear AE and PCA
With linear activations and MSE, intuition links to principal directions (depends on data and constraints).
- Nonlinear activations allow richer representations
4. Practical tips
Match data scale; adjust bottleneck and depth; use DAE for robust features when needed.

Autoencoder: Compress and Reconstruct

1. Symmetric encoder–decoder structure
Concept: The encoder fθf_\thetafθ​ maps input xxx to a latent vector z=fθ(x)z=f_\theta(x)z=fθ​(x); the decoder gϕg_\phigϕ​ maps zzz to x^=gϕ(z)\hat{x}=g_\phi(z)x^=gϕ​(z). The dimension of zzz is forced into a much smaller bottleneck than the original input.
Intuition: Like a witness describing a face to a sketch artist with a few traits (zzz) instead of every pixel—the decoder redraws the face from that summary.
2. Loss: how close is the reconstruction?
Concept: For continuous real-valued features, MSE 1d∑i(xi−x^i)2\frac{1}{d}\sum_i (x_i-\hat{x}_i)^2d1​∑i​(xi​−x^i​)2 is typical; for [0,1][0,1][0,1] grayscale images, BCE is also used.
Intuition: Like overlaying the original and the copy and scoring per-pixel mismatch.
3. Why the bottleneck matters
If zzz were as large as xxx, the network could trivially copy the input (identity). A narrow bottleneck forces the model to keep only real patterns in zzz.
Practice (anomaly detection): Train on normal images only; high reconstruction error on novel “abnormal” inputs flags defects.
4. Denoising autoencoder (DAE)
Use: Add noise or masking, then train to recover the clean target. The model learns more robust features that ignore superficial corruption.
5. What is the latent space?
Concept: The latent space is the low-dimensional vector space where the encoder’s codes zzz live—not the raw pixel/input space. Each sample becomes one point (a coordinate vector) in this space; after training, similar inputs often land nearby, while different patterns map farther apart, so the space can acquire geometric structure.
In an autoencoder: The bottleneck dimension kkk is the latent space dimension. The decoder gϕg_\phigϕ​ maps points in this space back to high-dimensional x^\hat{x}x^. (Chapter 18 VAE adds a probability model on this space for sampling and generation.)
6. What is PCA?
Concept: PCA (Principal Component Analysis) is a linear dimensionality-reduction method: it finds directions in which the data variance is largest, in order, and builds orthogonal axes called principal components. Projecting data onto the first few axes yields a low-dimensional summary that keeps as much variance as possible (along discarded axes you lose that variance).
Versus autoencoders: PCA uses linear maps only; autoencoders with nonlinear activations can learn richer, curved structure. On complex data, AEs are often more flexible. (A linear AE trained with MSE connects to PCA intuition under certain conditions.)

Why it matters

Beyond PCA: powerful dimensionality reduction
PCA, as described above, is essentially linear dimensionality reduction. Autoencoders, by contrast, use nonlinear activations to compress and visualize high-dimensional data in 2–3D more flexibly.
Unsupervised feature learning
Labeling is expensive. An AE can extract features zzz from raw data alone; a pretrained encoder is a strong starting point for transfer learning into classifiers.
Gateway to generative AI
Beyond compression, tweaking latent zzz to synthesize new faces or images leads to VAEs and GANs.

How it is used

Step 1: Normalize and scale
Map image pixels from 000–255255255 to [0,1][0,1][0,1] with min–max, or standardize per channel. Keep RGB channel order (R,G,B)(R,G,B)(R,G,B) fixed and apply the same preprocessing every batch. Inconsistent scaling changes MSE gradients and can slow or destabilize training.
Step 2: Architecture, bottleneck kkk, and loss
Images: prefer a convolutional AE (CAE) to preserve locality. Vectors or sequences: use 1D convs or fully connected stacks. kkk: smaller kkk → stronger compression but more detail loss; larger kkk → easier reconstruction but weaker summarization—pick kkk with validation loss. Outputs in R\mathbb{R}R → MSE; [0,1][0,1][0,1]-like grayscale → consider BCE.
Step 3: Training loop, output activation, stability
Backpropagate MSE or BCE each minibatch. For [0,1][0,1][0,1] targets, put sigmoid on the decoder’s last layer. Use Adam (or similar), a learning-rate schedule, and gradient clipping if needed. Split train/validation; if validation loss worsens, try early stopping, dropout/weight decay, or denoising AE.
Step 4: Evaluation, plots, downstream
Do not rely on the loss curve alone—inspect x^\hat{x}x^. Project latent zzz to 2D (e.g., t-SNE) to see structure or outliers. For anomaly detection, train on normal data only and set a reconstruction-error threshold on a validation set. Freeze or fine-tune the encoder for few-label classification or clustering.
Uses at a glance
  • GoalAnomaly detection
  • IdeaTrain on normal data only → flag high reconstruction error
  • GoalDenoising
  • IdeaDAE: corrupted input → clean target
  • GoalDim. reduction / viz
  • IdeaSmall zzz or 2D projection of zzz
  • GoalPretraining
  • IdeaReuse the encoder as a front end for transfer
GoalIdea
Anomaly detectionTrain on normal data only → flag high reconstruction error
DenoisingDAE: corrupted input → clean target
Dim. reduction / vizSmall zzz or 2D projection of zzz
PretrainingReuse the encoder as a front end for transfer

Summary

One-liner: The encoder squeezes data through a narrow bottleneck zzz; the decoder maps back to x^\hat{x}x^; training minimizes reconstruction error so the network discovers salient structure.
Links: Combine Dense and CNN blocks for encoder/decoder; CAEs help on complex spatial data.
Next (Chapter 18): VAE places a probability distribution on zzz for generation.

Problem-solving notes

Autoencoder items are easiest if you keep the one-liner z=fθ(x)z=f_\theta(x)z=fθ​(x), x^=gϕ(z)\hat{x}=g_\phi(z)x^=gϕ​(z) and the goal reconstruction loss matching xxx to x^\hat{x}x^. At the bottleneck, usually k≪dk \ll dk≪d. For one fully connected layer d→kd \to kd→k, count about d⋅kd\cdot kd⋅k weights + kkk biases. Flattened image length is height×width (×3 for RGB); patch count (no CLS) is (H/p)×(W/p)(H/p)\times(W/p)(H/p)×(W/p)—same line of reasoning as ViT patch/grid (Chapter 5 review).
Anomaly detection: train reconstruction on normal data, then flag samples with large reconstruction error. Denoising AE maps corrupted inputs toward clean targets for robust features. Use MSE for real pixels; BCE is common for [0,1][0,1][0,1] grayscale. When k/dk/dk/d or a percent appears, align numerator and denominator carefully.
Convolutional AE stacks CNN encoders/decoders to keep local structure (Chapter 12). If kkk is too large, the net can approach an identity copy; questions often test the compression vs. expressivity trade-off when shrinking kkk.
Next chapter VAE puts a probability model on latent zzz for generation. If the stem says probabilistic latent or sampling/generation, think VAE.