Ch.19
Autoencoder: Compress and Reconstruct
Autoencoder: compress & reconstruct
Like summarizing text on a sticky note, then rewriting it in full.
The encoder compresses input to latent bottleneck ; the decoder expands to . Lower reconstruction loss means outputs closer to the input.
Input xEncoderBottleneck zDecoderLoss
Training flow
- ① Input: feed .
- ② Encoder: map to .
- ③ Bottleneck: small summarizes information.
- ④ Decoder: map to .
- ⑤ Loss: minimize mismatch between and .
Feeding images or high-dimensional data into a network, the model first encodes a compact summary code (latent representation), then decodes to with the same shape—an autoencoder. Training minimizes reconstruction loss between and . This is classic unsupervised learning: no class labels; the data itself is the target.
A narrow bottleneck enables dimensionality reduction and anomaly detection. Chapter 18 (VAE) adds a probabilistic latent model for generation; here we build the compress–reconstruct foundation.
Reading the formulas (autoencoder)
1. Encoder and decoder in one line
, . Loss example: .
- : Latent code at the bottleneck
- : Reconstructed output
2. Bottleneck and compression
Input dim , latent : compression ratio is about .
- Smaller : stronger compression (more information loss possible)
- Larger : easier reconstruction, weaker summarization
3. Linear AE and PCA
With linear activations and MSE, intuition links to principal directions (depends on data and constraints).
- Nonlinear activations allow richer representations
4. Practical tips
Match data scale; adjust bottleneck and depth; use DAE for robust features when needed.
Autoencoder: Compress and Reconstruct
1. Symmetric encoder–decoder structure
Concept: The encoder maps input to a latent vector ; the decoder maps to . The dimension of is forced into a much smaller bottleneck than the original input.
Intuition: Like a witness describing a face to a sketch artist with a few traits () instead of every pixel—the decoder redraws the face from that summary.
2. Loss: how close is the reconstruction?
Concept: For continuous real-valued features, MSE is typical; for grayscale images, BCE is also used.
Intuition: Like overlaying the original and the copy and scoring per-pixel mismatch.
3. Why the bottleneck matters
If were as large as , the network could trivially copy the input (identity). A narrow bottleneck forces the model to keep only real patterns in .
Practice (anomaly detection): Train on normal images only; high reconstruction error on novel “abnormal” inputs flags defects.
4. Denoising autoencoder (DAE)
Use: Add noise or masking, then train to recover the clean target. The model learns more robust features that ignore superficial corruption.
5. What is the latent space?
Concept: The latent space is the low-dimensional vector space where the encoder’s codes live—not the raw pixel/input space. Each sample becomes one point (a coordinate vector) in this space; after training, similar inputs often land nearby, while different patterns map farther apart, so the space can acquire geometric structure.
In an autoencoder: The bottleneck dimension is the latent space dimension. The decoder maps points in this space back to high-dimensional . (Chapter 18 VAE adds a probability model on this space for sampling and generation.)
6. What is PCA?
Concept: PCA (Principal Component Analysis) is a linear dimensionality-reduction method: it finds directions in which the data variance is largest, in order, and builds orthogonal axes called principal components. Projecting data onto the first few axes yields a low-dimensional summary that keeps as much variance as possible (along discarded axes you lose that variance).
Versus autoencoders: PCA uses linear maps only; autoencoders with nonlinear activations can learn richer, curved structure. On complex data, AEs are often more flexible. (A linear AE trained with MSE connects to PCA intuition under certain conditions.)
Why it matters
Beyond PCA: powerful dimensionality reduction
PCA, as described above, is essentially linear dimensionality reduction. Autoencoders, by contrast, use nonlinear activations to compress and visualize high-dimensional data in 2–3D more flexibly.
Unsupervised feature learning
Labeling is expensive. An AE can extract features from raw data alone; a pretrained encoder is a strong starting point for transfer learning into classifiers.
Gateway to generative AI
Beyond compression, tweaking latent to synthesize new faces or images leads to VAEs and GANs.
How it is used
Step 1: Normalize and scale
Map image pixels from – to with min–max, or standardize per channel. Keep RGB channel order fixed and apply the same preprocessing every batch. Inconsistent scaling changes MSE gradients and can slow or destabilize training.
Step 2: Architecture, bottleneck , and loss
Images: prefer a convolutional AE (CAE) to preserve locality. Vectors or sequences: use 1D convs or fully connected stacks. : smaller → stronger compression but more detail loss; larger → easier reconstruction but weaker summarization—pick with validation loss. Outputs in → MSE; -like grayscale → consider BCE.
Step 3: Training loop, output activation, stability
Backpropagate MSE or BCE each minibatch. For targets, put sigmoid on the decoder’s last layer. Use Adam (or similar), a learning-rate schedule, and gradient clipping if needed. Split train/validation; if validation loss worsens, try early stopping, dropout/weight decay, or denoising AE.
Step 4: Evaluation, plots, downstream
Do not rely on the loss curve alone—inspect . Project latent to 2D (e.g., t-SNE) to see structure or outliers. For anomaly detection, train on normal data only and set a reconstruction-error threshold on a validation set. Freeze or fine-tune the encoder for few-label classification or clustering.
Uses at a glance
- GoalAnomaly detection
- IdeaTrain on normal data only → flag high reconstruction error
- GoalDenoising
- IdeaDAE: corrupted input → clean target
- GoalDim. reduction / viz
- IdeaSmall or 2D projection of
- GoalPretraining
- IdeaReuse the encoder as a front end for transfer
| Goal | Idea |
|---|---|
| Anomaly detection | Train on normal data only → flag high reconstruction error |
| Denoising | DAE: corrupted input → clean target |
| Dim. reduction / viz | Small or 2D projection of |
| Pretraining | Reuse the encoder as a front end for transfer |
Summary
One-liner: The encoder squeezes data through a narrow bottleneck ; the decoder maps back to ; training minimizes reconstruction error so the network discovers salient structure.
Links: Combine Dense and CNN blocks for encoder/decoder; CAEs help on complex spatial data.
Next (Chapter 18): VAE places a probability distribution on for generation.
Problem-solving notes
Autoencoder items are easiest if you keep the one-liner , and the goal reconstruction loss matching to . At the bottleneck, usually . For one fully connected layer , count about weights + biases. Flattened image length is height×width (×3 for RGB); patch count (no CLS) is —same line of reasoning as ViT patch/grid (Chapter 5 review).
Anomaly detection: train reconstruction on normal data, then flag samples with large reconstruction error. Denoising AE maps corrupted inputs toward clean targets for robust features. Use MSE for real pixels; BCE is common for grayscale. When or a percent appears, align numerator and denominator carefully.
Convolutional AE stacks CNN encoders/decoders to keep local structure (Chapter 12). If is too large, the net can approach an identity copy; questions often test the compression vs. expressivity trade-off when shrinking .
Next chapter VAE puts a probability model on latent for generation. If the stem says probabilistic latent or sampling/generation, think VAE.