Ch.04
Matrix Multiplication and Linear Transformation: Math That Manipulates Space
Math diagram by chapter
Select a chapter to see its diagram below. View the flow of intermediate math at a glance.
Square grid → → skewed grid
The same coordinates on the left land in one step on the right—the whole grid stretches with them.
Orange “A” badge: the matrix (linear map) for this figure. It multiplies the left-hand coordinates to produce the point on the right.
On one plane, the square grid becomes a parallelogram under A, and x moves to Ax.
Input / boxReachable
Remember this picture
. is just a name (function symbol) for the linear map “multiply by ”. So means “apply to ”, which is exactly .
Blue region = parallelogram spanned by the two column vectors of . The output always lies inside that region (reachable as a combination of the columns).
Matrix multiplication is not just a tedious pile of additions and multiplications. A matrix plays the same role as a smart filter in a digital photo editor—rotating, twisting, and compressing raw data. This chapter dives into linear transformations: putting one datum (a vector) through an editor (a matrix) and mapping it into a genuinely different “space.” We unpack what it means, mathematically, for the backbone of deep learning models to work the way it does.
Matrix multiplication and linear maps: edit space with full control
1. Linear transformation (Linear Transformation): the “Free Transform” tool in an image editor
Concept: Imagine an image drawn on a transparent grid, opened in Photoshop. Dragging corners to stretch diagonally, rotating 45°, or shearing—all of that is a linear transformation in geometry.
Strict rules: This tool has two rules that must never break. First, the origin at the center of the image stays fixed after the transform. Second, lines that were straight stay straight (no bending), and lines that were parallel stay parallel.
2. Matrix × vector (): applying a filter to the raw image
Concept: Here is the “original data (position of a point)” with no effect applied yet, and is a smart filter (transform rule) that shears at specific angles and scales. Applying the filter is written (matrix acts on ).
In deep learning: One neural network layer uses this to build .
* (weight matrix): shears data into angles/scales that are easier for the model to analyze (linear map).
* (bias vector): drags the sheared result sideways like moving a layer in an editor (translation).
The result after “deform + shift” is passed to the next layer.
3. Matrix × matrix (): stacking filters in order
Concept: Multiplying and means applying two editing filters one after another. In , expressions flow right to left, so you apply first, then overlay on the result.
Key fact (): “Stretch horizontally by 2, then rotate 90°” gives a tall image; “rotate 90°, then stretch horizontally by 2” gives a wide image. Order changes the outcome, so (multiplication is not commutative).
4. Matching dimensions: plugging compatible cables
Concept: When stacking filters, connectors must match: the number of columns of the left matrix must equal the number of rows of the right matrix.
Key formula: times absorbs the touching and outputs . In code, transpose flips a table so batch data and weights line up cleanly as .
5. Worked example: projecting 3D onto the -plane
Example 1 The linear map that sends to the plane (dropping the third coordinate) is:
Rule: . Here is the same product row by row (dot products).
Step 1 — Set up For ,
Step 2 — Dot each row of with (th entry = row · )
Step 3 — Stack the entries
So and are unchanged; becomes . Geometrically this is orthogonal projection onto the -plane through the origin—a single matrix multiply encodes “remove one axis.” It ties back to the dot-product / projection view from Ch.02.
Practitioner’s recap: Matrix multiplication treats data not as a flat list of numbers but as dynamic spatial transforms via . When stacking layers, matching shapes () comes first; never forget that order () completely changes the result.
The magic of parallelism: millions of pixels in one shot
A single high-res photo can have millions of pixels. Looping with a for-loop over each pixel would choke the CPU and make training impractical. Matrix multiplication packs those numbers into one huge table (matrix) and encodes the transform as another matrix, so the “apply a filter” picture becomes one multiply.
GPUs are built so thousands of cores share this work. Batch GEMM in TensorFlow/PyTorch stacks many samples as rows of , multiplies by once, and pushes the whole mini-batch through . Deep learning digests huge data quickly because matrices are a common format hardware can parallelize.
One shared language across AI
Whether Netflix recommendations, Tesla lane detection, or ChatGPT, the bottom layer keeps running . Fully connected layers, embeddings, attention scores—different names, same matrix × matrix pattern.
With this mindset, shape mismatches are easier to debug: mismatched inner sizes are like mismatched cable specs. Once this “shared language” clicks, papers, code, and logs in different domains read off the same map.
1) Transformers and attention: a “map of attention” via matrices
Attention scores how much each word attends to every other word. fills raw scores for “how much this word looks at that word.” Softmax and a weighted sum with complete scaled dot-product attention. In one line: matrix multiply builds the relation graph, then the same algebra mixes values.
2) Linear layers and batch training: the whole mini-batch at once
Fully connected layers repeat . In training, stack samples as rows of and compute in one go. Convolutions, unfolded, are also large matrix multiplies, which is why frameworks lean on GEMM.
3) Embeddings and recommendations: comparing meaning vectors
Users, items, and words become vectors; dot products and matrix multiplies yield similarity and scores for search, ranking, and recommendations—one matrix summarizes “who is close to whom.”
1) PCA and dimensionality reduction: shadows of thousand-D clouds
Humans struggle past 3D, but data often live in hundreds or thousands of dimensions. PCA uses the covariance matrix, picks eigenvector directions (axes of largest variance), and projects data onto them. As a linear map, that flattens irrelevant directions. A 2D scatter plot is literally a shadow of a high-D cloud pressed down by matrices.
2) Geometric preview: eigenvalues and determinants later
Even one linear map stretches different directions differently; special directions are eigenvectors, stretch factors eigenvalues. The parallelogram from columns and compositions you learn here make Ch.05’s invertibility, determinants, spectrum feel familiar.
3) Computer graphics: matrices all the way to the screen
Games and CAD use homogeneous coordinates and matrix multiplies for rotate, translate, and perspective. Rendering 3D to 2D is still moving coordinates with one matrix. Deep learning and CG look different but share the same matrix toolbox for space.
The table lists shape rules and identities. Examples sketch typical steps.
- Symbol
- MeaningDefined when cols of = rows of
- Symbol
- MeaningDot of row of and col of
- Symbol
- MeaningVector of row– dots
- Symbol
- Meaning
- SymbolComposition
- Meaning
- SymbolFC layer
- Meaning
| Symbol | Meaning |
|---|---|
| Defined when cols of = rows of | |
| Dot of row of and col of | |
| Vector of row– dots | |
| Composition | |
| FC layer |
① Shapes Inner dimensions must match.
② Batch Same on each row → GEMM.
Worked examples
Example 1 — shape
Q: is , is . ?
A: .
Example 2 — order
Q: Matrix for “ then ”?
A: .
Example 3 — transpose
Q: ?
A: .
Example 4 — column
Q: ?
A: Second column of .
Example 5 — batch
Q: One-shot linear layer with rows as samples?
A: Often .
Practice problems
For a linear map , which always holds?
1 / 10