Everyone's AI
Machine learningAI Papers
Loading...

Learn

🏅My achievements

Ch.04

Matrix Multiplication and Linear Transformation: Math That Manipulates Space

Math diagram by chapter

Select a chapter to see its diagram below. View the flow of intermediate math at a glance.

Square grid → AAA → skewed grid

The same coordinates on the left land in one step on the right—the whole grid stretches with them.

Orange “A” badge: the matrix AAA (linear map) for this figure. It multiplies the left-hand coordinates x\mathbf{x}x to produce the point AxA\mathbf{x}Ax on the right.

InputOutputa₁a₂xAxℝ²: the 2D real space for inputs (pairs of coordinates). Not the regression R-squared.ℝ²Real plane · inputℝ²: the 2D real space where we plot the transformed point; same dimension as in this figure.ℝ²Real plane · outputAGray dashed: square if A = I

On one plane, the square grid becomes a parallelogram under A, and x moves to Ax.

Input / box|AAA|Reachable|x↦Ax\mathbf{x} \mapsto A\mathbf{x}x↦Ax

Remember this picture

T(x)=AxT(\mathbf{x})=A\mathbf{x}T(x)=Ax. TTT is just a name (function symbol) for the linear map “multiply by AAA”. So T(x)T(\mathbf{x})T(x) means “apply TTT to x\mathbf{x}x”, which is exactly AxA\mathbf{x}Ax.
Blue region = parallelogram spanned by the two column vectors of AAA. The output T(x)=AxT(\mathbf{x})=A\mathbf{x}T(x)=Ax always lies inside that region (reachable as a combination of the columns).
Matrix multiplication is not just a tedious pile of additions and multiplications. A matrix plays the same role as a smart filter in a digital photo editor—rotating, twisting, and compressing raw data. This chapter dives into linear transformations: putting one datum (a vector) through an editor (a matrix) and mapping it into a genuinely different “space.” We unpack what it means, mathematically, for the backbone y=Wx+b\mathbf{y} = W\mathbf{x} + \mathbf{b}y=Wx+b of deep learning models to work the way it does.

Matrix multiplication and linear maps: edit space with full control

1. Linear transformation (Linear Transformation): the “Free Transform” tool in an image editor
Concept: Imagine an image drawn on a transparent grid, opened in Photoshop. Dragging corners to stretch diagonally, rotating 45°, or shearing—all of that is a linear transformation in geometry.
Strict rules: This tool has two rules that must never break. First, the origin (0,0)(0,0)(0,0) at the center of the image stays fixed after the transform. Second, lines that were straight stay straight (no bending), and lines that were parallel stay parallel.
2. Matrix × vector (AxA\mathbf{x}Ax): applying a filter to the raw image
Concept: Here x\mathbf{x}x is the “original data (position of a point)” with no effect applied yet, and AAA is a smart filter (transform rule) that shears at specific angles and scales. Applying the filter is written AxA\mathbf{x}Ax (matrix AAA acts on x\mathbf{x}x).
In deep learning: One neural network layer uses this to build y=Wx+b\mathbf{y} = W\mathbf{x} + \mathbf{b}y=Wx+b.
* WWW (weight matrix): shears data into angles/scales that are easier for the model to analyze (linear map).
* b\mathbf{b}b (bias vector): drags the sheared result sideways like moving a layer in an editor (translation).
The result y\mathbf{y}y after “deform + shift” is passed to the next layer.
3. Matrix × matrix (ABABAB): stacking filters in order
Concept: Multiplying AAA and BBB means applying two editing filters one after another. In ABABAB, expressions flow right to left, so you apply BBB first, then overlay AAA on the result.
Key fact (AB≠BAAB \neq BAAB=BA): “Stretch horizontally by 2, then rotate 90°” gives a tall image; “rotate 90°, then stretch horizontally by 2” gives a wide image. Order changes the outcome, so AB≠BAAB \neq BAAB=BA (multiplication is not commutative).
4. Matching dimensions: plugging compatible cables
Concept: When stacking filters, connectors must match: the number of columns of the left matrix must equal the number of rows of the right matrix.
Key formula: (m×n)(m\times n)(m×n) times (n×p)(n\times p)(n×p) absorbs the touching nnn and outputs (m×p)(m\times p)(m×p). In code, transpose flips a table so batch data XXX and weights WWW line up cleanly as Y=XWTY = XW^{\mathsf{T}}Y=XWT.
5. Worked example: projecting 3D onto the xyxyxy-plane
Example 1 The linear map that sends (x1,x2,x3)(x_1,x_2,x_3)(x1​,x2​,x3​) to the plane z=0z=0z=0 (dropping the third coordinate) is:
A=(100010000)A=\begin{pmatrix}1&0&0\\0&1&0\\0&0&0\end{pmatrix}A=​100​010​000​​
Rule: x↦Ax\mathbf{x}\mapsto A\mathbf{x}x↦Ax. Here is the same product row by row (dot products).
Step 1 — Set up For x=(x1,x2,x3)T\mathbf{x}=(x_1,x_2,x_3)^{\mathsf T}x=(x1​,x2​,x3​)T,
Ax=(100010000)(x1x2x3)A\mathbf{x}=\begin{pmatrix}1&0&0\\0&1&0\\0&0&0\end{pmatrix}\begin{pmatrix}x_1\\x_2\\x_3\end{pmatrix}Ax=​100​010​000​​​x1​x2​x3​​​
Step 2 — Dot each row of AAA with x\mathbf{x}x (iiith entry = row iii · x\mathbf{x}x)
y1=1⋅x1+0⋅x2+0⋅x3=x1,y2=0⋅x1+1⋅x2+0⋅x3=x2,y3=0⋅x1+0⋅x2+0⋅x3=0.\begin{aligned} y_1 &= 1\cdot x_1+0\cdot x_2+0\cdot x_3 = x_1,\\ y_2 &= 0\cdot x_1+1\cdot x_2+0\cdot x_3 = x_2,\\ y_3 &= 0\cdot x_1+0\cdot x_2+0\cdot x_3 = 0. \end{aligned}y1​y2​y3​​=1⋅x1​+0⋅x2​+0⋅x3​=x1​,=0⋅x1​+1⋅x2​+0⋅x3​=x2​,=0⋅x1​+0⋅x2​+0⋅x3​=0.​
Step 3 — Stack the entries
Ax=(y1y2y3)=(x1x20)A\mathbf{x}=\begin{pmatrix}y_1\\y_2\\y_3\end{pmatrix}=\begin{pmatrix}x_1\\x_2\\0\end{pmatrix}Ax=​y1​y2​y3​​​=​x1​x2​0​​
So x1x_1x1​ and x2x_2x2​ are unchanged; x3x_3x3​ becomes 000. Geometrically this is orthogonal projection onto the xyxyxy-plane through the origin—a single matrix multiply encodes “remove one axis.” It ties back to the dot-product / projection view from Ch.02.
Practitioner’s recap: Matrix multiplication treats data not as a flat list of numbers but as dynamic spatial transforms via y=Wx+b\mathbf{y} = W\mathbf{x} + \mathbf{b}y=Wx+b. When stacking layers, matching shapes ((m×n)×(n×p)(m \times n) \times (n \times p)(m×n)×(n×p)) comes first; never forget that order (AB≠BAAB \neq BAAB=BA) completely changes the result.
The magic of parallelism: millions of pixels in one shot
A single high-res photo can have millions of pixels. Looping with a for-loop over each pixel would choke the CPU and make training impractical. Matrix multiplication packs those numbers into one huge table (matrix) and encodes the transform as another matrix, so the “apply a filter” picture becomes one multiply.
GPUs are built so thousands of cores share this work. Batch GEMM in TensorFlow/PyTorch stacks many samples as rows of XXX, multiplies by WWW once, and pushes the whole mini-batch through Y=XWTY = XW^{\mathsf{T}}Y=XWT. Deep learning digests huge data quickly because matrices are a common format hardware can parallelize.
One shared language across AI
Whether Netflix recommendations, Tesla lane detection, or ChatGPT, the bottom layer keeps running Y=XWTY = XW^{\mathsf{T}}Y=XWT. Fully connected layers, embeddings, attention scores—different names, same matrix × matrix pattern.
With this mindset, shape mismatches are easier to debug: mismatched inner sizes are like mismatched cable specs. Once this “shared language” clicks, papers, code, and logs in different domains read off the same map.
1) Transformers and attention: a “map of attention” via matrices
Attention scores how much each word attends to every other word. QKTQK^{\mathsf T}QKT fills raw scores for “how much this word looks at that word.” Softmax and a weighted sum with VVV complete scaled dot-product attention. In one line: matrix multiply builds the relation graph, then the same algebra mixes values.
2) Linear layers and batch training: the whole mini-batch at once
Fully connected layers repeat y=Wx+b\mathbf{y} = W\mathbf{x} + \mathbf{b}y=Wx+b. In training, stack NNN samples as rows of XXX and compute Y=XWT+1bTY = XW^{\mathsf{T}} + \mathbf{1}\mathbf{b}^{\mathsf{T}}Y=XWT+1bT in one go. Convolutions, unfolded, are also large matrix multiplies, which is why frameworks lean on GEMM.
3) Embeddings and recommendations: comparing meaning vectors
Users, items, and words become vectors; dot products and matrix multiplies yield similarity and scores for search, ranking, and recommendations—one matrix summarizes “who is close to whom.”
1) PCA and dimensionality reduction: shadows of thousand-D clouds
Humans struggle past 3D, but data often live in hundreds or thousands of dimensions. PCA uses the covariance matrix, picks eigenvector directions (axes of largest variance), and projects data onto them. As a linear map, that flattens irrelevant directions. A 2D scatter plot is literally a shadow of a high-D cloud pressed down by matrices.
2) Geometric preview: eigenvalues and determinants later
Even one linear map stretches different directions differently; special directions are eigenvectors, stretch factors eigenvalues. The parallelogram from columns and compositions ABABAB you learn here make Ch.05’s invertibility, determinants, spectrum feel familiar.
3) Computer graphics: matrices all the way to the screen
Games and CAD use homogeneous coordinates and matrix multiplies for rotate, translate, and perspective. Rendering 3D to 2D is still moving coordinates with one matrix. Deep learning and CG look different but share the same matrix toolbox for space.
The table lists shape rules and identities. Examples sketch typical steps.
  • SymbolABABAB
  • MeaningDefined when cols of AAA = rows of BBB
  • Symbol(AB)ij(AB)_{ij}(AB)ij​
  • MeaningDot of row iii of AAA and col jjj of BBB
  • SymbolAxA\mathbf{x}Ax
  • MeaningVector of row–x\mathbf{x}x dots
  • Symbol(AB)T(AB)^{\mathsf T}(AB)T
  • MeaningBTATB^{\mathsf T}A^{\mathsf T}BTAT
  • SymbolComposition
  • Meaningx↦A(Bx)=(AB)x\mathbf{x}\mapsto A(B\mathbf{x})=(AB)\mathbf{x}x↦A(Bx)=(AB)x
  • SymbolFC layer
  • Meaningy=Wx+b\mathbf{y}=W\mathbf{x}+\mathbf{b}y=Wx+b
SymbolMeaning
ABABABDefined when cols of AAA = rows of BBB
(AB)ij(AB)_{ij}(AB)ij​Dot of row iii of AAA and col jjj of BBB
AxA\mathbf{x}AxVector of row–x\mathbf{x}x dots
(AB)T(AB)^{\mathsf T}(AB)TBTATB^{\mathsf T}A^{\mathsf T}BTAT
Compositionx↦A(Bx)=(AB)x\mathbf{x}\mapsto A(B\mathbf{x})=(AB)\mathbf{x}x↦A(Bx)=(AB)x
FC layery=Wx+b\mathbf{y}=W\mathbf{x}+\mathbf{b}y=Wx+b
① Shapes Inner dimensions must match.
② Batch Same WWW on each row → GEMM.

Worked examples

Example 1 — shape
Q: AAA is 4×74\times 74×7, BBB is 7×37\times 37×3. ABABAB?
A: 4×34\times 34×3.

Example 2 — order
Q: Matrix for “BBB then AAA”?
A: ABABAB.

Example 3 — transpose
Q: (AB)T(AB)^{\mathsf T}(AB)T?
A: BTATB^{\mathsf T}A^{\mathsf T}BTAT.

Example 4 — column
Q: Ae2A\mathbf{e}_2Ae2​?
A: Second column of AAA.

Example 5 — batch
Q: One-shot linear layer with rows as samples?
A: Often XWTXW^{\mathsf T}XWT.

Practice problems

For a linear map T(x)=AxT(\mathbf{x})=A\mathbf{x}T(x)=Ax, which always holds?
1 / 10