Linear Algebra Operations Every ML Engineer Should Know
Machine learning is applied linear algebra. This is not a metaphor. When you define a neural network layer, you are specifying a matrix multiplication. When you compute a loss gradient, you are performing a matrix transpose and multiply. When you reduce dimensions with PCA, you are computing an eigendecomposition.
This post is a practical reference — not a textbook chapter. For each operation, we cover what it does, where it appears in ML, and the implementation details that matter in production.
1. Matrix Multiplication
The single most important operation in ML. If you understand nothing else, understand this.
# The rule: (m, n) @ (n, p) → (m, p)
# Inner dimensions must match. Result has outer dimensions.
import numpy as np
A = np.array([[1, 2], [3, 4]]) # (2, 2)
B = np.array([[5, 6], [7, 8]]) # (2, 2)
C = A @ B # (2, 2)
# C = [[1*5+2*7, 1*6+2*8],
# [3*5+4*7, 3*6+4*8]]
# C = [[19, 22],
# [43, 50]]
Where it appears
- Every dense/linear layer:
y = Wx + b - Attention mechanisms:
QK^Tandsoftmax(scores) @ V - Convolutions (implemented as matrix multiplication via im2col)
- Batch normalization (element-wise, but the statistics computation uses matmul)
Production considerations
Matrix multiplication is O(n^3) for square matrices, but hardware (GPUs, TPUs) is specifically designed to parallelize it. NVIDIA's tensor cores perform 4x4 matmul in a single clock cycle. The practical bottleneck is usually memory bandwidth (moving data to/from compute units), not arithmetic throughput.
For small matrices (under 5x5), you can verify your multiplication by hand or with our matrix calculator before scaling up.
2. Transpose
Swaps rows and columns. Turns (m, n) into (n, m).
A = np.array([[1, 2, 3],
[4, 5, 6]]) # (2, 3)
A_T = A.T # (3, 2)
# [[1, 4],
# [2, 5],
# [3, 6]]
Where it appears
- Gradient computation:
dW = X^T @ dL/dY - Covariance matrix:
Sigma = X^T @ X / (n-1) - Normal equation:
beta = (X^T X)^(-1) X^T y - Symmetric matrices:
A = A^T(covariance matrices are always symmetric)
Key property: (AB)^T = B^T A^T. The transpose of a product reverses the order. This rule appears constantly in gradient derivations.
3. Determinant
A scalar value that encodes several properties of a square matrix at once.
# 2x2 determinant
A = np.array([[a, b], [c, d]])
det_A = a*d - b*c
# For any size
det_A = np.linalg.det(A)
Where it appears
- Checking if a matrix is invertible (
det != 0) - Gaussian distributions: the normalization constant includes
|Sigma|^(-1/2) - Volume scaling: determinant measures how a transformation changes volume
- Change of variables in probability: Jacobian determinant
Practical note
In production, you rarely compute determinants directly. Instead, you use decompositions (LU, Cholesky) and compute the determinant as a product of diagonal elements. For log-determinants (common in Gaussian models), this avoids numerical overflow.
4. Matrix Inverse
The matrix A^(-1) such that A @ A^(-1) = I.
# 2x2 inverse formula
A = np.array([[a, b], [c, d]])
A_inv = (1/(a*d - b*c)) * np.array([[d, -b], [-c, a]])
# General inverse
A_inv = np.linalg.inv(A)
# Better: solve the system directly
x = np.linalg.solve(A, b) # Equivalent to x = A_inv @ b but faster and more stable
Where it appears
- Ordinary Least Squares:
beta = (X^T X)^(-1) X^T y - Gaussian processes: inverting the kernel matrix
- Kalman filter: inverting the innovation covariance
- Newton's method:
x_{n+1} = x_n - H^(-1) @ gradient
Production rule
Never call np.linalg.inv() in production code. Use np.linalg.solve() or a decomposition. Direct inversion is O(n^3) and numerically unstable for ill-conditioned matrices. Regularization (A + lambda*I) improves conditioning.
5. Eigendecomposition
Decomposes a square matrix into A = Q * Lambda * Q^(-1), where Lambda is a diagonal matrix of eigenvalues and Q contains the eigenvectors.
eigenvalues, eigenvectors = np.linalg.eig(A)
# For symmetric matrices (covariance, Gram), use eigh (faster, more stable)
eigenvalues, eigenvectors = np.linalg.eigh(A)
Where it appears
- PCA: Eigenvectors of the covariance matrix are the principal components
- Spectral clustering: Eigenvectors of the graph Laplacian
- Google PageRank: Dominant eigenvector of the link matrix
- Stability analysis: Eigenvalues of the Hessian determine optimization landscape
- Matrix powers:
A^n = Q * Lambda^n * Q^(-1)(efficient for large n)
# PCA implementation
X_centered = X - X.mean(axis=0)
cov = X_centered.T @ X_centered / (len(X) - 1)
eigenvalues, eigenvectors = np.linalg.eigh(cov)
# Sort descending
idx = np.argsort(eigenvalues)[::-1]
components = eigenvectors[:, idx[:k]] # Top k components
X_reduced = X_centered @ components
6. Singular Value Decomposition (SVD)
The most powerful decomposition in numerical linear algebra. Works on any matrix (not just square).
U, S, Vt = np.linalg.svd(X)
# U: (m, m) left singular vectors
# S: (min(m,n),) singular values
# Vt: (n, n) right singular vectors (transposed)
# X ≈ U[:, :k] @ diag(S[:k]) @ Vt[:k, :] (rank-k approximation)
Where it appears
- PCA: SVD of the data matrix directly (more numerically stable than eigendecomposition of covariance)
- Recommender systems: Matrix factorization via truncated SVD
- Image compression: Keep top-k singular values
- Pseudoinverse:
A+ = V @ diag(1/S) @ U^T - Solving least squares:
np.linalg.lstsquses SVD internally - Low-rank approximation: LoRA (Low-Rank Adaptation) for fine-tuning large language models
# Image compression example
from PIL import Image
img = np.array(Image.open('photo.jpg').convert('L')) # Grayscale
U, S, Vt = np.linalg.svd(img, full_matrices=False)
# Keep top 50 singular values (out of e.g. 1000)
k = 50
compressed = U[:, :k] @ np.diag(S[:k]) @ Vt[:k, :]
# Original: m*n values. Compressed: m*k + k + k*n values.
# For 1000x1000 image, k=50: 100,050 vs 1,000,000 (10x compression)
7. Norms
Norms measure the "size" of vectors and matrices. Different norms have different properties and ML applications.
# Vector norms
L1 = np.sum(np.abs(x)) # Manhattan distance, promotes sparsity
L2 = np.sqrt(np.sum(x**2)) # Euclidean distance
Linf = np.max(np.abs(x)) # Maximum absolute value
# Matrix norms
Frobenius = np.sqrt(np.sum(A**2)) # Element-wise L2
spectral = np.max(np.linalg.svd(A, compute_uv=False)) # Largest singular value
ML applications
- L2 regularization (Ridge/weight decay):
loss + lambda * ||W||_2^2 - L1 regularization (Lasso):
loss + lambda * ||W||_1(produces sparse weights) - Gradient clipping: Rescale gradient if
||g||_2 > threshold - Batch normalization: Normalizes by L2 norm of feature statistics
- Cosine similarity:
cos(a,b) = (a . b) / (||a||_2 * ||b||_2)
8. Hadamard (Element-wise) Product
Not a "standard" linear algebra operation, but ubiquitous in ML. Multiply corresponding elements.
C = A * B # Element-wise in numpy (NOT matrix multiplication)
# C[i,j] = A[i,j] * B[i,j]
Where it appears
- Attention masking: Zero out positions by multiplying with a 0/1 mask
- Gating mechanisms: LSTM gates multiply element-wise with cell state
- Feature interactions: Element-wise products of embeddings
- Dropout: Element-wise multiplication with random binary mask
9. Cholesky Decomposition
For symmetric positive definite matrices (like covariance matrices), A = L @ L^T where L is lower triangular.
L = np.linalg.cholesky(A)
# Solve Ax = b efficiently:
# 1. Solve Ly = b (forward substitution)
# 2. Solve L^T x = y (backward substitution)
Where it appears
- Gaussian process inference (inverting kernel matrices efficiently)
- Sampling from multivariate Gaussians:
samples = L @ z + meanwherez ~ N(0, I) - Log-determinant computation:
log|A| = 2 * sum(log(diag(L)))
Cholesky is twice as fast as general LU decomposition and is numerically very stable for the matrices that appear in ML (which are usually symmetric positive semi-definite after regularization). Research platforms like HeyTensor and model management tools discussed at Zovo rely on these decompositions for efficient computation at scale.
10. Matrix Rank and Condition Number
The rank tells you how many linearly independent rows (or columns) a matrix has. The condition number tells you how numerically stable operations on that matrix will be.
rank = np.linalg.matrix_rank(A)
cond = np.linalg.cond(A)
# High condition number = numerically unstable
# Rule of thumb: if cond > 1/machine_epsilon, expect problems
# For float64: machine_epsilon ≈ 2.2e-16, so cond > ~1e16 is bad
ML implications
- Rank-deficient feature matrices indicate multicollinearity (redundant features)
- High condition number explains training instability and exploding gradients
- Low-rank approximation (SVD truncation, LoRA) is effective when the data is inherently low-rank
Putting It All Together
Here is a mental model for how these operations connect in a typical ML pipeline:
# Data preparation
X = load_data() # Matrix representation
X = X - X.mean(axis=0) # Centering (vector subtraction)
X = X / X.std(axis=0) # Scaling (element-wise division)
# Dimensionality reduction (Eigendecomposition or SVD)
U, S, Vt = np.linalg.svd(X, full_matrices=False)
X_reduced = X @ Vt[:k].T # Matrix multiplication
# Model (Linear regression via normal equation)
beta = np.linalg.solve(X.T @ X + lambda*I, X.T @ y) # Inverse + matmul
# Neural network (forward pass)
h1 = relu(X @ W1 + b1) # Matmul + element-wise
h2 = relu(h1 @ W2 + b2) # Matmul + element-wise
output = h2 @ W3 + b3 # Matmul
# Gradient (backward pass)
dW3 = h2.T @ d_output # Transpose + matmul
dW2 = h1.T @ (d_h2 * relu_grad(h2)) # Transpose + Hadamard + matmul
Every line is a matrix operation. The entire ML pipeline — from raw data to trained model — is a sequence of multiplications, transpositions, decompositions, and element-wise operations. Master these ten operations, and you have the tools to understand any ML paper, debug any dimension mismatch, and optimize any computational bottleneck.