Linear Algebra Operations Every ML Engineer Should Know

March 10, 2025 · 12 min read · By Michael Lip

Machine learning is applied linear algebra. This is not a metaphor. When you define a neural network layer, you are specifying a matrix multiplication. When you compute a loss gradient, you are performing a matrix transpose and multiply. When you reduce dimensions with PCA, you are computing an eigendecomposition.

This post is a practical reference — not a textbook chapter. For each operation, we cover what it does, where it appears in ML, and the implementation details that matter in production.

1. Matrix Multiplication

The single most important operation in ML. If you understand nothing else, understand this.

# The rule: (m, n) @ (n, p) → (m, p)
# Inner dimensions must match. Result has outer dimensions.

import numpy as np
A = np.array([[1, 2], [3, 4]])      # (2, 2)
B = np.array([[5, 6], [7, 8]])      # (2, 2)
C = A @ B                            # (2, 2)
# C = [[1*5+2*7, 1*6+2*8],
#      [3*5+4*7, 3*6+4*8]]
# C = [[19, 22],
#      [43, 50]]

Where it appears

Every dense/linear layer: y = Wx + b
Attention mechanisms: QK^T and softmax(scores) @ V
Convolutions (implemented as matrix multiplication via im2col)
Batch normalization (element-wise, but the statistics computation uses matmul)

Production considerations

Matrix multiplication is O(n^3) for square matrices, but hardware (GPUs, TPUs) is specifically designed to parallelize it. NVIDIA's tensor cores perform 4x4 matmul in a single clock cycle. The practical bottleneck is usually memory bandwidth (moving data to/from compute units), not arithmetic throughput.

For small matrices (under 5x5), you can verify your multiplication by hand or with our matrix calculator before scaling up.

2. Transpose

Swaps rows and columns. Turns (m, n) into (n, m).

A = np.array([[1, 2, 3],
              [4, 5, 6]])      # (2, 3)
A_T = A.T                      # (3, 2)
# [[1, 4],
#  [2, 5],
#  [3, 6]]

Where it appears

Gradient computation: dW = X^T @ dL/dY
Covariance matrix: Sigma = X^T @ X / (n-1)
Normal equation: beta = (X^T X)^(-1) X^T y
Symmetric matrices: A = A^T (covariance matrices are always symmetric)

Key property: (AB)^T = B^T A^T. The transpose of a product reverses the order. This rule appears constantly in gradient derivations.

3. Determinant

A scalar value that encodes several properties of a square matrix at once.

# 2x2 determinant
A = np.array([[a, b], [c, d]])
det_A = a*d - b*c

# For any size
det_A = np.linalg.det(A)

Where it appears

Checking if a matrix is invertible (det != 0)
Gaussian distributions: the normalization constant includes |Sigma|^(-1/2)
Volume scaling: determinant measures how a transformation changes volume
Change of variables in probability: Jacobian determinant

Practical note

In production, you rarely compute determinants directly. Instead, you use decompositions (LU, Cholesky) and compute the determinant as a product of diagonal elements. For log-determinants (common in Gaussian models), this avoids numerical overflow.

4. Matrix Inverse

The matrix A^(-1) such that A @ A^(-1) = I.

# 2x2 inverse formula
A = np.array([[a, b], [c, d]])
A_inv = (1/(a*d - b*c)) * np.array([[d, -b], [-c, a]])

# General inverse
A_inv = np.linalg.inv(A)

# Better: solve the system directly
x = np.linalg.solve(A, b)  # Equivalent to x = A_inv @ b but faster and more stable

Where it appears

Ordinary Least Squares: beta = (X^T X)^(-1) X^T y
Gaussian processes: inverting the kernel matrix
Kalman filter: inverting the innovation covariance
Newton's method: x_{n+1} = x_n - H^(-1) @ gradient

Production rule

Never call np.linalg.inv() in production code. Use np.linalg.solve() or a decomposition. Direct inversion is O(n^3) and numerically unstable for ill-conditioned matrices. Regularization (A + lambda*I) improves conditioning.

5. Eigendecomposition

Decomposes a square matrix into A = Q * Lambda * Q^(-1), where Lambda is a diagonal matrix of eigenvalues and Q contains the eigenvectors.

eigenvalues, eigenvectors = np.linalg.eig(A)

# For symmetric matrices (covariance, Gram), use eigh (faster, more stable)
eigenvalues, eigenvectors = np.linalg.eigh(A)

Where it appears

PCA: Eigenvectors of the covariance matrix are the principal components
Spectral clustering: Eigenvectors of the graph Laplacian
Google PageRank: Dominant eigenvector of the link matrix
Stability analysis: Eigenvalues of the Hessian determine optimization landscape
Matrix powers: A^n = Q * Lambda^n * Q^(-1) (efficient for large n)

# PCA implementation
X_centered = X - X.mean(axis=0)
cov = X_centered.T @ X_centered / (len(X) - 1)
eigenvalues, eigenvectors = np.linalg.eigh(cov)

# Sort descending
idx = np.argsort(eigenvalues)[::-1]
components = eigenvectors[:, idx[:k]]  # Top k components
X_reduced = X_centered @ components

6. Singular Value Decomposition (SVD)

The most powerful decomposition in numerical linear algebra. Works on any matrix (not just square).

U, S, Vt = np.linalg.svd(X)
# U: (m, m) left singular vectors
# S: (min(m,n),) singular values
# Vt: (n, n) right singular vectors (transposed)
# X ≈ U[:, :k] @ diag(S[:k]) @ Vt[:k, :]  (rank-k approximation)

Where it appears

PCA: SVD of the data matrix directly (more numerically stable than eigendecomposition of covariance)
Recommender systems: Matrix factorization via truncated SVD
Image compression: Keep top-k singular values
Pseudoinverse: A+ = V @ diag(1/S) @ U^T
Solving least squares: np.linalg.lstsq uses SVD internally
Low-rank approximation: LoRA (Low-Rank Adaptation) for fine-tuning large language models

# Image compression example
from PIL import Image
img = np.array(Image.open('photo.jpg').convert('L'))  # Grayscale

U, S, Vt = np.linalg.svd(img, full_matrices=False)

# Keep top 50 singular values (out of e.g. 1000)
k = 50
compressed = U[:, :k] @ np.diag(S[:k]) @ Vt[:k, :]

# Original: m*n values. Compressed: m*k + k + k*n values.
# For 1000x1000 image, k=50: 100,050 vs 1,000,000 (10x compression)

7. Norms

Norms measure the "size" of vectors and matrices. Different norms have different properties and ML applications.

# Vector norms
L1 = np.sum(np.abs(x))           # Manhattan distance, promotes sparsity
L2 = np.sqrt(np.sum(x**2))       # Euclidean distance
Linf = np.max(np.abs(x))         # Maximum absolute value

# Matrix norms
Frobenius = np.sqrt(np.sum(A**2))  # Element-wise L2
spectral = np.max(np.linalg.svd(A, compute_uv=False))  # Largest singular value

ML applications

L2 regularization (Ridge/weight decay): loss + lambda * ||W||_2^2
L1 regularization (Lasso): loss + lambda * ||W||_1 (produces sparse weights)
Gradient clipping: Rescale gradient if ||g||_2 > threshold
Batch normalization: Normalizes by L2 norm of feature statistics
Cosine similarity: cos(a,b) = (a . b) / (||a||_2 * ||b||_2)

8. Hadamard (Element-wise) Product

Not a "standard" linear algebra operation, but ubiquitous in ML. Multiply corresponding elements.

Need developer utilities for your ML pipeline? Check out KappaKit's developer toolkit.

C = A * B  # Element-wise in numpy (NOT matrix multiplication)
# C[i,j] = A[i,j] * B[i,j]

Where it appears

Attention masking: Zero out positions by multiplying with a 0/1 mask
Gating mechanisms: LSTM gates multiply element-wise with cell state
Feature interactions: Element-wise products of embeddings
Dropout: Element-wise multiplication with random binary mask

9. Cholesky Decomposition

For symmetric positive definite matrices (like covariance matrices), A = L @ L^T where L is lower triangular.

L = np.linalg.cholesky(A)
# Solve Ax = b efficiently:
# 1. Solve Ly = b (forward substitution)
# 2. Solve L^T x = y (backward substitution)

Where it appears

Gaussian process inference (inverting kernel matrices efficiently)
Sampling from multivariate Gaussians: samples = L @ z + mean where z ~ N(0, I)
Log-determinant computation: log|A| = 2 * sum(log(diag(L)))

Cholesky is twice as fast as general LU decomposition and is numerically very stable for the matrices that appear in ML (which are usually symmetric positive semi-definite after regularization). Research platforms like HeyTensor and model management tools discussed at math tools rely on these decompositions for efficient computation at scale.

10. Matrix Rank and Condition Number

The rank tells you how many linearly independent rows (or columns) a matrix has. The condition number tells you how numerically stable operations on that matrix will be.

rank = np.linalg.matrix_rank(A)
cond = np.linalg.cond(A)

# High condition number = numerically unstable
# Rule of thumb: if cond > 1/machine_epsilon, expect problems
# For float64: machine_epsilon ≈ 2.2e-16, so cond > ~1e16 is bad

ML implications

Rank-deficient feature matrices indicate multicollinearity (redundant features)
High condition number explains training instability and exploding gradients
Low-rank approximation (SVD truncation, LoRA) is effective when the data is inherently low-rank

Putting It All Together

Here is a mental model for how these operations connect in a typical ML pipeline:

# Data preparation
X = load_data()                           # Matrix representation
X = X - X.mean(axis=0)                    # Centering (vector subtraction)
X = X / X.std(axis=0)                     # Scaling (element-wise division)

# Dimensionality reduction (Eigendecomposition or SVD)
U, S, Vt = np.linalg.svd(X, full_matrices=False)
X_reduced = X @ Vt[:k].T                  # Matrix multiplication

# Model (Linear regression via normal equation)
beta = np.linalg.solve(X.T @ X + lambda*I, X.T @ y)  # Inverse + matmul

# Neural network (forward pass)
h1 = relu(X @ W1 + b1)                    # Matmul + element-wise
h2 = relu(h1 @ W2 + b2)                   # Matmul + element-wise
output = h2 @ W3 + b3                     # Matmul

# Gradient (backward pass)
dW3 = h2.T @ d_output                     # Transpose + matmul
dW2 = h1.T @ (d_h2 * relu_grad(h2))       # Transpose + Hadamard + matmul

Every line is a matrix operation. The entire ML pipeline — from raw data to trained model — is a sequence of multiplications, transpositions, decompositions, and element-wise operations. Master these ten operations, and you have the tools to understand any ML paper, debug any dimension mismatch, and optimize any computational bottleneck.

Linear Algebra Operations Every ML Engineer Should Know

1. Matrix Multiplication

Where it appears

Production considerations

2. Transpose

Where it appears

3. Determinant

Where it appears

Practical note

4. Matrix Inverse

Where it appears

Production rule

5. Eigendecomposition

Where it appears

6. Singular Value Decomposition (SVD)

Where it appears

7. Norms

ML applications

8. Hadamard (Element-wise) Product

Where it appears

9. Cholesky Decomposition

Where it appears

10. Matrix Rank and Condition Number

ML implications

Putting It All Together

Related Resources