Linear Algebra Operations Every ML Engineer Should Know

March 10, 2025 · 12 min read

Machine learning is applied linear algebra. This is not a metaphor. When you define a neural network layer, you are specifying a matrix multiplication. When you compute a loss gradient, you are performing a matrix transpose and multiply. When you reduce dimensions with PCA, you are computing an eigendecomposition.

This post is a practical reference — not a textbook chapter. For each operation, we cover what it does, where it appears in ML, and the implementation details that matter in production.

1. Matrix Multiplication

The single most important operation in ML. If you understand nothing else, understand this.

# The rule: (m, n) @ (n, p) → (m, p)
# Inner dimensions must match. Result has outer dimensions.

import numpy as np
A = np.array([[1, 2], [3, 4]])      # (2, 2)
B = np.array([[5, 6], [7, 8]])      # (2, 2)
C = A @ B                            # (2, 2)
# C = [[1*5+2*7, 1*6+2*8],
#      [3*5+4*7, 3*6+4*8]]
# C = [[19, 22],
#      [43, 50]]

Where it appears

Production considerations

Matrix multiplication is O(n^3) for square matrices, but hardware (GPUs, TPUs) is specifically designed to parallelize it. NVIDIA's tensor cores perform 4x4 matmul in a single clock cycle. The practical bottleneck is usually memory bandwidth (moving data to/from compute units), not arithmetic throughput.

For small matrices (under 5x5), you can verify your multiplication by hand or with our matrix calculator before scaling up.

2. Transpose

Swaps rows and columns. Turns (m, n) into (n, m).

A = np.array([[1, 2, 3],
              [4, 5, 6]])      # (2, 3)
A_T = A.T                      # (3, 2)
# [[1, 4],
#  [2, 5],
#  [3, 6]]

Where it appears

Key property: (AB)^T = B^T A^T. The transpose of a product reverses the order. This rule appears constantly in gradient derivations.

3. Determinant

A scalar value that encodes several properties of a square matrix at once.

# 2x2 determinant
A = np.array([[a, b], [c, d]])
det_A = a*d - b*c

# For any size
det_A = np.linalg.det(A)

Where it appears

Practical note

In production, you rarely compute determinants directly. Instead, you use decompositions (LU, Cholesky) and compute the determinant as a product of diagonal elements. For log-determinants (common in Gaussian models), this avoids numerical overflow.

4. Matrix Inverse

The matrix A^(-1) such that A @ A^(-1) = I.

# 2x2 inverse formula
A = np.array([[a, b], [c, d]])
A_inv = (1/(a*d - b*c)) * np.array([[d, -b], [-c, a]])

# General inverse
A_inv = np.linalg.inv(A)

# Better: solve the system directly
x = np.linalg.solve(A, b)  # Equivalent to x = A_inv @ b but faster and more stable

Where it appears

Production rule

Never call np.linalg.inv() in production code. Use np.linalg.solve() or a decomposition. Direct inversion is O(n^3) and numerically unstable for ill-conditioned matrices. Regularization (A + lambda*I) improves conditioning.

5. Eigendecomposition

Decomposes a square matrix into A = Q * Lambda * Q^(-1), where Lambda is a diagonal matrix of eigenvalues and Q contains the eigenvectors.

eigenvalues, eigenvectors = np.linalg.eig(A)

# For symmetric matrices (covariance, Gram), use eigh (faster, more stable)
eigenvalues, eigenvectors = np.linalg.eigh(A)

Where it appears

# PCA implementation
X_centered = X - X.mean(axis=0)
cov = X_centered.T @ X_centered / (len(X) - 1)
eigenvalues, eigenvectors = np.linalg.eigh(cov)

# Sort descending
idx = np.argsort(eigenvalues)[::-1]
components = eigenvectors[:, idx[:k]]  # Top k components
X_reduced = X_centered @ components

6. Singular Value Decomposition (SVD)

The most powerful decomposition in numerical linear algebra. Works on any matrix (not just square).

U, S, Vt = np.linalg.svd(X)
# U: (m, m) left singular vectors
# S: (min(m,n),) singular values
# Vt: (n, n) right singular vectors (transposed)
# X ≈ U[:, :k] @ diag(S[:k]) @ Vt[:k, :]  (rank-k approximation)

Where it appears

# Image compression example
from PIL import Image
img = np.array(Image.open('photo.jpg').convert('L'))  # Grayscale

U, S, Vt = np.linalg.svd(img, full_matrices=False)

# Keep top 50 singular values (out of e.g. 1000)
k = 50
compressed = U[:, :k] @ np.diag(S[:k]) @ Vt[:k, :]

# Original: m*n values. Compressed: m*k + k + k*n values.
# For 1000x1000 image, k=50: 100,050 vs 1,000,000 (10x compression)

7. Norms

Norms measure the "size" of vectors and matrices. Different norms have different properties and ML applications.

# Vector norms
L1 = np.sum(np.abs(x))           # Manhattan distance, promotes sparsity
L2 = np.sqrt(np.sum(x**2))       # Euclidean distance
Linf = np.max(np.abs(x))         # Maximum absolute value

# Matrix norms
Frobenius = np.sqrt(np.sum(A**2))  # Element-wise L2
spectral = np.max(np.linalg.svd(A, compute_uv=False))  # Largest singular value

ML applications

8. Hadamard (Element-wise) Product

Not a "standard" linear algebra operation, but ubiquitous in ML. Multiply corresponding elements.

C = A * B  # Element-wise in numpy (NOT matrix multiplication)
# C[i,j] = A[i,j] * B[i,j]

Where it appears

9. Cholesky Decomposition

For symmetric positive definite matrices (like covariance matrices), A = L @ L^T where L is lower triangular.

L = np.linalg.cholesky(A)
# Solve Ax = b efficiently:
# 1. Solve Ly = b (forward substitution)
# 2. Solve L^T x = y (backward substitution)

Where it appears

Cholesky is twice as fast as general LU decomposition and is numerically very stable for the matrices that appear in ML (which are usually symmetric positive semi-definite after regularization). Research platforms like HeyTensor and model management tools discussed at Zovo rely on these decompositions for efficient computation at scale.

10. Matrix Rank and Condition Number

The rank tells you how many linearly independent rows (or columns) a matrix has. The condition number tells you how numerically stable operations on that matrix will be.

rank = np.linalg.matrix_rank(A)
cond = np.linalg.cond(A)

# High condition number = numerically unstable
# Rule of thumb: if cond > 1/machine_epsilon, expect problems
# For float64: machine_epsilon ≈ 2.2e-16, so cond > ~1e16 is bad

ML implications

Putting It All Together

Here is a mental model for how these operations connect in a typical ML pipeline:

# Data preparation
X = load_data()                           # Matrix representation
X = X - X.mean(axis=0)                    # Centering (vector subtraction)
X = X / X.std(axis=0)                     # Scaling (element-wise division)

# Dimensionality reduction (Eigendecomposition or SVD)
U, S, Vt = np.linalg.svd(X, full_matrices=False)
X_reduced = X @ Vt[:k].T                  # Matrix multiplication

# Model (Linear regression via normal equation)
beta = np.linalg.solve(X.T @ X + lambda*I, X.T @ y)  # Inverse + matmul

# Neural network (forward pass)
h1 = relu(X @ W1 + b1)                    # Matmul + element-wise
h2 = relu(h1 @ W2 + b2)                   # Matmul + element-wise
output = h2 @ W3 + b3                     # Matmul

# Gradient (backward pass)
dW3 = h2.T @ d_output                     # Transpose + matmul
dW2 = h1.T @ (d_h2 * relu_grad(h2))       # Transpose + Hadamard + matmul

Every line is a matrix operation. The entire ML pipeline — from raw data to trained model — is a sequence of multiplications, transpositions, decompositions, and element-wise operations. Master these ten operations, and you have the tools to understand any ML paper, debug any dimension mismatch, and optimize any computational bottleneck.