Original Research

Matrix Operation Benchmark — NumPy Performance Across Matrix Sizes

Real benchmark data for 8 NumPy matrix operations across 6 matrix sizes (10x10 to 1000x1000). Each measurement is the average of 5 runs with standard deviation, executed on Apple Silicon with Accelerate BLAS.

By Michael Lip · Updated April 2026

Methodology

Benchmarks were executed on April 11, 2026 using Python 3.9 with NumPy on macOS (Apple Silicon) with Apple Accelerate as the BLAS backend. Each operation was timed using time.perf_counter() with 5 repetitions per (operation, size) pair. Input matrices are random float64 values from np.random.randn(n, n). Operations per second = 1000 / avg_ms. All times in milliseconds. The determinant overflow warning at large sizes is expected and does not affect timing accuracy.

Operation Matrix Size Avg Time (ms) Std Dev (ms) Ops/Second Complexity
matmul10x100.0060.010166,667O(n^3)
matmul50x500.2080.4064,808O(n^3)
matmul100x1000.0120.00283,333O(n^3)
matmul250x2500.0960.00110,417O(n^3)
matmul500x5000.6760.0711,479O(n^3)
matmul1000x10005.4470.888184O(n^3)
inverse10x100.0100.008100,000O(n^3)
inverse50x500.2110.3514,739O(n^3)
inverse100x1000.0950.01310,526O(n^3)
inverse250x2500.5580.0331,792O(n^3)
inverse500x5002.6700.110375O(n^3)
inverse1000x100015.6270.67064O(n^3)
determinant10x100.3000.5923,333O(n^3)
determinant50x500.1330.2387,519O(n^3)
determinant100x1000.3900.7012,564O(n^3)
determinant250x2500.2290.0054,367O(n^3)
determinant500x5001.3770.643726O(n^3)
determinant1000x10004.4860.127223O(n^3)
eigenvalues10x100.3400.6292,941O(n^3)
eigenvalues50x500.4740.4012,110O(n^3)
eigenvalues100x1001.9190.129521O(n^3)
eigenvalues250x25012.8760.10278O(n^3)
eigenvalues500x50048.4110.34121O(n^3)
eigenvalues1000x1000237.4912.3264O(n^3)
svd10x100.2910.5393,436O(n^3)
svd50x500.6610.8441,513O(n^3)
svd100x1000.8630.0431,159O(n^3)
svd250x2505.3050.410189O(n^3)
svd500x50022.1300.77145O(n^3)
svd1000x1000105.2211.92010O(n^3)
transpose10x100.0010.0011,000,000O(n^2)
transpose50x500.0020.001500,000O(n^2)
transpose100x1000.0040.001250,000O(n^2)
transpose250x2500.0260.00838,462O(n^2)
transpose500x5000.1030.0319,709O(n^2)
transpose1000x10000.6410.0351,560O(n^2)
trace10x100.0020.002500,000O(n)
trace50x500.0020.001500,000O(n)
trace100x1000.0020.001500,000O(n)
trace250x2500.0020.002500,000O(n)
trace500x5000.0040.004250,000O(n)
trace1000x10000.0030.002333,333O(n)
qr10x100.0290.02434,483O(n^3)
qr50x500.9721.8301,029O(n^3)
qr100x1000.1840.0135,435O(n^3)
qr250x2501.0790.038927O(n^3)
qr500x5004.5860.150218O(n^3)
qr1000x100024.4920.31341O(n^3)

Key Findings

Eigenvalue computation is the bottleneck. At 1000x1000, eigenvalues take 237.5 ms — 44x slower than matmul (5.4 ms) and 2.3x slower than SVD (105.2 ms). This reflects the iterative QR algorithm's convergence requirements versus direct computation in matmul.

Trace is essentially free. At O(n) complexity, trace takes 0.003 ms even at 1000x1000 — it simply sums n diagonal elements. Transpose at O(n^2) is the next fastest, requiring a memory copy but no computation.

BLAS makes matmul deceptively fast. The 100x100 matmul (0.012 ms) is faster than 50x50 (0.208 ms) because BLAS achieves better vectorization and cache utilization at larger sizes. The first call also includes JIT/dispatch overhead that amortizes over larger computations.

Scaling follows cubic complexity. From 500x500 to 1000x1000 (2x dimension), matmul goes from 0.676 to 5.447 ms (8.1x), inverse from 2.670 to 15.627 ms (5.9x), and eigenvalues from 48.4 to 237.5 ms (4.9x). The sub-8x scaling for some operations indicates better hardware utilization at larger sizes.

Frequently Asked Questions

How fast is NumPy matrix multiplication?

NumPy matmul is extremely fast thanks to BLAS. A 100x100 matmul takes 0.012 ms, 500x500 takes 0.676 ms, and 1000x1000 takes 5.447 ms. NumPy delegates to hardware-optimized routines (Accelerate, OpenBLAS, MKL) using SIMD, cache blocking, and multi-threading.

Why is eigenvalue computation so much slower than matrix multiplication?

Eigenvalue computation requires O(10n^3) operations vs O(n^3) for matmul. At 1000x1000, eigenvalues take 237.5 ms vs 5.4 ms for matmul — 44x slower. It uses iterative QR steps that converge gradually, while matmul is a single direct computation.

What is the fastest matrix operation in NumPy?

Trace is fastest — it sums diagonal elements in O(n) time, taking 0.003 ms at 1000x1000. Transpose is second at 0.641 ms (O(n^2) copy). Neither invokes LAPACK. Among LAPACK operations, matmul is fastest due to optimized BLAS Level 3 (GEMM).

How does matrix size affect NumPy performance?

Doubling the dimension increases time by ~8x for O(n^3) operations. However, BLAS achieves better utilization at larger sizes — 1000x1000 matmul runs at higher GFLOPS than 100x100 due to better cache line, SIMD, and multi-core utilization.

Which BLAS backend does NumPy use and does it matter?

NumPy supports Accelerate (macOS), OpenBLAS (Linux), MKL (Anaconda), and BLIS. For 1000x1000 matmul, MKL and Accelerate are typically 10-30% faster than OpenBLAS. Check your backend with np.show_config(). For eigenvalue/SVD, LAPACK quality matters more than BLAS backend.