Comprehensive laboratory for algorithm implementations, Big-O exploration, and CPU vs GPU benchmarking.
Most algorithm repositories are reference collections — code you read, not code you measure. This one is different in three ways.
Every algorithm is benchmarked, not just implemented. The built-in benchmark runner (benchmarks/run_benchmarks.py) measures wall-clock time across configurable input sizes and exports CSV results. Every number in the Key Findings table below is reproducible on your own hardware with a single command.
GPU kernels ship alongside CPU implementations — with automatic fallbacks. The algorithms/gpu/ tree contains CuPy and Numba CUDA implementations of sorting, FFT, convolution, sparse matrix-vector multiply, BFS frontier expansion, prefix scan, and reduction. Every GPU path falls back to NumPy/SciPy when no CUDA device is present, so the full test suite passes on CPU-only machines out of the box.
An interactive Dash dashboard visualises your data, not a screenshot. benchmarks/dashboard_app.py renders a live CPU vs GPU speedup chart, empirical Big-O curves, and per-algorithm breakdown — driven by your own benchmark CSV. The 16.6× GPU speedup on Radix sort at n = 100k is the headline result, but the dashboard shows why: sustained superlinear scaling with array size, not a one-off win at a cherry-picked n.
Almost no Python algorithm repos benchmark anything. Fewer still run on the GPU. This one does both.
algorithms/
├── cpu/ # Standard Python / NumPy implementations
│ ├── backtracking/ # N-Queens
│ ├── data_structures/ # Stack, Queue, LinkedList, DoublyLinkedList,
│ │ # BST, AVL, Trie, SkipList, BloomFilter, MinHeap, UnionFind
│ ├── dynamic_prog/ # Memoized Fibonacci, 0/1 Knapsack
│ ├── geometry/ # Convex hull, Closest pair
│ ├── graphs/ # BFS, DFS, Dijkstra
│ ├── greedy/ # Coin change
│ ├── math/ # GCD, Sieve of Eratosthenes
│ ├── searching/ # Linear, Binary, Jump, Fibonacci search
│ ├── sorting/ # Bubble, Insertion, Selection, Merge, Quick,
│ │ # Heap, Shell, Counting, Radix, Timsort
│ └── strings/ # KMP, Rabin-Karp, Suffix array
└── gpu/ # CuPy / Numba CUDA kernels (optional; CPU fallbacks included)
├── math/ # Matrix multiply (CuPy)
├── sorting/ # GPU sort, parallel sort
├── convolution.py # 1D convolution
├── fft.py # FFT
├── numba_kernels.py # vec_add, matmul (Numba CUDA)
├── numba_spmv.py # Sparse matrix-vector multiply (Numba CUDA)
├── radix_cupy.py # Sort via CuPy
├── reduction.py # Sum / max reduction
├── scan.py # Prefix sum
└── sparse_ops.py # SpMV wrapper (Numba → CuPy → CPU fallback)
assets/ # Static assets (dashboard CSS theme)
benchmarks/ # CLI benchmark runner, CSV output, interactive Dash app
core/ # Shared utilities: timer decorator, data generators
docs/ # Algorithm complexity reference and GPU setup guide
tests/ # pytest test suite (118 tests)
- Python 3.14 on Windows / Linux / macOS
- Optional GPU: NVIDIA GPU with CUDA toolkit (targeting CUDA 13.x). All GPU modules include CPU fallbacks.
1. Install core dependencies
python -m venv .venv
.venv\Scripts\activate
pip install -r requirements.txt2. Optional: GPU setup (CuPy + Numba)
Install a CuPy wheel matching your CUDA runtime:
pip install cupy-cuda13xFor Numba CUDA kernels, conda/mamba is recommended on Windows to avoid binary compatibility issues:
conda create -n pyalg python=3.14 -c conda-forge numba cudatoolkit
conda activate pyalg
pip install -r requirements.txtVerify your GPU environment:
import importlib.util
print('cupy present:', importlib.util.find_spec('cupy') is not None)
try:
import numba
print('numba version:', numba.__version__)
print('numba.cuda available:', numba.cuda.is_available())
except Exception as e:
print('numba:', e)3. Run tests
python -m pytest -q4. Run benchmarks
# Single group
python benchmarks/run_benchmarks.py --group sort --sizes 1000 10000 --repeat 3 --out benchmarks/results_sort.csv
# Full sweep
python benchmarks/run_benchmarks.py --full --sizes 100 1000 10000 --repeat 3 --out benchmarks/results_full.csv
# Merge CSVs
python benchmarks/run_benchmarks.py --merge-inputs "benchmarks/results_*.csv" --merge-out benchmarks/results_combined.csvVerify GPU toolchain before GPU-heavy runs:
python benchmarks/gpu_smoke.py5. Interactive dashboard
python benchmarks/dashboard_app.py --csv benchmarks/results_full.csv| Category | Implementations |
|---|---|
| Sorting | Bubble, Insertion, Selection, Merge, Quick, Heap, Shell, Counting, Radix, Timsort |
| Searching | Linear, Binary, Jump, Fibonacci |
| Graphs | BFS, DFS, Dijkstra |
| Dynamic Programming | Memoized Fibonacci, 0/1 Knapsack |
| Data Structures | Stack, Queue, LinkedList, DoublyLinkedList, BST, AVL, Trie, SkipList, BloomFilter, MinHeap, UnionFind |
| Strings | KMP, Rabin-Karp, Suffix Array |
| Geometry | Convex Hull (monotone chain), Closest Pair |
| Math | GCD (Euclidean), Sieve of Eratosthenes |
| Backtracking | N-Queens |
| Greedy | Coin Change |
| GPU — Sorting | CuPy Sort, Parallel Sort (CuPy w/ CPU fallback), Radix Sort (CuPy) |
| GPU — Math | FFT, Convolution, Vec Add (Numba), Matmul naive + tiled (Numba CUDA shared memory), Matmul (CuPy), Prefix Scan, Sum/Max/Min/Mean Reduce |
| GPU — Sparse | SpMV CSR (Numba CUDA → CuPy → CPU fallback) |
| GPU — Graphs | BFS frontier expansion (CuPy SpMV level-sync → CPU fallback) |
| File | Contents |
|---|---|
docs/algorithm-complexity.md |
Big-O time and space reference for all implemented algorithms |
docs/gpu-kernels.md |
GPU toolchain installation, verification, best practices |
docs/next-iteration-scaffold.md |
Iteration planning scaffold, benchmark protocol, acceptance gates, and report template |
benchmarks/README.md |
Benchmark runner usage, CSV workflow, dashboard guide |
Measured on this machine (CPU-only fallback paths where no GPU is available). Run your own sweep to get GPU numbers.
| Algorithm | GPU wins at n ≈ | Peak speedup (n=100k) | Note |
|---|---|---|---|
| Radix sort (CuPy) | 1 000 | 16.6× | Monotonically improves with n |
| SpMV CSR | 1 000 | 1.1× | Stable; transfer overhead limits gain |
| FFT | 100 000 | 1.3× | GPU slower than NumPy below n=50k |
| matmul | 100 000 | 1.0× | Break-even; tiled kernel needed for dense gains |
| BFS (frontier) | never (at these sizes) | 0.5× | Graph irregularity kills GPU parallelism |
| max/sum reduce | never (at these sizes) | 0.9× | NumPy BLAS baseline is hard to beat |
FFT anomaly at n=5 000 (0.16× = 6× slower): CuPy warmup + PCIe transfer overhead dominates small arrays. Radix is the portfolio headline: sustained superlinear scaling on GPU.
Generate the interactive speedup chart:
python benchmarks/dashboard_app.py --csv benchmarks/results_full.csv
# Open http://127.0.0.1:8050 → "CPU vs GPU Speedup" tabRegenerate static plots after a new benchmark run:
python benchmarks/plot_results.py --csv benchmarks/results_full.csv --out assets/plots