Everyone uses PyTorch and NumPy, but very few people actually know what happens under the hood when you multiply two tensors. I decided to build two deep learning engines completely from scratch just to see how deep the rabbit hole goes, and to answer a single question.
Spoiler alert: I actually pulled it off lmao.
This repository contains two backend engines built entirely from the ground up:
macrograd: A pure C backend utilizing a custom Bump Arena Allocator.minigrad: A C++ backend relying onstd::vectorand RAII principles.
I benchmarked both engines against PyTorch and NumPy across varying matrix sizes (N=10 to N=2000), stacking optimizations one by one. The performance journey was insane, dropping execution time from an abysmal 17 seconds down to just 0.20 seconds, officially dethroning PyTorch on a single machine.
I documented the entire process. If you want to see exactly how I stripped away abstractions to achieve bare metal performance, follow the main storyline:
- 00. Methodology: Defining the exact hardware environment (i7-13650HX, AVX2) and mathematically calculating the absolute physical compute limits of my silicon.
- 01. The Naive Baseline: Writing the standard O(N^3) triple for loop and discovering it was roughly 73x slower than PyTorch.
- 02. Beating PyTorch: Perfecting memory access and unlocking hardware threading via pthreads and OpenMP to finally push execution time down to 0.20s and secure the win.
-
03. End-to-End ML Training: Proving that my optimized C engine can train a Multi-Layer Perceptron up to 2x faster than PyTorch's ATen backend by utilizing an
$O(1)$ Arena memory checkpoint to eliminate page faults. - 04. Scaling to MNIST: The Framework Overhead Reversal. Analyzing how scaling to a 60,000-image dataset completely dilutes PyTorch's framework overhead (GIL, dynamic ATen graphs, dispatcher), allowing Intel MKL to strike back. (Also detailing a massive, unresolved Pthreads energy anomaly for future work).
- 05. Diagnosing Erratic Execution Times: Laying out an experimental pipeline to diagnose why single-threaded CPU intensive workloads experience 20-second time jumps (investigating Thermal Throttling, E-Cores thread migration, and FPU Denormal traps).
- 06. The Limits of Custom SIMD: A realization of the "BLAS Wall". Understanding why a custom AVX2 SIMD implementation cannot beat the hand-tuned assembly micro-kernels and GotoBLAS memory packing of industry-standard libraries like Intel MKL.
For those interested in the raw hardware mechanisms, Cache Lines, and OS-level virtual memory interactions, I've compiled my extra profiling data into dedicated systems-engineering deep dives:
- A1. Cache Misses & Tiling: Hooking into the Linux
perf_event_opensyscall to mathematically prove how column-major traversal wastes 93.75% of cache line bandwidth, causing 1 billion cache misses, and how matrix tiling fixed it. - A2. OS Jitter & Allocators: Exposing the hidden cost of Demand Paging. Proving why
std::vector(which maps virtual pages lazily and faults on write) triggers thousands of minor page faults compared to a pre-faulted C bump allocator (zero). - A3. Power Consumption Analysis: Utilizing the Linux RAPL interface to measure the exact microjoule energy cost of training. Proving the "Race to Sleep" concept and demonstrating why optimization is inherently green.
- A4. The Branch Predictor & Loop Unrolling: Proving that PyTorch executes 8.5 Billion fewer branches than standard C++ loops at scale, while explaining the dangerous tradeoff of overflowing the Instruction Cache (L1i) with massive loop unrolling.
- A5. Compiler Optimization Flags: An A/B test of GCC flags (
-O0through-Ofast), proving how relaxing strict IEEE math compliance triggers auto-vectorization for a massive 5.0x speedup, and exposing the-march=nativehardware trap. - A6. Amdahl's Law and Scaling: A thread-scaling sweep from 1 to 20 cores, proving how asymmetric CPU architectures (P-cores vs E-cores) cause immediate performance drops, and why memory-bound code physically cannot scale across threads.
- A7. The Roofline Model: Calculating the physical GB/s limits of the hardware during execution, proving mathematically that PyTorch's 606 GFLOPS achieved ~74% of the absolute physical compute limit of the silicon.
/macrograd/: The C engine source code/minigrad/: The C++ engine source code/benchmarking/: The raw C++ and Python benchmarking scripts, Chrome profiler trace generators, and JSONL data dumps/reports/: My technical writeups and systems-level proofs on the profiling discoveries