CPP-vs-Torch: Outpacing PyTorch from scratch

Everyone uses PyTorch and NumPy, but very few people actually know what happens under the hood when you multiply two tensors. I decided to build two deep learning engines completely from scratch just to see how deep the rabbit hole goes, and to answer a single question.

Can a solo dev build a matrix multiplication engine in raw C/C++ that beats PyTorch?

Spoiler alert: I actually pulled it off lmao.

This repository contains two backend engines built entirely from the ground up:

macrograd: A pure C backend utilizing a custom Bump Arena Allocator.
minigrad: A C++ backend relying on std::vector and RAII principles.

I benchmarked both engines against PyTorch and NumPy across varying matrix sizes (N=10 to N=2000), stacking optimizations one by one. The performance journey was insane, dropping execution time from an abysmal 17 seconds down to just 0.20 seconds, officially dethroning PyTorch on a single machine.

Part 1: The Optimization Journey

I documented the entire process. If you want to see exactly how I stripped away abstractions to achieve bare metal performance, follow the main storyline:

00. Methodology: Defining the exact hardware environment (i7-13650HX, AVX2) and mathematically calculating the absolute physical compute limits of my silicon.
01. The Naive Baseline: Writing the standard O(N^3) triple for loop and discovering it was roughly 73x slower than PyTorch.
02. Beating PyTorch: Perfecting memory access and unlocking hardware threading via pthreads and OpenMP to finally push execution time down to 0.20s and secure the win.
03. End-to-End ML Training: Proving that my optimized C engine can train a Multi-Layer Perceptron up to 2x faster than PyTorch's ATen backend by utilizing an $O(1)$ Arena memory checkpoint to eliminate page faults.
04. Scaling to MNIST: The Framework Overhead Reversal. Analyzing how scaling to a 60,000-image dataset completely dilutes PyTorch's framework overhead (GIL, dynamic ATen graphs, dispatcher), allowing Intel MKL to strike back. (Also detailing a massive, unresolved Pthreads energy anomaly for future work).
05. Diagnosing Erratic Execution Times: Laying out an experimental pipeline to diagnose why single-threaded CPU intensive workloads experience 20-second time jumps (investigating Thermal Throttling, E-Cores thread migration, and FPU Denormal traps).
06. The Limits of Custom SIMD: A realization of the "BLAS Wall". Understanding why a custom AVX2 SIMD implementation cannot beat the hand-tuned assembly micro-kernels and GotoBLAS memory packing of industry-standard libraries like Intel MKL.

Part 2: Hardware & OS Deep Dives

For those interested in the raw hardware mechanisms, Cache Lines, and OS-level virtual memory interactions, I've compiled my extra profiling data into dedicated systems-engineering deep dives:

A1. Cache Misses & Tiling: Hooking into the Linux perf_event_open syscall to mathematically prove how column-major traversal wastes 93.75% of cache line bandwidth, causing 1 billion cache misses, and how matrix tiling fixed it.
A2. OS Jitter & Allocators: Exposing the hidden cost of Demand Paging. Proving why std::vector (which maps virtual pages lazily and faults on write) triggers thousands of minor page faults compared to a pre-faulted C bump allocator (zero).
A3. Power Consumption Analysis: Utilizing the Linux RAPL interface to measure the exact microjoule energy cost of training. Proving the "Race to Sleep" concept and demonstrating why optimization is inherently green.
A4. The Branch Predictor & Loop Unrolling: Proving that PyTorch executes 8.5 Billion fewer branches than standard C++ loops at scale, while explaining the dangerous tradeoff of overflowing the Instruction Cache (L1i) with massive loop unrolling.
A5. Compiler Optimization Flags: An A/B test of GCC flags (-O0 through -Ofast), proving how relaxing strict IEEE math compliance triggers auto-vectorization for a massive 5.0x speedup, and exposing the -march=native hardware trap.
A6. Amdahl's Law and Scaling: A thread-scaling sweep from 1 to 20 cores, proving how asymmetric CPU architectures (P-cores vs E-cores) cause immediate performance drops, and why memory-bound code physically cannot scale across threads.
A7. The Roofline Model: Calculating the physical GB/s limits of the hardware during execution, proving mathematically that PyTorch's 606 GFLOPS achieved ~74% of the absolute physical compute limit of the silicon.

Project structure

/macrograd/: The C engine source code
/minigrad/: The C++ engine source code
/benchmarking/: The raw C++ and Python benchmarking scripts, Chrome profiler trace generators, and JSONL data dumps
/reports/: My technical writeups and systems-level proofs on the profiling discoveries

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
benchmarking		benchmarking
macrograd		macrograd
minigrad		minigrad
reports		reports
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CPP-vs-Torch: Outpacing PyTorch from scratch

Can a solo dev build a matrix multiplication engine in raw C/C++ that beats PyTorch?

Part 1: The Optimization Journey

Part 2: Hardware & OS Deep Dives

Project structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CPP-vs-Torch: Outpacing PyTorch from scratch

Can a solo dev build a matrix multiplication engine in raw C/C++ that beats PyTorch?

Part 1: The Optimization Journey

Part 2: Hardware & OS Deep Dives

Project structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages