-
Notifications
You must be signed in to change notification settings - Fork 32
2x speed-up via 1/2 hardware concurrency #1
Description
TL;DR: use navigator.hardwareConcurrency / 2 in main-worker.js
I maintain two open source Flutter libraries for cross-platform ML. (all platforms, macOS, iOS, Android, Windows, Linux)
FONNX wraps the ONNX runtime.
FLLAMA wraps llama.cpp - except on web.
I saw your post in /r/localllama a few days ago (I'm refulgentis).
Today, I looked at the code: it is the first to run llama.cpp on WASM in many months, excellent work.
Also this week, I updated FLLAMA's llama.cpp version, and it had a really interesting issue on Android. It took 3 minutes to load a 3B model. Used to take 15 seconds. Turned out the issue was setting # of threads equal to # of CPU cores. Simply changing it from 4 to 2 fixed everything and made it much faster during inference too.
After playing around with this project for an hour trying to speed it up, I realized the same trick worked.
It may seem hacky, but I recommend changing use of navigatior.hardwareConcurrency() to navigatior.hardwareConcurrency() / 2:
- I don't 100% understand why it helps so much, other than the general reasons (threads can get starved for data, etc.)
- In my experience, it is also best practice for ML on web generally. Approximately all the ONNX web implementations I've seen do the same thing.
- Part of me thinks it has something to do with a change in llama.cpp, because my slow Android load happened sometime between llama.cpp commit
ceebbb5b21b971941b2533210b74bf359981006cand7930a8a6e89a04c77c51e3ae5dc1cd8e845b6b8f. But, that is unlikely. The Android problem was an extremely slow model load with inference speed that stayed the same.
Benchmarks (M2 Max/Ultra/whatever MacBook Pro). number of threads to tokens per second
Phi2
1: 1.2
6: 6.5 (1/2 hardware concurrency)
8: 7.4
12: 3.6 (hardware concurrency)
Mistral:
1: 13
6: 62 (1/2 hardware concurrency)
8: 65
12: 26 (hardware concurrency)