2x speed-up via 1/2 hardware concurrency

TL;DR: use navigator.hardwareConcurrency / 2 in main-worker.js

I maintain two open source Flutter libraries for cross-platform ML. (_all_ platforms, macOS, iOS, Android, Windows, Linux)
FONNX wraps the ONNX runtime.
FLLAMA wraps llama.cpp - except on web.

I saw your post in /r/localllama a few days ago (I'm refulgentis).
Today, I looked at the code: it is the first to run llama.cpp on WASM in many months, **excellent** work.

Also this week, I updated FLLAMA's llama.cpp version, and it had a really interesting issue on Android. It took 3 minutes to load a 3B model. Used to take 15 seconds. Turned out the issue was setting # of threads equal to # of CPU cores. Simply changing it from 4 to 2 fixed everything and made it much faster during inference too.

After playing around with this project for an hour trying to speed it up, I realized the same trick worked.

It may seem hacky, but I recommend changing use of navigatior.hardwareConcurrency() to  navigatior.hardwareConcurrency() / 2:

- I don't 100% understand why it helps so much, other than the general reasons (threads can get starved for data, etc.)
- In my experience, it is also best practice for ML on web generally. Approximately all the ONNX web implementations I've seen do the same thing.
- Part of me thinks it has something to do with a change in llama.cpp, because my slow Android load happened sometime between llama.cpp commit `ceebbb5b21b971941b2533210b74bf359981006c` and `7930a8a6e89a04c77c51e3ae5dc1cd8e845b6b8f`. But, that is unlikely. The Android problem was an extremely slow model load with inference speed that stayed the same.



Benchmarks (M2 Max/Ultra/whatever MacBook Pro). number of threads to tokens per second
Phi2
1: 1.2
6: 6.5 (1/2 hardware concurrency)
8: 7.4
12: 3.6 (hardware concurrency)

Mistral:
1: 13
6: 62 (1/2 hardware concurrency)
8: 65
12: 26 (hardware concurrency)



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2x speed-up via 1/2 hardware concurrency #1

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

2x speed-up via 1/2 hardware concurrency #1

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions