-
-
Notifications
You must be signed in to change notification settings - Fork 172
Description
Issue description
llama.cpp through node-llama-cpp is much slower/unusable compared to running through llama-cli on Windows ARM.
Expected Behavior
- npm install node-llama-cpp
- Either run npx --no node-llama-cpp chat or create chat session and prompt in NodeJS
- Generation runs at comparable speed to downloading the same release at https://github.com/ggml-org/llama.cpp/releases and running llama-cli with the same model.
Actual Behavior
- npm install node-llama-cpp
- Either run npx --no node-llama-cpp chat or create chat session and prompt in NodeJS
- Generation runs considerably slower, probably 10x as slow? This to the extent that it is not really usable, when adding any context the computer will hang for minutes before generating the first token and then maybe 2 tps. Whereas using llama-cli is pretty much instant generation at 30 tps with the same model.
Steps to reproduce
What I tried:
- Building llama.cpp from source myself. First attempt was slower than the public release too, but then I was able to generate a comparable build using:
cmake -B build -G "Ninja" ^ -DLLAMA_NATIVE=OFF ^ -DLLAMA_BUILD_SHARED_LIB=ON ^ -DBUILD_SHARED_LIBS=ON ^ -DCMAKE_WINDOWS_EXPORT_ALL_SYMBOLS=ON ^ -DCMAKE_BUILD_TYPE=Release ^ -DCMAKE_C_COMPILER="C:/msys64/clangarm64/bin/aarch64-w64-mingw32-clang.exe" ^ -DCMAKE_CXX_COMPILER="C:/msys64/clangarm64/bin/aarch64-w64-mingw32-clang++.exe" ^ -DCMAKE_C_FLAGS="-O3 -flto -march=armv8.2-a+dotprod" ^ -DCMAKE_CXX_FLAGS="-O3 -flto -march=armv8.2-a+dotprod"
This build was running at the same speed as the public release of llama-cpp. - I then ran
npx --no node-llama-cpp source download --skipBuildin my project folder to download the same source inside node_modules/node-llama-cpp/llama/llama.cpp. - I deleted the prebuilt source in node_modules/@node-llama-cpp/ to make sure that one would not be used.
- I then put the same flags in environment variables of my successful direct build in the Windows Native ARM Command Prompt:
set NODE_LLAMA_CPP_CMAKE_OPTION_DGGML_NATIVE=OFF set NODE_LLAMA_CPP_CMAKE_OPTION_DGGML_OPENMP=OFF set NODE_LLAMA_CPP_CMAKE_OPTION_DCMAKE_C_COMPILER="C:/msys64/clangarm64/bin/aarch64-w64-mingw32-clang.exe" set NODE_LLAMA_CPP_CMAKE_OPTION_DCMAKE_CXX_COMPILER="C:/msys64/clangarm64/bin/aarch64-w64-mingw32-clang++.exe" set NODE_LLAMA_CPP_CMAKE_OPTION_DCMAKE_C_FLAGS=-"O3 -flto -march=armv8.2-a+dotprod" set NODE_LLAMA_CPP_CMAKE_OPTION_DCMAKE_CXX_FLAGS="-O3 -flto -march=armv8.2-a+dotprod" set DCMAKE_WINDOWS_EXPORT_ALL_SYMBOLS=ON set DBUILD_SHARED_LIBS=ON - I then ran
npx --no node-llama-cpp source build. It builds and confirms at the end that the flags were used:
`
√ Compiled llama.cpp
√ Removed temporary files
To use the binary you've just built, use this code:
import {getLlama} from "node-llama-cpp";
const llama = await getLlama({
gpu: false,
cmakeOptions: {
DCMAKE_C_COMPILER: ""C:/msys64/clangarm64/bin/aarch64-w64-mingw32-clang.exe"",
DCMAKE_C_FLAGS: "-"O3 -flto -march=armv8.2-a+dotprod"",
DCMAKE_CXX_COMPILER: ""C:/msys64/clangarm64/bin/aarch64-w64-mingw32-clang++.exe"",
DCMAKE_CXX_FLAGS: ""-O3 -flto -march=armv8.2-a+dotprod"",
DGGML_NATIVE: "OFF",
DGGML_OPENMP: "OFF"
}
});
`
And yet, running npx --no node-llama-cpp chat is as slow as with the prebuilt llama.cpp in the steps listed above under 'actual behavior'.
What are the exact differences of npx --no node-llama-cpp source build and cmake --build build --config Release? Or does llama-cli.exe perhaps just apply some sort of configuration flag somewhere when running that node-llama-cpp does not apply? Note that on ARM Vulkan doesn't need to be used, purely on CPU alone the speed is normally still pretty decent.
My Environment
| Dependency | Version |
|---|---|
| Operating System | Apple macOS running Windows 11 ARM via Parallels |
| CPU | M3 Max |
| Node.js version | 22.18.0 |
| Typescript version | 5.8.3 |
node-llama-cpp version |
3.16.0 |
npx --yes node-llama-cpp inspect gpu output:
OS: Windows 10.0.26200 (arm64)
Node: 22.18.0 (arm64)
TypeScript: 5.8.3
node-llama-cpp: 3.16.0
Prebuilt binaries: b8095
Cloned source: b8095
Vulkan: Vulkan is detected, but using it failed
To resolve errors related to Vulkan, see the Vulkan guide: https://node-llama-cpp.withcat.ai/guide/vulkan
CPU model: Apple Silicon
Used RAM: 55.49% (4.44GB/7.99GB)
Free RAM: 44.5% (3.56GB/7.99GB)
Additional Context
No response
Relevant Features Used
- Metal support
- CUDA support
- Vulkan support
- Grammar
- Function calling
Are you willing to resolve this issue by submitting a Pull Request?
Yes, I have the time, and I know how to start.