Skip to content

bug: Windows on ARM not running at full possible speed #556

@rogerdcarvalho

Description

@rogerdcarvalho

Issue description

llama.cpp through node-llama-cpp is much slower/unusable compared to running through llama-cli on Windows ARM.

Expected Behavior

  1. npm install node-llama-cpp
  2. Either run npx --no node-llama-cpp chat or create chat session and prompt in NodeJS
  3. Generation runs at comparable speed to downloading the same release at https://github.com/ggml-org/llama.cpp/releases and running llama-cli with the same model.

Actual Behavior

  1. npm install node-llama-cpp
  2. Either run npx --no node-llama-cpp chat or create chat session and prompt in NodeJS
  3. Generation runs considerably slower, probably 10x as slow? This to the extent that it is not really usable, when adding any context the computer will hang for minutes before generating the first token and then maybe 2 tps. Whereas using llama-cli is pretty much instant generation at 30 tps with the same model.

Steps to reproduce

What I tried:

  1. Building llama.cpp from source myself. First attempt was slower than the public release too, but then I was able to generate a comparable build using:
    cmake -B build -G "Ninja" ^ -DLLAMA_NATIVE=OFF ^ -DLLAMA_BUILD_SHARED_LIB=ON ^ -DBUILD_SHARED_LIBS=ON ^ -DCMAKE_WINDOWS_EXPORT_ALL_SYMBOLS=ON ^ -DCMAKE_BUILD_TYPE=Release ^ -DCMAKE_C_COMPILER="C:/msys64/clangarm64/bin/aarch64-w64-mingw32-clang.exe" ^ -DCMAKE_CXX_COMPILER="C:/msys64/clangarm64/bin/aarch64-w64-mingw32-clang++.exe" ^ -DCMAKE_C_FLAGS="-O3 -flto -march=armv8.2-a+dotprod" ^ -DCMAKE_CXX_FLAGS="-O3 -flto -march=armv8.2-a+dotprod"
    This build was running at the same speed as the public release of llama-cpp.
  2. I then ran npx --no node-llama-cpp source download --skipBuild in my project folder to download the same source inside node_modules/node-llama-cpp/llama/llama.cpp.
  3. I deleted the prebuilt source in node_modules/@node-llama-cpp/ to make sure that one would not be used.
  4. I then put the same flags in environment variables of my successful direct build in the Windows Native ARM Command Prompt:
    set NODE_LLAMA_CPP_CMAKE_OPTION_DGGML_NATIVE=OFF set NODE_LLAMA_CPP_CMAKE_OPTION_DGGML_OPENMP=OFF set NODE_LLAMA_CPP_CMAKE_OPTION_DCMAKE_C_COMPILER="C:/msys64/clangarm64/bin/aarch64-w64-mingw32-clang.exe" set NODE_LLAMA_CPP_CMAKE_OPTION_DCMAKE_CXX_COMPILER="C:/msys64/clangarm64/bin/aarch64-w64-mingw32-clang++.exe" set NODE_LLAMA_CPP_CMAKE_OPTION_DCMAKE_C_FLAGS=-"O3 -flto -march=armv8.2-a+dotprod" set NODE_LLAMA_CPP_CMAKE_OPTION_DCMAKE_CXX_FLAGS="-O3 -flto -march=armv8.2-a+dotprod" set DCMAKE_WINDOWS_EXPORT_ALL_SYMBOLS=ON set DBUILD_SHARED_LIBS=ON
  5. I then ran npx --no node-llama-cpp source build. It builds and confirms at the end that the flags were used:
    `
    √ Compiled llama.cpp
    √ Removed temporary files

To use the binary you've just built, use this code:

import {getLlama} from "node-llama-cpp";

const llama = await getLlama({
gpu: false,
cmakeOptions: {
DCMAKE_C_COMPILER: ""C:/msys64/clangarm64/bin/aarch64-w64-mingw32-clang.exe"",
DCMAKE_C_FLAGS: "-"O3 -flto -march=armv8.2-a+dotprod"",
DCMAKE_CXX_COMPILER: ""C:/msys64/clangarm64/bin/aarch64-w64-mingw32-clang++.exe"",
DCMAKE_CXX_FLAGS: ""-O3 -flto -march=armv8.2-a+dotprod"",
DGGML_NATIVE: "OFF",
DGGML_OPENMP: "OFF"
}
});

`
And yet, running npx --no node-llama-cpp chat is as slow as with the prebuilt llama.cpp in the steps listed above under 'actual behavior'.

What are the exact differences of npx --no node-llama-cpp source build and cmake --build build --config Release? Or does llama-cli.exe perhaps just apply some sort of configuration flag somewhere when running that node-llama-cpp does not apply? Note that on ARM Vulkan doesn't need to be used, purely on CPU alone the speed is normally still pretty decent.

My Environment

Dependency Version
Operating System Apple macOS running Windows 11 ARM via Parallels
CPU M3 Max
Node.js version 22.18.0
Typescript version 5.8.3
node-llama-cpp version 3.16.0

npx --yes node-llama-cpp inspect gpu output:

OS: Windows 10.0.26200 (arm64)
Node: 22.18.0 (arm64)
TypeScript: 5.8.3

node-llama-cpp: 3.16.0
Prebuilt binaries: b8095
Cloned source: b8095

Vulkan: Vulkan is detected, but using it failed
To resolve errors related to Vulkan, see the Vulkan guide: https://node-llama-cpp.withcat.ai/guide/vulkan

CPU model: Apple Silicon
Used RAM: 55.49% (4.44GB/7.99GB)
Free RAM: 44.5% (3.56GB/7.99GB)

Additional Context

No response

Relevant Features Used

  • Metal support
  • CUDA support
  • Vulkan support
  • Grammar
  • Function calling

Are you willing to resolve this issue by submitting a Pull Request?

Yes, I have the time, and I know how to start.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingrequires triageRequires triaging

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions