bug: Windows on ARM not running at full possible speed

### Issue description

llama.cpp through node-llama-cpp is much slower/unusable compared to running through llama-cli on Windows ARM.

### Expected Behavior

1. npm install node-llama-cpp
2. Either run npx --no node-llama-cpp chat or create chat session and prompt in NodeJS
3. Generation runs at comparable speed to downloading the same release at https://github.com/ggml-org/llama.cpp/releases and running llama-cli with the same model.

### Actual Behavior

1. npm install node-llama-cpp
2. Either run npx --no node-llama-cpp chat or create chat session and prompt in NodeJS
3. Generation runs considerably slower, probably 10x as slow? This to the extent that it is not really usable, when adding any context the computer will hang for minutes before generating the first token and then maybe 2 tps. Whereas using llama-cli is pretty much instant generation at 30 tps with the same model.

### Steps to reproduce

What I tried:

1. Building llama.cpp from source myself. First attempt was slower than the public release too, but then I was able to generate a comparable build using:
`
cmake -B build -G "Ninja" ^
  -DLLAMA_NATIVE=OFF ^
  -DLLAMA_BUILD_SHARED_LIB=ON ^
  -DBUILD_SHARED_LIBS=ON ^
  -DCMAKE_WINDOWS_EXPORT_ALL_SYMBOLS=ON ^
  -DCMAKE_BUILD_TYPE=Release ^
  -DCMAKE_C_COMPILER="C:/msys64/clangarm64/bin/aarch64-w64-mingw32-clang.exe" ^
  -DCMAKE_CXX_COMPILER="C:/msys64/clangarm64/bin/aarch64-w64-mingw32-clang++.exe" ^
  -DCMAKE_C_FLAGS="-O3 -flto -march=armv8.2-a+dotprod" ^
  -DCMAKE_CXX_FLAGS="-O3 -flto -march=armv8.2-a+dotprod"
`
This build was running at the same speed as the public release of llama-cpp.
2.  I then ran `npx --no node-llama-cpp source download --skipBuild` in my project folder to download the same source inside node_modules/node-llama-cpp/llama/llama.cpp.
3. I deleted the prebuilt source in node_modules/@node-llama-cpp/ to make sure that one would not be used.
4. I then put the same flags in environment variables of my successful direct build in the Windows Native ARM Command Prompt:
`
set NODE_LLAMA_CPP_CMAKE_OPTION_DGGML_NATIVE=OFF
set NODE_LLAMA_CPP_CMAKE_OPTION_DGGML_OPENMP=OFF
set NODE_LLAMA_CPP_CMAKE_OPTION_DCMAKE_C_COMPILER="C:/msys64/clangarm64/bin/aarch64-w64-mingw32-clang.exe"
set NODE_LLAMA_CPP_CMAKE_OPTION_DCMAKE_CXX_COMPILER="C:/msys64/clangarm64/bin/aarch64-w64-mingw32-clang++.exe"
set NODE_LLAMA_CPP_CMAKE_OPTION_DCMAKE_C_FLAGS=-"O3 -flto -march=armv8.2-a+dotprod"
set NODE_LLAMA_CPP_CMAKE_OPTION_DCMAKE_CXX_FLAGS="-O3 -flto -march=armv8.2-a+dotprod"
set DCMAKE_WINDOWS_EXPORT_ALL_SYMBOLS=ON
set DBUILD_SHARED_LIBS=ON
`
3. I then ran `npx --no node-llama-cpp source build`. It builds and confirms at the end that the flags were used:
`
√ Compiled llama.cpp
√ Removed temporary files

To use the binary you've just built, use this code:
--------------------------------------------------------------------------------------------
import {getLlama} from "node-llama-cpp";

const llama = await getLlama({
    gpu: false,
    cmakeOptions: {
        DCMAKE_C_COMPILER: "\"C:/msys64/clangarm64/bin/aarch64-w64-mingw32-clang.exe\"",
        DCMAKE_C_FLAGS: "-\"O3 -flto -march=armv8.2-a+dotprod\"",
        DCMAKE_CXX_COMPILER: "\"C:/msys64/clangarm64/bin/aarch64-w64-mingw32-clang++.exe\"",
        DCMAKE_CXX_FLAGS: "\"-O3 -flto -march=armv8.2-a+dotprod\"",
        DGGML_NATIVE: "OFF",
        DGGML_OPENMP: "OFF"
    }
});
--------------------------------------------------------------------------------------------
`
 And yet, running npx --no node-llama-cpp chat is as slow as with the prebuilt llama.cpp in the steps listed above under 'actual behavior'. 

What are the exact differences of `npx --no node-llama-cpp source build` and `cmake --build build --config Release`? Or does llama-cli.exe perhaps just apply some sort of configuration flag somewhere when running that node-llama-cpp does not apply? Note that on ARM Vulkan doesn't need to be used, purely on CPU alone the speed is normally still pretty decent.

### My Environment

| Dependency               | Version             |
| ---                      | ---                 |
| Operating System         | Apple macOS running Windows 11 ARM via Parallels |
| CPU                      |  M3 Max  |
| Node.js version          |  22.18.0             |
| Typescript version       |  5.8.3             |
| `node-llama-cpp` version | 3.16.0             |

`npx --yes node-llama-cpp inspect gpu` output:
```
OS: Windows 10.0.26200 (arm64)
Node: 22.18.0 (arm64)
TypeScript: 5.8.3

node-llama-cpp: 3.16.0
Prebuilt binaries: b8095
Cloned source: b8095

Vulkan: Vulkan is detected, but using it failed
To resolve errors related to Vulkan, see the Vulkan guide: https://node-llama-cpp.withcat.ai/guide/vulkan

CPU model: Apple Silicon
Used RAM: 55.49% (4.44GB/7.99GB)
Free RAM: 44.5% (3.56GB/7.99GB)
```


### Additional Context

_No response_

### Relevant Features Used

- [ ] Metal support
- [ ] CUDA support
- [ ] Vulkan support
- [ ] Grammar
- [ ] Function calling

### Are you willing to resolve this issue by submitting a Pull Request?

Yes, I have the time, and I know how to start.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

bug: Windows on ARM not running at full possible speed #556

Issue description

Expected Behavior

Actual Behavior

Steps to reproduce

To use the binary you've just built, use this code:

My Environment

Additional Context

Relevant Features Used

Are you willing to resolve this issue by submitting a Pull Request?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Dependency	Version
Operating System	Apple macOS running Windows 11 ARM via Parallels
CPU	M3 Max
Node.js version	22.18.0
Typescript version	5.8.3
`node-llama-cpp` version	3.16.0

Uh oh!

bug: Windows on ARM not running at full possible speed #556

Description

Issue description

Expected Behavior

Actual Behavior

Steps to reproduce

To use the binary you've just built, use this code:

My Environment

Additional Context

Relevant Features Used

Are you willing to resolve this issue by submitting a Pull Request?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions