gguf-split: split and merge gguf per batch of tensors#6135
gguf-split: split and merge gguf per batch of tensors#6135phymbert merged 5 commits intoggml-org:masterfrom
Conversation
|
Interesting approach. I think allowing to split by file size would be more intuitive (and usually more appropriate since file size is usually the limiting factor, eg 4G for FAT or 50G for HF). The current code also makes the workflow a bit awkward with a lot of extra writes. It shouldn't be too hard to call |
There was a problem hiding this comment.
Thanks for having a look into this feature, your PR overall LGTM, just don't forget to include Makefile.
This would be useful for my wllama, since loading 5MB-10MB chunks in parallel will be faster in web environment. So I'm looking forward to the implementation in llama_model_loader.
For the syscalls that @Artefact2 proposed, we can implement in the v2 of the PR I think, for now it's already a good start to test if modification to llama_model_loader works or not.
Thanks. You cannot exactly predict the size of the GGUF as tensors size can vary, and we want to have valid GGUF (i.e. not truncated as in your example) for later on having |
…et general.split_count KV to all split
|
@ggerganov Hi Georgi, can I merge and continue on common ? |
* gguf-split: split and merge gguf files per tensor * gguf-split: build with make toolchain * gguf-split: rename `--split-tensors-size` to `--split-max-tensors`. Set general.split_count KV to all split * split : minor style + fix compile warnings * gguf-split: remove --upload not implemented --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Motivation
Distributing and storing GGUF files is difficult for 13b+ models, especially on f16. Lot of issue can happen during file transfers, examples:
Typically, they need to be tranferred from huggingface to an internal storage like s3, minio, git lfs, nexus or artifactory, then downloaded by the inference server and stored locally (or on a k8s PvC for example). Also they cannot be stored in a dockerfile, but IMHO this is for good.
This PR introduces a
gguf-splitCLI to ease the split and merge of multiple GGUF.Examples:
--split--mergeReferences
Notes
If this approach is accepted, we can later on adapt
llama_load_model_from_fileandllama_load_model_from_urlto supportgeneral.split_countKV in GGUF.mmapis not used in this first implementation neithercopy_file_rangeiops.The only split strategy supported at the moment is
--split-max-tensors Nwhich will create split ggufs with max tensors each regardless of their bytes size. Later on another split strategy based on max file size can be introduced.