convert-hf : support direct Q8_0 conversion by compilade · Pull Request #7234 · ggml-org/llama.cpp

compilade · 2024-05-12T03:39:13Z

This adds Q8_0 conversion to convert-hf-to-gguf.py, and it results in EXACTLY the same files as if converted with ./quantize from an f32 model.

Note that this was NOT the case for convert.py, because it rounds to nearest even and divides by the scale, while the reference implementation in ggml-quants.c rounds away from zero and multiplies by the inverse of the scale.

Summary of changes

Add missing self.gguf_writer.add_file_type(self.ftype) for StableLMModel, InternLM2Model, PlamoModel, QwenModel, BaichuanModel, and XverseModel.
- This was messing up the checksums otherwise
Add gguf-py/gguf/quants.py to put the Q8_0 implementation there and also move bf16 conversion in there too.
Make lazy tensors support shape and dtype changes from operations which won't run on meta tensors
- Useful with Numpy which doesn't have true meta tensors
Performance improvement for bf16 conversion, from 40-60 MB/s on my machine to 104 MB/s
make GGUFWriter support arbitrary quants with np.uint8 dtype
- it now corrects the shape using the type size and the block size of the raw_dtype

TODO:

Maybe rename Model.extra_f16_tensors to Model.extra_quantized_tensors
Maybe also fix Q8_0 in convert.py to round in the same way as the reference implementation?

Testing

To be sure this Python implementation of Q8_0 really is working in the exact same way as the reference implementation from ggml-quants.c, I'm testing conversion and quantization of a bunch of different model architectures.

I recently got a big external hard drive, which makes storing the output of these tests much easier.

I'm doing pretty much this for each for every model architecture tested below
(using outfile {ftype} templating introduced in #7158) :

$ python3 convert-hf-to-gguf.py --outtype f32 --outfile /srv/LLMstash/tmp/model-name-lazy-convert.{ftype}.gguf /srv/LLMstash/src/model-dir/
$ python3 convert-hf-to-gguf.py --outtype bf16 --outfile /srv/LLMstash/tmp/model-name-lazy-convert.{ftype}.gguf /srv/LLMstash/src/model-dir/
$ python3 convert-hf-to-gguf.py --outtype q8_0 --outfile /srv/LLMstash/tmp/model-name-lazy-convert.{ftype}.gguf /srv/LLMstash/src/model-dir/
$ ./build/bin/quantize /srv/LLMstash/tmp/model-name-{lazy-convert.f32,quantize.bf16}.gguf bf16
$ ./build/bin/quantize /srv/LLMstash/tmp/model-name-{lazy-convert.f32,quantize.q8_0}.gguf q8_0
$ sha256sum /srv/LLMstash/tmp/model-name-*.{bf16,q8_0}.gguf

I'd say there is some suspense when the checksums begin to appear. Will they match?

I didn't notice these on my first pass.

LostRuins · 2024-05-19T04:42:21Z

Hi, this PR breaks model conversion on my system.

  File "E:\LLaMA\llamacpp\gguf-py\gguf\lazy.py", line 9, in <module>
    from numpy._typing import _Shape
ModuleNotFoundError: No module named 'numpy._typing'

I was using numpy-1.22.3 After force upgrading my env to the latest numpy-1.26.4, the latest script works.

However, I am hoping that it is possible to allow it to work with numpy 1.22 as it did before this commit as a fallback? A lot of toolchains might still be on slightly older versions of numpy, forcing the use of the latest newest version may not be ideal.

#7380

CISC · 2024-05-23T01:30:44Z

gguf-py/gguf/gguf_writer.py

+            if tensor_dtype == np.uint8:
+                block_size, type_size = GGML_QUANT_SIZES[raw_dtype]
+                if tensor_shape[-1] % type_size != 0:
+                    raise ValueError(f"Quantized tensor row size ({tensor_shape[-1]}) is not a multiple of {dtype.name} type size ({type_size})")
+                tensor_shape = tuple(tensor_shape[:-1]) + (tensor_shape[-1] // type_size * block_size,)


This has broken copying of tensors on i-quants (and probably several others as well), using

./gguf-new-metadata.py foo.IQ4_NL.gguf bar.gguf

you now get

ValueError: Quantized tensor row size (4096) is not a multiple of IQ4_NL type size (18)

The issue seems to be that the type_size is off by 2, however I don't see why the tensor should be reshaped in this scenario, so this should probably be re-evaluated.

Thanks for finding this!
I think it also breaks copying of all other quantized tensors in gguf-new-metadata.
Sorry about that.

I think I found a way to fix this while also simplifying what happens to the shape in the round-trip between GGUFReader and GGUFWriter. See #7483

compilade added 2 commits May 11, 2024 11:08

convert-hf : support q8_0 conversion

d7e199e

convert-hf: add missing ftype

2b1e5ea

compilade added enhancement New feature or request python python script changes Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level labels May 12, 2024

ggerganov approved these changes May 12, 2024

View reviewed changes

convert-hf : add missing ftype to Baichuan and Xverse

65a1a58

I didn't notice these on my first pass.

compilade merged commit ee52225 into master May 13, 2024

compilade mentioned this pull request May 13, 2024

convert-hf : save memory with lazy evaluation #7075

Merged

7 tasks

LostRuins mentioned this pull request May 19, 2024

convert-hf-to-gguf.py fails PR #7234 #7380

Closed

compilade mentioned this pull request May 22, 2024

gguf-py : do not use internal numpy types #7472

Merged

2 tasks

CISC reviewed May 23, 2024

View reviewed changes

This was referenced May 23, 2024

gguf-py : fix and simplify quantized shape round-trip #7483

Merged

convert-*.py: GGUF Naming Convention Refactor and Metadata Override Refactor #7499

Merged

This was referenced Jun 8, 2024

gguf-py : decouple adding metadata from writing in GGUFWriter #7827

Merged

Fix conversion of unnormalized BF16->BF16 weights #7843

Merged

compilade mentioned this pull request Jul 14, 2024

convert_hf : faster lazy safetensors #8482

Merged

7 tasks

This was referenced Jul 23, 2024

convert: add tensor hash general.hash.sha256 to kv store #8645

Closed

Add option to keep output and embed tensors at f16 #8715

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

convert-hf : support direct Q8_0 conversion#7234

convert-hf : support direct Q8_0 conversion#7234
compilade merged 3 commits intomasterfrom
compilade/q8_0-convert-hf

compilade commented May 12, 2024 •

edited

Loading

Uh oh!

LostRuins commented May 19, 2024

Uh oh!

CISC May 23, 2024 •

edited

Loading

Uh oh!

CISC May 23, 2024

Uh oh!

compilade May 23, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

compilade commented May 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary of changes

TODO:

Testing

Uh oh!

LostRuins commented May 19, 2024

Uh oh!

CISC May 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CISC May 23, 2024

Choose a reason for hiding this comment

Uh oh!

compilade May 23, 2024

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

compilade commented May 12, 2024 •

edited

Loading

CISC May 23, 2024 •

edited

Loading