iq2_xxs: tune quantization by ikawrakow · Pull Request #5320 · ggml-org/llama.cpp

ikawrakow · 2024-02-04T09:02:48Z

We get slightly better PPL, and we cut quantization time in nearly half.

The trick is to 1st quantize without forcing points onto the E8-lattice. We can then use a narrower search range around the block scale that we got that way, which gives the significant reduction in quantization time.

The code becomes simpler too, so it is a win-win.

Here is a comparison between PPL with this PR and PR #4773 for a context of 4096

Model	File size (GiB)	PPL Master	PPL PR
Mistral-7B	1.855	6.446	6.448
LLaMA-v2-7B	1.728	7.067	7.048
LLaMA-v2-13B	3.295	5.728	5.672
LLaMA-v2-70B	17.03	4.079	4.057
Mixtral-8x7B	11.44	4.948	4.904

We get slightly better PPL, and we cut quantization time in nearly half. The trick is to 1st quantize without forcing points onto the E8-lattice. We can then use a narrower search range around the block scale that we got that way.

sorasoras · 2024-02-04T09:33:47Z

Could this applied to other IQ quants？

ikawrakow · 2024-02-05T08:44:42Z

Could this applied to other IQ quants？

Yes, but with much less gain. I.e., one gets an increase in PPL if one reduces the scale search range as aggressively as here, or one can keep about the same PPL but with much lower speedup.

Nexesenex · 2024-02-05T09:56:13Z

Any noticeable speed-up you can offer us with a close to equal perplexity is interresting for the CPU poor, @ikawrakow!

We get slightly better PPL, and we cut quantization time in nearly half. The trick is to 1st quantize without forcing points onto the E8-lattice. We can then use a narrower search range around the block scale that we got that way. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

iq2_xxs: tune quantization

f3798f7

We get slightly better PPL, and we cut quantization time in nearly half. The trick is to 1st quantize without forcing points onto the E8-lattice. We can then use a narrower search range around the block scale that we got that way.

ggerganov approved these changes Feb 5, 2024

View reviewed changes

ikawrakow merged commit 6fdfa2e into master Feb 5, 2024

ikawrakow deleted the ik/iq2xxs_tune branch February 5, 2024 08:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

iq2_xxs: tune quantization#5320

iq2_xxs: tune quantization#5320
ikawrakow merged 1 commit intomasterfrom
ik/iq2xxs_tune

ikawrakow commented Feb 4, 2024 •

edited

Loading

Uh oh!

sorasoras commented Feb 4, 2024

Uh oh!

ikawrakow commented Feb 5, 2024

Uh oh!

Nexesenex commented Feb 5, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ikawrakow commented Feb 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sorasoras commented Feb 4, 2024

Uh oh!

ikawrakow commented Feb 5, 2024

Uh oh!

Nexesenex commented Feb 5, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ikawrakow commented Feb 4, 2024 •

edited

Loading