Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
54 changes: 28 additions & 26 deletions docs/model_zoo.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,32 +2,34 @@

Scale hosts the following models in the LLM Engine Model Zoo:

| Model Name | Inference APIs Available | Fine-tuning APIs Available | Inference Frameworks Available | Inference max total tokens (prompt + response) |
| --------------------- | ------------------------ | -------------------------- | ------------------------------ | ------------------------------ |
| `llama-7b` | ✅ | ✅ | deepspeed, text-generation-inference | 2048 |
| `llama-2-7b` | ✅ | ✅ | text-generation-inference, vllm | 4096|
| `llama-2-7b-chat` | ✅ | | text-generation-inference, vllm | 4096|
| `llama-2-13b` | ✅ | | text-generation-inference, vllm | 4096|
| `llama-2-13b-chat` | ✅ | | text-generation-inference, vllm | 4096|
| `llama-2-70b` | ✅ | ✅ | text-generation-inference, vllm | 4096|
| `llama-2-70b-chat` | ✅ | | text-generation-inference, vllm | 4096|
| `falcon-7b` | ✅ | | text-generation-inference, vllm | 2048 |
| `falcon-7b-instruct` | ✅ | | text-generation-inference, vllm | 2048 |
| `falcon-40b` | ✅ | | text-generation-inference, vllm | 2048 |
| `falcon-40b-instruct` | ✅ | | text-generation-inference, vllm | 2048 |
| `mpt-7b` | ✅ | | deepspeed, text-generation-inference, vllm | 2048 |
| `mpt-7b-instruct` | ✅ | ✅ | deepspeed, text-generation-inference, vllm | 2048 |
| `flan-t5-xxl` | ✅ | | deepspeed, text-generation-inference | 2048 |
| `mistral-7b` | ✅ | ✅ | vllm | 8000 |
| `mistral-7b-instruct` | ✅ | ✅ | vllm | 8000 |
| `codellama-7b` | ✅ | ✅ | text-generation-inference, vllm | 16384 |
| `codellama-7b-instruct` | ✅ | ✅ | text-generation-inference, vllm | 16384 |
| `codellama-13b` | ✅ | ✅ | text-generation-inference, vllm | 16384 |
| `codellama-13b-instruct` | ✅ | ✅ | text-generation-inference, vllm | 16384 |
| `codellama-34b` | ✅ | ✅ | text-generation-inference, vllm | 16384 |
| `codellama-34b-instruct` | ✅ | ✅ | text-generation-inference, vllm | 16384 |
| `zephyr-7b-alpha` | ✅ | | text-generation-inference, vllm | 32768 |
| `zephyr-7b-beta` | ✅ | | text-generation-inference, vllm | 32768 |
| Model Name | Inference APIs Available | Fine-tuning APIs Available | Inference Frameworks Available | Inference max total tokens (prompt + response) |
| ------------------------ | ------------------------ | -------------------------- | ------------------------------------------ | ---------------------------------------------- |
| `llama-7b` | ✅ | ✅ | deepspeed, text-generation-inference | 2048 |
| `llama-2-7b` | ✅ | ✅ | text-generation-inference, vllm | 4096 |
| `llama-2-7b-chat` | ✅ | | text-generation-inference, vllm | 4096 |
| `llama-2-13b` | ✅ | | text-generation-inference, vllm | 4096 |
| `llama-2-13b-chat` | ✅ | | text-generation-inference, vllm | 4096 |
| `llama-2-70b` | ✅ | ✅ | text-generation-inference, vllm | 4096 |
| `llama-2-70b-chat` | ✅ | | text-generation-inference, vllm | 4096 |
| `falcon-7b` | ✅ | | text-generation-inference, vllm | 2048 |
| `falcon-7b-instruct` | ✅ | | text-generation-inference, vllm | 2048 |
| `falcon-40b` | ✅ | | text-generation-inference, vllm | 2048 |
| `falcon-40b-instruct` | ✅ | | text-generation-inference, vllm | 2048 |
| `mpt-7b` | ✅ | | deepspeed, text-generation-inference, vllm | 2048 |
| `mpt-7b-instruct` | ✅ | ✅ | deepspeed, text-generation-inference, vllm | 2048 |
| `flan-t5-xxl` | ✅ | | deepspeed, text-generation-inference | 2048 |
| `mistral-7b` | ✅ | ✅ | vllm | 8000 |
| `mistral-7b-instruct` | ✅ | ✅ | vllm | 8000 |
| `mixtral-8x7b` | ✅ | | vllm | 32768 |
| `mixtral-8x7b-instruct` | ✅ | | vllm | 32768 |
| `codellama-7b` | ✅ | ✅ | text-generation-inference, vllm | 16384 |
| `codellama-7b-instruct` | ✅ | ✅ | text-generation-inference, vllm | 16384 |
| `codellama-13b` | ✅ | ✅ | text-generation-inference, vllm | 16384 |
| `codellama-13b-instruct` | ✅ | ✅ | text-generation-inference, vllm | 16384 |
| `codellama-34b` | ✅ | ✅ | text-generation-inference, vllm | 16384 |
| `codellama-34b-instruct` | ✅ | ✅ | text-generation-inference, vllm | 16384 |
| `zephyr-7b-alpha` | ✅ | | text-generation-inference, vllm | 32768 |
| `zephyr-7b-beta` | ✅ | | text-generation-inference, vllm | 32768 |

## Usage

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -165,6 +165,8 @@
"codellama-34b-instruct",
"mistral-7b",
"mistral-7b-instruct",
"mixtral-8x7b",
"mixtral-8x7b-instruct",
"mammoth-coder-llama-2-7b",
"mammoth-coder-llama-2-13b",
"mammoth-coder-llama-2-34b",
Expand Down Expand Up @@ -210,6 +212,7 @@
# Can also see 13B, 34B there too
"llama-2": {"max_model_len": None, "max_num_batched_tokens": 4096},
"mistral": {"max_model_len": 8000, "max_num_batched_tokens": 8000},
"mixtral": {"max_model_len": 32768, "max_num_batched_tokens": 32768},
"zephyr": {"max_model_len": 32768, "max_num_batched_tokens": 32768},
}

Expand Down
7 changes: 6 additions & 1 deletion model-engine/model_engine_server/inference/vllm/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,8 +1,13 @@
FROM nvcr.io/nvidia/pytorch:22.12-py3
FROM nvcr.io/nvidia/pytorch:23.09-py3

RUN pip uninstall torch -y
COPY requirements.txt /workspace/requirements.txt
RUN pip install -r requirements.txt

# install special version of megablocks
RUN pip install git+https://github.com/stanford-futuredata/megablocks.git@5897cd6f254b7b3edf7a708a3a3314ecb54b6f78#egg=megablocks

RUN wget https://github.com/peak/s5cmd/releases/download/v2.2.1/s5cmd_2.2.1_Linux-64bit.tar.gz
RUN tar -xvzf s5cmd_2.2.1_Linux-64bit.tar.gz

COPY vllm_server.py /workspace/vllm_server.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
ray==2.6.3
vllm==0.2.0
pydantic==1.10.12
vllm==0.2.5
pydantic==1.10.13
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,8 @@ def get_default_supported_models_info() -> Dict[str, ModelInfo]:
),
"mistral-7b": ModelInfo("mistralai/Mistral-7B-v0.1", None),
"mistral-7b-instruct": ModelInfo("mistralai/Mistral-7B-Instruct-v0.1", None),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's also update mistral-7b-Instruct to use the newer version released today: https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2 we can do this in a follow up pr but good to do as well

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we add it as a separate model instead of replacing the current one? also, should do that as a follow-up PR

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I wouldn't make it a separate model, I would just add the weights to s3 and use them in favor of the v0.1 weights. I suppose there is some value to having both models though for completeness, @yunfeng-scale thoughts on this?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't think there's need to add as a new model

"mixtral-8x7b": ModelInfo("mistralai/Mixtral-8x7B-v0.1", None),
"mixtral-8x7b-instruct": ModelInfo("mistralai/Mixtral-8x7B-Instruct-v0.1", None),
"mammoth-coder-llama-2-7b": ModelInfo("TIGER-Lab/MAmmoTH-Coder-7B", None),
"mammoth-coder-llama-2-13b": ModelInfo("TIGER-Lab/MAmmoTH-Coder-13B", None),
"mammoth-coder-llama-2-34b": ModelInfo("TIGER-Lab/MAmmoTH-Coder-34B", None),
Expand Down