Make tokenize CLI tool have nicer command line arguments.#6188
Merged
mofosyne merged 5 commits intoggml-org:masterfrom May 25, 2024
Merged
Make tokenize CLI tool have nicer command line arguments.#6188mofosyne merged 5 commits intoggml-org:masterfrom
mofosyne merged 5 commits intoggml-org:masterfrom
Conversation
Contributor
Author
|
Just noticed the CI doesn't run...maybe because it's my first PR and I'm not on an allowlist? Do I have a way to run the compilation tests somehow myself? Edit: Oh er just as I wrote this I see things building. NVM. |
ggerganov
approved these changes
Mar 21, 2024
Before this commit, tokenize was a simple CLI tool like this:
tokenize MODEL_FILENAME PROMPT [--ids]
This simple tool loads the model, takes the prompt, and shows the tokens
llama.cpp is interpreting.
This changeset makes the tokenize more sophisticated, and more useful
for debugging and troubleshooting:
tokenize [-m, --model MODEL_FILENAME]
[--ids]
[--stdin]
[--prompt]
[-f, --file]
[--no-bos]
[--log-disable]
It also behaves nicer on Windows now, interpreting and rendering Unicode
from command line arguments and pipes no matter what code page the user
has set on their terminal.
0e5b526 to
cd7b5f7
Compare
Contributor
Author
cebtenzzre
reviewed
Mar 26, 2024
ggerganov
approved these changes
Mar 27, 2024
Member
ggerganov
left a comment
There was a problem hiding this comment.
It will still recognize the old form (i.e. simple positional arguments) just to not surprise people. Although I would myself like to remove it entirely, to simplify the thing. Not sure anyone actually uses this tool except for ad-hoc testing like I do. Opinions on completely removing the "old style arguments"?
Yes, let's remove the old style arguments to simplify
…guments. It must now be invoked with long --model, --prompt etc. arguments only. Shortens the code.
jdh1166
approved these changes
Apr 25, 2024
ggerganov
reviewed
May 9, 2024
Collaborator
|
Appears everyone is happy so far with this PR and will merge once CI issue in main branch is all sorted. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.




Tokenize CLI tool is one of the tools in
examples/*. It's a pretty short and simple tool that takes arguments like this:And it would load the model, read the prompt, and then print a list of tokens it interpreted, or if --ids was given, just the integer values.
This changeset makes the command a bit more sophisticated with more options:
It will still recognize the old form (i.e. simple positional arguments) just to not surprise people. Although I would myself like to remove it entirely, to simplify the thing. Not sure anyone actually uses this tool except for ad-hoc testing like I do. Opinions on completely removing the "old style arguments"?
Motivation: I've been using this tool for my own tests with tokenization divergence investigations. I find it useful to do quick ad-hoc tests on text tokenization and comparisons. In particular I wanted it to behave nice if you give it a filename or pipe into it from
stdin.I took my hacks and cleaned them up into nicer looking command line arguments, following the style and argument names of some other CLI tools I saw. Also in general I added some error checking etc. so you are more likely to get a readable error than a segfault if you did something wrong.
Draft because I need to test some of the argument combinations and also Windows, and I want to see the CI results on GitHub here. I think the stdin reading as it is written might be sketchy on Windows, if you try to physically type letters, which would now become a feature of
tokenize.(
std::cindoes not have.is_open()? Got really confused when I was trying to write code to check did we read fromstdinproperly without syscall failures and trying to figure out if the code is checking syscall failures in a waterproof way. I'm a C programmer not a C++ one dammit)