BitTokens: Efficient numeracy in language models through single-token number embeddings

LLMs perform poorly on arithmetic tasks, requiring excessive reasoning tokens to achieve good performance. We propose BitTokens, a novel encoding strategy that represents any number as a single token using its IEEE 754 binary floating-point representation. This single-token number encoding allows language models to solve arithmetic tasks both effectively and efficiently.

How to use BitTokens

To get started check out our interactive Jupyter notebook.

A more detailed implementation of BitTokens can be found in the bittoken_embedding.py file.

Setup

Package manager UV

Tip

We recommend using the fast package manager uv for dependency management, but you may use any other package manager. We provide an additional requirements.txt file for this. Replace uv run with python in the commands.

Download and install the fast package manager UV.

# Download and install uv with python version >=3.13
curl -LsSf https://astral.sh/uv/install.sh | sh

Sync uv environment

# Installs python 3.13, torch 2.11, and other dependencies
uv sync

Install remaining dependencies:

Note

At the time of writing, there exists no official pre-built wheel for FlashAttention with torch=2.11 and python=3.13. We use this approach instead.

Tip

Sometimes FlashAttention causes trouble when installing. If you run into an error, please refer to the official install guide.

uv pip install "flash_attn-2.8.3+cu12torch2.11cxx11abiTRUE-cp313-cp313-linux_x86_64.whl" # replace with `flash-attn==2.8.3 --no-build-isolation` once official wheel available
uv pip install git+https://github.com/KellerJordan/Muon

Prepare Environment

Create an .env file and define the following variables:

PROJECT_PATH=... # Absolute path to the 'BitTokens/' folder
DATA_PATH=...    # Absolute path to data folder

# [Optional] If you want to use the eval_scripts
OPENROUTER_API_KEY=...

For convenience, load the .env file to execute the next commands.
```
source .env
```

Get the datasets

Exact paper dataset

To reproduce the manuscript-style training commands below, download the exact synthetic number-problem dataset used by the paper. Set DATA_PATH to the directory where the files should be placed, then run:

uv run --with huggingface_hub hf download KreitnerL/BitTokens-dataset --repo-type dataset --local-dir "$DATA_PATH"

Dataset page: https://huggingface.co/datasets/KreitnerL/BitTokens-dataset

The dataset contains all synthetic number-problem CSV files referenced by the BitToken configs and the FoNE, xVal, significant-digit, token-digit, and base-10 baseline configs. It includes the standard arithmetic tasks plus the hard tasks: Exponentiation, Mean, and Std. It also includes the binary-uniform curriculum files used by BitTokens where referenced by the configs.

The hosted dataset has 37 CSV files: 14 train CSVs, 14 validation CSVs, and 9 test CSVs. It intentionally does not include FineWeb-derived .txt files; those should be downloaded from the public FineWeb dataset instead.

The hosted CSV files keep only the columns required for training and evaluation: prompt, text_prompt, answer, difficulty, and difficulty_sd.

FineWeb text data

The multitask configs mix the synthetic number-problem data with text data. Download FineWeb from its original public Hugging Face dataset rather than from this repo:

uv run --with huggingface_hub hf download HuggingFaceFW/fineweb \
  --repo-type dataset \
  --include "sample/10BT/*.parquet" \
  --local-dir "$DATA_PATH"

Decode the downloaded parquet files to text files:

uv run $PROJECT_PATH/data_generation/decode_fineweb.py \
  --folder_dir "$DATA_PATH/sample/10BT/" \
  --save_path "$DATA_PATH/"

The training configs expect the FineWeb text files at $DATA_PATH/000_00000_train.txt and $DATA_PATH/val_text.txt. If your decoded files have different names, create those train/validation text files under $DATA_PATH before launching training.

Regenerate synthetic number problems

You can also generate fresh number problems locally. This is useful for development, but it will not produce the exact same examples used in the paper, so training results may differ.

Generate the number problems for each task for each phase:

# Decimal version (used for all base-10 baselines and for testing)
uv run $PROJECT_PATH/data_generation/data_generation_v2.py --save_dir $DATA_PATH
# Binary version (used for BitToken training)
uv run $PROJECT_PATH/data_generation/data_generation_v2.py --save_dir $DATA_PATH --significant_digits_distribution binary_uniform

Download and decode FineWeb as described above if you want to run the mixed numeric/text multitask configs.

Running experiments

To recreate a BitToken model in a multiTask setting similar to the manuscript, run:

uv run $PROJECT_PATH/train.py --load_config_from $PROJECT_PATH/configs/config_bittoken_multiTask.py --tqdm --verbose --deterministic --seed 999

Note

The first run has a longer startup time because it tokenizes the entire dataset first and stores it in a cache directory under $DATA_PATH/.

This has been tested on a Nvidia DGX A100 80GB GPU. The results will be stored in the folder $PROJECT_PATH/trained.

Citation

If you find our work useful, please cite our ICML 2026 paper:

@inproceedings{
    kreitner2026bittokens,
    title={Efficient numeracy in language models through single-token number embeddings},
    author={Linus Kreitner and Paul Hager and Jonathan Mengedoht and Georgios Kaissis and Daniel Rueckert and Martin J. Menten},
    booktitle={Forty-third International Conference on Machine Learning},
    year={2026},
    url={https://openreview.net/forum?id=Bh4Ubk80M8}
}

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
configs		configs
data_generation		data_generation
dataloader		dataloader
docs		docs
eval_scripts		eval_scripts
frontier_model_analysis		frontier_model_analysis
images		images
networks		networks
slurm_scripts		slurm_scripts
tokenizers		tokenizers
training		training
utils		utils
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
analyze_token_counts.py		analyze_token_counts.py
bittokens_notebook.ipynb		bittokens_notebook.ipynb
eval.py		eval.py
init_sweep.py		init_sweep.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BitTokens: Efficient numeracy in language models through single-token number embeddings

How to use BitTokens

Setup

Package manager UV

Install remaining dependencies:

Prepare Environment

Get the datasets

Exact paper dataset

FineWeb text data

Regenerate synthetic number problems

Running experiments

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

BitTokens: Efficient numeracy in language models through single-token number embeddings

How to use BitTokens

Setup

Package manager UV

Install remaining dependencies:

Prepare Environment

Get the datasets

Exact paper dataset

FineWeb text data

Regenerate synthetic number problems

Running experiments

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages