Skip to content

KreitnerL/BitTokens

Repository files navigation

BitTokens: Efficient numeracy in language models through single-token number embeddings

arxiv paper ICML 2026 Spotlight BitTokens Website Jupyter notebook (browser) Hugging Face dataset MIT License

LLMs perform poorly on arithmetic tasks, requiring excessive reasoning tokens to achieve good performance. We propose BitTokens, a novel encoding strategy that represents any number as a single token using its IEEE 754 binary floating-point representation. This single-token number encoding allows language models to solve arithmetic tasks both effectively and efficiently. Figure 1

How to use BitTokens

To get started check out our interactive Jupyter notebook.

A more detailed implementation of BitTokens can be found in the bittoken_embedding.py file.

Setup

Package manager UV

Tip

We recommend using the fast package manager uv for dependency management, but you may use any other package manager. We provide an additional requirements.txt file for this. Replace uv run with python in the commands.

  1. Download and install the fast package manager UV.
    # Download and install uv with python version >=3.13
    curl -LsSf https://astral.sh/uv/install.sh | sh
  2. Sync uv environment
    # Installs python 3.13, torch 2.11, and other dependencies
    uv sync

Install remaining dependencies:

Note

At the time of writing, there exists no official pre-built wheel for FlashAttention with torch=2.11 and python=3.13. We use this approach instead.

Tip

Sometimes FlashAttention causes trouble when installing. If you run into an error, please refer to the official install guide.

uv pip install "flash_attn-2.8.3+cu12torch2.11cxx11abiTRUE-cp313-cp313-linux_x86_64.whl" # replace with `flash-attn==2.8.3 --no-build-isolation` once official wheel available
uv pip install git+https://github.com/KellerJordan/Muon

Prepare Environment

  1. Create an .env file and define the following variables:

    PROJECT_PATH=... # Absolute path to the 'BitTokens/' folder
    DATA_PATH=...    # Absolute path to data folder
    
    # [Optional] If you want to use the eval_scripts
    OPENROUTER_API_KEY=...
  2. For convenience, load the .env file to execute the next commands.

    source .env

Get the datasets

Exact paper dataset

To reproduce the manuscript-style training commands below, download the exact synthetic number-problem dataset used by the paper. Set DATA_PATH to the directory where the files should be placed, then run:

uv run --with huggingface_hub hf download KreitnerL/BitTokens-dataset --repo-type dataset --local-dir "$DATA_PATH"

Dataset page: https://huggingface.co/datasets/KreitnerL/BitTokens-dataset

The dataset contains all synthetic number-problem CSV files referenced by the BitToken configs and the FoNE, xVal, significant-digit, token-digit, and base-10 baseline configs. It includes the standard arithmetic tasks plus the hard tasks: Exponentiation, Mean, and Std. It also includes the binary-uniform curriculum files used by BitTokens where referenced by the configs.

The hosted dataset has 37 CSV files: 14 train CSVs, 14 validation CSVs, and 9 test CSVs. It intentionally does not include FineWeb-derived .txt files; those should be downloaded from the public FineWeb dataset instead.

The hosted CSV files keep only the columns required for training and evaluation: prompt, text_prompt, answer, difficulty, and difficulty_sd.

FineWeb text data

The multitask configs mix the synthetic number-problem data with text data. Download FineWeb from its original public Hugging Face dataset rather than from this repo:

uv run --with huggingface_hub hf download HuggingFaceFW/fineweb \
  --repo-type dataset \
  --include "sample/10BT/*.parquet" \
  --local-dir "$DATA_PATH"

Decode the downloaded parquet files to text files:

uv run $PROJECT_PATH/data_generation/decode_fineweb.py \
  --folder_dir "$DATA_PATH/sample/10BT/" \
  --save_path "$DATA_PATH/"

The training configs expect the FineWeb text files at $DATA_PATH/000_00000_train.txt and $DATA_PATH/val_text.txt. If your decoded files have different names, create those train/validation text files under $DATA_PATH before launching training.

Regenerate synthetic number problems

You can also generate fresh number problems locally. This is useful for development, but it will not produce the exact same examples used in the paper, so training results may differ.

  1. Generate the number problems for each task for each phase:
    # Decimal version (used for all base-10 baselines and for testing)
    uv run $PROJECT_PATH/data_generation/data_generation_v2.py --save_dir $DATA_PATH
    # Binary version (used for BitToken training)
    uv run $PROJECT_PATH/data_generation/data_generation_v2.py --save_dir $DATA_PATH --significant_digits_distribution binary_uniform
  2. Download and decode FineWeb as described above if you want to run the mixed numeric/text multitask configs.

Running experiments

To recreate a BitToken model in a multiTask setting similar to the manuscript, run:

uv run $PROJECT_PATH/train.py --load_config_from $PROJECT_PATH/configs/config_bittoken_multiTask.py --tqdm --verbose --deterministic --seed 999

Note

The first run has a longer startup time because it tokenizes the entire dataset first and stores it in a cache directory under $DATA_PATH/.

This has been tested on a Nvidia DGX A100 80GB GPU. The results will be stored in the folder $PROJECT_PATH/trained.

Citation

If you find our work useful, please cite our ICML 2026 paper:

@inproceedings{
    kreitner2026bittokens,
    title={Efficient numeracy in language models through single-token number embeddings},
    author={Linus Kreitner and Paul Hager and Jonathan Mengedoht and Georgios Kaissis and Daniel Rueckert and Martin J. Menten},
    booktitle={Forty-third International Conference on Machine Learning},
    year={2026},
    url={https://openreview.net/forum?id=Bh4Ubk80M8}
}

About

Official repository for the ICML 2026 paper "Efficient numeracy in language models through single-token number encodings"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors