Description
Fast access to experiment file formats is essential. Alternatives such as LMDB or Hugging Face datasets may offer better performance in some scenarios. The current Parquet dataset is broken and unlikely to work satisfactorily. Benchmarking is needed to determine tradeoffs.
Potential candidates: LMBD, Hugging Face Dataset, memorymapped .npy arrays (PolarBERT)
Some of these formats provide fast random access (like SQLite), while others is read sequentially and therefore require randomization on-write. As a result, the user experience is different. We should consider if/how we can support both regimes.
Acceptance Criteria
Description
Fast access to experiment file formats is essential. Alternatives such as LMDB or Hugging Face datasets may offer better performance in some scenarios. The current Parquet dataset is broken and unlikely to work satisfactorily. Benchmarking is needed to determine tradeoffs.
Potential candidates: LMBD, Hugging Face Dataset, memorymapped .npy arrays (PolarBERT)
Some of these formats provide fast random access (like SQLite), while others is read sequentially and therefore require randomization on-write. As a result, the user experience is different. We should consider if/how we can support both regimes.
Acceptance Criteria