Welcome to the public repository for benchmarking Recallr AI against other memory providers (Supermemory, Mem0) on the LongMemEval (Oracle) benchmark.
| Provider | Strategy | Pass@1 Accuracy |
|---|---|---|
| Recallr AI | Agentic | 466/500 (93.2%) |
| Recallr AI | Low Latency | 439/500 (87.8%) |
| Recallr AI | Balanced | 428/500 (85.6%) |
| Mem0 | Non Graph | 313/500 (62.6%) |
| Mem0 | Graph | 311/500 (62.2%) |
| Supermemory | Default | 159/500 (31.8%) |
| Provider | Strategy | Min | P25 | Median | P95 | Max |
|---|---|---|---|---|---|---|
| Recallr AI | Low Latency | 0.234 | 0.265 | 0.299 | 0.408 | 0.750 |
| Recallr AI | Balanced | 1.032 | 1.132 | 1.198 | 1.575 | 3.548 |
| Recallr AI | Agentic | 5.125 | 6.194 | 6.997 | 8.619 | 20.095 |
| Mem0 | Non Graph | 0.489 | 0.504 | 0.786 | 1.787 | 6.171 |
| Mem0 | Graph | 0.697 | 0.746 | 0.961 | 2.692 | 10.458 |
| Supermemory | Default | 0.392 | 0.851 | 1.301 | 3.293 | 4.242 |
| Question Type | Agentic | Balanced | Low Latency |
|---|---|---|---|
| Knowledge Update | 92.3% | 94.9% | 97.4% |
| Multi-session | 89.5% | 91.0% | 91.0% |
| Single-session Assistant | 100.0% | 26.8% | 26.8% |
| Single-session Preference | 100.0% | 93.3% | 96.7% |
| Single-session User | 100.0% | 95.7% | 98.6% |
| Temporal Reasoning | 89.5% | 92.5% | 97.0% |
| Question Type | Non Graph | Graph |
|---|---|---|
| Knowledge Update | 76.9% | 75.6% |
| Multi-session | 65.4% | 63.2% |
| Single-session Assistant | 19.6% | 19.6% |
| Single-session Preference | 90.0% | 90.0% |
| Single-session User | 90.0% | 90.0% |
| Temporal Reasoning | 48.9% | 50.4% |
| Question Type | Default |
|---|---|
| Knowledge Update | 60.3% |
| Multi-session | 35.3% |
| Single-session Assistant | 3.6% |
| Single-session Preference | 20.0% |
| Single-session User | 30.0% |
| Temporal Reasoning | 27.1% |
Below are the commands used to run and evaluate each of the benchmark scripts on 500 records from longmemeval_oracle.json.
Run the benchmark:
uv run python3 run_recallr_longmemeval.py \
--data-path data/longmemeval/longmemeval_oracle.json \
--start-index 0 --end-index 499 \
--parallelism 20 --output-dir runsEvaluate the results:
uv run python3 evaluate_runs.py \
--provider recallr \
--benchmark-version oracle \
--requests-per-minute 200Run the benchmark:
uv run python3 run_mem0_longmemeval.py \
--data-path data/longmemeval/longmemeval_oracle.json \
--start-index 0 --end-index 499 \
--parallelism 20 --output-dir runsEvaluate the results:
uv run python3 evaluate_runs.py \
--provider mem0 \
--benchmark-version oracle \
--requests-per-minute 200Run the benchmark:
uv run python3 run_supermemory_longmemeval.py \
--data-path data/longmemeval/longmemeval_oracle.json \
--start-index 0 --end-index 499 \
--parallelism 20 --output-dir runsEvaluate the results:
uv run python3 evaluate_runs.py \
--provider supermemory \
--benchmark-version oracle \
--requests-per-minute 200Contributions are welcome! If you want to add new memory providers, datasets, or optimize existing strategies, feel free to open a pull request or submit an issue.