Memory consumption of tree sequence statistics

When the output dimension of a statistic is large, so is the memory consumption.

The following example calculates the pairwise distance matrix for all samples from a single tree and requires a bit over 7GB of RAM for a small number of samples (1000).

```python
import msprime
import numpy as np
import tskit


def pairwise_distance_branch(ts: tskit.TreeSequence, samples: np.array):
    sample_sets = []
    indexes = []
    for i in range(len(samples)):
        sample_sets.append([i])
        for j in range(i + 1, len(samples)):
            indexes.append((i, j))

    div = ts.divergence(sample_sets, indexes=indexes, mode="branch")
    return div


print(msprime.__version__)
print(msprime.tskit.__version__)
ts = msprime.simulate(1000, random_seed=12345)
div = pairwise_distance_branch(ts, [i for i in ts.samples()])
```

The versions are:
0.7.4
0.2.3

From talking to @petrelharp about this, it appears that some/most of the RAM use may be attributable to some memoization during the calculation that (he feels) may not be necessary?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory consumption of tree sequence statistics #647

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Memory consumption of tree sequence statistics #647

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions