Skip to content

Native serialization to a stream for FlatIndex#280

Merged
razdoburdin merged 9 commits into
intel:dev/razdoburdin_streamingfrom
razdoburdin:serialization_flat_index
Mar 4, 2026
Merged

Native serialization to a stream for FlatIndex#280
razdoburdin merged 9 commits into
intel:dev/razdoburdin_streamingfrom
razdoburdin:serialization_flat_index

Conversation

@razdoburdin
Copy link
Copy Markdown
Contributor

Reopening of #275 for developer branch

@mergify
Copy link
Copy Markdown

mergify Bot commented Mar 2, 2026

⚠️ The sha of the head commit of this PR conflicts with #275. Mergify cannot evaluate rules on this PR. Once #275 is merged or closed, Mergify will resume processing this PR. ⚠️


template <typename T = void> class StreamWriter : public Writer<T, StreamWriter<T>> {
public:
StreamWriter(std::ostream& os)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seemed like the Header structure written by FileWriter at the beginning of a file has some important information including:

  • magic number and uuid - for versioning.
  • stored data size.
    Why StreamWriter does not populate the same header?

General question:
How are we going to handle cases when several objects to be stored/loaded in a stream?
E.g. in case of Vamana index, we have to store/load configuration, graph and data (where data may contain 2 simple datasets for LVQ/LeanVec cases).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For FileWriter ostream is seek able (we know that it is a fstream) , so we can insert placeholder, write data, calculate size of data being written, replace placeholder to an actual header. But for StreamWriter ostream may be non seek able, and we can't do the same trick with placeholder Header.
I see two options here:

  1. Create temporary seek able stringstream, and use it as a buffer. But it creates a 2x memory overhead in serialization.
  2. Extract all required information from metadata. In this case we don't need Header.

I have used the fhe first approach (with stringstream) for toml::table serialization, since metadata are small, and overhead doesn't look like an acceptable trade-off in this case.
But for the main data I try to realize the second option (without overhead). I haven't started work on Vamana yet, so I am not confident, if metadata contains all required information in this case.

Copy link
Copy Markdown
Member

@rfsaliev rfsaliev Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I would add a test for flat+LVQ/LeanVec to validate if multi-dataset cases are managed properly.

Copy link
Copy Markdown
Member

@rfsaliev rfsaliev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGFM
Except objections regarding multiple data/datasets in 1 stream - to be verified on next steps during implementation of Vamana index support.

@razdoburdin razdoburdin merged commit c6c42c4 into intel:dev/razdoburdin_streaming Mar 4, 2026
37 checks passed
razdoburdin added a commit that referenced this pull request Apr 16, 2026
This PR adds native stream serialization to all SVS index types, as an
alternative to the existing (legacy) directory-based serialization. It
allow to avoid filesystem round-trips of the data. The native
serialization doesn't require from the stream to be seek able, so no
additional restrictions were introduced.

See the following PR for details:
#280,
#281,
#285,
#286,
#289,
#292,
#294,
#296,
#299

Main changes are:
1. A CRTP base `Archiver` extracts binary I/O primitives (`write_size`,
`read_size`, `write_name`, `read_name`, `read_from_istream`) from
`DirectoryArchiver`. `DirectoryArchiver` and new `StreamArchiver` class
inherit from `Archiver`. `StreamArchiver` has its own magic number
("SVS_STRM") to distinguish native streams from directory archives.
2. The monolithic `Writer` is split via CRTP with two derived classes:
`FileWriter` owns an `std::ofstream`, writes a header, flushes on
destructor, `StreamWriter` wraps an external `std::ostream&`, no
header/lifecycle management. This allows `io::save(data, os)` to write
vector data directly to any stream.
3. The `save(stream)` in orchestrator `Impl` classes no longer does
temp-dir->pack. Instead it directly calls `impl().save(stream)`.
4. The dispatching between new (native) and old (legacy) deserialization
is made at the orchestrators. `Deserializer::build(is)` reads the magic
number, exposes `is_native()` to choose path.

---------

Co-authored-by: Dmitry Razdoburdin <drazdobu@intel.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Rafik Saliev <rafik.f.saliev@intel.com>
Co-authored-by: ethanglaser <42726565+ethanglaser@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants