A Zig binding for NMSLIB (Non-Metric Space Library), providing efficient similarity search and k-NN graph construction for high-dimensional data.
NMSLIB-ZIG is a Zig wrapper around NMSLIB, a popular library for approximate nearest neighbor search. This binding exposes NMSLIB's functionality through a clean Zig API with proper memory management and error handling. It uses a custom C wrapper (nmslib_c.cpp) to bridge between Zig and the C++ NMSLIB library.
- Multiple Data Types: Support for dense vectors, sparse vectors, uint8 vectors, and string data
- Various Distance Metrics: L2, cosine similarity, sparse metrics, and more
- Multiple Indexing Methods: HNSW, VP-Tree, Sequential Search, and others
- Efficient Memory Management: Custom allocator integration with Zig's allocator system
- Batch Operations: Efficient batch insertion and querying
- Save/Load Indexes: Persist indexes to disk for reuse
- Thread Pool Support: Configurable thread pool for parallel queries
- Comprehensive Error Handling: Proper error mapping from C to Zig errors
The project consists of three main components:
- lib.zig: The main Zig API that provides idiomatic Zig bindings
- nmslib_c.cpp: A C wrapper around the C++ NMSLIB library
- nmslib_c.h: C header defining the interface between Zig and C++
The architecture uses:
@cImportto import the C interface into Zig- Custom allocator callbacks for memory management across language boundaries
- Thread-local error tracking for detailed error reporting
- Zig (master or recent version)
- C++ compiler with C++17 support
- OpenMP support (for parallel operations)
zig buildzig build test- Sparse Vectors: Element IDs must be 1-based (not 0-based), positive, and strictly increasing within each vector. This is a requirement of the underlying NMSLIB library.
- Dense Vectors: Require a
dimparameter specifying dimensionality for L2 and cosine spaces. - Dense UInt8 Vectors: Require integer distance type (
.Int) and are optimized for SIFT-style descriptors. - String Objects: Levenshtein distance (
"leven") requires integer distance type (.Int).
Certain space types require specific distance types:
"leven"(Levenshtein) →.Intdistance type only"l2sqr_sift"→.Intdistance type and 128-D uint8 vectors- Vector spaces (
"l2","l2sqr","cosine") → require"dim"parameter
const std = @import("std");
const nmslib = @import("lib.zig");
pub fn main() !void {
var gpa = std.heap.GeneralPurposeAllocator(.{}){};
defer _ = gpa.deinit();
const allocator = gpa.allocator();
// Create space parameters (specify dimensionality)
var params = try nmslib.Params.init(allocator);
defer params.deinit();
try params.add("dim", .{ .Int = 4 });
// Create index with L2 distance and HNSW method
var index = try nmslib.Index.init(
allocator,
"l2", // Space type
params, // Space parameters
"hnsw", // Indexing method
.DenseVector, // Data type
.Float // Distance type
);
defer index.deinit();
// Add data points
const data = [_][]const f32{
&[_]f32{ 1.0, 0.0, 0.0, 0.0 },
&[_]f32{ 0.0, 1.0, 0.0, 0.0 },
&[_]f32{ 0.0, 0.0, 1.0, 0.0 },
};
const ids = [_]i32{ 1, 2, 3 };
try index.addDenseBatch(&data, &ids);
// Build the index
try index.buildIndex(null, false);
// Perform k-NN query
const query = nmslib.QueryPoint{ .DenseVector = &[_]f32{ 1.0, 0.1, 0.0, 0.0 } };
const result = try index.knnQuery(query, 2);
defer result.deinit();
std.debug.print("Found {} neighbors\n", .{result.ids.len});
for (result.ids, result.distances) |id, dist| {
std.debug.print("ID: {}, Distance: {d:.4}\n", .{ id, dist });
}
}const nmslib = @import("lib.zig");
var index = try nmslib.Index.init(
allocator,
"cosinesimil_sparse",
null,
"hnsw",
.SparseVector,
.Float
);
defer index.deinit();
// Sparse vectors as arrays of {id, value} pairs
// Note: IDs must be 1-based, positive, and strictly increasing (NMSLIB requirement)
const data = [_][]const nmslib.SparseElem{
&[_]nmslib.SparseElem{
.{ .id = 1, .value = 1.0 },
.{ .id = 5, .value = 2.0 },
},
&[_]nmslib.SparseElem{
.{ .id = 2, .value = 1.0 },
.{ .id = 10, .value = 3.0 },
},
};
try index.addSparseBatch(&data, null);
try index.buildIndex(null, false);
const query = nmslib.QueryPoint{
.SparseVector = &[_]nmslib.SparseElem{
.{ .id = 1, .value = 1.0 }
}
};
const result = try index.knnQuery(query, 5);
defer result.deinit();var index = try nmslib.Index.init(
allocator,
"leven", // Levenshtein distance
null,
"hnsw",
.ObjectAsString,
.Int // Levenshtein requires integer distance type
);
defer index.deinit();
const data = [_][]const u8{ "hello", "world", "test" };
try index.addStringBatch(&data, null);
try index.buildIndex(null, false);
const query = nmslib.QueryPoint{ .ObjectAsString = "hello" };
const result = try index.knnQuery(query, 2);
defer result.deinit();// Save index
try index.save("my_index.bin", true); // true = save data
// Load index
var loaded_index = try nmslib.Index.load(
allocator,
"my_index.bin",
.DenseVector,
.Float,
true // true = load data
);
defer loaded_index.deinit();const queries = [_][]const f32{
&[_]f32{ 1.0, 0.0, 0.0 },
&[_]f32{ 0.0, 1.0, 0.0 },
};
// Note: thread_pool_size parameter is currently unused in the implementation
const batch_result = try index.knnQueryBatch(&queries, 10, null);
defer batch_result.deinit();
for (batch_result.results, 0..) |result, i| {
std.debug.print("Query {}: Found {} neighbors\n", .{ i, result.ids.len });
}// Find all points within a distance radius
// Note: Only works with DenseVector data type
const query = &[_]f32{ 1.0, 0.0, 0.0 };
const result = try index.rangeQuery(query, 0.5); // radius = 0.5
defer result.deinit();The main index structure for similarity search.
Methods:
init()- Create a new indexdeinit()- Free the indexreset()- Reset the index and clear data storageaddDenseBatch()- Add dense vector batchaddSparseBatch()- Add sparse vector batch (IDs must be 1-based, positive, strictly increasing)addUInt8Batch()- Add uint8 vector batchaddStringBatch()- Add string data batchbuildIndex()- Build the index structureclearIndexCache()- Clear the built index cacheknnQuery()- Perform k-nearest neighbor queryknnQueryBatch()- Batch k-NN queriesrangeQuery()- Find all neighbors within radiussave()- Save index to diskload()- Load index from diskgetDistance()- Get distance between two pointsgetDataPoint()- Retrieve a data pointborrowDataPointString()- Borrow string data point (no copy)borrowDataDense()- Borrow dense vector data point (no copy)borrowDataSparse()- Borrow sparse vector data point (no copy)dataQty()- Get number of data pointssetQueryTimeParams()- Set query-time parameterssetThreadPoolSize()- Configure thread pool sizegetThreadPoolSize()- Get current thread pool sizegetSpaceType()- Get the space type stringgetMethod()- Get the indexing method stringgetDataType()- Get the data type enum
Parameter management for spaces and indexes.
Methods:
init()- Create new parameter setdeinit()- Free parameter setadd()- Add a parameter (key: []const u8, value: ParamValue)has()- Check if a parameter existsfromSlice()- Create from key-value pairs
Union type for parameter values:
String- []const u8 string valueInt- i32 integer valueDouble- f64 floating-point value
Enum of supported data types:
DenseVector- Dense floating-point vectorsSparseVector- Sparse vectors (id, value pairs)DenseUInt8Vector- Dense uint8 vectorsObjectAsString- String objects
Enum of distance computation types:
Float- Floating-point distancesInt- Integer distances
Union type for query data, tagged by DataType:
DenseVector- Slice of f32 valuesSparseVector- Slice of SparseElem structsDenseUInt8Vector- Slice of u8 valuesObjectAsString- Slice of u8 (string bytes)
Union type for stored data points, tagged by DataType:
DenseVector- Slice of f32 valuesSparseVector- Slice of SparseElem structsDenseUInt8Vector- Slice of u8 valuesObjectAsString- Slice of u8 (string bytes)
Structure for sparse vector elements:
id- u32 element ID (1-based, must be strictly increasing)value- f32 element value
Result of a k-NN or range query containing:
ids- Slice of result IDs (trimmed to actual results)distances- Slice of distances (trimmed to actual results)full_ids- Full allocated buffer for IDsfull_distances- Full allocated buffer for distancesused- Number of results returnedallocator- Allocator used for memory managementdeinit()- Free the result
Result of batch k-NN queries containing:
results- Array of QueryResult structures (one per query)allocator- Allocator used for memory managementdeinit()- Free all results and the array
l2- Euclidean (L2) distancecosinesimil- Cosine similaritycosinesimil_sparse- Cosine similarity for sparse vectorsl2_sift- L2 distance optimized for SIFT descriptorsleven- Levenshtein distance (edit distance)- And many more (see NMSLIB documentation)
hnsw- Hierarchical Navigable Small World graphs (recommended)sw-graph- Small World graphvptree- Vantage Point Treeseq_search- Sequential search (brute force)
The library provides comprehensive error handling through Zig's error system:
pub const Error = error{
NullPointer,
InvalidArgument,
OutOfMemory,
BufferTooSmall,
SpaceIncompatible,
QueryTooLarge,
InvalidSparseElement,
IndexBuildFailed,
QueryExecutionFailed,
DataIOFailed,
PluginRegistrationFailed,
Internal,
Runtime,
IndexNotBuilt,
IndexAlreadyBuilt,
};All operations return !T (error union types), allowing proper error propagation with try.
- Index creation and data insertion should be done from a single thread
- Queries can be performed from multiple threads after index creation
- Use
setThreadPoolSize()to configure parallel query execution
- Index Method Selection: HNSW is recommended for most use cases (good balance of speed and accuracy)
- Parameters: Tune space and index parameters for your specific use case
- Batch Operations: Use batch methods for better performance when inserting multiple data points
- Memory: Dense vectors use more memory than sparse vectors
- Build vs Query Time: More complex index methods (like HNSW) take longer to build but provide faster queries
Run the included tests to verify functionality:
zig build testContributions are welcome! Please ensure:
- Code follows Zig style conventions
- All tests pass
- New features include tests
- Documentation is updated
This project wraps NMSLIB, which is licensed under the Apache License 2.0. Please refer to the NMSLIB project for its license terms.