Skip to content

[REVIEW] Add multimodal reader/writer pipeline and scoped benchmarks#1517

Merged
VibhuJawa merged 63 commits intoNVIDIA-NeMo:mainfrom
VibhuJawa:minimal_multi_modal_implimentation
Mar 4, 2026
Merged

[REVIEW] Add multimodal reader/writer pipeline and scoped benchmarks#1517
VibhuJawa merged 63 commits intoNVIDIA-NeMo:mainfrom
VibhuJawa:minimal_multi_modal_implimentation

Conversation

@VibhuJawa
Copy link
Copy Markdown
Contributor

@VibhuJawa VibhuJawa commented Feb 18, 2026

Summary

Adds row-wise multimodal ingestion and write path for WebDataset tar shards (MINT-1T style), with lazy materialization support for local, remote, and tar-archived binary content.

What this enables

  • Ingest MINT-1T-style WebDataset tar shards into a normalized row-wise schema (MultiBatchTask)
  • Filter rows with optional materialization (e.g. JPEG aspect-ratio filtering)
  • Write to Parquet with configurable compression and optional materialize-on-write
  • Three automatic I/O strategies for binary content: range reads, tar extraction, and direct file reads
  • Multi-frame TIFF extraction via frame_index in source_ref

Architecture

WebDataset tar shards
        |
        v
┌─────────────────────────┐
│  WebdatasetReader       │  CompositeStage: FilePartitioning + WebdatasetReaderStage
│  (io/reader.py)         │  Parses tar members -> normalized multimodal rows
└────────┬────────────────┘
         |  MultiBatchTask (Arrow/Pandas)
         v
┌─────────────────────────┐
│  Filter Stages          │  e.g. MultimodalJpegAspectRatioFilterStage
│  (stages.py)            │  Row-wise filtering with optional materialization
└────────┬────────────────┘
         |
         v
┌─────────────────────────┐
│  MultimodalParquet-     │  MultimodalParquetWriterStage
│  WriterStage            │  Parquet output with optional materialize-on-write
│  (io/writers/tabular.py)│  Supports snappy/zstd compression, configurable row groups
└─────────────────────────┘

By the numbers

Metric Count
New source files 15
New test files 5
Source lines (multimodal module + task) 1,484
Test lines 895
Benchmark script lines 158
Tutorial lines 84
Documentation (README) lines 153
Total new lines (multimodal) 2,774
New classes 12 (8 public, 4 internal)
Unit tests 39

Key components added

Component Path Description
MultiBatchTask nemo_curator/tasks/multimodal.py Task type wrapping Arrow/Pandas with a 9-column reserved schema (MULTIMODAL_SCHEMA) plus user passthrough columns
WebdatasetReader stages/multimodal/io/reader.py CompositeStage that decomposes into FilePartitioningStage + WebdatasetReaderStage
WebdatasetReaderStage stages/multimodal/io/readers/webdataset.py Parses tar members into normalized rows with source_ref locators and optional materialize-on-read
MultimodalParquetWriterStage stages/multimodal/io/writers/tabular.py Writes Parquet with optional materialize-on-write, snappy/zstd compression
BaseMultimodalFilterStage stages/multimodal/stages.py Base filter with row-validity checks and shared materialization iterator
MultimodalJpegAspectRatioFilterStage stages/multimodal/stages.py Filters JPEG images by aspect ratio bounds (requires Pillow)
Materialization utils stages/multimodal/utils/materialization.py Three-strategy dispatch: range read (fs.cat_ranges), tar extract, direct read
split_table_by_group_max_bytes nemo_curator/core/utils.py Splits Arrow tables by byte budget while keeping group rows together

Minimal usage

from nemo_curator.pipeline import Pipeline
from nemo_curator.stages.multimodal.io import WebdatasetReader, MultimodalParquetWriterStage
from nemo_curator.stages.multimodal.stages import MultimodalJpegAspectRatioFilterStage

pipeline = Pipeline(name="mint1t_pipeline")
pipeline.add_stage(WebdatasetReader(
    source_id_field="pdf_name",
    file_paths="/data/mint1t/shards/",
))
pipeline.add_stage(MultimodalJpegAspectRatioFilterStage(drop_invalid_rows=True))
pipeline.add_stage(MultimodalParquetWriterStage(
    path="/output/parquet/",
    materialize_on_write=True,
    mode="overwrite",
))
pipeline.run()

Benchmarking

Added benchmarking/scripts/multimodal_mint1t_benchmark.py with two nightly benchmark entries in nightly-benchmark.yaml:

Benchmark Executor Materialize CPUs Description
multimodal_mint1t_xenna Xenna off 64 Throughput baseline (no binary materialization)
multimodal_mint1t_xenna_materialize Xenna on-write 64 End-to-end with binary content fetched at write time

Both benchmarks are initially enabled: false for opt-in activation. Tracked metrics: throughput_rows_per_sec, num_rows, num_output_files, materialize_error_count.

The benchmark script supports:

  • Configurable executor (--executor xenna|ray_data)
  • Materialize-on-read and materialize-on-write toggles
  • Parquet compression and row group size tuning
  • Output metrics collection (modality counts, file sizes, error rates)

Other changes

  • nemo_curator/stages/base.py: Added _time_metric() context manager for timing code blocks within stages
  • nemo_curator/core/utils.py: Added split_table_by_group_max_bytes() for splitting Arrow tables by byte budget
  • pyproject.toml: Added Pillow to the image_cpu dependency group
  • benchmarking/scripts/utils.py: Added collect_parquet_output_metrics() for post-run output analysis
  • .gitignore: Added AGENTS.md

Tests

39 unit tests across 3 test files (895 lines):

  • test_multimodal_core.py (25 tests) -- source_ref parsing, classify_rows dispatch, all three materialization strategies (direct read, tar extract, range read), range deduplication, mixed-strategy batches, edge cases (empty tasks, missing paths), JPEG aspect-ratio filter, split_table_by_group_max_bytes, row-validity mask, composite stage decomposition
  • test_multimodal_reader.py (6 tests) -- custom field mapping, default passthrough, content type resolution, image tokens with frame index, empty output schema, field validation errors
  • test_multimodal_writer.py (3 tests) -- materialization error handling, direct content path without tar key, DataFrame index preservation

Documentation

Full design reference in nemo_curator/stages/multimodal/README.md covering schema, source_ref format, materialization strategies, usage example, and file layout.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Feb 18, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@VibhuJawa VibhuJawa changed the title [WIP] Add multimodal MINT1T reader/writer pipeline and scoped benchmarks [WIP] Add multimodal reader/writer pipeline and scoped benchmarks Feb 18, 2026
@VibhuJawa VibhuJawa force-pushed the minimal_multi_modal_implimentation branch from c8b60e8 to 3b3433c Compare February 18, 2026 07:30
@VibhuJawa VibhuJawa requested a review from Copilot February 18, 2026 07:49
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a comprehensive multimodal data processing pipeline for NVIDIA NeMo Curator, enabling WebDataset (MINT-1T style) to Parquet conversion with filtering and materialization capabilities. The implementation follows the established task-centric API design and provides both reader and writer stages specifically designed for multimodal data containing text, images, and metadata.

Changes:

  • Adds MultiBatchTask for row-wise multimodal records with a standardized schema supporting sample_id, position, modality, content_type, text_content, binary_content, metadata_source, metadata_json, and materialize_error fields
  • Implements WebdatasetReader (composite stage) and MultimodalParquetWriter with support for optional binary materialization, configurable parquet backends (pandas/pyarrow), and error tracking for failed materializations
  • Adds MultimodalJpegAspectRatioFilterStage for filtering multimodal rows based on JPEG aspect ratios, with support for loading images from source when binary content is not already present
  • Introduces performance instrumentation via _time_metric context manager on ProcessingStage and split_table_by_group_max_bytes utility for group-preserving table splitting
  • Includes comprehensive test coverage (3 test files, 6 tests passing), tutorial pipeline, and benchmark infrastructure with nightly configuration entries (currently disabled)

Reviewed changes

Copilot reviewed 24 out of 24 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
nemo_curator/tasks/multimodal.py Core MultiBatchTask implementation with multimodal schema, conversion methods, and metadata parsing utilities
nemo_curator/tasks/init.py Export MultiBatchTask, update copyright year to 2026
nemo_curator/stages/base.py Add _time_metric helper, move imports to top level, update copyright to 2026
nemo_curator/core/utils.py Add split_table_by_group_max_bytes utility for group-preserving table splitting
nemo_curator/stages/multimodal/stages.py Implement base annotator/filter stages and JPEG aspect ratio filter
nemo_curator/stages/multimodal/io/readers/webdataset.py WebDataset reader stage parsing tar shards into multimodal rows
nemo_curator/stages/multimodal/io/readers/base.py Base multimodal reader contract
nemo_curator/stages/multimodal/io/readers/init.py Export WebdatasetReaderStage
nemo_curator/stages/multimodal/io/reader.py Composite WebdatasetReader with file partitioning
nemo_curator/stages/multimodal/io/writers/multimodal.py Parquet writer with optional materialization and error tracking
nemo_curator/stages/multimodal/io/writers/base.py Base multimodal writer with output mode handling
nemo_curator/stages/multimodal/io/writers/init.py Export writer classes
nemo_curator/stages/multimodal/io/writer.py User-facing MultimodalParquetWriter alias
nemo_curator/stages/multimodal/io/init.py Export reader and writer for public API
nemo_curator/stages/multimodal/init.py Export multimodal stages
tests/stages/multimodal/conftest.py Shared test fixtures for multimodal tests
tests/stages/multimodal/test_multimodal_reader.py Tests for WebDataset reader functionality
tests/stages/multimodal/test_multimodal_writer.py Tests for parquet writer with materialization
tests/stages/multimodal/test_multimodal_core.py Tests for task methods, filter stages, and utilities
tests/stages/multimodal/init.py Test package marker
tutorials/multimodal/mint1t_mvp_pipeline.py Example pipeline for MINT1T processing
benchmarking/scripts/utils.py Add collect_parquet_output_metrics helper
benchmarking/scripts/multimodal_mint1t_benchmark.py Benchmark script for multimodal pipeline performance
benchmarking/nightly-benchmark.yaml Add disabled benchmark entries for multimodal processing

Comment thread nemo_curator/stages/multimodal/io/writers/multimodal.py Outdated
Comment thread nemo_curator/stages/multimodal/stages.py Outdated
Comment thread nemo_curator/tasks/multimodal.py Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 24 out of 24 changed files in this pull request and generated 2 comments.

Comment thread nemo_curator/stages/multimodal/io/readers/webdataset.py Outdated
Comment thread nemo_curator/stages/multimodal/io/reader.py Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 27 out of 27 changed files in this pull request and generated 6 comments.

Comment thread nemo_curator/stages/multimodal/stages.py Outdated
Comment thread nemo_curator/stages/multimodal/io/writer.py Outdated
Comment thread benchmarking/scripts/utils.py Outdated
Comment thread nemo_curator/core/utils.py Outdated
Comment thread nemo_curator/stages/multimodal/io/writers/multimodal.py Outdated
Comment thread nemo_curator/stages/multimodal/io/writers/multimodal.py Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 28 out of 28 changed files in this pull request and generated 9 comments.

Comment thread nemo_curator/stages/multimodal/io/writers/multimodal.py Outdated
Comment thread benchmarking/scripts/utils.py
Comment thread nemo_curator/tasks/multimodal.py Outdated
Comment thread nemo_curator/tasks/multimodal.py Outdated
Comment thread nemo_curator/stages/multimodal/stages.py Outdated
Comment thread nemo_curator/stages/multimodal/io/readers/webdataset.py Outdated
Comment thread nemo_curator/stages/multimodal/io/writers/multimodal.py Outdated
Comment thread nemo_curator/core/utils.py Outdated
Comment thread nemo_curator/stages/multimodal/utils/constants.py Outdated
@abhinavg4
Copy link
Copy Markdown
Contributor

Hi the outputs still look wrong:
image

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Mar 2, 2026

Additional Comments (2)

benchmarking/scripts/multimodal_mint1t_benchmark.py, line 98
only validates first output file - if multiple parquet files written, ordering issues in other files won't be detected

consider validating all output files or at least a random sample


benchmarking/scripts/multimodal_mint1t_benchmark.py, line 55
aspect ratio bounds hardcoded - consider adding --min-aspect-ratio and --max-aspect-ratio arguments to test different filtering thresholds

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Previously all passthrough fields were broadcast to every row. This adds
per_image_fields and per_text_fields which distribute list values 1:1 to
their respective non-None content rows, with sample-level passthrough
now only on the metadata row. Includes length-mismatch warnings,
ValueError for non-list per-modality values, and centralized helper
methods.

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Made-with: Cursor
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Mar 2, 2026

Additional Comments (4)

nemo_curator/stages/base.py, line 329
Incorrect return-type annotation for @contextlib.contextmanager generator

A function decorated with @contextlib.contextmanager is a generator function; its body should be annotated with the yield type (Iterator[None] or Generator[None, None, None]), not the context-manager wrapper type. The current annotation contextlib.AbstractContextManager[None] is the type of what callers receive after the decorator is applied, but mypy/pyright treat it as the generator's return type and will raise a type error.

    @contextlib.contextmanager
    def _time_metric(self, name: str) -> Iterator[None]:

You'll also need to add Iterator to the imports at the top of base.py:

from typing import TYPE_CHECKING, Any, Generic, Iterator, TypeVar, final

nemo_curator/stages/multimodal/utils/materialization.py, line 238
Narrow exception scope in _read_direct_file

Only OSError is caught here, while the similar _fill_tar_extract_rows catches (OSError, tarfile.TarError) and _fill_range_read_rows catches OSError after the url_to_fs guard. For remote filesystems backed by HTTP/S3/GCS, fsspec may raise additional exception types (e.g., aiohttp.ClientError, botocore.exceptions.ClientError) that are not guaranteed to be wrapped as OSError. An unhandled exception here would surface as a pipeline crash rather than a per-row materialize_error.

Consider catching a broader base class (at minimum Exception) and recording the failure in the return value, consistent with how the other two I/O strategy helpers behave:

def _read_direct_file(path: str, storage_options: dict[str, object]) -> bytes | None:
    try:
        with fsspec.open(path, mode="rb", **storage_options) as fobj:
            return fobj.read()
    except Exception:
        return None

benchmarking/scripts/utils.py, line 163
validate_parquet_ordering loads all three columns into memory for every sample

pd.read_parquet(parquet_path, columns=["sample_id", "position", "modality"]) pulls the full decompressed content of those three columns into a single DataFrame. For a benchmark parquet file with millions of rows this can be expensive. More importantly, the groupby loop calls content["position"].tolist() and sorted(positions) for every sample, which is O(n log n) per-sample.

A subtle correctness issue: df.groupby("sample_id", sort=False) uses the default dropna=True, so any rows where sample_id is null will be silently excluded from validation, masking potential data integrity issues.

Consider adding dropna=False to catch null sample IDs:

    for sample_id, group in df.groupby("sample_id", sort=False, dropna=False):

nemo_curator/stages/multimodal/io/readers/webdataset.py, line 218
TIFF frame counter not reset per resolved file

frame_counter is a single integer that increments for every TIFF image token encountered in a sample, regardless of which TIFF member (content_key) the token resolves to. This encodes a "global TIFF frame index within the sample" rather than "frame index within a specific TIFF file."

For the common MINT-1T pattern (one multi-frame TIFF per sample, all tokens resolving to the same member) this is correct. However, if a sample references multiple distinct TIFF members, the frame indices for the second and subsequent files will be wrong. For example, with tokens resolving to [tiff_a, tiff_b, tiff_a], the emitted frame indices are [0, 1, 2]. The second reference to tiff_a should get frame 1, and tiff_b should get frame 0 — not 1. During materialization _extract_tiff_frame would either silently read the wrong frame or return None for an out-of-range index.

Consider tracking frame counts per resolved member name instead of a single global counter:

frame_counter_per_member: dict[str, int] = {}
...
if content_key is not None and is_multiframe_candidate:
    frame_index = frame_counter_per_member.get(content_key, 0)
    frame_counter_per_member[content_key] = frame_index + 1

…ory to interleaved

Rename the task class, schema, module, stage directory, and all stage
classes to use "Interleaved" naming, reflecting the interleaved nature
of the data format:

- MultiBatchTask -> InterleavedBatch
- MULTIMODAL_SCHEMA -> INTERLEAVED_SCHEMA
- tasks/multimodal.py -> tasks/interleaved.py
- stages/multimodal/ -> stages/interleaved/
- BaseMultimodalAnnotatorStage -> BaseInterleavedAnnotatorStage
- BaseMultimodalFilterStage -> BaseInterleavedFilterStage
- MultimodalAspectRatioFilterStage -> InterleavedAspectRatioFilterStage
- MultimodalParquetWriterStage -> InterleavedParquetWriterStage
- BaseMultimodalReader -> BaseInterleavedReader
- BaseMultimodalWriter -> BaseInterleavedWriter

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Made-with: Cursor
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Made-with: Cursor
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Mar 3, 2026

Additional Comments (6)

nemo_curator/stages/interleaved/io/readers/webdataset.py, line 105
Range reads silently corrupt data for compressed tar archives

info.offset_data is populated for all tars — compressed or not. For .tar.gz / .tgz files opened with mode="r:*", the underlying file object is a GzipFile, whose tell() returns the position in the decompressed stream. Python's tarfile module records this decompressed position as TarInfo.offset_data.

Later, _fill_range_read_rows in materialization.py calls:

blobs = fs.cat_ranges(dedup_paths, starts, ends)

…where starts/ends come from this offset_data. But fs.cat_ranges reads raw bytes from the compressed .tar.gz file at those byte positions, which are offsets into the gzip bitstream — not decompressed image data. The returned bytes are garbage, and there is no error signal because the read succeeds from the filesystem's perspective.

The tar_extract path is unaffected since tf.extractfile() handles decompression correctly.

Fix: skip populating byte_offset/byte_size when the tar path is compressed:

# Only populate byte-range hints for uncompressed tars; compressed tars
# require sequential decompression, so offset_data is the decompressed-
# stream position rather than the raw file byte offset.
_COMPRESSED_TAR_EXTS = (".tar.gz", ".tgz")
is_compressed = ctx.tar_path.endswith(_COMPRESSED_TAR_EXTS)

if ctx.member_info and content_key in ctx.member_info and not is_compressed:
    info = ctx.member_info[content_key]
    byte_offset = info.offset_data
    byte_size = info.size

This ensures compressed-tar rows fall into the tar_extract branch in _classify_rows, which is always correct.


nemo_curator/stages/interleaved/stages.py, line 108
iter_materialized_bytes re-materializes the full task, not the filtered df

materialize_task_binary_content(task) materializes all rows in the original task, ignoring the df argument passed to this method. If a caller has pre-filtered or transformed df (e.g., removing rows, reordering), the positional alignment between df.reset_index(drop=True) and materialized_df.reset_index(drop=True) breaks: materialized_df.iloc[pos] will return the bytes for a different row than df.index[pos].

In the current usage inside _image_keep_mask, df is the output of task.to_pandas().copy() from process(), so both df and task.to_pandas() originate from the same underlying data and the indices happen to align. However, the method signature implies df can differ, making this a latent correctness trap for future subclass overrides.

Consider either:

  1. Removing the df parameter from iter_materialized_bytes and documenting that it always operates on the full task, or
  2. Constructing a task from df before calling materialize_task_binary_content so the materialization applies only to the rows actually in df:
def iter_materialized_bytes(self, task, df, row_mask):
    # Build a task wrapping only the passed-in df to avoid full re-materialization
    filtered_task = InterleavedBatch(
        task_id=task.task_id, dataset_name=task.dataset_name,
        data=df.reset_index(drop=True), _metadata=task._metadata,
        _stage_perf=task._stage_perf,
    )
    materialized_df = materialize_task_binary_content(filtered_task).to_pandas().reset_index(drop=True)
    ...

nemo_curator/stages/interleaved/io/writers/base.py, line 96
User write_kwargs can override index=False

write_kwargs.update(self.write_kwargs) will overwrite the safe default {"index": False} if the user passes "index" in their write_kwargs. When index is True (or its default for DataFrames with a non-RangeIndex), a spurious __index_level_0__ column appears in every output Parquet file. Use setdefault so the user value takes precedence while maintaining a safe fallback:

        write_kwargs: dict[str, Any] = {}
        write_kwargs.update(self.write_kwargs)
        write_kwargs.setdefault("index", False)

nemo_curator/stages/interleaved/utils/materialization.py, line 148
OSError catch scope can overwrite already-materialized rows on late I/O failure

The outer except (OSError, tarfile.TarError) block covers both opening the archive and iterating members. If an IOError (a subclass of OSError) is raised mid-iteration — for example during extracted.read() for a corrupt member — control jumps to the except handler, which then overwrites error_values[idx] for all rows in keyed_rows, including rows already successfully materialized in earlier loop iterations. Those rows will end up with binary_content set but also with a non-null materialize_error, which is a contradictory state.

Consider separating the archive-open failure from per-member failures:

try:
    fobj_ctx = fsspec.open(path, mode="rb", **storage_options)
    tf_ctx = tarfile.open(fileobj=fobj_ctx.__enter__(), mode="r:*")
except (OSError, tarfile.TarError):
    for idx, *_ in keyed_rows:
        error_values[idx] = "failed to open archive"
    continue

with fobj_ctx, tf_ctx as tf:
    for idx, member, frame_idx in keyed_rows:
        # per-member try/except here
        ...

This prevents a single corrupt member from marking previously-successful rows as failed.


benchmarking/scripts/multimodal_mint1t_benchmark.py, line 57
Aspect-ratio bounds are hardcoded and not exposed as CLI arguments

min_aspect_ratio=1.0 and max_aspect_ratio=2.0 are hardcoded constants, not controllable from the command line. For a benchmarking script this limits configurability and may produce misleading throughput numbers if the fixed filter happens to drop most samples on a particular dataset.

Consider adding:

parser.add_argument("--min-aspect-ratio", type=float, default=1.0)
parser.add_argument("--max-aspect-ratio", type=float, default=2.0)

and forwarding them to InterleavedAspectRatioFilterStage.


nemo_curator/stages/interleaved/io/readers/webdataset.py, line 373
materialize_on_read skips TIFF frame extraction and doesn't record failures

Two inconsistencies with the write-time materialization path:

  1. Frame extraction skipped: source_ref may carry a frame_index for multi-frame TIFFs (populated in _image_rows). When materialize_on_read=True, the raw tar-member bytes are stored verbatim in binary_content_extract_tiff_frame is never called. The write-time path (via materialize_task_binary_content_fill_tar_extract_rows) correctly handles frame_index. A pipeline that toggles between read-time and write-time materialization will produce different binary_content for the same TIFF document.

  2. Silent failure not recorded: If _extract_tar_member returns None (member not found in the tar), binary_content is left None but materialize_error is never populated. Downstream consumers have no way to distinguish "not yet materialized" from a genuine extraction failure — unlike the write-time path, which explicitly records "missing member 'X'" in materialize_error.

Both the frame-extraction call and the failure-recording behaviour should be added here to keep the two materialization modes symmetric.

Interactive notebook showing the end-to-end interleaved pipeline:
read WebDataset tar → inspect schema → display interleaved document
(text + inline images) → filter by aspect ratio → write to Parquet.
Includes executed outputs with rendered MINT-1T sample data.

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Made-with: Cursor
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Made-with: Cursor
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Mar 3, 2026

Additional Comments (2)

nemo_curator/stages/interleaved/utils/materialization.py, line 25
Unconditional top-level PIL import breaks all interleaved imports without Pillow

from PIL import Image as _Image is an unconditional module-level import. PIL is only needed inside _extract_tiff_frame, but because materialization.py is re-exported via utils/__init__.py, any import of a stage from the interleaved package (including WebdatasetReader and InterleavedParquetWriterStage) will raise ImportError: No module named 'PIL' on environments that don't have the image_cpu extras installed.

The import chain that triggers this:

io/__init__.py
  → writers/tabular.py
    → writers/base.py
      → utils/__init__.py
        → materialization.py  ← PIL imported here unconditionally

stages.py already handles this correctly with a guarded import:

try:
    from PIL import Image
except ImportError:
    Image = None

materialization.py should follow the same pattern and guard the call-site in _extract_tiff_frame:

try:
    from PIL import Image as _Image
except ImportError:
    _Image = None  # type: ignore[assignment]

Then in _extract_tiff_frame, add an early guard before using _Image.open:

if _Image is None:
    return None  # PIL not available; cannot extract TIFF frame

benchmarking/scripts/multimodal_mint1t_benchmark.py, line 57
--no-materialize-on-write baseline still materializes all images via the filter stage

InterleavedAspectRatioFilterStage is always added to the pipeline unconditionally. Its content_keep_mask_image_keep_maskiter_materialized_bytes path calls materialize_task_binary_content(task) for every image row, regardless of the --materialize-on-read / --materialize-on-write flags.

This means the multimodal_mint1t_xenna benchmark entry (labeled "Throughput baseline (no binary materialization)" in the PR description) still performs a full byte-range or tar-extract I/O pass over all image content during the filter stage. The benchmark description and the --no-materialize-on-write flag are therefore misleading: it is not a zero-materialization baseline.

Consider either:

  1. Making the filter stage conditional on an --aspect-ratio-filter flag so it can be omitted for the raw throughput baseline.
  2. Updating the benchmark description / YAML comment to clarify that materialization occurs at filter time even in the "no-write-materialize" variant.

- Use IPython.display.Image (image/png output) instead of display(HTML)
  so images render on GitHub, JupyterLab, and VS Code
- Add HuggingFace download cell (mlfoundations/MINT-1T-PDF-CC-2024-18)
  with MINT1T_TAR_PATH env var override for local data
- Fix all ruff lint errors (type annotations, import order, magic numbers)

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Made-with: Cursor
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Mar 3, 2026

Additional Comments (4)

nemo_curator/stages/interleaved/stages.py, line 116
Position re-indexing after filtering may produce non-contiguous positions

After keep_mask removes some image rows, cumcount() is applied to the remaining rows sorted by position, producing 0-based counter values. However, content_by_position.sort_values("position") sorts already-filtered rows by their original positions. If groupby(...).cumcount() produces positions 0, 1, 2 but the original dtype is pd.ArrowDtype(pa.int32()) (from task.to_pandas(types_mapper=pd.ArrowDtype)), the .astype(filtered["position"].dtype) coercion and subsequent .loc assignment may silently no-op or raise in certain pandas/pyarrow version combinations due to ArrowDtype setitem semantics.

Additionally, sort_values(["sample_id", "position"]) on the returned filtered DataFrame sorts metadata rows (position=-1) before content rows within each sample_id. But if a sample's only metadata row was dropped by a custom drop_invalid_rows=False subclass, the sort would still work — this edge case seems unhandled, but depends on subclass behaviour.

Consider adding an explicit reset_index(drop=True) before returning to ensure downstream code gets a clean 0-based index:

return filtered.sort_values(["sample_id", "position"]).reset_index(drop=True)

nemo_curator/stages/interleaved/io/writers/base.py, line 96
User write_kwargs silently overrides index=False

write_kwargs.update(self.write_kwargs) on line 95 will overwrite the safe default {"index": False} set on line 94 if the caller passes write_kwargs={"index": True} (or any other value for "index"). While unlikely in practice, this could cause the DataFrame index to be written as a column into Parquet output. Use setdefault to protect the default:

        write_kwargs: dict[str, Any] = {}
        write_kwargs.update(self.write_kwargs)
        write_kwargs.setdefault("index", False)

nemo_curator/core/utils.py, line 229
avg_bytes_per_row can be very inaccurate for sparse binary columns

table.nbytes / n averages all bytes (including large_binary image buffers) uniformly across rows. For tables where only image rows carry large binary_content payloads and text/metadata rows have None, this estimate is dominated by the image rows. A batch of 90 text rows + 10 image rows of 1 MB each would compute ~10 MB / 100 ≈ 100 KB/row, leading to splits that could be 10× larger than max_batch_bytes for image-heavy samples.

Consider computing the estimate from the image rows only, or documenting the approximation prominently so callers set max_batch_bytes conservatively:

# Current approximation: uniform average across all rows.
# For tables with sparse binary_content this can over-estimate splits.
avg_bytes_per_row = table.nbytes / n

nemo_curator/stages/interleaved/io/readers/webdataset.py, line 210
frame_counter is global per sample, not per content_key

frame_counter increments for every TIFF image row in the sample regardless of which tar member (content_key) it refers to. For a sample with two separate single-frame TIFF files (slide1.tiff and slide2.tiff), the first gets frame_index=0 and the second gets frame_index=1.

During materialization, _extract_tiff_frame(slide2_bytes, 1) hits if frame_index >= getattr(img, "n_frames", 1): return None for a single-frame TIFF, causing a spurious materialize_error on a perfectly valid file.

The counter should be tracked per unique content_key so each distinct TIFF member starts its own frame numbering at 0. A dict keyed by content_key (e.g. frame_counter_per_key) replaces the flat frame_counter variable and correctly handles both the multi-frame-same-member case and the separate-files case.

The quickstart notebook embeds base64-encoded PNG thumbnails which
trigger false positives in detect-secrets. Updated baseline to
whitelist these along with existing pre-existing entries.

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Made-with: Cursor
Regenerated .secrets.baseline to include entries from the updated
interleaved quickstart notebook (base64-encoded image outputs and
hex strings from Parquet metadata) so the detect-secrets CI check
passes.

Signed-off-by: Varun Jawa <vjawa@nvidia.com>
Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Made-with: Cursor
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Mar 3, 2026

Additional Comments (3)

nemo_curator/stages/interleaved/utils/materialization.py, line 25
Unconditional PIL import will break module import without Pillow

from PIL import Image as _Image is a top-level, unconditional import. Pillow is declared as an optional dependency in the image_cpu group in pyproject.toml, not a required one.

Because materialization.py is re-exported by utils/__init__.py (which imports materialize_task_binary_content), and utils/__init__.py is in turn imported by stages.py, io/writers/base.py, and io/readers/webdataset.py, any import from nemo_curator.stages.interleaved will raise ImportError on systems where Pillow is not installed — even for purely text-based pipelines that never touch images.

The fix is to guard the import and raise a clear RuntimeError only at the call site where _Image is actually needed:

try:
    from PIL import Image as _Image
except ImportError:
    _Image = None  # type: ignore[assignment]

# ... then inside _extract_tiff_frame:
def _extract_tiff_frame(tiff_bytes: bytes, frame_index: int) -> bytes | None:
    if _Image is None:
        raise RuntimeError(
            "Pillow is required for TIFF frame extraction. "
            "Install the 'image_cpu' dependency group."
        )
    ...

This pattern is already used correctly in stages.py for the Image import (lines 31–34).


benchmarking/scripts/utils.py, line 98
Task results serialized with pickle — portability and security risk

pickle.dumps(results["tasks"]) produces a binary blob that is tied to the exact Python/library environment used to create it. Loading it with a different version of NeMo Curator or its dependencies can silently produce incorrect objects or raise AttributeError/ModuleNotFoundError. Additionally, pickle files loaded from untrusted sources can execute arbitrary code.

For a benchmark output that is meant to be inspected or compared later, prefer a serialization format that is both version-safe and human-readable. If the full task objects are genuinely needed, at minimum document that the .pkl file must be loaded with the same Python environment and should never be loaded from untrusted paths.

        import json as _json
        # Persist task summaries as JSON for portability; skip non-serializable fields.
        task_summary = [
            {k: v for k, v in t.__dict__.items() if isinstance(v, (str, int, float, bool, type(None)))}
            for t in results["tasks"]
        ] if results["tasks"] else []
        (output_path / "tasks.json").write_text(_json.dumps(task_summary, indent=2))

Alternatively, if preserving the full objects via pickle is a hard requirement, add a comment explaining that constraint and the version-pinning expectation.


nemo_curator/stages/interleaved/io/readers/webdataset.py, line 384
binary_content set to None even on failed frame extraction without early continue

When _extract_tiff_frame returns None (frame index out of bounds or malformed TIFF), raw_bytes = extracted sets raw_bytes = None, and then line 384 unconditionally writes None into binary_content. The materialize_error message is correctly set, but the unconditional assignment on line 384 also runs for the raw_bytes is None branch (missing member), overwriting the already-None field with None again.

This is harmless today but fragile: if binary_content is ever pre-populated (e.g. by a subclass), the unconditional assignment would silently clear it. A continue after each error path or restructuring the assignment to else would make the intent explicit:

raw_bytes = self._extract_tar_member(tf, content_key, read_ctx.byte_cache)
if raw_bytes is None:
    row["materialize_error"] = f"missing member '{content_key}'"
    continue
frame_index = parsed_ref.get("frame_index")
if frame_index is not None:
    extracted = _extract_tiff_frame(raw_bytes, frame_index)
    if extracted is None:
        row["materialize_error"] = f"failed to extract frame {frame_index} from '{content_key}'"
        continue
    raw_bytes = extracted
row["binary_content"] = raw_bytes

The --fields default hardcoded field names (language_id_whole_page_fasttext,
previous_word_count) that ~5.3% of MINT-1T samples lack, causing a
ValueError. Default to None (auto-discovery) to match benchmark behavior.

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Made-with: Cursor
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Mar 3, 2026

Additional Comments (5)

nemo_curator/stages/interleaved/utils/materialization.py, line 25
Pillow hard import breaks non-image workflows

from PIL import Image as _Image is a top-level import, meaning any code that does from nemo_curator.stages.interleaved.utils import materialize_task_binary_content will raise ImportError if Pillow is not installed — even for pipelines that only process JPEG or text content and never touch multi-frame TIFFs.

Pillow is listed in the optional image_cpu dependency group (pyproject.toml), so this will silently break base installations.

Consider guarding it the same way stages.py does:

try:
    from PIL import Image as _Image
except ImportError:
    _Image = None

Then inside _extract_tiff_frame, guard the call:

def _extract_tiff_frame(tiff_bytes: bytes, frame_index: int) -> bytes | None:
    if _Image is None:
        return None  # or raise RuntimeError with install hint
    ...

nemo_curator/stages/interleaved/io/readers/webdataset.py, line 214
frame_counter not reset per distinct TIFF file

frame_counter is a single integer that increments across all TIFF images in a sample, regardless of which tar member they resolve to. This is only correct when all image tokens reference the same multi-page TIFF file.

When a sample has two distinct TIFF files (e.g. images = ["doc_a.tiff", "doc_b.tiff"]):

  • doc_a.tiffframe_index=0
  • doc_b.tiffframe_index=1 ✗ (should be 0 — it is the first frame of its own file)

At materialization time, doc_b.tiff would extract frame 1 instead of frame 0, silently returning the wrong image.

Consider tracking frame counts per resolved content_key:

frame_counters: dict[str, int] = {}
...
if content_key is not None and is_multiframe_candidate:
    frame_index = frame_counters.get(content_key, 0)
    frame_counters[content_key] = frame_index + 1

nemo_curator/stages/interleaved/io/readers/webdataset.py, line 200
image_member_name not consulted for MIME type when content_key has no extension

mimetypes.guess_type(content_key or image_member_name or "") relies on Python's short-circuit or: image_member_name is only used when content_key is falsy (None or ""). When content_key is a non-empty extensionless string (e.g. a numeric token "0001" that happens to be in member_names), mimetypes.guess_type("0001") returns (None, None) and the fallback to image_member_name (which may be "0001.tiff") is never tried.

The knock-on effect is that is_multiframe_candidate = content_type == "image/tiff" becomes False, so multi-frame TIFF frame indexing is silently skipped for those rows.

            content_type, _ = mimetypes.guess_type(content_key or "")
            if content_type is None:
                content_type, _ = mimetypes.guess_type(image_member_name or "")

nemo_curator/stages/interleaved/io/readers/webdataset.py, line 384
raw_bytes overwritten with None after TIFF extraction failure

When _extract_tiff_frame returns None (extraction error), the materialize_error is correctly set. However raw_bytes = extracted (line 383) then unconditionally sets raw_bytes = None, and row["binary_content"] = raw_bytes (line 384) stores None. While the end result is the same (no binary content on failure), the implicit overwrite of raw_bytes makes the control flow harder to reason about and differs from the explicit continue pattern used in _fill_tar_extract_rows.

Consider adding a continue after setting the error to make the intent explicit and consistent:

if frame_index is not None:
    extracted = _extract_tiff_frame(raw_bytes, frame_index)
    if extracted is None:
        row["materialize_error"] = f"failed to extract frame {frame_index} from '{content_key}'"
        row["binary_content"] = None
        continue
    raw_bytes = extracted
row["binary_content"] = raw_bytes

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!


benchmarking/scripts/multimodal_mint1t_benchmark.py, line 57
Aspect-ratio filter materializes all images even when both materialize flags are off

InterleavedAspectRatioFilterStage always calls iter_materialized_bytes for every image row in order to compute aspect ratios, regardless of whether materialize_on_read or materialize_on_write are set. When both flags are False (the benchmark's default), the filter stage still triggers a full materialization round-trip for all images just to filter them — but those bytes are then discarded and not written to the Parquet output.

This means the "no materialize" benchmark configuration still pays the I/O cost of binary materialization for every image. If this is intentional (e.g. to validate filtering works end-to-end), it should be documented. If not, consider adding a skip_filter_if_unmaterialized option or using a lazy/metadata-only filter path.

- validate_and_project_source_fields now logs a warning and fills None
  when requested passthrough fields are absent from a source sample,
  instead of raising ValueError (heterogeneous data resilience).
- _extract_per_modality_fields warns when a per-modality field is missing
  from the sample (previously silently ignored).
- Changed type-mismatch raise from ValueError to TypeError (TRY004 fix).
- Removed metadata_json from INTERLEAVED_SCHEMA -- passthrough fields
  already carry all source data; the redundant JSON blob is no longer needed.
- Updated all tests, notebook, and README to match.

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Made-with: Cursor
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Mar 3, 2026

Additional Comments (5)

nemo_curator/stages/interleaved/io/readers/webdataset.py, line 200
frame_counter not reset per unique content_key

frame_counter is shared across all TIFF images in a sample regardless of which member file they resolve to. If a sample's images list contains tokens that resolve to different TIFF member files, every file after the first will receive an incorrect (too-large) frame_index.

Example: A sample with images = ["a.tiff", "b.tiff"] where both members exist in the tar:

  • a.tiffframe_index=0 ✅, frame_counter=1
  • b.tiffframe_index=1 ❌ (should be 0 for an independent TIFF)

During write-time materialization, _fill_tar_extract_rows will open b.tiff and try to seek to frame 1, which either silently returns None (setting materialize_error) or extracts the wrong frame — depending on how many frames b.tiff actually has.

The counter should be tracked per content_key so that each distinct TIFF file begins its own 0-based frame sequence:

frame_counters: dict[str, int] = {}
# ...
if content_key is not None and is_multiframe_candidate:
    frame_index = frame_counters.get(content_key, 0)
    frame_counters[content_key] = frame_index + 1

nemo_curator/stages/interleaved/io/readers/webdataset.py, line 376
binary_content unconditionally overwritten even on TIFF frame-extraction failure

When _extract_tiff_frame returns None, line 374 sets raw_bytes = extracted = None, then line 376 sets row["binary_content"] = None. This correctly signals failure alongside materialize_error, but it also silently discards the original raw TIFF bytes that were successfully read from the tar.

If a caller later encounters binary_content is None + materialize_error set, they know materialisation failed; but the full-TIFF bytes that were available are unrecoverable. Consider only overwriting raw_bytes when extraction succeeds, so the original bytes can be preserved (even if the caller must handle the frame-extraction error separately):

if frame_index is not None:
    extracted = _extract_tiff_frame(raw_bytes, frame_index)
    if extracted is None:
        row["materialize_error"] = f"failed to extract frame {frame_index} from '{content_key}'"
    else:
        raw_bytes = extracted  # only replace on success
row["binary_content"] = raw_bytes

nemo_curator/stages/interleaved/stages.py, line 129
Orphaned metadata rows when all content rows are filtered

content_keep_mask (e.g. _image_keep_mask) only updates the mask for image rows; all other rows default to True. Combined with _basic_row_validity_mask, which always passes through metadata rows, a sample whose every image is rejected will still emit a single metadata row with position=-1 and no accompanying content rows.

Downstream consumers (e.g. the parquet writer, validate_parquet_ordering) should handle this case, but it may produce unexpected output — particularly for pipelines that assume at least one content row per metadata row. If fully-empty samples should be dropped, consider adding a post-filter step that removes metadata rows whose sample_id no longer appears in any content row.


nemo_curator/core/utils.py, line 236
Byte-budget estimate uses a single average, ignoring binary column variance

avg_bytes_per_row = table.nbytes / n is a global average across every column, including binary_content (large_binary). When binary_content is materialized, a single image row can be orders of magnitude larger than a text or metadata row — making the average a poor predictor for any individual group's size. This may cause splits that are significantly over or under the max_batch_bytes budget.

Consider computing the average per-modality or using pyarrow.Table.slice(...).nbytes for a more accurate per-group estimate, at least as an option for callers that materialize binary content.


nemo_curator/stages/interleaved/utils/materialization.py, line 158
Fragile cache key using id(raw_bytes) risks stale hits

_resolve_frame caches by (id(raw_bytes), frame_idx). CPython may reuse a memory address after a bytes object is garbage-collected; if this happens within a single _scatter_range_blobs call, the cache could return stale bytes for the wrong member.

A more robust key would be the actual content (e.g. hash(raw_bytes)) or the logical identity of the member. Alternatively, since TIFF frame extraction is cheap relative to I/O, removing the cache entirely and calling _extract_tiff_frame directly avoids the risk without meaningful overhead.

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Made-with: Cursor
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Mar 3, 2026

Additional Comments (3)

nemo_curator/stages/interleaved/utils/materialization.py, line 220
_scatter_range_blobs ValueError from strict=True not caught

The try/except block only guards the fs.cat_ranges() call (lines 212–218), but _scatter_range_blobs is invoked at line 220, outside the except. Inside _scatter_range_blobs, the call zip(range_keys, blobs, strict=True) will raise ValueError if cat_ranges returns a list whose length differs from range_keys. While this is unlikely with well-behaved fsspec implementations, certain remote filesystems or partially-implemented drivers have been known to return unexpected result counts.

If this exception escapes, it will propagate through materialize_task_binary_content and crash the pipeline stage rather than logging a per-row error. Consider wrapping the call:

try:
    blobs = fs.cat_ranges(dedup_paths, starts, ends)
    _scatter_range_blobs(blobs, range_keys, unique_ranges, binary_values, error_values)
except (OSError, RuntimeError, ValueError) as exc:
    logger.warning(f"cat_ranges failed for {path} ({len(entries)} ranges): {exc}")
    for idx, *_ in entries:
        error_values[idx] = "cat_ranges failed"
    continue

benchmarking/scripts/multimodal_mint1t_benchmark.py, line 105
Ordering validation covers only the first output file

validate_parquet_ordering is called exclusively on parquet_files[0]. A pipeline run with many input shards will produce multiple Parquet files, and row-ordering bugs (e.g., cross-sample interleaving at shard boundaries) may only manifest in files other than the first one.

Consider sampling a few files (e.g., first, middle, last) or validating all files if the dataset is small enough:

files_to_check = parquet_files[:1] + ([parquet_files[len(parquet_files) // 2]] if len(parquet_files) > 2 else []) + (parquet_files[-1:] if len(parquet_files) > 1 else [])
for check_file in files_to_check:
    result = validate_parquet_ordering(check_file)
    if not result["valid"]:
        ordering_valid = False
        logger.error("Ordering validation failed on {}: {}", check_file.name, result["errors"])
    else:
        logger.info("Ordering validation passed on {}", check_file.name)

nemo_curator/stages/interleaved/io/readers/webdataset.py, line 202
TIFF frame_counter not scoped per resolved member

frame_counter is a sample-level integer that increments for every image token resolving to a TIFF, regardless of which TIFF member it resolves to. This is correct only when all image tokens in a sample reference the same multi-frame TIFF (the standard MINT-1T pattern).

When a sample contains image tokens that resolve to different TIFF files, each gets an incorrect frame index derived from the shared counter rather than a per-member counter. For example, a sample with two distinct single-frame TIFF members would assign frame_index=0 to the first and frame_index=1 to the second — the second should be 0.

Consider using a per-member counter (a dict mapping member name to frame count) so that frame indices are correct regardless of how many distinct TIFF members appear in a sample.

…, orphaned metadata, fragile cache

- Track frame_index per content_key so distinct TIFF members each start at 0
- Preserve original TIFF bytes in binary_content when frame extraction fails
- Drop metadata rows whose sample has no remaining content after filtering
- Remove _resolve_frame id()-based cache; inline _extract_tiff_frame directly

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Made-with: Cursor
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Mar 3, 2026

Additional Comments (4)

nemo_curator/stages/interleaved/io/readers/webdataset.py, line 377
Full TIFF bytes written to binary_content on failed frame extraction

When materialize_on_read=True and TIFF frame extraction fails (_extract_tiff_frame returns None), the code sets materialize_error correctly but then falls through to row["binary_content"] = raw_bytes at line 376, where raw_bytes still holds the full multi-frame TIFF rather than None.

The write-time path in _fill_tar_extract_rows (materialization.py lines 141-145) avoids this by using continue when extraction fails:

if payload is None:
    error_values[idx] = f"failed to extract frame {frame_idx} from '{member}'"
    continue          # ← skips binary_values[idx] assignment entirely
binary_values[idx] = payload

The fix is to skip assigning binary_content when extraction fails:

if raw_bytes is None:
    row["materialize_error"] = f"missing member '{content_key}'"
else:
    frame_index = parsed_ref.get("frame_index")
    if frame_index is not None:
        extracted = _extract_tiff_frame(raw_bytes, frame_index)
        if extracted is None:
            row["materialize_error"] = f"failed to extract frame {frame_index} from '{content_key}'"
            continue   # or skip setting binary_content
        raw_bytes = extracted
    row["binary_content"] = raw_bytes

Without this fix, consumers of binary_content for a failed TIFF-frame row receive the raw multi-frame TIFF (wrong frame or format), while materialize_error signals failure — creating an inconsistent state where the field appears populated but is incorrect.


nemo_curator/stages/interleaved/utils/materialization.py, line 148
Outer exception handler overwrites successfully processed rows

When OSError or tarfile.TarError is raised inside the inner processing loop (e.g., from extracted.read() at line 131 on a corrupted member), the outer except block on lines 146-148 marks every row in keyed_rows with "failed to read path" — including rows that were already successfully processed (i.e., those where binary_values[idx] was already populated and error_values[idx] was set to None).

Example scenario:

  • A tar has 3 members; the first two extract fine, the third raises tarfile.TarError
  • The outer except resets all 3 to "failed to read path", discarding the already-correct results for rows 1 and 2

Consider tracking the last successfully-processed index or catching the error per-row to preserve successful results, e.g.:

try:
    key_cache[member] = extracted.read() if extracted is not None else None
except (OSError, tarfile.TarError) as e:
    key_cache[member] = None
    error_values[idx] = f"error reading member '{member}': {e}"
    continue

nemo_curator/stages/interleaved/utils/materialization.py, line 234
Narrow exception list may miss filesystem-specific errors

_read_direct_file catches (OSError, RuntimeError, ValueError). While OSError covers most POSIX/Win32 errors (including PermissionError, FileNotFoundError), some fsspec backends raise backend-specific errors that don't inherit from any of these (e.g., botocore.exceptions.ClientError for S3, gRPC status errors for GCS). In those cases the function raises instead of returning None, causing the entire pipeline step to fail rather than recording a per-row error.

Consider catching Exception with a narrow log to avoid unhandled crashes:

except Exception as exc:
    logger.warning("Failed to read {}: {}", path, exc)
    return None

nemo_curator/stages/interleaved/stages.py, line 132
sort_values called at end of annotate but not guaranteed stable within groups

filtered.sort_values(["sample_id", "position"]) at the end of annotate reorders rows for the caller. If the caller passes a DataFrame with mixed dtypes for position (e.g., ArrowDtype(pa.int32()) from the Arrow-backed path), pandas sort_values may silently coerce or behave unexpectedly. Consider explicitly resetting the index after the sort to avoid callers depending on the original index:

        return filtered.sort_values(["sample_id", "position"]).reset_index(drop=True)

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Mar 3, 2026

Additional Comments (3)

nemo_curator/stages/interleaved/utils/materialization.py, line 25
Hard Pillow import breaks module import for users without image_cpu

from PIL import Image as _Image is an unconditional top-level import. Since materialization.py is pulled in by utils/__init__.py, which is imported by every interleaved stage, any user who hasn't installed the image_cpu optional dependency will get an ImportError when importing any interleaved stage — even if they never process TIFFs.

PIL is only used in _extract_tiff_frame (line 100), so it should be a lazy or soft import:

try:
    from PIL import Image as _Image
except ImportError:
    _Image = None  # type: ignore[assignment]

Then guard _extract_tiff_frame to raise a helpful error if PIL is needed but missing.


benchmarking/scripts/utils.py, line 158
Position validation fails for pipelines without a filter/reindex stage

expected = list(range(len(positions))) assumes content positions are the dense set [0, 1, ..., n-1]. However, the reader assigns positions from enumerate() without densifying them — when None-padded lists are skipped, positions remain sparse. For example:

  • Input: texts = [None, "hello", None, "world"]
  • Reader produces positions: [1, 3] (from enumerate, but only non-None rows)
  • Validation expects: [0, 1]
  • Result: False validation error

This only passes today in the benchmark because InterleavedAspectRatioFilterStage runs and reindexes positions via cumcount(). But for any pipeline that reads and writes without a filter stage, validate_parquet_ordering will raise false errors.

Consider validating that positions are strictly increasing within a sample rather than requiring dense 0..n-1:

if positions != sorted(positions):
    errors.append(f"sample={sample_id}: content positions {positions} are not monotone")

nemo_curator/stages/interleaved/utils/materialization.py, line 148
Outer except block overwrites error state for already-successfully-materialized rows

When a tar opens successfully and some members are extracted without error, but a later extracted.read() call (line 131) raises OSError mid-iteration, the outer except (OSError, tarfile.TarError) block fires and sets error_values[idx] = "failed to read path" for all rows in keyed_rows — including those already successfully processed earlier in the loop. This leaves rows with binary_values[idx] = <valid bytes> paired with error_values[idx] = "failed to read path", which will confuse any consumer inspecting materialize_error to determine success.

Track which rows have already been resolved before the exception occurs, so the outer handler only marks truly unresolved rows:

resolved: set[int] = set()
try:
    with fsspec.open(path, ...) as fobj, tarfile.open(...) as tf:
        for idx, member, frame_idx in keyed_rows:
            ...
            binary_values[idx] = payload
            error_values[idx] = None
            resolved.add(idx)
except (OSError, tarfile.TarError):
    for idx, *_ in keyed_rows:
        if idx not in resolved:
            error_values[idx] = "failed to read path"

Copy link
Copy Markdown
Contributor

@abhinavg4 abhinavg4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thanks a ton for your efforts Vibhu

Comment thread nemo_curator/tasks/multimodal.py Outdated
Comment thread nemo_curator/tasks/interleaved.py

@dataclass
class MultiBatchTask(Task[pa.Table | pd.DataFrame]):
"""Task carrying row-wise multimodal records.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just having position means it's interleaved right? How about interleaved_multimodal?

msg = f"Unsupported data type: {type(self.data)}"
raise TypeError(msg)

def validate(self) -> bool:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But base is just True right? For us, I'm not sure if the end user would agree this is what it means for a task to be a valid task.

New parametrized and data-driven tests across 6 test files covering
all previously untested branches in materialization, validation utils,
InterleavedBatch task, filter/annotator stages, readers, and writers.

Signed-off-by: Vibhu Jawa <vjawa@nvidia.com>
Made-with: Cursor
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Mar 4, 2026

Additional Comments (2)

nemo_curator/stages/interleaved/io/readers/webdataset.py, line 376
binary_content retains full TIFF bytes when frame extraction fails

When _extract_tiff_frame returns None (frame index out of range or unsupported format), materialize_error is set correctly on line 373. However, raw_bytes is never cleared — it still holds the full multi-frame TIFF data. The unconditional row["binary_content"] = raw_bytes on line 376 (outside the else block) then stores the full TIFF bytes in binary_content despite the error state. Downstream consumers that check materialize_error to guard access to binary_content will still receive corrupted data.

This contradicts the write-time materialization path (via _fill_tar_extract_rows in materialization.py), which uses continue to skip setting binary_values[idx] when frame extraction fails.

                    if frame_index is not None:
                        extracted = _extract_tiff_frame(raw_bytes, frame_index)
                        if extracted is None:
                            row["materialize_error"] = f"failed to extract frame {frame_index} from '{content_key}'"
                            raw_bytes = None
                        else:
                            raw_bytes = extracted
                row["binary_content"] = raw_bytes

nemo_curator/stages/base.py, line 329
Incorrect return type annotation on _time_metric

@contextlib.contextmanager decorates a generator function. The correct return type annotation for a generator function is -> Iterator[None], not -> contextlib.AbstractContextManager[None]. The latter is the type of the decorated return value, but the function signature should reflect the generator.

    @contextlib.contextmanager
    def _time_metric(self, name: str) -> Iterator[None]:

Also add from collections.abc import Iterator to the imports at the top of the file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants