Skip to content

Multisegment Header Trailer Removal#836

Merged
yruslan merged 3 commits intoAbsaOSS:masterfrom
capitalone-contributions:multisegment-header-trailer
Mar 25, 2026
Merged

Multisegment Header Trailer Removal#836
yruslan merged 3 commits intoAbsaOSS:masterfrom
capitalone-contributions:multisegment-header-trailer

Conversation

@alexzvk
Copy link
Contributor

@alexzvk alexzvk commented Mar 24, 2026

Some EBCDIC files contain headers and trailers with differing schemas compared to the schema of the actual data. Currently to avoid reading these headers and trailers, one must manually parse the copybook and calculate file_start_offset and file_end_offset in bytes. This PR allows for the user to specify copybook root level records and have the file_start_offset and file_end_offset automatically calculated, and remove these records from the schema.

Summary by CodeRabbit

  • New Features

    • Added record_header_name and record_trailer_name options to designate specific records as headers or trailers in multisegment COBOL files. Designated records are automatically excluded from the output schema, and file offsets are computed based on their byte sizes.
  • Documentation

    • Updated documentation with new multisegment file options for header and trailer configuration.
  • Chores

    • Added VS Code workspace settings to gitignore.

@alexzvk alexzvk requested a review from yruslan as a code owner March 24, 2026 22:10
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 24, 2026

Walkthrough

This PR introduces support for COBOL copybook header and trailer records. Users can now specify record names that should be excluded from output via record_header_name and record_trailer_name options. The feature computes byte offsets from copybook structure and filters excluded records from schema and extraction pipelines.

Changes

Cohort / File(s) Summary
Configuration & Parameter Parsing
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CobolParameters.scala, CobolParametersParser.scala, ReaderParameters.scala
Added recordHeaderName, recordTrailerName options to CobolParameters; extended CobolParametersParser to parse these options, compute recordsToExclude set, and validate against conflicting offset settings; added recordsToExclude field to ReaderParameters.
Schema & Record Extraction
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/schema/CobolSchema.scala, spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/schema/CobolSchema.scala, cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/extractors/record/RecordExtractors.scala
Updated both CobolSchema implementations to accept recordsToExclude and filter excluded records from schema generation and size calculations; extended RecordExtractors.extractRecord and extractHierarchicalRecord to accept and apply record exclusion filtering.
Reader & Source Integration
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/DefaultSource.scala, spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/builder/SparkCobolOptionsBuilder.scala
Enhanced DefaultSource with new resolveHeaderTrailerOffsets logic to parse copybook, locate header/trailer records, derive byte offsets, and configure readers accordingly; updated SparkCobolOptionsBuilder to pass recordsToExclude to record extraction.
Iterator Updates
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/iterator/FixedLenNestedRowIterator.scala, VarLenHierarchicalIterator.scala, VarLenNestedIterator.scala
Updated all three iterator implementations to pass readerProperties.recordsToExclude into their respective RecordExtractors method calls.
Tests & Documentation
.gitignore, README.md, spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/integration/Test13cCopybookHeaderTrailerSpec.scala
Added .vscode/ to .gitignore; documented new record_header_name and record_trailer_name options in README; added comprehensive integration test suite validating header/trailer record exclusion, offset resolution, schema filtering, and error handling.

Sequence Diagram

sequenceDiagram
    actor User
    participant ParameterParser as CobolParametersParser
    participant DefaultSource
    participant CopybookParser
    participant Schema as CobolSchema
    participant RecordExtractor as RecordExtractors
    participant Output

    User->>ParameterParser: Provide record_header_name,<br/>record_trailer_name options
    ParameterParser->>ParameterParser: Parse & validate options<br/>Compute recordsToExclude
    ParameterParser->>DefaultSource: Pass CobolParameters<br/>with recordsToExclude
    DefaultSource->>CopybookParser: Load & parse copybook
    CopybookParser-->>DefaultSource: AST with root records
    DefaultSource->>DefaultSource: resolveHeaderTrailerOffsets:<br/>Locate header/trailer records,<br/>compute byte offsets
    DefaultSource->>Schema: Create CobolSchema<br/>with recordsToExclude
    Schema->>Schema: Filter excluded records<br/>from copybook structure
    Schema-->>DefaultSource: Filtered schema
    DefaultSource->>RecordExtractor: Call extractRecord<br/>with recordsToExclude
    RecordExtractor->>RecordExtractor: Iterate root groups,<br/>exclude filtered records
    RecordExtractor-->>Output: Extract data records only
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Poem

🐰 Headers skip and trailers flee,
With record names we're wild and free!
Offset bytes now dance in place,
Pure data shines with bounded grace,
One feature hops—excluded with style!

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Multisegment Header Trailer Removal' accurately reflects the core functionality: enabling users to specify and automatically exclude header/trailer records from multisegment EBCDIC files.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (4)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CobolParameters.scala (1)

114-115: Add Scaladoc entries for the new header/trailer options.

The new recordHeaderName / recordTrailerName fields are part of the public parameter model but not listed in the class @param docs yet.

📝 Suggested doc patch
  * `@param` metadataPolicy          Specifies the policy of metadat fields to be added to the Spark schema
+ * `@param` recordHeaderName        Optional root-level record name to treat as file header
+ * `@param` recordTrailerName       Optional root-level record name to treat as file trailer
  * `@param` options                 Options passed to 'spark-cobol'.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CobolParameters.scala`
around lines 114 - 115, The Scaladoc for the CobolParameters class is missing
`@param` entries for the new optional fields recordHeaderName and
recordTrailerName; update the class Scaladoc to include two `@param` lines
describing these fields (their purpose as optional names for record header and
trailer records, expected format and behavior when None) so the public parameter
model docs reflect the new options and their semantics.
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/ReaderParameters.scala (1)

142-142: Update Scaladoc to include the new constructor field.

ReaderParameters now has recordsToExclude, but the class Scaladoc @param list doesn’t document it yet.

📝 Suggested doc patch
  * `@param` inputFileNameColumn     A column name to add to the dataframe. The column will contain input file name for each record similar to 'input_file_name()' function
  * `@param` metadataPolicy          Specifies the policy of metadat fields to be added to the Spark schema
+ * `@param` recordsToExclude        Root-level record names (normalized) to exclude from schema and extraction output
  * `@param` options                 Options passed to spark-cobol
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/ReaderParameters.scala`
at line 142, Update the Scaladoc for the ReaderParameters class to document the
newly added constructor field recordsToExclude: locate the class definition for
ReaderParameters and add an `@param` entry describing recordsToExclude (type
Set[String], default Set.empty) in the Scaladoc `@param` list so it explains its
purpose and expected values; ensure the wording matches existing style and
mentions default behavior when not provided.
spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/integration/Test13cCopybookHeaderTrailerSpec.scala (1)

155-227: LGTM!

Good negative test coverage for validation errors:

  1. record_header_name + file_start_offset incompatibility
  2. record_trailer_name + file_end_offset incompatibility
  3. Non-existent record name
  4. Same record for header and trailer

Consider adding a test for trailer-only exclusion (symmetric to header-only test) for completeness.

💡 Optional: Add trailer-only exclusion test
test("Test only record_trailer_name without header") {
  // Test data without header record at start
  val testDataNoHeader: Array[Byte] = Array[Byte](
    // Data record 1: ALICE 0100 (10 bytes DATA-REC)
    0x41, 0x4C, 0x49, 0x43, 0x45, 0x20, 0x30, 0x31, 0x30, 0x30,
    // Data record 2: BOB   0200 (10 bytes DATA-REC)
    0x42, 0x4F, 0x42, 0x20, 0x20, 0x20, 0x30, 0x32, 0x30, 0x30,
    // Trailer record: TRL 0002 (8 bytes)
    0x54, 0x52, 0x4C, 0x20, 0x30, 0x30, 0x30, 0x32
  )

  // Copybook without header definition
  val copybookNoHeader =
    """       01  DATA-REC.
      |           05  NAME           PIC X(6).
      |           05  AMOUNT         PIC 9(4).
      |       01  TRAILER-REC.
      |           05  TRL-TAG        PIC X(4).
      |           05  TRL-COUNT      PIC 9(4).
      |""".stripMargin

  withTempBinFile("test13c", ".dat", testDataNoHeader) { tempFile =>
    val df = spark
      .read
      .format("cobol")
      .option("copybook_contents", copybookNoHeader)
      .option("encoding", "ascii")
      .option("record_trailer_name", "TRAILER-REC")
      .option("schema_retention_policy", "collapse_root")
      .load(tempFile)

    assert(df.count() == 2)
    val fieldNames = df.schema.fieldNames
    assert(fieldNames.contains("NAME"))
    assert(!fieldNames.contains("TRL_TAG"))
  }
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/integration/Test13cCopybookHeaderTrailerSpec.scala`
around lines 155 - 227, Add a new test mirroring the header-only exclusion test
named "Test only record_trailer_name without header" that uses withTempBinFile
to write binary testDataNoHeader (two DATA-REC records plus one TRAILER-REC), a
copybook string that defines DATA-REC and TRAILER-REC (but no header), then read
with spark.read.format("cobol") using .option("copybook_contents",
copybookNoHeader), .option("encoding", "ascii"), .option("record_trailer_name",
"TRAILER-REC") and .option("schema_retention_policy", "collapse_root"); assert
the resulting DataFrame count equals 2 and that the schema contains DATA-REC
fields (e.g., "NAME") but does not contain trailer fields (e.g., "TRL_TAG") to
verify trailer-only exclusion behavior.
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/schema/CobolSchema.scala (1)

422-440: Consider adding recordsToExclude support to the builder.

The CobolSchemaBuilder.build() method does not support recordsToExclude, meaning schemas created via the builder cannot exclude header/trailer records. This is likely acceptable if the builder is intended for simpler use cases, but it creates an API inconsistency.

💡 Optional: Add builder support for recordsToExclude
   class CobolSchemaBuilder(copybook: Copybook) {
     // ... existing fields ...
     private var metadataPolicy: MetadataPolicy = MetadataPolicy.Basic
+    private var recordsToExclude: Set[String] = Set.empty

+    def withRecordsToExclude(recordsToExclude: Set[String]): CobolSchemaBuilder = {
+      this.recordsToExclude = recordsToExclude
+      this
+    }

     def build(): CobolSchema = {
       // ... existing code ...
       new CobolSchema(
         copybook,
         schemaRetentionPolicy,
         isDisplayAlwaysString,
         strictIntegralPrecision,
         inputFileNameField,
         generateRecordId,
         generateRecordBytes,
         corruptFieldsPolicy,
         generateSegIdFieldsCnt,
         segmentIdProvidedPrefix,
-        metadataPolicy
+        metadataPolicy,
+        recordsToExclude
       )
     }
   }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/schema/CobolSchema.scala`
around lines 422 - 440, The builder's build() method currently omits the
recordsToExclude setting; update CobolSchemaBuilder.build() (the def build():
CobolSchema method) to pass the builder's recordsToExclude into the CobolSchema
constructor (add the recordsToExclude argument to the new CobolSchema(...)
call), and ensure the builder exposes a recordsToExclude field/parameter used
here; if the CobolSchema constructor signature already accepts recordsToExclude,
simply add it to the argument list, otherwise update the constructor to accept
and store it as well.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@README.md`:
- Around line 1618-1619: The README has a Markdown lint issue MD058 because the
table ending with the `.option("record_trailer_name", "TRAILER")` row is
immediately followed by the heading `##### Helper fields generation options`;
insert a single blank line between the table and that heading so the table is
surrounded by blank lines (i.e., add one empty line after the table row
containing `.option("record_trailer_name", "TRAILER")` and before the `#####
Helper fields generation options` heading).

---

Nitpick comments:
In
`@cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CobolParameters.scala`:
- Around line 114-115: The Scaladoc for the CobolParameters class is missing
`@param` entries for the new optional fields recordHeaderName and
recordTrailerName; update the class Scaladoc to include two `@param` lines
describing these fields (their purpose as optional names for record header and
trailer records, expected format and behavior when None) so the public parameter
model docs reflect the new options and their semantics.

In
`@cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/ReaderParameters.scala`:
- Line 142: Update the Scaladoc for the ReaderParameters class to document the
newly added constructor field recordsToExclude: locate the class definition for
ReaderParameters and add an `@param` entry describing recordsToExclude (type
Set[String], default Set.empty) in the Scaladoc `@param` list so it explains its
purpose and expected values; ensure the wording matches existing style and
mentions default behavior when not provided.

In
`@spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/schema/CobolSchema.scala`:
- Around line 422-440: The builder's build() method currently omits the
recordsToExclude setting; update CobolSchemaBuilder.build() (the def build():
CobolSchema method) to pass the builder's recordsToExclude into the CobolSchema
constructor (add the recordsToExclude argument to the new CobolSchema(...)
call), and ensure the builder exposes a recordsToExclude field/parameter used
here; if the CobolSchema constructor signature already accepts recordsToExclude,
simply add it to the argument list, otherwise update the constructor to accept
and store it as well.

In
`@spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/integration/Test13cCopybookHeaderTrailerSpec.scala`:
- Around line 155-227: Add a new test mirroring the header-only exclusion test
named "Test only record_trailer_name without header" that uses withTempBinFile
to write binary testDataNoHeader (two DATA-REC records plus one TRAILER-REC), a
copybook string that defines DATA-REC and TRAILER-REC (but no header), then read
with spark.read.format("cobol") using .option("copybook_contents",
copybookNoHeader), .option("encoding", "ascii"), .option("record_trailer_name",
"TRAILER-REC") and .option("schema_retention_policy", "collapse_root"); assert
the resulting DataFrame count equals 2 and that the schema contains DATA-REC
fields (e.g., "NAME") but does not contain trailer fields (e.g., "TRL_TAG") to
verify trailer-only exclusion behavior.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 0d27b798-a188-4da3-a652-7e938d5407ae

📥 Commits

Reviewing files that changed from the base of the PR and between dc80bd4 and b81c59f.

📒 Files selected for processing (14)
  • .gitignore
  • README.md
  • cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/extractors/record/RecordExtractors.scala
  • cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/iterator/FixedLenNestedRowIterator.scala
  • cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/iterator/VarLenHierarchicalIterator.scala
  • cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/iterator/VarLenNestedIterator.scala
  • cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CobolParameters.scala
  • cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CobolParametersParser.scala
  • cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/ReaderParameters.scala
  • cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/schema/CobolSchema.scala
  • spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/builder/SparkCobolOptionsBuilder.scala
  • spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/schema/CobolSchema.scala
  • spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/DefaultSource.scala
  • spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/integration/Test13cCopybookHeaderTrailerSpec.scala

Comment on lines +1618 to 1619
| .option("record_trailer_name", "TRAILER") | Assuming a copybook definition represents a trailer, offsets the file read end by the number of bytes in that trailer and excludes the record definition from the output schema. |
##### Helper fields generation options
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add a blank line after the multisegment options table.

Line 1618 is immediately followed by a heading, which triggers markdownlint MD058 (tables should be surrounded by blank lines).

✅ Minimal markdown fix
 | .option("record_trailer_name", "TRAILER")                                             | Assuming a copybook definition represents a trailer, offsets the file read end by the number of bytes in that trailer and excludes the record definition from the output schema.                                                                                                                                                                                                                                                                                            |
+
 ##### Helper fields generation options    
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
| .option("record_trailer_name", "TRAILER") | Assuming a copybook definition represents a trailer, offsets the file read end by the number of bytes in that trailer and excludes the record definition from the output schema. |
##### Helper fields generation options
| .option("record_trailer_name", "TRAILER") | Assuming a copybook definition represents a trailer, offsets the file read end by the number of bytes in that trailer and excludes the record definition from the output schema. |
##### Helper fields generation options
🧰 Tools
🪛 markdownlint-cli2 (0.21.0)

[warning] 1618-1618: Tables should be surrounded by blank lines

(MD058, blanks-around-tables)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@README.md` around lines 1618 - 1619, The README has a Markdown lint issue
MD058 because the table ending with the `.option("record_trailer_name",
"TRAILER")` row is immediately followed by the heading `##### Helper fields
generation options`; insert a single blank line between the table and that
heading so the table is surrounded by blank lines (i.e., add one empty line
after the table row containing `.option("record_trailer_name", "TRAILER")` and
before the `##### Helper fields generation options` heading).

Copy link
Collaborator

@yruslan yruslan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome! Thank you for your contribution! A feature like this for handling headers and footers was requested several times, so glad you contributed the solution.

@yruslan
Copy link
Collaborator

yruslan commented Mar 25, 2026

We will release the new version of Cobrix tomorrow

@yruslan yruslan merged commit c9ee54c into AbsaOSS:master Mar 25, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants