Multisegment Header Trailer Removal by alexzvk · Pull Request #836 · AbsaOSS/cobrix

alexzvk · 2026-03-24T22:10:44Z

Some EBCDIC files contain headers and trailers with differing schemas compared to the schema of the actual data. Currently to avoid reading these headers and trailers, one must manually parse the copybook and calculate file_start_offset and file_end_offset in bytes. This PR allows for the user to specify copybook root level records and have the file_start_offset and file_end_offset automatically calculated, and remove these records from the schema.

Summary by CodeRabbit

New Features
- Added record_header_name and record_trailer_name options to designate specific records as headers or trailers in multisegment COBOL files. Designated records are automatically excluded from the output schema, and file offsets are computed based on their byte sizes.
Documentation
- Updated documentation with new multisegment file options for header and trailer configuration.
Chores
- Added VS Code workspace settings to gitignore.

…ality

coderabbitai · 2026-03-24T22:11:03Z

Walkthrough

This PR introduces support for COBOL copybook header and trailer records. Users can now specify record names that should be excluded from output via record_header_name and record_trailer_name options. The feature computes byte offsets from copybook structure and filters excluded records from schema and extraction pipelines.

Changes

Cohort / File(s)	Summary
Configuration & Parameter Parsing `cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CobolParameters.scala`, `CobolParametersParser.scala`, `ReaderParameters.scala`	Added `recordHeaderName`, `recordTrailerName` options to `CobolParameters`; extended `CobolParametersParser` to parse these options, compute `recordsToExclude` set, and validate against conflicting offset settings; added `recordsToExclude` field to `ReaderParameters`.
Schema & Record Extraction `cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/schema/CobolSchema.scala`, `spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/schema/CobolSchema.scala`, `cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/extractors/record/RecordExtractors.scala`	Updated both `CobolSchema` implementations to accept `recordsToExclude` and filter excluded records from schema generation and size calculations; extended `RecordExtractors.extractRecord` and `extractHierarchicalRecord` to accept and apply record exclusion filtering.
Reader & Source Integration `spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/DefaultSource.scala`, `spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/builder/SparkCobolOptionsBuilder.scala`	Enhanced `DefaultSource` with new `resolveHeaderTrailerOffsets` logic to parse copybook, locate header/trailer records, derive byte offsets, and configure readers accordingly; updated `SparkCobolOptionsBuilder` to pass `recordsToExclude` to record extraction.
Iterator Updates `cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/iterator/FixedLenNestedRowIterator.scala`, `VarLenHierarchicalIterator.scala`, `VarLenNestedIterator.scala`	Updated all three iterator implementations to pass `readerProperties.recordsToExclude` into their respective `RecordExtractors` method calls.
Tests & Documentation `.gitignore`, `README.md`, `spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/integration/Test13cCopybookHeaderTrailerSpec.scala`	Added `.vscode/` to `.gitignore`; documented new `record_header_name` and `record_trailer_name` options in README; added comprehensive integration test suite validating header/trailer record exclusion, offset resolution, schema filtering, and error handling.

Sequence Diagram

sequenceDiagram
    actor User
    participant ParameterParser as CobolParametersParser
    participant DefaultSource
    participant CopybookParser
    participant Schema as CobolSchema
    participant RecordExtractor as RecordExtractors
    participant Output

    User->>ParameterParser: Provide record_header_name,<br/>record_trailer_name options
    ParameterParser->>ParameterParser: Parse & validate options<br/>Compute recordsToExclude
    ParameterParser->>DefaultSource: Pass CobolParameters<br/>with recordsToExclude
    DefaultSource->>CopybookParser: Load & parse copybook
    CopybookParser-->>DefaultSource: AST with root records
    DefaultSource->>DefaultSource: resolveHeaderTrailerOffsets:<br/>Locate header/trailer records,<br/>compute byte offsets
    DefaultSource->>Schema: Create CobolSchema<br/>with recordsToExclude
    Schema->>Schema: Filter excluded records<br/>from copybook structure
    Schema-->>DefaultSource: Filtered schema
    DefaultSource->>RecordExtractor: Call extractRecord<br/>with recordsToExclude
    RecordExtractor->>RecordExtractor: Iterate root groups,<br/>exclude filtered records
    RecordExtractor-->>Output: Extract data records only

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

#780 Add spark-cobol options to the raw record extractor interface #781: Shares modification of CobolParameters, CobolParametersParser, and ReaderParameters parameter/config surface for reader behavior control
#723 Add the ability to generate _corrupt_fields column containing information on fields that Cobrix failed to decode #820: Both PRs extend RecordExtractors.extractRecord and call sites with new filtering parameters for record exclusion logic
#415 Implement the basic version of EBCDIC writer #775: Overlaps at CobolParameters/CobolParametersParser constructor and parsed field changes for configuration extension

Poem

🐰 Headers skip and trailers flee,
With record names we're wild and free!
Offset bytes now dance in place,
Pure data shines with bounded grace,
One feature hops—excluded with style!

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Multisegment Header Trailer Removal' accurately reflects the core functionality: enabling users to specify and automatically exclude header/trailer records from multisegment EBCDIC files.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (4)

cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CobolParameters.scala (1)

114-115: Add Scaladoc entries for the new header/trailer options.

The new recordHeaderName / recordTrailerName fields are part of the public parameter model but not listed in the class @param docs yet.

📝 Suggested doc patch

  * `@param` metadataPolicy          Specifies the policy of metadat fields to be added to the Spark schema
+ * `@param` recordHeaderName        Optional root-level record name to treat as file header
+ * `@param` recordTrailerName       Optional root-level record name to treat as file trailer
  * `@param` options                 Options passed to 'spark-cobol'.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In
`@cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CobolParameters.scala`
around lines 114 - 115, The Scaladoc for the CobolParameters class is missing
`@param` entries for the new optional fields recordHeaderName and
recordTrailerName; update the class Scaladoc to include two `@param` lines
describing these fields (their purpose as optional names for record header and
trailer records, expected format and behavior when None) so the public parameter
model docs reflect the new options and their semantics.

cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/ReaderParameters.scala (1)

142-142: Update Scaladoc to include the new constructor field.

ReaderParameters now has recordsToExclude, but the class Scaladoc @param list doesn’t document it yet.

📝 Suggested doc patch

  * `@param` inputFileNameColumn     A column name to add to the dataframe. The column will contain input file name for each record similar to 'input_file_name()' function
  * `@param` metadataPolicy          Specifies the policy of metadat fields to be added to the Spark schema
+ * `@param` recordsToExclude        Root-level record names (normalized) to exclude from schema and extraction output
  * `@param` options                 Options passed to spark-cobol

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In
`@cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/ReaderParameters.scala`
at line 142, Update the Scaladoc for the ReaderParameters class to document the
newly added constructor field recordsToExclude: locate the class definition for
ReaderParameters and add an `@param` entry describing recordsToExclude (type
Set[String], default Set.empty) in the Scaladoc `@param` list so it explains its
purpose and expected values; ensure the wording matches existing style and
mentions default behavior when not provided.

spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/integration/Test13cCopybookHeaderTrailerSpec.scala (1)

155-227: LGTM!

Good negative test coverage for validation errors:

record_header_name + file_start_offset incompatibility
record_trailer_name + file_end_offset incompatibility
Non-existent record name
Same record for header and trailer

Consider adding a test for trailer-only exclusion (symmetric to header-only test) for completeness.

💡 Optional: Add trailer-only exclusion test

test("Test only record_trailer_name without header") {
  // Test data without header record at start
  val testDataNoHeader: Array[Byte] = Array[Byte](
    // Data record 1: ALICE 0100 (10 bytes DATA-REC)
    0x41, 0x4C, 0x49, 0x43, 0x45, 0x20, 0x30, 0x31, 0x30, 0x30,
    // Data record 2: BOB   0200 (10 bytes DATA-REC)
    0x42, 0x4F, 0x42, 0x20, 0x20, 0x20, 0x30, 0x32, 0x30, 0x30,
    // Trailer record: TRL 0002 (8 bytes)
    0x54, 0x52, 0x4C, 0x20, 0x30, 0x30, 0x30, 0x32
  )

  // Copybook without header definition
  val copybookNoHeader =
    """       01  DATA-REC.
      |           05  NAME           PIC X(6).
      |           05  AMOUNT         PIC 9(4).
      |       01  TRAILER-REC.
      |           05  TRL-TAG        PIC X(4).
      |           05  TRL-COUNT      PIC 9(4).
      |""".stripMargin

  withTempBinFile("test13c", ".dat", testDataNoHeader) { tempFile =>
    val df = spark
      .read
      .format("cobol")
      .option("copybook_contents", copybookNoHeader)
      .option("encoding", "ascii")
      .option("record_trailer_name", "TRAILER-REC")
      .option("schema_retention_policy", "collapse_root")
      .load(tempFile)

    assert(df.count() == 2)
    val fieldNames = df.schema.fieldNames
    assert(fieldNames.contains("NAME"))
    assert(!fieldNames.contains("TRL_TAG"))
  }
}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In
`@spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/integration/Test13cCopybookHeaderTrailerSpec.scala`
around lines 155 - 227, Add a new test mirroring the header-only exclusion test
named "Test only record_trailer_name without header" that uses withTempBinFile
to write binary testDataNoHeader (two DATA-REC records plus one TRAILER-REC), a
copybook string that defines DATA-REC and TRAILER-REC (but no header), then read
with spark.read.format("cobol") using .option("copybook_contents",
copybookNoHeader), .option("encoding", "ascii"), .option("record_trailer_name",
"TRAILER-REC") and .option("schema_retention_policy", "collapse_root"); assert
the resulting DataFrame count equals 2 and that the schema contains DATA-REC
fields (e.g., "NAME") but does not contain trailer fields (e.g., "TRL_TAG") to
verify trailer-only exclusion behavior.

spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/schema/CobolSchema.scala (1)

422-440: Consider adding recordsToExclude support to the builder.

The CobolSchemaBuilder.build() method does not support recordsToExclude, meaning schemas created via the builder cannot exclude header/trailer records. This is likely acceptable if the builder is intended for simpler use cases, but it creates an API inconsistency.

💡 Optional: Add builder support for recordsToExclude

   class CobolSchemaBuilder(copybook: Copybook) {
     // ... existing fields ...
     private var metadataPolicy: MetadataPolicy = MetadataPolicy.Basic
+    private var recordsToExclude: Set[String] = Set.empty

+    def withRecordsToExclude(recordsToExclude: Set[String]): CobolSchemaBuilder = {
+      this.recordsToExclude = recordsToExclude
+      this
+    }

     def build(): CobolSchema = {
       // ... existing code ...
       new CobolSchema(
         copybook,
         schemaRetentionPolicy,
         isDisplayAlwaysString,
         strictIntegralPrecision,
         inputFileNameField,
         generateRecordId,
         generateRecordBytes,
         corruptFieldsPolicy,
         generateSegIdFieldsCnt,
         segmentIdProvidedPrefix,
-        metadataPolicy
+        metadataPolicy,
+        recordsToExclude
       )
     }
   }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In
`@spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/schema/CobolSchema.scala`
around lines 422 - 440, The builder's build() method currently omits the
recordsToExclude setting; update CobolSchemaBuilder.build() (the def build():
CobolSchema method) to pass the builder's recordsToExclude into the CobolSchema
constructor (add the recordsToExclude argument to the new CobolSchema(...)
call), and ensure the builder exposes a recordsToExclude field/parameter used
here; if the CobolSchema constructor signature already accepts recordsToExclude,
simply add it to the argument list, otherwise update the constructor to accept
and store it as well.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@README.md`:
- Around line 1618-1619: The README has a Markdown lint issue MD058 because the
table ending with the `.option("record_trailer_name", "TRAILER")` row is
immediately followed by the heading `##### Helper fields generation options`;
insert a single blank line between the table and that heading so the table is
surrounded by blank lines (i.e., add one empty line after the table row
containing `.option("record_trailer_name", "TRAILER")` and before the `#####
Helper fields generation options` heading).

---

Nitpick comments:
In
`@cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CobolParameters.scala`:
- Around line 114-115: The Scaladoc for the CobolParameters class is missing
`@param` entries for the new optional fields recordHeaderName and
recordTrailerName; update the class Scaladoc to include two `@param` lines
describing these fields (their purpose as optional names for record header and
trailer records, expected format and behavior when None) so the public parameter
model docs reflect the new options and their semantics.

In
`@cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/ReaderParameters.scala`:
- Line 142: Update the Scaladoc for the ReaderParameters class to document the
newly added constructor field recordsToExclude: locate the class definition for
ReaderParameters and add an `@param` entry describing recordsToExclude (type
Set[String], default Set.empty) in the Scaladoc `@param` list so it explains its
purpose and expected values; ensure the wording matches existing style and
mentions default behavior when not provided.

In
`@spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/schema/CobolSchema.scala`:
- Around line 422-440: The builder's build() method currently omits the
recordsToExclude setting; update CobolSchemaBuilder.build() (the def build():
CobolSchema method) to pass the builder's recordsToExclude into the CobolSchema
constructor (add the recordsToExclude argument to the new CobolSchema(...)
call), and ensure the builder exposes a recordsToExclude field/parameter used
here; if the CobolSchema constructor signature already accepts recordsToExclude,
simply add it to the argument list, otherwise update the constructor to accept
and store it as well.

In
`@spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/integration/Test13cCopybookHeaderTrailerSpec.scala`:
- Around line 155-227: Add a new test mirroring the header-only exclusion test
named "Test only record_trailer_name without header" that uses withTempBinFile
to write binary testDataNoHeader (two DATA-REC records plus one TRAILER-REC), a
copybook string that defines DATA-REC and TRAILER-REC (but no header), then read
with spark.read.format("cobol") using .option("copybook_contents",
copybookNoHeader), .option("encoding", "ascii"), .option("record_trailer_name",
"TRAILER-REC") and .option("schema_retention_policy", "collapse_root"); assert
the resulting DataFrame count equals 2 and that the schema contains DATA-REC
fields (e.g., "NAME") but does not contain trailer fields (e.g., "TRL_TAG") to
verify trailer-only exclusion behavior.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 0d27b798-a188-4da3-a652-7e938d5407ae

📥 Commits

Reviewing files that changed from the base of the PR and between dc80bd4 and b81c59f.

📒 Files selected for processing (14)

.gitignore
README.md
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/extractors/record/RecordExtractors.scala
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/iterator/FixedLenNestedRowIterator.scala
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/iterator/VarLenHierarchicalIterator.scala
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/iterator/VarLenNestedIterator.scala
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CobolParameters.scala
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CobolParametersParser.scala
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/ReaderParameters.scala
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/schema/CobolSchema.scala
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/builder/SparkCobolOptionsBuilder.scala
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/schema/CobolSchema.scala
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/DefaultSource.scala
spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/integration/Test13cCopybookHeaderTrailerSpec.scala

coderabbitai · 2026-03-24T22:19:07Z

README.md

+| .option("record_trailer_name", "TRAILER")                                             | Assuming a copybook definition represents a trailer, offsets the file read end by the number of bytes in that trailer and excludes the record definition from the output schema.                                                                                                                                                                                                                                                                                            |
 ##### Helper fields generation options    


⚠️ Potential issue | 🟡 Minor

Add a blank line after the multisegment options table.

Line 1618 is immediately followed by a heading, which triggers markdownlint MD058 (tables should be surrounded by blank lines).

✅ Minimal markdown fix

| .option("record_trailer_name", "TRAILER") | Assuming a copybook definition represents a trailer, offsets the file read end by the number of bytes in that trailer and excludes the record definition from the output schema. | + ##### Helper fields generation options

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

| .option("record_trailer_name", "TRAILER") | Assuming a copybook definition represents a trailer, offsets the file read end by the number of bytes in that trailer and excludes the record definition from the output schema. |

##### Helper fields generation options

| .option("record_trailer_name", "TRAILER") | Assuming a copybook definition represents a trailer, offsets the file read end by the number of bytes in that trailer and excludes the record definition from the output schema. |

##### Helper fields generation options

🧰 Tools

🪛 markdownlint-cli2 (0.21.0)

[warning] 1618-1618: Tables should be surrounded by blank lines

(MD058, blanks-around-tables)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@README.md` around lines 1618 - 1619, The README has a Markdown lint issue MD058 because the table ending with the `.option("record_trailer_name", "TRAILER")` row is immediately followed by the heading `##### Helper fields generation options`; insert a single blank line between the table and that heading so the table is surrounded by blank lines (i.e., add one empty line after the table row containing `.option("record_trailer_name", "TRAILER")` and before the `##### Helper fields generation options` heading).

yruslan

This is awesome! Thank you for your contribution! A feature like this for handling headers and footers was requested several times, so glad you contributed the solution.

yruslan · 2026-03-25T09:10:23Z

We will release the new version of Cobrix tomorrow

alexzvk added 3 commits March 10, 2026 20:48

feat: add header trailer removal for multisegment definition function…

027e9f1

…ality

chore: readd erroneously removed line

8d7bdcb

docs: update readme with multisegment header trailer removal function…

b81c59f

…ality

alexzvk requested a review from yruslan as a code owner March 24, 2026 22:10

coderabbitai bot reviewed Mar 24, 2026

View reviewed changes

yruslan approved these changes Mar 25, 2026

View reviewed changes

yruslan merged commit c9ee54c into AbsaOSS:master Mar 25, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multisegment Header Trailer Removal#836

Multisegment Header Trailer Removal#836
yruslan merged 3 commits intoAbsaOSS:masterfrom
capitalone-contributions:multisegment-header-trailer

alexzvk commented Mar 24, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 24, 2026 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 24, 2026

Uh oh!

yruslan left a comment

Uh oh!

yruslan commented Mar 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		\| .option("record_trailer_name", "TRAILER") \| Assuming a copybook definition represents a trailer, offsets the file read end by the number of bytes in that trailer and excludes the record definition from the output schema. \|
		##### Helper fields generation options

Conversation

alexzvk commented Mar 24, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

yruslan left a comment

Choose a reason for hiding this comment

Uh oh!

yruslan commented Mar 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

alexzvk commented Mar 24, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 24, 2026 •

edited

Loading