Multisegment Header Trailer Removal#836
Conversation
WalkthroughThis PR introduces support for COBOL copybook header and trailer records. Users can now specify record names that should be excluded from output via Changes
Sequence DiagramsequenceDiagram
actor User
participant ParameterParser as CobolParametersParser
participant DefaultSource
participant CopybookParser
participant Schema as CobolSchema
participant RecordExtractor as RecordExtractors
participant Output
User->>ParameterParser: Provide record_header_name,<br/>record_trailer_name options
ParameterParser->>ParameterParser: Parse & validate options<br/>Compute recordsToExclude
ParameterParser->>DefaultSource: Pass CobolParameters<br/>with recordsToExclude
DefaultSource->>CopybookParser: Load & parse copybook
CopybookParser-->>DefaultSource: AST with root records
DefaultSource->>DefaultSource: resolveHeaderTrailerOffsets:<br/>Locate header/trailer records,<br/>compute byte offsets
DefaultSource->>Schema: Create CobolSchema<br/>with recordsToExclude
Schema->>Schema: Filter excluded records<br/>from copybook structure
Schema-->>DefaultSource: Filtered schema
DefaultSource->>RecordExtractor: Call extractRecord<br/>with recordsToExclude
RecordExtractor->>RecordExtractor: Iterate root groups,<br/>exclude filtered records
RecordExtractor-->>Output: Extract data records only
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (4)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CobolParameters.scala (1)
114-115: Add Scaladoc entries for the new header/trailer options.The new
recordHeaderName/recordTrailerNamefields are part of the public parameter model but not listed in the class@paramdocs yet.📝 Suggested doc patch
* `@param` metadataPolicy Specifies the policy of metadat fields to be added to the Spark schema + * `@param` recordHeaderName Optional root-level record name to treat as file header + * `@param` recordTrailerName Optional root-level record name to treat as file trailer * `@param` options Options passed to 'spark-cobol'.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CobolParameters.scala` around lines 114 - 115, The Scaladoc for the CobolParameters class is missing `@param` entries for the new optional fields recordHeaderName and recordTrailerName; update the class Scaladoc to include two `@param` lines describing these fields (their purpose as optional names for record header and trailer records, expected format and behavior when None) so the public parameter model docs reflect the new options and their semantics.cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/ReaderParameters.scala (1)
142-142: Update Scaladoc to include the new constructor field.
ReaderParametersnow hasrecordsToExclude, but the class Scaladoc@paramlist doesn’t document it yet.📝 Suggested doc patch
* `@param` inputFileNameColumn A column name to add to the dataframe. The column will contain input file name for each record similar to 'input_file_name()' function * `@param` metadataPolicy Specifies the policy of metadat fields to be added to the Spark schema + * `@param` recordsToExclude Root-level record names (normalized) to exclude from schema and extraction output * `@param` options Options passed to spark-cobol🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/ReaderParameters.scala` at line 142, Update the Scaladoc for the ReaderParameters class to document the newly added constructor field recordsToExclude: locate the class definition for ReaderParameters and add an `@param` entry describing recordsToExclude (type Set[String], default Set.empty) in the Scaladoc `@param` list so it explains its purpose and expected values; ensure the wording matches existing style and mentions default behavior when not provided.spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/integration/Test13cCopybookHeaderTrailerSpec.scala (1)
155-227: LGTM!Good negative test coverage for validation errors:
record_header_name+file_start_offsetincompatibilityrecord_trailer_name+file_end_offsetincompatibility- Non-existent record name
- Same record for header and trailer
Consider adding a test for trailer-only exclusion (symmetric to header-only test) for completeness.
💡 Optional: Add trailer-only exclusion test
test("Test only record_trailer_name without header") { // Test data without header record at start val testDataNoHeader: Array[Byte] = Array[Byte]( // Data record 1: ALICE 0100 (10 bytes DATA-REC) 0x41, 0x4C, 0x49, 0x43, 0x45, 0x20, 0x30, 0x31, 0x30, 0x30, // Data record 2: BOB 0200 (10 bytes DATA-REC) 0x42, 0x4F, 0x42, 0x20, 0x20, 0x20, 0x30, 0x32, 0x30, 0x30, // Trailer record: TRL 0002 (8 bytes) 0x54, 0x52, 0x4C, 0x20, 0x30, 0x30, 0x30, 0x32 ) // Copybook without header definition val copybookNoHeader = """ 01 DATA-REC. | 05 NAME PIC X(6). | 05 AMOUNT PIC 9(4). | 01 TRAILER-REC. | 05 TRL-TAG PIC X(4). | 05 TRL-COUNT PIC 9(4). |""".stripMargin withTempBinFile("test13c", ".dat", testDataNoHeader) { tempFile => val df = spark .read .format("cobol") .option("copybook_contents", copybookNoHeader) .option("encoding", "ascii") .option("record_trailer_name", "TRAILER-REC") .option("schema_retention_policy", "collapse_root") .load(tempFile) assert(df.count() == 2) val fieldNames = df.schema.fieldNames assert(fieldNames.contains("NAME")) assert(!fieldNames.contains("TRL_TAG")) } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/integration/Test13cCopybookHeaderTrailerSpec.scala` around lines 155 - 227, Add a new test mirroring the header-only exclusion test named "Test only record_trailer_name without header" that uses withTempBinFile to write binary testDataNoHeader (two DATA-REC records plus one TRAILER-REC), a copybook string that defines DATA-REC and TRAILER-REC (but no header), then read with spark.read.format("cobol") using .option("copybook_contents", copybookNoHeader), .option("encoding", "ascii"), .option("record_trailer_name", "TRAILER-REC") and .option("schema_retention_policy", "collapse_root"); assert the resulting DataFrame count equals 2 and that the schema contains DATA-REC fields (e.g., "NAME") but does not contain trailer fields (e.g., "TRL_TAG") to verify trailer-only exclusion behavior.spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/schema/CobolSchema.scala (1)
422-440: Consider addingrecordsToExcludesupport to the builder.The
CobolSchemaBuilder.build()method does not supportrecordsToExclude, meaning schemas created via the builder cannot exclude header/trailer records. This is likely acceptable if the builder is intended for simpler use cases, but it creates an API inconsistency.💡 Optional: Add builder support for recordsToExclude
class CobolSchemaBuilder(copybook: Copybook) { // ... existing fields ... private var metadataPolicy: MetadataPolicy = MetadataPolicy.Basic + private var recordsToExclude: Set[String] = Set.empty + def withRecordsToExclude(recordsToExclude: Set[String]): CobolSchemaBuilder = { + this.recordsToExclude = recordsToExclude + this + } def build(): CobolSchema = { // ... existing code ... new CobolSchema( copybook, schemaRetentionPolicy, isDisplayAlwaysString, strictIntegralPrecision, inputFileNameField, generateRecordId, generateRecordBytes, corruptFieldsPolicy, generateSegIdFieldsCnt, segmentIdProvidedPrefix, - metadataPolicy + metadataPolicy, + recordsToExclude ) } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/schema/CobolSchema.scala` around lines 422 - 440, The builder's build() method currently omits the recordsToExclude setting; update CobolSchemaBuilder.build() (the def build(): CobolSchema method) to pass the builder's recordsToExclude into the CobolSchema constructor (add the recordsToExclude argument to the new CobolSchema(...) call), and ensure the builder exposes a recordsToExclude field/parameter used here; if the CobolSchema constructor signature already accepts recordsToExclude, simply add it to the argument list, otherwise update the constructor to accept and store it as well.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@README.md`:
- Around line 1618-1619: The README has a Markdown lint issue MD058 because the
table ending with the `.option("record_trailer_name", "TRAILER")` row is
immediately followed by the heading `##### Helper fields generation options`;
insert a single blank line between the table and that heading so the table is
surrounded by blank lines (i.e., add one empty line after the table row
containing `.option("record_trailer_name", "TRAILER")` and before the `#####
Helper fields generation options` heading).
---
Nitpick comments:
In
`@cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CobolParameters.scala`:
- Around line 114-115: The Scaladoc for the CobolParameters class is missing
`@param` entries for the new optional fields recordHeaderName and
recordTrailerName; update the class Scaladoc to include two `@param` lines
describing these fields (their purpose as optional names for record header and
trailer records, expected format and behavior when None) so the public parameter
model docs reflect the new options and their semantics.
In
`@cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/ReaderParameters.scala`:
- Line 142: Update the Scaladoc for the ReaderParameters class to document the
newly added constructor field recordsToExclude: locate the class definition for
ReaderParameters and add an `@param` entry describing recordsToExclude (type
Set[String], default Set.empty) in the Scaladoc `@param` list so it explains its
purpose and expected values; ensure the wording matches existing style and
mentions default behavior when not provided.
In
`@spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/schema/CobolSchema.scala`:
- Around line 422-440: The builder's build() method currently omits the
recordsToExclude setting; update CobolSchemaBuilder.build() (the def build():
CobolSchema method) to pass the builder's recordsToExclude into the CobolSchema
constructor (add the recordsToExclude argument to the new CobolSchema(...)
call), and ensure the builder exposes a recordsToExclude field/parameter used
here; if the CobolSchema constructor signature already accepts recordsToExclude,
simply add it to the argument list, otherwise update the constructor to accept
and store it as well.
In
`@spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/integration/Test13cCopybookHeaderTrailerSpec.scala`:
- Around line 155-227: Add a new test mirroring the header-only exclusion test
named "Test only record_trailer_name without header" that uses withTempBinFile
to write binary testDataNoHeader (two DATA-REC records plus one TRAILER-REC), a
copybook string that defines DATA-REC and TRAILER-REC (but no header), then read
with spark.read.format("cobol") using .option("copybook_contents",
copybookNoHeader), .option("encoding", "ascii"), .option("record_trailer_name",
"TRAILER-REC") and .option("schema_retention_policy", "collapse_root"); assert
the resulting DataFrame count equals 2 and that the schema contains DATA-REC
fields (e.g., "NAME") but does not contain trailer fields (e.g., "TRL_TAG") to
verify trailer-only exclusion behavior.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 0d27b798-a188-4da3-a652-7e938d5407ae
📒 Files selected for processing (14)
.gitignoreREADME.mdcobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/extractors/record/RecordExtractors.scalacobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/iterator/FixedLenNestedRowIterator.scalacobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/iterator/VarLenHierarchicalIterator.scalacobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/iterator/VarLenNestedIterator.scalacobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CobolParameters.scalacobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CobolParametersParser.scalacobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/ReaderParameters.scalacobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/schema/CobolSchema.scalaspark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/builder/SparkCobolOptionsBuilder.scalaspark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/schema/CobolSchema.scalaspark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/DefaultSource.scalaspark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/integration/Test13cCopybookHeaderTrailerSpec.scala
| | .option("record_trailer_name", "TRAILER") | Assuming a copybook definition represents a trailer, offsets the file read end by the number of bytes in that trailer and excludes the record definition from the output schema. | | ||
| ##### Helper fields generation options |
There was a problem hiding this comment.
Add a blank line after the multisegment options table.
Line 1618 is immediately followed by a heading, which triggers markdownlint MD058 (tables should be surrounded by blank lines).
✅ Minimal markdown fix
| .option("record_trailer_name", "TRAILER") | Assuming a copybook definition represents a trailer, offsets the file read end by the number of bytes in that trailer and excludes the record definition from the output schema. |
+
##### Helper fields generation options 📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| | .option("record_trailer_name", "TRAILER") | Assuming a copybook definition represents a trailer, offsets the file read end by the number of bytes in that trailer and excludes the record definition from the output schema. | | |
| ##### Helper fields generation options | |
| | .option("record_trailer_name", "TRAILER") | Assuming a copybook definition represents a trailer, offsets the file read end by the number of bytes in that trailer and excludes the record definition from the output schema. | | |
| ##### Helper fields generation options |
🧰 Tools
🪛 markdownlint-cli2 (0.21.0)
[warning] 1618-1618: Tables should be surrounded by blank lines
(MD058, blanks-around-tables)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@README.md` around lines 1618 - 1619, The README has a Markdown lint issue
MD058 because the table ending with the `.option("record_trailer_name",
"TRAILER")` row is immediately followed by the heading `##### Helper fields
generation options`; insert a single blank line between the table and that
heading so the table is surrounded by blank lines (i.e., add one empty line
after the table row containing `.option("record_trailer_name", "TRAILER")` and
before the `##### Helper fields generation options` heading).
yruslan
left a comment
There was a problem hiding this comment.
This is awesome! Thank you for your contribution! A feature like this for handling headers and footers was requested several times, so glad you contributed the solution.
|
We will release the new version of Cobrix tomorrow |
Some EBCDIC files contain headers and trailers with differing schemas compared to the schema of the actual data. Currently to avoid reading these headers and trailers, one must manually parse the copybook and calculate file_start_offset and file_end_offset in bytes. This PR allows for the user to specify copybook root level records and have the file_start_offset and file_end_offset automatically calculated, and remove these records from the schema.
Summary by CodeRabbit
New Features
record_header_nameandrecord_trailer_nameoptions to designate specific records as headers or trailers in multisegment COBOL files. Designated records are automatically excluded from the output schema, and file offsets are computed based on their byte sizes.Documentation
Chores