feat(es): add dump helper and bulk-ndjson restore path#415
Conversation
Add `elastic es helpers dump` for exporting indices as bulk-format NDJSON using PIT + search_after, and extend `bulk-ingest` with a `bulk-ndjson` source format so the dump output can be streamed back into `_bulk`. Use case: capture a remote index for local debugging. Inspired by escli-rs (https://github.com/Anaethelion/escli-rs); ports the dump/load feature set into the elastic/cli helper conventions.
✅MegaLinter analysis: Success
Notices📣 MegaLinter 9.5.0 is out! Discover the new features and security recommendations in the release announcement. (Skip this info by defining See detailed reports in MegaLinter artifacts MegaLinter is graciously provided by OX Security |
Supersedes the previous regeneration commit, which ran against a locally-symlinked node_modules with a stale yaml version. The actual drift on main was commander 14 -> 15 from #410, which had updated the lockfile but not NOTICE.txt.
Adds a "Dump and restore an index" subsection under the `es` section with the round-trip example, a flag reference for `dump`, and a note on the new `bulk-ingest --source-format bulk-ndjson` mode.
Anaethelion
left a comment
There was a problem hiding this comment.
Just a small improvement.
Track the active PIT id and output fd in mutable refs visible to a SIGINT/SIGTERM handler that releases both before the process exits with code 130. The per-index `finally` and the signal handler share the same refs (and null them eagerly) so the two cleanup paths can't race into a double-close. Listeners are removed when the handler returns. Addresses Anaethelion review feedback on PR #415.
Found something to address after approval
JoshMock
left a comment
There was a problem hiding this comment.
A few nits, and a few larger concerns. Overall looks awesome!
Regardless, I'm on PTO after today so don't wait to merge on my account. Worst case scenario, we have to make a second pass for some of my proposed improvements. 🖤
|
|
||
| Run `elastic es <command> --help` for all available options on any command. | ||
|
|
||
| ##### Dump and restore an index |
There was a problem hiding this comment.
Can you duplicate or move these docs into the ./docs/ directory as a guide? Probably worth adding similar guides for the other helpers as well, honestly, but we can handle that in a separate PR.
| } catch (err) { | ||
| throw new Error(`bulk-ndjson: invalid action line at line ${lineNum}: ${err instanceof Error ? err.message : String(err)}`, { cause: err }) | ||
| } | ||
| if (parsed == null || typeof parsed !== 'object' || Array.isArray(parsed)) { |
There was a problem hiding this comment.
A nit, probably for later, but it would be a nice improvement to standardize on using node:assert for all input validation of this sort (basically anywhere Zod or JSON schemas are not used for validation).
| const pairs: string[] = [] | ||
| let action: string | undefined | ||
| let lineNum = 0 | ||
| let nonEmptyCount = 0 | ||
|
|
||
| for (const line of raw.split('\n')) { | ||
| lineNum++ | ||
| const trimmed = line.trim() | ||
| if (trimmed.length === 0) continue | ||
| nonEmptyCount++ |
There was a problem hiding this comment.
The "every other line" loop scheme appears to overlook the fact that a delete action won't have a paired document on the following line. Also, update actions require documents to be wrapped in a {"doc": ...} envelope.
Since this is ingest path will always come use the output of a dump command, maybe BULK_ACTIONS needs to just be reduced to index and create, or just index?
| return JSON.parse(input.query) | ||
| } | ||
| if (input.query_file != null) { | ||
| const raw = readRawInput(input.query_file) |
There was a problem hiding this comment.
If query_file is a - (shorthand for stdin), will this attempt to read a filename literally called - or will it do the right thing?
| * Reuses retry, concurrency, and progress reporting from the main flow; the only difference | ||
| * is that the input is already bulk-shaped, so each pair is sent through verbatim. | ||
| */ | ||
| async function runBulkNdjson (opts: BulkIngestInput, transport: EsClient): Promise<JsonValue> { |
There was a problem hiding this comment.
opts.pipeline and opts.routing are ignored in this function. Should they be handled here, or do we expect the bulk-ndjson format to never include those?
| const actionLine = addId | ||
| ? actionPrefix + JSON.stringify(hit._id) + actionSuffix | ||
| : actionPrefix | ||
| write(`${actionLine}\n${JSON.stringify(hit._source)}\n`) |
There was a problem hiding this comment.
Every hit will call write, which then calls writeSync, executing a filesystem call for each line. This could be optimized by using a writable file stream, or manually buffering rows and only writing when a byte threshold is exceeded, to significantly speed up dumps with more than a few thousand documents.
- dump: --query-file '-' now reads from stdin instead of trying to open a
file literally named '-'.
- dump: batch per-page hits into a single write to cut writeSync calls
from O(docs) to O(pages); on a multi-million-doc dump this is
syscall-bound vs network-bound.
- bulk-ndjson: restrict accepted actions to `index` and `create`. The
pair parser assumes every action is followed by a doc line, which is
not true for `delete` (no doc) or `update` (needs `{"doc": ...}`
envelope). The producer this format is designed for (dump) only emits
`index`, so rejecting the others is safer than silently corrupting
input.
- bulk-ndjson: apply --pipeline and --routing as URL query params so
they affect every action in the batch without rewriting pre-formatted
action lines.
- docs: move the long-form dump-and-restore guide to docs/cli/stack/es/
helpers/dump-and-restore.md; README links to it.
Addresses JoshMock review on PR #415.
🔍 Preview links for changed docs |
✅ Elastic Docs Style Checker (Vale)No issues found on modified lines! The Vale linter checks documentation changes against the Elastic Docs style guide. To use Vale locally or report issues, refer to Elastic style guide for Vale. |
|
Follow-up: filed #443 to track the memory profile of The dump side streams (one PIT page at a time, bounded by |
Summary
elastic es helpers dump— exports one or more indices as bulk-format NDJSON (action +_sourceline pairs) using a per-index Point-in-Time +search_aftersorted by_shard_doc, with--size,--keep-alive,--output(file or stdout),--skip-index-name,--add-id,--query(inline JSON), and--query-file(file or-for stdin).elastic es helpers bulk-ingestwith abulk-ndjsonsource format that streams pre-formatted action+doc line pairs verbatim into_bulk, sodumpoutput round-trips through the existing ingester.--indexis now optional in this mode (action lines may carry_index).utils dump/utils load— the use case is capturing a remote index for local debugging.Why
Round-trip export/import for debugging is a recurring ask.
scroll-searchwrites only_sourcelines (one-way, not re-ingestable) andbulk-ingestpreviously generated action lines from scratch, so it couldn't consume already-shaped bulk NDJSON.dumpproduces the right shape and the newbulk-ndjsonsource format closes the loop with no third command.Design notes
search_aftergives consistent reads without leaving long-lived scroll contexts behind, and is what escli-rs uses. PIT is closed in afinallyblock so it doesn't leak on transport errors.bulk-ndjsonas a source format, not a parallel command. Reuses the existingbulk-ingestplumbing:resolveRawInputsfor file/dir/stdin, generalisedsplitIntoBatches<T>with asizeOfcallback, plus the same retry/concurrency/progress-reporter helpers.--jsonwithout--outputis rejected with a clear error, because streamed NDJSON and a stats JSON blob would otherwise collide on stdout.Example
Test plan
dump(PIT + search_after, multi-index, query inline/file,--outputfile vs stdout,--skip-index-name,--add-id, PIT cleanup on error, missing PIT id, empty query file,--jsonrequires--output), 10 new forbulk-ndjsoningest (verbatim_bulkbody,/{index}/_bulkrouting, byte-size batching, odd-line / non-bulk-action validation, multi-file dir,--data-file+--data-dirconflict, empty file, no glob match, schema rejection of missing--indexfor other formats).npm test— 1462/1462 pass.npm run test:lint,npm run build,tsc --noEmitclean.