Skip to content

feat(es): add dump helper and bulk-ndjson restore path#415

Open
MattDevy wants to merge 11 commits into
mainfrom
claude/admiring-nash-2e349b
Open

feat(es): add dump helper and bulk-ndjson restore path#415
MattDevy wants to merge 11 commits into
mainfrom
claude/admiring-nash-2e349b

Conversation

@MattDevy

Copy link
Copy Markdown
Contributor

Summary

  • Adds elastic es helpers dump — exports one or more indices as bulk-format NDJSON (action + _source line pairs) using a per-index Point-in-Time + search_after sorted by _shard_doc, with --size, --keep-alive, --output (file or stdout), --skip-index-name, --add-id, --query (inline JSON), and --query-file (file or - for stdin).
  • Extends elastic es helpers bulk-ingest with a bulk-ndjson source format that streams pre-formatted action+doc line pairs verbatim into _bulk, so dump output round-trips through the existing ingester. --index is now optional in this mode (action lines may carry _index).
  • Inspired by escli-rs's utils dump / utils load — the use case is capturing a remote index for local debugging.

Why

Round-trip export/import for debugging is a recurring ask. scroll-search writes only _source lines (one-way, not re-ingestable) and bulk-ingest previously generated action lines from scratch, so it couldn't consume already-shaped bulk NDJSON. dump produces the right shape and the new bulk-ndjson source format closes the loop with no third command.

Design notes

  • PIT, not scroll. PIT + search_after gives consistent reads without leaving long-lived scroll contexts behind, and is what escli-rs uses. PIT is closed in a finally block so it doesn't leak on transport errors.
  • bulk-ndjson as a source format, not a parallel command. Reuses the existing bulk-ingest plumbing: resolveRawInputs for file/dir/stdin, generalised splitIntoBatches<T> with a sizeOf callback, plus the same retry/concurrency/progress-reporter helpers.
  • --json without --output is rejected with a clear error, because streamed NDJSON and a stats JSON blob would otherwise collide on stdout.

Example

# Export a remote index, omit _index so it can be re-targeted, write to file
elastic --use-context remote es helpers dump \
  --indices my-prod-idx --skip-index-name \
  --query '{"range":{"@timestamp":{"gte":"now-1h"}}}' \
  --output dump.ndjson

# Re-ingest into a local cluster under a new index name
elastic --use-context local es helpers bulk-ingest \
  --source-format bulk-ndjson --index local-copy --data-file dump.ndjson

Test plan

  • Unit tests: 16 new for dump (PIT + search_after, multi-index, query inline/file, --output file vs stdout, --skip-index-name, --add-id, PIT cleanup on error, missing PIT id, empty query file, --json requires --output), 10 new for bulk-ndjson ingest (verbatim _bulk body, /{index}/_bulk routing, byte-size batching, odd-line / non-bulk-action validation, multi-file dir, --data-file+--data-dir conflict, empty file, no glob match, schema rejection of missing --index for other formats).
  • npm test — 1462/1462 pass.
  • npm run test:lint, npm run build, tsc --noEmit clean.
  • Branch coverage 90.35% (threshold 90%).
  • Manual end-to-end against a real cluster — leaving for reviewer / next pass.

Add `elastic es helpers dump` for exporting indices as bulk-format NDJSON
using PIT + search_after, and extend `bulk-ingest` with a `bulk-ndjson`
source format so the dump output can be streamed back into `_bulk`. Use
case: capture a remote index for local debugging.

Inspired by escli-rs (https://github.com/Anaethelion/escli-rs); ports the
dump/load feature set into the elastic/cli helper conventions.
@github-actions

github-actions Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

MegaLinter analysis: Success

Descriptor Linter Files Fixed Errors Warnings Elapsed time
✅ COPYPASTE jscpd yes no no 9.7s
✅ REPOSITORY gitleaks yes no no 60.58s
✅ REPOSITORY git_diff yes no no 0.48s
✅ REPOSITORY secretlint yes no no 38.06s
✅ REPOSITORY trivy yes no no 16.98s
✅ TYPESCRIPT eslint 5 0 0 5.22s
✅ YAML yamllint 1 0 0 0.92s

Notices

📣 MegaLinter 9.5.0 is out! Discover the new features and security recommendations in the release announcement. (Skip this info by defining SECURITY_SUGGESTIONS: false)

See detailed reports in MegaLinter artifacts
Set VALIDATE_ALL_CODEBASE: true in mega-linter.yml to validate all sources, not only the diff

MegaLinter is graciously provided by OX Security
Show us your support by starring ⭐ the repository

MattDevy added 3 commits June 16, 2026 12:35
Supersedes the previous regeneration commit, which ran against a
locally-symlinked node_modules with a stale yaml version. The actual
drift on main was commander 14 -> 15 from #410, which had updated
the lockfile but not NOTICE.txt.
Adds a "Dump and restore an index" subsection under the `es` section
with the round-trip example, a flag reference for `dump`, and a note on
the new `bulk-ingest --source-format bulk-ndjson` mode.

@Anaethelion Anaethelion left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a small improvement.

Comment thread src/es/helpers/dump.ts
Track the active PIT id and output fd in mutable refs visible to a
SIGINT/SIGTERM handler that releases both before the process exits with
code 130. The per-index `finally` and the signal handler share the same
refs (and null them eagerly) so the two cleanup paths can't race into a
double-close. Listeners are removed when the handler returns.

Addresses Anaethelion review feedback on PR #415.
@MattDevy MattDevy requested a review from Anaethelion June 17, 2026 12:43
Anaethelion
Anaethelion previously approved these changes Jun 17, 2026
@Anaethelion Anaethelion dismissed their stale review June 17, 2026 16:20

Found something to address after approval

@JoshMock JoshMock left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few nits, and a few larger concerns. Overall looks awesome!

Regardless, I'm on PTO after today so don't wait to merge on my account. Worst case scenario, we have to make a second pass for some of my proposed improvements. 🖤

Comment thread README.md Outdated

Run `elastic es <command> --help` for all available options on any command.

##### Dump and restore an index

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you duplicate or move these docs into the ./docs/ directory as a guide? Probably worth adding similar guides for the other helpers as well, honestly, but we can handle that in a separate PR.

} catch (err) {
throw new Error(`bulk-ndjson: invalid action line at line ${lineNum}: ${err instanceof Error ? err.message : String(err)}`, { cause: err })
}
if (parsed == null || typeof parsed !== 'object' || Array.isArray(parsed)) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A nit, probably for later, but it would be a nice improvement to standardize on using node:assert for all input validation of this sort (basically anywhere Zod or JSON schemas are not used for validation).

Comment on lines +114 to +123
const pairs: string[] = []
let action: string | undefined
let lineNum = 0
let nonEmptyCount = 0

for (const line of raw.split('\n')) {
lineNum++
const trimmed = line.trim()
if (trimmed.length === 0) continue
nonEmptyCount++

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "every other line" loop scheme appears to overlook the fact that a delete action won't have a paired document on the following line. Also, update actions require documents to be wrapped in a {"doc": ...} envelope.

Since this is ingest path will always come use the output of a dump command, maybe BULK_ACTIONS needs to just be reduced to index and create, or just index?

Comment thread src/es/helpers/dump.ts Outdated
return JSON.parse(input.query)
}
if (input.query_file != null) {
const raw = readRawInput(input.query_file)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If query_file is a - (shorthand for stdin), will this attempt to read a filename literally called - or will it do the right thing?

* Reuses retry, concurrency, and progress reporting from the main flow; the only difference
* is that the input is already bulk-shaped, so each pair is sent through verbatim.
*/
async function runBulkNdjson (opts: BulkIngestInput, transport: EsClient): Promise<JsonValue> {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

opts.pipeline and opts.routing are ignored in this function. Should they be handled here, or do we expect the bulk-ndjson format to never include those?

Comment thread src/es/helpers/dump.ts Outdated
const actionLine = addId
? actionPrefix + JSON.stringify(hit._id) + actionSuffix
: actionPrefix
write(`${actionLine}\n${JSON.stringify(hit._source)}\n`)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every hit will call write, which then calls writeSync, executing a filesystem call for each line. This could be optimized by using a writable file stream, or manually buffering rows and only writing when a byte threshold is exceeded, to significantly speed up dumps with more than a few thousand documents.

MattDevy added 2 commits June 19, 2026 12:51
- dump: --query-file '-' now reads from stdin instead of trying to open a
  file literally named '-'.
- dump: batch per-page hits into a single write to cut writeSync calls
  from O(docs) to O(pages); on a multi-million-doc dump this is
  syscall-bound vs network-bound.
- bulk-ndjson: restrict accepted actions to `index` and `create`. The
  pair parser assumes every action is followed by a doc line, which is
  not true for `delete` (no doc) or `update` (needs `{"doc": ...}`
  envelope). The producer this format is designed for (dump) only emits
  `index`, so rejecting the others is safer than silently corrupting
  input.
- bulk-ndjson: apply --pipeline and --routing as URL query params so
  they affect every action in the batch without rewriting pre-formatted
  action lines.
- docs: move the long-form dump-and-restore guide to docs/cli/stack/es/
  helpers/dump-and-restore.md; README links to it.

Addresses JoshMock review on PR #415.
@github-actions

github-actions Bot commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

🔍 Preview links for changed docs

@github-actions

github-actions Bot commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

✅ Elastic Docs Style Checker (Vale)

No issues found on modified lines!


The Vale linter checks documentation changes against the Elastic Docs style guide. To use Vale locally or report issues, refer to Elastic style guide for Vale.

@MattDevy

Copy link
Copy Markdown
Contributor Author

Follow-up: filed #443 to track the memory profile of bulk-ingest.

The dump side streams (one PIT page at a time, bounded by --size), but the ingest side — for all source formats including the new bulk-ndjson, not just my addition — reads the full input into memory before flushing the first batch. Large dumps (100 MB+) bloat; multi-GB dumps OOM. Fix needs a readline-based streaming reader for every input source and a csv-parse (streaming) swap for CSV, so it's broader than this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants