Skip to content

fix: Bench crate compilation + criterion parser#19

Merged
strawgate merged 5 commits into
masterfrom
worktree-grok-regex-udfs
Mar 29, 2026
Merged

fix: Bench crate compilation + criterion parser#19
strawgate merged 5 commits into
masterfrom
worktree-grok-regex-udfs

Conversation

@strawgate
Copy link
Copy Markdown
Owner

Summary

  • Fix StdoutSink/StdoutFormat not being publicly re-exported from logfwd-output (Rust 2024 edition glob re-export visibility change)
  • Replace with NullSink already defined in the bench
  • Includes the criterion output parser fix from PR fix: Parse criterion output in bench workflow #17

Test plan

  • cargo check -p logfwd-bench --benches compiles
  • CI green
  • Manual workflow dispatch produces populated issue

🤖 Generated with Claude Code

strawgate and others added 5 commits March 29, 2026 02:41
Add two new DataFusion scalar UDFs for structured log parsing:

- regexp_extract(string, pattern, group_index): Spark-compatible regex
  extraction returning capture group at given index (0=full match, 1+=groups)
- grok(string, pattern): Logstash-style grok pattern parsing returning a
  Struct with one field per named capture (%{PATTERN:name} syntax)

Grok includes 25+ built-in patterns (IP, WORD, NUMBER, TIMESTAMP_ISO8601,
LOGLEVEL, etc). Both UDFs are registered automatically in SqlTransform.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
New `logfwd-bench` crate with criterion benchmarks covering the full
pipeline: scanner (all fields + pushdown), CRI parse/reassemble,
DataFusion transforms (passthrough, filter, projection, regexp_extract,
grok), zstd compression, output sinks, and end-to-end pipelines.

Nightly GitHub Actions workflow runs benchmarks on master, parses
results into a markdown table, and posts as a GitHub issue (closing
the previous one to avoid clutter).

Also fixes clippy warnings in UDF code (Default impls, collapsible ifs).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Criterion doesn't support --output-format bencher. Parse its native
format instead (bench name on own line, followed by time/thrpt lines).
Also capture stderr since criterion writes there.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
These types aren't re-exported publicly from logfwd-output in Rust 2024
edition. Replace with NullSink which already exists in the bench.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@strawgate strawgate merged commit 6f7ba87 into master Mar 29, 2026
1 check failed
@strawgate strawgate deleted the worktree-grok-regex-udfs branch March 29, 2026 07:45
strawgate added a commit that referenced this pull request Apr 11, 2026
Systematic audit found 30 potential issues from agent-authored code.
After parallel verification by 11 subagents, 13 were confirmed real,
8 were by-design, 5 were won't-fix, and 5 were false positives.
This commit fixes all 12 actionable confirmed findings.

**High severity:**
- Fix OtherStr panic: OTLP sink crashed on non-string attribute types
  (e.g., hash() UDF returning UInt64). Replaced unreachable!() with
  array_value_to_string(). Removed dead str_value() function. (#7)
- Fix silent struct drop: non-conflict Struct columns now log a warning
  before being skipped, matching the resource struct behavior. (#6)

**Medium severity:**
- Fix scanner contract drift: SCANNER_CONTRACT.md said "no escape
  decoding" but implementation decodes since PR #885. Updated doc. (#19)
- Deduplicate calendar math: made core's Kani-verified days_from_civil
  public; arrow's wrapper now delegates instead of reimplementing. (#21)
- Centralize metadata keys: added METADATA_RESOURCE_KEY and
  METADATA_RESOURCE_PREFIX constants to field_names.rs, replacing 15
  bare string literals across 4 files / 3 crates. (#15)
- Add TypedColumn::Bytes variant: OTAP bytes attributes now round-trip
  as BinaryArray instead of being hex-encoded to strings. (#16)

**Low severity:**
- Deduplicate WELL_KNOWN arrays: star_schema.rs now delegates to
  field_names::matches_any() instead of maintaining a local copy. Added
  logfwd-types dependency to logfwd-arrow. (#13)
- Centralize _raw column name: added field_names::RAW constant. (#12)
- Extract MAX_REQUEST_BODY_SIZE: shared constant in receiver_http.rs
  replaces 3 independent definitions. (#27)
- Import DEFAULT_RETRY_AFTER_SECS: otap_sink and arrow_ipc_sink now
  import from http_classify instead of redefining. (#29)
- Name timing defaults: pipeline build.rs and input_build.rs now use
  named constants instead of inline unwrap_or literals. (#30)
- Add timestamp diagnostic: tracing::debug!() on timestamp parse
  fallback for operator visibility. (#1)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant