Replace page classifier with dit, add -fpt flag#2404
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughReplaces the internal Naive Bayes page classifier with the external Changes
Sequence Diagram(s)sequenceDiagram
participant CLI as CLI / Flags
participant Runner as Runner
participant Dit as dit.Classifier
participant KB as KnowledgeBase
CLI->>Runner: start enumeration (flags include -fpt / outputs)
Runner->>Dit: ExtractPageType(headlessBody, body)
alt dit available
Dit-->>Runner: {PageType, Forms}
Runner->>KB: populate entry with PageType, Forms, pHash
else classifier nil or error
Dit-->>Runner: error / nil
Runner->>KB: populate entry with pHash (no PageType/Forms)
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
runner/runner.go (1)
717-722:⚠️ Potential issue | 🟡 MinorFix unchecked error return from
finput.Close().The pipeline failure indicates that the error return value of
finput.Closeis not checked. This violates theerrchecklinter rule.🐛 Proposed fix
defer finput.Close() + defer func() { + if err := finput.Close(); err != nil { + gologger.Warning().Msgf("Could not close input file '%s': %s\n", r.options.InputFile, err) + } + }() - defer finput.Close()Alternatively, since
deferis already used, you can suppress the linter for this specific case if the error is intentionally ignored:- defer finput.Close() + defer finput.Close() //nolint:errcheck // best-effort close
🤖 Fix all issues with AI agents
In `@README.md`:
- Line 65: Update the Go version note for the httpx installation: replace or
make the `>=1.25.0` requirement in the README (the string "`>=1.25.0`" that
currently references Go for `httpx`) conditional or placeholder until the
correct Go version is finalized, and ensure consistency with the `go.mod` module
declaration (`go` directive) by updating both the README text and the `go.mod`
`go` version to the final approved Go version once determined.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@runner/runner.go`:
- Around line 655-669: The PageType value stored in Runner.classifyPage may be a
non-string type and later code does resp.KnowledgeBase["PageType"].(string),
causing a panic; change the insertion so PageType is coerced to a stable string
(e.g., use fmt.Sprint or fmt.Sprintf("%v", result.Type)) when setting
kb["PageType"] = ..., keep the existing nil/empty handling and ensure Forms
logic is unchanged so downstream filters can safely type-assert to string.
🧹 Nitpick comments (2)
runner/runner.go (2)
433-438: Initialization fallback is OK, but consider gating/noise + explicit nil on error.Proceeding with a nil classifier is fine, but:
- This warning will show even when page-type classification is irrelevant to the user’s chosen flags. Consider initializing
ditonly when-json/-csv/-fpt(or similar) is enabled to reduce surprise/noise.- Minor clarity: explicitly set
runner.ditClassifier = nilon error (so future refactors don’t accidentally use a partially-initialized value).
2639-2640: Consider preferringheadlessBody(when available) for classification accuracy.Right now KnowledgeBase classification uses
respDataeven whenscanopts.ScreenshotproducedheadlessBody. If the goal is better login/captcha/etc detection and form extraction on JS-rendered pages, you may want:
classifyPage(headlessBody, pHash)whenheadlessBody != ""- else fall back to
respDataThis is not strictly required for correctness, but it likely improves real-world detection rates.
f43056f to
9ba4754
Compare
dwisiswant0
left a comment
There was a problem hiding this comment.
Any deprecation process should be planned with a clear roadmap and timeline. Pulling support abruptly without giving users enough time to prepare/migrate is just going to cause UX frustration and/or unnecessary (downstream) breakage.
9ba4754 to
41696ba
Compare
|
There was a problem hiding this comment.
🧹 Nitpick comments (1)
runner/runner.go (1)
1130-1137: Consider renaminglogFilteredErrorPagetologFilteredPage.The function
logFilteredErrorPageis now used for all filtered page types (login, captcha, parked, etc.), not just error pages. The current name is misleading given the broader-fptfunctionality.♻️ Suggested rename
- logFilteredErrorPage(r.options.OutputFilterErrorPagePath, resp.URL) + logFilteredPage(r.options.OutputFilterErrorPagePath, resp.URL)Also rename the function definition at line 1564 and consider renaming
OutputFilterErrorPagePathtoOutputFilterPagePathfor consistency.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@runner/runner.go` around lines 1130 - 1137, The helper name logFilteredErrorPage is misleading because it's used for all filtered page types; rename the function to logFilteredPage (and update its definition) and rename the option OutputFilterErrorPagePath to OutputFilterPagePath on the options struct; then update every call site (e.g., the code that currently calls logFilteredErrorPage(r.options.OutputFilterErrorPagePath, resp.URL)) to call logFilteredPage(r.options.OutputFilterPagePath, resp.URL) so names are consistent across the function, its calls, and the option field.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@runner/runner.go`:
- Around line 1130-1137: The helper name logFilteredErrorPage is misleading
because it's used for all filtered page types; rename the function to
logFilteredPage (and update its definition) and rename the option
OutputFilterErrorPagePath to OutputFilterPagePath on the options struct; then
update every call site (e.g., the code that currently calls
logFilteredErrorPage(r.options.OutputFilterErrorPagePath, resp.URL)) to call
logFilteredPage(r.options.OutputFilterPagePath, resp.URL) so names are
consistent across the function, its calls, and the option field.
ℹ️ Review info
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (5)
.github/workflows/build-test.ymlis excluded by!**/*.yml.github/workflows/functional-test.ymlis excluded by!**/*.yml.github/workflows/release-binary.ymlis excluded by!**/*.yml.github/workflows/release-test.ymlis excluded by!**/*.ymlgo.sumis excluded by!**/*.sum
📒 Files selected for processing (9)
DockerfileREADME.mdcommon/pagetypeclassifier/clf.gobcommon/pagetypeclassifier/dataset.txtcommon/pagetypeclassifier/pagetypeclassifier.gocommon/pagetypeclassifier/pagetypeclassifier_test.gogo.modrunner/options.gorunner/runner.go
💤 Files with no reviewable changes (3)
- common/pagetypeclassifier/pagetypeclassifier.go
- common/pagetypeclassifier/dataset.txt
- common/pagetypeclassifier/pagetypeclassifier_test.go
🚧 Files skipped from review as they are similar to previous changes (3)
- runner/options.go
- README.md
- Dockerfile
- Add Deprecated comment to OutputFilterErrorPage field - Coerce PageType to string for safe type assertion
1e193fc to
0e3ceee
Compare
Proposed changes
Replace built-in Naive Bayes page classifier with dit (20 page types, 8 form types, 79 field types). Add
-fpt/-filter-page-typeflag for filtering by any page type(s). Deprecate-fepas alias for-fpt error.common/pagetypeclassifier/withdit.Classifier-fptflag (e.g.-fpt login,captcha,parked)-fepwith info messageFormswith form type and field classificationsCloses #2403
Proof
httpx -u https://github.com/login -json— KnowledgeBase showsPageType: login+ Forms-fpt loginfilters login pages,-fpt errorfilters error pages-fpt login,errorfilters multiple types, case-insensitive-fepbackward compat filters error pages + shows deprecation messagego build ./...andgo test ./...passChecklist
Summary by CodeRabbit
Chores
New Features
Deprecations
Removals
Documentation