Skip to content

Replace page classifier with dit, add -fpt flag#2404

Merged
Mzack9999 merged 10 commits intodevfrom
feat/dit-page-classifier
Mar 4, 2026
Merged

Replace page classifier with dit, add -fpt flag#2404
Mzack9999 merged 10 commits intodevfrom
feat/dit-page-classifier

Conversation

@dogancanbakir
Copy link
Member

@dogancanbakir dogancanbakir commented Feb 16, 2026

Proposed changes

Replace built-in Naive Bayes page classifier with dit (20 page types, 8 form types, 79 field types). Add -fpt/-filter-page-type flag for filtering by any page type(s). Deprecate -fep as alias for -fpt error.

  • Replace common/pagetypeclassifier/ with dit.Classifier
  • Add -fpt flag (e.g. -fpt login,captcha,parked)
  • Deprecate -fep with info message
  • KnowledgeBase now includes Forms with form type and field classifications
  • Bump Go to 1.25.7, update CI/CD workflows and Dockerfile

Closes #2403

Proof

  • httpx -u https://github.com/login -json — KnowledgeBase shows PageType: login + Forms
  • -fpt login filters login pages, -fpt error filters error pages
  • -fpt login,error filters multiple types, case-insensitive
  • -fep backward compat filters error pages + shows deprecation message
  • go build ./... and go test ./... pass

Checklist

  • Pull request is created against the dev branch
  • All checks passed (lint, unit/integration/regression tests etc.) with my changes
  • I have added tests that prove my fix is effective or that my feature works
  • I have added necessary documentation (if appropriate)

Summary by CodeRabbit

  • Chores

    • Bumped Go toolchain to 1.25.7 and updated base builder image; refreshed dependency set.
  • New Features

    • Added -fpt / --filter-page-type flag to filter output by page type.
  • Deprecations

    • Deprecated -fep / --filter-error-page; retained for backward compatibility with a deprecation notice.
  • Removals

    • Removed the previous page-type classifier, its dataset, and associated tests.
  • Documentation

    • README updated with new flag examples and Go requirement.

@auto-assign auto-assign bot requested a review from dwisiswant0 February 16, 2026 09:48
@coderabbitai
Copy link

coderabbitai bot commented Feb 16, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Replaces the internal Naive Bayes page classifier with the external dit classifier; removes common/pagetypeclassifier files and dataset; adds --filter-page-type (-fpt) flag and deprecates -fep; updates runner to call dit.ExtractPageType (includes Forms in KnowledgeBase) and bumps Go/tooling versions.

Changes

Cohort / File(s) Summary
Build & Toolchain
Dockerfile, go.mod
Bump Go builder image and go directive to 1.25.7; add github.com/happyhackingspace/dit v0.0.9; update/remove several indirect deps.
Documentation
README.md
Raise Go requirement to >=1.25.0; add --filter-page-type/-fpt flag docs and example; mark -fep as a deprecated alias pointing to -fpt.
Removed classifier package
common/pagetypeclassifier/...
common/pagetypeclassifier/dataset.txt, common/pagetypeclassifier/pagetypeclassifier.go, common/pagetypeclassifier/pagetypeclassifier_test.go
Delete old Naive Bayes classifier, dataset, and tests; remove PageTypeClassifier, New(), Classify(), HTML-to-text helpers, and related tests.
CLI Flags / Options
runner/options.go
Add OutputFilterPageType goflags.StringSlice and --filter-page-type/-fpt; keep -fep as deprecated alias that maps to page type error when used.
Runner / Classifier integration
runner/runner.go
Replace pagetypeclassifier import/field with dit (ditClassifier); conditionally initialize classifier (warn on failure); add classifyPage(headlessBody, body, pHash) which calls dit.ExtractPageType and populates KnowledgeBase with PageType and Forms.
Healthcheck minor change
runner/healthcheck.go
Switched from test.WriteString(fmt.Sprintf(...)) to fmt.Fprintf(&test, ...) for health check output formatting only.

Sequence Diagram(s)

sequenceDiagram
    participant CLI as CLI / Flags
    participant Runner as Runner
    participant Dit as dit.Classifier
    participant KB as KnowledgeBase

    CLI->>Runner: start enumeration (flags include -fpt / outputs)
    Runner->>Dit: ExtractPageType(headlessBody, body)
    alt dit available
        Dit-->>Runner: {PageType, Forms}
        Runner->>KB: populate entry with PageType, Forms, pHash
    else classifier nil or error
        Dit-->>Runner: error / nil
        Runner->>KB: populate entry with pHash (no PageType/Forms)
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 I swapped my Bayes for shiny new Dit,
Twenty types now—what a clever fit!
Flags renamed, the old one waves goodbye,
Forms join the knowledge, under the sky. 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately summarizes the main changes: replacing the page classifier with dit and adding the -fpt flag, which are the core objectives.
Linked Issues check ✅ Passed The PR implements all main coding requirements: replaces pagetypeclassifier with dit, adds -fpt flag with deprecation of -fep, removes the old package, and extends KnowledgeBase with Forms data.
Out of Scope Changes check ✅ Passed All changes are in scope: Go version bumps (1.24.5→1.25.7), dependency updates (die, removal of html-to-markdown), Dockerfile updates, and healthcheck refactoring align with upgrading and replacing components.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/dit-page-classifier

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
runner/runner.go (1)

717-722: ⚠️ Potential issue | 🟡 Minor

Fix unchecked error return from finput.Close().

The pipeline failure indicates that the error return value of finput.Close is not checked. This violates the errcheck linter rule.

🐛 Proposed fix
 				defer finput.Close()
+				defer func() {
+					if err := finput.Close(); err != nil {
+						gologger.Warning().Msgf("Could not close input file '%s': %s\n", r.options.InputFile, err)
+					}
+				}()
-				defer finput.Close()

Alternatively, since defer is already used, you can suppress the linter for this specific case if the error is intentionally ignored:

-				defer finput.Close()
+				defer finput.Close() //nolint:errcheck // best-effort close
🤖 Fix all issues with AI agents
In `@README.md`:
- Line 65: Update the Go version note for the httpx installation: replace or
make the `>=1.25.0` requirement in the README (the string "`>=1.25.0`" that
currently references Go for `httpx`) conditional or placeholder until the
correct Go version is finalized, and ensure consistency with the `go.mod` module
declaration (`go` directive) by updating both the README text and the `go.mod`
`go` version to the final approved Go version once determined.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@runner/runner.go`:
- Around line 655-669: The PageType value stored in Runner.classifyPage may be a
non-string type and later code does resp.KnowledgeBase["PageType"].(string),
causing a panic; change the insertion so PageType is coerced to a stable string
(e.g., use fmt.Sprint or fmt.Sprintf("%v", result.Type)) when setting
kb["PageType"] = ..., keep the existing nil/empty handling and ensure Forms
logic is unchanged so downstream filters can safely type-assert to string.
🧹 Nitpick comments (2)
runner/runner.go (2)

433-438: Initialization fallback is OK, but consider gating/noise + explicit nil on error.

Proceeding with a nil classifier is fine, but:

  • This warning will show even when page-type classification is irrelevant to the user’s chosen flags. Consider initializing dit only when -json/-csv/-fpt (or similar) is enabled to reduce surprise/noise.
  • Minor clarity: explicitly set runner.ditClassifier = nil on error (so future refactors don’t accidentally use a partially-initialized value).

2639-2640: Consider preferring headlessBody (when available) for classification accuracy.

Right now KnowledgeBase classification uses respData even when scanopts.Screenshot produced headlessBody. If the goal is better login/captcha/etc detection and form extraction on JS-rendered pages, you may want:

  • classifyPage(headlessBody, pHash) when headlessBody != ""
  • else fall back to respData

This is not strictly required for correctness, but it likely improves real-world detection rates.

@dogancanbakir dogancanbakir force-pushed the feat/dit-page-classifier branch from f43056f to 9ba4754 Compare February 16, 2026 10:35
Copy link
Member

@dwisiswant0 dwisiswant0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any deprecation process should be planned with a clear roadmap and timeline. Pulling support abruptly without giving users enough time to prepare/migrate is just going to cause UX frustration and/or unnecessary (downstream) breakage.

@tarunKoyalwar tarunKoyalwar removed their request for review February 17, 2026 14:50
Copy link
Member

@Mzack9999 Mzack9999 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Merge conflict
  • Add // Deprecated: use xxx syntax for the CLI option

@dogancanbakir
Copy link
Member Author

dogancanbakir commented Mar 3, 2026

@dwisiswant0

Any deprecation process should be planned with a clear roadmap and timeline. Pulling support abruptly without giving users enough time to prepare/migrate is just going to cause UX frustration and/or unnecessary (downstream) breakage.

-fep is not being removed. It still works as before, it maps to -fpt error behind the scenes and shows a deprecation info message. Backward compatibility is preserved.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
runner/runner.go (1)

1130-1137: Consider renaming logFilteredErrorPage to logFilteredPage.

The function logFilteredErrorPage is now used for all filtered page types (login, captcha, parked, etc.), not just error pages. The current name is misleading given the broader -fpt functionality.

♻️ Suggested rename
-				logFilteredErrorPage(r.options.OutputFilterErrorPagePath, resp.URL)
+				logFilteredPage(r.options.OutputFilterErrorPagePath, resp.URL)

Also rename the function definition at line 1564 and consider renaming OutputFilterErrorPagePath to OutputFilterPagePath for consistency.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@runner/runner.go` around lines 1130 - 1137, The helper name
logFilteredErrorPage is misleading because it's used for all filtered page
types; rename the function to logFilteredPage (and update its definition) and
rename the option OutputFilterErrorPagePath to OutputFilterPagePath on the
options struct; then update every call site (e.g., the code that currently calls
logFilteredErrorPage(r.options.OutputFilterErrorPagePath, resp.URL)) to call
logFilteredPage(r.options.OutputFilterPagePath, resp.URL) so names are
consistent across the function, its calls, and the option field.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@runner/runner.go`:
- Around line 1130-1137: The helper name logFilteredErrorPage is misleading
because it's used for all filtered page types; rename the function to
logFilteredPage (and update its definition) and rename the option
OutputFilterErrorPagePath to OutputFilterPagePath on the options struct; then
update every call site (e.g., the code that currently calls
logFilteredErrorPage(r.options.OutputFilterErrorPagePath, resp.URL)) to call
logFilteredPage(r.options.OutputFilterPagePath, resp.URL) so names are
consistent across the function, its calls, and the option field.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9ba4754 and 41696ba.

⛔ Files ignored due to path filters (5)
  • .github/workflows/build-test.yml is excluded by !**/*.yml
  • .github/workflows/functional-test.yml is excluded by !**/*.yml
  • .github/workflows/release-binary.yml is excluded by !**/*.yml
  • .github/workflows/release-test.yml is excluded by !**/*.yml
  • go.sum is excluded by !**/*.sum
📒 Files selected for processing (9)
  • Dockerfile
  • README.md
  • common/pagetypeclassifier/clf.gob
  • common/pagetypeclassifier/dataset.txt
  • common/pagetypeclassifier/pagetypeclassifier.go
  • common/pagetypeclassifier/pagetypeclassifier_test.go
  • go.mod
  • runner/options.go
  • runner/runner.go
💤 Files with no reviewable changes (3)
  • common/pagetypeclassifier/pagetypeclassifier.go
  • common/pagetypeclassifier/dataset.txt
  • common/pagetypeclassifier/pagetypeclassifier_test.go
🚧 Files skipped from review as they are similar to previous changes (3)
  • runner/options.go
  • README.md
  • Dockerfile

@dogancanbakir dogancanbakir force-pushed the feat/dit-page-classifier branch from 1e193fc to 0e3ceee Compare March 3, 2026 18:04
@Mzack9999 Mzack9999 merged commit 3390c41 into dev Mar 4, 2026
14 checks passed
@Mzack9999 Mzack9999 deleted the feat/dit-page-classifier branch March 4, 2026 12:06
@Mzack9999 Mzack9999 added the Type: Enhancement Most issues will probably ask for additions or changes. label Mar 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Type: Enhancement Most issues will probably ask for additions or changes.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Replace page classifier with dit, add -fpt flag

3 participants