Skip to content

GSoC 2026 Module B — Week 3: Stage 2 LLM relevance classifier#947

Open
manshusainishab wants to merge 5 commits into
OWASP:mainfrom
manshusainishab:module_b_w3
Open

GSoC 2026 Module B — Week 3: Stage 2 LLM relevance classifier#947
manshusainishab wants to merge 5 commits into
OWASP:mainfrom
manshusainishab:module_b_w3

Conversation

@manshusainishab

Copy link
Copy Markdown
Contributor

Summary

Adds Stage 2 of Module B (Noise/Relevance Filter): an LLM classifier that labels
each content chunk as KNOWLEDGE, NOISE, or UNCERTAIN under the
recall-first rule. Builds on the Week 1 schemas and Week 2 regex/sanitize stages.

This PR is self-contained (classifier + prompt + config + tests). Pipeline
wiring, the queue/DB model, and the CLI entry point come in later weeks.

What's added

  • application/utils/noise_filter/config_loader.py — loads Module B settings
    from CRE_NOISE_FILTER_* environment variables into a typed NoiseFilterConfig
    (model, batch size, per-chunk char cap, confidence threshold), with defaults.
  • application/utils/noise_filter/prompts.py — the recall-first system prompt
    and a few-shot block (5 KNOWLEDGE / 3 NOISE / 2 UNCERTAIN worked examples), plus
    a helper that renders a numbered batch of chunks into the user prompt.
  • application/utils/noise_filter/llm_classifier.pyLLMClassifier, which
    classifies a list of ChangeRecords and returns one ClassifyResult per record.
  • application/tests/noise_filter/llm_classifier_test.py — 14 unit tests,
    fully mocked (no network calls).
  • .env.example — documents the four new CRE_NOISE_FILTER_* variables.

How the classifier works

  • Sends each chunk's heading_path + text to a dedicated lightweight model via
    LiteLLM (default gemini/gemini-2.5-flash-lite).
  • Processes records in batches (CRE_NOISE_FILTER_BATCH_SIZE, default 10), one
    request per batch, and maps results back to input order by index.
  • Requests a strict JSON-schema response; if the provider doesn't support strict
    mode, falls back to JSON-object mode.
  • Truncates each chunk to CRE_NOISE_FILTER_MAX_CHARS (default 1500) before sending.
  • Retries on rate-limit/quota errors using the existing CRE_LLM_MAX_RETRIES /
    CRE_LLM_RETRY_SLEEP_SECONDS settings.
  • Returns UNCERTAIN (confidence 0.0) for any unparseable, malformed, or invalid
    output, and marks a whole batch UNCERTAIN if the LLM call fails — so a bad
    response never aborts a run.

Configuration

Variable Default Purpose
CRE_NOISE_FILTER_LLM_MODEL gemini/gemini-2.5-flash-lite Classification model (LiteLLM string)
CRE_NOISE_FILTER_BATCH_SIZE 10 Chunks per LLM request
CRE_NOISE_FILTER_MAX_CHARS 1500 Per-chunk character cap before sending
CRE_NOISE_FILTER_CONFIDENCE_THRESHOLD 0.8 Minimum confidence to enqueue a KNOWLEDGE verdict

The model is Gemini, so it authenticates with the existing GEMINI_API_KEY;
no new credential is required.

Testing

  • 14 new unit tests covering prompt content, batch ordering and splitting,
    malformed/invalid/empty output handling, the JSON-schema fallback, rate-limit
    retry and exhaustion, and truncation.
  • Full suite: 369 passing, 1 skipped, 0 failures.
  • black --check clean across the repo.

@coderabbitai

coderabbitai Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Warning

Review limit reached

@manshusainishab, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 26 minutes and 28 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: ae4eea5a-e93b-4300-b537-4c11d1beb121

📥 Commits

Reviewing files that changed from the base of the PR and between c30a29c and aba6ce6.

📒 Files selected for processing (2)
  • application/tests/noise_filter/llm_classifier_test.py
  • application/utils/noise_filter/llm_classifier.py

Walkthrough

Adds Module B noise-filter configuration, prompt construction, and an LLM classifier that batches ChangeRecord inputs, retries rate-limited calls, truncates long text, and parses ordered verdicts. Includes unit tests for prompt content, batching, malformed output, fallback behavior, retries, and truncation.

Changes

Noise Filter Module B

Layer / File(s) Summary
Configuration surface
.env.example, application/utils/noise_filter/config_loader.py
Module B env defaults, typed config loading, and the new settings block are added together.
Prompt contract
application/utils/noise_filter/prompts.py
Defines the base system prompt, few-shot examples, JSON renderers, and user prompt builder for the classifier.
Batch classification flow
application/utils/noise_filter/llm_classifier.py
Adds batched LLM calls, strict-schema fallback, rate-limit retries, truncation, and result parsing into ordered verdicts.
Classifier test coverage
application/tests/noise_filter/llm_classifier_test.py, application/tests/noise_filter/config_loader_test.py
Adds tests for prompt text, ordering, batching, malformed output, response-format fallback, retries, truncation, and config loading/validation.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested reviewers

  • Pa04rth
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 27.78% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: a Stage 2 LLM relevance classifier for Module B.
Description check ✅ Passed The description is detailed and directly matches the changeset, covering the classifier, prompts, config, tests, and env docs.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@application/utils/noise_filter/config_loader.py`:
- Around line 29-36: NoiseFilterConfig currently allows invalid values to be
constructed directly, so add invariant checks inside its dataclass
initialization path. Update NoiseFilterConfig to validate batch_size >= 1,
max_chars >= 1, and confidence_threshold between 0.0 and 1.0 in __post_init__,
so every construction route fails fast before llm_classifier uses these fields.
Keep the checks centralized in NoiseFilterConfig so direct callers and
load_config() both get the same validation behavior.

In `@application/utils/noise_filter/llm_classifier.py`:
- Around line 151-163: The fallback in llm_classifier.py is too broad: the
try/except around self._completion_with_retry in the strict_schema path retries
on every exception, even non-capability failures. Update the logic in the
classifier method that builds messages and calls _completion_with_retry so only
schema- მხარდაჭာ unsupported capability errors (for example the provider’s
BadRequestError for strict schema) trigger the json_object retry, and let all
other exceptions propagate without a second attempt. Keep the warning/logging
specific to the capability fallback path so the retry only happens when strict
schema is genuinely unsupported.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 6ed01a44-6a87-42ec-9e1b-c7220c2dfc79

📥 Commits

Reviewing files that changed from the base of the PR and between 4485936 and 02d4bd4.

📒 Files selected for processing (5)
  • .env.example
  • application/tests/noise_filter/llm_classifier_test.py
  • application/utils/noise_filter/config_loader.py
  • application/utils/noise_filter/llm_classifier.py
  • application/utils/noise_filter/prompts.py

Comment thread application/utils/noise_filter/config_loader.py
Comment thread application/utils/noise_filter/llm_classifier.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant