week_2: Module C (The Librarian) — C.0 input boundary: SectionValidator + ExplicitLinkResolver#925
week_2: Module C (The Librarian) — C.0 input boundary: SectionValidator + ExplicitLinkResolver#925PRAteek-singHWY wants to merge 6 commits into
Conversation
dataset Contracts + regression ruler before any pipeline code, per the OIE RFC's 'test before the code' directive. RFC OWASP#734 envelopes (KnowledgeItem in, LinkProposal/ReviewItem out) as Pydantic v2, drift-guarded against the vendored owasp-graph schemas; TRACT hub-firewall + multi-link scoring; 319-row golden dataset derived from standards_cache.sqlite with --check drift detection. One prod edit: pydantic>=2,<3 pin.
…nts, fail-fast build, edge cases
|
Warning Review limit reached
More reviews will be available in 45 minutes and 18 seconds. Learn how PR review limits work. Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file). ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits. 🚦 How do rate limits work?CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability. For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
WalkthroughAdds Module C (Librarian): RFC JSON schemas and Pydantic models, env-config loading, explicit-link resolution, hub firewalling, scoring, C.0 input boundary conversion, golden dataset generation, evaluation, and unit tests. ChangesModule C Librarian Implementation
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (6)
application/utils/librarian/schemas.py (2)
272-290: 💤 Low valueConsider making the
extrafield handling explicit for clarity.The docstring (Line 276-277) states this model "tolerates extra fields so B can extend the row without breaking C," relying on Pydantic v2's default behavior of
extra="ignore". While this works correctly, being explicit would improve code clarity and prevent confusion if Pydantic defaults change in the future.📝 Suggested improvement
class KnowledgeQueueItem(BaseModel): """Read-side mirror of Module B's `knowledge_queue` Postgres row. Per master guide §1.2: C reads these rows and synthesizes the RFC `KnowledgeItem` envelope from them. Not a wire contract; tolerates extra fields so B can extend the row without breaking C. """ + model_config = ConfigDict(extra="ignore") id: str🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@application/utils/librarian/schemas.py` around lines 272 - 290, The KnowledgeQueueItem Pydantic model currently relies on Pydantic v2's default of ignoring extra fields; make this explicit by adding an explicit model config to the KnowledgeQueueItem class (e.g., set model_config = {"extra": "ignore"} or the equivalent Config/ModelConfig pattern used across the codebase) so BaseModel subclass KnowledgeQueueItem clearly documents and enforces tolerant handling of unknown fields from Module B.
99-107: 💤 Low valueConsider aligning the
languagefield default with the JSON schema.The JSON schema (
knowledge-item.jsonLine 43) specifies"language": { "type": "string", "default": "en" }, but the Pydantic model haslanguage: Optional[str] = None. While JSON Schemadefaultvalues are not enforced during validation (they're hints for tooling), this creates a semantic mismatch: the schema documents that language should default to"en", but the model defaults toNone.If the intent is for language to always be
"en"when unspecified, consider:language: str = "en"If
Noneis intentionally allowed (meaning "language unknown"), the current model is correct but consider updating the JSON schema to clarify this.Since round-trip tests are passing (per PR description), this may be intentional or the tests may not validate default value behavior strictly.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@application/utils/librarian/schemas.py` around lines 99 - 107, The KnowledgeContent Pydantic model's language field currently allows None but the JSON schema documents a default of "en"; to align them, change the field in the KnowledgeContent class from Optional[str] = None to a non-optional string with default "en" (i.e., make language: str = "en") so the model provides the documented default; if instead None is intended, update the JSON schema to remove or change the default—refer to the KnowledgeContent class and the language field to implement the change.requirements.txt (1)
1-119: ⚖️ Poor tradeoffConsider deduplicating dependency entries.
The file contains duplicate entries that may cause confusion:
compliance-trestle(lines 3, 35)setuptools(lines 12, 33)SQLAlchemy(lines 34, 100)psycopg2-binary(lines 27, 61)playwright(lines 26, 54)scikit_learn/scikit-learn(lines 30, 94 — naming inconsistency)When pip encounters duplicates, it uses the last occurrence, making earlier entries misleading. Consider consolidating these into single entries.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@requirements.txt` around lines 1 - 119, requirements.txt contains duplicate and inconsistent package entries; remove redundant lines and consolidate into single canonical entries for each package (e.g., keep one compliance-trestle, one setuptools, one SQLAlchemy, one psycopg2-binary, one playwright) and normalize scikit-learn to the correct package name (scikit-learn) so pip behavior is deterministic; ensure any required version specifiers are preserved when merging and remove exact duplicates (e.g., duplicate PyYAML/version lines) so the final file lists each dependency only once.scripts/build_golden_dataset.py (1)
261-261: ⚡ Quick winRename unused loop variable to follow convention.
The variable
node_idis not used within the loop body. Per Python convention, prefix unused variables with underscore.♻️ Proposed fix
- for node_id, name, section_id, text, cre_concat in rows: + for _node_id, name, section_id, text, cre_concat in rows:🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/build_golden_dataset.py` at line 261, The loop binding in the for statement "for node_id, name, section_id, text, cre_concat in rows:" uses an unused variable node_id; rename it to _node_id to follow Python convention for unused variables and avoid linter warnings. Update the for header to "for _node_id, name, section_id, text, cre_concat in rows:" and ensure there are no references to node_id inside the loop (if any, replace them with the intended variable or raise if needed).application/utils/librarian/knowledge_source.py (2)
16-20: ⚡ Quick winSimplify abstract method body.
The
raise NotImplementedErrorin the abstract method body is redundant. The@abstractmethoddecorator already enforces that subclasses must implement this method.♻️ Proposed fix
class KnowledgeSource(ABC): `@abstractmethod` def items(self) -> Iterator[KnowledgeQueueItem]: """Yield knowledge_queue rows awaiting classification.""" - raise NotImplementedError + ...🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@application/utils/librarian/knowledge_source.py` around lines 16 - 20, Remove the redundant raise in the abstract method: in the KnowledgeSource class remove the "raise NotImplementedError" from the items method (keep the `@abstractmethod` decorator and the docstring or replace the body with a simple pass) so subclasses are still required to implement KnowledgeSource.items without the unnecessary explicit exception.
9-9: ⚡ Quick winRemove unused import.
The
jsonmodule is imported but never used. Line 34 usesKnowledgeQueueItem.model_validate_json, which is a Pydantic method that handles JSON parsing internally.♻️ Proposed fix
-import json from abc import ABC, abstractmethod from typing import Iterator🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@application/utils/librarian/knowledge_source.py` at line 9, Remove the unused import of the json module at the top of knowledge_source.py; since the code uses KnowledgeQueueItem.model_validate_json (a Pydantic method) for JSON parsing, delete the "import json" line to avoid an unused-import warning and keep imports minimal.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@requirements.txt`:
- Line 66: Update the pydantic version range in requirements.txt to exclude
vulnerable 2.x releases by changing the spec for "pydantic" from
"pydantic>=2,<3" to "pydantic>=2.4.0,<3" so that installations will use the
first patched 2.4.0+ release; locate the "pydantic>=2,<3" entry and replace it
with "pydantic>=2.4.0,<3".
---
Nitpick comments:
In `@application/utils/librarian/knowledge_source.py`:
- Around line 16-20: Remove the redundant raise in the abstract method: in the
KnowledgeSource class remove the "raise NotImplementedError" from the items
method (keep the `@abstractmethod` decorator and the docstring or replace the body
with a simple pass) so subclasses are still required to implement
KnowledgeSource.items without the unnecessary explicit exception.
- Line 9: Remove the unused import of the json module at the top of
knowledge_source.py; since the code uses KnowledgeQueueItem.model_validate_json
(a Pydantic method) for JSON parsing, delete the "import json" line to avoid an
unused-import warning and keep imports minimal.
In `@application/utils/librarian/schemas.py`:
- Around line 272-290: The KnowledgeQueueItem Pydantic model currently relies on
Pydantic v2's default of ignoring extra fields; make this explicit by adding an
explicit model config to the KnowledgeQueueItem class (e.g., set model_config =
{"extra": "ignore"} or the equivalent Config/ModelConfig pattern used across the
codebase) so BaseModel subclass KnowledgeQueueItem clearly documents and
enforces tolerant handling of unknown fields from Module B.
- Around line 99-107: The KnowledgeContent Pydantic model's language field
currently allows None but the JSON schema documents a default of "en"; to align
them, change the field in the KnowledgeContent class from Optional[str] = None
to a non-optional string with default "en" (i.e., make language: str = "en") so
the model provides the documented default; if instead None is intended, update
the JSON schema to remove or change the default—refer to the KnowledgeContent
class and the language field to implement the change.
In `@requirements.txt`:
- Around line 1-119: requirements.txt contains duplicate and inconsistent
package entries; remove redundant lines and consolidate into single canonical
entries for each package (e.g., keep one compliance-trestle, one setuptools, one
SQLAlchemy, one psycopg2-binary, one playwright) and normalize scikit-learn to
the correct package name (scikit-learn) so pip behavior is deterministic; ensure
any required version specifiers are preserved when merging and remove exact
duplicates (e.g., duplicate PyYAML/version lines) so the final file lists each
dependency only once.
In `@scripts/build_golden_dataset.py`:
- Line 261: The loop binding in the for statement "for node_id, name,
section_id, text, cre_concat in rows:" uses an unused variable node_id; rename
it to _node_id to follow Python convention for unused variables and avoid linter
warnings. Update the for header to "for _node_id, name, section_id, text,
cre_concat in rows:" and ensure there are no references to node_id inside the
loop (if any, replace them with the intended variable or raise if needed).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
Run ID: c707f462-c16f-4413-acca-0aa2f5c45f32
📒 Files selected for processing (28)
application/tests/librarian/__init__.pyapplication/tests/librarian/config_loader_test.pyapplication/tests/librarian/dataset_test.pyapplication/tests/librarian/explicit_link_resolver_test.pyapplication/tests/librarian/fixtures/golden_dataset.jsonapplication/tests/librarian/fixtures/golden_dataset.schema.jsonapplication/tests/librarian/fixtures/sample_knowledge_queue.jsonlapplication/tests/librarian/hub_firewall_test.pyapplication/tests/librarian/schemas_test.pyapplication/tests/librarian/scoring_test.pyapplication/tests/librarian/section_validator_test.pyapplication/utils/librarian/__init__.pyapplication/utils/librarian/_rfc_schemas/knowledge-item.jsonapplication/utils/librarian/_rfc_schemas/link-proposal.jsonapplication/utils/librarian/_rfc_schemas/locator.jsonapplication/utils/librarian/_rfc_schemas/proposed-link.jsonapplication/utils/librarian/_rfc_schemas/review-item.jsonapplication/utils/librarian/_rfc_schemas/source-ref.jsonapplication/utils/librarian/config_loader.pyapplication/utils/librarian/explicit_link_resolver.pyapplication/utils/librarian/hub_firewall.pyapplication/utils/librarian/knowledge_source.pyapplication/utils/librarian/schemas.pyapplication/utils/librarian/scoring.pyapplication/utils/librarian/section_validator.pyrequirements.txtscripts/build_golden_dataset.pyscripts/evaluate_librarian.py
…inkResolver The C.0 deterministic input boundary, per the proposal's W2/W3 'data preparation layer' (named validator, not normalizer: the RFC assigns text normalization to Module A; C validates and adapts, never transforms text). - section_validator.py: typed-error validation of both upstream shapes (B's knowledge_queue row + RFC KnowledgeItem envelope) into an internal Section; synthesizes RFC identity fields (chunk_id/artifact_id/source/ locator) from B's reduced row; strips volatile audit metadata. - explicit_link_resolver.py: deterministic ddd-ddd fast path, no ML. Fail-safe: only a single known reference auto-links; unknown or conflicting references route to review. - evaluate_librarian.py: harness now runs every golden row through C.0, prints per-slice validation pass rate, and gates the explicit slice at 100% resolver correctness (exit 1 on regression). Gate: PASS 5/5. - Table-driven tests for every rejection class and resolver outcome. mypy --strict clean, black clean, 66 tests green.
b69a735 to
b4a6aa2
Compare
…ASP#925) GHSA pydantic ReDoS affects >=2.0.0,<2.4.0; first patched in 2.4.0. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ASP#925) GHSA pydantic ReDoS affects >=2.0.0,<2.4.0; first patched in 2.4.0.
d99a64a to
73aa15e
Compare
…ASP#925) GHSA pydantic ReDoS affects >=2.0.0,<2.4.0; first patched in 2.4.0.
week_2: Module C (The Librarian) — C.0 deterministic input boundary: SectionValidator + ExplicitLinkResolver
Overview
This is the Week 2 deliverable for Module C (The Librarian): C.0, the deterministic input boundary. Before any retrieval or ML, every incoming chunk passes two deterministic stages:
Sectionthe pipeline consumes, and rejects bad input with typed errors.ddd-ddd)? If exactly one known id is cited, link it directly with no ML. Unknown or conflicting references never auto-link — they route to human review (fail-safe).A naming note for reviewers: the plan documents called this component "SectionNormalizer", but the RFC (#734) assigns text normalization to Module A (harvest + normalize + chunk), with Module B's sanitizer on top. By the time text reaches Module C it is contractually clean, and re-cleaning it here would silently drift C from what A hashed and B classified. This component validates and adapts — it never transforms text — so it is named SectionValidator.
Scope: 4 new files + 1 updated (the eval harness). No frontend, no migrations, no DB access, no behaviour change to OpenCRE proper.
What changed
section_validator.pyknowledge_queuerow and the full RFCKnowledgeItemenvelope — into a frozen internalSection. Synthesizes the RFC identity fields from B's row (chunk_id = chk:{repo}@{sha}:{path},artifact_id = art:{repo}:{path},source,locator). Strips volatile audit metadata (llm_reasoning, filter stages, run timestamps) so downstream stages can never key decisions on it. Rejections are typed (MalformedKnowledgeItemError,EmptyTextError,UnsupportedLanguageError,NotKnowledgeError); raw PydanticValidationErrornever escapes the boundary.explicit_link_resolver.py\b\d{3}-\d{3}\b) extraction + resolution against an injected set of known CRE ids. Outcomes:resolved(single known id → auto-link),no_reference(continue to the semantic path, W3+),unknown_reference/conflicting_references(→ review, with the known ids preserved as suggestions). The known-id set is injected so this module stays dependency-free: the harness seeds it from the golden dataset today; the DB-backedcre.external_idregistry arrives with the retriever (W3).evaluate_librarian.pypredict()makes its first real predictions (explicit path only; semantic path still stubbed).section_validator_test.py,explicit_link_resolver_test.pyen-GBaccepted,frrejected), and every resolver outcome including pattern boundaries (027-5555andCVE-2024-1234-567must not match). One test asserts the boundary never leaks a raw Pydantic error.How the pieces connect
flowchart TB subgraph UPSTREAM["Upstream shapes (Week 1 contracts)"] row["KnowledgeQueueItem<br/>(Module B's reduced row)"] ki["KnowledgeItem<br/>(RFC envelope)"] end subgraph C0["C.0 — input boundary (this PR)"] val["section_validator.py<br/>validate · adapt · synthesize identity"] sec["Section<br/>(internal, frozen)"] res["explicit_link_resolver.py<br/>regex ddd-ddd, no ML"] end subgraph OUT["Outcomes"] link["resolved →<br/>deterministic link"] sem["no_reference →<br/>semantic path (W3+)"] rev["unknown / conflicting →<br/>human review"] err["typed errors:<br/>Malformed · EmptyText ·<br/>UnsupportedLanguage · NotKnowledge"] end subgraph HARNESS["Eval harness (updated)"] eval["evaluate_librarian.py<br/>pass rate per slice ·<br/>explicit gate 100% (exit 1 on fail)"] end row --> val ki --> val val --> sec val -. reject .-> err sec --> res res --> link res --> sem res --> rev sec --> eval res --> evalResults
mypy --strictclean on both new modules (repo flags)blackcleanWhat is intentionally not here
Retriever, embeddings, cross-encoder, SafetyGuard, decision engine, CLI wiring, and DB models/migrations are later weeks (W3–W8 per the proposal). The semantic path in the harness still predicts nothing — Week 3 plugs the retriever into the same
predict()seam.How to verify locally
🤖 Generated with Claude Code