week_3: Module C (The Librarian) — C.1 candidate retriever (in-memory + pgvector) + pipeline switch#937
Conversation
dataset Contracts + regression ruler before any pipeline code, per the OIE RFC's 'test before the code' directive. RFC OWASP#734 envelopes (KnowledgeItem in, LinkProposal/ReviewItem out) as Pydantic v2, drift-guarded against the vendored owasp-graph schemas; TRACT hub-firewall + multi-link scoring; 319-row golden dataset derived from standards_cache.sqlite with --check drift detection. One prod edit: pydantic>=2,<3 pin.
…nts, fail-fast build, edge cases
…inkResolver The C.0 deterministic input boundary, per the proposal's W2/W3 'data preparation layer' (named validator, not normalizer: the RFC assigns text normalization to Module A; C validates and adapts, never transforms text). - section_validator.py: typed-error validation of both upstream shapes (B's knowledge_queue row + RFC KnowledgeItem envelope) into an internal Section; synthesizes RFC identity fields (chunk_id/artifact_id/source/ locator) from B's reduced row; strips volatile audit metadata. - explicit_link_resolver.py: deterministic ddd-ddd fast path, no ML. Fail-safe: only a single known reference auto-links; unknown or conflicting references route to review. - evaluate_librarian.py: harness now runs every golden row through C.0, prints per-slice validation pass rate, and gates the explicit slice at 100% resolver correctness (exit 1 on regression). Gate: PASS 5/5. - Table-driven tests for every rejection class and resolver outcome. mypy --strict clean, black clean, 66 tests green.
|
Warning Review limit reached
More reviews will be available in 45 minutes and 18 seconds. Learn how PR review limits work. Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file). ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits. 🚦 How do rate limits work?CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability. For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (8)
WalkthroughThis PR adds Module C (“The Librarian”) with RFC-aligned schemas and Pydantic models, environment-based configuration, C.0 section validation, explicit CRE resolution, semantic retrieval, evaluation utilities, golden dataset tooling, and CLI integration. ChangesModule C — The Librarian
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
…ipeline switch The semantic search step. For sections with no explicit CRE id, embed the text and cosine-rank the CRE vector hub, returning the top-K (default 20) candidates as an RFC RetrievalAudit (reranked[] empty until W4). Two interchangeable backends behind one retrieve() seam, selected by CRE_LIBRARIAN_RETRIEVER_BACKEND: - CandidateRetriever: in-memory sklearn cosine (SQLite/CI/harness) - PgVectorRetriever: Postgres-side <=> cosine over embedding_vec build_retriever() is the factory. Reuses OpenCRE's embedding stack (PromptHandler.get_text_embeddings, db.get_embeddings_by_doc_type). Dim gate fails loudly on empty/ragged hubs and query/hub width mismatch. CLI switch (--run_librarian / --librarian_dry_run) runs the pipeline dry-run; harness --use_live_embeddings measures retrieval recall@k. The pgvector schema migration is a W8 deliverable (validated against real Postgres there); W3 ships only the code path, unit-tested via a fake connection. 82 librarian tests pass.
73f6a52 to
78b1cb1
Compare
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
application/utils/librarian/knowledge_source.py (1)
29-35:⚠️ Potential issue | 🟠 MajorWrap
model_validate_jsonin error handling to prevent a single malformed line from aborting the entire run.At line 34, an uncaught
ValidationErrorfrommodel_validate_jsonwill propagate out of the generator and terminate the iteration incre_main, causing the entire run to fail. The codebase already handles similar validation errors elsewhere (e.g.,section_validator.pycatchesValidationErrorand continues); apply the same pattern here by catching the exception and logging a warning before continuing to the next line.Suggested fix
+import logging import json from abc import ABC, abstractmethod from typing import Iterator +from pydantic import ValidationError + from application.utils.librarian.schemas import KnowledgeQueueItem +logger = logging.getLogger(__name__) + @@ def items(self) -> Iterator[KnowledgeQueueItem]: with open(self._path, encoding="utf-8") as fh: - for line in fh: + for line_no, line in enumerate(fh, start=1): line = line.strip() if line: - yield KnowledgeQueueItem.model_validate_json(line) + try: + yield KnowledgeQueueItem.model_validate_json(line) + except ValidationError as exc: + logger.warning( + "Skipping malformed knowledge_queue row at line %d: %s", + line_no, + exc, + ) + continue🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@application/utils/librarian/knowledge_source.py` around lines 29 - 35, The items method in the KnowledgeSource class does not handle ValidationError exceptions that can be raised by model_validate_json when processing malformed lines. Wrap the KnowledgeQueueItem.model_validate_json call in a try-except block to catch ValidationError, log a warning message when validation fails for a particular line, and use continue to skip that line and proceed to the next one instead of allowing the exception to propagate and terminate the entire iteration. This pattern is already implemented elsewhere in the codebase such as in section_validator.py and should be applied consistently here.
🧹 Nitpick comments (2)
application/utils/librarian/__init__.py (1)
12-12: ⚡ Quick winUpdate this scope note to reflect current module reality.
Line 12 says there is “No linking logic yet,” but this PR stack now includes explicit resolution and semantic retrieval in Module C. Keeping this sentence stale can mislead future maintainers.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@application/utils/librarian/__init__.py` at line 12, Update the scope documentation at line 12 in the __init__.py file to reflect the current state of the librarian module. Remove or revise the statement that says "No linking logic yet" since the PR stack now includes explicit resolution and semantic retrieval functionality in Module C. Update the scope note to accurately describe what the module currently contains and supports.application/tests/librarian/config_loader_test.py (1)
12-19: ⚡ Quick winAdd
retriever_backendassertions to complete the config contract checks.The default and override tests validate most fields but skip
retriever_backend, which is part ofLibrarianConfigand a key routing control.Proposed test additions
class TestConfigLoaderDefaults(unittest.TestCase): @@ def test_defaults_when_env_unset(self): with mock.patch.dict(os.environ, {}, clear=True): cfg = load_config() + self.assertEqual(cfg.retriever_backend, "in_memory") self.assertEqual(cfg.crossencoder_model, "cross-encoder/ms-marco-MiniLM-L-6-v2") @@ class TestConfigLoaderOverrides(unittest.TestCase): OVERRIDES = { + "CRE_LIBRARIAN_RETRIEVER_BACKEND": "pgvector", "CRE_LIBRARIAN_CROSSENCODER_MODEL": "cross-encoder/other", @@ def test_env_overrides_apply(self): with mock.patch.dict(os.environ, self.OVERRIDES, clear=True): cfg = load_config() + self.assertEqual(cfg.retriever_backend, "pgvector") self.assertEqual(cfg.crossencoder_model, "cross-encoder/other")Also applies to: 27-35, 40-47
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@application/tests/librarian/config_loader_test.py` around lines 12 - 19, The test assertions for the configuration loader are incomplete and missing validation for the `retriever_backend` field, which is a key part of the LibrarianConfig contract. Add self.assertEqual assertions to verify the `retriever_backend` field matches the expected value in both the default configuration test and the override configuration tests (including the sections at lines 27-35 and 40-47), following the same pattern used for other fields like crossencoder_model, top_k_retrieval, batch_size, and conformal_alpha.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@application/tests/librarian/config_loader_test.py`:
- Around line 20-23: The test_config_is_frozen method has two issues: it doesn't
isolate os.environ when calling load_config(), which can cause spurious failures
if local CRE_LIBRARIAN_* environment variables are set (other tests in the same
file correctly use mock.patch.dict(os.environ, {}, clear=True)), and it uses
assertRaises(Exception) which is too broad and can mask unrelated failures. Wrap
the load_config() call with mock.patch.dict(os.environ, {}, clear=True) to
isolate the environment, and change assertRaises(Exception) to
assertRaises(FrozenInstanceError) from the dataclasses module to assert the
specific exception that should be raised when attempting to modify a frozen
dataclass instance.
In `@application/tests/librarian/dataset_test.py`:
- Around line 110-114: Add a timeout parameter to the subprocess.run() call that
executes the determinism check with _BUILD_SCRIPT and "--check" arguments. This
will prevent the test from hanging indefinitely if the subprocess stalls due to
database locks or I/O issues. Include an appropriate timeout value (in seconds)
in the subprocess.run() function to ensure the test fails fast rather than
blocking the entire test suite.
In `@application/utils/librarian/schemas.py`:
- Line 67: The Pydantic model fields are currently using plain Optional[str]
types which do not enforce RFC format validation, allowing invalid payloads
through and breaking contract parity. Fix this by updating six fields across the
models: SourceRef.url and SourceRef.committed_at, Locator.url,
KnowledgeItem.filtered_at, LinkProposal.classified_at, and
ReviewItem.created_at. Replace all URI fields (url and similar) with AnyUrl type
imported from pydantic module, and replace all date-time fields (committed_at,
filtered_at, classified_at, created_at) with datetime type imported from the
standard library. Pydantic v2 will natively validate RFC 3339 format for
datetime fields and enforce valid URI schemes for AnyUrl fields, maintaining RFC
schema contract requirements.
In `@application/utils/librarian/section_validator.py`:
- Around line 163-166: Replace the assert statement that checks if item.content
is not None with an explicit if check followed by raising
MalformedKnowledgeItemError. The assert statement on line 164 gets stripped
during optimized Python execution (python -O), causing subsequent calls to
item.content.text and item.content.language in _require_text() and
_require_language() to fail with AttributeError instead of the expected typed
exception. Use an if statement that validates item.content is not None, and if
it fails, raise MalformedKnowledgeItemError with an appropriate message about
missing content to ensure consistent validation behavior and error reporting at
this boundary regardless of Python optimization flags.
In `@scripts/build_golden_dataset.py`:
- Line 261: The loop variable `node_id` in the for loop that iterates over
`rows` is never used in the loop body, which triggers a B007 lint error. Rename
this unused loop variable to an underscore `_` to indicate that it is
intentionally unused and to comply with coding guidelines, ensuring that the
loop still correctly unpacks all the values from each row tuple.
---
Outside diff comments:
In `@application/utils/librarian/knowledge_source.py`:
- Around line 29-35: The items method in the KnowledgeSource class does not
handle ValidationError exceptions that can be raised by model_validate_json when
processing malformed lines. Wrap the KnowledgeQueueItem.model_validate_json call
in a try-except block to catch ValidationError, log a warning message when
validation fails for a particular line, and use continue to skip that line and
proceed to the next one instead of allowing the exception to propagate and
terminate the entire iteration. This pattern is already implemented elsewhere in
the codebase such as in section_validator.py and should be applied consistently
here.
---
Nitpick comments:
In `@application/tests/librarian/config_loader_test.py`:
- Around line 12-19: The test assertions for the configuration loader are
incomplete and missing validation for the `retriever_backend` field, which is a
key part of the LibrarianConfig contract. Add self.assertEqual assertions to
verify the `retriever_backend` field matches the expected value in both the
default configuration test and the override configuration tests (including the
sections at lines 27-35 and 40-47), following the same pattern used for other
fields like crossencoder_model, top_k_retrieval, batch_size, and
conformal_alpha.
In `@application/utils/librarian/__init__.py`:
- Line 12: Update the scope documentation at line 12 in the __init__.py file to
reflect the current state of the librarian module. Remove or revise the
statement that says "No linking logic yet" since the PR stack now includes
explicit resolution and semantic retrieval functionality in Module C. Update the
scope note to accurately describe what the module currently contains and
supports.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
Run ID: a0a9bece-b59f-4c00-a238-d4dd2b76ad18
📒 Files selected for processing (34)
.env.exampleapplication/cmd/cre_main.pyapplication/tests/librarian/__init__.pyapplication/tests/librarian/candidate_retriever_test.pyapplication/tests/librarian/config_loader_test.pyapplication/tests/librarian/dataset_test.pyapplication/tests/librarian/explicit_link_resolver_test.pyapplication/tests/librarian/fixtures/golden_dataset.jsonapplication/tests/librarian/fixtures/golden_dataset.schema.jsonapplication/tests/librarian/fixtures/sample_knowledge_queue.jsonlapplication/tests/librarian/hub_firewall_test.pyapplication/tests/librarian/schemas_test.pyapplication/tests/librarian/scoring_test.pyapplication/tests/librarian/section_validator_test.pyapplication/utils/librarian/__init__.pyapplication/utils/librarian/_rfc_schemas/knowledge-item.jsonapplication/utils/librarian/_rfc_schemas/link-proposal.jsonapplication/utils/librarian/_rfc_schemas/locator.jsonapplication/utils/librarian/_rfc_schemas/proposed-link.jsonapplication/utils/librarian/_rfc_schemas/review-item.jsonapplication/utils/librarian/_rfc_schemas/source-ref.jsonapplication/utils/librarian/candidate_retriever.pyapplication/utils/librarian/config_loader.pyapplication/utils/librarian/explicit_link_resolver.pyapplication/utils/librarian/hub_firewall.pyapplication/utils/librarian/knowledge_source.pyapplication/utils/librarian/schemas.pyapplication/utils/librarian/scoring.pyapplication/utils/librarian/section_validator.pycre.pyrequirements.txtscripts/benchmark_retriever.pyscripts/build_golden_dataset.pyscripts/evaluate_librarian.py
Resolve the CodeRabbit review comments on the Module C PR: - schemas.py: enforce RFC format parity — AnyUrl for url fields, datetime for committed_at/filtered_at/classified_at/created_at - section_validator.py: replace assert with a typed MalformedKnowledgeItemError guard (survives python -O) - knowledge_source.py: skip+log malformed JSONL rows instead of aborting the whole iteration on ValidationError - config_loader_test.py: isolate os.environ and assert the specific FrozenInstanceError - dataset_test.py: add a timeout to the determinism subprocess call - build_golden_dataset.py: rename unused loop var node_id -> _node_id - section_validator_test.py: assert committed_at as the typed datetime Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Resolve the CodeRabbit review comments on the Module C PR: - schemas.py: enforce RFC format parity — AnyUrl for url fields, datetime for committed_at/filtered_at/classified_at/created_at - section_validator.py: replace assert with a typed MalformedKnowledgeItemError guard (survives python -O) - knowledge_source.py: skip+log malformed JSONL rows instead of aborting the whole iteration on ValidationError - config_loader_test.py: isolate os.environ and assert the specific FrozenInstanceError - dataset_test.py: add a timeout to the determinism subprocess call - build_golden_dataset.py: rename unused loop var node_id -> _node_id - section_validator_test.py: assert committed_at as the typed datetime
…ASP#925) GHSA pydantic ReDoS affects >=2.0.0,<2.4.0; first patched in 2.4.0.
3c447bf to
695a016
Compare
week_3: Module C (The Librarian) — C.1 semantic candidate retriever + pipeline switch
Overview
Week 1 built the judge (golden dataset + scorer + harness). Week 2 built the parts that must never guess — the C.0 input boundary and the deterministic explicit-CRE fast path. This is the first part that does guess.
The problem: the explicit fast path only fires when a CRE id (
ddd-ddd) is literally written in the chunk. Most chunks just describe a security concept in plain English, so for those C must find which of OpenCRE's CRE nodes the text is about. That is a search problem — and the first place in the pipeline that can be wrong.This PR's role: build the search step (C.1). Embed the section text, cosine-match it against the CRE-node vector hub, and return the top-K (default 20) rank-ordered candidates — the shortlist the W4 cross-encoder reranks and later weeks turn into a confident yes/no. Because Week 1 already built the judge, this first ML component gets a measured score from day one (recall@k, see Results).
Scope: 1 new module + benchmark + 1 new test, plus config/CLI/harness wiring. No frontend, no migration, no behaviour change to OpenCRE proper. (The pgvector schema migration lands in W8 — see "not here".)
What changed
candidate_retriever.py(new)retrieve()seam, both emitting the same RFCRetrievalAudit(candidates[]rank-ordered withscore_vector;reranked[]empty until W4).CandidateRetriever— in-memory sklearn cosine over an injectedembed_fn+ immutableCandidatePool.PgVectorRetriever— pushes the cosine into Postgres via the<=>operator over anembedding_veccolumn (scored1 - distance), never loading the hub into RAM.build_retriever(backend, ...)is the factory. DI keeps the module import-light and hermetically testable (same pattern as W2's resolver). Reuses OpenCRE's embedding stack (PromptHandler.get_text_embeddings,db.get_embeddings_by_doc_type,cosine_similarity).candidate_retriever.py,config_loader.pyRetrieverBackendenum +CRE_LIBRARIAN_RETRIEVER_BACKEND. The RFC is silent on retrieval tech (it mandates only the audit trail), so the choice is ours.in_memoryis the only backend that runs on SQLite, so CI/dev/harness use it;pgvectoris the proposal's primary for prod Postgres.candidate_retriever.pyCandidatePool.from_mappingrejects an empty or ragged hub;retrieveasserts query-vector width == stored CRE-vector width. A silent dim mismatch makes every cosine score meaningless → fails loudly (DimensionMismatchError). Postgres enforces this structurally via the fixed-widthvector(dim)column.config_loader.py,.env.exampleCRE_LIBRARIAN_RETRIEVER_BACKEND(validated), plus a documented Librarian env block.cre.py,cre_main.py--run_librarian/--librarian_dry_run(+--librarian_source): read knowledge-queue sections → explicit resolver (C.0.5) →build_retriever-selected backend (C.1) → log the shortlist per section. Writes land W8, so this is dry-run only;--run_librarianwithout writes warns and behaves identically.benchmark_retriever.py(new)embedding_veccolumn present.evaluate_librarian.py--use_live_embeddingsmeasures retrieval recall@k over the positive slice against the real DB + embedding model. Offline (CI default) it stays honest: the retriever is wired, but recall@k needs live embeddings — there is no offline number because the pool must be the real CRE vectors, and seeding it from golden text is exactly the leakage the hub firewall strips.candidate_retriever_test.py(new)score_vector/cre_namepopulation, W3 audit shape; pgvector parsing + SQL-binding via a fake recording connection (no DB needed); and thebuild_retrieverfactory routing + arg-mismatch guards.How the pieces connect
flowchart TB sec["Section (from C.0, W2)"] res["explicit_link_resolver (C.0.5, W2)<br/>regex ddd-ddd, no ML"] subgraph C1["C.1 — semantic retriever (this PR)"] embed["embed_fn(text)<br/>(PromptHandler.get_text_embeddings)"] factory["build_retriever(backend)"] inmem["CandidateRetriever<br/>sklearn cosine + top-K<br/>(SQLite / CI / harness)"] pg["PgVectorRetriever<br/>embedding_vec <=> query<br/>(Postgres prod)"] audit["RetrievalAudit<br/>candidates[] (score_vector)<br/>reranked[] = [] (W4)"] end rev["unknown / conflicting -> review (W2)"] rerank["CrossEncoderRanker (C.2, W4)<br/>reranks shortlist -> top 5"] sec --> res res -- "resolved" --> done["deterministic link"] res -- "no_reference" --> embed res -- "unknown / conflicting" --> rev embed --> factory factory --> inmem factory --> pg inmem --> audit pg --> audit audit --> rerankResults
score_vectorset,reranked[]empty (W4 fills it)What is intentionally not here
The cross-encoder rerank (C.2, W4), decision/threshold routing (C.3–C.4, W5), SafetyGuard (W5), and graph writes (W8) are later weeks.
reranked[]is therefore empty, and the pipeline is dry-run only.The pgvector schema migration is intentionally a W8 deliverable.
PgVectorRetriever+ thebuild_retrieverfactory + the benchmark ship here (the W3 code deliverable), fully unit-tested via a fake connection. But theembedding_veccolumn it queries is created by an Alembic migration that the deliverables plan schedules for W8 (live integration) — where it can be validated against real Postgres (flask db upgrade/downgrade,make alembic-guardrail, backfill, ivfflat tuning). Until thenpgvectoris selectable by config but only runnable once W8 lands the column;in_memoryis the default and the only backend used on SQLite/CI. Shipping the retriever code ahead of the schema keeps this PR free of an unvalidated shared-table migration.How to verify locally