Conversation
…routing Addresses issues #93, #94, and #95. Issue #93 – Driver dependency - Move psycopg2-binary to an optional 'postgres' extra - Add optional pyodbc and aioodbc as a new 'mssql' extra - Remove unconditional top-level `import psycopg2`; replace the `psycopg2.errors.UndefinedObject` check with `_is_undefined_object_error`, which lazily imports psycopg2 when available and falls back to checking the SQLSTATE pgcode (42704) otherwise Issue #94 – Hardcoded async DSN rewriting - Add `make_async_dsn()` helper that parses the DSN dialect with SQLAlchemy's `make_url` and rewrites the driver via `_ASYNC_DRIVER_MAP` (postgresql → asyncpg, mssql → aioodbc); raises ValueError for unknown dialects instead of silently producing a broken DSN - Replace the hardcoded `db_dsn.replace("postgresql://", ...)` in `create_db_engine` with a call to `make_async_dsn` Issue #95 – PostgreSQL search_path schema selection - Replace `SET search_path TO <schema>` (PostgreSQL-only) with `engine.execution_options(schema_translate_map={None: schema_name})`, which is dialect-neutral and works for PostgreSQL, MS-SQL, and DuckDB - The `set_db_settings` connect-event hook is now only wired up for DuckDB's `file_search_path`; update its docstring accordingly - Add `schema_name` parameter to `get_metadata` and pass it to `MetaData.reflect(engine, schema=schema_name)` so reflection targets the right schema without relying on search_path - Update `make_tables_file` and `remove_db_tables` to forward schema_name to `get_metadata` Also adds: - examples/omop-mssql/ with README documenting all eight MS-SQL challenges (challenges 1–3 link to GitHub issues #93–#95) - tests/test_utils_mssql.py with 19 unit tests covering all new helpers and the schema_translate_map behaviour Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add sqlalchemy.dialects.mssql import to serialize_metadata.py
- Map PostgreSQL-only type strings to cross-dialect equivalents:
TSVECTOR → sqltypes.Text (no MS-SQL equivalent)
BYTEA → sqltypes.LargeBinary (renders as VARBINARY(MAX) on MS-SQL)
CIDR → sqltypes.String(43) (stored as plain string)
REAL → sqltypes.REAL (generic, replaces postgresql.REAL)
- Add _mssql_varbinary_parser to handle VARBINARY, VARBINARY(n),
VARBINARY(max), and VARBINARY(MAX)
- Register parsers for MS-SQL-native types: UNIQUEIDENTIFIER, DATETIMEOFFSET(n),
DATETIME2(n), BINARY(n), MONEY, SMALLMONEY, IMAGE, TINYINT, SMALLDATETIME,
NTEXT, SQL_VARIANT, ROWVERSION
- Place DATETIME2/DATETIMEOFFSET before DATETIME in parsy.alt() to avoid
prefix-collision: ordered-choice parsers must try longer names first
- Rename time_type parameter pg_type → tz_type (no behavioural change)
- Add mssql.UNIQUEIDENTIFIER → UUID generator mapping in make.py
- Add tests/test_serialize_metadata_mssql.py: 50 tests covering all new
MS-SQL types, PostgreSQL degradation mappings, and regression coverage
for all pre-existing type strings
- Update examples/omop-mssql/README.md challenge 4 with issue link #96
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add SERIAL, BIGSERIAL and SMALLSERIAL to the type parser so orm.yaml files using PostgreSQL pseudo-types can be loaded when targeting MS-SQL. Add a @compiles(CreateColumn, "mssql") hook that strips the IDENTITY modifier from DDL — MS-SQL rejects explicit INSERTs into IDENTITY columns without SET IDENTITY_INSERT ON, which datafaker never issues. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Three new tests for _is_undefined_object_error: - pyodbc-style error (SQLSTATE in args[0], no pgcode) → False, documenting that the current pgcode-based fallback does not cover MS-SQL/pyodbc errors - pgcode=None → False - SQLAlchemy ProgrammingError wrapper → only e.orig matches, not the wrapper itself, confirming the call-site convention in remove_vocab_foreign_key_constraints Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
set_db_settings is DuckDB-only; fixing the autocommit toggle in isolation would leave the SET syntax still broken for MS-SQL. Both should be addressed together if the function is ever extended. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add Setup section to README covering poetry install --extras mssql, homebrew ODBC driver installation, odbcinst.ini registration, and .env configuration with the correct driver 18 DSN format - Add .env.example with placeholder DSN for safe version control (.env itself is gitignored) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
All commands now use relative paths (./orm.yaml etc.) and a cd note makes clear they must be run from examples/omop-mssql/ so the .env file is picked up correctly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ODBC Driver 18 defaults to Encrypt=yes with certificate validation. SQL Server's self-signed certificate fails that check, causing a login timeout. TrustServerCertificate=yes bypasses validation for local/dev containers where a proper cert is not in place. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…loses #101) Strip leading schema qualifier from FK targets before constructing ForeignKey objects so that 3-part orm.yaml references such as mimic100.concept.concept_id resolve correctly against a MetaData whose tables carry no schema prefix. Add @compiles(CreateTable, "mssql") hook to remove ON DELETE CASCADE from MS-SQL DDL, preventing error 1785 when multiple FK columns on one table all point at the same vocabulary table (common in OMOP-style schemas). Both fixes have tests. Also adds pytest to dev dependencies and the omop-mssql example config.yaml. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Remove the remove_mssql_identity DDL hook that stripped IDENTITY from CREATE TABLE, and stop supplying explicit PK values in generated df.py files. SQLAlchemy now omits single-column integer PKs from INSERT statements and SQL Server generates them via IDENTITY(1,1). Also fix column_value() to use NEWID() instead of random() on MS-SQL, which has no random() function. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Consolidate the OMOP example under examples/omop-mssql. The mimic_omop orm.yaml becomes orm_full.yaml; a trimmed orm.yaml and a copy_vocabulary.sql helper script are added alongside it. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace PostgreSQL-only raw SQL in configure-generators with SQLAlchemy expression API so the same code works on all supported dialects: - generators/mimesis.py: use sqlalchemy.extract() + cast() instead of EXTRACT(YEAR FROM ...) string; compiled SQL strings stored on MimesisDateTimeGenerator now use DATEPART(year,...) on MS-SQL and EXTRACT(YEAR FROM ...) on PostgreSQL/DuckDB. - generators/base.py: use func.stdev on MS-SQL and func.stddev elsewhere instead of the hard-coded STDDEV() string that MS-SQL does not recognise. - make.py: remove dead _YEAR_SUMMARY_QUERY constant and GeneratorInfo.summary_query field (stored but never read anywhere in the codebase). - tests/test_generators_dialect.py: new tests asserting that the compiled SQL contains the correct function name per dialect. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…loses #106) Replace raw RANDOM()/LIMIT SQL in choice.py with SQLAlchemy expression API: - New _choice_stmt() helper builds SELECT expressions using func.random()/ func.newid() and .limit()/.subquery(); SQLAlchemy compiles these to RANDOM()/LIMIT on PostgreSQL/DuckDB and NEWID()/TOP on MS-SQL automatically. - ChoiceGenerator.__init__() gains an optional dialect= parameter; the stored _query string (written to src-stats.yaml for later make-stats execution) is now compiled against the source engine's dialect at configure-generators time. - ChoiceGeneratorFactory.get_generators() builds both live queries as SQLAlchemy select() expressions and passes engine.dialect to all ChoiceGenerator constructors. - The suppress-only subquery no longer contains ORDER BY without TOP, which MS-SQL rejects in derived table expressions. - tests/test_generators_dialect.py: 7 new tests covering stored query SQL for all four sample/suppress combinations and both dialect-correct live queries. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…orFactory ChoiceGeneratorFactory.get_generators() was using table(table_name) to build the FROM clause, which creates a lightweight TableClause with no schema information. On MS-SQL databases with a non-default schema (e.g. dbo.person vs myschema.person) this produced 'Invalid object name' errors. Fix: use column.table directly as the select_from() target. This is the actual SQLAlchemy Table object reflected from the source DB, which carries schema metadata and compiles to a fully-qualified name (e.g. [myschema].[person] on MS-SQL, myschema.person on PostgreSQL). Add test_schema_qualified_table_appears_in_from to verify the schema name appears in the compiled SQL for both MS-SQL and PostgreSQL dialects. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Buckets.make_buckets() and Buckets.__init__() used table(table_name) (a schema-less TableClause) and raw text() SQL, so queries against schema-qualified databases produced unqualified table names and were rejected with "Invalid object name" on MS-SQL. Both methods now accept an optional src_table parameter (a SQLAlchemy Table object with schema metadata). When provided, select_from() uses it directly, preserving the schema qualifier in the FROM clause. Callers in continuous.py and mimesis.py pass column.table. Buckets.__init__() is also converted from raw text() to a SQLAlchemy expression using func.floor() and group_by(floor_expr), which avoids the GROUP BY alias pattern that MS-SQL rejects. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Six locations in generators and the interactive shell still emitted RANDOM() and LIMIT which MS-SQL does not support. Group A — interactive shell (base.py, table.py, generators.py): Replace sqlalchemy.text() f-strings with SQLAlchemy expression API. func.newid() vs func.random() chosen per dialect.name == "mssql"; .limit(n) compiles to TOP n / LIMIT n automatically. Group B — CovariateQuery (continuous.py): Add dialect_name parameter to __init__(); _inner_query() now emits SELECT TOP n … ORDER BY NEWID() on MS-SQL and SELECT * … ORDER BY RANDOM() LIMIT n elsewhere. dialect_name passed from get_generators(). Group C — MissingnessType (missingness.py): sampled_query() accepts dialect_name; returns the MS-SQL TOP/NEWID form when dialect_name == "mssql". do_sampled() passes self.sync_engine.dialect.name. Tests added in test_generators_dialect.py (CovariateQuery, Missingness) and new test_interactive_dialect.py (peek, _get_column_data, print_column_data) covering both MS-SQL and PostgreSQL. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace sa_table(self.table_name()) — which creates a schema-less TableClause — with self.table_metadata() in do_peek(), print_column_data(), print_row_data(), and _get_column_data(). This fixes "Invalid object name 'person'" (42S02) on MS-SQL when tables live in a non-default schema (e.g. dbo.person). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
schema_translate_map (set by create_db_engine) only rewrites SQLAlchemy Core constructs — not text() raw SQL. Three sites still built text() with bare table names, causing 42S02 'Invalid object name' on MS-SQL schemas: - do_counts (base.py): converted to select_from(self.table_metadata()) so the translate map applies, same pattern as do_peek / _get_column_data. - _get_generators_from_buckets (continuous.py): now receives the Table object (not just a name string) and uses select_from(src_table); also replaces raw LN/STDDEV SQL with func.log/func.stddev_samp for dialect correctness. - MultivariateNormalGeneratorFactory.get_generators (continuous.py): qualifies the table name via schema_qualified_name() before passing to CovariateQuery so the stored SQL string contains the correct schema prefix at create-data time; the unqualified name is kept for SRC_STATS config keys. Adds schema_qualified_name() utility to utils.py that reads the engine's schema_translate_map execution option. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…-generators _get_aggregate_query builds SQL stored in config.yaml's src-stats section and executed by make-stats via text() — not affected by schema_translate_map. Two call sites now pass schema_qualified_name(entry.name, engine) so the stored SQL contains the schema prefix (e.g. mimic100.person). SELECT_AGGREGATE_RE regex updated to match schema.table forms, and the auto__ name check now strips the schema prefix before comparing with the config key so PredefinedGenerator still recognises its own clauses when re-reading a config that was written with schema-qualified table names. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…oses #108) ChoiceGenerator compiled self._query at construction time using table(table_name), which treated a schema-qualified string like "mimic100.person" as a single identifier and produced [mimic100.person] on MS-SQL. Using column.table dropped the schema because dict_to_metadata creates Table objects without schema set. Fix: pass the schema-qualified name (from schema_qualified_name()) as table_sql and use text(table_sql) in _choice_stmt, which compiles verbatim without quoting. The SRC_STATS config key and self.table_name remain unqualified; only the stored SQL string is qualified. Also refactors _get_aggregate_query to qualify the table name internally rather than requiring callers to pre-qualify it. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description for the MSSQL work in issue 109