Mssql by myyong · Pull Request #110 · SAFEHR-data/datafaker

myyong · 2026-05-26T09:57:09Z

Description for the MSSQL work in issue 109

…routing Addresses issues #93, #94, and #95. Issue #93 – Driver dependency - Move psycopg2-binary to an optional 'postgres' extra - Add optional pyodbc and aioodbc as a new 'mssql' extra - Remove unconditional top-level `import psycopg2`; replace the `psycopg2.errors.UndefinedObject` check with `_is_undefined_object_error`, which lazily imports psycopg2 when available and falls back to checking the SQLSTATE pgcode (42704) otherwise Issue #94 – Hardcoded async DSN rewriting - Add `make_async_dsn()` helper that parses the DSN dialect with SQLAlchemy's `make_url` and rewrites the driver via `_ASYNC_DRIVER_MAP` (postgresql → asyncpg, mssql → aioodbc); raises ValueError for unknown dialects instead of silently producing a broken DSN - Replace the hardcoded `db_dsn.replace("postgresql://", ...)` in `create_db_engine` with a call to `make_async_dsn` Issue #95 – PostgreSQL search_path schema selection - Replace `SET search_path TO <schema>` (PostgreSQL-only) with `engine.execution_options(schema_translate_map={None: schema_name})`, which is dialect-neutral and works for PostgreSQL, MS-SQL, and DuckDB - The `set_db_settings` connect-event hook is now only wired up for DuckDB's `file_search_path`; update its docstring accordingly - Add `schema_name` parameter to `get_metadata` and pass it to `MetaData.reflect(engine, schema=schema_name)` so reflection targets the right schema without relying on search_path - Update `make_tables_file` and `remove_db_tables` to forward schema_name to `get_metadata` Also adds: - examples/omop-mssql/ with README documenting all eight MS-SQL challenges (challenges 1–3 link to GitHub issues #93–#95) - tests/test_utils_mssql.py with 19 unit tests covering all new helpers and the schema_translate_map behaviour Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Add sqlalchemy.dialects.mssql import to serialize_metadata.py - Map PostgreSQL-only type strings to cross-dialect equivalents: TSVECTOR → sqltypes.Text (no MS-SQL equivalent) BYTEA → sqltypes.LargeBinary (renders as VARBINARY(MAX) on MS-SQL) CIDR → sqltypes.String(43) (stored as plain string) REAL → sqltypes.REAL (generic, replaces postgresql.REAL) - Add _mssql_varbinary_parser to handle VARBINARY, VARBINARY(n), VARBINARY(max), and VARBINARY(MAX) - Register parsers for MS-SQL-native types: UNIQUEIDENTIFIER, DATETIMEOFFSET(n), DATETIME2(n), BINARY(n), MONEY, SMALLMONEY, IMAGE, TINYINT, SMALLDATETIME, NTEXT, SQL_VARIANT, ROWVERSION - Place DATETIME2/DATETIMEOFFSET before DATETIME in parsy.alt() to avoid prefix-collision: ordered-choice parsers must try longer names first - Rename time_type parameter pg_type → tz_type (no behavioural change) - Add mssql.UNIQUEIDENTIFIER → UUID generator mapping in make.py - Add tests/test_serialize_metadata_mssql.py: 50 tests covering all new MS-SQL types, PostgreSQL degradation mappings, and regression coverage for all pre-existing type strings - Update examples/omop-mssql/README.md challenge 4 with issue link #96 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

@compiles

Add SERIAL, BIGSERIAL and SMALLSERIAL to the type parser so orm.yaml files using PostgreSQL pseudo-types can be loaded when targeting MS-SQL. Add a @compiles(CreateColumn, "mssql") hook that strips the IDENTITY modifier from DDL — MS-SQL rejects explicit INSERTs into IDENTITY columns without SET IDENTITY_INSERT ON, which datafaker never issues. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Three new tests for _is_undefined_object_error: - pyodbc-style error (SQLSTATE in args[0], no pgcode) → False, documenting that the current pgcode-based fallback does not cover MS-SQL/pyodbc errors - pgcode=None → False - SQLAlchemy ProgrammingError wrapper → only e.orig matches, not the wrapper itself, confirming the call-site convention in remove_vocab_foreign_key_constraints Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

set_db_settings is DuckDB-only; fixing the autocommit toggle in isolation would leave the SET syntax still broken for MS-SQL. Both should be addressed together if the function is ever extended. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Add Setup section to README covering poetry install --extras mssql, homebrew ODBC driver installation, odbcinst.ini registration, and .env configuration with the correct driver 18 DSN format - Add .env.example with placeholder DSN for safe version control (.env itself is gitignored) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

All commands now use relative paths (./orm.yaml etc.) and a cd note makes clear they must be run from examples/omop-mssql/ so the .env file is picked up correctly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

ODBC Driver 18 defaults to Encrypt=yes with certificate validation. SQL Server's self-signed certificate fails that check, causing a login timeout. TrustServerCertificate=yes bypasses validation for local/dev containers where a proper cert is not in place. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

@compiles

…loses #101) Strip leading schema qualifier from FK targets before constructing ForeignKey objects so that 3-part orm.yaml references such as mimic100.concept.concept_id resolve correctly against a MetaData whose tables carry no schema prefix. Add @compiles(CreateTable, "mssql") hook to remove ON DELETE CASCADE from MS-SQL DDL, preventing error 1785 when multiple FK columns on one table all point at the same vocabulary table (common in OMOP-style schemas). Both fixes have tests. Also adds pytest to dev dependencies and the omop-mssql example config.yaml. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Remove the remove_mssql_identity DDL hook that stripped IDENTITY from CREATE TABLE, and stop supplying explicit PK values in generated df.py files. SQLAlchemy now omits single-column integer PKs from INSERT statements and SQL Server generates them via IDENTITY(1,1). Also fix column_value() to use NEWID() instead of random() on MS-SQL, which has no random() function. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Consolidate the OMOP example under examples/omop-mssql. The mimic_omop orm.yaml becomes orm_full.yaml; a trimmed orm.yaml and a copy_vocabulary.sql helper script are added alongside it. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replace PostgreSQL-only raw SQL in configure-generators with SQLAlchemy expression API so the same code works on all supported dialects: - generators/mimesis.py: use sqlalchemy.extract() + cast() instead of EXTRACT(YEAR FROM ...) string; compiled SQL strings stored on MimesisDateTimeGenerator now use DATEPART(year,...) on MS-SQL and EXTRACT(YEAR FROM ...) on PostgreSQL/DuckDB. - generators/base.py: use func.stdev on MS-SQL and func.stddev elsewhere instead of the hard-coded STDDEV() string that MS-SQL does not recognise. - make.py: remove dead _YEAR_SUMMARY_QUERY constant and GeneratorInfo.summary_query field (stored but never read anywhere in the codebase). - tests/test_generators_dialect.py: new tests asserting that the compiled SQL contains the correct function name per dialect. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…loses #106) Replace raw RANDOM()/LIMIT SQL in choice.py with SQLAlchemy expression API: - New _choice_stmt() helper builds SELECT expressions using func.random()/ func.newid() and .limit()/.subquery(); SQLAlchemy compiles these to RANDOM()/LIMIT on PostgreSQL/DuckDB and NEWID()/TOP on MS-SQL automatically. - ChoiceGenerator.__init__() gains an optional dialect= parameter; the stored _query string (written to src-stats.yaml for later make-stats execution) is now compiled against the source engine's dialect at configure-generators time. - ChoiceGeneratorFactory.get_generators() builds both live queries as SQLAlchemy select() expressions and passes engine.dialect to all ChoiceGenerator constructors. - The suppress-only subquery no longer contains ORDER BY without TOP, which MS-SQL rejects in derived table expressions. - tests/test_generators_dialect.py: 7 new tests covering stored query SQL for all four sample/suppress combinations and both dialect-correct live queries. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…orFactory ChoiceGeneratorFactory.get_generators() was using table(table_name) to build the FROM clause, which creates a lightweight TableClause with no schema information. On MS-SQL databases with a non-default schema (e.g. dbo.person vs myschema.person) this produced 'Invalid object name' errors. Fix: use column.table directly as the select_from() target. This is the actual SQLAlchemy Table object reflected from the source DB, which carries schema metadata and compiles to a fully-qualified name (e.g. [myschema].[person] on MS-SQL, myschema.person on PostgreSQL). Add test_schema_qualified_table_appears_in_from to verify the schema name appears in the compiled SQL for both MS-SQL and PostgreSQL dialects. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Buckets.make_buckets() and Buckets.__init__() used table(table_name) (a schema-less TableClause) and raw text() SQL, so queries against schema-qualified databases produced unqualified table names and were rejected with "Invalid object name" on MS-SQL. Both methods now accept an optional src_table parameter (a SQLAlchemy Table object with schema metadata). When provided, select_from() uses it directly, preserving the schema qualifier in the FROM clause. Callers in continuous.py and mimesis.py pass column.table. Buckets.__init__() is also converted from raw text() to a SQLAlchemy expression using func.floor() and group_by(floor_expr), which avoids the GROUP BY alias pattern that MS-SQL rejects. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Six locations in generators and the interactive shell still emitted RANDOM() and LIMIT which MS-SQL does not support. Group A — interactive shell (base.py, table.py, generators.py): Replace sqlalchemy.text() f-strings with SQLAlchemy expression API. func.newid() vs func.random() chosen per dialect.name == "mssql"; .limit(n) compiles to TOP n / LIMIT n automatically. Group B — CovariateQuery (continuous.py): Add dialect_name parameter to __init__(); _inner_query() now emits SELECT TOP n … ORDER BY NEWID() on MS-SQL and SELECT * … ORDER BY RANDOM() LIMIT n elsewhere. dialect_name passed from get_generators(). Group C — MissingnessType (missingness.py): sampled_query() accepts dialect_name; returns the MS-SQL TOP/NEWID form when dialect_name == "mssql". do_sampled() passes self.sync_engine.dialect.name. Tests added in test_generators_dialect.py (CovariateQuery, Missingness) and new test_interactive_dialect.py (peek, _get_column_data, print_column_data) covering both MS-SQL and PostgreSQL. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replace sa_table(self.table_name()) — which creates a schema-less TableClause — with self.table_metadata() in do_peek(), print_column_data(), print_row_data(), and _get_column_data(). This fixes "Invalid object name 'person'" (42S02) on MS-SQL when tables live in a non-default schema (e.g. dbo.person). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

schema_translate_map (set by create_db_engine) only rewrites SQLAlchemy Core constructs — not text() raw SQL. Three sites still built text() with bare table names, causing 42S02 'Invalid object name' on MS-SQL schemas: - do_counts (base.py): converted to select_from(self.table_metadata()) so the translate map applies, same pattern as do_peek / _get_column_data. - _get_generators_from_buckets (continuous.py): now receives the Table object (not just a name string) and uses select_from(src_table); also replaces raw LN/STDDEV SQL with func.log/func.stddev_samp for dialect correctness. - MultivariateNormalGeneratorFactory.get_generators (continuous.py): qualifies the table name via schema_qualified_name() before passing to CovariateQuery so the stored SQL string contains the correct schema prefix at create-data time; the unqualified name is kept for SRC_STATS config keys. Adds schema_qualified_name() utility to utils.py that reads the engine's schema_translate_map execution option. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…-generators _get_aggregate_query builds SQL stored in config.yaml's src-stats section and executed by make-stats via text() — not affected by schema_translate_map. Two call sites now pass schema_qualified_name(entry.name, engine) so the stored SQL contains the schema prefix (e.g. mimic100.person). SELECT_AGGREGATE_RE regex updated to match schema.table forms, and the auto__ name check now strips the schema prefix before comparing with the config key so PredefinedGenerator still recognises its own clauses when re-reading a config that was written with schema-qualified table names. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…oses #108) ChoiceGenerator compiled self._query at construction time using table(table_name), which treated a schema-qualified string like "mimic100.person" as a single identifier and produced [mimic100.person] on MS-SQL. Using column.table dropped the schema because dict_to_metadata creates Table objects without schema set. Fix: pass the schema-qualified name (from schema_qualified_name()) as table_sql and use text(table_sql) in _choice_stmt, which compiles verbatim without quoting. The SRC_STATS config key and self.table_name remain unqualified; only the stored SQL string is qualified. Also refactors _get_aggregate_query to qualify the table name internally rather than requiring callers to pre-qualify it. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

myyong and others added 30 commits April 10, 2026 15:35

Update to ignore opiates

c0897d4

Merge branch 'main' of https://github.com/SAFEHR-data/datafaker

834a20b

Template omop orm.yaml, config.yaml and df.py files

a053742

Fix ordered list formatting

675dd92

Fix pre-commit hooks.

54d13b5

Fix headers that broke pre-commit hook

1c42c49

Organised UI commands

b311776

Add commit link to challenge 5 in MS-SQL README

fabb0cc

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add issue #98 link for challenge 6 (UUID type mapping) to MS-SQL README

dfb698f

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add commit links for challenge 7 to MS-SQL README

0925a93

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Merge branch 'ui-update' into mssql

3c3c20a

Fix Steps paths in MS-SQL README to run from example directory

db2e48b

All commands now use relative paths (./orm.yaml etc.) and a cd note makes clear they must be run from examples/omop-mssql/ so the .env file is picked up correctly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Rename mimic_omop example to omop-mssql

40fc121

Consolidate the OMOP example under examples/omop-mssql. The mimic_omop orm.yaml becomes orm_full.yaml; a trimmed orm.yaml and a copy_vocabulary.sql helper script are added alongside it. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add omop-postgresql example and missing test fixture

1857e76

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Update omop-postgresql README to use relative paths

ad98acf

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

myyong and others added 4 commits May 22, 2026 13:13

Config files for omop-mssql and omop-postgresl

3aefaa2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mssql#110

Mssql#110
myyong wants to merge 34 commits into
mainfrom
mssql

myyong commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

myyong commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant