Update normalization parameters and add estimator params validation by avolkov-intel · Pull Request #210 · IntelPython/scikit-learn_bench

avolkov-intel · 2026-05-06T08:38:25Z

Description

Add different normalization options
Remove implicit normalization from loaders
Add estimator parameters validation

Estimator parameters validation can be useful in case you want to override some parameter like n_jobs using -p algorithm:estimator_params:n_jobs=64 for the benchmarks. Currently it would fail since some estimators don't support n_jobs and you would need to run the benchmarks separately. In this approach this parameter will be simply ignored by estimators that don't support it and warning will be shown.

Checklist:

Completeness and readability

I have commented my code, particularly in hard-to-understand areas.
I have updated the documentation to reflect the changes or created a separate PR with updates and provided its number in the description, if necessary.
Git commit message contains an appropriate signed-off-by string (see CONTRIBUTING.md for details).
I have resolved any merge conflicts that might occur with the base branch.

Testing

I have run it locally and tested the changes extensively.
All CI jobs are green or I have provided justification why they aren't.
I have extended testing suite if new functionality was introduced in this PR.

ethanglaser · 2026-05-06T18:27:43Z

                "data": {
                    "dataset": "hepmass",
-                    "split_kwargs": { "train_size": 0.1, "test_size": null }
+                    "split_kwargs": { "train_size": 0.1, "test_size": null },


Seems like this is the only case where benchmark behavior changes - is it intended?

I think it was done for a reason but let me check the convergence for both options

ethanglaser · 2026-05-06T18:28:19Z

+                    "dataset" : "cifar",
+                    "split_kwargs": { "ignore" : true },


Suggested change

"dataset" : "cifar",

"split_kwargs": { "ignore" : true },

"dataset": "cifar",

"split_kwargs": { "ignore": true },

ethanglaser · 2026-05-06T18:28:53Z

                    "dataset": "mnist",
                    "split_kwargs": { "train_size": 20000, "test_size": null },
-                    "preprocessing_kwargs": { "normalize": false }
+                    "preprocessing_kwargs": {"normalize" : null}


Suggested change

"preprocessing_kwargs": {"normalize" : null}

"preprocessing_kwargs": { "normalize": null }

Minor adjustment but looks like spaces have been added before colons in a few places throughout the configs

david-cortes-intel · 2026-05-08T12:12:24Z

            },
            "data": {
-                "preprocessing_kwargs": { "normalize": true }
+                "preprocessing_kwargs": { "normalize": "standard" }


Would the jsons with the datasets be able to override this? For example:
https://github.com/avolkov-intel/scikit-learn_bench/blob/a2a75152d6bae2cf3bcdf350125369b04f289b33/configs/regular/knn.json#L8

It's not clear from the docs how it would work when specified more than once.

david-cortes-intel · 2026-05-08T12:13:12Z

 tabulate
 fastparquet
 h5py
+openml


Looks like you need to rebase from the current main.

david-cortes-intel · 2026-05-08T12:14:51Z

+        else:
+            logger.warning(f'Unknown "{normalize}" normalization type.')
+        if scaler is not None:
+            return pd.DataFrame(scaler.fit_transform(x), columns=x.columns, index=x.index)


Wouldn't this make it ignore return_type == np.ndarray?

avolkov-intel added 4 commits May 4, 2026 22:28

Update preprocessing args

567b241

Update scaling logic

fd6e4a5

Fix scaling

ac310c1

Add gisette normalization in SVM config

7df7818

avolkov-intel requested review from david-cortes-intel and ethanglaser as code owners May 6, 2026 08:38

Add estimator parameters filter

5536db7

avolkov-intel changed the title ~~Update normalization parameters~~ Update normalization parameters and add estimator params validation May 6, 2026

avolkov-intel added bug Something isn't working extend Extend benchmarks labels May 6, 2026

Code format

a2a7515

ethanglaser reviewed May 6, 2026

View reviewed changes

david-cortes-intel reviewed May 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update normalization parameters and add estimator params validation#210

Update normalization parameters and add estimator params validation#210
avolkov-intel wants to merge 6 commits intoIntelPython:mainfrom
avolkov-intel:dev/anatolyv-normalize-fix

avolkov-intel commented May 6, 2026 •

edited

Loading

Uh oh!

ethanglaser May 6, 2026

Uh oh!

avolkov-intel May 7, 2026

Uh oh!

ethanglaser May 6, 2026

Uh oh!

ethanglaser May 6, 2026 •

edited

Loading

Uh oh!

david-cortes-intel May 8, 2026

Uh oh!

david-cortes-intel May 8, 2026

Uh oh!

david-cortes-intel May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	"preprocessing_kwargs": {"normalize" : null}
	"preprocessing_kwargs": { "normalize": null }

Conversation

avolkov-intel commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

ethanglaser May 6, 2026

Choose a reason for hiding this comment

Uh oh!

avolkov-intel May 7, 2026

Choose a reason for hiding this comment

Uh oh!

ethanglaser May 6, 2026

Choose a reason for hiding this comment

Uh oh!

ethanglaser May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

david-cortes-intel May 8, 2026

Choose a reason for hiding this comment

Uh oh!

david-cortes-intel May 8, 2026

Choose a reason for hiding this comment

Uh oh!

david-cortes-intel May 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

avolkov-intel commented May 6, 2026 •

edited

Loading

ethanglaser May 6, 2026 •

edited

Loading