Update normalization parameters and add estimator params validation#210
Update normalization parameters and add estimator params validation#210avolkov-intel wants to merge 6 commits intoIntelPython:mainfrom
Conversation
| "data": { | ||
| "dataset": "hepmass", | ||
| "split_kwargs": { "train_size": 0.1, "test_size": null } | ||
| "split_kwargs": { "train_size": 0.1, "test_size": null }, |
There was a problem hiding this comment.
Seems like this is the only case where benchmark behavior changes - is it intended?
There was a problem hiding this comment.
I think it was done for a reason but let me check the convergence for both options
| "dataset" : "cifar", | ||
| "split_kwargs": { "ignore" : true }, |
There was a problem hiding this comment.
| "dataset" : "cifar", | |
| "split_kwargs": { "ignore" : true }, | |
| "dataset": "cifar", | |
| "split_kwargs": { "ignore": true }, |
| "dataset": "mnist", | ||
| "split_kwargs": { "train_size": 20000, "test_size": null }, | ||
| "preprocessing_kwargs": { "normalize": false } | ||
| "preprocessing_kwargs": {"normalize" : null} |
There was a problem hiding this comment.
| "preprocessing_kwargs": {"normalize" : null} | |
| "preprocessing_kwargs": { "normalize": null } |
Minor adjustment but looks like spaces have been added before colons in a few places throughout the configs
| }, | ||
| "data": { | ||
| "preprocessing_kwargs": { "normalize": true } | ||
| "preprocessing_kwargs": { "normalize": "standard" } |
There was a problem hiding this comment.
Would the jsons with the datasets be able to override this? For example:
https://github.com/avolkov-intel/scikit-learn_bench/blob/a2a75152d6bae2cf3bcdf350125369b04f289b33/configs/regular/knn.json#L8
It's not clear from the docs how it would work when specified more than once.
| tabulate | ||
| fastparquet | ||
| h5py | ||
| openml |
There was a problem hiding this comment.
Looks like you need to rebase from the current main.
| else: | ||
| logger.warning(f'Unknown "{normalize}" normalization type.') | ||
| if scaler is not None: | ||
| return pd.DataFrame(scaler.fit_transform(x), columns=x.columns, index=x.index) |
There was a problem hiding this comment.
Wouldn't this make it ignore return_type == np.ndarray?
Description
Estimator parameters validation can be useful in case you want to override some parameter like n_jobs using -p algorithm:estimator_params:n_jobs=64 for the benchmarks. Currently it would fail since some estimators don't support n_jobs and you would need to run the benchmarks separately. In this approach this parameter will be simply ignored by estimators that don't support it and warning will be shown.
Checklist:
Completeness and readability
Testing