You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@W-17427085: Set ANNOY related dependencies to be optional (#3858)
Changes:
- Remove `"annoy", "numpy", "pandas", "scikit-learn"` from dependencies
under `pyproject.toml` and add them under optional dependencies
- Created flag `OPTIONAL_DEPENDENCIES_AVAILABLE`, to indicate if ANNOY
related dependencies are present in `select_utils.py`. If these optional
dependencies are not available, for high volume of records (i.e.
`complexity_constant >= 1000`), still Levenshtein Distance based
selection will apply.
- Skipped those pytests which have dependencies on `pandas` and ANNOY
related optional dependencies under `test_select_utils.py`
- Adding a warning message for non-zero similarity score when using
ANNOY (for high volume of records). Updated the docs as well
- Added additional workflow to run all unit tests with all optional
dependencies installed
Copy file name to clipboardExpand all lines: docs/data.md
+3Lines changed: 3 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -352,6 +352,9 @@ This parameter is **optional**; if not specified, no threshold will be applied a
352
352
353
353
This feature is particularly useful during version upgrades, where records that closely match can be selected, while those that do not match sufficiently can be inserted into the target org.
354
354
355
+
**Important Note:**
356
+
For high volumes of records, an approximation algorithm is applied to improve performance. In such cases, setting a threshold of `0` may not guarantee the selection of exact matches, as the algorithm can assign a small non-zero similarity score to exact matches. To ensure accurate selection, it is recommended to set the threshold to a small value slightly greater than `0`, such as `0.1`. This ensures both precision and efficiency in the selection process.
0 commit comments