Skip to content

ML-DSA SHAKE x4 verify#6

Open
mdcornu wants to merge 5 commits into
masterfrom
ml_dsa_shake_x4_val
Open

ML-DSA SHAKE x4 verify#6
mdcornu wants to merge 5 commits into
masterfrom
ml_dsa_shake_x4_val

Conversation

@mdcornu

@mdcornu mdcornu commented Apr 27, 2026

Copy link
Copy Markdown
Owner
Checklist
  • documentation is added or updated
  • tests are added or updated

mdcornu and others added 4 commits April 28, 2026 09:29
Changes:
- Adds new SHAKE x4 API to perform 4 SHAKE operations in parallel when AVX512VL is supported.
- Adds AVX512VL Keccak x4 assembly module (keccak1600x4-avx512vl).
- Adds internal SHA3 x4 APIs/context in sha3.h and wrappers in sha3_x4.c modules.
- Adds runtime dispatch for ML-DSA sample operations with an OSSL_ML_DSA_SAMPLE_OPS vtable.
  Callers obtain the correct implementation via ossl_ml_dsa_sample_ops(), which returns
  either the generic scalar ops functions, or the AVX512VL multi-buffer ops depending
  on the build and CPU capabilities.
- Adds x86-64 multi-buffer function implementation into ml_dsa_sample_hw_x86_64.inc,
  included in ml_dsa_sample.c when KECCAK1600_ASM and x86_64 are defined.

Co-authored-by: Tomasz Kantecki <tomasz.kantecki@intel.com>
Signed-off-by: Marcel Cornu <marcel.d.cornu@intel.com>
Add a new `sha3_x4_internal_test` target and recipe to validate the
internal SHAKE x4 implementation against scalar SHA3 reference paths.

Cover SHAKE-128 and SHAKE-256 in one-shot and incremental modes, plus
multi-absorb and multi-squeeze cases across varied input and output
sizes. Tests are skipped when AVX512VL extensions are not available.

Signed-off-by: Marcel Cornu <marcel.d.cornu@intel.com>
Signed-off-by: Marcel Cornu <marcel.d.cornu@intel.com>
Signed-off-by: Marcel Cornu <marcel.d.cornu@intel.com>
@mdcornu mdcornu force-pushed the ml_dsa_shake_x4_val branch from 093dca0 to 9791bc1 Compare April 28, 2026 13:59
Add a new CI workflow that runs AVX512VL specific tests under Intel SDE
v10.8, since GitHub Actions runners do not currently have AVX512 hardware.
SDE emulates AVX512 instructions and spoofs CPUID so the AVX512 code
paths can be exercised.

Two jobs are included: linux (ubuntu-latest) and windows (windows-2022).
Each job builds OpenSSL with no-shared and enable-fips, then runs the
following tests under `sde64 -skx` (Skylake-X, AVX512F+BW+DQ+VL):

- ml_dsa_internal_test: exercises AVX512VL ML-DSA sampling
- sha3_x4_internal_test: exercises AVX512VL SHAKE x4 functions
- openssl fipsinstall: runs the full FIPS KAT suite (including ML-DSA
  and SHA3 self-tests) against the FIPS provider under emulation

Signed-off-by: Marcel Cornu <marcel.d.cornu@intel.com>
mdcornu pushed a commit that referenced this pull request Jun 10, 2026
TSAN seems to be having a problem with atomic_load_ptr and
atomic_store_ptr.  Both are, by default, __ATOMIC_RELAXED operations.

According to the tsan docs, it flags these operations as a race because,
while they are indivisible, they create no happens-before constraint,
meaning they can be reordered.

An exemplar race that is reported is:

WARNING: ThreadSanitizer: data race (pid=2139404)
  Read of size 4 at 0x723400002308 by thread T39:
    #0 EVP_MD_up_ref crypto/evp/digest.c:995 (threadstest+0x45032d) (BuildId: f34377d95e3c1d13ab9aa3204d2f1f7840d1c84a)
    #1 evp_md_up_ref crypto/evp/digest.c:974 (threadstest+0x450242) (BuildId: f34377d95e3c1d13ab9aa3204d2f1f7840d1c84a)
    #2 ossl_method_up_ref crypto/property/property.c:201 (threadstest+0x4b7a55) (BuildId: f34377d95e3c1d13ab9aa3204d2f1f7840d1c84a)
    #3 ossl_method_store_cache_get_locked crypto/property/property.c:941 (threadstest+0x4b9922) (BuildId: f34377d95e3c1d13ab9aa3204d2f1f7840d1c84a)
    #4 ossl_method_store_cache_get crypto/property/property.c:974 (threadstest+0x4b9a47) (BuildId: f34377d95e3c1d13ab9aa3204d2f1f7840d1c84a)
    #5 inner_evp_generic_fetch crypto/evp/evp_fetch.c:314 (threadstest+0x458186) (BuildId: f34377d95e3c1d13ab9aa3204d2f1f7840d1c84a)
    #6 evp_generic_fetch crypto/evp/evp_fetch.c:404 (threadstest+0x4586dc) (BuildId: f34377d95e3c1d13ab9aa3204d2f1f7840d1c84a)
    #7 EVP_MD_fetch crypto/evp/digest.c:985 (threadstest+0x4502d7) (BuildId: f34377d95e3c1d13ab9aa3204d2f1f7840d1c84a)
    #8 derive_kdk crypto/rsa/rsa_ossl.c:472 (threadstest+0x4cf738) (BuildId: f34377d95e3c1d13ab9aa3204d2f1f7840d1c84a)
    openssl#9 rsa_ossl_private_decrypt crypto/rsa/rsa_ossl.c:646 (threadstest+0x4d0174) (BuildId: f34377d95e3c1d13ab9aa3204d2f1f7840d1c84a)
    openssl#10 RSA_private_decrypt crypto/rsa/rsa_crpt.c:48 (threadstest+0x4c6971) (BuildId: f34377d95e3c1d13ab9aa3204d2f1f7840d1c84a)
    openssl#11 rsa_decrypt providers/implementations/asymciphers/rsa_enc.c:321 (threadstest+0x51cab7) (BuildId: f34377d95e3c1d13ab9aa3204d2f1f7840d1c84a)
    openssl#12 EVP_PKEY_decrypt crypto/evp/asymcipher.c:280 (threadstest+0x44a9ca) (BuildId: f34377d95e3c1d13ab9aa3204d2f1f7840d1c84a)
    openssl#13 thread_shared_evp_pkey test/threadstest.c:966 (threadstest+0x404be7) (BuildId: f34377d95e3c1d13ab9aa3204d2f1f7840d1c84a)
    openssl#14 thread_run test/threadstest.h:67 (threadstest+0x40132d) (BuildId: f34377d95e3c1d13ab9aa3204d2f1f7840d1c84a)

  Previous write of size 8 at 0x723400002308 by main thread (mutexes: write M0):
    #0 memset <null> (libtsan.so.2+0x4c1eb) (BuildId: 40906101a3a1e1f1ececafafda314aee009d688a)
    #1 CRYPTO_zalloc crypto/mem.c:228 (threadstest+0x48679d) (BuildId: f34377d95e3c1d13ab9aa3204d2f1f7840d1c84a)
    #2 evp_md_new crypto/evp/digest.c:758 (threadstest+0x44f35e) (BuildId: f34377d95e3c1d13ab9aa3204d2f1f7840d1c84a)
    #3 evp_md_from_algorithm crypto/evp/digest.c:839 (threadstest+0x44f885) (BuildId: f34377d95e3c1d13ab9aa3204d2f1f7840d1c84a)
    #4 construct_evp_method crypto/evp/evp_fetch.c:230 (threadstest+0x457ec9) (BuildId: f34377d95e3c1d13ab9aa3204d2f1f7840d1c84a)
    #5 ossl_method_construct_this crypto/core_fetch.c:110 (threadstest+0x4801bf) (BuildId: f34377d95e3c1d13ab9aa3204d2f1f7840d1c84a)
    #6 algorithm_do_map crypto/core_algorithm.c:77 (threadstest+0x47f7a3) (BuildId: f34377d95e3c1d13ab9aa3204d2f1f7840d1c84a)
    #7 algorithm_do_this crypto/core_algorithm.c:122 (threadstest+0x47f987) (BuildId: f34377d95e3c1d13ab9aa3204d2f1f7840d1c84a)
    #8 ossl_provider_doall_activated crypto/provider_core.c:1609 (threadstest+0x49a42a) (BuildId: f34377d95e3c1d13ab9aa3204d2f1f7840d1c84a)
    openssl#9 ossl_algorithm_do_all crypto/core_algorithm.c:164 (threadstest+0x47fb14) (BuildId: f34377d95e3c1d13ab9aa3204d2f1f7840d1c84a)
    openssl#10 ossl_method_construct crypto/core_fetch.c:157 (threadstest+0x4803d0) (BuildId: f34377d95e3c1d13ab9aa3204d2f1f7840d1c84a)
    openssl#11 inner_evp_generic_fetch crypto/evp/evp_fetch.c:333 (threadstest+0x4583a2) (BuildId: f34377d95e3c1d13ab9aa3204d2f1f7840d1c84a)
    openssl#12 evp_generic_fetch crypto/evp/evp_fetch.c:404 (threadstest+0x4586dc) (BuildId: f34377d95e3c1d13ab9aa3204d2f1f7840d1c84a)
    openssl#13 EVP_MD_fetch crypto/evp/digest.c:985 (threadstest+0x4502d7) (BuildId: f34377d95e3c1d13ab9aa3204d2f1f7840d1c84a)
    openssl#14 derive_kdk crypto/rsa/rsa_ossl.c:472 (threadstest+0x4cf738) (BuildId: f34377d95e3c1d13ab9aa3204d2f1f7840d1c84a)
    openssl#15 rsa_ossl_private_decrypt crypto/rsa/rsa_ossl.c:646 (threadstest+0x4d0174) (BuildId: f34377d95e3c1d13ab9aa3204d2f1f7840d1c84a)
    openssl#16 RSA_private_decrypt crypto/rsa/rsa_crpt.c:48 (threadstest+0x4c6971) (BuildId: f34377d95e3c1d13ab9aa3204d2f1f7840d1c84a)
    openssl#17 rsa_decrypt providers/implementations/asymciphers/rsa_enc.c:321 (threadstest+0x51cab7) (BuildId: f34377d95e3c1d13ab9aa3204d2f1f7840d1c84a)
    openssl#18 EVP_PKEY_decrypt crypto/evp/asymcipher.c:280 (threadstest+0x44a9ca) (BuildId: f34377d95e3c1d13ab9aa3204d2f1f7840d1c84a)

What tsan is saying here is that the memset in evp_md_new may get
re-ordered such that the contents of the EVP_MD may still be getting
zeroed at the time we have (a) found the EVP_MD in the method store
cache, and (b) attempted to do an up_ref on it.

This is plainly impossible, especially given that, in order to reach the
method store cache, it must be places in the method store algorithm
sparse array, which still requires the taking of the method store write
lock.  But for some reason tsan fails to see the memory fence that
creates.

It seems the simplest solution to correct this is, if we are running
under tsan, use __ATOMIC_ACQUIRE and __ATOMIC_RELEASE on
CRYPTO_atomic_[load|store]_ptr to make sure tsan sees the proper memory
ordering.

Reviewed-by: Saša Nedvědický <sashan@openssl.org>
Reviewed-by: Bob Beck <beck@openssl.org>
Reviewed-by: Nikola Pajkovsky <nikolap@openssl.org>
MergeDate: Tue Jun  9 18:17:19 2026
(Merged from openssl#31018)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant