Skip to content

Conversation

martin-frbg
Copy link
Collaborator

@martin-frbg martin-frbg commented Aug 18, 2025

eventually fixes #5414

#define C6 x22 //Constant6: N*SVLs
#define C2 x19 //Constant2: N + SVLs
#define C3 x20 //Constant3: K*SVLs + SVLs
#define C4 x21 //Constant4: SVLs-2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modifying x20 to x21 will require below dependent changes.
At line 65: sub w21, w21, #2
At line 202: cmp w13, w21

Copy link
Collaborator Author

@martin-frbg martin-frbg Aug 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, sorry, I had already corrected this locally but pushed the wrong version. Unfortunately this correction has no effect on the wrong xscblat3 test results seen for M odd (and contrary to my expectations this PR also does not fix the divergence between SGEMM and SGEMMT seen in test_sgemmt of utest/openblas_utest_ext that was flagged in #5414)

if (strcmp(gotoblas_corename(), "armv9sme") == 0 || strcmp(gotoblas_corename(), "vortexm4") == 0)
// if (support_sme1())
#endif
if (order == CblasRowMajor && m==lda && n ==ldb && k==ldc && beta == 0 && alpha == 1.0 && TransA == CblasNoTrans && TransB == CblasNoTrans&& SGEMM_DIRECT_PERFORMANT(m,n,k)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For RowMajor, shouldn't the leading dimension check be (lda==k && ldb==n && ldc==n) ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

normally yes but arguments have already been reshuffled at this point (I think - I'll recheck when I get back to this later this week)




int CNAME(BLASLONG M, BLASLONG N, BLASLONG K)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this helper function checks for when the SME implementation wouldn't be performant?
Are these checks applicable explicitly for Apple M4 only?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In principle I'd expect them to be relevant for future SME hardware as well - I think it is unlikely that the direct path will outperform Goto's block algorithm at any matrix size and shape (and we should have an SME GEMM kernel compatible with that at some point - there already is a draft PR that only lacks the TRMM part).
This is just a quick copy of the x86_64 implementation for now, so numbers will need to be tuned once we're certain that the codes are correct

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

test_extensions/test_sgemmt.c fails with SME on Apple M4
2 participants