-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Split VORTEXM4 from VORTEX target and fix SGEMM_DIRECT support for SME-capable targets #5423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
… reserved on MacOS
…ang compatibility
…t_performant for ARM64
#define C6 x22 //Constant6: N*SVLs | ||
#define C2 x19 //Constant2: N + SVLs | ||
#define C3 x20 //Constant3: K*SVLs + SVLs | ||
#define C4 x21 //Constant4: SVLs-2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modifying x20 to x21 will require below dependent changes.
At line 65: sub w21, w21, #2
At line 202: cmp w13, w21
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, sorry, I had already corrected this locally but pushed the wrong version. Unfortunately this correction has no effect on the wrong xscblat3 test results seen for M odd (and contrary to my expectations this PR also does not fix the divergence between SGEMM and SGEMMT seen in test_sgemmt of utest/openblas_utest_ext that was flagged in #5414)
…me as matrix dimension
if (strcmp(gotoblas_corename(), "armv9sme") == 0 || strcmp(gotoblas_corename(), "vortexm4") == 0) | ||
// if (support_sme1()) | ||
#endif | ||
if (order == CblasRowMajor && m==lda && n ==ldb && k==ldc && beta == 0 && alpha == 1.0 && TransA == CblasNoTrans && TransB == CblasNoTrans&& SGEMM_DIRECT_PERFORMANT(m,n,k)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For RowMajor, shouldn't the leading dimension check be (lda==k && ldb==n && ldc==n) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
normally yes but arguments have already been reshuffled at this point (I think - I'll recheck when I get back to this later this week)
|
||
|
||
|
||
int CNAME(BLASLONG M, BLASLONG N, BLASLONG K) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, this helper function checks for when the SME implementation wouldn't be performant?
Are these checks applicable explicitly for Apple M4 only?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In principle I'd expect them to be relevant for future SME hardware as well - I think it is unlikely that the direct path will outperform Goto's block algorithm at any matrix size and shape (and we should have an SME GEMM kernel compatible with that at some point - there already is a draft PR that only lacks the TRMM part).
This is just a quick copy of the x86_64 implementation for now, so numbers will need to be tuned once we're certain that the codes are correct
eventually fixes #5414