-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Implementing SVE in [SD]AXPY
Kernels for A64FX
and Graviton3E
#5426
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Implementing SVE in [SD]AXPY
Kernels for A64FX
and Graviton3E
#5426
Conversation
@@ -32,6 +32,10 @@ SGEMVNKERNEL = gemv_n_sve_v1x3.c | |||
DGEMVNKERNEL = gemv_n_sve_v1x3.c | |||
SGEMVTKERNEL = gemv_t_sve_v1x3.c | |||
DGEMVTKERNEL = gemv_t_sve_v1x3.c | |||
|
|||
SAXPYKERNEL = axpy_sve.c | |||
DAXPYKERNEL = axpy_sve.c |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since you have used the SVL for the implementation instead of hardcoding the vector width, the kernel should work on NEOVERSEV2 as well. Please check this on Graviton4 and add it to KERNEL.NEOVERSEV2 as well.
BLASLONG sve_size = SV_COUNT(); | ||
|
||
if (n < 0) return (0); | ||
if (da == 0.0) return (0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why can't these two checks be combined into one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your comments.
There was another way you mentioned, but I followed kernel/arm/axpy.c#L45-L46.
Hi @hideaki-motoki , thanks for the PR! I have added few comments. |
Resolves #5417.


This change improves the performance of
[SD]AXPY
on bothA64FX
andGraviton3E
.The graphs below show the single thread performance improvement of
[D]AXPY
onA64FX
andGraviton3E
, respectively.The performance improved by 2.57 times on the
A64FX
and 1.13 times on theGraviton3E
.I have confirmed that this optimization also yields performance benefits for Level 2 BLAS kernels that utilize
[SD]AXPY
, such as[SD]SPMV
and[SD]GER
.