Skip to content

Conversation

mentOS31
Copy link

@mentOS31 mentOS31 commented Aug 22, 2025

What
In addition to the existing blocking and non-blocking collective calls,
this PR adds persistent collective call in COLL/UCC.
For example,

mpirun \
  -mca coll_ucc_enable 1 -mca coll_ucc_priority 100 \
  -mca coll_ucc_cts reduce_init \
  osu_reduce_persistent

How
This adds <coll>_init function for each collective operation,
using UCC_COLL_ARGS_FLAG_PERSISTENT flag [1].

Reference
[1] "Unified Collective Communications (UCC) Specification Version 1.0" (2022/02/18)
https://openucx.github.io/ucc/api/v1.0/pdf/ucc.pdf
- UCC_COLL_ARGS_FLAG_PERSISTENT in Section 8.8.4.2 "ucc_coll_args_flags_t"

hasegawa.kento added 3 commits August 21, 2025 14:33
Signed-off-by: hasegawa.kento <hasegawa.kento@fujitsu.com>
Signed-off-by: hasegawa.kento <hasegawa.kento@fujitsu.com>
Signed-off-by: hasegawa.kento <hasegawa.kento@fujitsu.com>
Copy link
Member

@bosilca bosilca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have few nitpick comments for the entire commit:

  • 'iniz' has little meaning. How about init_common instead ?
  • 'a' as a siffix has no meaning while ' alias' does !
  • in the original code when the call to the UCC collective _init was made it was multiline with clear separation between send and receive argument. Please maintain that structure, and add the persistent bool flag in front of the ucc_module.

@@ -592,6 +648,19 @@ OBJ_CLASS_INSTANCE(mca_coll_ucc_req_t, ompi_request_t,

int mca_coll_ucc_req_free(struct ompi_request_t **ompi_req)
{
if (MPI_REQUEST_NULL != ompi_req[0]) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm mostly certain that this function cannot be called with MPI_REQUEST_NULL, as MPI_REQUEST_NULL is a static object and cannot be added on any free_list. Did you see it happening ? If not please remove this protection.

UCC_ERROR("ucc_collective_post failed: %s", ucc_status_string(rc_ucc));
coll_req->super.req_complete = REQUEST_COMPLETED;
coll_req->super.req_state = OMPI_REQUEST_INACTIVE;
if (OMPI_SUCCESS == rc) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please ensure that coll_req->super.req_status.MPI_ERROR reflect the rc error.

@@ -61,6 +62,7 @@ struct mca_coll_ucc_component_t {
ucc_lib_attr_t ucc_lib_attr;
ucc_coll_type_t cts_requested;
ucc_coll_type_t nb_cts_requested;
ucc_coll_type_t pc_cts_requested;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does pc stand for persistent collective? maybe ps_cts_requested is better

@@ -132,6 +134,34 @@ struct mca_coll_ucc_module_t {
mca_coll_base_module_t* previous_scatter_module;
mca_coll_base_module_iscatter_fn_t previous_iscatter;
mca_coll_base_module_t* previous_iscatter_module;
mca_coll_base_module_allreduce_init_fn_t previous_allreduce_init;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please keep alignment

@@ -57,6 +57,10 @@ static inline ucc_status_t mca_coll_ucc_allgatherv_init(const void *sbuf, size_t
}
};

if (true == persistent) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe make it part of line 37 where other flags are set?


if ((bz > 0) && (bp != 0)) {
bp[0] = '\0';
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it make sense to continue this function if bp/bz is null?

if (0 == strcmp(cp_suffix, "_init")) {
if ((bz > 0) && (bp != 0)) {
if (blen >= bz) {
return 0 /* XXX internal error */;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return -1 to separate error from other possible results

@@ -87,6 +88,34 @@ static void mca_coll_ucc_module_clear(mca_coll_ucc_module_t *ucc_module)
ucc_module->previous_scatter_module = NULL;
ucc_module->previous_iscatter = NULL;
ucc_module->previous_iscatter_module = NULL;
ucc_module->previous_allreduce_init = NULL;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please keep alignment

@@ -157,18 +215,33 @@ static void mca_coll_ucc_init_default_cts(void)
n_cts = opal_argv_count(cts);
cm->cts_requested = disable ? COLL_UCC_CTS : 0;
cm->nb_cts_requested = disable ? COLL_UCC_CTS : 0;
cm->pc_cts_requested = 0; /* XXX PC currently disabled by default */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure how it supposed to work, maybe I missing something. It seems like disable mode doesn't work for persistent colls as for other. If user gives ^allreduce_init it means enable every persistent collective except allreduce, however since cts_requested is 0 initialy it won't change anything

@@ -530,11 +565,32 @@ mca_coll_ucc_module_disable(mca_coll_base_module_t *module,
UCC_UNINSTALL_COLL_API(comm, ucc_module, reduce);
UCC_UNINSTALL_COLL_API(comm, ucc_module, ireduce);
UCC_UNINSTALL_COLL_API(comm, ucc_module, gather);
/* UCC_UNINSTALL_COLL_API(comm, ucc_module, igather); */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is it commented out?

Comment on lines +696 to +698
if (true != coll_req->super.req_persistent) {
continue;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't it an error?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants