Model conversion with OLIVE and running with ONNX GenAI runtime Error 5005 NPU Models and GPU error with Microsoft.Neutron.OpenAI.Provider.

When converting model to ONNX GenAI Runtime - Model converts using OLIVE 

When running the model with AITK or Foundry Local

NPU - Error 5005 Qualcomm Snapdragon issue  https://github.com/microsoft/Foundry-Local/issues/67 
GPU - Error Microsoft.Neutron.OpenAI.Provider.OpenAIServiceProviderOnnx.LoadModelAsync

error for GPU 

```
2025-08-07 20:27:35.858 [info] Information: Microsoft.Neutron.OpenAI.Provider.OpenAIServiceProviderOnnx [1400]  2025-08-07T20:27:35.857568+08:00 Loading model:gpt-oss-20b-cuda-gpu

2025-08-07 20:27:35.865 [info] Error: Microsoft.Neutron.OpenAI.Provider.OpenAIServiceProviderOnnx [1402]  2025-08-07T20:27:35.8645887+08:00 Failed loading model:gpt-oss-20b-cuda-gpu error: [Load model from /home/lokinfey/.aitk/models/Microsoft/gpt-oss-20b-cuda-gpu/model.onnx failed:This is an invalid model. In Node, ("/model/layers.0/attn/GroupQueryAttention", GroupQueryAttention, "com.microsoft", -1) : ("/model/layers.0/attn/qkv_proj/Add/output_0": tensor(float16),"","","past_key_values.0.key": tensor(float16),"past_key_values.0.value": tensor(float16),"/model/attn_mask_reformat/attn_mask_subgraph/Sub/Cast/output_0": tensor(int32),"/model/attn_mask_reformat/attn_mask_subgraph/Gather/Cast/output_0": tensor(int32),"cos_cache": tensor(float16),"sin_cache": tensor(float16),"","","model.layers.0.attn.sinks": tensor(float16),) -> ("/model/layers.0/attn/GroupQueryAttention/output_0": tensor(float16),"present.0.key": tensor(float16),"present.0.value": tensor(float16),) , Error Node(/model/layers.0/attn/GroupQueryAttention) with schema(com.microsoft::GroupQueryAttention:1) has input size 12 not in range [min=7, max=9].,    at Microsoft.ML.OnnxRuntimeGenAI.Result.VerifySuccess(IntPtr) + 0x56

  at Microsoft.Neutron.OpenAI.Provider.OpenAIServiceProviderOnnx.LoadModelAsync(String, String, String, CancellationToken) + 0x730

  at Microsoft.Neutron.OpenAI.Provider.OpenAIServiceProviderBase`1.<EnsureModelLoadedAsync>d__44.MoveNext() + 0x549]

2025-08-07 20:27:35.866 [info] Information: Microsoft.Neutron.OpenAI.Provider.OpenAIServiceProviderOnnx [1401]  2025-08-07T20:27:35.864944+08:00 Finish loading model:gpt-oss-20b-cuda-gpu elapsed time:00:00:00.0073828

2025-08-07 20:27:35.905 [error] Failed loading model gpt-oss-20b-cuda-gpu. Load model from /home/lokinfey/.aitk/models/Microsoft/gpt-oss-20b-cuda-gpu/model.onnx failed:This is an invalid model. In Node, ("/model/layers.0/attn/GroupQueryAttention", GroupQueryAttention, "com.microsoft", -1) : ("/model/layers.0/attn/qkv_proj/Add/output_0": tensor(float16),"","","past_key_values.0.key": tensor(float16),"past_key_values.0.value": tensor(float16),"/model/attn_mask_reformat/attn_mask_subgraph/Sub/Cast/output_0": tensor(int32),"/model/attn_mask_reformat/attn_mask_subgraph/Gather/Cast/output_0": tensor(int32),"cos_cache": tensor(float16),"sin_cache": tensor(float16),"","","model.layers.0.attn.sinks": tensor(float16),) -> ("/model/layers.0/attn/GroupQueryAttention/output_0": tensor(float16),"present.0.key": tensor(float16),"present.0.value": tensor(float16),) , Error Node(/model/layers.0/attn/GroupQueryAttention) with schema(com.microsoft::GroupQueryAttention:1) has input size 12 not in range [min=7, max=9]. 

2025-08-07 20:27:41.699 [info] Information: Microsoft.Neutron.OpenAI.Provider.OpenAIServiceProviderOnnx [1400]  2025-08-07T20:27:41.6987518+08:00 Loading model:gpt-oss-20b-cuda-gpu

2025-08-07 20:27:41.708 [info] Error: Microsoft.Neutron.OpenAI.Provider.OpenAIServiceProviderOnnx [1402]  2025-08-07T20:27:41.7079342+08:00 Failed loading model:gpt-oss-20b-cuda-gpu error: [Load model from /home/lokinfey/.aitk/models/Microsoft/gpt-oss-20b-cuda-gpu/model.onnx failed:This is an invalid model. In Node, ("/model/layers.0/attn/GroupQueryAttention", GroupQueryAttention, "com.microsoft", -1) : ("/model/layers.0/attn/qkv_proj/Add/output_0": tensor(float16),"","","past_key_values.0.key": tensor(float16),"past_key_values.0.value": tensor(float16),"/model/attn_mask_reformat/attn_mask_subgraph/Sub/Cast/output_0": tensor(int32),"/model/attn_mask_reformat/attn_mask_subgraph/Gather/Cast/output_0": tensor(int32),"cos_cache": tensor(float16),"sin_cache": tensor(float16),"","","model.layers.0.attn.sinks": tensor(float16),) -> ("/model/layers.0/attn/GroupQueryAttention/output_0": tensor(float16),"present.0.key": tensor(float16),"present.0.value": tensor(float16),) , Error Node(/model/layers.0/attn/GroupQueryAttention) with schema(com.microsoft::GroupQueryAttention:1) has input size 12 not in range [min=7, max=9].,    at Microsoft.ML.OnnxRuntimeGenAI.Result.VerifySuccess(IntPtr) + 0x56

  at Microsoft.Neutron.OpenAI.Provider.OpenAIServiceProviderOnnx.LoadModelAsync(String, String, String, CancellationToken) + 0x730

  at Microsoft.Neutron.OpenAI.Provider.OpenAIServiceProviderBase`1.<EnsureModelLoadedAsync>d__44.MoveNext() + 0x549]

2025-08-07 20:27:41.709 [info] Information: Microsoft.Neutron.OpenAI.Provider.OpenAIServiceProviderOnnx [1401]  2025-08-07T20:27:41.7082954+08:00 Finish loading model:gpt-oss-20b-cuda-gpu elapsed time:00:00:00.0095458

2025-08-07 20:27:41.749 [error] Failed loading model gpt-oss-20b-cuda-gpu. Load model from /home/lokinfey/.aitk/models/Microsoft/gpt-oss-20b-cuda-gpu/model.onnx failed:This is an invalid model. In Node, ("/model/layers.0/attn/GroupQueryAttention", GroupQueryAttention, "com.microsoft", -1) : ("/model/layers.0/attn/qkv_proj/Add/output_0": tensor(float16),"","","past_key_values.0.key": tensor(float16),"past_key_values.0.value": tensor(float16),"/model/attn_mask_reformat/attn_mask_subgraph/Sub/Cast/output_0": tensor(int32),"/model/attn_mask_reformat/attn_mask_subgraph/Gather/Cast/output_0": tensor(int32),"cos_cache": tensor(float16),"sin_cache": tensor(float16),"","","model.layers.0.attn.sinks": tensor(float16),) -> ("/model/layers.0/attn/GroupQueryAttention/output_0": tensor(float16),"present.0.key": tensor(float16),"present.0.value": tensor(float16),) , Error Node(/model/layers.0/attn/GroupQueryAttention) with schema(com.microsoft::GroupQueryAttention:1) has input size 12 not in range [min=7, max=9].
```

**To Reproduce**
Steps to reproduce the behavior:
1. Convert GPT OSS Model using OLIVE 4 x H200 required 
2. Run in AITK Playground
3. Error recieved above for NPU or GPU
4. See error

**Expected behavior**
Would expect this run


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Model conversion with OLIVE and running with ONNX GenAI runtime Error 5005 NPU Models and GPU error with Microsoft.Neutron.OpenAI.Provider. #1679

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Model conversion with OLIVE and running with ONNX GenAI runtime Error 5005 NPU Models and GPU error with Microsoft.Neutron.OpenAI.Provider. #1679

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions