Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hi,
A proposal of A2 blueprint. The blueprint is similar to this blueprint: https://github.com/oracle-quickstart/oci-ai-blueprints/blob/main/docs/sample_blueprints/model_serving/cpu-inference/cpu-inference-mistral-vm.json
and aimed at reaching similar performance for mistral:7b-instruct-q8_0 and llama3.1:8b-instruct-q8_0 models.
from our internal benchmarks we got such numbers for A2:
mistral:7b-instruct-q8_0
b=1, t=8, tg_throughput=6.27, pp_throughput=446.95, ttft=0.15
llama3.1:8b-instruct-q8_0
b=1, t=8, tg_throughput=5.63, pp_throughput=419.10, ttft=0.20
and such numbers for E4:
mistral:7b-instruct-q8_0
b=1, t=4, tg_throughput=6.91, pp_throughput=470.62, ttft=0.15
llama3.1:8b-instruct-q8_0
b=1, t=4, tg_throughput=6.54, pp_throughput=427.20, ttft=0.17
The main difference in the blueprint is 8 cores for A2 vs 4 cores for E4.