LLM accuracy benchmarch - MMLU

2 minute read

Get a harness to benchmark LLM, through MMLU

0 lm-evaluatin-harness Approach

Firstly tried using lm-eval but couldn’t make it work with Nebius endpoint

#pip install lm-eval[api]
lm_eval --model local-completions \
        --model_args model=openai/gpt-oss-120b,base_url=ttps://api.studio.nebius.com/v1,api_key=${NEBIUS_API_KEY} \
        --tasks mmlu \
        --num_fewshot 5

1 deepeval Approach

This working approach uses deepeval and it supports multiple benchmarks, with customized benchmark code. Following code is generated by Gemini and with some output format fixes

import os
from openai import OpenAI
# DeepEval imports
from deepeval.models.base_model import DeepEvalBaseLLM
from deepeval.benchmarks import MMLU
from deepeval.benchmarks.mmlu.task import MMLUTask
# --- Step 1: Define Your Chat Completion Model ---
# We must wrap our API call in a class that deepeval recognizes.
# It must inherit from DeepEvalBaseLLM.
class OpenAIChatCompletionModel(DeepEvalBaseLLM):
    def __init__(self, base_url: str, model: str, api_key: str):
        # We'll use the OpenAI client
        self.client = OpenAI(base_url=base_url, 
                             api_key=api_key)
        self.model = model
    def load_model(self):
        # This method is required, but we don't need to do
        # anything here since the client is already initialized.
        return self
    def generate(self, prompt: str) -> str:
        """
        Generates a synchronous response from the chat completion endpoint.
        """
        try:
            response = self.client.chat.completions.create(
                model=self.model,
                messages=[{"role": "user", "content": prompt}],
                temperature=0.0, # MMLU is deterministic, so temp 0 is best
            )
            return response.choices[0].message.content
        except Exception as e:
            print(f"Error during generation: {e}")
            return ""
    async def a_generate(self, prompt: str) -> str:
        """
        Generates an asynchronous response.
        For simplicity, this example just wraps the synchronous
        call. For production, you'd use `AsyncOpenAI`.
        """
        return self.generate(prompt)
    def get_model_name(self):
        return self.model
# --- Step 2: Configure and Run the MMLU Benchmark ---
def run_mmlu_test(base_url: str, model: str, api_key: str):
    # 1. Instantiate your custom model
    # We use gpt-4o-mini for a fast and capable example
    chat_model = OpenAIChatCompletionModel(base_url=base_url, 
                                           model=model,
                                           api_key=api_key)
    # 2. Initialize the MMLU benchmark
    # We select just two tasks for a quick example.
    # Running all 57 tasks can take a very long time and cost.
    # The default `n_shots=5` provides 5 examples to the model
    # to help it answer in the correct 'A', 'B', 'C', or 'D' format.
    benchmark = MMLU(
        tasks=[
            MMLUTask.COLLEGE_BIOLOGY,
            MMLUTask.ASTRONOMY
        ],
        n_shots=5  # 5-shot is the MMLU standard
    )
    print(f"Running MMLU benchmark on model: {chat_model.get_model_name()}")
    print(f"Tasks: {[task.value for task in benchmark.tasks]}")
    print("This may take a few minutes...")
    # 3. Run the evaluation
    # This will download the dataset and send requests to your
    # model for each test case.
    results = benchmark.evaluate(model=chat_model)
    # 4. Print the results
    print("--- MMLU Evaluation Complete ---")
    print(f"Overall Score: {results.overall_accuracy:.4f}")
    print("\nScores by Task:")
    for index, row in benchmark.task_scores.iterrows():
        print(f"- {row["Task"]}: {row["Score"]}")
# --- Step 3: Execute the script ---
if __name__ == "__main__":
    # Check if API key is set
    if "NEBIUS_API_KEY" not in os.environ:
        print("Error: NEBIUS_API_KEY environment variable not set.")
        print("Please set your API key: export NEBIUS_API_KEY")
    else:
        run_mmlu_test(base_url="https://api.studio.nebius.com/v1", 
                      model="openai/gpt-oss-120b",
                      api_key=os.getenv("NEBIUS_API_KEY"))

Twitter Facebook LinkedIn

LLM accuracy benchmarch - MMLU

0 lm-evaluatin-harness Approach

1 deepeval Approach

You May Also Enjoy

Claude Skills

Langevin Sampling in Diffuion models

TensorRT-LLM Backend

Diffusion Transformers