Evaluation and Benchmarks

Mneno provides local evaluation infrastructure for retrieval, context building, and compaction. It does not bundle benchmark datasets, external evaluators, telemetry, or analytics uploads.

Evaluate search

from mneno import MemoryClient

client = MemoryClient(trace_enabled=True)
memory = client.add("User is building Mneno.", importance=0.9)

result = client.evaluate_search(
    "What is the user building?",
    relevant_memory_ids=[memory.id],
    limit=10,
)

print(result.result_count, result.candidate_count, result.latency_ms)
print(result.metrics)

When relevant IDs are supplied, search evaluation reports precision@k, recall@k, and MRR. It also reports scanned and selected counts, latency, decision count, and trace event count.

Evaluate context

result = client.evaluate_context(
    "What is the user building?",
    budget=1200,
    relevant_memory_ids=[memory.id],
)

print(result.included_count, result.excluded_count)
print(result.estimated_tokens, result.budget)

Context evaluation includes token efficiency, budget utilization, inclusion and exclusion reason counts, relevance, latency, and trace coverage.

Evaluate compaction

result = client.evaluate_compaction()

print(result.before_count, result.after_count)
print(result.reduction_ratio)

Compaction evaluation previews changes by default and does not mutate storage. Pass apply=True only when the evaluation should apply the compaction result.

Serialize operation results

All evaluation result models provide stable JSON-compatible helpers:

payload = result.to_dict()
json_text = result.to_json()

Build a benchmark report

report = client.build_evaluation_report(
    benchmark_name="local-smoke-test",
    metrics=result.metrics,
    trace_ids=result.trace_ids,
    summary="Compaction evaluation complete",
)

benchmark_payload = client.export_benchmark_result(report)

Benchmark exports use this versioned envelope:

{
  "format": "mneno.benchmark.result",
  "version": 1,
  "benchmark": "local-smoke-test",
  "created_at": "2026-06-07T12:00:00Z",
  "metrics": [],
  "traces": [],
  "metadata": {}
}

Implement a benchmark adapter

External benchmark packages implement BenchmarkAdapter:

from mneno import BenchmarkAdapter, EvaluationReport, MemoryClient


class LocalAdapter:
    name = "local"

    def prepare(self, client: MemoryClient) -> None:
        self.client = client

    def run(self) -> EvaluationReport:
        result = self.client.evaluate_search("Mneno")
        return self.client.build_evaluation_report(
            benchmark_name=self.name,
            metrics=result.metrics,
            trace_ids=result.trace_ids,
        )


adapter: BenchmarkAdapter = LocalAdapter()

Future LOCOMO, LongMemEval, and BEAM adapters belong in the separate Mneno Bench distribution. They can consume these typed results and versioned trace exports without adding benchmark dependencies to core.

Getting Started

Core Concepts

Advanced

Evaluation and Benchmarks

Evaluate search

Evaluate context

Evaluate compaction

Serialize operation results

Build a benchmark report

Implement a benchmark adapter

​Evaluate search

​Evaluate context

​Evaluate compaction

​Serialize operation results

​Build a benchmark report

​Implement a benchmark adapter

Evaluate search

Evaluate context

Evaluate compaction

Serialize operation results

Build a benchmark report

Implement a benchmark adapter