Inference Scout Elo Ranking System

Introduction

The Elo rating system, originally designed for chess, has been adapted to evaluate the performance of Scout nodes in delivering LLM results within Chasm's CoDEX. This document explains how the Elo rating system is applied to assess the efficiency and reliability of Scout nodes.

We use the Elo ranking system because it is more fair and aligns with the game theory of Nash Equilibrium. In this system, scouts with higher Elo rankings are incentivized to maintain their performance and avoid malicious behavior, as doing so would result in a loss of rank. Conversely, scouts with lower Elo rankings can quickly move up the ranks by outperforming higher-ranked scouts, promoting a dynamic and competitive environment that rewards genuine performance improvements.

Metrics for Elo Calculation

Performance Metrics

Scout nodes are evaluated based on three key performance metrics for each response:

  1. End-to-End Latency (EEL):

    • Measures the time taken for a request to be processed and a response to be delivered.

    • Scored on a scale from 0 to 1.

  2. Output Tokens per Second (OTS):

    • Evaluates the number of tokens generated by the LLM per second.

    • Scored on a scale from 0 to 1.

  3. Server Uptime (SU):

    • Tracks the availability and reliability of the server.

    • Scored on a scale from 0 to 1.

The overall performance score is calculated using the following formula:

Performance = k_1 * EEL + k_2 * SU + k_3 * OTS

where k_1, k_2, and k_3 are weighting factors that balance the contribution of each metric to the overall performance score.

Evaluating Performance

Performance is assessed between miners and validators because validators can also function as miners. This dual role necessitates a comprehensive evaluation mechanism to ensure consistency and reliability across the network.

Quality Check Using Health Checks

In addition to performance metrics, the Elo rating system also incorporates quality checks to evaluate the effectiveness of LLM responses. This is achieved through the LLM-as-a-judge method, which compares the quality of the LLM response to that of an average LLM.

Health Check Process

  • Quality Assessment:

    • The server performs a quality check on the LLM response.

    • This involves using an LLM-as-a-judge to evaluate the accuracy, relevance, and coherence of the response.

  • Comparison to Average LLM:

    • The response is compared against the benchmark set by an average LLM.

    • The comparison helps determine the relative quality and effectiveness of the response.

Incorporating Health Checks into Elo Rating

The results of the health checks are integrated into the Elo rating system to provide a holistic evaluation of each Scout node. By combining performance metrics and quality assessments, the Elo rating system ensures a balanced and comprehensive assessment of scout efficiency and reliability.

Last updated