Comparing Rating Systems
This page provides a side-by-side comparison of the rating systems implemented in Elote, helping you choose the right system for your specific use case.
Overview Comparison
Feature |
Elo |
Glicko |
ECF |
DWZ |
Ensemble |
---|---|---|---|---|---|
Origin |
Chess (1960s) |
Chess (1995) |
England (1950s) |
Germany (1990s) |
Meta-system |
Complexity |
Low |
Medium |
Low |
Medium |
High |
Uncertainty Tracking |
No |
Yes (RD) |
No |
Partial |
Depends on components |
Expected Score Formula |
Logistic |
Modified Logistic |
Linear |
Logistic |
Weighted Average |
Inactivity Handling |
No |
Yes |
No |
Partial |
Depends on components |
Implementation Difficulty |
Easy |
Moderate |
Easy |
Moderate |
Complex |
Computational Cost |
Low |
Medium |
Low |
Medium |
High |
Typical Use Cases |
General purpose |
Sparse competitions |
English chess |
Youth development |
Complex domains |
Mathematical Formulation
System |
Expected Outcome Formula |
---|---|
Elo |
\(E_A = \frac{1}{1 + 10^{(R_B - R_A) / 400}}\) |
Glicko |
\(E(A, B) = \frac{1}{1 + 10^{-g(RD_B) \times (r_A - r_B) / 400}}\) where \(g(RD) = \frac{1}{\sqrt{1 + 3 \times RD^2 / \pi^2}}\) |
ECF |
\(E_A = 0.5 + \frac{R_A - R_B}{F}\) where F is typically 120 |
DWZ |
\(W_e = \frac{1}{1 + 10^{-(R_A - R_B) / 400}}\) |
Ensemble |
\(E_{ensemble} = \sum_{i=1}^{n} w_i \times E_i\) where \(w_i\) are weights |
Key Parameters
System |
Key Parameters |
---|---|
Elo |
K-factor (determines rating change magnitude) |
Glicko |
Initial rating, Initial RD, Volatility, Tau |
ECF |
K-factor, F-factor (conversion factor) |
DWZ |
Initial rating, Development coefficient |
Ensemble |
Component systems, Weights |
Strengths and Weaknesses
Elo
Strengths: - Simple to understand and implement - Widely recognized and used - Works well with sufficient data - Zero-sum in two-player games
Weaknesses: - No uncertainty measurement - Requires many matches for accuracy - Fixed K-factor can be problematic - Doesn’t handle inactivity well
Glicko
Strengths: - Tracks rating reliability - Handles inactivity appropriately - More accurate for sparse competitions - Better for matchmaking
Weaknesses: - More complex to implement - Higher computational requirements - More parameters to tune - Less intuitive interpretation
ECF
Strengths: - Linear relationship is easy to calculate - Designed for English chess ecosystem - Simple to understand - Long history of use
Weaknesses: - Limited range of effectiveness - Regional focus - Less theoretical justification - No uncertainty tracking
DWZ
Strengths: - Handles youth development well - Age and experience factors - Good for tournament play - National standardization
Weaknesses: - Complex calculation - Regional focus - Parameter sensitivity - Less international recognition
Ensemble
Strengths: - Combines strengths of multiple systems - More robust predictions - Adaptable to different domains - Graceful degradation
Weaknesses: - Most complex to implement - Highest computational cost - Requires weight tuning - Less interpretable
Choosing the Right System
Consider the following factors when choosing a rating system:
Data Density: How frequently do competitors face each other? - Sparse data: Consider Glicko - Dense data: Elo may be sufficient
Domain Specifics: - Chess in England: ECF - Chess in Germany: DWZ - Youth development: DWZ - General purpose: Elo or Glicko
Computational Resources: - Limited resources: Elo or ECF - Sufficient resources: Glicko, DWZ, or Ensemble
Uncertainty Importance: - Critical to track uncertainty: Glicko - Uncertainty less important: Elo or ECF
Complexity Tolerance: - Need simple explanation: Elo or ECF - Can handle complexity: Glicko, DWZ, or Ensemble
Prediction Accuracy: - Highest accuracy needed: Consider Ensemble - Reasonable accuracy sufficient: Any individual system
Code Comparison
Here’s a quick comparison of how to use each system in Elote:
from elote import EloCompetitor, GlickoCompetitor, ECFCompetitor, DWZCompetitor, EnsembleCompetitor
# Elo
elo_player = EloCompetitor(initial_rating=1500, k_factor=32)
# Glicko
glicko_player = GlickoCompetitor(initial_rating=1500, initial_rd=350, volatility=0.06)
# ECF
ecf_player = ECFCompetitor(initial_rating=120, k_factor=16, f_factor=120)
# DWZ
dwz_player = DWZCompetitor(initial_rating=1600, initial_development_coeff=30)
# Ensemble
ensemble_player = EnsembleCompetitor(
rating_systems=[
(EloCompetitor(initial_rating=1500), 0.5),
(GlickoCompetitor(initial_rating=1500), 0.5)
]
)
# Usage is the same for all systems
opponent = EloCompetitor(initial_rating=1400)
# Get expected scores
print(f"Elo expected: {elo_player.expected_score(opponent):.2%}")
print(f"Glicko expected: {glicko_player.expected_score(opponent):.2%}")
print(f"ECF expected: {ecf_player.expected_score(opponent):.2%}")
print(f"DWZ expected: {dwz_player.expected_score(opponent):.2%}")
print(f"Ensemble expected: {ensemble_player.expected_score(opponent):.2%}")
# Record a win
elo_player.beat(opponent)
glicko_player.beat(opponent)
ecf_player.beat(opponent)
dwz_player.beat(opponent)
ensemble_player.beat(opponent)
Empirical Comparison
While theoretical comparisons are useful, the best way to choose a rating system is through empirical testing on your specific domain. Elote makes it easy to experiment with different systems and compare their predictive accuracy.
Here’s a simple approach to compare systems:
Split your historical match data into training and testing sets
Train each rating system on the training data
Evaluate prediction accuracy on the test data
Choose the system with the best performance for your specific use case
Remember that no rating system is universally best - the right choice depends on your specific requirements, data characteristics, and domain constraints.