One of the best ways to test one engine against another is by using SPRT, which stands for "Sequential Probability Ratio Test." A program which can run such matches is cutechess-cli, by using the SPRT parameter.
It is beyond the scope of this book to explain all the mathematics behind these formulas, even if I'd understand everything in detail. I'm a programmer, not a Statistician.
It is sufficient to understand the basics.
You run a match between two engines. This match does not have a set length, but to make sure it doesn't run 'forever', we normally set a very large number of games, such as 20.000. To make SPRT work, we need to set two hypotheses: one we hope that comes true (H1), and one we hope that isn't true (H0, the NULL hypothesis). We allow for a chance of 5% that the SPRT-test gives us the wrong result. So, we are 95% confident that the result is correct. (You can lower the 5% margin, but the test will run a lot longer.)
Let's say we have a new engine, called version NEW. We also have an old version, we call version OLD.
Now we state:
- H1: Engine NEW is at lesat 1 Elo stronger than engine OLD.
- H0: Engine NEW is NOT more than 5 Elo stronger than Engine OLD.
- Error margin: 5%.
When running cutechess_cli, we give it the SPRT parameter:
-sprt elo0=1 elo1=5 alpha=0.05 beta=0.05
(Alpha an Beta have nothing to do with Alpha/Beta searching. In this case, Alpha is the chance we accept H1 while we shouldn't, and Beta is the chance we accept H0 while we shouldn't; i.e., a chance of 5% we get the wrong result from the test.)
A cutechess_cli command could look like this:
cutechess-cli \ -engine conf="Rustic Alpha 3.15.100" \ -engine conf="Rustic Alpha 3.1.112" \ -each \ tc=inf/10+0.1 \ book="/home/marcel/Chess/OpeningBooks/gm1950.bin" \ bookdepth=4 \ -games 2 -rounds 2500 -repeat 2 -maxmoves 200 \ -sprt elo0=0 elo1=10 alpha=0.05 beta=0.05 \ -concurrency 4 \ -ratinginterval 10 \ -pgnout "/home/marcel/Chess/sprt.pgn"
With this command, we're testing 3.15.100 (NEW) against 3.1.112 (OLD), where we run 2500 rounds with 2 games each, so each engine plays the same opening with white and black. Time control is 10 seconds + 0.1 increment.
Now we start the match between our NEW and OLD (previous) engine version, and cutechess_cli will start to play games.
Let's say that, after 400 games, Alpha 2 is 100 Elo stronger than Alpha 1. This could still change, if you play enough games... but that is the point of SPRT testing. As soon as cutechess_cli is 95% sure that the difference between the two engines is not going to change anymore, it will abort the match, which saves a lot of time.
At that point, cutechess_cli compares the result in Elo against the stated hypotheses. With a result of +100 Elo for engine NEW, we can see:
- H1: NEW is at least 1 Elo stronger than OLD. True, because 100 > 1.
- H0: NEW is NOT more than 5 Elo stronger than OLD. False, because it IS more than 5 Elo stronger (100 Elo is more than 5 Elo).
So H1 is accepted, and we have determined that NEW is a stronger engine than OLD, and by how much (self-play) Elo.
If NEW scored -20 Elo, then we would have had this result:
- H1: NEW is at least 1 Elo stronger than OLD. False, because -10 < 1.
- H0: NEW is NOT more than 5 Elo stronger than Old. True, because -10 is indeed less than 5 Elo.
So H0 is accepted, which means that our NEW engine is not stronger than our OLD engine; it is actually weaker, and we should not include this feature. (At least, not yet: a feature which causes a strength loss now, could cause a strength gain when added on top of other features, so it's worth it to try again later.)
As long as the difference between NEW and OLD is between 1 and 5 Elo, the SPRT-test keeps running, because both hypotheses are still true: 3 Elo is "at least 1 Elo stronger", but it is also "not more than 5 Elo stronger." If this result doesn't change, cutechess_cli would run forever; that is the reason why we set a match limit of something like 20.000 games.