The Bridge World

Bridge Rating System for Partnerships

by John Curley

November, 2010

"The introduction of chess rating systems has probably done more to influence the popularity of tournament chess than any other single factor. . . "
—Mark E. Glickman, A comprehensive guide to chess ratings

Introduction

The ACBL has used the masterpoint system for many years. It provides masterpoint rewards to players who do well in club games and tournaments, and maintains the accumulated totals in a member database to provide players with a measure of progress.

The masterpoint system suffers a number of deficiencies compared to systems in other games and sports. Because masterpoints are cumulative, two players with the same masterpoints could have very different strengths, one perhaps being a weaker player who has played more often or with stronger partners.

Contrast this with chess where ratings go up and down based on performance in the context of the strength of the field. A player has an expected performance calculated from his rating and that of his opponents. A strong player must do much better than the average player in a tournament to avoid losing rating points. Chess players have an accurate measure of progress over time.

A more fundamental issue with the masterpoint system is that it is based on individuals, whereas bridge has no easy measure of a player's performance except indirectly by seeing how partnerships perform in competition. It is really pairs who compete, and any rating scheme that does not reflect this can't be accurate.

In bridge then, it seems more sensible to rate pairs, something that we can do relatively accurately with a chess-like system. A pairs rating system would give partnerships a reliable measure of their strength and enable them to track growth. Players could compare performance with different partners. This document describes a system and explores technical issues. The scheme is proposed for the ACBL, but could be used by the WBF and other bridge organizations.

It is only the arrival of cheap computing power and the internet that make chess-like rating feasible to bridge on a large scale. The appendix discusses computer issues.

Proposed scheme

The proposed pairs rating uses the "Elo" scheme implemented for chess in 1960 and used successfully in other games and sports. A good description is in the Wiki. Our proposed scheme differs from chess in that the rating is attached not to the individual player but to the pair, meaning the partnership.

Here is an example of how the system is applied. Say we have a Mitchell movement in a club game and east-west pairs ew1, ew2,... have ratings r1, r2, ... The ratings give rise to expected performance levels for these pairs in competition. The formula for ew1's expected performance versus ew2, based on the difference between ratings r1 and r2, using the Elo system is

1 / [ 1 + 10^{(r2 - r1) / 400} ] (A)

Thus if the first pair has a 200 point edge in ratings over the second, its performance expectation is about 0.74, meaning that three times out of four, ew1 should come out ahead of ew2 in the contest. By adding the expected performances of ew1 in turn against ew2, ew3, etc., we arrive at ew1's expected performancee for the session, a value that predicts the number of pairs ew1 should beat.

For ew1's actual performance, a, count one for each pair it beats, one-half for a tie, 0 for a loss. The degree to which ew1 did better or worse than expected is taken as evidence that its rating should be incremented or decremented, as follows:

rating adjustment = K * ( a - e) (B)

where K is a factor that determines the rating points that can be gained or lost per session. This rating adjustment is added to r1. (A small modification to formula B is discussed later.)

The bridge "session" is the basis for rating adjustments. A session is continuous play of perhaps 20-30 boards in a team or pairs game. For example, in an afternoon plus evening event with four sections ("A", "B", "C", "D"), there are eight independent sessions for rating purposes.

These calculations would be performed monthly from tournament reports filed electronically to the ACBL. The ACBL would post updated ratings on its website. See the appendix for discussion of computer issues.

In formula B the US Chess Federation (USCF) uses a K value of 32, but it sets K larger for newcomers to allow faster convergence from provisional ratings towards true levels. The same reasoning applies in bridge for new partnerships -- see "Unrated Pairs" and "Choice of K" below.

Unrated pairs and provisional ratings

It will often be the case that a pair hasn't played together in competition. This will happen much more frequently than the 'unrated player' in chess. However history with other partners, not available in chess, can provide an initial provisional rating from ratings in partnerships in which the two are separately involved.

To assign an initial provisional rating, a simple idea is to average all ratings, including provisional ratings, of pairs in which the two individuals are separately involved. Refinements are available: perhaps the average is weighted according to some combination of currency, the number of sessions each rating is based on, and other factors. Individuals with no rated partnerships may have masterpoints which serve an a guide to initial rating. To kick-start the system as a whole, masterpoints are the obvious guide to setting the initial partnership rating.

Provisional ratings are then revised based on performance until enough play has transpired that the ratings becomes 'established' - say after five or ten sessions of play. This provisional period also gives new partnerships time to get comfortable with their systems. Probably, a pair's rating should have a stale-date of several years, after which, if the pair re-enters competition, their old rating is considered provisional.

Because provisional ratings are less accurate than established ratings, they should be removed from the formula used to adjust established ratings to prevent spreading error. Of course, the full formula should be used to adjust the provisional ratings themselves based on session performance. This means that rating updates for pairs with established ratings are based only on their performance against similar 'established' pairs in the session, while provisional pairs get updates based on performance against other provisionals as well as the established 'reference' pairs, the latter perhaps given extra weight.

Choice of K

Consider the Elo model in USCF chess where players have ratings from 100 to about 2800, with average 1500. Using formula A above, a player with a rating advantage of 200 over an opponent has an expectation of about 0.74, and can be expected to win three games out of four. With K=32, a win brings him 8 rating points while a loss loses 24. His and his opponent's ratings have a standard deviation about 14, a measure of average rating fluctuation per game. Over time his average rating doesn't change if the ratings are accurate.

A bridge session takes about the same time as a chess game. It is desirable to have similar game expectation and standard deviation in both games. In a hypothetical situation where only two pairs compete, the formulas A and B above work fine, and this is almost applicable to KO or Swiss teams, discussed later [but beware pitfalls in the use of pair ratings in team games; see the next section.] However in a pairs contest or B-A-M teams we have multiple opponents per session. Should K and other factors change in the formulas above?

The answer (discussion in appendix) is that in formula B, K stays the same but the "actual minus expected" is divided by the square root of n, where n is the number of opponents. This brings the standard deviations in line with the two-pair case. The rating adjustment formula when facing more than one opposing pair becomes:

rating adjustment = K * ( a - e ) / sqrt(n) (C)

If for example there are 13 pairs, n=12, with square root 3.5, and this change brings the rating fluctuations of a 13-pair contest in line with the 2-pair case. Formulas B and C are identical if n=1.

For board-a-match teams, an additional change is needed. With certain assumptions discussed in the next section, a team can be thought of as having an effective rating of

r = ( r1 + r2 ) / sqrt(2)

where r1 and r2 are the ratings of the constituent pairs. Formula A predicts team-versus-team performance, and formula C with appropriate 'n' calculates rating adjustments for a B-A-M session. Rating adjustments apply equally to both pairs in the team.

Examples in the rest of the document assume a value K=32. As in chess, a larger K is desirable for provisionally rated players to allow faster convergence, and a smaller K is desirable in the rarefied upper echelons.

Adjustments to pairs rating based on team play

There are some caveats with adjusting pairs ratings based on performance in team play, discussed at end of section. But to describe one approach . . . in a team contest, suppose the ratings of the constituent pairs of team X are r1 and r2, and for team Y, r3 and r4. On the basis of the Elo model, team X and team Y have effective ratings

rX = ( r1 + r2 ) / sqrt(2) (D1)

rY = ( r3 + r4 ) / sqrt(2) (D2)

With this change, formula A is used to calculate expected team performance.

As to rating point adjustments, for sessions of 20+ boards, formula C (with n=1) is applied, and the calculated adjustment applies equally to constituent pairs. If any pair is unrated or provisionally rated, no 'established' rating points change.

If pairs on team X are rated 200 points above Y, the rating fluctuation per session is about 10, smaller than one might expect because, in effect, the pairs on each team pool gains or losses.

Swiss teams has shorter team-vs-team encounters, and we need to set 'n' in formula C according to the number of opposing teams per session; often n=4 in 7-board Swiss.

It is tempting to use the size of the win or loss in imps or B-A-M boards in the calculations of expected performance and rating point adjustments. Some fudge factors may (or may not) improve accuracy at a cost of complexity. Such refinements could be left for future research.

Caveat: There are significant assumptions behind the use of team game results to adjust pair ratings. The formulas D1 and D2 don't reflect the different character of pairs versus team games, or that a pair may perform much better in one venue than the other. Furthermore the simple additive model (D1 and D2) suggested by the Elo system may not be useful -- are we simply handing out "ACBL rating points" to each winning constituent pair, albeit giving negative rating points to the losers? On the other hand, if D1 and D2 do correlate strongly with team performance then some adjustment to ratings of constituent pairs is justified, because the adjustments decrease overall variance of the four pair ratings from their supposed true values.

So one big question is . . . if due to the the significant differences between pairs and team play, a pair's rating goes up like a rocket in one, then drops like a rock in the other, what does the rating mean? Should pair ratings should measure only pairs play, or do we need two ratings for the same partnership?

Here is a simple approach for discussion. Let the pair rating r be based only on pairs play. It's got lots of data and should be solid. Give the pair an additional rating factor t based on its performance in team play: when the partnership is in a team event, use r+t values for each pair in D1 and D2, and based on results, calculate rating adjustment for the t values. Then the pair's rating r measures pairs play, while t (which may be positive or negative) describes the degree to which the partnership performs better or worse in team play than in pairs.

This isn't the last word, but it is easy to do, is testable over time, and is probably of interest to players.

Team ratings

A team considered as a unit can be given an independent rating based on performance against other teams using the Elo system, with rating adjustments applied per session as a simple win-tie-loss value versus computed expectation. If structural details such as which partnerships, with what strengths, play what sessions aren't factored in, the variance in the calculation of the team's expected performance could be quite high and the ratings themselves would have high variance and perhaps be less useful than they could be. At the other extreme one can imagine sophisticated models of expected performance by two competing teams that include fatigue factors in multi-session matches, analysis of records of previous head-to-heads, evaluation of captain or coach, etc.

Here is a straightforward approach. Define a "rated team" to be two constituent pairs playing a teams session. Each pair may of course belong to several rated teams (sometimes in the same contest, e.g.: a knockout team with three pairs). The rated team is given a provisional rating based on considerations similar to the pairs case. Formulas A and C apply analogously. Rating points flow on a per-session basis. Rated teams are tracked in the database, and the computer issues are nearly identical to the pairs case. The process is transparent. One can assess the strength of a three-pair team using a (possibly weighted) average of its three rated teams.

With this scheme, let's revisit the murky question of giving a pair a team-play rating ("tp-rating") on the premise that a team rating is correlated with the average of the tp-ratings of its two pairs. This is a least-squares problem similar to that discussed in the section on individual ratings, below. Begin with current estimates for tp-ratings for all constituent pairs of all rated teams. Define the "error" for a rated team as team rating minus the average of its two tp-ratings. Beginning with popular pair P belonging to the most number of teams, find the average error of those teams and add twice that amount to P's tp-rating, so as to make the average error equal to zero. Continue with the next most popular pair, and so on for all pairs. Now iterate this process until tp-ratings converge. This approach seems more solid (if more opaque) than the scheme discussed in the last section.

Individual ratings

If individual contests were in vogue, one could develop a solid individual rating based on the Elo model. In the absence of good individual data however, we can attempt to derive them from pair ratings. Only time will tell how accurately they reflect the individual's strength.

Here is one way to proceed. Let's hypothesize that the established pair rating is the sum of to-be-determined individual ratings of the two in the pair. Set each individual's rating to the weighted average pair rating of the (typically) several partnerships of which he or she is a part, where the weight depends on currency and number of sessions played. Individual ratings would mirror the 100 to 2800 point spread of the pair ratings.

At this stage, in many cases, pair ratings won't equal the average of individuals' ratings. Therefore fiddle with the individual ratings of each pair until the discrepancy between the average of the individual ratings and the pair rating is not too different from discrepancies in all the other pairs, i.e. until discrepancies flatten out. This may be done repeatedly until the individual ratings converge.

What we have is called an over-determined sparse linear system, for which we want a best-fit solution (with constraints, no one wants a negative rating just to satisfy the math!) There are sophisticated iterative techniques available to solve it. For an example of a bridge site solving a similar problem to produce ratings see the Colorado Springs Power Rating site, especially "Explanation of Power Ratings", by Chris Champion.

Individual rating are certainly be of interest ( who is carrying whom?) and could be useful for other purposes. Individuals may want to partner someone who has a certain strength relative themselves, for example. On-line players with typically many partners would have a pretty good guide to their strength.

One problem is that the calculation is complex and the rating isn't transparent, leading perhaps to lack of trust in the figure. Indeed, one disadvantage of the Colorado Springs system, which calculates pair ratings jointly with individual ratings in a long complex computation, is the lack of transparency compared to a chess-like rating system in which the flow of rating points is known after each game.

Effect of the rating system on the game

Mark Glickman's comment from the world of chess: "The introduction of chess rating systems has probably done more to influence the popularity of tournament chess than any other single factor . . . " The average chess player knows he won't win the tournament, but to come away with a 50 point rating increase is very satisfying. He tends to play his best even if doing badly because his rating is on the line.

In a pairs session at bridge however, if a pair isn't doing well, they may gamble more than they should, or lose interest and drop a board. This introduces an undesirable randomness to the game. If however, their 'standing' was on the line, they would be more likely to adopt a steady and focused game throughout to minimise loss and maximise gain. Bridge would be better for it. Rating point flow would be posted at end of session, an even without a win, pairs could take satisfaction in point gains.

Why should pairs worry about standing in the ratings, apart from say personal interest? Well, in chess, some tournaments are restricted to masters, or grandmasters. If ACBL used pair ratings like that, partnerships that wanted to qualify for such events would be motivated to maximize rating gain in games and tournaments, not just try to "win". Off-percentage actions or sloppy play on any hand would in the long run be punished in the ratings, admittedly just a little at a time.

Sometimes a pair will gamble for a top late in the contest in hopes of going from contender to winner. Over time, this may have a small negative impact on its ratings. Whether the pair gambles to win or plays steady to gain rating points probably depends on why the pair is playing.

Monitoring inflation and deflation

Chess has 50 years of experience tinkering with the rating system to improve it, in particular trying to control inflation and deflation which creep in as the population changes over time.

For bridge it would be valuable to maintain an approximately constant rating standard but it will be some time before problems arise for which measures need to be taken. It could be valuable (and interesting to ACBL members) to log rating data and distributions (percentiles) annually for purposes of monitoring inflationary trends with an eye to maintaining standards.

Appendix

A1. Computer issues

The ACBL website gives a membership figure of 160,000 members playing some 3 million "tables" annually. How many active partnerships can we expect? An upper bound is 6 million, two per annual "table" with no one playing twice with the same partner. A lower bound is 80,000, if each ACBL member has only one partner. A guess would be 500,000 to a million.

How many sessions? Assuming 10 tables per session, perhaps 300,000 per year, 30,000 per month, 1,000 per day. Before the internet and cheap computers pairs rating would have been impossible.

Most clubs and tournaments use ACBLscore to produce reports that are emailed to ACBL headquarters. Let's assume the reports have the information required to process pair ratings. The suggested process follows these lines. Say it's the end of the month and ratings have been posted on the website. For the next 4 weeks, new reports come in (at 1000/day) containing club and tournament identifiers, session identifiers, date, time, and session results.

Create a session id for each session. For each pair in a session find its rating in the database (or if the pair is not in the database, create a record with a provisional rating) and compute expected performance, actual performance, and rating adjustment using formulas A and C above. Some logic is required to handle established versus provisional pairs.

[Note: It's desirable during the month to base calculations on the published ratings, then at end of month sum the adjustments to update to the rating, rather than maintain a running rating value.]

Now write an entry to "rating change table" containing the pair id, session id and rating adjustment.

At end of month, for each pair in the month's "rating change table", sum the adjustments and add the result to last month's rating. Update the rating database and post it to the website. Done!

Members will want look-up facilities. A search on the ratings database should produce current rating as well as recent rating adjustments per the monthly "rating change" tables. The adjustments could be hot-linked using the session id to the relevant session reports.

In addition, members might like a history of their rating over time, percentile figures, top pairs in their area, etc.

A2. Choice of K when multiple opponents

Take the hypothetical situation where only two pairs compete, one rated 200 points higher with expectation 0.74. As in chess, the two pairs' ratings are stable around their true values with standard deviation about 14. Now clone the second pair 11 times such that 13 pairs are competing, one ranked 200 points above the other twelve. What happens? The ratings are stable over time, but the expected rating fluctuation for the strong player is about 50, much higher than the two-pair case. This is disconcerting considering that the strong pair sees the two contests as pretty much identical.

To make the standard deviations compatible, let's think of the session as a set of 12 mini matches, with expectation for each adjusted to the shorter length. The statistical upshot is that in formula B, K stays the same but the "actual minus expected" is divided by the square root of n, where n is the number of opponents. The rating adjustment formula when facing more than one opposing pair becomes:

rating adjustment = K*( a - e ) / sqrt(n)

In our example, n=12, with square root 3.5, and this change brings the rating fluctuations of the 13-pair contest in line with the 2-pair case. Note that formula B and the above formula are identical if n=1.

ESOTERICA

This section is devoted to weird, wild and wacky material. For bridge friends, lovers of arcana, pursuers of special interests, and anyone intrigued with a particular facet of the game of bridge.