Ramgopal Prajapat:

Learnings and Views

ML Based Ranking for WTA Players

By: Ram on Dec 04, 2022

Summary

For WTA, Players are ranked based on points earned in the latest 52 weeks. The cumulative points are considered based on maximum of 16 tournaments for singles and 11 for doubles. In this blog, we will use historical ranking and statistics from the matches to develop ML based model to predict the player rankings.
High Level Approach
•    Context
•    Data
•    Analysis 
•    Feature Engineering
•    Model Development
•    Insights and Concluding Thoughts

Context and Plan

The WTA (Women's Tennis Association) is the principal organizing body of women's professional tennis, it governs its own tour worldwide. On its website, it provides a lot of data about the players as individuals as well the tour matches with results and the current ranks.

Methods to rank players:

  1. Points based system and points earned depends on stage of the tournament and type of tournaments - This is the current official way of ranking players
  2. Player Performance (Won and Lost) and gap of rankings between the players – used in here

In this blog, aim is to show the steps for WTA Player ranking Method. We will use the dataset prepared based on the data available here. Of course, a more comprehensive features can be developed further.

Key hypotheses for ranking based on the data available

  • Number of matches won in the latest 6 months
  • Number of matches lost in the latest 6 months
  • Average rank gap when matches won/lost
  • Min and Max ranking in the latest 6 months
  • Total number of matches played

Additional derived features, we could consider are

  • win percentages
  • Ratio of rank gap for Won & Lost matches
  • Variation of ranking in 6 months
  • Number of times rated in the latest 6 months

Currently, ranking of the players are calculated based on the method 1 and we will be using that as label for the ranking model.

Data

We have following project structure.  Two data files (CSV) are available in the data folder. The first file is about the features and second ranking.  You can download files from here

Graphical user interface, application

Description automatically generated

Table

Description automatically generated

 

We need to merge these two data frames, but we should delete the duplicate rows for the combinations of player id and ranking date.

Text

Description automatically generated with medium confidence

 

Analysis

We can review the data using descriptive analysis and visualizations.

  • Number of players ranked overall
  • Count of players for each of the ranking dates
  • Range of dates for the rankings.

Missing and outlier treatments should be done.  For missing values, we have adopted simple approach to replace with 0.

Feature Engineering

We can create total matches played, % won/lost etc.

Graphical user interface, Word

Description automatically generated

We can also think on additional features that can be created. Some of these can be consider

  • Matches details – points or set up, location of the match – home vs neutral location etc
  • Tournament Type and rounds

Normalize the features - all the features are brought to scale between 0 and 1.

Ranking Model Development

Boosted Tree based (XGB) Pair-wise Ranking Algorithm is used for the model development. We need to develop ranks within each of the dates (ranking date), so we need to define the groups. Below is the sample code.

 

groups = player_rank_performance_normalised.groupby('ranking_date').size().to_frame('size')['size'].to_numpy()

Also, we need to create train and test samples, like any of the supervised modelling.

Text

Description automatically generated

Once the ML based ranking model is developed, we can use the model object to predict the rank values.

The predicted values can be converted to ranks using Rank method.

# Predict Ranking based on the Trained Model

pred_wta_values = model_wta.predict(X_train)

player_rank_performance_normalised['pred_value'] =pred_wta_values

player_rank_performance_normalised['pred_rank']=player_rank_performance_normalised.groupby('ranking_date')['pred_rank'].rank(method='dense')

Now the model performance can be compared – cross tab between predicted and actual rankings.

Calendar

Description automatically generated

Insights and Concluding Thoughts

In a perfect scenario, we would have expected values only on the diagonal. But the model is far from the perfect as we have used only handful of the features. Model performance is Ok with the top ranking, but performance goes down as we look at the other ranks. Additional features can be considered to improve the model performance. Also, we can optimize the hyperparameters as well.

The model performance can be measured with NDCG, MRR and additional metrics.

 

Jupyter Notebook with full code

Leave a comment