By: Ram on Dec 04, 2022
For WTA, Players are ranked based on points earned in the latest 52 weeks. The cumulative points are considered based on maximum of 16 tournaments for singles and 11 for doubles. In this blog, we will use historical ranking and statistics from the matches to develop ML based model to predict the player rankings.
High Level Approach
• Context
• Data
• Analysis
• Feature Engineering
• Model Development
• Insights and Concluding Thoughts
The WTA (Women's Tennis Association) is the principal organizing body of women's professional tennis, it governs its own tour worldwide. On its website, it provides a lot of data about the players as individuals as well the tour matches with results and the current ranks.
Methods to rank players:
In this blog, aim is to show the steps for WTA Player ranking Method. We will use the dataset prepared based on the data available here. Of course, a more comprehensive features can be developed further.
Key hypotheses for ranking based on the data available
Additional derived features, we could consider are
Currently, ranking of the players are calculated based on the method 1 and we will be using that as label for the ranking model.
We have following project structure. Two data files (CSV) are available in the data folder. The first file is about the features and second ranking. You can download files from here
We need to merge these two data frames, but we should delete the duplicate rows for the combinations of player id and ranking date.
We can review the data using descriptive analysis and visualizations.
Missing and outlier treatments should be done. For missing values, we have adopted simple approach to replace with 0.
We can create total matches played, % won/lost etc.
We can also think on additional features that can be created. Some of these can be consider
Normalize the features - all the features are brought to scale between 0 and 1.
Boosted Tree based (XGB) Pair-wise Ranking Algorithm is used for the model development. We need to develop ranks within each of the dates (ranking date), so we need to define the groups. Below is the sample code.
groups = player_rank_performance_normalised.groupby('ranking_date').size().to_frame('size')['size'].to_numpy()
Also, we need to create train and test samples, like any of the supervised modelling.
Once the ML based ranking model is developed, we can use the model object to predict the rank values.
The predicted values can be converted to ranks using Rank method.
# Predict Ranking based on the Trained Model
pred_wta_values = model_wta.predict(X_train)
player_rank_performance_normalised['pred_value'] =pred_wta_values
player_rank_performance_normalised['pred_rank']=player_rank_performance_normalised.groupby('ranking_date')['pred_rank'].rank(method='dense')
Now the model performance can be compared – cross tab between predicted and actual rankings.
In a perfect scenario, we would have expected values only on the diagonal. But the model is far from the perfect as we have used only handful of the features. Model performance is Ok with the top ranking, but performance goes down as we look at the other ranks. Additional features can be considered to improve the model performance. Also, we can optimize the hyperparameters as well.
The model performance can be measured with NDCG, MRR and additional metrics.