With soccer’s January transfer window set to open in a matter of days, rumours about potential blockbuster transfers are already brewing.
Given the chaotic nature of this past summer’s transfer window amidst a global pandemic, there is one question on everyone’s mind; will the current transfer rumours circulating about their team will come true, or amount to nothing more than speculation?
As some rumours are merited, while others appear implausible, it seems like an obvious challenge to see if this discrepancy could be modelled using readily available data.
Whereas this problem was previously approached using a player’s historical data, I decided to use text data collected from various online blogs, particularly articles written about transfer rumours in the past transfer window from numerous sports news sites. Since this data comes from a completed transfer window, the outcome is already determined, making this a supervised learning problem.
The Data
The text data was obtained from sports news sites such as NBC Sports, Sports Mole, Daily Star, and the Manchester Evening News. The data represents a mixture of players from various leagues.
Players themselves were selected so there was a near-even split of players who completed a transfer and those who did not, providing a binary classification problem of “Did the player complete a transfer or not?” with the respective labels being Y for yes or N for no. The players selected are shown in the graphic below.
To get an idea of what characteristics in the data may influence the classification, I conducted a detailed exploration. Immediately, it came to mind that I should see how much text data I had collected for each player. A large disparity between players could make the prediction task more difficult.
Looking at the above graphics (click to enlarge), aside from a few outliers, the amount of text data collected for each player is par with others. Next, to get a sense of what is referenced by the collected text data, I created a word cloud visual. I considered three cases for the visuals - one for each label of a player and one for the text of all players combined.
From these three word clouds, it is clear that writers frequently reference player and club names in their articles, making sense as they would be referencing possible destinations for the player in question. Despite the word clouds being mostly similar, the Completed Transfers word cloud possesses more definitive language such as deal, made, linked, and going, indicating I needed a model complex enough to pick up such differences in the data.
Additionally, the word clouds showed that many writers tend to heavily use names, whether that be players, leagues, or clubs. To see if this were only happening for select players or all players in the dataset, I used named entity recognition to identify words in the data that could be attributed as a name. These were then tallied for each player’s text data to summarise how names are used in the text of each player.
Of the players written about using the most named entities, only two completed transfers. One of these players is Jack Butland (Stoke to Crystal Palace). Butland also possesses the most text data for a sole player, which would likely imply more named entities, indicating writers reporting on players with less conclusive rumours reference multiple teams, managers, and other news sources as a way to back their claims. These are all named entities and could contribute to this pattern.
I explored some of the named entities identified by this procedure to see exactly what sort of words and terms were considered named entities.
Mario Götze
As Mario Götze is a player who completed their transfer (Free transfer to PSV Eindhoven), only two named entities appeared: Mario Götze, and Niko Kovac, the manager of AS Monaco another team he was linked with this past summer.
John Stones
John Stones is one player who did not complete a transfer, yet had a large number of named entities used in the text collected about him. The named entities referenced in his text were: Ferran Torres, Nathan Ake, Txiki Begiristain, Kalidou Koulibaly, Nicolas Otamendi, Eric Garcia, Aymeric Laporte, Mikel Arteta, Jose Giminez, Frank Lampard, Sky Sports, and Antonio Rudiger. These named entities include his teammates (Nicolas Otamendi), players at different teams in his position (Nathan Ake), and opponent managers who may have considered him an option (Mikel Arteta, Frank Lampard).
The case of Stones helps support the issue we identified above. A broad range of players appears in his text ranging from teammates to possible replacements and even rival teams’ managers. Many fans of the game would likely not see this as reliable transfer news as there is no definitive claim that Stones will be transferred.
The last aspect of the data I wanted to dive into was its polarity. In binary text classification problems, this can be a strong indicator of the target label. If this is the case, it is a highly valuable piece of information. The polarity of a text is its ability to appear positive or negative to a reader. The idea here is that the media will positively discuss a more promising transfer or one viewed as a good fit.
The graphic above shows that three of the players spoken about the most positively, in Nelson Semedo (FC Barcelona to Wolves), Diogo Jota (Wolves to Liverpool), and Ruben Dias (SL Benfica to Manchester City) also completed transfers, implying that the way a writer communicates about a player poised for a transfer may be indicative of the transfer being completed or not. Furthermore, most of the players who did not complete transfers are not at either of the two extremities when it comes to text polarity.
Exploring these various characteristics confirms that certain patterns exist, which provide a way to differentiate between a completed and uncompleted transfer in the data. We can find such patterns in the text's polarity, the usage of named entities, and the text's content itself. From these observations, I proceeded with the preprocessing of the data and then constructing a suitable model.
Reprocessing
To preprocess the data, I used standard text data procedures. Named entities were replaced with a <NAME> token so that the model treats all names equally. A similar approach was taken for numbers and links appearing in the text as I replaced them with <NUM> and <LINK> tokens, respectively.
Finally, special characters and stop words were removed with the remaining words in the text data being stemmed, then lowercased. Stemming consists of transforming words to their root form to reduce unique words in the data such as played becoming play.
Model Construction
To convert the text data to numeric features required by machine learning algorithms, I used n-grams of sizes 1 to 3. These sizes were chosen for their ability to cover multiple word combinations without becoming overly specific by using large n-gram sizes. To identify the data's hidden patterns highlighted in my exploratory analysis, a model capable of capturing complex patterns was required, leading me to select a neural network as my model of choice. The parameters were determined through hyperparameter tuning and resulted in a neural network with three hidden layers of size 16, 8 and 4.
The dataset I worked with was relatively small row-wise, and so to validate the results, I performed cross-validation by training on the text data for 80% of the players and testing on the other 20%, producing a training error of 81.25% and a test error of 45%. We can likely attribute the low test error to the test set only ever containing four players and misclassifying even a few can significantly skew the test error. The small dataset makes analyzing these results particularly difficult.
To better understand the results, I analyzed the test set of players who received the worst predictive accuracy. This set included players Ousmane Dembélé, and Georginio Wijnaldum; both play for large clubs and are highly sought after due to their recent success at the individual and club levels. With such popularity, these players may cause writers to use similar language to a player with an actual pending transfer.
Lastly, the results indicate that the model was better at identifying players who completed a transfer versus players who did not.
January Transfer Window
The upcoming January transfer window provides teams with an excellent opportunity to bring in reinforcements to close out what has already been a taxing season.
Lionel Messi recently announced that he wouldn't be releasing a decision on his future until the summer so this, likely, rules out a move for the Barcelona ace this coming January. A decision many fans are either dreading or await impatiently.
Instead, I decided to deploy my model on three players whose names have circulated in transfer news as of late; the first coming from our very own MLS, in Mark McKenzie. The 21-year-old defender has shown out for Philadelphia Union and seems set to move overseas to KRC Genk.
The second player is Luka Jovic. In the 2018-19 season, Jovic had a breakout season with Eintracht Frankfurt netting himself 17 goals in Bundesliga and another 10 in European competitions, causing Real Madrid to snatch him up. Since then he's struggled and looks set to rediscover himself with Wolverhampton, who have teetered on the edge of breaking into what is considered the Premier League's elite. In recent seasons, they have been a bane in the sides of teams finding themselves in the top six.
Lastly, the struggling Arsenal are frantically looking for defensive back-up, which may come in Brighton's Tariq Lamptey, a highly touted English prospect. Hector Bellerin has hit a recent tough run of form, making this transfer all the more sensible.
Using the aforementioned model, I decided to see how it would classify these possible transfers. Collecting data in a similar manner in which I used to train my model, I now had data on the above three players.
Running the model for these three players produced the prediction that both McKenzie and Lamptey will complete transfers in this coming January transfer window, while Jovic will most likely remain put at Real Madrid. It will be interesting to see how these predictions come to fruition in the coming weeks and provide a real-world outlook on a research-oriented topic.
Conclusions
As mentioned, the barring issue with this project was the lack of data. However, the data collected was sufficient for a strong lead into this area of text analytics for soccer, an area untapped thus far. The abundance of text data on the web makes collecting the data and verifying its reliability a significant challenge.
The exploratory work that went into this analysis highlighted some reliable patterns that likely applies to less refined data. Suppose more data were to be collected, such as tweets or interviews. In that case, these patterns may be handy in developing a robust model capable of exploiting the wide range of sources to accomplish the outlined task.
The results of the classification task show that there exists a discrepancy between the training and test accuracies. This indicates that the model is overfitting, but, in this case, the lack of data prevents this claim from being confidently made. Regardless, with an initial training accuracy of 81.25%, a robust beginner model has been demonstrated. Any future work to improve the model should focus more on preparing the data or discovering a neural network architecture that generalizes the training data more successfully.
This project acts as a proof of concept that is worthwhile pursuing. A proceeding attempt solving the highlighted issues would be a useful tool for soccer clubs, companies, those in the betting industry, and writers looking to get ahead of the market.
Text data retrieved from Bleacher Report, Bleacher Report, Bleacher Report, Bleacher Report, Birmingham Mail, Birmingham Mail, BuliNews, BuliNews, Caught Offside, Daily Mail, Daily Star, Euro Football Rumours, ESPN Football, Football Espana, Football Espana, Football Espana, Football London, Football London, Football Transfer Tavern, Give Me Sport, Inside Futbol, Ligue 1 News, Manchester Evening News, Manchester Evening News, Manchester Evening News, MLS Soccer, NBC Sports, Pink Un, Sportsmole, SportsMole, SportsMole, SportsMole, SportsMole, SportsMole, SportsMole, SportsMole, Stoke Sentinel, Stoke Sentinel, Talk Sport, Talk Sport, TalkSport, Talk Sport, Talk Sport, Talk Sport, TBR Football, The Athletic, The Express, The Express, The Guardian, The Independent, The Inquirer, The Mirror, The Short Fuse, The Sun, The Sun, The Sun, The Sun, Tribuna, 90MIN, and 90MIN
Cover photo credited to Getty Images