
Exploring the English Premier League Data
Soccer is arguably the most popular and most watched sport in the world. There are soccer leagues in almost every country in the world. In America, the sport has grown with the establishment of Major League Soccer. The English Premier League is considered to be the best league in the world.
In this project, I will be exploring the English Premier League (EPL) dataset to get answers on the following questions:
How are scoring goals and match excitement levels related?
Does home team possession lead to more goals being scored in a game?
Does team possession % lead to higher team rating?
Does getting yellow cards make teams more vulnerable to conceding goals?
The following are the python libraries that I have utilized in the project:
Pandas
Matplotlib
Scipy
Numpy
Introduction to the data
The dataset I have chosen to explore is a single csv format file from kaggle.com, published by Sanjeet Singh Naik, who is a Dataset Expert on Kaggle. The name of the dataset is “Football Data : Top 5 Leagues” and it can be publicly accessed. For this project, I am focusing specifically on the English Premier League data.csv file. The dataset also contains data from other leagues in Europe. You can view the dataset here(Link).
The following features from the dataset will be useful in the later stages of the project:
Match Excitement (a rating which measures the excitement level of a match)
Home Team Rating (rating given to team playing at their home ground)
Home Team Possession % (% of the time the home team had the ball in their possession)
Home Team Goals Scored (total goals scored in a match by the home team)
Total Goals Scored (total goals scored in a match by both teams)
Home Team Goals Conceded (total numbers of goals allowed by the home team in a match)
Home Team Yellow Cards (yellow cards are warnings given by the referees to players that commit fouls during open play, this column contains the total number of yellows cards received by the home team in a match)
Pre-Processing the Data
After exploring the dataset, I found that it was constructed very well and it did not need much cleaning. I checked if the dataset contained any NaN values because I will using those columns to create visualiations in the next step. The datatype of most columns is integer and float because they are statistics of metrics. And since most of the data is in numerical form we can make visualizations using the Matplotlib library. I also added a new column with the name “Total goals scored” to the dataframe, it sums up the goals scored by each team in a game. This column will be used in a scatter plot to check its relationship with match excitement level.
Visualizations
The below scatter plot shows the relationship between total goals scored in a game by a team and match excitement level. I expected them to be moderately correlated because goals are very hard to score in the premier league and often times many exciting games go scoreless. But the r value shows that they are highly correlated.

The below visualization compares home team possession % with home team goals scored. Since the points are all randomly scattered the relationship between the two variables is weak. I initially thought that it would be a moderately positive relationship because logically the greater % of time you have ball in your possession the more chances you can create and score. But the r value proves that higher possession does not always lead to scoring more goals.

The two scatterplots below show the possession % vs the team rating for home and away teams. By looking at where the points scattered most and also the r value we can judge that there is definitely a relationship between how much time a team possesses a ball vs the rating that they are given. It is a weak to moderate relationship. I would have expected a higher correlation because most fans generally like to watch attacking and possession based soccer play which contains more forward and risk-taking play.


This chart shows if a team is more vulnerable if the players getting yellow cards increase as the game goes on. The r value shows that the two variables have a very weak relationship. My expectations of the relationship between home team yellow card and home team goals conceded was the same as the actual result, which says there exists a very weak relationship between the two variables. The reason for that is team managers often instruct their players to commit fouls to break momentum of the opposition and prevent dangerous attacks.

Conclusions
One of the things that I wanted to see was if match excitement and the number of goals scored in a match were related. After visualizing the data through scatter plot and calculating the coefficient of correlation for the two variables, we can conclude that indeed the higher the number of goals scored the better the match excitement level. Another big talking point amongst soccer enthusiasts is if higher ball possession % leads to better goal scoring. Most managers play high energy soccer with tactics that would increase their teams ball possession. Many soccer pundits believe that greater ball possession leads to more goals and ultimately more wins. The modern soccer and most managers playing philosophy is moving towards high tempo, pressing and higher possession playing style. But the second visualization proves that games can be won with lesser ball possession %. A big talking point amongst soccer pundits is that they support giving more credit to teams that dominated/ had more ball possession, which some might disagree with. But the visualizations do show that higher ball possession leads to higher team rating which can be translated into fans of the team enjoying the tactics and the philosophy installed by the manager. The final visualization compares if a more physical team concedes more goals. The weak relationship means that just because a team has committed more fouls it does mean that they will concede more goals. In soccer, two yellows given to the same player results in the ejection of that player from the game. This situation offers a great advantage to the opposition as they have an additional player on the field. So, the misconception often is that if a player get one yellow card, he will be less aggressive which may give the opposition the edge try and knick a goal and get a favorable result. The visualization clearly proves that a highly penalized team is not more prone to conceding goals.
References
https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf
https://www.kaggle.com/sanjeetsinghnaik/football-data-top-5-leagues