Udacity Nanodegree Project1: Investigate a Dataset - [European Soccer Database] - 09/10 Season

Name : Oke Oladunsi

Hi, if you need to skip the technical facts kindly click here for a summary of facts and findings carried out during the scope of study.

Table of Contents

Introduction

Dataset Description

Tip: This soccer database comes from Kaggle which is well suited for data analysis and machine learning. It contains data for soccer matches, players, and teams from 11 European countries from 2008 to 2016.

Question(s) for Analysis

During my investigation i would be trying to give answers to the following questions:

Tip: What are the top 20 performing teams (goals scored) in european top 5 (five) leagues for the 2009/2010 season

Tip: Are there any correlations between goals scored or conceeded and games won per team

Tip: Are there significance relationship between the amount of home games won and the the team position at the end of the season

these and more questions will i be investigating during this entire process

Data Wrangling

Tip: In this section of the report, you will load in the data, check for cleanliness, and then trim and clean your dataset for analysis. Make sure that you document your steps carefully and justify your cleaning decisions.

General Properties

Data Cleaning (N:B for this project i have downloaded the database into my pc and i will be accessing using pandas sql module provided by pandas.)

through out this project the following packages are going to be employed:

conda install pandas,sqlite3,numpy, matplotlib

Loading data into pandas dataframe

Total players playing the europe during this period can be found running the cell below

Teams attributes can be found running the cell below.

N:B the only column without complete info is buildUpPlayDribbling

The european countries and their respective leagues can be found running the cell below

Total matches in the database can be found running the cell below

Total seasons in the database can be found running the cell below

Exploratory Data Analysis

Tip: At this point I have generated several csv's file for the standings for each league which is for 8 seasons. So therefore, from the generated csv file

Research Question 1 (What are the top 20 teams with the most goals in the 2009-2010 season in europe top five leagues)

I will continue analysi on this later, so for now i will turn it to markdown cell

spain_liga_league_10_11 = pd.read_csv("spain_liga_bbva_2010_2011.csv") english_premier_league_10_11 = pd.read_csv("england_premier_league_2010_2011.csv") france_ligue_1_10_11 = pd.read_csv("france_ligue_1_2010_2011.csv") germany_1_bundesliga_10_11 = pd.read_csv("germany_1._bundesliga_2010_2011.csv") italy_seria_a_10_11 = pd.read_csv("italy_serie_a_2010_2011.csv")

I will continue analysi on this later, so for now i will have turn it to markdown cell that i might meet submition deadline

Tip: I wanted to investigate how teams in the top five leagues performed over the course of 5 seasons, but now i will only do with one season.

spain_liga_league_10_11 = pd.read_csv("spain_liga_bbva_2010_2011.csv") english_premier_league_10_11 = pd.read_csv("england_premier_league_2010_2011.csv") france_ligue_1_10_11 = pd.read_csv("france_ligue_1_2010_2011.csv") germany_1_bundesliga_10_11 = pd.read_csv("germany_1._bundesliga_2010_2011.csv") italy_seria_a_10_11 = pd.read_csv("italy_serie_a_2010_2011.csv")

spain_liga_league_11_12 = pd.read_csv("spain_liga_bbva_2011_2012.csv") english_premier_league_11_12 = pd.read_csv("england_premier_league_2011_2012.csv") france_ligue_1_11_12 = pd.read_csv("france_ligue_1_2011_2012.csv") germany_1_bundesliga_11_12 = pd.read_csv("germany_1._bundesliga_2011_2012.csv") italy_seria_a_11_12 = pd.read_csv("italy_serie_a_2011_2012.csv")

spain_liga_league_12_13 = pd.read_csv("spain_liga_bbva_2012_2013.csv") english_premier_league_12_13 = pd.read_csv("england_premier_league_2012_2013.csv") france_ligue_1_12_13 = pd.read_csv("france_ligue_1_2012_2013.csv") germany_1_bundesliga_12_13 = pd.read_csv("germany_1._bundesliga_2012_2013.csv") italy_seria_a_12_13 = pd.read_csv("italy_serie_a_2012_2013.csv")

Top 20 clubs in all of europe top five(5) leagues for the 2009/2010 season can be found running the cell below

Answers to the questions asked above

Research Question 2 ( Are there any correlations between goals scored and games won per team)

Answers to the questions asked above

Research Question 3 Are there significance relationship between the amount of home games won and the the team position at the end of the season

this are the other facts that we can deduce from the dataset.

Conclusions

Tip: After carrying out the analysis on this dataset I was able to give answers to the questions asked in the Questions section

After the analysis I carried on this dataset i was able to determine the following

  1. Chelsea scored the most goals in all of europe
  2. Camp nou of FC Barcelona is the most difficult place to get a win in all of europe
  3. If a team scored between 59_100 goals there is possibility of getting into european cup competitions
  4. any team that conceed 59-90 goals will probably be in the lower end of the table at the end of the season
the above listed observations are proves that the more a team the chances of winning more matches. we can also deduce that the best form defense any team can have is scoring more goals as we were able to see amongst the top 20 teams with most goals conceeds lesser goals. We are also able to see that teams that conceeded the most goals in their respective leagues get relegated most often times. ### LIMITATIONS: > **Tip**: Finally, the players dataset need more intensive cleaning because of which I can achive due to the deadline I will have to put hold further analysis at this point so I can avoid being revoked from carrying on with the work i intended doing. But once, I am done submitting i hope can further carry out more analysis and learning from this intresting dataset. ### REFERENCES:
  1. Connor Anderson, Kevin Velasco, and Alex Shropshire.Hypothesis Testing European Soccer Data; Home Field Advantage, Ideal Formations, and Inter-League Attributes Explored in Python, May 11, 2019
  2. Yiou Wang August 2020 Data Analyses of European Soccer
  3. Soccer Data Analysis