Data Visualization on IPL Dataset

Data Visualization on IPL Dataset

Machine Learning

Hey people, we've performed a few basic EDA commands on the IPL dataset. Now we are going to learn how to visualize the data using the Seaborn and Matplotlib libraries.

Why do we need to Visualize data?

Fact: Humans remember more than they hear.

A picture can say a thousand words. Instead of analyzing each term, we can visualize them so that the maximum amount of data is analyzed. People can easily understand what a picture represents and the data in it.

There are several types of representation of data in terms of shape or relationship. We will try to cover a maximum of shapes and relations to make you understand the visualization.

Bar Graph

  • A bar graph enables you to visualize the data between two variables in bars.

  • Here, we've used the Matplotlib and Seaborn libraries to make it visual. (Which is imported as plt and sns)

  • Here, xticks refers to the labels that we used to represent each bar in the graph.

  • xlabel and ylabels are used to indicate on which axis we need to plot which column's data.

  • titleenables us to give a title to the bar graph, as its name suggests. The same goes for the show

    Bar Graph Code snippet

Line Chart

  • figure function enables us to give the size of the image in which we get our graph visualized.

  • we need to include the column's name in the lineplot function as mentioned.

Histogram

  • A histogram is similar to the bar graph, but here all the bars stick together so that we can analyze the data easily without strain.

  • The first line of code plt.figure(figsize=(10, 6))creates a new figure with a width of 10 inches and a height of 6 inches.

  • This is the size of the plot that will be displayed.

      sns.histplot(df['home_runs'], bins=20, kde=True, color='blue', label='Home Team Runs')
      sns.histplot(df['away_runs'], bins=20, kde=True, color='red', label='Away Team Runs')
    

    use the Seaborn library to create two histograms, one for the home team runs and one for the away team runs.

  • The bins The parameter controls the number of bins in the histogram, and the kde parameter controls whether to show a smoothed version of the distribution (called a kernel density estimate).

  • The color The parameter controls the color of the bars, and the label parameter adds a label to the legend.

      plt.xlabel('Runs')
      plt.ylabel('Frequency')
      plt.title('Distribution of Runs Scored by Home and Away Teams')
    

    set the labels for the x-axis, y-axis, and title of the plot.

      plt.legend()
      plt.show()
    

Scatter Plot

The first line of code, plt.figure(figsize=(10, 6)), creates a new figure with a width of 10 inches and a height of 6 inches. This is the size of the plot that will be displayed.

The next line of code

sns.scatterplot(x='home_score', y='away_score', data=df)

uses the Seaborn library to create a scatter plot of the home team score vs. the away team score. The x and y parameters specify the columns in the df DataFrame that contains the home team score and away team score, respectively.

The next two lines of code

plt.xlabel('Home Team Score')
plt.ylabel('Away Team Score')
plt.title('Scatter Plot of Home Team vs. Away Team Scores')

set the labels for the x-axis, y-axis, and title of the plot.

The final line of code

plt.show()
#displays the plot.

Pie Chart

  • The df['decision'].value_counts() The function counts the number of loans in each category of decision (approved or denied) and stores the results in a data frame.

  • The plot(kind='pie', autopct='%1.1f%%') function then creates a pie chart of the data frame, where each slice represents the percentage of loans in each category.

  • The autopct parameter specifies the format of the labels on the pie chart. In this case, the labels are formatted as percentages with one decimal place.

HeatMap

A heatmap is a graphical representation of data where values are depicted by color.

Heatmaps can be used to show a variety of things, such as:

  • The distribution of data: A heatmap can be used to show the distribution of data in a dataset. For example, a heatmap of the temperatures in a city can show how the temperatures vary across the city.

  • The relationship between two variables: A heatmap can be used to show the relationship between two variables. For example, a heatmap of the correlation between customer satisfaction and product price can show how customer satisfaction changes with product price.

  • The importance of features: A heatmap can be used to show the importance of features in a machine learning model. For example, a heatmap of the feature importance in a decision tree can show which features are most important for the model's predictions.

sns.heatmap(correlation_matrix, cmap='coolwarm', annot=True, fmt=".2f")

uses the Seaborn library to create a heatmap of the correlation matrix. The cmap parameter specifies the colormap to use, and the annot parameter specifies whether to annotate the heatmap with the correlation coefficients. The fmt parameter specifies the format of the annotations. In this case, the annotations are formatted as two-decimal-place floats.

Box Plot

  • The sns.set(font_scale=1.2) function is used to set the font scale for all subsequent plots.

  • This means that the font size of all the text in the plot will be 1.2 times the default size. This can be useful for making the text in plots easier to read.

  • The boxplot() function is used to create a box plot. A box plot is a graphical representation of the distribution of data.

  • It shows the median, interquartile range, and outliers for each group of data. The median is the middle value in the distribution, and the interquartile range is the range of values between the first and third quartiles.

  • Outliers are values that are much higher or lower than the rest of the data.

Bubble Chart

sns.scatterplot(x='home_runs', y='away_runs', data=df, hue='season', size='season', sizes=(50, 200))
  • This uses the Seaborn library to create a bubble chart of the home_runs and away_runs columns in the DataFrame df.
  • The x and y parameters specify the columns to plot on the x-axis and y-axis, respectively.

  • The hue parameter specifies the column to use for coloring the bubbles, and the size parameter specifies the column to use for sizing the bubbles.

  • The sizes parameter specifies the range of sizes for the bubbles.

This is all I've learned about Data Visualization, I hope you guys learned something about visualizing data.

In the next article, we are going to implement basic Machine Learning algorithms on this dataset and perform few actions like

  • Regression

  • And going to create a model in which we predict the best batsman and blower in the IPL.

To not miss further articles, please follow me and comment if you've any doubts regarding Data Visualization.