For those readers with background in R, the use of ggplot is obvious. The ggplot package for python is virtually identical to the R package. For those readers unfamiliar with R and ggplot, ggplot has a number of distinct advantages over some other graphics packages:

- With the use of "building blocks", superposition of information is extremely simple.
- Multiple data sets can easily be presented in the same plot.
- Faceting can be done without the use of loops.
- There is a heavily active community of users, tutorials, guides and question/answers available online.
- Geoms are intuitive to use.
- The package is built upon one of the major schools of data visualization.

In this tutorial we will cover some of the most important and impressive features of ggplot. We will use the builtin datasets; more information on those can be found at http://docs.ggplot2.org/current/. ggplot will work with most datasets, provided the data is in readable formats.

We will cover the following topics in this tutorial:

`ggplot`

is not installed it can be installed using `pip install`

in the command line or through the git: https://github.com/yhat/ggplot.

In [5]:

```
#pip install numpy
#pip install pandas
#pip install matplotlib
#pip install ggplot
```

We will begin by importing three packages needed for ggplot. `numpy`

gives us access to needed mathematical functions, pandas gives us access to dataframes, the preferred method for data in R, and matplotlib.pyplot gives us access to what is considered python's more traditional graphics and visualization package.

Without `matplotlib`

the graphs will open in a separate window.

In [6]:

```
import numpy as np
import pandas as pd
#import scipy as sp
import matplotlib.pyplot as plt
%matplotlib inline
from ggplot import *
#from scipy.stats import iqr
```

`ggplot`

's built in datasets: `diamonds`

and `mtcars`

. Below we call `.head()`

to visibly display the data stored in those two datasets.

In [50]:

```
mtcars['gear'] = mtcars['gear'].astype('category')
print mtcars.head()
print diamonds.head()
```

`mpg`

) versus horsepower (`hp`

) of the `mtcars`

dataset. We start by creating a ggplot object.

In [10]:

```
mpg_hp_scatter = ggplot(mtcars, aes(x = 'hp', y = 'mpg'))
type(mpg_hp_scatter)
```

Out[10]:

`geom_point()`

to our object to make our scatterplot. We will also add labels and a title to our plot using `xlab`

and `ylab`

to add axis labels and `ggtitle`

to add the title.

In [11]:

```
mpg_hp_scatter = mpg_hp_scatter + \
ylab('Miles per Gallon') + \
xlab("Horsepower") + \
ggtitle("Miles per Gallon vs. Horsepower")
mpg_hp_scatter_wg = mpg_hp_scatter + geom_point()
mpg_hp_scatter_wg
```

Out[11]:

The plot above shows the miles per gallon versus the horsepower of the `mtcars`

dataset. There is a clear negative relationship between the two variables as evidenced by the increasing horsepower and decreasing miles per gallon.

Below we have plotted the same scatterplot, with the addition of a third variable `gear`

which colors the points of the plot.Wi

In [51]:

```
mpg_hp_wc_scatter = ggplot(mtcars, aes(x = 'hp', y = 'mpg', color = 'gear'))
mpg_hp_wc_scatter = mpg_hp_wc_scatter + \
ylab('Miles per Gallon') + \
xlab("Horsepower") + \
ggtitle("Miles per Gallon vs. Horsepower\nColored by Gear")
mpg_hp_wc_scatter_wg = mpg_hp_wc_scatter + geom_point()
mpg_hp_wc_scatter_wg
```

Out[51]:

`mtcars`

dataset, we will graph `mpg`

.

In [34]:

```
mpg_hist = ggplot(mtcars, aes(x = "mpg"))
mpg_hist = mpg_hist + \
xlab("Miles per Gallon") + \
ylab("Frequency") + \
ggtitle("Frequency of Miles per Gallon")
mpg_hist_wg = mpg_hist + geom_histogram()
mpg_hist_wg
```

Out[34]:

We can see on the graph above that mpg is skewed right with a center around 17. Since binsize and number are not statistically relevant in histograms we can adjust their values until we get the best possible histogram.

We will use the Freedman-Diaconis Rule, a statistical equation for selecting the proper binwidth. Our function `best_binwidth`

is the Freedman-Diaconis Rule in python. The plot below is the same plot with an updated binwidth.

In [35]:

```
def best_binwidth(x):
iqr = float(np.subtract(*np.percentile(x, [75, 25])))
iqr = 2 * iqr
leng = len(x)
leng = leng**(1/3.0)
divisor = iqr / leng
xmax = x.max() / divisor
xmin = x.min() / divisor
return (xmax - xmin)
mpg_hist_bb_wg = mpg_hist + geom_histogram(binwidth = best_binwidth(mtcars['mpg']))
mpg_hist_bb_wg
```

Out[35]:

In [36]:

```
mpg_hist_wg_lines = mpg_hist + \
geom_histogram() + \
geom_vline(x = mtcars["mpg"].mean(), color = "red") + \
geom_vline(x = mtcars["mpg"].median(), color = "green")
mpg_hist_wg_lines
```

Out[36]:

In [40]:

```
mpg_density = ggplot(mtcars, aes(x = "mpg"))
mpg_density = mpg_density + \
xlab("mpg") +\
ylab("Frequency") +\
ggtitle("Frequency of MPG")
mpg_density_wg = mpg_density + geom_density()
mpg_density_wg
```

Out[40]:

`geom_density`

to create the plot. We will continue to use our mpp example.

In comparison to the histograms above, our density plot shows the distribution of mpg with a single line. We continue to see the right skew we discussed earlier and the center between 15 - 20. As we mentioned above, there are those who find density plots easier to read than histograms and vice versa, it is a matter of preference for many of the visualization the reader will need.

ggplot can also display multiple distributions if we split the variable based on another variable. To do this, we add a parameter to our initial `ggplot`

call. In our example below, we have set the parameter `color`

to `gear`

one of the other variables in the dataset.

In [52]:

```
mpg_g_density = ggplot(mtcars, aes(x = "mpg", color = "gear"))
mpg_g_density = mpg_g_density + \
xlab("mpg") +\
ylab("Frequency") +\
ggtitle("Frequency of MPG\nSplit by Gear")
mpg_g_density_wg = mpg_g_density + geom_density()
mpg_g_density_wg
```

Out[52]:

We can see that the distributions of mpg are different for each of the gears. Thus adding a categorical variable can add to an understanding of different variable.

Since histograms and density plots are different plot types of the same data, we can call `geom_histogram`

instead of `geom_density`

to get the histogram of the plot we above.

In [53]:

```
mpg_g_hist_wg = mpg_g_density + geom_histogram()
mpg_g_hist_wg
```

Out[53]:

Bar charts are often confused with histograms due to their visual similarities. However, they display different types of data and different information can be gained each plot type. Where histograms display one quantatative variable, bar charts display one categorical variable. Since histograms display quantatitive data, the skew, mean and median are all statistical information that has meaning. Since a bar chart displays the categories of a categorical variable, none of that information is sensical.

To create a bar chart in ggplot, we use `geom_bar`

. Below, we have plotted `cut`

from the `diamond`

dataset.

In [19]:

```
cut_bar = ggplot(diamonds, aes(x = "cut"))
cut_bar = cut_bar + \
xlab("Cut") +\
ylab("Frequency") +\
ggtitle("Distribution of Cut")
cut_bar_wg = cut_bar + geom_bar()
cut_bar_wg
```

Out[19]:

In a bar chart neither the order of the categories nor the width of the bars matters. Generally the important part of the bar chart is the height of the bars. In our example, Ideal is the most frequent in our dataset, while Fair is the least frequent.

If we wanted to see the frequency of one categorical variable split by another categorical variable, we have two options. The first option is a stacked bar chart where each bar of the first variable is colored by another variable. Below we have written a stacked bar chart of `cut`

split by `color`

by adding the `position`

parameter to `geom_bar`

and setting it equal to `stack`

.

In [20]:

```
cut_c_bar = ggplot(diamonds, aes(x = "cut", fill = "color"))
cut_c_bar = cut_c_bar + \
xlab("Cut") +\
ylab("Frequency") +\
ggtitle("Stacked Bar Chart of Cut\nColored by Diamond Color")
cut_c_bar_stack_wg = cut_c_bar + geom_bar(stat = "identity", position = "stack")
cut_c_bar_stack_wg
```

Out[20]:

In our example, we see that the color G represents the largest part of the Ideal column, while in the Very Good column it appears with the same frequency as H.

The other way to display the data is with a side-by-side bar chart. This is done by changing the `position`

parameter to `dodge`

. Below we have plotted the same data with `position`

changed.

In [21]:

```
cut_c_bar = ggplot(diamonds, aes(x = "cut", fill = "color"))
cut_c_bar = cut_c_bar + \
xlab("Cut") +\
ylab("Frequency") +\
ggtitle("Side-by-Side Bar Chart of Cut\nColored by Diamond Color")
cut_c_bar_dodge_wg = cut_c_bar + geom_bar(stat = "identity", position = "dodge")
cut_c_bar_dodge_wg
```

Out[21]:

`mpg_density_wg`

. We will facet the dataset by `gear`

so we can best compare it against our `mpg_g_density_wg`

from earlier.

In [54]:

```
mpg_g_density_facet = ggplot(mtcars, aes(x = "mpg", color = "gear"))
mpg_g_density_facet = mpg_g_density_facet + \
xlab("mpg") +\
ylab("Frequency") +\
ggtitle("Frequency of MPG\nSplit by Gear")
mpg_g_density_facet_wg = mpg_g_density_facet + geom_density() + facet_grid("gear")
mpg_g_density_facet_wg
```

Out[54]:

The plot above displays the same information as `mpg_g_density_wg`

with the different distributions on different plots.

We can also split plots of categorical distributions with facets as well. We will recreate `cut_bar`

and facet our bar charts by `color`

.

In [23]:

```
cut_bar = ggplot(diamonds, aes(x = "cut"))
cut_bar = cut_bar + \
xlab("Cut") +\
ylab("Frequency") +\
ggtitle("Distribution of Cut")
cut_bar_facet_wg = cut_bar + geom_bar() + facet_grid("color")
cut_bar_facet_wg
```

Out[23]:

`stat_smooth`

, we can visualize different regression lines on scatter plots and other such plots. As an example we will plot `mtcars`

`disp`

versus `hp`

. In other words, we will hypothesize that `disp`

is dependent on `hp`

.

In [24]:

```
disp_hp = ggplot(mtcars, aes(x = "hp", y = "disp"))
disp_hp = disp_hp + \
xlab("Horsepower") +\
ylab("Displacement") +\
ggtitle("Displacement vs. Horsepower")
disp_hp_wg = disp_hp + geom_point()
disp_hp_wg
```

Out[24]:

We see a somewhat positive, linear relationship between horsepower and displacement, the smoothing lines may better display those relationships.

When we add in `geom_smooth`

without any other parameters defined we get the same plot as above with the default smooth.

In [25]:

```
disp_hp_smooth_wg = disp_hp + geom_point() + stat_smooth()
disp_hp_smooth_wg
```

Out[25]:

The smooth that `stat_smooth`

chose appears linear with larger error bars near the ends of the data, adding evidence to our suspicions.

If we were curious about a particular model type we could change the method of `stat_smooth`

with the added `method`

parameter.

In [ ]:

```
disp_hp_lm_wg = disp_hp + geom_point() + stat_smooth(method = "loess")
disp_hp_lm_wg
```

Above is the same plot with a different smooth that we tried.

We can remove the standard error bars from the smooth line with the paramter `se`

.

In [58]:

```
disp_hp_lmse_wg = disp_hp + geom_point() + stat_smooth(method = "lm", se = False)
disp_hp_lmse_wg
```

Out[58]:

`price`

of the `diamonds`

dataset using the standard ggplot theme.

In [59]:

```
price_density = ggplot(diamonds, aes(x = "price"))
price_density = price_density + \
xlab("Horsepower") + \
ylab("Frequency") + \
ggtitle("Frequency of Horsepower")
price_density_wg = price_density + geom_density()
price_density_wg
```

Out[59]:

Below we have plotted the same plot with our own theme, aptly named `our_theme`

.

In [9]:

```
our_theme = theme(axis_text=element_text(size=8, color='black'), x_axis_text=element_text(angle=45))
price_density_theme = price_density + geom_density() + our_theme
price_density_theme
```

Out[9]:

In [ ]:

```
```