Isaac Haberman (Ihaberma) and Adin Adler (aadler)

For our final project we analyzed the video game Defense of the Ancients 2(DOTA). For a full explanation of the game and further resources, check out http://dota2.gamepedia.com/Dota_2. DOTA is game played by ten players, five per team, who compete in matches that generally last forty-five minutes in length. Before each match, players choose a unique hero without replacement from a pool of 113 heroes, 111 of which are in the dataset, which is missing the heroes corresponding to the numbers 24 and 108. Throughout the match, players earns gold and experience which help them level up their character. Gold and experience are reset after each match.

We analyzed the data from two CSV files we received from a company called Feedless, a company creating an AI coaching bot for DOTA. The 2 CSV files are titled `MatchOverview`

and `MatchDetail`

.

`MatchOverview`

contains 23875 rows, with each row corresponding to a unique match. The data frame contains 12 columns, grouped by the match id.

The first column identifies the match, usually a large number like 2503037971.

The next ten columns contain the heroes picked in the match; columns one through five represent team zero, columns six through ten represent team one. The columns can contain any integer 1-113 inclusive, skipping 24 and 108, which were heroes not yet released to the public.

The final column is a boolean variable representing the winning team; 0 if team zero won and 1 if team one won.

`MatchDetail`

contains 202319 rows and 23 columns containing data on the experience and gold for a player at 5 minute intervals up to 45 minutes. For example `e_5`

represents experience of a player at the 5 minute mark and `g_5`

represents gold of a player at the 5 minute mark.

The first 3 columns identify the player, character and match id

The next 10 columns detail the experience values at 5 minute intervals, from 0 minutes to 45 minutes inclusive.

The final 10 columns detail the gold values at 5 minute intervals, from 0 minutes to 45 minutes inclusive.

Through our analysis, we hoped to identify if a hero would win a match based on their gold and experience throughout the match.

We began our analysis by loading the two datasets and reading over the data types and some of the observations.

In [16]:

```
import pandas as pd
import numpy as np
from collections import Counter
import matplotlib.pyplot as plt
%matplotlib inline
from ggplot import *
import sklearn
```

In [17]:

```
def load_data(file1,file2):
overview = pd.read_csv(file1)
detail = pd.read_csv(file2)
return overview,detail
overview,detail = load_data("MatchOverview.csv","MatchDetail.csv")
print overview.dtypes
print "---------"
print detail.dtypes
print overview.head()
print "---------"
print detail.head()
```

`df`

. Using pandas's `merge`

we were able to merge the two data sets on `match_id`

and remove all match id's that did not have a corollary in the other data set. Our new data set is grouped by match id, player id and character id. Unfortunately, each row had an extra column, as both data frames had the player id of the row.

In [18]:

```
df = overview.merge(detail,on="match_id")
print "Unique match id's:", len(df)
print "df.dtypes:", df.dtypes
print "----------"
print df.tail(5)
print len(df['match_id'].unique())
```

After merging the data frames, we cleaned the data by:

- Tallying how many wins each team had. By splitting our data set into training and testing data, we preserved the ratio of wins to losses for team. This avoided the rare case where our training set is entirely one team winning, which would obviously ruin predictions.
- Setting
`hero_id`

and`player_id`

to strings, so they would be interpreted as factors and not as quantatative variables. - Reset experience and gold columns greater than match time to the last non-zero time amount. This removed zeros that could invalidate our correlations and regressions.
- Changed
`player_id`

to`team_id`

, a binary variable representing team, as we were interested in team performances and not individual performances.

In [32]:

```
def clean_data(tempdf):
newdf = tempdf
newdf = newdf.assign(team = newdf.apply(lambda x: 0 if x.player_id < 5 else 1,axis=1))
newdf[['hero_1','hero_2','hero_3','hero_4','hero_5','hero_6','hero_7','hero_8','hero_9','hero_10','hero_id']] = newdf[['hero_1','hero_2','hero_3','hero_4','hero_5','hero_6','hero_7','hero_8','hero_9','hero_10','hero_id']].astype(str)
#on this line we can check to make sure every game has a full roster of players (having 10 players).
#When we run the above line we see that there is at least 1 game without a full roster!
C = Counter(newdf['match_id'])
less = [x for x in C.keys() if C[x] < 10]
#We run in to an issue where each player_id does not occur the same amount of times.
newdf = newdf[newdf['match_id'] != less[0]]
#Once we emilinate these rows we have only full game data
#Now we want to change all values of 0 after the match ends into the last non-zero value
newdf = newdf.apply(lambda x: set_zero(x),axis=1)
#We want to count the number of games each team wins
counts = newdf['first_5_won'].value_counts()
newdf.columns = newdf.columns.str.replace('g_5','g_05')
newdf.columns = newdf.columns.str.replace('e_5','e_05')
newdf = newdf.assign(won = newdf.apply(lambda x: 1 if (x.first_5_won and x.team == 0) or (x.first_5_won == False and x.team == 1) else 0, axis=1))
newdf.drop('first_5_won', axis=1, inplace=True)
newdf.to_csv("output.csv")
return newdf,counts
#This is the function that changes all of the 0's at the end of every match to the last value that was non-zero for gold/experience
def set_zero(row):
value = 5
while value <= 40:
e1 = 'e_' + str(value+5)
e2 = 'e_' + str(value)
g1 = 'g_' + str(value+5)
g2 = 'g_' + str(value)
if row[e1] == 0:
row[e1] = row[e2]
row[g1] = row[g2]
value += 5
return row
df = overview.merge(detail,on="match_id")
df,counts = clean_data(df)
print df.dtypes
```

Prior to modeling, we plotted a few of the variables from our dataset. The resulting plots and our analysis is below.

In [5]:

```
hero_bar = ggplot(df, aes(x = 'hero_id'))
hero_bar = hero_bar + \
xlab("Hero ID") + \
ylab("Frequency") + \
ggtitle("Frequency of Hero ID")
hero_bar + geom_bar() + theme(x_axis_text=element_text(angle=90, size = 5))
```

Out[5]:

`hero_id`

as a predictor.

In [6]:

```
match = df[df['match_id'] == 2502981287]
def colSum(df, choice, team):
c_sum = sum(df.loc[df['team'] == team, choice])
return c_sum
def timeRow(df, time):
if time == 5:
g = "g_05"
e = "e_05"
else:
g = "g_" + str(time)
e = "e_" + str(time)
index = [time]
row_d = {
"time" : time,
"gold_0" : colSum(df, g, 0),
"exp_0" : colSum(df, e, 0),
"gold_1" : colSum(df, g, 1),
"exp_1" : colSum(df, e, 1)
}
row = pd.DataFrame(data=row_d, index=index)
return row
g_e = timeRow(match, 0)
for i in range(5, 50, 5):
g_e = g_e.append(timeRow(match, i), ignore_index = True)
g_e = g_e.reset_index()
g_e = g_e.drop('index', 1)
g_e_melt = pd.melt(g_e, id_vars = "time")
g_e_point = ggplot(g_e_melt, aes(x = "time", y = "value", color = "variable"))
g_e_point = g_e_point + \
xlab("Time") + \
ylab("Value") + \
ggtitle("Experience and Gold over Time")
g_e_point + geom_line()
```

Out[6]:

Above, we plotted the two teams' experience and gold throughout one match, `2502981287`

. The red and green lines represent experience, while the blue and purple lines represent gold. In the example above, team 1 (the second team) won the match. Looking at the differences between experience, there was a shift in winning between the 30th and 35th minute. The change in gold follows between the 35th and 40th minutes. I (Isaac) posit that based on this plot and other matches we have plotted, experience and gold prior to 20 minutes before the end of the match will not be significant predictors and will not add much to our model.

The plot below shows the same information, but with team 1's experience and gold subtracted from team 0's. The peak of the experience line is at 20 minutes, followed by an immediate downwards trend and a more substantial downwards trend between the 30th and 40th minutes. The significant negative difference between the experiences is similar information to the swap in experience curves above. The gold curve tracks the experience curve for most of the match and only begins to differ greatly from experience towards the 35th minute. We saw this same feature on the plot above with the convergence towards zero shown above as well.

The gold and experience graphs often go in favor of one team and then the other, not necessarily always with 1 team ahead. The issue that arises from this is that you cannot use the gold and experience at a specific time (say g_25 and e_25) to make predictions because often the graph will be rather up and down. I (Adin) posit that we are concerned with the values throughout the match since winning the game will result in large spikes of gold and experience, so no speicifc time in the game will be an accurate predictor.

In [7]:

```
g_e_diff = g_e
g_e_diff['exp_diff'] = g_e['exp_0'] - g_e['exp_1']
g_e_diff['gold_diff'] = g_e['gold_0'] - g_e['gold_1']
g_e_diff = g_e_diff.drop('exp_0', 1)
g_e_diff = g_e_diff.drop('exp_1', 1)
g_e_diff = g_e_diff.drop('gold_0', 1)
g_e_diff = g_e_diff.drop('gold_1', 1)
g_e_diff_melt = pd.melt(g_e_diff, id_vars = "time")
g_e_diff_point = ggplot(g_e_diff_melt, aes(x = "time", y = "value", color = "variable"))
g_e_diff_point = g_e_diff_point + \
xlab("Time") + \
ylab("Value") + \
ggtitle("Experience and Gold over Time")
g_e_diff_point + geom_line()
```

Out[7]:

In [8]:

```
g_melt = pd.melt(df[["g_0", "g_05", "g_10", "g_15", "g_20", "g_25", "g_30", "g_35", "g_40", "g_45"]])
g_hist = ggplot(g_melt, aes(x = "value"))
g_hist = g_hist + \
xlab("Gold") + \
ylab("Frequency") + \
ggtitle("Distribution of Gold\nFaceted by Time")
g_hist + geom_histogram() + facet_grid("variable")
```

Out[8]:

The faceted plots above show the distribution of gold at times 0 through 45 minutes. Distributions grow wider with increasing time, as certain players play well rewarding them with more gold, while others play poorly, keeping their gold totals closer to 0.

On average, the rate of increase in gold value increases as the game progresses. As players accumulate more gold, they collect more items, which in turn allows for better killing ability and subsequently more gold. In DOTA, the bounty for a kill is based on the net worth of the player who dies, as well as the difference in gold between the two teams.

The later in the game, the higher the net worth of the dying player. As a result, deaths closer to the end of the game cause a larger shift in net worth for the two teams.

In [9]:

```
e_melt = pd.melt(df[["e_0", "e_05", "e_10", "e_15", "e_20", "e_25", "e_30", "e_35", "e_40", "e_45"]])
e_hist = ggplot(e_melt, aes(x = "value"))
e_hist = e_hist + \
xlab("Experience") + \
ylab("Frequency") + \
ggtitle("Distribution of Experience\nFaceted by Time")
e_hist + geom_histogram() + facet_grid("variable")
```

Out[9]:

The faceted plot above shows experience plotted against the different times in the dataset. Similar to the gold dataset, as time increases, the distributions of experience widen.

This plot shows us that there is a wider variance of gold and experience as the game progresses. Since the variance is larger, the later times may be more influencial as predictors since the grpahs for the teams may vary by more.

Similar to above, a dying player rewards the other team with more experience the higher their own experience is, so there tends to be larger shifts later in the game.

After our adventure into plotting and EDA, we explored Elo ratings of the heroes (further explanation for Elo ratings can be found at this link: http://www.herbrich.me/papers/trueskill.pdf). Using a k of 10, our ratings ranged from 1168 to 818 with a standard deviation of 67. We chose a k of 10 after testing different values of k between 5 and 25. We liked 10 because it had a very round distribution with both the maximum and minimum values close to two standard deviations from the mean.

In [37]:

```
def roundTraditional(val,digits = 0):
return int(round(val+10**(-len(str(val))-1)))
heroes = {str(float(i)):[1000] for i in range(1,114)}
def elo(group,heroes,k=10):
groups = group.groupby(['team'])
for name,group in groups:
if group['team'].iloc[0] == 0:
team0Score = 0
team0Heroes = group['hero_id']
for hero in team0Heroes:
team0Score += heroes[hero][-1]
if group['won'].iloc[0] == 1:
team0Won = 1
team1Won = 0
else:
team0Won = 0
team1Won = 1
else:
team1Score = 0
team1Heroes = group['hero_id']
for hero in team1Heroes:
team1Score += heroes[hero][-1]
for hero in team0Heroes:
heroes[hero].append(roundTraditional(heroes[hero][-1] + k * (team0Won-1.0/(1+10**((team1Score-team0Score)/400.0)))))
for hero in team1Heroes:
heroes[hero].append(roundTraditional(heroes[hero][-1] + k * (team1Won-1.0/(1+10**((team0Score-team1Score)/400.0)))))
return heroes
grouped = df.groupby(['match_id'])
ints = 0
for name,group in grouped:
heroes = elo(group,heroes,10)
```

In [38]:

```
number = {str(float(i)):0 for i in range(1,114)}
def MakeElo(x,number,heroes):
elo = heroes[x.hero_id][number[x.hero_id]]
number[x.hero_id] = number[x.hero_id] + 1
return elo
df = df.assign(Elo = df.apply(lambda x: MakeElo(x,number,heroes),axis=1))
```

In [91]:

```
df54 = df[df['hero_id'] == "54.0"]
dfElo = pd.melt(df54, value_vars = ['Elo'], id_vars = ['match_id'])
#dfElo["hero_id"] = df["hero_id"].astype(float)
times = list(range(1, len(dfElo) + 1))
dfElo['time'] = times
elo_point = ggplot(dfElo, aes(x = "time", y = "value"))
elo_point = elo_point + \
xlab("Time") + \
ylab("Value") + \
ggtitle("Elo Rating vs. Time")
elo_point + geom_line()
```

Out[91]:

After adding the Elo ratings to our data set, we began our modeling. We split the data into train and test data and tested two theories across multiple different algorithms and two theories mathematically. Our four theories were:

Experience Theory: A team's summed experience at time 20 as the predictor for victory. Our EDA showed that the greatest disparities between teams began around 20 minutes. Therefore, we theorize that victory can be predicted from the team with more experience at time 20.

Elo Theory: A team's summed Elo ratings as the predictor for victory. Our foray into Elo ratings showed that there was fluctuation between ratings between matches, but, generally, the preferred heroes were the ones with higher Elo ratings.

Full Theory: Use all the variables in our data set and a variety of machine learning algorithms to find the best model.

Twenty Theory: A combination of theories 1 and 3. All the variables except for experience and gold before time 20 will be used in the same machine learning algorithms with the same test data as the Full Theory.

I (Isaac) assumed that since the Twenty Theory was similar enough to the Full Theory and that experience at time 20 and afterwards was most important, the Full and Twenty Theories would have little variation in the accuracies.

In [93]:

```
df = df.iloc[np.random.permutation(len(df))]
key = np.random.rand(len(df)) < 0.8
f_train_X = df[key].drop(["match_id", 'won'],1)
f_train_y = df[key]['won']
f_test_X = df[~key].drop(["match_id", 'won'],1)
f_test_y = df[~key]['won']
g_train_X = df[key].drop(["match_id", "g_0", "g_05", "g_10", "g_15", "e_0", "e_05", "e_10", "e_15", "won"],1)
g_train_y = df[key]['won']
g_test_X = df[~key].drop(["match_id", "g_0", "g_05", "g_10", "g_15", "e_0", "e_05", "e_10", "e_15", "won"],1)
g_test_y = df[~key]['won']
```

In [104]:

```
def guessExp(group):
sum0 = np.sum(group[group['team'] == 0]['e_20'])
sum1 = np.sum(group[group['team'] == 1]['e_20'])
if sum0 > sum1:
group = group.assign(Exp_predicted = group.apply(lambda x: 1 if x.team == 0 else 0,axis=1))
else:
group = group.assign(Exp_predicted = group.apply(lambda x: 1 if x.team == 1 else 0,axis=1))
return group
groups_exp = df.groupby(['match_id'])
i = 0
for name,group in groups_exp:
if i == 0:
data = guessExp(group)
else:
data = data.append(guessExp(group))
i += 1
continue
```

In [105]:

```
print "Experience Accuracy: ", np.sum(data['won'] == data['Exp_predicted'])/float(len(data))
```

In [ ]:

```
def guessElo(group):
sum0 = np.sum(group[group['team'] == 0]['ELO'])
sum1 = np.sum(group[group['team'] == 1]['ELO'])
if sum0 > sum1:
group = group.assign(ELO_predicted = group.apply(lambda x: 1 if x.team == 0 else 0,axis=1))
else:
group = group.assign(ELO_predicted = group.apply(lambda x: 1 if x.team == 1 else 0,axis=1))
return group
groups = df.groupby(['match_id'])
i = 0
for name,group in groups:
if i == 0:
data = guessElo(group)
else:
data = data.append(guessElo(group))
i += 1
continue
```

In [95]:

```
print "Elo Accuracy: ", np.sum(data['won'] == data['ELO_predicted'])/float(len(data))
```

Our Elo Theory worked on the same principles as our Experience Theory, but only produced an accuracy of 66.13%. We believe the lower accuracy of this theory is due to the instability of Elo ratings across players and time and that the Elo ratings may not have been as stable as we theorized above. Since we cannot control for different players using different heroes, the Elo ratings had more fluctuations and instability across our data set than we would have liked.

While testing our first two theories required only summation, our other two theories required the use of machine learning algorithms. The first algorithm we tested with our theories was a randomForest. We chose to use a randomForest, because we theorized the randomForest would classify the data into certain win conditions which would best display the best predictors for victory. The randomForest produced the best accuracy of any of the algorithms we tested, with an accuracy of 86.42% for the Full theory and an accuracy of 86.45% for the Twenty theory.

In [96]:

```
from sklearn.ensemble import RandomForestClassifier
f_rf_model = RandomForestClassifier(n_estimators=50, max_depth=None, min_samples_split=2, random_state=0)
f_rf_model.fit(f_train_X,f_train_y)
g_rf_model = RandomForestClassifier(n_estimators=50, max_depth=None, min_samples_split=2, random_state=0)
g_rf_model.fit(g_train_X,g_train_y)
f_rf_score = f_rf_model.score(f_test_X,f_test_y)
g_rf_score = g_rf_model.score(g_test_X,g_test_y)
print "Full score: ", f_rf_score
print "Twenty score: ", g_rf_score
```

In [97]:

```
from sklearn.linear_model import LogisticRegression
f_lr_model = LogisticRegression()
f_lr_model.fit(f_train_X,f_train_y)
g_lr_model = LogisticRegression()
g_lr_model.fit(g_train_X,g_train_y)
f_lr_score = f_lr_model.score(f_test_X,f_test_y)
g_lr_score = g_lr_model.score(g_test_X,g_test_y)
print "Full score: ", f_lr_score
print "Twenty score: ", g_lr_score
```

After testing a plethora of models, we concluded that the Experience Theory is the best way to predict victory in a DOTA match. While the Full randomForest produced a higher accuracy, the difference was minute. Additionally, the Full Theory requires variables that exist only after a given match, while the Experience Theory can be calculated midgame without a hit to accuracy.