Traditionally, economists have categorized modern economic growth with the beginning of the Industrial Revolution in the early 1800’s. In this report, we will analyze what factors that led to economic growth in the pre-Industrial Revolution era, the primitive accumulation. We theorize that countries with access to Atlantic Trade had the greatest economic growth. We further theorize, that countries with relatively free institutions grew from greater access to Atlantic Trade and, therefore, greater economic growth in the primitive accumulation.

To test our theory, we will be using RAJ.csv, a data set containing economic and political information on countries during the primitive accumulation. There are 224 observations of the 14 variables: country, year, urbanization (fraction of population living in cities and towns) which will serve as our proxy for economic growth, population (thousands), ratio of country’s Atlantic coast-line (miles) to total land area (square miles), an increasing rating of constraint of the executive branch (1 - 7), a rating of prior constraint of the executive branch, index of the volume of Atlantic trade across all countries, indicators for countries in Western and Eastern Europe, the number of wars the country engaged in, indicator of a Protestant country, indicator of country belonging to the Holy Roman Empire and the estimate of per-capita GDP in current dollars. We have removed China, India, Japan and Turkey from the data set as they are missing essential data needed for the report and are non-European countries.

Our initial model, `urb.lm`

, is a basic linear model with `urbanization`

as the response and few of the other variables as predictors.

```
urb.lm <- lm(urbanization ~ factor(country) + factor(year) +
factor(westernEurope) + (atlTrade:coastToArea) + ordered(initialConstr) +
ordered(initialConstr):atlTrade:coastToArea, data = raj)
```

The model’s coefficients are summarized below. Five of the countries have positive coefficients, meaning there would be greater `urbanization`

than if not in that `country`

; Albania, Belgium, Italy, Netherlands and Spain. As time passes, `urbanization`

increases as expected. and the interaction between `atlTrade`

and `coastToArea`

is positive as expected. Interestingly, the coefficients of `WesternEurope`

and `initialConstr`

were NA, indicating a perfect colinearity with one of the other variables. Prior knowledge of the European theater tells us that any country with Atlantic Coastline would be a country in Western Europe as Eastern European countries are either not on the Atlantic Seaboard or are landlocked. We theorize that the collinearity for `initialConstr`

might relate to the interaction term present. Further testing is needed, to investigate these effects.

Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|

(Intercept) | 0.1200 | 0.017 | 6.90 | 9.7e-11 |

factor(country)Austria | -0.0670 | 0.020 | -3.30 | 1.2e-03 |

factor(country)Belgium | 0.1200 | 0.021 | 5.90 | 2.5e-08 |

factor(country)Bulgaria | -0.0300 | 0.020 | -1.50 | 1.4e-01 |

factor(country)Czech | -0.0970 | 0.020 | -4.80 | 3.9e-06 |

factor(country)Denmark | -0.1200 | 0.034 | -3.50 | 7.2e-04 |

factor(country)England | -0.1100 | 0.033 | -3.40 | 8.3e-04 |

factor(country)Finland | -0.1100 | 0.020 | -5.60 | 8.1e-08 |

factor(country)France | -0.0430 | 0.021 | -2.10 | 4.0e-02 |

factor(country)Germany | -0.0440 | 0.026 | -1.70 | 9.4e-02 |

factor(country)Greece | -0.0650 | 0.020 | -3.20 | 1.5e-03 |

factor(country)Hungary | -0.0570 | 0.020 | -2.90 | 5.0e-03 |

factor(country)Ireland | -0.1300 | 0.023 | -5.90 | 2.7e-08 |

factor(country)Italy | 0.0670 | 0.020 | 3.30 | 1.0e-03 |

factor(country)Netherlands | 0.0480 | 0.034 | 1.40 | 1.6e-01 |

factor(country)Norway | -0.1000 | 0.030 | -3.30 | 1.3e-03 |

factor(country)Poland | -0.0930 | 0.020 | -4.60 | 8.9e-06 |

factor(country)Portugal | -0.0430 | 0.034 | -1.30 | 2.0e-01 |

factor(country)Romania | -0.0890 | 0.020 | -4.40 | 1.8e-05 |

factor(country)Russia | -0.0960 | 0.020 | -4.70 | 4.7e-06 |

factor(country)Serbia | -0.0850 | 0.020 | -4.20 | 4.4e-05 |

factor(country)Spain | 0.0280 | 0.021 | 1.30 | 1.9e-01 |

factor(country)Sweden | -0.0870 | 0.020 | -4.30 | 2.7e-05 |

factor(country)Switzerland | -0.0650 | 0.020 | -3.20 | 1.6e-03 |

factor(year)1400 | 0.0017 | 0.012 | 0.15 | 8.8e-01 |

factor(year)1500 | 0.0061 | 0.012 | 0.51 | 6.1e-01 |

factor(year)1600 | 0.0130 | 0.012 | 1.10 | 2.8e-01 |

factor(year)1700 | 0.0240 | 0.012 | 1.90 | 6.0e-02 |

factor(year)1750 | 0.0220 | 0.013 | 1.70 | 8.6e-02 |

factor(year)1800 | 0.0550 | 0.013 | 4.20 | 3.8e-05 |

factor(year)1850 | 0.0560 | 0.013 | 4.20 | 3.9e-05 |

atlTrade:coastToArea | 0.7100 | 0.270 | 2.60 | 1.0e-02 |

atlTrade:coastToArea:ordered(initialConstr).L | 1.0000 | 0.730 | 1.40 | 1.6e-01 |

atlTrade:coastToArea:ordered(initialConstr).Q | 0.4700 | 0.640 | 0.73 | 4.7e-01 |

atlTrade:coastToArea:ordered(initialConstr).C | 0.4000 | 0.410 | 0.99 | 3.3e-01 |

atlTrade:coastToArea:ordered(initialConstr)^4 | 0.1300 | 0.270 | 0.49 | 6.2e-01 |

We will analyze how `urb.lm`

fits the data, looking at outliers, influential points, and residuals before checking the cross-validated mean squared error.

Below are the Rstudent and Cook’s Distance plots of our initial model. There are a few worrisome points, having Cook’s Distance greater than 0.05 and Rstudent above 2: Albania in 1800 where urbanization quadruples, Belgium post Black Plague (1400), where urbanization triples and England in 1850 where population doubles. A cursory glance of history does not reveal much about changes in Albania in 1800, however its Cook’s Distance is abnormally large and we will therefore remove it. There are ties between urbanization and the Black Plague. The Industrial Revolution begins in Britain in the 1800’s and thus should explains its outlier status. Those three points have been removed and the model rerun.

We have plotted the residuals and the Q-Q Plot of `urb.lm`

. The residuals appeared without pattern but centered around 0.01. The Q-Q Plot appears semi-linear with discrepancies throughout. Neither of these plots are indicative of a strong model, further checking is required.

```
urb.lm.resid <- residuals(urb.lm)
urb.lm.fit <- fitted(urb.lm)
urb.lm.df <- data.frame(Residuals = urb.lm.resid, Fitted = urb.lm.fit)
#QQPlot of Residuals
urb.lm.sd <- sqrt(deviance(urb.lm)/df.residual(urb.lm))
urb.lm.residSD <- urb.lm$residuals/urb.lm.sd
lm.res <- ggplot(urb.lm.df) +
geom_point(aes(Fitted, Residuals)) +
xlab('Fitted') +
ylab('Residuals') +
geom_hline(aes(yintercept = 0)) +
ggtitle("Linear Model Residuals")
#Plot the qqplot of the Sd's
lm.qq <- ggplot() + geom_point(aes(qnorm(ppoints(urb.lm.residSD)),
sort(urb.lm.residSD))) +
xlab('Theoretical Quantiles') +
ylab('Observed Quantiles') +
geom_line(aes(qnorm(ppoints(urb.lm.residSD)),
qnorm(ppoints(urb.lm.residSD)))) +
ggtitle("Linear Model Q-Q Plot")
grid.arrange(lm.res, lm.qq, nrow = 1)
```

```
#Professors Code
cv.model <- function(data, model, formulae, nfolds=5) {
data <- na.omit(data)
formulae <- sapply(formulae, as.formula)
n <- nrow(data)
fold.labels <- sample(rep(1:nfolds, length=n))
mses <- matrix(NA, nrow=nfolds, ncol=length(formulae))
colnames <- as.character(formulae)
for (fold in 1:nfolds) {
test.rows <- which(fold.labels == fold)
train <- data[-test.rows,]
test <- data[test.rows,]
for (form in 1:length(formulae)) {
current.model <- model(formula=formulae[[form]], data=train)
predictions <- predict(current.model, newdata=test)
test.responses <- eval(formulae[[form]][[2]], envir=test)
test.errors <- test.responses - predictions
mses[fold, form] <- mean(test.errors^2)
}
}
return(colMeans(mses))
}
mse.lm <- cv.model(raj, lm, "urbanization ~ factor(country) + factor(year) +
factor(westernEurope) + (atlTrade:coastToArea) +
ordered(initialConstr) +
ordered(initialConstr):atlTrade:coastToArea")
```

Instead of returnung the simple mean squared error, we felt it would be better suited to cross validate the mean squared error. Using 5-fold cross validation, our algorithm returns a mean squared error of 0.0015457.

After testing a few different models, We chose to use a generalized linear model with different variables and interactions.

To help decide which variables to use, we used a decision tree, to see which variables were important for `urbanization`

. From our variable importance graph, we see that `initalConstr`

is the most important variable followed by `execConstr`

and `year`

. Our least important variables were `country`

, `coastToArea`

and `population`

. While the decision tree was useful for observing the importance of the variables, we did some brief testing of models without the least important variables, and they performed remarkably poorer than the final model we chose.

```
#Decision Tree
set.seed(100)
urb.tree <- train(urbanization ~ ., data = raj,
method = "ctree",
trControl = trainControl(method = "cv", number = 5),
tuneGrid = expand.grid(mincriterion = 0.95))
#Plotting important variables
plot(varImp(urb.tree))
```

We tested a few different models cross-validated mean squared errors and chose the one with the smallest value. Below are the ones we tested.

```
tm1 <- cv.model(raj, glm, "urbanization ~ factor(year) +
(atlTrade:coastToArea) +
ordered(initialConstr) +
ordered(initialConstr):atlTrade:coastToArea")
tm2 <- cv.model(raj, glm, "urbanization ~ factor(country) +
(atlTrade:coastToArea) +
ordered(initialConstr) +
ordered(initialConstr):atlTrade:coastToArea")
tm3 <- cv.model(raj, glm, "urbanization ~ factor(country) +
(atlTrade:coastToArea) +
ordered(initialConstr):atlTrade:coastToArea")
tm4 <- cv.model(raj, glm, "urbanization ~ factor(country) +
(atlTrade:coastToArea) +
ordered(initialConstr):atlTrade")
```

The first test model `tm1`

, without `country`

or `westernEurope`

had a cross-validated mean squared error of 0.0029591. After `tm1`

, we realized, while the decision tree did not see the importance of `country`

, the model needed it to perform well. The removal of `westernEurope`

had no effect on the model, probably due its collinearity. The second model, `tm2`

, included `country`

and removed `year`

. Its cross-validated mean squared error was 0.0020516. While `tm2`

performed better than `tm1`

, performed worse than `urb.lm`

. Our third test model, `tm3`

, removed `initialConstr`

, its cross-validated mean square error was 0.0018385. While `tm3`

performed well, we had a hunch that removing the `coastToArea`

interaction term could further improve the model. `tm4`

was our final test model, its cross-validated mean square error was 0.0017693 and was the best of all of the four models. We felt confident in `tm4`

to continue our analysis.

As mentioned above, we used `tm4`

as our final model. `tm4`

, now referred to as `urb.glm`

had some differences from `urb.lm`

. We had removed the binary indicator for `westernEurope`

due to its collinearity issues, as well as `year`

as it did not have much of an effect on the model. We removed `coastToArea`

from the larger interaction term, but left the interaction term between `initialConstr`

and `atlTrade`

as per our initial theory. We will continue to use the data with outliers removed based on our understanding that these points would still be outliers and overly influential. The model is written below.

```
urb.glm <- glm(urbanization ~ factor(country) +
(atlTrade:coastToArea) +
atlTrade:ordered(initialConstr), data = raj)
```

`urb.glm`

’s coefficients are summarized below. Similarly to `urb.lm`

, only five countries have positive coefficients. There is one insignificant country; Bulgaria. The interaction between `atlTrade`

and `areaToCoast`

is still significant and has a positive coefficient, indicating greater `urbanization`

(economic growth) per their interaction. Our new variable, the interaction between `atlTrade`

and `initialConstr`

is significant on all levels with a negative coefficient on the low level, but positive effects on the other levels. Interestingly, it appears that there is a point of initial executive constraint at which its influence no longer changes and the coefficient stabilizes between levels.

Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|

(Intercept) | 0.0960 | 0.0150 | 6.20 | 3.8e-09 |

factor(country)Austria | -0.0480 | 0.0190 | -2.60 | 1.1e-02 |

factor(country)Belgium | 0.0980 | 0.0240 | 4.10 | 7.9e-05 |

factor(country)Bulgaria | -0.0110 | 0.0190 | -0.59 | 5.6e-01 |

factor(country)Czech | -0.0790 | 0.0230 | -3.50 | 5.6e-04 |

factor(country)Denmark | -0.1600 | 0.0240 | -6.70 | 3.6e-10 |

factor(country)England | -0.1100 | 0.0280 | -3.80 | 1.7e-04 |

factor(country)Finland | -0.0950 | 0.0190 | -4.90 | 2.1e-06 |

factor(country)France | -0.0370 | 0.0230 | -1.60 | 1.0e-01 |

factor(country)Germany | -0.0400 | 0.0190 | -2.10 | 3.4e-02 |

factor(country)Greece | -0.0470 | 0.0190 | -2.50 | 1.3e-02 |

factor(country)Hungary | -0.0400 | 0.0230 | -1.80 | 7.6e-02 |

factor(country)Ireland | -0.0530 | 0.0250 | -2.10 | 3.6e-02 |

factor(country)Italy | 0.0720 | 0.0230 | 3.10 | 2.2e-03 |

factor(country)Netherlands | 0.0880 | 0.0240 | 3.70 | 3.5e-04 |

factor(country)Norway | -0.1000 | 0.0190 | -5.40 | 2.5e-07 |

factor(country)Poland | -0.0880 | 0.0230 | -3.80 | 2.1e-04 |

factor(country)Portugal | -0.0690 | 0.0250 | -2.70 | 7.1e-03 |

factor(country)Romania | -0.0710 | 0.0190 | -3.80 | 2.3e-04 |

factor(country)Russia | -0.0770 | 0.0190 | -4.10 | 6.0e-05 |

factor(country)Serbia | -0.0660 | 0.0190 | -3.50 | 5.2e-04 |

factor(country)Spain | 0.0480 | 0.0240 | 2.00 | 4.7e-02 |

factor(country)Sweden | -0.0510 | 0.0240 | -2.20 | 3.3e-02 |

factor(country)Switzerland | -0.0480 | 0.0230 | -2.10 | 3.6e-02 |

atlTrade:coastToArea | 1.2000 | 0.0920 | 13.00 | 1.0e-25 |

atlTrade:ordered(initialConstr).L | -0.0050 | 0.0025 | -2.00 | 4.8e-02 |

atlTrade:ordered(initialConstr).Q | 0.0094 | 0.0025 | 3.70 | 2.6e-04 |

atlTrade:ordered(initialConstr).C | 0.0150 | 0.0035 | 4.20 | 3.8e-05 |

atlTrade:ordered(initialConstr)^4 | 0.0150 | 0.0029 | 5.20 | 7.2e-07 |

Like we did with `urb.lm`

, we analyzed `urb.glm`

, Looking at outliers, influential points, residuals before checking the cross-validated mean squared error. After checking `urb.glm`

, we compared the two models.

Below are the rstudent and Cooks Distance plots of `urb.glm`

. Since we have removed the most worrisome outliers with `urb.lm`

, the rstudent and Cook’s Distance plot do not show many worrisome points.

```
#Review Rstudent
par(mfrow = c(1,2))
plot(rstudent(urb.glm))
abline(h=qt(.025,df=nrow(raj)-2),col="red")
abline(h=qt(.975,df=nrow(raj)-2),col="red")
#cook's distance
plot(cooks.distance(urb.glm))
title("After Removal", outer = TRUE)
```

Below we have plotted the residuals and a Q-Q plot of `urb.glm`

. Our residuals appear patternless and centered around zero. Our Q-Qplot appears mostly linear with some variability on the right tail.

```
urb.glm.resid <- residuals(urb.glm)
urb.glm.fit <- fitted(urb.glm)
urb.glm.df <- data.frame(Residuals = urb.glm.resid, Fitted = urb.glm.fit)
#QQPlot of Residuals
urb.glm.sd <- sqrt(deviance(urb.glm)/df.residual(urb.glm))
urb.glm.residSD <- urb.glm$residuals/urb.glm.sd
glm.res <- ggplot(urb.glm.df) +
geom_point(aes(Fitted, Residuals)) +
xlab('Fitted') +
ylab('Residuals') +
geom_hline(aes(yintercept = 0)) +
ggtitle("Linear Model Residuals")
#Plot the qqplot of the Sd's
glm.qq <- ggplot() + geom_point(aes(qnorm(ppoints(urb.glm.residSD)),
sort(urb.glm.residSD))) +
xlab('Theoretical Quantiles') +
ylab('Observed Quantiles') +
geom_line(aes(qnorm(ppoints(urb.glm.residSD)),
qnorm(ppoints(urb.glm.residSD)))) +
ggtitle("Linear Model Q-Q Plot")
grid.arrange(glm.res, glm.qq, nrow = 1)
```

`mse.glm <- cv.model(raj, glm, "urbanization ~ factor(country) + (atlTrade:coastToArea) + atlTrade:ordered(initialConstr)")`

Using the function we created earlier, `urb.glm`

has a cross-validated mean squared error of 0.0017667.

After analyzing `urb.lm`

and `urb.glm`

separately, we decided to compare the two models with a few different techniques.

```
resample <- function(x){
return(sample(x, size = length(x), replace = TRUE))
}
resampler.res <- function(model, column) {
temp.frame <- raj
new.data <- fitted(model) + resample(residuals(model))
temp.frame[,column] <- new.data
return(temp.frame)
}
joint.estimator <- function(dat){
fit.lm <- lm(dat[,"urbanization"] ~ factor(dat[,"country"]) + factor(dat[,"year"]) +
factor(dat[,"westernEurope"]) + ordered(dat[,"initialConstr"]) +
dat[,"atlTrade"]:dat[,"coastToArea"] +
ordered(dat[,"initialConstr"]):dat[,"atlTrade"]:dat[,"coastToArea"])
mse.lm <- mean(fit.lm$residuals^2)
fit.glm <- glm(dat[,"urbanization"] ~ factor(dat[,"country"]) +
(dat[,"atlTrade"]:dat[,"coastToArea"]) +
dat[,"atlTrade"]:ordered(dat[,"initialConstr"]))
mse.lm <- mean(fit.glm$residuals^2)
return(mse.lm - mse.glm)
}
#bootstrap for CI's
mse.bootstrap <- function(B, model1, model2, column) {
t.hat = mean(model1$residuals^2) - mean(model2$residuals^2)
boots <- replicate(B, joint.estimator(resampler.res(model1, column)))
return((sum(boots >= t.hat) + 1)/ (B + 1))
}
pv <- mse.bootstrap(1000, urb.lm, urb.glm, "urbanization")
```

We initially tested the differences between the two models with the use of Akaike Information Criterion and Bayesian Information Criterion, in both tests `urb.glm`

returned a better result than `urb.lm`

. We compared the two Q-Q plots side by side, and `urb.glm`

’s plot appears more linear, while `urb.lm`

’s has severe discrepancies near the tails and the entirety of the right-hand side.

While our preliminary testing had given us evidence that `urb.glm`

was the better model, we decided additional testing was necessary. We ran a hypothesis test between bootstrapped mean squared errors. The probability under the basic linear model of observing a large gap in mean squared errors is 0.0009. Using an alpha level of 0.05, we find the result significant. Meaning, there is evidence to suggest that `urb.lm`

does worse than `urb.glm`

.

We have tested and compared our two models. While `urb.lm`

performed well according to our cross-validated mean squared error, it has severe issues with collinearity. `urb.glm`

, Performed better on all of our testing, residuals and mean squared error. All results give credence, to our belief that `urb.glm`

is a better model than `urb.lm`

.

Our initial theory is well explained by `urb.glm`

. Our original theory posited that economic growth pre-Industrial Revolution is best explained by Atlantic trade and a free society. `urb.glm`

Included significant terms for Atlantic Trade both in interaction with Atlantic coastal area and with initial executive constraint in a country. Similarly, the initial constraint in interaction with Atlantic trade is significant at all levels, indicating its importance to the model and evidence to our theory. Interestingly the coefficient for initial constraint begins negative, moves positive and levels out. We believe that this indicates that there is a limit to how free a country can be at which point executive constraint does not affect economic growth. Conversely, having little executive constraint negatively affects economic growth. We have found evidence to suggest that economic growth during the primitive accumulation was driven by Atlantic trade and executive constraint.