Data Computing
November 16, 2016
We have spent most of our time on two subjects:
Visualization works well with 1-3 variables, and in some situations can work with more variables.
Glyph: geom_path() or “Sankey”, Annotations: rivers and towns
Aesthetics: (x,y) latitude and longitude; size size of army; color advance or retreat.
Not bad! …although I don’t blame you if you prefer Charles Minard to ggplot2
p.s. I can’t take credit for most of the code; I did the same thing I often recommend you do:
If we need to relate more variables, a visualization may not suffice.
Formal mathematical representations provide an alternative:
lm()Purpose: Figure out how to raise sales of a brand of carseats.
head(ISLR::Carseats %>% rename(CompP=CompPrice, Ads=Advertising, Pop=Population,
Shelf=ShelveLoc, Edu=Education))| Sales | CompP | Income | Ads | Pop | Price | Shelf | Age | Edu | Urban | US |
|---|---|---|---|---|---|---|---|---|---|---|
| 9.50 | 138 | 73 | 11 | 276 | 120 | Bad | 42 | 17 | Yes | Yes |
| 11.22 | 111 | 48 | 16 | 260 | 83 | Good | 65 | 10 | Yes | Yes |
| 10.06 | 113 | 35 | 10 | 269 | 80 | Medium | 59 | 12 | Yes | Yes |
| 7.40 | 117 | 100 | 4 | 466 | 97 | Medium | 55 | 14 | Yes | Yes |
| 4.15 | 141 | 64 | 3 | 340 | 128 | Bad | 38 | 13 | Yes | No |
| 10.81 | 124 | 113 | 13 | 501 | 72 | Bad | 78 | 16 | No | Yes |
Carseats <-
ISLR::Carseats %>%
mutate(rel_price = Price / CompPrice)
mod1 <-
Carseats %>%
lm(Sales ~ rel_price + Population + Education + Advertising,
data = .)
coef(mod1)## (Intercept) rel_price Population Education Advertising
## 17.497352098 -10.907759402 -0.000139262 -0.055051345 0.136911760
summary(mod1)##
## Call:
## lm(formula = Sales ~ rel_price + Population + Education + Advertising,
## data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.005 -1.470 -0.118 1.310 5.280
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.750e+01 8.728e-01 20.048 < 2e-16 ***
## rel_price -1.091e+01 6.673e-01 -16.345 < 2e-16 ***
## Population -1.393e-04 7.470e-04 -0.186 0.852
## Education -5.505e-02 4.051e-02 -1.359 0.175
## Advertising 1.369e-01 1.651e-02 8.294 1.74e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.108 on 395 degrees of freedom
## Multiple R-squared: 0.4483, Adjusted R-squared: 0.4427
## F-statistic: 80.25 on 4 and 395 DF, p-value: < 2.2e-16
library(rpart)
mod2 <-
Carseats %>%
rpart(Sales ~ rel_price + Price + CompPrice + Advertising + Population + Education + Income, data=.)
prp(mod2)mod3 <- Carseats %>% rpart(Sales ~ ., data=.)
prp(mod3)mod4 <- Carseats %>%
rpart(Sales ~ rel_price + Advertising + ShelveLoc + Population, data=.)
prp(mod4)Dists <- dist(mtcars)
Dendrogram <- hclust(Dists)
ggdendrogram(Dendrogram)Activity: Problems 17.1 & 17.2 (DC p. 154)
The assignment is worth a total of 10 points.
[1 points] Turn in HTML with embedded .Rmd file (e.g. “DataComputing simple” template)
mod1 using the CompleteCases datafullModel using all variables in the CompleteCases data and comment on the comparison of fullModel vs mod1 in terms of log-liklihoodsmallModel by excluding variables from the fullModel without sacrificing performance (as measured by log_likelihood)smallModelUse the following command to load the data
Houses <- read.file("http://tiny.cc/dcf/houses-for-sale.csv")price based on living_area, bathrooms, & fireplaces (hint: party::ctree())houseModel (the one in the book is clearly wrong)[3 points] Answer questions 2, 3, & 4