Data Computing
November 16, 2016
We have spent most of our time on two subjects:
Visualization works well with 1-3 variables, and in some situations can work with more variables.
Glyph: geom_path()
or “Sankey”, Annotations: rivers and towns
Aesthetics: (x,y)
latitude and longitude; size
size of army; color
advance or retreat.
Not bad! …although I don’t blame you if you prefer Charles Minard to ggplot2
p.s. I can’t take credit for most of the code; I did the same thing I often recommend you do:
If we need to relate more variables, a visualization may not suffice.
Formal mathematical representations provide an alternative:
lm()
Purpose: Figure out how to raise sales of a brand of carseats.
head(ISLR::Carseats %>% rename(CompP=CompPrice, Ads=Advertising, Pop=Population,
Shelf=ShelveLoc, Edu=Education))
Sales | CompP | Income | Ads | Pop | Price | Shelf | Age | Edu | Urban | US |
---|---|---|---|---|---|---|---|---|---|---|
9.50 | 138 | 73 | 11 | 276 | 120 | Bad | 42 | 17 | Yes | Yes |
11.22 | 111 | 48 | 16 | 260 | 83 | Good | 65 | 10 | Yes | Yes |
10.06 | 113 | 35 | 10 | 269 | 80 | Medium | 59 | 12 | Yes | Yes |
7.40 | 117 | 100 | 4 | 466 | 97 | Medium | 55 | 14 | Yes | Yes |
4.15 | 141 | 64 | 3 | 340 | 128 | Bad | 38 | 13 | Yes | No |
10.81 | 124 | 113 | 13 | 501 | 72 | Bad | 78 | 16 | No | Yes |
Carseats <-
ISLR::Carseats %>%
mutate(rel_price = Price / CompPrice)
mod1 <-
Carseats %>%
lm(Sales ~ rel_price + Population + Education + Advertising,
data = .)
coef(mod1)
## (Intercept) rel_price Population Education Advertising
## 17.497352098 -10.907759402 -0.000139262 -0.055051345 0.136911760
summary(mod1)
##
## Call:
## lm(formula = Sales ~ rel_price + Population + Education + Advertising,
## data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.005 -1.470 -0.118 1.310 5.280
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.750e+01 8.728e-01 20.048 < 2e-16 ***
## rel_price -1.091e+01 6.673e-01 -16.345 < 2e-16 ***
## Population -1.393e-04 7.470e-04 -0.186 0.852
## Education -5.505e-02 4.051e-02 -1.359 0.175
## Advertising 1.369e-01 1.651e-02 8.294 1.74e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.108 on 395 degrees of freedom
## Multiple R-squared: 0.4483, Adjusted R-squared: 0.4427
## F-statistic: 80.25 on 4 and 395 DF, p-value: < 2.2e-16
library(rpart)
mod2 <-
Carseats %>%
rpart(Sales ~ rel_price + Price + CompPrice + Advertising + Population + Education + Income, data=.)
prp(mod2)
mod3 <- Carseats %>% rpart(Sales ~ ., data=.)
prp(mod3)
mod4 <- Carseats %>%
rpart(Sales ~ rel_price + Advertising + ShelveLoc + Population, data=.)
prp(mod4)
Dists <- dist(mtcars)
Dendrogram <- hclust(Dists)
ggdendrogram(Dendrogram)
Activity: Problems 17.1 & 17.2 (DC p. 154)
The assignment is worth a total of 10 points.
[1 points] Turn in HTML with embedded .Rmd file (e.g. “DataComputing simple” template)
mod1
using the CompleteCases
datafullModel
using all variables in the CompleteCases
data and comment on the comparison of fullModel
vs mod1
in terms of log-liklihood
smallModel
by excluding variables from the fullModel
without sacrificing performance (as measured by log_likelihood
)smallModel
Use the following command to load the data
Houses <- read.file("http://tiny.cc/dcf/houses-for-sale.csv")
price
based on living_area
, bathrooms
, & fireplaces
(hint: party::ctree()
)houseModel
(the one in the book is clearly wrong)[3 points] Answer questions 2, 3, & 4