はじめに

PI Study #6 「Rでデータころがし職人（ハンズオン）」 - PIStudy | Doorkeeper

に参加してきた．

ついでにということで，勉強中にggplotの練習をしていた．

自分のためにもメモを残しておく．

データソース

データ読み込み

df <- read.csv("~/Projects/DataAnalytics/RHandsOn/dataCompact/children/lightsource.csv")
head(df)

##   id      date sex  age height weight
## 1  1 2005/11/1   1 4.47   1022   20.6
## 2  2 2005/11/1   1 2.11    835   11.8
## 3  3 2005/11/1   2 3.56    958   13.6
## 4  4 2005/11/1   1 2.91    909   12.4
## 5  5 2005/11/1   1 4.25   1044   16.6
## 6  6 2005/11/1   1 3.93   1023   16.2

男女の数の棒グラフを作りたい

meltせずともfactorでなんとかなるらしい
meltは図を並べたいときに使う感じ？
factorは連続値でなく，カテゴリ型として扱うもの

library(ggplot2)
p <- ggplot(df, aes(factor(sex)))
p + geom_bar()

f:id:yusuke0h:20141129204343p:plain

factorをなくすと

p <- ggplot(df, aes(sex))
p + geom_bar()

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

f:id:yusuke0h:20141129204402p:plain

年齢のヒストグラムを作る

p <- ggplot(df, aes(age))
p + geom_histogram()

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

f:id:yusuke0h:20141129204417p:plain

男女別の年齢ヒストグラムを作る

ここでようやくmeltが必要になる?
→　ageとsexを並べて図示したいわけではないのでいらない

library(reshape2)
mdf <- melt(df, id="id", measure=c("sex", "age"))

以下でよし

p <- ggplot(df, aes(age, colour=factor(sex), fill=factor(sex))) + geom_histogram()
p + facet_grid(sex ~ .)

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

f:id:yusuke0h:20141129204449p:plain

同じく，身長・体重のヒストグラムを見る

p <- ggplot(df, aes(height, colour=factor(sex), fill=factor(sex))) + geom_histogram()
p + facet_grid(sex ~ .)

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

f:id:yusuke0h:20141129204509p:plain

p <- ggplot(df, aes(weight, colour=factor(sex), fill=factor(sex))) + geom_histogram()
p + facet_grid(sex ~ .)

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

f:id:yusuke0h:20141129204541p:plain

三歳児の身長が正規分布かどうか知る

三歳児に絞る
↓
検定量を見る
↓
ヒストグラムをプロットする
↓
検定する

df3 <- subset(df, trunc(age) == 3)
summary(df3$height)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     848     932     962     962     992    1070

p <- ggplot(df3, aes(height, colour=factor(sex), fill=factor(sex))) + geom_histogram()
p

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

## Warning: position_stack requires constant width: output may be incorrect

f:id:yusuke0h:20141129204602p:plain

shapiro.test(df3$height)

## 
##  Shapiro-Wilk normality test
## 
## data:  df3$height
## W = 0.9944, p-value = 0.3415

p-value = 0.3415であり，p > 0.05なので正規分布でないとは言えないらしい．
紛らわしい言い方だけど統計ではこのように表現するんだって．

ヒストグラムと分布を重ねる

ヒストグラムで再びaesでy軸を頻度で定義するのがミソ

p <- ggplot(df3, aes(height, colour=factor(sex), fill=factor(sex))) + geom_histogram(aes(y= ..density..), fill=NA) + geom_density(fill=NA)
p

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

## Warning: position_stack requires constant width: output may be incorrect

f:id:yusuke0h:20141129204621p:plain

全データで回帰分析する

df <- read.csv("~/Projects/DataAnalytics/RHandsOn/dataCompact/children/lightsourceclean.csv")

# 単回帰分析
p <- ggplot(df, aes(x=weight, y=height, colour=factor(sex))) + geom_point() + geom_smooth(method = "lm")
plot(p)

f:id:yusuke0h:20141129204633p:plain

# 単回帰分析でなくデフォルト（loess）のスムージング処理
p <- ggplot(df, aes(x=weight, y=height, colour=factor(sex))) + geom_point() + geom_smooth()
plot(p)

## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.

f:id:yusuke0h:20141129204654p:plain

二次元のグラフの回帰分析ならばグラフにプロットできるのでわかりやすいのだけど，普通は説明変数は複数あるの直感的な理解が難しいと思う．例えば，ビールの売上（目的変数）を予測するには，説明変数として気温，CM，花火大会，天気などいろいろあるのでこうなるとグラフ化しての回帰分析は難しい．ただ，やっていることは方程式作っているだけなので二次元でイメージを掴んで，それ以上は脳内で保管するしか無いよね．

感想

統計

統計や多変量解析をしっかり勉強したい．統計の尺度の話だったり，多変量解析と統計の違いがよくわかってない．

Rmarkdown便利

ソースコードや図がそのままHTMLにできるのですごく便利．これまではmarkdownを書いて，図を個別に保存して，ソースコードをひっくり返してとやっていたので，それがシームレスにできるのが非常に楽．

この先を知りたい

可視化手法や分析手法を勉強するのはいいのだけど，

結局これをどう役に立てるのか
アクションにつなげる
仮説の建て方
価値の産み方

等がよくわからない．この先や前段階をもっと知りたいと思う．

みずぎわブログ

技術系のことや日々に考えたことを書き連ねます

ggplot2で身体データの可視化をやる