匯入套件
匯入資料 資料為主計處薪資及生產力統計資料,這裡只以服務業來進行分析
service <- read_csv("C:/Users/Daniel Chiang/Desktop/HW5/service.csv")
## Parsed with column specification:
## cols(
## career = col_character(),
## employee = col_integer(),
## averagewage = col_integer(),
## uaualwage = col_integer(),
## unusualwage = col_integer(),
## overtime = col_integer(),
## averageworkhr = col_double(),
## normalworkhr = col_double(),
## overtimeworkhr = col_double(),
## `in` = col_double(),
## out = col_double(),
## num = col_integer()
## )
進行相關性分析
require(reshape2)
## Loading required package: reshape2
require(scales)
## Loading required package: scales
##
## Attaching package: 'scales'
## The following object is masked from 'package:readr':
##
## col_factor
serCor<-cor(service[,c(3,9:11)])
serMelt<-melt(serCor,varnames = c("x","y"),value.name = "Correlation")
serMelt<-serMelt[order(serMelt$Correlation),]
ggplot(serMelt,aes(x=x,y=y))+
geom_tile(aes(fill=Correlation))+
scale_fill_gradient2(low="black",mid="white",high="darkblue",guide=guide_colorbar(ticks=FALSE,barheight=10),limits=c(-1,1))+
theme_minimal()+
labs(x=NULL,y=NULL)
黑色表示負相關,藍色表示正相關,可以看出薪資和離職呈現負相關,而加班工時雖也和離職成正相關,新進部分,和薪資呈現負相關,但相關性較小;和加班工時成正相關,推估可能薪資較低希望可以有加班的機會。
進行t檢定
t.test(service$averagewage,alternative = "two.sided",mu=22000)
##
## One Sample t-test
##
## data: service$averagewage
## t = 5.1221, df = 11, p-value = 0.0003324
## alternative hypothesis: true mean is not equal to 22000
## 95 percent confidence interval:
## 37038.75 59701.75
## sample estimates:
## mean of x
## 48370.25
randT<-rt(30000,df=NROW(service)-1)
serTTest<-t.test(service$averagewage,alternative = "two.sided",mu=22000)
ggplot(data.frame(x=randT))+
geom_density(aes(x=x),fill="grey",color="grey")+
geom_vline(xintercept = serTTest$statistic)+
geom_vline(xintercept = mean(randT)+c(-2,2)*sd(randT),linetype=2)
假設薪資為22k進行預測,虛線為平均數左右兩個標準差,實現為t統計量,可以看到它離開分佈有一段距離,我們可知道平均值不等於22k,所以基本上在服務業可能起薪低於平均,但長期來看,薪資會再提高,平均不只22k。
ANOVA分析
services <- read_csv("C:/Users/Daniel Chiang/Desktop/HW5/services.csv")
## Parsed with column specification:
## cols(
## year = col_integer(),
## career = col_character(),
## employee = col_integer(),
## averagewage = col_integer(),
## uaualwage = col_integer(),
## unusualwage = col_integer(),
## overtime = col_integer(),
## averageworkhr = col_double(),
## normalworkhr = col_double(),
## overtimeworkhr = col_double(),
## `in` = col_double(),
## out = col_double()
## )
serAnova<-aov(averagewage~career-1 ,services)
summary(serAnova)
## Df Sum Sq Mean Sq F value Pr(>F)
## career 12 1.204e+11 1.004e+10 4219 <2e-16 ***
## Residuals 36 8.563e+07 2.379e+06
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
F value很大代表組跟組間差異大,但組內差異小,所以career中各行業別無明顯差異。
用找出薪資不同的行業來佐證
serout<-ddply(services,"career",summarise,
averagewage.mean=mean(services$averagewage),
averagewage.sd=sd(services$averagewage),
Length=NROW(services$averagewage),
tfrac=qt(p=0.90,df=Length-1),
Lower=averagewage.mean-tfrac*averagewage.sd/sqrt(Length),
Upper=averagewage.mean+tfrac*averagewage.sd/sqrt(Length))
ggplot(serout,aes(x=averagewage.mean,y=career))+geom_point()+
geom_errorbarh(aes(xmin=Lower,xmax=Upper),height=0.3)
從圖表可看出不同職業別的平均值和信賴區間,個職業別間無非常明顯之差異,可佐證ANOVA分析。