完成網路爬蟲,從 The American Presidency Project 網站 http://www.presidency.ucsb.edu/inaugurals.php 上取得58屆總統的名字和就職演說的連結,再透過 InaugTestFunction
取得連結內就職演說文字,經文本清理存成TestMining.R
。
Sys.setlocale(category='LC_ALL', locale='C')
## [1] "C"
source("TextMining.R")
library(Matrix)
library(stats)
library(cluster)
library(devtools)
#install_github("vqv/ggbiplot")
library(ggbiplot)
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
## Loading required package: plyr
## Loading required package: scales
## Loading required package: grid
製作 TF-IDF 詞頻矩陣。
tdm <- TermDocumentMatrix(corpus2)
N = tdm$ncol
tf <- apply(tdm, 2, sum)
idfCal <- function(word_doc){ log2( N / nnzero(word_doc) ) }
idf <- apply(tdm, 1, idfCal)
doc.tfidf <- as.matrix(tdm)
for(i in 1:nrow(tdm)){
for(j in 1:ncol(tdm)){
doc.tfidf[i,j] <- (doc.tfidf[i,j] / tf[j]) * idf[i]
}
}
因為要比較58屆總統就職演說的同/異質性,資料應是58屆總統就職演說的文本,而不是詞語,因此將詞頻矩陣doc.tfidf
先行轉置,存成data
。 接著,透過 Elbow Method 找出最佳集群數目(optimal number of clusters)。由於沒有明確的 elbow ,因此我找出num_clusters
使組內平方和(wss)小於0.5,暫定其為最佳集群數目。
#Transpose dataframe
data = t(doc.tfidf)
#K-Means clustering: elbow method to determine optimum numbers of clusters
wss <- 2:58
for (i in 2:58)
{
wss[i] <- sum(kmeans(data, centers=i, nstart = 20, algorithm = c("Lloyd"))$withinss)
}
plot(2:58, wss[2:58], type="b", xlab="Number of Clusters", ylab="Within groups sum of squares")
num_clusters = length(which(unlist(wss) > 0.5)) +1
num_clusters
## [1] 10
從上述假設,得到最佳集群數目,將58屆總統就職演說作 K-Means Clustering。
#K-Means clustering: perform K-Means clustering
kmeansOut = kmeans(data, num_clusters, nstart = 20, algorithm = c("Lloyd"))
testdata = data
#K-Means clustering: perform PCA
data.pca = prcomp(data)
data.kmeans = as.factor(kmeansOut$cluster)
kmeans_clustering <- as.data.frame(data.kmeans)
kmeans_clustering
## data.kmeans
## 1 3
## 2 3
## 3 8
## 4 3
## 5 3
## 6 3
## 7 3
## 8 3
## 9 3
## 10 3
## 11 6
## 12 7
## 13 4
## 14 3
## 15 3
## 16 3
## 17 3
## 18 3
## 19 3
## 20 3
## 21 3
## 22 3
## 23 3
## 24 3
## 25 3
## 26 3
## 27 3
## 28 3
## 29 3
## 30 3
## 31 3
## 32 2
## 33 3
## 34 3
## 35 9
## 36 3
## 37 1
## 38 10
## 39 3
## 40 3
## 41 3
## 42 3
## 43 3
## 44 3
## 45 3
## 46 3
## 47 3
## 48 3
## 49 3
## 50 5
## 51 3
## 52 3
## 53 3
## 54 3
## 55 3
## 56 3
## 57 3
## 58 3
以視覺化呈現 K-Means Clustering 結果。可以看出絕大多數的就職演說集中在同一集群,其餘零星分布,沒有明顯的分群現象,顯示文本內容同質性高,未受時代風潮、當代重要議題等影響。
#K-Means clustering: biplot for Principal Components
g <- ggbiplot(data.pca, obs.scale = 1, var.scale = 1,
groups = data.kmeans, ellipse = TRUE,
circle = TRUE, labels = rownames(data))
g <- g + scale_color_discrete(name = '')
g <- g + theme(legend.direction = 'horizontal',
legend.position = 'top')
print(g)