題目:IMDb 2017年500大熱門電影內容呈現

以文字雲呈現 IMDb 網站2017年500大熱門電影內容。

套件安裝

install.packages('xml2')
install.packages('rvest')
install.packages('NLP')
install.packages('tm')
install.packages('stringr')
install.packages('RColorBrewer')
install.packages('wordcloud')

套件執行

library('xml2')
library('rvest')
library('NLP')
library('tm')
library('stringr')
library('RColorBrewer')
library('wordcloud')

網路爬蟲

#Specifying the url for desired website to be scrapped
url_100 <- 'http://www.imdb.com/search/title?count=100&release_date=2017,2017&title_type=feature&page=1&ref_=adv_nxt'
url_200 <- 'http://www.imdb.com/search/title?count=100&release_date=2017,2017&title_type=feature&page=2&ref_=adv_nxt'
url_300 <- 'http://www.imdb.com/search/title?count=100&release_date=2017,2017&title_type=feature&page=3&ref_=adv_nxt'
url_400 <- 'http://www.imdb.com/search/title?count=100&release_date=2017,2017&title_type=feature&page=4&ref_=adv_nxt'
url_500 <- 'http://www.imdb.com/search/title?count=100&release_date=2017,2017&title_type=feature&page=5&ref_=adv_nxt'

#Reading the HTML code from the website
webpage_100 <- read_html(url_100)
webpage_200 <- read_html(url_200)
webpage_300 <- read_html(url_300)
webpage_400 <- read_html(url_400)
webpage_500 <- read_html(url_500)

#Using CSS selectors to scrap the description section
description_data_html_100 <- html_nodes(webpage_100,'.text-muted+ .text-muted , .ratings-bar+ .text-muted')
description_data_html_200 <- html_nodes(webpage_200,'.text-muted+ .text-muted , .ratings-bar+ .text-muted')
description_data_html_300 <- html_nodes(webpage_300,'.text-muted+ .text-muted , .ratings-bar+ .text-muted')
description_data_html_400 <- html_nodes(webpage_400,'.text-muted+ .text-muted , .ratings-bar+ .text-muted')
description_data_html_500 <- html_nodes(webpage_500,'.text-muted+ .text-muted , .ratings-bar+ .text-muted')

文本清理

先將description data轉成文字,再合併五個網頁(500大)的文字strings。

#Converting the description data to text
description_data_100 <- html_text(description_data_html_100)
description_data_200 <- html_text(description_data_html_200)
description_data_300 <- html_text(description_data_html_300)
description_data_400 <- html_text(description_data_html_400)
description_data_500 <- html_text(description_data_html_500)

#Combine char strings
description_data <- paste(description_data_100, description_data_200, description_data_300, description_data_400, description_data_500, sep = " ")

檢視經合併後的文字內容:

#Let's have a look at the description
head(description_data)
## [1] "\nA young blade runner's discovery of a long-buried secret leads him to track down former blade runner Rick Deckard, who's been missing for thirty years. \nFrank, a single man raising his child prodigy niece Mary, is drawn into a custody battle with his mother. \nA missing teenage girl. A brutal and tormented enforcer on a rescue mission. Corrupt power and vengeance unleash a storm of violence that may lead to his awakening. \nA young couple go on an adventurous vacation to Thailand only to find themselves haunted by a malevolent spirit after naively disrespecting a Ghost House. \nA couple who can't stop fighting embark on a last-ditch effort to save their marriage: turning their fights into songs and starting a band."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
## [2] "\nHaving taken her first steps into a larger world in Star Wars: Episode VII - The Force Awakens (2015), Rey continues her epic journey with Finn, Poe, and Luke Skywalker in the next chapter of the saga. \nA high school student named, Light Turner, discovers a mysterious notebook that has the power to kill anyone whose name is written within its pages and launches a secret crusade to rid the world of criminals. \nAn art student taps into a rich source of creative inspiration after the accidental slaughter of her rapist. An unlikely vigilante emerges, set out to avenge college girls whose attackers ...                See full summary<U+00A0><U+00BB>\n \nGaurav, a simple hard working guy for a white collar job visits Mumbai for a meeting where his doppelganger is about to bring chaos in his life. \nA journalist strikes up a romantic relationship with notorious drug lord Pablo Escobar."                                                                                                                                                                                                                                                                                                                                                                                      
## [3] "\nA group of bullied kids band together when a shapeshifting demon, taking the appearance of a clown, begins hunting children. \nTwo brothers attempt to pull off a heist during a NASCAR race in North Carolina. \nA francophone S.Q. officer and an anglophone O.P.P officer reunite to investigate a large car theft ring led by an Italian mobster. \nA government clerk on election duty in the conflict ridden jungle of Central India tries his best to conduct free and fair voting despite the apathy of security forces and the looming fear of guerrilla attacks by communist rebels. \nA hectic wedding party held in an 17th century French palace comes together with the help of the behind-the-scenes staff."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
## [4] "\nFueled by his restored faith in humanity and inspired by Superman's selfless act, Bruce Wayne enlists the help of his newfound ally, Diana Prince, to face an even greater enemy. \nIn the near future, Major is the first of her kind: A human saved from a terrible crash, who is cyber-enhanced to be a perfect soldier devoted to stopping the world's most dangerous criminals. \nColombian drug kingpin Jesus Morales secretly pays for the services of a sniper nicknamed \"The Devil,\" capable of killing one-by-one the enemies of anyone who hires him. With no adversaries left alive, Morales grows stronger and gains control of more smuggling routes into the United States. The DEA, alarmed by this threat to the country, sends agent Kate Estrada, who has been following Morales ...                See full summary<U+00A0><U+00BB>\n \nNot long after John Chambers and his family arrive at their new home in a small country town of Pennsylvania, John begins to experience sleep paralysis. Lying there paralyzed, trapped ...                See full summary<U+00A0><U+00BB>\n \n\"Botoks\" is intended to be a record of the authentic history of strong, determined and expressive physicians who struggle with life's decisions and problems: discrimination, maternity ...                See full summary<U+00A0><U+00BB>\n"
## [5] "\nWhen their headquarters are destroyed and the world is held hostage, the Kingsman's journey leads them to the discovery of an allied spy organization in the US. These two elite secret organizations must band together to defeat a common enemy. \nTwo friends on a road trip compete for the affections of a handsome man when their flight is redirected due to a hurricane. \nIn 13th century Ireland, a group of monks must escort a sacred relic across an Irish landscape fraught with peril. \nA teenage girl and her little brother must survive a wild 24 hours during which a mass hysteria of unknown origins causes parents to turn violently on their own kids. \nA man awakens in an empty house that he is unable to leave. Battling fatigue, injury and amnesia, and guided only by a cryptic voice on his phone, he begins piecing together fractured ...                See full summary<U+00A0><U+00BB>\n"                                                                                                                                                                                                                                                                                                                                                                                     
## [6] "\nThe story of psychologist William Moulton Marston, the polyamorous relationship between his wife and his mistress, the creation of his beloved comic book character Wonder Woman, and the controversy the comic generated. \nA newly released prison gangster is forced by the leaders of his gang to orchestrate a major crime with a brutal rival gang on the streets of Southern California. \nThe Zookeeper's Wife tells the account of keepers of the Warsaw Zoo, Antonina and Jan Zabinski, who helped save hundreds of people and animals during the German invasion. \nAn oily, amoral estate agent is preyed upon by one of his victims, who quietly moves into his flat and, unseen, begins a deliciously malicious campaign of revenge. Two Pigeons is a dark comedy with a sinister streak. \nCondorito embarks in a hilarious adventure to save the planet and his loved ones from an evil alien."

將文字內容合併成單一string並做清理:

#Combine as one string
description_data <- paste(description_data, collapse = " ")

#Data-Preprocessing: removing '\n'
description_data2 <- gsub("\n","",description_data)

#Data-Preprocessing: removing non-words
description_data2 <- gsub("\\W"," ",description_data2)

#Data-Preprocessing: removing digits
description_data2 <- gsub("\\d"," ",description_data2)

#Data-Preprocessing: changing all to lower case
description_data2 <- tolower(description_data2)

#Data-Preprocessing: removing stopwords
description_data2 <- removeWords(description_data2,stopwords())

#Data-Preprocessing: removing single letters
description_data2 <- gsub("\\b[A-z]\\b{1}"," ",description_data2)

#Data-Preprocessing: removing irrelevant words
description_data2 <- gsub("see"," ",description_data2)
description_data2 <- gsub("full"," ",description_data2)
description_data2 <- gsub("summary"," ",description_data2)

#Data-Preprocessing: removing whitespaces
description_data2 <- stripWhitespace(description_data2)

#Data-Preprocessing: split up processed string into a list of separate words
textbag <- str_split(description_data2, "\\s+")

#Data-Preprocessing: unlist textbag into separate characters
textbag <- unlist(textbag)

文字雲

由文字雲可以推論,IMDb 2017年500大熱門電影最常出現的主題為家庭與生命相關情節。

wordcloud(textbag, min.freq = 10, random.order = FALSE, scale=c(3.5, 0.5), color=brewer.pal(6, "Dark2"))