Google Search Result Encoding in Chinese
all
I've been working on the R program for google search data mining.
So far, my codes run well, except for the Traditional Chinese encoding problem.
I'm working under the linux environment...
Google <- function(input) { require(XML) require(stringr) require(RCurl) hits <- GoogleHits(input) if( hits >= 1000){ start.num = seq(0,900,100) }else if(hits < 1000){ start.num = seq(0,hits,100) } for(i in 1:length(start.num)){ start = start.num[i] url <-paste("https://www.google.com/search?as_epq=",input, "&as_occt=title&num=100&ie=UTF-8&start=",start, sep = "") CAINFO = paste(system.file(package="RCurl"), "/CurlSSL/ca-bundle.crt", sep = "") script <- getURL(url, followlocation = TRUE, cainfo = CAINFO) # using htmlParse() to re-organize the whole structure # the Chinese encoding shows quite well here # (but the structure is not a vector) doc <- htmlParse(script) # Wanna extract out the searched keyword # which is tagged by "<b>keyword</b>" # here, I take the keyword "統計" for example extract <- str_extract_all(html_str, "<b>統計</b>") # here is the problem... which extract only takes a vector as an argument # so below will return an error print (extract) } }
So, the problems that I encountered are all included in the comments.
1) if not using htmlParse(), the extracted data can not be presented into recognized Chinese characters
2) if I've tried to convert the data into a vector (by applying script <- lapply(url, getURL)), though the str_extract_all() method can be used, the encoding problem arises...
In addition, the Chinese here I meant is the Traditional Chinese
Any comments or suggestions are truly appreciated!
Thanks in advance.
