Skip to Content

Google Search Result Encoding in Chinese

all

I've been working on the R program for google search data mining.

So far, my codes run well, except for the Traditional Chinese encoding problem.
I'm working under the linux environment...

Google <- function(input)
  {
   require(XML)
   require(stringr)
   require(RCurl)
   hits <- GoogleHits(input)
   if( hits >= 1000){
      start.num = seq(0,900,100) 
  }else if(hits < 1000){
      start.num = seq(0,hits,100) 
  }
  for(i in 1:length(start.num)){
       start = start.num[i] 
       url <-paste("https://www.google.com/search?as_epq=",input,
       "&as_occt=title&num=100&ie=UTF-8&start=",start, sep = "")   
 
 
   CAINFO = paste(system.file(package="RCurl"), "/CurlSSL/ca-bundle.crt", sep = "")
   script <- getURL(url, followlocation = TRUE, cainfo = CAINFO)
 
   # using htmlParse() to re-organize the whole structure
   # the Chinese encoding shows quite well here 
   # (but the structure is not a vector)     
   doc <- htmlParse(script)
 
   # Wanna extract out the searched keyword
   # which is tagged by "<b>keyword</b>"
   # here, I take the keyword "統計" for example
   extract <- str_extract_all(html_str, "<b>統計</b>")
 
   # here is the problem... which extract only takes a vector as an argument
   # so below will return an error
   print (extract)
 
}
  }

So, the problems that I encountered are all included in the comments.

1) if not using htmlParse(), the extracted data can not be presented into recognized Chinese characters

2) if I've tried to convert the data into a vector (by applying script <- lapply(url, getURL)), though the str_extract_all() method can be used, the encoding problem arises...

In addition, the Chinese here I meant is the Traditional Chinese

Any comments or suggestions are truly appreciated!

Thanks in advance.