Skip to Content

Webcrawl with R

I have a problem I would like some help at. I need to create a piece of R code that can load in a csv file. The csv file contain one column named "Link" and for each i(Row) there is a link from which the code need to download the content of the link and place it in a separate csv file. Until now I have managed find and modify the piece of code showed below. (Thanks to Christopher Gandrud and co authors)

library(foreign)
library(RCurl)
 
addresses <- read.csv(">>PATH TO CSV FILE<<")
 
for (i in addresses) full.text <- getURL(i)
 
text <- data.frame(full.text)
 
outpath <-">>PATH TO SPECIFIED FOLDER<<"
 
x <- 1:nrow(text)
 
for(i in x) {
  write(as.character(text[i,1]), file = paste(outpath,"/",i,".txt",sep=""))
}

Actually the code works perfectly, BUT the problem is that I am overloading the server with requests, so after having downloaded the correct content from 100-150 links, the files are just empty. I know for a fact that this is the problem since I have tested it many times with a decreasing number of links. Actually if I just download 100 links at the time it is no problem. Above 100 it starts becoming a problem. Non the less I need to implement a couple of things into this piece of code for it to become a good crawler for this particular task.

I have divided my problem into two because solving problem one should solve the case temporarily.

  1. I want to use the Sys.Sleep function for every 100 downloads. So the code fires 100 requests for the first 100 links and then it pauses for x seconds before it fires the next 100 requests...

  2. Having done that with all rows/links in my dataset/csv file I need it to check each csv file for two conditions. They cannot be empty and they cannot contain a certain error message the server gives me in some special cases. If one of these two condtions are true then it need to save the filename(link number) into a vector I can work with from there.

Wow this question suddenly got pretty long. I realize it is a big question and I am asking a lot. It is for my master thesis which is not really about R programming but I need to download the content from a lot of websites which I have been given access to. Next I have to analyze the content, which is what my thesis is about. Any suggestions/comments are welcome.

 library(foreign)  
 library(RCurl)  
 
 addresses <- read.csv("~/Dropbox/Speciale/Mining/Input/Extract post - Dear Lego n(250).csv")  
 
 for (i in addresses) {  
+   if(i == 50) {  
+     print("Why wont this work?")  
+     Sys.sleep(10)  
+     print(i)  
+   }  
+   else {  
+     print(i)  
+   }  
+ }  

"And then a whole list over the links loaded in. No "Why wont this work" at i == 50" followed by

Warning message

In if (i == 100) {:
 the condition has length > 1 and only the first element will be used  
full.text <- getURL(i)  
text <- data.frame(full.text)  
outpath <-"~/Dropbox/Speciale/Mining/Output"  
x <- 1:nrow(text)  
for(i in x) {  
write(as.character(text[i,1]), file = paste(outpath,"/",i,".txt",sep=""))}  

Able to help me more?