A Web Crawler for Renren: A Ruby Approach
Recently, I learned some basics of ruby through Why’s (Poignant) Guide to Ruby, and I wrote an automated crawler to visit the home pages of other people on renren, the Chinese version of facebook.
Here is the source code. You may save it as renren-crawler.rb.
require 'net/http'
require 'CGI'
$email = '[email protected]' # Enter Your Renren Email
$password = 'my_password' # Enter Your Renren Password
class User
attr_accessor :id, :depth;
def initialize (id = $id, depth = 0)
@id = id
@depth = depth
end
end
class Spider
def initialize()
@connect = Net::HTTP.start('www.renren.com', 80)
resp0, body0 = @connect.post('/PLogin.do', "email=#{CGI::escape($email)}&password=#{CGI::escape($password)}&origURL=http%3A%2F%2Fwww.renren.com%2Fhome&domain=renren.com")
resp1, body1 = @connect.get(resp0['location'], {"Cookie"=>resp0['set-cookie']})
cookie = resp0['set-cookie'] + '; ' + resp1[ 'set-cookie' ]
@start = User.new(230154727)
@cookie = cookie
@punish = 0
@step = 0
end
def rands
idlist = [@start.id]
visited = {}
begin
@step += 1
cid = idlist.last
resp, body = @connect.get('/profile.do?id=' + cid.to_s, { "Cookie" => @cookie })
if visited.member?(cid)
visited[cid][:rev] += 1
else
username = body.match(/username(.|\n)+h1>/)
if username
username = username[0].delete("\n")
username = username.slice((username.index('>')+1)...username.index('<'))
end
visited.merge!({cid=>{:name=>username, :rev=>0}})
end
print @step, '@', idlist.length, '[', visited[cid][:rev], ']', ':', "\t#", cid, "\t", visited[cid][:name]
clist = body.scan(/profile.do\?id=\d{0,20}/).uniq
if clist.length > 0
@punish = 0
clist.map!{|c| c.slice((c.index('=')+1)..-1).to_i}.delete_if{|c| c == 0 or visited.member?(c)}
if clist.length > 0
idlist << clist[rand(clist.length)]
else
idlist.delete_at(-1)
print "\tDead End!"
end
else
@punish += 1
idlist.delete_at(-1)
print "\tPunish?", @punish
if (@punish > 1)
print "\a\a\a\nENTER THE CAPTCHA IN YOUR BROWSER AND THE SPIDER WILL RETRY IN 15 SECONDS!\n"
sleep 15
end
end
print "\n"
end while idlist.length > 0
end
end
spider=Spider.new()
spider.rands
Here are some explanations for the code. The Spider class has a modified initialize method which uses your email address and password to get authentication. So before you run the script, you should put your own email and password in line 4 and 5 respectively.
Most of the functionality is in the rands method, here is the strategy that rands uses:
- It starts from a specific user. By default it will be me, but you can customize it on line 21 by replacing it with your own id.
- Check if there are any users that haven’t been visited on the home page that the crawler is visiting.
- If yes, pick one randomly and visit. Otherwise, take a step back and repeat step 2.
- Since Renren has a self-defense mechanism, it requires you to enter a CAPTCHA when you have visited 100 people. So when you hear 3 beeps (the spider is screaming, because it really hurts when hitting on a CAPTCHA) and see the warning, open anyone’s home page and respond to the challenge. The crawler will retry in 15 seconds. If the connection is re-established, everything goes like before, otherwise it keeps screaming.
So this is the big picture. And now, open your CMD (Windows) or Terminal (Unix based system), change your directory to where you saved the file, and rock it! (You should have ruby installed, otherwise it won’t work.)
$ ruby renren-crawler.rb
And the result goes like this:
1@1[0]: #230154727 姜子麟
2@2[0]: #230070920 张瑞勋
3@3[0]: #179475049 杨冉
4@4[0]: #229909528 卢天亮ayake
5@5[0]: #233278906 许子岳
6@6[0]: #239304543 黄一纯~晔
7@7[0]: #230288268 蒋成皿——jcm
8@8[0]: #231204752 王炫烨mrsmsr
9@9[0]: #232148174 李欣意
Isn’t it cool? Have fun.