A Web Crawler for Renren: A Ruby Approach

Recently, I learned some basics of ruby through Why’s (Poignant) Guide to Ruby, and I wrote an automated crawler to visit the home pages of other people on renren, the Chinese version of facebook.
Here is the source code. You may save it as renren-crawler.rb.

require 'net/http'
require 'CGI'
$email = 'user@example.com' # Enter Your Renren Email
$password = 'my_password' # Enter Your Renren Password
class User
  attr_accessor :id, :depth;
  def initialize (id = $id, depth = 0)
    @id = id
    @depth = depth
  end
end
class Spider
  def initialize()
    @connect = Net::HTTP.start('www.renren.com', 80)
    resp0, body0 = @connect.post('/PLogin.do', "email=#{CGI::escape($email)}&password=#{CGI::escape($password)}&origURL=http%3A%2F%2Fwww.renren.com%2Fhome&domain=renren.com")
    resp1, body1 = @connect.get(resp0['location'], {"Cookie"=>resp0['set-cookie']})
    cookie = resp0['set-cookie'] + '; ' + resp1[ 'set-cookie' ]
    @start = User.new(230154727)
    @cookie = cookie
    @punish = 0
    @step = 0
  end
  def rands
    idlist = [@start.id]
    visited = {}
    begin
      @step += 1
      cid = idlist.last
      resp, body = @connect.get('/profile.do?id=' + cid.to_s, { "Cookie" => @cookie })
      if visited.member?(cid)
        visited[cid][:rev] += 1
      else
        username = body.match(/username(.|\n)+h1>/)
        if username
          username = username[0].delete("\n")
          username = username.slice((username.index('>')+1)...username.index('<'))
        end
        visited.merge!({cid=>{:name=>username, :rev=>0}})
      end
      print @step, '@', idlist.length, '[', visited[cid][:rev], ']', ':', "\t#", cid, "\t", visited[cid][:name]
      clist = body.scan(/profile.do\?id=\d{0,20}/).uniq
      if clist.length > 0
        @punish = 0
        clist.map!{|c| c.slice((c.index('=')+1)..-1).to_i}.delete_if{|c| c == 0 or visited.member?(c)}
        if clist.length > 0
          idlist << clist[rand(clist.length)]
        else
          idlist.delete_at(-1)
          print "\tDead End!"
        end
      else
        @punish += 1
        idlist.delete_at(-1)
        print "\tPunish?", @punish
        if (@punish > 1)
          print "\a\a\a\nENTER THE CAPTCHA IN YOUR BROWSER AND THE SPIDER WILL RETRY IN 15 SECONDS!\n"
          sleep 15
        end
      end
      print "\n"
    end while idlist.length > 0
  end
end
spider=Spider.new()
spider.rands

Here are some explanations for the code. The Spider class has a modified initialize method which uses your email address and password to get authentication. So before you run the script, you should put your own email and password in line 4 and 5 respectively.
Most of the functionality is in the rands method, here is the strategy that rands uses:

  1. It starts from a specific user. By default it will be me, but you can customize it on line 21 by replacing it with your own id.
  2. Check if there are any users that haven’t been visited on the home page that the crawler is visiting.
  3. If yes, pick one randomly and visit. Otherwise, take a step back and repeat step 2.
  4. Since Renren has a self-defense mechanism, it requires you to enter a CAPTCHA when you have visited 100 people. So when you hear 3 beeps (the spider is screaming, because it really hurts when hitting on a CAPTCHA) and see the warning, open anyone’s home page and respond to the challenge. The crawler will retry in 15 seconds. If the connection is re-established, everything goes like before, otherwise it keeps screaming.

So this is the big picture. And now, open your CMD (Windows) or Terminal (Unix based system), change your directory to where you saved the file, and rock it! (You should have ruby installed, otherwise it won’t work.)

$ ruby renren-crawler.rb

And the result goes like this:

1@1[0]:	#230154727	姜子麟
2@2[0]:	#230070920	张瑞勋
3@3[0]:	#179475049	杨冉
4@4[0]:	#229909528	卢天亮ayake
5@5[0]:	#233278906	许子岳
6@6[0]:	#239304543	黄一纯~晔
7@7[0]:	#230288268	蒋成皿——jcm
8@8[0]:	#231204752	王炫烨mrsmsr
9@9[0]:	#232148174	李欣意

Isn’t it cool? Have fun.

Leave a Reply

Your email address will not be published. Required fields are marked *