Level 3: Render with PhantomJS

Sometimes web page is too complex to find out the API request. It's time to meet the power of PhantomJS.

To use PhantomJS, you should have PhantomJS installed. If you are running pyspider with all mode, PhantomJS is enabled if excutable in the PATH.

Make sure phantomjs is working by running

$ pyspider phantomjs

Continue with the rest of the tutorial if the output is

Web server running on port 25555

Use PhantomJS

When pyspider with PhantomJS connected, you can enable this feature by adding a parameter fetch_type='js' to self.crawl. We use PhantomJS to scrape channel list of http://www.twitch.tv/directory/game/Dota%202 which is loaded with AJAX we discussed in Level 2:

class Handler(BaseHandler):
    def on_start(self):
        self.crawl('http://www.twitch.tv/directory/game/Dota%202',
                   fetch_type='js', callback=self.index_page)

    def index_page(self, response):
        return {
            "url": response.url,
            "channels": [{
                "title": x('.title').text(),
                "viewers": x('.info').contents()[2],
                "name": x('.info a').text(),
            } for x in response.doc('.stream.item').items()]
        }

I used some API to handle the list of streams. You can find complete API reference from PyQuery complete API

Running JavaScript on Page

We will try to scrape images from http://www.pinterest.com/categories/popular/ in this section. Only 25 images is shown at the beginning, more images would be loaded when you scroll to the bottom of the page.

To scrape images as many as posible we can use a js_script parameter to set some function wrapped JavaScript codes to simulate the scroll action:

class Handler(BaseHandler):
    def on_start(self):
        self.crawl('http://www.pinterest.com/categories/popular/',
                   fetch_type='js', js_script="""
                   function() {
                       window.scrollTo(0,document.body.scrollHeight);
                   }
                   """, callback=self.index_page)

    def index_page(self, response):
        return {
            "url": response.url,
            "images": [{
                "title": x('.richPinGridTitle').text(),
                "img": x('.pinImg').attr('src'),
                "author": x('.creditName').text(),
            } for x in response.doc('.item').items() if x('.pinImg')]
        }
  • Script would been executed after page loaded(can been changed via js_run_at parameter)
  • We scroll once after page loaded, you can scroll multiple times using setTimeout. PhantomJS will fetch as many items as possible before timeout arrived.

Online demo: http://demo.pyspider.org/debug/tutorial_pinterest