codeblog code is freedom — patching my itch

12/20/2013

DOM scraping

Filed under: Blogging,Debian,General,Ubuntu,Ubuntu-Server,Web — kees @ 11:16 pm

For a long time now I’ve used mechanize (via either Perl or Python) for doing website interaction automation. Stuff like playing web games, checking the weather, or reviewing my balance at the bank. However, as the use of javascript continues to increase, it’s getting harder and harder to screen-scrape without actually processing DOM events. To do that, really only browsers are doing the right thing, so getting attached to an actual browser DOM is generally the only way to do any kind of web interaction automation.

It seems the thing furthest along this path is Selenium. Initially, I spent some time trying to make it work with Firefox, but gave up. Instead, this seems to work nicely with Chrome via the Chrome WebDriver. And even better, all of this works out of the box on Ubuntu 13.10 via python-selenium and chromium-chromedriver.

Running /usr/lib/chromium-browser/chromedriver2_server from chromium-chromedriver starts a network listener on port 9515. This is the WebDriver API that Selenium can talk to. When requests are made, chromedriver2_server spawns Chrome, and all the interactions happen against that browser.

Since I prefer Python, I avoided the Java interfaces and focused on the Python bindings:

#!/usr/bin/env python
import sys
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.keys import Keys

caps = webdriver.DesiredCapabilities.CHROME

browser = webdriver.Remote("http://localhost:9515", caps)

browser.get("https://bank.example.com/")
assert "My Bank" in browser.title

try:
    elem = browser.find_element_by_name("userid")
    elem.send_keys("username")

    elem = browser.find_element_by_name("password")
    elem.send_keys("wheee my password" + Keys.RETURN)
except NoSuchElementException:
    print "Could not find login elements"
    sys.exit(1)

assert "Account Balances" in browser.title

xpath = "//div[text()='Balance']/../../td[2]/div[contains(text(),'$')]"
balance = browser.find_element_by_xpath(xpath).text

print balance

browser.close()

This would work pretty great, but if you need to save any state between sessions, you’ll want to be able to change where Chrome stores data (since by default in this configuration, it uses an empty temporary directory via --user-data-dir=). Happily, various things about the browser environment can be controlled, including the command line arguments. This is configurable by expanding the “desired capabilities” variable:

caps = webdriver.DesiredCapabilities.CHROME
caps["chromeOptions"] = {
        "args": ["--user-data-dir=/home/user/somewhere/to/store/your/session"],
    }

A great thing about this is that you get to actually watch the browser do its work. However, in cases where this interaction is going to be fully automated, you likely won’t have a Xorg session running, so you’ll need to wrap the WebDriver in one (since it launches Chrome). I used Xvfb for this:

#!/bin/bash
# Start WebDriver under fake X and wait for it to be listening
xvfb-run /usr/lib/chromium-browser/chromedriver2_server &
pid=$!
while ! nc -q0 -w0 localhost 9515; do
    sleep 1
done

the-chrome-script
rc=$?

# Shut down WebDriver
kill $pid

exit $rc

Alternatively, all of this could be done in the python script too, but I figured it’s easier to keep the support infrastructure separate from the actual test script itself. I actually leave the xvfb-run call external too, so it’s easier to debug the browser in my own X session.

One bug I encountered was that the WebDriver’s cache of the browser’s DOM can sometimes get out of sync with the actual browser’s DOM. I didn’t find a solution to this, but managed to work around it. I’m hoping later versions fix this. :)

© 2013, Kees Cook. This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License.
Creative Commons License

Share and Enjoy:
  • Digg
  • del.icio.us
  • Facebook
  • Google Bookmarks
  • Technorati
  • Identi.ca
  • Reddit
  • Twitter

2 Comments »

  1. You know about phantomjs, right? Headless webkit. Way easier than automating an actual graphical browser window with selenium. And you may find casperjs useful for writing automation scripts.

    Comment by sil — 12/21/2013 @ 10:54 am

  2. I second for PhantomJS. Super fast (its Chrome without the… chrome), running it doesn’t take over your browser and you don’t see windows stealing focus. The last version are very stable. You can even run it on a server!

    It support a selenium mode if you still want to use selenium and your existing python script (so its a drop-in replacement for chrome). And as sil said, you can also script it directly with javascript.

    Comment by Laurent B-Roy — 12/23/2013 @ 6:37 am

RSS feed for comments on this post. TrackBack URL

Leave a comment

Powered by WordPress