Website Scraper: April 2013

Sunday, 28 April 2013

OutWit Hub: Web-scraping made easy

I read a blog earlier this term on web-scraping and decided to check it out. I started with the suggested software, and quickly realized that there are only a few really good tools available for web-scraping and that are supported by Max OS. So, after reading a few reviews, I landed on OutWit Hub.

OutWit Hub has 2 versions: Basic and Pro. The difference is in available tools. In basic, the "words" tools isn't available. This aspect allows you to see the frequency of any word as it occurs on the page you are currently viewing. Several of the scraping tools are offline as well. I've upgraded to Pro, it's only $60 per year and I was curious to see what else it can do.

I'm not a computer scientist, by a long shot, but I have a general grasp on coding and how computers operate. For this reason, I really like OutWit Hub. The tutorials on this site are incredible. They walk you through examples and you can interact with the UI while the tutorial is going. Also, a lot of the tools are pretty intuitive to use. If you're not sold on getting the Pro version, I'd encourage you to visit their website and download the free version just to check out the tutorials. They're really great.

I've used the site for several examples just to test. I needed to get all of the emails off of an organization's website, so instead of copy/pasting everything and praying for the best, I used the "email" feature on OutWit and all of the names and emails of every member on the page populated an exportable table. #boom

Then, I wanted to see if it could be harnessed for Twitter and Facebook. So, using the source-code approach to scraping, I was able to extract text from the loaded parts of my Twitter and Facebook feeds. The problems I encountered were: Not knowing enough about the coding to make the scraper dynamic enough to peruse through unloaded pages, and not knowing how to automate and build a larger dataset (i.e. continuously run the scraper over a set amount of time by continuously reloading the page and harvesting the data. It's possible, I just didn't figure it out).

So, I've videoed a tutorial on how to use OutWit Hub Pro's scraper feature to scrape the loaded part of your Facebook news feed. Below are the written instructions and the video at the bottom gives you the visual.

Essentially, you will:
1.) Launch OutWit Hub (presuming you've downloaded and upgraded to Pro).
2.) Login to your profile on Facebook.
3.) Take note of whatever text you want to capture as a reference point when you go to look in the code. This is assuming you don't know how to read html. For example, if the first person on your news feed says: "Hey check out this video!", then take note of their statement "Hey check out this video!"
4.) Click the "scrapers" item on the left side of the screen.
5.) In the search window, type in the text "Hey check out this video" and observe the indicators in the code that mark the beginning and end of that text.
5.) In the window below the code, click the "New" button.
6.) Type in a name for the scraper
7.) Click the checkbox in row 1 of the window.
8.) Enter a title/description for the information you're collecting in the first column. Using the same example: "Stuff friends say on FB" or "Text". It really only matters if you're going to be extracting other data from the same page and want to keep it separate.
9.) Type in the html code that you indicated as the beginning to the data that you want to extract under the "Marker Before" column.
10.) Repeat step 9 for the next column using the html code that you indicated as the end to the data.
11.) Click "Execute".
12.) Your data is now available for export in several templates - CSV, Excel, SQL, HTML, TXT

Here is a Youtube video example of me using it to extract and display comments made by my Facebook friends that appeared on my news feed.

Source: http://auburnbigdata.blogspot.in/2013/04/outwit-hub-web-scraping-made-easy.html

Note:

Roze Tailer is experienced web scraping consultant and writes articles on screen scraping services, website scraper, Yellow Pages Scraper, amazon data scraping, yellowpages data scraping, product information scraping and yellowpages data scraping.

Building A Concurrent Web Scraper With Python

Another week and during my internet travels I stumbled upon a blog post by Aditya Bhargava titled “Building a concurrent web scraper with haskell”. I’m not a Haskell programmer and my experience of it is extremely limited. Reading the post most of it read like a cryptic magic spell!

Anyway it was still an interesting read and has inspired me to try my hand at writing something similar. So I reached out and and the quickest thing to hand was Python. In Linux just fire up a terminal and drop into your favorite text editor and away you go.

The Process

It is usually a good idea to think and plan out the basics of the solution design. Our target is a web page and our end result should be all images downloaded from it to disk. The target web page contains links to the images we want to scrape. Once we have a list of images instead of downloading them one at a time, the goal will be to download them “concurrently”.

The entire process can be broken down into the following distinct four steps:

    Get contents of page
    Parse and create a list of image links
    For each link start a new thread
    Download a single image

Now that we have a description of each step (or “function”) we can think about each function’s name and its input and output:

    “from page” url -> html text
    “all links” html text -> list of links
    “in parallel” function, list of links -> start thread
    “download image” link -> file saved to disk

You may have noticed that the output of one function is the input of another. This has been designed so that each function can be fed into the next one. Thus we can (at least in pseudo code) describe the entire program call:

    in parallel ( download image, all links ( on page ( url ) ) )

Voila we are all done! Erm OK perhaps not quite, we just need to create the code for each function:

Lets create the code in the order that the functions are called. First start off with the parallel call, this function will simply take a “function” to invoke and a list of links. We can iterate through the list, and call the function each time and pass the link to the function as a parameter.

To get the html contents for a given url we can use Python’s urllib module:

Let’s now create the function that downloads each image, we need to provide a filename. We could parse out the link that is passed in but being a lazy programmer I’ve decided to just create a GUID instead :)

Now all that is left is really the meat of the program, the task of parsing out the image links. To do this I have decided to use regex (shudder; I know I know, you can’t parse HTML with Regex). Being naughty sometimes is OK, right?

Finally now that all the functions have been created, we can call it and enjoy the images being downloaded concurrently, and what better web page than reddit’s /r/pics ?

Conclusion

Programming in Python is fun because the code almost flows in the way you imagine it! Naturally this simple example is most certainly not production ready. However the basics are the same. Extending this example would include limiting the number of concurrent threads, error handling and perhaps recursively calling links on the website. Something for the reader to try out :)

The full 23 lines of source code are included below. Until next time have fun Pythoning!

Source: http://www.codingninja.co.uk/building-a-concurrent-web-scraper-with-python/

Note: