Website Scraper: May 2013

Friday, 31 May 2013

Why Web Scraping Software Won't Help

How to get continuous stream of data from these websites without getting stopped? Scraping logic depends upon the HTML sent out by the web server on page requests, if anything changes in the output, its most likely going to break your scraper setup.

If you are running a website which depends upon getting continuous updated data from some websites, it can be dangerous to reply on just a software.

Some of the challenges you should think:

1. Web masters keep changing their websites to be more user friendly and look better, in turn it breaks the delicate scraper data extraction logic.

2. IP address block: If you continuously keep scraping from a website from your office, your IP is going to get blocked by the "security guards" one day.

3. Websites are increasingly using better ways to send data, Ajax, client side web service calls etc. Making it increasingly harder to scrap data off from these websites. Unless you are an expert in programing, you will not be able to get the data out.

4. Think of a situation, where your newly setup website has started flourishing and suddenly the dream data feed that you used to get stops. In today's society of abundant resources, your users will switch to a service which is still serving them fresh data.

Getting over these challenges

Let experts help you, people who have been in this business for a long time and have been serving clients day in and out. They run their own servers which are there just to do one job, extract data. IP blocking is no issue for them as they can switch servers in minutes and get the scraping exercise back on track. Try this service and you will see what I mean here.

Source: http://ezinearticles.com/?Why-Web-Scraping-Software-Wont-Help&id=4550594

Wednesday, 29 May 2013

Beating Scraper Sites

I've gotten a few emails recently asking me about scraper sites and how to beat them. I'm not sure anything is 100% effective, but you can probably use them to your advantage (somewhat). If you're unsure about what scraper sites are:

A scraper site is a website that pulls all of its information from other websites using web scraping. In essence, no part of a scraper site is original. A search engine is not an example of a scraper site. Sites such as Yahoo and Google gather content from other websites and index it so you can search the index for keywords. Search engines then display snippets of the original site content which they have scraped in response to your search.

In the last few years, and due to the advent of the Google AdSense web advertising program, scraper sites have proliferated at an amazing rate for spamming search engines. Open content, Wikipedia, are a common source of material for scraper sites.

from the main article at Wikipedia.org

Now it should be noted, that having a vast array of scraper sites that host your content may lower your rankings in Google, as you are sometimes perceived as spam. So I recommend doing everything you can to prevent that from happening. You won't be able to stop every one, but you'll be able to benefit from the ones you don't.

Things you can do:

Include links to other posts on your site in your posts.

Include your blog name and a link to your blog on your site.

Manually whitelist the good spiders (google,msn,yahoo etc).

Manually blacklist the bad ones (scrapers).

Automatically blog all at once page requests.

Automatically block visitors that disobey robots.txt.

Use a spider trap: you have to be able to block access to your site by an IP address...this is done through .htaccess (I do hope you're using a linux server..) Create a new page, that will log the ip address of anyone who visits it. (don't setup banning yet, if you see where this is going..). Then setup your robots.txt with a "nofollow" to that link. Next you much place the link in one of your pages, but hidden, where a normal user will not click it. Use a table set to display:none or something. Now, wait a few days, as the good spiders (google etc.) have a cache of your old robots.txt and could accidentally ban themselves. Wait until they have the new one to do the autobanning. Track this progress on the page that collects IP addresses. When you feel good, (and have added all the major search spiders to your whitelist for extra protection), change that page to log, and autoban each ip that views it, and redirect them to a dead end page. That should take care of quite a few of them.

Source: http://ezinearticles.com/?Beating-Scraper-Sites&id=597843

Monday, 27 May 2013

Beating Scraper Sites

Saturday, 18 May 2013

Website Data Scraping

Website data scraping is one of the highest rising sectors of web data scraping, web data mining and mailing database development.

Website data scraping is also playing a crucial role in data scraping, data extraction and data mining. Getting the unsurpassed worth of scrapped database is of main concern to anybody wishing to use those particular database. However, not all outsourcing service provider offering data extraction and website data scraping services are delivering services with Housedomarding quality. That is the reason why it is important to be guarded when it comes to lookout for data extraction and website data scraping services.

Website data scraping is not an easy task as generally people might think. You will face lot many complexity when you goes for the bulk data scraping from a particular site. The most common problem occurs of ip blocking, in which hosted website blocks your ip so you will not be able to access that website and another problem arise of duplication. In case of duplication, you need to process the database and remove the duplicated database. If you will go for the database of higher volume like as database of 2 million records then checking duplication and duplicated records elimination is really tough task.

We have experienced professionals to handle complex projects and deliver you high quality database and can also provide final outcome as per your requirement in of any format like as excel, access, csv, mysql etc.

Most commonly data scraping websites are mentioned as below:

    Data scraping from yellow pages, google map
    Data scraping from yell, yelp, citysearch, freeindex
    Data scraping from white pages, google local business
    Data scraping from super pages, linkedin, twitter
    Data scraping from ebay, amazon, shopping websites
    Data scraping from whois, exhibitor online, clicksmart, bluebook
    Data scraping from youtube, monster, jobfinder, alexa, domaintools
    Data scraping from carfinder, myshopping, 1startgallery, nahc
    Data scraping from gumtree, kijiji, backpage, 1800dentists, lawyer
    Data scraping from 123notary, spafinder, eat2eat, clickbank, chefmoz
    Data scraping from tripadvisor, bookit, futurashop, efollett
    Data scraping from zoominfo, hotfrog, uscity, hoovers, switchboard, b2bindex

Source: http://www.housedom.com/data-scraping

Thursday, 16 May 2013

.net & c# website scraper

This is a website scraper implemented in C# and .NET that uses regular expressions (regex) to scrape only the text directly related to the website content.

This website scraper was developed by Kevin Rio, a Miami Website Developer, for the purpose of collecting information and content from affiliate sites, such as E-Commerce applications that do not provide an API or an externally available database. It is also useful for SEO professionals to evaluate competitor websites.

This is a completely free website scraper that can be used in any commercial or private applications.

What This Website Scraper Does

This web scraper collects a website address as input from the user and retrieves all of the HTML code from that site. It then filters all of the HTML and JavaScript code using regular expressions and leaves only the website’s text for the user to explore. It provides the text in an easy to copy text area and a variable that is easy to manipulate and extend in your own scripts.

The download provides all of the files including the Visual Studio solution file.

Source: http://www.krio.me/dot-net-c-sharp-web-scraper/

Sunday, 5 May 2013

website scraper

Website scraper is a software program performing the whole process of web scraping. Web scraping usually stands for extracting of unstructured or bad-structured data from the Internet andWebsite Scraper storing it in almost any format on your computer. Website scrapers are very popular nowadays among businessmen (especially involved in E-Commerce, Real Estate, Automotive, Retail and many other fields of business) and Internet users all over the world. They help to get necessary information in time and to analyze it.

A website scraper collects necessary data from some target web sites. It can be used even by law enforcement professionals to extract data and to use it in court, businessmen can use website scrapers for writing and developing some marketing research or for business intelligence, entrepreneurs normally use such software programs to harvest e-mails to monitor business trends, market specialists can use website scrapers to analyze market trends and to keep reports.

WebSundew website scraper is a data extraction tool which is one of the best samples of web scraping programs existing in the market. It is presented in the market of web scraping tools as a family of four editions (Lite, Standard, Professional and Enterprise). All of them are designed to meet the needs of users and make copy-and-paste process as easy and comfortable as possible.

If you're looking for a website scraper, perhaps you need to have a look at WebSundew web scraping tool. WebSundew website scraper pastes and copies all the required data from some particular web site to your computer in no time and with no effort. You need no special programming skills to work with the program, all you need is a computer and WebSundew licensed copy.

Source: http://www.websundew.com/website-scraper

Friday, 3 May 2013

Word of the Day: Scraper Website

A scraper website is a web site that copies content from another website. Occasionally, just a page is ‘scraped’ from another website and used illegally in another website.

Scraper sites often add their own ads to the copied web pages after deleting the ad code from the copied web pages. Often the scraper websites will hit on popular news stories and try to get placed on top ranked search results pages. Sometimes the pages are copied carelessly and contain broken links or incorrect directory paths to the photos and other graphics that are located on the original website’s server. When this happens, the photos or graphics are missing in the scraper site.

Occasionally the scraper website producer will change the page slightly to conform to other original parts of the scraper website. Following are two screen shots that illustrate scraping. The first website screen shot is an original article Elliptical vs. treadmill: Which machine really delivers? published by the Daily Herald March 9, 2009. The second screen shot shows a scraper page from Middletown Gold’s Gym in Middletown, New York How does the elliptical really compare to the treadmill published May 20, 2009. The text is almost identical. If you read the text, you can see that one of the names in the articles is attributed to different geographic locations. For example ‘Arlington Heights Personal Trainer Mark Bostrom’ is changed to ‘Gold’s Gym Personal Trainer Mark Bostrom.’ Some other names in the pages posted on the web are also changed similarly.

A scraper site’s use of information from other sites without permission is in violation of copyright law, unless the websites are public domain websites.

Source: http://www.arlingtoncardinal.com/2010/11/word-of-the-day-scraper-website/

Note:

Justin Stephens is experienced web scraping consultant and writes articles on screen scraping services, website scraper, Yellow Pages Scraper, amazon data scraping, yellowpages data scraping, product information scraping and yellowpages data scraping.

Rise Above from the Tedious Tasks using Web Scrapers

There is a huge amount of available data through websites. However, most people have found out that copying data directly into a functional spreadsheet or database from a website can be a tiresome process. Data entry from sources in the internet can quickly become too expensive as the required hours tally up. Clearly, a programmed method for gathering information from HTML based websites can offer high cost of management.

Web scrapers are computer programs which are able to incorporate information directly from the internet. They have the capabilities of surfing the web, asses the contents of a particular site and pull data points and place them into a structured working spreadsheet or database. Most companies will use program packages to get the required information such as evaluating prices, tracking changes of online content and performing online research. Now let’s look on how web scrapers might support data collection for a wide range of purposes.

Improving on Manual Entry Methods
Using a computer’s copy and paste functionality or simply by typing text from a website is actually inefficient and costly. Any web scraper has the capability to browse through a series of websites, make a decision on what is important data and then copy the information into a structured database, spreadsheet or any other computer program. Software packages have the ability of recording macros by having a user to perform a series once and then have the computer remember and automate those actions.

Every user can efficiently perform as their own computer programmer in expanding their capability to process websites. These applications can as well interface with databases and spreadsheets to manage information automatically as it is copied from a website.

Aggregating Information
There are quite a number of occasions where the material stored in sites can be used and stored. For instance, a clothing company which is looking forward in bringing their clothing line to retailers can go online for the contact information of sales personnel to generate leads. Most businesses can carry out market research on prices as product availability by studying online catalogs.

Data Management
Managing figures and numbers are done best through databases and spreadsheets. However, information on website that is formatted with HTML is not readily accessible for those purposes. While websites are exceptional for displaying facts and figures, they always fall short when they have to be analyzed, stored or manipulated otherwise.

Ultimately, web scrapers are able to take the output which is intended for presentation to a person and then change it to numerals which can be used by a computer. In addition, by automating this process using software application and macros, entry costs are severely reduced.

This type of data management is also effective at merging different sources of information. if a company were to obtain a statistical or research information, it could be scraped so as to format the information into a spreadsheet or database. This is also highly efficient at taking legacy systems contents and incorporating them into today’s systems. Generally, a web scraper is a cost efficient user tool for data management and manipulation.

Source: http://www.locfinder.net/rise-above-from-the-tedious-tasks-using-web-scrapers/

Note:

Wednesday, 1 May 2013

How To Get Your Content Removed From Scraper Sites

The past couple of years have seen an increasing amount of algorithmic penalties being handed out by Google which have changed the look of their search results drastically. First we had Google Panda which algorithmically lowers the search visibility of sites filled with low quality content, whether this is scraped or spun content or just thin pages full of boilerplate text and ads. Then in April 2012 we had Google Penguin which algorithmically lowers the search visibility of sites with links from low quality content and heavy exact match anchor text.

I’m not going to get into a debate about whether or not these updates were a good or a bad thing but we have to accept the facts that with the recent news that Panda updates will now be automatically rather than manually rolled out and that Penguin 2.0 is on the horizon it’s fair to assume these spam fighters are here to stay.

Since the first iterations of Panda I had clients whose blogs were being outranked by websites who had scraped their content and were not being attributed by Google as the original source of the document. I am not talking about syndication which is a normal part of the web, where there exists an agreement between webmasters; but people who steal your content without any permission or attribution. Now that we have Penguin to add to the mix I am seeing websites that steal your content and then link back to you causing problems too.

Many webmasters will add internal links to pages on their sites with the anchor text they want that page to rank for. This in itself is not a huge issue however if you then get a few dozen scraper sites stealing your content and then linking to your web pages using the same anchor text you can soon find that your site is in trouble.
Cleaning Up Scraped Content

When I get people coming to me looking for help with a ranking penalty I handle it in a very process oriented way. I’m sure other people may have their own opinions on the steps to take to remove pages which are stealing your content.

This is the exact process I use for cleaning up scraped content for my clients.

1.       Find out who is Scraping from Your Website

There are a number of paid and free tools out there which you can begin to use to find out who is copying your content. I always suggest you start by checking the links to your website.

Google Webmaster Tools is entirely free and provides you with a list of links that are pointing to your website, Bing also provide you with free access to links they find pointing to your site too. If there aren’t enough links being shown in Webmaster Tools then it might be worth investing in a link analysis package such as ahrefs or MajesticSEO to help you, I like to download the links into an Excel file or Google Docs.

The next step is to copy a unique sentence from a web page with a lot of links to it and then search for that sentence in Google with quotes.

If someone is copying your content make a note of it on your spreadsheet and move on to the next one, it is also a good idea to make a note of any contact pages or emails too.

(This is a labour intensive process so it may be more cost effective to hire an outsource on oDesk or Elance to do this data entry for you.)

2.       Make Contact with the Scrapers

I always try to contact the webmaster to either remove the content; if it is a high quality site and a writer has stolen my article to pass off as their own work then I might ask the webmaster to remove the content in question and replace it with a new version rewritten by me. You can easily find out the quality of a site with a plugin such as SEOQuake.

If you can’t find an email address, social media accounts or contact form on the website concerned then use a whois lookup tool to find their contact information.

3.       File a DMCA Request

I always use this option as a last resort mainly because I believe in being a good web citizen. I find that if people are doing something wrong it is better to ask them to stop or try to educate them than simply hit them with legal notices straight away. Some webmasters however will refuse or even ignore you if this is the case then I will file a complaint with their web host and/or Google.

The best tool I have found to find out a web host of a site is whoishostingthis.com, simply add the domain of the site concerned into the search box and in a matter of seconds it will give you the hosting provider’s name and web address so you can raise a DMCA complaint.

In your DMCA request make sure you provide details of the web page you want removing, the web page which has stolen your content and details of any attempts you have tried to make to resolve the issue with the webmaster directly.

Many web hosts will take action within a matter of hours and in some cases they will remove the whole site until the scraped content is removed.

Source: http://www.webmaster-success.com/how-to-get-your-content-removed-from-scraper-sites/

Note:

Roze Tailer is experienced web scraping consultant and writes articles on screen scraping services, website scraper, Yellow Pages Scraper, amazon data scraping, yellowpages data scraping, product information scraping and yellowpages data scraping.