Friday, 28 November 2014

Scraping R-bloggers with Python – Part 2

In my previous post I showed how to write a small simple python script to download the pages of R-bloggers.com. If you followed that post and ran the script, you should have a folder on your hard drive with 2409 .html files labeled post1.html , post2.html and so forth. The next step is to write a small script that extract the information we want from each page, and store that information in a .csv file that is easily read by R. In this post I will show how to extract the post title, author name and date of a given post and store it in a .csv file with a unique id.

To do this open a document in your favorite python editor (I like to use aquamacs) and name it: extraction.py. As in the previous post we start by importing the modules that we will use for the extraction:

from BeautifulSoup import BeautifulSoup

import os
import re

As in the previous post we will be using the BeautifulSoup module to extract the relevant information from the pages. The os module is used to get a list of file from the directory where we have saved the .html files, and finally the re module allows us to use regular expressions to format the titles that include a comma value or a newline value (\n). We need to remove these as they would mess up the formatting of the .csv file.

After having read in the modules, we need to get a list of files that we can iterate over. First we need to specify the path were the files are saved, and then we use the os module to get all the filenames in the specified directory:

path = "/Users/thomasjensen/Documents/RBloggersScrape/download"

listing = os.listdir(path)

It might be that there are other files in the given directory, hence we apply a filter, in shape of a list comprehension, to weed out any file names that do not match our naming scheme:

listing = [name for name in listing if re.search(r"post\d+\.html",name) != None]

Notice that a regular expression was used to determine whether a given name in the list matched our naming scheme. For more on regular expressions have a look at this site.

The final steps in preparing our extraction is to change the working directory to where we have our .html files, and create an empty dictionary:

os.chdir(path)
data = {}

Dictionaries are one of the great features of Python. Essentially a dictionary is a mapping of a key to a specific value, however the fact that dictionaries can be nested within each other, allows us to create data structures similar to R’s data frames.

Now we are ready to begin extracting information from our downloaded pages. Much as in the previous post, we will loop over all the file names, read each file into Python and create a BeautifulSoup object from the file:

for page in listing:
    site = open(page,"rb")
    soup = BeautifulSoup(site)

In order to store the values we extract from a given page, we update the dictionary with a unique key for the page. Since our naming scheme made sure that each file had a unique name, we simply remove the .html part from the page name, and use that as our key:

key = re.sub(".html","",page)

data.update({key:{}})

This will create a mapping between our key and an empty dictionary, nested within the data dictionary. Once this is done we can start extract information and store it in our newly created nested dictionary. The content we want is located in the main column, which has the id tag “leftcontent” in the html code. To get at this we use find() function on soup object created above:

content = soup.find("div", id = "leftcontent")

The first “h1” tag in our content object contains the title, so again we will use the find() function on the content object, to find the first “h1” tag:

title = content.findNext("h1").text

To get the text within the “h1” tag the .text had been added to our search with in the content object.

To find the author name, we are lucky that there is a class of “div” tags called “meta” which contain a link with the author name in it. To get the author name we simply find the meta div class and search for a link. Then we pull out the text of the link tag:

author = content.find("div",{"class":"meta"}).findNext("a").text

Getting the date is a simple matter as it is nested within div tag with the class “date”:

date = content.find("div",{"class":"date"}).text

Once we have the three variables we put them in dictionaries that are nested within the nested dictionary we created with the key:

data[key]["title"] = title
data[key]["author"] = author
data[key]["date"] = date

Once we have run the loop and gone through all posts, we need to write them in the right format to a .csv file. To begin with we open a .csv file names output:

output = open("/Users/thomasjensen/Documents/RBloggersScrape/output.csv","wb")

then we create a header that contain the variable names and write it to the output.csv file as the first row:

variables = unicode(",".join(["id","date","author","title"]))
header = variables + "\n"
output.write(header.encode("utf8"))

Next we pull out all the unique keys from our dictionary that represent individual posts:

keys = data.keys()

Now it is a simple matter of looping through all the keys, pull out the information associated with each key, and write that information to the output.csv file:

for key in keys:
    print key
    id = key
    date = re.sub(",","",data[key]["date"])
    author = data[key]["author"]
    title = re.sub(",","",data[key]["title"])
    title = re.sub("\\n","",title)
    linelist = [id,date,author,title]
    linestring = unicode(",".join(linelist))
    linestring = linestring + "\n"
    output.write(linestring.encode("utf-8"))

Notice that we first create four variables that contain the id, date, author and title information. With regards to the title we use two regular expressions to remove any commas and “\n” from the title, as these would create new columns or new line breaks in the output.csv file. Finally we put the variables together in a list, and turn the list into a string with the list items separated by a comma. Then a linebreak is added to the end of the string, and the string is written to the output.csv file. As a last step we close the file connection:

output.close()

And that is it. If you followed the steps you should now have a csv file in your directory with 2409 rows, and four variables – ready to be read into R. Stay tuned for the next post which will show how we can use this data to see how R-bloggers has developed since 2005. The full extraction script is shown below:

from BeautifulSoup import BeautifulSoup

import os
import re

 path = "/Users/thomasjensen/Documents/RBloggersScrape/download"
 listing = os.listdir(path)

listing = [name for name in listing if re.search(r"post\d+\.html",name) != None]
 os.chdir(path)
 data = {}
 for page in listing:
site = open(page,"rb")
soup = BeautifulSoup(site)
key = re.sub(".html","",page)
print key
data.update({key:{}})
 content = soup.find("div", id = "leftcontent")
title = content.findNext("h1").text
author = content.find("div",{"class":"meta"}).findNext("a").text
date = content.find("div",{"class":"date"}).text
data[key]["title"] = title
data[key]["author"] = author
data[key]["date"] = date

 output = open("/Users/thomasjensen/Documents/RBloggersScrape/output.csv","wb")

 keys = data.keys()
 variables = unicode(",".join(["id","date","author","title"]))
 header = variables + "\n"
 output.write(header.encode("utf8"))
 for key in keys:
print key
id = key
date = re.sub(",","",data[key]["date"])
author = data[key]["author"]
title = re.sub(",","",data[key]["title"])
title = re.sub("\\n","",title)
linelist = [id,date,author,title]
linestring = unicode(",".join(linelist))
linestring = linestring + "\n"
output.write(linestring.encode("utf-8"))
 output.close()

Source:http://www.r-bloggers.com/scraping-r-bloggers-with-python-part-2/

Wednesday, 26 November 2014

Data Mining and Frequent Datasets

I've been doing some work for my exams in a few days and I'm going through some past papers but unfortunately there are no corresponding answers. I've answered the question and I was wondering if someone could tell me if I am correct.

My question is

    (c) A transactional dataset, T, is given below:
    t1: Milk, Chicken, Beer
    t2: Chicken, Cheese
    t3: Cheese, Boots
    t4: Cheese, Chicken, Beer,
    t5: Chicken, Beer, Clothes, Cheese, Milk
    t6: Clothes, Beer, Milk
    t7: Beer, Milk, Clothes

    Assume that minimum support is 0.5 (minsup = 0.5).

    (i) Find all frequent itemsets.

Here is how I worked it out:

    Item : Amount
    Milk : 4
    Chicken : 4
    Beer : 5
    Cheese : 4
    Boots : 1
    Clothes : 3

Now because the minsup is 0.5 you eliminate boots and clothes and make a combo of the remaining giving:

    {items} : Amount
    {Milk, Chicken} : 2
    {Milk, Beer} : 4
    {Milk, Cheese} : 1
    {Chicken, Beer} : 3
    {Chicken, Cheese} : 3
    {Beer, Cheese} : 2

Which leaves milk and beer as the only frequent item set then as it is the only one above the minsup?

data mining

Nanor

3 Answers

There are two ways to solve the problem:

    using Apriori algorithm
    Using FP counting

Assuming that you are using Apriori, the answer you got is correct.

The algorithm is simple:

First you count frequent 1-item sets and exclude the item-sets below minimum support.

Then count frequent 2-item sets by combining frequent items from previous iteration and exclude the item-sets below support threshold.

The algorithm can go on until no item-sets are greater than threshold.

In the problem given to you, you only get 1 set of 2 items greater than threshold so you can't move further.

There is a solved example of further steps on Wikipedia here.

You can refer "Data Mining Concepts and Techniques" by Han and Kamber for more examples.

141

There is more than two algorithms to solve this problem. I will just mention a few of them: Apriori, FPGrowth, Eclat, HMine, DCI, Relim, AIM, etc. –  Phil Mar 5 '13 at 7:18

OK to start, you must first understand, data mining (sometimes called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing it into useful information - information that can be used to increase revenue, cuts costs, or both. Data mining software is one of a number of analytical tools for analyzing data. It allows users to analyze data from many different dimensions or angles, categorize it, and summarize the relationships identified. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational databases.

Now, the amount of raw data stored in corporate databases is exploding. From trillions of point-of-sale transactions and credit card purchases to pixel-by-pixel images of galaxies, databases are now measured in gigabytes and terabytes. (One terabyte = one trillion bytes. A terabyte is equivalent to about 2 million books!) For instance, every day, Wal-Mart uploads 20 million point-of-sale transactions to an A&T massively parallel system with 483 processors running a centralized database.

Raw data by itself, however, does not provide much information. In today's fiercely competitive business environment, companies need to rapidly turn these terabytes of raw data into significant insights into their customers and markets to guide their marketing, investment, and management strategies.

Now you must understand that association rule mining is an important model in data mining. Its mining algorithms discover all item associations (or rules) in the data that satisfy the user-specified minimum support (minsup) and minimum confidence (minconf) constraints. Minsup controls the minimum number of data cases that a rule must cover. Minconf controls the predictive strength of the rule.

Since only one minsup is used for the whole database, the model implicitly assumes that all items in the data are of the same nature and/or have similar frequencies in the data. This is, however, seldom the case in real- life applications. In many applications, some items appear very frequently in the data, while others rarely appear. If minsup is set too high, those rules that involve rare items will not be found. To find rules that involve both frequent and rare items, minsup has to be set very low.

This may cause combinatorial explosion because those frequent items will be associated with one another in all possible ways. This dilemma is called the rare item problem. This paper proposes a novel technique to solve this problem. The technique allows the user to specify multiple minimum supports to reflect the natures of the items and their varied frequencies in the database. In rule mining, different rules may need to satisfy different minimum supports depending on what items are in the rules.

Given a set of transactions T (the database), the problem of mining association rules is to discover all association rules that have support and confidence greater than the user-specified minimum support (called minsup) and minimum confidence (called minconf).

I hope that once you understand the very basics of data mining that the answer to this question shall become apparent.

1

The Apriori algorithm is based on the idea that for a pair o items to be frequent, each individual item should also be frequent. If the hamburguer-ketchup pair is frequent, the hamburger itself must also appear frequently in the baskets. The same can be said about the ketchup.

So for the algorithm, it is established a "threshold X" to define what is or it is not frequent. If an item appears more than X times, it is considered frequent.

The first step of the algorithm is to pass for each item in each basket, and calculate their frequency (count how many time it appears). This can be done with a hash of size N, where the position y of the hash, refers to the frequency of Y.

If item y has a frequency greater than X, it is said to be frequent.

In the second step of the algorithm, we iterate through the items again, computing the frequency of pairs in the baskets. The catch is that we compute only for items that are individually frequent. So if item y and item z are frequent on itselves, we then compute the frequency of the pair. This condition greatly reduces the pairs to compute, and the amount of memory taken.

Once this is calculated, the frequencies greater than the threshold are said frequent itemset.

Source: http://stackoverflow.com/questions/14164853/data-mining-and-frequent-datasets?rq=1

Wednesday, 19 November 2014

Online Data Entry & Web Scraping Services

To operate any type of organization smoothly, it is essential to have precise data that is accurate and reliable. When your business expands, data entry on an ongoing basis is a tedious job. It’s a very time consuming task that can often distract employees focusing on core business areas.

Webpop offers all forms of online data entry services that are quick and accurate. We provide data entry services across all verticals that can be completely customized to your business requirements.

Database Population Services

Database population involves content collection from various database sources. This requires a lot of attention to detail, dedication and awareness and can prove a formidable task, especially for websites that largeley depend on it.

Webpop offer a quick and efficient database population service that helps relieve the stress from an extremely laborius task and leaves you more time to focus on more important aspects of your business. By investing just a fraction of the cost, you can outsource your database population tasks to us.

Web Scraping Services

Webpop have been assisting clients in searching, extracting and collecting data from the web for the past 5 years using the latest techniques in web scraping techology. We can scrape all types of information from a variety of sources such as websites, blogs, online directories, e-commerce websites and podcasts to name a few. We use a varied selection of automated and manual web scraping technologies to extract, gather and collect all of the required data you require from any chosen website(s) on the World Wide Web.

We can simplify the whole process from collection to population, converting your scraped data in to structured formats that are applicable to your website. This can be offered as a one time service or an ongoing basis that will assist you in constantly keeping your website’s content fresh and up to date. We can crawl competitors websites, gather sales leads, product details, pricing methodologies and also creat custom campaigns to suit your project’s requirements.

Over the years Webpop has grown from strength-to-strength by providing all types of data entry, database population and web scraping services. All of our data entry services are performed with care, due dilligence and attention to detail. We enjoy a challenge and pride ourselves on delivering results whilst working on precarious projects that require precision and total commitment.

Source:http://www.webpopdesign.com/services/data-entry/

Monday, 17 November 2014

Kimono Is A Smarter Web Scraper That Lets You “API-ify” The Web, No Code Required

A new Y Combinator-backed startup called Kimono wants to make it easier to access data from the unstructured web with a point-and-click tool that can extract information from webpages that don’t have an API available. And for non-developers, Kimono plans to eventually allow anyone track data without needing to understand APIs at all.

This sort of smarter “web scraper” idea has been tried before, and has always struggled to find more than a niche audience. Previous attempts with similar services like Dapper or Needlebase, for example, folded. Yahoo Pipes still chugs along, but it’s fair to say that the service has long since been a priority for its parent company.

But Kimono’s founders believe that the issue at hand is largely timing.

“Companies more and more are realizing there’s a lot of value in opening up some of their data sets via APIs to allow developers to build these ecosystems of interesting apps and visualizations that people will share and drive up awareness of the company,” says Kimono co-founder Pratap Ranade. (He also delves into this subject deeper in a Forbes piece here). But often, companies don’t know how to begin in terms of what data to open up, or how. Kimono could inform them.

Plus, adds Ranade, Kimono is materially different from earlier efforts like Dapper or Needlebase, because it’s outputting to APIs and is starting off by focusing on the developer user base, with an expansion to non-technical users planned for the future. (Meanwhile, older competitors were often the other way around).

The company itself is only a month old, and was built by former Columbia grad school companions Ranade and Ryan Rowe. Both left grad school to work elsewhere, with Rowe off to Frog Design and Ranade at McKinsey. But over the nearly half-dozen or so years they continued their careers paths separately, the two stayed in touch and worked on various small projects together.

One of those was Airpapa.com, a website that told you which movies were showing on your flights. This ended up giving them the idea for Kimono, as it turned out. To get the data they needed for the site, they had to scrape data from several publicly available websites.

“The whole process of cleaning that [data] up, extracting it on a schedule…it was kind of a painful process,” explains Rowe. “We spent most of our time doing that, and very little time building the website itself,” he says. At the same time, while Rowe was at Frog, he realized that the company had a lot of non-technical designers who needed access to data to make interesting design decisions, but who weren’t equipped to go out and get the data for themselves.

With Kimono, the end goal is to simplify data extraction so that anyone can manage it. After signing up, you install a bookmarklet in your browser, which, when clicked, puts the website into a special state that allows you to point to the items you want to track. For example, if you were trying to track movie times, you might click on the movie titles and showtimes. Then Kimono’s learning algorithm will build a data model involving the items you’ve selected.

Screen Shot 2014-02-18 at 4.29.05 PM

Screen Shot 2014-02-18 at 4.29.27 PM

That data can be tracked in real time and extracted in a variety of ways, including to Excel as a .CSV file, to RSS in the form of email alerts, or for developers as a RESTful API that returns JSON. Kimono also offers “Kimonoblocks,” which lets you drop the data as an embed on a webpage, and it offers a simple mobile app builder, which lets you turn the data into a mobile web application.

Screen Shot 2014-02-18 at 4.29.50 PM

For developer users, the company is currently working on an API editor, which would allow you to combine multiple APIs into one.

So far, the team says, they’ve been “very pleasantly surprised” by the number of sign-ups, which have reached ten thousand*. And even though only a month old, they’ve seen active users in the thousands.

Initially, they’ve found traction with hardware hackers who have done fun things like making an airhorn blow every time someone funds their Kickstarter campaign, for instance, as well as with those who have used Kimono for visualization purposes, or monitoring the exchange rates of various cryptocurrencies like Bitcoin and dogecoin. Others still are monitoring data that’s later spit back out as a Twitter bot.

Kimono APIs are now making over 100,000 calls every week, and usage is growing by over 50 percent per week. The company also put out an unofficial “Sochi Olympics API” to showcase what the platform can do.

The current business model is freemium based, with pricing that kicks in for higher-frequency usage at scale.

The Mountain View-based company is a team of just the two founders for now, and has initial investment from YC, YC VC and SV Angel.

Source:http://techcrunch.com/2014/02/18/kimono-is-a-smarter-web-scraper-that-lets-you-api-ify-the-web-no-code-required/

Sunday, 16 November 2014

A Web Scraper’s Guide to Kimono

Being a frequent reader of Hacker News, I noticed an item on the front page earlier this year which read, “Kimono – Never write a web scraper again.” Although it got a great number of upvotes, the tech junta was quick to note issues, especially if you are a developer who knows how to write scrapers. The biggest concern was a non-intuitive UX, followed by the inability of the first beta version to extract data items from websites as smoothly as the demo video suggested.

I decided to give it a few months before I tested it out, and I finally got the chance to do so recently.

Kimono is a Y-Combinator backed startup trying to do something in a field where others have failed. Kimono is focused on creating APIs for websites which don’t have one, another term would be web scraping. Imagine you have a website which shows some data you would like to dynamically process in your website or application. If the website doesn’t have an API, you can create one using Kimono by extracting the data items from the website.

Is it Legal?

Kimono provides an FAQ section, which says that web scraping from public websites “is 100% legal” as long as you check the robots.txt file to see which URL patterns they have disallowed. However, I would advise you to proceed with caution because some websites can pose a problem.

A robots.txt is a file that gives directions to crawlers (usually of search engines) visiting the website. If a webmaster wants a page to be available on search engines like Google, he would not disallow robots in the robots.txt file. If they’d prefer no one scrapes their content, they’d specifically mention it in their Terms of Service. You should always look at the terms before creating an API through Kimono.

An example of this is Medium. Their robots.txt file doesn’t mention anything about their public posts, but the following quote from their TOS page shows you shouldn’t scrape them (since it involves extracting data from their HTML/CSS).

    For the remainder of the site, you may not duplicate, copy, or reuse any portion of the HTML/CSS, JavaScipt, logos, or visual design elements without express written permission from Medium unless otherwise permitted by law.

If you check the #BuiltWithKimono section of their website, you’d notice a few straightforward applications. For instance, there is a price comparison API, which is built by extracting the prices from product pages on different websites.

Let us move on and see how we can use this service.

What are we about to do?

Let’s try to accomplish a task, while exploring Kimono. The Blog Bowl is a blog directory where you can share and discover blogs. The posts that have been shared by users are available on the feeds page. Let us try to get a list of blog posts from the page.

The simple thought process when scraping the data is parsing the HTML (or searching through it, in simpler terms) and extracting the information we require. In this case, let’s try to get the title of the post, its link, and the blogger’s name and profile page.

Source: http://www.sitepoint.com/web-scrapers-guide-kimono/

Thursday, 13 November 2014

Future of Web Scraping

The Internet is large, complex and ever-evolving. Nearly 90% of all the data in the world has been generated over the last two years. In this vast ocean of data, how does one get to the relevant piece of information? This is where web scraping takes over.

Web scrapers attach themselves, like a leech, to this beast and ride the waves by extracting information form websites at will. Granted “scraping” doesn’t have a lot of positive connotations, yet it happens to be the only way to access data or content from a web site without RSS or an open API.

Future of Web Scraping

Web scraping faces testing times ahead. We outline why there may be some serious challenges to its future.

With rise in data, redundancies in web scraping are rising. No more is web scraping a domain of the coders; in fact, companies now offer customized scraping tools to clients which they can use to get the data they want. The outcome of everyone equipped to crawl, scrape, and extract, is unnecessary waste of precious man-power. Collaborative scraping could well heal this hurt. Here, where one web crawler does a broad scraping, the others scrape data off an API. An extension of the problem is that text retrieval attracts more attention than multimedia; and with websites becoming more complex, this enforces limited scraping capacity.

Easily, the biggest challenge to web scraping technology is Privacy concerns. With data freely available (most of it voluntary, much of it involuntary), the call for stricter legislation rings loudest. Unintended users can easily target a company and take advantage of the business using web scraping. The disdain with which “do not scrape” policies are treated and terms of usage violated, tells us that even legal restrictions are not enough. This begs to ask an age-old question: is scraping legal?

Is Crawling Legal? from PromptCloud

The flipside to this argument is that if technological barriers replace legal clauses, then web scraping will see a steady, and sure, decline. This is a distinct possibility since the only way scraping activity thrives is on the grid, and if the very means are taken away and programs no longer have access to website information, then web scraping by itself will be wiped out.

Building the Future

On the same thought is the growing trend of accepting “open data”. The open data policy, while long mused hasn’t been used at the scale it should be. The old way was to believe that closed data is the edge over competitors. But that mindset is changing. Increasingly, websites are beginning to offer APIs and embracing open data. But what’s the advantage of doing so?

Selling APIs not only brings in the money, but also is useful in driving back traffic to the sites! APIs are also a more controlled, cleaner way of turning sites into services. Steadily many successful sites like Twitter, LinkedIn etc. are offering access to their APIs with paid services and actively blocking scraper and bots.

Yet, beyond these obvious challenges, there’s a glimmer of hope for web scraping. And this is based on a singular factor: the growing need for data!

With Internet & web technology spreading, massive amounts of data will be accessible on the web. Particularly with increased adoption of mobile internet. According to one report, by 2020, the number of mobile internet users will hit 3.8 billion, or around half of the world’s population!

Since ‘big data’ can be both, structured & unstructured; web scraping tools will only get sharper and incisive. There is fierce competition between those who provide web scraping solutions. With the rise of open source languages like Python, R & Ruby, Customized scraping tools will only flourish bringing in a new wave of data collection and aggregation methods.

Source: https://www.promptcloud.com/blog/Future-of-Web-Scraping

Wednesday, 12 November 2014

3 Reasons to Up Your Web Scraping Game

If you aren’t using a machine-learning-driven intelligent Web scraping solution yet, here are three reasons why you might want to abandon that entry-level Web-scraping software or cut your high-cost script-writing approach.

    You need to keep an eye on a large number of web sources that get updated frequently.

    Understanding what’s changed is at least as critical as the data itself.

    You don’t want maintenance and scheduling to drag you down.

Here’s what an intelligent Web-scraping solution can deliver – and why:

1. Better data monitoring of an ever-shifting Web

If you need to keep a watch over hundreds, thousands or even tens of thousands of sites, an intelligent Web scraper is a must, because:

    It can scale – easily adding new websites, coordinating extraction routines, and automating the normalization of data across different websites.

    It can navigate and extract data from websites efficiently. Script-based approaches typically only can view a Web page in isolation, making it difficult to optimize navigation across unique pages of a targeted site. More intelligent approaches can be trained to bypass unnecessary links and leave a lighter footprint on the sites you need to access. And, they can monitor millions of precise Web data points quickly. This means you can monitor more pages on more sites with more frequent updates.

2. Critical alerts to Web data changes

A key sales executive suddenly drops off of the management page of your main competitor. That can mean big shakeup in the entire organization, which your sales team can jump on.

An intelligent Web scraper can alert you to this personnel shift because it can be set to monitor for just the changes; less powerful technologies or script-based approaches can’t. Whether you’re tracking price shifts, people moves, or product changes (or more) intelligent Web scraping delivers more profound insights.

3. Maintenance may become your biggest nightmare

You’ve purchased an entry-level tool and built out scrapers for a few hundred sites.  At first, everything seems fine. But, within weeks you begin to notice that your data is incomplete and not being updated as you’d expected. Why did your data deliveries disappear?

Reality is that these low-cost tools are simply not designed for mission-critical business applications – on the surface they look helpful and easy to use, but underneath the surface they are script-based and highly dependent upon the HTML of a website. But websites change, and entry-level web scraping tools are simply not engineered to adapt to those changes.

And, most of these tools are simply not designed for enterprise use. They have limited reporting, if any, so the only way to know whether they’re successfully completing their tasks is by finding gaps in the data – often when it’s too late.

An intelligent web scraping approach doesn’t rely upon the HTML of a web page. It uses machine learning algorithms which view the web the same way a user might. A typical reader doesn’t get confused when a font or color is changed on a website, and neither do these algorithms. But simple approaches to web scraping are highly dependent on the specific HTML to help it understand the content of a page. So, when websites have design changes (on average once every 18 months), the software fails.

While entry-level web scraping software can be an easy solution for simple, one-time web scraping projects, the scripts they generate are fragile and the resources required for tracking and maintenance can become overwhelming when you need to regularly extract data from multiple sites.

Case in point: Shopzilla assimilates data five times faster than outsourced Web scrapers

To demonstrate the power of intelligent Web scraping, here’s a real-life example from Shopzilla.  Shopzilla manages a premier portfolio of online shopping brands in the United States and Europe, connecting more than 40 million shoppers each month with millions of products from retailers worldwide. With the explosive growth of retail data on the Web, Shopzilla’s outsourced, custom-built approach, based on scripting, could not add the product lines of new retailers to its site in a timely fashion. It was taking up to two weeks to write the scripts needed to make a single site accessible.

By deploying Connotate’s intelligent web scraping platform on site, Shopzilla gained the ability to harness Web data’s rapid growth and keep up to date. Today, new sources are added in days, not weeks.  The platform continually monitors Web content from thousands of sites, delivering high volumes of data every day in a structured format. The result: 500 percent more data from new retailers. An added bonus: the company has reduced IT maintenance costs and its dependence on outsourced development timetables. Case in point: Deep competitor intelligence in two languages

A global manufacturer needed to monitor competitors’ technology improvements in a field where market leadership hinges on an ability to quickly leverage these advances. That meant accessing scholarly journals and niche sites in multiple languages. Using the Connotate solution, it was able to access highly-targeted, keyword-driven university and industry research journals and blogs in German and English that are hard to reach because they do not support RSS feeds. Our solution also incorporated semantic analysis to tag and categorize data and help identify new technologies and products not currently in the keyword list. The firm enhanced its competitive edge with the up-to-the-minute, precise data it needed.

Is your Web scraping intelligent enough?

See what intelligent agents through an automated Web data extraction and monitoring solution can bring to your business. Contact us and speak with one of experts.

Source:http://www.connotate.com/3-reasons-web-scraping-game-6579#.VGMjH2f4EuQ

Monday, 10 November 2014

Data Scraping vs. Data Crawling

One of our favorite quotes has been- ‘If a problem changes by an order, it becomes a totally different problem’ and in this lies the answer to- what’s the difference between scraping and crawling?

Crawling usually refers to dealing with large data-sets where you develop your own crawlers (or bots) which crawl to the deepest of the web pages. Data scraping on the other hand refers to retrieving information from any source (not necessarily the web). It’s more often the case that irrespective of the approaches involved, we refer to extracting data from the web as scraping (or harvesting) and that’s a serious misconception.

=>Below are some differences in our opinion- both evident and subtle

1.    Scraping data does not necessarily involve the web. Data scraping could refer to extracting information from a local machine, a database, or even if it is from the internet, a mere “Save as” link on the page is also a subset of the data scraping universe. Crawling on the other hand differs immensely in scale as well as in range. Firstly, crawling = web crawling which means on the web, we can only “crawl” data. Programs that perform this incredible job are called crawl agents or bots or spiders (please leave the other spider in spiderman’s world). Some web spiders are algorithmically designed to reach the maximum depth of a page and crawl them iteratively (did we ever say scrape?).

2.    Web is an open world and the quintessential practising platform of our right to freedom. Thus a lot of content gets created and then duplicated. For instance, the same blog might be posted on different pages and our spiders don’t understand that. Hence, data de-duplication (affectionately dedup) is an integral part of data crawling. This is done to achieve two things- keep our clients happy by not flooding their machines with the same data more than once, and saving our own servers some space. However, dedup is not necessarily a part of data scraping.

3.    One of the most challenging things in the web crawling space is to deal with coordination of successive crawls. Our spiders have to be polite with the servers that they hit so that they don’t piss them off and this creates an interesting situation to handle. Over a period of time, our intelligent spiders have to get more intelligent (and not crazy!) and learn to know when and how much to hit a server in order to crawl data on its web pages while complying with its politeness policies.

4.    Finally, different crawl agents are used to crawl different websites and hence you need to ensure they don’t conflict with each other in the process. This situation never arises when you intend to just scrape data.

On a concluding note, scraping represents a very superficial node of crawling which we call extraction and that again requires few algorithms and some automation in place.

Source:https://www.promptcloud.com/blog/data-scraping-vs-data-crawling/

Saturday, 8 November 2014

Web Scraping the Solution to Data Harvesting

The internet is the number one information provider in the world and it is of course the largest in the same course. Web scraping is meant to extract and harvest useful information from the internet. It can be regarded as a multidisciplinary process that involves statistics, databases, data harvesting and data retrieval.

There has been noted a rapid expansion of the web and therefore causing an enormous growth of information. This has led to increased difficulty in the extraction of useful and potential information. Web scraping therefore confronts this problem by harvesting explicit information from a number of websites for knowledge discovery and easy access. It is important to realize that query interfaces of web databases are prone to sharing of same building blocks. It is therefore important to realize that the web offers unprecedented challenge and opportunity to data harvesting.

Source:http://www.loginworks.com/blogs/web-scraping-blogs/web-scraping-solution-data-harvesting/

Wednesday, 5 November 2014

Application of Web Data Mining in CRM

The process of improvising the customer relations and interactions and making them more amicable may be termed as Customer relationship management (CRM). Since web data mining is used in the utilization of the various modeling and data analysis methods in detecting given patterns and relationships in the data, it can be used as an effective tool in CRM. By the effectively using web data mining you are able to understand what your customers what.

It is important to note that web data mining can be used effectively in searching for the right and potential customers to be offered the right products at the right time. The result of this in any business is the increase in the revenue generated. This is made possible as you are able to respond to each customer in an effective and efficient way. The method further utilizes very few resources and can be therefore termed as an economical method.

In the next paragraphs we discuss the basic process of customer relationship management and its integration with web data mining service. The following are the basic process that should be used in understanding what your customers need, sending them the right offers and products, and reducing the resources used in managing your customers.

Defining the business objective. Web data mining can be used to define and inform your customers your business objective. By doing research you can be able to determine whether your business objective is communicated well to your customers and clients. Does your business objective take interest in the customers? Your business goal must be clearly outlined in your business CRM. By having a more precise and defined goal is the possible way of ensuring success in the customer relationship management.

Source:http://www.loginworks.com/blogs/web-scraping-blogs/application-web-data-mining-crm/