Monday, March 17, 2008

Web Data-Mining-Techno Arena--P. Sumalatha, MCA-II SEM

Here’s a ready reckoner on how to mine the riches that the internet throws up. As the Net grows, mining that vast repository of information is getting more daunting. But relax and on, as there’s enough help at hand.

Who needs it?

Almost all of us need information. A lot of information is freely available on the web. Learning a few techniques on how to mine information on the web is a useful skill. Here are some sample usage scenarios.

· You are an entrepreneur who is planning to start a new software business. You hear that Web 2.0 and social applications are hot. You want to do some research to understand the marketplace, and want to prototype a few product ideas.
· You are part of the CTO office of a software company, and are interested in short, medium-, and long-term technology and business trends in your industry. You need this information to build skills in your organization, and to build a few concept prototypes.
· You are part of the CIO office of an organization. You need to balance early adoption of technologies with providing a stable environment for your business; you don’t want to jump at every new technology. In addition to finding new tools and techniques, you also want to understand the risks and the maturity level of these technologies, which ones are being used for building applications, and you also want to track many non-technical factors.
· You are an outsourcing company and want to find customers for your business and track trends in outsourcing. Being a jump ahead of your competition and carving a niche are important differentiators.
· You are part of HR, or a Learning officer, and need to plan for the skill development of your employees. You want to keep your software team happy and so need to know the latest technologies, tools and resources to plan training and skill development.
· You are a development lead, and need to provide the team with the latest information on product releases, and access to product/ technology knowledge bases. You need to know of any problems, including security issues, in the tools or software that you are currently using for your projects.

Broadly, there are several components to finding, suing and sharing information.

· Identifying and discovering information sources.
· Tracking information from various sources and filtering them for their relevance to your needs
· Organizing collected information and sharing it with others

FINDING INFORMATION:-
Information sources can be categorized as :


· News sources
· Company web sites
· Blocks
· Search engines
· Wikis
· Discussion groups
· Social book marking sites
· Social networks

All theses sources are complimentary to each other. Each delivor a slightly differing type of information. We will discuss each one of them in more detail in the rest of this article.

NEWS SOURCES:-

News has traditionally been our major sources of information, but news sources gradually becoming web based and are delivered in several ways .

News web sites are the closest to the printing news papers that we have been accustomed to. Few layout changes are made to accommodate the smaller area for displaying news. But if you New York Times. The wall street journal are the Hindu, you will notice a lot of similarities to traditional newspapers.

News portals are similar to news websites, with some difference-you are allowed customize the news to suit to your tastes. Some of the more popular news delivery is done using news portals. News portals are also popular for delivering custom news specific to an industry or subject area.

News feeds are lists of news items these are also known as web feeds or RSS feeds. Each feed has a title and discussion and contains one or more news items. Each item, in turn, contains a title, description and a link to the original story. RSS feeds a very popular means of delivering multiple news items.

Where do you find these feeds? :-


Several publications (like business week, the wall street journal ) and yahoo news provide news feeds. You can subscribe to these feeds and read them with news reader software. News reader aggregate several feeds of your choice, and present a list of current news items.

One of the benefits of RSS feeds is that they are based on a standard format-you can easily write an applications that filters RSS feeds based on your own specific criteria, an have it present you with only selected items.

Blogs are interesting sources of news-they contain content created by people like you and me, no one has bloggers. In many ways, you could consider blogs to be ‘citizen news’ blubbers pick an item of interest and write a post is like a many news column. Blogs most often carry reference to the original news and commentary by the blogger. A more increasing aspect of blogs is that readers can comment on the news items or the commentary. On popular or controversial blog posts the commentary often provides more information than the original post themselves.


Discussion groups:-

We use the term ‘discussion groups’ to broadly cover mailing lists, company/ product-based discussion groups, and collaborative web products like yahoo groups and Google groups. A lot of information is exchanged and debated in a discussion group. If you are considering a new programming language, web frame work or a new development tool these forums provide valuable information.

Company-based groups are normally run by a company’s support group, are run independently by others outside the company. If you are looking at investing in a certain company’s products, this may be a good place to check them out.

Topic-based groups focus on a certain topic of interest like wireless mobile devices. A typical group of this kind may focus on web frame works are learning software they are not specific to a single company but several companies in the arena may be discussed.

Product –based groups focus on a product or a product family. For example, Django is python-based frame work for building web applications. Members of the Django group on a Google spend a lot of time discussing various approaches and best practices on using Django. They also discuss various issues and work around.



Company websites:-

Company websites are one of the best means of getting information about the company-directly from the source. You can typically get information about the philosophy, mission, team, products, or services-and a lot more. Company sites range from a few pages to a few thousand pages.

You can learn a lot about a company by visiting its websites. Frequent updates to the sites indicate a lot activity. You can look at customer wins, business partnerships, product releases, and job postings. These core activities are reflected in frequent press releases and news coverage.
A few sites like Alexa, Zoom Info and Hoovers not only provide information about a company, but also provide information on competing companies and products.

Search engines

Search engines from Google, Yahoo and Microsoft are probably the most frequently used resources to find information. Search engines index news sites, company sites, feeds and blogs, to provide answers to your search.
Web search engines in general locate information using a keyword search and some kind of a ranking scheme. Many of these search engines also provide an API (application programming interface) that you can use to create automated searches for example, a search for your own company. Meta search engines submit searches to multiple search engines, and consolidate and group information.
Custom search engines allow you to customize one or more aspects of search. For example, Google Custom Search allows you to specify the sources (websites) to search, and some predefined keywords to include in your search.
Vertical search engines are focused on specific industry or subject area. By focusing on a specific subject area, they can provide more effective search and accurate results.
Blog search engines allow you to look for blogs that cover a certain topic or search area. While you can use regular search engines to find blogs, b log searches focus on indexing and searching only blog posts. Popular blog search engines include Google blog search technorati.
Code search engines let you search for code (programs) of a specific type. This is a great resource for developers. The search is typically done for open source, or other publicly available source code. You can specify certain key words, and select a language or operating environment. Google code search enables you to search based on file paths or names, the license under which the code is released, the programming language used, the name of the package, and more. Each of these accepts regular expressions as the search expression, allowing you to construct powerful and very specific search expressions. Krugle is another code search engine, which was recently purchased by yahoo, and also sports an open source version.

Semantic search is a new type of search engine that is in its infancy.
Instead of just indexing key words and searching for them, semantic search engines allow you search a bit deeper. Some of the semantic search engines also allow humans to validate the results of the search, so that they can be improved.

Social bookmarks:-

Social bookmark services allow people to share their book marks. This is done by saving bookmarks on the internet and making them accessible to every one. To make the access simpler, social book marking systems include descriptions tages with bookmarks. Popular social book marking systems include del.icio.us, furl ,Digg ,Slashdot, Reddit and Stumble upon. Some of them allow users to rank bookmarks, so the more popular ones are listed at the top, further increasing their visibility.
Other facilities provided by social book marking services include an API and RSS feeds to programmatically access book marks. They also tag clouds (where tags are displayed with the more frequently-occurring tags appearing larger).

WIKIS:-
Wikis are a special type of website where contest is created by multiple people –collaborative content development, editing, approver and publication make wikis a very powerful platform for creating content. Probably the most definitive example of a Wiki is Wikipedia, which contains the collective knowledge of hundreds of thousands of people. A collaborative encyclopedia, Wikipedia has become the first stop in many people’s search for encyclopedic knowledge on the Net.
DbPedia is community effort to extract structured information WikiPedia and to make this information available on the Web. DbPedia and to link other datasts on the Web to WikiPedia data.
Product wikis are specialized wiki sites that are set up to provide wiki collaborative community for product documentation, support and issues. Many open source projects/product have a product wiki. Even commercial products from major vendors have their own wikis.


Industry wikis cover information about specific industries. You can think of them as the aggregation of knowledge about a particular industry-for example, web 2.0 or AJAX techniques. Industry wikis are normally associated with industry portals, or managed by independent groups. If you are trying to find customers or start a company in a specific industry, you may want to first check whether your industry has its own wiki. A good starting point is wikipdia, which may contain links to other wikis.


Support wikis are knowledge repositories for products and services these are maintained either by the vendor or by the community. If you are in IT or any software development, these can be gold mines of information.

Social networks:-

A discussion about information sources is not complete with out some mentions of social networks. Social networks allow people / groups to share information in a wide variety of ways. The most popular social network today is face book, closely followed by My Space. Linked in is a business social networks. Recently, Google introduced open social, which provides a common set of APIs for social application across multiple websites. With standard java script o n HTML, developers can create applications that access in social networks friends and update feeds.

There are wide verities of information sources, and they seem to increase rapidly. Most of them offer programmatic access through a language-independent API. All you need is a way to track them and be updated when these rich sources of information produce new information.

No comments: