Showing posts with label Clustering. Show all posts
Showing posts with label Clustering. Show all posts

Sunday, August 19, 2007

Enterprise Search - Redefining Scope

Continuing my discussion on my previous post 'Enterprise Search Done Right' where I wrote about the user needs of enterprise search, in this post I want to share my thoughts of enterprise search and how things have changed in last couple of years in this technology to force the community to rethink about its scope and functionality.

In last few year there has been a shift in how users perceive search. The users are not just searching for the information but are also concerned about how they are searching, how effective and relevance is the search information and how information is being rendered to them. They are also judging the quality and efficiency of the search services. For the organization, emphasis is not how to make the information searchable but findable. The organization's web strategies are more aligned to serving the customer better and converting more business opportunities. For the search vendors, the emphasis has been on advancing the search algorithm, embedding more technologies under the search domain to provide a complete enterprise information management solution, and providing the search processes more engaged to have effective user experience.

How the research companies perceive the growth of enterprise search industry in next few years? Gartner suggests the search industry will realize double-digit percent growth in 2007, surpassing $728 million--a 15 percent increase from 2006's $633 million [According to the Gartner report, "Dataquest Insight: Forecast for Information Access With Search Technology in the Enterprise, 2006-2011."]. This is a positive sign, search vendors can add more technologies on the search stack to bring benefits to the organizations looking for enterprise information management solutions.

Before scoping the requirements, lets take a look at how users find or access information:

1. Pattern Search - It is typical form of searching as users search Google or Yahoo. You search for a word or a phrase, search engine will bring set of pages with the match. The advance version of pattern matching is Clustering in which search engines automatically classify initial set of search results in buckets. You can read about Clustering from Clustering with Search Engines. The public version of clustering engine can be found on Clusty

2. Browsable Taxonomy or Topical Navigation - It is browsing though the information based on pre-defined topics or categories. The organization information is classified into organization-wide taxonomies. The content can be classified or categorized at content creation time or at content indexing time (by search engines). The content authoring tools should provide capabilities to define taxonomies and provides association of categories to the information. For example Documentum provides taxonomy support for content classification. Alternatively, search engines can also classify information at the time of indexing based on pre-defined rules and taxonomy. For example, Verity provides auto-classification of information.



3. Navigating through Semantic Web - Semantics web is not just navigating though links on the web pages. It describes relationship based on meta attributes or properties. RDF (Resource Description Framework) is a markup language for describing information and resources on the web. The Semantic Web uses RDF to describe web resources. How this might this be useful? Suppose you want to compare the price and choice of ipods in your zip code, or you want to search online catalogs from different manufactures and service providers for mobile phones. The tools like Siderean provides RDF based alternate navigation.

If you look now, patten matching only symbolizes the search functionality in true sense, other type of information access is getting popular because of boundary between search and navigation is getting hazier every year. Now search can not work in isolated technology to solve enterprise information management solutions. It needs to provide and integrate with the collection of technologies to meet demanding enterprise needs. Now search is not just pattern matching algorithm, it has been expanded into a complete enterprise information access and management platform that includes extraction, classification, taxonomy support and pattern matching. The requirement and scope of search is not limited to 'searching information based on words and key phrases'. It is now an integral part of enterprise information management platform. The functions of enterprise search now includes:

1. Enterprise Search
2. Taxonomy Management - Ability to create or extend organization taxonomy
3. Information Classification i.e. categorization
4. Entity Extraction - Identifies and extracts key entities i.e. the who, what, when and how, such as people, dates, places, companies, email addresses, geo-coordinates, facilities, etc.
5. Application Integration - one, ability to integrate with various data sources within the organization for search. For example, now RSS feeds are good source of data for indexing information. Second, ability to capture user search and navigation information to collect data for search optimization.
6. Information Rendering - How the information that is searched is rendered to end users including caching, translation, transformation of information in various formats and user experience.
7. Administrative Interface - Ability to give an administrative interface to all site web master to control and view their site performance and control data.

If you look at the enterprise search vendors, you see most of the companies have made strides in developing next-generation search and advanced find tools. These include Autonomy, IBM , Convera, FAST, Inxight, Vivisimo, Siderean etc.

Wednesday, August 8, 2007

Clustering with Search Engines

When I first read about the Clustering, I was confused about its utility and ability to work on limited set of search results. But over the period time, after reading lot of research material on search usability and taking to people, I realized that searcher do not go beyond few pages. In fact the study shows that 2/3 of searcher do not go beyond 2 pages of 10 results each. The searcher either find the information that they are looking for or change the search terms. I personally do not go beyond 2-3 pages unless I am not able to refined search phrase. Even with 2 pages of result set, users have filter and infer the context of search results to find for the relevant information that they are looking for. So actually the users are spending more time on search result windows rather than actually working on searched information. When I am searching, it is not only important how relevant the search results are, but also how much time it took to me to get to those results. Here I see value of clustering with search engines.

Search technology has been evolving and maturing over last few years. The search companies are completing with each other for creating a niche for themselves and attracting more traffic to get benefit from advertising, but ever increasing demand of end users are always overtaking them from behind. The users are getting interested in federated search, clustering and faceted search apart from the regular sequential search. Google web search provides clustering as indented search results when it find more search results from same site or site section. Some other search engines like Vivisimo Velocity provide Topic Clustering, or grouping results into topics/subjects, that help in refine searches.

Can we use clustering for anything else? We can use clustering to build applications like job sites, event sites, social networking etc. Also we can generate tag cloud to guide users navigate through popular topics. The list can grow as we see more need for applications.

I was working with Nutch and Google search appliance, when I got interested in the clustering and faceted search technologies. I wanted to integrate the clustering technology with both these search engines to show my clients the value of clustering technology. I gave them demo with Clusty web search and they see lot of value. But question was, can I integrate it with Google search appliance and is there any free or low cost tool that can provide the same functionality. I knew about Vivisimo, but client wanted a cheaper solution, so I started digging more in the open source arena.

Then I came across Carrot2, Open Source Search Results Clustering Engine. Search Carrot2 provides an architecture for acquiring search results from various sources (YahooAPI, GoogleAPI, MSNAPI, eTools Meta Search, Alexa Web Search, PubMed, OpenSearch, Lucene index, SOLR), clustering the results and visualising the clusters. Currently, 5 clustering algorithms are available that are suitable for different kinds of document clustering tasks. Carrot2 has been successfully used in a number of commercial and research applications and resulted in a number of interesting publications.

I found this tool interesting and wanted to use it with my existing search engines. The tool provides seamless integration with nutch and lucene search engines. The developer only needs to point to existing search indexes and customize the page layout. The carrot application is deployed as webapp which you can be drop in any web application server. The deployment was cake walk and results were fascinating. Then I also got it working with the Google search appliance. Here I had to use its Java APIs to build a clustered search interface from the results of Google appliance.

If anyone is looking for low cost clustering solution, one can use open source clustering engine which can be integrated with any existing web search engines and also with enterprise search engines like Google search appliance, lucene index (one can debate on its enterprise capability). It is easy to deploy and configure and does not impose any extra baggage.

I am sure Google must be working on this technology and would come with a solution which is a notch better than Vivisimo or Carrot2. I am eagerly waiting to hear from them.