Overview
Features
Quick Start
Usage Example
License

ejIndex

org.exodelta.j2.index

Full Text Indexing Service for JBoss

hosted by:

sourceforge project page

Overview:

ejIndex is a full-text indexing and search service implemented as a JBoss MBean service. It uses the Apache/Lucene index engine to provide very fast,efficient and stable text indexing/search facilities. ejIndex wraps the Lucene facilitites into a robust service implementation with thread pooling, queueing and access synchronization - with concurrent interfaces for Indexing, Search and Management. Additionally some text filters are included to extract searchable text from common document formats such as MSWord, MSExcel, HTML, XML, etc (with some help from other projects).

Features:

Full-text indexing and searching of metadata
Applications can index properties (metadata) to be full text indexed and searchable. Customizable schema definitions allow you to store and retrieve different data types as well as being able to specify wether properties are indexed as tokenized strings (text) or non-tokenized values (keywords).

Full-text indexing and searching of Document Content
In additon to indexing properties you may also submit the URL of a local or remote file to be indexed as the content (aka document-body). A number of different text filters are available for some common document formats such as MSWord, MSExcel, HTML, XML PDF and binary files. A very simple interface is also provided in order to allow you to create filters for other custom document types as well.

Implemented as an MBean service
The service can be deployed and managed via standard JBoss JMX mechanisms. The index and search interfaces are registered with, and accessable via the JBoss JNDI service allowing EJB's and other services to access the index and search services. For remote acces, two Session EJB's are included which provide access to the IndexService and SearchService interfaces.

Robust thread pooling and synchronization
Configurable thread pools provide highly efficient search and indexing capability, as well as the ability to concurrently search and modify the index without stopping any services.

Configurable wait policies
Because all searches are executed asynchronously, you are able to configure different "wait-policies", namely WAIT_FOR_HITS which specifies to wait for at least a specified number of hits, or WAIT_FOR_TIMEOUT where you can specify a maximum amount of time to wait for hits. After retrieving results, you can use a search-handle to retrieve more results (if available). This allows for very "responsive" client applications, because the user will always see results quickly.

Quick Start (for the impatient like me):

Download the latest build, which you can get here.
If you are not familiar with Lucene, it is a very good idea to look at some of the docs before proceeding, especially the query syntax.
Extract the files from the lib directory (ejindex.jar,exoutils.jar,lucene.jar) and put them in your jboss lib directory (e.g. /jboss-home/server/default/lib).
Extract the files from the deploy directory (ejindex-ejb.jar,ejindex-service.xml) and put them in your jboss deploy directory (e.g. /jboss-home/server/default/deploy).
Test the service by using something like the following (SimpleTest.java in the distribution):

Example Usage (SimpleTest.java in distribution file):

public class SimpleTest
{

     public static void main ( String [] args ) throws Exception
     {
         Hashtable props = new Hashtable ();
         props . put ( InitialContext . INITIAL_CONTEXT_FACTORY , "org.jnp.interfaces.NamingContextFactory" );
         props . put ( InitialContext . PROVIDER_URL , "jnp://127.0.0.1:1099" );
         InitialContext context = new InitialContext ( props );

         IndexServiceHome ixHome = ( IndexServiceHome ) context . lookup ( "exodelta/IndexService" );
         IndexService ixService = ixHome . create ();

         IndexRequest request = new IndexRequest ( "jboss.org/index.html" ); // sets unique ID
         request . setDocumentURL ( new URL ( "http://jboss.org/index.html" ));

         // these properties are defined in schema defs..
         request . addProperty ( "title" , "jboss home" );
         request . addProperty ( "author" , "someone-at-jboss" );
         request . addProperty ( "rating" , new Float ( 9.5 ));
         request . addProperty ( "dateCreated" , new Date ());

         request . setMimeType ( "text/html" );

         ixService . addItem ( request );

         // allow some time to fetch & index doc
         Thread . sleep ( 5000 );      // we wouldnt normally need this

         SearchServiceHome ssHome = ( SearchServiceHome ) context . lookup ( "exodelta/SearchService" );
         SearchService ssService = ssHome . create ();

         SearchRequest search = new SearchRequest ();

         search . setColumns ( "id,title,author,rating,datecreated, summary" );
         search . setQuery ( "Open Source" );      // see Apache docs for query specs

         SearchResults results = ssService . executeSearch ( search );

         while ( results . moveNext ())
         {
             String id = ( String ) results . getValue ( "id" );
             String title = ( String ) results . getValue ( "title" );
             String author = ( String ) results . getValue ( "author" );
             Float rating = ( Float ) results . getValue ( "rating" );
             Date dateAdded = ( Date ) results . getValue ( "datecreated" );
             String summary = ( String ) results . getValue ( "summary" );

             System . out . println ( "id: " + id + ", title: " + title + ", author: " + author
                                 + ", rating: " + rating + ", dateAdded: " + dateAdded );
             System . out . println ( "summary: " + summary );
         }

         ixService . removeItem ( "jboss.org/index.html" );
         ixService . remove ();
         ssService . remove ();
     }
}

License:

This software is made freely available under the terms and conditions of the GNU Lesser General Public License (LGPL). For details of the terms of this license, please refer to http://www.gnu.org/licenses/licenses.html#LGPL.

Advanced Configuration:

There are many configuration options to allow fine-tuning of the system - too many to detail here. For an example of a default configuration, you can take a look at the standard ejindex-service.xml file in html format here:

Filters:

In order to extract text from documents that is suitable for indexing, you need to use an appropriate ContentFilter. Filters are ethier implemented in Java, or as an external application that can extract the text and write it to stdout. The different filters are mapped to specific mime-types or file-extensions by specifying the mapping in the filtermappings section ot the ejindex-service.xml file.

ejIndex comes with some filters for common document formats, or you can implement your own.

The standard filters currently provided are:

Text
Requires no special configuration.
HTML
Requires no special configuration.
MS Word
This filter uses an external application to read text from word documents. If you need this filter, you need to install either wvWare or Antiword. There may also be other apps you could use - but I havent checked - these two are open-source, free and seem to work well. You will also need to modify the ejindex-service.xml file and edit the command parameter for the docfilter section to specify the correct path to the executable.
MS Excel
This filter uses an external application to read text from excel (.xls) documents. If you need this filter, you need to install xlhtml. You will also need to modify the ejindex-service.xml file and edit the command parameter for the xlsfilter section to specify the correct path to xlhtml.
MS Powerpoint
This filter uses an external application to read text from excel (.ppt) documents. If you need this filter, you need to install ppthtml (actually a part of the xlhtml project). You will also need to modify the ejindex-service.xml file and edit the command parameter for the pptfilter section to specify the correct path to ppthtml. *Note: ppthtml only works properly for ppt versions greater then PPT 2000. For older versions, you should use the binary filter (or aother app).
XML
Requires no special configuration. It will index text, cdata and comment nodes in xml files.
Binary
This is the default filter that is used if no appropriate filter-mapping can be found for a file. It will try to extract text from the file, ignoring control characters and repetitive character sequences etc.

About:

This project came about for different reasons. Firstly I wanted to write something to get more experience with j2ee app servers, and jboss in particular. I have been working on information/document management applications for quite some time, and this is something I wished I had before. I also wanted to contribute something to the OS community that hopefuly others might find useful. If you have any comments or suggestions (good or otherwise), please let me know via the forums or by email : andy at exodelta.com.