Overview:
ejIndex
is a full-text indexing and search service implemented as a
JBoss MBean service. It uses the
Apache/Lucene
index engine to provide very fast,efficient and stable text
indexing/search facilities.
ejIndex wraps
the Lucene facilitites into a robust service implementation
with thread pooling, queueing and access
synchronization - with concurrent interfaces for
Indexing, Search and Management. Additionally some
text filters are included to extract searchable text
from common document formats such as MSWord, MSExcel, HTML, XML,
etc (with some help from other projects).
Features:
- Full-text
indexing and searching of metadata
Applications can index properties (metadata) to be full text
indexed and searchable. Customizable schema definitions allow you
to store and retrieve different data types as well as being able to
specify wether properties are indexed as tokenized strings (text)
or non-tokenized values (keywords).
- Full-text
indexing and searching of Document Content
In additon to indexing properties you may also submit the URL of a
local or remote file to be indexed as the content (aka
document-body). A number of different text filters are available for
some common document formats such as MSWord, MSExcel, HTML, XML PDF
and binary files. A very simple interface is also provided
in order to allow you to create filters for other custom document
types as well.
- Implemented
as an MBean service
The service can be deployed and managed via standard JBoss JMX
mechanisms. The index and search interfaces are registered with, and
accessable via the JBoss JNDI service allowing EJB's and other
services to access the index and search services. For remote acces,
two Session EJB's are included which provide access to the
IndexService and SearchService interfaces.
- Robust
thread pooling and synchronization
Configurable thread pools provide highly efficient search and
indexing capability, as well as the ability to concurrently
search and modify the index without stopping any services.
- Configurable
wait policies
Because all searches are executed asynchronously, you are able to
configure different "wait-policies", namely WAIT_FOR_HITS which
specifies to wait for at least a specified number of hits, or
WAIT_FOR_TIMEOUT where you can specify a maximum amount of time to
wait for hits. After retrieving results, you can use a search-handle
to retrieve more results (if available). This allows for very
"responsive" client applications, because the user will always see
results quickly.
Quick Start (for
the impatient like me):
Example Usage
(SimpleTest.java in distribution file):
public
class SimpleTest
{
public static void main ( String [] args ) throws
Exception
{
Hashtable
props = new
Hashtable
();
props . put ( InitialContext . INITIAL_CONTEXT_FACTORY ,
"org.jnp.interfaces.NamingContextFactory"
);
props . put ( InitialContext . PROVIDER_URL , "jnp://127.0.0.1:1099"
);
InitialContext
context = new
InitialContext
( props );
IndexServiceHome
ixHome = (
IndexServiceHome )
context .
lookup ( "exodelta/IndexService" );
IndexService
ixService
= ixHome . create ();
IndexRequest
request = new
IndexRequest
( "jboss.org/index.html"
); // sets unique ID
request .
setDocumentURL (
new URL ( "http://jboss.org/index.html" ));
// these properties are defined in schema defs..
request .
addProperty (
"title" , "jboss home" );
request .
addProperty (
"author" ,
"someone-at-jboss"
);
request .
addProperty (
"rating" ,
new Float ( 9.5 ));
request .
addProperty (
"dateCreated" ,
new Date ());
request .
setMimeType (
"text/html" );
ixService .
addItem (
request );
// allow some time to fetch & index doc
Thread . sleep ( 5000 ); // we
wouldnt normally need this
SearchServiceHome
ssHome = (
SearchServiceHome )
context .
lookup ( "exodelta/SearchService" );
SearchService
ssService
= ssHome . create ();
SearchRequest
search = new
SearchRequest
();
search . setColumns ( "id,title,author,rating,datecreated, summary" );
search . setQuery ( "Open Source" ); // see
Apache docs for query specs
SearchResults
results = ssService
. executeSearch
( search );
while ( results . moveNext ())
{
String id = (
String ) results . getValue ( "id" );
String title = (
String ) results . getValue ( "title" );
String author = (
String ) results . getValue ( "author" );
Float rating = (
Float ) results . getValue ( "rating" );
Date dateAdded = (
Date ) results . getValue ( "datecreated" );
String summary = (
String ) results . getValue ( "summary" );
System . out . println ( "id: " + id
+ ", title: " +
title + ",
author: " +
author
+ ", rating: " + rating
+ ", dateAdded: " +
dateAdded );
System . out . println ( "summary: " + summary
);
}
ixService .
removeItem (
"jboss.org/index.html" );
ixService .
remove ();
ssService .
remove ();
}
}
License:
This software is made freely available under
the terms and conditions of the GNU Lesser General Public License
(LGPL). For details of the terms of this license, please refer to http://www.gnu.org/licenses/licenses.html#LGPL.
Advanced
Configuration:
There are many configuration options to
allow fine-tuning of the system - too many to detail here. For an
example of a default configuration, you can take a look at the
standard ejindex-service.xml file in html format here:
Filters:
In order to extract text from
documents that is suitable for indexing, you need to use an appropriate
ContentFilter. Filters are ethier implemented in Java, or as an external
application that can extract the text and write it to stdout. The
different filters are mapped to specific mime-types or file-extensions
by specifying the mapping in the filtermappings section ot the
ejindex-service.xml file.
ejIndex comes with some
filters for common document formats, or you can implement your own.
The standard filters
currently provided are:
- Text
Requires no special configuration.
- HTML
Requires no special configuration.
- MS Word
This filter uses an external application to read text from word
documents. If you need this filter, you need to install either wvWare or Antiword. There may also be other
apps you could use - but I havent checked - these two are open-source,
free and seem to work well. You will also need to modify the
ejindex-service.xml file and edit the command parameter for the
docfilter section to specify the correct path to the executable.
- MS Excel
This filter uses an external application to read text from excel (.xls)
documents. If you need this filter, you need to install xlhtml. You will
also need to modify the ejindex-service.xml file and edit the command
parameter for the xlsfilter section to specify the correct path to
xlhtml.
- MS Powerpoint
This filter uses an external application to read text from excel (.ppt)
documents. If you need this filter, you need to install ppthtml
(actually a part of the xlhtml
project). You will also need to modify the ejindex-service.xml file and
edit the command parameter for the pptfilter section to specify the
correct path to ppthtml. *Note: ppthtml only works properly for ppt
versions greater then PPT 2000. For older versions, you should use the
binary filter (or aother app).
- XML
Requires no special configuration. It will index text, cdata and
comment nodes in xml files.
- Binary
This is the default filter that is used if no appropriate
filter-mapping can be found for a file. It will try to extract text from
the file, ignoring control characters and repetitive character sequences
etc.
About:
This project came about for
different reasons. Firstly I wanted to write something to get more
experience with j2ee app servers, and jboss in
particular. I have been working on information/document management
applications for quite some time, and this is something I wished I had
before. I also wanted to contribute something to the OS community that
hopefuly others might find useful. If you have any comments or
suggestions (good or otherwise), please let me know via the forums or by
email : andy at exodelta.com.
Copyright ©2003 Andy
Scholz.