Lucene Wrapper Project

Chapter One: Life Before Lucene

A large part of the Objectify CMS (well any good CMS) is it’s search capabilities. This is compounded by the fact that the Objectify CMS includes a powerful Knowledge Management System (KMS) component as well.

A few years ago I wrote my first ever search engine… it was a powerful, full text indexing, stemming, soundexing (sic) fully custom built indexing system for use with the Objectify CMS and KMS. It worked a treat, was fast, worked with the CMS’s HTML content and its XML based Knowledge Base content… but it had a few flaws. It’s query syntax was limited, it was complex and hard to maintain – and it didn’t scale over about 10,000 records (paltry??)… the memory requirements started scaling with it.

Building this search engine was a long a arduous task, with my gotchas, pitfalls and hurdles along the way. Performance and scalability where perhaps the largest and most difficult issue to deal with (err… sounds like any IT project to me – ed). Long story short the system worked well, but the time came to add a raft of new features and address the scalability concerns. The query syntax needed a few tweaks also.

Chapter Two: My new love – Lucene

Oh Lucene how I love thee. The decision was hard, but I decided it would be best to fully re-write the search engine in Lucene – boy am I glad I did.

Chapter Three: The first date

It took me a while to sort out the best way to manage documents in Lucene, all the options, the patterns and practices, and how to integrate it into the Objectify CMS and KMS products.

Basically, Lucene is built to be really generic – it can work in just about any situation you need a search, from Windows forms applications, web applications, services you name it. And it can scale… and scale and scale… 13 million records on a moderately equipped machine is nothing to Lucene… and it can handle multiple gigabyte indexes with ease. Long story short it’s a set of search engine base classes, certainly not a fully fledged search engine – it’s up to the developer to implement all the features… which is probably why you are reading this – you want full control.

Chapter Four: Third base

With all this in mind, I thought I would put together a little class library that you can use to create and search with. It’s really quite simple… just grab the code, grab Lucene.NET and build away!

My aim with this two project is to hit one of two goals:

1) You take this project and use it in your own projects with none/little change and it helps out immensely with your project deadlines

2) You take this project, pull it apart and learn all about how to implement Lucene.NET by example – cutting down the time it takes to get your own test projects up and running.

Chapter Five: How to use this project

Using this project is quite straight forward. Look in the lJak Tests project for a test class which should have all the information you require.

Lucene.NET and other search engines work great with TDD (Test Driven Development) so use the tests to your advantage as you develop – it will save you piles of time.

The basics are:

1. Choose an ID for your document. Something like the file name of the source page, a GUID or the ID of the piece of content from your database. You will use this ID later to refer back to the original document from the search results.

2. Split up your document/parse into logical blocks. You may want to split a HTML page into areas like Title, Heading, and each P/DIV tag or something. That way the end user could search on just title, or just body etc.

3. Parse apart the content. This code does not include a parser. So, you have to break the content up yourself. It would be a good idea to strip out HTML etc also. For the CMS/KMS products I wrote a parser that takes the content XML and breaks it up based on tag name, id and whether it’s and attribute or child node etc. If i get some time I may post a sample parser.

4. Instantiate the LuceneDocument complete with your ID and a path to your index.

string id = "somefile.aspx";
string indexName = "TestIndex";
LuceneUpdateDocument target = new LuceneUpdateDocument(id, indexName);

The index will be automatically created in the loading assemblies directory\index. You can change this behaviour in:


public static IndexWriter GetIndexWriter(string name)

...

DirectoryInfo indexBase = new DirectoryInfo(AppDomain.CurrentDomain.BaseDirectory + "\\Index\\" + name);

...

Note: Hrm, I must have been tired when I wrote that, should have used String.Format.

5. Add fields to the document. These are what will actually be searched. You can choose to add the field so they can be searched and read back from the index, or you can add them so they can be searched but not read back from the index- use the first option for attributes like filename, date and other small pices of metadata, and use the second option for the main bulk of content. Your index will be too big if you try to store everything in there – so ensure your main content is indexed but not stored (thus not read back).To index attributes:


target.AddField("Title", "This is a test document", LuceneUpdateDocument.FieldStoreType.Attribute);

To index main content areas:


target.AddField("Content", "This is the main body content", LuceneUpdateDocument.FieldStoreType.Content);

6. Commit the new items to the index. This indexes the content and commits it to the index. It also clears out some internal caching that the wrapper class performs.

target.Commit();
Index.ClearCache();

7. Search your new index! Searching is easy!


//Create an instance of the searching class, passing in the same index that you used to add the content
Search s = new Search(indexName);
//This line searches for the word "document" in the Title and Content fields.
List results = s.QueryBasic("document", "Title", "Content");

//This line searches for the word "Main" in the Title and Content fields.
results = s.QueryBasic("Main", "Title", "Content");

foreach (SearchResult sr in results)
{
     string title = sr.GetFieldVal("Title");
     Assert.AreEqual(title, "This is a test document");
}

As you can see above, to search the index the syntax is

 s.QueryBasic("Some Query", "FieldOne", "FieldTwo", ..., ..., "FieldX");

The first parameter is the search query, you can look up all the Lucene query syntax here. After the first parameter you can list as many search fields as you like using the parameter array. Only the fields you list here (or any fields you specify in the query using special syntax) will be scanned by Lucene.

8. Reading the content from the index. Searching isn’t much good unless you can read information from your search results! Reading your attributes is easy, because they are searchable and stored in the index for quick retrieval (remember above how some items where marked as attributes and stored whereas other items are marked as main content – thus not stored). Note in the code sample above there is a foreach statement which iterates over each search result. In this foreach it pulls out the title field and performs an assert on it.

string title = sr.GetFieldVal("Title");

It’s too easy to get the stored attributes. But how about main content? Well this is up to you. Using the ID that you stored earlier (get it out like the title attribute above) load the original document and read what you need – or just redirect the user to it to view. Review the Lucene query documentation for a full list of all the cool stuff you can do!

Chapter Six: That’s a lovely story, but I don’t give a crap – where is the code man??

Firstly you will need to get the Lucene.NET DLL from here http://incubator.apache.org/lucene.net/. I didn’t include it in the download as I wasn’t sure of the legality – and I couldn’t be bothered reading their re-distribution policy.

LJAK – Lucene.NET Wrapper Class (Source).

Documentation (haha just kidding, like there is documentation).

Legal – Please read legal stuff at the top of the LuceneDocument class.

Happy indexing and searching!