Personal tools
You are here: Home CinnamonServer Components and Concepts Lucene Search in Cinnamon Lucene indexing in Cinnamon

Lucene indexing in Cinnamon

How indexing and searching in Cinnamon works.

Overview

Cinnamon server uses Lucene to provide search capabilities for the client. The search is highly configurable and you can search documents semantically, meaning that you do not only get a search result "1000 documents contain the word 'engine'" but rather "10 documents contain the word 'engine' in their title element". Depending on how the index is created, you can exclude or include documents via XPath statements.

Indexing

Cinnamon parses documents and folders and extracts system metadata, custom metadata and content, as applicable, and tests if they match predefined conditions. If the condition are met, an XPath expression is evaluated against the data and the result is stored in a Lucene index field.

Example of HTML indexing
Cinnamon checks each new object, if it is of the format HTML (instead of 'JPEG' etc) and of type document (instead of 'template'). Then, if an HTML  document has been found, it extracts the //head/title element from the content and stores its value in the Lucene index field with the name document.title.  

IndexTypes

While the IndexItem defines the when and where of the indexing task, the IndexTypes define the how. The Lucene index is string based and so all values found have to be converted into strings. Sometimes the default index behavior of tokenizing a string at the word boundary simply does not cut it, as you want to be able to search for "DIN A4" and find the norm instead of every document with a "din" inside and an "a4" somewhere. (Yes, there is proximity search, but that only takes you so far). That is the moment you should consider using one of the specialized Indexers, which are defined by the IndexTypes. An IndexType is the combination of a custom Java class, an optional value assistance provider and a basic data type (Boolean, String, Date, Number, etc). A selection of IndexTypes are provided with the Cinnamon server, but you may add others (how about an OCR indexer for license plate images?).

IndexItems

The combination of a search condition (from the example: format and object type), a data type (here: the content, not the metadata) and search expression ("//head/title") and Lucene fieldname ("document.title") is called an IndexItem. You may define as many IndexItems as you wish (within reason).

IndexGroups

IndexItems may be organzied with IndexGroups. A Cinnamon client may choose to display different search forms depending on the circumstances. For example, if a dialog to search for images may have other fields than one to search the audit logs of a Cinnamon application.

Configuring Lucene in Cinnamon

There are some aspects of Lucene in the Cinnamon server that can be configured with a file "lucene.properties" which should be placed in CINNAMON_HOME_DIR.

# lucene.properties: configuration file for Lucene indexer
indexDir=/home/cinnamon/cinnamon-system/index
sleepBetweenRuns=5000
itemsPerRun=100
logModulus=1
  • indexDir is the directory where the Lucene index will be stored (each repository gets its own folder).
  • sleepBetweenRuns is the number of milliseconds how long the IndexServer will sleep (to prevent it from running continuously which would reduce the overall performance of the server as the database would be queried many times per second).
  • itemsPerRun is the number of objects that will be indexed before the IndexServer-thread goes back to sleep.
  • logModulus determines how often the IndexServer will switch on full debugging on recurring status messages. As the thread runs continuously and tries to index new items every five seconds, you may wish to increase this number to reduce log file spam. Errors and exceptions are logged regardless of the logModulus.
Document Actions