Help

One of the points for using Hibernate Search is that your valued business data is stored in the database: a reliable transactional and relational store. So while Hibernate Search keeps the index in sync with the database during regular use, in several occasions you'll need to be able to rebuild the indexes from scratch:

  • Data existed before you introduced Hibernate Search
  • New features are developed, index mapping changed
  • A database backup is restored
  • Batch changes are applied on the database
  • ...you get the point, this list could be very long

Evolving, user driven

An API to perform this operations always existed in previous versions of Hibernate Search, but questions about how to make this faster weren't unusual on the forums. Keep in mind that rebuilding the whole index basically means that you have to load all indexed entities from the database to the Java world to feed Lucene with data to index. I'm a user myself and the code for the new MassIndexer API was tuned after field experience on several applications and with much involvement of the community.

QuickStart: MassIndexer API

Since version 3.0 the documentation provided a recommended re-indexing routine; this method is still available but a new API providing better performance was added in version 3.2. No configuration changes are required, just start it:

FullTextSession fullTextSession = ...
MassIndexer massIndexer = fullTextSession.createIndexer();
massIndexer.startAndWait();

The above code will block until all entities are reindexed. If you don't need to wait for it use the asynchronous method:

fullTextSession.createIndexer().start();

Selecting the entities to rebuild the index for

You don't need to rebuild the index for all indexed entities; let's say you want to re-index only the DVDs:

fullTextSession.createIndexer( Dvd.class ).startAndWait();

This will include all of Dvd's subclasses, as all Hibernate Search's APIs are polymorphic.

Index optimization and clearing

As in Lucene's world an update is implemented as remove and then add, before adding all entities to the index we need to remove them all from the index. This operation is known as purgeAll in Hibernate Search. By default the index is purged and optimized at start, and optimized again at the end; you might opt for a different strategy but keep in mind that by disabling the purge operation you could later find duplicates. The optimization after purge is applied to save some disk space, as recommended in Hibernate Search in Action.

fullTextSession.createIndexer()
   .purgeAllOnStart( true ) // true by default, highly recommended
   .optimizeAfterPurge( true ) // true is default, saves some disk space
   .optimizeOnFinish( true ) // true by default
   .start();

Faster, Faster!

A MassIndexer is very sensitive to tuning; some settings can make it orders of magnitude faster when tuned properly, and the good values depend on your specific mapping, environment, database, and even your content. To find out which settings you need to tune you should be aware of some implementation details.

The MassIndexer uses a pipeline with different specialized threads working on it, so most processing is done concurrently. The following explains the process for a single entity type, but this is actually done in parallel jobs for each different entity when you have more than one indexed type:

  1. A single thread named identifier-loader scrolls over the primary keys of the type. The number of threads for this stage is always one so that a transaction can define the set of keys to consider. So a first hint is to use simple keys, avoid complex types as the loading of keys will always be serialized.
  2. The loaded primary keys are pushed to a id queue; there's one such queue for each root type to be indexed.
  3. At the second stage a threadpool called entity-loader loads batches of entities using the provided primary keys. You can tune the number of threads working on this task concurrently (threadsToLoadObjects(int)) and the number of entities fetched per iteration (batchSizeToLoadObjects(int)). While idle threads might be considered a waste, this is minor and it's better to have some spare threads doing nothing than the opposite. Make sure you don't make it too hard for the database by requesting too much data: setting a too big batch size or way too many threads will also hurt, you will have to find the sweet spot. The queues will work as buffers to mitigate the effect of performance highs and lows due to different data, so finding the sweet spot is not a quest for the exact value but about finding a reasonable value.
  4. The entity queue contains a stream of entities needing conversion into Lucene Documents, it's fed by the entity-loader threads and consumed by the document-creator threads.
  5. The document-creator threads will convert the entities into Lucene Documents (apply your Search mapping and custom bridges, transform data in text). It is important to understand that it's still possible that during conversion some lazy object will be loaded from database (step 7 in the picture). So this step could be cheap or expensive: depending on your domain model and how you mapped it there could be more round trips happening to database or none at all. Second level cache interactions might help or hurt in this phase.
  6. The document queue should be a constant stream of Documents to be added to the index. If this queue is mostly near-empty it means you're being slower in producing the data than what Lucene is able to analyse and write it to the index. I this queue is mostly full it means you're being faster in producing the Documents than what Lucene is able to write them to the index. Always consider that Lucene is analysing the text during the write operation, so if this is slow it's not necessarily related to I/O limits but you could have expensive analysis. To find out you'll need a profiler.
  7. The document indexer thread number is also configurable, so in case of expensive analysis you can have more CPUs working on it.

The queues are blocking and bounded, so there's no danger in setting too many producer threads for any stage: if a queue fills up the producers will be set on hold until space is available. All thread pools have names assigned, so if you connect with a profiler or debugger the different stages can be promptly identified.

API for data load tuning

The following settings rebuild my personal reference database in 3 minutes, while I started at 6 hours before enabling these settings, or at 2 months before looking into any kind of Hibernate or Lucene tuning.

fullTextSession.createIndexer()
   .batchSizeToLoadObjects( 30 )
   .threadsForSubsequentFetching( 8 )
   .threadsToLoadObjects( 4 )
   .threadsForIndexWriter( 3 )
   .cacheMode(CacheMode.NORMAL) // defaults to CacheMode.IGNORE
   .startAndWait();

Caching

When some information is embedded in the index from entities having a low cardinality (a high cache hit ratio), for example when there's a ManyToOne relation to gender or countrycode it might make sense to enable the cache, which is ignored by default. In most cases ignoring the cache will result in best performance, especially if you're using a distributed cache which would introduce unneeded network events.

Offline job

While normally all changes done by Hibernate Search are coupled to a transaction, the MassIndexer uses several transactions and consistency is not guaranteed if you make changes to data while it's running. The index will only contain the entities which where existing in the database when the job started, and any update made to the index in this timeframe by other means might be lost. While nothing wrong would happen to the data on database, the index might be inconsistent if changes are made while the job is busy.

Performance checklist

After having parallelized the indexing code, there are some other bottlenecks to avoid:

  1. Check your database behavior, almost all databases provide a profiling tool which can provide valuable information when run during a mass indexing
  2. Use a connection pool and size it properly: having more Hibernate Search threads accessing your database won't help when they have to contend database connections
  3. Having set EAGER loading on properties not needed by Hibernate Search will have them loaded, avoid it
  4. Check for network latency
  5. The effect of tuning settings doesn't depend only on static information like schema, mapping options and Lucene settings but also on the contents of data: don't assume five minutes of testing will highlight your normal behavior and collect metrics from real world scenarios. The queues are going to help as buffers for non-constant performance in the various stages.
  6. Don't forget to tune the IndexWriter. See the reference documentation: nothing changed in this area.
  7. If the indexing CPUs are not at 100% usage and the DB is not hitting his limits, you know you can improve the settings

Progress Monitoring

An API is on the road to plug your own monitoring implementation; currently beta1 uses a logger to periodically print status, so you can enable or disable your loggers to control it. Let us know what kind of monitoring you would like to have, or contribute one!

13 comments:
 
07. Dec 2009, 23:30 CET | Link

Great stuff. I wish I had this kind of API earlier.

ReplyQuote
 
07. Dec 2009, 23:39 CET | Link
amin mohammed-coleman | aminmc(AT)gmail.com

Cool stuff. Will definitely be using this. It would be quite handy if there was a jmx hook to the monitoring.

 
08. Dec 2009, 16:16 CET | Link

Nice to finally see that in the light. This has been baking in trunk for a looooong time and i your head for even more.

I love the 3 mins (tuned) vs 6 hrs (default setting) vs 2 months prior to this :)

 
09. Dec 2009, 18:39 CET | Link
Davide D'Alto

Great job!

 
21. Jan 2010, 19:36 CET | Link

Would really like to try this out to speed up indexing. Currently indexing 20million records.

When will it be released ? Is there a beta somewhere I can download and try it out ?

Kr George

 
21. Jan 2010, 20:46 CET | Link
George wrote on Jan 21, 2010 13:36:
Would really like to try this out to speed up indexing. Currently indexing 20million records. When will it be released ? Is there a beta somewhere I can download and try it out ? Kr George

Yes, it's available in Hibernate Search 3.2 Beta1

 
03. Mar 2010, 22:48 CET | Link
Salem Ben Afia | salem.ben.afia(AT)gamil.com
Sounds Great!

Unfortunately, i'm facing some memory problems for only 1.2M entries and i don't know how to paginate with MassIndexer to avoid this.
I used to work with the old way of indexing one by one and it takes about 4hours.

I get this exception: "this writer hit an OutOfMemoryError; cannot commit"
sometimes many exception like : Hibernate Search: entityloader-N java.lang.OutOfMemoryError: Java heap space

Can someone give me the right way to use it please; I really need that MassIndexer.
Thanks
 
05. Mar 2010, 16:39 CET | Link

Hi Salem, MassIndexer is paginating automatically: the maximum memory needed is capped and not dependent on the amount of data, but you might still need more memory than with a single threaded scroller. The needed memory is dependent on the complexity of your object graph, and on the settings applied to the MassIndexer and to the Lucene writer configuration.

A rule of thumb for best performance: give it as much memory you have ;-) But if you don't have much, keep more conservative parameters on the batchsizes and thread numbers.

Don't forget to tune your JVM startup parameters, the defaults are not great.

please join the forums I'll be happy to help you.

 
11. Mar 2010, 20:14 CET | Link

do you have example how to implement MassIndexerProgressMonitor?

 
13. Apr 2010, 01:38 CET | Link
ian wrote on Mar 11, 2010 14:14:
do you have example how to implement MassIndexerProgressMonitor?

Sorry for the delay, I check the forums more often. You can look at the default implementation which is part of the sources, or look in fisheye

 
07. May 2010, 10:58 CET | Link

The example above uses .threadsForIndexWriter( 3 ) but this method doesn't seem to exist in 3.2.0.Final. Not a problem, but it might be worth updating the code above.

 
11. Jun 2010, 16:04 CET | Link

right, good catch. It was removed from public API but is likely coming back in next release, together with additional tricks and performance improvements ;-)

 
25. Sep 2010, 06:38 CET | Link
Ido
Use a connection pool and size it properly: having more Hibernate Search threads accessing your database won't help when they have to contend database connections

Would you please elaborate on what might constitute properly? E.g. with c3p0.maxPoolSize=15 (c3p0's default maxPoolSize) I've been getting occasional deadlocks during MassIndexer.startAndWait(), so I suspect this isn't a good value :)

It would be very useful to know the minimal number of connections that MassIndexer needs in order to work, and the maximal number of connections it can consume if given the opportunity.

Post Comment