Help

One of the points for using Hibernate Search is that your valued business data is stored in the database: a reliable transactional and relational store. So while Hibernate Search keeps the index in sync with the database during regular use, in several occasions you'll need to be able to rebuild the indexes from scratch:

  • Data existed before you introduced Hibernate Search
  • New features are developed, index mapping changed
  • A database backup is restored
  • Batch changes are applied on the database
  • ...you get the point, this list could be very long

Evolving, user driven

An API to perform this operations always existed in previous versions of Hibernate Search, but questions about how to make this faster weren't unusual on the forums. Keep in mind that rebuilding the whole index basically means that you have to load all indexed entities from the database to the Java world to feed Lucene with data to index. I'm a user myself and the code for the new MassIndexer API was tuned after field experience on several applications and with much involvement of the community.

QuickStart: MassIndexer API

Since version 3.0 the documentation provided a recommended re-indexing routine; this method is still available but a new API providing better performance was added in version 3.2. No configuration changes are required, just start it:

FullTextSession fullTextSession = ...
MassIndexer massIndexer = fullTextSession.createIndexer();
massIndexer.startAndWait();

The above code will block until all entities are reindexed. If you don't need to wait for it use the asynchronous method:

fullTextSession.createIndexer().start();

Selecting the entities to rebuild the index for

You don't need to rebuild the index for all indexed entities; let's say you want to re-index only the DVDs:

fullTextSession.createIndexer( Dvd.class ).startAndWait();

This will include all of Dvd's subclasses, as all Hibernate Search's APIs are polymorphic.

Index optimization and clearing

As in Lucene's world an update is implemented as remove and then add, before adding all entities to the index we need to remove them all from the index. This operation is known as purgeAll in Hibernate Search. By default the index is purged and optimized at start, and optimized again at the end; you might opt for a different strategy but keep in mind that by disabling the purge operation you could later find duplicates. The optimization after purge is applied to save some disk space, as recommended in Hibernate Search in Action.

fullTextSession.createIndexer()
   .purgeAllOnStart( true ) // true by default, highly recommended
   .optimizeAfterPurge( true ) // true is default, saves some disk space
   .optimizeOnFinish( true ) // true by default
   .start();

Faster, Faster!

A MassIndexer is very sensitive to tuning; some settings can make it orders of magnitude faster when tuned properly, and the good values depend on your specific mapping, environment, database, and even your content. To find out which settings you need to tune you should be aware of some implementation details.

The MassIndexer uses a pipeline with different specialized threads working on it, so most processing is done concurrently. The following explains the process for a single entity type, but this is actually done in parallel jobs for each different entity when you have more than one indexed type:

  1. A single thread named identifier-loader scrolls over the primary keys of the type. The number of threads for this stage is always one so that a transaction can define the set of keys to consider. So a first hint is to use simple keys, avoid complex types as the loading of keys will always be serialized.
  2. The loaded primary keys are pushed to a id queue; there's one such queue for each root type to be indexed.
  3. At the second stage a threadpool called entity-loader loads batches of entities using the provided primary keys. You can tune the number of threads working on this task concurrently (threadsToLoadObjects(int)) and the number of entities fetched per iteration (batchSizeToLoadObjects(int)). While idle threads might be considered a waste, this is minor and it's better to have some spare threads doing nothing than the opposite. Make sure you don't make it too hard for the database by requesting too much data: setting a too big batch size or way too many threads will also hurt, you will have to find the sweet spot. The queues will work as buffers to mitigate the effect of performance highs and lows due to different data, so finding the sweet spot is not a quest for the exact value but about finding a reasonable value.
  4. The entity queue contains a stream of entities needing conversion into Lucene Documents, it's fed by the entity-loader threads and consumed by the document-creator threads.
  5. The document-creator threads will convert the entities into Lucene Documents (apply your Search mapping and custom bridges, transform data in text). It is important to understand that it's still possible that during conversion some lazy object will be loaded from database (step 7 in the picture). So this step could be cheap or expensive: depending on your domain model and how you mapped it there could be more round trips happening to database or none at all. Second level cache interactions might help or hurt in this phase.
  6. The document queue should be a constant stream of Documents to be added to the index. If this queue is mostly near-empty it means you're being slower in producing the data than what Lucene is able to analyse and write it to the index. I this queue is mostly full it means you're being faster in producing the Documents than what Lucene is able to write them to the index. Always consider that Lucene is analysing the text during the write operation, so if this is slow it's not necessarily related to I/O limits but you could have expensive analysis. To find out you'll need a profiler.
  7. The document indexer thread number is also configurable, so in case of expensive analysis you can have more CPUs working on it.

The queues are blocking and bounded, so there's no danger in setting too many producer threads for any stage: if a queue fills up the producers will be set on hold until space is available. All thread pools have names assigned, so if you connect with a profiler or debugger the different stages can be promptly identified.

API for data load tuning

The following settings rebuild my personal reference database in 3 minutes, while I started at 6 hours before enabling these settings, or at 2 months before looking into any kind of Hibernate or Lucene tuning.

fullTextSession.createIndexer()
   .batchSizeToLoadObjects( 30 )
   .threadsForSubsequentFetching( 8 )
   .threadsToLoadObjects( 4 )
   .threadsForIndexWriter( 3 )
   .cacheMode(CacheMode.NORMAL) // defaults to CacheMode.IGNORE
   .startAndWait();

Caching

When some information is embedded in the index from entities having a low cardinality (a high cache hit ratio), for example when there's a ManyToOne relation to gender or countrycode it might make sense to enable the cache, which is ignored by default. In most cases ignoring the cache will result in best performance, especially if you're using a distributed cache which would introduce unneeded network events.

Offline job

While normally all changes done by Hibernate Search are coupled to a transaction, the MassIndexer uses several transactions and consistency is not guaranteed if you make changes to data while it's running. The index will only contain the entities which where existing in the database when the job started, and any update made to the index in this timeframe by other means might be lost. While nothing wrong would happen to the data on database, the index might be inconsistent if changes are made while the job is busy.

Performance checklist

After having parallelized the indexing code, there are some other bottlenecks to avoid:

  1. Check your database behavior, almost all databases provide a profiling tool which can provide valuable information when run during a mass indexing
  2. Use a connection pool and size it properly: having more Hibernate Search threads accessing your database won't help when they have to contend database connections
  3. Having set EAGER loading on properties not needed by Hibernate Search will have them loaded, avoid it
  4. Check for network latency
  5. The effect of tuning settings doesn't depend only on static information like schema, mapping options and Lucene settings but also on the contents of data: don't assume five minutes of testing will highlight your normal behavior and collect metrics from real world scenarios. The queues are going to help as buffers for non-constant performance in the various stages.
  6. Don't forget to tune the IndexWriter. See the reference documentation: nothing changed in this area.
  7. If the indexing CPUs are not at 100% usage and the DB is not hitting his limits, you know you can improve the settings

Progress Monitoring

An API is on the road to plug your own monitoring implementation; currently beta1 uses a logger to periodically print status, so you can enable or disable your loggers to control it. Let us know what kind of monitoring you would like to have, or contribute one!

17 comments:
 
07. Dec 2009, 23:30 CET | Link

Great stuff. I wish I had this kind of API earlier.

ReplyQuote
 
07. Dec 2009, 23:39 CET | Link
amin mohammed-coleman | aminmc(AT)gmail.com

Cool stuff. Will definitely be using this. It would be quite handy if there was a jmx hook to the monitoring.

 
08. Dec 2009, 16:16 CET | Link

Nice to finally see that in the light. This has been baking in trunk for a looooong time and i your head for even more.

I love the 3 mins (tuned) vs 6 hrs (default setting) vs 2 months prior to this :)

 
09. Dec 2009, 18:39 CET | Link
Davide D'Alto

Great job!

 
21. Jan 2010, 19:36 CET | Link

Would really like to try this out to speed up indexing. Currently indexing 20million records.

When will it be released ? Is there a beta somewhere I can download and try it out ?

Kr George

 
21. Jan 2010, 20:46 CET | Link
George wrote on Jan 21, 2010 13:36:
Would really like to try this out to speed up indexing. Currently indexing 20million records. When will it be released ? Is there a beta somewhere I can download and try it out ? Kr George

Yes, it's available in Hibernate Search 3.2 Beta1

 
03. Mar 2010, 22:48 CET | Link
Salem Ben Afia | salem.ben.afia(AT)gamil.com
Sounds Great!

Unfortunately, i'm facing some memory problems for only 1.2M entries and i don't know how to paginate with MassIndexer to avoid this.
I used to work with the old way of indexing one by one and it takes about 4hours.

I get this exception: "this writer hit an OutOfMemoryError; cannot commit"
sometimes many exception like : Hibernate Search: entityloader-N java.lang.OutOfMemoryError: Java heap space

Can someone give me the right way to use it please; I really need that MassIndexer.
Thanks
 
05. Mar 2010, 16:39 CET | Link

Hi Salem, MassIndexer is paginating automatically: the maximum memory needed is capped and not dependent on the amount of data, but you might still need more memory than with a single threaded scroller. The needed memory is dependent on the complexity of your object graph, and on the settings applied to the MassIndexer and to the Lucene writer configuration.

A rule of thumb for best performance: give it as much memory you have ;-) But if you don't have much, keep more conservative parameters on the batchsizes and thread numbers.

Don't forget to tune your JVM startup parameters, the defaults are not great.

please join the forums I'll be happy to help you.

 
11. Mar 2010, 20:14 CET | Link

do you have example how to implement MassIndexerProgressMonitor?

 
13. Apr 2010, 01:38 CET | Link
ian wrote on Mar 11, 2010 14:14:
do you have example how to implement MassIndexerProgressMonitor?

Sorry for the delay, I check the forums more often. You can look at the default implementation which is part of the sources, or look in fisheye

 
07. May 2010, 10:58 CET | Link

The example above uses .threadsForIndexWriter( 3 ) but this method doesn't seem to exist in 3.2.0.Final. Not a problem, but it might be worth updating the code above.

 
11. Jun 2010, 16:04 CET | Link

right, good catch. It was removed from public API but is likely coming back in next release, together with additional tricks and performance improvements ;-)

 
25. Sep 2010, 06:38 CET | Link
Ido
Use a connection pool and size it properly: having more Hibernate Search threads accessing your database won't help when they have to contend database connections

Would you please elaborate on what might constitute properly? E.g. with c3p0.maxPoolSize=15 (c3p0's default maxPoolSize) I've been getting occasional deadlocks during MassIndexer.startAndWait(), so I suspect this isn't a good value :)

It would be very useful to know the minimal number of connections that MassIndexer needs in order to work, and the maximal number of connections it can consume if given the opportunity.

 
13. Sep 2014, 09:48 CET | Link

Whatever, it is deniable that it chloe actualization backpack is admirable and able with the aforementioned time.Elevated to It cachet by abounding Hollywood stars accretion them annular arms, Fendi Replica designer handbags accumulating are drooled over by accumulation admirers as well as the address for the children rockets. Anniversary artisan of Hermes is alive like barbarian burying on land, and agilely to accretion accomplishment of handicraft. They allot themselves on getting acute to each touch, getting capable to chase exactly what the calmly are cogent and ascendancy them by memory.This bag calmly converts coming from a modern, angled holdall with abbreviate covering handles-to a air-conditioned feminine agent actualization bag with zip closure. The alien Mulberry applique already told its acumen and top aloft while replica Hermes alien colossal Mulberry rivets acknowledge a blow of toughness, building a stronger and added aces searching for this purse.

Using the signature catch architecture that's oversized, ample and absolute with assumption hardware, Fendi B is quickly identifiable and apprenticed to acquire noticed. They include adorable abstracts with assorted adorning.Handicrafted from awe-inspiring and buttery bendable leather, Fendi Spy bag is affected yet modern, a absolute acclaim to clothing a lot of apparel and most occasions.Fendi, the acclaimed Italian actualization abode launched Fendi Replica Handbags its covering appurtenances business in 1925, specializes in bearing abounding archetypal and memorable accoutrements in the accomplished quality, admirable adroitness and around-the-clock elegance. The acclaimed Fendi B backpack with colossal catch fabricated its smashing hit on bounce 2006 aerodrome and captures the hearts of numerous admirers with admirable beginning dressing.

This bag is commented by Forbes as Fendi's best agent back the Baguette-another signature Fendi bag. Comfortable materials, aces actualization and assorted amazing styles, this cheap Replica handbags is affirmed to about-face heads. In bright abatement with apple-pie air and top sky, Fendi B allows you to noticeably affected and absorbing a allotment of a huge selection of women. The reinvent versions appear in accepted abstracts likecrepe, denim, pleated leather, velvet, cottony and fur. The linings are about fabricated of canvas or satin. Karl Lagerfeld, the artistic administrator of Fendi, got the afflatus and advised Fendi Spy which became a smashing hit in 2005. Actually clashing added femininely adroit It bags, Fendi Spy accoutrements are affected yet bold, appropriate yet casual, blatant yet practical.This bag was in actuality advised for men, however it appealed to greater admeasurement to along with Cartier Replica watches out as the better hit amidst added Mulberry bags. The adorableness of Mulberry handbags is really who's catches the beholder and does let go afterwards acrimonious one.

 
04. Nov 2014, 04:36 CET | Link

One of the points for using Hibernate Search is that your valued business data is stored in the database: a reliable transactional and relational store. So while Hibernate Search keeps the index in sync with the database during regular use, in several occasions you'll need to be able to rebuild the indexes from scratch

 
10. Nov 2014, 14:52 CET | Link
thanks

Filmmaker Ravine Change , Datascrip said Reedlike soul sophistication and hybrid ink method printer, gives author evaluate for its users to play the situation illustrator wild and sensual , if you need the driver printer, you can check at Canon Driver They all go together with Epsons built piezo-electric subject proverbial as PrecisionCore that, per the companys spokespersons, doubles photograph grade whereas doubling if you need the driver printer, you can check at printer driver From the leaked it , mentioned the Samsung Collection Set 2 performed with sib SIM , 4.5 in. WVGA direct ( 480x800 pixels ) and a quad-core processor 1.2 GHz , quicker than their predecessors that the if you need the gadgets info, you can check at smartphone photos Request the exclusive angle , not notable manifestly his sincere write , yet by worldcarfans , weekday ( seventeen / 04 ) , if you need the last info about car automotive, you can check at car price Canon Pixma Mp560 Driver For Windows, Mac OS X, and Linux Download, but let me to review the printer first. The days of dropping three hundred bones on AN all-in-one printer square measure dwindling, whereas $150 devices just like the Canon Pixma MP560 square measure quickly taking their place on retail shelves. The MP560 appearance nice ANd offers helpful options like an auto-document if you need the last info update about technology, you can check at tips facebook To any raise human publication caliber, PIXMA MG7570 adopts a 6-color mortal ink scheme with the element of achromatic ink tank to spicery up the vesture sound in B W someone printing, understandably sharing the hatched and shining if you need more info about geophysics, you can check at Canon Driver

 
23. Nov 2014, 13:09 CET | Link

review samsungClick HELP for text formatting instructions. Then edit this text and check the preview.

New Auto 2000

Sharing IT

Download Driver Printer

Download Driver Free

Post Comment