Help

I gotta preface this post by saying that we are very skeptical of the idea that Java is the right place to do processing that works with data in bulk. By extension, ORM is probably not an especially appropriate way to do batch processing. We think that most databases offer excellent solutions in this area: stored procedure support, and various tools for import and export. Because of this, we've neglected to properly explain to people how to use Hibernate for batch processing if they really feel they /have/ to do it in Java. At some point, we have to swallow our pride, and accept that lots of people are actually doing this, and make sure they are doing it the Right Way.

A naive approach to inserting 100 000 rows in the database using Hibernate might look like this:

Session session = sessionFactory.openSession();
Transaction tx = session.beginTransaction();
for ( int i=0; i<100000; i++ ) {
   Customer customer = new Customer(.....);
   session.save(customer);
}
tx.commit();
session.close();

This would fall over with an OutOfMemoryException somewhere after the 50 000th row. That's because Hibernate cache's all the newly inserted Customers in the session-level cache. Certain people have expressed the view that Hibernate should manage memory better, and not simply fill up all available memory with the cache. One very noisy guy who used Hibernate for a day and noticed this is even going around posting on all kinds of forums and blog comments, shouting about how this demonstrates what shitty code Hibernate is. For his benefit, let's remember why the first-level cache is not bounded in size:

  • persistent instances are /managed/ - at the end of the transaction, Hibernate synchronizes any change to the managed objects to the database (this is sometimes called /automatic dirty checking/)
  • in the scope of a single persistence context, persistent identity is equivalent to Java identity (this helps eliminate data /aliasing/ effects)
  • the session implements /asynchronous write-behind/, which allows Hibernate to transparently batch together write operations

For typical OLTP work, these are all very, very useful features. Since ORM is really intended as a solution for OLTP problems, I usually ignore criticisms of ORM which focus upon OLAP or batch stuff as simply missing the point.

However, it turns out that this problem is incredibly easy to work around. For the record, here is how you do batch inserts in Hibernate.

First, set the JDBC batch size to a reasonable number (say, 10-20):

hibernate.jdbc.batch_size 20

Then, flush() and clear() the session every so often:

Session session = sessionFactory.openSession();
Transaction tx = session.beginTransaction();

for ( int i=0; i<100000; i++ ) {
   Customer customer = new Customer(.....);
   session.save(customer);
   if ( i % 20 == 0 ) {
      //flush a batch of inserts and release memory:
      session.flush();
      session.clear();
   }
}

tx.commit();
session.close();

What about retreiving and updating data? Well, in Hibernate 2.1.6 or later, the scroll() method is the best approach:

Session session = sessionFactory.openSession();
Transaction tx = session.beginTransaction();

ScrollableResults customers = session.getNamedQuery("GetCustomers")
   .scroll(ScrollMode.FORWARD_ONLY);
int count=0;
while ( customers.next() ) {
   Customer customer = (Customer) customers.get(0);
   customer.updateStuff(...);
   if ( ++count % 20 == 0 ) {
      //flush a batch of updates and release memory:
      session.flush();
      session.clear();
   }
}

tx.commit();
session.close();

Not so difficult, or even shitty, I guess. Actually, I think you'll agree that this was much easier to write than the equivalent JDBC code messing with scrollable result sets and the JDBC batch API.

One caveat: if Customer has second-level caching enabled, you can still get some memory management problems. The reason for this is that Hibernate has to notify the second-level cache /after the end of the transaction/, about each inserted or updated customer. So you should disable caching of customers for the batch process.

39 comments:
 
27. Aug 2004, 06:12 CET | Link
Noisy People
You've said over and over that Hibernate wasn't designed for batch stuff. If you want to do batch stuff and don't want to resort to jdbc, there are plenty of other ORM tools that will do the job.

No problem, we get it.

ReplyQuote
 
27. Aug 2004, 08:09 CET | Link
Gavin
Oh hello again, you really -are- very persistent :-)

Please don't mischaracterize our position on this, which is that Java is not a good place to do batch processing. We have never said that Hibernate is any better or worse than any other product for this use case. Indeed, almost all ORM products behave exactly the same as Hibernate on this.

If you actually run the code I just displayed, you will find it to be just as fast as whatever other ORM product you are using. But not as fast as a stored procedure, of course.
 
27. Aug 2004, 09:00 CET | Link
Zuzur
Hello Gavin,

Actually, you don't need to do batch processing to have thousands of insertions/updates in a single transaction. It may be a design issue, but the fact is that there is such a need in some applications.

I don't see why you speak about difficulties with JDBC and scrollable resultsets when the exact same "counter" solution has been available right here with begin/endTransaction() since the very beginning.

I'm sorry but it is not an hibernate strength per se to be able to rely on the transaction API of the JDBC driver :-)

BTW, i am absolutely fond of this lovely framework !
 
27. Aug 2004, 12:39 CET | Link
philippe | pch(AT)infologic.fr
Hello,

for information, it seems that the scroll(ScrollMode) is only avaiblable in Hibernate 3 not yet in 2.1.6

Philippe


 
27. Aug 2004, 13:29 CET | Link
Juozas Baliuka
I found awk script is a good tool for batch stuff on client and it is a standard way to transform text files.
http://www.gnu.org/software/gawk/manual/html_mono/gawk.html
 
27. Aug 2004, 13:37 CET | Link
Gavin
Just use scroll() without a parameter.
 
27. Aug 2004, 13:38 CET | Link
Juozas Baliuka
This is from manual, in case somebody do not understand motivation:

"Programs in awk are different from programs in most other languages, because awk programs are data-driven; that is, you describe the data you want to work with and then what to do when you find it. Most other languages are procedural; you have to describe, in great detail, every step the program is to take. When working with procedural languages, it is usually much harder to clearly describe the data your program will process. For this reason, awk programs are often refreshingly easy to read and write."
 
27. Aug 2004, 14:20 CET | Link
Jonas
So now I know why I found it hard to get a straight answer on how to do this on the forums last fall when I was trying to do this :) In the end I just batched with JDBC and it worked great. I just regret the hours I put into trying to get Hibernate to do something it was never designed to do. I guess I just wanted the persistence code to look nice and consistent which is never a good motivation for breaking your back. I'm wiser now :) Keep up the great work on Hibernate!
 
27. Aug 2004, 15:15 CET | Link
Noisy Persistent People
If Java isn't a good place to do batch database stuff, where is? Is it a flaw of Java?

It is funny you say that because lots of people are doing it very successfully. It's just Hibernate that has issues because you don't feel people should do it. Other ORM tools do Java batch right out of the box, no configuration needed.

I just hope that your attitude against doing batch work in Java doesn't carry over to your EJB work.

 
27. Aug 2004, 15:45 CET | Link
Christian
Java is indeed the worst tool to do data batching.

 
27. Aug 2004, 15:52 CET | Link
Gavin
What special configuration was needed in the previous code examples? Setting the JDBC batch size?? I'm sure this setting is required in ANY tool that takes advantage of the JDBC batch API. (Full ORM or not.)

Hibernate has no issues that any other full ORM tool does not have. Any JDO implementation, TopLink, etc, will behave exactly the same. See PersistenceManager.flush() and
PersistenceManager.evictAll() for the equivalent JDO pattern.

Dude, give it a break man, you are wrong. It happens to everyone once in a while ... you'll get over it :-)
 
27. Aug 2004, 16:19 CET | Link
John
Christian & Gavin,

"Java is indeed the worst tool to do data batching. " Why is that?

 
27. Aug 2004, 16:29 CET | Link
Gavin
So, this post is an attempt to adjust some of our previous rhetoric *slightly*.
 
27. Aug 2004, 17:28 CET | Link
John
So, do I understand that you have a problem with batch in general and not with Java doing batch?

Some people can't use stored procedures to do batch jobs because they use data outside the database or need to remain somewhat neutral to the database.

 
27. Aug 2004, 17:46 CET | Link
Christian
I believe there is no way to simplify the statement Gavin made any more.


 
27. Aug 2004, 18:12 CET | Link
John
"Java is indeed the worst tool to do data batching."

You want to take back this statement, then right? Or do you disagree with Gavin?


 
27. Aug 2004, 18:16 CET | Link
Christian
Man, what is so difficult about "don't load half your database into your JVM, use a set-level language _in_ your database"?

It would be very nice if you would stop making any assumptions about our motivations. If your goal is to make us angry by acting like a bonehead, this "discussion" is over.

 
27. Aug 2004, 18:22 CET | Link
John
As a potential jboss customer, I want to know how you feel. I totally understand now. You are right, this "discussion" is over.

Here's what I'm taking away from it. jboss is against any type of batch job that doesn't happen completely inside a database.

Since that isn't the way most real world apps work, we will look for solutions outside of jboss products. Also, calling potential clients boneheads is great for business.


 
27. Aug 2004, 18:29 CET | Link
Christian
This made my day... A customer who wants to know what I feel. Thanks :)

 
27. Aug 2004, 18:45 CET | Link
Max
We haven't said we are *against* batching in Java, just that we don't encourage it!

What has been said:

Loading *massive* amount of data into another process (VM) to do manipulation and then push it to another process (DB) is not as efficient as doing it with something close to the database.

This is a valid statement no matter what kind of ORM you use and Hibernate was not built for this, but as Gavin shows it is quite possible to do - and if you do it as described it will most likely outperform many ORM's.

We probably have been to up-tight about it and we think people have misunderstood that message so Gavin posted a blog on how to actually do this stuff efficiently if you really want to do it!

    
 
27. Aug 2004, 18:53 CET | Link
Christian
Forget it Max, it was just a cheap shot at JBoss...

 
27. Aug 2004, 22:40 CET | Link
Noisy Guy
But you guys make it so easy! :)

 
28. Aug 2004, 01:21 CET | Link
Gavin
Dear Noisy,

It is always very easy to take cheap shots at people who are actually *doing*. In the long run, what the doers do has relevance to the world, what the hataz do is forgotten tomorrow.

I forgive you, actually, your jealousy makes me sad.

One day, you will do something worthwhile.

peace
 
28. Aug 2004, 01:39 CET | Link
Gavin
Dear John,

We are all just "muddle-ing through". We are trying to find the truth, on the basis of our experiences. Stuff the Hibernat team tells people is the best advice we know, after thinking very hard about the problem.

It's very annoying when people try to to parse our words every which way, trying to trap us in a contradiction, or trying to make us admit we are wrong as some kind of point-scoring exercise, or looking for some kind of evil motivation. Sometimes we are wrong, often at least slightly wrong. It's never malicious. Hopefully we eventually learn about our mistakes.

There is a principle in argument knows as "the principle of charity", which is that you should always interpret people's comments in the *best possible* light, and assume the most idealistic motivation. In this case, our motivation is to help people build systems that interact efficiently with a SQL database. That's all.

If people aren't interested in what we have to say, or are only interested in insulting us, they shouldn't read it. If you read it, and comment on it, follow the principle of charity. If you disagree, talk about your use case, and how it differs, and show why it's important. That way we *all* might learn something. Don't try to point-score, it's just not useful.

regards,

Gavin

P.S. We've noticed a new argument technique recently. Apparently, it is possible to win any argument by declaring "oh I was a potential JBoss customer, and now I'm not". Apart from the incredible non-provability (indeed, unlikelihood) of the claim, it is quite irrelevant and has zero bearing on the topic being debated. Just a hint....
 
28. Aug 2004, 13:41 CET | Link
Roman Sykora | r.sykora(AT)gmx.net
Hrhr,
very entertaining. I'm enjoying this discussions pretty much. Thank you a lot.
For giving me (and everyone who likes it) Hibernate and that amusing entertainment. You could simply ignore those "hirni's" (don't know an adequate term in english), but you don't. Not just giving us Hibernate but putting a smile on our faces too.

Thanks
Roman Sykora
 
30. Aug 2004, 13:47 CET | Link
lee
I'm convinced that people who need to point-score on forums and comments have far too much bloody time on their hands, not to mention that they're immature and petty.

The rest of us who actually have to *DO* something only have time to be interested in finding solutions. And we're frankly just appreciative of any advice, hints and honestly about the tools we have to work with. The Hibernate team does this very well.

Thanks
Lee
 
02. Sep 2004, 07:15 CET | Link
1. How does the session.clear() interact with optimistic locking? Eg. if the record was subsequently read?

2. On batching generally, is is possible in JDBC to batch queries as well as updates? Eg. If I know up front that I need to query three different records can JDBC (and hence Hibernate) batch them?

Thanks,

Anthony
 
02. Sep 2004, 11:29 CET | Link
Gavin
(1) I'm not quite sure what you mean ... by nature, an optimistic lock is checked when you perform the update ... so there is no impact here ...

(2) No, its not possible, though Hibernate can batch-read entities by primary key. (See the doco.)
 
03. Sep 2004, 15:06 CET | Link
Nkuma
Hi Gavin,

What does the setting "jdbc.batch_size" do? If I set this setting to 20 for example...then I would expect , Hibernate persists the data transparently at this frequency (every 20 records or so).

But it doesn't happen that way.

May be I am missing the point.

Regards,
Nkuma
 
03. Sep 2004, 18:54 CET | Link
Gavin
The batch_size setting is determines the maximum number of statements that can be executed in a single JDBC batch.
 
06. Sep 2004, 20:45 CET | Link
ernst_pluess
I completely agree, that it isn't performant to do DB batch operations with Java, compared to do it with SQL and/or stored procedures.

But I think there's an other point: Maintenance.
If I've both batching and a gui driven application instrumenting the same data, I'd prefer to maintain only one domain model.
This means I can either try to do GUI codeing in PL/SQL or do batches in the JVM.

IMHO: If you can pay the price (not perfect performance) do the batching in Java (preferably with Hibernate ;-) ) and take advantage of a nice OO domain model and good support for GUI programming.
If performance _realy_ hurts, pay the other price (double the maintenance) and do the batch stuff as close to your data as you can.

HTH
Ernst

 
08. Sep 2004, 20:57 CET | Link
Juozas Baliuka
This double maintenance problem is very trivil to solve, just drop nice OO domain model or evil database.
 
12. Sep 2004, 09:30 CET | Link
Java is fine for batch... if Java is one of the languages supported internally by the DB server. That is, stored procedures are the way to do bulk processing and Java can be used if you can write SPs for your server in Java.

So the question can be: why are server-based approaches better than client-based ones for batch processing? That's pretty obvious: the work of translating data sets out of the DB alone adds significant overhead. Java (and O-O in general) makes it even more expensive by adding a layer of memory alocation (even if you're just doing JDBC). Finally ORMs add even more. Sure creating new objects isn't all that expensive, but it starts to add up. As a result pl/sql programmers have very steady jobs in pretty much all sizeable banks and other analytics processing shops around the world.
 
05. Oct 2004, 11:23 CET | Link
BruceS
I'm a bit new to hibernate. W.r.t. your "caveat" (I think that's the problem I'm having) - I can't seem to find where to disable second-level caching for my "Customer". Please point me in the right direction...
 
05. Oct 2004, 18:31 CET | Link
Christian
The "Forum" would be a good direction to ask usage questions. The second-level cache is disabled by default, btw.

 
25. Oct 2004, 17:39 CET | Link
Paul Rivers
You know, tons of people like me come to the blog, read the post, and then typically don't post a comment because we're just like "Oh, that's cool." So thanks.

PS If I were you, I would delete these pointlessly obnoxious comments. I really can't believe these people. What is it, no one listens to them in real life, so they complain on blogs instead?
 
10. Sep 2009, 08:18 CET | Link
siddhi

Do you still thing Hibernate is not the best thing for batch processing.is it better to use Pl/SQL and jdbc for batch processing.

 
02. Oct 2009, 10:52 CET | Link

I´ve seen a number of customers use hibernate for batch processing, the results have been mixed. First recognize that in 2004, when this thread first started, no container-managed batch processing technology really existed in the marketplace. Today there exists a batch container (WebSphere Compute Grid) that runs inside of an application server, this batch container is a peer to the EJB container, web container, etc. Note that the batch container is portable and can run outside of WebSphere Application Server, in places like JBOSS, WAS CE, Weblogic, etc. In addition to having a first class batch container, there are two popular ways to describe (read: program) your batch applications: Batch Datastream Framework that integrates with the Compute Grid batch container, and Spring Batch. Container-managed services for batch are critical. Let the container manage the: transaction for the batch application, prepared-statement management (jdbc batching, etc), checkpoint/restart, etc.

Before answering your question about the viability of Hibernate for batch (or data-intensive) applications, let´s first make clear the role of Java for this workload. Java, especially java 1.5 and beyond with generational garbage collection, a better JIT, etc is very well suited for batch applications. Surprisingly, Java has even outperformed COBOL batch for certain types of workloads on the mainframe, where COBOL batch dominates. This is a testament to the JIT optimizations made over the last several years, and we should expect the performance of Java to converge with other languages.

Using stored procedures for batch work works well, however there are a couple of important considerations to keep in mind. First, you risk duplicating business logic across your OLTP application and your batch application, since both run in fundamentally different environments. Second, life-cycle management is a major problem. A previous forum post dismissed this as a trivial problem to solve, but it´s quite difficult. The life-cycle of the database is independent of the life-cycle of the application server as well as the life-cycle of the application itself. Frequent changes to the application could require changes to the stored-procedure definition, At large IT shops, this typically requires coordination across multiple organizations: operations, database management, application development. It´s better to recognize this early, and bundle the batch application logic with the OLTP application, and manage this as a single application.

Now to your question about the viability of Hibernate for batch. The closer the data access technology is to the underlying data serving technology, the more optimizations can be made. Read this as: writing SQL and using JDBC directly will give you more performance tuning opportunities than leveraging an ORM layer. SQL queries for batch, which typically must retrieve hundreds of thousands/millions records will be highly optimized by database experts that understand the database optimizer, plans, etc. I recently saw an SQL query for a batch application that was over 300 pages (ms word, times new roman, 12pt font) long! The hibernate query was killed after taking 4 weeks to execute. The 300pg sql query completed in under 4 minutes. While there are numerous optimzations that can be made to an SQL statement for batch, two in particular are important: First, holding cursors open across transactions, where a single select can be made to the database, and multiple syncpoints/checkpoints/transactions are used to process the records; second, JDBC batching, where mulple prepared statements are accumulated in the App Server tier and sent across the wire in one RPC call. By using an ORM layer for batch, you limit your ability to tune. ORM for batch is especially a problem for selects, since for batch you only want to select the columns you really need, and with an ORM technology you may get the entire row/object.

A major problem I see is ORM sprawl, where developers haphazardly use the ORM api´s throughout the code, versus creating a proper data-access layer that is technology independent. This limits options from an application standpoint, where we can switch the data-access object from Hibernate to SQL in the future. We instead end up in a situation where the ORM technology is ingrained in the app.

There are a number of papers on the topic of designing batch applications available now. These weren´t available in 2004 when this thread started.

- designing batch applications (pdf w/ examples): http://snehalantani.googlepages.com/designingBatchApps.zip

- data-intensive processing with websphere: http://snehalantani.googlepages.com/WebSphereDataIntensiveApps.pdf

- SwissRe and their use of WebSphere Compute Grid on z/OS for batch: http://www-01.ibm.com/software/tivoli/features/ccr2/ccr2-2008-12/swissre-websphere-compute-grid-zos.html

- J2EE batch processing: http://www.slideshare.net/chris1adkin/j2ee-batch-processing-presentation

- Hibernate chapter on batch: http://docs.jboss.org/hibernate/stable/core/reference/en/html/batch.html

- Hibernate chapter on performance tuning: http://docs.jboss.org/hibernate/core/3.3/reference/en/html/performance.html

 
02. Oct 2009, 11:01 CET | Link
I should clarify that WebSphere Compute Grid is a batch processing *platform*, not just a container. The platform consists of a job dispatching tier (including a parallel job management component) that dispatches batch jobs to a cluster of batch containers. Each tier is highly available. The batch containers can run on multiple platforms and applications, where the strategy is ubiquity: batch containers should run anywhere and everywhere.

Here is a presentation that describes the technology: http://snehalantani.googlepages.com/latestpresentationmaterial
Post Comment