Help

Inactive Bloggers
05. Dec 2004, 03:30 CET, by Christian Bauer

Yesterday, another vendor marketing statement was posted on TSS. I usually ignore these, but when it is about data management, I just have to reply. What is always surprising to me is how little we Java developers still know about data management. Here is a statement made by Maxim Kramarenko in the discussion thread:

"OO/network DBMS can be very useful when you need to handle large 
hierarchies - simple graph navigation can be times more fast and simple 
way to fetch data then SQL. Even ORACLE START WITH/CONNECT BY statement 
works VERY slow for hierarchies." 

Now, this statement has several fundamentally different (and important) things mixed together, and in the end confuses everyone more than it helps. I expressed my disbelief (yes, I should have been more verbose and friendly...) and then was asked to contribute to the discussion. Here is my (rather long) response:

I'm using the EXPLODE operator provided by my database management system if I'd like to get a hierarchical view of my data (which is the most common operation when it cames to data hierarchies, I think). If no such operator is available or if I find its performance characteristics unacceptable I call my DBMS vendor and complain until they fix it. Of course, I only do that after optimizing my query execution plans, etc. (this should eliminate most issues).

If all of this is not sufficient, I might consider using a nested set model or materialized path to represent my tree information, variations of the simple adjacency list. Again, this is certainly the last step I'd take (which still happens way too often with the poor quality of the SQL products we have today), and is highly likely only acceptable for read-mostly trees. What we have in SQL, the recursive WITH foo as () or the proprietary CONNECT BY variation, is not neccessarily what I have in mind if I think about a good EXPLODE operator. But see below for a reference with a better and complete explanation.

I would certainly not sacrifice any of the (with SQL very rare) advantages I get from the relational data model if I can't find a good operator for my implementation in my DBMS product. After all, its a logical data model, and any performance concern is a physical implementation detail. I don't see how both are related, but I know we are in bad shape if we start messing up the differences between the two. There is no reason why a network/graph-based logical model or a relational model couldn't be implemented with the same performance characteristics. Just because some products are not implemented optimal doesn't mean we should ditch the whole concept of data independence!

Complain to your DBMS vendor. Ask them why their product doesn't fully support state of the art relational data management, such as a relational (and possibly polymorphic) data type system, or a query language that supports closures/explosion for data hierarchies. The list of deficiencies in todays SQL products is unfortunately much longer than this. It's not the fault of the data model if you can't do something in a particular product, or if a specification has serious problems.

It's easy for the snake oil salesman to sell you his old wine if you let yourself get confused about logical data models and physical implementations. It hurts everyone in the end, as we all have to tell our software vendors what we would like to see and what support we need in a product. If we are not able to clearly articulate our needs and if we forget history (ie. what worked and what didn't work in the past), we might get tricked. I'm not feeling comfortable with that.

Finally, a recommendation I can make is the book Practical Issues in Database Management by Fabian Pascal. It is a small book, having only 250 pages. Fabian shows you 10 common issues you will face in your daily practice, but instead of simply explaining how to work your way through SQL, he first explains the relational data model basics for each problem. He then looks at the current practice and explains what you can do in SQL (or what we would need in a DBMS) to solve or work around the issue. A quick read for a weekend and definitely recommended. If you want to brush up your data management basics, buy it in a bundle with Chris Date's Introduction to Database Systems.

29. Sep 2004, 22:29 CET, by Gavin King

Type - not sex, or race - discrimination is what we do when we read a row from a SQL query result set, and determine what Java class we should instantiate to hold the data from that row. Type discrimination is needed by any ORM solution or handwritten persistence layer that supports polymorphic queries or associations.

This simplest way to visualize type discrimination is to consider the following result set:

( 123, '1234-4567-8910-1234', 'Foo Bar', 'VISA' )
( 111, '4321-7654-0198-0987', 'Baz Qux', 'MCD'  )

This result set contains details of two credit cards. Suppose our application has a different Java class to represent each type of credit card. We have two subclasses, VisaCard and MasterCard of the CreditCard class. Then we can check the last column of the result set to decide which class to instantiate for each row. This column is the discriminator column.

You're probably wondering why I'm talking about about result sets instead of tables. Well, there are various ways to map a Java class hierarchy to a relational database schema: table-per-hierarchy, table-per-concrete-class, table-per-class. So the actual table structure may be quite complex. But the only way to actually get data efficiently out of the database is to denormalize it into one big square result set. In fact, I usually think of the job of the SQL query as transforming table-per-concrete-class or table-per-class mapped data into an intermediate table-per-hierarchy denormalized form. (This is, incidentally, why table-per-hierarchy offers the best performance of the three mapping strategies - it is already in a convenient form and does not require unions or joins on the database.) Whatever strategy we choose, we need to perform type discrimination upon this flattened result set.

Now, what I'm most interested in for the moment is the result set discriminator column. Most writings on the topic of inheritance mappings - and there have not been very many - and even most ORM solutions (actually every one that I know of), have assumed that the discriminator column is an actual physical column of the root table of the supertype. Indeed, it has usually been further assumed that the mapping from discriminator values to classes is one-to-one. But this need not be the case. In my credit card example, it certainly made sense. But now let's consider a different case. We're storing data relating to specific individuals in the PERSON table; our SQL query might look like this:

SELECT ID, NAME, SPECIES FROM PERSON

and the result set might be:

( 12345, 'Zxychg Ycjzy', 'Martian' )
( 52778, 'Glooble Queghm', 'Venusian' )
( 98876, 'Gavin King', 'Human' )

Now, here on earth, we consider Humans to be quite special and worthy of extra attention compared to other species of intelligent alien. So we might have a specific Human class, and a generic Alien class. Then the mapping from discriminator column values to classes is certainly not one-to-one. Indeed, there is a specific value for Human, and a catchall for Alien.

Actually, it's quite reasonable that we might even have some extra, Human-specific data, in the HUMAN table. To get all the data we need, let's use the following join:

SELECT ID, NAME, SPECIES, COUNTRY 
FROM PERSON 
    NATURAL JOIN HUMAN

( 12345, 'Zxychg Ycjzy', 'Martian', null )
( 52778, 'Glooble Queghm', 'Venusian', null )
( 98876, 'Arnold Schwarzenegger', 'Human', 'US' )

In this result set, we have two potential discriminator columns. Either the COUNTRY or SPECIES column could be used to determine if the individual is human. And the COUNTRY column isn't a column of the root PERSON table. Now imagine we introduce a further level of specialization, and include data specific to employees of our organization:

SELECT ID, NAME, SPECIES, COUNTRY, JOB
FROM PERSON 
    NATURAL JOIN HUMAN 
    NATURAL JOIN EMPLOYEE

( 12345, 'Zxychg Ycjzy', 'Martian', null, null )
( 52778, 'Glooble Queghm', 'Venusian', null, null )
( 98876, 'Arnold Schwarzenegger', 'Human', 'US', null )
( 34556, 'Gavin King', 'Human', 'AU', 'Java Developer' )

Now we can no longer perform type discrimination using just a single column. Ouch! That's messy. Let's change the query slightly:

SELECT ID, NAME, SPECIES, COUNTRY, JOB, 
    CASE 
        WHEN LOC IS NULL THEN 'ALIEN' 
        WHEN JOB IS NULL THEN 'HUMAN' 
        ELSE 'EMPLOYEE' 
    END 
FROM PERSON 
    NATURAL JOIN HUMAN 
    NATURAL JOIN EMPLOYEE

( 12345, 'Zxychg Ycjzy', 'Martian', null, null, 'ALIEN' )
( 52778, 'Glooble Queghm', 'Venusian', null, null, 'ALIEN' )
( 98876, 'Arnold Schwarzenegger', 'Human', 'US', null, 'HUMAN' )
( 34556, 'Gavin King', 'Human', 'AU', 'Java Developer', 'EMPLOYEE' )

Yay, we've got our nice clean discriminator column back again! But this column most certainly does not correspond to any physical column of a table. It holds a pure derived value.

This is content-based discrimination. Our example uses a table-per-class mapping strategy, but the result sets above could all just as easily have come from from some other mapping strategy.

Here's a second example of content-based discrimination:

SELECT TX_ID, ACCOUNT_ID, AMOUNT, 
    CASE 
        WHEN AMOUNT>0 THEN 'CREDIT' 
        ELSE 'DEBIT' 
    END 
FROM TRANSACTIONS

( 12875467987, 98798723, 56.99, 'CREDIT' )
( 09808343123, 87558345, 123.25, 'DEBIT' )

Here, we use a column that is based upon a mathematical expression, AMOUNT>0, to discriminate between DebitTransaction and CreditTransaction. In principle, much more complex expressions are possible. (In practice, they are likely to remain quite simple.)

In Hibernate 2.x the table-per-class mapping strategy always used content-based discrimination, and the table-per-hierarchy strategy always used column-based discrimination. For some reason - that is now kind of obscure to me - that felt quite natural. In Hibernate3, you can use content based discrimination for the table-per-hierarchy mapping strategy:

<class name="Person" table="PERSON" 
    discriminator-value="ALIEN">
    ...
    <discriminator type="string"> 
        <formula>
            CASE 
                WHEN LOC IS NULL THEN 'ALIEN' 
                WHEN JOB IS NULL THEN 'HUMAN' 
                ELSE 'EMPLOYEE' 
            END
        </formula>
    </discriminator>
    ...
    <subclass name="Human" 
        discriminator-value="HUMAN">
        ...
        <subclass name="Employee" 
            discriminator-value="EMPLOYEE">
            ...
        </subclass>
     </subclass>
 </class>

And you can use column-based discrimination for table-per-class mappings:

<class name="Person" table="PERSON" 
    discriminator-value="ALIEN">
    ...
    <discriminator type="string" column="TYPE"/>
    ...
    <subclass name="Human" 
        discriminator-value="HUMAN">
        <join table="HUMAN">
            ...
            <subclass name="Employee" 
                discriminator-value="EMPLOYEE">
                <join table="EMPLOYEE">
                    ...
                </join>
            </subclass>
        </join>
     </subclass>
 </class>

For the table-per-concrete-class strategy (<union-subclass> mapping), only content-based discrimination makes sense.

14. Sep 2004, 09:15 CET, by Gavin King

September 20-22 in Melbourne will be the first time we deliver our new three-day Hibernate course. The course has been heavily revised and expanded to include previews of the cool new stuff coming in Hibernate3 and an overview of Hibernate internals (/very/ useful if you ever need to debug a Hibernate application). There are still seats available, if you're quick! This will be the last training we run in Australia for a while, since I won't be in the country much, if at all, over the next six months or so. Email training@jboss.com for more information. (We also have an upcoming course in Paris, November 3-5.)

27. Aug 2004, 09:46 CET, by Gavin King

I gotta preface this post by saying that we are very skeptical of the idea that Java is the right place to do processing that works with data in bulk. By extension, ORM is probably not an especially appropriate way to do batch processing. We think that most databases offer excellent solutions in this area: stored procedure support, and various tools for import and export. Because of this, we've neglected to properly explain to people how to use Hibernate for batch processing if they really feel they /have/ to do it in Java. At some point, we have to swallow our pride, and accept that lots of people are actually doing this, and make sure they are doing it the Right Way.

A naive approach to inserting 100 000 rows in the database using Hibernate might look like this:

Session session = sessionFactory.openSession();
Transaction tx = session.beginTransaction();
for ( int i=0; i<100000; i++ ) {
   Customer customer = new Customer(.....);
   session.save(customer);
}
tx.commit();
session.close();

This would fall over with an OutOfMemoryException somewhere after the 50 000th row. That's because Hibernate cache's all the newly inserted Customers in the session-level cache. Certain people have expressed the view that Hibernate should manage memory better, and not simply fill up all available memory with the cache. One very noisy guy who used Hibernate for a day and noticed this is even going around posting on all kinds of forums and blog comments, shouting about how this demonstrates what shitty code Hibernate is. For his benefit, let's remember why the first-level cache is not bounded in size:

  • persistent instances are /managed/ - at the end of the transaction, Hibernate synchronizes any change to the managed objects to the database (this is sometimes called /automatic dirty checking/)
  • in the scope of a single persistence context, persistent identity is equivalent to Java identity (this helps eliminate data /aliasing/ effects)
  • the session implements /asynchronous write-behind/, which allows Hibernate to transparently batch together write operations

For typical OLTP work, these are all very, very useful features. Since ORM is really intended as a solution for OLTP problems, I usually ignore criticisms of ORM which focus upon OLAP or batch stuff as simply missing the point.

However, it turns out that this problem is incredibly easy to work around. For the record, here is how you do batch inserts in Hibernate.

First, set the JDBC batch size to a reasonable number (say, 10-20):

hibernate.jdbc.batch_size 20

Then, flush() and clear() the session every so often:

Session session = sessionFactory.openSession();
Transaction tx = session.beginTransaction();

for ( int i=0; i<100000; i++ ) {
   Customer customer = new Customer(.....);
   session.save(customer);
   if ( i % 20 == 0 ) {
      //flush a batch of inserts and release memory:
      session.flush();
      session.clear();
   }
}

tx.commit();
session.close();

What about retreiving and updating data? Well, in Hibernate 2.1.6 or later, the scroll() method is the best approach:

Session session = sessionFactory.openSession();
Transaction tx = session.beginTransaction();

ScrollableResults customers = session.getNamedQuery("GetCustomers")
   .scroll(ScrollMode.FORWARD_ONLY);
int count=0;
while ( customers.next() ) {
   Customer customer = (Customer) customers.get(0);
   customer.updateStuff(...);
   if ( ++count % 20 == 0 ) {
      //flush a batch of updates and release memory:
      session.flush();
      session.clear();
   }
}

tx.commit();
session.close();

Not so difficult, or even shitty, I guess. Actually, I think you'll agree that this was much easier to write than the equivalent JDBC code messing with scrollable result sets and the JDBC batch API.

One caveat: if Customer has second-level caching enabled, you can still get some memory management problems. The reason for this is that Hibernate has to notify the second-level cache /after the end of the transaction/, about each inserted or updated customer. So you should disable caching of customers for the batch process.

25. Aug 2004, 19:20 CET, by Gavin King

We were doing some work with a customer with a very large project recently, and they were concerned about traceability of the SQL issued by Hibernate. Their problem is one that I guess is common: suppose I see something wong in the Hibernate log (say, some N+1 selects problem), how do I know which of my business classes is producing this? All I've got in the Hibernate log is org.hibernate.SQL, line 224 as the source of the log message!

I started to explain how Hibernate3 can embed comments into the generated SQL, so you could at least track the problem back to a particular HQL query. But then Steve remembered that log4j provides the /nested diagnostic context/. Now, I've seen a lot of projects using log4j, but I've never actually seen this used anywhere. I think it might be a better alternative to adding entry and exit logging everywhere, since we can see this context even if the entry/exit log categories are disabled. It's a good way to track the source of SQL in the Hibernate log. All you need to do is add calls to push() and pop() in your DAO:

public List getCustomersByName(String pattern) {
    NDC.push("CustomerDAO.getCustomersByName()");
    try {
        return getSession()
            .createQuery("from Customer c where c.name like :pattern")
            .setString("pattern", pattern)
            .list();
    }
    finally {
        NDC.pop();
    }
}

Then, if I set my pattern right:

log4j.appender.stdout.layout.ConversionPattern=%d{ABSOLUTE} %5p %c{1}:%L - %m (%x)%n

I'll get a log message like this:

20:59:38,249 DEBUG [=>SQL:244] - select .... like ? (CustomerDAO.getCustomersByName())

Just thought I'd mention it, in case it helps someone.

Showing 1176 to 1180 of 1221 blog entries