Help

Yesterday, another vendor marketing statement was posted on TSS. I usually ignore these, but when it is about data management, I just have to reply. What is always surprising to me is how little we Java developers still know about data management. Here is a statement made by Maxim Kramarenko in the discussion thread:

"OO/network DBMS can be very useful when you need to handle large 
hierarchies - simple graph navigation can be times more fast and simple 
way to fetch data then SQL. Even ORACLE START WITH/CONNECT BY statement 
works VERY slow for hierarchies." 

Now, this statement has several fundamentally different (and important) things mixed together, and in the end confuses everyone more than it helps. I expressed my disbelief (yes, I should have been more verbose and friendly...) and then was asked to contribute to the discussion. Here is my (rather long) response:

I'm using the EXPLODE operator provided by my database management system if I'd like to get a hierarchical view of my data (which is the most common operation when it cames to data hierarchies, I think). If no such operator is available or if I find its performance characteristics unacceptable I call my DBMS vendor and complain until they fix it. Of course, I only do that after optimizing my query execution plans, etc. (this should eliminate most issues).

If all of this is not sufficient, I might consider using a nested set model or materialized path to represent my tree information, variations of the simple adjacency list. Again, this is certainly the last step I'd take (which still happens way too often with the poor quality of the SQL products we have today), and is highly likely only acceptable for read-mostly trees. What we have in SQL, the recursive WITH foo as () or the proprietary CONNECT BY variation, is not neccessarily what I have in mind if I think about a good EXPLODE operator. But see below for a reference with a better and complete explanation.

I would certainly not sacrifice any of the (with SQL very rare) advantages I get from the relational data model if I can't find a good operator for my implementation in my DBMS product. After all, its a logical data model, and any performance concern is a physical implementation detail. I don't see how both are related, but I know we are in bad shape if we start messing up the differences between the two. There is no reason why a network/graph-based logical model or a relational model couldn't be implemented with the same performance characteristics. Just because some products are not implemented optimal doesn't mean we should ditch the whole concept of data independence!

Complain to your DBMS vendor. Ask them why their product doesn't fully support state of the art relational data management, such as a relational (and possibly polymorphic) data type system, or a query language that supports closures/explosion for data hierarchies. The list of deficiencies in todays SQL products is unfortunately much longer than this. It's not the fault of the data model if you can't do something in a particular product, or if a specification has serious problems.

It's easy for the snake oil salesman to sell you his old wine if you let yourself get confused about logical data models and physical implementations. It hurts everyone in the end, as we all have to tell our software vendors what we would like to see and what support we need in a product. If we are not able to clearly articulate our needs and if we forget history (ie. what worked and what didn't work in the past), we might get tricked. I'm not feeling comfortable with that.

Finally, a recommendation I can make is the book Practical Issues in Database Management by Fabian Pascal. It is a small book, having only 250 pages. Fabian shows you 10 common issues you will face in your daily practice, but instead of simply explaining how to work your way through SQL, he first explains the relational data model basics for each problem. He then looks at the current practice and explains what you can do in SQL (or what we would need in a DBMS) to solve or work around the issue. A quick read for a weekend and definitely recommended. If you want to brush up your data management basics, buy it in a bundle with Chris Date's Introduction to Database Systems.

13 comments:
 
09. Dec 2004, 15:13 CET | Link
Carl Rosenberger | carl(AT)db4o.com
There is no reason why a network/graph-based logical model or a relational model couldn't be implemented with the same performance characteristics. Just because some products are not implemented optimal doesn't mean we should ditch the whole concept of data independence!

Relations require one index lookup for every relation and therefore they must be slower than the direct pointers that object databases use. I repeat my challenge from TSS to provide an equivalent solution for storing and loading a deep tree. I bet that you will not get close to 1/10th of the speed of my db4o solution with a relational approach, and an O-R mapper will certainly not improve your chances.

import java.io.*;
import com.db4o.*;
public class Tree {
 static final String FILE = "tree.yap";
 static final int DEPTH = 15;
 Tree preceding;
 Tree subsequent;
 public static void main(String[] args) {
  new File(FILE).delete();
  Tree tree = createTree(DEPTH);
  ObjectContainer objectContainer = Db4o.openFile(FILE);
  long start = System.currentTimeMillis();
  objectContainer.set(tree);
  long id = objectContainer.ext().getID(tree);
  objectContainer.close();
  long stop = System.currentTimeMillis();
  System.out.println("Time to store tree: " +
   (stop - start) + "ms");
  objectContainer = Db4o.openFile(FILE);
  start = System.currentTimeMillis();
  tree = (Tree)objectContainer.ext().getByID(id);
  objectContainer.activate(tree, Integer.MAX_VALUE);
  stop = System.currentTimeMillis();
  System.out.println("Time to load tree: " +
    (stop - start) + "ms");
  objectContainer.close();
 }
 public static Tree createTree(int depth){
  if(depth < 1){
   return null;
  }
  Tree tree = new Tree();
  tree.preceding = createTree(depth - 1);
  tree.subsequent = createTree(depth - 1);
  return tree;
  }
}
ReplyQuote
 
09. Dec 2004, 15:17 CET | Link
Carl Rosenberger | carl(AT)db4o.com
I meant to quote you Christian but "blockquote" didn't work. Let's try again:

Christian wrote:
"There is no reason why a network/graph-based logical model or a relational model couldn't be implemented with the same performance characteristics. Just because some products are not implemented optimal doesn't mean we should ditch the whole concept of data independence!"

The above is wrong.

Relations require one index lookup for every relation and therefore they must be slower than the direct pointers that object databases use. I repeat my challenge from TSS to provide an equivalent solution for storing and loading a deep tree. I bet that you will not get close to 1/10th of the speed of my db4o solution with a relational approach, and an O-R mapper will certainly not improve your chances.

import java.io.*;
import com.db4o.*;
public class Tree {
static final String FILE = "tree.yap";
static final int DEPTH = 15;
Tree preceding;
Tree subsequent;
public static void main(String[] args) {
new File(FILE).delete();
Tree tree = createTree(DEPTH);
ObjectContainer objectContainer = Db4o.openFile(FILE);
long start = System.currentTimeMillis();
objectContainer.set(tree);
long id = objectContainer.ext().getID(tree);
objectContainer.close();
long stop = System.currentTimeMillis();
System.out.println("Time to store tree: " +
(stop - start) + "ms");
objectContainer = Db4o.openFile(FILE);
start = System.currentTimeMillis();
tree = (Tree)objectContainer.ext().getByID(id);
objectContainer.activate(tree, Integer.MAX_VALUE);
stop = System.currentTimeMillis();
System.out.println("Time to load tree: " +
(stop - start) + "ms");
objectContainer.close();
}
public static Tree createTree(int depth){
if(depth < 1){
return null;
}
Tree tree = new Tree();
tree.preceding = createTree(depth - 1);
tree.subsequent = createTree(depth - 1);
return tree;
}
}
 
09. Dec 2004, 17:16 CET | Link
Christian
OK, I will just let that stand for itself, Carl Rosenberger debunking himself again. "Relations need an index". What an educated statement. Pointer tables in an object database are something different! Only object database vendors can implement their systems that way! Relational databases just have to use poor old indexes and can't use the magic! Yes, being sarcastic doesn't help. Last time I've seen Carl was on Usenet, he was explaining to me that joins need foreign keys. He is also well known for making ridiculous "challenges" like this. Nobody every takes them, and I certainly won't waste my time. But thanks Carl, for showing us how a logical/physical confusion looks like. Come back once you understand the difference.

 
09. Dec 2004, 19:01 CET | Link
Christian
And here is his explanation of the relational model (thats when I stopped listening to him, before that his SODA query stuff had some interesting ideas):

http://groups-beta.google.com/group/de.comp.lang.java/msg/44da4c8de9c121ed?dmode=print

It's in German. Translation (logical errors are not mine, but Carls):

"In a relational database a join is a primary key and a foreign key (sic!). A query that aggregates from several tables can only work if there is a foreign key. Thats why you have to modify your database model for every new link."

Well, he almost got the description of relationships and data access in a network database right (his own product), but I was shocked that this kind of misinformation and misuse of terminology could come from an "expert" in the field. Any further discussion is pointless at this level.


 
09. Dec 2004, 21:52 CET | Link
Carl Rosenberger | carl(AT)db4o.com
I bow low for your theoretical superiority, I can not match it. I am very often wrong with theories, that's why I work empirical and test-first.

Somehow the practical result works very good and my above challenge will prove it, if you dare to compare.

Your refusal not to take up the challenge will let anyone assume that you do not see a chance for yourself to get close to 1/10th of our speed.

I can't remember sarcasm ever being a successful reply to facts.

Please continue writing books.

We write software.
 
09. Dec 2004, 22:03 CET | Link
Carl Rosenberger | carl(AT)db4o.com
Oh, and I overread this one:
"Pointer tables in an object database are something different".

We do not use pointer *tables* to load members of objects. We use direct *pointers* from objects to members with one single indirection for transactional purposes. Performance for navigation stays flat-out-exactly-the-same, no matter how many objects there are.

Relational databases usually do not provide this feature to achieve the o-so-admired logical independance. Feel free to use a product with redundant physical pointers to try to beat the results of my challenge.
 
10. Dec 2004, 01:50 CET | Link
Christian
"Relational databases usually do not provide this feature"

So you have finallly read the blog entry, congratulations. Please, tell us why you can't implement a logical relational data model with your superior physical pointer model. Why do you have to sell an obsolete logical model with your implementation? And don't forget to also explain why we should throw away decades of data management theory, just because _you_ can't implement it right. And no, I can't either, but I'm not the database vendor here. I'm blaming you for selling me obsolute stuff and unlike so many others you have met, I can see when you are waving your hand.

 
10. Dec 2004, 02:00 CET | Link
Christian
"Your refusal not to take up the challenge will let anyone assume that you do not see a chance for yourself to get close to 1/10th of our speed. "

Hehe, the usual Rosenberger response. Well tested and works every time, eh?

 
10. Dec 2004, 15:14 CET | Link
Carl Rosenberger | carl(AT)db4o.net
I think we are discussing at completely different levels. You talk theory, I talk practice.

You are talking about theoretical implementations. I am talking about implementations that are available today.

You are saying that my database theory is bad. I am saying that your practical implementation of an O-R mapper naturally has to be slow for member lookup because it is commonly run on systems that do not supply physical pointers.

You say I am an idiot.
I say your product will work worse than ours.

We may both be perfectly right.

I am not saying that it is not possible to add physical pointers to a relational system but hardly any product does it. Information would just be redundant. A system with physical pointers will naturally faster, do we agree on that?

Another question:
If one uses Hibernate as it is and all the mappings that it uses. If one only uses criteria queries as in chapter 12 of your documentation. What would be the advantage of having a relational database in the back?
In practice?
 
10. Dec 2004, 17:58 CET | Link
Christian
Your refusal not to take up the challenge will let anyone assume that you do not see a chance for yourself to get close to 1/10th of the speed of a relational database implemented correctly.

 
20. Dec 2004, 18:50 CET | Link
Michael Slattery
Here is what we did to make hierarchical data easy and fast in SQL.

Consider this table:
location (
name varchar(50),
location_id int, parent_location_id int,
hier_level int,
traversal_order int)

hier_level is a non-normalized column. It stores the level in the hierarchy. traversal_order is also calculated. It is the natural hierarchial ordering if you were to display all items as a list.

We created the following table:
location_family (
ancestor_id int, descendant_id int)

This can be entirely created from existing location data. This number of records in this table is the square of the number of records in 'location'. (size is a downside)

Now we can do extremely complex hierarchical queries w/o stored procedures and w/o proprietary features of one SQL product. Of course, some sort of routine must be run to populate the non-normalized columns and location_family.

Example: Get location 100 and all of its children:
select * from location
where location_id in (select descendant_id from location_family where ancestor_id=100)
order by traversal_order;







 
10. Apr 2006, 22:11 CET | Link
Fuad
Very good article on 'Nested Sets Model' in SQL:
http://dev.mysql.com/tech-resources/articles/hierarchical-data.html

Why do we need getRight(), getLeft(), and etc. in NodeInfo class (latest version of CaveatEmptor)? Java 'contract' should not depend on a low-level table design.

What is 'Thread' - I don't know yet, please help.

 
10. Apr 2006, 22:55 CET | Link
Fuad
Some thoughts related to discussion...

Simply implement writeObject() and readObject() and get easily 'Object database'; sample: Lucene, Nutch.
What about different types of concurrency strategy, tolerance, and (indeed) performance comparable to BIOS HDD block update?
For instance, look at Oracle as a specific type of an optimized for small objects 'FileSystem' (forget SQL). UNIX/Windows are not optimized for small objects, each 100-byte file needs 8k (16k, etc.) block on a hard-drive! Can you update 8k block on a harddrive using writeObject() (just 100 contigious bytes to update)?

Are we reinventing the wheel?

BTW, we can externalize everything to Oracle etc. (forget SQL!), we are not limited by Windows/UNIX file system to read/write Java objects.

Performance comparisons look wrong. Each tool has a specific area to be applied...
Post Comment