Yesterday, another vendor marketing statement was posted on TSS. I usually ignore these, but when it is about data management, I just have to reply. What is always surprising to me is how little we Java developers still know about data management. Here is a statement made by Maxim Kramarenko in the discussion thread:
"OO/network DBMS can be very useful when you need to handle large hierarchies - simple graph navigation can be times more fast and simple way to fetch data then SQL. Even ORACLE START WITH/CONNECT BY statement works VERY slow for hierarchies."
Now, this statement has several fundamentally different (and important) things mixed together, and in the end confuses everyone more than it helps. I expressed my disbelief (yes, I should have been more verbose and friendly...) and then was asked to contribute to the discussion. Here is my (rather long) response:
I'm using the EXPLODE operator provided by my database management system if I'd like to get a hierarchical view of my data (which is the most common operation when it cames to data hierarchies, I think). If no such operator is available or if I find its performance characteristics unacceptable I call my DBMS vendor and complain until they fix it. Of course, I only do that after optimizing my query execution plans, etc. (this should eliminate most issues).
If all of this is not sufficient, I might consider using a nested set model or materialized path to represent my tree information, variations of the simple adjacency list. Again, this is certainly the last step I'd take (which still happens way too often with the poor quality of the SQL products we have today), and is highly likely only acceptable for read-mostly trees. What we have in SQL, the recursive WITH foo as () or the proprietary CONNECT BY variation, is not neccessarily what I have in mind if I think about a good EXPLODE operator. But see below for a reference with a better and complete explanation.
I would certainly not sacrifice any of the (with SQL very rare) advantages I get from the relational data model if I can't find a good operator for my implementation in my DBMS product. After all, its a logical data model, and any performance concern is a physical implementation detail. I don't see how both are related, but I know we are in bad shape if we start messing up the differences between the two. There is no reason why a network/graph-based logical model or a relational model couldn't be implemented with the same performance characteristics. Just because some products are not implemented optimal doesn't mean we should ditch the whole concept of data independence!
Complain to your DBMS vendor. Ask them why their product doesn't fully support state of the art relational data management, such as a relational (and possibly polymorphic) data type system, or a query language that supports closures/explosion for data hierarchies. The list of deficiencies in todays SQL products is unfortunately much longer than this. It's not the fault of the data model if you can't do something in a particular product, or if a specification has serious problems.
It's easy for the snake oil salesman to sell you his old wine if you let yourself get confused about logical data models and physical implementations. It hurts everyone in the end, as we all have to tell our software vendors what we would like to see and what support we need in a product. If we are not able to clearly articulate our needs and if we forget history (ie. what worked and what didn't work in the past), we might get tricked. I'm not feeling comfortable with that.
Finally, a recommendation I can make is the book Practical Issues in Database Management by Fabian Pascal. It is a small book, having only 250 pages. Fabian shows you 10 common issues you will face in your daily practice, but instead of simply explaining how to work your way through SQL, he first explains the relational data model basics for each problem. He then looks at the current practice and explains what you can do in SQL (or what we would need in a DBMS) to solve or work around the issue. A quick read for a weekend and definitely recommended. If you want to brush up your data management basics, buy it in a bundle with Chris Date's Introduction to Database Systems
.
Relations require one index lookup for every relation and therefore they must be slower than the direct pointers that object databases use. I repeat my challenge from TSS to provide an equivalent solution for storing and loading a deep tree. I bet that you will not get close to 1/10th of the speed of my db4o solution with a relational approach, and an O-R mapper will certainly not improve your chances.
import java.io.*;
import com.db4o.*;
public class Tree {
static final String FILE = "tree.yap";
static final int DEPTH = 15;
Tree preceding;
Tree subsequent;
public static void main(String[] args) {
new File(FILE).delete();
Tree tree = createTree(DEPTH);
ObjectContainer objectContainer = Db4o.openFile(FILE);
long start = System.currentTimeMillis();
objectContainer.set(tree);
long id = objectContainer.ext().getID(tree);
objectContainer.close();
long stop = System.currentTimeMillis();
System.out.println("Time to store tree: " +
(stop - start) + "ms");
objectContainer = Db4o.openFile(FILE);
start = System.currentTimeMillis();
tree = (Tree)objectContainer.ext().getByID(id);
objectContainer.activate(tree, Integer.MAX_VALUE);
stop = System.currentTimeMillis();
System.out.println("Time to load tree: " +
(stop - start) + "ms");
objectContainer.close();
}
public static Tree createTree(int depth){
if(depth < 1){
return null;
}
Tree tree = new Tree();
tree.preceding = createTree(depth - 1);
tree.subsequent = createTree(depth - 1);
return tree;
}
}
Christian wrote:
"There is no reason why a network/graph-based logical model or a relational model couldn't be implemented with the same performance characteristics. Just because some products are not implemented optimal doesn't mean we should ditch the whole concept of data independence!"
The above is wrong.
Relations require one index lookup for every relation and therefore they must be slower than the direct pointers that object databases use. I repeat my challenge from TSS to provide an equivalent solution for storing and loading a deep tree. I bet that you will not get close to 1/10th of the speed of my db4o solution with a relational approach, and an O-R mapper will certainly not improve your chances.
import java.io.*;
import com.db4o.*;
public class Tree {
static final String FILE = "tree.yap";
static final int DEPTH = 15;
Tree preceding;
Tree subsequent;
public static void main(String[] args) {
new File(FILE).delete();
Tree tree = createTree(DEPTH);
ObjectContainer objectContainer = Db4o.openFile(FILE);
long start = System.currentTimeMillis();
objectContainer.set(tree);
long id = objectContainer.ext().getID(tree);
objectContainer.close();
long stop = System.currentTimeMillis();
System.out.println("Time to store tree: " +
(stop - start) + "ms");
objectContainer = Db4o.openFile(FILE);
start = System.currentTimeMillis();
tree = (Tree)objectContainer.ext().getByID(id);
objectContainer.activate(tree, Integer.MAX_VALUE);
stop = System.currentTimeMillis();
System.out.println("Time to load tree: " +
(stop - start) + "ms");
objectContainer.close();
}
public static Tree createTree(int depth){
if(depth < 1){
return null;
}
Tree tree = new Tree();
tree.preceding = createTree(depth - 1);
tree.subsequent = createTree(depth - 1);
return tree;
}
}
http://groups-beta.google.com/group/de.comp.lang.java/msg/44da4c8de9c121ed?dmode=print
It's in German. Translation (logical errors are not mine, but Carls):
"In a relational database a join is a primary key and a foreign key (sic!). A query that aggregates from several tables can only work if there is a foreign key. Thats why you have to modify your database model for every new link."
Well, he almost got the description of relationships and data access in a network database right (his own product), but I was shocked that this kind of misinformation and misuse of terminology could come from an "expert" in the field. Any further discussion is pointless at this level.
Somehow the practical result works very good and my above challenge will prove it, if you dare to compare.
Your refusal not to take up the challenge will let anyone assume that you do not see a chance for yourself to get close to 1/10th of our speed.
I can't remember sarcasm ever being a successful reply to facts.
Please continue writing books.
We write software.
"Pointer tables in an object database are something different".
We do not use pointer *tables* to load members of objects. We use direct *pointers* from objects to members with one single indirection for transactional purposes. Performance for navigation stays flat-out-exactly-the-same, no matter how many objects there are.
Relational databases usually do not provide this feature to achieve the o-so-admired logical independance. Feel free to use a product with redundant physical pointers to try to beat the results of my challenge.
So you have finallly read the blog entry, congratulations. Please, tell us why you can't implement a logical relational data model with your superior physical pointer model. Why do you have to sell an obsolete logical model with your implementation? And don't forget to also explain why we should throw away decades of data management theory, just because _you_ can't implement it right. And no, I can't either, but I'm not the database vendor here. I'm blaming you for selling me obsolute stuff and unlike so many others you have met, I can see when you are waving your hand.
Hehe, the usual Rosenberger response. Well tested and works every time, eh?
You are talking about theoretical implementations. I am talking about implementations that are available today.
You are saying that my database theory is bad. I am saying that your practical implementation of an O-R mapper naturally has to be slow for member lookup because it is commonly run on systems that do not supply physical pointers.
You say I am an idiot.
I say your product will work worse than ours.
We may both be perfectly right.
I am not saying that it is not possible to add physical pointers to a relational system but hardly any product does it. Information would just be redundant. A system with physical pointers will naturally faster, do we agree on that?
Another question:
If one uses Hibernate as it is and all the mappings that it uses. If one only uses criteria queries as in chapter 12 of your documentation. What would be the advantage of having a relational database in the back?
In practice?
Consider this table:
location (
name varchar(50),
location_id int, parent_location_id int,
hier_level int,
traversal_order int)
hier_level is a non-normalized column. It stores the level in the hierarchy. traversal_order is also calculated. It is the natural hierarchial ordering if you were to display all items as a list.
We created the following table:
location_family (
ancestor_id int, descendant_id int)
This can be entirely created from existing location data. This number of records in this table is the square of the number of records in 'location'. (size is a downside)
Now we can do extremely complex hierarchical queries w/o stored procedures and w/o proprietary features of one SQL product. Of course, some sort of routine must be run to populate the non-normalized columns and location_family.
Example: Get location 100 and all of its children:
select * from location
where location_id in (select descendant_id from location_family where ancestor_id=100)
order by traversal_order;
http://dev.mysql.com/tech-resources/articles/hierarchical-data.html
Why do we need getRight(), getLeft(), and etc. in NodeInfo class (latest version of CaveatEmptor)? Java 'contract' should not depend on a low-level table design.
What is 'Thread' - I don't know yet, please help.
Simply implement writeObject() and readObject() and get easily 'Object database'; sample: Lucene, Nutch.
What about different types of concurrency strategy, tolerance, and (indeed) performance comparable to BIOS HDD block update?
For instance, look at Oracle as a specific type of an optimized for small objects 'FileSystem' (forget SQL). UNIX/Windows are not optimized for small objects, each 100-byte file needs 8k (16k, etc.) block on a hard-drive! Can you update 8k block on a harddrive using writeObject() (just 100 contigious bytes to update)?
Are we reinventing the wheel?
BTW, we can externalize everything to Oracle etc. (forget SQL!), we are not limited by Windows/UNIX file system to read/write Java objects.
Performance comparisons look wrong. Each tool has a specific area to be applied...