Help

Visualizing data structures is not easy, and I'm confident that a great deal of success of the exceptionally well received demo we presented at the JBoss World 2011 keynote originated from the nice web UIs projected on the multiple big screens. These web applications were effectively visualizing the tweets flowing, the voted hashtags highlighted in the tagcloud, and the animated Infinispan grid while the nodes were dancing on an ideal hashweel visualizing the data distribution among the nodes.

So I bet that everybody in the room got a clear picture of the fact that the data was stored in Infinispan, and by live unplugging a random server everybody could see the data reorganize itself, making it seem a simple and natural way to process huge amounts of data. Not all technical details were explained, so in this and the following post we're going to detail what you did not see: how was the data stored, how could Drools filter the data, how could all visualizations load the grid stored data, and still be developed in record time?

JPA on a Grid

All those applications were simply using JPA: Java Persistence API. Think about the name: it's clearly meant to address all persistence needs of Java applications; in fact while it's traditionally associated with JDBC databases, we have just shown it's not necessarily limited to those databases: our applications were running an early preview of the Hibernate Object/Grid Mapper: Hibernate OGM, and the simple JPA model was mapped to Infinispan, a key/value store.

Collecting requirements

The initial plan didn't include Hibernate OGM, as it was very experimental yet, it was never released nor even tagged before, but it was clear that we wanted to use Infinispan: to store and to search the tweets. Kevin Conner was the architect envisioning the technical needs, who managed to push each involved developer to do its part, and finally assembled it all into a working application in record time; so he came to Emmanuel and me with a simple list of requirements:

  • we want to showcase Infinispan
  • we want to store Tweets, many of them, in real time coming in from a live stream from Twitter
  • we need to be able to iterate them all in time order, to rollback the stream and re-process it again (as you can see in the demo recording, we had a fake cheater and want to apply stricter rules to filter out invalid tweets at a second stage, without loosing the originally collected tweets).
  • we need to know which projects are voted the most: people are going to express preferences via hashtags in their tweets
  • we want to know who's voting the most
  • it has to perform well, potentially on a big set of data

Using Lucene

So, to implement these search requirements, you have to consider that being Infinispan a key/value store, performing queries is not as natural as you would do on a database. Infinispan currently provides two main approaches: to use the Query module or to define some simple Map/Reduce tasks.

Also, consider those requirements. Using SQL, how were we going to count all tweets containing a specific hashtag, extract this count for all collected hashtags, and sort them by frequency? On a relational database, that would have been a very inefficient query which involves at least a full table scan, possibly a scan per hashtag, and it would have required a prior list of hashtags to look for. We wanted to extract the most frequently mentioned tags, we didn't really know what to look for as people were free to vote for anything.

A totally different approach is to use an inverted index: every time you save a tweet, you tokenize it, extract all terms and so keep a list of terms with pointers to the containing tweets, and store the frequency as well. That's exactly how full-text search engines like Lucene work; in addition to that Lucene is able to apply state-of-the-art optimizations, caches and filtering capabilities. Both our Infinispan Query and Hibernate Search provide nice and easy integrations with Lucene (they actually share the same engine, one geared towards Infinispan users and one to Hibernate and JPA users).

To count for who voted the most is a problem which is technically comparable to counting for term frequencies, so again Lucene would be perfect. Sorting all data on a timestamp is not a good reason to introduce Lucene, but still it's able to do that pretty well too, so Lucene would indeed solve all query needs for this application.

Hibernate OGM with Hibernate Search

So Infinispan Query could have been a good choice. But we opted for Hibernate OGM with Search as they would provide the same indexing features, but on top of that using a nice JPA interface. Also I have to admit that Hibernate OGM was initially discarded as it was lacking an HQL query parser: my fault, being late with implementing it, but in this case it was not a problem as all queries we needed were better solved using the full text queries, which are not defined via HQL.

Model

So how does our model look like? Very simple, it's a single JPA entity, enhanced with some Hibernate Search annotations.

@Indexed(index = "tweets")
@Analyzer(definition = "english")
@AnalyzerDef(name = "english",
	tokenizer = @TokenizerDef(factory = StandardTokenizerFactory.class),
	filters = {
		@TokenFilterDef(factory = ASCIIFoldingFilterFactory.class),
		@TokenFilterDef(factory = LowerCaseFilterFactory.class),
		@TokenFilterDef(factory = StopFilterFactory.class, params = {
				@Parameter(name = "words", value = "stoplist.properties"),
				@Parameter(name = "resource_charset", value = "UTF-8"),
				@Parameter(name = "ignoreCase", value = "true")
		})
})
@Entity
public class Tweet {
	
	private String id;
	private String message = "";
	private String sender = "";
	private long timestamp = 0L;
	
	public Tweet() {}
	
	public Tweet(String message, String sender, long timestamp) {
		this.message = message;
		this.sender = sender;
		this.timestamp = timestamp;
	}

	@Id
	@GeneratedValue(generator = "uuid")
	@GenericGenerator(name = "uuid", strategy = "uuid2")
	public String getId() { return id; }
	public void setId(String id) { this.id = id; }

	@Field
	public String getMessage() { return message; }
	public void setMessage(String message) { this.message = message; }

	@Field(index=Index.UN_TOKENIZED)
	public String getSender() { return sender; }
	public void setSender(String sender) { this.sender = sender; }

	@Field
	@NumericField
	public long getTimestamp() { return timestamp; }
	public void setTimestamp(long timestamp) { this.timestamp = timestamp; }

}

Note the uuid generator for the identifier: that's currently the most efficient one to use in a distributed environment. On top of the standard @Entity, @Indexed enables the Lucene indexing engine, the @AnalyzerDef and Analyzer specifies the text cleanup we want to apply to the indexed tweets, @Field selects the property to be indexed, @NumericField makes sure the numeric sort will be performed efficiently treating the indexed value really as a number and not as an additional keyword: always remember that Lucene is focused on natural language matching.

Example queries

This is going to look like a bit verbose as I'm expanding all functions for clarity:

public List<Tweet> allTweetsSortedByTime() {

	//this is needed only once but we want to show it in this context:
	QueryBuilder queryBuilder = fullTextEntityManager.getSearchFactory().buildQueryBuilder().forEntity( Tweet.class ).get();

	//Define a Lucene query which is going to return all tweets:
	Query query = queryBuilder.all().createQuery();

	//Make a JPA Query out of it:
	FullTextQuery fullTextQuery = fullTextEntityManager.createFullTextQuery( query );

	//Currently needed to have Hibernate Search work with OGM:
	fullTextQuery.initializeObjectsWith( ObjectLookupMethod.PERSISTENCE_CONTEXT, DatabaseRetrievalMethod.FIND_BY_ID );

	//Specify the desired sort:
	fullTextQuery.setSort( new Sort( new SortField( "timestamp", SortField.LONG ) ) );

	//Run the query (or alternatively open a scrollable result):
	return fullTextQuery.getResultList();
}

Download it

To see the full example, I pushed a complete Maven project to github. It includes a test of all queries and all project details needed to get it running, such as Infinispan and JGroups configurations, the persistence.xml needed to enable HibernateOGM.

Please clone it to start playing with OGM: https://github.com/Sanne/tweets-ogm

And see you on IRC, the Hibernate Search forums, or the brand new Hibernate OGM forums for any clarification.

How are entities persisted in the grid?

Emmanuel is going to blog about that soon, keep an eye on the blog! Update: blog on OGM published

10 comments:
 
15. Jun 2011, 23:18 CET | Link
Leo

There is a plan to support DB NoSQL Other Redis like or Riak.

ReplyQuote
 
16. Jun 2011, 12:17 CET | Link

Yes! As different databases where abstracted before, it would be nice to help portability across different NoSQL engines, to some extent at least. The main problem is that, as the NO-SQL name implies, there's no common language and all systems differentiate from each other in great extent. Infinispan is just our current supported engine as we know it best, it's written in Java, it rocks :) and supporting transactions to some extent it makes our life simpler, but we're looking towards supporting many more. If anyone wants to help, please join the mailing list and introduce yourself! As mentioned, a more in-depth post is coming specifically about OGM.

 
19. May 2014, 06:19 CET | Link

The agency was first dash just by an individual's descendants before the 1970’s the moment any commitment supplier in the world got the agency.

04. Jun 2014, 07:38 CET | Link

The main thing that can skyrocket your business is Traffic. You may have the world’s best design and products but it would be useless till anybody knows about it. And that’s the main reason why most of the people are now making SEO an integral part of their business, and they are also getting good return on investment too. click here

07. Jun 2014, 14:36 CET | Link

The program isn't perfect and not everyone will be a fan of the seminar style teaching which could probably benefit from being broken down into smaller chunks, but with a unique offering I haven't seen anywhere else, very reasonable pricing, a full 365-day money back guarantee, an amazing support forum, and 7 incredible bonuses to enhance all areas of your dating repertoire (that are actually worthwhile and not just a bunch of filler junk to make the program look better than it really is), I can think of few reasons not to try it.desire system

07. Jun 2014, 15:38 CET | Link

Actual product packaging and materials may contain more and different information than what is shown on our website. We recommend that you do not rely solely on the information presented and that you always read labels, warnings, and directions before using or consuming a product. Please see our full disclaimer below. www.amazon.com

 
09. Jun 2014, 16:17 CET | Link
jack

You actually make it appear really easy along with your presentation but I to find this matter to be actually one thing which I think I might by no means understand. It kind of feels too complicated and very broad for me. I'm looking ahead on your subsequent put up, I will attempt to get the hang of it! digitale optionen

 
10. Jun 2014, 12:45 CET | Link

Berikut di bawah ini adalah daftar situs iklan baris gratis yang dapat di submit iklan otomatis menggunakan software autosubmit, Saat ini jumlah situs iklan gratis yang dapat di submit dengan software autosubmit berjumlah 1293 situs, dan akan terus bertambah. Pasang iklan baris gratis tanpa daftar menjadi jauh lebih mudah dengan menggunakan software autosubmit.Pasang iklan gratis di internet

 
31. Aug 2014, 17:22 CET | Link
jack

This was a really outstanding post. In theory I would oral surgery danbury like to compose like this too - taking time and real effort to make solid Information.

 
12. Nov 2014, 06:59 CET | Link

This was the best website with lot useful information for the reader. Thanks for sharing the post and keep the good works. You are the best, God bless you. Read our traveling blog Bali Tour, Bali Tours, Bali Tour Driver, Bali Day Tours, Best Bali Tour, Bali Full Day Tour, Bali Half Day Tour, Bali Combination Tour, Bali Adventure Tour, Bali Car Charter, Bali Driver, Bali Day Tour, Bali Tour, Bali Tour Driver

Post Comment