Thursday, June 10, 2010

Intro to Hazelcast's Distributed Query

When you decide to incorporate a distributed data grid as part of your application architecture, a product's scalability, reliability, cost and performance are key considerations that will help you make your decision. Another key consideration will be the accessibility of the data. One nice feature of Hazelcast that I have been working with lately is distributed queries. In simple terms, distributed queries provide an API and syntax that allow a developer to query for entries that exist in a Hazelcast distributed map. Let's look at a very simple example.

In the demo project (link at the bottom) I have one object, a test case and the Hazelcast 1.8.4 jar file as a project dependency. Below is the class that will be put into a distributed map, ReportData. Once we have a distributed map that is full of ReportData entries, we can use Hazelcast's distributed query to find our ReportData entries.

package org.axiomaticit.model;

public class ReportData implements Serializable {

    private static final long serialVersionUID = 2789198967473633902L;
    private Long id;
    private Boolean active;
    private String reportName;
    private String value;
    private Date startDate;
    private Date endDate;

    public ReportData(Long id, Boolean active, String reportName, String value, Date startDate, Date endDate) {
        this.id = id;
        this.active = active;
        this.reportName = reportName;
        this.value = value;
        this.startDate = startDate;
        this.endDate = endDate;
    }

    // all the getters and setters
}


Nothing too complex in the code above. It is just an object that implements Serializable and that contains a few different types (String, Boolean and Date) of attributes. This class will work nicely to help demonstrate Hazelcast's distributed query API and syntax. I omitted the getters and setters for brevity.

// get a "ReportData" distributed map
Map<Long, ReportData> reportDataMap = Hazelcast.getMap("ReportData");

// create a ReportData object
ReportData reportData = new ReportData(...);

// put it into our Hazelcast Distributed Map
reportDataMap.put(reportData.getId(), reportData);

In the test code, I created ~50,000 ReportData objects using a for loop and put them into the "ReportData" distributed map. I used the index, 0..50,000, for the ReportData's id and the reportName is set to "Report " + index. I did a few other things, so we could have a few different dates represented in our map's entries. Check out the demo project for more detail.

Set<ReportData> reportDataSet = (Set<ReportData>) map.values(new SqlPredicate("active AND id > 990 AND reportName = 'Report 995'"));

The above code queries the distributed map for all ReportData objects where active is equal to true, the id is greater than 990 and the reportName is equal to "Report 995".

Below the reportDataSet will contain all ReportData where active is equal to true and id is greater than 49985.

Set<ReportData> reportDataSet = (Set<ReportData>) map.values(new SqlPredicate("active AND id > 49985"));

Below, we have a case where we are building the predicate programmatically using the EntryObject to fetch all ReportData where the id is greater than 49900 and the endDate attribute of ReportData is between two dates, startDate and endDate. I included the code below to show how I am creating a few dates to use in the predicate that eventually gets passed into the map.values(predicate) method.

Calendar calendar1 = Calendar.getInstance();
calendar1.set(2010, 3, 1);
Calendar calendar2 = Calendar.getInstance();
calendar2.set(2010, 3, 31);

Date startDate = new Date(calendar1.getTimeInMillis());
Date endDate = new Date(calendar2.getTimeInMillis());

EntryObject e = new PredicateBuilder().getEntryObject();
Predicate predicate = e.get("id").greaterThan(new Long(49900)).and(e.get("endDate").between(startDate, endDate));

Set<ReportData> reportDataSet = (Set<ReportData>) map.values(predicate);

Getting data from your Hazelcast distributed map using the distributed query API and query syntax is pretty straight forward. Most of these queries ran for about 500 milliseconds to 2 seconds in my IDE. The power and performance comes from the ability to query objects or map entries that are in memory rather than always relying on a round trip to your RDBMS. Distributed queries are an important feature that make Hazelcast a great tool that can help offset the workload of your RDBMS. With Hazelcast and a good knowledge of your enterprise data, you can implement a simple and effective solution that will easily scale to as many Hazelcast nodes your hardware can support. The demo project can be downloaded here. For more information, check out Hazelcast's website or visit the project's home at Google Code.