Sunday, April 7, 2013

Kettle (Pentaho Data Integration): Connecting to Google Cloud SQL

We were playing around Google Cloud SQL and wanted to see if we can migrate some of our MySQL database data onto the cloud. We opted to use Pentaho Data Integration (aka Kettle) to do some ETL. We were surprised to see that Kettle could not connect properly to Google Cloud SQL (even if Google says it's running MySQL 5.5). So, we had to dig deeper. Here's what we found out.

We tried using the MySQL connection type. But we could not get it to use the correct URL. We also found out the following::

  1. Using a MySQL connection type does not allow a custom URL. It was prefixing the URL with "jdbc:mysql://". We needed to use a URL that starts with "jdbc:google:rdbms://".
  2. Using a "Generic database" connection type allows Kettle to connect successfully. But it still fails in retrieving meta information about tables/columns/datatypes.
  3. Kettle optimizes by using meta data from a JDBC prepared statement (not from the result set). Unfortunately, calling PreparedStatement.getMetaData on Google Cloud SQL JDBC driver always returns null (even after the statement is executed). This prevents Kettle from proceeding to display the table in its UI.

To solve the above problems, we decided to create our own database plugin.

Creating a Database Plugin for Kettle

To create a database plugin for Kettle, we need to implement the DatabaseInterface. Here's what we did.

We extended from the original MySQLDatabaseMeta class to get the default behavior of a MySQL connection type. Then we override a couple of methods to achieve behavior that is specific to Google Cloud SQL. We had to specify a URL that starts with "jdbc:google:rdbms", and com.google.cloud.sql.Driver as the driver class.

// the annotation allows PDI to recognize this class as a database plug-in 
@DatabaseMetaPlugin(
    type="GoogleCloudSqlJdbc", // A globally unique ID for database plugin
    typeDescription="Google Cloud SQL" // The label to use in the database dialog
)
public class GoogleCloudSqlDatabaseMeta extends MySQLDatabaseMeta implements
        DatabaseInterface {

    @Override
    public String getDriverClass() {
        return "com.google.cloud.sql.Driver";
    }

    /**
     * As of Google App Engine SDK 1.7.4, the driver does not
     * support metadata retrieval with prepared statements.
     */
    @Override
    public boolean supportsPreparedStatementMetadataRetrieval() {
        return false;
    }

    /**
     * As of Google App Engine SDK 1.7.4, the driver only supports
     * result set metadata (which requires the execution of the
     * statement).
     */
    @Override
    public boolean supportsResultSetMetadataRetrievalOnly() {
        return true;
    }

    /**
     * Due to code that is specific to {@link MySQLDatabaseMeta},
     * a fetch size of {@link Integer#MIN_VALUE} causes an error
     * with the Google Cloud SQL driver. Thus, this is turned off.
     */
    @Override
    public boolean isFetchSizeSupported() {
        return false;
    }

    /**
     * As of Google App Engine SDK 1.7.4, the driver is not supporting
     * {@link Statement#setMaxRows(int)}.
     */
    @Override
    public boolean supportsSetMaxRows() {
        return false;
    }

    @Override
    public String getURL(String hostname, String port, String databaseName) {
        // port number is actually ignored
        return "jdbc:google:rdbms://" + hostname + "/" + databaseName;
    }

    @Override
    public String[] getUsedLibraries() {
        return new String[] { "google_sql.jar" };
    }

    @Override
    public int[] getAccessTypeList() {
        return new int[] { DatabaseMeta.TYPE_ACCESS_NATIVE };
    }

}

While tinkering with Kettle code, we found out that it has MySQLDatabaseMeta-specific code that sets a fetch size of Integer.MIN_VALUE. Something along the lines of:

    if (databaseMeta.getDatabaseInterface() instanceof MySQLDatabaseMeta
            && databaseMeta.isStreamingResults()) {
        sel_stmt.setFetchSize(Integer.MIN_VALUE);
    }
Because of the above, the Google Cloud SQL driver throws an exception. We had to disable the setting of fetch size.

Deploying the Database Plugin to Kettle

This was tested on version 4.4 of Kettle. The resulting JAR file is copied to the plugins/databases folder of your Kettle installation (e.g. /Applications/pentaho/design-tools/data-integration). The JDBC driver from Google Cloud SQL can be downloaded from http://dl.google.com/cloudsql/tools/google_sql_tool.zip, or if you already have the App Engine SDK for Java (90+ MB) (look for the google_sql.jar file under the lib/impl folder). Copy the driver to the libext/JDBC folder of your Kettle installation.

After deploying, restart Spoon (the GUI) and you should be able to see the "Google Cloud SQL" connection type.

Kettle/Spoon Database Connection - Google Cloud SQL

Acknowledgements

Thanks to Karl Benavidez, Jose Araneta, and the awesome developers at Orange and Bronze Software Labs for making this possible.

Let us know if this works for you.

Tuesday, April 2, 2013

JavaScript: A Love-Hate Relationship

I’ve been programming in Java for a greater part of my professional career. As web development got more powerful and demanding (and complex), I ended up dealing with client-side JavaScript more often than I wanted. In most of these cases, I’ve come to hate it. To avoid it, I considered alternatives like Adobe’s Flex and GWT.

Until about five (5) years ago, when I sat down and really started learning JavaScript, I started to love it. It felt like one of those new year’s resolutions that I failed to complete. Good thing, I faced projects that relied heavily on jQuery and Google Maps JavaScript API. With that, I was forced encouraged to dive in and learn JavaScript. So far, I’m glad I did.

Variables and Primitive Data Types

How many primitive data types does JavaScript have? Three: string, numeric, and boolean. It only has three primitive data types compared to the eight of Java. Wow! How simpler could it get?!

As it turns out, there are two more primitive data types: null and undefined are also considered primitive data types. Undefined is the values of a variable with no value. Variables can be emptied by setting the value to null.

Note that primitive data types are automatically wrapped as objects (e.g. a string literal is automatically wrapped as a String object). This makes them possess properties and methods.

var text = 'hello'; // or text = new String('hello');
var len = text.length; // returns a number
text.charAt(1); // returns 'e'

Native Objects

The JavaScript language provides some native objects: Array, Function, Date, Error, and Math. Objects can have properties and methods. Objects are key-value pairs. The code snippet below shows an object with five properties. Two of which are functions.

var myObject = {
  prop1: 'hello',
  prop2: false,
  prop3: new Date(),
  function1: function(param1, param2) {. . .},
  function2: function() {. . .}
};

// the above is equivalent to this:
var myObject2 = {
  'prop1': 'hello',
  'prop2': false,
  'prop3': new Date(),
  'function1': function(param1, param2) {. . .},
  'function2': function() {. . .}
};

// accessing the properties is like indexing a map
alert(myObject['prop1'] == myObject.prop1); // true

No Classes, Just Objects

As I dug deeper, I found it confusing that there are no classes, just objects. Coming from a Java background, I wanted to learn how to apply OOP. With that, I looked for creating private member variables and functions. JavaScript is a class-free, prototypal language. Instead of defining classes, and creating objects based on classes, it simply creates objects that inherit from (parent) objects that act as classes.

JavaScript is a prototype-based language.

Prototype-based programming is a style of object-oriented programming in which classes are not present, and behavior reuse (known as inheritance in class-based languages) is accomplished through a process of decorating existing objects which serve as prototypes. This model is also known as class-less, prototype-oriented, or instance-based programming.

JavaScript MDN

This is the part I found confusing. But after a while, it’s becoming clear. Declaring a class in JavaScript is as easy as defining a function.

function Student(name) {
  // This is the constructor
  // We can add properties (instance variables) to "this"
  this.name = name;
}

Since functions are objects in JavaScript, we can add properties and methods. In this case, we use the prototype property of a Function object to add methods.

Student.prototype.greet = function(greeting) {
  alert(greeting + ', my name is '
      + this.name);
};

Student.prototype.anotherMethod = function() {
};

We can create Student objects by calling the constructor with the new keyword.

var student1 = new Student('John Smith');
student1.greet('Hi'); // will display 'Hi, my name is John Smith'
alert(student1.name); // the name property is public

So, how can we provide private members to classes? In JavaScript, functions are used to provide scope and hide variables. For this, let’s change the public name property of the Student class to become private. We also add a public getter method to access the private name property.

function Student(name) {
  // This is the constructor
  // We can add public properties (instance variables) to "this"
  // this.name = name;
  var _name = name;
  this.getName = function() { return _name; };
}

This also changes the public member method in its reference to the private name property.

Student.prototype.greet = function(greeting) {
  alert(greeting + ', my name is '
      + this.getName());
};

Student.prototype.anotherMethod = function() {. . .};

Now that the name property is made private, the usage of Student class will also change.

var student1 = new Student('John Smith');
student1.greet('Hi'); // will display 'Hi, my name is John Smith'
alert(student1.name); // will return undefined
alert(student1.getName()); // will return 'John Smith'

Closures

JavaScript uses closures to provide access to variables that continue to exist even after the variables are out of scope. Let's take the following example from one of Douglas Crockford's talks.

var names = ['zero', 'one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine'];
var digit_name = function(n) {
  return names[n];
};

In the above example, the names array is declared at the global scope. We can move names as a variable inside the function (and avoid name clashes with other functions). The result looks something like this:

// var names = [];
var digit_name = function(n) {
  var names = ['zero', 'one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine'];
  return names[n];
};

But the above is slower, since the names array would have to be (re-)initialized every time the function is called. Here's where we can use a closure to allow the function to continue to have access to the names array variable even after it is out of scope.

var digit_name = (function() {
  var names = ['zero', 'one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine'];

  // return a function
  return function(n) {
    // var names = [];
    return names[n];
  };
})();

Notice that the variable digit_name is assigned the result of a function call. To make it clearer, let’s try re-formatting the code. Hopefully, this makes it clearer.

var digit_name = (function() {...})();

The open-and-close parenthesis immediately calls the function. And that function returns a function.

var digit_name = (function() {... return function(n) {}; })();

Closures are also used to declare private members. In the Student class example, notice how the getName method still had access to the _name variable?

Understanding the Misunderstood

JavaScript is the world’s most misunderstood language largely because people aren’t taking the time to learn it. I hope that developers continue to take programming seriously.

It is the language that most people use without bothering to learn it first.

Programming is a complicated business. It should never be taken in ignorance.

Douglas Crockford - Senior JavaScript Architect, PayPal