Sunday, April 7, 2013

Kettle (Pentaho Data Integration): Connecting to Google Cloud SQL

We were playing around Google Cloud SQL and wanted to see if we can migrate some of our MySQL database data onto the cloud. We opted to use Pentaho Data Integration (aka Kettle) to do some ETL. We were surprised to see that Kettle could not connect properly to Google Cloud SQL (even if Google says it's running MySQL 5.5). So, we had to dig deeper. Here's what we found out.

We tried using the MySQL connection type. But we could not get it to use the correct URL. We also found out the following::

  1. Using a MySQL connection type does not allow a custom URL. It was prefixing the URL with "jdbc:mysql://". We needed to use a URL that starts with "jdbc:google:rdbms://".
  2. Using a "Generic database" connection type allows Kettle to connect successfully. But it still fails in retrieving meta information about tables/columns/datatypes.
  3. Kettle optimizes by using meta data from a JDBC prepared statement (not from the result set). Unfortunately, calling PreparedStatement.getMetaData on Google Cloud SQL JDBC driver always returns null (even after the statement is executed). This prevents Kettle from proceeding to display the table in its UI.

To solve the above problems, we decided to create our own database plugin.

Creating a Database Plugin for Kettle

To create a database plugin for Kettle, we need to implement the DatabaseInterface. Here's what we did.

We extended from the original MySQLDatabaseMeta class to get the default behavior of a MySQL connection type. Then we override a couple of methods to achieve behavior that is specific to Google Cloud SQL. We had to specify a URL that starts with "jdbc:google:rdbms", and com.google.cloud.sql.Driver as the driver class.

// the annotation allows PDI to recognize this class as a database plug-in 
@DatabaseMetaPlugin(
    type="GoogleCloudSqlJdbc", // A globally unique ID for database plugin
    typeDescription="Google Cloud SQL" // The label to use in the database dialog
)
public class GoogleCloudSqlDatabaseMeta extends MySQLDatabaseMeta implements
        DatabaseInterface {

    @Override
    public String getDriverClass() {
        return "com.google.cloud.sql.Driver";
    }

    /**
     * As of Google App Engine SDK 1.7.4, the driver does not
     * support metadata retrieval with prepared statements.
     */
    @Override
    public boolean supportsPreparedStatementMetadataRetrieval() {
        return false;
    }

    /**
     * As of Google App Engine SDK 1.7.4, the driver only supports
     * result set metadata (which requires the execution of the
     * statement).
     */
    @Override
    public boolean supportsResultSetMetadataRetrievalOnly() {
        return true;
    }

    /**
     * Due to code that is specific to {@link MySQLDatabaseMeta},
     * a fetch size of {@link Integer#MIN_VALUE} causes an error
     * with the Google Cloud SQL driver. Thus, this is turned off.
     */
    @Override
    public boolean isFetchSizeSupported() {
        return false;
    }

    /**
     * As of Google App Engine SDK 1.7.4, the driver is not supporting
     * {@link Statement#setMaxRows(int)}.
     */
    @Override
    public boolean supportsSetMaxRows() {
        return false;
    }

    @Override
    public String getURL(String hostname, String port, String databaseName) {
        // port number is actually ignored
        return "jdbc:google:rdbms://" + hostname + "/" + databaseName;
    }

    @Override
    public String[] getUsedLibraries() {
        return new String[] { "google_sql.jar" };
    }

    @Override
    public int[] getAccessTypeList() {
        return new int[] { DatabaseMeta.TYPE_ACCESS_NATIVE };
    }

}

While tinkering with Kettle code, we found out that it has MySQLDatabaseMeta-specific code that sets a fetch size of Integer.MIN_VALUE. Something along the lines of:

    if (databaseMeta.getDatabaseInterface() instanceof MySQLDatabaseMeta
            && databaseMeta.isStreamingResults()) {
        sel_stmt.setFetchSize(Integer.MIN_VALUE);
    }
Because of the above, the Google Cloud SQL driver throws an exception. We had to disable the setting of fetch size.

Deploying the Database Plugin to Kettle

This was tested on version 4.4 of Kettle. The resulting JAR file is copied to the plugins/databases folder of your Kettle installation (e.g. /Applications/pentaho/design-tools/data-integration). The JDBC driver from Google Cloud SQL can be downloaded from http://dl.google.com/cloudsql/tools/google_sql_tool.zip, or if you already have the App Engine SDK for Java (90+ MB) (look for the google_sql.jar file under the lib/impl folder). Copy the driver to the libext/JDBC folder of your Kettle installation.

After deploying, restart Spoon (the GUI) and you should be able to see the "Google Cloud SQL" connection type.

Kettle/Spoon Database Connection - Google Cloud SQL

Acknowledgements

Thanks to Karl Benavidez, Jose Araneta, and the awesome developers at Orange and Bronze Software Labs for making this possible.

Let us know if this works for you.

No comments:

Post a Comment