Wednesday, January 24, 2018

JasperReports: The Tricky Parts

If you have been programming in Java long enough, chances are you needed to generate reports for business users. In my case, I've seen several projects use JasperReports® Library to generate reports in PDF and other file formats. Recently, I've had the privilege of observing Mike and his team use the said reporting library and the challenges they faced.

JasperReports in a Nutshell

In a nutshell, generating reports using JasperReports (JR) involves three steps:

  1. Load compiled report (i.e. load a JasperReport object)
  2. Run report by filling it with data (results to a JasperPrint object)
  3. Export filled report to a file (e.g. use JRPdfExporter to export to PDF)

In Java code, it looks something like this.

JasperReport compiledReport = JasperCompileManager.compileReport(
Map<String, Object> parameters = ...;
java.sql.Connection connection = dataSource.getConnection();
try {
    JasperPrint filledReport = JasperFillManager.fillReport(
            compiledReport, parameters, connection);
            filledReport, "report.pdf");
} finally {

Thanks to the facade classes, this looks simple enough. But looks can be deceiving!

Given the above code snippet (and the outlined three steps), which parts do you think takes the most amount of time and memory? (Sounds like an interview question).

If you answered (#2) filling with data, you're correct! If you answered #3, you're also correct, since #3 is proportional to #2.

IMHO, most online tutorials only show the easy parts. In the case of JR, there seems to be a lack of discussion on the more difficult and tricky parts. Here, with Mike's team, we encountered two difficulties: out of memory errors, and long running reports. What made these difficulties particularly memorable was that they only showed up during production (not during development). I hope that by sharing them, they can be avoided in the future.

Out of Memory Errors

The first challenge was reports running out of memory. During development, the test data we use to run the report would be too small when compared to real operating data. So, design for that.

In our case, all reports were run with a JRVirtualizer. This way, it will flush to disk/file when the maximum number of pages/objects in memory has been reached.

During the process, we also learned that the virtualizer needs to be cleaned-up. Otherwise, there will be several temporary files lying around. And we can only clean-up these temporary files after the report has been exported to file.

Map<String, Object> parameters = ...;
JRVirtualizer virtualizer = new JRFileVirtualizer(100);
try {
    parameters.put(JRParameter.REPORT_VIRTUALIZER, virtualizer);
    ... filledReport = JasperFillManager.fillReport(
            compiledReport, parameters, ...);
    // cannot cleanup virtualizer at this point
    JasperExportManager.exportReportToPdf(filledReport, ...);
} finally {

For more information, please see Virtualizer Sample - JasperReports.

Note that JR is not always the culprit when we encountered out-of-memory errors when running reports. Sometimes, we would encounter an out-of-memory error even before JR was used. We saw how JPA can be misused to load the entire dataset for the report (Query.getResultList() and TypedQuery.getResultList()). Again, the error does not show up during development since the dataset is still small. But when the dataset is too large to fit in memory, we get the out-of-memory errors. We opted to avoid using JPA for generating reports. I guess we'll just have to wait until JPA 2.2's Query.getResultStream() becomes available. I wish JPA's Query.getResultList() returned Iterable instead. That way, it is possible to have one entity is mapped at a time, and not the entire result set.

For now, avoid loading the entire dataset. Load one record at a time. In the process, we went back to good ol' JDBC. Good thing JR uses ResultSets well.

Long Running Reports

The second challenge was long running reports. Again, this probably doesn't happen during development. At best, a report that runs for about 10 seconds is considered long. But with real operating data, it can run for about 5-10 minutes. This is especially painful when the report is being generated upon an HTTP request. If the report can start to write to the response output stream within the timeout period (usually 60 seconds or up to 5 minutes), then it has a good chance of being received by the requesting user (usually via browser). But if it takes more than 5 minutes to fill the report and another 8 minutes to export to file, then the user will just see a timed-out HTTP request, and log it as a bug. Sound familiar?

Keep in mind that reports can run for a few minutes. So, design for that.

In our case, we launch reports on a separate thread. For reports that are triggered with an HTTP request, we respond with a page that contains a link to the generated report. This avoids the time-out problem. When the user clicks on this link and the report is not yet complete, s/he will see that the report is still being generated. But when the report is completed, s/he will be able to see the generated report file.

ExecutorService executorService = ...;
... = executorService.submit(() -> {
    Map<String, Object> parameters = ...;
    try {
        ... filledReport = JasperFillManager.fillReport(
                compiledReport, parameters, ...);
        JasperExportManager.exportReportToPdf(filledReport, ...);
    } finally {

We also had to add the ability to stop/cancel a running report. Good thing JR has code that checks for Thread.interrupted(). So, simply interrupting the thread will make it stop. Of course, you'll need to write some tests to verify (expect JRFillInterruptedException and ExportInterruptedException).

And while we were at it, we rediscovered ways to add "listeners" to the report generation (e.g. FillListener and JRExportProgressMonitor) and provide the user some progress information.

We also created utility test classes to generate large amounts of data by repeating a given piece of data over and over. This is useful to help the rest of the team develop JR applications that are designed for handling long runs and out-of-memory errors.

Further Design Considerations

Another thing to consider is the opening and closing of the resource needed when filling the report. This could be a JDBC connection, a Hibernate session, a JPA EntityManager, or a file input stream (e.g. CSV, XML). Illustrated below is a rough sketch of my design considerations.

1. Compiling
         - - - - - - - - - - - - - -\
         - - - -\                    \
2. Filling       > open-close         \
         - - - -/   resource           > swap to file
3. Exporting                         /
         - - - - - - - - - - - - - -/

We want to isolate #2 and define decorators that would open the resource, fill the report, and close the opened resource in a finally block. The resource that is opened may depend on the <queryString> element (if present) inside the report. In some cases, where there is no <queryString> element, there is probably no need to open a resource.

<queryString language="hql">
    <![CDATA[ ... ]]>
<queryString language="csv">
    <![CDATA[ ... ]]>

Furthermore, we also want to combine #2 and #3 as one abstraction. This single abstraction makes it easier to decorate with enhancements, like flushing the created page objects to files, and load them back during exporting. As mentioned, this is what the JRVirtualizer does. But we'd like a design where this is transparent to the object(s) using the combined-#2-and-#3 abstraction.


That's all for now. Again, thanks to Mike and his team for sharing their experiences. Yup, he's the same guy who donates his app's earnings to charity. Also, thanks to Claire for the ideas on testing by repeating a given data again and again. The relevant pieces of code can be found on GitHub.

Tuesday, January 2, 2018

DataSource Routing with Spring @Transactional

I was inspired by Carl Papa's use of aspects with the Spring Framework to determine the DataSource to use (either read-write or read-only). So, I'm writing this post.

I must admit that I have long been familiar with Spring's AbstractRoutingDataSource. But I did not have a good idea where it can be used. Thanks to Carl and team, and one of their projects. Now, I know a good use case.


With Spring, read-only transactions are typically marked with annotations.

public class ... {
    public void ...() {...}

    @Transactional // read-write
    public void ...() {...}

To take advantage of this, we use Spring's TransactionSynchronizationManager to determine if the current transaction is read-only or not.


Here, we use Spring's AbstractRoutingDataSource to route to the read-only replica if the current transaction is read-only. Otherwise, it routes to the default which is the master.

public class ... extends AbstractRoutingDataSource {
    protected Object determineCurrentLookupKey() {
        if (TransactionSynchronizationManager
                .isCurrentTransactionReadOnly() ...) {
            // return key to a replica
        return null; // use default

Upon using the above approach, we found out that the TransactionSynchronizationManager is one step behind because Spring will have already called DataSource.getConnection() before a synchronization is established. Thus, a LazyConnectionDataSourceProxy needs to be configured as well.

As we were discussing this, we figured if there was another way to determine if the current transaction is read-only or not (without resorting to LazyConnectionDataSourceProxy). So, we came up with an experimental approach where an aspect captures the TransactionDefinition (from the @Transactional annotation, if any) as a thread-local variable, and an AbstractRoutingDataSource that routes based on the captured information.

The relevant source code can be found on GitHub. Thanks again, Carl! BTW, Carl is also an award-winning movie director. Wow, talent definitely knows no boundaries.

Monday, April 24, 2017

Apache Spark RDD and Java Streams

A few months ago, I was fortunate enough to participate in a few PoCs (proof-of-concepts) that used Apache Spark. There, I got the chance to use resilient distributed datasets (RDDs for short), transformations, and actions.

After a few days, I realized that while Apache Spark and the JDK are very different platforms, there are similarities between RDD transformations and actions, and stream intermediate and terminal operations. I think these similarities can help beginners (like me *grin*) get started with Apache Spark.

Java StreamApache Spark RDD
Intermediate operationTransformation
Terminal operationAction

Java Streams

Let's start with streams. Java 8 was released sometime in 2014. Arguably, the most significant feature it brought is the Streams API (or simply streams).

Once a Stream is created, it provides many operations that can be grouped in two categories:

  • intermediate,
  • and terminal.

Intermediate operations return a stream from the previous one. These intermediate operations can be connected together to form a pipeline. Terminal operations, on the other hand, closes the stream pipeline, and returns a result.

Here's an example.

Stream.of(1, 2, 3)
        .peek(n -> System.out.println("Peeked at: " + n))
        .map(n -> n*n)

When the above example is run, it generates the following output:

Peeked at: 1
Peeked at: 2
Peeked at: 3

Intermediate operations are lazy. The actual execution does not start until the terminal operation is encountered. The terminal operation in this case is forEach(). That's why, we do not see the following.

Peeked at: 1
Peeked at: 2
Peeked at: 3

Instead, what we see is that the operations: peek(), map(), and forEach(), have been joined to form a pipeline. In each pass, the static of() operation returns one element from the specified values. Then the pipeline is invoked: peek() that prints the string "Peeked at: 1", followed by map(), and terminated by forEach() that prints the number "1". Then with another pass starting with of() that returns the next element from the specified values, followed by peek(), and map(), and so on.

Executing an intermediate operation such as peek() does not actually perform any peeking, but instead creates a new stream that, when traversed, contains the same elements of the initial stream, but additionally performing the provided action.

Apache Spark RDD

Now, let's turn to Spark's RDD (resilient distributed dataset). Spark's core abstraction for working with data is the resilient distributed dataset (RDD).

An RDD is simply a distributed collection of elements. In Spark all work is expressed as either creating new RDDs, or calling operations on RDDs to compute a result. Under the hood, Spark automatically distributes the data contained in RDDs across your cluster and parallelizes the operations you perform on them.

Once created, RDDs offer two types of operations:

  • transformations,
  • and actions.

Transformations construct a new RDD from a previous one. Actions, on the other hand, compute a result based on an RDD, and either return it to the driver program or save it to an external storage system (e.g., HDFS).

Here's an example with a rough equivalent using Java Streams.

SparkConf conf = new SparkConf().setAppName(...);
JavaSparkContext sc = new JavaSparkContext(conf);

List<Integer> squares = sc.parallelize(Arrays.asList(1, 2, 3))
        .map(n -> n*n)


// Rough equivalent using Java Streams
List<Integer> squares2 = Stream.of(1, 2, 3)
        .map(n -> n*n)


After setting up the Spark context, we call parallelize() which creates an RDD from the given list of elements. map() is a transformation, and collect() is an action. Transformations, like intermediate stream operations in Java, are lazily evaluated. In this example, Spark will not begin to execute the function provided in a call to map() until it sees an action. This approach might seem unusual at first, but it makes a lot of sense when dealing with huge amounts of data (big data, in other words). It allows Spark to split up the work and do them in parallel.

Word Count Example

Let's use word count as an example. Here, we have two implementations: one uses Apache Spark, and the other uses Java Streams.

Here's the Java Stream version.

public class WordCountJava {

 private static final String REGEX = "\\s+";
 public Map<String, Long> count(URI uri) throws IOException {
  return Files.lines(Paths.get(uri))
   .map(line -> line.split(REGEX))
   .map(word -> word.toLowerCase())
    identity(), TreeMap::new, counting()));


Here, we read the source file line by line and transforming each line in a sequence of words (via the map() intermediate operation). Since we have a sequence of words for each line and we have many lines, we convert them to a single sequence of words using flatMap(). In the end, we group them by their identity() (i.e. the identity of a string is the string itself) and we count them.

When tested against a text file that contains the two lines:

The quick brown fox jumps over the lazy dog
The quick brown fox jumps over the lazy dog

It outputs the following map:

{brown=2, dog=2, fox=2, jumps=2, lazy=2, over=2, quick=2, the=4}

And now, here's the Spark version.

public class WordCountSpark {

 private static final String REGEX = "\\s+";
 public List<Tuple2<String, Long>> count(URI uri, JavaSparkContext sc) throws IOException {
  JavaRDD<String> input = sc.textFile(Paths.get(uri).toString());
  return input.flatMap(
     line -> Arrays.asList(line.split(REGEX)).iterator())
    .map(word -> word.toLowerCase())
    .mapToPair(word -> new Tuple2<String, Long>(word, 1L))
    .reduceByKey((x, y) -> (Long) x + (Long) y)


When run against the same two-line text file, it outputs the following:

[(brown,2), (dog,2), (fox,2), (jumps,2), (lazy,2), (over,2), (quick,2), (the,4)]

The initial configuration of a JavaSparkContext has been excluded for brevity. We create a JavaRDD from a text file. It's worth mentioning that this initial RDD will operate line-by-line from the text file. That's why we split each line into sequence of words and flatMap() them. Then we transform a word into a key-value tuple with a count of one (1) for incremental counting. Once we have done that, we group by words (reduceByKey()) our key-value tuples from the previous RDD and in the end we sort them in natural order.

In Closing

As shown, both implementations are similar. The Spark implementation requires more setup and configuration, and is more powerful. Learning about intermediate and terminal stream operations can help get a Java developer started with understanding Apache Spark.

Thanks to Krischelle, RB, and Juno, for letting me participate in the PoCs that used Apache Spark.