Test Generation in Java - an Overview

The time has come to summarize my research on test generation in Java. One year ago I started with a presentation “Test Generation in Java” at several conferences and meanwhile I can present the essence of these presentations …

The Purpose of Test Generation

When I write software, I write tests! Ideally before writing code, often after writing code and in very few cases after delivering the program. I do not use test code generators. Why?

I think that test generators are not an appropriate replacement for thoroughly written unit tests, component tests or integration tests. The scenarios that are fitted for test generators are very specific.

I would characterize one scenario that is fitted for test generation like this:

large, grown code base
non trivial complexity
low initial code coverage by tests
new requirements
time pressure

This scenario is the classic legacy code scenario. Sustainably adjusting legacy code means that this code has to be refactorable (to understand the code, to find the critical position where to insert a new feature). Refactoring requires the code to be tested thoroughly. And time pressure means that there is almost no time to write tests or to refactor.

This means that test generation has the role of bootstrapping a legacy code project to the first level, provided that the generated code is covering the code extensively.

Test Generation and Program Analysis - Dynamic, Static, None?

The subject test generation is covered by three different approaches:

No Program Analysis: These approaches have very few preliminaries. The types and method signatures have to be resolved, yet there is no need to analyze data flow or control flow of the method under test. Typically code coverage is achieved by combining expressions using the tested API in a syntactically correct and heuristically promising way. Since the theory of this approach is quite simple there are many opportunities to optimize the results (e.g with artificial intelligence), which makes the approach more scalable. The downside is that essential information on the program cannot be used at test generation time.
Static Program Analysis: These approaches analyze data flow and control flow of a program under test. Static analysis has many advantages, e.g. they extract knowledge on the analyzed program usable at test generation time, yet they do not depend on user interactions. Unfortunately, some problems are not solvable with static analysis and some solutions are too generic to be of practical use.
Dynamic Program Analysis: This approach analyzes the execution traces instead of static data- or control-flow. Dynamic program analysis applies at runtime and is therefore close to the real usage profile of the tested software. The downside is that dynamic program analysis cannot derive generic rules on the program, only specific observations on the current behavior. Furthermore dynamic analysis depends on users interacting with the software under test..

A comprehensive overview over the approaches on test generation - summarized by scientists - could be found in the paper An Orchestrated Survey on Automated Software Test Case Generation (multiple authors, unfortunately nothing about dynamic analysis). A maintained list of tools about the subject could be found at Code-based test generation (by Zoltán Micskei). The following evaluation could be used to get a more detailed overview on the existing tools:

Tools without Program Analysis

The two most popular open source test generation tools are tools without program analysis.

Randoop generates a large set of sequences based on semantically correct operations on a certain API. For each sequence all literal-valued intermediate results (i.e. primitive types and strings) are captured. At testing these sequences are replayed and it is verified that all intermediate results at testing time are equals to the captured intermediate results. State changes by a method are not recognized directly (as the code is not analyzed on effects), yet the approach relies on the idea that state changes are reflected in changed intermediate results (and at least in theory and with large sets of generated sequences this is correct). Randoop is stable and the generated code is reliably compilable. The quality of the generated tests is rather bad (hardly readable, often trivial, much redundancy). Many scientific papers compare their new approach with Randoop, probably because Randoop is reliably stable (most scientific approaches are not) and easy to beat.

The source code is well structured, the API is clear and usable. The unit tests of the randoop project are of low quality, many are commented out, not focused and the purpose of a test is often not transparent to the reader.

Evosuite is also based on the generation of randomly arranged operations on a certain API. And as Randoop it captures literal-valued intermediate results. The sequences are subsequently filtered and optimized - to reduce redundancy and bad readability. This is done by genetic algorithms that optimize a set of initial sequences with a quality goal in mind (e.g. high code coverage) and shorten each sequence. Consequently the generated tests are far better than randomly generated sequences. Evosuite is quite stable compared to other scientific prototypes, well-documented and ready to use. Evosuite is also rightly the most popular test generation tool among all scientific approaches. Yet I fear that Evosuites features are already at its limits.

The source code reveals a quite complex architecture. Unfortunately Evosuite is quite monolithic, it is very hard to reach the internal API and this one is clearly not designed for third parties usage. The quality of the source code is moderate. There are few unit tests and most existing tests are more or less golden master tests (verifying output for a given input).

Tools with static Program Analysis

The main approaches using static program analysis are Symbolic Execution and Concolic Execution (Concrete & Symbolic Execution). Science is quite busy on this subject, yet the implementations are not really promising.

Symbolic Pathfinder is based on the open source JVM Java Pathfinder. It extends Java Pathfinder by symbolic execution. The documentation how this should be reached is recommendable and comprehensible. Yet the implementation has been halted in an early experimental state and was never really continued since then. The installation is difficult and many examples do not run out of the box. Other examples require external programs (solvers) to be installed. It might be that some examples will never run, even if configuration and setup is entirely correct.

The source code is full of comments what is not done yet and creates the impression that this code was actually never released. From the view point of a software craftsman the code is not clean. There are many tests that are not self validating. I think that this bad code quality will not allow progress in the long-term.

Starfinder is also based on the open source JVM Java Pathfinder. Starfinder is a newer approach to add symbolic execution to Java Pathfinder. Unfortunately also this project is no more than a scientific prototype.

The source code seems more mature than that of Symbolic Pathfinder, there are fewer disturbing comments and fewer commented out code. Unfortunately there are also few tests that ensure the intended behavior of the tool. There is almost no automated test on unit level, most are of the category golden master test. After all for most test the purpose is clear. I guess however that this project will not be reused for later on developments.

CATG/Janala2 is a tool for concolic testing of Java code. The documentation of this tool is thin, it is even hardly possible to get to know what the subject of CATG is (this is probably better explained in referencing scientific papers). Probably CATGs purpose is generating input data for tests. The installation is very hard and deviates from the documentation (wrong versions, wrong artifact names, …). Overall it is of limited use.

The source code is comparatively clean, there are effectively no tests. The fact that this code was not modified or updated many years, points to the fact that it is not further developed.

KeY is actually a JVM for program verification. There is a spin-off based on the idea that specifications could be used to generate tests. The installation of KeY having test generation in mind is not easy, the documentation is misleading and incomplete. Finally test generation requires the Z3 solver to be installed on the system. All features are only working with java versions 1.4 and lower (no generics). The generated tests are unreadable and often not compilable for non-trivial programs and even after manual rework not really comprehensible.

In General the source code is organized in an ordered and quite clean way. There are more unit tests and component tests than in comparable projects. On the downside the code is provided with binary dependencies, that are not available as source code (e.g. the parser limiting the system to Java 1.4).

Worth mentioning are also the tools JPET, JTest and AgitarOne, that look quite promising in the marketing videos. Yet all three are not available to the public (JPET is free, yet only as binary, JTest and AgitarOne are proprietary), and we had not the opportunity to test them in real projects.

Tools with Dynamic Program Analysis

The main Approach with dynamic analysis is Capture-Replay. Here the state before and after a method invocation (at runtime) is captured and later on verified in regression tests.

Testrecorder is a tool based on capture-replay. Methods of interest are annotated with @Recorded and capturing starts if the program is started with the Testrecorder agent. The tool has a modular architecture, you could use different parts (serialization, code generation) of the program independently from others and even without the agent. Knowing that many testing scenarios are not predictable the tools design leaves many points for customization and extension from the user, that could be used if the standard methods for serialization or test generation are not sufficient.

The source code is complex, yet clean (no comments, 90% code coverage by unit tests and integration tests), the API is only rudimentary documented and it is not yet decided which methods are part of the public API.

A former approach to this subject was e.g. ThOR which could generate test data for limited scenarios (Java Beans), yet it was not updated in the near past.

to Overview