A new study from researchers at multiple Chinese universities evaluates ChatGPT’s ability to automatically generate unit tests, an important part of software testing. The results indicate that while ChatGPT-generated tests still have some limitations, the technology shows significant promise to reduce manual test writing efforts.
Unit testing validates the functionality of discrete modules of code. Manual test writing is tedious and time consuming, so automatically generating quality unit tests could greatly improve developer productivity. The study analyzed 1,000 Java methods using a basic ChatGPT prompt for test generation.
Key Findings
- Similarly to ChatUniTest only 24.8% of ChatGPT’s tests passed compilation and execution due to syntax, type, access, and assertion errors. But ChatGPT substantially outperformed previous deep learning methods.
- The passing tests achieved over 80%1 statement coverage and compared well to manually written tests in coverage, readability, and developer preference.
- With refinements to provide ChatGPT more code context and clarify method intent, the error rate could be reduced, unlocking more of ChatGPT’s potential.
Achievements
Based on these insights, the researchers proposed ChatTester, a novel approach to improve the correctness of ChatGPT’s generated unit tests. It has two main components: an initial test generator and an iterative test refiner. The initial generator breaks down test creation into first inferring the intent of the code module, then generating tests based on that intent to improve assertion quality. The iterative refiner fixes compilation errors by prompting ChatGPT with error messages and additional code context, allowing ChatGPT to resolve issues itself. Evaluations showed 34.3% more compilable tests and 18.7% more passing tests compared to default ChatGPT.
The results are exciting for automating the crucial but tedious task of unit testing. While more work is needed, ChatGPT demonstrates capabilities approaching human testers. If the error rate can be reduced, ChatGPT could minimally assist developers in writing tests or even fully automate testing for common code. This could accelerate development cycles and free programmers to focus on higher value tasks.
Summary
The researchers plan to expand the benchmark codebase and study techniques like ChatTester on other languages besides Java. They also want to explore using ChatGPT for integration and system testing. While promising, responsibly deploying AI like ChatGPT in software testing requires careful validation to prevent defects. Overall, tapping the power of large language models could profoundly impact software quality and developer productivity.
- Paper reports statement and branch coverage of generated tests that could pass the execution. In Section III.D where authors describe the experimental procedure we read: “Here we only focus on 248 passing tests generated by ChatGPT since it is less meaningful to recommend tests with compilation errors or execution errors to developers in practice.” In Section IV.B, when presenting the coverage data in Table V, the authors specifically state “Table V presents the statement and branch coverage of generated tests that could pass the execution.” ↩︎