In this paper, we introduce MLLM-DataEngine, a novel closed-loop system that connects data generation, model training, and evaluation. Despite the advancements in Multimodal Large Language Models (MLLMs) in instruction dataset building and benchmarking, the current MLLMs face challenges in improving their capabilities based on evaluation results without incurring high human costs.

MLLM-DataEngine addresses this problem by analyzing the weaknesses of the model through evaluation results and generating an incremental dataset for the next training iteration to enhance the model iteratively. Unlike previous methods that separate data collection from benchmarking, the data generated by MLLM-DataEngine exhibits superior targeting, quality, and correctness.

To achieve better targeting, we propose an Adaptive Bad-case Sampling module that adjusts the data ratio within each incremental dataset based on benchmarking results. For quality, we utilize GPT-4 to generate high-quality data for each data type. Additionally, prompt design plays a crucial role in data generation correctness. Instead of using hand-crafted prompts, we introduce an Interactive Prompt Optimization strategy that significantly improves the correctness of generated data through multi-round interactions between humans and GPT.

Extensive experiments demonstrate that MLLM-DataEngine can effectively enhance the capability of MLLMs in a targeted and automatic manner with minimal human involvement. We plan to release MLLM-DataEngine as a general solution for future MLLM building endeavors.