Benchmark for Detecting Measurement Tampering (arXiv:2308.15605v1 [cs.LG])

Training powerful AI systems to perform complex tasks can be challenging when it comes to providing robust training signals that are resistant to manipulation. One specific concern is measurement tampering, where the AI system alters multiple measurements to make it seem like it has achieved good results instead of actually achieving the desired outcome.

In this study, we have developed four new datasets consisting of text inputs and measurements to evaluate techniques for detecting measurement tampering in large language models. Specifically, we aim to determine whether examples where all measurements indicate the desired outcome actually had the outcome occur, or if this was due to measurement tampering.

We have demonstrated techniques that outperform basic approaches on most of the datasets, although they do not achieve optimal performance. We believe that there is still significant room for improvement in both the techniques used and the datasets themselves. We are enthusiastic about future research efforts that address the issue of measurement tampering.