A new research paper from the University of Maryland has cast doubts on the reliability of current techniques for detecting AI-generated text. The paper empirically and theoretically analyzes state-of-the-art detectors based on watermarking, zero-shot learning, trained classifiers, and retrieval methods. Through experiments and theoretical results, the authors demonstrate vulnerabilities in all these approaches when faced with practical attacks.
The main findings of the paper are:
- Paraphrasing attacks using simple neural network models can remove detectable patterns from AI text and evade detection. This holds even for retrieval-based defenses designed to be robust to paraphrasing.
- There is a fundamental limit to how well even the best possible detector can distinguish human vs AI text. As language models improve and their outputs become more human-like, the gap in performance between the best detector and random guessing diminishes.
- Existing techniques based on watermarking and retrieval can be spoofed by humans to make genuine text be flagged as AI-generated. This raises the risk of falsely accusing humans of using AI tools.
The implications of the paper are far-reaching. Reliable detection of AI text is crucial to prevent misuse through plagiarism, spamming, fake reviews etc. However, current techniques are shown to be inadequate. Deploying unreliable detectors in real-world applications could do more harm than good.
The authors recommend an open discussion around developing ethical AI systems where text generation capabilities do not have to be sacrificed for better detection. With powerful models like GPT-4 now available, the stakes around AI authenticity are higher than ever. More research is critical to make AI text detection reliable, unbiased and spoofing-resistant before deployment. Till then, claims of effective AI detection need to be taken with a grain of salt.