AI Text Detectors: Accuracy

Overview

AI Text Detectors, alternatively referred to as AI Writing Detectors or AI Content Detectors, are digital tools engineered to identify if a portion or an entire text has been composed by artificial intelligence (AI) applications, such as ChatGPT.

AI detectors have applications across various domains. Educators can use these tools to verify the originality of student's work, while online moderators can employ them to eliminate inauthentic product reviews and other spam content.

These detectors utilize complex algorithms and machine learning to identify instances of AI-generated content. The accuracy of these detectors is important, as it directly affects their reliability and effectiveness. This precision is measured by their ability to correctly distinguish between AI and human-authored content, thereby reducing the instances of false positives and negatives. Despite this, as of now, these tools are in their developmental phase and their reliability is still under evaluation.

Two text samples run through an AI-text detector GLTR

Accuracy Measurement

The accuracy measurement process of AI text detectors involves comparing the detector's outputs with a set of pre-labeled data, known as a test dataset.

Typically, a set of texts—both AI-generated and human-written—are fed into the detector. The detector's job is to classify each text correctly. Since the actual source (AI or human) of each text in the test dataset is known in advance, it's possible to compare the detector's classifications with the real source.

The accuracy is then calculated as the ratio of correctly classified instances (both true positives and true negatives) to the total instances in the dataset.

While accuracy is an important metric, it does not tell the whole story, especially in cases where the dataset is imbalanced (for example, there might be significantly more human-written texts than AI-generated texts). In these scenarios, other metrics, such as precision, recall (also known as sensitivity), and the F1 score, are often used alongside accuracy to provide a more comprehensive understanding of a detector's performance.

Precision measures the percentage of correctly identified positive results (in this case, AI-generated texts) out of all instances identified as positive by the detector.
Recall assesses the percentage of correctly identified positive results out of all actual positive instances in the dataset.
F1 score is the harmonic mean of precision and recall, providing a balanced measure when there is an uneven class distribution.

Overall, while the accuracy of AI text detectors can be measured, understanding the nuances of their effectiveness requires a multifaceted approach, considering multiple metrics and recognizing the limitations of each.

Researches on Accuracy

Please note that the following are not exhaustive lists, and the order does not imply ranking. The accuracy of each tool may vary depending on the specific use case.

Research from Caulfield, J. (2023, May 15)

In this research, nine AI detectors were selected based on their visibility in search results, with a mix of free and two renowned premium tools. These tools were tested using the same set of texts, with scoring based solely on detection accuracy and the occurrence of false positives. The researchers didn't take usability and pricing into account when scoring.

AI text detectors ranking according to their performance and search result visibility

A total of 30 texts were used in the test, with five texts in each of six categories. Each text was between 1,000 and 1,500 characters long, as the detectors usually show inaccuracies with shorter texts. The six categories of texts were:

Completely human-written texts
Texts generated by GPT-3.5 (from ChatGPT)
Texts generated by GPT-4 (from ChatGPT)
Mix of human-written texts and GPT-3.5 text (from ChatGPT)
GPT-3.5 texts paraphrased by QuillBot
Human-written texts paraphrased by QuillBot

The human-written texts covered five different topics from various publications. To ensure a fair comparison, all AI-generated texts were created based on these same five topics.

Methodology Pros:

The methodology provides a comprehensive comparison across different types of text, giving a broad view of the detectors' capabilities.
The use of a uniform scoring system ensures fair evaluation of the detectors based solely on accuracy.
The variety of text sources, including AI-generated, human-written, and paraphrased texts, offers a robust test of the detectors' capabilities.

Methodology Cons:

The research only includes nine detectors, which might not be representative of the entire field of available AI text detectors.
The methodology doesn't account for shorter texts, which are common in real-world scenarios like social media posts.
The research focuses on the rate of false positives, while false negatives (AI-generated text classified as human-written) are also a significant concern.

Research from Vivian van Oijen

The research focused on assessing the performance of different AI text detection tools in recognizing AI-generated text. These tools evaluate the input text either by attributing a label or assigning a score ranging from 1 to 100. The labels utilized are "AI", "human", and "unclear".

The research team used a variety of prompts for ChatGPT to generate text, which was subsequently analyzed by the detection tools. The prompts encompassed a wide array of topics, from historical events to fictional narratives, and even included a Dutch translation for added diversity.

Furthermore, to understand how well these tools classify human-written text, the researchers chose to use certain examples of human-created content. They acknowledged that not all errors are equally damaging, with false positives potentially being much more harmful than false negatives.

The selected human-written texts included:

Text from Wikipedia, specifically old versions that predate the release of ChatGPT, thus reducing the chance of containing AI-generated content.
Report of a SURF project.
An excerpt from "Alice in Wonderland."
A Reddit post from r/talesfromtechsupport.

Pros of the Methodology

The use of a variety of prompts ensures a broad examination of the detectors' capabilities.
The methodology includes the use of authentic human-written texts, enabling a more accurate assessment of the tools' performance in real-world applications.
The study wisely puts an emphasis on the gravity of false positives, considering their potential to cause more significant damage.

Cons of the Methodology

The study primarily focuses on English text, with only one example of Dutch, which may limit the applicability of the results to other languages.
The research does not specify the total number of detectors being tested, which can affect the comprehensiveness of the research.
The methodology relies solely on the text generated by ChatGPT. Incorporating text from various AI models could provide a more comprehensive analysis.

The results are listed in the Table below:

Tool	Accuracy on AI-generated texts	Accuracy on Human-written text
Content at Scale	0%	60%
Copyleaks	44%	100%
Corrector App	50%	100%
Crossplag	30%	100%
GPTZero	30%	60%
OpenAI	40%	80%
Writer	0%	80%
Total	28%	83%

Results from van Oijen's research estimating efficiency of AI-text detectors

Overview

Accuracy Measurement

Researches on Accuracy

Research from Caulfield, J. (2023, May 15)

Research from Vivian van Oijen

References

Sign up with email

Check your inbox