A Validation Study on Artificial Intelligence Content Detection Tools
- Vishnuprasad Kakarla; Phoenix Carlisle
- Oct 16
- 13 min read
Abstract
Since the launch of generative Artificial Intelligence (AI), various detectors have been released to identify AI-generated content . However, their reliability remains unknown, due to their fluctuating levels of accuracy. In this study, we conducted a preliminary assessment on the accuracy of such tools, namely Copyleaks, GPTZero, Quillbot, Undetectable, and Writer since they were the top results from a Google search titled “AI Detectors.” We tested six text samples: one human-written version and one AI-written version for each of three differing topics Each human sample was written individually but one of the human-written samples was co-authored to mix the patterns found in the text. For this analysis, human-generated content was classified as negative, while the AI-generated content was classified as positive. The sensitivity (the ability of a test to correctly identify samples with the condition) was 80%, while the specificity (the ability of a test to correctly identify samples without the condition) was 82.35%. In simpler terms, 26 out of the 30 results obtained were correctly classified as either human-written or AI-written, yielding an accuracy rate of 86.67% . This means that about 13 out of every 100 samples will come back misclassified as AI-written or human-written. These results initiate the question whether these tools should be used as evidence for academic dishonesty or violation of terms in academic settings.
Keywords: Artificial Intelligence (AI), Large-Language Model (LLM), Copyleaks, GPTZero, Quillbot, Undetectable, Writer, ChatGPT, Google, Gemini, Perplexity, Bard, AI Detectors
1. Introduction
ChatGPT, a widely used generative Artificial Intelligence (AI) tool (Cardillo, 2024), was released in November 2022 (Foote, 2024). Since then, many other AI tools such as Google Gemini (formerly Bard (Ortiz, 2024)), were also released. These AI tools are examples of Large-Language Models (Guinness, 2024), or LLMs for short. LLMs are machine learning models trained on large datasets of textual information that encompass the patterns of human-written language (Liu et al., 2024). By understanding the human language, they can fluently respond to prompts from the user in a conversational tone, simulating human interaction (IBM, 2023). In December 2022, originality.ai released the first publicly available AI content detection tool (Gillham, 2022), followed by multiple others These tools attempt to determine whether a given text sample, such as essays, contains content generated by Artificial Intelligence (McCoy, 2024); Studies have been found that these tools vary in detection accuracy, with high false positives and false negative rates (Weber-Wulff, 2023; Elkhatat et al., 2023). The aim of this study was to complete a preliminary assessment of these detection tools. Ironically, OpenAI, the developer of ChatGPT, had also released an AI detector in January 2023 (OpenAI, 2023) but later discontinued it, citing low accuracy rates and stating that “sometimes human-written text will be incorrectly but confidently labeled as AI-written by our classifier (OpenAI, 2023).” To better our understanding on these tools, the next step is to explore how these tools fundamentally work: :
Classifying: Method of classifying given samples into a preset class, which would be either AI-written or human-written in the context of AI detection tools (Santoro, 2023). They use large datasets of AI-written and human-written text. Classifiers analyze the patterns and look for similarities found in the datasets it was trained on and the given text to come to a conclusion (Santoro, 2023).
Embedding: A method that converts text into numerical vectors, allowing analysis of underlying patterns in the classification of humans or AI written text (Santoro, 2023). After the analysis, the text is labeled as either AI-written or human-written.
Perplexity: Perplexity, different from the LLM Perplexity AI, is a measure of how predictable the text is (Santoro, 2023). The lower the perplexity, the more predictable the text is. AI text tends to be very predictable and have a low perplexity (AutoGPT, 2024).
Burstiness: Burstiness is the differences in a sentence’s structure and length throughout the text (Santoro, 2023). AI-written text tends to be more uniform in sentence length; this distinction becomes less pronounced as the model changes (AutoGPT, 2024).
AI detection tools primarily use these four methods to detect AI-generated content (Marinkovic, 2024). There are types of results among these detectors. For example, one AI detector might give a percentage that represents how likely it was written by humans or AI, while another detector might give a textual result that simply states “Written by AI / AI content detected” or vice versa. These results recorded forms the basis for analysis of accuracy in detection.
Table 1: AI tools tested, and their method(s)
2. Methods
Three different text samples on different broad topics with an approximate word count of 200 were self-written. Then, large-language models like ChatGPT developed by OpenAI LLC., and Gemini developed by Google LLC., were tasked to write on the same topics and word count, but no other differences (e.g., tone, diction, etc.) were specified to imitate the typical use of artificial intelligence. These text samples were then put through several different Artificial Intelligence detection tools. These tools were selected from an internet query to mimic the typical selection process of AI detectors:
The human-written sample was tested first, along with the corresponding LLM-generated text. This process was repeated for all remaining text samples. Each result we obtained was recorded in a spreadsheet, sorted by text sample number and detector used. For this study, any result that is classified as less than or equal to 35% AI-content will be classified as human-written, and any result that is classified as less than or equal to 35% human-written will be classified as AI-content. However, if the detector did not include a percentage-based result, the result given was directly recorded.

3. Results
3.1. Overview
As there were six samples and five detectors, we obtained 30 results. Each result was recorded in a spreadsheet, sorted by sample number and AI detector used. 26 out of the 30 results were correctly classified, while the remaining 4 were incorrectly classified. There were 12 true positives, 14 true negatives, 1 false positive, and 3 false negatives, as outlined in Table 2 below. The findings indicate that for every 100 samples, about 13 samples will come back misclassified, using our parameters.

3.2. Calculation of Accuracy Rate
To calculate the accuracy rate, we can use the following formula (Baratloo et al., 2015):

3.3. Calculation of Sensitivity and Specificity
3.3.1. Sensitivity
Sensitivity can be calculated using the following formula (West et al., 2020):

3.3.2. Specificity
Specificity can be calculated using the following formula (West et al., 2020):

4. Discussion
After assessment of our samples, we determined the accuracy rate to be 86.67% and the inaccuracy rate to be 13.33%. Additionally, we have calculated a sensitivity of 80% and specificity of 82.35%. A sensitivity of 80% means that the detectors are collectively good at identifying true positives 80% of the time. A specificity of 82.35% means that the detectors are good at identifying true negatives 82.35% of the time. An accuracy rate of 86.67% means that the detectors are collectively good at identifying either true positives or true negatives 86.67% of the time. In a study conducted by Elkhatat et al., the observed results indicated that there was “considerable variability in the tools’ ability to correctly identify and categorize text as either AI-generated or human-written,” which is consistent with the results we obtained (Elkhatat et al., 2023).
Considering the practical applications of AI detection tools in classroom settings, such technology holds the potential to assist educators in better evaluating student work and identifying AI-generated content. However, the reliability of these tools for providing an accurate assessment of student work is still subject to debate. According to the National Center for Education Statistics, the average class size for departmentalized classroom settings in the United States is 23 students per class (National Center for Education Statistics, 2018). This connotes the idea that out of the 23 students, the work of about 3 students will be misclassified by the AI detectors, giving either false positives or false negatives. When a teacher relies on AI detection tools results as sole evidence of academic dishonesty and decides to act, this can potentially appear on a student’s academic records, contingent on school policies. This demonstrates that additional evidence may be needed to completely prove academic dishonesty.
Nonetheless, it is important to note that our study has some limitations. The free versions of the AI detection tools were used in this study; however, a paid version could have offered a deeper analysis, improving the accuracy of results. Additionally, our text samples were shorter than normal papers or essays and had limited variation in grammar, diction, and connotation. To enhance our study's results, we could incorporate a more extensive variety of samples with different writing patterns.
From a comprehensive perspective, while these AI tools do demonstrate some level of accuracy, they still may not be accurate enough to be used in academic settings. As generative AI is increasingly incorporated into aspects of society such as education, healthcare, and technology, AI detection tools will prove to be paramount for verifying the authenticity of work. However, the limitations and variability in the accuracy of these tools have potential to hinder the scope of its usage. While AI detection tools may offer benefits for discerning AI content, the findings of our study thus suggest that they should not be solely relied upon for accurate analysis of texts.
5. References
A. Cardillo. 20 Most Popular AI Tools Ranked. Exploding Topics. https://explodingtopics.com/blog/most-popular-ai-tools (2024).
K. Foote. A brief history of generative AI. Dataversity. https://www.dataversity.net/a-brief-history-of-generative-ai/. (2024).
S. Ortiz. What is Google's Gemini AI tool (formerly Bard)? Everything you need to know. ZDNet. https://www.zdnet.com/article/what-is-googles-gemini-ai-tool-formerly-bard-everything you-need-to-know/ (2024).
H. Guinness. The best large language models (LLMs). Zapier. https://zapier.com/blog/best-llm/ (2024).
Y. Liu, J. Cao, C. Liu, K. Ding, L. Jin. Datasets for Large Language Models: A Comprehensive Survey. ArXiv. https://doi.org/10.48550/arXiv.2402.18041 (2024).
IBM. What are large-language models? International Business Machines. https://www.ibm.com/think/topics/large-language-models (2023).
J. Gillham. Originality.AI launches the first AI tool available to detect content created by popular AI writing tools. PRWeb Cision. https://www.prweb.com/releases/originality-ai-launches-the-first-ai-tool-available-to-detect-content-created-by-popular-ai-writing-tools-883852583.html (2022).
J. McCoy. How Do AI Detectors Work and Why It Matters? Brandwell. https://brandwell.ai/blog/how-do-ai-detectors-work/#whatisanaicontentdetector (2024).
Weber-Wulff, D., Anohina-Naumeca, A., Bjelobaba, S. et al. Testing of detection tools for AI-generated text. International Journal for Educational Integrity 19, 26. https://doi.org/10.1007/s40979-023-00146-z (2023).
A. M. Elkhatat, K. Elsaid, and S. Almeer. Evaluating the efficacy of AI content detection tools in differentiating between human and AI-generated text. International Journal of Educational Integrity 19, 17. https://doi.org/10.1007/s40979-023-00140-5 (2023).
OpenAI. New AI classifier for indicating AI-written text. OpenAI. https://openai.com/index/new-ai-classifier-for-indicating-ai-written-text/ (2023).
K. Santoro. How do AI detectors work? Quillbot. https://quillbot.com/blog/ai-writing-tools/how-do-ai-detectors-work/ (2023).
AutoGPT. How to Trick AI Detectors. AutoGPT. https://autogpt.net/how-to-trick-ai-detectors/ (2024).
P. Marinkovic. 4 ways AI content detectors work to spot AI. Surfer SEO. https://surferseo.com/blog/how-do-ai-content-detectors-work/ (2024).
J. Manyika. An overview of the Gemini app. Google Gemini. https://gemini.google/overview-gemini-app.pdf (2024).
OpenAI. How ChatGPT and our foundation models are developed. OpenAI Help Center. https://help.openai.com/en/articles/7842364-how-chatgpt-and-our-foundation-models-ar e-developed (N.d.).
Perplexity AI. How does Perplexity work? Perplexity AI FAQ. https://www.perplexity.ai/hub/faq/how-does-perplexity-work (N.d.).
Copyleaks. AI Detector by Copyleaks. Copyleaks. https://copyleaks.com/ai-content-detector (N.d.).
Undetectable AI. Advanced AI Detector and AI Checker for ChatGPT and More. Undetectable AI. https://undetectable.ai/ai-detector (N.d.).
Quillbot. AI Detector. Quillbot. https://quillbot.com/ai-content-detector (N.d.).
F. Habibzadeh. GPTZero Performance in Identifying Artificial Intelligence-Generated Medical Texts: A Preliminary Study. Journal of Korean medical science, 38, e319. https://doi.org/10.3346/jkms.2023.38.e319 (2023).
A. Baratloo, M. Hosseini, A. Negida, & G. El Ashal. Part 1: Simple Definition and Calculation of Accuracy, Sensitivity and Specificity. Emergency (Tehran, Iran), 3(2), 48–49. https://pmc.ncbi.nlm.nih.gov/articles/PMC4614595/ (2015).
R. West, A. Kobokovich. Factsheet: Understanding the Accuracy of Diagnostic and Serology Tests: Sensitivity and Specificity. John Hopkins: Center for Health Security. https://centerforhealthsecurity.org/sites/default/files/2022-12/201207-sensitivity-specifict y-factsheet.pdf (2020).
National Center for Education Statistics. Average class size in public schools, by class type and state: 2017–18. National Center for Education Statistics. https://nces.ed.gov/surveys/ntps/tables/ntps1718_fltable06_t1s.asp (2018).
6. Appendix – Text Samples












Comments