A Validation Study on Artificial Intelligence Content Detection Tools

Vishnuprasad Kakarla; Phoenix Carlisle
12 minutes ago
13 min read

Abstract

Since the launch of generative Artificial Intelligence (AI), various detectors have been released to identify AI-generated content . However, their reliability remains unknown, due to their fluctuating levels of accuracy. In this study, we conducted a preliminary assessment on the accuracy of such tools, namely Copyleaks, GPTZero, Quillbot, Undetectable, and Writer since they were the top results from a Google search titled “AI Detectors.” We tested six text samples: one human-written version and one AI-written version for each of three differing topics Each human sample was written individually but one of the human-written samples was co-authored to mix the patterns found in the text. For this analysis, human-generated content was classified as negative, while the AI-generated content was classified as positive. The sensitivity (the ability of a test to correctly identify samples with the condition) was 80%, while the specificity (the ability of a test to correctly identify samples without the condition) was 82.35%. In simpler terms, 26 out of the 30 results obtained were correctly classified as either human-written or AI-written, yielding an accuracy rate of 86.67% . This means that about 13 out of every 100 samples will come back misclassified as AI-written or human-written. These results initiate the question whether these tools should be used as evidence for academic dishonesty or violation of terms in academic settings.

Keywords: Artificial Intelligence (AI), Large-Language Model (LLM), Copyleaks, GPTZero, Quillbot, Undetectable, Writer, ChatGPT, Google, Gemini, Perplexity, Bard, AI Detectors

1. Introduction

ChatGPT, a widely used generative Artificial Intelligence (AI) tool (Cardillo, 2024), was released in November 2022 (Foote, 2024). Since then, many other AI tools such as Google Gemini (formerly Bard (Ortiz, 2024)), were also released. These AI tools are examples of Large-Language Models (Guinness, 2024), or LLMs for short. LLMs are machine learning models trained on large datasets of textual information that encompass the patterns of human-written language (Liu et al., 2024). By understanding the human language, they can fluently respond to prompts from the user in a conversational tone, simulating human interaction (IBM, 2023). In December 2022, originality.ai released the first publicly available AI content detection tool (Gillham, 2022), followed by multiple others These tools attempt to determine whether a given text sample, such as essays, contains content generated by Artificial Intelligence (McCoy, 2024); Studies have been found that these tools vary in detection accuracy, with high false positives and false negative rates (Weber-Wulff, 2023; Elkhatat et al., 2023). The aim of this study was to complete a preliminary assessment of these detection tools. Ironically, OpenAI, the developer of ChatGPT, had also released an AI detector in January 2023 (OpenAI, 2023) but later discontinued it, citing low accuracy rates and stating that “sometimes human-written text will be incorrectly but confidently labeled as AI-written by our classifier (OpenAI, 2023).” To better our understanding on these tools, the next step is to explore how these tools fundamentally work: :

Classifying: Method of classifying given samples into a preset class, which would be either AI-written or human-written in the context of AI detection tools (Santoro, 2023). They use large datasets of AI-written and human-written text. Classifiers analyze the patterns and look for similarities found in the datasets it was trained on and the given text to come to a conclusion(Santoro, 2023).
Embedding: A method that converts text into numerical vectors, allowing analysis of underlying patterns in the classification of humans or AI written text (Santoro, 2023). After the analysis, the text is labeled as either AI-written or human-written.
Perplexity: Perplexity, different from the LLM Perplexity AI, is a measure of how predictable the text is(Santoro, 2023). The lower the perplexity, the more predictable the text is. AI text tends to be very predictable and have a low perplexity (AutoGPT, 2024).
Burstiness: Burstiness is the differences in a sentence’s structure and length throughout the text(Santoro, 2023). AI-written text tends to be more uniform in sentence length; this distinction becomes less pronounced as the model changes (AutoGPT, 2024).

AI detection tools primarily use these four methods to detect AI-generated content(Marinkovic, 2024). There are types of results among these detectors. For example, one AI detector might give a percentage that represents how likely it was written by humans or AI, while another detector might give a textual result that simply states “Written by AI / AI content detected” or vice versa. These results recorded forms the basis for analysis of accuracy in detection.

Table 1: AI tools tested, and their method(s)

Name of AI Tool	Type of AI Tool	Method(s) of Tool
Google Gemini (formerly Bard)	Generative AI Tool (Large Language Models)	Uses publicly available data to generate text (Manyika, 2024).
ChatGPT (OpenAI)	Generative AI Tool (Large Language Models)	Uses publicly available data, third-party partner data, and user(s) / trainer(s) given data (OpenAI, N.d.).
Perplexity.ai	Generative AI Tool (Large Language Models)	Uses sources found on the internet to gather information and generate text (Perplexity AI, N.d.).
Copyleaks	Artificial Intelligence Content Detection Tool	Recognizes human patterns and classifies differing [from human patterns] content as potentially AI-written (Copyleaks, N.d.).
Writer	Artificial Intelligence Content Detection Tool	No information from credible sources regarding its methods were available at the time of writing.
Undetectable	Artificial Intelligence Content Detection Tool	Gives the user various results from other AI detection tools in an abreast manner (Undetectable AI, N.d).
Quillbot	Artificial Intelligence Content Detection Tool	Analyzes given text for repeated words, awkward phrases, and unnatural flow, often found in AI-generated text (Quillbot, N.d.).
GPTZero	Artificial Intelligence Content Detection Tool	Analyzes the perplexity and burstiness of a text(Habibzadeh, 2023).

2. Methods

Three different text samples on different broad topics with an approximate word count of 200 were self-written. Then, large-language models like ChatGPT developed by OpenAI LLC., and Gemini developed by Google LLC., were tasked to write on the same topics and word count, but no other differences (e.g., tone, diction, etc.) were specified to imitate the typical use of artificial intelligence. These text samples were then put through several different Artificial Intelligence detection tools. These tools were selected from an internet query to mimic the typical selection process of AI detectors:

Copyleaks (Detector #1)
Undetectable (Detector #2)
GPTZero (Detector #3)
Quillbot (Detector #4)
Writer (Detector #5)

The human-written sample was tested first, along with the corresponding LLM-generated text. This process was repeated for all remaining text samples. Each result we obtained was recorded in a spreadsheet, sorted by text sample number and detector used. For this study, any result that is classified as less than or equal to 35% AI-content will be classified as human-written, and any result that is classified as less than or equal to 35% human-written will be classified as AI-content. However, if the detector did not include a percentage-based result, the result given was directly recorded.

3. Results

3.1. Overview

As there were six samples and five detectors, we obtained 30 results. Each result was recorded in a spreadsheet, sorted by sample number and AI detector used. 26 out of the 30 results were correctly classified, while the remaining 4 were incorrectly classified. There were 12 true positives, 14 true negatives, 1 false positive, and 3 false negatives, as outlined in Table 2 below. The findings indicate that for every 100 samples, about 13 samples will come back misclassified, using our parameters.

3.2. Calculation of Accuracy Rate

To calculate the accuracy rate, we can use the following formula (Baratloo et al., 2015):

3.3. Calculation of Sensitivity and Specificity

3.3.1. Sensitivity

Sensitivity can be calculated using the following formula (West et al., 2020):

3.3.2. Specificity

Specificity can be calculated using the following formula (West et al., 2020):

4. Discussion

After assessment of our samples, we determined the accuracy rate to be 86.67% and the inaccuracy rate to be 13.33%. Additionally, we have calculated a sensitivity of 80% and specificity of 82.35%. A sensitivity of 80% means that the detectors are collectively good at identifying true positives 80% of the time. A specificity of 82.35% means that the detectors are good at identifying true negatives 82.35% of the time. An accuracy rate of 86.67% means that the detectors are collectively good at identifying either true positives or true negatives 86.67% of the time. In a study conducted by Elkhatat et al., the observed results indicated that there was “considerable variability in the tools’ ability to correctly identify and categorize text as either AI-generated or human-written,” which is consistent with the results we obtained (Elkhatat et al., 2023).

Considering the practical applications of AI detection tools in classroom settings, such technology holds the potential to assist educators in better evaluating student work and identifying AI-generated content. However, the reliability of these tools for providing an accurate assessment of student work is still subject to debate. According to the National Center for Education Statistics, the average class size for departmentalized classroom settings in the United States is 23 students per class (National Center for Education Statistics, 2018). This connotes the idea that out of the 23 students, the work of about 3 students will be misclassified by the AI detectors, giving either false positives or false negatives. When a teacher relies on AI detection tools results as sole evidence of academic dishonesty and decides to act, this can potentially appear on a student’s academic records, contingent on school policies. This demonstrates that additional evidence may be needed to completely prove academic dishonesty.

Nonetheless, it is important to note that our study has some limitations. The free versions of the AI detection tools were used in this study; however, a paid version could have offered a deeper analysis, improving the accuracy of results. Additionally, our text samples were shorter than normal papers or essays and had limited variation in grammar, diction, and connotation. To enhance our study's results, we could incorporate a more extensive variety of samples with different writing patterns.

From a comprehensive perspective, while these AI tools do demonstrate some level of accuracy, they still may not be accurate enough to be used in academic settings. As generative AI is increasingly incorporated into aspects of society such as education, healthcare, and technology, AI detection tools will prove to be paramount for verifying the authenticity of work. However, the limitations and variability in the accuracy of these tools have potential to hinder the scope of its usage. While AI detection tools may offer benefits for discerning AI content, the findings of our study thus suggest that they should not be solely relied upon for accurate analysis of texts.

5. References

A. Cardillo. 20 Most Popular AI Tools Ranked. Exploding Topics. https://explodingtopics.com/blog/most-popular-ai-tools (2024).
K. Foote. A brief history of generative AI. Dataversity. https://www.dataversity.net/a-brief-history-of-generative-ai/. (2024).
S. Ortiz. What is Google's Gemini AI tool (formerly Bard)? Everything you need to know. ZDNet. https://www.zdnet.com/article/what-is-googles-gemini-ai-tool-formerly-bard-everything you-need-to-know/ (2024).
H. Guinness. The best large language models (LLMs). Zapier. https://zapier.com/blog/best-llm/ (2024).
Y. Liu, J. Cao, C. Liu, K. Ding, L. Jin. Datasets for Large Language Models: A Comprehensive Survey. ArXiv. https://doi.org/10.48550/arXiv.2402.18041 (2024).
IBM. What are large-language models? International Business Machines. https://www.ibm.com/think/topics/large-language-models (2023).
J. Gillham. Originality.AI launches the first AI tool available to detect content created by popular AI writing tools. PRWeb Cision. https://www.prweb.com/releases/originality-ai-launches-the-first-ai-tool-available-to-detect-content-created-by-popular-ai-writing-tools-883852583.html (2022).
J. McCoy. How Do AI Detectors Work and Why It Matters? Brandwell. https://brandwell.ai/blog/how-do-ai-detectors-work/#whatisanaicontentdetector (2024).
Weber-Wulff, D., Anohina-Naumeca, A., Bjelobaba, S. et al. Testing of detection tools for AI-generated text. International Journal for Educational Integrity 19, 26. https://doi.org/10.1007/s40979-023-00146-z (2023).
A. M. Elkhatat, K. Elsaid, and S. Almeer. Evaluating the efficacy of AI content detection tools in differentiating between human and AI-generated text. International Journal of Educational Integrity 19, 17. https://doi.org/10.1007/s40979-023-00140-5 (2023).
OpenAI. New AI classifier for indicating AI-written text. OpenAI. https://openai.com/index/new-ai-classifier-for-indicating-ai-written-text/ (2023).
K. Santoro. How do AI detectors work? Quillbot. https://quillbot.com/blog/ai-writing-tools/how-do-ai-detectors-work/ (2023).
AutoGPT. How to Trick AI Detectors. AutoGPT. https://autogpt.net/how-to-trick-ai-detectors/ (2024).
P. Marinkovic. 4 ways AI content detectors work to spot AI. Surfer SEO. https://surferseo.com/blog/how-do-ai-content-detectors-work/ (2024).
J. Manyika. An overview of the Gemini app. Google Gemini. https://gemini.google/overview-gemini-app.pdf (2024).
OpenAI. How ChatGPT and our foundation models are developed. OpenAI Help Center. https://help.openai.com/en/articles/7842364-how-chatgpt-and-our-foundation-models-ar e-developed (N.d.).
Perplexity AI. How does Perplexity work? Perplexity AI FAQ. https://www.perplexity.ai/hub/faq/how-does-perplexity-work (N.d.).
Copyleaks. AI Detector by Copyleaks. Copyleaks. https://copyleaks.com/ai-content-detector (N.d.).
Undetectable AI. Advanced AI Detector and AI Checker for ChatGPT and More. Undetectable AI. https://undetectable.ai/ai-detector (N.d.).
Quillbot. AI Detector. Quillbot. https://quillbot.com/ai-content-detector (N.d.).
F. Habibzadeh. GPTZero Performance in Identifying Artificial Intelligence-Generated Medical Texts: A Preliminary Study. Journal of Korean medical science, 38, e319. https://doi.org/10.3346/jkms.2023.38.e319 (2023).
A. Baratloo, M. Hosseini, A. Negida, & G. El Ashal. Part 1: Simple Definition and Calculation of Accuracy, Sensitivity and Specificity. Emergency (Tehran, Iran), 3(2), 48–49. https://pmc.ncbi.nlm.nih.gov/articles/PMC4614595/ (2015).
R. West, A. Kobokovich. Factsheet: Understanding the Accuracy of Diagnostic and Serology Tests: Sensitivity and Specificity. John Hopkins: Center for Health Security. https://centerforhealthsecurity.org/sites/default/files/2022-12/201207-sensitivity-specifict y-factsheet.pdf (2020).
National Center for Education Statistics. Average class size in public schools, by class type and state: 2017–18. National Center for Education Statistics. https://nces.ed.gov/surveys/ntps/tables/ntps1718_fltable06_t1s.asp (2018).

6. Appendix – Text Samples

Sample 1.1 (Human-written): ~200 words essay on the role of technology in education

“Technology plays a huge role in education, especially in today’s day and age. From helping teachers grade tests faster, to giving students feedback on their work, the technology we have today is capable of changing the education world forever. They also save tons of paper as teachers are now assigning work virtually. With the rapid development of Artificial Intelligence (AI), there are even more opportunities to revolutionize the education sector even more. Khanmigo, a platform developed by Khan Academy, uses AI to give students real-time assistance. This is another example of what a wonderful role technology can play in education. However, the necessary developments are not being made. While there are many factors to this, the biggest barrier would be funding. Technology is expensive. To implement this at a school or district level is even more expensive. 84% of students attend public school, where funding is tight. There are not enough monetary resources to integrate technology to its fullest capabilities. Without this barrier, schools will have the opportunity to better provide students with a more technological approach to education. Technology’s possible major role in education could be revolutionary and life-changing for many.”

Sample 1.2 (ChatGPT 4o): ~200 words essay on the role of technology in education “Technology has revolutionized education, reshaping how knowledge is imparted and acquired. It bridges geographical barriers, making quality education accessible to students worldwide through online platforms and virtual classrooms. Tools such as interactive whiteboards, educational apps, and learning management systems enhance engagement, catering to diverse learning styles and paces. In the digital era, technology promotes personalized learning, enabling students to tailor their educational experiences. Adaptive software assesses individual progress, providing resources and exercises to address unique needs. Similarly, access to vast digital libraries and real-time data empowers students and educators with up-to-date knowledge, fostering research and critical thinking skills. Collaboration is another area where technology excels, connecting learners globally. Group projects and discussions can now occur seamlessly across continents, preparing students for a globally interconnected world. Moreover, it equips them with essential technical skills, preparing them for future careers in an increasingly digital economy. However, reliance on technology requires addressing challenges such as the digital divide and ensuring data security. Proper implementation and guidance are essential to maximize its potential while minimizing distractions. In conclusion, technology acts as a powerful catalyst in education, enhancing accessibility, engagement, and innovation. When used responsibly, it has the potential to transform learning, equipping individuals for a dynamic future.”

Sample 2.1 (Human-written): ~200 words essay on the power of positive thinking Does your mindset actually benefit actions? Would you say that in the right headspace a paint brush becomes an assistant or weapon? The power of positive thinking is often a regarde that isn’t taken seriously enough. We’ve all heard the cheesy sayings, seen the millions of posters and seen every influencer video talk at you in million dollar homes, yet it is a notion that can change the way we function. If we are driven by a higher thought process that guides us, the best way to control that is thinking in a way that will leave a positive impact. We cannot control what has happened or what is about to take place, however, we can change how we feel about it and positive thinking is one of many ways you can do that, providing a serene mental place during disastrous events. Not only does your calm state of mind protect and help you, it also benefits others. Many people might look at you as a role model, being a peaceful person also encourages them to be tranquil. Being calm, which is often provided by positive thinking, ensures your wellbeing in a way that no other type of thinking can match.

Sample 2.2 (Google Gemini 1.5 Flash): ~200 words essay on the power of positive thinking

“Positive thinking is a mental and emotional attitude that focuses on the bright side of life and expects positive outcomes. It involves cultivating an optimistic outlook, even during challenging times. This mindset can significantly impact an individual's overall well-being, influencing their thoughts, feelings, and behaviors. One of the key benefits of positive thinking is its ability to reduce stress and anxiety. When faced with difficulties, positive thinkers are more likely to approach challenges with a solution-oriented perspective, rather than dwelling on negativity. This proactive approach can help individuals cope with stress more effectively, leading to improved mental and emotional health. Furthermore, positive thinking can enhance self-esteem and confidence. By focusing on their strengths and accomplishments, individuals can develop a stronger sense of self-worth. This increased self-belief can empower them to pursue their goals with greater determination and resilience, ultimately leading to increased success in various areas of life. In conclusion, positive thinking is a valuable tool that can significantly improve an individual's overall well-being. By cultivating an optimistic outlook and focusing on the positive aspects of life, individuals can reduce stress, enhance self-esteem, and increase their chances of achieving their goals. While challenges are inevitable, a positive mindset can empower individuals to navigate through difficulties with greater resilience and emerge stronger.”

Sample 3.1 (Human-written): ~200 words essay on the role of Mitochondria in humans

The mitochondria’s importance is a meme that continues to be taken out of context. Those who paid attention in biology know the organelle tends to rely on two membranes yet what do they do exactly? It is understandable how and why the cell would be taken out of context under the namesake of “powerhouse,” yet it doesn’t take away from the fact that the organelle revolves around producing energy that can begin the ATP process. The mitochondria do not live within the bounds of one cellular type; mitochondria can be found

in multiples within several types such as liver, muscle and even brain cells. The mitochondria's role within the body is only seen in parts of a whole with the unreliable sources of class; mitochondria will always have its claim to the power house of energy yet it plays more than one role. The romanticized role of mitochondria is what makes them funny when in reality they are there to help process cell death, heat and calcium. Just like energy, these other tasks are important to the whole body. So next time that you're laughing at the powerhouse jokes of the mitochondria, remember that it’s the reason you even can.

Sample 3.2 (Perplexity Free): ~200 words essay on the role of Mitochondria in humans

“Mitochondria, often dubbed the "powerhouses" of human cells, play a pivotal role in cellular energy metabolism. These double-membraned organelles are primarily responsible for producing adenosine triphosphate (ATP) through a process known as oxidative phosphorylation. This process occurs in the inner mitochondrial membrane, where electrons derived from nutrients are transferred through a series of protein complexes, ultimately generating ATP. Remarkably, mitochondria can produce over 30 ATP molecules from a single glucose molecule, making them essential for meeting the high energy demands of various tissues, particularly those with intensive metabolic activity like muscles and the brain. In addition to energy production, mitochondria are integral to several other cellular functions. They regulate metabolic pathways, influence cell growth and differentiation, and are key players in apoptosis, or programmed cell death. This regulation is crucial for maintaining cellular homeostasis and preventing uncontrolled cell proliferation, which can lead to cancer. Mitochondria also help maintain calcium ion balance within cells and play a role in detoxifying ammonia in liver cells. Furthermore, they possess their own circular DNA, which allows them to replicate independently of the cell's nuclear DNA and communicate with other cellular components. Dysfunctional mitochondria are associated with a range of diseases, including neurodegenerative disorders like Alzheimer's and Parkinson's disease, as well as metabolic syndromes and certain cancers. Therefore, understanding mitochondrial biology is not only fundamental to cell physiology but also vital for developing therapeutic strategies aimed at mitigating mitochondrial-related diseases.”

A Validation Study on Artificial Intelligence Content Detection Tools

Recent Posts

Get in Touch