Tech Ouroboros: When AI Feasts on AI-Generated Data

Much like Superman can only be defeated by Superman, AI's biggest threat has turned out to be AI itself. All of us know by now that AI works through machine learning whereby AI trains on data to find underlying patterns and insights in order to make likely predictions.

To enable this, AI needs to be trained on colossal amounts of data. For the longest time, AI has been fed with human-collected and generated data, for example, scientific studies, statistics, and books, giving birth to a powerful seeming generative AI that can generate text and content across a variety of forms, including text, pictures, and videos.

However, the recent proliferation of AI-generated text has tempted data scientists working on AI to feed it with content that has been generated by AI itself. Further, the ubiquitous presence of AI-generated content on the internet has necessitated that future AI Models will often learn from the work outputs generated by their predecessors.

Importantly, doing this with large language models leads to an erosion in the quality of their output. Over time, this process ensures that the digital brain of the large language model is broken, as it begins producing data that is absolutely corrupt. According to researchers at various institutions, ranging from Stanford and Rice, this raises the threat of “model collapse”, where AI begins producing content that is absolutely gibberish and incomprehensible.

Because AI is responsible for its own death through the consumption of its own generation of data, AI becomes like an ouroboros, which is a snake consuming its own tail. According to researcher Nicolas Papernot, this model collapse can be compared to a process when we keep on photocopying the photocopy, and then keep on photocopying the photocopies over and over again. After a while, the photocopy will have no information that the original paper had. This is called an autophagous loop in technical terms, by which it means AI’s allergy to itself by which it devours its own tail.

This is dangerous as scientists predict that this could lead to widespread pollution on the internet where such information is no longer tethered to reality, but only what LLMs prefer reality to be. Thus, it is important for us to ensure that data calls for diverse data inputs, such that it is possible to incorporate real-world experiences and perspectives into these models to ensure equity.

Sources

https://futurism.com/ai-trained-ai-generated-data-interview

https://www.popularmechanics.com/technology/a44675279/ai-content-model-collapse/

https://www.ece.utoronto.ca/news/training-ai-on-machine-generated-text-could-lead-to-model-collapse-researchers-warn

Previous
Previous

Ascendant AI Technologies: Unlocking New Horizons for Computational Linguists

Next
Next

Understanding Synthetic Data in the Modern World