AI: The Magician's Hand

Circo de Cuba - Acrobatics

AI is all about data. Big data initiated this wave of technological investment in machine learning, which was later branded as AI systems and architectures. It’s truly impressive when one considers the amount of data we—as a network-connected humanity—produce, either firsthand or with the help of algorithmic or more “intelligent” systems.

There’s an infographic that, while somewhat dated, gives us a rough estimation.

But big data, and data in general, is also the mostly hidden secret ingredient of all AI models. It’s the input to be transformed and served in various ways. It’s the training material, in the official description of AI systems. Only by training these models on the right and vast amounts of data can these systems ensure better and evolving outcomes. That’s why Big Tech, private firms and countries treat data as the new oil—they try to exploit it in every way possible, intercepting, storing, and analyzing vast amounts of data from every possible sensor, platform, and user interaction on the internet. But not only that. They try to gain rights on specific platforms and try to develop closed communities in which the non-excludable nature of knowledge can be walled off—Facebook and Reddit are such examples. But data control and data access strategies involve more than just plain old walling off knowledge.

One such example, which seems preposterous but, given the industry’s history, isn’t something that happened out of the blue, is the use of pirated material for training purposes by Meta (Facebook’s parent company). And by pirated material, we mean not one or two files. As an Ars Techica article mentions, referring to a court filing, Meta torrented “at least 81.7 terabytes of data across multiple shadow libraries through the site Anna’s Archive, including at least 35.7 terabytes of data from Z-Library and LibGen,” the authors’ court filing said. And “Meta also previously torrented 80.6 terabytes of data from LibGen.” Everything is acceptable in this race for access to (high-quality) data, such as academic papers and books.

But why this analogy with the magician’s hand? It feels like, as we constantly discuss and are impressed by the developments in AI, we talk more about fictitious and, at least, too hypothetical scenarios like the AGI (Artificial General Inteligence) or similar concepts, while existing ML and AI systems continue to harvest and capitalize on our currently produced data. It’s exactly like focusing on the magician’s hand while the “magic” underneath happens.