The Importance of Data in Artificial Intelligence: Understanding the Current Landscape

They said it in The documentary ‘The Dilemma of Social Networks’: “If you are not paying for the product, you are the product.” This phrase resonates with many users today, as it highlights how applications and services that seem free often come at the expense of our  personal information . From social networks and browsers to GPS applications, most digital platforms actively collect user data to drive their business models. The same applies to  artificial intelligence (AI) applications , which aim to attract users and establish partnerships to gather data that can enhance their performance.

Improving User Experience. One significant way AI applications benefit from user data is through improving their models. “When you share your content with us, you help our models become more precise since they can better solve your specific problems,” reads a line from the ChatGPT Privacy Policy. Similar statements can be found in the policies of other platforms like Gemini. Interestingly, Anthropic was the only company that initially kept conversations with Claude private; however, they recently announced a policy change. This indicates that our personal information, usage data, and particularly our conversations are crucial for ongoing training and enhancements of these models. Users can choose to disable these data-sharing options, but by default, they are enabled.

 <img alt="Openai's hypothetical social network does not want to connect people. Want your data to train your AI " width="375" height="142" src="https://i.blogs.es/9ce6fb/ghibli/375_142.png"/>

The Data Shortage Crisis. AI requires an immense amount of data to function optimally. Early language models utilized various forms of content, including copyright-protected material such as books and artwork. However, data is not infinite. Discussions about a potential data shortage had already begun by late 2021, and by early 2023, Elon Musk suggested that AI had already consumed all human knowledge. This poses significant challenges for the progress of AI and may be responsible for the overall slowing of development.

Innovative Solutions. In response to the data deficit, AI companies have begun exploring alternative methods for data acquisition. OpenAI, for instance, has transcribed over a million hours of YouTube content to train its GPT-4 model. Meanwhile, Google has opted to utilize every piece of information available on the Internet to enhance its AI capabilities. More controversially, Musk suggests that the future of AI lies in synthetic data generated by AI itself. However, another valuable resource for training AI is—surprisingly—our conversations with these applications. Initial models had limited user interactions, but the data influx from a user base of nearly 800 million users currently provides a rich vein of information for improvement.

The Data-Driven Alliance. User interactions are not just beneficial on an individual scale; they can be extremely valuable when aggregated from specific demographic groups. According to Rest of World, AI companies are forming alliances with other organizations to access data that cannot be gathered through standard web scraping. OpenAI has teamed up with Shopee to offer its Plus plan to users in Indonesia, Vietnam, and Thailand. Similarly, Google provides its free Gemini Pro plan for a year to students in India, while Perplexity Pro is accessible for free through telecom operators like Movistar and Airtel in India. These partnerships not only broaden their user base but also equip them with real-time consumption data from specific groups, allowing for more accurate model training.

China’s Strategic Advantage. China exemplifies how access to specific data can significantly enhance the development of effective AI solutions. Pharmaceutical research companies leveraging AI can tap into data from the national health system, which covers more than 600 million individuals. This massive dataset gives Chinese companies a competitive edge, enabling them to sign multimillion-dollar agreements with major pharmaceutical firms.

The Call for Regulation. With the rapid expansion of AI capabilities, experts like Sameer Patil from the Observer Research Foundation are urging the need for clearer regulations, especially in sensitive sectors like health and finance. “Participating companies must ensure that data sets are anonymized and not personalized,” he states in an interview with Rest of World.

Image | ChatGPT

In Xataka, it’s reported that goals are set high for AI: a data center nearly as large as Manhattan has been announced, accompanied by an investment of up to $65 billion.



General News – 2