OpenAI Microsoft Harvard Train AI on 1M Books
OpenAI and MicrosoftMSFT-- have joined forces with Harvard’s libraries to train AI models on a vast collection of books dating back to the 15th century. This collaboration involves nearly one million books published in 254 languages, providing a rich repository of historical and cultural data for AI development. The initiative aims to enhance the accuracy and reliability of AI systems by leveraging this extensive dataset, which includes works from various fields such as literature, philosophy, law, and agriculture.
The Harvard-based Institutional Data Initiative, supported by unrestricted gifts from Microsoft and OpenAI, is working with libraries and museums worldwide to make their historic collections AI-ready. This effort not only benefits AI developers but also the communities served by these institutions. The dataset, Institutional Books 1.0, contains over 394 million scanned pages, including a Korean painter’s handwritten thoughts from the 1400s and a significant number of 19th-century works.
This collaboration is seen as a prudent move to use public domain data, which is less controversial than copyrighted content. It also addresses the issue of AI developers running out of high-quality data, as they have increasingly turned to synthetic data generated by the chatbots themselves. The dataset promises to be a valuable resource for improving AI systems, as it includes original sources meticulously preserved and organized by generations of librarians.
The initiative is part of a broader effort to move power back to institutions that have long been stewards of data and information. Librarians play a crucial role in this process, ensuring that the data is used responsibly and ethically. The dataset, which has an estimated 242 billion tokens, will be shared on the Hugging Face platform, making it accessible to anyone interested in developing AI models.
The linguistic diversity of the book collection is another significant aspect, with fewer than half the volumes in English. This diversity includes European languages such as German, French, Italian, Spanish, and Latin, as well as other languages. The collection’s depth in 19th-century thought is also seen as critical for building AI agents that can plan and reason like humans, drawing on the pedagogical and scientific information available in university settings.
However, the initiative also acknowledges the challenges of dealing with a large dataset that includes outdated and potentially harmful content. The project aims to provide guidance on mitigating these risks, helping developers make informed decisions and use AI responsibly. This collaboration between tech companies and libraries represents a significant step forward in the responsible development of AI, leveraging historical and cultural data to enhance AI capabilities while addressing ethical concerns.

Quickly understand the history and background of various well-known coins
Latest Articles
Stay ahead of the market.
Get curated U.S. market news, insights and key dates delivered to your inbox.

Comments
No comments yet