Rare Book Monthly

Articles - July - 2025 Issue

Vast Amounts of New Data from Books Being Made Available to AI Chatbox Programs like ChatGPT

A large source of additional information for AI (artificial intelligence) chatbox programs, like ChatGPT or Microsoft's Llama, has been opened. Those are the online search programs that answer just about every question you ask in seconds. A type of software known as “Large Language Models” are able to take vast amounts of data, use it to familiarize itself with manners of speech so as to understand this vast database of information, and then pull out what it needs to answer your question. It is utterly amazing what they do, but they can't do it all by themselves. They know nothing but what they are fed, and if they are to respond from the knowledge of vast amounts of information, that information must come from somewhere.

 

Much of it comes from the internet, which means they must be enough smart to separate the wheat from the chaff, and “chaff” is an overly polite word for a lot of what is out there. In other words, they also need some more reputable sources of information, and books and other publications are an important source for that. However, many (but not all) of the authors and publishers are not pleased with their work being used without payment. Authors, deservedly, get royalties for their work in books, but not for their work when it is copied and used by AI. They have sued to stop this practice and cite copyright law, as these works are copyrighted.

 

All of this is in the courts and how it is resolved is as yet unknown. However, a new source has emerged lately. That is from books in libraries. Harvard University announced that they are making their vast dataset of books from their library available to AI models at no cost. Most of this was created almost two decades ago as part of the Google Books project, where Google scanned and digitized millions of books at various libraries. Harvard compiled this and more as part of their Institutional Data Initiative at the Harvard Law Library. Harvard has files for 386 million pages from almost one million books. They are now making it available for services like ChatGPT to learn from and find answers to your questions.

 

This will be helpful, particularly for understanding historic material, but there is one very major drawback. It is safe to use these books without risk of being sued because they are out of copyright. Copyright terms are 95 years. Therefore, none of these books is less than 95 years old. This will not be much good for providing medical advice, even if it sometimes feels like this must be where RFK Jr. gets his medical recommendations. You want the latest opinions for medical diagnoses and the same for other scientific knowledge. Good luck fixing your computer or car with advice that predates 1930, unless you have a Model T. Of course, these programs already have a lot of later information in place (some of which they are being sued to remove). It just means that these 386 million new pages won't add much to answers you seek for these sorts of questions.

 

It should be noted that some information Harvard is providing is more recent since it is not subject to copyright. One example is legal case law. These court opinions are available to anyone to read – they need to be for legal experts to understand the law. This recent case law is being provided to the AI models that want to add it.

 

 

Update: A few days ago, the first court decision came down in a case of authors suing chatbox for copyright violation. The authors lost. Click here for more.


Posted On: 2025-07-09 14:41
User Name: hjrobin

No links in this discussion to the actual data. How un-bibliographic!


Rare Book Monthly

  • Forum Auctions
    Fine Books, Manuscripts and Works on Paper
    28th May 2026
    Forum, May 28: Book of Hours.- Heures de nostre dame a l'usaige de Romme, Paris, Antoine Chappiel pour Germain Hardouin, [1504]. £6,000-8,000
    Forum, May 28: Colonna (Francesco). La Hypnerotomachia di Poliphilo, second edition, Venice, Sons of Aldus Manutius, 1545. £15,000-20,000
    Forum, May 28: The Christ Child holding a crystal orb and surrounded by banderoles with devotional exhortations, on a leaf most probably from a Book of Hours, [Southern Netherlands, last decades of the fifteenth century]. £2,000-3,000
    Forum Auctions
    Fine Books, Manuscripts and Works on Paper
    28th May 2026
    Forum, May 28: Jackson (Shirley). The Haunting of Hill House, first English edition, signed presentation inscription from the author to Claude Fredericks, 1960. £2,000-3,000
    Forum, May 28: Lennon (John). In His Own Write, first edition, first impression, signed by the author, 1964. £3,000-4,000
    Forum, May 28: Doves Press.- Keats (John). [Poems], one of 200 copies on paper, Doves Press, 1914. £5,000-7,000
    Forum Auctions
    Fine Books, Manuscripts and Works on Paper
    28th May 2026
    Forum, May 28: Rodrigues (João Barbosa). Sertum Palmarum Brasiliensium, 2 vol., first and only edition, Brussels, 1903. £8,000-12,000
    Forum, May 28: Newton (Sir Isaac). Philosophiae naturalis principia mathematica…editio ultima, auctior et emendatior, Amsterdam, Sumptibus Societatis, 1714. £8,000-12,000
    Forum, May 28: Kepler (Johannes). Ad Vitellionem paralipomena, wuibus astronomiae pars optica traditur, first edition, Frankfurt am Main, 1604. £5,000-7,000
    Forum Auctions
    Fine Books, Manuscripts and Works on Paper
    28th May 2026
    Forum, May 28: Tagliacozzi (Gaspare). De Curtorum Chirurgia per insitionem, libri duo, first edition, Venice, Gasparo Bindoni, 1597. £7,000-10,000
    Forum, May 28: Lootsman (Jacobsz). The Lightning Colomne, or Sea-Mirrour, containing the Sea-Coasts of the Northern, Eastern and Western Navigation..., 1670. £8,000-12,000
    Forum, May 28: Ribelles y Helip (José), Attributed to. An album comprising 33 finely executed watercolours of Spanish costume, bull-fighting scenes, and other genre subjects, [circa 1830]. £10,000-15,000

Article Search

Archived Articles