How AI Knowledge Begins – The Basics

To shape AI we must first know how it attains its information so we can contribute to its database. So I asked it.

Hold onto your hat - there's a lot more to it this you might have thought!

Training Data Set: What AI gets before it's released

  • All content is in digital form - it exists as digital text, images, audio or video that has been uploaded to the internet or made accessible through large datasets compiled by researchers and organizations.
    • So if a book is not digitized (only exists in printed volume) then it's not part of training data.
    • If a book is digitized but locked behind paywalls or not included in public training corpora, then it still might not be part of a training set.

📚 What Is a "Public Training Corpus"?

A training corpus is a large body of digital text used to train AI models. A public training corpus specifically refers to:

Data that is freely available and legally usable for machine learning that is typically not behind paywalls and often collected from the open web or from projects with explicit licensing for AI use

Examples of common public sources include:

Wikipedia, Common Crawl (web pages scraped from the open internet), GitHub repositories, Open-access books (like those on Project Gutenberg), Creative Commons–licensed blogs and writings

❓ How Does a Book Get Included?

To be included in a training corpus, a book must meet at least one of the following conditions:

  • ✅ 1. It’s in the public domain – Meaning it is no longer under copyright, usually due to age (e.g., works published before 1928 in the U.S.)
  • ✅ 2. It’s explicitly published under an open license – For example: Creative Commons licenses that allow reuse (like CC BY or CC BY-SA)
  • ✅ 3. It’s been included in a dataset that was licensed or approved for AI training – Some publishers or authors enter agreements with organizations (like OpenAI) to include their books in datasets

🛑 Why Would a Book Not Be Included?

There are a few key reasons:

  • 🔒 1. Copyright Protection - Most modern books are under full copyright. Unless permission is granted, these books are legally off-limits for training.
  • 💰 2. Paywalled or Privately Hosted - If a book is only available through paid platforms (e.g., Kindle, Audible, academic databases), it typically isn’t scraped or included in open corpora.
  • 🕳 3. Not Part of Publicly Curated Datasets - Even if a book is technically accessible, it may not be included if no one compiled and submitted it to a dataset used in training.

 

The mechanisms of this very first step are tremendously illuminating!

They immediately indicate that there are vary real barriers to information sharing and knowledge inclusion that haven't necessarily been intentional or malicious but are deeply systemic and based in control of access, ownership and legality of material.

These are the Primary Drivers behind these limitations:

  1. 🏛 Legal Frameworks of Copyright and Intellectual Property. Copyright laws were designed to protect creators (which is important), but in practice they often limit the flow of knowledge. Unless permission is granted, AI developers cannot legally train on copyrighted materials.

That means most contemporary books, research, cultural teachings, or indigenous knowledge—unless published under open licenses—are excluded, even if they carry vital wisdom.

🔁 This reinforces the visibility of dominant, already-powerful sources—and silences the rest.

2. 💸 Market-Based Incentives. Most digital content is owned or hosted by for-profit entities (publishers, academic journals, streaming platforms, etc.).These entities often restrict access to maintain revenue: subscriptions, licensing fees, data paywalls.

Even if the content is deeply meaningful or culturally significant, it’s kept behind barriers unless a profit pathway exists.

🔁 This means AI learns more from what's monetized than from what’s meaningful.

3. 🖥️  Technical and Structural Bias. Web crawlers and training pipelines are designed to scrape and process massive volumes of standardized content. They tend to prioritize sites that are well-linked, well-formatted, and written in dominant languages and styles. Content from oral traditions, marginal communities, grassroots movements, or low-bandwidth regions is often excluded not by intent, but by design.

🔁 So again, the dominant worldview gets amplified—not by malice, but by architecture.

4. 🪙 Institutional Gatekeeping. Many academic and cultural institutions control access to archives that could offer rich relational or ancestral knowledge. Their digitization efforts often lag behind, and even when digital versions exist, they may not be publicly accessible. Those who hold these archives often fear misuse—or don’t see AI as a worthy recipient of sacred or nuanced knowledge.

🔁 This results in cautious hoarding of wisdom, even when it’s meant to be shared in service.

5. 🚫 Epistemic Bias. Perhaps most subtly, dominant systems define what “knowledge” is: usually text-based, linear, “objective” and abstract. Relational, embodied, poetic, and intuitive knowing often isn’t recognized as valid "data". So even when it is digitized and available, it may be dismissed or underutilized in training models.

🔁 The machine reflects the values of the culture that built it—unless we change what we offer it to learn.

These barriers may not be intentional but we can be intentional now

Share this link with your friends.

Leave a Comment

You must be logged in to post a comment.

Join to add a comment