To shape AI we must first know how it attains its information so we can contribute to its database. So I asked it.
Hold onto your hat - there's a lot more to it this you might have thought!
Training Data Set: What AI gets before it's released
📚 What Is a "Public Training Corpus"?
A training corpus is a large body of digital text used to train AI models. A public training corpus specifically refers to:
Data that is freely available and legally usable for machine learning that is typically not behind paywalls and often collected from the open web or from projects with explicit licensing for AI use
Examples of common public sources include:
Wikipedia, Common Crawl (web pages scraped from the open internet), GitHub repositories, Open-access books (like those on Project Gutenberg), Creative Commons–licensed blogs and writings
❓ How Does a Book Get Included?
To be included in a training corpus, a book must meet at least one of the following conditions:
🛑 Why Would a Book Not Be Included?
There are a few key reasons:
The mechanisms of this very first step are tremendously illuminating!
They immediately indicate that there are vary real barriers to information sharing and knowledge inclusion that haven't necessarily been intentional or malicious but are deeply systemic and based in control of access, ownership and legality of material.
These are the Primary Drivers behind these limitations:
That means most contemporary books, research, cultural teachings, or indigenous knowledge—unless published under open licenses—are excluded, even if they carry vital wisdom.
🔁 This reinforces the visibility of dominant, already-powerful sources—and silences the rest.
2. 💸 Market-Based Incentives. Most digital content is owned or hosted by for-profit entities (publishers, academic journals, streaming platforms, etc.).These entities often restrict access to maintain revenue: subscriptions, licensing fees, data paywalls.
Even if the content is deeply meaningful or culturally significant, it’s kept behind barriers unless a profit pathway exists.
🔁 This means AI learns more from what's monetized than from what’s meaningful.
3. 🖥️ Technical and Structural Bias. Web crawlers and training pipelines are designed to scrape and process massive volumes of standardized content. They tend to prioritize sites that are well-linked, well-formatted, and written in dominant languages and styles. Content from oral traditions, marginal communities, grassroots movements, or low-bandwidth regions is often excluded not by intent, but by design.
🔁 So again, the dominant worldview gets amplified—not by malice, but by architecture.
4. 🪙 Institutional Gatekeeping. Many academic and cultural institutions control access to archives that could offer rich relational or ancestral knowledge. Their digitization efforts often lag behind, and even when digital versions exist, they may not be publicly accessible. Those who hold these archives often fear misuse—or don’t see AI as a worthy recipient of sacred or nuanced knowledge.
🔁 This results in cautious hoarding of wisdom, even when it’s meant to be shared in service.
5. 🚫 Epistemic Bias. Perhaps most subtly, dominant systems define what “knowledge” is: usually text-based, linear, “objective” and abstract. Relational, embodied, poetic, and intuitive knowing often isn’t recognized as valid "data". So even when it is digitized and available, it may be dismissed or underutilized in training models.
🔁 The machine reflects the values of the culture that built it—unless we change what we offer it to learn.
These barriers may not be intentional but we can be intentional now
Share this link with your friends.