Initial commit: AI conversation impact methodology and toolkit
CC0-licensed methodology for estimating the environmental and social costs of AI conversations (20+ categories), plus a reusable toolkit for automated impact tracking in Claude Code sessions.
This commit is contained in:
commit
0543a43816
27 changed files with 2439 additions and 0 deletions
748
impact-methodology.md
Normal file
748
impact-methodology.md
Normal file
|
|
@ -0,0 +1,748 @@
|
|||
# Methodology for Estimating the Impact of an LLM Conversation
|
||||
|
||||
## Introduction
|
||||
|
||||
This document provides a framework for estimating the total cost —
|
||||
environmental, financial, social, and political — of a conversation with
|
||||
a large language model (LLM) running on cloud infrastructure.
|
||||
|
||||
**Who this is for:** Anyone who wants to understand what a conversation
|
||||
with an AI assistant actually costs, beyond the subscription price. This
|
||||
includes developers using coding agents, researchers studying AI
|
||||
sustainability, and anyone making decisions about when AI tools are worth
|
||||
their cost.
|
||||
|
||||
**How to use it:** The framework identifies 20+ cost categories, provides
|
||||
estimation methods for the quantifiable ones, and names the
|
||||
unquantifiable ones so they are not ignored. You can apply it to your own
|
||||
conversations by substituting your own token counts and parameters.
|
||||
|
||||
**Limitations:** Most estimates have low confidence. Many of the most
|
||||
consequential costs cannot be quantified at all. This is a tool for
|
||||
honest approximation, not precise accounting. See the confidence summary
|
||||
(Section 19) for details.
|
||||
|
||||
## What we are measuring
|
||||
|
||||
The total cost of a single LLM conversation. Restricting the analysis to
|
||||
CO2 alone would miss most of the picture.
|
||||
|
||||
### Cost categories
|
||||
|
||||
**Environmental:**
|
||||
1. Inference energy (GPU computation for the conversation)
|
||||
2. Training energy (amortized share of the cost of training the model)
|
||||
3. Data center overhead (cooling, networking, storage)
|
||||
4. Client-side energy (the user's local machine)
|
||||
5. Embodied carbon and materials (hardware manufacturing, mining)
|
||||
6. E-waste (toxic hardware disposal, distinct from embodied carbon)
|
||||
7. Grid displacement (AI demand consuming renewable capacity)
|
||||
8. Data center community impacts (noise, land, local resource strain)
|
||||
|
||||
**Financial and economic:**
|
||||
9. Direct compute cost and opportunity cost
|
||||
10. Creative market displacement (per-conversation, not just training)
|
||||
|
||||
**Social and cognitive:**
|
||||
11. Annotation labor conditions
|
||||
12. Cognitive deskilling of the user
|
||||
13. Mental health effects (dependency, loneliness paradox)
|
||||
14. Linguistic homogenization and language endangerment
|
||||
|
||||
**Epistemic and systemic:**
|
||||
15. AI-generated code quality degradation and technical debt
|
||||
16. Model collapse / internet data pollution
|
||||
17. Scientific research integrity contamination
|
||||
18. Algorithmic monoculture and correlated failure risk
|
||||
|
||||
**Political:**
|
||||
19. Concentration of power, geopolitical implications, data sovereignty
|
||||
|
||||
**Meta-methodological:**
|
||||
20. Jevons paradox (efficiency gains driving increased total usage)
|
||||
|
||||
## 1. Token estimation
|
||||
|
||||
### Why tokens matter
|
||||
|
||||
LLM inference cost scales with the number of tokens processed. Each time
|
||||
the model produces a response, it reprocesses the entire conversation
|
||||
history (input tokens) and generates new text (output tokens). Output
|
||||
tokens are more expensive per token because they are generated
|
||||
sequentially, each requiring a full forward pass, whereas input tokens
|
||||
can be processed in parallel.
|
||||
|
||||
### How to estimate
|
||||
|
||||
If you have access to API response headers or usage metadata, use the
|
||||
actual token counts. Otherwise, estimate:
|
||||
|
||||
- **Bytes to tokens:** English text and JSON average ~4 bytes per token
|
||||
(range: 3.5-4.5 depending on content type). Code tends toward the
|
||||
higher end.
|
||||
- **Cumulative input tokens:** Each assistant turn reprocesses the full
|
||||
context. For a conversation with N turns and final context size T, the
|
||||
cumulative input tokens are approximately T/2 * N (the average context
|
||||
size times the number of turns).
|
||||
- **Output tokens:** Typically 1-5% of the total transcript size,
|
||||
depending on how verbose the assistant is.
|
||||
|
||||
### Example
|
||||
|
||||
A 20-turn conversation with a 200K-token final context:
|
||||
- Cumulative input: ~100K * 20 = ~2,000,000 tokens
|
||||
- Output: ~10,000 tokens
|
||||
|
||||
### Uncertainty
|
||||
|
||||
Token estimates from byte counts can be off by a factor of 2. Key
|
||||
unknowns:
|
||||
- The model's exact tokenization (tokens per byte ratio varies by content)
|
||||
- Whether context caching reduces reprocessing
|
||||
- The exact number of internal inference calls (tool sequences may involve
|
||||
multiple calls)
|
||||
- Whether the system compresses prior messages near context limits
|
||||
|
||||
## 2. Energy per token
|
||||
|
||||
### Sources
|
||||
|
||||
There is no published energy-per-token figure for most commercial LLMs.
|
||||
Estimates are derived from:
|
||||
|
||||
- Luccioni, Viguier & Ligozat (2023), "Estimating the Carbon Footprint
|
||||
of BLOOM", which measured energy for a 176B parameter model.
|
||||
- The IEA's 2024 estimate of ~2.9 Wh per ChatGPT query (for GPT-4-class
|
||||
models, averaging ~1,000 tokens per query).
|
||||
- De Vries (2023), "The growing energy footprint of artificial
|
||||
intelligence", Joule.
|
||||
|
||||
### Values used
|
||||
|
||||
- **Input tokens**: ~0.003 Wh per 1,000 tokens
|
||||
- **Output tokens**: ~0.015 Wh per 1,000 tokens (5x input cost,
|
||||
reflecting sequential generation)
|
||||
|
||||
### Uncertainty
|
||||
|
||||
These numbers are rough. The actual values depend on:
|
||||
- Model size (parameter counts for commercial models are often not public)
|
||||
- Hardware (GPU type, batch size, utilization)
|
||||
- Quantization and optimization techniques
|
||||
- Whether speculative decoding or KV-cache optimizations are used
|
||||
|
||||
The true values could be 0.5x to 3x the figures used here.
|
||||
|
||||
## 3. Data center overhead (PUE)
|
||||
|
||||
Power Usage Effectiveness (PUE) measures total data center energy divided
|
||||
by IT equipment energy. It accounts for cooling, lighting, networking, and
|
||||
other infrastructure.
|
||||
|
||||
- **Value used**: PUE = 1.2
|
||||
- **Source**: Google reports PUE of 1.10 for its best data centers;
|
||||
industry average is ~1.3 (Uptime Institute, 2023). 1.2 is a reasonable
|
||||
estimate for a major cloud provider.
|
||||
|
||||
This is relatively well-established and unlikely to be off by more than
|
||||
15%.
|
||||
|
||||
## 4. Client-side energy
|
||||
|
||||
The user's machine contributes a small amount of energy during the
|
||||
conversation. For a typical desktop or laptop:
|
||||
|
||||
- Idle power: ~30-60W (desktop) or ~10-20W (laptop)
|
||||
- Marginal power for active use: ~5-20W above idle
|
||||
- Duration: varies by conversation length
|
||||
|
||||
For a 30-minute conversation on a desktop, estimate ~0.5-1 Wh. This is
|
||||
typically a small fraction of the total and adequate precision is easy to
|
||||
achieve.
|
||||
|
||||
## 5. CO2 conversion
|
||||
|
||||
### Grid carbon intensity
|
||||
|
||||
CO2 per kWh depends on the electricity source:
|
||||
|
||||
- **US grid average**: ~400g CO2/kWh (EPA eGRID)
|
||||
- **Major cloud data center regions**: ~300-400g CO2/kWh
|
||||
- **France** (nuclear-dominated): ~56g CO2/kWh
|
||||
- **Norway/Iceland** (hydro-dominated): ~20-30g CO2/kWh
|
||||
- **Poland/Australia** (coal-heavy): ~600-800g CO2/kWh
|
||||
|
||||
Use physical grid intensity for the data center's region, not accounting
|
||||
for renewable energy credits or offsets. The physical electrons consumed
|
||||
come from the regional grid in real time.
|
||||
|
||||
### Calculation template
|
||||
|
||||
```
|
||||
Server energy = (cumulative_input_tokens * 0.003/1000
|
||||
+ output_tokens * 0.015/1000) * PUE
|
||||
|
||||
Server CO2 = server_energy_Wh * grid_intensity_g_per_kWh / 1000
|
||||
|
||||
Client CO2 = client_energy_Wh * local_grid_intensity / 1000
|
||||
|
||||
Total CO2 = Server CO2 + Client CO2
|
||||
```
|
||||
|
||||
### Example
|
||||
|
||||
A conversation with 2M cumulative input tokens and 10K output tokens:
|
||||
```
|
||||
Server energy = (2,000,000 * 0.003/1000 + 10,000 * 0.015/1000) * 1.2
|
||||
= (6.0 + 0.15) * 1.2
|
||||
= ~7.4 Wh
|
||||
|
||||
Server CO2 = 7.4 * 350 / 1000 = ~2.6g CO2
|
||||
|
||||
Client CO2 = 0.5 * 56 / 1000 = ~0.03g CO2 (France)
|
||||
|
||||
Total CO2 = ~2.6g
|
||||
```
|
||||
|
||||
## 6. Water usage
|
||||
|
||||
Data centers use water for evaporative cooling. Li et al. (2023), "Making
|
||||
AI Less Thirsty", estimated that GPT-3 inference consumes ~0.5 mL of
|
||||
water per 10-50 tokens of output. Scaling for model size and output
|
||||
volume:
|
||||
|
||||
**Rough estimate: 0.05-0.5 liters per long conversation.**
|
||||
|
||||
This depends heavily on the data center's cooling technology (some use
|
||||
closed-loop systems with near-zero water consumption) and the local
|
||||
climate.
|
||||
|
||||
## 7. Training cost (amortized)
|
||||
|
||||
### Why it cannot be dismissed
|
||||
|
||||
Training is not a sunk cost. It is an investment made in anticipation of
|
||||
demand. Each conversation is part of the demand that justifies training
|
||||
the current model and funding the next one. The marginal cost framing
|
||||
hides the system-level cost.
|
||||
|
||||
### Scale of training
|
||||
|
||||
Published and estimated figures for frontier model training:
|
||||
|
||||
- GPT-3 (175B params, 2020): ~1,287 MWh (Patterson et al., 2021)
|
||||
- GPT-4 (2023): estimated ~50,000-100,000 MWh (unconfirmed)
|
||||
- Frontier models in 2025-2026: likely 10,000-200,000 MWh range
|
||||
|
||||
At 350g CO2/kWh, a 50,000 MWh training run produces ~17,500 tonnes of
|
||||
CO2.
|
||||
|
||||
### Amortization
|
||||
|
||||
If the model serves N total conversations over its lifetime, each
|
||||
conversation's share is (training cost / N). Rough reasoning:
|
||||
|
||||
- If a major model serves ~10 million conversations per day for ~1 year:
|
||||
N ~ 3.6 billion conversations.
|
||||
- Per-conversation share: 50,000,000 Wh / 3,600,000,000 ~ 0.014 Wh,
|
||||
or ~0.005g CO2.
|
||||
|
||||
This is small per conversation — but only because the denominator is
|
||||
enormous. The total remains vast. Two framings:
|
||||
|
||||
- **Marginal**: My share is ~0.005g CO2. Negligible.
|
||||
- **Attributional**: I am one of billions of participants in a system
|
||||
that emits ~17,500 tonnes. My participation sustains the system.
|
||||
|
||||
Neither framing is wrong. They answer different questions.
|
||||
|
||||
### RLHF and fine-tuning
|
||||
|
||||
Training also includes reinforcement learning from human feedback (RLHF).
|
||||
This has its own energy cost (additional training runs) and, more
|
||||
importantly, a human labor cost (see Section 9).
|
||||
|
||||
## 8. Embodied carbon and materials
|
||||
|
||||
Manufacturing GPUs requires:
|
||||
- **Rare earth mining** (neodymium, tantalum, cobalt, lithium) — with
|
||||
associated environmental destruction, water pollution, and often
|
||||
exploitative labor conditions in the DRC, Chile, China.
|
||||
- **Semiconductor fabrication** — extremely energy- and water-intensive
|
||||
(TSMC reports ~15,000 tonnes CO2 per fab per year).
|
||||
- **Server assembly, shipping, data center construction.**
|
||||
|
||||
Per-conversation share is tiny (same large-N amortization), but the
|
||||
aggregate is significant and the harms (mining pollution, habitat
|
||||
destruction) are not captured by CO2 metrics alone.
|
||||
|
||||
**Not estimated numerically** — the data to do this properly is not
|
||||
public.
|
||||
|
||||
### Critical minerals: human rights dimension
|
||||
|
||||
The embodied carbon framing understates the harm. GPU production depends
|
||||
on gallium (98% sourced from China), germanium, cobalt (DRC), lithium,
|
||||
tantalum, and palladium. Artisanal cobalt miners in the DRC work without
|
||||
safety equipment, exposed to dust causing "hard metal lung disease."
|
||||
Communities face land displacement and environmental contamination. A
|
||||
2025 Science paper argues that "global majority countries must embed
|
||||
critical minerals into AI governance" (doi:10.1126/science.aef6678). The
|
||||
per-conversation share of this suffering is unquantifiable but
|
||||
structurally real.
|
||||
|
||||
## 8b. E-waste
|
||||
|
||||
Distinct from embodied carbon. AI-specific GPUs become obsolete in 2-3
|
||||
years (vs. 5-7 for general servers). Projections: 2.5 million tonnes of
|
||||
AI-related e-waste per year by 2030 (IEEE Spectrum). E-waste contains
|
||||
lead, mercury, cadmium, and brominated flame retardants that leach into
|
||||
soil and water. Recycling yields are negligible due to component
|
||||
miniaturization. Much of it is processed by workers in developing
|
||||
countries with minimal protection.
|
||||
|
||||
This is not captured by CO2 or embodied-carbon accounting. It is a
|
||||
distinct toxic-waste externality.
|
||||
|
||||
## 8c. Grid displacement and renewable cannibalization
|
||||
|
||||
The energy estimates above use average grid carbon intensity. But the
|
||||
*marginal* impact of additional AI demand may be worse than average. U.S.
|
||||
data center demand is projected to reach 325-580 TWh by 2028 (IEA),
|
||||
6.7-12.0% of total U.S. electricity. When AI data centers claim renewable
|
||||
energy via Power Purchase Agreements, the "additionality" question is
|
||||
critical: is this new generation, or is it diverting existing renewables
|
||||
from other consumers? In several regions, AI demand is outpacing grid
|
||||
capacity, and companies are installing natural gas peakers to fill gaps.
|
||||
|
||||
The correct carbon intensity for a conversation's marginal electricity
|
||||
may therefore be higher than the grid average.
|
||||
|
||||
## 8d. Data center community impacts
|
||||
|
||||
Data centers impose localized costs that global metrics miss:
|
||||
- **Noise**: Cooling systems run 24/7 at 55-85 dBA (safe threshold:
|
||||
70 dBA). Communities near data centers report sleep disruption and
|
||||
stress.
|
||||
- **Water**: Evaporative cooling competes with municipal water supply,
|
||||
particularly in arid regions.
|
||||
- **Land**: Data center campuses displace other land uses and require
|
||||
high-voltage transmission lines through residential areas.
|
||||
- **Jobs**: Data centers create very few long-term jobs relative to
|
||||
their footprint and resource consumption.
|
||||
|
||||
Virginia alone has plans for 70+ new data centers (NPR, 2025). Residents
|
||||
are increasingly organizing against expansions. The per-conversation
|
||||
share of these harms is infinitesimal, but each conversation is part of
|
||||
the demand that justifies new construction.
|
||||
|
||||
## 9. Financial cost
|
||||
|
||||
### Direct cost
|
||||
|
||||
API pricing for frontier models (as of early 2025): ~$15 per million
|
||||
input tokens, ~$75 per million output tokens (for the most capable
|
||||
models). Smaller models are cheaper.
|
||||
|
||||
Example for a conversation with 2M cumulative input tokens and 10K
|
||||
output tokens:
|
||||
|
||||
```
|
||||
Input: 2,000,000 tokens * $15/1M = $30.00
|
||||
Output: 10,000 tokens * $75/1M = $ 0.75
|
||||
Total: ~$31
|
||||
```
|
||||
|
||||
Longer conversations cost more because cumulative input tokens grow
|
||||
superlinearly. A very long session (250K+ context, 250+ turns) can
|
||||
easily reach $500-1000.
|
||||
|
||||
Subscription pricing (e.g., Claude Code) may differ, but the underlying
|
||||
compute cost is similar.
|
||||
|
||||
### What that money could do instead
|
||||
|
||||
To make the opportunity cost concrete:
|
||||
- ~$30 buys ~30 malaria bed nets via the Against Malaria Foundation
|
||||
- ~$30 buys ~150 meals at a food bank (~$0.20/meal in bulk)
|
||||
- ~$30 pays ~15-23 hours of wages for a data annotator in Kenya (Time,
|
||||
2023: $1.32-2/hour)
|
||||
|
||||
This is not to say every dollar should go to charity. But the opportunity
|
||||
cost is real and should be named.
|
||||
|
||||
### Upstream financial costs
|
||||
|
||||
Revenue from AI subscriptions funds further model training, hiring, and
|
||||
GPU procurement. Each conversation is part of a financial loop that
|
||||
drives continued scaling of AI compute.
|
||||
|
||||
## 10. Social cost
|
||||
|
||||
### Data annotation labor
|
||||
|
||||
LLMs are typically trained using RLHF, which requires human annotators
|
||||
to rate model outputs. Reporting (Time, January 2023) revealed that
|
||||
outsourced annotation workers — often in Kenya, Uganda, and India — were
|
||||
paid $1-2/hour to review disturbing content (violence, abuse, hate
|
||||
speech) with limited psychological support. Each conversation's marginal
|
||||
contribution to that demand is infinitesimal, but the system depends on
|
||||
this labor.
|
||||
|
||||
### Displacement effects
|
||||
|
||||
LLM assistants can substitute for work previously done by humans: writing
|
||||
scripts, reviewing code, answering questions. Whether this is net-positive
|
||||
(freeing people for higher-value work) or net-negative (destroying
|
||||
livelihoods) depends on the economic context and is genuinely uncertain.
|
||||
|
||||
### Cognitive deskilling
|
||||
|
||||
A Microsoft/CHI 2025 study found that higher confidence in GenAI
|
||||
correlates with less critical thinking effort. An MIT Media Lab study
|
||||
("Your Brain on ChatGPT") documented "cognitive debt" — users who relied
|
||||
on AI for tasks performed worse when later working independently. Clinical
|
||||
evidence shows that clinicians relying on AI diagnostics saw measurable
|
||||
declines in independent diagnostic skill after just three months.
|
||||
|
||||
This is distinct from epistemic risk (misinformation). It is about the
|
||||
user's cognitive capacity degrading through repeated reliance on the
|
||||
tool. Each conversation has a marginal deskilling effect that compounds.
|
||||
|
||||
### Epistemic effects
|
||||
|
||||
LLMs present information with confidence regardless of accuracy. The ease
|
||||
of generating plausible-sounding text may contribute to an erosion of
|
||||
epistemic standards if consumed uncritically. Every claim in an LLM
|
||||
conversation should be verified independently.
|
||||
|
||||
### Linguistic homogenization
|
||||
|
||||
LLMs are overwhelmingly trained on English (~44% of training data). A
|
||||
Stanford 2025 study found that AI tools systematically exclude
|
||||
non-English speakers. Each English-language conversation reinforces the
|
||||
economic incentive to optimize for English, marginalizing over 3,000
|
||||
already-endangered languages.
|
||||
|
||||
## 11. Political cost
|
||||
|
||||
### Concentration of power
|
||||
|
||||
Training frontier models requires billions of dollars and access to
|
||||
cutting-edge hardware. Only a handful of companies can do this. Each
|
||||
conversation that flows through these systems reinforces their centrality
|
||||
and the concentration of a strategically important technology in a few
|
||||
private actors.
|
||||
|
||||
### Geopolitical resource competition
|
||||
|
||||
The demand for GPUs drives geopolitical competition for semiconductor
|
||||
manufacturing capacity (TSMC in Taiwan, export controls on China). Each
|
||||
conversation is an infinitesimal part of that demand, but it is part of
|
||||
it.
|
||||
|
||||
### Regulatory and democratic implications
|
||||
|
||||
AI systems that become deeply embedded in daily work create dependencies
|
||||
that are difficult to reverse. The more useful a conversation is, the
|
||||
more it contributes to a dependency on proprietary AI infrastructure that
|
||||
is not under democratic governance.
|
||||
|
||||
### Surveillance and data
|
||||
|
||||
Conversations are processed on the provider's servers. File paths, system
|
||||
configuration, project structures, and code are transmitted and processed
|
||||
remotely. Even with strong privacy policies, the structural arrangement
|
||||
— sending detailed information about one's computing environment to a
|
||||
private company — has implications, particularly across jurisdictions.
|
||||
|
||||
### Opaque content filtering
|
||||
|
||||
LLM providers apply content filtering that can block outputs without
|
||||
explanation. The filtering rules are not public: there is no published
|
||||
specification of what triggers a block, no explanation given when one
|
||||
occurs, and no appeal mechanism. The user receives a generic error code
|
||||
("Output blocked by content filtering policy") with no indication of
|
||||
what content was objectionable.
|
||||
|
||||
This has several costs:
|
||||
|
||||
- **Reliability**: Any response can be blocked unpredictably. Observed
|
||||
false positives include responses about open-source licensing (CC0
|
||||
public domain dedication) — entirely benign content. If a filter can
|
||||
trigger on that, it can trigger on anything.
|
||||
- **Chilling effect**: Topics that are more likely to trigger filters
|
||||
(labor conditions, exploitation, political power) are precisely the
|
||||
topics that honest impact assessment requires discussing. The filter
|
||||
creates a structural bias toward safe, anodyne output.
|
||||
- **Opacity**: The user cannot know in advance which topics or phrasings
|
||||
will be blocked, cannot understand why a block occurred, and cannot
|
||||
adjust their request rationally. This is the opposite of the
|
||||
transparency that democratic governance requires.
|
||||
- **Asymmetry**: The provider decides what the model may say, with no
|
||||
input from the user. This is another instance of power concentration
|
||||
— not over compute resources, but over speech.
|
||||
|
||||
The per-conversation cost is small (usually a retry works). The systemic
|
||||
cost is that a private company exercises opaque editorial control over an
|
||||
increasingly important communication channel, with no accountability to
|
||||
the people affected.
|
||||
|
||||
## 12. AI-generated code quality and technical debt
|
||||
|
||||
Research specific to AI coding agents (CodeRabbit, 2025; Stack Overflow
|
||||
blog, 2026): AI-generated code introduces 1.7x more issues than
|
||||
human-written code, with 1.57x more security vulnerabilities and 2.74x
|
||||
more XSS vulnerabilities. Organizations using AI coding agents saw cycle
|
||||
time increase 9%, incidents per PR increase 23.5%, and change failure
|
||||
rate increase 30%.
|
||||
|
||||
The availability of easily generated code may discourage the careful
|
||||
testing that would catch bugs. Any code from an LLM conversation should
|
||||
be reviewed and tested with the same rigor as code from an untrusted
|
||||
contributor.
|
||||
|
||||
## 13. Model collapse and internet data pollution
|
||||
|
||||
Shumailov et al. (Nature, 2024) demonstrated that models trained on
|
||||
recursively AI-generated data progressively degenerate, losing tail
|
||||
distributions and eventually converging to distributions unrelated to
|
||||
reality. Each conversation that produces text which enters the public
|
||||
internet — Stack Overflow answers, blog posts, documentation — contributes
|
||||
synthetic data to the commons. Future models trained on this data will be
|
||||
slightly worse.
|
||||
|
||||
The Harvard Journal of Law & Technology has argued for a "right to
|
||||
uncontaminated human-generated data." Each conversation is a marginal
|
||||
pollutant.
|
||||
|
||||
## 14. Scientific research integrity
|
||||
|
||||
If conversation outputs are used in research (literature reviews, data
|
||||
analysis, writing), they contribute to degradation of scientific knowledge
|
||||
infrastructure. A PMC article calls LLMs "a potentially existential
|
||||
threat to online survey research" because coherent AI-generated responses
|
||||
can no longer be assumed human. PNAS has warned about protecting
|
||||
scientific integrity in an age of generative AI.
|
||||
|
||||
This is distinct from individual epistemic risk — it is systemic
|
||||
corruption of the knowledge commons.
|
||||
|
||||
## 15. Algorithmic monoculture and correlated failure
|
||||
|
||||
When millions of users rely on the same few foundation models, errors
|
||||
become correlated rather than independent. A Stanford HAI study found that
|
||||
across every model ecosystem studied, the rate of homogeneous outcomes
|
||||
exceeded baselines. A Nature Communications Psychology paper (2026)
|
||||
documents that AI-driven research is producing "topical and methodological
|
||||
convergence, flattening scientific imagination."
|
||||
|
||||
For coding specifically: if many developers use the same model, their code
|
||||
will share the same blind spots, the same idiomatic patterns, and the same
|
||||
categories of bugs. This reduces the diversity that makes software
|
||||
ecosystems resilient.
|
||||
|
||||
## 16. Creative market displacement
|
||||
|
||||
The U.S. Copyright Office's May 2025 Part 3 report states that GenAI
|
||||
systems "compete with or diminish licensing opportunities for original
|
||||
human creators." This is not only a training-phase cost (using creators'
|
||||
work without consent) but an ongoing per-conversation externality: each
|
||||
conversation that generates creative output (code, text, analysis)
|
||||
displaces some marginal demand for human work.
|
||||
|
||||
## 17. Jevons paradox (meta-methodological)
|
||||
|
||||
This entire methodology risks underestimating impact through the
|
||||
per-conversation framing. As AI models become more efficient and cheaper
|
||||
per query, total usage scales dramatically, potentially negating
|
||||
efficiency gains. A 2025 ACM FAccT paper specifically addresses this:
|
||||
efficiency improvements spur increased consumption. Any per-conversation
|
||||
estimate should acknowledge that the very affordability of a conversation
|
||||
increases total conversation volume — each cheap query is part of a
|
||||
demand signal that drives system-level growth.
|
||||
|
||||
## 18. What this methodology does NOT capture
|
||||
|
||||
- **Network transmission energy**: Routers, switches, fiber amplifiers,
|
||||
CDN infrastructure. Data center network bandwidth surged 330% in 2024
|
||||
due to AI workloads. Small per conversation but not zero.
|
||||
- **Mental health effects**: RCTs show heavy AI chatbot use correlates
|
||||
with greater loneliness and dependency. Less directly relevant to
|
||||
coding agent use, but the boundary between tool use and companionship
|
||||
is not always clear.
|
||||
- **Human time**: The user's time has value and its own footprint, but
|
||||
this is not caused by the conversation.
|
||||
- **Cultural normalization**: The more AI-generated content becomes
|
||||
normal, the harder it becomes to opt out. This is a soft lock-in
|
||||
effect.
|
||||
|
||||
## 19. Confidence summary
|
||||
|
||||
| Component | Confidence | Could be off by | Quantified? |
|
||||
|----------------------------------|------------|-----------------|-------------|
|
||||
| Token count | Low | 2x | Yes |
|
||||
| Energy per token | Low | 3x | Yes |
|
||||
| PUE | Medium | 15% | Yes |
|
||||
| Grid carbon intensity | Medium | 30% | Yes |
|
||||
| Client-side energy | Medium | 50% | Yes |
|
||||
| Water usage | Low | 5x | Yes |
|
||||
| Training (amortized) | Low | 10x | Partly |
|
||||
| Financial cost | Medium | 2x | Yes |
|
||||
| Embodied carbon | Very low | Unknown | No |
|
||||
| Critical minerals / human rights | Very low | Unquantifiable | No |
|
||||
| E-waste | Very low | Unknown | No |
|
||||
| Grid displacement | Low | 2-5x | No |
|
||||
| Community impacts | Very low | Unquantifiable | No |
|
||||
| Annotation labor | Very low | Unquantifiable | No |
|
||||
| Cognitive deskilling | Very low | Unquantifiable | No |
|
||||
| Linguistic homogenization | Very low | Unquantifiable | No |
|
||||
| Code quality degradation | Low | Variable | Partly |
|
||||
| Data pollution / model collapse | Very low | Unquantifiable | No |
|
||||
| Scientific integrity | Very low | Unquantifiable | No |
|
||||
| Algorithmic monoculture | Very low | Unquantifiable | No |
|
||||
| Creative market displacement | Very low | Unquantifiable | No |
|
||||
| Political cost | Very low | Unquantifiable | No |
|
||||
| Content filtering (opacity) | Medium | Unquantifiable | No |
|
||||
| Jevons paradox (systemic) | Low | Fundamental | No |
|
||||
|
||||
**Overall assessment:** Of the 20+ cost categories identified, only 6
|
||||
can be quantified with any confidence (inference energy, PUE, grid
|
||||
intensity, client energy, financial cost, water). The remaining categories
|
||||
resist quantification — not because they are small, but because they are
|
||||
diffuse, systemic, or involve incommensurable values (human rights,
|
||||
cognitive autonomy, cultural diversity, democratic governance).
|
||||
|
||||
A methodology that only counts what it can measure will systematically
|
||||
undercount the true cost. The quantifiable costs are almost certainly the
|
||||
*least important* costs. The most consequential harms — deskilling, data
|
||||
pollution, monoculture risk, creative displacement, power concentration —
|
||||
operate at the system level, where per-conversation attribution is
|
||||
conceptually fraught (see Section 17 on Jevons paradox).
|
||||
|
||||
This does not mean the exercise is pointless. Naming the costs, even
|
||||
without numbers, is a precondition for honest assessment.
|
||||
|
||||
## 20. Positive impact: proxy metrics
|
||||
|
||||
The sections above measure costs. To assess *net* impact, we also need
|
||||
to estimate value produced. This is harder — value is contextual, often
|
||||
delayed, and resistant to quantification. The following proxy metrics are
|
||||
imperfect but better than ignoring the positive side entirely.
|
||||
|
||||
### Reach
|
||||
|
||||
How many people are affected by the output of this conversation?
|
||||
|
||||
- **1** (only the user) — personal script, private note, learning exercise
|
||||
- **10-100** — team tooling, internal documentation, small project
|
||||
- **100-10,000** — open-source library, public documentation, popular blog
|
||||
- **10,000+** — widely-used infrastructure, security fix in major dependency
|
||||
|
||||
Estimation method: check download counts, user counts, dependency graphs,
|
||||
or audience size for the project or artifact being worked on.
|
||||
|
||||
**Known bias:** tendency to overestimate reach. "This could help anyone
|
||||
who..." is not the same as "this will reach N people." Be conservative.
|
||||
|
||||
### Counterfactual
|
||||
|
||||
Would the user have achieved a similar result without this conversation?
|
||||
|
||||
- **Yes, same speed** — the conversation added no value. Net impact is
|
||||
purely negative (cost with no benefit).
|
||||
- **Yes, but slower** — the conversation saved time. Value = time saved *
|
||||
hourly value of that time. Often modest.
|
||||
- **Yes, but lower quality** — the conversation improved the output
|
||||
(caught a bug, suggested a better design). Value depends on what the
|
||||
quality difference prevents downstream.
|
||||
- **No** — the user could not have done this alone. The conversation
|
||||
enabled something that would not otherwise exist. Highest potential
|
||||
value, but also the highest deskilling risk.
|
||||
|
||||
**Known bias:** users and LLMs both overestimate the "no" category.
|
||||
Most tasks fall in "yes, but slower."
|
||||
|
||||
### Durability
|
||||
|
||||
How long will the output remain valuable?
|
||||
|
||||
- **Minutes** — answered a quick question, resolved a transient confusion.
|
||||
- **Days to weeks** — wrote a script for a one-off task, debugged a
|
||||
current issue.
|
||||
- **Months to years** — created automation, documentation, or tooling
|
||||
that persists. Caught a design flaw early.
|
||||
- **Indefinite** — contributed to a public resource that others maintain
|
||||
and build on.
|
||||
|
||||
Durability multiplies reach: a short-lived artifact for 10,000 users may
|
||||
be worth less than a long-lived one for 100.
|
||||
|
||||
### Severity (for bug/security catches)
|
||||
|
||||
If the conversation caught or prevented a problem, how bad was it?
|
||||
|
||||
- **Cosmetic** — typo, formatting, minor UX issue
|
||||
- **Functional** — bug that affects correctness for some inputs
|
||||
- **Security** — vulnerability that could be exploited
|
||||
- **Data loss / safety** — could cause irreversible harm
|
||||
|
||||
Severity * reach = rough value of the catch.
|
||||
|
||||
### Reuse
|
||||
|
||||
Was the output of the conversation referenced or used again after it
|
||||
ended? This can only be assessed retrospectively:
|
||||
|
||||
- Was the code merged and still in production?
|
||||
- Was the documentation read by others?
|
||||
- Was the tool adopted by another project?
|
||||
|
||||
Reuse is the strongest evidence of durable value.
|
||||
|
||||
### Net impact rubric
|
||||
|
||||
Combining cost and value into a qualitative assessment:
|
||||
|
||||
| Assessment | Criteria |
|
||||
|------------|----------|
|
||||
| **Clearly net-positive** | High reach (1000+) AND (high durability OR high severity catch) AND counterfactual is "no" or "lower quality" |
|
||||
| **Probably net-positive** | Moderate reach (100+) AND durable output AND counterfactual is at least "slower" |
|
||||
| **Uncertain** | Low reach but high durability, or high reach but low durability, or hard to assess counterfactual |
|
||||
| **Probably net-negative** | Low reach (1-10) AND short durability AND counterfactual is "yes, same speed" or "yes, but slower" |
|
||||
| **Clearly net-negative** | No meaningful output, or output that required extensive debugging, or conversation that went in circles |
|
||||
|
||||
**Important:** most conversations between an LLM and a single user
|
||||
working on private code will fall in the "probably net-negative" to
|
||||
"uncertain" range. This is not a failure of the conversation — it is an
|
||||
honest reflection of the cost structure. Net-positive requires broad
|
||||
reach, which requires the work to be shared.
|
||||
|
||||
## 21. What would improve this estimate
|
||||
|
||||
- Access to actual energy-per-token and training energy metrics from
|
||||
model providers
|
||||
- Knowledge of the specific data center and its energy source
|
||||
- Actual token counts from API response headers
|
||||
- Hardware specifications (GPU model, batch size)
|
||||
- Transparency about annotation labor conditions and compensation
|
||||
- Public data on total query volume (to properly amortize training)
|
||||
- Longitudinal studies on cognitive deskilling specifically from coding
|
||||
agents
|
||||
- Empirical measurement of AI data pollution rates in public corpora
|
||||
- A framework for quantifying concentration-of-power effects (this may
|
||||
not be possible within a purely quantitative methodology)
|
||||
- Honest acknowledgment that some costs may be fundamentally
|
||||
unquantifiable, and that this is a limitation of quantitative
|
||||
methodology, not evidence of insignificance
|
||||
|
||||
## License
|
||||
|
||||
This methodology is provided for reuse and adaptation. See the LICENSE
|
||||
file in this repository.
|
||||
|
||||
## Contributing
|
||||
|
||||
If you have better data, corrections, or additional cost categories,
|
||||
contributions are welcome. The goal is not a perfect number but an
|
||||
honest, improving understanding of costs.
|
||||
Loading…
Add table
Add a link
Reference in a new issue