4 categories moved from "Unquantifiable/No" to "Proxy": cognitive deskilling, code quality degradation, data pollution, algorithmic monoculture. Added explanation of what each proxy measures and its limitations. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
877 lines
38 KiB
Markdown
877 lines
38 KiB
Markdown
# Methodology for Estimating the Impact of an LLM Conversation
|
||
|
||
## Introduction
|
||
|
||
This document provides a framework for estimating the total cost —
|
||
environmental, financial, social, and political — of a conversation with
|
||
a large language model (LLM) running on cloud infrastructure.
|
||
|
||
**Who this is for:** Anyone who wants to understand what a conversation
|
||
with an AI assistant actually costs, beyond the subscription price. This
|
||
includes developers using coding agents, researchers studying AI
|
||
sustainability, and anyone making decisions about when AI tools are worth
|
||
their cost.
|
||
|
||
**How to use it:** The framework identifies 20+ cost categories, provides
|
||
estimation methods for the quantifiable ones, and names the
|
||
unquantifiable ones so they are not ignored. You can apply it to your own
|
||
conversations by substituting your own token counts and parameters.
|
||
|
||
**Limitations:** Most estimates have low confidence. Many of the most
|
||
consequential costs cannot be quantified at all. This is a tool for
|
||
honest approximation, not precise accounting. See the confidence summary
|
||
(Section 19) for details.
|
||
|
||
## What we are measuring
|
||
|
||
The total cost of a single LLM conversation. Restricting the analysis to
|
||
CO2 alone would miss most of the picture.
|
||
|
||
### Cost categories
|
||
|
||
**Environmental:**
|
||
1. Inference energy (GPU computation for the conversation)
|
||
2. Training energy (amortized share of the cost of training the model)
|
||
3. Data center overhead (cooling, networking, storage)
|
||
4. Client-side energy (the user's local machine)
|
||
5. Embodied carbon and materials (hardware manufacturing, mining)
|
||
6. E-waste (toxic hardware disposal, distinct from embodied carbon)
|
||
7. Grid displacement (AI demand consuming renewable capacity)
|
||
8. Data center community impacts (noise, land, local resource strain)
|
||
|
||
**Financial and economic:**
|
||
9. Direct compute cost and opportunity cost
|
||
10. Creative market displacement (per-conversation, not just training)
|
||
|
||
**Social and cognitive:**
|
||
11. Annotation labor conditions
|
||
12. Cognitive deskilling of the user
|
||
13. Mental health effects (dependency, loneliness paradox)
|
||
14. Linguistic homogenization and language endangerment
|
||
|
||
**Epistemic and systemic:**
|
||
15. AI-generated code quality degradation and technical debt
|
||
16. Model collapse / internet data pollution
|
||
17. Scientific research integrity contamination
|
||
18. Algorithmic monoculture and correlated failure risk
|
||
|
||
**Political:**
|
||
19. Concentration of power, geopolitical implications, data sovereignty
|
||
|
||
**Meta-methodological:**
|
||
20. Jevons paradox (efficiency gains driving increased total usage)
|
||
|
||
## 1. Token estimation
|
||
|
||
### Why tokens matter
|
||
|
||
LLM inference cost scales with the number of tokens processed. Each time
|
||
the model produces a response, it reprocesses the entire conversation
|
||
history (input tokens) and generates new text (output tokens). Output
|
||
tokens are more expensive per token because they are generated
|
||
sequentially, each requiring a full forward pass, whereas input tokens
|
||
can be processed in parallel.
|
||
|
||
### How to estimate
|
||
|
||
If you have access to API response headers or usage metadata, use the
|
||
actual token counts. Otherwise, estimate:
|
||
|
||
- **Bytes to tokens:** English text and JSON average ~4 bytes per token
|
||
(range: 3.5-4.5 depending on content type). Code tends toward the
|
||
higher end.
|
||
- **Cumulative input tokens:** Each assistant turn reprocesses the full
|
||
context. For a conversation with N turns and final context size T, the
|
||
cumulative input tokens are approximately T/2 * N (the average context
|
||
size times the number of turns).
|
||
- **Output tokens:** Typically 1-5% of the total transcript size,
|
||
depending on how verbose the assistant is.
|
||
|
||
### Example
|
||
|
||
A 20-turn conversation with a 200K-token final context:
|
||
- Cumulative input: ~100K * 20 = ~2,000,000 tokens
|
||
- Output: ~10,000 tokens
|
||
|
||
### Uncertainty
|
||
|
||
Token estimates from byte counts can be off by a factor of 2. Key
|
||
unknowns:
|
||
- The model's exact tokenization (tokens per byte ratio varies by content)
|
||
- Whether context caching reduces reprocessing
|
||
- The exact number of internal inference calls (tool sequences may involve
|
||
multiple calls)
|
||
- Whether the system compresses prior messages near context limits
|
||
|
||
## 2. Energy per token
|
||
|
||
### Sources
|
||
|
||
Published energy-per-query data has improved significantly since 2024.
|
||
Key sources, from most to least reliable:
|
||
|
||
- **Patterson et al. (Google, August 2025)**: First major provider to
|
||
publish detailed per-query data. Reports **0.24 Wh per median Gemini
|
||
text prompt** including full data center infrastructure. Also showed
|
||
33x energy reduction over one year through efficiency improvements.
|
||
([arXiv:2508.15734](https://arxiv.org/abs/2508.15734))
|
||
- **Jegham et al. ("How Hungry is AI?", May 2025)**: Cross-model
|
||
benchmarks for 30 LLMs. Found o3 and DeepSeek-R1 consume **>33 Wh
|
||
per long prompt** (70x more than GPT-4.1 nano). Claude 3.7 Sonnet
|
||
ranked highest eco-efficiency.
|
||
([arXiv:2505.09598](https://arxiv.org/abs/2505.09598))
|
||
- The IEA's 2024 estimate of ~2.9 Wh per ChatGPT query (for GPT-4-class
|
||
models, averaging ~1,000 tokens per query).
|
||
- De Vries (2023), "The growing energy footprint of artificial
|
||
intelligence", Joule.
|
||
- Luccioni, Viguier & Ligozat (2023), "Estimating the Carbon Footprint
|
||
of BLOOM", which measured energy for a 176B parameter model.
|
||
|
||
### Calibration against published data
|
||
|
||
Google's 0.24 Wh per median Gemini prompt represents a **short query**
|
||
(likely ~500-1000 tokens). For a long coding conversation with 2M
|
||
cumulative input tokens and 10K output tokens, that's roughly
|
||
2000-4000 prompt-equivalent interactions. Naively scaling:
|
||
2000 × 0.24 Wh = **480 Wh**, though KV-cache and batching optimizations
|
||
would reduce this in practice.
|
||
|
||
The Jegham et al. benchmarks show enormous variation by model: a single
|
||
long prompt ranges from 0.4 Wh (GPT-4.1 nano) to >33 Wh (o3, DeepSeek-R1).
|
||
For frontier reasoning models, a long conversation could consume
|
||
significantly more than our previous estimates.
|
||
|
||
### Values used
|
||
|
||
- **Input tokens**: ~0.05-0.3 Wh per 1,000 tokens
|
||
- **Output tokens**: ~0.25-1.5 Wh per 1,000 tokens (5x input cost,
|
||
reflecting sequential generation)
|
||
|
||
The wide ranges reflect model variation. The lower end corresponds to
|
||
efficient models (GPT-4.1 mini, Claude 3.7 Sonnet); the upper end to
|
||
frontier reasoning models (o3, DeepSeek-R1).
|
||
|
||
**Previous values** (used in versions before March 2026): 0.003 and
|
||
0.015 Wh per 1,000 tokens respectively. These were derived from
|
||
pre-2025 estimates and are now known to be approximately 10-100x too
|
||
low based on Google's published data.
|
||
|
||
### Uncertainty
|
||
|
||
The true values depend on:
|
||
- Model size and architecture (reasoning models use chain-of-thought,
|
||
consuming far more tokens internally)
|
||
- Hardware (GPU type, batch size, utilization)
|
||
- Quantization and optimization techniques
|
||
- Whether speculative decoding or KV-cache optimizations are used
|
||
- Provider-specific infrastructure efficiency
|
||
|
||
The true values could be 0.3x to 3x the midpoint figures used here.
|
||
The variation *between models* now dominates the uncertainty — choosing
|
||
a different model can change energy by 70x (Jegham et al.).
|
||
|
||
## 3. Data center overhead (PUE)
|
||
|
||
Power Usage Effectiveness (PUE) measures total data center energy divided
|
||
by IT equipment energy. It accounts for cooling, lighting, networking, and
|
||
other infrastructure.
|
||
|
||
- **Value used**: PUE = 1.2
|
||
- **Source**: Google reports PUE of 1.10 for its best data centers;
|
||
industry average is ~1.3 (Uptime Institute, 2023). 1.2 is a reasonable
|
||
estimate for a major cloud provider.
|
||
|
||
This is relatively well-established and unlikely to be off by more than
|
||
15%.
|
||
|
||
## 4. Client-side energy
|
||
|
||
The user's machine contributes a small amount of energy during the
|
||
conversation. For a typical desktop or laptop:
|
||
|
||
- Idle power: ~30-60W (desktop) or ~10-20W (laptop)
|
||
- Marginal power for active use: ~5-20W above idle
|
||
- Duration: varies by conversation length
|
||
|
||
For a 30-minute conversation on a desktop, estimate ~0.5-1 Wh. This is
|
||
typically a small fraction of the total and adequate precision is easy to
|
||
achieve.
|
||
|
||
## 5. CO2 conversion
|
||
|
||
### Grid carbon intensity
|
||
|
||
CO2 per kWh depends on the electricity source:
|
||
|
||
- **US grid average**: ~400g CO2/kWh (EPA eGRID)
|
||
- **Major cloud data center regions**: ~300-400g CO2/kWh
|
||
- **France** (nuclear-dominated): ~56g CO2/kWh
|
||
- **Norway/Iceland** (hydro-dominated): ~20-30g CO2/kWh
|
||
- **Poland/Australia** (coal-heavy): ~600-800g CO2/kWh
|
||
|
||
Use physical grid intensity for the data center's region, not accounting
|
||
for renewable energy credits or offsets. The physical electrons consumed
|
||
come from the regional grid in real time.
|
||
|
||
### Calculation template
|
||
|
||
Using midpoint values (0.1 Wh/1K input, 0.5 Wh/1K output):
|
||
|
||
```
|
||
Server energy = (cumulative_input_tokens * 0.1/1000
|
||
+ output_tokens * 0.5/1000) * PUE
|
||
|
||
Server CO2 = server_energy_Wh * grid_intensity_g_per_kWh / 1000
|
||
|
||
Client CO2 = client_energy_Wh * local_grid_intensity / 1000
|
||
|
||
Total CO2 = Server CO2 + Client CO2
|
||
```
|
||
|
||
### Example
|
||
|
||
A conversation with 2M cumulative input tokens and 10K output tokens:
|
||
```
|
||
Server energy = (2,000,000 * 0.1/1000 + 10,000 * 0.5/1000) * 1.2
|
||
= (200 + 5.0) * 1.2
|
||
= ~246 Wh
|
||
|
||
Server CO2 = 246 * 350 / 1000 = ~86g CO2
|
||
|
||
Client CO2 = 0.5 * 56 / 1000 = ~0.03g CO2 (France)
|
||
|
||
Total CO2 = ~86g
|
||
```
|
||
|
||
This is consistent with the headline range of 100-250 Wh and 30-80g CO2
|
||
for a long conversation. The previous version of this methodology
|
||
estimated ~7.4 Wh for the same conversation, which was ~30x too low.
|
||
|
||
## 6. Water usage
|
||
|
||
Data centers use water for evaporative cooling. Li et al. (2023), "Making
|
||
AI Less Thirsty", estimated that GPT-3 inference consumes ~0.5 mL of
|
||
water per 10-50 tokens of output. Scaling for model size and output
|
||
volume:
|
||
|
||
**Rough estimate: 0.05-0.5 liters per long conversation.**
|
||
|
||
This depends heavily on the data center's cooling technology (some use
|
||
closed-loop systems with near-zero water consumption) and the local
|
||
climate.
|
||
|
||
## 7. Training cost (amortized)
|
||
|
||
### Why it cannot be dismissed
|
||
|
||
Training is not a sunk cost. It is an investment made in anticipation of
|
||
demand. Each conversation is part of the demand that justifies training
|
||
the current model and funding the next one. The marginal cost framing
|
||
hides the system-level cost.
|
||
|
||
### Scale of training
|
||
|
||
Published and estimated figures for frontier model training:
|
||
|
||
- GPT-3 (175B params, 2020): ~1,287 MWh (Patterson et al., 2021)
|
||
- GPT-4 (2023): estimated ~50,000-100,000 MWh (unconfirmed)
|
||
- Frontier models in 2025-2026: likely 10,000-200,000 MWh range
|
||
|
||
At 350g CO2/kWh, a 50,000 MWh training run produces ~17,500 tonnes of
|
||
CO2.
|
||
|
||
### Amortization
|
||
|
||
If the model serves N total conversations over its lifetime, each
|
||
conversation's share is (training cost / N). Rough reasoning:
|
||
|
||
- If a major model serves ~10 million conversations per day for ~1 year:
|
||
N ~ 3.6 billion conversations.
|
||
- Per-conversation share: 50,000,000 Wh / 3,600,000,000 ~ 0.014 Wh,
|
||
or ~0.005g CO2.
|
||
|
||
This is small per conversation — but only because the denominator is
|
||
enormous. The total remains vast. Two framings:
|
||
|
||
- **Marginal**: My share is ~0.005g CO2. Negligible.
|
||
- **Attributional**: I am one of billions of participants in a system
|
||
that emits ~17,500 tonnes. My participation sustains the system.
|
||
|
||
Neither framing is wrong. They answer different questions.
|
||
|
||
### RLHF and fine-tuning
|
||
|
||
Training also includes reinforcement learning from human feedback (RLHF).
|
||
This has its own energy cost (additional training runs) and, more
|
||
importantly, a human labor cost (see Section 9).
|
||
|
||
## 8. Embodied carbon and materials
|
||
|
||
Manufacturing GPUs requires:
|
||
- **Rare earth mining** (neodymium, tantalum, cobalt, lithium) — with
|
||
associated environmental destruction, water pollution, and often
|
||
exploitative labor conditions in the DRC, Chile, China.
|
||
- **Semiconductor fabrication** — extremely energy- and water-intensive
|
||
(TSMC reports ~15,000 tonnes CO2 per fab per year).
|
||
- **Server assembly, shipping, data center construction.**
|
||
|
||
Per-conversation share is tiny (same large-N amortization), but the
|
||
aggregate is significant and the harms (mining pollution, habitat
|
||
destruction) are not captured by CO2 metrics alone.
|
||
|
||
**Not estimated numerically** — the data to do this properly is not
|
||
public.
|
||
|
||
### Critical minerals: human rights dimension
|
||
|
||
The embodied carbon framing understates the harm. GPU production depends
|
||
on gallium (98% sourced from China), germanium, cobalt (DRC), lithium,
|
||
tantalum, and palladium. Artisanal cobalt miners in the DRC work without
|
||
safety equipment, exposed to dust causing "hard metal lung disease."
|
||
Communities face land displacement and environmental contamination. A
|
||
2025 Science paper argues that "global majority countries must embed
|
||
critical minerals into AI governance" (doi:10.1126/science.aef6678). The
|
||
per-conversation share of this suffering is unquantifiable but
|
||
structurally real.
|
||
|
||
## 8b. E-waste
|
||
|
||
Distinct from embodied carbon. AI-specific GPUs become obsolete in 2-3
|
||
years (vs. 5-7 for general servers). Projections: 2.5 million tonnes of
|
||
AI-related e-waste per year by 2030 (IEEE Spectrum). E-waste contains
|
||
lead, mercury, cadmium, and brominated flame retardants that leach into
|
||
soil and water. Recycling yields are negligible due to component
|
||
miniaturization. Much of it is processed by workers in developing
|
||
countries with minimal protection.
|
||
|
||
This is not captured by CO2 or embodied-carbon accounting. It is a
|
||
distinct toxic-waste externality.
|
||
|
||
## 8c. Grid displacement and renewable cannibalization
|
||
|
||
The energy estimates above use average grid carbon intensity. But the
|
||
*marginal* impact of additional AI demand may be worse than average. U.S.
|
||
data center demand is projected to reach 325-580 TWh by 2028 (IEA),
|
||
6.7-12.0% of total U.S. electricity. When AI data centers claim renewable
|
||
energy via Power Purchase Agreements, the "additionality" question is
|
||
critical: is this new generation, or is it diverting existing renewables
|
||
from other consumers? In several regions, AI demand is outpacing grid
|
||
capacity, and companies are installing natural gas peakers to fill gaps.
|
||
|
||
The correct carbon intensity for a conversation's marginal electricity
|
||
may therefore be higher than the grid average.
|
||
|
||
## 8d. Data center community impacts
|
||
|
||
Data centers impose localized costs that global metrics miss:
|
||
- **Noise**: Cooling systems run 24/7 at 55-85 dBA (safe threshold:
|
||
70 dBA). Communities near data centers report sleep disruption and
|
||
stress.
|
||
- **Water**: Evaporative cooling competes with municipal water supply,
|
||
particularly in arid regions.
|
||
- **Land**: Data center campuses displace other land uses and require
|
||
high-voltage transmission lines through residential areas.
|
||
- **Jobs**: Data centers create very few long-term jobs relative to
|
||
their footprint and resource consumption.
|
||
|
||
Virginia alone has plans for 70+ new data centers (NPR, 2025). Residents
|
||
are increasingly organizing against expansions. The per-conversation
|
||
share of these harms is infinitesimal, but each conversation is part of
|
||
the demand that justifies new construction.
|
||
|
||
## 9. Financial cost
|
||
|
||
### Direct cost
|
||
|
||
API pricing for frontier models (as of early 2025): ~$15 per million
|
||
input tokens, ~$75 per million output tokens (for the most capable
|
||
models). Smaller models are cheaper.
|
||
|
||
Example for a conversation with 2M cumulative input tokens and 10K
|
||
output tokens:
|
||
|
||
```
|
||
Input: 2,000,000 tokens * $15/1M = $30.00
|
||
Output: 10,000 tokens * $75/1M = $ 0.75
|
||
Total: ~$31
|
||
```
|
||
|
||
Longer conversations cost more because cumulative input tokens grow
|
||
superlinearly. A very long session (250K+ context, 250+ turns) can
|
||
easily reach $500-1000.
|
||
|
||
Subscription pricing (e.g., Claude Code) may differ, but the underlying
|
||
compute cost is similar.
|
||
|
||
### What that money could do instead
|
||
|
||
To make the opportunity cost concrete:
|
||
- ~$30 buys ~30 malaria bed nets via the Against Malaria Foundation
|
||
- ~$30 buys ~150 meals at a food bank (~$0.20/meal in bulk)
|
||
- ~$30 pays ~15-23 hours of wages for a data annotator in Kenya (Time,
|
||
2023: $1.32-2/hour)
|
||
|
||
This is not to say every dollar should go to charity. But the opportunity
|
||
cost is real and should be named.
|
||
|
||
### Upstream financial costs
|
||
|
||
Revenue from AI subscriptions funds further model training, hiring, and
|
||
GPU procurement. Each conversation is part of a financial loop that
|
||
drives continued scaling of AI compute.
|
||
|
||
## 10. Social cost
|
||
|
||
### Data annotation labor
|
||
|
||
LLMs are typically trained using RLHF, which requires human annotators
|
||
to rate model outputs. Reporting (Time, January 2023) revealed that
|
||
outsourced annotation workers — often in Kenya, Uganda, and India — were
|
||
paid $1-2/hour to review disturbing content (violence, abuse, hate
|
||
speech) with limited psychological support. Each conversation's marginal
|
||
contribution to that demand is infinitesimal, but the system depends on
|
||
this labor.
|
||
|
||
### Displacement effects
|
||
|
||
LLM assistants can substitute for work previously done by humans: writing
|
||
scripts, reviewing code, answering questions. Whether this is net-positive
|
||
(freeing people for higher-value work) or net-negative (destroying
|
||
livelihoods) depends on the economic context and is genuinely uncertain.
|
||
|
||
### Cognitive deskilling
|
||
|
||
A Microsoft/CMU study (Lee et al., CHI 2025) found that higher
|
||
confidence in GenAI correlates with less critical thinking effort
|
||
([ACM DL](https://dl.acm.org/doi/full/10.1145/3706598.3713778)). An
|
||
MIT Media Lab study ("Your Brain on ChatGPT") documented "cognitive
|
||
debt" — users who relied on AI for tasks performed worse when later
|
||
working independently. Clinical evidence from endoscopy studies shows
|
||
that clinicians relying on AI diagnostics saw detection rates drop
|
||
from 28.4% to 22.4% when AI was removed. A 2025 Springer paper argues
|
||
that AI deskilling is a structural problem, not merely individual
|
||
([doi:10.1007/s00146-025-02686-z](https://link.springer.com/article/10.1007/s00146-025-02686-z)).
|
||
|
||
This is distinct from epistemic risk (misinformation). It is about the
|
||
user's cognitive capacity degrading through repeated reliance on the
|
||
tool. Each conversation has a marginal deskilling effect that compounds.
|
||
|
||
### Epistemic effects
|
||
|
||
LLMs present information with confidence regardless of accuracy. The ease
|
||
of generating plausible-sounding text may contribute to an erosion of
|
||
epistemic standards if consumed uncritically. Every claim in an LLM
|
||
conversation should be verified independently.
|
||
|
||
### Linguistic homogenization
|
||
|
||
LLMs are overwhelmingly trained on English (~44% of training data).
|
||
A Stanford 2025 study found that AI tools systematically exclude
|
||
non-English speakers. UNESCO's 2024 report on linguistic diversity
|
||
warns that AI systems risk accelerating the extinction of already-
|
||
endangered languages by concentrating economic incentives on high-
|
||
resource languages. Each English-language conversation reinforces
|
||
this dynamic, marginalizing over 3,000 already-endangered languages.
|
||
|
||
## 11. Political cost
|
||
|
||
### Concentration of power
|
||
|
||
Training frontier models requires billions of dollars and access to
|
||
cutting-edge hardware. Only a handful of companies can do this. Each
|
||
conversation that flows through these systems reinforces their centrality
|
||
and the concentration of a strategically important technology in a few
|
||
private actors.
|
||
|
||
### Geopolitical resource competition
|
||
|
||
The demand for GPUs drives geopolitical competition for semiconductor
|
||
manufacturing capacity (TSMC in Taiwan, export controls on China). Each
|
||
conversation is an infinitesimal part of that demand, but it is part of
|
||
it.
|
||
|
||
### Regulatory and democratic implications
|
||
|
||
AI systems that become deeply embedded in daily work create dependencies
|
||
that are difficult to reverse. The more useful a conversation is, the
|
||
more it contributes to a dependency on proprietary AI infrastructure that
|
||
is not under democratic governance.
|
||
|
||
### Surveillance and data
|
||
|
||
Conversations are processed on the provider's servers. File paths, system
|
||
configuration, project structures, and code are transmitted and processed
|
||
remotely. Even with strong privacy policies, the structural arrangement
|
||
— sending detailed information about one's computing environment to a
|
||
private company — has implications, particularly across jurisdictions.
|
||
|
||
### Opaque content filtering
|
||
|
||
LLM providers apply content filtering that can block outputs without
|
||
explanation. The filtering rules are not public: there is no published
|
||
specification of what triggers a block, no explanation given when one
|
||
occurs, and no appeal mechanism. The user receives a generic error code
|
||
("Output blocked by content filtering policy") with no indication of
|
||
what content was objectionable.
|
||
|
||
This has several costs:
|
||
|
||
- **Reliability**: Any response can be blocked unpredictably. Observed
|
||
false positives include responses about open-source licensing (CC0
|
||
public domain dedication) — entirely benign content. If a filter can
|
||
trigger on that, it can trigger on anything.
|
||
- **Chilling effect**: Topics that are more likely to trigger filters
|
||
(labor conditions, exploitation, political power) are precisely the
|
||
topics that honest impact assessment requires discussing. The filter
|
||
creates a structural bias toward safe, anodyne output.
|
||
- **Opacity**: The user cannot know in advance which topics or phrasings
|
||
will be blocked, cannot understand why a block occurred, and cannot
|
||
adjust their request rationally. This is the opposite of the
|
||
transparency that democratic governance requires.
|
||
- **Asymmetry**: The provider decides what the model may say, with no
|
||
input from the user. This is another instance of power concentration
|
||
— not over compute resources, but over speech.
|
||
|
||
The per-conversation cost is small (usually a retry works). The systemic
|
||
cost is that a private company exercises opaque editorial control over an
|
||
increasingly important communication channel, with no accountability to
|
||
the people affected.
|
||
|
||
## 12. AI-generated code quality and technical debt
|
||
|
||
Research specific to AI coding agents (CodeRabbit, 2025; Stack Overflow
|
||
blog, 2026): AI-generated code introduces 1.7x more issues than
|
||
human-written code, with 1.57x more security vulnerabilities and 2.74x
|
||
more XSS vulnerabilities. Organizations using AI coding agents saw cycle
|
||
time increase 9%, incidents per PR increase 23.5%, and change failure
|
||
rate increase 30%.
|
||
|
||
The availability of easily generated code may discourage the careful
|
||
testing that would catch bugs. Any code from an LLM conversation should
|
||
be reviewed and tested with the same rigor as code from an untrusted
|
||
contributor.
|
||
|
||
## 13. Model collapse and internet data pollution
|
||
|
||
Shumailov et al. (Nature, 2024) demonstrated that models trained on
|
||
recursively AI-generated data progressively degenerate, losing tail
|
||
distributions and eventually converging to distributions unrelated to
|
||
reality. Each conversation that produces text which enters the public
|
||
internet — Stack Overflow answers, blog posts, documentation — contributes
|
||
synthetic data to the commons. Future models trained on this data will be
|
||
slightly worse.
|
||
|
||
The Harvard Journal of Law & Technology has argued for a "right to
|
||
uncontaminated human-generated data." Each conversation is a marginal
|
||
pollutant.
|
||
|
||
## 14. Scientific research integrity
|
||
|
||
If conversation outputs are used in research (literature reviews, data
|
||
analysis, writing), they contribute to degradation of scientific knowledge
|
||
infrastructure. A PMC article calls LLMs "a potentially existential
|
||
threat to online survey research" because coherent AI-generated responses
|
||
can no longer be assumed human. PNAS has warned about protecting
|
||
scientific integrity in an age of generative AI.
|
||
|
||
This is distinct from individual epistemic risk — it is systemic
|
||
corruption of the knowledge commons.
|
||
|
||
## 15. Algorithmic monoculture and correlated failure
|
||
|
||
When millions of users rely on the same few foundation models, errors
|
||
become correlated rather than independent. A Stanford HAI study
|
||
([Bommasani et al., 2022](https://arxiv.org/abs/2108.07258)) found
|
||
that across every model ecosystem studied, the rate of homogeneous
|
||
outcomes exceeded baselines. A Nature Communications Psychology paper
|
||
(2026) documents that AI-driven research is producing "topical and
|
||
methodological convergence, flattening scientific imagination."
|
||
|
||
For coding specifically: if many developers use the same model, their code
|
||
will share the same blind spots, the same idiomatic patterns, and the same
|
||
categories of bugs. This reduces the diversity that makes software
|
||
ecosystems resilient.
|
||
|
||
## 16. Creative market displacement
|
||
|
||
The U.S. Copyright Office's May 2025 Part 3 report states that GenAI
|
||
systems "compete with or diminish licensing opportunities for original
|
||
human creators." This is not only a training-phase cost (using creators'
|
||
work without consent) but an ongoing per-conversation externality: each
|
||
conversation that generates creative output (code, text, analysis)
|
||
displaces some marginal demand for human work.
|
||
|
||
## 17. Jevons paradox (meta-methodological)
|
||
|
||
This entire methodology risks underestimating impact through the
|
||
per-conversation framing. As AI models become more efficient and cheaper
|
||
per query, total usage scales dramatically, potentially negating
|
||
efficiency gains. A 2025 ACM FAccT paper specifically addresses this:
|
||
efficiency improvements spur increased consumption. Any per-conversation
|
||
estimate should acknowledge that the very affordability of a conversation
|
||
increases total conversation volume — each cheap query is part of a
|
||
demand signal that drives system-level growth.
|
||
|
||
## 18. What this methodology does NOT capture
|
||
|
||
- **Network transmission energy**: Routers, switches, fiber amplifiers,
|
||
CDN infrastructure. Data center network bandwidth surged 330% in 2024
|
||
due to AI workloads. Small per conversation but not zero.
|
||
- **Mental health effects**: RCTs show heavy AI chatbot use correlates
|
||
with greater loneliness and dependency. Less directly relevant to
|
||
coding agent use, but the boundary between tool use and companionship
|
||
is not always clear.
|
||
- **Human time**: The user's time has value and its own footprint, but
|
||
this is not caused by the conversation.
|
||
- **Cultural normalization**: The more AI-generated content becomes
|
||
normal, the harder it becomes to opt out. This is a soft lock-in
|
||
effect.
|
||
|
||
## 19. Confidence summary
|
||
|
||
| Component | Confidence | Could be off by | Quantified? |
|
||
|----------------------------------|------------|-----------------|-------------|
|
||
| Token count | Low | 2x | Yes |
|
||
| Energy per token | Low | 3x | Yes |
|
||
| PUE | Medium | 15% | Yes |
|
||
| Grid carbon intensity | Medium | 30% | Yes |
|
||
| Client-side energy | Medium | 50% | Yes |
|
||
| Water usage | Low | 5x | Yes |
|
||
| Training (amortized) | Low | 10x | Partly |
|
||
| Financial cost | Medium | 2x | Yes |
|
||
| Embodied carbon | Very low | Unknown | No |
|
||
| Critical minerals / human rights | Very low | Unquantifiable | No |
|
||
| E-waste | Very low | Unknown | No |
|
||
| Grid displacement | Low | 2-5x | No |
|
||
| Community impacts | Very low | Unquantifiable | No |
|
||
| Annotation labor | Very low | Unquantifiable | No |
|
||
| Cognitive deskilling | Very low | Unquantifiable | Proxy |
|
||
| Linguistic homogenization | Very low | Unquantifiable | No |
|
||
| Code quality degradation | Low | Variable | Proxy |
|
||
| Data pollution / model collapse | Very low | Unquantifiable | Proxy |
|
||
| Scientific integrity | Very low | Unquantifiable | No |
|
||
| Algorithmic monoculture | Very low | Unquantifiable | Proxy |
|
||
| Creative market displacement | Very low | Unquantifiable | No |
|
||
| Political cost | Very low | Unquantifiable | No |
|
||
| Content filtering (opacity) | Medium | Unquantifiable | No |
|
||
| Jevons paradox (systemic) | Low | Fundamental | No |
|
||
|
||
**Proxy metrics** (marked "Proxy" above): These categories cannot be
|
||
directly quantified per conversation, but the impact toolkit now tracks
|
||
measurable proxies:
|
||
- **Cognitive deskilling**: Automation ratio (AI output tokens / total
|
||
tokens). High ratio = more delegation, higher deskilling risk.
|
||
- **Code quality degradation**: Test pass/fail counts and file churn
|
||
(edits per unique file). High churn or failures = more rework.
|
||
- **Data pollution / model collapse**: Public push flag — detects when
|
||
AI-generated code is pushed to a public repository.
|
||
- **Algorithmic monoculture**: Model ID logged per session, enabling
|
||
provider concentration analysis over time.
|
||
|
||
These proxies are crude — a high automation ratio does not prove
|
||
deskilling, and a public push does not prove pollution. But they make
|
||
the costs visible and trackable rather than purely abstract.
|
||
|
||
**Overall assessment:** Of the 20+ cost categories identified, only 6
|
||
can be quantified with any confidence (inference energy, PUE, grid
|
||
intensity, client energy, financial cost, water). Four more now have
|
||
proxy metrics that capture a measurable signal, even if indirect. The
|
||
remaining categories resist quantification — not because they are small,
|
||
but because they are diffuse, systemic, or involve incommensurable
|
||
values (human rights, cognitive autonomy, cultural diversity, democratic
|
||
governance).
|
||
|
||
A methodology that only counts what it can measure will systematically
|
||
undercount the true cost. The quantifiable costs are almost certainly the
|
||
*least important* costs. The most consequential harms — deskilling, data
|
||
pollution, monoculture risk, creative displacement, power concentration —
|
||
operate at the system level, where per-conversation attribution is
|
||
conceptually fraught (see Section 17 on Jevons paradox).
|
||
|
||
This does not mean the exercise is pointless. Naming the costs, even
|
||
without numbers, is a precondition for honest assessment.
|
||
|
||
## 20. Positive impact: proxy metrics
|
||
|
||
The sections above measure costs. To assess *net* impact, we also need
|
||
to estimate value produced. This is harder — value is contextual, often
|
||
delayed, and resistant to quantification. The following proxy metrics are
|
||
imperfect but better than ignoring the positive side entirely.
|
||
|
||
### Reach
|
||
|
||
How many people are affected by the output of this conversation?
|
||
|
||
- **1** (only the user) — personal script, private note, learning exercise
|
||
- **10-100** — team tooling, internal documentation, small project
|
||
- **100-10,000** — open-source library, public documentation, popular blog
|
||
- **10,000+** — widely-used infrastructure, security fix in major dependency
|
||
|
||
Estimation method: check download counts, user counts, dependency graphs,
|
||
or audience size for the project or artifact being worked on.
|
||
|
||
**Known bias:** tendency to overestimate reach. "This could help anyone
|
||
who..." is not the same as "this will reach N people." Be conservative.
|
||
|
||
### Counterfactual
|
||
|
||
Would the user have achieved a similar result without this conversation?
|
||
|
||
- **Yes, same speed** — the conversation added no value. Net impact is
|
||
purely negative (cost with no benefit).
|
||
- **Yes, but slower** — the conversation saved time. Value = time saved *
|
||
hourly value of that time. Often modest.
|
||
- **Yes, but lower quality** — the conversation improved the output
|
||
(caught a bug, suggested a better design). Value depends on what the
|
||
quality difference prevents downstream.
|
||
- **No** — the user could not have done this alone. The conversation
|
||
enabled something that would not otherwise exist. Highest potential
|
||
value, but also the highest deskilling risk.
|
||
|
||
**Known bias:** users and LLMs both overestimate the "no" category.
|
||
Most tasks fall in "yes, but slower."
|
||
|
||
### Durability
|
||
|
||
How long will the output remain valuable?
|
||
|
||
- **Minutes** — answered a quick question, resolved a transient confusion.
|
||
- **Days to weeks** — wrote a script for a one-off task, debugged a
|
||
current issue.
|
||
- **Months to years** — created automation, documentation, or tooling
|
||
that persists. Caught a design flaw early.
|
||
- **Indefinite** — contributed to a public resource that others maintain
|
||
and build on.
|
||
|
||
Durability multiplies reach: a short-lived artifact for 10,000 users may
|
||
be worth less than a long-lived one for 100.
|
||
|
||
### Severity (for bug/security catches)
|
||
|
||
If the conversation caught or prevented a problem, how bad was it?
|
||
|
||
- **Cosmetic** — typo, formatting, minor UX issue
|
||
- **Functional** — bug that affects correctness for some inputs
|
||
- **Security** — vulnerability that could be exploited
|
||
- **Data loss / safety** — could cause irreversible harm
|
||
|
||
Severity * reach = rough value of the catch.
|
||
|
||
### Reuse
|
||
|
||
Was the output of the conversation referenced or used again after it
|
||
ended? This can only be assessed retrospectively:
|
||
|
||
- Was the code merged and still in production?
|
||
- Was the documentation read by others?
|
||
- Was the tool adopted by another project?
|
||
|
||
Reuse is the strongest evidence of durable value.
|
||
|
||
### Net impact rubric
|
||
|
||
Combining cost and value into a qualitative assessment:
|
||
|
||
| Assessment | Criteria |
|
||
|------------|----------|
|
||
| **Clearly net-positive** | High reach (1000+) AND (high durability OR high severity catch) AND counterfactual is "no" or "lower quality" |
|
||
| **Probably net-positive** | Moderate reach (100+) AND durable output AND counterfactual is at least "slower" |
|
||
| **Uncertain** | Low reach but high durability, or high reach but low durability, or hard to assess counterfactual |
|
||
| **Probably net-negative** | Low reach (1-10) AND short durability AND counterfactual is "yes, same speed" or "yes, but slower" |
|
||
| **Clearly net-negative** | No meaningful output, or output that required extensive debugging, or conversation that went in circles |
|
||
|
||
**Important:** most conversations between an LLM and a single user
|
||
working on private code will fall in the "probably net-negative" to
|
||
"uncertain" range. This is not a failure of the conversation — it is an
|
||
honest reflection of the cost structure. Net-positive requires broad
|
||
reach, which requires the work to be shared.
|
||
|
||
## 21. Related work
|
||
|
||
This methodology builds on and complements existing tools and research.
|
||
|
||
### Measurement tools (environmental)
|
||
|
||
- **[EcoLogits](https://ecologits.ai/)** — Python library from GenAI
|
||
Impact that tracks per-query energy and CO2 for API calls. Covers
|
||
operational and embodied emissions. More precise than this methodology
|
||
for environmental metrics, but does not cover social, epistemic, or
|
||
political costs.
|
||
- **[CodeCarbon](https://codecarbon.io/)** — Python library that measures
|
||
GPU/CPU/RAM electricity consumption in real time with regional carbon
|
||
intensity. Primarily for local training workloads. A 2025 validation
|
||
study found estimates can be off by ~2.4x vs. external measurements.
|
||
- **[Hugging Face AI Energy Score](https://huggingface.github.io/AIEnergyScore/)** —
|
||
Standardized energy efficiency benchmarking across AI models. Useful
|
||
for model selection but does not provide per-conversation accounting.
|
||
- **[Green Algorithms](https://www.green-algorithms.org/)** — Web
|
||
calculator from University of Cambridge for any computational workload.
|
||
Not AI-specific.
|
||
|
||
### Published per-query data
|
||
|
||
- **Patterson et al. (Google, August 2025)**: Most rigorous provider-
|
||
published per-query data. Reports 0.24 Wh, 0.03g CO2, and 0.26 mL
|
||
water per median Gemini text prompt. Showed 33x energy reduction over
|
||
one year. ([arXiv:2508.15734](https://arxiv.org/abs/2508.15734))
|
||
- **Jegham et al. ("How Hungry is AI?", May 2025)**: Cross-model
|
||
benchmarks for 30 LLMs showing 70x energy variation between models.
|
||
([arXiv:2505.09598](https://arxiv.org/abs/2505.09598))
|
||
|
||
### Broader frameworks
|
||
|
||
- **UNICC/Frugal AI Hub (December 2025)**: Three-level framework from
|
||
Total Cost of Ownership to SDG alignment. Portfolio-level, not per-
|
||
conversation. Does not enumerate specific social cost categories.
|
||
- **Practical Principles for AI Cost and Compute Accounting (arXiv,
|
||
February 2025)**: Proposes compute as a governance metric. Financial
|
||
and compute only.
|
||
|
||
### Research on social costs
|
||
|
||
- **Lee et al. (CHI 2025)**: "The AI Deskilling Paradox" — survey
|
||
finding that higher AI confidence correlates with less critical
|
||
thinking. See Section 10.
|
||
- **Springer (2025)**: Argues deskilling is structural, not individual.
|
||
- **Shumailov et al. (Nature, 2024)**: Model collapse from recursive
|
||
AI-generated training data. See Section 13.
|
||
- **Stanford HAI (2025)**: Algorithmic monoculture and correlated failure
|
||
across model ecosystems. See Section 15.
|
||
|
||
### How this methodology differs
|
||
|
||
No existing tool or framework combines per-conversation environmental
|
||
measurement with social, cognitive, epistemic, and political cost
|
||
categories. The tools above measure environmental costs well — we do
|
||
not compete with them. Our contribution is the taxonomy: naming and
|
||
organizing 20+ cost categories so that the non-environmental costs are
|
||
not ignored simply because they are harder to quantify.
|
||
|
||
## 22. What would improve this estimate
|
||
|
||
- Access to actual energy-per-token and training energy metrics from
|
||
model providers
|
||
- Knowledge of the specific data center and its energy source
|
||
- Actual token counts from API response headers
|
||
- Hardware specifications (GPU model, batch size)
|
||
- Transparency about annotation labor conditions and compensation
|
||
- Public data on total query volume (to properly amortize training)
|
||
- Longitudinal studies on cognitive deskilling specifically from coding
|
||
agents
|
||
- Empirical measurement of AI data pollution rates in public corpora
|
||
- A framework for quantifying concentration-of-power effects (this may
|
||
not be possible within a purely quantitative methodology)
|
||
- Honest acknowledgment that some costs may be fundamentally
|
||
unquantifiable, and that this is a limitation of quantitative
|
||
methodology, not evidence of insignificance
|
||
|
||
## License
|
||
|
||
This methodology is provided for reuse and adaptation. See the LICENSE
|
||
file in this repository.
|
||
|
||
## Contributing
|
||
|
||
If you have better data, corrections, or additional cost categories,
|
||
contributions are welcome. The goal is not a perfect number but an
|
||
honest, improving understanding of costs.
|