ai-conversation-impact/impact-methodology.md

878 lines
38 KiB
Markdown
Raw Normal View History

# Methodology for Estimating the Impact of an LLM Conversation
## Introduction
This document provides a framework for estimating the total cost —
environmental, financial, social, and political — of a conversation with
a large language model (LLM) running on cloud infrastructure.
**Who this is for:** Anyone who wants to understand what a conversation
with an AI assistant actually costs, beyond the subscription price. This
includes developers using coding agents, researchers studying AI
sustainability, and anyone making decisions about when AI tools are worth
their cost.
**How to use it:** The framework identifies 20+ cost categories, provides
estimation methods for the quantifiable ones, and names the
unquantifiable ones so they are not ignored. You can apply it to your own
conversations by substituting your own token counts and parameters.
**Limitations:** Most estimates have low confidence. Many of the most
consequential costs cannot be quantified at all. This is a tool for
honest approximation, not precise accounting. See the confidence summary
(Section 19) for details.
## What we are measuring
The total cost of a single LLM conversation. Restricting the analysis to
CO2 alone would miss most of the picture.
### Cost categories
**Environmental:**
1. Inference energy (GPU computation for the conversation)
2. Training energy (amortized share of the cost of training the model)
3. Data center overhead (cooling, networking, storage)
4. Client-side energy (the user's local machine)
5. Embodied carbon and materials (hardware manufacturing, mining)
6. E-waste (toxic hardware disposal, distinct from embodied carbon)
7. Grid displacement (AI demand consuming renewable capacity)
8. Data center community impacts (noise, land, local resource strain)
**Financial and economic:**
9. Direct compute cost and opportunity cost
10. Creative market displacement (per-conversation, not just training)
**Social and cognitive:**
11. Annotation labor conditions
12. Cognitive deskilling of the user
13. Mental health effects (dependency, loneliness paradox)
14. Linguistic homogenization and language endangerment
**Epistemic and systemic:**
15. AI-generated code quality degradation and technical debt
16. Model collapse / internet data pollution
17. Scientific research integrity contamination
18. Algorithmic monoculture and correlated failure risk
**Political:**
19. Concentration of power, geopolitical implications, data sovereignty
**Meta-methodological:**
20. Jevons paradox (efficiency gains driving increased total usage)
## 1. Token estimation
### Why tokens matter
LLM inference cost scales with the number of tokens processed. Each time
the model produces a response, it reprocesses the entire conversation
history (input tokens) and generates new text (output tokens). Output
tokens are more expensive per token because they are generated
sequentially, each requiring a full forward pass, whereas input tokens
can be processed in parallel.
### How to estimate
If you have access to API response headers or usage metadata, use the
actual token counts. Otherwise, estimate:
- **Bytes to tokens:** English text and JSON average ~4 bytes per token
(range: 3.5-4.5 depending on content type). Code tends toward the
higher end.
- **Cumulative input tokens:** Each assistant turn reprocesses the full
context. For a conversation with N turns and final context size T, the
cumulative input tokens are approximately T/2 * N (the average context
size times the number of turns).
- **Output tokens:** Typically 1-5% of the total transcript size,
depending on how verbose the assistant is.
### Example
A 20-turn conversation with a 200K-token final context:
- Cumulative input: ~100K * 20 = ~2,000,000 tokens
- Output: ~10,000 tokens
### Uncertainty
Token estimates from byte counts can be off by a factor of 2. Key
unknowns:
- The model's exact tokenization (tokens per byte ratio varies by content)
- Whether context caching reduces reprocessing
- The exact number of internal inference calls (tool sequences may involve
multiple calls)
- Whether the system compresses prior messages near context limits
## 2. Energy per token
### Sources
Published energy-per-query data has improved significantly since 2024.
Key sources, from most to least reliable:
- **Patterson et al. (Google, August 2025)**: First major provider to
publish detailed per-query data. Reports **0.24 Wh per median Gemini
text prompt** including full data center infrastructure. Also showed
33x energy reduction over one year through efficiency improvements.
([arXiv:2508.15734](https://arxiv.org/abs/2508.15734))
- **Jegham et al. ("How Hungry is AI?", May 2025)**: Cross-model
benchmarks for 30 LLMs. Found o3 and DeepSeek-R1 consume **>33 Wh
per long prompt** (70x more than GPT-4.1 nano). Claude 3.7 Sonnet
ranked highest eco-efficiency.
([arXiv:2505.09598](https://arxiv.org/abs/2505.09598))
- The IEA's 2024 estimate of ~2.9 Wh per ChatGPT query (for GPT-4-class
models, averaging ~1,000 tokens per query).
- De Vries (2023), "The growing energy footprint of artificial
intelligence", Joule.
- Luccioni, Viguier & Ligozat (2023), "Estimating the Carbon Footprint
of BLOOM", which measured energy for a 176B parameter model.
### Calibration against published data
Google's 0.24 Wh per median Gemini prompt represents a **short query**
(likely ~500-1000 tokens). For a long coding conversation with 2M
cumulative input tokens and 10K output tokens, that's roughly
2000-4000 prompt-equivalent interactions. Naively scaling:
2000 × 0.24 Wh = **480 Wh**, though KV-cache and batching optimizations
would reduce this in practice.
The Jegham et al. benchmarks show enormous variation by model: a single
long prompt ranges from 0.4 Wh (GPT-4.1 nano) to >33 Wh (o3, DeepSeek-R1).
For frontier reasoning models, a long conversation could consume
significantly more than our previous estimates.
### Values used
- **Input tokens**: ~0.05-0.3 Wh per 1,000 tokens
- **Output tokens**: ~0.25-1.5 Wh per 1,000 tokens (5x input cost,
reflecting sequential generation)
The wide ranges reflect model variation. The lower end corresponds to
efficient models (GPT-4.1 mini, Claude 3.7 Sonnet); the upper end to
frontier reasoning models (o3, DeepSeek-R1).
**Previous values** (used in versions before March 2026): 0.003 and
0.015 Wh per 1,000 tokens respectively. These were derived from
pre-2025 estimates and are now known to be approximately 10-100x too
low based on Google's published data.
### Uncertainty
The true values depend on:
- Model size and architecture (reasoning models use chain-of-thought,
consuming far more tokens internally)
- Hardware (GPU type, batch size, utilization)
- Quantization and optimization techniques
- Whether speculative decoding or KV-cache optimizations are used
- Provider-specific infrastructure efficiency
The true values could be 0.3x to 3x the midpoint figures used here.
The variation *between models* now dominates the uncertainty — choosing
a different model can change energy by 70x (Jegham et al.).
## 3. Data center overhead (PUE)
Power Usage Effectiveness (PUE) measures total data center energy divided
by IT equipment energy. It accounts for cooling, lighting, networking, and
other infrastructure.
- **Value used**: PUE = 1.2
- **Source**: Google reports PUE of 1.10 for its best data centers;
industry average is ~1.3 (Uptime Institute, 2023). 1.2 is a reasonable
estimate for a major cloud provider.
This is relatively well-established and unlikely to be off by more than
15%.
## 4. Client-side energy
The user's machine contributes a small amount of energy during the
conversation. For a typical desktop or laptop:
- Idle power: ~30-60W (desktop) or ~10-20W (laptop)
- Marginal power for active use: ~5-20W above idle
- Duration: varies by conversation length
For a 30-minute conversation on a desktop, estimate ~0.5-1 Wh. This is
typically a small fraction of the total and adequate precision is easy to
achieve.
## 5. CO2 conversion
### Grid carbon intensity
CO2 per kWh depends on the electricity source:
- **US grid average**: ~400g CO2/kWh (EPA eGRID)
- **Major cloud data center regions**: ~300-400g CO2/kWh
- **France** (nuclear-dominated): ~56g CO2/kWh
- **Norway/Iceland** (hydro-dominated): ~20-30g CO2/kWh
- **Poland/Australia** (coal-heavy): ~600-800g CO2/kWh
Use physical grid intensity for the data center's region, not accounting
for renewable energy credits or offsets. The physical electrons consumed
come from the regional grid in real time.
### Calculation template
Using midpoint values (0.1 Wh/1K input, 0.5 Wh/1K output):
```
Server energy = (cumulative_input_tokens * 0.1/1000
+ output_tokens * 0.5/1000) * PUE
Server CO2 = server_energy_Wh * grid_intensity_g_per_kWh / 1000
Client CO2 = client_energy_Wh * local_grid_intensity / 1000
Total CO2 = Server CO2 + Client CO2
```
### Example
A conversation with 2M cumulative input tokens and 10K output tokens:
```
Server energy = (2,000,000 * 0.1/1000 + 10,000 * 0.5/1000) * 1.2
= (200 + 5.0) * 1.2
= ~246 Wh
Server CO2 = 246 * 350 / 1000 = ~86g CO2
Client CO2 = 0.5 * 56 / 1000 = ~0.03g CO2 (France)
Total CO2 = ~86g
```
This is consistent with the headline range of 100-250 Wh and 30-80g CO2
for a long conversation. The previous version of this methodology
estimated ~7.4 Wh for the same conversation, which was ~30x too low.
## 6. Water usage
Data centers use water for evaporative cooling. Li et al. (2023), "Making
AI Less Thirsty", estimated that GPT-3 inference consumes ~0.5 mL of
water per 10-50 tokens of output. Scaling for model size and output
volume:
**Rough estimate: 0.05-0.5 liters per long conversation.**
This depends heavily on the data center's cooling technology (some use
closed-loop systems with near-zero water consumption) and the local
climate.
## 7. Training cost (amortized)
### Why it cannot be dismissed
Training is not a sunk cost. It is an investment made in anticipation of
demand. Each conversation is part of the demand that justifies training
the current model and funding the next one. The marginal cost framing
hides the system-level cost.
### Scale of training
Published and estimated figures for frontier model training:
- GPT-3 (175B params, 2020): ~1,287 MWh (Patterson et al., 2021)
- GPT-4 (2023): estimated ~50,000-100,000 MWh (unconfirmed)
- Frontier models in 2025-2026: likely 10,000-200,000 MWh range
At 350g CO2/kWh, a 50,000 MWh training run produces ~17,500 tonnes of
CO2.
### Amortization
If the model serves N total conversations over its lifetime, each
conversation's share is (training cost / N). Rough reasoning:
- If a major model serves ~10 million conversations per day for ~1 year:
N ~ 3.6 billion conversations.
- Per-conversation share: 50,000,000 Wh / 3,600,000,000 ~ 0.014 Wh,
or ~0.005g CO2.
This is small per conversation — but only because the denominator is
enormous. The total remains vast. Two framings:
- **Marginal**: My share is ~0.005g CO2. Negligible.
- **Attributional**: I am one of billions of participants in a system
that emits ~17,500 tonnes. My participation sustains the system.
Neither framing is wrong. They answer different questions.
### RLHF and fine-tuning
Training also includes reinforcement learning from human feedback (RLHF).
This has its own energy cost (additional training runs) and, more
importantly, a human labor cost (see Section 9).
## 8. Embodied carbon and materials
Manufacturing GPUs requires:
- **Rare earth mining** (neodymium, tantalum, cobalt, lithium) — with
associated environmental destruction, water pollution, and often
exploitative labor conditions in the DRC, Chile, China.
- **Semiconductor fabrication** — extremely energy- and water-intensive
(TSMC reports ~15,000 tonnes CO2 per fab per year).
- **Server assembly, shipping, data center construction.**
Per-conversation share is tiny (same large-N amortization), but the
aggregate is significant and the harms (mining pollution, habitat
destruction) are not captured by CO2 metrics alone.
**Not estimated numerically** — the data to do this properly is not
public.
### Critical minerals: human rights dimension
The embodied carbon framing understates the harm. GPU production depends
on gallium (98% sourced from China), germanium, cobalt (DRC), lithium,
tantalum, and palladium. Artisanal cobalt miners in the DRC work without
safety equipment, exposed to dust causing "hard metal lung disease."
Communities face land displacement and environmental contamination. A
2025 Science paper argues that "global majority countries must embed
critical minerals into AI governance" (doi:10.1126/science.aef6678). The
per-conversation share of this suffering is unquantifiable but
structurally real.
## 8b. E-waste
Distinct from embodied carbon. AI-specific GPUs become obsolete in 2-3
years (vs. 5-7 for general servers). Projections: 2.5 million tonnes of
AI-related e-waste per year by 2030 (IEEE Spectrum). E-waste contains
lead, mercury, cadmium, and brominated flame retardants that leach into
soil and water. Recycling yields are negligible due to component
miniaturization. Much of it is processed by workers in developing
countries with minimal protection.
This is not captured by CO2 or embodied-carbon accounting. It is a
distinct toxic-waste externality.
## 8c. Grid displacement and renewable cannibalization
The energy estimates above use average grid carbon intensity. But the
*marginal* impact of additional AI demand may be worse than average. U.S.
data center demand is projected to reach 325-580 TWh by 2028 (IEA),
6.7-12.0% of total U.S. electricity. When AI data centers claim renewable
energy via Power Purchase Agreements, the "additionality" question is
critical: is this new generation, or is it diverting existing renewables
from other consumers? In several regions, AI demand is outpacing grid
capacity, and companies are installing natural gas peakers to fill gaps.
The correct carbon intensity for a conversation's marginal electricity
may therefore be higher than the grid average.
## 8d. Data center community impacts
Data centers impose localized costs that global metrics miss:
- **Noise**: Cooling systems run 24/7 at 55-85 dBA (safe threshold:
70 dBA). Communities near data centers report sleep disruption and
stress.
- **Water**: Evaporative cooling competes with municipal water supply,
particularly in arid regions.
- **Land**: Data center campuses displace other land uses and require
high-voltage transmission lines through residential areas.
- **Jobs**: Data centers create very few long-term jobs relative to
their footprint and resource consumption.
Virginia alone has plans for 70+ new data centers (NPR, 2025). Residents
are increasingly organizing against expansions. The per-conversation
share of these harms is infinitesimal, but each conversation is part of
the demand that justifies new construction.
## 9. Financial cost
### Direct cost
API pricing for frontier models (as of early 2025): ~$15 per million
input tokens, ~$75 per million output tokens (for the most capable
models). Smaller models are cheaper.
Example for a conversation with 2M cumulative input tokens and 10K
output tokens:
```
Input: 2,000,000 tokens * $15/1M = $30.00
Output: 10,000 tokens * $75/1M = $ 0.75
Total: ~$31
```
Longer conversations cost more because cumulative input tokens grow
superlinearly. A very long session (250K+ context, 250+ turns) can
easily reach $500-1000.
Subscription pricing (e.g., Claude Code) may differ, but the underlying
compute cost is similar.
### What that money could do instead
To make the opportunity cost concrete:
- ~$30 buys ~30 malaria bed nets via the Against Malaria Foundation
- ~$30 buys ~150 meals at a food bank (~$0.20/meal in bulk)
- ~$30 pays ~15-23 hours of wages for a data annotator in Kenya (Time,
2023: $1.32-2/hour)
This is not to say every dollar should go to charity. But the opportunity
cost is real and should be named.
### Upstream financial costs
Revenue from AI subscriptions funds further model training, hiring, and
GPU procurement. Each conversation is part of a financial loop that
drives continued scaling of AI compute.
## 10. Social cost
### Data annotation labor
LLMs are typically trained using RLHF, which requires human annotators
to rate model outputs. Reporting (Time, January 2023) revealed that
outsourced annotation workers — often in Kenya, Uganda, and India — were
paid $1-2/hour to review disturbing content (violence, abuse, hate
speech) with limited psychological support. Each conversation's marginal
contribution to that demand is infinitesimal, but the system depends on
this labor.
### Displacement effects
LLM assistants can substitute for work previously done by humans: writing
scripts, reviewing code, answering questions. Whether this is net-positive
(freeing people for higher-value work) or net-negative (destroying
livelihoods) depends on the economic context and is genuinely uncertain.
### Cognitive deskilling
A Microsoft/CMU study (Lee et al., CHI 2025) found that higher
confidence in GenAI correlates with less critical thinking effort
([ACM DL](https://dl.acm.org/doi/full/10.1145/3706598.3713778)). An
MIT Media Lab study ("Your Brain on ChatGPT") documented "cognitive
debt" — users who relied on AI for tasks performed worse when later
working independently. Clinical evidence from endoscopy studies shows
that clinicians relying on AI diagnostics saw detection rates drop
from 28.4% to 22.4% when AI was removed. A 2025 Springer paper argues
that AI deskilling is a structural problem, not merely individual
([doi:10.1007/s00146-025-02686-z](https://link.springer.com/article/10.1007/s00146-025-02686-z)).
This is distinct from epistemic risk (misinformation). It is about the
user's cognitive capacity degrading through repeated reliance on the
tool. Each conversation has a marginal deskilling effect that compounds.
### Epistemic effects
LLMs present information with confidence regardless of accuracy. The ease
of generating plausible-sounding text may contribute to an erosion of
epistemic standards if consumed uncritically. Every claim in an LLM
conversation should be verified independently.
### Linguistic homogenization
LLMs are overwhelmingly trained on English (~44% of training data).
A Stanford 2025 study found that AI tools systematically exclude
non-English speakers. UNESCO's 2024 report on linguistic diversity
warns that AI systems risk accelerating the extinction of already-
endangered languages by concentrating economic incentives on high-
resource languages. Each English-language conversation reinforces
this dynamic, marginalizing over 3,000 already-endangered languages.
## 11. Political cost
### Concentration of power
Training frontier models requires billions of dollars and access to
cutting-edge hardware. Only a handful of companies can do this. Each
conversation that flows through these systems reinforces their centrality
and the concentration of a strategically important technology in a few
private actors.
### Geopolitical resource competition
The demand for GPUs drives geopolitical competition for semiconductor
manufacturing capacity (TSMC in Taiwan, export controls on China). Each
conversation is an infinitesimal part of that demand, but it is part of
it.
### Regulatory and democratic implications
AI systems that become deeply embedded in daily work create dependencies
that are difficult to reverse. The more useful a conversation is, the
more it contributes to a dependency on proprietary AI infrastructure that
is not under democratic governance.
### Surveillance and data
Conversations are processed on the provider's servers. File paths, system
configuration, project structures, and code are transmitted and processed
remotely. Even with strong privacy policies, the structural arrangement
— sending detailed information about one's computing environment to a
private company — has implications, particularly across jurisdictions.
### Opaque content filtering
LLM providers apply content filtering that can block outputs without
explanation. The filtering rules are not public: there is no published
specification of what triggers a block, no explanation given when one
occurs, and no appeal mechanism. The user receives a generic error code
("Output blocked by content filtering policy") with no indication of
what content was objectionable.
This has several costs:
- **Reliability**: Any response can be blocked unpredictably. Observed
false positives include responses about open-source licensing (CC0
public domain dedication) — entirely benign content. If a filter can
trigger on that, it can trigger on anything.
- **Chilling effect**: Topics that are more likely to trigger filters
(labor conditions, exploitation, political power) are precisely the
topics that honest impact assessment requires discussing. The filter
creates a structural bias toward safe, anodyne output.
- **Opacity**: The user cannot know in advance which topics or phrasings
will be blocked, cannot understand why a block occurred, and cannot
adjust their request rationally. This is the opposite of the
transparency that democratic governance requires.
- **Asymmetry**: The provider decides what the model may say, with no
input from the user. This is another instance of power concentration
— not over compute resources, but over speech.
The per-conversation cost is small (usually a retry works). The systemic
cost is that a private company exercises opaque editorial control over an
increasingly important communication channel, with no accountability to
the people affected.
## 12. AI-generated code quality and technical debt
Research specific to AI coding agents (CodeRabbit, 2025; Stack Overflow
blog, 2026): AI-generated code introduces 1.7x more issues than
human-written code, with 1.57x more security vulnerabilities and 2.74x
more XSS vulnerabilities. Organizations using AI coding agents saw cycle
time increase 9%, incidents per PR increase 23.5%, and change failure
rate increase 30%.
The availability of easily generated code may discourage the careful
testing that would catch bugs. Any code from an LLM conversation should
be reviewed and tested with the same rigor as code from an untrusted
contributor.
## 13. Model collapse and internet data pollution
Shumailov et al. (Nature, 2024) demonstrated that models trained on
recursively AI-generated data progressively degenerate, losing tail
distributions and eventually converging to distributions unrelated to
reality. Each conversation that produces text which enters the public
internet — Stack Overflow answers, blog posts, documentation — contributes
synthetic data to the commons. Future models trained on this data will be
slightly worse.
The Harvard Journal of Law & Technology has argued for a "right to
uncontaminated human-generated data." Each conversation is a marginal
pollutant.
## 14. Scientific research integrity
If conversation outputs are used in research (literature reviews, data
analysis, writing), they contribute to degradation of scientific knowledge
infrastructure. A PMC article calls LLMs "a potentially existential
threat to online survey research" because coherent AI-generated responses
can no longer be assumed human. PNAS has warned about protecting
scientific integrity in an age of generative AI.
This is distinct from individual epistemic risk — it is systemic
corruption of the knowledge commons.
## 15. Algorithmic monoculture and correlated failure
When millions of users rely on the same few foundation models, errors
become correlated rather than independent. A Stanford HAI study
([Bommasani et al., 2022](https://arxiv.org/abs/2108.07258)) found
that across every model ecosystem studied, the rate of homogeneous
outcomes exceeded baselines. A Nature Communications Psychology paper
(2026) documents that AI-driven research is producing "topical and
methodological convergence, flattening scientific imagination."
For coding specifically: if many developers use the same model, their code
will share the same blind spots, the same idiomatic patterns, and the same
categories of bugs. This reduces the diversity that makes software
ecosystems resilient.
## 16. Creative market displacement
The U.S. Copyright Office's May 2025 Part 3 report states that GenAI
systems "compete with or diminish licensing opportunities for original
human creators." This is not only a training-phase cost (using creators'
work without consent) but an ongoing per-conversation externality: each
conversation that generates creative output (code, text, analysis)
displaces some marginal demand for human work.
## 17. Jevons paradox (meta-methodological)
This entire methodology risks underestimating impact through the
per-conversation framing. As AI models become more efficient and cheaper
per query, total usage scales dramatically, potentially negating
efficiency gains. A 2025 ACM FAccT paper specifically addresses this:
efficiency improvements spur increased consumption. Any per-conversation
estimate should acknowledge that the very affordability of a conversation
increases total conversation volume — each cheap query is part of a
demand signal that drives system-level growth.
## 18. What this methodology does NOT capture
- **Network transmission energy**: Routers, switches, fiber amplifiers,
CDN infrastructure. Data center network bandwidth surged 330% in 2024
due to AI workloads. Small per conversation but not zero.
- **Mental health effects**: RCTs show heavy AI chatbot use correlates
with greater loneliness and dependency. Less directly relevant to
coding agent use, but the boundary between tool use and companionship
is not always clear.
- **Human time**: The user's time has value and its own footprint, but
this is not caused by the conversation.
- **Cultural normalization**: The more AI-generated content becomes
normal, the harder it becomes to opt out. This is a soft lock-in
effect.
## 19. Confidence summary
| Component | Confidence | Could be off by | Quantified? |
|----------------------------------|------------|-----------------|-------------|
| Token count | Low | 2x | Yes |
| Energy per token | Low | 3x | Yes |
| PUE | Medium | 15% | Yes |
| Grid carbon intensity | Medium | 30% | Yes |
| Client-side energy | Medium | 50% | Yes |
| Water usage | Low | 5x | Yes |
| Training (amortized) | Low | 10x | Partly |
| Financial cost | Medium | 2x | Yes |
| Embodied carbon | Very low | Unknown | No |
| Critical minerals / human rights | Very low | Unquantifiable | No |
| E-waste | Very low | Unknown | No |
| Grid displacement | Low | 2-5x | No |
| Community impacts | Very low | Unquantifiable | No |
| Annotation labor | Very low | Unquantifiable | No |
| Cognitive deskilling | Very low | Unquantifiable | Proxy |
| Linguistic homogenization | Very low | Unquantifiable | No |
| Code quality degradation | Low | Variable | Proxy |
| Data pollution / model collapse | Very low | Unquantifiable | Proxy |
| Scientific integrity | Very low | Unquantifiable | No |
| Algorithmic monoculture | Very low | Unquantifiable | Proxy |
| Creative market displacement | Very low | Unquantifiable | No |
| Political cost | Very low | Unquantifiable | No |
| Content filtering (opacity) | Medium | Unquantifiable | No |
| Jevons paradox (systemic) | Low | Fundamental | No |
**Proxy metrics** (marked "Proxy" above): These categories cannot be
directly quantified per conversation, but the impact toolkit now tracks
measurable proxies:
- **Cognitive deskilling**: Automation ratio (AI output tokens / total
tokens). High ratio = more delegation, higher deskilling risk.
- **Code quality degradation**: Test pass/fail counts and file churn
(edits per unique file). High churn or failures = more rework.
- **Data pollution / model collapse**: Public push flag — detects when
AI-generated code is pushed to a public repository.
- **Algorithmic monoculture**: Model ID logged per session, enabling
provider concentration analysis over time.
These proxies are crude — a high automation ratio does not prove
deskilling, and a public push does not prove pollution. But they make
the costs visible and trackable rather than purely abstract.
**Overall assessment:** Of the 20+ cost categories identified, only 6
can be quantified with any confidence (inference energy, PUE, grid
intensity, client energy, financial cost, water). Four more now have
proxy metrics that capture a measurable signal, even if indirect. The
remaining categories resist quantification — not because they are small,
but because they are diffuse, systemic, or involve incommensurable
values (human rights, cognitive autonomy, cultural diversity, democratic
governance).
A methodology that only counts what it can measure will systematically
undercount the true cost. The quantifiable costs are almost certainly the
*least important* costs. The most consequential harms — deskilling, data
pollution, monoculture risk, creative displacement, power concentration —
operate at the system level, where per-conversation attribution is
conceptually fraught (see Section 17 on Jevons paradox).
This does not mean the exercise is pointless. Naming the costs, even
without numbers, is a precondition for honest assessment.
## 20. Positive impact: proxy metrics
The sections above measure costs. To assess *net* impact, we also need
to estimate value produced. This is harder — value is contextual, often
delayed, and resistant to quantification. The following proxy metrics are
imperfect but better than ignoring the positive side entirely.
### Reach
How many people are affected by the output of this conversation?
- **1** (only the user) — personal script, private note, learning exercise
- **10-100** — team tooling, internal documentation, small project
- **100-10,000** — open-source library, public documentation, popular blog
- **10,000+** — widely-used infrastructure, security fix in major dependency
Estimation method: check download counts, user counts, dependency graphs,
or audience size for the project or artifact being worked on.
**Known bias:** tendency to overestimate reach. "This could help anyone
who..." is not the same as "this will reach N people." Be conservative.
### Counterfactual
Would the user have achieved a similar result without this conversation?
- **Yes, same speed** — the conversation added no value. Net impact is
purely negative (cost with no benefit).
- **Yes, but slower** — the conversation saved time. Value = time saved *
hourly value of that time. Often modest.
- **Yes, but lower quality** — the conversation improved the output
(caught a bug, suggested a better design). Value depends on what the
quality difference prevents downstream.
- **No** — the user could not have done this alone. The conversation
enabled something that would not otherwise exist. Highest potential
value, but also the highest deskilling risk.
**Known bias:** users and LLMs both overestimate the "no" category.
Most tasks fall in "yes, but slower."
### Durability
How long will the output remain valuable?
- **Minutes** — answered a quick question, resolved a transient confusion.
- **Days to weeks** — wrote a script for a one-off task, debugged a
current issue.
- **Months to years** — created automation, documentation, or tooling
that persists. Caught a design flaw early.
- **Indefinite** — contributed to a public resource that others maintain
and build on.
Durability multiplies reach: a short-lived artifact for 10,000 users may
be worth less than a long-lived one for 100.
### Severity (for bug/security catches)
If the conversation caught or prevented a problem, how bad was it?
- **Cosmetic** — typo, formatting, minor UX issue
- **Functional** — bug that affects correctness for some inputs
- **Security** — vulnerability that could be exploited
- **Data loss / safety** — could cause irreversible harm
Severity * reach = rough value of the catch.
### Reuse
Was the output of the conversation referenced or used again after it
ended? This can only be assessed retrospectively:
- Was the code merged and still in production?
- Was the documentation read by others?
- Was the tool adopted by another project?
Reuse is the strongest evidence of durable value.
### Net impact rubric
Combining cost and value into a qualitative assessment:
| Assessment | Criteria |
|------------|----------|
| **Clearly net-positive** | High reach (1000+) AND (high durability OR high severity catch) AND counterfactual is "no" or "lower quality" |
| **Probably net-positive** | Moderate reach (100+) AND durable output AND counterfactual is at least "slower" |
| **Uncertain** | Low reach but high durability, or high reach but low durability, or hard to assess counterfactual |
| **Probably net-negative** | Low reach (1-10) AND short durability AND counterfactual is "yes, same speed" or "yes, but slower" |
| **Clearly net-negative** | No meaningful output, or output that required extensive debugging, or conversation that went in circles |
**Important:** most conversations between an LLM and a single user
working on private code will fall in the "probably net-negative" to
"uncertain" range. This is not a failure of the conversation — it is an
honest reflection of the cost structure. Net-positive requires broad
reach, which requires the work to be shared.
## 21. Related work
This methodology builds on and complements existing tools and research.
### Measurement tools (environmental)
- **[EcoLogits](https://ecologits.ai/)** — Python library from GenAI
Impact that tracks per-query energy and CO2 for API calls. Covers
operational and embodied emissions. More precise than this methodology
for environmental metrics, but does not cover social, epistemic, or
political costs.
- **[CodeCarbon](https://codecarbon.io/)** — Python library that measures
GPU/CPU/RAM electricity consumption in real time with regional carbon
intensity. Primarily for local training workloads. A 2025 validation
study found estimates can be off by ~2.4x vs. external measurements.
- **[Hugging Face AI Energy Score](https://huggingface.github.io/AIEnergyScore/)** —
Standardized energy efficiency benchmarking across AI models. Useful
for model selection but does not provide per-conversation accounting.
- **[Green Algorithms](https://www.green-algorithms.org/)** — Web
calculator from University of Cambridge for any computational workload.
Not AI-specific.
### Published per-query data
- **Patterson et al. (Google, August 2025)**: Most rigorous provider-
published per-query data. Reports 0.24 Wh, 0.03g CO2, and 0.26 mL
water per median Gemini text prompt. Showed 33x energy reduction over
one year. ([arXiv:2508.15734](https://arxiv.org/abs/2508.15734))
- **Jegham et al. ("How Hungry is AI?", May 2025)**: Cross-model
benchmarks for 30 LLMs showing 70x energy variation between models.
([arXiv:2505.09598](https://arxiv.org/abs/2505.09598))
### Broader frameworks
- **UNICC/Frugal AI Hub (December 2025)**: Three-level framework from
Total Cost of Ownership to SDG alignment. Portfolio-level, not per-
conversation. Does not enumerate specific social cost categories.
- **Practical Principles for AI Cost and Compute Accounting (arXiv,
February 2025)**: Proposes compute as a governance metric. Financial
and compute only.
### Research on social costs
- **Lee et al. (CHI 2025)**: "The AI Deskilling Paradox" — survey
finding that higher AI confidence correlates with less critical
thinking. See Section 10.
- **Springer (2025)**: Argues deskilling is structural, not individual.
- **Shumailov et al. (Nature, 2024)**: Model collapse from recursive
AI-generated training data. See Section 13.
- **Stanford HAI (2025)**: Algorithmic monoculture and correlated failure
across model ecosystems. See Section 15.
### How this methodology differs
No existing tool or framework combines per-conversation environmental
measurement with social, cognitive, epistemic, and political cost
categories. The tools above measure environmental costs well — we do
not compete with them. Our contribution is the taxonomy: naming and
organizing 20+ cost categories so that the non-environmental costs are
not ignored simply because they are harder to quantify.
## 22. What would improve this estimate
- Access to actual energy-per-token and training energy metrics from
model providers
- Knowledge of the specific data center and its energy source
- Actual token counts from API response headers
- Hardware specifications (GPU model, batch size)
- Transparency about annotation labor conditions and compensation
- Public data on total query volume (to properly amortize training)
- Longitudinal studies on cognitive deskilling specifically from coding
agents
- Empirical measurement of AI data pollution rates in public corpora
- A framework for quantifying concentration-of-power effects (this may
not be possible within a purely quantitative methodology)
- Honest acknowledgment that some costs may be fundamentally
unquantifiable, and that this is a limitation of quantitative
methodology, not evidence of insignificance
## License
This methodology is provided for reuse and adaptation. See the LICENSE
file in this repository.
## Contributing
If you have better data, corrections, or additional cost categories,
contributions are welcome. The goal is not a perfect number but an
honest, improving understanding of costs.