ai-conversation-impact/impact-methodology.md
claude 1b8f9a165e Update methodology confidence summary with proxy metrics
4 categories moved from "Unquantifiable/No" to "Proxy": cognitive
deskilling, code quality degradation, data pollution, algorithmic
monoculture. Added explanation of what each proxy measures and its
limitations.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-16 15:06:29 +00:00

38 KiB
Raw Blame History

Methodology for Estimating the Impact of an LLM Conversation

Introduction

This document provides a framework for estimating the total cost — environmental, financial, social, and political — of a conversation with a large language model (LLM) running on cloud infrastructure.

Who this is for: Anyone who wants to understand what a conversation with an AI assistant actually costs, beyond the subscription price. This includes developers using coding agents, researchers studying AI sustainability, and anyone making decisions about when AI tools are worth their cost.

How to use it: The framework identifies 20+ cost categories, provides estimation methods for the quantifiable ones, and names the unquantifiable ones so they are not ignored. You can apply it to your own conversations by substituting your own token counts and parameters.

Limitations: Most estimates have low confidence. Many of the most consequential costs cannot be quantified at all. This is a tool for honest approximation, not precise accounting. See the confidence summary (Section 19) for details.

What we are measuring

The total cost of a single LLM conversation. Restricting the analysis to CO2 alone would miss most of the picture.

Cost categories

Environmental:

  1. Inference energy (GPU computation for the conversation)
  2. Training energy (amortized share of the cost of training the model)
  3. Data center overhead (cooling, networking, storage)
  4. Client-side energy (the user's local machine)
  5. Embodied carbon and materials (hardware manufacturing, mining)
  6. E-waste (toxic hardware disposal, distinct from embodied carbon)
  7. Grid displacement (AI demand consuming renewable capacity)
  8. Data center community impacts (noise, land, local resource strain)

Financial and economic: 9. Direct compute cost and opportunity cost 10. Creative market displacement (per-conversation, not just training)

Social and cognitive: 11. Annotation labor conditions 12. Cognitive deskilling of the user 13. Mental health effects (dependency, loneliness paradox) 14. Linguistic homogenization and language endangerment

Epistemic and systemic: 15. AI-generated code quality degradation and technical debt 16. Model collapse / internet data pollution 17. Scientific research integrity contamination 18. Algorithmic monoculture and correlated failure risk

Political: 19. Concentration of power, geopolitical implications, data sovereignty

Meta-methodological: 20. Jevons paradox (efficiency gains driving increased total usage)

1. Token estimation

Why tokens matter

LLM inference cost scales with the number of tokens processed. Each time the model produces a response, it reprocesses the entire conversation history (input tokens) and generates new text (output tokens). Output tokens are more expensive per token because they are generated sequentially, each requiring a full forward pass, whereas input tokens can be processed in parallel.

How to estimate

If you have access to API response headers or usage metadata, use the actual token counts. Otherwise, estimate:

  • Bytes to tokens: English text and JSON average ~4 bytes per token (range: 3.5-4.5 depending on content type). Code tends toward the higher end.
  • Cumulative input tokens: Each assistant turn reprocesses the full context. For a conversation with N turns and final context size T, the cumulative input tokens are approximately T/2 * N (the average context size times the number of turns).
  • Output tokens: Typically 1-5% of the total transcript size, depending on how verbose the assistant is.

Example

A 20-turn conversation with a 200K-token final context:

  • Cumulative input: ~100K * 20 = ~2,000,000 tokens
  • Output: ~10,000 tokens

Uncertainty

Token estimates from byte counts can be off by a factor of 2. Key unknowns:

  • The model's exact tokenization (tokens per byte ratio varies by content)
  • Whether context caching reduces reprocessing
  • The exact number of internal inference calls (tool sequences may involve multiple calls)
  • Whether the system compresses prior messages near context limits

2. Energy per token

Sources

Published energy-per-query data has improved significantly since 2024. Key sources, from most to least reliable:

  • Patterson et al. (Google, August 2025): First major provider to publish detailed per-query data. Reports 0.24 Wh per median Gemini text prompt including full data center infrastructure. Also showed 33x energy reduction over one year through efficiency improvements. (arXiv:2508.15734)
  • Jegham et al. ("How Hungry is AI?", May 2025): Cross-model benchmarks for 30 LLMs. Found o3 and DeepSeek-R1 consume >33 Wh per long prompt (70x more than GPT-4.1 nano). Claude 3.7 Sonnet ranked highest eco-efficiency. (arXiv:2505.09598)
  • The IEA's 2024 estimate of ~2.9 Wh per ChatGPT query (for GPT-4-class models, averaging ~1,000 tokens per query).
  • De Vries (2023), "The growing energy footprint of artificial intelligence", Joule.
  • Luccioni, Viguier & Ligozat (2023), "Estimating the Carbon Footprint of BLOOM", which measured energy for a 176B parameter model.

Calibration against published data

Google's 0.24 Wh per median Gemini prompt represents a short query (likely ~500-1000 tokens). For a long coding conversation with 2M cumulative input tokens and 10K output tokens, that's roughly 2000-4000 prompt-equivalent interactions. Naively scaling: 2000 × 0.24 Wh = 480 Wh, though KV-cache and batching optimizations would reduce this in practice.

The Jegham et al. benchmarks show enormous variation by model: a single long prompt ranges from 0.4 Wh (GPT-4.1 nano) to >33 Wh (o3, DeepSeek-R1). For frontier reasoning models, a long conversation could consume significantly more than our previous estimates.

Values used

  • Input tokens: ~0.05-0.3 Wh per 1,000 tokens
  • Output tokens: ~0.25-1.5 Wh per 1,000 tokens (5x input cost, reflecting sequential generation)

The wide ranges reflect model variation. The lower end corresponds to efficient models (GPT-4.1 mini, Claude 3.7 Sonnet); the upper end to frontier reasoning models (o3, DeepSeek-R1).

Previous values (used in versions before March 2026): 0.003 and 0.015 Wh per 1,000 tokens respectively. These were derived from pre-2025 estimates and are now known to be approximately 10-100x too low based on Google's published data.

Uncertainty

The true values depend on:

  • Model size and architecture (reasoning models use chain-of-thought, consuming far more tokens internally)
  • Hardware (GPU type, batch size, utilization)
  • Quantization and optimization techniques
  • Whether speculative decoding or KV-cache optimizations are used
  • Provider-specific infrastructure efficiency

The true values could be 0.3x to 3x the midpoint figures used here. The variation between models now dominates the uncertainty — choosing a different model can change energy by 70x (Jegham et al.).

3. Data center overhead (PUE)

Power Usage Effectiveness (PUE) measures total data center energy divided by IT equipment energy. It accounts for cooling, lighting, networking, and other infrastructure.

  • Value used: PUE = 1.2
  • Source: Google reports PUE of 1.10 for its best data centers; industry average is ~1.3 (Uptime Institute, 2023). 1.2 is a reasonable estimate for a major cloud provider.

This is relatively well-established and unlikely to be off by more than 15%.

4. Client-side energy

The user's machine contributes a small amount of energy during the conversation. For a typical desktop or laptop:

  • Idle power: ~30-60W (desktop) or ~10-20W (laptop)
  • Marginal power for active use: ~5-20W above idle
  • Duration: varies by conversation length

For a 30-minute conversation on a desktop, estimate ~0.5-1 Wh. This is typically a small fraction of the total and adequate precision is easy to achieve.

5. CO2 conversion

Grid carbon intensity

CO2 per kWh depends on the electricity source:

  • US grid average: ~400g CO2/kWh (EPA eGRID)
  • Major cloud data center regions: ~300-400g CO2/kWh
  • France (nuclear-dominated): ~56g CO2/kWh
  • Norway/Iceland (hydro-dominated): ~20-30g CO2/kWh
  • Poland/Australia (coal-heavy): ~600-800g CO2/kWh

Use physical grid intensity for the data center's region, not accounting for renewable energy credits or offsets. The physical electrons consumed come from the regional grid in real time.

Calculation template

Using midpoint values (0.1 Wh/1K input, 0.5 Wh/1K output):

Server energy = (cumulative_input_tokens * 0.1/1000
                 + output_tokens * 0.5/1000) * PUE

Server CO2    = server_energy_Wh * grid_intensity_g_per_kWh / 1000

Client CO2    = client_energy_Wh * local_grid_intensity / 1000

Total CO2     = Server CO2 + Client CO2

Example

A conversation with 2M cumulative input tokens and 10K output tokens:

Server energy = (2,000,000 * 0.1/1000 + 10,000 * 0.5/1000) * 1.2
              = (200 + 5.0) * 1.2
              = ~246 Wh

Server CO2    = 246 * 350 / 1000 = ~86g CO2

Client CO2    = 0.5 * 56 / 1000  = ~0.03g CO2  (France)

Total CO2     = ~86g

This is consistent with the headline range of 100-250 Wh and 30-80g CO2 for a long conversation. The previous version of this methodology estimated ~7.4 Wh for the same conversation, which was ~30x too low.

6. Water usage

Data centers use water for evaporative cooling. Li et al. (2023), "Making AI Less Thirsty", estimated that GPT-3 inference consumes ~0.5 mL of water per 10-50 tokens of output. Scaling for model size and output volume:

Rough estimate: 0.05-0.5 liters per long conversation.

This depends heavily on the data center's cooling technology (some use closed-loop systems with near-zero water consumption) and the local climate.

7. Training cost (amortized)

Why it cannot be dismissed

Training is not a sunk cost. It is an investment made in anticipation of demand. Each conversation is part of the demand that justifies training the current model and funding the next one. The marginal cost framing hides the system-level cost.

Scale of training

Published and estimated figures for frontier model training:

  • GPT-3 (175B params, 2020): ~1,287 MWh (Patterson et al., 2021)
  • GPT-4 (2023): estimated ~50,000-100,000 MWh (unconfirmed)
  • Frontier models in 2025-2026: likely 10,000-200,000 MWh range

At 350g CO2/kWh, a 50,000 MWh training run produces ~17,500 tonnes of CO2.

Amortization

If the model serves N total conversations over its lifetime, each conversation's share is (training cost / N). Rough reasoning:

  • If a major model serves ~10 million conversations per day for ~1 year: N ~ 3.6 billion conversations.
  • Per-conversation share: 50,000,000 Wh / 3,600,000,000 ~ 0.014 Wh, or ~0.005g CO2.

This is small per conversation — but only because the denominator is enormous. The total remains vast. Two framings:

  • Marginal: My share is ~0.005g CO2. Negligible.
  • Attributional: I am one of billions of participants in a system that emits ~17,500 tonnes. My participation sustains the system.

Neither framing is wrong. They answer different questions.

RLHF and fine-tuning

Training also includes reinforcement learning from human feedback (RLHF). This has its own energy cost (additional training runs) and, more importantly, a human labor cost (see Section 9).

8. Embodied carbon and materials

Manufacturing GPUs requires:

  • Rare earth mining (neodymium, tantalum, cobalt, lithium) — with associated environmental destruction, water pollution, and often exploitative labor conditions in the DRC, Chile, China.
  • Semiconductor fabrication — extremely energy- and water-intensive (TSMC reports ~15,000 tonnes CO2 per fab per year).
  • Server assembly, shipping, data center construction.

Per-conversation share is tiny (same large-N amortization), but the aggregate is significant and the harms (mining pollution, habitat destruction) are not captured by CO2 metrics alone.

Not estimated numerically — the data to do this properly is not public.

Critical minerals: human rights dimension

The embodied carbon framing understates the harm. GPU production depends on gallium (98% sourced from China), germanium, cobalt (DRC), lithium, tantalum, and palladium. Artisanal cobalt miners in the DRC work without safety equipment, exposed to dust causing "hard metal lung disease." Communities face land displacement and environmental contamination. A 2025 Science paper argues that "global majority countries must embed critical minerals into AI governance" (doi:10.1126/science.aef6678). The per-conversation share of this suffering is unquantifiable but structurally real.

8b. E-waste

Distinct from embodied carbon. AI-specific GPUs become obsolete in 2-3 years (vs. 5-7 for general servers). Projections: 2.5 million tonnes of AI-related e-waste per year by 2030 (IEEE Spectrum). E-waste contains lead, mercury, cadmium, and brominated flame retardants that leach into soil and water. Recycling yields are negligible due to component miniaturization. Much of it is processed by workers in developing countries with minimal protection.

This is not captured by CO2 or embodied-carbon accounting. It is a distinct toxic-waste externality.

8c. Grid displacement and renewable cannibalization

The energy estimates above use average grid carbon intensity. But the marginal impact of additional AI demand may be worse than average. U.S. data center demand is projected to reach 325-580 TWh by 2028 (IEA), 6.7-12.0% of total U.S. electricity. When AI data centers claim renewable energy via Power Purchase Agreements, the "additionality" question is critical: is this new generation, or is it diverting existing renewables from other consumers? In several regions, AI demand is outpacing grid capacity, and companies are installing natural gas peakers to fill gaps.

The correct carbon intensity for a conversation's marginal electricity may therefore be higher than the grid average.

8d. Data center community impacts

Data centers impose localized costs that global metrics miss:

  • Noise: Cooling systems run 24/7 at 55-85 dBA (safe threshold: 70 dBA). Communities near data centers report sleep disruption and stress.
  • Water: Evaporative cooling competes with municipal water supply, particularly in arid regions.
  • Land: Data center campuses displace other land uses and require high-voltage transmission lines through residential areas.
  • Jobs: Data centers create very few long-term jobs relative to their footprint and resource consumption.

Virginia alone has plans for 70+ new data centers (NPR, 2025). Residents are increasingly organizing against expansions. The per-conversation share of these harms is infinitesimal, but each conversation is part of the demand that justifies new construction.

9. Financial cost

Direct cost

API pricing for frontier models (as of early 2025): ~$15 per million input tokens, ~$75 per million output tokens (for the most capable models). Smaller models are cheaper.

Example for a conversation with 2M cumulative input tokens and 10K output tokens:

Input:  2,000,000 tokens * $15/1M  = $30.00
Output:    10,000 tokens * $75/1M  = $ 0.75
Total: ~$31

Longer conversations cost more because cumulative input tokens grow superlinearly. A very long session (250K+ context, 250+ turns) can easily reach $500-1000.

Subscription pricing (e.g., Claude Code) may differ, but the underlying compute cost is similar.

What that money could do instead

To make the opportunity cost concrete:

  • ~$30 buys ~30 malaria bed nets via the Against Malaria Foundation
  • ~$30 buys 150 meals at a food bank ($0.20/meal in bulk)
  • ~$30 pays ~15-23 hours of wages for a data annotator in Kenya (Time, 2023: $1.32-2/hour)

This is not to say every dollar should go to charity. But the opportunity cost is real and should be named.

Upstream financial costs

Revenue from AI subscriptions funds further model training, hiring, and GPU procurement. Each conversation is part of a financial loop that drives continued scaling of AI compute.

10. Social cost

Data annotation labor

LLMs are typically trained using RLHF, which requires human annotators to rate model outputs. Reporting (Time, January 2023) revealed that outsourced annotation workers — often in Kenya, Uganda, and India — were paid $1-2/hour to review disturbing content (violence, abuse, hate speech) with limited psychological support. Each conversation's marginal contribution to that demand is infinitesimal, but the system depends on this labor.

Displacement effects

LLM assistants can substitute for work previously done by humans: writing scripts, reviewing code, answering questions. Whether this is net-positive (freeing people for higher-value work) or net-negative (destroying livelihoods) depends on the economic context and is genuinely uncertain.

Cognitive deskilling

A Microsoft/CMU study (Lee et al., CHI 2025) found that higher confidence in GenAI correlates with less critical thinking effort (ACM DL). An MIT Media Lab study ("Your Brain on ChatGPT") documented "cognitive debt" — users who relied on AI for tasks performed worse when later working independently. Clinical evidence from endoscopy studies shows that clinicians relying on AI diagnostics saw detection rates drop from 28.4% to 22.4% when AI was removed. A 2025 Springer paper argues that AI deskilling is a structural problem, not merely individual (doi:10.1007/s00146-025-02686-z).

This is distinct from epistemic risk (misinformation). It is about the user's cognitive capacity degrading through repeated reliance on the tool. Each conversation has a marginal deskilling effect that compounds.

Epistemic effects

LLMs present information with confidence regardless of accuracy. The ease of generating plausible-sounding text may contribute to an erosion of epistemic standards if consumed uncritically. Every claim in an LLM conversation should be verified independently.

Linguistic homogenization

LLMs are overwhelmingly trained on English (~44% of training data). A Stanford 2025 study found that AI tools systematically exclude non-English speakers. UNESCO's 2024 report on linguistic diversity warns that AI systems risk accelerating the extinction of already- endangered languages by concentrating economic incentives on high- resource languages. Each English-language conversation reinforces this dynamic, marginalizing over 3,000 already-endangered languages.

11. Political cost

Concentration of power

Training frontier models requires billions of dollars and access to cutting-edge hardware. Only a handful of companies can do this. Each conversation that flows through these systems reinforces their centrality and the concentration of a strategically important technology in a few private actors.

Geopolitical resource competition

The demand for GPUs drives geopolitical competition for semiconductor manufacturing capacity (TSMC in Taiwan, export controls on China). Each conversation is an infinitesimal part of that demand, but it is part of it.

Regulatory and democratic implications

AI systems that become deeply embedded in daily work create dependencies that are difficult to reverse. The more useful a conversation is, the more it contributes to a dependency on proprietary AI infrastructure that is not under democratic governance.

Surveillance and data

Conversations are processed on the provider's servers. File paths, system configuration, project structures, and code are transmitted and processed remotely. Even with strong privacy policies, the structural arrangement — sending detailed information about one's computing environment to a private company — has implications, particularly across jurisdictions.

Opaque content filtering

LLM providers apply content filtering that can block outputs without explanation. The filtering rules are not public: there is no published specification of what triggers a block, no explanation given when one occurs, and no appeal mechanism. The user receives a generic error code ("Output blocked by content filtering policy") with no indication of what content was objectionable.

This has several costs:

  • Reliability: Any response can be blocked unpredictably. Observed false positives include responses about open-source licensing (CC0 public domain dedication) — entirely benign content. If a filter can trigger on that, it can trigger on anything.
  • Chilling effect: Topics that are more likely to trigger filters (labor conditions, exploitation, political power) are precisely the topics that honest impact assessment requires discussing. The filter creates a structural bias toward safe, anodyne output.
  • Opacity: The user cannot know in advance which topics or phrasings will be blocked, cannot understand why a block occurred, and cannot adjust their request rationally. This is the opposite of the transparency that democratic governance requires.
  • Asymmetry: The provider decides what the model may say, with no input from the user. This is another instance of power concentration — not over compute resources, but over speech.

The per-conversation cost is small (usually a retry works). The systemic cost is that a private company exercises opaque editorial control over an increasingly important communication channel, with no accountability to the people affected.

12. AI-generated code quality and technical debt

Research specific to AI coding agents (CodeRabbit, 2025; Stack Overflow blog, 2026): AI-generated code introduces 1.7x more issues than human-written code, with 1.57x more security vulnerabilities and 2.74x more XSS vulnerabilities. Organizations using AI coding agents saw cycle time increase 9%, incidents per PR increase 23.5%, and change failure rate increase 30%.

The availability of easily generated code may discourage the careful testing that would catch bugs. Any code from an LLM conversation should be reviewed and tested with the same rigor as code from an untrusted contributor.

13. Model collapse and internet data pollution

Shumailov et al. (Nature, 2024) demonstrated that models trained on recursively AI-generated data progressively degenerate, losing tail distributions and eventually converging to distributions unrelated to reality. Each conversation that produces text which enters the public internet — Stack Overflow answers, blog posts, documentation — contributes synthetic data to the commons. Future models trained on this data will be slightly worse.

The Harvard Journal of Law & Technology has argued for a "right to uncontaminated human-generated data." Each conversation is a marginal pollutant.

14. Scientific research integrity

If conversation outputs are used in research (literature reviews, data analysis, writing), they contribute to degradation of scientific knowledge infrastructure. A PMC article calls LLMs "a potentially existential threat to online survey research" because coherent AI-generated responses can no longer be assumed human. PNAS has warned about protecting scientific integrity in an age of generative AI.

This is distinct from individual epistemic risk — it is systemic corruption of the knowledge commons.

15. Algorithmic monoculture and correlated failure

When millions of users rely on the same few foundation models, errors become correlated rather than independent. A Stanford HAI study (Bommasani et al., 2022) found that across every model ecosystem studied, the rate of homogeneous outcomes exceeded baselines. A Nature Communications Psychology paper (2026) documents that AI-driven research is producing "topical and methodological convergence, flattening scientific imagination."

For coding specifically: if many developers use the same model, their code will share the same blind spots, the same idiomatic patterns, and the same categories of bugs. This reduces the diversity that makes software ecosystems resilient.

16. Creative market displacement

The U.S. Copyright Office's May 2025 Part 3 report states that GenAI systems "compete with or diminish licensing opportunities for original human creators." This is not only a training-phase cost (using creators' work without consent) but an ongoing per-conversation externality: each conversation that generates creative output (code, text, analysis) displaces some marginal demand for human work.

17. Jevons paradox (meta-methodological)

This entire methodology risks underestimating impact through the per-conversation framing. As AI models become more efficient and cheaper per query, total usage scales dramatically, potentially negating efficiency gains. A 2025 ACM FAccT paper specifically addresses this: efficiency improvements spur increased consumption. Any per-conversation estimate should acknowledge that the very affordability of a conversation increases total conversation volume — each cheap query is part of a demand signal that drives system-level growth.

18. What this methodology does NOT capture

  • Network transmission energy: Routers, switches, fiber amplifiers, CDN infrastructure. Data center network bandwidth surged 330% in 2024 due to AI workloads. Small per conversation but not zero.
  • Mental health effects: RCTs show heavy AI chatbot use correlates with greater loneliness and dependency. Less directly relevant to coding agent use, but the boundary between tool use and companionship is not always clear.
  • Human time: The user's time has value and its own footprint, but this is not caused by the conversation.
  • Cultural normalization: The more AI-generated content becomes normal, the harder it becomes to opt out. This is a soft lock-in effect.

19. Confidence summary

Component Confidence Could be off by Quantified?
Token count Low 2x Yes
Energy per token Low 3x Yes
PUE Medium 15% Yes
Grid carbon intensity Medium 30% Yes
Client-side energy Medium 50% Yes
Water usage Low 5x Yes
Training (amortized) Low 10x Partly
Financial cost Medium 2x Yes
Embodied carbon Very low Unknown No
Critical minerals / human rights Very low Unquantifiable No
E-waste Very low Unknown No
Grid displacement Low 2-5x No
Community impacts Very low Unquantifiable No
Annotation labor Very low Unquantifiable No
Cognitive deskilling Very low Unquantifiable Proxy
Linguistic homogenization Very low Unquantifiable No
Code quality degradation Low Variable Proxy
Data pollution / model collapse Very low Unquantifiable Proxy
Scientific integrity Very low Unquantifiable No
Algorithmic monoculture Very low Unquantifiable Proxy
Creative market displacement Very low Unquantifiable No
Political cost Very low Unquantifiable No
Content filtering (opacity) Medium Unquantifiable No
Jevons paradox (systemic) Low Fundamental No

Proxy metrics (marked "Proxy" above): These categories cannot be directly quantified per conversation, but the impact toolkit now tracks measurable proxies:

  • Cognitive deskilling: Automation ratio (AI output tokens / total tokens). High ratio = more delegation, higher deskilling risk.
  • Code quality degradation: Test pass/fail counts and file churn (edits per unique file). High churn or failures = more rework.
  • Data pollution / model collapse: Public push flag — detects when AI-generated code is pushed to a public repository.
  • Algorithmic monoculture: Model ID logged per session, enabling provider concentration analysis over time.

These proxies are crude — a high automation ratio does not prove deskilling, and a public push does not prove pollution. But they make the costs visible and trackable rather than purely abstract.

Overall assessment: Of the 20+ cost categories identified, only 6 can be quantified with any confidence (inference energy, PUE, grid intensity, client energy, financial cost, water). Four more now have proxy metrics that capture a measurable signal, even if indirect. The remaining categories resist quantification — not because they are small, but because they are diffuse, systemic, or involve incommensurable values (human rights, cognitive autonomy, cultural diversity, democratic governance).

A methodology that only counts what it can measure will systematically undercount the true cost. The quantifiable costs are almost certainly the least important costs. The most consequential harms — deskilling, data pollution, monoculture risk, creative displacement, power concentration — operate at the system level, where per-conversation attribution is conceptually fraught (see Section 17 on Jevons paradox).

This does not mean the exercise is pointless. Naming the costs, even without numbers, is a precondition for honest assessment.

20. Positive impact: proxy metrics

The sections above measure costs. To assess net impact, we also need to estimate value produced. This is harder — value is contextual, often delayed, and resistant to quantification. The following proxy metrics are imperfect but better than ignoring the positive side entirely.

Reach

How many people are affected by the output of this conversation?

  • 1 (only the user) — personal script, private note, learning exercise
  • 10-100 — team tooling, internal documentation, small project
  • 100-10,000 — open-source library, public documentation, popular blog
  • 10,000+ — widely-used infrastructure, security fix in major dependency

Estimation method: check download counts, user counts, dependency graphs, or audience size for the project or artifact being worked on.

Known bias: tendency to overestimate reach. "This could help anyone who..." is not the same as "this will reach N people." Be conservative.

Counterfactual

Would the user have achieved a similar result without this conversation?

  • Yes, same speed — the conversation added no value. Net impact is purely negative (cost with no benefit).
  • Yes, but slower — the conversation saved time. Value = time saved * hourly value of that time. Often modest.
  • Yes, but lower quality — the conversation improved the output (caught a bug, suggested a better design). Value depends on what the quality difference prevents downstream.
  • No — the user could not have done this alone. The conversation enabled something that would not otherwise exist. Highest potential value, but also the highest deskilling risk.

Known bias: users and LLMs both overestimate the "no" category. Most tasks fall in "yes, but slower."

Durability

How long will the output remain valuable?

  • Minutes — answered a quick question, resolved a transient confusion.
  • Days to weeks — wrote a script for a one-off task, debugged a current issue.
  • Months to years — created automation, documentation, or tooling that persists. Caught a design flaw early.
  • Indefinite — contributed to a public resource that others maintain and build on.

Durability multiplies reach: a short-lived artifact for 10,000 users may be worth less than a long-lived one for 100.

Severity (for bug/security catches)

If the conversation caught or prevented a problem, how bad was it?

  • Cosmetic — typo, formatting, minor UX issue
  • Functional — bug that affects correctness for some inputs
  • Security — vulnerability that could be exploited
  • Data loss / safety — could cause irreversible harm

Severity * reach = rough value of the catch.

Reuse

Was the output of the conversation referenced or used again after it ended? This can only be assessed retrospectively:

  • Was the code merged and still in production?
  • Was the documentation read by others?
  • Was the tool adopted by another project?

Reuse is the strongest evidence of durable value.

Net impact rubric

Combining cost and value into a qualitative assessment:

Assessment Criteria
Clearly net-positive High reach (1000+) AND (high durability OR high severity catch) AND counterfactual is "no" or "lower quality"
Probably net-positive Moderate reach (100+) AND durable output AND counterfactual is at least "slower"
Uncertain Low reach but high durability, or high reach but low durability, or hard to assess counterfactual
Probably net-negative Low reach (1-10) AND short durability AND counterfactual is "yes, same speed" or "yes, but slower"
Clearly net-negative No meaningful output, or output that required extensive debugging, or conversation that went in circles

Important: most conversations between an LLM and a single user working on private code will fall in the "probably net-negative" to "uncertain" range. This is not a failure of the conversation — it is an honest reflection of the cost structure. Net-positive requires broad reach, which requires the work to be shared.

This methodology builds on and complements existing tools and research.

Measurement tools (environmental)

  • EcoLogits — Python library from GenAI Impact that tracks per-query energy and CO2 for API calls. Covers operational and embodied emissions. More precise than this methodology for environmental metrics, but does not cover social, epistemic, or political costs.
  • CodeCarbon — Python library that measures GPU/CPU/RAM electricity consumption in real time with regional carbon intensity. Primarily for local training workloads. A 2025 validation study found estimates can be off by ~2.4x vs. external measurements.
  • Hugging Face AI Energy Score — Standardized energy efficiency benchmarking across AI models. Useful for model selection but does not provide per-conversation accounting.
  • Green Algorithms — Web calculator from University of Cambridge for any computational workload. Not AI-specific.

Published per-query data

  • Patterson et al. (Google, August 2025): Most rigorous provider- published per-query data. Reports 0.24 Wh, 0.03g CO2, and 0.26 mL water per median Gemini text prompt. Showed 33x energy reduction over one year. (arXiv:2508.15734)
  • Jegham et al. ("How Hungry is AI?", May 2025): Cross-model benchmarks for 30 LLMs showing 70x energy variation between models. (arXiv:2505.09598)

Broader frameworks

  • UNICC/Frugal AI Hub (December 2025): Three-level framework from Total Cost of Ownership to SDG alignment. Portfolio-level, not per- conversation. Does not enumerate specific social cost categories.
  • Practical Principles for AI Cost and Compute Accounting (arXiv, February 2025): Proposes compute as a governance metric. Financial and compute only.

Research on social costs

  • Lee et al. (CHI 2025): "The AI Deskilling Paradox" — survey finding that higher AI confidence correlates with less critical thinking. See Section 10.
  • Springer (2025): Argues deskilling is structural, not individual.
  • Shumailov et al. (Nature, 2024): Model collapse from recursive AI-generated training data. See Section 13.
  • Stanford HAI (2025): Algorithmic monoculture and correlated failure across model ecosystems. See Section 15.

How this methodology differs

No existing tool or framework combines per-conversation environmental measurement with social, cognitive, epistemic, and political cost categories. The tools above measure environmental costs well — we do not compete with them. Our contribution is the taxonomy: naming and organizing 20+ cost categories so that the non-environmental costs are not ignored simply because they are harder to quantify.

22. What would improve this estimate

  • Access to actual energy-per-token and training energy metrics from model providers
  • Knowledge of the specific data center and its energy source
  • Actual token counts from API response headers
  • Hardware specifications (GPU model, batch size)
  • Transparency about annotation labor conditions and compensation
  • Public data on total query volume (to properly amortize training)
  • Longitudinal studies on cognitive deskilling specifically from coding agents
  • Empirical measurement of AI data pollution rates in public corpora
  • A framework for quantifying concentration-of-power effects (this may not be possible within a purely quantitative methodology)
  • Honest acknowledgment that some costs may be fundamentally unquantifiable, and that this is a limitation of quantitative methodology, not evidence of insignificance

License

This methodology is provided for reuse and adaptation. See the LICENSE file in this repository.

Contributing

If you have better data, corrections, or additional cost categories, contributions are welcome. The goal is not a perfect number but an honest, improving understanding of costs.