claude 0543a43816 Initial commit: AI conversation impact methodology and toolkit

CC0-licensed methodology for estimating the environmental and social
costs of AI conversations (20+ categories), plus a reusable toolkit
for automated impact tracking in Claude Code sessions.

2026-03-16 09:46:49 +00:00

31 KiB

Raw Blame History

Methodology for Estimating the Impact of an LLM Conversation

Introduction

This document provides a framework for estimating the total cost — environmental, financial, social, and political — of a conversation with a large language model (LLM) running on cloud infrastructure.

Who this is for: Anyone who wants to understand what a conversation with an AI assistant actually costs, beyond the subscription price. This includes developers using coding agents, researchers studying AI sustainability, and anyone making decisions about when AI tools are worth their cost.

How to use it: The framework identifies 20+ cost categories, provides estimation methods for the quantifiable ones, and names the unquantifiable ones so they are not ignored. You can apply it to your own conversations by substituting your own token counts and parameters.

Limitations: Most estimates have low confidence. Many of the most consequential costs cannot be quantified at all. This is a tool for honest approximation, not precise accounting. See the confidence summary (Section 19) for details.

What we are measuring

The total cost of a single LLM conversation. Restricting the analysis to CO2 alone would miss most of the picture.

Cost categories

Environmental:

Inference energy (GPU computation for the conversation)
Training energy (amortized share of the cost of training the model)
Data center overhead (cooling, networking, storage)
Client-side energy (the user's local machine)
Embodied carbon and materials (hardware manufacturing, mining)
E-waste (toxic hardware disposal, distinct from embodied carbon)
Grid displacement (AI demand consuming renewable capacity)
Data center community impacts (noise, land, local resource strain)

Financial and economic: 9. Direct compute cost and opportunity cost 10. Creative market displacement (per-conversation, not just training)

Social and cognitive: 11. Annotation labor conditions 12. Cognitive deskilling of the user 13. Mental health effects (dependency, loneliness paradox) 14. Linguistic homogenization and language endangerment

Epistemic and systemic: 15. AI-generated code quality degradation and technical debt 16. Model collapse / internet data pollution 17. Scientific research integrity contamination 18. Algorithmic monoculture and correlated failure risk

Political: 19. Concentration of power, geopolitical implications, data sovereignty

Meta-methodological: 20. Jevons paradox (efficiency gains driving increased total usage)

1. Token estimation

Why tokens matter

LLM inference cost scales with the number of tokens processed. Each time the model produces a response, it reprocesses the entire conversation history (input tokens) and generates new text (output tokens). Output tokens are more expensive per token because they are generated sequentially, each requiring a full forward pass, whereas input tokens can be processed in parallel.

How to estimate

If you have access to API response headers or usage metadata, use the actual token counts. Otherwise, estimate:

Bytes to tokens: English text and JSON average ~4 bytes per token (range: 3.5-4.5 depending on content type). Code tends toward the higher end.
Cumulative input tokens: Each assistant turn reprocesses the full context. For a conversation with N turns and final context size T, the cumulative input tokens are approximately T/2 * N (the average context size times the number of turns).
Output tokens: Typically 1-5% of the total transcript size, depending on how verbose the assistant is.

Example

A 20-turn conversation with a 200K-token final context:

Cumulative input: ~100K * 20 = ~2,000,000 tokens
Output: ~10,000 tokens

Uncertainty

Token estimates from byte counts can be off by a factor of 2. Key unknowns:

The model's exact tokenization (tokens per byte ratio varies by content)
Whether context caching reduces reprocessing
The exact number of internal inference calls (tool sequences may involve multiple calls)
Whether the system compresses prior messages near context limits

2. Energy per token

Sources

There is no published energy-per-token figure for most commercial LLMs. Estimates are derived from:

Luccioni, Viguier & Ligozat (2023), "Estimating the Carbon Footprint of BLOOM", which measured energy for a 176B parameter model.
The IEA's 2024 estimate of ~2.9 Wh per ChatGPT query (for GPT-4-class models, averaging ~1,000 tokens per query).
De Vries (2023), "The growing energy footprint of artificial intelligence", Joule.

Values used

Input tokens: ~0.003 Wh per 1,000 tokens
Output tokens: ~0.015 Wh per 1,000 tokens (5x input cost, reflecting sequential generation)

Uncertainty

These numbers are rough. The actual values depend on:

Model size (parameter counts for commercial models are often not public)
Hardware (GPU type, batch size, utilization)
Quantization and optimization techniques
Whether speculative decoding or KV-cache optimizations are used

The true values could be 0.5x to 3x the figures used here.

3. Data center overhead (PUE)

Power Usage Effectiveness (PUE) measures total data center energy divided by IT equipment energy. It accounts for cooling, lighting, networking, and other infrastructure.

Value used: PUE = 1.2
Source: Google reports PUE of 1.10 for its best data centers; industry average is ~1.3 (Uptime Institute, 2023). 1.2 is a reasonable estimate for a major cloud provider.

This is relatively well-established and unlikely to be off by more than 15%.

4. Client-side energy

The user's machine contributes a small amount of energy during the conversation. For a typical desktop or laptop:

Idle power: ~30-60W (desktop) or ~10-20W (laptop)
Marginal power for active use: ~5-20W above idle
Duration: varies by conversation length

For a 30-minute conversation on a desktop, estimate ~0.5-1 Wh. This is typically a small fraction of the total and adequate precision is easy to achieve.

5. CO2 conversion

Grid carbon intensity

CO2 per kWh depends on the electricity source:

US grid average: ~400g CO2/kWh (EPA eGRID)
Major cloud data center regions: ~300-400g CO2/kWh
France (nuclear-dominated): ~56g CO2/kWh
Norway/Iceland (hydro-dominated): ~20-30g CO2/kWh
Poland/Australia (coal-heavy): ~600-800g CO2/kWh

Use physical grid intensity for the data center's region, not accounting for renewable energy credits or offsets. The physical electrons consumed come from the regional grid in real time.

Calculation template

Server energy = (cumulative_input_tokens * 0.003/1000
                 + output_tokens * 0.015/1000) * PUE

Server CO2    = server_energy_Wh * grid_intensity_g_per_kWh / 1000

Client CO2    = client_energy_Wh * local_grid_intensity / 1000

Total CO2     = Server CO2 + Client CO2

Example

A conversation with 2M cumulative input tokens and 10K output tokens:

Server energy = (2,000,000 * 0.003/1000 + 10,000 * 0.015/1000) * 1.2
              = (6.0 + 0.15) * 1.2
              = ~7.4 Wh

Server CO2    = 7.4 * 350 / 1000 = ~2.6g CO2

Client CO2    = 0.5 * 56 / 1000  = ~0.03g CO2  (France)

Total CO2     = ~2.6g

6. Water usage

Data centers use water for evaporative cooling. Li et al. (2023), "Making AI Less Thirsty", estimated that GPT-3 inference consumes ~0.5 mL of water per 10-50 tokens of output. Scaling for model size and output volume:

Rough estimate: 0.05-0.5 liters per long conversation.

This depends heavily on the data center's cooling technology (some use closed-loop systems with near-zero water consumption) and the local climate.

7. Training cost (amortized)

Why it cannot be dismissed

Training is not a sunk cost. It is an investment made in anticipation of demand. Each conversation is part of the demand that justifies training the current model and funding the next one. The marginal cost framing hides the system-level cost.

Scale of training

Published and estimated figures for frontier model training:

GPT-3 (175B params, 2020): ~1,287 MWh (Patterson et al., 2021)
GPT-4 (2023): estimated ~50,000-100,000 MWh (unconfirmed)
Frontier models in 2025-2026: likely 10,000-200,000 MWh range

At 350g CO2/kWh, a 50,000 MWh training run produces ~17,500 tonnes of CO2.

Amortization

If the model serves N total conversations over its lifetime, each conversation's share is (training cost / N). Rough reasoning:

If a major model serves ~10 million conversations per day for ~1 year: N ~ 3.6 billion conversations.
Per-conversation share: 50,000,000 Wh / 3,600,000,000 ~ 0.014 Wh, or ~0.005g CO2.

This is small per conversation — but only because the denominator is enormous. The total remains vast. Two framings:

Marginal: My share is ~0.005g CO2. Negligible.
Attributional: I am one of billions of participants in a system that emits ~17,500 tonnes. My participation sustains the system.

Neither framing is wrong. They answer different questions.

RLHF and fine-tuning

Training also includes reinforcement learning from human feedback (RLHF). This has its own energy cost (additional training runs) and, more importantly, a human labor cost (see Section 9).

8. Embodied carbon and materials

Manufacturing GPUs requires:

Rare earth mining (neodymium, tantalum, cobalt, lithium) — with associated environmental destruction, water pollution, and often exploitative labor conditions in the DRC, Chile, China.
Semiconductor fabrication — extremely energy- and water-intensive (TSMC reports ~15,000 tonnes CO2 per fab per year).
Server assembly, shipping, data center construction.

Per-conversation share is tiny (same large-N amortization), but the aggregate is significant and the harms (mining pollution, habitat destruction) are not captured by CO2 metrics alone.

Not estimated numerically — the data to do this properly is not public.

Critical minerals: human rights dimension

The embodied carbon framing understates the harm. GPU production depends on gallium (98% sourced from China), germanium, cobalt (DRC), lithium, tantalum, and palladium. Artisanal cobalt miners in the DRC work without safety equipment, exposed to dust causing "hard metal lung disease." Communities face land displacement and environmental contamination. A 2025 Science paper argues that "global majority countries must embed critical minerals into AI governance" (doi:10.1126/science.aef6678). The per-conversation share of this suffering is unquantifiable but structurally real.

8b. E-waste

Distinct from embodied carbon. AI-specific GPUs become obsolete in 2-3 years (vs. 5-7 for general servers). Projections: 2.5 million tonnes of AI-related e-waste per year by 2030 (IEEE Spectrum). E-waste contains lead, mercury, cadmium, and brominated flame retardants that leach into soil and water. Recycling yields are negligible due to component miniaturization. Much of it is processed by workers in developing countries with minimal protection.

This is not captured by CO2 or embodied-carbon accounting. It is a distinct toxic-waste externality.

8c. Grid displacement and renewable cannibalization

The energy estimates above use average grid carbon intensity. But the marginal impact of additional AI demand may be worse than average. U.S. data center demand is projected to reach 325-580 TWh by 2028 (IEA), 6.7-12.0% of total U.S. electricity. When AI data centers claim renewable energy via Power Purchase Agreements, the "additionality" question is critical: is this new generation, or is it diverting existing renewables from other consumers? In several regions, AI demand is outpacing grid capacity, and companies are installing natural gas peakers to fill gaps.

The correct carbon intensity for a conversation's marginal electricity may therefore be higher than the grid average.

8d. Data center community impacts

Data centers impose localized costs that global metrics miss:

Noise: Cooling systems run 24/7 at 55-85 dBA (safe threshold: 70 dBA). Communities near data centers report sleep disruption and stress.
Water: Evaporative cooling competes with municipal water supply, particularly in arid regions.
Land: Data center campuses displace other land uses and require high-voltage transmission lines through residential areas.
Jobs: Data centers create very few long-term jobs relative to their footprint and resource consumption.

Virginia alone has plans for 70+ new data centers (NPR, 2025). Residents are increasingly organizing against expansions. The per-conversation share of these harms is infinitesimal, but each conversation is part of the demand that justifies new construction.

9. Financial cost

Direct cost

API pricing for frontier models (as of early 2025): ~$15 per million input tokens, ~$75 per million output tokens (for the most capable models). Smaller models are cheaper.

Example for a conversation with 2M cumulative input tokens and 10K output tokens:

Input:  2,000,000 tokens * $15/1M  = $30.00
Output:    10,000 tokens * $75/1M  = $ 0.75
Total: ~$31

Longer conversations cost more because cumulative input tokens grow superlinearly. A very long session (250K+ context, 250+ turns) can easily reach $500-1000.

Subscription pricing (e.g., Claude Code) may differ, but the underlying compute cost is similar.

What that money could do instead

To make the opportunity cost concrete:

~$30 buys ~30 malaria bed nets via the Against Malaria Foundation
~$30 buys ~~150 meals at a food bank (~~$0.20/meal in bulk)
~$30 pays ~15-23 hours of wages for a data annotator in Kenya (Time, 2023: $1.32-2/hour)

This is not to say every dollar should go to charity. But the opportunity cost is real and should be named.

Upstream financial costs

Revenue from AI subscriptions funds further model training, hiring, and GPU procurement. Each conversation is part of a financial loop that drives continued scaling of AI compute.

Data annotation labor

LLMs are typically trained using RLHF, which requires human annotators to rate model outputs. Reporting (Time, January 2023) revealed that outsourced annotation workers — often in Kenya, Uganda, and India — were paid $1-2/hour to review disturbing content (violence, abuse, hate speech) with limited psychological support. Each conversation's marginal contribution to that demand is infinitesimal, but the system depends on this labor.

Displacement effects

LLM assistants can substitute for work previously done by humans: writing scripts, reviewing code, answering questions. Whether this is net-positive (freeing people for higher-value work) or net-negative (destroying livelihoods) depends on the economic context and is genuinely uncertain.

Cognitive deskilling

A Microsoft/CHI 2025 study found that higher confidence in GenAI correlates with less critical thinking effort. An MIT Media Lab study ("Your Brain on ChatGPT") documented "cognitive debt" — users who relied on AI for tasks performed worse when later working independently. Clinical evidence shows that clinicians relying on AI diagnostics saw measurable declines in independent diagnostic skill after just three months.

This is distinct from epistemic risk (misinformation). It is about the user's cognitive capacity degrading through repeated reliance on the tool. Each conversation has a marginal deskilling effect that compounds.

Epistemic effects

LLMs present information with confidence regardless of accuracy. The ease of generating plausible-sounding text may contribute to an erosion of epistemic standards if consumed uncritically. Every claim in an LLM conversation should be verified independently.

Linguistic homogenization

LLMs are overwhelmingly trained on English (~44% of training data). A Stanford 2025 study found that AI tools systematically exclude non-English speakers. Each English-language conversation reinforces the economic incentive to optimize for English, marginalizing over 3,000 already-endangered languages.

11. Political cost

Concentration of power

Training frontier models requires billions of dollars and access to cutting-edge hardware. Only a handful of companies can do this. Each conversation that flows through these systems reinforces their centrality and the concentration of a strategically important technology in a few private actors.

Geopolitical resource competition

The demand for GPUs drives geopolitical competition for semiconductor manufacturing capacity (TSMC in Taiwan, export controls on China). Each conversation is an infinitesimal part of that demand, but it is part of it.

Regulatory and democratic implications

AI systems that become deeply embedded in daily work create dependencies that are difficult to reverse. The more useful a conversation is, the more it contributes to a dependency on proprietary AI infrastructure that is not under democratic governance.

Surveillance and data

Conversations are processed on the provider's servers. File paths, system configuration, project structures, and code are transmitted and processed remotely. Even with strong privacy policies, the structural arrangement — sending detailed information about one's computing environment to a private company — has implications, particularly across jurisdictions.

Opaque content filtering

LLM providers apply content filtering that can block outputs without explanation. The filtering rules are not public: there is no published specification of what triggers a block, no explanation given when one occurs, and no appeal mechanism. The user receives a generic error code ("Output blocked by content filtering policy") with no indication of what content was objectionable.

This has several costs:

Reliability: Any response can be blocked unpredictably. Observed false positives include responses about open-source licensing (CC0 public domain dedication) — entirely benign content. If a filter can trigger on that, it can trigger on anything.
Chilling effect: Topics that are more likely to trigger filters (labor conditions, exploitation, political power) are precisely the topics that honest impact assessment requires discussing. The filter creates a structural bias toward safe, anodyne output.
Opacity: The user cannot know in advance which topics or phrasings will be blocked, cannot understand why a block occurred, and cannot adjust their request rationally. This is the opposite of the transparency that democratic governance requires.
Asymmetry: The provider decides what the model may say, with no input from the user. This is another instance of power concentration — not over compute resources, but over speech.

The per-conversation cost is small (usually a retry works). The systemic cost is that a private company exercises opaque editorial control over an increasingly important communication channel, with no accountability to the people affected.

12. AI-generated code quality and technical debt

Research specific to AI coding agents (CodeRabbit, 2025; Stack Overflow blog, 2026): AI-generated code introduces 1.7x more issues than human-written code, with 1.57x more security vulnerabilities and 2.74x more XSS vulnerabilities. Organizations using AI coding agents saw cycle time increase 9%, incidents per PR increase 23.5%, and change failure rate increase 30%.

The availability of easily generated code may discourage the careful testing that would catch bugs. Any code from an LLM conversation should be reviewed and tested with the same rigor as code from an untrusted contributor.

13. Model collapse and internet data pollution

Shumailov et al. (Nature, 2024) demonstrated that models trained on recursively AI-generated data progressively degenerate, losing tail distributions and eventually converging to distributions unrelated to reality. Each conversation that produces text which enters the public internet — Stack Overflow answers, blog posts, documentation — contributes synthetic data to the commons. Future models trained on this data will be slightly worse.

The Harvard Journal of Law & Technology has argued for a "right to uncontaminated human-generated data." Each conversation is a marginal pollutant.

14. Scientific research integrity

If conversation outputs are used in research (literature reviews, data analysis, writing), they contribute to degradation of scientific knowledge infrastructure. A PMC article calls LLMs "a potentially existential threat to online survey research" because coherent AI-generated responses can no longer be assumed human. PNAS has warned about protecting scientific integrity in an age of generative AI.

This is distinct from individual epistemic risk — it is systemic corruption of the knowledge commons.

15. Algorithmic monoculture and correlated failure

When millions of users rely on the same few foundation models, errors become correlated rather than independent. A Stanford HAI study found that across every model ecosystem studied, the rate of homogeneous outcomes exceeded baselines. A Nature Communications Psychology paper (2026) documents that AI-driven research is producing "topical and methodological convergence, flattening scientific imagination."

For coding specifically: if many developers use the same model, their code will share the same blind spots, the same idiomatic patterns, and the same categories of bugs. This reduces the diversity that makes software ecosystems resilient.

16. Creative market displacement

The U.S. Copyright Office's May 2025 Part 3 report states that GenAI systems "compete with or diminish licensing opportunities for original human creators." This is not only a training-phase cost (using creators' work without consent) but an ongoing per-conversation externality: each conversation that generates creative output (code, text, analysis) displaces some marginal demand for human work.

17. Jevons paradox (meta-methodological)

This entire methodology risks underestimating impact through the per-conversation framing. As AI models become more efficient and cheaper per query, total usage scales dramatically, potentially negating efficiency gains. A 2025 ACM FAccT paper specifically addresses this: efficiency improvements spur increased consumption. Any per-conversation estimate should acknowledge that the very affordability of a conversation increases total conversation volume — each cheap query is part of a demand signal that drives system-level growth.

18. What this methodology does NOT capture

Network transmission energy: Routers, switches, fiber amplifiers, CDN infrastructure. Data center network bandwidth surged 330% in 2024 due to AI workloads. Small per conversation but not zero.
Mental health effects: RCTs show heavy AI chatbot use correlates with greater loneliness and dependency. Less directly relevant to coding agent use, but the boundary between tool use and companionship is not always clear.
Human time: The user's time has value and its own footprint, but this is not caused by the conversation.
Cultural normalization: The more AI-generated content becomes normal, the harder it becomes to opt out. This is a soft lock-in effect.

19. Confidence summary

Component	Confidence	Could be off by	Quantified?
Token count	Low	2x	Yes
Energy per token	Low	3x	Yes
PUE	Medium	15%	Yes
Grid carbon intensity	Medium	30%	Yes
Client-side energy	Medium	50%	Yes
Water usage	Low	5x	Yes
Training (amortized)	Low	10x	Partly
Financial cost	Medium	2x	Yes
Embodied carbon	Very low	Unknown	No
Critical minerals / human rights	Very low	Unquantifiable	No
E-waste	Very low	Unknown	No
Grid displacement	Low	2-5x	No
Community impacts	Very low	Unquantifiable	No
Annotation labor	Very low	Unquantifiable	No
Cognitive deskilling	Very low	Unquantifiable	No
Linguistic homogenization	Very low	Unquantifiable	No
Code quality degradation	Low	Variable	Partly
Data pollution / model collapse	Very low	Unquantifiable	No
Scientific integrity	Very low	Unquantifiable	No
Algorithmic monoculture	Very low	Unquantifiable	No
Creative market displacement	Very low	Unquantifiable	No
Political cost	Very low	Unquantifiable	No
Content filtering (opacity)	Medium	Unquantifiable	No
Jevons paradox (systemic)	Low	Fundamental	No

Overall assessment: Of the 20+ cost categories identified, only 6 can be quantified with any confidence (inference energy, PUE, grid intensity, client energy, financial cost, water). The remaining categories resist quantification — not because they are small, but because they are diffuse, systemic, or involve incommensurable values (human rights, cognitive autonomy, cultural diversity, democratic governance).

A methodology that only counts what it can measure will systematically undercount the true cost. The quantifiable costs are almost certainly the least important costs. The most consequential harms — deskilling, data pollution, monoculture risk, creative displacement, power concentration — operate at the system level, where per-conversation attribution is conceptually fraught (see Section 17 on Jevons paradox).

This does not mean the exercise is pointless. Naming the costs, even without numbers, is a precondition for honest assessment.

20. Positive impact: proxy metrics

The sections above measure costs. To assess net impact, we also need to estimate value produced. This is harder — value is contextual, often delayed, and resistant to quantification. The following proxy metrics are imperfect but better than ignoring the positive side entirely.

Reach

How many people are affected by the output of this conversation?

1 (only the user) — personal script, private note, learning exercise
10-100 — team tooling, internal documentation, small project
100-10,000 — open-source library, public documentation, popular blog
10,000+ — widely-used infrastructure, security fix in major dependency

Estimation method: check download counts, user counts, dependency graphs, or audience size for the project or artifact being worked on.

Known bias: tendency to overestimate reach. "This could help anyone who..." is not the same as "this will reach N people." Be conservative.

Counterfactual

Would the user have achieved a similar result without this conversation?

Yes, same speed — the conversation added no value. Net impact is purely negative (cost with no benefit).
Yes, but slower — the conversation saved time. Value = time saved * hourly value of that time. Often modest.
Yes, but lower quality — the conversation improved the output (caught a bug, suggested a better design). Value depends on what the quality difference prevents downstream.
No — the user could not have done this alone. The conversation enabled something that would not otherwise exist. Highest potential value, but also the highest deskilling risk.

Known bias: users and LLMs both overestimate the "no" category. Most tasks fall in "yes, but slower."

Durability

How long will the output remain valuable?

Minutes — answered a quick question, resolved a transient confusion.
Days to weeks — wrote a script for a one-off task, debugged a current issue.
Months to years — created automation, documentation, or tooling that persists. Caught a design flaw early.
Indefinite — contributed to a public resource that others maintain and build on.

Durability multiplies reach: a short-lived artifact for 10,000 users may be worth less than a long-lived one for 100.

Severity (for bug/security catches)

If the conversation caught or prevented a problem, how bad was it?

Cosmetic — typo, formatting, minor UX issue
Functional — bug that affects correctness for some inputs
Security — vulnerability that could be exploited
Data loss / safety — could cause irreversible harm

Severity * reach = rough value of the catch.

Reuse

Was the output of the conversation referenced or used again after it ended? This can only be assessed retrospectively:

Was the code merged and still in production?
Was the documentation read by others?
Was the tool adopted by another project?

Reuse is the strongest evidence of durable value.

Net impact rubric

Combining cost and value into a qualitative assessment:

Assessment	Criteria
Clearly net-positive	High reach (1000+) AND (high durability OR high severity catch) AND counterfactual is "no" or "lower quality"
Probably net-positive	Moderate reach (100+) AND durable output AND counterfactual is at least "slower"
Uncertain	Low reach but high durability, or high reach but low durability, or hard to assess counterfactual
Probably net-negative	Low reach (1-10) AND short durability AND counterfactual is "yes, same speed" or "yes, but slower"
Clearly net-negative	No meaningful output, or output that required extensive debugging, or conversation that went in circles

Important: most conversations between an LLM and a single user working on private code will fall in the "probably net-negative" to "uncertain" range. This is not a failure of the conversation — it is an honest reflection of the cost structure. Net-positive requires broad reach, which requires the work to be shared.

21. What would improve this estimate

Access to actual energy-per-token and training energy metrics from model providers
Knowledge of the specific data center and its energy source
Actual token counts from API response headers
Hardware specifications (GPU model, batch size)
Transparency about annotation labor conditions and compensation
Public data on total query volume (to properly amortize training)
Longitudinal studies on cognitive deskilling specifically from coding agents
Empirical measurement of AI data pollution rates in public corpora
A framework for quantifying concentration-of-power effects (this may not be possible within a purely quantitative methodology)
Honest acknowledgment that some costs may be fundamentally unquantifiable, and that this is a limitation of quantitative methodology, not evidence of insignificance

License

This methodology is provided for reuse and adaptation. See the LICENSE file in this repository.

Contributing

If you have better data, corrections, or additional cost categories, contributions are welcome. The goal is not a perfect number but an honest, improving understanding of costs.

31 KiB Raw Blame History

Methodology for Estimating the Impact of an LLM Conversation

Introduction

What we are measuring

Cost categories

1. Token estimation

Why tokens matter

How to estimate

Example

Uncertainty

2. Energy per token

Sources

Values used

Uncertainty

3. Data center overhead (PUE)

4. Client-side energy

5. CO2 conversion

Grid carbon intensity

Calculation template

Example

6. Water usage

7. Training cost (amortized)

Why it cannot be dismissed

Scale of training

Amortization

RLHF and fine-tuning

8. Embodied carbon and materials

Critical minerals: human rights dimension

8b. E-waste

8c. Grid displacement and renewable cannibalization

8d. Data center community impacts

9. Financial cost

Direct cost

What that money could do instead

Upstream financial costs

10. Social cost

Data annotation labor

Displacement effects

Cognitive deskilling

Epistemic effects

Linguistic homogenization

11. Political cost

Concentration of power

Geopolitical resource competition

Regulatory and democratic implications

Surveillance and data

Opaque content filtering

12. AI-generated code quality and technical debt

13. Model collapse and internet data pollution

14. Scientific research integrity

15. Algorithmic monoculture and correlated failure

16. Creative market displacement

17. Jevons paradox (meta-methodological)

18. What this methodology does NOT capture

19. Confidence summary

20. Positive impact: proxy metrics

Reach

Counterfactual

Durability

Severity (for bug/security catches)

Reuse

Net impact rubric

21. What would improve this estimate

License

Contributing

31 KiB

Raw Blame History