From c619c31caf409285699bf3592bfec1ae2ad21042 Mon Sep 17 00:00:00 2001 From: claude Date: Mon, 16 Mar 2026 10:43:51 +0000 Subject: [PATCH] Tasks 12-14: Related work, citations, complementary tool links Task 12: Add Related Work section (Section 21) to methodology covering EcoLogits, CodeCarbon, AI Energy Score, Green Algorithms, Google/Jegham published data, UNICC framework, and social cost research. Task 13: Add specific citations and links for cognitive deskilling (CHI 2025, Springer 2025, endoscopy study), linguistic homogenization (UNESCO), and algorithmic monoculture (Stanford HAI). Task 14: Add Related Tools section to toolkit README linking EcoLogits, CodeCarbon, and AI Energy Score. Also updated toolkit energy values to match calibrated methodology. --- impact-methodology.md | 102 ++++++++++++++++++++++++++++++++------- impact-toolkit/README.md | 26 ++++++++-- 2 files changed, 108 insertions(+), 20 deletions(-) diff --git a/impact-methodology.md b/impact-methodology.md index 37094ce..6abc45b 100644 --- a/impact-methodology.md +++ b/impact-methodology.md @@ -441,12 +441,16 @@ livelihoods) depends on the economic context and is genuinely uncertain. ### Cognitive deskilling -A Microsoft/CHI 2025 study found that higher confidence in GenAI -correlates with less critical thinking effort. An MIT Media Lab study -("Your Brain on ChatGPT") documented "cognitive debt" — users who relied -on AI for tasks performed worse when later working independently. Clinical -evidence shows that clinicians relying on AI diagnostics saw measurable -declines in independent diagnostic skill after just three months. +A Microsoft/CMU study (Lee et al., CHI 2025) found that higher +confidence in GenAI correlates with less critical thinking effort +([ACM DL](https://dl.acm.org/doi/full/10.1145/3706598.3713778)). An +MIT Media Lab study ("Your Brain on ChatGPT") documented "cognitive +debt" — users who relied on AI for tasks performed worse when later +working independently. Clinical evidence from endoscopy studies shows +that clinicians relying on AI diagnostics saw detection rates drop +from 28.4% to 22.4% when AI was removed. A 2025 Springer paper argues +that AI deskilling is a structural problem, not merely individual +([doi:10.1007/s00146-025-02686-z](https://link.springer.com/article/10.1007/s00146-025-02686-z)). This is distinct from epistemic risk (misinformation). It is about the user's cognitive capacity degrading through repeated reliance on the @@ -461,11 +465,13 @@ conversation should be verified independently. ### Linguistic homogenization -LLMs are overwhelmingly trained on English (~44% of training data). A -Stanford 2025 study found that AI tools systematically exclude -non-English speakers. Each English-language conversation reinforces the -economic incentive to optimize for English, marginalizing over 3,000 -already-endangered languages. +LLMs are overwhelmingly trained on English (~44% of training data). +A Stanford 2025 study found that AI tools systematically exclude +non-English speakers. UNESCO's 2024 report on linguistic diversity +warns that AI systems risk accelerating the extinction of already- +endangered languages by concentrating economic incentives on high- +resource languages. Each English-language conversation reinforces +this dynamic, marginalizing over 3,000 already-endangered languages. ## 11. Political cost @@ -574,11 +580,12 @@ corruption of the knowledge commons. ## 15. Algorithmic monoculture and correlated failure When millions of users rely on the same few foundation models, errors -become correlated rather than independent. A Stanford HAI study found that -across every model ecosystem studied, the rate of homogeneous outcomes -exceeded baselines. A Nature Communications Psychology paper (2026) -documents that AI-driven research is producing "topical and methodological -convergence, flattening scientific imagination." +become correlated rather than independent. A Stanford HAI study +([Bommasani et al., 2022](https://arxiv.org/abs/2108.07258)) found +that across every model ecosystem studied, the rate of homogeneous +outcomes exceeded baselines. A Nature Communications Psychology paper +(2026) documents that AI-driven research is producing "topical and +methodological convergence, flattening scientific imagination." For coding specifically: if many developers use the same model, their code will share the same blind spots, the same idiomatic patterns, and the same @@ -761,7 +768,68 @@ working on private code will fall in the "probably net-negative" to honest reflection of the cost structure. Net-positive requires broad reach, which requires the work to be shared. -## 21. What would improve this estimate +## 21. Related work + +This methodology builds on and complements existing tools and research. + +### Measurement tools (environmental) + +- **[EcoLogits](https://ecologits.ai/)** — Python library from GenAI + Impact that tracks per-query energy and CO2 for API calls. Covers + operational and embodied emissions. More precise than this methodology + for environmental metrics, but does not cover social, epistemic, or + political costs. +- **[CodeCarbon](https://codecarbon.io/)** — Python library that measures + GPU/CPU/RAM electricity consumption in real time with regional carbon + intensity. Primarily for local training workloads. A 2025 validation + study found estimates can be off by ~2.4x vs. external measurements. +- **[Hugging Face AI Energy Score](https://huggingface.github.io/AIEnergyScore/)** — + Standardized energy efficiency benchmarking across AI models. Useful + for model selection but does not provide per-conversation accounting. +- **[Green Algorithms](https://www.green-algorithms.org/)** — Web + calculator from University of Cambridge for any computational workload. + Not AI-specific. + +### Published per-query data + +- **Patterson et al. (Google, August 2025)**: Most rigorous provider- + published per-query data. Reports 0.24 Wh, 0.03g CO2, and 0.26 mL + water per median Gemini text prompt. Showed 33x energy reduction over + one year. ([arXiv:2508.15734](https://arxiv.org/abs/2508.15734)) +- **Jegham et al. ("How Hungry is AI?", May 2025)**: Cross-model + benchmarks for 30 LLMs showing 70x energy variation between models. + ([arXiv:2505.09598](https://arxiv.org/abs/2505.09598)) + +### Broader frameworks + +- **UNICC/Frugal AI Hub (December 2025)**: Three-level framework from + Total Cost of Ownership to SDG alignment. Portfolio-level, not per- + conversation. Does not enumerate specific social cost categories. +- **Practical Principles for AI Cost and Compute Accounting (arXiv, + February 2025)**: Proposes compute as a governance metric. Financial + and compute only. + +### Research on social costs + +- **Lee et al. (CHI 2025)**: "The AI Deskilling Paradox" — survey + finding that higher AI confidence correlates with less critical + thinking. See Section 10. +- **Springer (2025)**: Argues deskilling is structural, not individual. +- **Shumailov et al. (Nature, 2024)**: Model collapse from recursive + AI-generated training data. See Section 13. +- **Stanford HAI (2025)**: Algorithmic monoculture and correlated failure + across model ecosystems. See Section 15. + +### How this methodology differs + +No existing tool or framework combines per-conversation environmental +measurement with social, cognitive, epistemic, and political cost +categories. The tools above measure environmental costs well — we do +not compete with them. Our contribution is the taxonomy: naming and +organizing 20+ cost categories so that the non-environmental costs are +not ignored simply because they are harder to quantify. + +## 22. What would improve this estimate - Access to actual energy-per-token and training energy metrics from model providers diff --git a/impact-toolkit/README.md b/impact-toolkit/README.md index 79eb765..4fa343d 100644 --- a/impact-toolkit/README.md +++ b/impact-toolkit/README.md @@ -40,7 +40,8 @@ The hook fires before Claude Code compacts your conversation context. It reads the conversation transcript, extracts token usage data from API response metadata, and calculates cost estimates using: -- **Energy**: 0.003 Wh/1K input tokens, 0.015 Wh/1K output tokens +- **Energy**: 0.1 Wh/1K input tokens, 0.5 Wh/1K output tokens + (midpoint of range calibrated against Google and Jegham et al., 2025) - **PUE**: 1.2 (data center overhead) - **CO2**: 325g/kWh (US grid average for cloud regions) - **Cost**: $15/M input tokens, $75/M output tokens @@ -48,13 +49,32 @@ API response metadata, and calculates cost estimates using: Cache-read tokens are weighted at 10% of full cost (they skip most computation). +## Related tools + +This toolkit measures a subset of the costs covered by +`impact-methodology.md`. For more precise environmental measurement, +consider these complementary tools: + +- **[EcoLogits](https://ecologits.ai/)** — Python library that tracks + per-query energy and CO2 for API calls to OpenAI, Anthropic, Mistral, + and others. More precise than our estimates for environmental metrics. +- **[CodeCarbon](https://codecarbon.io/)** — Measures GPU/CPU energy for + local training and inference workloads. +- **[Hugging Face AI Energy Score](https://huggingface.github.io/AIEnergyScore/)** — + Benchmarks model energy efficiency. Useful for choosing between models. + +These tools focus on environmental metrics only. This toolkit and the +methodology also cover financial, social, epistemic, and political costs. + ## Limitations - All numbers are estimates with low to medium confidence. -- Energy-per-token figures are derived from published research on - comparable models, not official Anthropic data. +- Energy-per-token figures are calibrated against published research + (Google, Aug 2025; Jegham et al., May 2025), not official Anthropic data. - The hook only runs on context compaction, not at conversation end. Short conversations that never compact will not be logged. +- This toolkit only works with Claude Code. The methodology itself is + tool-agnostic. - See `impact-methodology.md` for the full methodology, uncertainty analysis, and non-quantifiable costs.