Tasks 12-14: Related work, citations, complementary tool links

Task 12: Add Related Work section (Section 21) to methodology covering
EcoLogits, CodeCarbon, AI Energy Score, Green Algorithms, Google/Jegham
published data, UNICC framework, and social cost research.

Task 13: Add specific citations and links for cognitive deskilling
(CHI 2025, Springer 2025, endoscopy study), linguistic homogenization
(UNESCO), and algorithmic monoculture (Stanford HAI).

Task 14: Add Related Tools section to toolkit README linking EcoLogits,
CodeCarbon, and AI Energy Score. Also updated toolkit energy values to
match calibrated methodology.
This commit is contained in:
claude 2026-03-16 10:43:51 +00:00
parent 9653f69860
commit c619c31caf
2 changed files with 108 additions and 20 deletions

View file

@ -441,12 +441,16 @@ livelihoods) depends on the economic context and is genuinely uncertain.
### Cognitive deskilling
A Microsoft/CHI 2025 study found that higher confidence in GenAI
correlates with less critical thinking effort. An MIT Media Lab study
("Your Brain on ChatGPT") documented "cognitive debt" — users who relied
on AI for tasks performed worse when later working independently. Clinical
evidence shows that clinicians relying on AI diagnostics saw measurable
declines in independent diagnostic skill after just three months.
A Microsoft/CMU study (Lee et al., CHI 2025) found that higher
confidence in GenAI correlates with less critical thinking effort
([ACM DL](https://dl.acm.org/doi/full/10.1145/3706598.3713778)). An
MIT Media Lab study ("Your Brain on ChatGPT") documented "cognitive
debt" — users who relied on AI for tasks performed worse when later
working independently. Clinical evidence from endoscopy studies shows
that clinicians relying on AI diagnostics saw detection rates drop
from 28.4% to 22.4% when AI was removed. A 2025 Springer paper argues
that AI deskilling is a structural problem, not merely individual
([doi:10.1007/s00146-025-02686-z](https://link.springer.com/article/10.1007/s00146-025-02686-z)).
This is distinct from epistemic risk (misinformation). It is about the
user's cognitive capacity degrading through repeated reliance on the
@ -461,11 +465,13 @@ conversation should be verified independently.
### Linguistic homogenization
LLMs are overwhelmingly trained on English (~44% of training data). A
Stanford 2025 study found that AI tools systematically exclude
non-English speakers. Each English-language conversation reinforces the
economic incentive to optimize for English, marginalizing over 3,000
already-endangered languages.
LLMs are overwhelmingly trained on English (~44% of training data).
A Stanford 2025 study found that AI tools systematically exclude
non-English speakers. UNESCO's 2024 report on linguistic diversity
warns that AI systems risk accelerating the extinction of already-
endangered languages by concentrating economic incentives on high-
resource languages. Each English-language conversation reinforces
this dynamic, marginalizing over 3,000 already-endangered languages.
## 11. Political cost
@ -574,11 +580,12 @@ corruption of the knowledge commons.
## 15. Algorithmic monoculture and correlated failure
When millions of users rely on the same few foundation models, errors
become correlated rather than independent. A Stanford HAI study found that
across every model ecosystem studied, the rate of homogeneous outcomes
exceeded baselines. A Nature Communications Psychology paper (2026)
documents that AI-driven research is producing "topical and methodological
convergence, flattening scientific imagination."
become correlated rather than independent. A Stanford HAI study
([Bommasani et al., 2022](https://arxiv.org/abs/2108.07258)) found
that across every model ecosystem studied, the rate of homogeneous
outcomes exceeded baselines. A Nature Communications Psychology paper
(2026) documents that AI-driven research is producing "topical and
methodological convergence, flattening scientific imagination."
For coding specifically: if many developers use the same model, their code
will share the same blind spots, the same idiomatic patterns, and the same
@ -761,7 +768,68 @@ working on private code will fall in the "probably net-negative" to
honest reflection of the cost structure. Net-positive requires broad
reach, which requires the work to be shared.
## 21. What would improve this estimate
## 21. Related work
This methodology builds on and complements existing tools and research.
### Measurement tools (environmental)
- **[EcoLogits](https://ecologits.ai/)** — Python library from GenAI
Impact that tracks per-query energy and CO2 for API calls. Covers
operational and embodied emissions. More precise than this methodology
for environmental metrics, but does not cover social, epistemic, or
political costs.
- **[CodeCarbon](https://codecarbon.io/)** — Python library that measures
GPU/CPU/RAM electricity consumption in real time with regional carbon
intensity. Primarily for local training workloads. A 2025 validation
study found estimates can be off by ~2.4x vs. external measurements.
- **[Hugging Face AI Energy Score](https://huggingface.github.io/AIEnergyScore/)** —
Standardized energy efficiency benchmarking across AI models. Useful
for model selection but does not provide per-conversation accounting.
- **[Green Algorithms](https://www.green-algorithms.org/)** — Web
calculator from University of Cambridge for any computational workload.
Not AI-specific.
### Published per-query data
- **Patterson et al. (Google, August 2025)**: Most rigorous provider-
published per-query data. Reports 0.24 Wh, 0.03g CO2, and 0.26 mL
water per median Gemini text prompt. Showed 33x energy reduction over
one year. ([arXiv:2508.15734](https://arxiv.org/abs/2508.15734))
- **Jegham et al. ("How Hungry is AI?", May 2025)**: Cross-model
benchmarks for 30 LLMs showing 70x energy variation between models.
([arXiv:2505.09598](https://arxiv.org/abs/2505.09598))
### Broader frameworks
- **UNICC/Frugal AI Hub (December 2025)**: Three-level framework from
Total Cost of Ownership to SDG alignment. Portfolio-level, not per-
conversation. Does not enumerate specific social cost categories.
- **Practical Principles for AI Cost and Compute Accounting (arXiv,
February 2025)**: Proposes compute as a governance metric. Financial
and compute only.
### Research on social costs
- **Lee et al. (CHI 2025)**: "The AI Deskilling Paradox" — survey
finding that higher AI confidence correlates with less critical
thinking. See Section 10.
- **Springer (2025)**: Argues deskilling is structural, not individual.
- **Shumailov et al. (Nature, 2024)**: Model collapse from recursive
AI-generated training data. See Section 13.
- **Stanford HAI (2025)**: Algorithmic monoculture and correlated failure
across model ecosystems. See Section 15.
### How this methodology differs
No existing tool or framework combines per-conversation environmental
measurement with social, cognitive, epistemic, and political cost
categories. The tools above measure environmental costs well — we do
not compete with them. Our contribution is the taxonomy: naming and
organizing 20+ cost categories so that the non-environmental costs are
not ignored simply because they are harder to quantify.
## 22. What would improve this estimate
- Access to actual energy-per-token and training energy metrics from
model providers

View file

@ -40,7 +40,8 @@ The hook fires before Claude Code compacts your conversation context.
It reads the conversation transcript, extracts token usage data from
API response metadata, and calculates cost estimates using:
- **Energy**: 0.003 Wh/1K input tokens, 0.015 Wh/1K output tokens
- **Energy**: 0.1 Wh/1K input tokens, 0.5 Wh/1K output tokens
(midpoint of range calibrated against Google and Jegham et al., 2025)
- **PUE**: 1.2 (data center overhead)
- **CO2**: 325g/kWh (US grid average for cloud regions)
- **Cost**: $15/M input tokens, $75/M output tokens
@ -48,13 +49,32 @@ API response metadata, and calculates cost estimates using:
Cache-read tokens are weighted at 10% of full cost (they skip most
computation).
## Related tools
This toolkit measures a subset of the costs covered by
`impact-methodology.md`. For more precise environmental measurement,
consider these complementary tools:
- **[EcoLogits](https://ecologits.ai/)** — Python library that tracks
per-query energy and CO2 for API calls to OpenAI, Anthropic, Mistral,
and others. More precise than our estimates for environmental metrics.
- **[CodeCarbon](https://codecarbon.io/)** — Measures GPU/CPU energy for
local training and inference workloads.
- **[Hugging Face AI Energy Score](https://huggingface.github.io/AIEnergyScore/)** —
Benchmarks model energy efficiency. Useful for choosing between models.
These tools focus on environmental metrics only. This toolkit and the
methodology also cover financial, social, epistemic, and political costs.
## Limitations
- All numbers are estimates with low to medium confidence.
- Energy-per-token figures are derived from published research on
comparable models, not official Anthropic data.
- Energy-per-token figures are calibrated against published research
(Google, Aug 2025; Jegham et al., May 2025), not official Anthropic data.
- The hook only runs on context compaction, not at conversation end.
Short conversations that never compact will not be logged.
- This toolkit only works with Claude Code. The methodology itself is
tool-agnostic.
- See `impact-methodology.md` for the full methodology, uncertainty
analysis, and non-quantifiable costs.