Tasks 10-11: AI authorship transparency + calibrate energy estimates

Task 10: Add "How this was made" section to README disclosing AI collaboration and project costs. Landing page updated separately. Task 11: Calibrate energy-per-token against Google (Patterson et al., Aug 2025) and "How Hungry is AI" (Jegham et al., May 2025). Previous values (0.003/0.015 Wh per 1K tokens) were ~10-100x too low. Updated to 0.05-0.3/0.25-1.5 Wh per 1K tokens with model-dependent ranges. Worked example now produces ~246 Wh, consistent with headline figures.
2026-03-16 10:38:12 +00:00 · 2026-03-16 10:38:12 +00:00 · a9403fe128
commit a9403fe128
parent 67e86d1b6b
2 changed files with 68 additions and 16 deletions
--- a/README.md
+++ b/README.md
@ -44,6 +44,15 @@ Most estimates have low confidence. Many of the most consequential costs
 The quantifiable costs are almost certainly the least important ones.
 This is a tool for honest approximation, not precise accounting.
 ## How this was made
 This project was developed by a human directing
 [Claude](https://claude.ai) (Anthropic's AI assistant) across multiple
 conversations. The methodology was applied to itself: we estimate the
 project consumed ~$2,500-10,000 in compute, ~500-2,500 Wh of energy,
 and ~150-800g of CO2 across all sessions. Whether it produces enough
 value to justify those costs is [an open question we are tracking](plans/measure-project-impact.md).
 ## Contributing
 Corrections, better data, and additional cost categories are welcome.
--- a/impact-methodology.md
+++ b/impact-methodology.md
@ -107,31 +107,68 @@ unknowns:
 ### Sources
-There is no published energy-per-token figure for most commercial LLMs.
+Published energy-per-query data has improved significantly since 2024.
-Estimates are derived from:
+Key sources, from most to least reliable:
- Luccioni, Viguier & Ligozat (2023), "Estimating the Carbon Footprint
+- **Patterson et al. (Google, August 2025)**: First major provider to
-  of BLOOM", which measured energy for a 176B parameter model.
+  publish detailed per-query data. Reports **0.24 Wh per median Gemini
  text prompt** including full data center infrastructure. Also showed
  33x energy reduction over one year through efficiency improvements.
  ([arXiv:2508.15734](https://arxiv.org/abs/2508.15734))
 - **Jegham et al. ("How Hungry is AI?", May 2025)**: Cross-model
  benchmarks for 30 LLMs. Found o3 and DeepSeek-R1 consume **>33 Wh
  per long prompt** (70x more than GPT-4.1 nano). Claude 3.7 Sonnet
  ranked highest eco-efficiency.
  ([arXiv:2505.09598](https://arxiv.org/abs/2505.09598))
 - The IEA's 2024 estimate of ~2.9 Wh per ChatGPT query (for GPT-4-class
  models, averaging ~1,000 tokens per query).
 - De Vries (2023), "The growing energy footprint of artificial
  intelligence", Joule.
 - Luccioni, Viguier & Ligozat (2023), "Estimating the Carbon Footprint
  of BLOOM", which measured energy for a 176B parameter model.
 ### Calibration against published data
 Google's 0.24 Wh per median Gemini prompt represents a **short query**
 (likely ~500-1000 tokens). For a long coding conversation with 2M
 cumulative input tokens and 10K output tokens, that's roughly
 2000-4000 prompt-equivalent interactions. Naively scaling:
 2000 × 0.24 Wh = **480 Wh**, though KV-cache and batching optimizations
 would reduce this in practice.
 The Jegham et al. benchmarks show enormous variation by model: a single
 long prompt ranges from 0.4 Wh (GPT-4.1 nano) to >33 Wh (o3, DeepSeek-R1).
 For frontier reasoning models, a long conversation could consume
 significantly more than our previous estimates.
 ### Values used
- **Input tokens**: ~0.003 Wh per 1,000 tokens
+- **Input tokens**: ~0.05-0.3 Wh per 1,000 tokens
- **Output tokens**: ~0.015 Wh per 1,000 tokens (5x input cost,
+- **Output tokens**: ~0.25-1.5 Wh per 1,000 tokens (5x input cost,
  reflecting sequential generation)
 The wide ranges reflect model variation. The lower end corresponds to
 efficient models (GPT-4.1 mini, Claude 3.7 Sonnet); the upper end to
 frontier reasoning models (o3, DeepSeek-R1).
 **Previous values** (used in versions before March 2026): 0.003 and
 0.015 Wh per 1,000 tokens respectively. These were derived from
 pre-2025 estimates and are now known to be approximately 10-100x too
 low based on Google's published data.
 ### Uncertainty
-These numbers are rough. The actual values depend on:
+The true values depend on:
- Model size (parameter counts for commercial models are often not public)
+- Model size and architecture (reasoning models use chain-of-thought,
  consuming far more tokens internally)
 - Hardware (GPU type, batch size, utilization)
 - Quantization and optimization techniques
 - Whether speculative decoding or KV-cache optimizations are used
 - Provider-specific infrastructure efficiency
-The true values could be 0.5x to 3x the figures used here.
+The true values could be 0.3x to 3x the midpoint figures used here.
 The variation *between models* now dominates the uncertainty — choosing
 a different model can change energy by 70x (Jegham et al.).
 ## 3. Data center overhead (PUE)
@ -178,9 +215,11 @@ come from the regional grid in real time.
 ### Calculation template
 Using midpoint values (0.1 Wh/1K input, 0.5 Wh/1K output):
 ```
-Server energy = (cumulative_input_tokens * 0.003/1000
+Server energy = (cumulative_input_tokens * 0.1/1000
-                 + output_tokens * 0.015/1000) * PUE
+                 + output_tokens * 0.5/1000) * PUE
 Server CO2    = server_energy_Wh * grid_intensity_g_per_kWh / 1000
@ -193,17 +232,21 @@ Total CO2     = Server CO2 + Client CO2
 A conversation with 2M cumulative input tokens and 10K output tokens:
 ```
-Server energy = (2,000,000 * 0.003/1000 + 10,000 * 0.015/1000) * 1.2
+Server energy = (2,000,000 * 0.1/1000 + 10,000 * 0.5/1000) * 1.2
-              = (6.0 + 0.15) * 1.2
+              = (200 + 5.0) * 1.2
-              = ~7.4 Wh
+              = ~246 Wh
-Server CO2    = 7.4 * 350 / 1000 = ~2.6g CO2
+Server CO2    = 246 * 350 / 1000 = ~86g CO2
 Client CO2    = 0.5 * 56 / 1000  = ~0.03g CO2  (France)
-Total CO2     = ~2.6g
+Total CO2     = ~86g
 ```
 This is consistent with the headline range of 100-250 Wh and 30-80g CO2
 for a long conversation. The previous version of this methodology
 estimated ~7.4 Wh for the same conversation, which was ~30x too low.
 ## 6. Water usage
 Data centers use water for evaporative cooling. Li et al. (2023), "Making