Tasks 10-11: AI authorship transparency + calibrate energy estimates

Task 10: Add "How this was made" section to README disclosing AI collaboration and project costs. Landing page updated separately. Task 11: Calibrate energy-per-token against Google (Patterson et al., Aug 2025) and "How Hungry is AI" (Jegham et al., May 2025). Previous values (0.003/0.015 Wh per 1K tokens) were ~10-100x too low. Updated to 0.05-0.3/0.25-1.5 Wh per 1K tokens with model-dependent ranges. Worked example now produces ~246 Wh, consistent with headline figures.
2026-03-16 10:38:12 +00:00 · 2026-03-16 10:38:12 +00:00 · a9403fe128
commit a9403fe128
parent 67e86d1b6b
2 changed files with 68 additions and 16 deletions
--- a/README.md
+++ b/README.md
@ -44,6 +44,15 @@ Most estimates have low confidence. Many of the most consequential costs
 The quantifiable costs are almost certainly the least important ones.
 This is a tool for honest approximation, not precise accounting.

+## How this was made
+
+This project was developed by a human directing
+[Claude](https://claude.ai) (Anthropic's AI assistant) across multiple
+conversations. The methodology was applied to itself: we estimate the
+project consumed ~$2,500-10,000 in compute, ~500-2,500 Wh of energy,
+and ~150-800g of CO2 across all sessions. Whether it produces enough
+value to justify those costs is [an open question we are tracking](plans/measure-project-impact.md).
+
 ## Contributing

 Corrections, better data, and additional cost categories are welcome.
--- a/impact-methodology.md
+++ b/impact-methodology.md
@ -107,31 +107,68 @@ unknowns:

 ### Sources

-There is no published energy-per-token figure for most commercial LLMs.
-Estimates are derived from:
+Published energy-per-query data has improved significantly since 2024.
+Key sources, from most to least reliable:

- Luccioni, Viguier & Ligozat (2023), "Estimating the Carbon Footprint
-  of BLOOM", which measured energy for a 176B parameter model.
+- **Patterson et al. (Google, August 2025)**: First major provider to
+  publish detailed per-query data. Reports **0.24 Wh per median Gemini
+  text prompt** including full data center infrastructure. Also showed
+  33x energy reduction over one year through efficiency improvements.
+  ([arXiv:2508.15734](https://arxiv.org/abs/2508.15734))
+- **Jegham et al. ("How Hungry is AI?", May 2025)**: Cross-model
+  benchmarks for 30 LLMs. Found o3 and DeepSeek-R1 consume **>33 Wh
+  per long prompt** (70x more than GPT-4.1 nano). Claude 3.7 Sonnet
+  ranked highest eco-efficiency.
+  ([arXiv:2505.09598](https://arxiv.org/abs/2505.09598))
 - The IEA's 2024 estimate of ~2.9 Wh per ChatGPT query (for GPT-4-class
  models, averaging ~1,000 tokens per query).
 - De Vries (2023), "The growing energy footprint of artificial
  intelligence", Joule.
+- Luccioni, Viguier & Ligozat (2023), "Estimating the Carbon Footprint
+  of BLOOM", which measured energy for a 176B parameter model.
+
+### Calibration against published data
+
+Google's 0.24 Wh per median Gemini prompt represents a **short query**
+(likely ~500-1000 tokens). For a long coding conversation with 2M
+cumulative input tokens and 10K output tokens, that's roughly
+2000-4000 prompt-equivalent interactions. Naively scaling:
+2000 × 0.24 Wh = **480 Wh**, though KV-cache and batching optimizations
+would reduce this in practice.
+
+The Jegham et al. benchmarks show enormous variation by model: a single
+long prompt ranges from 0.4 Wh (GPT-4.1 nano) to >33 Wh (o3, DeepSeek-R1).
+For frontier reasoning models, a long conversation could consume
+significantly more than our previous estimates.

 ### Values used

- **Input tokens**: ~0.003 Wh per 1,000 tokens
- **Output tokens**: ~0.015 Wh per 1,000 tokens (5x input cost,
+- **Input tokens**: ~0.05-0.3 Wh per 1,000 tokens
+- **Output tokens**: ~0.25-1.5 Wh per 1,000 tokens (5x input cost,
  reflecting sequential generation)

+The wide ranges reflect model variation. The lower end corresponds to
+efficient models (GPT-4.1 mini, Claude 3.7 Sonnet); the upper end to
+frontier reasoning models (o3, DeepSeek-R1).
+
+**Previous values** (used in versions before March 2026): 0.003 and
+0.015 Wh per 1,000 tokens respectively. These were derived from
+pre-2025 estimates and are now known to be approximately 10-100x too
+low based on Google's published data.
+
 ### Uncertainty

-These numbers are rough. The actual values depend on:
- Model size (parameter counts for commercial models are often not public)
+The true values depend on:
+- Model size and architecture (reasoning models use chain-of-thought,
+  consuming far more tokens internally)
 - Hardware (GPU type, batch size, utilization)
 - Quantization and optimization techniques
 - Whether speculative decoding or KV-cache optimizations are used
+- Provider-specific infrastructure efficiency

-The true values could be 0.5x to 3x the figures used here.
+The true values could be 0.3x to 3x the midpoint figures used here.
+The variation *between models* now dominates the uncertainty — choosing
+a different model can change energy by 70x (Jegham et al.).

 ## 3. Data center overhead (PUE)

@ -178,9 +215,11 @@ come from the regional grid in real time.

 ### Calculation template

+Using midpoint values (0.1 Wh/1K input, 0.5 Wh/1K output):
+
 ```
-Server energy = (cumulative_input_tokens * 0.003/1000
-                 + output_tokens * 0.015/1000) * PUE
+Server energy = (cumulative_input_tokens * 0.1/1000
+                 + output_tokens * 0.5/1000) * PUE

 Server CO2    = server_energy_Wh * grid_intensity_g_per_kWh / 1000

@ -193,17 +232,21 @@ Total CO2     = Server CO2 + Client CO2

 A conversation with 2M cumulative input tokens and 10K output tokens:
 ```
-Server energy = (2,000,000 * 0.003/1000 + 10,000 * 0.015/1000) * 1.2
-              = (6.0 + 0.15) * 1.2
-              = ~7.4 Wh
+Server energy = (2,000,000 * 0.1/1000 + 10,000 * 0.5/1000) * 1.2
+              = (200 + 5.0) * 1.2
+              = ~246 Wh

-Server CO2    = 7.4 * 350 / 1000 = ~2.6g CO2
+Server CO2    = 246 * 350 / 1000 = ~86g CO2

 Client CO2    = 0.5 * 56 / 1000  = ~0.03g CO2  (France)

-Total CO2     = ~2.6g
+Total CO2     = ~86g
 ```

+This is consistent with the headline range of 100-250 Wh and 30-80g CO2
+for a long conversation. The previous version of this methodology
+estimated ~7.4 Wh for the same conversation, which was ~30x too low.
+
 ## 6. Water usage

 Data centers use water for evaporative cooling. Li et al. (2023), "Making