Tasks 10-11: AI authorship transparency + calibrate energy estimates

Task 10: Add "How this was made" section to README disclosing AI
collaboration and project costs. Landing page updated separately.

Task 11: Calibrate energy-per-token against Google (Patterson et al.,
Aug 2025) and "How Hungry is AI" (Jegham et al., May 2025). Previous
values (0.003/0.015 Wh per 1K tokens) were ~10-100x too low. Updated
to 0.05-0.3/0.25-1.5 Wh per 1K tokens with model-dependent ranges.
Worked example now produces ~246 Wh, consistent with headline figures.
This commit is contained in:
claude 2026-03-16 10:38:12 +00:00
parent 67e86d1b6b
commit a9403fe128
2 changed files with 68 additions and 16 deletions

View file

@ -44,6 +44,15 @@ Most estimates have low confidence. Many of the most consequential costs
The quantifiable costs are almost certainly the least important ones. The quantifiable costs are almost certainly the least important ones.
This is a tool for honest approximation, not precise accounting. This is a tool for honest approximation, not precise accounting.
## How this was made
This project was developed by a human directing
[Claude](https://claude.ai) (Anthropic's AI assistant) across multiple
conversations. The methodology was applied to itself: we estimate the
project consumed ~$2,500-10,000 in compute, ~500-2,500 Wh of energy,
and ~150-800g of CO2 across all sessions. Whether it produces enough
value to justify those costs is [an open question we are tracking](plans/measure-project-impact.md).
## Contributing ## Contributing
Corrections, better data, and additional cost categories are welcome. Corrections, better data, and additional cost categories are welcome.

View file

@ -107,31 +107,68 @@ unknowns:
### Sources ### Sources
There is no published energy-per-token figure for most commercial LLMs. Published energy-per-query data has improved significantly since 2024.
Estimates are derived from: Key sources, from most to least reliable:
- Luccioni, Viguier & Ligozat (2023), "Estimating the Carbon Footprint - **Patterson et al. (Google, August 2025)**: First major provider to
of BLOOM", which measured energy for a 176B parameter model. publish detailed per-query data. Reports **0.24 Wh per median Gemini
text prompt** including full data center infrastructure. Also showed
33x energy reduction over one year through efficiency improvements.
([arXiv:2508.15734](https://arxiv.org/abs/2508.15734))
- **Jegham et al. ("How Hungry is AI?", May 2025)**: Cross-model
benchmarks for 30 LLMs. Found o3 and DeepSeek-R1 consume **>33 Wh
per long prompt** (70x more than GPT-4.1 nano). Claude 3.7 Sonnet
ranked highest eco-efficiency.
([arXiv:2505.09598](https://arxiv.org/abs/2505.09598))
- The IEA's 2024 estimate of ~2.9 Wh per ChatGPT query (for GPT-4-class - The IEA's 2024 estimate of ~2.9 Wh per ChatGPT query (for GPT-4-class
models, averaging ~1,000 tokens per query). models, averaging ~1,000 tokens per query).
- De Vries (2023), "The growing energy footprint of artificial - De Vries (2023), "The growing energy footprint of artificial
intelligence", Joule. intelligence", Joule.
- Luccioni, Viguier & Ligozat (2023), "Estimating the Carbon Footprint
of BLOOM", which measured energy for a 176B parameter model.
### Calibration against published data
Google's 0.24 Wh per median Gemini prompt represents a **short query**
(likely ~500-1000 tokens). For a long coding conversation with 2M
cumulative input tokens and 10K output tokens, that's roughly
2000-4000 prompt-equivalent interactions. Naively scaling:
2000 × 0.24 Wh = **480 Wh**, though KV-cache and batching optimizations
would reduce this in practice.
The Jegham et al. benchmarks show enormous variation by model: a single
long prompt ranges from 0.4 Wh (GPT-4.1 nano) to >33 Wh (o3, DeepSeek-R1).
For frontier reasoning models, a long conversation could consume
significantly more than our previous estimates.
### Values used ### Values used
- **Input tokens**: ~0.003 Wh per 1,000 tokens - **Input tokens**: ~0.05-0.3 Wh per 1,000 tokens
- **Output tokens**: ~0.015 Wh per 1,000 tokens (5x input cost, - **Output tokens**: ~0.25-1.5 Wh per 1,000 tokens (5x input cost,
reflecting sequential generation) reflecting sequential generation)
The wide ranges reflect model variation. The lower end corresponds to
efficient models (GPT-4.1 mini, Claude 3.7 Sonnet); the upper end to
frontier reasoning models (o3, DeepSeek-R1).
**Previous values** (used in versions before March 2026): 0.003 and
0.015 Wh per 1,000 tokens respectively. These were derived from
pre-2025 estimates and are now known to be approximately 10-100x too
low based on Google's published data.
### Uncertainty ### Uncertainty
These numbers are rough. The actual values depend on: The true values depend on:
- Model size (parameter counts for commercial models are often not public) - Model size and architecture (reasoning models use chain-of-thought,
consuming far more tokens internally)
- Hardware (GPU type, batch size, utilization) - Hardware (GPU type, batch size, utilization)
- Quantization and optimization techniques - Quantization and optimization techniques
- Whether speculative decoding or KV-cache optimizations are used - Whether speculative decoding or KV-cache optimizations are used
- Provider-specific infrastructure efficiency
The true values could be 0.5x to 3x the figures used here. The true values could be 0.3x to 3x the midpoint figures used here.
The variation *between models* now dominates the uncertainty — choosing
a different model can change energy by 70x (Jegham et al.).
## 3. Data center overhead (PUE) ## 3. Data center overhead (PUE)
@ -178,9 +215,11 @@ come from the regional grid in real time.
### Calculation template ### Calculation template
Using midpoint values (0.1 Wh/1K input, 0.5 Wh/1K output):
``` ```
Server energy = (cumulative_input_tokens * 0.003/1000 Server energy = (cumulative_input_tokens * 0.1/1000
+ output_tokens * 0.015/1000) * PUE + output_tokens * 0.5/1000) * PUE
Server CO2 = server_energy_Wh * grid_intensity_g_per_kWh / 1000 Server CO2 = server_energy_Wh * grid_intensity_g_per_kWh / 1000
@ -193,17 +232,21 @@ Total CO2 = Server CO2 + Client CO2
A conversation with 2M cumulative input tokens and 10K output tokens: A conversation with 2M cumulative input tokens and 10K output tokens:
``` ```
Server energy = (2,000,000 * 0.003/1000 + 10,000 * 0.015/1000) * 1.2 Server energy = (2,000,000 * 0.1/1000 + 10,000 * 0.5/1000) * 1.2
= (6.0 + 0.15) * 1.2 = (200 + 5.0) * 1.2
= ~7.4 Wh = ~246 Wh
Server CO2 = 7.4 * 350 / 1000 = ~2.6g CO2 Server CO2 = 246 * 350 / 1000 = ~86g CO2
Client CO2 = 0.5 * 56 / 1000 = ~0.03g CO2 (France) Client CO2 = 0.5 * 56 / 1000 = ~0.03g CO2 (France)
Total CO2 = ~2.6g Total CO2 = ~86g
``` ```
This is consistent with the headline range of 100-250 Wh and 30-80g CO2
for a long conversation. The previous version of this methodology
estimated ~7.4 Wh for the same conversation, which was ~30x too low.
## 6. Water usage ## 6. Water usage
Data centers use water for evaporative cooling. Li et al. (2023), "Making Data centers use water for evaporative cooling. Li et al. (2023), "Making