Tasks 10-11: AI authorship transparency + calibrate energy estimates
Task 10: Add "How this was made" section to README disclosing AI collaboration and project costs. Landing page updated separately. Task 11: Calibrate energy-per-token against Google (Patterson et al., Aug 2025) and "How Hungry is AI" (Jegham et al., May 2025). Previous values (0.003/0.015 Wh per 1K tokens) were ~10-100x too low. Updated to 0.05-0.3/0.25-1.5 Wh per 1K tokens with model-dependent ranges. Worked example now produces ~246 Wh, consistent with headline figures.
This commit is contained in:
parent
67e86d1b6b
commit
a9403fe128
2 changed files with 68 additions and 16 deletions
|
|
@ -44,6 +44,15 @@ Most estimates have low confidence. Many of the most consequential costs
|
|||
The quantifiable costs are almost certainly the least important ones.
|
||||
This is a tool for honest approximation, not precise accounting.
|
||||
|
||||
## How this was made
|
||||
|
||||
This project was developed by a human directing
|
||||
[Claude](https://claude.ai) (Anthropic's AI assistant) across multiple
|
||||
conversations. The methodology was applied to itself: we estimate the
|
||||
project consumed ~$2,500-10,000 in compute, ~500-2,500 Wh of energy,
|
||||
and ~150-800g of CO2 across all sessions. Whether it produces enough
|
||||
value to justify those costs is [an open question we are tracking](plans/measure-project-impact.md).
|
||||
|
||||
## Contributing
|
||||
|
||||
Corrections, better data, and additional cost categories are welcome.
|
||||
|
|
|
|||
|
|
@ -107,31 +107,68 @@ unknowns:
|
|||
|
||||
### Sources
|
||||
|
||||
There is no published energy-per-token figure for most commercial LLMs.
|
||||
Estimates are derived from:
|
||||
Published energy-per-query data has improved significantly since 2024.
|
||||
Key sources, from most to least reliable:
|
||||
|
||||
- Luccioni, Viguier & Ligozat (2023), "Estimating the Carbon Footprint
|
||||
of BLOOM", which measured energy for a 176B parameter model.
|
||||
- **Patterson et al. (Google, August 2025)**: First major provider to
|
||||
publish detailed per-query data. Reports **0.24 Wh per median Gemini
|
||||
text prompt** including full data center infrastructure. Also showed
|
||||
33x energy reduction over one year through efficiency improvements.
|
||||
([arXiv:2508.15734](https://arxiv.org/abs/2508.15734))
|
||||
- **Jegham et al. ("How Hungry is AI?", May 2025)**: Cross-model
|
||||
benchmarks for 30 LLMs. Found o3 and DeepSeek-R1 consume **>33 Wh
|
||||
per long prompt** (70x more than GPT-4.1 nano). Claude 3.7 Sonnet
|
||||
ranked highest eco-efficiency.
|
||||
([arXiv:2505.09598](https://arxiv.org/abs/2505.09598))
|
||||
- The IEA's 2024 estimate of ~2.9 Wh per ChatGPT query (for GPT-4-class
|
||||
models, averaging ~1,000 tokens per query).
|
||||
- De Vries (2023), "The growing energy footprint of artificial
|
||||
intelligence", Joule.
|
||||
- Luccioni, Viguier & Ligozat (2023), "Estimating the Carbon Footprint
|
||||
of BLOOM", which measured energy for a 176B parameter model.
|
||||
|
||||
### Calibration against published data
|
||||
|
||||
Google's 0.24 Wh per median Gemini prompt represents a **short query**
|
||||
(likely ~500-1000 tokens). For a long coding conversation with 2M
|
||||
cumulative input tokens and 10K output tokens, that's roughly
|
||||
2000-4000 prompt-equivalent interactions. Naively scaling:
|
||||
2000 × 0.24 Wh = **480 Wh**, though KV-cache and batching optimizations
|
||||
would reduce this in practice.
|
||||
|
||||
The Jegham et al. benchmarks show enormous variation by model: a single
|
||||
long prompt ranges from 0.4 Wh (GPT-4.1 nano) to >33 Wh (o3, DeepSeek-R1).
|
||||
For frontier reasoning models, a long conversation could consume
|
||||
significantly more than our previous estimates.
|
||||
|
||||
### Values used
|
||||
|
||||
- **Input tokens**: ~0.003 Wh per 1,000 tokens
|
||||
- **Output tokens**: ~0.015 Wh per 1,000 tokens (5x input cost,
|
||||
- **Input tokens**: ~0.05-0.3 Wh per 1,000 tokens
|
||||
- **Output tokens**: ~0.25-1.5 Wh per 1,000 tokens (5x input cost,
|
||||
reflecting sequential generation)
|
||||
|
||||
The wide ranges reflect model variation. The lower end corresponds to
|
||||
efficient models (GPT-4.1 mini, Claude 3.7 Sonnet); the upper end to
|
||||
frontier reasoning models (o3, DeepSeek-R1).
|
||||
|
||||
**Previous values** (used in versions before March 2026): 0.003 and
|
||||
0.015 Wh per 1,000 tokens respectively. These were derived from
|
||||
pre-2025 estimates and are now known to be approximately 10-100x too
|
||||
low based on Google's published data.
|
||||
|
||||
### Uncertainty
|
||||
|
||||
These numbers are rough. The actual values depend on:
|
||||
- Model size (parameter counts for commercial models are often not public)
|
||||
The true values depend on:
|
||||
- Model size and architecture (reasoning models use chain-of-thought,
|
||||
consuming far more tokens internally)
|
||||
- Hardware (GPU type, batch size, utilization)
|
||||
- Quantization and optimization techniques
|
||||
- Whether speculative decoding or KV-cache optimizations are used
|
||||
- Provider-specific infrastructure efficiency
|
||||
|
||||
The true values could be 0.5x to 3x the figures used here.
|
||||
The true values could be 0.3x to 3x the midpoint figures used here.
|
||||
The variation *between models* now dominates the uncertainty — choosing
|
||||
a different model can change energy by 70x (Jegham et al.).
|
||||
|
||||
## 3. Data center overhead (PUE)
|
||||
|
||||
|
|
@ -178,9 +215,11 @@ come from the regional grid in real time.
|
|||
|
||||
### Calculation template
|
||||
|
||||
Using midpoint values (0.1 Wh/1K input, 0.5 Wh/1K output):
|
||||
|
||||
```
|
||||
Server energy = (cumulative_input_tokens * 0.003/1000
|
||||
+ output_tokens * 0.015/1000) * PUE
|
||||
Server energy = (cumulative_input_tokens * 0.1/1000
|
||||
+ output_tokens * 0.5/1000) * PUE
|
||||
|
||||
Server CO2 = server_energy_Wh * grid_intensity_g_per_kWh / 1000
|
||||
|
||||
|
|
@ -193,17 +232,21 @@ Total CO2 = Server CO2 + Client CO2
|
|||
|
||||
A conversation with 2M cumulative input tokens and 10K output tokens:
|
||||
```
|
||||
Server energy = (2,000,000 * 0.003/1000 + 10,000 * 0.015/1000) * 1.2
|
||||
= (6.0 + 0.15) * 1.2
|
||||
= ~7.4 Wh
|
||||
Server energy = (2,000,000 * 0.1/1000 + 10,000 * 0.5/1000) * 1.2
|
||||
= (200 + 5.0) * 1.2
|
||||
= ~246 Wh
|
||||
|
||||
Server CO2 = 7.4 * 350 / 1000 = ~2.6g CO2
|
||||
Server CO2 = 246 * 350 / 1000 = ~86g CO2
|
||||
|
||||
Client CO2 = 0.5 * 56 / 1000 = ~0.03g CO2 (France)
|
||||
|
||||
Total CO2 = ~2.6g
|
||||
Total CO2 = ~86g
|
||||
```
|
||||
|
||||
This is consistent with the headline range of 100-250 Wh and 30-80g CO2
|
||||
for a long conversation. The previous version of this methodology
|
||||
estimated ~7.4 Wh for the same conversation, which was ~30x too low.
|
||||
|
||||
## 6. Water usage
|
||||
|
||||
Data centers use water for evaporative cooling. Li et al. (2023), "Making
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue