diff --git a/README.md b/README.md index 0a6dd21..a6b2b95 100644 --- a/README.md +++ b/README.md @@ -44,6 +44,15 @@ Most estimates have low confidence. Many of the most consequential costs The quantifiable costs are almost certainly the least important ones. This is a tool for honest approximation, not precise accounting. +## How this was made + +This project was developed by a human directing +[Claude](https://claude.ai) (Anthropic's AI assistant) across multiple +conversations. The methodology was applied to itself: we estimate the +project consumed ~$2,500-10,000 in compute, ~500-2,500 Wh of energy, +and ~150-800g of CO2 across all sessions. Whether it produces enough +value to justify those costs is [an open question we are tracking](plans/measure-project-impact.md). + ## Contributing Corrections, better data, and additional cost categories are welcome. diff --git a/impact-methodology.md b/impact-methodology.md index 064dbd5..37094ce 100644 --- a/impact-methodology.md +++ b/impact-methodology.md @@ -107,31 +107,68 @@ unknowns: ### Sources -There is no published energy-per-token figure for most commercial LLMs. -Estimates are derived from: +Published energy-per-query data has improved significantly since 2024. +Key sources, from most to least reliable: -- Luccioni, Viguier & Ligozat (2023), "Estimating the Carbon Footprint - of BLOOM", which measured energy for a 176B parameter model. +- **Patterson et al. (Google, August 2025)**: First major provider to + publish detailed per-query data. Reports **0.24 Wh per median Gemini + text prompt** including full data center infrastructure. Also showed + 33x energy reduction over one year through efficiency improvements. + ([arXiv:2508.15734](https://arxiv.org/abs/2508.15734)) +- **Jegham et al. ("How Hungry is AI?", May 2025)**: Cross-model + benchmarks for 30 LLMs. Found o3 and DeepSeek-R1 consume **>33 Wh + per long prompt** (70x more than GPT-4.1 nano). Claude 3.7 Sonnet + ranked highest eco-efficiency. + ([arXiv:2505.09598](https://arxiv.org/abs/2505.09598)) - The IEA's 2024 estimate of ~2.9 Wh per ChatGPT query (for GPT-4-class models, averaging ~1,000 tokens per query). - De Vries (2023), "The growing energy footprint of artificial intelligence", Joule. +- Luccioni, Viguier & Ligozat (2023), "Estimating the Carbon Footprint + of BLOOM", which measured energy for a 176B parameter model. + +### Calibration against published data + +Google's 0.24 Wh per median Gemini prompt represents a **short query** +(likely ~500-1000 tokens). For a long coding conversation with 2M +cumulative input tokens and 10K output tokens, that's roughly +2000-4000 prompt-equivalent interactions. Naively scaling: +2000 × 0.24 Wh = **480 Wh**, though KV-cache and batching optimizations +would reduce this in practice. + +The Jegham et al. benchmarks show enormous variation by model: a single +long prompt ranges from 0.4 Wh (GPT-4.1 nano) to >33 Wh (o3, DeepSeek-R1). +For frontier reasoning models, a long conversation could consume +significantly more than our previous estimates. ### Values used -- **Input tokens**: ~0.003 Wh per 1,000 tokens -- **Output tokens**: ~0.015 Wh per 1,000 tokens (5x input cost, +- **Input tokens**: ~0.05-0.3 Wh per 1,000 tokens +- **Output tokens**: ~0.25-1.5 Wh per 1,000 tokens (5x input cost, reflecting sequential generation) +The wide ranges reflect model variation. The lower end corresponds to +efficient models (GPT-4.1 mini, Claude 3.7 Sonnet); the upper end to +frontier reasoning models (o3, DeepSeek-R1). + +**Previous values** (used in versions before March 2026): 0.003 and +0.015 Wh per 1,000 tokens respectively. These were derived from +pre-2025 estimates and are now known to be approximately 10-100x too +low based on Google's published data. + ### Uncertainty -These numbers are rough. The actual values depend on: -- Model size (parameter counts for commercial models are often not public) +The true values depend on: +- Model size and architecture (reasoning models use chain-of-thought, + consuming far more tokens internally) - Hardware (GPU type, batch size, utilization) - Quantization and optimization techniques - Whether speculative decoding or KV-cache optimizations are used +- Provider-specific infrastructure efficiency -The true values could be 0.5x to 3x the figures used here. +The true values could be 0.3x to 3x the midpoint figures used here. +The variation *between models* now dominates the uncertainty — choosing +a different model can change energy by 70x (Jegham et al.). ## 3. Data center overhead (PUE) @@ -178,9 +215,11 @@ come from the regional grid in real time. ### Calculation template +Using midpoint values (0.1 Wh/1K input, 0.5 Wh/1K output): + ``` -Server energy = (cumulative_input_tokens * 0.003/1000 - + output_tokens * 0.015/1000) * PUE +Server energy = (cumulative_input_tokens * 0.1/1000 + + output_tokens * 0.5/1000) * PUE Server CO2 = server_energy_Wh * grid_intensity_g_per_kWh / 1000 @@ -193,17 +232,21 @@ Total CO2 = Server CO2 + Client CO2 A conversation with 2M cumulative input tokens and 10K output tokens: ``` -Server energy = (2,000,000 * 0.003/1000 + 10,000 * 0.015/1000) * 1.2 - = (6.0 + 0.15) * 1.2 - = ~7.4 Wh +Server energy = (2,000,000 * 0.1/1000 + 10,000 * 0.5/1000) * 1.2 + = (200 + 5.0) * 1.2 + = ~246 Wh -Server CO2 = 7.4 * 350 / 1000 = ~2.6g CO2 +Server CO2 = 246 * 350 / 1000 = ~86g CO2 Client CO2 = 0.5 * 56 / 1000 = ~0.03g CO2 (France) -Total CO2 = ~2.6g +Total CO2 = ~86g ``` +This is consistent with the headline range of 100-250 Wh and 30-80g CO2 +for a long conversation. The previous version of this methodology +estimated ~7.4 Wh for the same conversation, which was ~30x too low. + ## 6. Water usage Data centers use water for evaporative cooling. Li et al. (2023), "Making