Extend pre-compact-snapshot.sh to extract 5 new per-conversation metrics from the transcript: automation ratio (deskilling proxy), model ID (monoculture tracking), test pass/fail counts (code quality proxy), file churn (edits per unique file), and public push detection (data pollution risk flag). Update show-impact.sh to display them. New plan: quantify-social-costs.md — roadmap for moving non-environmental cost categories from qualitative to proxy-measurable. Tasks 19-24 done. Task 25 (methodology update) pending. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
235 lines
9.3 KiB
Markdown
235 lines
9.3 KiB
Markdown
# Plan: Quantify social, epistemic, and political costs
|
|
|
|
**Target sub-goals**: 2 (measure impact), 6 (improve methodology)
|
|
|
|
## Problem
|
|
|
|
Our differentiator is the taxonomy of non-environmental costs (social,
|
|
epistemic, political). But the toolkit only tracks environmental and
|
|
financial metrics. Until we can produce numbers — even rough proxies —
|
|
for the other dimensions, the taxonomy remains a document, not a tool.
|
|
|
|
The confidence summary (Section 19) marks 13 of 22 categories as
|
|
"Unquantifiable." Some genuinely resist quantification. But for others,
|
|
we can define measurable proxies that capture *something* meaningful per
|
|
conversation, even if imperfect.
|
|
|
|
## Design principle
|
|
|
|
Not everything needs a number. The goal is to move categories from
|
|
"unquantifiable" to "rough proxy available" where honest proxies exist,
|
|
and to explicitly mark categories where quantification would be
|
|
dishonest. A bad number is worse than no number.
|
|
|
|
## Category-by-category analysis
|
|
|
|
### Feasible: per-conversation proxies exist
|
|
|
|
#### 1. Cognitive deskilling (Section 10)
|
|
|
|
**Proxy: Automation ratio**
|
|
- Measure: fraction of conversation output tokens vs. user input tokens.
|
|
A conversation where the AI writes 95% of the code has a higher
|
|
deskilling risk than one where the user writes code and asks for review.
|
|
- Formula: `deskilling_risk = output_tokens / (output_tokens + user_tokens)`
|
|
- Range: 0 (pure teaching) to 1 (pure delegation)
|
|
- Can be computed from the transcript by the existing hook.
|
|
- Calibration: weight by task type if detectable (e.g., "explain" vs
|
|
"write" vs "fix").
|
|
|
|
**Proxy: Review signal**
|
|
- Did the user modify the AI's output before committing? If the hook can
|
|
detect git diffs between AI-generated code and committed code, the
|
|
delta indicates human review effort. High delta = more engagement =
|
|
less deskilling risk.
|
|
- Requires: post-commit hook comparing AI output to committed diff.
|
|
|
|
**Confidence: Low but measurable.** The ratio is crude — a user who
|
|
delegates wisely is not deskilling. But it's directionally useful.
|
|
|
|
#### 2. Code quality degradation (Section 12)
|
|
|
|
**Proxy: Defect signal**
|
|
- Track whether tests pass after AI-generated changes. The hook could
|
|
record: (a) did the conversation include test runs? (b) did tests fail
|
|
after AI changes? (c) how many retry cycles occurred?
|
|
- Formula: `quality_risk = failed_test_runs / total_test_runs`
|
|
- Can be extracted from tool-call results in the transcript.
|
|
|
|
**Proxy: Churn rate**
|
|
- How many times was the same file edited in the conversation? High
|
|
churn = the AI got it wrong repeatedly.
|
|
- Formula: `churn = total_file_edits / unique_files_edited`
|
|
|
|
**Confidence: Medium.** Test failures are a real signal. Churn is
|
|
noisier (some tasks legitimately require iterative editing).
|
|
|
|
#### 3. Data pollution risk (Section 13)
|
|
|
|
**Proxy: Publication exposure**
|
|
- Is the output likely to enter public corpora? Detect if the
|
|
conversation involves: git push to a public repo, writing
|
|
documentation, creating blog posts, Stack Overflow answers.
|
|
- Formula: binary flag `public_output = true/false`, or estimate
|
|
`pollution_tokens = output_tokens_in_public_artifacts`
|
|
- Can be detected from tool calls (git push, file writes to known
|
|
public paths).
|
|
|
|
**Confidence: Low.** Many paths to publication are undetectable. But
|
|
flagging known public pushes is better than nothing.
|
|
|
|
#### 4. Monoculture risk (Section 15)
|
|
|
|
**Proxy: Provider concentration**
|
|
- Log which model and provider was used. Over time, the impact log
|
|
builds a picture of single-provider dependency.
|
|
- Formula: `monoculture_index = sessions_with_dominant_provider / total_sessions`
|
|
- Per-session: just log the model ID. The aggregate metric is computed
|
|
across sessions.
|
|
|
|
**Confidence: Medium.** Simple to measure, meaningful at portfolio level.
|
|
|
|
#### 5. Annotation labor (Section 10)
|
|
|
|
**Proxy: Token-proportional RLHF demand**
|
|
- Each conversation generates training signal (thumbs up/down, edits,
|
|
preference data). More tokens = more potential training data = more
|
|
annotation demand.
|
|
- Formula: `rlhf_demand_proxy = output_tokens * annotation_rate`
|
|
where `annotation_rate` is estimated from published RLHF dataset
|
|
sizes vs. total conversation volume.
|
|
- Very rough. But it makes the connection between "my conversation" and
|
|
"someone rates this output" concrete.
|
|
|
|
**Confidence: Very low.** The annotation_rate is unknown. But even an
|
|
order-of-magnitude estimate names the cost.
|
|
|
|
#### 6. Creative displacement (Section 16)
|
|
|
|
**Proxy: Substitution type**
|
|
- Classify the conversation by what human role it substitutes: code
|
|
writing, code review, documentation, research, design.
|
|
- Formula: categorical label, not a number. But the label enables
|
|
aggregation: "60% of my AI usage substitutes for junior developer
|
|
work."
|
|
- Can be inferred from tool calls (Write/Edit = code writing, Grep/Read
|
|
= research, etc.).
|
|
|
|
**Confidence: Low.** Classification is fuzzy. But naming what was
|
|
displaced is better than ignoring it.
|
|
|
|
#### 7. Power concentration (Section 11)
|
|
|
|
**Proxy: Spend concentration**
|
|
- Financial cost already tracked. Aggregating by provider shows how
|
|
much money flows to each company.
|
|
- Formula: `provider_share = spend_with_provider / total_ai_spend`
|
|
- Trivial to compute from existing data. The interpretation is what
|
|
matters: "I sent $X to Anthropic this month."
|
|
|
|
**Confidence: High for the number, low for what it means.**
|
|
|
|
#### 8. Content filtering opacity (Section 11)
|
|
|
|
**Proxy: Block count**
|
|
- Count how many responses were blocked by content filtering during
|
|
the conversation.
|
|
- Formula: `filter_blocks = count(blocked_responses)`
|
|
- Can be detected from error messages in the transcript.
|
|
|
|
**Confidence: High.** Easy to measure. Interpretation is subjective.
|
|
|
|
### Infeasible: honest quantification not possible per-conversation
|
|
|
|
#### 9. Linguistic homogenization (Section 10)
|
|
- Could log conversation language, but the per-conversation contribution
|
|
to language endangerment is genuinely unattributable. A counter
|
|
("this conversation was in English") is factual but not a meaningful
|
|
cost metric. **Keep qualitative.**
|
|
|
|
#### 10. Geopolitical resource competition (Section 11)
|
|
- No per-conversation proxy exists. The connection between one API call
|
|
and semiconductor export controls is real but too diffuse to measure.
|
|
**Keep qualitative.**
|
|
|
|
#### 11. Mental health effects (Section 18)
|
|
- Would require user self-report. No passive measurement is honest.
|
|
**Keep qualitative unless user opts into self-assessment.**
|
|
|
|
#### 12. Scientific integrity contamination (Section 14)
|
|
- Overlaps with data pollution (proxy #3 above). The additional risk
|
|
(AI in research methodology) is context-dependent and cannot be
|
|
detected from the conversation alone. **Keep qualitative.**
|
|
|
|
## Implementation plan
|
|
|
|
### Phase 1: Low-hanging fruit (extend existing hook)
|
|
|
|
Modify `pre-compact-snapshot.sh` to extract from the transcript:
|
|
|
|
1. **Automation ratio**: output_tokens / (output_tokens + user_input_tokens)
|
|
2. **Model ID**: already available from API metadata
|
|
3. **Test pass/fail counts**: parse tool call results for test outcomes
|
|
4. **File churn**: count Edit/Write tool calls per unique file
|
|
5. **Public push flag**: detect `git push` in tool calls
|
|
|
|
Add these fields to the JSONL log alongside existing metrics.
|
|
|
|
Estimated effort: extend the existing Python/bash parsing, ~100 lines.
|
|
|
|
### Phase 2: Post-conversation signals
|
|
|
|
Add an optional post-commit hook:
|
|
|
|
6. **Review delta**: compare AI-generated code (from transcript) with
|
|
actual committed code. Measures human review effort.
|
|
|
|
Estimated effort: new hook, ~50 lines. Requires git integration.
|
|
|
|
### Phase 3: Aggregate metrics
|
|
|
|
Build a dashboard script (extend `show-impact.sh`) that computes
|
|
portfolio-level metrics across sessions:
|
|
|
|
7. **Monoculture index**: provider concentration over time
|
|
8. **Spend concentration**: cumulative $ per provider
|
|
9. **Displacement profile**: % of sessions by substitution type
|
|
10. **RLHF demand estimate**: cumulative annotation labor proxy
|
|
|
|
### Phase 4: Methodology update
|
|
|
|
Update `impact-methodology.md` Section 19 confidence summary:
|
|
- Move categories with proxies from "Unquantifiable" to "Proxy available"
|
|
- Document each proxy's limitations honestly
|
|
- Update the toolkit README to reflect new capabilities
|
|
|
|
Update `impact-toolkit/README.md` to accurately describe what the
|
|
toolkit measures.
|
|
|
|
## What this does NOT do
|
|
|
|
- It does not make the unquantifiable quantifiable. Some costs remain
|
|
qualitative by design.
|
|
- It does not produce a single "social cost score." Collapsing
|
|
incommensurable harms into one number would be dishonest.
|
|
- It does not claim precision. Every proxy is explicitly labeled with
|
|
its confidence and failure modes.
|
|
|
|
## Success criteria
|
|
|
|
- The toolkit reports at least 5 non-environmental metrics per session.
|
|
- Each metric has documented limitations in the methodology.
|
|
- The confidence summary has fewer "Unquantifiable" entries.
|
|
- No metric is misleading — a proxy that doesn't work is removed, not
|
|
kept for show.
|
|
|
|
## Risks
|
|
|
|
- **Goodhart's law**: Once measured, users may optimize for the metric
|
|
rather than the underlying cost (e.g., adding fake user tokens to
|
|
lower automation ratio). Mitigate by documenting that proxies are
|
|
indicators, not targets.
|
|
- **False precision**: Numbers create an illusion of understanding.
|
|
Mitigate by always showing confidence levels alongside values.
|
|
- **Scope creep**: Trying to measure everything dilutes the toolkit's
|
|
usability. Start with Phase 1 only, evaluate before proceeding.
|