01Overview
Track scores and show scores are two separate formulas. Both land on a 0–100 scale, both are normalized so the average performance sits near 50, both are recalculated nightly from the latest data.
Tracks pull from ten components: four community signals, five audio/data signals, and one review signal. Together they are weighted roughly 50/50 between community and audio. Shows pull from six. Each component is computed independently, normalized to 0–100, then blended with a fixed weight. We never adjust scores by hand and we never ship weights that aren't printed somewhere on the site.
Every component is visible.
02Track score
A track score is a single 0–100 number that blends how the community has talked about a performance with what the audio itself sounds like. It's computed for every Phish track on phish.in (currently around 36,000 of them). The breakdown:
- Popularity · 11%
- Consensus · 10%
- Jam Chart · 12%
- Recognition · 12%
- Energy · 12%
- Groove · 12%
- Harmonic · 9%
- Duration · 9%
- Volatility · 8%
- Review · 5%
Sum of weights is 100. The composite is what shows on track pages and drives the leaderboards. The breakdown is stored as performance_score_breakdown JSON on every track row, so any client can pull the constituent numbers.
03The ten components
Each one is a deliberate signal. Here's what each is measuring, why it's weighted the way it is, and where the data comes from.
Popularity (11%) · community signal
Per-track “liked” counts pulled from phish.in. The single strongest predictor of which versions fans return to over the years. Heavily weighted because it aggregates decades of independent listener attention.
Consensus (10%) · community signal
Cross-source agreement: when phish.in likes, phish.net jam-chart inclusion, and aggregated review sentiment all point the same way, this component lights up. A version that all three sources call great gets credit beyond what any single one would give it.
Jam Chart (12%) · community signal
Inclusion (and tag richness) on the official phish.net jam charts. Type II, bliss, dark, ambient, and funk tags all count; multi-tag entries score higher.
Recognition (12%) · community signal
How often a version is cited in setlist reviews, retrospectives, and tape lists relative to other versions of the same song. A small but persistent signal that a version “has a name.”
Energy (12%) · audio signal
Dynamic range and energy build inferred from librosa analysis of the raw MP3. We measure the spread between quiet and loud passages and how the energy envelope ramps over the track's length. Patient builds score higher than flat or jagged ones.
Groove (12%) · audio signal
Rhythmic stability under energy_std, a measure of how steadily the band locks into a pocket. The transform isn't linear; it follows 100 · exp(-0.18 · energy_std^1.3), which sharply rewards genuinely-locked passages over merely-uneven ones.
Harmonic (9%) · audio signal
Spectral coherence and harmonic-to-percussive ratio. Versions with a clear harmonic spine (bliss jams, melodic peaks) score higher than dissonant or texture-only stretches.
Duration (9%) · data signal
How long this version is relative to the song's own distribution, expressed as a z-score and clamped to [0, 3]. A longer-than-average Tweezer scores higher than the median; a four-minute Bouncing Around the Room can't score higher than other Bouncings on this axis. Per-song normalization keeps short composed songs from being structurally penalized.
Volatility (8%) · audio signal
Direction changes per minute in the energy envelope during the jam. Versions that swing repeatedly between quiet and loud score higher than ones that ride a single intensity. Captures the “exploration” quality of a jam without requiring it to also be long; a 12-minute version with five distinct movements outscores a 25-minute drone.
Review (5%) · review signal
Per-show review sentiment summarized by a small Claude Haiku model from the public phish.net reviews. Aggregated to the track level by overlaying which segments of the show each review is actually responding to.
04Show score
Shows have their own formula: six components rather than ten, roughly 47/53 between community signals (rating, likes, jam charts) and setlist + audio signals (segues, depth, average track score). Phish.net rating is the largest single weight at 25%, but no individual signal dominates. Listed below by weight:
- Phish.net rating · 25%. How fans rated the show on phish.net's 1–5 scale, scaled to 0–100.
- Avg track score · 27%. Mean of the show's track scores; folds the audio signal into the show level.
- Segues · 13%. Density of segues (→) and song-pairings; rewards tightly-played sets.
- Set depth · 13%. Length and substance of the longest set / signature jam vehicle.
- Phish.in likes · 12%. Per-track 'liked' counts on phish.in, averaged across the show.
- Jam charts · 10%. How many tracks from the show appear on the official jam charts.
Stored on the shows table as show_score with the breakdown in show_score_breakdown JSON. Recomputed whenever a track score changes, since the “avg track score” component depends on the underlying audio numbers.
05Pipeline
Setlists, jam charts, song metadata, and community reviews come from phish.net. Audio files come from phish.in and are analyzed with librosa v0.10.2: bpm, key, mode, dynamic range, onset density, harmonic ratio, spectral centroid, energy variance, and a peak-moment estimator.
Sentiment summaries of reviews are generated per-show by a small Claude Haiku model. Scoring runs nightly via GitHub Actions; new tracks are picked up as phish.in publishes them, typically within 48 hours of the show.
Manual corrections live in the score_overrides table and only ever apply to broken inputs (e.g., a track tagged with the wrong song id). We never override a score because we disagree with what came out of the formula.
06What we don't measure
A few things we tried and pulled out of the surface:
- Key detection. librosa got it right less than 20% of the time on live recordings (especially audience tapes), so it's no longer surfaced anywhere in the UI.
- Raw dynamic range as a UI label. Audience recordings make the absolute number meaningless; a quiet jam from the lawn reads identical to a flat soundboard. The underlying value still feeds the Energy component but isn't shown standalone.
- Per-segment jam scores. An earlier attempt to score sub-sections of a jam (intro / build / peak) was too noisy to publish; the structured tags from the community jam-chart project carry that meaning more reliably.
07Treating new content fairly
Three of the four big community signals (jam-chart inclusion, recognition, consensus mentions) are conditional on community attention having already happened. A track played last week can't possibly be jam-charted yet, can't have phish.in likes, can't have appeared in best-of threads. Earlier versions of the formula scored those zeros literally, and because community signals dominated the weight, brand-new tracks routinely scored in single digits.
That wasn't honest. The system was saying “this performed badly” when all it actually knew was “we don't have data yet.” v3.3 fixes this two ways:
- Bayesian shrinkage. For tracks (and shows) under 365 days old, missing community signals fall back to a low baseline calibrated to the population mean for that signal (around 25 for tracks, 40 for shows) instead of 0. Old tracks with no community traction still score 0 on those components; their absence really is signal at that point. New tracks get the benefit of the doubt instead.
- Rebalance. Track formula moved from 80% community / 20% audio to a 50/50 split. Show formula moved from 65% community / 35% setlist+audio to roughly 47/53. Audio and setlist signals, which we have for any new content the moment it's analyzed, now carry their fair share of the weight.
Net effect: a brand-new track with strong audio characteristics now lands in the 60–80 range (was 5–15). All-time classics still sit in the mid-90s because their community signals are real, not neutral defaults. The leaderboard ordering at the top is unchanged; the bottom of new-show pages no longer reads as a row of zeros.
08Versioning
The current weights are version v3.3, recalibrated 2026-05-13. Every material weight change bumps the version and triggers a nightly rescore of the entire catalog. The version stamp appears at the bottom of every score panel on the site, so a screenshot is always self-dated.
If you ever need to cite a specific version in writing, the stamp is the citation. We don't silently re-tune weights; recalibrations are public, dated, and explained.
Have a methodology question or spotted a scoring oddity? Reach out.