Why jazz vocalises use syllables (and classical ones don't)

Open Vaccai’s Practical Method of Italian Singing and you will find the exercises printed with no syllables — just notes, sung on a pure vowel of the singer’s choosing (typically ah or ee). Classical vocalises across the bel canto tradition (Vaccai, Concone, Sieber, Lütgen, Marchesi) all share this convention: the voice produces a sustained, vowel-only line, and the consonants are stripped out so the voice can focus on tone, breath, and pitch.

Open Bob Stoloff’s Scat! Vocal Improvisation Techniques and the convention reverses. Every exercise comes with specific syllables: doo, bah, dah, bee, dwee, dn, du-du-da. Different syllables for different rhythmic contexts. Different syllables for ascending vs descending lines. Different syllables for swing eighths vs straight eighths. The consonants are not optional — they are part of the exercise ^[1].

This is not a stylistic preference. The two traditions are doing different things with vocal production, and the difference encodes a deep claim about what singing is for. This post lays out the claim and what it implies for ear-training app design.

The classical claim: voice as pitched instrument

The classical bel canto tradition treats the voice as primarily a pitched instrument. The pedagogical goal is the production of a sustained, even, beautiful tone across the singer’s range — and consonants are an obstacle to this. A consonant interrupts the airflow, breaks the legato, and momentarily dis-tunes the pitch. Strip the consonants out and you can focus entirely on what the bel canto tradition cares about most: the quality of the sustained sung pitch.

This is consistent with what classical vocalises are for: training the muscular and acoustic mechanisms of pitch production, register transition, breath support, and dynamics. The exercises are pure-pitch drills with the rhythmic dimension simplified to “long, even notes” and the consonant dimension eliminated entirely.

The jazz claim: voice as rhythmic instrument

The jazz scat tradition treats the voice as primarily a rhythmic instrument that happens to also be pitched. The pedagogical goal is the production of articulated, swinging, rhythmically-precise lines — and the consonants are exactly how the rhythm gets articulated ^[2].

A swing eighth-note pair, sung as doo-bah, has a clear hard consonant (the b of bah) on the off-beat eighth. That consonant is what makes the off-beat land — what gives it the percussive, slightly accented quality that defines the swing feel. Sing the same two notes on a pure vowel (aaa-aaa) and the rhythmic articulation collapses; the line feels limp and unswinging.

The same logic extends to other contexts:

Ascending lines in scat are often sung doo-bah-dee-bah-dah, with the harder consonants on the upper notes. The rising line is carried by the rising consonant intensity.
Descending lines invert: dwee-da-doo-bah-doo, with the softer consonants on the lower notes.
Triplets get triple-syllable groupings: doo-da-tee, bah-dah-doo. The middle syllable gets a softer consonant or a vowel-only attack to keep the triplet feel cohesive.
Long notes are held on whichever vowel the consonant set up: bahhhh, deee. The release of the long note often gets a closing consonant (bahhh-pp, deee-tt) to mark the end-point precisely.

Different jazz vocalists have characteristic syllable vocabularies — Ella Fitzgerald, Sarah Vaughan, Bobby McFerrin, Mark Murphy, and the Lambert/Hendricks/Ross trio all sound different in part because they use different syllable sets. But the underlying principle is universal: the consonants articulate the rhythm.

Why this matters cognitively

The jazz scat convention is not just stylistic. It tracks the same cross-cultural pedagogical principle covered in Speak the rhythm before you play it: vocal articulation is part of how rhythm is internalized, and vocal articulation requires consonants.

The motor-planning literature consistently finds that consonant-articulated speech recruits more of the precise timing machinery in the motor cortex than vowel-only phonation does ^[3]. A swung eighth-note pair sung doo-bah engages onset-timing precision in a way that the same two pitches on ah-ah does not. The jazz tradition stumbled onto this empirically — vocalists who used hard-consonant syllables sounded more rhythmically precise than those who didn’t, the convention spread, and a century later the cognitive science is catching up.

The same principle is why konnakol uses specific consonant syllables (ta, ka, di, mi) rather than pure vowels for its rhythmic-articulation work (see Konnakol). The takadimi system imports this convention into Western rhythm pedagogy directly (see Takadimi). The Carnatic, jazz, and rhythm-pedagogy traditions are converging on the same insight from very different directions.

The Lambert, Hendricks, and Ross test case

The 1957–1962 jazz vocal trio Lambert, Hendricks, and Ross took the consonant-articulation principle to its extreme. They wrote lyrics to instrumental jazz solos — Eddie Jefferson’s invention of vocalese as a genre — and sang those lyrics in real time, matching the rhythmic articulation of the original instrumental performance.

What this required was a vocal performance practice in which the consonants of the lyrics had to fall exactly on the rhythmic positions of the original instrument’s articulations. Listen to Twisted (their setting of a Wardell Gray solo) and you can hear the consonants doing precisely what the saxophone tongue articulations did — every off-beat lands with a hard consonant attack, every triplet has a clean three-syllable distribution.

This is the strongest possible demonstration that the consonants are the rhythmic articulation. Strip them out and the vocalese genre becomes impossible — you cannot match a saxophone’s tongued attacks on pure vowels ^[4].

What this implies for vocal warm-ups in jazz vs classical

A jazz singer warming up with a Vaccai pentachord on pure vowels (ahhh) is missing what their style actually requires. Two adjustments take the classical vocalise tradition and adapt it for jazz:

Sing the vocalise on jazz syllables instead of pure vowels. A 1-2-3-4-5-4-3-2-1 pentachord becomes doo-bah-dee-bah-dah-bah-dee-bah-doo (or any other syllable set). The pitch material is identical; the rhythmic articulation is added.
Practice the same vocalise with swing eighths instead of straight eighths. The classical convention is metronomic; the jazz convention requires the off-beat eighths to fall slightly later (see Swing eighths are not 2:1). Practicing with the swing feel from the start trains the muscular swing-articulation, not just the pitch-production.

These two adjustments produce the cross-tradition vocalise tradition that Stoloff codified — and the resulting warm-ups train both classical pitch-production and jazz rhythmic-articulation in the same exercise. There is no real cost to adopting them other than the unfamiliarity.

The classical and jazz traditions both have valid theories of what vocal production is for. The classical theory — voice as pitched instrument — produces beautiful sustained tone. The jazz theory — voice as rhythmic instrument — produces swinging articulated lines. Neither is wrong. But for a singer or improviser working in the jazz tradition, the syllables are not decoration. They are the technique.

References

Stoloff, B. (1996). Scat! Vocal Improvisation Techniques. Gerard and Sarzin Publishing. ISBN 9780962846755. The syllable-articulation conventions of the scat tradition are laid out in detail in the opening chapters; specific syllables for ascending vs descending lines, swing vs straight eighths, etc., are catalogued throughout. ↩︎
Stoloff was Professor and Assistant Chair in the Voice Department at Berklee for 28 years; the consonant-articulation principle is the structural innovation distinguishing his pedagogy from classical vocalise tradition. ↩︎
For the broader literature on consonant articulation and motor-cortex timing precision, see: Brown, S., Martinez, M. J., & Parsons, L. M. (2006). Music and language side by side in the brain: A PET study of the generation of melodies and sentences. European Journal of Neuroscience, 23(10). The shared motor-planning resources between speech and music production are well-documented, with consonant-articulated speech engaging timing-precision systems more than pure phonation does. ↩︎
For the Lambert, Hendricks, and Ross vocalese tradition, see Hendricks, J. (autobiography references) and the trio’s recordings on Sing a Song of Basie (1957) and The Hottest New Group in Jazz (1959–1961). The technical achievement of matching saxophone tongued articulations on consonant-articulated vocal lines is the strongest demonstration that consonants encode rhythmic articulation in vocal performance. ↩︎