What are the contributions in this paper?

Wold et al. this paper presented an audio analysis, search, and classification engine that reduces sounds to perceptual and acoustical features.

(Open Access) Content-based classification, search, and retrieval of audio (1996) | E. Wold

Content-Based

Classification,

Search, and

Retrieval of Audio

Many audio and

multimedia

applications would

benefit from the

ability to classify and

search for audio

based on its

characteristics. The

audio analysis,

search, and

classification engine

described here

reduces sounds to

perceptual and

acoustical features.

This lets users search

or retrieve sounds by

any one feature or a

combination of

them, by specifying

previously learned

classes based on

these features, or by

selecting or entering

reference sounds and

asking the engine to

retrieve similar or

dissimilar sounds.

Erling Wold, Thorn Blum, Douglas Keislar,

and James Wheaton

Muscle Fish

rapid increase in speed and capacity

of computers and networks has

allowed the inclusion of audio as a data

type in many modern computer appli-

cations. However, the audio is usually treated as an

opaque collection of bytes with only the most prim-

itive fields attached: name, file format, sampling

rate, and so on. Users accustomed to searching,

scanning, and retrieving text data can be frustrated

by the inability to look inside the audio objects.

Multimedia databases or file systems, for exam-

ple, can easily have thousands of audio record-

ings. These could be anything from a library of

sound effects to the soundtrack portion of a news

footage archive. Such libraries are often poorly

indexed or named to begin with. Even if a previ-

ous user has assigned keywords or indices to the

data, these are often highly subjective and may be

useless to another person. Searching for a partic-

ular sound or class of sound (such as applause,

music, or the speech of a particular speaker) can

be a daunting task.

How might people want to access sounds? We

believe there are several useful methods, all of

which we have attempted to incorporate into our

system.

I Simile: saying one sound is like another sound

or a group of sounds in terms of some charac-

teristics. For example, “like the sound of a herd

of elephants.” A simpler example would be to

say that it belongs to the class

of speech

sounds

or the class of applause sounds, where the sys-

tem has previously been trained on other

sounds in this class.

I Acoustical/perceptual features: describing the

sounds in terms of commonly understood

physical characteristics such as brightness,

pitch, and loudness.

I Subjective features: describing the sounds using

personal descriptive language. This requires

training the system (in our case, by example)

to understand the meaning of these descriptive

terms. For example, a user might be looking for

a “shimmering” sound.

I Onomatopoeia: making a sound similar in some

quality to the sound you are looking for. For

example, the user could making a buzzing

sound to find bees or electrical hum.

In a retrieval application, all of the above could

be used in combination with traditional keyword

and text queries.

To accomplish any of the above methods, we

first reduce the sound to a small set of parameters

using various analysis techniques. Second, we use

statistical techniques over the parameter space to

accomplish the classification and retrieval.

Previous research

Sounds are traditionally described by their

pitch, loudness, duration, and timbre. The first

three of these psychological percepts are well

understood and can be accurately modeled by

measurable acoustic features. Timbre, on the other

hand, is an ill-defined attribute that encompasses

all the distinctive qualities of a sound other than

its pitch, loudness, and duration. The effort to dis-

cover the components of timbre underlies much

of the previous psychoacoustic research that is rel-

evant to content-based audio retrieval.*

Salient components of timbre include the

amplitude envelope, harmonicity, and spectral

envelope. The attack portions of a tone are often

essential for identifying the timbre. Timbres with

similar spectral energy distributions (as measured

by the centroid of the spectrum) tend to be judged

as perceptually similar. However, research has

shown that the time-varying spectrum of a single

musical instrument tone cannot generally be

treated as a “fingerprint” identifying the instru-

ment, because there is too much variation across

1070-986X/96/$5.00 6 1996 IEEE

the instrument’s range of pitches and across its

range of dynamic levels.

Various researchers have discussed or proto-

typed algorithms capable of extracting audio

structure from a sound.z The goal was to allow

queries such as “find the first occurrence of the

note G-sharp.” These algorithms were tuned to

specific musical constructs and were not appro-

priate for all sounds.

Other researchers have focused on indexing

audio databases using neural nets.3 Although they

have had some success with their method, there

are several problems from our point of view. For

example, while the neural nets report similarities

between sounds, it is very hard to “look inside” a

net after it is trained or while it is in operation to

determine how well the training worked or what

aspects of the sounds are similar to each other.

This makes it difficult for the user to specify which

features of the sound are important and which to

ignore.

Analysis and retrieval engine

Here we present a general paradigm and spe-

cific techniques for analyzing audio signals in a

way that facilitates content-based retrieval.

Content-based retrieval of audio can mean a vari-

ety of things. At the lowest level, a user could

retrieve a sound by specifying the exact numbers

in an excerpt of the sound’s sampled data. This is

analogous to an exact text search and is just as

simple to implement in the audio domain.

At the next higher level of abstraction, the

retrieval would match any sound containing the

given excerpt, regardless of the data’s sample rate,

quantization, compression, and so on. This is

analogous to a fuzzy text search and can be imple-

mented using correlation techniques. At the next

level, the query might involve acoustic features

that can be directly measured and perceptual (sub-

jective) properties of the sound.+5 Above this, one

can ask for speech content .or musical content.

It is the “sound” level-acoustic and perceptu-

al properties-with which we are most concerned

here. Some of the aural (perceptual) properties of

a sound, such as pitch, loudness, and brightness,

correspond closely to measurable features of the

audio signal, making it logical to provide fields for

these properties in the audio database record.

However, other aural properties (“scratchiness,”

for instance) are more indirectly related to easily

measured acoustical features of the sound. Some

of these properties may even have different mean-

ings for different users.

We first measure a variety of acoustical features

of each sound. This set of N features is represented

as an N-vector. In text databases, the resolution of

queries typically requires matching and compar-

ing strings. In an audio database, we would like to

match and compare the aural properties as

described above. For example, we would like to

ask for all the sounds similar to a given sound or

that have more or less of a given property. To

guarantee that this is possible, sounds that differ

in the aural property should map to different

regions of the N-space. If this were not satisfied,

the database could not distinguish between

sounds with different values for this property.

Note that this approach is similar to the “feature-

vector” approach currently used in content-based

retrieval of images, although the actual features

used are very different.6

Since we cannot know the complete list of

aural properties that users might wish to specify,

it is impossible to guarantee that our choice of

acoustical features will meet these constraints.

However, we can make sure that we meet these

constraints for many useful aural properties.

Acoustical features

We can currently analyze the following aspects

of sound: loudness, pitch, brightness, bandwidth,

and harmonicity.

Loudness is approximated by the signal’s root-

mean-square (RMS) level in decibels, which is cal-

culated by taking a series of windowed frames of

the sound and computing the square root of the

sum of the squares of the windowed sample val-

ues. (This method does not account for the fre-

quency response of the human ear; if desired, the

necessary equalization can be added by applying

the Fletcher-Munson equal-loudness contours.)

The human ear can hear over a 120-decibel range.

Our software produces estimates over a lOO-

decibel range from 16-bit audio recordings.

Pitch is estimated by taking a series of short-

time Fourier spectra. For each of these frames, the

frequencies and amplitudes of the peaks are mea-

sured and an approximate greatest common divi-

sor algorithm is used to calculate an estimate of

the pitch. We store the pitch as a log frequency.

The pitch algorithm also returns a pitch confi-

dence value that can be used to weight the pitch

in later calculations. A perfect young human ear

can hear frequencies in the ZO-Hz to ZO-kHz

range. Our software can measure pitches in the

range of 50 Hz to about 10 kHz.

Brightness is computed as the centroid of the

short-time Fourier magnitude spec-

tra, again stored as a log frequency.

It is a measure of the higher fre-

quency content of the signal. As an

example, putting your hand over

your mouth as you speak reduces the

brightness of the speech sound as

well as the loudness. This feature

varies over the same range as the

pitch, although it can’t be less than the pitch esti-

mate at any given instant.

Bandwidth is computed as the magnitude-

weighted average of the differences between the

spectral components and the centroid. As exam-

ples, a single sine wave has a bandwidth of zero

and ideal white noise has an infinite bandwidth.

Harmonicity distinguishes between harmonic

spectra (such as vowels and most musical sounds),

inharmonic spectra (such as metallic sounds), and

noise (spectra that vary randomly in frequency

and time). It is computed by measuring the devl-

ation of the sound’s line spectrum from a perfect-

ly harmonic spectrum. This is currently an

optional feature and is not used in the examples

that follow. It is normalized to lie in a range from

zero to one.

All of these aspects of sound vary over time.

The trajectory in time is computed during the

analysis but not stored as such in the database.

However, for each of these trajectories, several fea-

tures are computed and stored. These include the

average value, the variance of the value over the

trajectory, and the autocorrelation of the trajec-

tory at a small lag. Autocorrelation is a measure of

the smoothness of the trajectory. It can distin-

guish between a pitch glissando and a wildly vary-

ing pitch (for example), which the simple variance

measure cannot.

The average, variance, and autocorrelation

computations are weighted by the amplitude tra-

jectory to emphasize the perceptually important

sections of the sound. In addition to the above

features, the duration of the sound is stored. The

feature vector thus consists of the duration plus

the parameters just mentioned (average, variance,

and autocorrelation) for each of the aspects of

sound given above. Figure 1 shows a plot of the

raw trajectories of loudness, brightness, band-

width, and pitch for a recording of male laughter.

After the statistical analyses, the resulting

analysis record (shown in Table 1) contains the

computed values. These numbers are the only

information used in the content-based classifica-

tion and retrieval of these sounds. It is possible to

see some of the essential characteris-

tics of the sound. Most notably, we

see the rapidly time-varying nature

of the laughter.

Training the system

It is possible to specify a sound

directly by submitting constraints on

the values of the N-vector described

above directly to the system. For

example, the user can ask for sounds

in a certain range of pitch or bright-

ness, However, it is also possible to

train the system by example. In this

case, the user selects examples of

sounds that demonstrate the proper-

ty the user wishes to train, such as

“scratchiness.”

For each sound entered into the

database, the N-vector, which we

represent as a, is computed. When

the user supplies a set of example

sounds for training, the mean vector

,U and the covariance matrix R for

the a vectors in each class are calcu-

lated. The mean and covariance are

given by

0.00 0.50

1.00 0.00 1 so

--... LaughterYoungMale.bright

Figure 1. Male laughter.

where A4 is the number of sounds in the summa-

tion. In practice, one can ignore the off-diagonal

elements of R if the feature vector elements are

reasonably independent of each other. This sim-

plification can yield significant savings in com-

putation time. The mean and covariance together

become the system’s model of the perceptual

property being trained by the user.

Classifying sounds

When a new sound needs to be classified, a dis-

tance measure is calculated from the new sound’s

a vector and the model above. We use a weighted

D = ((a - p)T R1 (a - p))”

Again, the off-diagonal elements of R can be

ignored for faster computation. Also, simpler mea-

sures such as an L, or Manhattan distance can be

used. The distance is compared to a threshold to

determine whether the sound is “in” or “out” of

the class. If there are several mutually exclusive

classes, the sound is placed in the class to which

it is closest, that is, for which it has the smallest

value of D.

If it is known a priori that some acoustic fea-

tures .are unimportant for the class, these can be

ignored or given a lower weight in the computa-

tion of D. For example, if the class models some

timbral aspect of the sounds, the duration and

average pitch of the sounds can usually be

ignored.

We also define a likelihood value L based on

the normal distribution and given by

L = exp(-D2/2)

This value can be interpreted as “how much” of the

defining property for the class the new sound has.

Retrieving sounds

It is now possible to select, sort, or classify

sounds from the database using the distance mea-

sure. Some example queries are

I Retrieve the “scratchy” sounds. That is, retrieve

all the sounds that have a high likelihood of

being in the “scratchy” class.

I Retrieve the top 20 “scratchy” sounds.

I Retrieve all the sounds that are less “scratchy”

than a given sound.

I Sort the given set of sounds by how “scratchy”

they are.

I Classify a given set of sounds into the follow-

’ ing set of classes.

any desired hyper-rectangle of sounds in the data-

base by requesting all sounds whose feature val-

ues fall in a set of desired ranges. Requesting such

hyper-rectangles allows a much more efficient

search. This technique has the advantage that it

can be implemented on top of the very efficient

index-based search algorithms in existing com-

mercial databases.

As an example, consider a query to retrieve the

top M sounds in a class. If the database has MO

sounds total, we first ask for all the sounds in a

hyper-rectangle centered around the mean ,U with

volume V such that

V/v, = M/n/i,

For small databases, it is easiest to compute the

distance measure(s) for all the sounds in the data-

base and then to choose the sounds that match

the desired result. For large databases, this can be

too expensive. To speed up the search, we index

(sort) the sounds in the database by all the

and try again.

Note that the above discussion is a simplifica-

tion of our current algorithm, which asks for big-

ger volumes to begin with to correct for two

factors. First, for, our distance measure, we really

want a hypersphere of volume V, which means we

want the hyper-rectangle that circumscribes this

sphere. Second, the distribution of sounds in the

feature space is not perfectly regular. If we assume

some reasonable distribution of the sounds in the

database, we can easily compute how much larger

V has to be to achieve some desired confidence

level that the search will succeed.

Quality measures

The magnitude of the covariance matrix R is a

measure of the compactness of the class. This can

be reported to the user as a quality measure of the

classification. For example, if the dimensions of R

are similar to the dimensions of the database, this

class would not be useful as a discriminator, since

all the sounds would fall into it. Similarly, the sys-

tem can detect other irregularities in the training

set, such as outliers or bimodality.

The size of the covariance matrix in each

dimension is a measure of the particular dimen-

sion’s importance to the class. From this, the user

can see if a particular feature is too important or

not important enough. For example, if all the

sounds in the training set happen to have a very

similar duration, the classification process will

rank this feature highly, even though it may be

irrelevant. If this is the case, the user can tell the

system to ignore duration or weight it differently,

or the user can try to improve the training set.

Similarly, the system can report to the user the

components of the computed distance measure.

Again, this is an indication to the user of possible

problems in the class description.

Note that all of these measures would be diffi-

cult to derive from a non-statistical model such as

a neural network.

Segmentation

The discussion above deals with the case where

each sound is a single gestalt. Some examples of

this would be single short sounds, such as a door

slam, or longer sounds of uniform texture, such as

a recording of rain on cement. Recordings that

contain many different events need to be seg-

mented before using the features above.

Segmentation is accomplished by applying the

acoustic analyses discussed to the signal and look-

ing for transitions (sudden changes in the mea-

sured features). The transitions define segments of

the signal, which can then be treated like individ-

ual sounds. For example, a recording of a concert

could be scanned automatically for applause

sounds to determine the boundaries between

musical pieces. Similarly, after training the system

to recognize a certain speaker, a recording could

be segmented and scanned for all the sections

where that speaker was talking.

Performance

We have used the above algorithms at Muscle

Fish on a test sound database that contains about

400 sound files. These sound files were culled from

various sound effects and musical instrument sam-

ple libraries. A wide variety of sounds are represent-

ed from animals, machines, musical instruments,

speech, and nature. The sounds vary in duration

from less than a second to about 1.5 seconds.

A number of classes were made by running the

classification algorithm on some perceptually sim-

ilar sets of sounds. These classes were then used to

reorder the sounds in the database by their likeli-

hood of membership in the class. The following

discussion shows the results of this process for sev-

eral sound sets. These examples illustrate the char-

acter of the process and the fuzzy

nature of the retrieval. (For more

information, and to duplicate these

examples, see the “Interactive Web

Demo” sidebar.)

Example 1: Laughter. For this

example, all the recordings of laugh-

ter except two were used in creating

the class. Figure 2 shows a plot of the

class membership likelihood values

(the Y-axis) for all of the sound files

in the test database. Each vertical

strip along the X-axis is a user-

defined category (the directory in

which the sound resides). See the

“Class Model” sidebar on p. 32 for

details on how our system comput-

ed this model.

The highest returned likelihoods

are for the laughing sounds, includ-

ing the two that were not included

in the original training set, as well as

one of the animal recordings. This animal record-

ing is of a chicken coop and has strong similari-

ties in sound to the laughter recordings,

consisting of a number of strong sound bursts.

Example 2: Female speech. Our test database

contains a number of very short recordings of a

- Laughter.order

not in training set

m . . . . . Animals

+---..-.. Bells

*----

Crowds

*..

k2000

s_-- Laughter

---

Telephone

--....-

Water

- Mcgill/altotrombone

o.. . . . .

Mcgill/cellobowed

c I - - - - - Mcgill/oboe

-----

Mcgill/percussion

.- .

Mcgill/tubularbells

I--

Mcgill/violinbowed

---

Mcgill/violinpizz

Speech/female

- Speech/male

Figure 2. Laughter

classification.

Content-based classification, search, and retrieval of audio

Figures

Citations

Musical genre classification of audio signals

Audio Set: An ontology and human-labeled dataset for audio events

Ergonomic man-machine interface incorporating adaptive pattern recognition based control system

M-tree: An Efficient Access Method for Similarity Search in Metric Spaces

Adaptive pattern recognition based control system and method

References

Automatic partitioning of full-motion video

Video and Image Processing in Multimedia Systems

Recognition of sound sources and events

Automatic indexing of a sound database using self-organizing neural nets

Aspects of tone sensation : a psychophysical study

Related Papers (5)

Content-based retrieval of music and audio

Query by humming: musical information retrieval in an audio database

Construction and evaluation of a robust multifeature speech/music discriminator

Method and article of manufacture for content-based analysis, storage, retrieval, and segmentation of audio information

Connected audio and other media objects

Frequently Asked Questions (1)

Q1. What are the contributions in this paper?