Recurrent Networks and Human Computer Jazz Improvisation
Papers describing machine jazz improvisation and other work are available
by clicking here.
Recurrent Reinforcement Networks for Improvisational Time Series
This page also gives a summary and provides audio samples of the original CHIME system.
A recurrent network is one that takes its own outputs as inputs in order to decide what to produce on the next interation of the network.
I am training a recurrent artificial neural network to play 3 different Sonny
Rollins tunes (all jazz, 12 bar blues). The 3 links below go to the songs
individually:
Blue7
Tenor Madness
Solid
I have also trained a network to learn to play all three songs.
Once I trained the network, I then changed it from a recurrent network to one that will accept human input. Below is the first
attempt at the human and machine trading fours. The human is singing into
a microphone. The wave inputs are converted into MIDI events by software
called Sound2MIDI. These are turned into a representation that can be used
by the recurrent network. After four measures or bars of human playing, the
machine that has just been trained to produce the three songs now plays for
four measures using the learned network weights and the human input. This repeats several times. The human starts and the chosen MIDI voice is that of a chime,
to distinguish it from the underlying accompaniment and from the machine. The
machine's voice is a tenor saxophone (or rather the general MIDI estimate of it). In all of these examples, the rhythm section or accompaniment is
generated by software called Band-in-a-Box.
Click here to listen.
The next step is to employ reinforcement learning to improve the machine's
performance. This is in progress and I have a few preliminary results.
If you
Click here you will hear an improved machine improvisation based on the original network
trained on the 3 Sonny Rollin's songs, improved by reinforcement learning,
and in response to a human solo. The voice of the improvisor is a piano voice, it is distinguishable from the piano voice that is part of the accompaniment.
Jazz music generation and understanding by a computer is a fun idea, and is a natural area of study for computer science and music majors in a liberal arts college. Aside from the music focus, it is an area to explore some deep computer science research. In the long term the ever more complex area
of human-machine interaction can be explored by creating an Interactive Jazz Improviser that not only can "hear" the music being played by humans and react to it, but can "sense" other aspects of human involvement in music, such as human body motion that reflects beat and mood, human facial expression that reflects the expression of the music being played, and other human cues and indicators. In turn, the Machine Improviser should be able to give cues to the human, as to its musical intent. A goal of this part of the research is to foster positive, mutually beneficial interactions between human and machine. It is hoped that this will be extendable to other types of interaction besides music playing.
This work has been supported by the National Science Foundation.
The opinions and conclusions are my own.
More recent work that was built on the original ideas follows
Reinforcement learning is an area of machine learning that addresses problems that are typified
by having sparse feedback as to whether a proposed solution achieved a goal. Generally only
an intermittent, scalar reinforcement value is available that indicates whether or not
actions or decision policies are successful. This can be contrasted with methods that rely
on a training period in which examples of correct actions or policies are available. In recent
years the focus of the field has been on the relationship of some kinds of reinforcement
learning algorithms to the dynamic programming algorithms
of engineering and applied mathematics. In dynamic programming terms, the reinforcement value is likened
to the cost of moving from one state to another or the cost of achieving a goal.
Most research in reinforcement learning constrains itself to a tabular approach in which
the states and actions that are available are small enough in number that the solutions
can be represented in a table. With this constraint, many algorithms and theoretical results
have been developed. In fact, it was through the tabular approach that the relationship to
dynamic programming was first discovered. We are investigating the use of machine learning algorithms
for learning to improvise jazz music, a subfield of computer music or algorithmic music composition.
During the last 15 or so years, some researchers in the algorithmic composition field have used
artificial neural networks (ANNs). Generally, ANNs learn nonlinear mappings from inputs to outputs.
They can be used to learn to classify patterns or learn models of systems. Significantly, these
are not tabular mappings, but are nonlinear functional mappings
of a set of numerical inptus.
Recurrent neural networks compose a special class of ANNs in which the processing units receive
feedback from themselves or other processing units, allowing a history of the state to be
used in decision-making. The state of the art of recurrent artificial neural network learning and its
application to time series learning is undergoing innovation and there are several new
kinds of algorithms that have been produced just in the last several years.
We studied several of these algorithms in previous years, and
determined the usefulness of various kinds of recurrent neural network algorithms for
learning musical time series. We continued this work during summer of 03, comparing
several algorithms. One recently invented algorithm called LSTM, for Long Short-Term Memory,
is a significant departure from the ``standard'' recurrent algorithm in that it can
``cache'' information for long stretches of time before using it in choosing actions. Error information
can also be cached and used to correct network weights. This is just the kind of
algorithm that is needed for musical systems in which a long sequence of notes has a
comprehensive meaning when it is finished being played.
After doing comparative studies with the LSTM algorithm and several others, during summer of 03 and
during the PI's sabbatical period of fall 03, we re-visited the issue of representing music sequences.
The idea here is that each pitch and corresponding duration should be encoded with two issues in mind.
One is that the encoding is in a form that can be best used by the recurrent neural
network algorithm and secondly that it encodes some musical properties.
We devised first a pitch representation called Circles of Thirds, and performed
a comparative study between 1) this representation, 2) one we and others have used
for algorithmic computer music\cite{franklin-01, todd-91},
and 3) another that also included some music information\cite{mozer-94}. The study
compared three musical tasks involving short music sequences, and used the LSTM recurrent algorithm.
We were able to show the Circles of Thirds representation performed at the best level
in all three tasks, and is also the most concise encoding. This work was presented at and
published in the
proceedings of the Florida Artificial Intelligence Research Symposium (FLAIRS) 2004, special
track on computer music. In summer 2004, we submitted an invited extended version of this
paper to the International Journal of Artificial Intelligence Tools, co-authored with an
undergraduate student (Krystal Locke) who compared the results of the LSTM with
the Circles of Thirds representation to results using other recurrent networks, using
the same representation (this paper has
been accepted for publication).
We also devised a duration representation that can be used to learn rhythms of long songs. Furthermore,
it can be used to learn from score-notations, or from human performances as encoded by MIDI
(Musical Instrument Digital Interface). Human players most generally do not adhere to the very strict
timing of a musical score, particularly in jazz, where they are encouraged not to follow strict timing.
The duration representation is a step towards machine accomodation of human performers. The PI
presented a paper describing this work that is published in the Proceedings of the 8th
International Conference on Music Perception and Cognition (ICMPC8), August 2004.
A description of both representations, and the ability of the LSTM network to learn a long
jazz melody in both score form
and human-performed form, plus a description of the field of using recurrent neural networks for computer music
learning and composition, has also been accepted for publication as an article for the INFORMS
Journal on Computing (JOC), Special
Cluster of Papers on Computation in Music.
Click here to listen to the song Afro Blue, as learned by LSTM
Now that we have developed better representations and have had success with
the LSTM network,
we are ready to return to reinforcement learning for improvisation. We
made strides in this direction this year by experimenting with combining
the LSTM recurrent network with an algorithm that predicts the reinforcement value over time, called
temporal difference learning or value iteration.
We follow Bakker's derivation of adding temporal difference
eligibility traces to
an LSTM network~\cite{bakker-02}, in which he combines
LSTM with advantage learning, a type
of reinforcement learning.
Temporal difference algorithms are concerned with
predicting delays in receiving the reinforcement value. The results of this algorithm, imposed on the
LSTM network, to make predictions of outcomes of musical sequences is described in a paper
that appears in the Proceedings of the International Conference on Computer Music, November 2004.
In the first set of experiments, that TD-LSTM network predicts the outcome
of a music pitch sequence, while the sequence is being played.
We showed that it is able to predict
a positive or negative outcome for a short musical task
of chromatic lead-in to a chord tone, at the time the chromatic note is played.
In a second set of more difficult experiments, we showed that the LSTM-TD network
can predict positive outcomes when certain chord
tones are played on the last beat of each bar of
ii-V-I chord progressions.
In a separate research effort that has lead to an undergraduate honors thesis, we developed a system
based on reinforcement learning.
There are three separate pieces of implementation
- using software called Max/MSP to gather samples of
a human playing a melody (the same melody, different renditions)
and to transform those samples for later use
- a rhythm parser, based on dynamic programming to maximize a joint likelihood
function over time, that takes in a set of observations of a new rendition of
the same melody, and assigns these to the most likely notes from the score
of the melody. I.e. it estimates the score positions of the played notes.
It contains a model that is built from the samples (in part 1) and from a
a heuristic that the timing of a played note varies with the length of the note.
This is based in part on Raphael's work~\cite{raphael-01}.
- A set of tables of reinforcement learning agents, one table per measure,
that learn to vary the rhythm of the melody in a unique
way, but keep the variations within the limits needed by the
rhythm parser (in part 2) to correctly assign the played notes with the
notes from the score of the same melody. In other words, the measure of
reward is based on the rhythm parser's output.
While this thesis is ongoing, preliminary results show that the table of reinforcement
learning agents can learn to vary the rhythm of a melody subject to the
constraints, and the melody is recognizable to humans.
Click on the links to midi files below to hear 4 versions of the first 8 bars of
The Girl From Ipanema
Early LSTM findings
As described by Mozer~\cite{mozer-94}, the state-of-the-art
recurrent neural network can really only learn short sequences,
because of the vanishing of the gradient over time.
Schmidhuber et al.~\cite{gers-00,hochreiter-97} also analyze this phenomenon,
and designed a network called LSTM that can `cache' gradient information
in a self-recurrent memory cell. Gradient-based
gates learn when to use the error stored in the cell
to update the network's input weights. The network of cells,
enclosed in memory blocks, each with its own set of
adaptive gates, is also recurrent, receiving the set of
external inputs as well as the cell outputs at the last time
step.
Our first test with LSTM was on the forward motion task,
using 4 memory blocks of 2 cells each (as Gers et al. suggest~\cite{gers-00}).
The most startling observation was the robust stability of the network.
The squared error decreases steadily without oscillation.
With this configuration, the LSTM network learns with about the
same accuracy as does the RTRL network, but again in a much more
repeatable, stable manner. Values close to .3, such as .5 and .7
are likely to be classified as good, when the right chord value
appears on the second chord input.
We tested the LSTM network with a binary (Boxes) representation
and its performance was much worse, ending with a squared error
of .22 after 40,000 epochs.
With 6 memory blocks, the squared error was better, at
.17 after 40,000 epochs. We did not pursue this exploration
since the results with the hybrid representation were
so much better. However, this did lead us to try more
memory blocks for the hybrid representation.
After 40,000 epochs, the error is at .16.
40,000 may seem like many epochs compared to RTRL. However, LSTM
is much more computationally efficient, and the experiments are
actually shorter.
Further exploration in the forward motion task has shown us that
adding blocks, for a total of 8, hampered learning (for 80,000
epochs). Also, the best representation so far is
the hybrid representation.
Summary of use of Recurrent Networks
We had assumed that recurrent networks could
learn to classify our musical examples with 100\% accuracy.
We tried binary/localized representations,
pseudo-continuous representations, and a hybrid.
For some example sets, classification is 100\% accurate.
For others, certain examnples that are too similar
to the 'good' ones, are either all classified
as positive, or several good ones are mis-classified as negative.
Usually, classification accuracy is about 95\%.
We have decided to make the most of this behavior.
We could use a representation that makes similar
examples acceptable musically, perhaps using
part of Mozer's psychologically based representation
utilizing the musical circle of fifths. This will be an
interesting exploration, because of the nature
of jazz. If we use a chromatic representation, then
the most similar examples are a half step off.
In many Western tonal genres, this will sound terrible.
In jazz, it may or may not. Perhaps another level
of decision making could decide whether the
the network should learn toward false positives or false-negatives, depending
on what part of the phrase is being improvised.
Misclassification could be part of the quirkiness of an individual player,
channeling it into a kind of creativity.
While we need to explore LSTM more, especially on the other two tasks,
we have decided to focus on this network. Its accuracy is as good
as any of the other algorithms, it is much more stable, especially
compared to RTRL,
and it is computationally effecient. We have also been able to adapt it
for use as a TD reinforcement predictor.
Adding TD for Credit Assignment
We have implemented temporal difference recurrent networks
using both RTRL and LSTM.
Using TD with RTRL, for the three example tasks that we
are using, we expected TD to predict a high reinforcement
value for step 3 of the counting task, for step 2 of the ordered sequence
task, and for step 3 of the forward motion task. Also, a low reinforcement
value should be predicted at steps in the `bad' sequences, before the
final step, depending on what values occur.
We are greatly encouraged by the results. These predictions did in fact
occur as we had hoped when added to RTRL.
The only caveat here, is that TD obviously
depends on the capability of the underlying recurrent network.
So, as expected, the same percentage of errors in prediction occur as
do in the misclassifications of the networks without TD.
TD works on the principle of minimizing a prediction
of the cumulative future rewards (discounted as time advances into the future).
The TD `error' is $\delta = r + \gamma V(S_{t+1}) - V(S_t)$ where
r is the current reinforcement, $V(S_t)$ is the prediction at time
step t (TD's output), and $V(s_{t+1})$ is the estimate of future reward at time step t+1.
$r+\gamma V(S_(t+1)$ is an estimate of cumulative future reward.
The weight updates use this error.
TD was developed to work in a Boxes (discrete state) representation.
In this framework, $V(S_{t+1})$ is merely the weight in the next state
($s_{t+1})$ box. Using a network, in order to estimate $V(S_{t+1})$, we obtain the next set of inputs,
and run them through the network, using the same weight values that are used
to generate the network output, $V(S_t)$, necessarily before the weights are
updated. This requires careful
preservation of the {\em network's} state, particularly for LSTM where the cell state
is a function of its prior value. This has been a good estimate for us,
both in our previous work in coupling TD with a nonlinear
feedforward network, and now with recurrent networks.
When combining TD with LSTM, the preliminary results are even more
encouraging. After only 200 epochs of the
forward motion task, the TD-LSTM network can predict with
85\% accuracy. Also, the predictions occur for every time step of the
sequence. For example, in epoch 198, examples 98 and 99, the
prediction and targets are as follows:\\
\begin{tabular}{llll}
198 & 98 & 0.60 & 0.000\\
198 & 98 & 0.62 & 0.000\\
198 & 98 & 0.64 & 0.000\\
198 & 98 & 0.61 & 1.000\\
----&----&-------&--------\\
198 & 99 & 0.00 & 0.000\\
198 & 99 & 0.00 & 0.000\\
198 & 99 & 0.00 & 0.000\\
198 & 99 & 0.00 & 0.000
\end{tabular}\\
Increasing the number of epochs produced similar results.
The learning parameters required for TD and LSTM separately
vs.~their combination need adjustment. For example, TD
uses self-recurrent `eligibility traces' that are used in
the weight updates. These are decaying averages of the
inputs to the TD units (the output units of the TD-LSTM network).
The rate had to be adjusted to be lower than when TD was used with
a feedforward network.
We are testing TD further, with LSTM, and are on the verge of adding an
action-taking component, starting with the SARSA algorithm~\cite{sutton-98},
that can use the TD prediction to choose actions (next notes
or next sequences of notes).
Creativity in Machine Jazz Improvisors
Jazz improvisation is an art-form and requires a certain amount of
creativity. While we are not yet claiming to have endowed our systems
with creativity, we have in the past modified our reinforcement
learning algorithms to insure variety. We have spent some time examining
how other machine learning systems~\cite{biles-98,mozer-94,spector-95,thom-03}
applied to jazz improvisation
induced variety in their playing.
The main ideas are
- to use some probabilistic interpretation of the output of
the algorithms
- to employ musical domain knowledge
- to combine corpi of human examples
- to enable real-time interaction with the system to
enable human creativity to drive the machine's creativity.
Plans include the
design of the individual improvisors, integrating recurrent networks
with action-taking reinforcement learning techniques, and in
developing hierarchical improvisors that can combine the individuals.
LSTM networks have been used to learn complex language grammars,
and work has been done on grammatical models of music~\cite{lerdahl-93}. We will
explore using an LSTM-learned grammatical model of music (with motifs, phrases, etc.)
to drive the selection of the individual improvisors.
\bibitem{bakker-02}
B.~Bakker, Reinforcement Learning with Long Short-Term Memory,
{\em Neural Information Processing Systems, 14},
eds, T.G. Dietterich and S. Becker and Z. Ghahramani,
MIT Press, Cambridge, MA, 2002
\bibitem{biles-98}
J.~A.~Biles,
Interactive GenJam: Integrating Real-Time Performance with a Genetic Algorithm,
{\em Proc.~1998 International Computer Music Conference}, Ann Arbor, MI.
\bibitem{calvert-01}
D.~Calvert and S.~C.~Kremer, Networks with Adaptive State Transitions,
{\em A Field Guide to Dynamic Recurrent Networks}, eds.~j.~K.~Kolen, and S.~C.~Kremer,
IEEE Press, Piscataway, NJ 2001.
\bibitem{campolucci-98}P.~Campolucci, A Circuit Theory Approach to Recurrent
Neural Network Architectures and Learning Methods,
Dottorato di Ricerca in Ingegneria Elettrotecnica, Universit\`{a} Degli Studi di
Bologna, 1998.
\bibitem{franklin-01}
J.~A.~Franklin, Learning and Improvisation,
{\em Neural Information Processing Systems 14}, eds. Dietterich, T. G., Becker, S., and Ghahramani, Z. MIT Press,
Cambridge, MA 2001.
\bibitem{franklin-02}
J.~A.~Franklin, and V.~U.~Manfredi,
Nonlinear Credit Assignment for Music Sequences,
{\em Proc.~Second International Workshop on Intelligent Systems
Design and Applications}, Atlanta, GA, 2002.
\bibitem{gers-00}
F.~A.~Gers, J.~Schmidhuber and F.~Cummins. Learning to Forget: Continual Prediction with LSTM.
{\em Neural Computation} {\bf 12} (2000), 10:2451--2471.
\bibitem{hochreiter-97}
S.~Hochreiter and J.~Schmidhuber,
Long Short-Term Memory, {\em Neural Computation} {\bf 9} (1997), 8:1735-1780.
\bibitem{lin-01}T.-N.~Lin and C.~L.~Giles, Delay Networks: Buffers to the Rescue,
in {\em A Field Guide to Dynamic Neural Networks}, eds.~J.~F.~Kolen and S.~C.~Kremer,
IEEE Press, Piscataway, NJ 2001.
\bibitem{jordan-86}
M.~Jordan,
Attractor Dynamics and Parallelism in a Connectionist Sequential
Machine.
{\em Proc.~Eighth Annual Conf. of the Cognitive
Science Society}, Amherst, MA, 1986.
\bibitem{lerdahl-93}
F.~Lerdahl and R.~Jackendoff, An Overview of Hierarchical Structure in Music,
{\em Machine Models of Music}, eds.~S.~M.~Schwanauer and D.~A.~Levitt,
MIT Press, Cambrdige, Massachusetts, 1993.
\bibitem{mozer-94}
M.~C.~Mozer,
Neural Network Music Composition by Prediction:
Exploring the Benefits of Psychophysical Constraints and Multiscale Processing.
{\em Connection Science}, {\bf 6} (1994), 247-280.
\bibitem{pearlmutter-01}
B.~A.~Pearlmutter,Gradient Calculations for Dynamic Recurrent Neural Networks,
{\em A Field Guide to Dynamic Recurrent Networks}, eds.~j.~K.~Kolen, and S.~C.~Kremer,
IEEE Press, Piscataway, NJ 2001.
\bibitem{principe-93}
J.~Principe and J.~Kuo, Backpropagation Through Time with Fixed Memory Size Requirements,
Proc. 3rd Workshop on Neural Networks for Signal Processing, pages 207-215, Maryland, 1993.
\bibitem{raphael-01}
C.~Raphael, Music Plus One: A System for Expressive and Flexible Musical Accompaniment,
{\em Proceedings of the 2001International Computer Music Conference}. (Computer Music Association. San Francisco, CA, 2001).
\bibitem{spector-95}
L.~Spector, and A.~Alpern, Induction and Recapitulation of
Deep Musical Structure, {\em
Proc. 1995 IJCAI Workshop on Artificial Intelligence and Music},
Montreal, Qu\'{e}bec.
\bibitem{sutton-88}
R.~Sutton,
Learning to Predict by the Methods of Temporal Differences.
{\em Machine Learning} {\bf 3} (1988), 9-44.
\bibitem{sutton-98}
R.~S.~Sutton, and A.~G.~Barto,
{\em Reinforcement Learning},
(MIT Press, Cambridge, MA, 1998).
\bibitem{thom-03}
B.~Thom, Interactive Improvisational Music Companionship: A User-Modeiling Approach.
{\em User Modeling and User-Adapted Interaction Journal} {\bf 13} (2003), 2: 1-2.
\bibitem{todd-91}
P.~M.~Todd,
A Connectionist Approach to Algorithmic Composition.
{\em Music and Connectionism}, eds.~P.~M. Todd and E.~D.~Loy,
(MIT Press, Cambridge, MA, 1991).
\bibitem{williams-89}
R.~J.~Williams and D.~Zipser, A Learning Algorithm for Continually Running Fully
Recurrent Neural Networks, {\em Neural Computation} {\bf 1} (1989), 270-280.
\bibitem{williams-90}
R.~J.~Williams and J.~Peng,
An Efficient Gradient-based Algorithm for On-line Training of Recurrent Network Trajectories,
{\em Neural Computation {\bf 2} (1990)}, 4:491-501.