Recurrent Networks and Human Computer Jazz Improvisation

Papers describing machine jazz improvisation and other work are available by clicking here.

Recurrent Reinforcement Networks for Improvisational Time Series

This page also gives a summary and provides audio samples of the original CHIME system.

A recurrent network is one that takes its own outputs as inputs in order to decide what to produce on the next interation of the network. I am training a recurrent artificial neural network to play 3 different Sonny Rollins tunes (all jazz, 12 bar blues). The 3 links below go to the songs individually:
Tenor Madness

I have also trained a network to learn to play all three songs. Once I trained the network, I then changed it from a recurrent network to one that will accept human input. Below is the first attempt at the human and machine trading fours. The human is singing into a microphone. The wave inputs are converted into MIDI events by software called Sound2MIDI. These are turned into a representation that can be used by the recurrent network. After four measures or bars of human playing, the machine that has just been trained to produce the three songs now plays for four measures using the learned network weights and the human input. This repeats several times. The human starts and the chosen MIDI voice is that of a chime, to distinguish it from the underlying accompaniment and from the machine. The machine's voice is a tenor saxophone (or rather the general MIDI estimate of it). In all of these examples, the rhythm section or accompaniment is generated by software called Band-in-a-Box.
Click here to listen.
The next step is to employ reinforcement learning to improve the machine's performance. This is in progress and I have a few preliminary results. If you Click here you will hear an improved machine improvisation based on the original network trained on the 3 Sonny Rollin's songs, improved by reinforcement learning, and in response to a human solo. The voice of the improvisor is a piano voice, it is distinguishable from the piano voice that is part of the accompaniment.

Jazz music generation and understanding by a computer is a fun idea, and is a natural area of study for computer science and music majors in a liberal arts college. Aside from the music focus, it is an area to explore some deep computer science research. In the long term the ever more complex area of human-machine interaction can be explored by creating an Interactive Jazz Improviser that not only can "hear" the music being played by humans and react to it, but can "sense" other aspects of human involvement in music, such as human body motion that reflects beat and mood, human facial expression that reflects the expression of the music being played, and other human cues and indicators. In turn, the Machine Improviser should be able to give cues to the human, as to its musical intent. A goal of this part of the research is to foster positive, mutually beneficial interactions between human and machine. It is hoped that this will be extendable to other types of interaction besides music playing.

This work has been supported by the National Science Foundation. The opinions and conclusions are my own.

More recent work that was built on the original ideas follows

Reinforcement learning is an area of machine learning that addresses problems that are typified by having sparse feedback as to whether a proposed solution achieved a goal. Generally only an intermittent, scalar reinforcement value is available that indicates whether or not actions or decision policies are successful. This can be contrasted with methods that rely on a training period in which examples of correct actions or policies are available. In recent years the focus of the field has been on the relationship of some kinds of reinforcement learning algorithms to the dynamic programming algorithms of engineering and applied mathematics. In dynamic programming terms, the reinforcement value is likened to the cost of moving from one state to another or the cost of achieving a goal.

Most research in reinforcement learning constrains itself to a tabular approach in which the states and actions that are available are small enough in number that the solutions can be represented in a table. With this constraint, many algorithms and theoretical results have been developed. In fact, it was through the tabular approach that the relationship to dynamic programming was first discovered. We are investigating the use of machine learning algorithms for learning to improvise jazz music, a subfield of computer music or algorithmic music composition. During the last 15 or so years, some researchers in the algorithmic composition field have used artificial neural networks (ANNs). Generally, ANNs learn nonlinear mappings from inputs to outputs. They can be used to learn to classify patterns or learn models of systems. Significantly, these are not tabular mappings, but are nonlinear functional mappings of a set of numerical inptus.

Recurrent neural networks compose a special class of ANNs in which the processing units receive feedback from themselves or other processing units, allowing a history of the state to be used in decision-making. The state of the art of recurrent artificial neural network learning and its application to time series learning is undergoing innovation and there are several new kinds of algorithms that have been produced just in the last several years. We studied several of these algorithms in previous years, and determined the usefulness of various kinds of recurrent neural network algorithms for learning musical time series. We continued this work during summer of 03, comparing several algorithms. One recently invented algorithm called LSTM, for Long Short-Term Memory, is a significant departure from the ``standard'' recurrent algorithm in that it can ``cache'' information for long stretches of time before using it in choosing actions. Error information can also be cached and used to correct network weights. This is just the kind of algorithm that is needed for musical systems in which a long sequence of notes has a comprehensive meaning when it is finished being played.

After doing comparative studies with the LSTM algorithm and several others, during summer of 03 and during the PI's sabbatical period of fall 03, we re-visited the issue of representing music sequences. The idea here is that each pitch and corresponding duration should be encoded with two issues in mind. One is that the encoding is in a form that can be best used by the recurrent neural network algorithm and secondly that it encodes some musical properties.

We devised first a pitch representation called Circles of Thirds, and performed a comparative study between 1) this representation, 2) one we and others have used for algorithmic computer music\cite{franklin-01, todd-91}, and 3) another that also included some music information\cite{mozer-94}. The study compared three musical tasks involving short music sequences, and used the LSTM recurrent algorithm. We were able to show the Circles of Thirds representation performed at the best level in all three tasks, and is also the most concise encoding. This work was presented at and published in the proceedings of the Florida Artificial Intelligence Research Symposium (FLAIRS) 2004, special track on computer music. In summer 2004, we submitted an invited extended version of this paper to the International Journal of Artificial Intelligence Tools, co-authored with an undergraduate student (Krystal Locke) who compared the results of the LSTM with the Circles of Thirds representation to results using other recurrent networks, using the same representation (this paper has been accepted for publication).

We also devised a duration representation that can be used to learn rhythms of long songs. Furthermore, it can be used to learn from score-notations, or from human performances as encoded by MIDI (Musical Instrument Digital Interface). Human players most generally do not adhere to the very strict timing of a musical score, particularly in jazz, where they are encouraged not to follow strict timing. The duration representation is a step towards machine accomodation of human performers. The PI presented a paper describing this work that is published in the Proceedings of the 8th International Conference on Music Perception and Cognition (ICMPC8), August 2004.

A description of both representations, and the ability of the LSTM network to learn a long jazz melody in both score form and human-performed form, plus a description of the field of using recurrent neural networks for computer music learning and composition, has also been accepted for publication as an article for the INFORMS Journal on Computing (JOC), Special Cluster of Papers on Computation in Music.

Click here to listen to the song Afro Blue, as learned by LSTM

Now that we have developed better representations and have had success with the LSTM network, we are ready to return to reinforcement learning for improvisation. We made strides in this direction this year by experimenting with combining the LSTM recurrent network with an algorithm that predicts the reinforcement value over time, called temporal difference learning or value iteration. We follow Bakker's derivation of adding temporal difference eligibility traces to an LSTM network~\cite{bakker-02}, in which he combines LSTM with advantage learning, a type of reinforcement learning. Temporal difference algorithms are concerned with predicting delays in receiving the reinforcement value. The results of this algorithm, imposed on the LSTM network, to make predictions of outcomes of musical sequences is described in a paper that appears in the Proceedings of the International Conference on Computer Music, November 2004. In the first set of experiments, that TD-LSTM network predicts the outcome of a music pitch sequence, while the sequence is being played. We showed that it is able to predict a positive or negative outcome for a short musical task of chromatic lead-in to a chord tone, at the time the chromatic note is played. In a second set of more difficult experiments, we showed that the LSTM-TD network can predict positive outcomes when certain chord tones are played on the last beat of each bar of ii-V-I chord progressions.

In a separate research effort that has lead to an undergraduate honors thesis, we developed a system based on reinforcement learning. There are three separate pieces of implementation
  1. using software called Max/MSP to gather samples of a human playing a melody (the same melody, different renditions) and to transform those samples for later use
  2. a rhythm parser, based on dynamic programming to maximize a joint likelihood function over time, that takes in a set of observations of a new rendition of the same melody, and assigns these to the most likely notes from the score of the melody. I.e. it estimates the score positions of the played notes. It contains a model that is built from the samples (in part 1) and from a a heuristic that the timing of a played note varies with the length of the note. This is based in part on Raphael's work~\cite{raphael-01}.
  3. A set of tables of reinforcement learning agents, one table per measure, that learn to vary the rhythm of the melody in a unique way, but keep the variations within the limits needed by the rhythm parser (in part 2) to correctly assign the played notes with the notes from the score of the same melody. In other words, the measure of reward is based on the rhythm parser's output.

While this thesis is ongoing, preliminary results show that the table of reinforcement learning agents can learn to vary the rhythm of a melody subject to the constraints, and the melody is recognizable to humans.
Click on the links to midi files below to hear 4 versions of the first 8 bars of The Girl From Ipanema

Early LSTM findings

As described by Mozer~\cite{mozer-94}, the state-of-the-art recurrent neural network can really only learn short sequences, because of the vanishing of the gradient over time. Schmidhuber et al.~\cite{gers-00,hochreiter-97} also analyze this phenomenon, and designed a network called LSTM that can `cache' gradient information in a self-recurrent memory cell. Gradient-based gates learn when to use the error stored in the cell to update the network's input weights. The network of cells, enclosed in memory blocks, each with its own set of adaptive gates, is also recurrent, receiving the set of external inputs as well as the cell outputs at the last time step.

Our first test with LSTM was on the forward motion task, using 4 memory blocks of 2 cells each (as Gers et al. suggest~\cite{gers-00}). The most startling observation was the robust stability of the network. The squared error decreases steadily without oscillation. With this configuration, the LSTM network learns with about the same accuracy as does the RTRL network, but again in a much more repeatable, stable manner. Values close to .3, such as .5 and .7 are likely to be classified as good, when the right chord value appears on the second chord input.

We tested the LSTM network with a binary (Boxes) representation and its performance was much worse, ending with a squared error of .22 after 40,000 epochs. With 6 memory blocks, the squared error was better, at .17 after 40,000 epochs. We did not pursue this exploration since the results with the hybrid representation were so much better. However, this did lead us to try more memory blocks for the hybrid representation. After 40,000 epochs, the error is at .16. 40,000 may seem like many epochs compared to RTRL. However, LSTM is much more computationally efficient, and the experiments are actually shorter.

Further exploration in the forward motion task has shown us that adding blocks, for a total of 8, hampered learning (for 80,000 epochs). Also, the best representation so far is the hybrid representation.

Summary of use of Recurrent Networks

We had assumed that recurrent networks could learn to classify our musical examples with 100\% accuracy. We tried binary/localized representations, pseudo-continuous representations, and a hybrid. For some example sets, classification is 100\% accurate. For others, certain examnples that are too similar to the 'good' ones, are either all classified as positive, or several good ones are mis-classified as negative. Usually, classification accuracy is about 95\%.

We have decided to make the most of this behavior. We could use a representation that makes similar examples acceptable musically, perhaps using part of Mozer's psychologically based representation utilizing the musical circle of fifths. This will be an interesting exploration, because of the nature of jazz. If we use a chromatic representation, then the most similar examples are a half step off. In many Western tonal genres, this will sound terrible. In jazz, it may or may not. Perhaps another level of decision making could decide whether the the network should learn toward false positives or false-negatives, depending on what part of the phrase is being improvised. Misclassification could be part of the quirkiness of an individual player, channeling it into a kind of creativity.

While we need to explore LSTM more, especially on the other two tasks, we have decided to focus on this network. Its accuracy is as good as any of the other algorithms, it is much more stable, especially compared to RTRL, and it is computationally effecient. We have also been able to adapt it for use as a TD reinforcement predictor.

Adding TD for Credit Assignment

We have implemented temporal difference recurrent networks using both RTRL and LSTM.

Using TD with RTRL, for the three example tasks that we are using, we expected TD to predict a high reinforcement value for step 3 of the counting task, for step 2 of the ordered sequence task, and for step 3 of the forward motion task. Also, a low reinforcement value should be predicted at steps in the `bad' sequences, before the final step, depending on what values occur. We are greatly encouraged by the results. These predictions did in fact occur as we had hoped when added to RTRL. The only caveat here, is that TD obviously depends on the capability of the underlying recurrent network. So, as expected, the same percentage of errors in prediction occur as do in the misclassifications of the networks without TD.

TD works on the principle of minimizing a prediction of the cumulative future rewards (discounted as time advances into the future). The TD `error' is $\delta = r + \gamma V(S_{t+1}) - V(S_t)$ where r is the current reinforcement, $V(S_t)$ is the prediction at time step t (TD's output), and $V(s_{t+1})$ is the estimate of future reward at time step t+1. $r+\gamma V(S_(t+1)$ is an estimate of cumulative future reward. The weight updates use this error.

TD was developed to work in a Boxes (discrete state) representation. In this framework, $V(S_{t+1})$ is merely the weight in the next state ($s_{t+1})$ box. Using a network, in order to estimate $V(S_{t+1})$, we obtain the next set of inputs, and run them through the network, using the same weight values that are used to generate the network output, $V(S_t)$, necessarily before the weights are updated. This requires careful preservation of the {\em network's} state, particularly for LSTM where the cell state is a function of its prior value. This has been a good estimate for us, both in our previous work in coupling TD with a nonlinear feedforward network, and now with recurrent networks.

When combining TD with LSTM, the preliminary results are even more encouraging. After only 200 epochs of the forward motion task, the TD-LSTM network can predict with 85\% accuracy. Also, the predictions occur for every time step of the sequence. For example, in epoch 198, examples 98 and 99, the prediction and targets are as follows:\\
198 & 98 &  0.60 &   0.000\\
198 & 98 &  0.62 &   0.000\\
198 & 98 &  0.64 &   0.000\\
198 & 98 &  0.61 &   1.000\\
198 & 99 &  0.00 &   0.000\\
198 & 99 &  0.00 &   0.000\\
198 & 99 &  0.00 &   0.000\\
198 & 99 &  0.00 &   0.000

Increasing the number of epochs produced similar results. The learning parameters required for TD and LSTM separately vs.~their combination need adjustment. For example, TD uses self-recurrent `eligibility traces' that are used in the weight updates. These are decaying averages of the inputs to the TD units (the output units of the TD-LSTM network). The rate had to be adjusted to be lower than when TD was used with a feedforward network.

We are testing TD further, with LSTM, and are on the verge of adding an action-taking component, starting with the SARSA algorithm~\cite{sutton-98}, that can use the TD prediction to choose actions (next notes or next sequences of notes).

Creativity in Machine Jazz Improvisors

Jazz improvisation is an art-form and requires a certain amount of creativity. While we are not yet claiming to have endowed our systems with creativity, we have in the past modified our reinforcement learning algorithms to insure variety. We have spent some time examining how other machine learning systems~\cite{biles-98,mozer-94,spector-95,thom-03} applied to jazz improvisation induced variety in their playing.

The main ideas are
  1. to use some probabilistic interpretation of the output of the algorithms
  2. to employ musical domain knowledge
  3. to combine corpi of human examples
  4. to enable real-time interaction with the system to enable human creativity to drive the machine's creativity.

Plans include the design of the individual improvisors, integrating recurrent networks with action-taking reinforcement learning techniques, and in developing hierarchical improvisors that can combine the individuals. LSTM networks have been used to learn complex language grammars, and work has been done on grammatical models of music~\cite{lerdahl-93}. We will explore using an LSTM-learned grammatical model of music (with motifs, phrases, etc.) to drive the selection of the individual improvisors.

B.~Bakker, Reinforcement Learning with Long Short-Term Memory,
{\em Neural Information Processing Systems, 14},
eds, T.G. Dietterich and  S. Becker and Z. Ghahramani,
MIT Press, Cambridge, MA, 2002

Interactive GenJam: Integrating Real-Time Performance with a Genetic Algorithm,
{\em Proc.~1998 International Computer Music Conference}, Ann Arbor, MI.

D.~Calvert and S.~C.~Kremer, Networks with Adaptive State Transitions, 
{\em A Field Guide to Dynamic Recurrent Networks}, eds.~j.~K.~Kolen, and S.~C.~Kremer, 
IEEE Press, Piscataway, NJ  2001. 

\bibitem{campolucci-98}P.~Campolucci, A Circuit Theory Approach to Recurrent
Neural Network Architectures and Learning Methods, 
Dottorato di Ricerca in Ingegneria Elettrotecnica, Universit\`{a} Degli Studi di
Bologna, 1998.

J.~A.~Franklin, Learning and Improvisation, 
{\em Neural Information Processing Systems 14}, eds. Dietterich, T. G., Becker, S., and Ghahramani, Z. MIT Press,
Cambridge, MA 2001.

J.~A.~Franklin, and V.~U.~Manfredi, 
Nonlinear Credit Assignment for Music Sequences,
{\em Proc.~Second International Workshop on Intelligent Systems
Design and Applications},  Atlanta, GA, 2002.

F.~A.~Gers, J.~Schmidhuber and F.~Cummins. Learning to Forget: Continual Prediction with LSTM.
{\em Neural Computation} {\bf 12} (2000), 10:2451--2471.

S.~Hochreiter and J.~Schmidhuber,
Long Short-Term Memory, {\em Neural Computation} {\bf 9} (1997), 8:1735-1780.

\bibitem{lin-01}T.-N.~Lin and C.~L.~Giles, Delay Networks: Buffers to the Rescue,
in {\em A Field Guide to Dynamic Neural Networks}, eds.~J.~F.~Kolen and S.~C.~Kremer, 
IEEE Press, Piscataway, NJ  2001. 

 Attractor Dynamics and Parallelism in a Connectionist Sequential
 {\em Proc.~Eighth Annual Conf. of the Cognitive
 Science Society}, Amherst, MA, 1986.

F.~Lerdahl and R.~Jackendoff, An Overview of Hierarchical Structure in Music,
{\em Machine Models of Music}, eds.~S.~M.~Schwanauer and D.~A.~Levitt,
MIT Press, Cambrdige, Massachusetts, 1993.

Neural Network Music Composition by Prediction:
Exploring the Benefits of Psychophysical Constraints and Multiscale Processing. 
{\em Connection Science}, {\bf 6} (1994), 247-280.

B.~A.~Pearlmutter,Gradient Calculations for Dynamic Recurrent Neural Networks,
{\em A Field Guide to Dynamic Recurrent Networks}, eds.~j.~K.~Kolen, and S.~C.~Kremer, 
IEEE Press, Piscataway, NJ  2001. 

J.~Principe and J.~Kuo, Backpropagation Through Time with Fixed Memory Size Requirements,
Proc. 3rd Workshop on Neural Networks for Signal Processing, pages 207-215, Maryland, 1993. 

C.~Raphael,  Music Plus One: A System for Expressive and Flexible Musical Accompaniment,
{\em Proceedings of the 2001International Computer Music Conference}. (Computer Music Association. San Francisco, CA, 2001).

L.~Spector, and A.~Alpern, Induction and Recapitulation of
Deep Musical Structure, {\em
Proc. 1995 IJCAI Workshop on Artificial Intelligence and Music},
Montreal, Qu\'{e}bec.

 Learning to Predict by the Methods of Temporal Differences.
 {\em Machine Learning} {\bf 3} (1988), 9-44.

R.~S.~Sutton, and A.~G.~Barto, 
 {\em Reinforcement Learning},
 (MIT Press, Cambridge, MA, 1998).

B.~Thom, Interactive Improvisational Music Companionship: A User-Modeiling Approach.
{\em User Modeling and User-Adapted Interaction Journal} {\bf 13} (2003), 2: 1-2. 

 A Connectionist Approach to Algorithmic Composition.
{\em Music and Connectionism},  eds.~P.~M. Todd and E.~D.~Loy,
(MIT Press, Cambridge, MA, 1991).

R.~J.~Williams and D.~Zipser, A Learning Algorithm for Continually Running Fully
Recurrent Neural Networks, {\em Neural Computation} {\bf 1} (1989), 270-280.

R.~J.~Williams and J.~Peng,
An Efficient Gradient-based Algorithm for On-line Training of Recurrent Network Trajectories,
{\em Neural Computation {\bf 2} (1990)}, 4:491-501.