VirtuaLatin - Agent Based Percussive Accompaniment
David Murray-Rust
Master of Science School of Informatics University of Edinburgh 2003
Abstract This project details the construction and analysis of a percussive agent, able to add timbal timbales es accompa accompanim nimen entt to pre-reco pre-recorded rded salsa music music.. We propose, propose, impleimplement and test a novel representational structure specific to latin music, inspired by Lerdahl and Jackendoff’s General Theory of Tonal Music, and incorporating specific domain knowledge. knowledge. This is found to capture the relevant relevant information information but lack some flexibility. We develop a music listening designed to build up these high level representations using harmonic and rhythmic aspects along with parallelism, but find that it lacks lacks the informa information tion necessary necessary to create create full full represen representati tations ons.. We develo develop p a generative system which uses expert knowledge and high level representations to combine and alter templates in a musically sensitive manner. We implement and test an agent based platform for the composition of music, which is found to convey the necessary information and perform fast enough that real time operation should be possible. Overall, we find that the agent is capable of creating accompaniment which is indistinguishable from human playing to the general public, and difficult for domain experts to identify.
i
Abstract This project details the construction and analysis of a percussive agent, able to add timbal timbales es accompa accompanim nimen entt to pre-reco pre-recorded rded salsa music music.. We propose, propose, impleimplement and test a novel representational structure specific to latin music, inspired by Lerdahl and Jackendoff’s General Theory of Tonal Music, and incorporating specific domain knowledge. knowledge. This is found to capture the relevant relevant information information but lack some flexibility. We develop a music listening designed to build up these high level representations using harmonic and rhythmic aspects along with parallelism, but find that it lacks lacks the informa information tion necessary necessary to create create full full represen representati tations ons.. We develo develop p a generative system which uses expert knowledge and high level representations to combine and alter templates in a musically sensitive manner. We implement and test an agent based platform for the composition of music, which is found to convey the necessary information and perform fast enough that real time operation should be possible. Overall, we find that the agent is capable of creating accompaniment which is indistinguishable from human playing to the general public, and difficult for domain experts to identify.
i
Acknowledgements Thanks to everyone who has helped and supported me through this project, in particular, Alan Smaill and Manuel Contreras my supervisor and co-supervisor, and everyone who took the Salsa Challenge.
ii
Declaration I declare that this thesis was composed by myself, that the work contained herein is my own except where explicitly stated otherwise in the text, and that this work has not been submitted for any other degree or professional qualification except as specified.
(David Murray-Rust ) Murray-Rust )
iii
Table of Contents 1 Intro duction
1
1.1 1.1
The The use use of agen agentt syst system emss for for musi usical cal acti activi viti ties es . . . . . . . . . . .
1
1.2
Custom tomised repr eprese esentati ations for latin musi usic . . . . . . . . . . . . .
2
1.3
Output Generation . . . . . . . . . . . . . . . . . . . . . . . . . .
2
1.4
Musical analysis of latin music . . . . . . . . . . . . . . . . . . . .
3
1.5
Aims . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
2 Background
2.1
2.2
2.3
5
Music Representations . . . . . . . . . . . . . . . . . . . . . . . .
5
2.1.1
Audio . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
2.1.2
Common Practice Notation . . . . . . . . . . . . . . . . .
5
2.1.3
MIDI - Overview . . . . . . . . . . . . . . . . . . . . . . .
6
Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
2.2.1
Music Repr eprese esentati ations and Analyses . . . . . . . . . . . .
7
2.2.2
Mechanical Analysis of Music . . . . . . . . . . . . . . . .
8
2.2.3
Computer Generated Music . . . . . . . . . . . . . . . . .
8
2.2.4
Agents and Music . . . . . . . . . . . . . . . . . . . . . . .
10
2.2.5
Interactive Systems . . . . . . . . . . . . . . . . . . . . . .
11
2.2.6
Distributed Architectures . . . . . . . . . . . . . . . . . .
11
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
3 Design
14
3.1
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
3.2
Higher Level Representations . . . . . . . . . . . . . . . . . . . .
14
iv
3.2. 3.2.11
The The GTTM GTTM and and its Appl pplicat icatiion to Lati Latin n Musi Musicc . . . . . .
15
3.2.2
Desired Results . . . . . . . . . . . . . . . . . . . . . . . .
16
3.2.3
Design Philosophy . . . . . . . . . . . . . . . . . . . . . .
17
3.2.4
Well-Formedness Rules . . . . . . . . . . . . . . . . . . .
19
3.2.5
Preference Rules . . . . . . . . . . . . . . . . . . . . . . .
20
3.3
Agent System . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
3.4
Generative Metho ds . . . . . . . . . . . . . . . . . . . . . . . . .
25
3.4.1
Basic Rhythm Selection . . . . . . . . . . . . . . . . . . .
26
3.4.2
Phrasing . . . . . . . . . . . . . . . . . . . . . . . . . . . .
28
3.4.3
Fills . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
28
3.4.4
Chatter . . . . . . . . . . . . . . . . . . . . . . . . . . . .
29
Design Summary . . . . . . . . . . . . . . . . . . . . . . . . . . .
29
3.5
4 System Architecture
4.1
30
Agent Architecture . . . . . . . . . . . . . . . . . . . . . . . . . .
31
4.1.1
31
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
. . . . . . . . . . . . . . . . . . . . . .
34
. . . . . . . . . . . . . . . . . . . . .
38
4.2.1
Representation Classes . . . . . . . . . . . . . . . . . . . .
38
4.2.2
Human Readability . . . . . . . . . . . . . . . . . . . . . .
41
4.2.3
Identities . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
4.2.4
Representations By Hand . . . . . . . . . . . . . . . . . .
44
4.3
Low Level Music Representation . . . . . . . . . . . . . . . . . . .
45
4.4
Architecture Summary . . . . . . . . . . . . . . . . . . . . . . . .
45
4.2
4.1.2
Class Hierarchy and Roles
4.1.3
Information Flow
High Level Representations
5 Music Listening
46
5.1
The Annotation Class . . . . . . . . . . . . . . . . . . . . . . . .
46
5.2
Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . .
48
5.2.1
Harmonic Analysis . . . . . . . . . . . . . . . . . . . . . .
48
5.2.2
Pattern Analysis . . . . . . . . . . . . . . . . . . . . . . .
51
Rhythmic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . .
52
5.3
v
5.4
Dissection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
53
5.5
Music Listening Summary . . . . . . . . . . . . . . . . . . . . . .
54
6 Generative Methods
56
6.1
Basic Rhythm Selection . . . . . . . . . . . . . . . . . . . . . . .
56
6.2
Ornamentation . . . . . . . . . . . . . . . . . . . . . . . . . . . .
58
6.2.1
Phrasing . . . . . . . . . . . . . . . . . . . . . . . . . . . .
58
6.2.2
Fills . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
59
6.2.3
Chatter . . . . . . . . . . . . . . . . . . . . . . . . . . . .
60
6.2.4
Transformations . . . . . . . . . . . . . . . . . . . . . . . .
60
Modularity and Division of Labour . . . . . . . . . . . . . . . . .
61
6.3.1
Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . .
62
Generative Methods Summary . . . . . . . . . . . . . . . . . . . .
62
6.3 6.4
7 Results and Discussion
7.1
63
Music Listening . . . . . . . . . . . . . . . . . . . . . . . . . . . .
64
7.1.1
Chordal Analysis . . . . . . . . . . . . . . . . . . . . . . .
64
7.1.2
Chord Pattern Analysis . . . . . . . . . . . . . . . . . . .
65
7.1.3
Phrasing Extraction . . . . . . . . . . . . . . . . . . . . .
66
7.1.4
Final Dissection . . . . . . . . . . . . . . . . . . . . . . . .
67
7.2
Listening Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . .
68
7.3
Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . .
70
7.3.1
Structural Assumptions . . . . . . . . . . . . . . . . . . .
70
Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
73
7.4
8 Future Work
8.1
8.2
74
Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
74
8.1.1
Chord Recognition . . . . . . . . . . . . . . . . . . . . . .
74
8.1.2
Pattern Analysis . . . . . . . . . . . . . . . . . . . . . . .
75
Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
76
8.2.1
Ornament Selection . . . . . . . . . . . . . . . . . . . . . .
76
8.2.2
Groove and Feel . . . . . . . . . . . . . . . . . . . . . . . .
77
8.2.3
Soloing . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
77
vi
8.3
Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . .
78
8.4
Agent Environment . . . . . . . . . . . . . . . . . . . . . . . . . .
79
8.5
Long Term Improvements . . . . . . . . . . . . . . . . . . . . . .
79
9 Conclusions
83
A Musical Background
85
A.1 History and Use of the Timbales . . . . . . . . . . . . . . . . . . .
85
A.2 The Structure of Salsa Music . . . . . . . . . . . . . . . . . . . .
89
A.3 The Rˆ ole of the Timbalero . . . . . . . . . . . . . . . . . . . . . .
90
A.4 Knowledge Elicitation . . . . . . . . . . . . . . . . . . . . . . . .
91
B MIDI Details
92
B.1 MIDI Streams . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
92
B.2 MIDI Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
93
C jMusic
95
C.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
95
C.2 Alterations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
97
C.3 jMusic Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
98
D Listening Assessment Test
99
E Example Output
101
Bibliography
108
vii
List of Figures 3.1
Representation Structure . . . . . . . . . . . . . . . . . . . . . . .
18
3.2 Example section: the montuno from Mi Tierra (Gloria Estefan), leading up to the timbales solo . . . . . . . . . . . . . . . . . . . .
18
3.3
Possible Network Structures . . . . . . . . . . . . . . . . . . . . .
22
3.4
Possible Distributed Network Structure . . . . . . . . . . . . . . .
23
3.5
Music Messages Timeline . . . . . . . . . . . . . . . . . . . . . . .
24
3.6
Final Agent Architecture . . . . . . . . . . . . . . . . . . . . . . .
25
4.1
Overview of System Structure . . . . . . . . . . . . . . . . . . . .
31
4.2
Class Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
4.3
Message Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
34
4.4
Example jMusic XML File . . . . . . . . . . . . . . . . . . . . . .
36
4.5 SequentialRequester and CyclicResponseCollector flow diagrams .
37
4.6
Different sets of notes which would be classified as as C major . .
40
4.7
Ambiguous Chords . . . . . . . . . . . . . . . . . . . . . . . . . .
40
4.8
Example fragment of Section textual output . . . . . . . . . . . .
43
5.1
Analysis Operations
. . . . . . . . . . . . . . . . . . . . . . . . .
47
6.1
Generative Structure . . . . . . . . . . . . . . . . . . . . . . . . .
57
6.2
Rhythm Selection Logic . . . . . . . . . . . . . . . . . . . . . . .
58
7.1 Guitar part for bars 22-25 of Mi Tierra, exhibiting split bar chords
65
7.2
Phrasing under the solos (bars 153-176) . . . . . . . . . . . . . . .
66
8.1
Chunk Latency for the Agent System . . . . . . . . . . . . . . . .
80
viii
A.1 Example Timbales setup (overhead view) . . . . . . . . . . . . . .
86
A.2 Scoring Timbale Sounds . . . . . . . . . . . . . . . . . . . . . . .
88
A.3 Standard Son Claves . . . . . . . . . . . . . . . . . . . . . . . . .
88
A.4 Basic C´ a scara pattern, with a backbeat on the hembra . . . . . .
89
C.1 jMusic Part, Phrase and Note Structure . . . . . . . . . . . . . .
96
ix
Chapter 1 Introduction This report details the construction of VirtuaLatin, a software agent which is capable of taking the place of a human timbalero (drummer) in a salsa band. There are several “real world” reasons to do this, as well as research interest:
• As a practice tool for musicians, so that band rehearsals are possible when the drummer is ill
• As a learning tool, to give illustrations of how and why timbales should be played in the absence of a human teacher
• a first step on the road toward allowing hybrid ensembles of human and mechanical performers This is a large and complex task, so we identify four main areas of interest.
1.1
The use of agent systems for musical activities
The use of autonomous software agents is becoming increasingly widespread, and as with many other technological advances, it is highly applicable to music. The agent paradigm allows an opportunity to analyse the interaction between musicians, as well as each individual’s mental processes; we feel that this is a key aspect of understanding how music is created. Ultimately, it is a step towards a distributable heterogeneous environment in which musicians can play together 1
Chapter 1. Introduction
2
regardless of physical location or mental substrate. We describe an implementation of an agent infrastructure for musical activities, and analyse its use for both the project at hand and future work.
1.2
Customised representations for latin music
Music exists in many forms; from the abstract forms in a composer or listeners mind, through increasingly concrete formal representations such as musical scores and MIDI data to physical measurements of the sound waves produced when the music is played[8]. Each level of representation has its own characteristic virtues and failings, and correct choice or design of representation is crucial to the success of musical projects. We explore very different levels of musical representation here - low level representations which allow the basic musical “facts” to be communicated between agents, and high level representations which seek to understand the music being played. When human musicians compose, play or listen to music, high level representations of the music are created, which enable a deeper understanding of musical structure[18]. We therefore develop a novel high level symbolic representation of latin music which captures all the important features of a piece in such a way as to enable our agent to play in a highly musical manner.
1.3
Output Generation
The ultimate aspiration of the work presented here is to create high quality music; as such, we need a subsystem which can work over the representations given to perform in a musical manner. We use a combination of a rule based expert system which can select and combine templates, and alter them to fit specific situations, with domain knowledge and high level representations to provide playing which supports and enhances the musical structure of the piece.
Chapter 1. Introduction
1.4
3
Musical analysis of latin music
In order to provide musically sensitive accompaniment to previously unheard pieces, our agent needs to be capable of extracting the salient features from music it is listening to, and using these to build up the higher level representations it is going to work with. We combine modified versions of existing methods with domain knowledge and bespoke algorithms to create a comprehensive analysis of music heard, inspired by the structure of the GTTM [18]. We give a domain specific treatment of harmonic, rhythmic and structural features, including a search for musical parallelism, and investigate whether this is capable of creating the representations we need. We do not, however, integrate this with the generative system.
1.5
Aims
The overall aim of the project is: To create a process which is capable of providing a timbales accompaniment to prerecorded salsa music, in an agent based environment, which is of sufficient quality to be indistinguishable from human playing. This can be divided into four main aims: 1. construction of an agent environment suitable for the production of music 2. creation of representations which are suitably rich to inform the agent’s playing 3. implementation of a generative system which can produce high quality output 4. implementation of a music listening subsystem which can build the necessary representations The dissertation is structured as follows:
Chapter 1. Introduction
4
• some background on the general area, and a look at related work • an explanation of the design concepts behind the system • a look at the overall system architecture, including the agent platform and the music representations used
• description of the music listening sections of the project • detail of the generative methods used • analysis of results and discussion • ideas for further work • some conclusions and final thoughts
Chapter 2 Background This chapter gives some background to the project as a whole. A detailed discussion of latin music and the rˆ ole of the timbalero in a latin ensemble is given in Appendix A.
2.1
Music Representations
There are many different ways to represent music, with varying levels of complexity and expression. An overview is given in [8], but here we briefly detail the three standard representations which are most relevant to this project.
2.1.1
Audio
Audio data is the most basic representation of music, and consists of a direct recording of the sound produced when it is played. In the digital domain this consists of a series of samples which represent the waveform of a sound. It can be used to represent any sound, but is very low level - it does not delineate pitches, notes, beats or bars.
2.1.2
Common Practice Notation
Common Practice Notation (CPN) is the name given to standard “Western” scores. It contains information on what notes are to be played at particular times 5
Chapter 2. Background
6
by each instrument. This information is then subject to interpretation - the exact rendition is up to the players; parameters such as timing, dynamics and timbre are to some extent encoded encoded in the score, but will generally be played differently by different players, and are not trivially reproducible mechanically (work relating to this is discussed below).
2.1.3
MIDI - Overview
MIDI stands somewhere in between Audio and CPN in terms of representational levels. A MIDI file encodes:
• The start and end times, pitches and velocities of all notes • Information regarding other parameters of each part (such as volume and possible timbre changes)
• Information regarding what sounds should be used for each part To some extent, this captures all of the information about a particular performance - a MIDI recording of a pianist playing a certain piece will generally be recognisable as the same performance. A MIDI file will be played back by a sequencer , which in turn triggers a synthesiser to play sounds. It is in this stage that interpretation is possible; the MIDI sequencer has no idea what sounds it is triggering - it has simply asked for a sound by number (for example, sound 01 corresponds to a grand piano in the standard mapping). It is possible that the synthesiser in question does not support all of the parameters encoded in the MIDI file, or that the sounds are set up unexpectedly. Finally, different synthesisers will produce sounds of varying quality and realism. However, due in large part to conventions such as the General MIDI standard, one can be fairly sure that playing a MIDI file on compatible equipment will sound close to the authors intention. Thus we have a representational standard with close to the realism of Audio, with many of the high level features present in CPN. There exist many packages which can (with varying degrees of success) turn MIDI data into CPN scores.
Chapter 2. Background
2.2
7
Literature Review
2.2.1
Music Representations and Analyses
A broad overview of the issues surrounding music representation is given by Dannenburg [8]. He explores the problems in musical representation in several areas, the most relevant of which being hierarchy and structure, timing, timbre and notation. One of the most cited works in reference to musical representation is the Generative Theory of Tonal Music, by Lerdahl and Jackendoff [18]. This outlines a manner in which to hierarchically segment music into structurally significant groups, which it is argued is an essential step in developing an understanding of the music. As presented, it has two main obstructions to implementation; firstly it is incomplete1 , and secondly it is not a full formal specification. Many of the rules given are intentionally ambiguous - they indicate preferences, and often two rules will indicate opposing decisions with no decision procedure being defined. Despite these acknowledged issues, it provides a comprehensive framework on which music listening applications can be built, and there are many partial implementations which exhibit some degree of success. A different aspect of musical representation is covered by the MusES system[24], developed by Francois Pachet. A novel aspect of this system is the full treatment of enharmonic spelling - that is, considering C# and Db to be different pitch classes , despite the fact that they sound the same.2 This is a distinction which may often be necessary to analysis. The design of the system leans towards support for analysis, but is intended to be able to support any development - it relies on the idea that there is “some common sense layer of musical knowledge which may be made explicit”[25]. MusES was originally developed in Smalltalk, but subsequently ported to Java. Through conversations with F. Pachet, I was able to obtain a partial copy 1
there are features such as parallelism which are relied on but no method for determining
them is given 2 in some tuning systems, when played on some instruments they may in fact be different. On a piano keyboard, however, C# and Db are the same key
Chapter 2. Background
8
of the MusES library, and it would have made an ideal development platform. Unfortunately, due to portions of the code being copyrighted, I was unable to obtain a complete system. [13] describes a highly detailed formal representation of music, capable of representing a wide range of musical styles. An example is given of representing a minimalist piece which does not have explicitly heard notes; rather, a continuous set of sine waves is played, the amplitudes of which tend towards the idealised spectrum of the implied note at any given time, with the frequencies of the tones close to harmonics tending towards the ideal harmonics. The representation allows for many different levels of hierarchy and grouping, and is specifically designed for automated analysis tasks.
2.2.2
Mechanical Analysis of Music
There is a key distinction which lies at the heart of much musical analysis, and in many ways is more deeply entrenched than in other disciplines: the divide between symbolic and numeric analysis. This dichotomy is explored in [23], and synthetic approaches suggested. Harmonic reasoning based in the MusES system is compared with numeric harmonic analysis by NUSO, which performs statistical analysis on tonal music. It is suggested that symbolic analysis performs well if there are recognisable structures specific to a domain, and that numeric analysis is likely to perform better on “arbitrary sequences of notes”.
2.2.3
Computer Generated Music
In order to create generative musical systems in a scientific manner, it is necessary to have a specific goal in mind; this often includes tasks such as recreating a particular style of playing (imitative)3 , creating music which has a specific function (intentional), or testing a particular technique with respect to the generation of music (technical). Intentional music is particularly interesting due to it’s broad usage. Every day 3
definitions are my own, intended to aid discussion not create a rigorous framework
Chapter 2. Background
9
we hear many pieces of music designed to have specific effects on us, rather than be pleasurable to listen to. Film soundtracks, and the music in computer games are two common examples. The creators of GhostWriter [27] (a virtual environment used to aid children in creative writing in the horror genre) use music as a tool to build and relieve tension — to support the surprise and suspense which are the basic tools of the horror narrative. The tool proposed is a generative system which takes as input a desired level of “scariness” (tension). This is then converted into a set of parameters which control a high level form generator, a rhythmic section and a harmonic section. The harmonic section is based on the musical work of Herrman (who wrote scores for many of Hitchcock’s films, most notably Psycho) and the theoretical work of Schoenberg. Although the system is not tested in [27], tests to be performed are outlined. Zimmeremann [30] uses complex models of musical structure to create music designed to enhance presentations — the music is used to guide the audience’s attention and motivation. One contention of this paper is that there is a missing middle level in the theories of musical structure as applied to this domain while they are good at modelling high level structure (e.g. sonata form) and low level forms (such as cadences and beats) a layer in between is needed, which is called the music-rhetorical level. A structure of the presentation is created, which defines series of important points, such as the announcement of an event, or the introduction of an object, associated with a mood, function and a time. This structure is then used to guide music-rhetorical operations. The system as described is a partial implementation, and no analysis is given. This leads us on to PACTs - Possible ACTions, introduced by Pachet as strategies , and expanded in [26]. PACTs provide variable levels of description for musical actions, from low level operations (play “C E G”, play loud , play a certain rhythm) to high level concepts (play bluesy , play in a major scale). These are clearly useful tools for intention based composition; they also allow a different formulation of the the problem of producing musical output - rather than starting with an empty bar and the problem being how to fill it, we can start with a general impression of what to play, and the problem is to turn this
Chapter 2. Background
10
into a single concrete performance. Even if the exact notes and rhythms are known (to the level of a musical score), this is not generally sufficient to produce quality output. Hence there are ongoing efforts to both understand how human players interpret scores, and use this information to enhance the realism of mechanical performance. The SaxEx system [7] has been designed to take as input sound file of a phrase played inexpressively, some MIDI data describing the notes and an indication of the desired output. Case Based Reasoning is then applied, and a new sound file is created. It was found that this generated pleasing, natural output. The system has also been extended [2] to include affect driven labels on three axes (tenderaggressive, sad-joyful, calm-restless) for more control over output.
2.2.4
Agents and Music
There are several ways in which agents could be used for music. A natural breakdown is to model each player in an ensemble as an agent. This is the approach taken in the current project. A alternative would be to model a single musician as a collection of agents, as in Minsky’s Society of Agents model of cognition. A middle path between these ideas is taken by Pachet in his investigations into evolving rhythms [15]. Here, each percussive sound (e.g. kick drum, snare drum) is assigned to an agent. The agents then work together to evolve a rhythm. They are given a starting point, and a set of rules (expressed in the MusES system) and play a loop continuously, with every agent listening to the output of all the others. Typical rules are: emphasise strong/weak beats, move notes towards/away from other notes and adding syncopation or double beats. From the interaction of simple rules, it was found that some standard rhythms could be evolved, and the interesting versions of existing rhythms could be produced. The use of multiple agents for beat tracking is described in [11]. This system creates several agents with different hypotheses about where the beat is, and assigns greater weight to the agents which correctly predict many new beats. The system is shown to be both computationally inexpensive and robust with respect
Chapter 2. Background
11
to different styles of music; in all test cases it correctly divined the tempo, the only error being the phase (it sometimes tracked off-beats rather than on-beats).
2.2.5
Interactive Systems
Antoni Camurri has carried out a lot of work into interactive systems, and is director of the Laboratorio di Informatica Musicale. 4 . In [1] and [6], he looks at analysis of human gestures and movement. In [4], he develops an architecture for environmental agents, which alter an environment according to the actions of people within it. He breaks these agents down in to input and output sections, then a rational, emotional and reactive component. He finds the architecture to be flexible, and has used it in performances. The architecture is extended in [5] to give a fuller treatment of emotion, developing concepts such as happiness, depression, vanity, apathy and anger. Rowe [28] has developed the Cypher system, which can be used as an interactive compositional or performance tool. It does not use any stored scores, but will play along with a human performer with “a distinctive style and a voice quite recognizably different from the music presented at its input”. It offers a general architecture on which the user can build many different types of system. Another section of interest is auto accompaniment - creating mechanical systems which can “play along” with human performers. Raphael [9] creates a system where the computer plays a prerecorded accompaniment in time to a soloist. It uses a Hidden Markov model to model the soloist’s note onset times, a phase vocoder to allow for variable speed playback, and a Bayesian network to link the two. Training sessions (analogous to rehearsals) are used to train the belief network.
2.2.6
Distributed Architectures
Since one of the great benefits of agent based approaches is that agents may be distributed and of unknown origin (as long as they conform to a common 4
http://musart.dist.unige.it/sito inglese/laboratorio/description.html
Chapter 2. Background
12
specification), a logical direction is the distributed composition or performance of music. [16] describes some of the issues in distributed music applications. Two of the key barriers are defined - latency (the average delay in information being received after it has been transmitted) and jitter (the variability of this delay). It is stated that one can generally compensate for jitter by increasing latency, and that there is a problem with the current infrastructure in that there is no provision made for Quality of Service specification or allocation. The issues of representations and data transfer rate are discussed: audio represents a complete description of the music played, while MIDI only specifies pitches and onsets. This means that audio will be a more faithful reproduction, but that MIDI has far lower data transfer rates (typically 0.1 5kbps against 256kbps for high quality MP3 audio). It is concluded that it is currently impossible to perform music in a fully distributed fashion, but that all of the problems have technical solutions on the horizon - except the latencies due to the speed of light. There are many constraints associated with real time programming; in response to this, there have been attempts to set out agent systems designed to handle real time operation. [12] discusses the difference between reactive and cognitive agents, and gives a possible hybrid architecture which couples an outer layer of behaviours (which may be reactive or cognitive) with a central supervisor (based on an Augmented Transition Network). This ensures that hard goals are met by reactive processes, but more complex cognitive functions can be performed when the constraints are relaxed. [10] presents an agent language which allows the specification of real time constraints, and a CORBA layer which enforces this. Finally, [14] presents a real-time agent architecture which can take account of temporal, structural and resource constraints, goal resolution and unexpected results. This architecture is designed to be implemented by individual agents to allow them to function in a time and resource limited environment.
Chapter 2. Background
2.3
13
Conclusions
Several pieces of work have been particularly inspiring for this project; the theoretical work of Lerdahl and Jackendoff suggets a very useful model for musical analysis, and also helps support claims about musical structure. Pachet’s work on the MusES system has been useful, as it has given a complete (working) framework to examing, as well as the concept of PACTs. It is encouraging to see that not much work has been done on interacting musical agents, so we are covering new territory. Finally, the work of Rowe has demonstrated the possibilities of interactive music, and given many concrete examples of how certain subsystems may be implemented.
Chapter 3 Design 3.1
Overview
From the overall problem domain, we have selected several areas of interest:
• High level representations specific to latin music which are sufficient to adequately inform the playing of a timbalero.
• Generative methods working over high level representations which are capable of creating realistic timbale playing.
• Music listening algorithms which are capable of generating the necessary high level representations from raw musical data.
• Construction of an Agent based environment for musical processes. The desired end result is a system which can combine these components to generate high quality timbales parts to prerecorded salsa music.
3.2
Higher Level Representations
The musical representations discussed so far are designed to encode enough data about a piece of music to enable its reproduction in some manner. A musician either hearing or playing the music encoded in this form would need to have some 14
Chapter 3. Design
15
higher level understanding of the music in order to either play or hear the piece correctly. It is these representations which we now consider. In our specific case, we are attempting to create a representation which will:
• be internal to a particular agent • aid the agent in generating its output The goal is not a full formal analysis - this is both difficult and unnecessary. The agent needs, at this stage:
• An idea of where it is in the piece • An idea of what to play at this point in time • Some idea as to what will happen next
3.2.1
The GTTM and its Application to Latin Music
There can be no doubt that the GTTM has played a massive role in the current state of computational analysis of music - it appears in the bibliography of almost every paper on the subject. It is the theoretical framework around which the higher level representations used in this project have been built To recap, the GTTM consists of four levels: Grouping Structure segments the piece into a tree of units, with no overlap1 . Metrical Structure Divides the piece up by placing strong and weak beats at
a number of levels Time-span Reduction calculates an importance for the pitches in a piece based
on grouping and metre Prolongational Reduction calculates the harmonic and melodic importance
of pitches 1
except for the case of elisions , where the last note of one group may also be the first note
of the next
Chapter 3. Design
16
At each of these levels there are a set of well formedness rules, and a set of preference rules. The idea behind this is that there will often be many valid interpretations of a section, so we should try and calculate which one is most likely or preferred. The GTTM is a very general theory, and in this case we are focusing on a specific style of music; what extra information does this give us? Latin music always has an repetitive rhythm going on. Although this may change for different sections, there will always be a basic ‘groove’ happening. In almost all cases, this will be based on a clave , a repeating two bar pattern (see discussion elsewhere). There are clearly defined hyper-measure structures - mambos, verses, montunos and more - which provide the large structural elements from which a piece is built. The actions of a player can generally be described using a single sentence for each section ( “the horns play in the second mambo, and then all the percussion stops except the clave in the bridge” )
3.2.2
Desired Results
In general, the smallest structural unit in latin music is the bar; phrases may be played which cross bars, or which take up less than a single bar, but the structure is defined in terms of bars. Further, the clave will continue throughout, and will be implied even when not played. It follows that the necessary tasks are: quantization of the incoming data, according to an isochronous pulse2 metricization of the quantized data into beats and bars segmentation of the resulting bars into sections 2
Quantization in this sense is different to standard usage in sequencers. In this case we mean
“determining the most appropriate isochronous pulse and notating the incoming events relative to this”, rather than shifting incoming notes to be exact multiples of some chosen rhythmic value.
Chapter 3. Design
17
Here we are assume that we are dealing with music which is described in terms of beats and bars (i.e. metricised and quantized), we are only left with the task of segmenting these bars and extracting relevant features from them - a process described in Section 5.
3.2.3
Design Philosophy
The structures under consideration do not represent the music itself, but only its higher level structure and features. There are also some assumptions which are used to simplify matters: Structural Assumption 1 There are high level sections of music with distinct
structural roles Structural Assumption 2 The smallest structural unit in latin music is the
bar; phrases may be played which cross bars, or which take up less than a single bar, but the structure is defined in terms of bars. Structural Assumption 3 A bar contains one and only one chord Structural Assumption 4 A segment contains one and only one groove
Grouping in the GTTM is completely hierarchical: each group contains other groups down to the note level and is contained within a larger group up to the group containing the entire piece; the number of grouping levels is unspecified. A fully recursive structure is highly expressive, but may cause difficulty with implementation and makes dealing with the resulting representation more complex. It is clear that more than two levels of grouping would provide a richer representation: a tune may have a repeated section which is composed of eight bars of a steady rumba groove, followed by six bars of phrasing. It would make sense to have this represented as one large group which contained two smaller groups (see Figure 3.1) . This representation is more complex to manage than one which considers only sections which are made up of sets of bars, but is ultimately richer, and allows for specification of groove at the section level, which is more appropriate than the bar level.