Flexible Audio Source Separation Toolbox (FASST) Version 1.0 User Guide Alexey Ozerov 1, Emmanuel Vincent 1
1
and an d Fr´ ed´ ed´ eric er ic Bimb Bi mbot ot
2
INRIA, Centre de Rennes - Bretagne Atlantique 2
IRISA, CNRS - UMR 6074
Campus de Beaulieu, 35042 Rennes cedex, France {alexey.ozerov, emmanuel.vincent}@inria.fr, frederic.bimbot@i
[email protected] risa.fr
October 5, 2011
1
Intr Introd oduc ucti tion on
This user guide describes how to use FASST, an implementation of the general flexible source separation separation framework framework presented presented in [1]. Before Before reading reading the user guide you are strongly strongly encouraged encouraged to read [1], at least the two first sections. This guide is organized organized as follows. follows. Some notations notations and abbreviations abbreviations used throughout throughout this document document are listed in section 2. Section Section 3 gives gives a detailed detailed specification specification of the mixture structure (a Matlab structure), used to define the available prior information. The main functions the user should know about are listed in section 4 and an example of usage is given in section 5.
2 2.1 2.1
Some Some abbr abbrevi eviati ations ons and notati notations ons Abbr Abbrev eviat iation ionss
GMM GMM GSMM GSMM HMM HM M NMF PSD PSD QERB S-HMM S-HMM STFT STFT
Gaussi Gaus sian an mixt mixtur uree model model Gaussi Gaussian an Scaled Scaled Mixtur Mixturee Model Model Hidd Hidden en Ma Mark rkoov Mo Mode dell Nonneg Nonnegati ative ve matrix matrix factor factoriza izatio tion n Power Spec Spectr tral al Dens Densit ity y Quadratic Quadratic Equivalen Equivalentt Rectangula Rectangularr Bandwidth Bandwidth transform transform Scaled Scaled Hidden Hidden Marko Markov v Model Model ShortShort-Tim Timee Fourier ourier Transfo ransform rm
1
2.2
Notations
F N I J spat J spec Rj C j
Number Number Number Number
of of of of
frequency bins in the corresponding time-frequency representation time frames in the corresponding time-frequency representation channels (this version is only implemented for I = 1 or I = 2) spatial components (see Section 3)
Number of spectral components (see Section 3) Rank of the covariance matrix of the j -th spatial component Number of factors in the j -th spectral component
Aj
− C j = 1: direct model − C j = 2: factored excitation-filter model Number of narrowband excitation spectral patterns (see [1]) in the j -th spec. comp., Number of characteristic excitation spectral patterns (see [1]) in the j -th spec. comp., Number of time-localized excitation patterns (see [1]) in the j -th spec. comp., Number of narrowband filter spectral patterns (see [1]) in the j -th spec. comp., Number of characteristic filter spectral patterns (see [1]) in the j -th spec. comp., Number of time-localized filter patterns (see [1]) in the j -th spec. comp., Mixing parameters (∈ CI ×Rj ×F ×N ) in the j -th spatial comp. (see [1]),
ex Wj
Narrowband excitation spectral patterns ( ∈ R+
Uex j
Excitation spectral pattern weights ( ∈ R+
ex Gj
Excitation time pattern weights (∈ R+
Hex j
Time-localized excitation patterns ( ∈ R+
Lex j K jex M jex Lft j K jft M jft
ft Wj
F ×Lex j
ex Lex j ×K j
K jex ×M jex
ft Gj
L × K j
K jft ×M jft
) in the j -th spec. comp. (see [1]),
) in the j -th spec. comp. (see [1]),
ft
M ×N
R
Set of real numbers Set of nonnegative real numbers Set of complex numbers
3
) in the j -th spec. comp. (see [1]),
ft
Filter spectral pattern weights ( ∈ R+j Time-localized filter patterns (∈ R+ j
C
) in the j -th spec. comp. (see [1]),
Narrowband filter spectral patterns ( ∈ R+
Hft j
R+
) in the j -th spec. comp. (see [1]),
F ×Lft j
Filter time pattern weights ( ∈ R+
) in the j -th spec. comp. (see [1]),
M jex×N
ft
Uft j
) in the j -th spec. comp. (see [1]),
) in the j -th spec. comp. (see [1]),
Mixture structure
The mixture structure is a Matlab structure that is used to incorporate prior information into the framework. The structure has a hierarchical organization that can be seen from the example in figure 1. Global parameters (e.g., signal representation) are defined on the first level of the hierarchy. The second level consists of J spat spatial components and J spec spectral components. Each source is typically modeled by one spectral component, although some sources (e.g., drums) might be modeled by several spectral components (e.g., bass drum, snare, etc.). Furthermore, each spectral component must be associated with one spatial component, and each spatial component must have at least one spectral component associated to it. 1 Compared to the description of the framework in [1], this implementation is more general in the sense that the number of spectral components is 1
This extension makes it possible to model the fact that several sources have the same direction, which is very often the case for professionally produced music recordings. It is implemented by simply adding the power spectrograms of the spectral components corresponding to the same spatial component.
2
not necessarily equal to that of spatial components, and more precisely J spec ≥ J spat . The third level of the hierarchy consists in factorizing each spectral component into one or more factors representing for instance excitation and filter structures (see [1]) 2 . Finally, on the fourth level of the hierarchy, each factor is represented as the product of three or four matrices (see Table 5), which are not represented in figure 1. For instance, the factor representing excitation structure ex ex is either represented as the product of four matrices Wjex Uex j Gj Hj representing, respectively, narrowband spectral patterns, spectral pattern weights, time pattern weights and time-localized ex ex patterns (see [1]) or as the product of threes matrices Wjex Uex j Gj when Hj is marked by the empty matrix [] 3 . Almost all the fields of the mixture structure must be filled as specified in Tables 2, 3, 4 and 5, except those marked by the empty matrix [] .
Figure 1: Visualization of a mixture structure example.
4
Main functions
The user should know about three main functions comp transf Cx, estim param a post model and separate spec comps, allowing, respectively, to compute the input time-frequency transform, estimate the model parameters and separate the spectral components. The headers of these functions are listed in Tables 6, 7 and 8. 2
Note that in [1] the usage of two factors (excitation and filter) is described. The implementation presented here is more flexible, since one can use any number of factors C j , and it reduces to [1] when C j = 2. This is done for convenience of usage. For example if one needs to implement an excitation model only or a filter model only (direct model ), one simply needs to choose C j = 1 without bothering to specify and to process an additional dummy factor. 3 ex In [1] only the case of four matrices is considered, and the case of three matrices Wjex Uex j Gj is just equivalent ex ex to fixing Hj to the N × N identity matrix. Since N may be quite big, we fix Hj to [] by convention in the latter case in order to avoid storing a big identity matrix in memory.
3
5
Examples of usage
The user should also know how to fill and browse the mixture structure and how to use the abovementioned three functions. An example of mixture structure filling and browsing is given in Tables 9 and 10. An example script for the separation of an instantaneous mixture of music signals is given in Table 11. Function EXAMPLE prof rec sep drums bass melody.m contains a more sophisticated example allowing the separation of the following four sources: drums, bass, melody (singing voice or leading melodic instrument), remaining sounds, from a stereo music recording. Due to memory limits in Matlab this function cannot process sound excerpts longer than 30 seconds. For full length music recording the function EXAMPLE prof rec sep drums bass melody FULL.m
should be used. This function simply cuts the full recording into small parts, and applies EXAMPLE prof rec sep drums bass melody.m to each of them.
References [1] A. Ozerov, E. Vincent, and F. Bimbot, “A general flexible framework for the handling of prior information in audio source separation,” IEEE Transactions on Audio, Speech and Signal Processing , to appear. [Online]. Available: http://hal.inria.fr/hal-00626962/ [2] A. Ozerov and C. F´evotte, “Multichannel nonnegative matrix factorization in convolutive mixtures for audio source separation,” IEEE Transactions on Audio, Speech and Language Processing , vol. 18, no. 3, pp. 550–563, March 2010.
4
Field Cx transf fs wlen Noise PSD spat comps spec comps
Description F × N × I × I complex-valued tensor of local mixture covariances Input time-frequency transform Sampling frequency in Hz Analysis window length (used to compute STFT or QERB) in samples F × 1 real-valued nonnegative vector of additive noise PSD, e.g., for annealing 1 × J spat cell array of spatial component structures 1 × J spec cell array of spectral component structures
Value
∈ CF ×N ×I ×I ’stft’ for STFT ’qerb’ for QERB ∈ {16000, 44100, . . .} ∈ {512, 1024, . . .}
∈ R1×F or [] see Table 3 see Table 4
Figure 2: Specification of the mixture structure ( mix str).
Field time dep
Description Stationarity of mixing
Value
mix type
Mixing type
frdm prior
Degree of adaptability
params
Tensor of mixing parameters (corresponding to Aj from [1])
’indep’ for time-invariant mixing ’dep’ for time-varying mixing ’inst’ for instantaneous (freq.-indep.) ’conv’ for convolutive (freq.-dep.) ’free’ for adaptive ’fixed’ for fixed for mix type = ’inst’ ∈ RI ×Rj I ×Rj ×F for mix type = ’conv’ ∈C
Figure 3: Specification of the spatial component structure ( spat comps{j}, j = 1 , . . . , Jspat ).
Field spat comp ind factors
Description Index of the corresponding spatial component 1 × Lj cell array of factor structures
Value ∈ { 1, . . . , Jspat }
Figure 4: Specification of the spectral component structure ( spec comps{j}, j = 1, . . . , Jspec ).
5
Field FB frdm prior FW frdm prior TW frdm prior TB frdm prior FB FW TW TB TW constr
TW DP params
TW DP frdm prior TW all
Description Degree of adaptability for narrowband spectral patterns Degree of adaptability for spectral pattern weights Degree of adaptability for time pattern weights Degree of adaptability for time-localized patterns Narrowband spectral patterns (Frequency Blobs) (corresponding to Wjex or Wjft ) Spectral pattern weights (Frequency Weights) ft (corresponding to Uex j or Uj ) Time pattern weights (Time Weights) ft (corresponding to Gex j or Gj ) Time-localized patterns (Time Blobs) ft (corresponding to Hex j or Hj ) Constraint on the time pattern weights (note that nontrivial constraints, i.e., different from ’NMF’ are not compatible with nonempty time patterns TB ) Discrete probability (DP) parameters for the time pattern weights (needed only when TW constr = ’NMF’)
Degree of adaptability for DP parameters (needed only when TW constr = ’NMF’) Matrix of all time weights ˜ ex or G ˜ ft from [1]) (corresponding to G j j (needed only when TW constr = ’NMF’)
Value ’free’ for adaptive ’fixed’ for fixed ’free’ for adaptive ’fixed’ for fixed ’free’ for adaptive ’fixed’ for fixed ’free’ for adaptive ’fixed’ for fixed F ×Lex j
∈ R+
or ∈
Lex ×K jex
∈ R+j
K ex ×M jex
∈ R+ j
M ex ×N
∈ R+ j
F ×Lft j R+
or ∈ or ∈
, ∈
ft Lft j × K j
R+
K jft ×M jft
R+
M jft ×N
R+
’NMF’ no constraint ’GMM’ for GMM ’HMM’ for HMM ’GSMM’ for GSMM ’SHMM’ for S-HMM 1 × K jex (1 × K jft ) vector
of Gaussian weights for GMM or GSMM K jex × K jex (K jft × K jft ) matrix of transition probabilities for HMM or S-HMM ’free’ for adaptive ’fixed’ for fixed Nonnegative real-valued matrix of the same size as TW
Figure 5: Specification of the spectral component factor structure ( factors{l}, l = 1, . . . , L j ).
6
or []
f u n c t i o n Cx = c o m p _ t r a n s f _ Cx ( x , transf , win_len ,
fs , q e r b _ n b i n )
% % Cx = c o mp t r a n s f C x ( x , t r a n s f , w i n l e n , f s , q e r b n b i n ) ; % % co mpute s p a t i a l c o v ar i a nc e m a tr i c es f o r t he c o rr e s po n di n g t r an s fo r m % % % input % −−−−− % % x : [ I x nsampl ] ma t ri x c o n t a i n i n g I time−d om ai n m i xt ur e s i g n a l s % wi th nsampl s a mp l e s % transf : transform % stft % qerb % win len : window l e n g t h % fs : ( opt ) s a m pl i n g f r e q u e n c y ( Hz ) % q er b n b i n : ( opt ) number o f b i ns fo r q er b t r an s fo rm % % output % −−−−−− % % Cx : [F x N x I x I ] m a tri x c o n t a i n i n g th e s p a t i a l co v a r i a n c e % m a t r i c e s o f t he i n p u t s i g n a l i n a l l time−f r e q ue n c y b i n s %
Figure 6: comp transf Cx : FASST function for the computation of the input time-frequency transform.
f u n c t i o n [ mix_str , l o g _ l i k e _ a r r ] = e s t i m _ p a r a m _ a _ p o s t _ m o d e l ( m i x _ s t r _ i n p , iter_num , s i m _ a n n _ o p t , A n n _ P S D _ b e g , A n n _ P S D _ e n d )
...
% % [ m i x s t r , l o g l i k e a r r ] = e s ti m p a r a m a p o st m o d e l ( m i x s t r i n p , . . . % i t er n u m , s i m a nn o p t , Ann PSD beg , Ann PSD end ) ; % % e s t i ma t e a p o s t e r i o r i m ix tu re m ode l p ar am et er s % % % input % −−−−− % % m ix s tr i np : in p u t mi x tu re st ru ct ur e % iter num : ( opt ) number of EM i t e r a t i o n s ( d e f = 1 0 0 ) % s i m a nn o pt : ( opt ) s im ul at ed a n n e al in g o pt io n ( d ef = ann ) % no ann : n o a nn e a li n g ( z e ro n o i s e ) % ann : a n ne a l in g % ann ns inj : a nn ea l in g w it h n o i se i n j e c t i o n % upd ns prm : u pd at e n o i s e p a ra m e te r s % ( Noise PSD i s updated thro ug h EM) % Ann PSD beg : ( opt ) [ F x 1 ] b e gi nn in g v ec to r o f a nn ea li ng n oi se PSD % ( d e f = X power / 1 0 0 ) % Ann PSD end : ( opt ) [ F x 1 ] end v ec to r o f a n ne al in g n o is e PSD % ( d e f = X power / 1 0 0 0 0 ) % % % output % −−−−−− % % mix str : e s t i m a t e d o utput m i xt ure s tr u c t ur e % l og l ik e a rr : a r r a y o f l o g−l i k e l i h o o d s %
Figure 7: estim param a post model : FASST function for the estimation of the model parameters.
7
f u n c t i o n ie = s e p a r a t e _ s p e c _ c o m p s ( x , mix_str , s e p _ c m p _ i n d s ) % % ie = separate spec % % separate spectr al % % % input % −−−−− % % x % mix str % s ep c mp i nd s % % % % output % −−−−−− % % ie % % %
comps(x , mix str , sep cmp inds ) ; c om po ne nt s
: [ nchan x nsampl ] m i xt ure s i g n a l : i n p u t mix s t r u c t u r e : ( o pt ) a rr ay o f i n di c e s f o r components t o s ep ar at e ( d e f = { 1 , 2 , . . . , K s p e c} )
:
[ K sep x nsampl x nchan ] es t i m a te d s p ec tr al components images , where K sep = l e n g t h ( s e p c m p i n d s ) i s t he number o f components t o s e p a r a t e
Figure 8: separate spec comps : FASST function for the separation of the spectral component signals.
8
f u n c t i o n m i x _ s t r = i n i t _ m i x _ s t r u c t _ M u l t _ N M F _ i n s t ( Cx , J , K , transf ,
fs , wlen )
% % m i x s t r = i n i t m i x s t r u c t M u l t N M F i n s t ( Cx , J , K, t r a ns f , f s , w le n ) ; % % An e xa mp le o f m ix tu re s t r u c t u re i n i t i a l i z a t i o n , c o r re s po n di n g t o % m u l t i c h a n n e l NMF m od el ( i n s t a n t a n e o u s c a s e ) % Most o f p ar a me te rs a r e i n i t i a l i z e d r an do ml y % % input % −−−−− % % Cx : [F x N x I x I ] m a tri x c o n t a i n i n g th e s p a t i a l co v a r i a n c e % m a t r i c e s o f t he i n p u t s i g n a l i n a l l time−f r e q ue n c y b i n s % o r [ F x N] s i n g l e c h a n ne l v a r i a n c e m a tr i x % J : number o f components ( h e r e J s p a t = J s p e c ) % K : number o f NMF components p e r s o u r c e % transf : transform ( s t f t o r qerb ) % fs : s a m pl i ng f r e q u e n c y i n Hz % wlen : l e ng t h o f th e time i nt eg ra ti on window ( must be a power o f 2 ) % % output % −−−−−− % % mix str : i n i t i a l i z e d mi x tur e s t r u c t u r e % rank = 1 ; [ F , N , I , I ] = s i z e ( Cx ) ; m i x _ s t r . C x m i x _ s t r . t r a n s f m i x _ s t r . f s m i x _ s t r . w l e n m i x _ s t r . s p a t _ c o m p s m i x _ s t r . s p e c _ c o m p s
= Cx ; = t r a n s f ; = fs ; = wlen ; = cell ( 1 , J ) ; = cell ( 1 , J ) ;
f o r j = 1 : J % i n i t i a l i z e s p a t i a l c o mp o ne nt m i x _ s t r . s p a t _ c o m p s { j } . t i m e _ d e p m i x _ s t r . s p a t _ c o m p s { j } . m i x _ t y p e m i x _ s t r . s p a t _ c o m p s { j } . f r d m _ p r i o r m i x _ s t r . s p a t _ c o m p s { j } . p a r a m s
= indep ; = inst ; = free ; = randn ( I , rank ) ;
% i n i t i a l i z e s i n g l e f a c t o r s p e c t r a l c om po ne nt m i x _ s t r . s p e c _ c o m p s { j } . s p a t _ c o m p _ i n d = j ; = c e l l ( 1 , 1 ) ; m i x _ s t r . s p e c _ c o m p s { j } . f a c t o r s f a c t o r 1 . F B f a c t o r 1 . F W f a c t o r 1 . T W f a c t o r 1 . T B f a c t o r 1 . F B _ f r d m _ p r io r f a c t o r 1 . F W _ f r d m _ p r io r f a c t o r 1 . T W _ f r d m _ p r io r f a c t o r 1 . T B _ f r d m _ p r io r f a c t o r 1 . T W _ c o n s t r
= 0.75 ab s ( randn ( F , K ) ) + 0 . 2 5 = d i a g ( o n e s ( 1 , K ) ) ; = 0.75 ab s ( randn ( K , N ) ) + 0 . 2 5 = []; = free ; = fixed ; = free ; = []; = NMF ;
ones ( F ,
K);
ones ( K ,
N);
m i x _ s t r . s p e c _ c o m p s { j } . f a c t o r s {1} = f a c t o r 1 ;
en d ;
Figure 9: Example of filling of the mixture structure corresponding to the multichannel NMF method [2] (instantaneous case).
9
>> m i x _ s t r
m i x _ s t r = Cx : [4 − D d o ub l e ] t r a n s f : s t f t fs : 1 60 00 w l e n : 1 02 4 s p a t _ c o m p s : { [ 1 x 1 s t ru c t ] s p e c _ c o m p s : { [ 1 x 1 s t ru c t ] N o i s e _ P S D : [ 51 3 x 1 d o ub l e ]
[ 1 x 1 s t ru c t ] [ 1 x 1 s t ru c t ]
[ 1 x 1 s t r u ct ] } [ 1 x 1 s t r u ct ] }
>> m i x _ s t r . s p a t _ c o m p s {2}
an s = t i m e _ d e p : i n d e p m i x _ t y p e : i n s t f r d m _ p r i o r : f r e e p a r a m s : [ 2 x 1 d o ub l e ] >> m i x _ s t r . s p e c _ c o m p s {3}
an s = s p a t _ c o m p _ i n d : f a c t o r s :
3 { [ 1 x 1 s t r u ct ] }
>> m i x _ s t r . s p e c _ c o m p s { 3 } . f a c t o r s {1}
an s = FB : [ 51 3 x 4 d o u b le ] FW : [ 4 x 4 d o u b l e ] TW : [ 4 x 9 8 d o u b le ] TB : [ ] F B _ f r d m _ p r i o r : f r e e F W _ f r d m _ p r i o r : f i x e d T W _ f r d m _ p r i o r : f r e e T B _ f r d m _ p r i o r : [ ] NMF T W _ c o n s t r :
Figure 10: Browsing in Matlab of the example mixture structure in Table 9.
10
f u n c t i o n E X A M P L E _ s s e p _ M u l t _ N M F _ i n s t ( ) data_dir = = result_dir file_prefix = transf wlen nsrc NMF_ncomp iter_num
= = = = =
e x a m pl e da t a / ; e x a m pl e da t a / ; S h an n on H u rl e y
S u nr i s e
i n st
;
stft ; 1024; 3; % nu mb er o f s o u r c e s 4; % number of NMF components 20 0 ;
% l o a d m i xt u re f p r i n t f ( I n p u t t i m e−f r e q ue n c y r e p r e s e n t a t i o n \n ) ; [ x , fs , n b i n s ]= wavread ( [ d a t a _ d ir f i l e _ p r e f ix mix . wav ] ) ; x = x . ; m i x _ n s a m p = s i z e ( x , 2 ) ; % compute time−f r e q ue n c y r e p r e s e n t a t i o n Cx = c o m p _ t r a n s f _ Cx ( x , transf , wlen , fs ) ; % fill
i n m ix tu re s t r u c t u re
m i x _ s t r = i n i t _ m i x _ s t r u c t _ M u l t _ N M F _ i n s t ( Cx , nsrc , N M F _ n c o m p , transf ,
fs , wlen ) ;
% r e i n i t i a l i z e mixing parameters A = [ s i n ( p i /8) , s i n ( p i / 4 ) , s i n ( 3 p i /8) ; c o s ( p i / 8 ) , c o s ( p i /4) , c o s ( 3 p i /8) ] ; f o r j = 1 : n s r c m i x _ s t r . s p a t _ c o m p s { j } . p a r a m s = A ( : , j ) ; en d ; % r un p a ra m et e rs e s t i m a t i o n ( w i t h s i m u la t e d a n n e a l i ng ) ann ) ; m i x _ s t r = e s t i m _ p a r a m _ a _ p o s t _ m o d e l ( mix_str , i t e r _ n u m , % s o u r ce s e p a ra t i o n i e _ E M = s e p a r a t e _ s p a t _ c o m p s ( x , m i x _ s t r ) ; % Co mp ut at io n o f t h e s p a t i a l s o u r c e i ma g es f p r i n t f ( Co mp ut at io n o f t h e s p a t i a l s o u r c e i m ag e s\n ) ; f o r j =1 : nsrc , wavwrite ( r e s h a p e ( i e _ E M ( j , : , : ) , m i x _ n s a m p , 2 ) , fs , nbins , . . . [ result_dir file_prefix s i m i n t 2 s t r ( j ) .wav ] ) ; en d
Figure 11: Example of usage involving all three main functions (runs the multichannel NMF method [2] in the instantaneous case).
11