Jacob Benesty · Yiteng Huang (Eds) Adaptive Signal Processing
Springer-Verlag Berlin Heidelberg GmbH
Jacob Benesty · Yiteng Huang (Eds)
Adaptive Signal Processing Applications to Real-World Problems With 122 Figures
123
Jacob Benesty Yiteng Huang Bell Labs Lucent Technologies 700 Mountain Avenue 07974-0636 Murray Hill, NJ USA
Library of Congress Cataloging-in-Publication Data Benesty, Jacob. Adaptive signal processing: applications to real-world problems / Jacob Benesty, Yiteng Huang. p.cm. Includes bibliographical references. ISBN 978-3-642-05507-2 ISBN 978-3-662-11028-7 (eBook) DOI 10.1007/978-3-662-11028-7 1. Adaptive signal processing. I. Huang, Yiteng. II. Title TK 5102.9.B4515 2003 621.382’2--dc21 2002036471
ISBN 978-3-642-05507-2 This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilm or in other ways, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer-Verlag Berlin Heidelberg GmbH. Violations are liable for prosecution act under German Copyright Law.
http://www.springer.de © Springer-Verlag Berlin Heidelberg 2003 Originally published by Springer-Verlag Berlin Heidelberg New York in 2003 Softcover reprint of the hardcover 1st edition 2003 The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Typesetting: Digital data supplied by editors Cover-Design: design & production GmbH, Heidelberg Printed on acid-free paper 62/3020 Rw 5 4 3 2 1 0
Preface
By adaptive signal processing, we mean, in general, adaptive filtering. In unknown environments where we need to model, identify, or track time-varying channels, adaptive filtering has been proven to be an effective and powerful tool. As a result, this tool is now in use in many different fields. Since the invention, by Widrow and Hoff in 1959, of one of the first adaptive filters, the so-called least-mean-square, many applications appeared to have the potential to use this fundamental concept. While the number of applications (using adaptive algorithms) has been (and keeps) flourishing with time, thanks to several successes, the need for more sophisticated adaptive algorithms became obvious as real-world problems are more complex and more demanding. Even though the theory of adaptive filtering is already a well-established topic in signal processing, new and improved concepts are discovered every year by researchers. Some of these recent approaches are discussed in this book. The goal of this book is to provide, for the first time, a reference to the hottest real-world applications where adaptive filtering techniques play an important role. To do so, we invited top researchers in different fields to contribute chapters addressing their specific topic of study. Thousands of pages would probably not be enough to describe all the practical applications utilizing adaptive algorithms. Therefore, we limited the topics to some important applications in acoustics, speech, wireless, and networking, where research is still very active and open. This book is roughly organized into two parts. In the first part (Chap. 1 to Chap. 8), several applications in acoustics & speech are developed. The second part (Chap. 9 to Chap. 12) focuses on wireless & networking applications. Some chapters are tutorial in nature while others present new research ideas. Clearly, all the chapters have one thing in common: adaptive algorithms play a key role in solving real-world problems. Chapter 1 gives some new directions in adaptive filtering for sparse impulse responses. A class of exponentiated gradient algorithms is discussed, analyzed, and compared to some classical adaptive algorithms like the stochastic gradient algorithm. It is also demonstrated how the concept of exponentiated gradient can be used for blind multichannel identification. In a hearing aid, the output transducer (receiver) generates mechanical and acoustic signals that are unintentionally fed back to the microphone. Feedback cancellation works by using an adaptive filter to model the feedback path, with the output of the adaptive filter subtracted from the microphone signal to cancel the feedback. Chapter 2 reviews the nature of the feedback path and describes the approaches used to implement feedback cancellation.
VI
Limitations on system performance imposed by room reverberation are also described. Chapter 3 is meant as an introduction to the topic of single-channel acoustic echo cancellation. It focuses on the hands-free telephone as one of the major applications for echo cancelling devices. Beginning with a brief discussion of a system formed by a loudspeaker and a microphone located within the same enclosure, the properties of speech signals and noise are described. Algorithms for the adaptation of the echo cancelling filter are also described. Because of its robustness and its low computational complexity, the NLMS algorithm is primarily applied. Measures to improve the speed of convergence and to avoid divergence in case of double-talk or strong local noise are discussed. Chapter 4 presents a new general class of algorithms for multichannel adaptive filtering, as required for multichannel acoustic echo cancellation. Based on a new rigorous derivation from a recursive least-squares criterion in the frequency domain, the approach leads to powerful and efficient multichannel algorithms that explicitly account for the possibly high cross-correlations between the channels, but it also covers known single-channel algorithms as special cases. Chapter 5 focuses on (adaptive and non-adaptive) filtering techniques to mitigate noise effect in speech communications. Three approaches are studied: beamforming using multiple microphone sensors; adaptive noise cancellation utilizing a primary sensor to pick up the noisy speech signal and an auxiliary or reference sensor to measure the noise field; and, spectral modification exploiting only a single sensor. The strengths and weaknesses of these approaches are elaborated and their performance (with respect to noise reduction) is studied and compared. Chapter 6 provides an overview of adaptive beamforming techniques for speech and audio signal acquisition. Basic concepts of optimum adaptive antenna arrays are reviewed and it is shown how these methods may be applied to meet the requirements of audio signal processing. A unified view is provided by especially dropping narrowband and stationarity assumptions and by using time-domain least-squares instead of frequency-domain minimum mean-square criteria. Chapter 7 introduces the blind source separation (BSS) of convolutive mixtures of acoustic signals, especially speech. A statistical and computational technique, called independent component analysis (ICA), is examined. By achieving nonlinear decorrelation, nonstationary decorrelation, or time-delayed decorrelation, source signals from observed mixed signals can be found. Particular attention is paid to the physical interpretation of BSS from an acoustical signal processing point of view. Chapter 8 is devoted to multichannel time delay estimation based on blind system identification for acoustic source localization. Both time-domain and frequency-domain adaptive blind multichannel identification algorithms are
VII
developed. The benefit of adaptive blind system identification methods for multichannel time delay estimation in reverberant environments is emphasized throughout this chapter. In Chap. 9, an overview of classical adaptive equalizer techniques is presented including linear, decision feedback, and fractionally spaced equalizers. Special emphasis is given to techniques applied in modern wireless systems where channels are frequency- and time-dispersive. Many basic concepts are explained and brought into the context of multiple-input multiple-output systems as they appear in the near future in wireless communication systems. A short overview on blind techniques is given, demonstrating the potential of new signal processing techniques even better suited for the particular needs of wireless communications. In Chap. 10, adaptive space-time processing for wireless receivers in CDMA networks is considered. Currently, the 2D RAKE is the most widely used space-time array-processor which combines multipath signals sequentially, first in space, then in time. Incremental processing improvements to arrive ultimately at a more efficient one-dimensional joint space-time (1DST) adaptive processor named STAR (spatio-temporal array-receiver) is introduced. STAR improves the receiver’s performance by approaching blind coherent space-time maximum ratio combining (ST-MRC). The demand for wireless local area networks (WLANs) based on the IEEE 802.11 standard is growing rapidly. This standard makes use of a carriersensing technique and packet collisions occur because the stations in a systems are essentially uncoordinated. In Chap 11, an IEEE 802.11 system with multiple receive antennas is examined. Multiple receive antennas can detect multiple packets and, also, reduce packet errors due to channel errors. Results are presented that demonstrate significant improvement in the throughput of the system when multiple receive antennas are used. The end-to-end delay is often used to analyze network performance. There are different types of delay in the Internet network: (artificial) delay due to unsynchronized clocks, transmission and propagation delays, and delay jitter. In Chap. 12, it is shown how to obtain a least-squares estimate of the clock skew (i.e. difference between the sender and receiver clock frequencies) and the fixed delay. It is also shown how to adaptively estimate the delay jitter and an unbiased recursive least-squares algorithm is proposed to estimate the clock skew and fixed delay. We hope this book will serve as a guide for researchers, Ph.D. students, and developers, for whom the main interest is in applications of adaptive signal processing. We also hope it will inspire many of the readers and will be the source of new ideas to come.
Bell Laboratories, Murray Hill August 2002
Jacob Benesty Yiteng (Arden) Huang
Contents
1 On a Class of Exponentiated Adaptive Algorithms for the Identification of Sparse Impulse Responses . . . . . . . . . . . . . . . . . . . Jacob Benesty, Yiteng (Arden) Huang, Dennis R. Morgan 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Derivation of the Different Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Link Between the LMS and EG Algorithms and Normalized Versions 1.4 The RLS and ERLS Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5 Link Between the PNLMS and EG± Algorithms . . . . . . . . . . . . . . . . . 1.6 Application of the EG± Algorithm for Blind Channel Identification 1.7 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Adaptive Feedback Cancellation in Hearing Aids . . . . . . . . . . James M. Kates 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Steady-State Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 The Feedback Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Real-World Processing Concerns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Feedback Cancellation System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.1 Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.2 Running Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.3 Performance Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6 Constrained Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.1 Adaptation with Clamp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.2 Adaptation with Cost Function . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.3 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7 Filtered-X System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8 Room Reverberation Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8.1 Test Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8.2 Initialization and Measurement Procedure . . . . . . . . . . . . . . . . . . 2.8.3 Measured Feedback Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8.4 Maximum Stable Gain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.9 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Single-Channel Acoustic Echo Cancellation . . . . . . . . . . . . . . . . Eberhard H¨ ansler, Gerhard Schmidt 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Loudspeaker-Enclosure-Microphone Systems . . . . . . . . . . . . . . . .
1 1 2 5 9 10 12 15 20 21 23 23 26 29 30 32 32 35 36 37 37 40 41 44 47 47 49 49 52 53 55 59 59 61 61
X
3.2.2 Electronic Replica of LEM Systems . . . . . . . . . . . . . . . . . . . . . . . . 3.2.3 Speech Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.4 Background Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.5 Regulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Methods for Acoustic Echo Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Loss Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.2 Echo Cancellation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.3 Echo Suppression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Adaptive Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 NLMS Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.2 AP Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.3 RLS Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Adaptation Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.1 Optimal Step Size for the NLMS Algorithm . . . . . . . . . . . . . . . . . 3.5.2 A Method for Estimating the Optimal Step Size . . . . . . . . . . . . . 3.6 Suppression of Residual Echoes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7 Processing Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7.1 Fullband Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7.2 Block Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7.3 Subband Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
63 65 67 68 69 70 70 71 72 73 74 76 77 78 80 82 87 87 88 88 89 91
4 Multichannel Frequency-Domain Adaptive Filtering with Application to Multichannel Acoustic Echo Cancellation . . . . . . 95 Herbert Buchner, Jacob Benesty, Walter Kellermann 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 4.2 General Derivation of Multichannel Frequency-Domain Algorithms . 98 4.2.1 Optimization Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 4.2.2 Normal Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 4.2.3 Adaptation Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 4.3 Convergence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 4.3.1 Analysis Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 4.3.2 Convergence in Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 4.3.3 Convergence in Mean Square . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 4.4 Generalized Frequency-Domain Adaptive MIMO Filtering . . . . . . . . . 110 4.5 Approximation and Special Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 4.5.1 Approximation of the Frequency-Domain Kalman Gain . . . . . . 113 4.5.2 Special Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 4.6 A Dynamical Regularization Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 4.7 Efficient Multichannel Realization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 4.7.1 Efficient Calculation of the Frequency-Domain Kalman Gain . . 117 4.7.2 Dynamical Regularization for Proposed Kalman Gain Approach119 4.7.3 Efficient DFT Calculation of Overlapping Data Blocks . . . . . . . 120 4.8 Simulations and Real-World Applications . . . . . . . . . . . . . . . . . . . . . . . 122
XI
4.8.1 Multichannel Acoustic Echo Cancellation . . . . . . . . . . . . . . . . . . . 4.8.2 Adaptive MIMO Filtering for Hands-Free Speech Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.9 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Filtering Techniques for Noise Reduction and Speech Enhancement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jingdong Chen, Yiteng (Arden) Huang, Jacob Benesty 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Noise Reduction with an Array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Adaptive Noise Cancellation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Spectral Modification with a Single Microphone . . . . . . . . . . . . . . . . . . 5.4.1 Parametric Spectral Subtraction . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.2 Estimation of the Noise Spectrum . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.3 Parametric Wiener Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.4 Estimation of the Wiener Gain Filter . . . . . . . . . . . . . . . . . . . . . . 5.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Adaptive Beamforming for Audio Signal Acquisition . . . . . . Wolfgang Herbordt, Walter Kellermann 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Signal Model, Sensor Arrays, and Concepts . . . . . . . . . . . . . . . . . . . . . . 6.2.1 Sensor Array, Sensor Signals, and Beamformer Setup . . . . . . . . 6.2.2 Interference-Independent Beamformer Performance Measures . 6.2.3 Interference-Dependent Beamformer Performance Measures . . . 6.3 Data-Independent Beamformer Design . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4 Optimum Data-Dependent Beamformer Designs . . . . . . . . . . . . . . . . . 6.4.1 Least-Squares Error (LSE) Design . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.2 Least-Squares Formulation of Linearly Constrained Minimum Variance (LCMV) Beamforming: LCMV-LS Design . . . . . . . . . . 6.4.3 Eigenvector Beamformers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.4 Suppression of Correlated Interference . . . . . . . . . . . . . . . . . . . . . 6.5 Adaptation of LCMV-LS Beamformers . . . . . . . . . . . . . . . . . . . . . . . . . 6.6 A Practical Audio Acquisition System Using a Robust GSC . . . . . . . 6.6.1 Spatio-Temporal Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.6.2 Robust GSC After [1] as an LCMV-LS Beamformer with SpatioTemporal Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.6.3 Realization in the DFT-Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.6.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
123 125 126 127 129 129 131 137 144 145 146 148 150 151 153 155 155 157 157 159 161 162 165 166 169 175 177 178 180 181 182 185 186 187 188
XII
7 Blind Source Separation of Convolutive Mixtures of Speech . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shoji Makino 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 What Is BSS? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.1 Mixed Signal Model for Speech Signals in a Room . . . . . . . . . . . 7.2.2 Unmixed Signal Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.3 Task of Blind Source Separation of Speech Signals . . . . . . . . . . . 7.2.4 Instantaneous Mixtures vs. Convolutive Mixtures . . . . . . . . . . . . 7.2.5 Time-Domain Approach vs. Frequency-Domain Approach . . . . 7.2.6 Time-Domain Approach for Convolutive Mixtures . . . . . . . . . . . 7.2.7 Frequency-Domain Approach for Convolutive Mixtures . . . . . . . 7.2.8 Scaling and Permutation Problems . . . . . . . . . . . . . . . . . . . . . . . . 7.3 What Is ICA? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.1 What Is Independence? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.2 Minimization of Mutual Information . . . . . . . . . . . . . . . . . . . . . . . 7.3.3 Maximization of Nongaussianity . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.4 Maximization of Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.5 Three ICA Theories Are Identical . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.6 Learning Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4 How Speech Signals Can Be Separated? . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.1 Second Order Statistics vs. Higher Order Statistics . . . . . . . . . . 7.4.2 Second Order Statistics (SOS) Approach . . . . . . . . . . . . . . . . . . . 7.4.3 Higher Order Statistics (HOS) Approach . . . . . . . . . . . . . . . . . . . 7.5 Physical Interpretation of BSS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5.1 Frequency-Domain Adaptive Beamformer (ABF) . . . . . . . . . . . . 7.5.2 ABF for Target S1 and Jammer S2 . . . . . . . . . . . . . . . . . . . . . . . . 7.5.3 ABF for Target S2 and Jammer S1 . . . . . . . . . . . . . . . . . . . . . . . . 7.5.4 Two Sets of ABFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6 Equivalence Between BSS and Adaptive Beamformers . . . . . . . . . . . . 7.6.1 When S1 = 0 and S2 = 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6.2 When S1 = 0 and S2 = 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6.3 When S1 = 0 and S2 = 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.7 Separation Mechanism of BSS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.7.1 Fundamental Limitation of BSS . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.7.2 When Sources Are Near the Microphones . . . . . . . . . . . . . . . . . . . 7.8 When Sources Are Not “Independent” . . . . . . . . . . . . . . . . . . . . . . . . . . 7.8.1 BSS Is Upper Bounded by ABF . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.8.2 BSS Is an Intelligent Version of ABF . . . . . . . . . . . . . . . . . . . . . . . 7.9 Sound Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.9.1 Directivity Patterns of NBF, BSS, and ABF . . . . . . . . . . . . . . . . 7.10 Experimental Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.10.1 Mixing Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.10.2 Source Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
195 195 196 197 197 198 198 199 199 200 200 201 201 202 202 202 203 204 204 205 206 207 207 208 208 210 210 210 211 212 213 213 214 214 215 216 216 217 219 220 220 220
XIII
7.10.3 Evaluation Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.10.4 Scaling and Permutation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.11 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Adaptive Multichannel Time Delay Estimation Based on Blind System Identification for Acoustic Source Localization . . Yiteng (Arden) Huang, Jacob Benesty 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.2 Signal Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.3 Channel Properties and Assumptions . . . . . . . . . . . . . . . . . . . . . . 8.3 Generalized Multichannel Time Delay Estimation . . . . . . . . . . . . . . . . 8.3.1 The Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.2 Time-Domain Multichannel LMS Approach . . . . . . . . . . . . . . . . . 8.3.3 Frequency-Domain Adaptive Algorithms . . . . . . . . . . . . . . . . . . . 8.3.4 Algorithm Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.2 Performance Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
221 222 222 223 227 227 229 229 230 231 232 232 233 234 237 238 238 242 242 243 246
9 Algorithms for Adaptive Equalization in Wireless Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249 Markus Rupp, Andreas Burg 9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249 9.2 Criteria for Equalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251 9.3 Channel Equalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253 9.3.1 Infinite Filter Length Solutions for Single Channels . . . . . . . . . . 254 9.3.2 Finite and Infinite Filter Length Solutions for Multiple Channels255 9.3.3 Finite Filter Length Solutions for Single Channels . . . . . . . . . . . 258 9.3.4 Decision Feedback Equalizers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261 9.4 Adaptive Algorithms for Channel Equalization . . . . . . . . . . . . . . . . . . . 265 9.4.1 Adaptively Minimizing ZF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265 9.4.2 Adaptively Minimizing MMSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267 9.4.3 Training and Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267 9.5 Channel Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269 9.5.1 Channel Estimation in MIMO Systems . . . . . . . . . . . . . . . . . . . . . 270 9.5.2 Estimation of Wireless Channels . . . . . . . . . . . . . . . . . . . . . . . . . . 271 9.5.3 Channel Estimation by Basis Functions . . . . . . . . . . . . . . . . . . . . 271 9.5.4 Channel Estimation by Predictive Methods . . . . . . . . . . . . . . . . . 272 9.6 Maximum Likelihood Equalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
XIV
9.6.1 Viterbi Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.7 Blind Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
274 275 277 277
10 Adaptive Space-Time Processing for Wireless CDMA . . . . Sofi`ene Affes, Paul Mermelstein 10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2 Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.3 The Blind 2D RAKE Receiver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.4 The Blind 2D STAR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.4.1 Decision-Feedback Identification (DFI) . . . . . . . . . . . . . . . . . . . . 10.4.2 Parallel and Soft DFI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.4.3 Parallel and Hard DFI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.4.4 Common and Soft DFI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.4.5 Common and Hard DFI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.4.6 Performance Gains of the DFI Versions . . . . . . . . . . . . . . . . . . . 10.5 The Blind 1D-ST STAR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.5.1 1D-ST Structured Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.5.2 2D STAR with Common DFI Reinterpreted . . . . . . . . . . . . . . . 10.5.3 1D-ST Structured DFI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.5.4 Performance Gains of 1D-ST STAR over 2D STAR . . . . . . . . . 10.6 The Pilot-Assisted 1D-ST STAR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.6.1 Data Model with Pilot Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.6.2 1D-ST STAR with Conventional Pilot-Channel Use . . . . . . . . 10.6.3 1D-ST STAR with Enhanced Pilot-Channel Use . . . . . . . . . . . 10.6.4 1D-ST STAR with Conventional Pilot-Symbol Use . . . . . . . . . 10.6.5 1D-ST STAR with Enhanced Pilot-Symbol Use . . . . . . . . . . . . 10.6.6 Performance Gains with Enhanced Pilot Use . . . . . . . . . . . . . . 10.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
283
11 The IEEE 802.11 System with Multiple Receive Antennas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Vijitha Weerackody 11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 System Model with Multiple Receive Antennas . . . . . . . . . . . . . . . . . 11.2.1 Estimating the Receive Antenna Weights . . . . . . . . . . . . . . . . . . 11.3 The IEEE 802.11 Distributed Coordination Function with Multiple Packet Reception . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.4 Throughput Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.4.1 All Successful Packets Acknowledged . . . . . . . . . . . . . . . . . . . . . 11.4.2 Only a Single Successful Packet Acknowledged . . . . . . . . . . . . . 11.5 Performance Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
283 285 287 288 288 289 291 293 295 300 301 302 302 303 305 311 311 313 314 315 316 316 318 319 323 323 326 328 328 330 332 334 335 338
XV
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339 12 Adaptive Estimation of Clock Skew and Different Types of Delay in the Internet Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341 Jacob Benesty 12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341 12.2 Terminology and Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . 342 12.3 Delay Jitter Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344 12.4 The Least-Squares (LS) Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344 12.5 The Maximum Likelihood (ML) Estimator and Linear Programming346 12.6 An Unbiased RLS Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347 12.7 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348 12.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353
List of Contributors
Sofi` ene Affes INRS-Telecommunications, University of Quebec Montreal, Canada Jacob Benesty Bell Laboratories, Lucent Technologies Murray Hill NJ, USA Herbert Buchner University of Erlangen-Nuremberg Erlangen, Germany Andreas Burg ETH Zurich Zurich, Switzerland Jingdong Chen Bell Laboratories, Lucent Technologies Murray Hill NJ, USA Eberhard H¨ ansler Darmstadt University of Technology Darmstadt, Germany Wolfgang Herbordt University of Erlangen-Nuremberg Erlangen, Germany Yiteng (Arden) Huang Bell Laboratories, Lucent Technologies Murray Hill NJ, USA James M. Kates Cirrus Logic, AudioLogic Design Center Boulder CO, USA Walter Kellermann University of Erlangen-Nuremberg Erlangen, Germany
XVIII
Shoji Makino NTT Communication Science Laboratories Kyoto, Japan Paul Mermelstein INRS-Telecommunications, University of Quebec Montreal, Canada Dennis R. Morgan Bell Laboratories, Lucent Technologies Murray Hill NJ, USA Markus Rupp TU Wien Vienna, Austria Gerhard Schmidt Temic Speech Processing Ulm, Germany Vijitha Weerackody Independent Consultant NYC NY, USA
1 On a Class of Exponentiated Adaptive Algorithms for the Identification of Sparse Impulse Responses Jacob Benesty, Yiteng (Arden) Huang, and Dennis R. Morgan Bell Laboratories, Lucent Technologies Murray Hill, NJ 07974, USA E-mail: {jbenesty, yitenghuang, drrm}@bell-labs.com Abstract. Sparse impulse responses are encountered in many applications (network and acoustic echo cancellation, feedback cancellation in hearing aids, etc). Recently, a class of exponentiated gradient (EG) algorithms has been proposed. One of the algorithms belonging to this class, the so-called EG± algorithm, converges and tracks much better than the classical stochastic gradient, or LMS, algorithm for sparse impulse responses. We analyze the EG± and EG with unnormalized positive and negative weights (EGU±) algorithms and show when to expect them to behave like the LMS algorithm. We propose different normalized versions with respect to the input signal. It is shown that the proportionate normalized LMS (PNLMS) algorithm proposed by Duttweiler in the context of network echo cancellation (where the system impulse response is often sparse) is an approximation of the EG±, so that we can expect the two algorithms to have similar behavior. Finally, we demonstrate how the concept of exponentiated gradient could be used for blind multichannel identification and propose the multichannel EG± algorithm.
1.1
Introduction
One of the most popular adaptive algorithms available in the literature is the stochastic gradient algorithm also called least-mean-square (LMS) [1], [2]. Its popularity comes from the fact that it is very simple to implement. As a consequence, the LMS algorithm is widely used in many applications. The main drawback of the LMS algorithm, though, is that it converges very slowly in general with correlated input signals. Recently, another variant of the LMS algorithm, called the exponentiated gradient algorithm with positive and negative weights (EG± algorithm), was proposed by Kivinen and Warmuth [3]. This new algorithm converges much faster than the LMS algorithm when the impulse response that we need to identify is sparse, which is often the case in network echo cancellation involving a hybrid transformer in conjunction with variable network delay, or in the context of hands-free communications where there is a strong coupling between the loudspeaker and the microphone [4]. The EG± algorithm has the nice feature that its update rule takes advantage of the sparseness of the impulse response to speed up its initial convergence and to improve its
J. Benesty et al. (eds.), Adaptive Signal Processing © Springer-Verlag Berlin Heidelberg 2003
2
J. Benesty et al.
tracking abilities compared to LMS. More recently, a technique known as the proportionate normalized LMS (PNLMS) algorithm [5] has been introduced which has similar advantages for sparse impulse responses. In [6], a general expression of the mean squared error (MSE) is derived for the EG± algorithm showing that for sparse impulse responses, the EG± algorithm, like PNLMS, converges more quickly than the LMS for a given asymptotic MSE. In this chapter, we show some interesting links between the LMS algorithm and different variants of the EG algorithm, when to expect the EG and LMS algorithms to behave the same way, and that the choice of some parameters of the EG is critical. We propose some normalized versions of the EG algorithms with respect to the input signal. We also show that the PNLMS algorithm is an approximation of the EG± algorithm. The idea of exponentiated gradient is by no means confined to the area of echo cancellation and might be useful in other applications of adaptive algorithms. We demonstrate how it can be used for blind channel identification and propose the multichannel EG± method. Finally, we compare by way of simulations all these algorithms for some typical sparse impulse responses.
1.2
Derivation of the Different Algorithms
In this section, we show how to derive different variants of the LMS algorithm. Depending on how we define the distance between the old and new weight vectors, we obtain different update rules. We define the a priori error signal e(n + 1) at time n + 1 as: e(n + 1) = y(n + 1) − yˆ(n + 1),
(1.1)
where y(n + 1) = hTt x(n + 1)
(1.2)
is the system output, T ht = ht,0 ht,1 · · · ht,L−1 is the true (subscript t) impulse response of the system, superscript T denotes transpose of a vector or a matrix, T x(n + 1) = x(n + 1) x(n) · · · x(n − L + 2) is a vector containing the last L samples of the input signal x, yˆ(n + 1) = hT (n)x(n + 1), is the model filter output, and T h(n) = h0 (n) h1 (n) · · · hL−1 (n)
(1.3)
1
On a Class of Exponentiated Adaptive Algorithms
3
is the model filter. One easy way to find adaptive algorithms that adjust the new weight vector h(n + 1) from the old one h(n) is to minimize the following function [3]: J[h(n + 1)] = d[h(n + 1), h(n)] + η2 (n + 1),
(1.4)
where d[h(n + 1), h(n)] is some measure of distance from the old to the new weight vector, (n + 1) = y(n + 1) − hT (n + 1)x(n + 1)
(1.5)
is the a posteriori error signal, and η is a positive constant. (This formulation is a generalization of the case of Euclidean distance; see [7] and references therein.) The magnitude of η represents the importance of correctiveness compared to the importance of conservativeness [3]. If η is very small, minimizing J[h(n + 1)] is close to minimizing d[h(n + 1), h(n)], so that the algorithm makes very small updates. On the other hand, if η is very large, the minimization of J[h(n + 1)] is almost equivalent to minimizing d[h(n + 1), h(n)] subject to the constraint (n + 1) = 0. To minimize J[h(n + 1)], we need to set its L partial derivatives ∂J[h(n + 1)]/∂hl (n + 1) to zero. Hence, the different weight coefficients hl (n + 1), l = 0, 1, ..., L − 1, will be found by solving the equations: ∂d[h(n + 1), h(n)] − 2ηx(n + 1 − l)(n + 1) = 0. ∂hl (n + 1)
(1.6)
Solving (1.6) is in general very difficult. However, if the new weight vector h(n+1) is close to the old weight vector h(n), replacing the a posteriori error signal (n + 1) in (1.6) with the a priori error signal e(n + 1) is a reasonable approximation and the equation ∂d[h(n + 1), h(n)] − 2ηx(n + 1 − l)e(n + 1) = 0 ∂hl (n + 1)
(1.7)
is much easier to solve for all distance measures d. The LMS algorithm is easily obtained from (1.7) by using the squared Euclidean distance dE [h(n + 1), h(n)] = h(n + 1) − h(n)22 .
(1.8)
The exponentiated gradient (EG) algorithm with positive weights results from using for d the relative entropy, also known as Kullback-Leibler divergence, dre [h(n + 1), h(n)] =
L−1 l=0
hl (n + 1) ln
hl (n + 1) , hl (n)
(1.9)
4
J. Benesty et al.
with the constraint
l
hl (n + 1) = 1, so that (1.7) becomes:
∂dre [h(n + 1), h(n)] − 2ηx(n + 1 − l)e(n + 1) + γ = 0, ∂hl (n + 1)
(1.10)
where γ isthe Lagrange multiplier. Actually, the appropriate constraint should be l hl (n + 1) = l ht,l but l ht,l is not known in practice, so we take the arbitrary value 1 instead, which is realized by scaling as will be explained in more detail later. This will have an effect on the adaptation step of the resulting adaptive algorithm. The algorithm derived from (1.10) is valid only for positive weights. To deal with both positive and negative coefficients, we can always find two vectors h+ (n + 1) and h− (n + 1) with positive coefficients, in such a way that the vector h(n + 1) = h+ (n + 1) − h− (n + 1)
(1.11)
can have positive and negative components. In this case, the a posteriori error signal can be written as: (n + 1) = y(n + 1) − [h+ (n + 1) − h− (n + 1)]T x(n + 1)
(1.12)
and the function (1.4) will change to: J[h+ (n + 1), h− (n + 1)] = d[h+ (n + 1), h+ (n)] + d[h− (n + 1), h− (n)] η (1.13) + 2 (n + 1), u where u is a positive scaling constant. Using the same approximation as before the Kullback-Leibler divergence plus the constraint + and choosing − l [hl (n + 1) + hl (n + 1)] = u, the solutions of the equations ∂dre [h+ (n + 1), h+ (n)] η − 2 x(n + 1 − l)e(n + 1) + γ = 0, + u ∂hl (n + 1)
(1.14)
∂dre [h− (n + 1), h− (n)] η + 2 x(n + 1 − l)e(n + 1) + γ = 0, u ∂h− (n + 1) l
(1.15)
give the so-called EG± algorithm, where e(n + 1) = y(n + 1) − [h+ (n) − h− (n)]T x(n + 1),
(1.16)
and will be further detailed in the next section. When the parameter space is a curved manifold (non Euclidean), there are no orthonormal linear coordinates and the squared length of a small incremental vector h(n + 1) − h(n) connecting h(n) and h(n + 1) is given by the quadratic form: dR [h(n + 1), h(n)] = [h(n + 1) − h(n)]T G[h(n)][h(n + 1) − h(n)]. (1.17)
1
On a Class of Exponentiated Adaptive Algorithms
5
Such a space is a Riemannian space. The L × L positive-definite matrix G[h(n)] is called the Riemannian metric tensor and it depends in general on h(n). The Riemannian metric tensor characterizes the intrinsic curvature of a particular manifold in L-dimensional space. In the Euclidean orthonormal case, G[h(n)] = I (the identity matrix) and (1.17) is the same as (1.8). Using (1.17) in (1.7), we obtain the natural gradient descent algorithm proposed by Amari [8]: h(n + 1) = h(n) + ηG−1 [h(n)]x(n + 1)e(n + 1).
1.3
(1.18)
Link Between the LMS and EG Algorithms and Normalized Versions
Let us define the LMS algorithm [1]: e(n + 1) = y(n + 1)/β − hT (n)x(n + 1), h(n + 1) = h(n) + μx(n + 1)e(n + 1),
(1.19) (1.20)
where β = 0 is a constant which does not affect the convergence rate or the mean squared error (MSE) of the algorithm. Its role is to scale the impulse response to be identified. In principle, any adaptive algorithm should be invariant with this transformation; to obtain the true error signal and the true adaptive filter, we only need to multiply e and h by β. If we initialize hl (0) = 0, l = 0, 1, ..., L − 1, we can easily see that: h(n + 1) = μ
n
x(i + 1)e(i + 1).
(1.21)
i=0
The exponentiated gradient algorithm with unnormalized weights updates two vectors with nonnegative weights; it is called the EGU± algorithm and it is summarized by the following equations [3], [6]: e(n + 1) = y(n + 1)/β − [h+ (n) − h− (n)]T x(n + 1), h+ l (n − hl (n
+ 1) = + 1) =
h+ l (n) exp [μ x(n + 1 − l)e(n + 1)] , h− l (n) exp [−μ x(n + 1 − l)e(n + 1)] .
(1.22) (1.23) (1.24)
The motivation for the EGU± (and EG) algorithm can be developed by taking the log of (1.23) and (1.24). This shows that the log weights use the same update as the LMS algorithm. Alternatively, this can be interpreted as exponentiating the update, hence the name EG. This has the effect of assigning larger relative updates to larger weights, thereby deemphasizing the effect of smaller weights. This is qualitatively similar to the PNLMS algorithm, to be described in more detail later, which makes the update proportional to the size of the weight. This type of behavior is desirable for sparse impulse responses where small weights do not contribute significantly to the mean solution but introduce an undesirable noise-like variance.
6
J. Benesty et al.
The final result for the model filter is βhl (n + 1) = β[h+ l (n + 1) − + 1)], l = 0, 1, ..., L − 1. The performance of this algorithm depends on the choice of β which is very unusual in adaptive filtering. This means that, with the same adaptation step μ , changing the system impulse response ht to ht /β in the EGU± algorithm changes its behavior. For the LMS algorithm, the adaptation step depends only on the input signal x, but for the EGU± it also depends on the true impulse response ht and the initialization. The only way to make the algorithm invariant to this is to take β large enough, as we will see. In this case, the EGU± algorithm is equivalent to the LMS algorithm. − We usually initialize the EGU± algorithm with h+ l (0) = hl (0) = L−1 + − 1/(2L), l = 0, 1, ..., L − 1 so that l=0 [hl (0) + hl (0)] = 1. This constraint can be relaxed in practice and we can always use another initialization as long it is strictly positive. A different initialization will be reflected in the adaptation step of the algorithm. If we start adaptation with − h+ l (0) = hl (0) = c > 0, l = 0, 1, ..., L − 1, (1.23) and (1.24) can be rewritten as: n h+ x(i + 1 − l)e(i + 1) , (1.25) l (n + 1) = c exp μ h− l (n
i=0 n
h− l (n
+ 1) = c exp −μ
x(i + 1 − l)e(i + 1)
i=0
=
c2 . + 1)
(1.26)
h+ l (n
Thus − hl (n + 1) = h+ l (n + 1) − hl (n + 1) n = c exp μ x(i + 1 − l)e(i + 1)
i=0
−c exp −μ = 2c sinh μ
n
i=0 n
x(i + 1 − l)e(i + 1)
x(i + 1 − l)e(i + 1) .
(1.27)
i=0
We can always choose β as large as wanted in order to have: |hl (n + 1)|/2c 1, ∀l.
(1.28)
In this case, since sinh(a) ≈ a, |a| 1,
(1.29)
1
On a Class of Exponentiated Adaptive Algorithms
7
(1.27) becomes: hl (n + 1) = 2cμ
n
x(i + 1 − l)e(i + 1).
(1.30)
i=0
Taking μ = μ/2c and comparing (1.30) to (1.21), we deduce that the LMS and EGU± are virtually identical for β large. The above approach also gives a hint on how to normalize the EGU± algorithm for a nonstationary input signal. Indeed, it is well known that in the normalized LMS (NLMS), we choose: α μ= T , (1.31) x (n + 1)x(n + 1) + δ where δ is a regularization factor and 0 < α ≤ 1. Hence, for the normalized EGU± (NEGU±), μ in (1.23) and (1.24) is replaced by: α . (1.32) μ = T 2c[x (n + 1)x(n + 1) + δ] Again with β large, the NLMS and NEGU± have exactly the same behavior so that the NEGU± does not have much interest in that case. There are likely better normalizations that would take into account the initialization of the algorithm and the magnitude of the system impulse response, but their derivation is not obvious. However, the so-called EG± algorithm is more intriguing and has much more potential to be used in some applications. The EG± algorithm is: e(n + 1) = y(n + 1)/β − [h+ (n) − h− (n)]T x(n + 1), h+ l (n + 1) = u L−1 j=0
h− l (n + 1) = u L−1 j=0
where
+ h+ l (n)rl (n + 1) + − − [h+ j (n)rj (n + 1) + hj (n)rj (n + 1)] − h− l (n)rl (n + 1) + − − [h+ j (n)rj (n + 1) + hj (n)rj (n + 1)]
μ x(n + 1 − l)e(n + 1) , + 1) = exp u μ − rl (n + 1) = exp − x(n + 1 − l)e(n + 1) u 1 = + , rl (n + 1) rl+ (n
(1.33) ,
(1.34)
,
(1.35)
(1.36)
(1.37)
and u is a constant chosen such that u ≥ ht /β1 . We can check that we always have h+ (n + 1)1 + h− (n + 1)1 = u. Note that, because of the choice of u, the parameter β does not influence the behavior of the EG± algorithm, in the same way that it doesn’t for the LMS algorithm. From here on, for simplicity, we will take β = 1.
8
J. Benesty et al.
Starting adaptation of the algorithm with h+ = h− = l (0) l (0) c > 0, l = 0, 1, ..., L − 1, we can show that (1.34) and (1.35) are equivalent to: s+ l (n + 1) h+ , (1.38) l (n + 1) = u L−1 + − j=0 [sj (n + 1) + sj (n + 1)] h− l (n + 1) = u L−1 j=0
where
s− l (n + 1) − [s+ j (n + 1) + sj (n + 1)]
,
n μ + 1) = exp x(i + 1 − l)e(i + 1) , u i=0 n μ − x(i + 1 − l)e(i + 1) sl (n + 1) = exp − u i=0
(1.39)
s+ l (n
=
1 . + 1)
s+ l (n
(1.40)
(1.41)
Clearly, the convergence of the algorithm does not depend of the initialization parameter c. Now − hl (n + 1) = h+ l (n + 1) − hl (n + 1) − s+ l (n + 1) − sl (n + 1) = u L−1 + − j=0 [sj (n + 1) + sj (n + 1)]
n sinh μu x(i + 1 − l)e(i + 1) i=0
. = u L−1 n μ cosh x(i + 1 − j)e(i + 1) j=0 i=0 u
(1.42)
Note that the sinh function has the effect of exponentiating the update, as previously commented. For u large enough and using the approximations sinh(a) ≈ a and cosh(a) ≈ 1 when |a| 1, (1.42) becomes: hl (n + 1) =
n μ x(i + 1 − l)e(i + 1). L i=0
(1.43)
We understand that, by taking μ = Lμ and for u large enough, the LMS and EG± algorithms have the same performance. Obviously, the choice of u is critical in practice: if we take it too large or too small, there is not a real advantage using the EG± algorithm. The simplest way to normalize the EG± is to choose: Lα μ = T . (1.44) x (n + 1)x(n + 1) + δ This algorithm is called the normalized EG± (NEG±) and is summarized in Table 1.1.
1
On a Class of Exponentiated Adaptive Algorithms
9
Table 1.1 The normalized EG± algorithm. Initialization: − h+ l (0) = hl (0) = c > 0, l = 0, 1, ..., L − 1
Parameters: u ≥ ht 1 0 < α ≤ 1, δ > 0 Error: e(n + 1) = y(n + 1) − [h+ (n) − h− (n)]T x(n + 1) Update: α xT (n + 1)x(n + 1) + δ μ(n + 1) + rl (n + 1) = exp L x(n + 1 − l)e(n + 1) u 1 rl− (n + 1) = + rl (n + 1)
μ(n + 1) =
+ h+ l (n)rl (n + 1) + + − − j=0 [hj (n)rj (n + 1) + hj (n)rj (n + − h− l (n)rl (n + 1) u L−1 + + − − j=0 [hj (n)rj (n + 1) + hj (n)rj (n +
h+ l (n + 1) = u L−1 h− l (n + 1) =
1)] 1)]
l = 0, 1, ..., L − 1
1.4
The RLS and ERLS Algorithms
In many ways, the recursive least-squares (RLS) algorithm is optimal from a convergence point of view since its convergence does not depend on the condition number of the input signal covariance matrix. It is well known that with ill-conditioned signals (like speech) this condition number can be very large and algorithms like LMS suffer from slow convergence [2]. Thus, it is interesting to compare the RLS algorithm to the other algorithms when the impulse response to identify is sparse. The RLS algorithm is: e(n + 1) = y(n + 1) − hT (n)x(n + 1),
(1.45)
hl (n + 1) = hl (n) + kl (n + 1)e(n + 1),
(1.46)
where T k(n + 1) = k0 (n + 1) k1 (n + 1) · · · kL−1 (n + 1) = R−1 (n + 1)x(n + 1)
(1.47)
10
J. Benesty et al.
is the Kalman gain, R(n + 1) =
n+1
λn+1−m x(m)xT (m)
m=1
= λR(n) + x(n + 1)xT (n + 1)
(1.48)
is an estimate of the input signal covariance matrix, and λ (0 < λ ≤ 1) is an exponential forgetting factor. A fast RLS (FRLS) can be derived by using the a priori Kalman gain k (n + 1) = R−1 (n)x(n + 1) and the forward and backward predictors. This a priori Kalman gain can be computed recursively with only 5L multiplications [2]. Following the same approach as for the NEG± algorithm, we propose the exponentiated RLS (ERLS) algorithm: e(n + 1) = y(n + 1) − [h+ (n) − h− (n)]T x(n + 1), h+ l (n + 1) = u L−1 j=0
h− l (n + 1) = u L−1 j=0
where now: rl+ (n
+ h+ l (n)rl (n + 1) + − − [h+ j (n)rj (n + 1) + hj (n)rj (n + 1)] − h− l (n)rl (n + 1) + − − [h+ j (n)rj (n + 1) + hj (n)rj (n + 1)]
kl (n + 1) e(n + 1) , + 1) = exp u 1 = − . rl (n + 1)
(1.49) ,
(1.50)
,
(1.51)
(1.52)
Obviously, a fast ERLS (FERLS) can easily be derived since the Kalmain gain in (1.52) is the same as the one used in the FRLS. Simulations presented later show that there is not much difference between the FRLS and FERLS for initial convergence, but for tracking, FERLS is much better than FRLS. Hence, the FERLS algorithm may be of some interest.
1.5
Link Between the PNLMS and EG± Algorithms
Recently, the proportionate normalized least-mean-square (PNLMS) algorithm was developed for use in network echo cancelers [5]. In comparison to the NLMS algorithm, PNLMS has very fast initial convergence and tracking when the echo path is sparse. As previously mentioned, the idea behind PNLMS is to update each coefficient of the filter independently of the others by adjusting the adaptation step size in proportion to the estimated filter coefficient. More recently, an improved PNLMS (IPNLMS) [9] was proposed that performs better than NLMS and PNLMS, whatever the nature of the impulse response is. Table 1.2 summarizes the IPNLMS algorithm. In general, g(l) in the table provides the “proportionate” scaling of the update.
1
On a Class of Exponentiated Adaptive Algorithms
11
Table 1.2 The improved proportionate NLMS algorithm. Initialization: hl (0) = 0, l = 0, 1, ..., L − 1 Parameters: 0 < α ≤ 1, δIPNLMS > 0 −1 ≤ κ ≤ 1 ε > 0 (small number to avoid division by zero) Error: e(n + 1) = y(n + 1) − hT (n)x(n + 1) Update: 1−κ |hl (n)| + (1 + κ) 2L 2h(n)1 + ε l = 0, 1, ..., L − 1 α μ(n + 1) = L−1 2 j=0 x (n + 1 − j)gj (n) + δIPNLMS gl (n) =
hl (n + 1) = hl (n) + μ(n + 1)gl (n)x(n + 1 − l)e(n + 1) l = 0, 1, ..., L − 1
The parameter κ controls the amount of proportionality in the update. For κ = −1, it can easily be checked that the IPNLMS and NLMS algorithms are identical. For κ close to 1, the IPNLMS behaves like the PNLMS algorithm [5]. In practice, a good choice for κ is 0 or −0.5. How are the IPNLMS and EG± algorithms specifically related? In the rest of this section, we show that the IPNLMS is in fact an approximation of the EG±. For |a| 1, we have: exp(a) ≈ 1 + a.
(1.53)
For μ small enough, the numerator and denominator of the EG± update equations can be approximated as follows: μ x(n + 1 − l)e(n + 1), (1.54) u μ rl− (n + 1) ≈ 1 − x(n + 1 − l)e(n + 1), (1.55) u L−1 μ + − − yˆ(n + 1)e(n + 1) [h+ j (n)rj (n + 1) + hj (n)rj (n + 1)] ≈ u + u j=0 rl+ (n + 1) ≈ 1 +
≈ u.
(1.56)
12
J. Benesty et al.
With these approximations, (1.34) and (1.35) become: μ x(n + 1 − l)e(n + 1)], u μ − h− x(n + 1 − l)e(n + 1)], l (n + 1) = hl (n)[1 − u + h+ l (n + 1) = hl (n)[1 +
(1.57) (1.58)
so that: − hl (n + 1) = h+ l (n + 1) − hl (n + 1)
= hl (n) + μ
− h+ l (n) + hl (n) x(n + 1 − l)e(n + 1). (1.59) h+ (n)1 + h− (n)1
If the true impulse response ht is sparse, it can be shown that if we choose u = ht 1 , the (positive) vector h+ (n) + h− (n) is also sparse after convergence. This means that the elements
− h+ l (n)+hl (n)
+
−
h (n)1 +h (n)1
in (1.59) play
exactly the same role as the elements gl (n) in the IPNLMS algorithm in the particular case where κ = 1 (PNLMS algorithm). As a result, we can expect the two algorithms (IPNLMS and EG±) to have similar performance. On the − other hand, if u ht 1 , it can be shown that h+ l (n) + hl (n) ≈ u/L, ∀l. In this case, the EG± algorithm will behave like the IPNLMS with κ = −1 (NLMS algorithm). Thus, the parameter κ in the IPNLMS operates like the parameter u in the EG±. However, the advantage of the IPNLMS is that no a priori information of the system impulse response is required in order to have a better convergence rate than the NLMS algorithm. Another clear advantage of the IPNLMS is that it is much less complex to implement than the EG±. We conclude that IPNLMS is a good approximation of EG± and is more useful in practice. Note also that the approximated EG± algorithm (1.59) belongs to the family of natural gradient algorithms [10], [11].
1.6
Application of the EG± Algorithm for Blind Channel Identification
The previous sections investigated the class of exponentiated gradient (EG) adaptive algorithms for estimating and tracking sparse impulse responses in the context of applications like echo cancellation where the source signal is known a priori. But in many other cases, e.g. acoustic dereverberation, wireless communications, etc., the source signal is either unobservable or very expensive to acquire and therefore a blind method is a must. Recently, the multichannel least-mean-square (MCLMS) algorithm was proposed for blindly identifying a single-input multiple-output (SIMO) FIR system [12]. In the development of the MCLMS algorithm, the cross relations between different channel outputs are utilized in a novel and systematic way, which facilitates the use of various adaptive methods. This section illustrates how
1
On a Class of Exponentiated Adaptive Algorithms
13
the concept of exponentiated gradient can be applied for blind channel identification with appropriate constraints. For a SIMO system, an a priori error signal eij (n + 1) based on the i-th and j-th observations at time n + 1 is defined as follows: T xi (n + 1)hj (n) − xTj (n + 1)hi (n), i = j, i, j = 1, ..., M eij (n + 1) = , 0, i = j, i, j = 1, ..., M (1.60) where
T xi (n + 1) = xi (n + 1) xi (n) · · · xi (n − L + 2)
is a vector containing the last L samples of the i-th channel output xi , xi (n + 1) = hTt,i s(n + 1) is the output of the i-th channel at time n + 1, T ht,i = ht,i,0 ht,i,1 · · · ht,i,L−1 is the true impulse response of the i-th channel, L is set to the length of the longest channel impulse response by assumption, T s(n + 1) = s(n + 1) s(n) · · · s(n − L + 2) is a vector containing the last L samples of the source signal s, T hi (n) = hi,0 (n) hi,1 (n) · · · hi,L−1 (n) is the i-th model filter at time n, and M is the number of channels in the studied system. Accordingly, the a posteriori error signal is given by T xi (n + 1)hj (n + 1) − xTj (n + 1)hi (n + 1), i = j ij (n + 1) = , (1.61) 0, i=j and a multichannel cost function similar to (1.4) is constructed as: η JM [h(n + 1)] = d[h(n + 1), h(n)] + χ(n + 1), u where T h(n + 1) = hT1 (n + 1) hT2 (n + 1) · · · hTM (n + 1) , and χ(n + 1) =
M−1
M
(1.62)
2ij (n + 1).
i=1 j=i+1
Therefore, by minimizing JM [h(n+ 1)], an adaptive algorithm can be derived to update the model filter for blind multichannel identification. Taking the derivative of (1.62) with respect to h(n + 1) yields ∂d[h(n + 1), h(n)] η + ∇χ(n + 1) = 0. ∂h(n + 1) u
(1.63)
14
J. Benesty et al.
It was shown in [12] that ˜ + 1)h(n + 1), ∇χ(n + 1) = 2R(n
(1.64)
where ˜ + 1) = R(n ⎡ ⎤ ˜ ˜ ˜ xM x1 (n + 1) · · · −R i=1 Rxi xi (n + 1) −Rx2 x1 (n + 1) ⎢ ˜ x1 x2 (n + 1) ˜ ˜ xM x2 (n + 1) ⎥ −R ⎢ −R ⎥ i=2 Rxi xi (n + 1) · · · ⎢ ⎥, .. .. .. .. ⎢ ⎥ . ⎣ ⎦ . . . ˜ ˜ ˜ −Rx2 xM (n + 1) · · · i=M Rxi xi (n + 1) −Rx1 xM (n + 1) ˜ xi xj (n + 1) = xi (n + 1)xT (n + 1), i, j = 1, 2, ..., M. R j For positive weights, we can use the Kullback-Leibler divergence as before. In order to avoid convergence of the adaptive algorithm to a trivial solution with all zero elements, the constraint l hl (n + 1) = u is imposed. Then, by applying these and substituting (1.64) into (1.63), we get ∂dre [h(n + 1), h(n)] 2η ˜ + R(n + 1)h(n + 1) + γ1 = 0, ∂h(n + 1) u
(1.65)
where γ is again a Lagrange multiplier and 1 = [1 1 · · · 1]T is a vector of ones. For simplicity in solving (1.65), we approximate h(n + 1) in the second term of (1.65) with h(n) and deduce the multichannel EG algorithm: hl (n)rl (n + 1) hl (n + 1) = u ML−1 , hk (n)rk (n + 1) k=0 where
l = 0, 1, ..., M L − 1,
(1.66)
2η rl (n + 1) = exp − fl (n + 1) , u
and fl (n + 1) is the l-th element of the vector ˜ + 1)h(n). f(n + 1) = R(n
(1.67)
For a system with both positive and negative filter coefficients, we can decompose the model filter impulse responses h(n + 1) into two components h+ (n + 1) and h− (n + 1) with positive coefficients, as used in the previous sections [see (1.11)]. Therefore the cost function (1.62) becomes: JM h+ (n + 1), h− (n + 1) = dre h+ (n + 1), h+ (n) + dre h− (n + 1), h− (n) η + χ(n + 1), (1.68) u1 + u2 where u1 and u2 are two positive constants. Since h(n + 1) = 0 is an undesired solution, it is necessary to ensure that h+ (n + 1) and h− (n + 1) would
1
On a Class of Exponentiated Adaptive Algorithms
15
not be equal to each other from initialization and throughout the process of adaptation. Of course many methods can be used to implement the inequality of h+ (n + 1) and h− (n + 1). In this chapter, we propose the following constraints: ML−1
h+ l (n + 1) = u1 = κht 1 ,
(1.69)
h− l (n + 1) = u2 = (1 − κ)ht 1 ,
(1.70)
l=0 ML−1 l=0
where 0 < κ < 1 and κ = 1/2. Utilizing these constraints and taking derivatives of (1.68) individually with respect to h+ (n + 1) and h− (n + 1) respectively gives ln
h+ 2η l (n + 1) +1+ fl (n + 1) + γ1 = 0, u + u2 h+ (n) 1 l
(1.71)
ln
h− 2η l (n + 1) +1− fl (n + 1) + γ2 = 0, − u hl (n) 1 + u2
(1.72)
where γ1 and γ2 are two Lagrange multipliers. Solving (1.71) and (1.72) − for h+ l (n + 1) and hl (n + 1) respectively produces the multichannel EG± algorithm: + h+ l (n)rl (n + 1) h+ , l (n + 1) = u1 ML−1 + hk (n)rk+ (n + 1) k=0
(1.73)
− h− l (n)rl (n + 1) , h− l (n + 1) = u2 ML−1 − hk (n)rk− (n + 1) k=0
(1.74)
where
2η + 1) = exp − fl (n + 1) , u1 + u2 2η − rl (n + 1) = exp fl (n + 1) u1 + u2 1 . = + rl (n + 1) rl+ (n
1.7
Simulations
In this section, we compare by way of simulation, the different algorithms developed in the previous sections. The first experiment considers the identification of a single-channel system when the input signal is known. The system impulse response ht to be identified is shown in Fig. 1.1. It is sparse and is of length L = 2048. The same length is used for all the adaptive filters
16
J. Benesty et al.
h(n). The sampling rate is 8 kHz and a white noise signal with 30 dB SNR (signal-to-noise ratio) is added to the output y(n). The input signal x(n) is either white Gaussian noise or a speech signal. The parameter settings chosen for all the simulations are: • • • • •
− hl (0) = 0, h+ l (0) = hl (0) = 1/2, l = 0, 1, ..., L − 1, α = 0.3, β = 1, κ = −0.5, ε = 0.001, λ = 1 − 1/(3L), δ = 20σx2 , δIPNLMS = (1 − κ)δ/2L.
Figures 1.2–1.6 show the convergence of the normalized misalignment, ht − h(n)2 /ht 2 , for all the algorithms. The only simulation that was done with an input speech signal is shown in Fig. 1.6; all the others were done with a white Gaussian input signal. Figure 1.2 compares the NLMS, NEGU±, and NEG± algorithms when a large u (here u = 400ht 1 ) is chosen for the NEG± algorithm. We see that the three algorithms behave exactly the same. In Fig. 1.3, we compare the NLMS, IPNLMS, and NEG± algorithms when u = 4ht 1 for the NEG± algorithm. Clearly, the IPNLMS and NEG± algorithms converge much faster than the NLMS algorithm, while the IPNLMS and NEG± show similar performance. Figures 1.4 and 1.5 compare the algorithms in a tracking situation when after 3 seconds the sparse impulse response of Fig. 1.1 is shifted to the right by 50 samples. The other conditions of Fig. 1.4 are the same as in Fig. 1.3. According to this simulation, the IPNLMS and NEG± algorithms track much better than the NLMS algorithm. In Fig. 1.5, the FRLS algorithm is compared to the FERLS algorithm with u = 15ht 1 : while the initial convergence of the two algorithms is almost the same, the FERLS tracks faster than the FRLS. Finally, Fig. 1.6 shows the misalignment of the NLMS, IPNLMS, and the NEG± algorithms with a speech source as input signal and u = 4ht 1 for the NEG± algorithm. Again, the IPNLMS and NEG± algorithms behave better than NLMS with a small advantage for the IPNLMS over the NEG± due to the fact that it is better normalized with a non-stationary input signal. In the second experiment, we demonstrate how the EG± algorithm performs for blindly identifying a SIMO system. The system to be identified consists of M = 3 channels and the impulse response of each channel has L = 15 taps. Figure 1.7 shows these three impulse responses. In each channel, a dominant component makes the impulse response sparse. A white Gaussian random sequence is used as source signal to excite the system. The channel output is intentionally corrupted by additive white Gaussian noise at 50 dB SNR. In the case of blind multichannel identification, a properly aligned vector is still a valid solution to the impulse response ht even though their gains may differ. Therefore, a performance measure different from that in the
1
On a Class of Exponentiated Adaptive Algorithms
17
0.03
0.02
0.01
0
Amplitude
0.01
0.02
0.03
0.04
0.05
0.06
0.07
200
400
600
800
1000 1200 Samples
1400
1600
1800
2000
4.5
5
Fig. 1.1. Sparse impulse response used in simulations.
0
5
Misalignment (dB)
10
15
20
25
30
35
40
0.5
1
1.5
2
2.5 3 Time (seconds)
3.5
4
Fig. 1.2. Misalignment of the NLMS (++), NEGU± (xx), and NEG± (–) algorithms with white Gaussian noise as input signal and u = 400ht 1 for the NEG± algorithm.
first experiment is used here. It is called normalized projection misalignment (NPM) and is given at time n by [13]
NPM(n) =
(n)2 , ht 2
(1.75)
J. Benesty et al.
18
35
40 L-__
~
__- L_ _
~
0.5
____
~
1.5
_ _- L_ _
~
_ _ _ _L -_ _- L_ _~_ _~
2.5 Time (seconds)
3.5
4.5
Fig. 1.3. Misalignment of the NLMS (++), IPNLMS (- - ), and NEG± H algorithms with white Gaussian noise as input signal and u = 411 ht l11 for the NEG± algorithm.
40 L-__
~
0.5
__- L_ _
~
____
1.5
~
_ _- L_ _
~
2.5 Time (seconds)
_ _ _ _l -_ _- L_ _~_ _~
3.5
4.5
Fig. 1.4. Misalignment during impulse response change. The impulse response changes at time 3 seconds. Other conditions same as in Fig. 1.3.
where
E(n) = h t
-
;[h(n) h(n) h (n)h(n)
1
On a Class of Exponentiated Adaptive Algorithms
19
10
30
35
40 L-__L -_ _ 0.5
~
_ _- L_ _
~
__
~
_ _ _ _L -_ _
Time (seconds)
~
3.5
_ _- L_ _
~
__
~
4.5
Fig. 1.5. Misalignment, during impulse response change, of the FRLS (- ) and FERLS (--) algorithms with white Gaussian noise as input signal and u = 1511 htlll für the FERLS algorithm. 0
NLMS
iii' ll.
~
!
NEG±
~
::;
,0
IP LMS
' 5L-__L -_ _ 0.5
~
_ _- L_ _~_ _~L-__L -_ _~_ _- L_ _~~~
1.5
2.5 fime (se<:ond-S)
3.5
4.5
Fig. 1.6. Misalignment of the NLMS (+ + ), IPNLMS (- - ), and NEG ± (- ) algorithms with a speech source as input signal and u = 411 htlll for the NEG± algorithm.
is the projection misalignment vector. By projecting h t onto h(n) and defining a projection error, we take into account only the intrinsic misalignment of the channel estimate, disregarding an arbitrary gain factor [13]. A comparison of the multichannel EG± and MCLMS algorithms is given in Fig. l.8. For the
20
J. Benesty et al. h
1
1.5 1 0.5 0 0.5
0
5
h
10
15
10
15
10
15
2
Amplitude
1.5 1 0.5 0 0.5
0
5
h
3
1.5 1 0.5 0 0.5
0
5 Samples
Fig. 1.7. Impulse responses of a single-input three-output system used in the simulation for blind multichannel identification.
MCLMS algorithm, the step size μ = 1.6 × 10−4 . For the multichannel EG± algorithm, η = 3.5 × 10−2 and κ = 0.75. As clearly shown in the result, both the multichannel EG± and the MCLMS algorithm are able to determine the channel impulse responses while the multichannel EG± algorithm converges faster than the MCLMS algorithm.
1.8
Conclusions
It seems possible to exploit sparsity in adaptive algorithms. One of the first algorithms to do so is the PNLMS proposed by Duttweiler in [5]. The PNLMS algorithm was introduced in the context of network echo cancellation where there is a strong need to improve convergence rate and tracking. It was known for a long time that unknown echo paths in the network are most of the time sparse and there are many different intuitions on how one should take advantage of that. Kivinen and Warmuth [3] derived the EG± algorithm in the context of computational learning theory. We have shown here that a good approximation of the EG± leads to the PNLMS. As a result, the two algorithms have very similar performance in all the simulations we have done. We have also shown some links between the EG algorithms and LMS, so that with appropriate choice of some parameters, the different algorithms can be identical. In light of those links, several normalized EG algorithms were proposed. Finally, we developed the EG± algorithm for blind multichannel identification and proposed a feasibility constraint on the decomposed positive components of the channel impulse responses to avoid a trivial solution
1
On a C la ss of Exponentiated Adaptive Algorithms
21
16
1a 20L-__L -_ _
o
~
_ _- L_ _- L_ _~_ _ _ _L -_ _L -_ _~_ _- L_ _~
10
Time (seconds)
12
14
16
1a
20
Fig. 1.8. Normalized projeetion misalignment ofthe MCLMS (--) and multiehannel EG± (-) algorithms for identi(ying a single-input three-output system excited by white Gaussian noise with additive white Gaussian noise at 50 dB SNR.
with all zero elements. It was shown with a simple example in the simulations that the proposed multichannel EG± algorithm converges faster than the MCLMS algorithm for sparse impulse responses. We note that for the case of Euclidian distance, minimization of (1.4) over a block of sampie leads to the affine projection (AP) algorithm and, more generally, to a class of algorithms that have convergence performance intermediate between LMS and RLS [7]. Likewise we can suggest a similar generalization of the EG dass of algorithms, wh ich may lead to algorithms with advantageous properties. We leave this for future research.
References 1. B. Widrow and S. D. Stearns, Adaptive Signal Pmcessing. Prentiee-Hall, Englewood Cliffs, N.J., 1985. 2. S. Haykin, Adaptive Filter Theory. Fourth Edition, Prentice Hall, Upper Saddle River, N.J., 2002. 3. J. Kivinen and M. K. Warmuth, "Exponentiated gradient versus gradient desee nt for linear predietors," In/orm. Comput., vol. 132, pp. 1- 64, Jan. 1997. 4. J. Benesty, T. Gänsler, D. R. Morgan , M . M. Sondhi, and S. L. Gay, Advances in Network and Acoustic Echo cancellation. Springer-Verlag, Berlin, 2001. 5. D. L. Duttweiler, "Proportionate normalized least me an square adaptation in eeho eaneelers," IEEE Trans . Speech Audio Processing, vol. 8, pp. 508-518, Sept. 2000. 6. S. 1. Hili and R . C. Williamson, "Convergenee of exponentiated gradient algorithms," IEEE Trans. Signal Processing, vol. 49, pp. 1208- 1215, June 2001.
22
J. Benesty et al.
7. D. R. Morgan and S. G. Kratzer, “On a class of computationally efficient, rapidly converging, generalized NLMS algorithms,” IEEE Signal Processing Lett., vol. 3, pp. 245–247, Aug. 1996. 8. S. Amari, “Natural gradient works efficiently in learning,” Neural Computation, vol. 10, pp. 251–276, Feb. 1998. 9. J. Benesty and S. L. Gay, “An improved PNLMS algorithm,” in Proc. IEEE ICASSP, 2002. 10. R. K. Martin, W. A. Sethares, R. C. Williamson, and C. R. Johnson, Jr., “Exploiting sparsity in adaptive filters,” in Conference on Information Sciences and Systems, The John Hopkins University, 2001. 11. S. L. Gay and S. C. Douglas, “Normalized natural gradient adaptive filtering for sparse and nonsparse systems,” in Proc. IEEE ICASSP, 2002. 12. Y. Huang and J. Benesty, “Adaptive multi-channel least mean square and Newton algorithms for blind channel identification,” Signal Processing, vol. 82, pp. 1127–1138, Aug. 2002. 13. D. R. Morgan, J. Benesty, and M. M. Sondhi, “On the evalution of estimated impulse responses,” IEEE Signal Processing Lett., vol. 5, no. 7, pp. 174–176, July 1998.
2 Adaptive Feedback Cancellation in Hearing Aids James M. Kates Cirrus Logic, AudioLogic Design Center 2465 Central Avenue, Suite 100, Boulder, CO 80301, USA E-mail:
[email protected] Abstract. In feedback cancellation in hearing aids, an adaptive filter is used to model the feedback path. The output of the adaptive filter is subtracted from the microphone signal to cancel the acoustic and mechanical feedback picked up by the microphone, thus allowing more gain in the hearing aid. In general, the feedback cancellation filter adapts on the hearing-aid input signal, and signal cancellation and coloration artifacts can occur for a narrowband input. The goal in feedback cancellation is to design an algorithm that is computationally efficient, provides 10 dB or more of additional amplifier headroom, and is free from processing artifacts. This chapter begins with a steady-state analysis of the effects of acoustic feedback on the hearing-aid response. This analysis illustrates the factors that can affect the accuracy of the adaptive filter and the filter convergence for systems adapting with or without a probe signal. The characteristics of the feedback path being modeled are then described. An additional concern in designing signal processing for a hearing aid are the processing constraints imposed by a low-power portable device, and these concerns are discussed next. Two forms of constrained adaptation are then derived, and simulation results are used to give a comparison of the constrained adaptation with the conventional unconstrained approach for a system adapting without a probe signal. Further improvements in feedback cancellation performance, particularly a reduction in audible processing artifacts, can be achieved by using the filtered-X algorithm, and this approach is described next. The performance of feedback cancellation is ultimately limited by the characteristics of the room in which the listener is located since the room reflections form part of the feedback path. The chapter concludes with a study of the feedback path in a room along with the feedback cancellation performance for an 8-, 16-, or 32-tap adaptive FIR filter in series with a five-pole nonadaptive filter, and the performance limitations imposed by the room reverberation are discussed.
2.1
Introduction
Mechanical and acoustic feedback limits the maximum gain that can be achieved in most hearing aids [1]. The acoustic feedback path includes the effects of the hearing-aid amplifier, receiver, and microphone as well as the acoustics of the vent or leak. System instability caused by feedback is sometimes audible as a continuous high-frequency tone or whistle emanating from the hearing aid. In most instruments, venting the earmold used with a behindthe-ear (BTE) hearing aid or venting the shell of an in-the-ear (ITE) hearing
J. Benesty et al. (eds.), Adaptive Signal Processing © Springer-Verlag Berlin Heidelberg 2003
24
J. M. Kates
aid establishes an acoustic feedback path that limits the maximum possible gain to less than 40 dB for a small vent and even less for large vents [2]. In feedback cancellation, the feedback signal is estimated and subtracted from the microphone signal. The LMS adaptation algorithm [3] is generally used to adaptively estimate the feedback path in hearing aids, as opposed to more sophisticated adaptive algorithms, because of the limited processing power available in a battery-powered device. Computer simulations and prototype digital systems indicate that increases in gain of between 6 and 20 dB can be achieved in an adaptive system before the onset of oscillation, and no loss of high-frequency response is observed [4], [5], [6], [7], [8], [9]. In laboratory tests of a wearable digital hearing aid [10], a group of hearing-impaired subjects used an additional 4 dB of gain when adaptive feedback cancellation was engaged and showed significantly better speech recognition in quiet and in a background of speech babble. Field trials of a feedback-cancellation system built into a BTE hearing aid have shown increases of 8-10 dB in the gain used by severely-impaired subjects [11] and increases of 10-13 dB in the gain margin measured in real ears [12]. One approach to adaptation for estimating the feedback path is to use a noise sequence continuously injected at a low level while the hearing aid is in normal operation [5], [8], [11]. The weight update of the adaptive finite impulse response (FIR) filter proceeds on a continuous basis using the LMS algorithm. The difficulty with this approach is that the probe sequence reduces the signal-to-noise ratio (SNR) at the output of the hearing aid, thus compromising the sound quality. Adaptation using a probe signal injected during short intervals when the normal hearing aid processing is turned off [6] would similarly degrade the sound quality by injecting noise and interrupting the audio output. Furthermore, the ability of such systems to cancel the feedback may be reduced by the presence of ambient noise or speech at the microphone input [6], [13]. Adaptive feedback cancellation without a probe signal is becoming the standard approach because it does not compromise the output noise level of the hearing aid. For a broadband input, such an adaptive system will generally converge to an accurate model of the feedback path. For a narrowband input, however, the convergence behavior of the adaptive system will be severely compromised as a result of the large eigenvalue spread in the excitation correlation matrix [14]. The correlation between the input and output signals of the hearing aid for a narrowband excitation leads to a bias in the adaptive solution unless adequate delay is provided in the hearingaid processing [4], [15]. Frequency-domain adaptive algorithms can also be used for feedback cancellation without a probe signal [16], [17], [9]. The block delay provided by frequency-domain processing using the short-time Fourier transform provides the input-output signal decorrelation needed for the convergence of the feedback cancellation in these systems.
2
Adaptive Feedback Cancellation in Hearing Aids
25
Feedback cancellation in hearing aids reduces the problems associated with acoustic feedback, but adaptive feedback cancellation systems are in turn subject to signal cancellation and coloration artifacts that can occur for a narrowband input signal. The tendency for the feedback cancellation to cancel a sinusoidal input will, in general, cause a large mismatch between the adaptive feedback path model and the actual feedback path [15]. This mismatch is normally characterized by a large increase in the magnitude of the adaptive FIR filter coefficients. The excessively large filter coefficients, when combined with the hearing aid gain function, can result in system instability or coloration of the output signal due to large undesired changes in the system frequency response. One procedure to maintain system stability given a narrowband input is to reduce the hearing-aid gain when a narrowband input is detected. Kaelin et al. [9] have proposed a block frequency-domain adaptive feedback cancellation system that uses a metric similar to the block-to-block coherence of the input signal to control the hearing-aid gain. If the metric exceeds a threshold, the gain is reduced. This approach, however, reduces the hearing-aid gain even if the adaptive filter is not misbehaving, and can reduce the audibility of narrowband sounds that the user may desire to hear. An effective approach that does not reduce the hearing aid gain is constrained adaptation [18], which has been developed to allow for adaptation to changes in the feedback path while reducing processing artifacts. The filtered-X algorithm [19], [20] is also an effective approach for reducing the sensitivity of the feedback cancellation system to low-frequency inputs while allowing for rapid adaptation to feedback-path changes at high frequencies. The amount of gain possible in a hearing aid is ultimately limited by the ability of the feedback cancellation system to model the actual feedback path. Room reverberation, which consists of multiple acoustic reflections at different amplitudes and time delays, can cause changes in the feedback path that can be difficult to model. Reflections cause peaks and valleys to appear in the feedback path frequency response [21]. The reflection pattern changes with position within the room, and imposes a fine structure on the feedback path response that the feedback cancellation system is unable to model [22]. The room reverberation thus limits the potential benefit of feedback cancellation in a practical device. This chapter begins with a steady-state analysis of the effects of acoustic feedback on the hearing-aid response. This analysis illustrates the factors that can affect the accuracy of the adaptive filter and the filter convergence for systems adapting with or without a probe signal. The characteristics of the feedback path being modeled are then described. An additional concern in designing signal processing for a hearing aid are the processing constraints imposed by a low-power portable device, and these concerns are discussed next. Two forms of constrained adaptation are then derived, and simulation results are used to give a comparison of the constrained adaptation with
26
J. M. Kates
the conventional unconstrained approach for a system adapting without a probe signal. Further improvements in feedback cancellation performance, particularly in reducing the audible processing artifacts, can be achieved by using the filtered-X algorithm, and this approach is described next. The performance of feedback cancellation is ultimately limited by the characteristics of the room in which the listener is located since the room reflections form part of the feedback path. The chapter concludes with a study of the feedback path in a room along with the measured feedback cancellation performance for an adaptive FIR filter in series with a five-pole nonadaptive filter, and the performance limitations imposed by the room reverberation are discussed.
2.2
Steady-State Analysis
A block diagram for a generic feedback-cancellation system is presented in Fig. 2.1. The system is assumed to be in steady state, with the adaptive filter weights having reached their asymptotic values; the steady-state analysis does not depend on the procedure that was used to update the adaptive filter weights. The feedback cancellation is applied in an adaptive processing loop outside the hearing-aid processing h(n) intended to ameliorate the hearing loss, and the feedback cancellation has adapted to minimize the error signal e(n). The probe signal used in some systems ([5], [8], [11]) is given by q(n). Those systems that do not use a probe signal ([4], [16], [17], [15], [9], [18]) can be represented by setting q(n) = 0 in the resultant system equations. The system of Kates [6] that uses an intermittent probe signal can be represented by setting q(n) = 0 during normal hearing-aid operation, and setting h(n) = 0 during the time period when adaptation is performed. The input to the hearing-aid processing is s(n), which is the sum of the desired input signal x(n) and the feedback signal f (n). The processed hearingaid signal is g(n), which when combined with the optional probe signal gives the amplifier input signal u(n). The amplifier impulse response is given by a(n), the receiver impulse response by r(n), and the microphone impulse response by m(n). The adaptive filter weights are given by w(n), and the signal in the ear canal is y(n). The feedback path impulse response b(n) includes both the acoustic and mechanical feedback, although acoustic feedback is assumed to dominate. The acoustic feedback path through the vent tends to have a high-pass behavior, and the acoustic feed-forward path c(n) through the vent from the pinna to the ear canal tends to have a low-pass filter behavior [23], [2]. A steady-state analysis [18] yields equations for the output signal y(n), the microphone output signal s(n), and the error signal e(n). The results of the analysis are presented in the frequency domain, denoted by upper-
2
Adaptive Feedback Cancellation in Hearing Aids
f(n)
27
FEEDBACK
b(n) x(n) ++
s(n)
Σ MIC
+ Σ −
e(n)
HEARING AID
h(n)
g(n)
+Σ +
u(n)
m(n)
y(n) AMP
RCVR
a(n)
r(n)
PROBE q(n)
v(n) ADAPT. FILT.
w(n) VENT
c(n)
Fig. 2.1. Block diagram for a generic feedback cancellation system in steady-state operation. The adaptive weights are assumed to have reached their asymptotic values.
case variables, and the frequency variable ω is suppressed to save space. The output signal is given by: Y =
Q · A · R + X[C + H(W · C + M · A · R)] . (1 − B · C) − H[M · A · R · B − W (1 − B · C)]
(2.1)
Because C is a low-pass response and B is a high-pass response with reduced gain, one can safely assume that the product |B · C| 1. This assumption leads to a useful approximate solution: Q · A · R + X[C + H(W · C + M · A · R)] Y ∼ . = 1 − H(M · A · R · B − W )
(2.2)
Thus the output consists of the probe signal (if used) filtered by the amplifier and receiver, plus the microphone input modified by the vent feed-forward path and the hearing-aid processing. The denominator of (2.2) shows that the system will be stable if either the gain of the hearing-aid processing H is low or if the feedback cancellation filter W comes close to canceling the feedback path M · A · R · B. Stability is guaranteed by the Nyquist criterion if |H(M · A · R · B − W )| < 1.
(2.3)
The microphone output signal s(n) is the combination of the microphone input and the feedback path output. The steady-state solution is given by: S=
X[M (1 + W · H)] + Q(M · A · R · B) . (1 − B · C) − H[M · A · R · B − W (1 − B · C)]
(2.4)
28
J. M. Kates
The approximate solution for |B · C| 1 is given by: X[M (1 + W · H)] + Q(M · A · R · B) . S∼ = 1 − H(M · A · R · B − W )
(2.5)
Thus the signal input to the hearing-aid processing h(n) can be colored by a mismatch between the feedback path and the filter modeling it in the denominator of (2.5), and can also be affected by the hearing-aid gain and the feedback path model if the product W · H approaches unit magnitude. The error signal used to control the feedback-cancellation filter adaptation is given by: E=
X · M + Q[M · A · R · B − W (1 − B · C)] . (1 − B · C) − H[M · A · R · B − W (1 − B · C)]
(2.6)
Again making the assumption that |B · C| 1 leads to: X · M + Q(M · A · R · B − W ) . E∼ = 1 − H(M · A · R · B − W )
(2.7)
Exact feedback cancellation occurs when W = (M · A · R · B)/(1 − B · C), or approximately for W ∼ = M · A · R · B. System stability requires a very close match between the feedback path and the cancellation filter when the hearing-aid processing gain is high, and can tolerate poorer matches for low processing gains. The error function given by (2.7) indicates how the system configuration will affect the convergence of the adaptive filter. The best convergence will occur for an open-loop system (the hearing-aid processing is turned off by setting H = 0), such as the intermittent adaptation proposed by Kates [6], because an open-loop system removes the effects of the denominator. Rapid adaptation in an open-loop system when using a probe signal requires |Q| |X|, which is best achieved by injecting an intense probe signal in a quiet environment. Adapting in the presence of a stronger ambient signal, giving |Q| ≤ |X|, will lead to slower convergence behavior because the error signal will be noisier, although the system will still tend to converge to the desired model of the feedback path. Thus the signal-to-noise ratio (SNR) that affects the convergence behavior in an open-loop system is the ratio of the probe signal power to that in the microphone input signal. The importance of the SNR in the performance of the adaptive system was observed by Kates [6] and has also been discussed by Knecht [17]. Despite the advantages of using a probe signal and adapting with the hearing aid processing turned off, a practical hearing aid can not use a probe and must adapt while in closed-loop operation. Adaptation during closed-loop operation without a probe signal (the probe is turned off by setting Q = 0) is problematical because the error can be minimized by either reducing the magnitude of the numerator or increasing the magnitude of the denominator of (2.7). For a sinusoidal input, the magnitude of the denominator can be
2
Adaptive Feedback Cancellation in Hearing Aids
29
driven large at the frequency of the sinusoid, converting (2.7) into a notch filter. This mode of operation can lead to cancellation of a sinusoidal input signal. Periodic input signals characterized by a limited number of spectral lines are also in danger of cancellation. The greater the number of coefficients in the adaptive FIR filter, the greater the number of sinusoidal components that can be cancelled.
2.3
The Feedback Path
The characteristics of the acoustic feedback path were measured for a ReSound BTE hearing aid attached to an AudioLogic Audallion processing unit containing a Motorola 56009 DSP with 12-bit A/D and 16-bit D/A converters. A white noise sequence was used to excite the receiver, and the response at the microphone was recorded. A sampling rate of 15.625 kHz was used. The feedback system impulse response was computed by cross-correlating the probe sequence with the microphone output. The frequency response was then computed by using an FFT of the windowed impulse response. The windowing involved removing the initial 60-sample system delay in the impulse response and then limiting the response to 256 samples using a window function that is unity for the first 128 samples followed by a half cycle of a cosine to effect the transition from 1 to 0 over the remaining 128 samples. Two sets of BTE-vent test conditions were used. The conditions used the “Eddi” manikin (Earmold Design, Inc.) to simulate in-situ BTE placement on the head and placement of the earmold in the ear canal. The dummy head has a pinna and ear canal but lacks an ear simulator to terminate the ear canal with the proper acoustic impedance. For the first condition, an unvented earmold was used, giving a tight fit in the ear canal. A telephone handset was then placed at varying distances from the side of the head. Responses were recorded for the handset removed, and in 1-cm increments from 5 cm from the ear to up against the ear. For the second condition a vented earmold was used. The resulting frequency responses are shown in Figs. 2.2 and 2.3 for the two test conditions. The frequency response functions give the magnitude of the signal at the hearing-aid microphone relative to the magnitude of the digital probe signal sent to the amplifier. The measured feedback path includes the hearing-aid power amplifier, receiver, and microphone in addition to the acoustic and mechanical paths through and around the vent and earmold. The different head-mounted vent conditions give quite different transfer functions, although two or three peaks (2-3 complex pole pairs) would be adequate to characterize the dominant shape for each of them. The magnitude of the transfer function for the vented earmold is about 30 dB greater than the transfer function for the unvented earmold, illustrating the increase in acoustic feedback that occurs when a vent is used. The response peak for the vented earmold occurs in the vicinity of 4 to 5 kHz, while the peak for the
30
J. M. Kates 0 -10 -20 -30 -40 -50 dB -60 -70 -80 -90 -100
0
1
2
3 4 5 Frequency in kHz
6
7
Fig. 2.2. Feedback path magnitude frequency response for a telephone handset removed (dashed line) and placed near the ear (solid line) for a behind-the-ear (BTE) hearing aid connected to an unvented earmold on a dummy head.
unvented earmold occurs in the vicinity of 2.5 to 4 kHz. These differences between the two earmolds illustrate the variety of possible feedback paths and the need to estimate the feedback path model parameters for each individual fitting. Bringing the telephone handset up to the ear results in an increase of about 10 dB in the amplitude of the transfer functions independent of the type of earmold used. The shape of the transfer function in the vicinity of the peaks is essentially unchanged by the presence of the handset in these examples, but it is also possible that placing the handset close to the aided ear will create a resonant cavity that will cause substantial changes to the shape of the transfer function. The results for the intermediate handset positions (not shown) show many peaks and valleys in the responses caused by reflections between the head, handset, and the table top on which the dummy head was placed.
2.4
Real-World Processing Concerns
The feedback cancellation system estimates the feedback path response and subtracts it from the incoming signal in the hearing aid. There are many adaptive algorithms that could be used to solve this problem, but the constraints on algorithm complexity in a practical hearing aid greatly limit the algorithm choices. Hence the use of the LMS algorithm [3] for the adaptive feedback cancellation. The main concerns in a hearing aid are the small
2
Adaptive Feedback Cancellation in Hearing Aids
31
0 -10 -20 -30 -40 -50 dB -60 -70 -80 -90 -100
0
1
2
3 4 5 Frequency in kHz
6
7
Fig. 2.3. Feedback path magnitude frequency response for a telephone handset removed (dashed line) and placed near the ear (solid line) for a behind-the-ear (BTE) hearing aid connected to a vented earmold on a dummy head.
amount of space available for the circuitry and the required battery life. In order to fit into a hearing aid, the entire circuit with all components should be no larger than 0.5 cm2 . Market forces require that the hearing aid be able to run for at least 50 hours on a 1.3 V battery the size of an aspirin tablet, which means a typical current drain of 0.5-1.5 mA. These size and power considerations lead to a small digital signal processing (DSP) chip operating a low clock rate. Typical specifications for the DSP in a commercial digital hearing aid are a Harvard architecture with a processing speed of 2 to 4 million instructions per second (MIPS), approximately 1024 words of program memory, and 1024 words each of x- and y-data memory. While these specifications will improve with time, they will always remain several orders of magnitude below what is available in off-the-shelf DSP chips that are intended for main-stream applications. The processor uses a 16-bit fixed-point data word size, and increases in the data word size are not expected in the near future because of the power and chip real-estate constraints. The DSP in a hearing aid is therefore a small, low-speed device using 16-bit fixed-point arithmetic. The number of lines of machine code needed for an algorithm and the execution time are very important considerations because of the small amount of memory and the low processing speed. Quantization effects, round-off noise, and signal dynamic range are also important considerations because of the 16-bit fixed-point word size. Algorithm develop-
32
J. M. Kates
ment for hearing aids thus requires an awareness of modern signal-processing techniques combined with a need to reconsider the “hot topics” of 30 years ago when people were first implementing DSP algorithms on slow computers having limited amounts of memory.
2.5
Feedback Cancellation System
The feedback cancellation system is intended to compensate for acoustic feedback occurring in the vicinity of the ear and head. The feedback cancellation approach divides the feedback path model into two filters in series [22]. The first filter represents those aspects of the feedback path that are expected to change very slowly, such as the microphone, amplifier, and receiver responses. The first part of the model is thus initialized when the hearing aid is fitted to the user, and then held constant while the hearing aid is in use. A typical first filter is an all-pole IIR filter having five poles. The second filter represents those aspects of the feedback path that can change rapidly, such as changes resulting from a shift in the position of the hearing aid within the ear canal. The second filter is initialized when the hearing aid is fitted to the user and then adaptively updated while the hearing aid is in use. The second filter is long enough to encompass the reflection time delays expected when a telephone receiver is positioned close to the aided ear or a hat is put on, but it is much shorter than the typical reverberation time in a room. 2.5.1
Initialization
The initial parameter estimation procedure uses the hearing aid to acquire the raw data needed to describe the feedback path impulse response. All other signal processing tasks are performed by the host computer that communicates with the hearing aid during the initialization. The host computer generates one period of a periodic probe sequence, which is downloaded to the hearing aid. The code in the hearing aid then records the signal at the hearing-aid microphone while reading out successive periods of the probe sequence as shown in Fig. 2.4. The accumulated microphone signal is uploaded to the host computer, which calculates the feedback path impulse response. The feedback path signal delay is extracted from the impulse response along with an estimate of the response signal-to-noise ratio. Using the estimated delay, the poles and zeros of a filter modeling the impulse response are then determined using system identification techniques. A description of each of these operations is given below. Probe Sequence and Impulse Response. A periodic maximal-length sequence is used to measure the feedback path impulse response. Such sequences have been successfully used to measure hearing-aid responses in laboratory
2
Adaptive Feedback Cancellation in Hearing Aids
33
FEEDBACK PATH
PERIODIC SUM
PERIODIC EXCITE
MIC
AMP CIRCULAR CORR
RCVR
MLS
Fig. 2.4. Block diagram of the system used to make the initial measurement of the feedback path impulse response.
measurement systems [24]. A maximal-length sequence (MLS) has only two values, +1 and −1. It is generated by a N -stage shift register and the period length is M = 2N − 1 [25]. One period of the sequence is generated and amplitude scaled in the host computer, which then downloads the scaled sequence to the hearing aid. To acquire the feedback path impulse response, the probe is read out from a circular buffer, giving a periodic excitation, and the microphone response is synchronously summed into a circular buffer having the same length as the probe signal. Because coherent averaging is used to acquire the microphone signal, the SNR for the feedback path response relative to random room noise is improved 3 dB for every doubling of the number of periods used for the excitation. Periodic room noise, such as 50or 60-Hz hum from electrical machinery, will also be reduced by the averaging as long as the hum is not synchronized with the periodic MLS signal. Nonlinear distortion, however, can cause spurious bumps and peaks to be distributed throughout the measured impulse response [26], [27]. The M -point circular autocorrelation of the M -point MLS has the property that it is M for zero lag and −1 for all other lags. To extract the feedback path impulse response, the summed microphone signal is divided by the number of periods used for the excitation and then circularly cross-correlated with one period of the MLS [28], [29], [30]. The result is then adjusted for the amplitude of the periodic excitation. For the present system, the MLS has a period M = 255 samples (approximately 16 msec), which is the longest possible sequence given the data memory constraints in the real-time hearing aid system. The result of the circular correlation is the impulse response of the feedback path, including the overall path delay and any residual room noise. Peak-to-Noise Ratio. The peak-to-noise ratio (PNR) is given by the ratio of the amplitude of the peak of the impulse response to the RMS noise level in the tail of the impulse response. The peak is the part of the signal that
34
J. M. Kates
is the least corrupted by the additive room noise and therefore gives the best estimate of the feedback path signal power. The impulse response is a sequence of 255 samples, but most of its energy is contained in the first 32 samples following the overall delay. Thus the tail of the impulse response consists almost entirely of room noise and reverberation, and the RMS room noise is computed from the last 64 samples in the impulse response. The peak-to-RMS room noise ratio thus gives an indication of the quality of the estimated feedback path impulse response. A peak-to-noise ratio of 30 dB or better has been empirically found to yield good estimates of the response poles and zeros. The peak-to-noise ratio is limited by the properties of the MLS circular convolution to a maximum of about 48 dB. Feedback Path Delay. The feedback path delay is determined using a two-step process. The delay is first estimated by finding the peak of the impulse response and then counting backwards a fixed number of samples from the peak to locate the approximate start of the impulse response. The peak of the impulse response is used to determine the delay because it is the most robust portion of the signal in the presence of additive noise. A search is then performed using candidate delay values above and below the approximate start delay estimated from the peak. The delay value yielding the best pole-zero model fit to the measured impulse response is then selected as the feedback path delay. Poles and Zeros. The poles and zeros of the filter modeling the feedback path are determined from the impulse response using the ARX procedure of Ljung [31]. The ARX algorithm is a joint optimization procedure in which the pole and zero coefficients are determined at the same time. Define the following sequences: r(n) = the feedback path impulse response. q(n) = a segment of white noise. t(n) = the noise sequence q(n) filtered through the DC zero filter D(z) = 1 − z −1 . This high-pass filtering is to duplicate the DC zero filter used in the running adaptation to remove the DC bias in the sampled data. z(n) = r(n) ∗ t(n), the filtered noise convolved with the impulse response. The ARX procedure is used to find the filter which, when convolved with t(n), produces the closest match to z(n). Define the regression vector ϕ(n) = −z(n − 1), −z(n − 2), · · · , −z(n − ma ), T t(n − ), t(n − − 1), · · · , t(n − − mb + 1) , (2.8)
2
Adaptive Feedback Cancellation in Hearing Aids
35
where ma is the number of poles in the model, mb is the number of zero coefficients (number of zeros plus 1), and is the estimated feedback path delay. The optimal coefficient vector is then found by solving the set of linear equations Pθ = c for the coefficient vector θ, where N −1 T P= ϕ(n)ϕ (n) , n=0
c=
N −1
ϕ(n)z(n) .
(2.9)
n=0
The correlation matrix P is symmetric positive definite, so the set of linear equations is solved using the Cholesky factorization of P followed by forward and back substitution. The pole coefficients a are given by the first ma entries in θ and the zero coefficients b are given by the remaining mb entries in θ. The modeled feedback path filter transfer function is then given by mb −1 b z −m − −1 m=0 ma m −k . W (z) = z 1−z (2.10) 1 + k=1 ak z The resultant model can be considered to be a delay and a DC zero followed by an all-pole IIR filter in series with an FIR filter. 2.5.2
Running Adaptation
The adaptive feedback cancellation, shown in Fig. 2.5, uses LMS adaptation [3] to adjust the zeros of the FIR filter that forms part of the model of the feedback path. The poles of the all-pole IIR filter remain frozen at the values determined during initialization. The adaptation proceeds closed-loop, and no additional noise is injected. The adaptive processing is implemented in block form, with the adaptive coefficients updated once for each block of data. The LMS adaptation over the block of data minimizes the error signal given by ε(m) =
N −1 n=0
e2n (m) =
N −1
2
[sn (m) − vn (m)] ,
(2.11)
n=0
where sn (m) is the microphone input signal and vn (m) is the output of the FIR filter for block m and sample n within the block, and there are N samples per block. The constrained adaptation described by Kates [18] can also be used. The LMS coefficient update for the set of adaptive FIR filter zero coefficients {bk } is given by bk (m + 1) = bk (m) + 2μ
N −1 n=0
en (m)dn−k (m),
(2.12)
36
J. M. Kates FEEDBACK PATH
f(n) x(n)
s(n)
+ Σ −
MIC
e(n)
HEARING AID GAIN
u(n)
y(n) AMP
RCVR
DELAY
v(n) LMS
ADAPT FIR
DC ZERO
d(n) FROZEN POLES
Fig. 2.5. Block diagram of the run-time adaptive feedback-cancellation system.
where dn−k (m) is the input to the adaptive filter, delayed by k samples, for block m. Power normalization of the adaptation step size gives improved system performance and is implemented by setting μ = μ0 /dT (n)d(n),
(2.13)
where d(n) is the vector of present and past adaptive filter input samples. 2.5.3
Performance Metric
A useful performance metric is the estimated maximum stable gain (MSG). Kates [18] showed that the hearing aid will be stable if the following condition is met: |H(ω)| |W (ω) − F (ω)| < 1,
(2.14)
where H(ω) is the hearing aid gain, W (ω) is the feedback path model, and F (ω) is the actual feedback path response. W (ω) is given by (2.10) with z −1 = e−jω . The maximum stable hearing aid gain as a function of frequency is then |H(ω)| <
1 . |W (ω) − F (ω)|
(2.15)
The maximum stable gain (MSG) is the maximum allowable gain value assuming a flat frequency response for the hearing aid: 1 MSG = min . (2.16) ω |W (ω) − F (ω)|
2
Adaptive Feedback Cancellation in Hearing Aids
37
The MSG is therefore determined by the frequency at which the mismatch between the feedback model and the actual feedback path is greatest. If no feedback cancellation is used, W (ω) = 0 and the MSG will be determined by the peak of the measured feedback path response F (ω). In general, the greatest mismatch will occur in the vicinity of the peaks in the measured feedback path frequency response.
2.6
Constrained Adaptation
The stability of the hearing aid will be maintained if the gain and feedback cancellation filter are constrained so that (2.3) is satisfied. This constraint requires that the physical feedback path transfer function be known at all times. However, the only information available during adaptation comprises the feedback cancellation filter coefficients and the hearing-aid gain. Further information is supplied by performing an initialization during the hearing-aid fitting; the initialization provides a reference set of adaptive filter coefficients obtained under controlled conditions. The reference set of coefficients can be determined by letting the system adapt in the presence of a white noise source until the adaptive filter coefficients reach steady state, or it can be determined using the system identification procedures described in Sect. 2.5.1. Even so, the information needed for an exact constraint is unavailable and approximate constraints must be derived instead. Two approaches to constrained adaptation are derived in this section [18]. Both approaches are designed to prevent the adaptive filter coefficients from deviating too far from the reference coefficients. In the first approach, the distance of the adaptive filter coefficients from the reference coefficients is determined, and the norm of the vector formed by the set of adaptive filter coefficients is clamped to prevent the distance from exceeding a pre-set threshold. In the second approach, a cost function is used in the adaptation to penalize excessive deviation of the adaptive filter coefficients from the reference coefficients. 2.6.1
Adaptation with Clamp
The stability of the hearing aid will be maintained if the gain and feedback cancellation filter are constrained so that (2.3) is satisfied. The transfer function of the actual feedback path is not available, so the feedback path is approximated instead by the reference set of adaptive filter coefficients determined during initialization. This approximation leads to |H(ω, m)| |W (ω, m) − W (ω, 0)| 1
(2.17)
for block time-domain processing where H(ω, m) is the hearing-aid gain transfer function for block m, W (ω, m) is the adaptive filter transfer function for block m and W (ω, 0) is the transfer function for the reference adaptive
38
J. M. Kates
filter coefficients. The less than sign in (2.3) has been replaced by very much less than in (2.17) for safety because the exact feedback path of (2.3) has been replaced by the approximate initial model fit to the feedback path. Starting with the definition of the discrete Fourier transform, one has: H(ω, m) =
P −1
hp (m)e−jωp ≤
p=0
P −1
|hp (m)|
p=0
and W (ω, m) − W (ω, 0) =
K−1
[wk (m) − wk (0)] e−jωk
k=0
≤
K−1
|wk (m) − wk (0)| ,
(2.18)
k=0
where hp (m) are the P hearing aid gain filter coefficients for block m, wk (m) are the K adaptive feedback cancellation filter coefficients for block m, and wk (0) are the K reference adaptive feedback cancellation filter coefficients. Thus, the frequency-domain constraint of (2.17) can be rewritten in the time domain to get: P −1 K−1 |hp (m)| |wk (m) − wk (0)| < δ 2 , where δ 2 1. (2.19) p=0
k=0
The system stability can be maintained and coloration artifacts reduced by adjusting either the vector formed by the set of hearing-aid gain filter coefficients to have a reduced 1-norm, the difference between the adaptive and initial feedback filter coefficient vectors to have a reduced 1-norm, or by manipulating both sets of coefficients in combination. In general, one wants the tightest bound on the adaptive filter coefficients that still allows the system to adapt to expected changes in the feedback path such as those caused by the presence of a telephone handset. The measurements of the feedback path shown in Sect. 2.3 indicate that the path response changes by about 10 dB in magnitude when a telephone handset is placed near the aided ear, and that this relative change is independent of the type of earmold used. The constraint on the norm of the adaptive filter coefficients can thus be expressed as K−1 k=0 |wk (m) − wk (0)| < γ, (2.20) K−1 k=0 |wk (0)| where γ ≈ 2 gives the desired 10-dB headroom above the reference condition. The situation becomes more complicated when the hearing aid incorporates dynamic-range compression. Assume that the hearing aid processing consists of a broadband compressor, giving a time-varying gain function h(m).
2
Adaptive Feedback Cancellation in Hearing Aids
39
The stability constraint of (2.20) may be inadequate when the hearing aid is providing high gain at low signal levels, and a reduction in the gain may therefore be needed in addition to the constraint on the filter coefficients. The gain is to be reduced when instability is possible in the hearing aid, so an increase in the norm of the filter coefficients can be used to trigger the gain reduction. The resultant algorithm for the constraint after the adaptive weight update has been performed is given below: K−1 0. Compute ξ = γ k=0 |wk (0)|. K−1 ˆ 1. If k=0 |wk (m) − wk (0)| < ξ, set w(m) = w(m); ξ ˆ . Else w(m) = w(0) + [w(m) − w(0)] K−1 |wk (m) − wk (0)| k=0 K−1 K−1 ˆ = h(m); 2. If k=0 |wk (m) − wk (0)| ≤ α k=0 |wk (0)| , set h(m) ˆ Else h(m) is reduced monotonically as the norm of the coefficient difference increases. ˆ 3. Replace the adaptive filter coefficients w(m) with w(m), and replace the ˆ compressor gain h(m) with h(m). Step 0 is performed at initialization, while steps 1-3 are performed for each block of data. Step 1 of the algorithm provides a clamp on the norm of the adaptive filter coefficient vector to prevent excessive vector growth and the associated coloration and stability problems. Step 2 reduces the hearingaid gain to provide a further margin of stability if needed. The gain reduction in step 2 of the algorithm is intentionally left vague because the amount of gain reduction will depend on the hearing-aid compressor behavior and the audibility of the gain fluctuations introduced by the algorithm. The algorithm for adaptation with a clamp given above is related to the scaled projection algorithm used in adaptive beamforming [32]. In the scaled projection algorithm, the weights for adaptive beamforming are divided into a non-adaptive set of weights that gives unit gain in the desired look direction plus an adaptive set of weights that adjusts to the spatial characteristics of the incoming signal. A projection matrix is used to remove any component of the adaptive weights that would lie parallel to the non-adaptive weights, thus minimizing cumulative roundoff errors, and the 2-norm of the adaptive weight vector is clamped to reduce the sensitivity of the adaptive system to correlated interference and errors in the assumed sensor locations. The clamp operation in the scaled projection algorithm prevents the cancellation of the signal when correlated interference is present at the sensor array, and thus serves the same function as the clamp on the adaptive weights in the hearingaid feedback cancellation algorithm. However, the derivation of the feedback cancellation clamp is based on ensuring system stability, while the derivation of the clamp in the scaled projection algorithm is based on constraining the beamformer sensitivity to correlated measurement errors. Thus the clamp for the feedback cancellation algorithm uses the 1-norm of the difference between the adaptive weight vector at any given time and the initial values, while the
40
J. M. Kates
clamp for the scaled projection algorithm uses the 2-norm of the projection of the adaptive weight vector orthogonal to the look direction. 2.6.2
Adaptation with Cost Function
Feedback cancellation typically uses LMS adaptation to adjust the FIR filter that models the feedback path. The processing is most conveniently implemented in block form, with the adaptive coefficients updated once for each block of data. Conventional LMS adaptation over the block of data minimizes the error signal given by ε(m) =
N −1 n=0
e2n (m) =
N −1
2
[sn (m) − vn (m)] ,
(2.21)
n=0
where sn (m) is the microphone output signal and vn (m) is the output of the FIR filter for block m, and there are N samples per block. The new algorithm minimizes the error signal combined with a cost function based on the magnitude of the adaptive coefficient vector: ε(m) =
N −1 n=0
2
[sn (m) − vn (m)] + β
K−1
2
[wk (m) − wk (0)] ,
(2.22)
k=0
where β is a weighting factor. The new constraint is intended to allow the feedback cancellation filter to freely adapt near the initial coefficients, but to penalize coefficients that deviate too far from the initial values. The LMS coefficient update for the new algorithm is given by wk (m+1) = wk (m)−2μβ [wk (m) − wk (0)]+2μ
N −1
en (m)dn−k (m),(2.23)
n=0
where dn−k (m) is the input to the adaptive filter, delayed by k samples, for block m. The modified LMS adaptation uses the same cross-correlation operation as the conventional algorithm to update the coefficients, but combines the update with an exponential decay of the coefficients toward the initial values. At low input signal or cross-correlation levels the adaptive coefficients will tend to stay in the vicinity of the initial values. If the magnitude of the cross-correlation increases, the coefficients will adapt to new values that minimize the error as long as the magnitude of the adaptive coefficients does not grow too large. Adaptation that would require large changes in the adaptive filter coefficients, however, such as reacting to a sinusoid, will lead to incomplete reduction of the error. The exponential decay towards the initial values, which prevents the adaptive coefficients from becoming excessively large, also prevents the adaptive coefficients from reaching the optimum values for minimizing the error. In the case of the telephone handset the new algorithm may be less effective than the unconstrained approach given by (2.21). But in the
2
Adaptive Feedback Cancellation in Hearing Aids
41
case of a sinusoid or tonal input, the new algorithm greatly reduces the occurrence of processing artifacts because the artifacts are generally the result of unconstrained growth in the magnitude of the adaptive filter coefficients. Thus the adaptation with a cost function reduces coloration and improves system stability at the expense of a reduction in the ability to model large deviations from the initial feedback path.
2.6.3
Simulation Results
A feedback-cancellation system using closed-loop adaptation without a probe signal was simulated in MATLAB. The system block diagram is shown in Fig. 2.5. The adaptive FIR filter modeling the feedback path used 8 taps in series with a fixed IIR filter having 5 poles. An ARX system identification procedure [31] was used to determine the initial adaptive feedback cancellation filter coefficients. It was assumed that there was no ambient noise present at the microphone during the initial parameter estimation. The feedback path used was the path measured for the vented earmold shown in Fig. 2.3 without the telephone handset present. The magnitude frequency response of the feedback cancellation filter after the initialization is shown in Fig. 2.6. The solid line is the measured feedback path, and the dashed line is the magnitude frequency response for the 5pole/8-tap FIR filter determined at the end of the initialization. The fit of the model to the actual data is quite good, and this system provides about 13 dB of additional hearing-aid gain when the feedback cancellation is engaged. The adaptive system shown in Fig. 2.5 was then simulated. The hearing aid processing consisted of a fixed 15-dB gain, and the feedback path was the one measured for the vented earmold without a telephone handset and presented in Fig. 2.3. The input signal was a 2-kHz sinusoid having unit amplitude, and the system sampling rate was 15.625 kHz. The adaptive system was run for the equivalent of 10 sec in the simulation time frame. This simulation time allowed the adaptive filter coefficients for the clamp and cost function approaches to essentially reach their asymptotic values. The system was simulated for three versions of the block time-domain adaptive feedback cancellation. Unconstrained adaptation was used as the reference condition. The second test condition was the constrained adaptation of (2.20), with γ = 1.7 chosen to give the tightest clamp that still allowed the system to adapt to a telephone handset brought up to the aided ear. The third test condition was the adaptation with the cost function given by (2.23). The adaptation exponential decay used 2μβ = 0.001, and this value also allowed successful adaptive filter convergence for a telephone handset brought up to the aided ear. The adaptation used μ = 10−7 and the data block size was 56 samples for all three algorithms tested. The time delay in the feedback path corresponded to one data block. Exciting the system with white noise resulted in the adaptive weights staying quite close to the values
42
J. M. Kates 0 -10 -20 -30 -40 dB -50 -60 -70 -80 -90 -100
0
1
2
3 4 5 Frequency in kHz
6
7
Fig. 2.6. Magnitude frequency response for the vented hearing aid feedback path (solid line) and the 5-pole 8-tap FIR model (dashed line) fit by the initial parameter estimation.
determined during initialization, indicating that the constrained adaptation has no deleterious effect on the system performance for a broadband input. The excitation was then changed to the sinusoid, and the envelope of the hearing aid output for the sinusoidal excitation is plotted in Fig. 2.7 for the three adaptation approaches. The feedback cancellation is working to minimize the error signal, which results in cancellation of the sinusoid. For the unconstrained adaptation, the sinusoid is multiplied by a decaying exponential gain function that will ultimately lead to complete cancellation of the output signal. The envelope for the clamped adaptation shows an initial decay similar to that for the unconstrained adaptation, but after the constraint on the 1-norm of the adaptive coefficient vector is engaged at approximately 4 sec, the envelope decay ceases. The output signal intensity is reduced by about 2.5 dB, but further attenuation is prevented by the clamp constraint. The output signal envelope for the adaptation with the coefficient cost function shows the benefit of this approach to the adaptation. The signal is attenuated by only about 1 dB because the growth of the adaptive filter coefficients is constrained by the cost of becoming too large. The normalized difference signal plotted in Fig. 2.8 is given by z(n) =
|v(n) − f (n)| , |f (n)|
(2.24)
where v(n) is the output of the adaptive filter modeling the feedback path and f (n) is the output of the simulated feedback path, as shown in Fig. 2.5. The signal z(n) gives the smoothed envelope of the difference between the
2
Adaptive Feedback Cancellation in Hearing Aids
1
43
Exp. Decay Cost Function
0.9 Clamp
0.8 Unconstrained
Amplitude
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
0
1
2
3
4
5 6 Time in sec
7
8
9
Fig. 2.7. Envelope of the hearing-aid output for the 2-kHz sinusoid for the unconstrained adaptation, clamped adaptation, and adaptation with the coefficient cost function.
modeled and measured feedback path signals, normalized by the envelope of the measured feedback path signal. A value of z(n) less than 0 dB indicates that the model is converging to the measured feedback path. A value of z(n) greater than 0 dB indicates that the model is diverging from the desired system even if the error signal e(n) is being driven to zero by the adaptive filter. For a sinusoidal input, the minimum error will be obtained when the input signal is completely canceled and not when the adaptive system provides the best model of the feedback path. The normalized difference plotted in Fig. 2.8 shows that the unconstrained adaptation starts at a reasonably close model of the measured feedback path, and then diverges from the desired model as the system attempts to cancel the 2-kHz sinusoid. The mismatch between the model and the actual feedback path for the unconstrained adaptation continues to grow with time as the signal cancellation shown in Fig. 2.7 becomes greater. The normalized difference signal for the clamped adaptation initially grows in a manner similar to that for the unconstrained adaptation. However, the mismatch gets clamped along with the adaptive coefficient vector once the 1-norm threshold of (2.20) is reached. The normalized difference signal for the adaptation with the coefficient cost constraint shows much less mismatch than for the coefficient vectors for the unconstrained or the clamped adaptation. The magnitude frequency response of the adaptive feedback path model at the end of the 20-sec adaptation period is shown in Fig. 2.9. The mismatch between the measured and modeled feedback paths for the unconstrained
44
J. M. Kates 20
Unconstrained 15
Clamp 10
Cost Function
dB
5 0 -5 -10 -15 -20
0
1
2
3
4
5 6 Time in sec
7
8
9
Fig. 2.8. Normalized signal for the difference between the feedback path model and the measured feedback path for the 2-kHz sinusoid for unconstrained adaptation, clamped adaptation, and adaptation with the coefficient cost function.
adaptation is about 22 dB at 2 kHz, illustrating how the modeled feedback path is diverging from the desired transfer function. Continued unconstrained adaptation would result in even greater deviation of the model from the actual feedback path. Also note the increase in the unconstrained adaptive filter frequency response in the vicinity of 6 and 7.5 kHz, which increases the probability that the hearing aid will become unstable due to a violation of the Nyquist criterion at a frequency other than 2 kHz. The magnitude frequency response of the feedback path model for the clamped adaptation shows a slightly better match to the desired feedback path as the mismatch at 2 kHz is about 15 dB, and there is a corresponding reduction in the magnitude of the mismatch between 6 and 7.5 kHz. The magnitude frequency response of the feedback path model for the adaptation with the coefficient cost function shows that the algorithm results in a feedback path model that stays closer to the measured feedback path than that resulting from unconstrained adaptation as the mismatch at 2 kHz has been reduced to about 13 dB.
2.7
Filtered-X System
The feedback cancellation system shown in Fig. 2.5 is a form of the filteredX algorithm [31] since the input to the adaptive FIR filter is first passed through the fixed IIR filter. A more complete implementation of the filtered-
2
Adaptive Feedback Cancellation in Hearing Aids
45
0 -10 -20
Unconstrained Clamp
-30
Cost Funct.
-40 dB -50 -60 -70 -80 -90 -100
0
1
2
3 4 5 Frequency in kHz
6
7
Fig. 2.9. Magnitude frequency response for the vented hearing aid feedback path (solid line) and the 5-pole 8-tap FIR model after 10 sec of adaptation. The response for the unconstrained adaptation is given by the dashed line, the response for the clamped adaptation by the dotted line, and the response for the adaptation with the coefficient cost function by the dot-dash line.
X approach also filters the inputs to the cross-correlation operation used to update the adaptive filter coefficients. The complete filtered-X feedback cancellation system [19], [20] is shown in Fig. 2.10. The feedback path is still modeled by the combination of an adaptive filter and a delay plus a frozen filter. However, both inputs to the LMS adaptation used to update the filter coefficients are filtered through filter p(n). The filter p(n) is a bandpass or highpass filter, and emphasizes the frequency region where mismatch between the actual and modeled feedback paths can cause the greatest stability problems in the hearing aid. Low frequencies, where the hearing aid typically has low gain but where tonal input signals are often experienced, is de-emphasized to minimize the possibility of canceling a tonal input. The filters p(n) help reduce the potential mismatch when a tonal input is present, and constrained adaptation can be used as well to reduce cancellation even more. The conflict between slow adaptation for tonal inputs and rapid adaptation to accommodate large changes in the feedback path is reduced in this system, but it is still not eliminated. In the system of Fig. 2.10, the cancellation of tonal input signals is reduced by minimizing the power in a filtered version of the error signal instead of minimizing the broadband error. The inputs q(n) and r(n) to the LMS adaptation are passed through the same filter p(n), giving q(n) = e(n) ∗ p(n) and
46
J. M. Kates FEEDBACK PATH
f(n) x(n)
s(n) MIC
+ Σ −
e(n)
H.A. PROC.
u(n)
y(n)
h(n) AMP FILTER
DELAY
p(n)
v(n)
RCVR
q(n) r(n) LMS
FILTER
ADAPTIVE FILTER
d(n)
p(n)
FROZEN FILTER
Fig. 2.10. Block diagram for the filtered-X feedback cancellation system.
r(n) = d(n) ∗ p(n), where ∗ denotes convolution. The adaptive coefficient update for input sample n is then given by: bk (n + 1) = bk (n) +
2μ σr2 (n)
q(n)r(n − k),
(2.25)
where μ controls the rate of adaptation and σr2 (n) is the average power in signal r(n). The simplest filter for p(n) is the first-order difference given in the frequency domain by P (z) = 0.5 1 − z −1 . (2.26) Simulation results were obtained for a frozen 5-pole IIR filter plus DC zero, in series with a 8-tap adaptive FIR filter. The system sampling rate was 16 kHz. Results using the vented earmold feedback path of Fig. 2.3 indicate that for a white noise input signal, the filtered-X algorithm of Fig. 2.10 and the normalized LMS algorithm shown in Fig. 2.5 produced nearly identical estimated maximum stable gain (MSG) values. The two systems were then compared for their reaction to a sinusoidal input. The systems were initialized to the feedback path for the vented earmold. The feedback path was left unchanged, and the system was then excited with a tone burst. A tone at 1 kHz was used to test the susceptibility of the feedback cancellation to speech or music, and a tone at 4 kHz was used to test the ability of the systems to cancel “whistling” that occurs when a hearing aid becomes unstable and starts to oscillate. Given enough time, all of the systems will cancel either sinusoid to the degree allowed by the FIR filter adaptation constraint. Thus the figure of merit used for this test was the rate of cancellation of the sinusoid in dB/sec.
2
Adaptive Feedback Cancellation in Hearing Aids
47
The results of the sinusoid test are presented in the table below: LMS Adaptation Filter Pole-Zero Model (Fig. 2.5) Filtered-X (Fig. 2.10)
Rate of Sinusoid Cancellation, dB/sec 1 kHz 4 kHz 9.1 5.2 0.3 2.8
These results indicate that the filtered-X system using a first-order difference filter for p(n) is very successful in reducing the adaptation to the 1-kHz sinusoid. The first-order difference filter reduces the amplitude of the 1-kHz tone entering the LMS adaptation of the FIR filter by a substantial amount, and the rate of adaptation to the tone is decreased proportionately. For the 4-kHz sinusoid, both systems exhibit similar performance. The first-order difference still has some attenuation at 4 kHz and its rate of cancellation is slightly reduced because of this, but both systems would be effective in canceling a high-frequency tone.
2.8
Room Reverberation Effects
Reverberation typically persists for a much longer time than the impulse responses of the electro-mechanical components in the hearing aid or the direct acoustic path from receiver to microphone. A short adaptive filter used to model the components and the acoustic environment in the ear canal and the local vicinity of the head will not be long enough to model the room reverberation. Changes in the reverberation as the user moves about the room will change the feedback path, but the feedback cancellation filter will not be able to model these changes. Increasing the length of the adaptive filter may not improve performance if the filter is not long enough to include the time delays corresponding to the major room reflections. Reverberation therefore becomes an important limitation to the effectiveness of feedback cancellation in hearing aids [22]. 2.8.1
Test Configuration
The effects of reverberation on feedback cancellation were studied for feedback path measurements made in an office. A behind-the-ear (BTE) hearing aid was mounted on the right ear of an “Eddi” acoustic manikin, and the manikin was placed on a wheeled platform. An open fitting, in which the earmold was replaced by the receiver tubing held in place with an annular support, was used to get the greatest possible intensity for the feedback path signal. The ear of the manikin was approximately 2.5 feet (0.75 m) above floor level. The hearing aid was connected via a cable to a real-time digital processing system, and the sampling rate was 15.625 kHz. The ear canal in the “Eddi” manikin reproduces the shape of a human ear canal, but it is
48
J. M. Kates 10’ 9" (3.28 m) BOOKCASE
11’ 3" 3.43 m)
CHAIR
DESK
CHAIR
4 3 2
5 6 7
1
8
FILING CABINET
TABLE
FILING CABINET
SHELF
Fig. 2.11. Office used for the feedback path measurements. The measurement locations, numbered from 1 through 8, are on a grid with a 6-inch (15 cm) spacing.
terminated by a rigid wall rather than by a network duplicating the acoustic impedance at the ear drum. The feedback path response measured in the “Eddi” manikin has been empirically determined to be about 10 dB more intense than that measured for a comparable human ear; however, it has the same general spectral shape and temporal structure. The feedback path response was measured for a dummy head placed at eight locations in an office representative of a small room. The office floor plan is shown in Fig. 2.11. The eight measurement positions are indicated by the crosses in the figure, numbered from 1 through 8. At each location the right ear of the manikin was used, and the manikin always faced away from the office door. The measurement locations are on a grid with six-inch (15 cm) spacing. The time delays for early reflections from the walls to either side of the manikin will shift by about 1 msec as it is moved from one location to the next, and this will shift the locations of the peaks and valleys in the feedback path frequency response perturbations caused by the reflections.
2
2.8.2
Adaptive Feedback Cancellation in Hearing Aids
49
Initialization and Measurement Procedure
The feedback model initialization as described in Sect. 2.5.1 was performed for each of the eight locations. The first step in the initialization was to acquire the feedback path response to the periodic MLS excitation, after which the summed periodic response was uploaded to the host computer. The analysis and signal processing was then continued on the host computer using a MATLAB simulation of the remaining initialization steps. The filter coefficient calculations were performed using the circularly wrapped feedback path impulse response data. The all-pole IIR filters all had five poles, and the FIR filters had either 8, 16, or 32 taps. The use of a five-pole IIR filter has proven to be very robust in practice; the poles tend to model the receiver and tubing resonances, and the zeros model the remaining acoustic aspects of the feedback path. The feedback cancellation has proven to be effective in nearly all environments even though the poles are initialized in the hearingaid dispenser’s office. The feedback path impulse response was obtained by the circular convolution of the MLS sequence with the summed response to the periodic 255-sample MLS excitation. The result was scaled to adjust for the excitation amplitude and the number of periods summed. The excitation amplitude was adjusted to give the lowest noise floor relative to the peak impulse response amplitude, thus minimizing the possibility of distortion corrupting the reverberation measurement. A total of 10 sec of data was acquired at each of the eight locations in the room. The hearing aid was left in place on the dummy head as it was moved from one location to the next to determine the effects of changing position within the room. As a control, the dummy head was placed at location 1 in the room and eight impulse responses sequentially acquired at the same location; these impulse responses showed the test/retest reliability and the effects of noise on repeated measurements. 2.8.3
Measured Feedback Path
The measured impulse responses determined using the MLS procedure are plotted in Fig. 2.12 for the eight repeated measurements at location 1 and in Fig. 2.13 for the eight separate room locations. The impulse responses have been unwrapped. The circular nature of the MLS measurement procedure results in the later reflections getting wrapped around to the beginning of the 255-sample buffer and appearing as a “pre-echo.” The first 100 samples of the measured circular impulse response were therefore removed from the beginning of the buffer and instead appended to the end. These first 100 samples represent the time delay primarily due to the digital processing algorithm used to acquire the data, with additional delays due to the A/D and D/A converters, the microphone and receiver group delay, and the acoustic delay due to the receiver tubing. The variance in the repeated measurements at the same location shown in Fig. 2.12 is much lower than for the set of measurements at the different
50
J. M. Kates
0.06
0.04
Amplitude
0.02
0
-0.02
-0.04
-0.06 50
100
150 200 Samples
250
300
350
Fig. 2.12. Eight unwrapped impulse responses obtained from repeated measurements at location 1.
0.06
0.04
Amplitude
0.02
0
-0.02
-0.04
-0.06 50
100
150 200 Samples
250
300
350
Fig. 2.13. Eight unwrapped impulse responses consisting of one impulse response from each of the eight measurement locations.
room locations shown in Fig. 2.13, indicating that the changes in the feedback path impulse response with room location are due to the changes in the reflection patterns and are not an artifact of noise or poor measurement
2
Adaptive Feedback Cancellation in Hearing Aids
51
repeatability. It is apparent from the overlaid responses in Fig. 2.13 that there is very little variation during samples 100-150 of the unwrapped response as the room location is changed. This is the time interval where the ringing of the receiver resonances and the immediate acoustic effects of the ear canal and pinna would be expected to dominate, followed by the shoulder bounce (mounting board for the dummy head) approximately 20-25 samples later. These acoustic factors would not be expected to depend on the location within the room. The first room reflection appears at about sample 160 in Figs. 2.12 and 2.13. This signal is probably due to a reflection from the opened office door, since measurements made with the door closed did not show this reflection. However, substantial variation is seen in the interval between 180 and 250 samples in Fig. 2.13. As the manikin is moved within the room, there will be shifts in the distance from the hearing aid to the walls near the table and desk, as shown in Fig. 2.11. Thus the reflections from these two walls will arrive at different times, creating the apparent randomness in the overlaid impulse responses. There is another strong reflection with very little variance at about 340 samples, indicating a reflection path such as a ceiling reflection that would not depend strongly on the location of the manikin in the room. Reflections at longer time delays will be wrapped around the periodic response and will tend to appear as additional noise. Distortion in the hearing aid could also cause artifacts to appear in the measured impulse response. Distortion would be caused by nonlinear responses to the excitation signal. However, the same nonlinear reactions would always occur in response to the same subsequences of the MLS sequence. Thus the distortion effects would be the same for each period of the excitation and would not depend on the location within the room. The variation in the tails of the impulse responses shown in Fig. 2.13 is therefore evidence that reverberation and not distortion is the dominating factor corrupting the measured feedback path impulse response. The magnitude frequency responses computed from the feedback path impulse responses of Figs. 2.12 and 2.13 are plotted in Figs. 2.14 and 2.15, respectively. The variation in the impulse responses translates into a corresponding variation in the frequency responses. The greatest variation in dB is at the low and high frequencies where the signal is weakest, and the least variation at the frequency response peaks. The frequency response curves of Fig. 2.14, computed from the eight impulse responses measured at location 1, show very little variance in the vicinity of the peaks at 2.8 and 3.7 kHz, with the different responses spanning less than 0.5 dB. In contrast, the eight measurements at the different room locations span a 2-dB range in the vicinity of the response peaks. Thus small displacements within the room can result in noticeable differences in the feedback path impulse and frequency responses.
52
J. M. Kates
0
-10
-20
dB
-30
-40
-50
-60
-70
0
1
2
3 4 5 Frequency in kHz
6
7
Fig. 2.14. Eight magnitude frequency response curves obtained from repeated measurements at location 1.
2.8.4
Maximum Stable Gain
An important question is whether the differences in the measured feedback paths cause differences in the expected feedback cancellation performance. The initialization calculations were performed for a 5-pole IIR filter in series with an 8-, 16-, or 32-tap FIR filter for each of the eight locations. The MSG without feedback cancellation and the MSG with feedback cancellation were computed for each of the combinations of FIR filter length and location. The results are based on the ARX model fit to the data. The MSG without feedback cancellation averages 7.7 dB, while that for the 8-tap FIR filter gives 16.4 dB, the 16-tap FIR filter gives an average of 19.9 dB, and the 32-tap FIR filter gives an average of 20.7 dB. Thus the feedback cancellation using the 8-tap FIR filter results in a 8.7 dB improvement in the expected maximum gain over no feedback cancellation. Increasing the FIR filter length to 16 taps improves the expected performance by just 3.5 dB, and doubling the length again to 32 taps adds only an additional 0.8 dB. A statistical analysis indicated that the 8-tap FIR filter and the 16-tap FIR results are significantly different at the 0.01 level. The 32-tap FIR filter results are not significantly different from those of the 16-tap FIR filter. The 32-tap filter should be long enough to model the shoulder bounce, but this increased filter length did not make a significant improvement in feedback cancellation performance. Successfully modeling the reverberation within a room requires a much longer filter; acoustic echo cancellation, for example, can require an FIR filter of up to 4000 taps at an 8-kHz sampling
2
Adaptive Feedback Cancellation in Hearing Aids
53
0
-10
-20
dB
-30
-40
-50
-60
-70
0
1
2
3 4 5 Frequency in kHz
6
7
Fig. 2.15. Eight magnitude frequency response curves obtained from measurements at each of the eight locations shown in Fig. 2.11.
rate to model the room reverberation, and using ARMA techniques to model the room reverberation can still require on the order of 250 pole coefficients and 450 zero coefficients [33]. An additional concern is the criterion for computing the ARX model fit, which is to minimize the mean-squared error between the observed response and the model. The model thus minimizes the average deviation, while the stability, and the related MSG criterion, is controlled by the maximum deviation, which may not be improved by the increased filter length. Increasing the FIR filter length will be of minimal benefit if it can not be made long enough to include all of the important room reverberation effects and still retain a reasonable computational load.
2.9
Conclusions
Feedback cancellation in hearing aids requires that the system track changes in the feedback path without any compromises in the sound quality of the instrument. The concerns with sound quality preclude the use of a noise probe sequence since such a probe will reduce the SNR at the hearing aid output. The feedback cancellation must therefore use the audio signal as the probe, with the result that adaptation during narrowband or sinusoidal inputs must be accommodated. Feedback cancellation using unconstrained adaptation will lead to the cancellation of an incoming narrowband signal instead of modeling the feedback path. As the adaptive feedback cancellation
54
J. M. Kates
filter deviates from the true feedback path, the tendency for the system to become unstable or for unwanted coloration to appear in the hearing aid output greatly increases. The adaptive feedback cancellation algorithms implemented in a hearing aid must run successfully in 16-bit fixed-point arithmetic on a slow DSP with minimal program and data memory. The DSP chip constraints limit the algorithm complexity, and force the algorithm developer to pay attention to real-world processing concerns that can be ignored in many other signal processing applications. The feedback-cancellation approaches described in this chapter, these being the constrained adaptation and the filtered-X algorithm, have been programmed and run successfully in a ear-level digital hearing aid that also contains multi-band dynamic-range compression, a two-microphone adaptive array, noise suppression, and speech enhancement processing. Two solutions to the problem of adapting in the presence of a sinusoid or narrowband input using constrained adaptation were presented. The constrained adaptation uses a reference filter coefficient vector determined during initialization. In the first approach, the adaptive feedback filter coefficient vector is constrained so that the 1-norm of the difference between the adaptive coefficient vector and the reference coefficient vector determined during initialization always stays below a threshold based on a scaled version of the 1-norm of the reference coefficient vector. In the second approach, a cost function is added to the adaptation, resulting in an adaptive coefficient update that incorporates an exponential decay of the coefficient vector towards the reference coefficient vector. The filtered-X algorithm can also reduce the probability of audible processing artifacts. In the filtered-X approach, the inputs to the crosscorrelation used to update the adaptive filter coefficients are themselves filtered. The coefficient updates are therefore most sensitive to the frequencies in the pass-band of the filters, and have reduced sensitivity to frequencies in the stop-band. A simple 1-zero high-pass filter was shown to greatly reduce the propensity of the feedback-cancellation system to react to low-frequency tones, and should greatly reduce hearing-aid artifacts when listening to speech or music. Room reverberation can affect feedback cancellation in hearing aids, with the strength of the effects depending on the acoustical conditions. These effects were studied using a BTE hearing aid mounted on a dummy head and coupled to the ear canal via an open fitting. The feedback path impulse response was measured for eight closely-spaced locations in a typical office. The impulse response measurements show that the feedback path impulse response can be divided into a short-time portion consisting of the ear-level factors of the microphone, receiver, ear canal, vent, and pinna resonances, and into a long-time portion consisting of the room reverberation. The short-time portion of the feedback path response can be accurately represented using an ARMA model consisting of a 5-pole IIR filter in series with an 8-tap FIR filter.
2
Adaptive Feedback Cancellation in Hearing Aids
55
The ARMA model fits a smooth curve through the feedback path frequency response, minimizing the effects of the room location and reverberation in fitting the poles and zeros. Increasing the length of the FIR filter from 8 to 16 taps improves the model accuracy, but yields only a 3- to 3.5-dB improvement in the estimated available amplifier headroom. Further increasing the FIR filter length to 32 taps yields a benefit only when reverberation is minimal. The changes in the measured responses with room location indicate that reverberation and not distortion dominates the long-time portion of the feedback path impulse response, and major features of the tail of the impulse response correlate with reflections from surfaces in the room. Because of the length of the reverberation response, it can not be modeled with a filter short enough to be practical in a wearable hearing aid. The mismatch between the modeled and actual feedback paths limits the headroom increase that can be achieved when using feedback cancellation to approximately 10-15 dB. The actual feedback cancellation performance depends on the room acoustics and location within the room, and will therefore vary as the user moves about. Several of the algorithms presented in this chapter are the subject of US and international patents and pending patent applications.
Acknowledgment The preparation of this chapter was supported by GN ReSound Corporation.
References 1. S. F. Lybarger, “Acoustic feedback control,” in The Vanderbilt Hearing-Aid Report, Studebaker and Bess, Eds., Upper Darby, PA: Monographs in Contemporary Audiology, pp. 87–90, 1982. 2. J. M. Kates, “A computer simulation of hearing aid response and the effects of ear canal size,” J. Acoust. Soc. Am., vol. 83, pp. 1952–1963, 1988. 3. B. Widrow, J. M. McCool, M. G. Larimore, and C. R. Johnson, Jr., “Stationary and nonstationary learning characteristics of the LMS adaptive filter,” Proc. IEEE, vol. 64, pp. 1151–1162, 1976. 4. D. K. Bustamante, T. L. Worrell, and M. J. Williamson, “Measurement of adaptive suppression of acoustic feedback in hearing aids,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, 1989, pp. 2017–2020. 5. A. M. Engebretson, M. P. O’Connell, and F. Gong, “An adaptive feedback equalization algorithm for the CID digital hearing aid,” in Proc. 12th Annual Int. Conf. of the IEEE Eng. in Medicine and Biology Soc., 1990, Part 5, pp. 2286–2287. 6. J. M. Kates, “Feedback cancellation in hearing aids: results from a computer simulation,” IEEE Trans. Signal Processing, vol. 39, pp. 553–562, 1991. 7. O. Dyrlund and N. Bisgaard, “Acoustic feedback margin improvements in hearing instruments using a prototype DFS (digital feedback suppression) system,” Scand. Audiol., vol. 20, pp. 49–53, 1991.
56
J. M. Kates
8. A. M. Engebretson and M. French-St. George, “Properties of an adaptive feedback equalization algorithm,” J. Rehab. Res. and Devel., vol. 30, pp. 8–16, 1993. 9. A. Kaelin, A. Lindgren, and S. Wyrsch, “A digital frequency-domain implementation of a very high gain hearing aid with compensation for recruitment of loudness and acoustic echo cancellation,” Signal Processing, vol. 64, pp. 71– 85, 1998. 10. M. French-St. George, D. J. Wood, and A. M. Engebretson, “Behavioral assessment of adaptive feedback cancellation in a digital hearing aid,” J. Rehab. Res. and Devel., vol. 30, pp. 17–25, 1993. 11. N. Bisgaard, “Digital feedback suppression: Clinical experiences with profoundly hearing impaired,” in Recent Developments in Hearing Instrument Technology: 15th Danavox Symposium, Ed. by J. Beilin and G. R. Jensen, Kolding, Denmark, pp. 370–384, 1993. 12. O. Dyrlund, L. B. Henningsen, N. Bisgaard, and J. H. Jensen, “Digital feedback suppression (DFS): characterization of feedback-margin improvements in a DFS hearing instrument,” Scand. Audiol., vol. 23, pp. 135–138, 1994. 13. J. A. Maxwell and P. M. Zurek, “Reducing acoustic feedback in hearing aids,” IEEE Trans. Speech Audio Processing, vol. 3, pp. 304–313, 1995. 14. S. Haykin, Adaptive Filter Theory (3rd Edition), Upper Saddle River, NJ: Prentice-Hall, 1996. 15. M. G. Siqueira, A. Alwan, and R. Speece, “Steady-state analysis of continuous adaptation systems in hearing aids,” in Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 1997. 16. P. Estermann and A. Kaelin, “Feedback cancellation in hearing aids: Results from using frequency-domain adaptive filters,” in Proc. IEEE ISCAS, 1994. 17. W. G. Knecht, “Some notes on feedback suppression with adaptive filters in hearing aids,” in Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 1997, session 2, paper 3. 18. J. M. Kates, “Constrained adaptation for feedback cancellation in hearing aids,” J. Acoust. Soc. Am., vol. 106, pp. 1010–1019, 1999. 19. S. Gao, S. Soli, and H.-F. Chi, “Band-limited adaptive feedback canceller for hearing aids,” Int. Patent Application WO 00019605 A2, filed Sept. 30, 1999. 20. J. Hellgren, “Analysis of feedback cancellation in hearing aids with filtered-X LMS and the direct method of closed loop identification,” IEEE Trans. Speech Audio Processing, vol. 10, pp. 119–131, 2002. 21. J. Hellgren, T. Lunner, and S. Arlinger, “Variations in the feedback of hearing aids,” J. Acoust. Soc. Am., vol. 106, pp. 2821–2833, 1999. 22. J. M. Kates, “Room reverberation effects in hearing aid feedback cancellation,” J. Acoust. Soc. Am., vol. 109, pp. 367–378, 2001. 23. J. Macrae, “Vents for high-powered hearing aids,” The Hearing Journal, pp. 13–16, Jan. 1983. 24. T. Schneider and D. G. Jamieson, “A dual-channel MLS-based test system for hearing-aid characterization,” J. Audio Eng. Soc., vol. 41, pp. 583–594, 1993. 25. W. H. Press, B. P. Flannery, S. A. Teukolsky, and W. T. Vetterling, Numerical Recipes: The Art of Scientific Computing, New York: Cambridge University Press, pp. 209–213 (Section 7.4: Generation of Random Bits), 1986. 26. C. Dunn and M. O. Hawksford, “Distortion immunity of MLS-derived impulse response measurements,” J. Audio Eng. Soc., vol. 41, pp. 314–335, 1993.
2
Adaptive Feedback Cancellation in Hearing Aids
57
27. J. Vanderkooy, “Aspects of MLS measuring systems,” J. Audio Eng. Soc., vol. 42, pp. 219–231, 1994. 28. M. R. Schroeder, “Integrated-impulse method measuring sound decay without using impulses,” J. Acoust. Soc. Am., vol. 66(2), pp. 497–500, 1979. 29. J. Borish and J. B. Angell, “An efficient algorithm for measuring the impulse response using pseudorandom noise,” J. Audio Eng. Soc., vol. 31, pp. 478–488, 1983. 30. D. D. Rife and J. Vanderkooy, “Transfer-function measurement with maximumlength sequences,” J. Audio Eng. Soc., vol. 37, pp. 419–444, 1989. 31. L. Ljung, System Identification: Theory for the User, Englewood Cliffs, NJ: Prentice Hall, 1987. 32. H. Cox, R. M. Zeskind, and M. M. Owen, “Robust adaptive beamforming,” IEEE Trans. Acoust., Speech, and Sig. Proc., vol. ASSP-35, pp. 1365–1375, 1987. 33. Y. Haneda, S. Makino, and Y. Kaneda, “Common acoustical pole and zero modeling of room transfer functions,” IEEE Trans. Speech Audio Processing, vol. 2, pp. 320–328, 1994.
3
Single-Channel Acoustic Echo Cancellation
Eberhard H¨ ansler1 and Gerhard Schmidt2 1
2
Darmstadt University of Technology Signal Theory Group Merckstr. 25, D-64283 Darmstadt, Germany E-mail:
[email protected] Temic Speech Processing Algorithmics Department S¨ oflinger Str. 100, D-89077 Ulm, Germany E-mail:
[email protected]
Abstract. This chapter explains the problem of acoustical echoes and their cancellation. It focuses on the hands-free telephone as one of the major applications for echo cancelling devices. Beginning with a brief discussion of a system formed by a loudspeaker and a microphone located within the same enclosure, the properties of speech signals and noise are described. The major part of the echo can be cancelled by an adaptive filter connected in parallel to the loudspeaker and the microphone. Residual echoes may be suppressed by an additional filter within the outgoing signal path. We will address only single-channel solutions, which means that we will have a monophonic loudspeaker signal. Algorithms for the adaptation of the echo cancelling filter are described. Because of its robustness and its low computational complexity, the NLMS algorithm is primarily applied. Measures to improve the speed of convergence and to avoid divergence in case of double-talk or strong local noise are discussed. Echo cancellation in subbands and the applications of block processing techniques conclude the chapter. The chapter is meant as an introduction to the topic of single-channel acoustic echo cancellation. Many of the relevant details of specific topics like detection and estimation schemes can be found in the corresponding references at the end of this chapter.
3.1
Introduction
More than a century ago, the front page of Scientific American showed a picture of a man using one of the first telephones [1]. He held a loudspeaker close to one of his ears and an identical looking device – the microphone – close to his mouth. Consequently, both hands were busy while doing a phone call in the early days of telecommunications. It did not take long until the loudspeaker and the microphone were mounted into a single device: the handset. Thus, one hand had been freed. To provide a natural communication between two users of a telecommunication system it is essential to get rid of the necessity to keep the loudspeaker and the microphone close to the users’ ear and mouth. This request
J. Benesty et al. (eds.), Adaptive Signal Processing © Springer-Verlag Berlin Heidelberg 2003
60
E. H¨ ansler, G. Schmidt
Fig. 3.1. Structure of a hands-free telephone system.
causes a multitude of difficulties. The inconvenience of handsets is rewarded by optimal acoustic conditions: a high signal-to-(environmental) noise ratio, a perfect coupling between the loudspeaker and the ear of the listener, and a high attenuation between the loudspeaker and the microphone. If the electro-acoustic transducers are moved away from the users’ head, sophisticated signal processing is required. Even if it was quite simple to free the first hand of the user, the “deliberation” of the second hand had engaged hundreds of researchers and scientists in the last decades. The problem of acoustic echo arises wherever a loudspeaker and a microphone are placed such that the microphone picks up the signal radiated by the loudspeaker and its reflections at the borders of the enclosure [3], [4], [13]. In the case of telecommunication systems, the users are annoyed by listening to their own speech delayed by the round-trip time of the system. If both conversation partners are using telephones with hands-free capabilities the electro-acoustic circuit may furthermore become unstable and produce howling. To avoid these problems, an adaptive filter can be placed parallel to the loudspeaker-enclosure-microphone (LEM) system (see Fig. 3.1). If one sucˆ ceeds in matching the impulse response h(n) of the filter exactly with the impulse response of the LEM system, the signals x(n) and e(n) are perfectly decoupled without any disturbing effects to the users of the electro-acoustic system.
3
Single-Channel Acoustic Echo Cancellation
61
Since, in real applications, a perfect match (over all times and all situations) cannot be achieved, the remaining signal e(n) (see Fig. 3.1) still contains echo components. To reduce these further, a Wiener-type echo suppression filter within the transmitting path may be used [2], [18], [19], [46]. In contrast to the cited references, where the echo suppression filter is also used for the suppression of local background noise, only the problem of residual echoes will be addressed in this chapter. Finally, a third sub-unit, the loss control, should be mentioned. The loss control circuit has the longest history in hands-free communication systems. A device for hands-free telephone conversation using voice switching was presented in 1957 [7]. In its simplest form it reduces the usually full-duplex communication system to a half-duplex one by alternatively switching the input and output lines on and off. Apart from preventing howling and suppressing echoes, any natural conversation, where both conversation partners can talk at the same time, was prevented, too. Only the echo cancellation filter in parallel to the LEM system can help to provide full-duplex – i.e. fully natural – communication. Due to the high interest in providing hands–free speech communication, an enormous number of papers has been published over the last two decades. Among those are a number of bibliographies [16], [20], [21], [22], overview papers [4], [14], [15], and books [3], [13]. For the following considerations the notation as given in Fig. 3.1 will be used. Lower case bold face letters will indicate column vectors, upper case bold face letters will denote matrices.
3.2
Settings
Before describing the basic algorithms used for acoustic echo control, some of the scenery which real systems have to cope with are investigated. 3.2.1
Loudspeaker-Enclosure-Microphone Systems
In an LEM system the loudspeaker and the microphone are connected by an acoustical path formed by a direct connection (if both can “see” each other) and in general a large number of reflections at the boundaries of the enclosure. For low sound pressure and no overload of the converters, this system may be modelled with sufficient accuracy as a linear system. The echo signal d(n) can be modeled as the output of a convolution of the (causal) impulse response of the LEM system hi (n) and the excitation signal x(n) (see Fig. 3.2): d(n) =
∞
hi (n) x(n − i).
(3.1)
i=0
The reason for having a time (n) and a coefficient index (i) for the LEM impulse response will become clearer at the end of this section. Signals from
62
E. H¨ ansler, G. Schmidt
Fig. 3.2. Model of the loudspeaker-enclosure-microphone (LEM) system.
local speakers s(n) and background noise b(n) are combined to a local signal n(n) = s(n) + b(n).
(3.2)
Adding the echo signal d(n) and the local signals results in the microphone signal, given by: y(n) = d(n) + s(n) + b(n) = d(n) + n(n).
(3.3)
The impulse response hi (n) of an LEM system can be described by a sequence of delta impulses delayed proportionally to the geometrical length of the related path and the inverse of the sound velocity. The amplitudes of the impulses depend on the reflection coefficients of the boundaries and on the inverse of the path lengths. As a first order approximation one can assume that the impulse response decays exponentially. A measure for the degree of this decay is the so-called reverberation time T60 . It specifies the time necessary for the sound energy to drop by 60 dB after the sound source has been switched off [27]. Depending on the application, it may be possible to design the boundaries of the enclosure such that the reverberation time is small, resulting in a short impulse response. Examples are telecommunication studios. For ordinary offices the reverberation time T60 is typically on the order of a few hundred milliseconds. For the interior of a passenger car this quantity is a few tens of milliseconds long. Figure 3.3 shows the impulse responses of LEM systems measured in a car (top), in an office (second diagram), and in a lecture room (third diagram). The microphone signals have been sampled at 8 kHz according to the standards for telephone signals. It becomes obvious that the impulse responses
3
Single-Channel Acoustic Echo Cancellation
63
of an office and of the lecture room exhibit amplitudes noticeably different from zero even after one thousand samples, that is to say after 125 ms. In comparison, the impulse response of the interior of a car decays faster due to the smaller volume of this enclosure. The impulse responses of LEM systems are highly sensitive to any changes such as the movement of a person within it. This is explained by the fact that, assuming a sound velocity of 343 m/s and 8 kHz sampling frequency, the distance traveled between two sampling instants is 4.3 cm. Therefore, a 4.3 cm change in the length of an echo path, the move of a person by only a few centimeters, shifts the related impulse by one sampling interval. Thus, the impulse response of an LEM system is time-variant. For this reason, we have chosen the notation hi (n) with filter coefficient index i and time index n for the impulse response in (3.1). 3.2.2
Electronic Replica of LEM Systems
From a control engineering point of view, acoustic echo cancellation is a system identification problem. However, the system to be identified – the LEM system – is highly complex: its impulse response exhibits up to several thousand sample values noticeably different from zero and it is time varying at a speed mainly according to human movements. The question of the optimal structure of the model of an LEM system and therefore also the question of the structure of the echo cancellation filter has been discussed intensively. Since a long impulse response has to be modeled by the echo cancellation filter, a recursive (IIR) filter seems best suited at first glance. At second glance, however, the impulse response exhibits a highly detailed and irregular shape. To achieve a sufficiently good match, the model must offer a large number of adjustable parameters. Therefore, an IIR filter does not show an advantage over a non-recursive (FIR) filter [28], [33]. The even more important argument in favor of an FIR filter is its guaranteed stability during adaptation. In the following we assume the echo cancellation filter to have an FIR structure of order N − 1. In this case the output of the echo cancellation filter d(n), which is an estimate of the echo signal d(n), can be described as a vector product of the impulse response of the adaptive filter and the excitation vector: = d(n)
N −1
ˆ i (n) x(n − i) = h ˆ T(n) x(n). h
(3.4)
i=0
The vector x(n) consists of the last N samples of the excitation signal x(n) = [x(n), x(n − 1), . . . , x(n − N + 1)]T ,
(3.5)
ˆ i (n) have been combined in a column vector and the filter coefficients h
T ˆ 0 (n), h ˆ 1 (n), . . . , h ˆ N −1 (n) . ˆ h(n) = h (3.6)
64
E. H¨ ansler, G. Schmidt 0.4 0.2 Impulse response measured in a car 0 −0.2 −0.4
200
400
600
800
1000 Samples
1200
1400
1600
1800
0.4 0.2 Impulse response measured in an office 0 −0.2 −0.4
200
400
600
800
1000 Samples
1200
1400
1600
1800
0.4 0.2
Impulse response measured in a small lecture room
0 −0.2 −0.4
200
400
600
800
1000 Samples
1200
1400
1600
1800
Upper bound of ERLE 40 35 30 Car
dB
25
Office
20 15 Small lecture room
10 5 0 200
400
600
800
1000 1200 Filter order
1400
1600
1800
Fig. 3.3. Impulse responses measured in a car, in an office, and in a small lecture room (sampling frequency = 8 kHz). The bottom diagram shows the maximal achievable echo attenuation in dependence of the filter order of an adaptive filter placed in parallel to the LEM system.
3
Single-Channel Acoustic Echo Cancellation
65
A measure to express the effect of an echo cancellation filter is the so-called echo return loss enhancement (ERLE): E d2 (n) , ERLE(n) = (3.7) 2 E [d(n) − d(n)] where the echo d(n) is equal to the microphone output signal y(n) in case the loudspeaker is the only signal source within the LEM system, i.e. the local speech signal s(n) and the local noise b(n) are zero. Assuming, for simplicity, a stationary white input signal x(n), the ERLE can be expressed as ∞ E x2 (n) h2i (n)
ERLE(n) = E {x2 (n)}
∞ i=0
h2i (n) − 2
i=0 N −1
ˆ i (n) + hi (n) h
i=0
N −1 i=0
. ˆh2 (n) i (3.8)
An upper bound for the efficiency of an echo cancellation filter of degree N −1 can be calculated by assuming a perfect match of the first N coefficients of the adaptive filter with the LEM system, ˆ i (n) = hi (n) for 0 ≤ i < N. h
(3.9)
In this case, (3.8) reduces to ∞
ERLEmax (n, N ) =
i=0 ∞
h2i (n) .
(3.10)
h2i (n)
i=N
The bottom diagram in Fig. 3.3 shows the upper bounds of the ERLE achievable with transversal echo cancellation filters of length N for a car, an office and a small lecture room with impulse responses as given in the first three diagrams of Fig. 3.3. An attenuation of only 30 dB needs filter lengths of about 1900 for the lecture room, 800 for the office and about 250 for the car. 3.2.3
Speech Signals
The performance of adaptive algorithms (especially for system identification purposes) depends crucially on the properties of the input signals [24]. In the case of acoustic echo control, speech signals have to be used as excitation. Identifying systems with this type of signals turns out to be very difficult. In the upper part of Fig. 3.4 an example of a speech sequence – sampled at 8 kHz – is depicted. Speech is characterized by three different types of excitation: nearly periodic (voiced) segments, noise-like (unvoiced) segments, and speech pauses.
66
E. H¨ ansler, G. Schmidt Speech signal
20000 10000 0 −10000 −20000 0
0.5
1
1.5
2
2.5 Seconds
3
3.5
4
4.5
5
Estimated power spectral density 90 80
dB
70 60 50 40 30
0
500
1000
1500
2000 Hz
2500
3000
3500
4000
Time−frequency analysis 4000
Hz
3000
2000
1000
0
0
0.5
1
1.5
2
2.5 Seconds
3
3.5
4
4.5
5
Fig. 3.4. Example of a speech sequence. In the upper part a 5-second-sequence of a speech signal is depicted in the time domain. The signal was sampled at 8 kHz. The second diagram shows an estimate of the power spectral density of the entire sequence (periodogram averaging). In the lowest part a time-frequency analysis of the speech signal is depicted. Dark colors represent areas with high energy, light colors display areas with low energy.
Short-term stationarity may be assumed for intervals of only 10 ms to 20 ms [9]. If a sampling frequency of 8 kHz (telephony) is used, the expected spectral envelope may have a dynamic range of more than 40 dB. If higher sampling rates (e.g. teleconferencing system) are implemented, the variations increase further. To illustrate the spectral variations as well as the non-stationarity,
3
Single-Channel Acoustic Echo Cancellation
67
an estimate of the power spectral density of a speech signal as well as a time-frequency analysis are depicted in the lower parts of Fig. 3.4. Due to the non-stationarity of speech, the power spectral density is not sufficient to characterize the spectral “behaviour” of speech. Voiced and unvoiced speech segments result from vowel and consonant sound, respectively. Voiced segments can be distinguished from unvoiced by periodicity and loudness. The periodicity (also called pitch) is a characteristic of the speaker. In Fig. 3.5, two 60 ms segments of speech (as well as the entire sequence) are depicted in the time and frequency domain. The second diagram shows a vowel (‘i:’ from be). In the last diagram, a sibilant sequence (‘sh’ from she) is depicted. The terms ‘i:’ and ‘sh’ are written in phonetic transcription. The differences between voiced and unvoiced speech are clearly visible in the frequency spectra. While voiced speech has a comb-like spectrum, unvoiced speech exhibits a non-harmonic spectral structure. Furthermore, unvoiced segments have most of their energy at high frequencies. The rapidly changing spectral characteristics of speech motivate the utilization of signal processing in the subband or frequency domain (see Sect. 3.7). In these processing structures a frequency selective power normalization is possible, leading to a smaller eigenvalue spread [24] and therefore to faster convergence. Beside this advantage, both structures offer better control possibilities (see Sect. 3.5). For the application of hands-free telephony, the excitation signal as well as the measurement noise are mutually independent speech signals. In double-talk situations (both partners speak simultaneously), the signal-to-noise-ratio (SNR) varies largely over frequency. In fullband structures the step size has to be reduced according to the frequency regions with the smallest SNR. In subband or frequency domain (block processing) structures the step sizes and the regularization parameters can be adjusted differently over frequency. Subbands with a small SNR are adapted using small step sizes or even a step size of zero. In the other subbands – with large SNR – the adaptation process can continue without degradation.
3.2.4
Background Noise
The signal of the local speaker is distorted by local background noise. In case of a hands-free telephone used in an office, the noise of personal computer (PC) fans or air conditioners might disturb the speech signal. If someone phones out of a car, engine, wind, and rolling noise are sources for the distorting signal. In contrast to speech signals most background noises show a nearly stationary behaviour. Time-frequency analyses confirm this property. In Fig. 3.6 the power spectral density of car noise as well as the one of a PC fan and an air conditioner – both measured in an office – are depicted. The car noise was measured in a car driving on a motorway with a speed of 100 km/h.
68
E. H¨ ansler, G. Schmidt
30000 20000 10000 0 −10000 −20000 −30000 0
0.5
1
1.5
2
2.5 Seconds
3
3.5
4
4.5
5
90 80
5000
70 dB
10000
0
60 50
−5000
40
−10000
30 0
20
40 Milliseconds
20
60
0
1000
2000 Hz
3000
4000
0
1000
2000 Hz
3000
4000
90 80
500
70 dB
1000
0
60 50
−500
40
−1000
30 0
20
40 Milliseconds
60
20
Fig. 3.5. Voiced and unvoiced speech segments. In the upper part a 5-secondsequence of a speech signal is depicted in the time domain. The following two diagram pairs show 60-ms-sequences of the speech signal in the time domain (left) as well as the squared spectrum of the sequences (right). First a vowel (‘i:’ from be) and second an unvoiced (sibilant) sequence (‘sh’ from she) are depicted.
3.2.5
Regulations
Beside the physical settings mentioned before, there are also some administrative restrictions. The characteristics of hands-free telephone systems are regulated by the International Telecommunication Union (ITU) as well as by the European Telecommunication Standards Institute (ETSI). The most severe restrictions for signal processing are the tolerable delays for front end processing: only 2 ms [26] are allowed for stationary telephones and only 39 ms [11] are allowed for mobile telephones (GSM). Furthermore, an echo atten-
3
Single-Channel Acoustic Echo Cancellation
69
Power spectral densities of background noises 100
80
Car at 100 km/h on a motorway
dB
60
40
20
0 PC fan and air conditioner −20
0
500
1000
1500
2000
2500
3000
3500
4000
Hz
Fig. 3.6. Properties of background noises. Shown are two power spectral densities - for measured noise recorded in a car driving on a motorway with a speed of 100 km/h and for the sound of a PC fan plus an air conditioner measured in an office.
uation of about 45 dB in case of single-talk and 30 dB in case of double-talk (the remote and the local speaker are talking simultaneously) are prescribed. Due to the severe delay restriction for stationary phones only fullband structures or hybrid structures are applicable. Filter bank systems (consisting of an analysis and a synthesis part, see Sect. 3.7.3) as well as overlapping DFTs (see Sect. 3.7.2) are introducing a delay considerably larger than 2 ms. Therefore, at least the convolution part has to be implemented in fullband. Hybrid structures allow the adaptation to be performed in another domain than the convolution. In these mixed processing structures, the adaptation (but not the convolution) process is delayed by a few sample instants. The fullband filter impulse response is computed via dedicated transformations [36], [10], [34].
3.3
Methods for Acoustic Echo Control
Most voice communication systems operate as a closed loop. In case of the telephone system with customers using handsets, the attenuations between loudspeakers and microphones crucially contribute to the stabilization of this loop. Therefore, with the introduction of hands-free devices the provision of a stable electro-acoustic loop becomes a major problem.
70
E. H¨ ansler, G. Schmidt
Fig. 3.7. Structure of a loss control. Depending on the remote and the local speech activity either the transmitting path and/or the receiving path are attenuated.
3.3.1
Loss Control
One of the oldest attempts to solve the acoustic feedback problem is the use of speech activated switching or loss control - traditionally called “echo suppressors.” Attenuation units are placed in the receiving path ar (n) and in the transmitting path at (n) of a hands-free telephone as shown in Fig. 3.7. In case of single-talk of the remote speaker only the transmitting path is attenuated. If only the local speaker is active all attenuation is inserted into the receiving path. During double-talk both signals are attenuated - leading to a half-duplex communication. Beside this disadvantage a loss control can guarantee (in contrast to echo cancellation and residual echo suppression) the amount of attenuation required by the ITU-T or ETSI recommendations. 3.3.2
Echo Cancellation
For the following considerations we assume that, both, the impulse responses of the echo cancellation filter and of the LEM system have the same order N − 1. In reality, the impulse response of the LEM system may be much longer than that of the adaptive filter. Nevertheless, this assumption means no restriction because the shorter impulse response can always be extended by zeros. Equivalent to (3.5) and (3.6) one can write the impulse response of the LEM system at time n as a vector h(n) = [h0 (n), h1 (n), . . . , hN −1 (n)]T ,
(3.11)
and the echo signal d(n) as an inner product, d(n) =
N −1 i=0
hi (n) x(n − i) = hT(n) x(n).
(3.12)
3
Single-Channel Acoustic Echo Cancellation
71
The mismatch between the electro-acoustic system and echo cancellation filter can be expressed by a mismatch vector - also called “misalignment” ˆ Δh (n) = h(n) − h(n).
(3.13)
Later, the squared L2 -norm Δh (n)2 = ΔhT(n) Δh (n)
(3.14)
of the system mismatch vector will be called the system distance . The quantity = ΔhT(n) x(n) eu (n) = d(n) − d(n)
(3.15)
represents the undisturbed error, i.e., the error signal when both the local speech signal s(n) and the local background noise b(n) are zero. Finally, the error signal e(n) is given by = eu (n) + n(n) = eu (n) + s(n) + b(n). e(n) = y(n) − d(n)
(3.16)
This (disturbed) error will enter the equation used to update the coefficients of the adaptive filter. Obviously, only the fraction expressed by the undisturbed error eu (n) contains “useful” information. The locally generated signal n(n), however, causes the filter to diverge and thus to increase the system distance. Therefore, independent of the adaptive algorithm used, a control procedure is necessary to switch off or slow down the adaptation when n(n) is large compared to the echo d(n). 3.3.3
Echo Suppression
The impact of the echo cancellation filter on the acoustical echo is limited by – at least – two facts: • only echoes due to the linear part of the transfer function of the LEM system can be cancelled, and • the order of the adaptive filter typically is much smaller than the order of the LEM system (see Sect. 3.2.1). Therefore, a second filter – the residual echo suppression filter – is used to reduce the echo further. The transfer function of this filter is given by the well-known Wiener equation [23], [24] Sen (Ω) Wopt ejΩ = , See (Ω)
(3.17)
or by its short-term approximation Sen (Ω, n) W ejΩ, n = , See (Ω, n)
(3.18)
72
E. H¨ ansler, G. Schmidt
respectively. In this equation Ω is a normalized frequency, See (Ω, n) is the short-term auto power spectral density of the error signal e(n), and Sen (Ω, n) is the short-term cross power spectral density of e(n) and the locally generated signal n(n). In good agreement with reality one can assume that the undisturbed error eu (n) and n(n) [see (3.15)] are orthogonal. Then, the power spectral density See (Ω, n) reduces to See (Ω, n) = Seu eu (Ω, n) + Snn (Ω, n).
(3.19)
Furthermore, the cross power spectral density Sen (Ω, n) is given by Sen (Ω, n) = Snn (Ω, n).
(3.20)
Then, it follows from (3.18) and after some manipulations for the transfer function of the echo suppression filter: Se e (Ω, n) W ejΩ, n = 1 − u u . See (Ω, n)
(3.21)
The impulse response of this filter can be found by inverse Fourier transformation. Since the signals involved are highly non-stationary, the short-term power spectral densities have to be estimated for time intervals not longer than 20 ms. The overriding problem, however, is that the locally generated signal n(n) is only observable during the absence of the remote excitation signal u(n). Since n(n) is composed of local speech and local noise, the filter suppresses local noise as well. It should be noted, however, that any impact of the residual echo suppression filter on residual echoes also impacts the local speech signal s(n) and, thus, reduces the quality of the speech output of the echo cancelling unit. The problem of residual echo suppression is very similar to the problem of noise suppression and usually both are treated simultaneously [18], [32].
3.4
Adaptive Algorithms
In this section, the convergence and tracking behaviour of three popular adaptive algorithms are examined. The tracking behaviour indicates how fast the adaptive filter can follow enclosure dislocations, whereas the convergence behaviour is investigated as an initial adjustment of the adaptive filter to the impulse response of the LEM system. The algorithms will not be explained in details; the interested reader is referred to an overview presented in [17] as well as the references cited therein. The following adaptive algorithms will be mentioned: • normalized least mean square (NLMS) algorithm, • affine projection (AP) algorithm, and • recursive least squares (RLS) algorithm.
3
Single-Channel Acoustic Echo Cancellation
73
Table 3.1 Adaptive algorithms, their cost functions, and their complexities. Algorithm
Cost function
NLMS
E e2 (n)
AP
E
L−1
ONLMS (N ) ∼ 2 N
e2 (n − i)
i=0
RLS
n
λn−i e2 (i),
Complexity
OAP (N ) ∼ 2 L N , but fast versions exist.
with 0 < λ ≤ 1
i=0
ORLS (N ) ∼ N 2 , but fast versions exist.
They differ in their cost function, which they try to minimize, and therefore also in their speed of convergence and their computational complexity. An overview is listed in Table 3.1. 3.4.1
NLMS Algorithm
The majority of implementations of acoustic echo cancelling systems use the NLMS algorithm to update the adaptive filter. This gradient type algorithm minimizes the mean square error [24]. The update equation is given by x(n) e(n) ˆ + 1) = h(n) ˆ h(n + μ(n) 2 . x(n)
(3.22)
The term xT(n) x(n) in the denominator represents a normalization according to the energy of the input vector x(n). This is necessary due to the high variance of this quantity for speech signals. The step size of the update is controlled by the step-size factor μ(n). In general, the algorithm is stable (in the mean square sense) for 0 < μ(n) < 2. Reducing the step size is necessary to prevent divergence of the filter coefficients in case of strong local speech signals s(n) and/or local background noise b(n). The NLMS algorithm has no memory, i.e. it uses only signals that are available at the time of the update. This is advantageous for tracking changes of the LEM system. The update is performed in the direction of the input signal vector x(n) (see Fig. 3.8). For speech signals, consecutive vectors may be highly correlated meaning that their directions differ only slightly. This is the reason for the low speed of convergence of the NLMS algorithm in case of speech excitation. Additional measures like decorrelation of the input signal x(n) and/or controlling the step-size parameter μ(n) (see Sect. 3.5) are necessary to speed up convergence.
74
E. H¨ ansler, G. Schmidt
Fig. 3.8. Updates of the system mismatch vector according to the NLMS algorithm (left) and to the AP algorithm (right).
The motivation for using the NLMS algorithm in the application discussed here is its robustness and its low computational complexity that is only in the order of 2 N operations per coefficient update. Decorrelating the input signal offers a computationally inexpensive method to improve the convergence of the filter coefficients. To achieve this two (identical) decorrelation filters and an inverse filter have to be added to the echo cancelling system (see Fig. 3.9). The decorrelation filter has to be duplicated since the loudspeaker needs the original signal x(n). Simulations show that even filters of first order approximately double the speed of convergence in acoustic echo cancelling applications [4]. Further improvements require adaptive decorrelation filters because of the non-stationarity of speech signals. Also, in case of higher-order filters the necessary interchange of the decorrelation filter and the time varying LEM system causes additional problems. Therefore, only the use of a first-order decorrelation filters can be recommended. The curves in Fig. 3.9 (left) are averages over several speech signals with pauses removed. 3.4.2
AP Algorithm
The AP algorithm [40] overcomes the weakness of the NLMS algorithm concerning correlated input signals by updating the filter coefficients not just in the direction of the current input vector but within a hyperplane spanned by the current input vector and its M − 1 immediate predecessors (see Fig. 3.8). To accomplish this, an input signal matrix X(n) is formed, X(n) = [x(n), x(n − 1), . . . , x(n − M + 1)] ,
(3.23)
and an error vector is calculated, ˆ e(n) = y(n) − X T(n) h(n),
(3.24)
3
Single-Channel Acoustic Echo Cancellation
75
Fig. 3.9. Insertion of decorrelation filters (left) and improved concergence of NLMS algorithm utilizing decorrelation filters (right): ◦: none, : fixed first order, : fixed second order, •: adaptive tenth order, : adaptive eighteenth order (sampling frequency = 8 kHz).
collecting the errors for the current and the M − 1 past input signal vectors applied to the echo cancellation filter with the current coefficient setting. The vector y(n) consists of the last M microphone signals y(n) = [y(n), y(n − 1), . . . , y(n − M + 1)]T .
(3.25)
The price to be paid for the improved convergence is the increased computational complexity caused by the inversion of an M × M matrix required at each coefficient update. Fast versions of this algorithm are available [12], [37]. Finally, the update equation is given by −1 ˆ + 1) = h(n) ˆ h(n + μ(n) X(n) X T(n) X(n) e(n). (3.26) Numerical problems arising during the inversion of the matrix X T(n) X(n) can be overcome by using X T(n) X(n) + δ I instead of X T(n) X(n), where I is the identity matrix and δ is a small positive constant. For M = 1 the AP algorithm is equal to the NLMS procedure. For speech input signals even M = 2 leads to a considerably faster convergence of the filter coefficients (see Fig. 3.10). Suggested values for M are between two and five for the update of the echo cancellation filter. It should be noted, however, that faster convergence of the coefficients of the echo cancellation filter also means faster divergence in case of strong local signals. Therefore, faster control of the step size is required as well. Its optimal value is based upon estimated quantities (see Sect. 3.5). Since
76
E. H¨ ansler, G. Schmidt
Fig. 3.10. Convergence of the filter coefficients for different adaptation algorithms (filter length = 1024, sampling frequency = 8 kHz): : NLMS (μ = 1), : AP of order two, ∗: AP of order five, : AP of order ten (all AP algorithms with μ = 1), : RLS (λ = 0.9999). The impulse response of the LEM system is changed at t = 5 seconds. We face the fact that the algorithms do not converge to the same steady state. This is caused by the more or less sensitive reactions to residual echoes caused by the non-modeled tail of the impulse response of the LEM system.
their reliabilities depend upon the lengths of the data records usable for the estimation, a too high speed of convergence may not be desirable. Nevertheless, the AP algorithm seems to be a good candidate to replace the NLMS algorithm in acoustic echo cancelling applications.
3.4.3
RLS Algorithm
The RLS algorithm minimizes the weighted sum of the squared error,
KRLS (n) =
n k=0
λn−k e2 (k) ,
(3.27)
3
Single-Channel Acoustic Echo Cancellation
77
where e(n) is given by (3.16). It calculates an estimate of the autocorrelation matrix of the input signal vector, xx (n) = S
n
λn−k x(k) xT(k),
(3.28)
k=0
and uses the inverse of this N × N matrix to decorrelate the input signal in the update equation. The factor λ is called the exponential forgetting factor. It is chosen close to but smaller than 1 and assigns decreasing weights to the input signal vectors the further they are in the past. As in the NLMS algorithm, first an a priori error ˆ e(n) = y(n) − xT(n) h(n),
(3.29)
is calculated. Finally, the filter update equation of the RLS algorithm is given by: ˆ + 1) = h(n) ˆ −1 h(n + μ(n) S xx (n) x(n) e(n).
(3.30)
In contrast to the AP algorithm, now an N × N matrix has to be inverted at each coefficient update. This can be done recursively and fast RLS algorithms are available [45] with numerical complexity on the order of 7 N . The RLS algorithm exhibits considerable stability problems. One of these is caused by finite word length errors, especially when the procedure is executed in 16 bit fixed-point arithmetic. The second problem arises from the properties of speech signals (see Sect. 3.2.3) temporarily causing the estimate of the (short-term) autocorrelation matrix to become singular. A forgetting factor λ very close to one helps to overcome this problem. On the other hand, however, the long memory caused by a λ close to one slows down the convergence of the coefficients of the adaptive filter after a change of the LEM system (see Fig. 3.10). In spite of the speed of convergence achievable with the RLS algorithm, the numerical complexity and the stability problems have so far discouraged the use of this algorithm in acoustical echo cancellation applications. A unified analysis of least square adaptive algorithms can be found in [17].
3.5
Adaptation Control
Independent of the specific algorithm used, the update of the coefficients of the echo cancellation filter strongly depends on the error signal e(n). This signal is composed of the undisturbed error eu (n) and the locally generated signal n(n) [see (3.16)]. Only eu (n) steers the coefficients towards their optimal value. The step-size factor μ(n) is used to control the update according to the ratio of eu (n) to n(n). Assuming the filter has converged, the error signal e(n) has reached a certain value. If suddenly the amplitude of e(n) is increased, it may have two reasons that require different actions:
78
E. H¨ ansler, G. Schmidt
• First, a local speaker became active or a local noise started. In this case the step size has to be reduced to prevent losing the degree of convergence achieved before. • Second, the impulse response of the LEM system has changed e.g. by the movement of the local talker. Now, the step size has to be increased to its maximal possible value in order to adapt the echo cancellation filter to the new impulse response as fast as possible. A major problem becomes visible with this consideration: the not directly observable undisturbed error signal eu (n) needs to be known in order to control the adaptation process. Another leading point should be mentioned here: the first situation requires immediate action. In the second case, a delayed action causes an audible echo but no divergence of the adaptive filter. 3.5.1
Optimal Step Size for the NLMS Algorithm
In this section, an optimal step size for the NLMS algorithm will be derived assuming that all required quantities are available. The following sections explain how to estimate them from measurable signals. Using (3.22) and assuming that the impulse response of the LEM system does not change, h(n + 1) = h(n) = h,
(3.31)
the system mismatch vector [see (3.13)] is given by: ˆ + 1) Δh (n + 1) = h − h(n ˆ = h − h(n) − μ(n)
x(n) e(n) 2
x(n) x(n) e(n) = Δh (n) − μ(n) 2 . x(n)
(3.32)
Using (3.22) and (3.15), the expectation of the system distance can be expressed as e(n) eu (n) 2 2 E Δh (n + 1) = E Δh − 2 μ(n) E 2 x(n) e2 (n) 2 . (3.33) +μ (n) E 2 x(n) For a decreasing system distance it is required that 2 2 E Δh (n + 1) − E Δh (n) ) ≤ 0. Inserting (3.33) leads to: e(n) eu (n) e2 (n) 2 − 2 μ(n) E μ (n) E 2 2 x(n) x(n)
≤ 0.
(3.34)
(3.35)
3
Single-Channel Acoustic Echo Cancellation
Thus, the condition for the optimal step size is given by: e(n) eu (n) E 2 x(n) . 0 ≤ μ(n) ≤ 2 e2 (n) E x(n)2
79
(3.36)
A step size in the middle of this range achieves the fastest decrease of the system distance. The optimal step size μopt (n) therefore is: e(n) eu (n) E x(n)2 μopt (n) = . (3.37) e2 (n) E 2 x(n) To simplify this result one can assume that the L2 norm of the input signal vector x(n) is approximately constant. This can be justified by the fact that in echo cancelling applications, the length of this vector typically is on the order of 512 up to 2048. Then, the optimal step size is approximated by μopt (n) ≈
E {e(n) eu (n)} . E {e2 (n)}
(3.38)
Since the undisturbed error eu (n) and the local signals n(n) are uncorrelated, this expression further simplifies to E e2u (n) . (3.39) μopt (n) ≈ E {e2 (n)} Finally, the denominator may be extended using (3.16) and again the property that eu (n) and n(n) are orthogonal: E e2u (n) . (3.40) μopt (n) ≈ E {e2u (n)} + E {n2 (n)} Equation (3.40) emphasizes the importance of the undisturbed error eu (n), respectively of its short-term power E{e2u (n)}: • If there is a good match between the LEM system and the adaptive filter, this term is small. • If at the same time the local signal n(n) is large, the optimal step size approaches zero – the adaptation freezes. We have discussed here only a scalar step-size factor μ(n). This means that the same factor is applied to all filter coefficients. Numerous suggestions have been made to replace the scalar step-size factor by a diagonal matrix in order to apply distinct factors to different elements of the coefficient vector.
80
E. H¨ ansler, G. Schmidt
An example is an exponentially weighted step size taking into account the exponential decay of the impulse responses of LEM systems [30], [31], [43]. The implementation of the optimal step size needs the solution of a number of problems. A variety of detection and estimation schemes for approximation of the optimal step size according to (3.39) have been published over the last years. A presentation of all of these control methods with their individual complexities and reliabilities would be far beyond the scope of this chapter. An overview of these methods can be found in [29]. Here, we will present one of the control schemes in the next section as an example. 3.5.2
A Method for Estimating the Optimal Step Size
The implementation of the optimal step size derived in the previous section requires the estimation of several quantities: • the expectations of signal powers have to be approximated by short-term estimates, • an estimation method for the not directly observable undisturbed error eu (n), respectively for its short-term power, has to be derived. Estimation of Short-Term Signal Power. Short-term signal power can be easily estimated by squaring the signal amplitude and smoothing this value by an IIR filter. A filter of first order proved to be sufficient [5]: |x(n)|2 = [1 − γ(n)] |x(n)|2 + γ(n) |x(n − 1)|2 .
(3.41)
If a rising signal amplitude should be detected faster than a falling one, a shorter time constant for a rising edge can be used than for a falling one: 2 2 γ(n) = γr , if |x(n)| > |x(n − 1)| , (3.42) γf , else. Typically both constants are chosen in [0.9, 0.999]. Applying different time constants gives rise to a (small) bias that can be neglected in this application. Where squaring the signal amplitude causes a problem because of fixed-point arithmetic, the square can be replaced by the magnitude of the amplitude [25], [42]. Both, square and magnitude, are related by a factor depending on the probability density function of the signal amplitude. If two short-term estimates of signal powers are compared with each other, like it is done in most cases in controlling echo cancellation filters, this factor cancels out. Estimation of the Undisturbed Error. Estimating the undisturbed error needs an estimate of the mismatch vector Δh (n) [see (3.15)]. Obviously, the impulse response vector h(n) of the LEM system is not known. However, if an artificial delay of ND samples is inserted before the loudspeaker [49],
3
Single-Channel Acoustic Echo Cancellation
81
the echo cancellation filter also models this part of the unknown impulse response. The impulse response coefficients related to this delay (later called delay coefficients) are zero: hi (n) = 0,
for i = 0, . . . , ND − 1.
(3.43)
The NLMS algorithm has the property to distribute coefficient errors equally over all coefficients. Therefore, from the mismatch of the first ND coefficients one can estimate the system distance [see (3.14)]: Δact (n) =
ND −1 N ˆ 2 (n) . h ND i=0 i
(3.44)
First-order IIR smoothing is used to reduce the variance of the estimate (3.44): Δ(n) = γΔ Δ(n − 1) + (1 − γΔ ) Δact (n).
(3.45)
Assuming statistical independence of the input signal x(n) and the filter coefficients, the optimal step size according to (3.39) is approximately given by 2
μopt (n) ≈ μ(n) =
|x(n)| Δ(n) 2
|e(n)|
.
(3.46)
The performance of this method proves to be quite reliable. It has one deficiency, however: the update of the echo cancellation filter freezes in case of a change of the impulse response of the LEM system. The reason for this behaviour is that the coefficients related to the artificial delay remain equal to zero in that case. Therefore, applying this method requires an additional detector for changes of the LEM system. Several methods are known [29]. A reliable indicator is based on a socalled shadow filter. This is a short adaptive filter in parallel to the reference (or foreground) echo cancellation filter [39]. The shadow filter is also adapted similarly to the reference filter, however, its step-size control is only excitation-based, i.e., adaptation is stopped if the remote excitation falls below a predetermined threshold. Furthermore, only half or less of the number of coefficients is used for the shadow filter, in comparison to the reference filter. These features ensure a high convergence speed of the shadow filter in the case of remote single talk. Of course, the filter diverges in case of local distortions. However, fast convergence after enclosure dislocations is ensured because the step-size control is independent of the methods that can freeze the adaptive filter in these situations. Hence, the only situations in which the shadow filter is better adjusted to the LEM echo path than the reference filter are enclosure dislocations. This is exploited to develop a detection mechanism: if the error signal of the shadow filter falls below the error signal of the reference filter for several iterations, enclosure dislocations are detected. In this case the delay coefficients are reinitialized in such a manner, that
82
E. H¨ ansler, G. Schmidt
• the actual estimation for the system distance is equal to the ratio of the 2 2 two short-term power estimations |x(n)| and |e(n)| ND −1 2 N ˆ 2 (n) = |e(n)| , h 2 ND i=0 i |x(n)|
(3.47)
• all delay coefficients have the same absolute value ! ! ! ! ! !ˆ ! !ˆ !hi (n)! = !h i+1 (n)! , for i = 0, . . . , ND − 2,
(3.48)
• and adjacent coefficients have alternating signs ˆ i (n) = −sgn h ˆ i+1 (n) , for i = 0, . . . , ND − 2. sgn h
(3.49)
All requirements can be fulfilled by setting the delay coefficients to " # 2 # i $ 1 |e(n)| ˆ hi (k) = (−1) N |x(n)|2
(3.50)
in case of detected enclosure dislocations. To conclude this section, we present a simulation with pre-recorded speech signals and measured impulse responses. For the estimation of the system distance, the delay coefficients method was implemented with ND = 40 delay coefficients. The length of the adaptive filter was set to N = 1024 and speech signals were used for the remote excitation as well as for the local distortion. Both signals are depicted in the two upper sections of Fig. 3.11 – a double-talk situation appears during iteration steps 30,000 and 40,000. After 62,000 iterations, an enclosure dislocation takes place (a book was placed between the loudspeaker and the microphone). To avoid freezing of the adaptive filter coefficients, a shadow filter of length NS = 256 was implemented. If the power of the error signal of the shadow filter falls 12 dB below the error power of the reference filter, an enclosure dislocation is detected and the first ND filter coefficients are reinitialized. In the middle part of Fig. 3.11, the estimated and the true system distance are depicted. The rescue mechanism needs about 3,000 iterations to detect the enclosure dislocation. During this time the step size μ(n) was set to very small values (see lower part of Fig. 3.11). After 65,000 iterations the filter converges again. Even without detection of local speech activity the step size was reduced during the double-talk situation.
3.6
Suppression of Residual Echoes
The coupling between the loudspeaker and the microphone is reduced in ˆ proportion to the degree of the match of the echo cancellation filter h(n) to the original system h(n). Since, in real applications, a perfect match (over all times and all situations) cannot be achieved (see simulation results in
3
Single-Channel Acoustic Echo Cancellation
83
Excitation signal 1 0 −1
Local distortion 1 0 −1
True and estimated norm of the system mismatch vector 0 Enclosure dislocation
dB
−10 −20 −30 −40
True system distance Estimated system distance Step size
1 0.8 0.6 0.4 0.2 0 0
10000
20000
30000
40000
50000
60000
70000
80000
90000
Iterations
Fig. 3.11. Simulation example for an entire step-size control unit. Speech signals were used for excitation as well as for local distortion (see first two diagrams). After 60,000 iterations a book was placed between the loudspeaker and the microphone. In the third diagram, the true and the estimated system distance are depicted. The lowest diagram shows the step size, based on the estimated norm of the system mismatch vector.
Fig. 3.11), the remaining signal e(n) still contains echo components. To reduce these further, a Wiener-type filter with a transfer function according to Se e (Ω, n) W ejΩ, n = 1 − u u See (Ω, n)
(3.51)
can be utilized (see Sect. 3.3.3). When applying (3.51), the power spectral densities See (Ω, n) and Seu eu (Ω, n) have to be replaced by their short-term
84
E. H¨ ansler, G. Schmidt
estimates See (Ω, n) and Seu eu (Ω, n). Therefore, the quotient may become larger than 1. Consequently, the filter exhibits a phase shift of π. To prevent that, (3.51) of the filter transfer function is (heuristically) modified to jΩ Seu eu (Ω, n) W e , n = max 1 − βover (3.52) , Wmin , See (Ω, n) where the spectral floor Wmin determines the maximal attenuation of the filter. The so called overestimation parameter βover is a second heuristic modification of the transfer function of the echo suppression filter. Using this parameter the “aggressiveness” of the filter can be adjusted. Details can be found in [41]. In order to estimate the short-term power spectral density of the error signal See (Ω, n) a frequency analysis is required. The short-time Fourier transform (STFT) is one of several methods to transform a nonstationary input signal into the frequency-domain. When performing the STFT, the input signal is periodically multiplied with a window function g(n) before performing a discrete Fourier transform (DFT): M−1 E ejΩμ , n = g(k) e(nr − k) e−jΩμ k .
(3.53)
k=0
The spectrum of the error signal is computed every r samples at M equidistant frequency supporting points: μ , with μ ∈ {0, . . . , M − 1} . (3.54) Ωμ = 2π M In accordance with Sect. 3.5.2, first-order IIR smoothing of the squared magnitude is applied in order to estimate the short-term power spectral density ! !2 See (Ωμ , nr) = (1 − γ) !E ejΩμ , n ! + γ See (Ωμ , nr − r) . (3.55) In contrast with the short-term power estimation presented in Sect. 3.5.2, only a fixed time constant γ is used here. Due to the fact that the undistorted error signal eu (n) is not accessible, the estimation of the short-term power spectral density Seu eu (Ωμ , n) cannot be approximated in the same manner as See (Ωμ , n). According to our model of the LEM system, the undisturbed error eu (n) can be expressed by a convolution of the excitation signal x(n) and the system mismatch Δh,i (n): eu (n) =
N −1
Δh,i (n) x(n − i).
(3.56)
i=0
Hence, the power spectral density of the undisturbed error signal can be estimated by multiplying the short-term power spectral density of the excitation signal with the squared magnitude spectrum of the estimated system
3
Single-Channel Acoustic Echo Cancellation
85
mismatch:
! !!2 ! jΩμ Seu eu (Ωμ , nr) = Sxx (Ωμ , nr) !Δ , nr ! . H e
(3.57)
The short-term power spectral density of the excitation signal is estimated in the same manner as for of the error signal [according to (3.55)]. For estimating the system mismatch, a so called double-talk detector [6] and a detector for enclosure dislocations are required. A detailed description of such detectors would go far beyond the scope of this chapter. The interested reader can find several examples of detection schemes in [13] and [29]. If the local background noise b(n) has only moderate power, the spectrum of the system mismatch can be estimated by smoothing the ratio of the excitation and the error spectrum: ⎧ ! !!2 ⎪ See (Ωμ , nr) ! jΩμ ⎪ ⎪ (1 − γ Δ e ) , nr − r + γ ! ! , Δ Δ H ⎪ ⎪ ⎪ Sxx (Ωμ , nr) ⎪ ⎪ ! !!2 ⎨ ! jΩμ (3.58) !ΔH e , nr ! = if cond. (3.59) is true, ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ! ⎪ !2 ⎪ ⎩ !!Δ H ejΩμ , nr − r !! , else. The recursive smoothing is updated only if one of the following conditions are true: (1) Remote single-talk is detected, or
(3.59) (2) Enclosure dislocations are detected.
The suppression of residual echoes can be performed either in the time or in the frequency domain. In the first case the frequency response of the suppression filter is transformed periodically via inverse DFT into the timedomain IDFT W ejΩμ, n , if n mod r ≡ 0, w(n) = (3.60) w(n − 1), else. In the latter case the spectrum of the error signal is multiplied with the frequency response of the suppression filter Sˆ ejΩμ , n = E ejΩμ , n W ejΩμ, nr (3.61) and the resulting signal spectrum is transformed via inverse DFT into the time-domain using overlap-add techniques. In Fig. 3.12, an example for residual echo suppression is presented. In the two upper diagrams the excitation signal x(n) as well as the sum of all local signals n(n) are depicted. The bottom diagram shows the short-term power of the microphone signal y(n), of the error signal (after echo cancellation) e(n), and of the signal after echo cancellation and suppression sˆ(n). The echo
86
E. H¨ ansler, G. Schmidt Excitation signal
20000 10000 0 −10000 −20000 Local signal 20000 10000 0 −10000 −20000 Short−term powers 80
Signal after echo cancellation
Microphone signal
70
dB
60 50 40 30 Signal after residual echo suppression 0
1
2
3
4
5 Seconds
6
7
8
9
Fig. 3.12. Suppression of residual echoes. In the upper two diagrams the excitation signal x(n) as well as the sum of all local signals n(n) are depicted. The bottom diagram shows the short-term power of the microphone signal y(n), of the error signal (after echo cancellation) e(n), and of the signal after echo suppression sˆ(n).
cancellation was performed using the NLMS algorithm and a control scheme as presented in Sect. 3.5.2. In remote single-talk situations the residual echo signal is attenuated by 12 dB, which was the maximum attenuation according to the parameter Wmin . During the double-talk situation (3.5 - 6.2 seconds) only a small attenuation of about 3 dB was introduced. The overestimation factor was chosen to βover = 3. Beside the advantages of residual echo suppression, also the disadvantages should be mentioned here. In contrast to echo cancellation, echo suppression schemes introduce attenuation into the sending path of the local signal. Therefore, a compromise between attenuation of residual echoes and reduction of the local speech quality always has to be made. In case of large background noise, a modulation effect due to the attenuation of the noise in
3
Single-Channel Acoustic Echo Cancellation
87
Table 3.2 Advantages (+) and disadvantages (−) of different processing structures (resol. = resolution). Processing structure Fullband
Control alternatives
Block
Subband
Time resol.
Freq. resol.
Time resol.
Freq. resol.
Time resol.
Freq. resol.
++
––
––
++
+
+
Computational complexity
––
++
+
Artificial delay
++
––
–
remote single-talk situations appears. In such scenarios the echo suppression should always be combined with (background) noise suppression.
3.7
Processing Structures
Beside selecting different control structures, the system designer has also the possibility to choose between different processing structures: fullband processing, block processing, and subband processing. Before the three mentioned processing structures and the related control alternatives are described in more detail in the next three subsections, Table 3.2 gives an overview about the advantages (+) and the disadvantages (−) of the different structures. 3.7.1
Fullband Processing
Fullband processing structures according to Fig. 3.1 offer the possibility to adjust the control parameter μ(n) to different values in each iteration. For this reason, fullband processing has the best time resolution among all processing structures. Especially for impulse responses which concentrate their energy on only a few coefficients, this is an important advantage for control purposes. Even if fullband processing structures do not have the possibility of frequency selective control or normalization, they have the very basic advantage of not introducing any delay into the signal path. For some applications this is a necessary feature.
88
E. H¨ ansler, G. Schmidt
3.7.2
Block Processing
Long time-domain based adaptive filters require huge processing power due to their large number of coefficients. For many applications, such as acoustic echo or noise control, algorithms with low numerical complexity are necessary. To remedy the complexity problem, adaptive filters based on block processing [35], [38], [44] can be used. In general, most block processing algorithms collect B input signal samples before they calculate a block of B output signal samples. Consequently, the filter is adapted only once every B sampling instants. To reduce the computational complexity, the convolution, and the adaptation are all performed in the frequency domain. Besides the advantage of reduced computational complexity, block processing also has disadvantages. Due to the computation of only one adaptation every B samples, the time-resolution for control purposes is reduced. If the signal-to-noise ratio changes in the middle of the signal block, for example, the control parameters can only be adjusted to the mean signal-to-noise ratio (averaged over the block length). Especially for large block length B and therefore for a large reduction of computational complexity, the reduced time resolution clearly impacts performance. If the filter update is performed in the frequency domain, a new degree of freedom arises. Each frequency bin of the update of the transformed filter vector can be weighted individually. Besides all the advantages of block processing, another inherent disadvantage of this processing structure should also be mentioned. Due to the collection of B samples a corresponding delay is introduced in the signal paths. 3.7.3
Subband Processing
Another possibility to reduce the computational complexity is to apply subband processing. By using filter banks [8], [47], [48], the excitation signal x(n) and the (distorted) desired signal y(n) are split up into several subbands (see Fig. 3.13). Depending on the properties of the lowpass and bandpass filters, the sampling rate in the subbands can be reduced. According to this reduction, the adaptive filters can be shortened. Instead of filtering the fullband signal and adapting one filter at the high sampling rate, M (number of subbands) convolutions and adaptations with subsampled signals are performed in parallel at reduced sampling rates. The factors ϕμ (n) (see Fig. 3.13) are the echo suppression coefficients. They are an approximation of the transfer function W (ejΩ ) according to (3.21) at the discrete frequency supporting points Ωμ = 2π M μ. The subscript μ ∈ {0, . . . , M −1} is used to address the subband channels. The final output signal sˆ(n) is obtained by recombining and upsampling the subband signals sˆμ (n).
3
Single-Channel Acoustic Echo Cancellation
89
Fig. 3.13. Structure of an acoustic echo control system operating in subbands. By using two analysis filterbanks, the far-end speaker and the microphone signal are decomposed into M subbands. In each subband, a digital replica of the echo signal is generated by convolution of the subband excitation signals and the estimated subband impulse responses. After subtraction of the estimated subband echo signals from the microphone signals and multiplying the echo suppression factors, the fullband output signal sˆ(n) is synthesized using a third filterbank.
In contrast to block (frequency domain) processing, the subband structure offers the system designer a second additional degree of freedom. Besides the possibility of having a frequency dependent control, also the dimensioning of the system can be done on a frequency selective basis. As an example, the orders of the adaptive filters can be adjusted individually in each channel according to the statistical properties of the excitation signal, the measurement noise, and the impulse response of the system to be identified. Using subsampled signals leads to a reduction of computational complexity. All necessary forms of control and detection can operate independently in each channel. The price to be paid for these advantages is – as in block processing structures – a delay introduced into the signal path by the analysis and synthesis filter banks.
3.8
Conclusions
Powerful and affordable acoustic echo cancelling systems are available. Their performance is satisfactory, especially if compared to solutions in other voice
90
E. H¨ ansler, G. Schmidt
processing areas such as speech recognition or speech-to-text translation. The fact that echo control systems have not yet entered the market on a large scale seems not to be a technical but a marketing problem: a customer who buys a high quality echo suppressing system pays for the comfort of his communication partner. Using a poor system only effects the partner at the far end, who usually is too polite to complain. Future research and development in the area of acoustic echo cancellation certainly will not have to take into account processing power restrictions. This has a number of consequences: the implementation of even sophisticated procedures on ordinary (office) PCs will be possible. This will make it easier to test modifications of existing procedures or completely new ideas in real-time and in real environments. The performance of future systems will approach limits only given by the environment they have to work in. It will no longer be limited by the restricted capabilities of affordable hardware. It will depend only on the quality of the algorithms implemented. This does not necessarily mean that future systems will be perfectly reliable in all situations. The reliability of estimation procedures used to detect system states such as a change of the impulse response of the LEM system or the beginning of double-talk depends on the length of the usable data record. However, since the working environment is highly time-varying and non-stationary, the usage of too long records can cause the loss of the realtime capability. Up to now the NLMS algorithm plays the role of a “work horse” for acoustic echo cancelling. The AP algorithm offers improved performance at modest additional implementation and processing cost. It does not cause stability problems that are difficult to solve. Rules for step-size control used for the NLMS algorithm, however, have to be reconsidered. In recent years frequency selective suppression schemes for residual echoes have gained increased importance. In most systems they are combined with (background) noise reduction algorithms. Even if they offer a large amount of suppression at moderate computational cost, they are not able to replace the “traditional” echo cancellation filter. Nevertheless, echo suppression schemes should be combined with the classical adaptive filters. In order to reduce speech distortions, a joint control of both devices – suppression and cancellation – is necessary. Customer demands are increasing with time. Using available systems, customers will certainly ask for better performance. Therefore, the need for new and better ideas will remain. Acoustic echo control will continue to be one of the most interesting problems in digital signal processing.
3
Single-Channel Acoustic Echo Cancellation
91
Acknowledgment The authors would like to thank Dr. Dennis R. Morgan for carefully reading the draft of this chapter and making many very useful suggestions to improve it.
References 1. “The new Bell telephone,” Scientific American, vol. 37, p. 1, 1877. 2. C. Beaugeant, V. Turbin, P. Scalart, and A. Gilloire, “New optimal filtering appoaches for hands-free telecommunication terminals,” Signal Processing, vol. 64, no. 1, pp. 33–47, 1998. 3. J. Benesty, T. G¨ ansler, D. R. Morgan, M. M. Sondhi, and S. L. Gay, Advances in Network and Acoustic Echo Cancellation. Springer, Berlin, Germany, 2001. 4. C. Breining, P. Dreiseitel, E. H¨ansler, A. Mader, B. Nitsch, H. Puder, T. Schertler, G. Schmidt, and J. Tilp, “Acoustic echo control,” IEEE Signal Processing Magazine, vol. 16, no. 4, pp. 42–69, 1999. 5. T. Burger, “Practical application of adaptation control for NLMS-Algorithms used for echo cancellation with speech signals,” in Proc. of the IWAENC, pp. 87–90, 1995. 6. J. H. Cho, D. R. Morgan, and J. Benesty, “An objective technique for evaluating doubletalk detectors in acoustic echo cancelers,” IEEE Trans. on Speech and Audio Processing, vol. 7, no. 6, pp. 718–724, 1999. 7. W. F. Clemency, F. F. Romanow, and A. F. Rose, “The Bell System Speakerphone,” AIEE. Trans., vol. 76(I), pp. 148–153, 1957. 8. R. E. Crochiere and L. R. Rabiner, Multirate Digital Signal Processing. Prentice Hall, Englewood Cliffs, New Jersey, USA, 1983. 9. J. Deller, J. Hansen, and J. Proakis, Discrete-Time Processing of Speech Signals. IEEE Press, New York, USA, 1993. 10. M. D¨ orbecker and P. Vary, “Reducing the delay of an acoustic echo canceller with subband adaptation,” in Proc. of the IWAENC, pp. 103–106, 1995. 11. ETS 300 903 (GSM 03.50), “Transmission planning aspects of the speech service in the GSM public land mobile network (PLMS) System,” ETSI, March 1999. 12. S. L. Gay, S. Travathia, “The fast affine projection algorithm,” in Proc. of the ICASSP, pp. 3023–3027, 1995. 13. S. L. Gay and J. Benesty (editors), Acoustic Signal Processing for Telecommunication. Kluwer, Boston, Massachusetts, USA, 2000. 14. A. Gilloire, “State of the art in acoustic echo cancellation,” in A. R. Figueiras and D. Docampo (eds.), Adaptive Algorithms: Applications and Non Classical Schemes, Universidad de Vigo, pp. 20–31, 1991. 15. A. Gilloire, E. Moulines, D. Slock, and P. Duhamel, “State of the art in acoustic echo cancellation,” in A. R. Figueiras–Vidal (ed.), Digital Signal Processing in Telecommunications, Springer, London, UK, pp. 45–91, 1996. 16. A. Gilloire, P. Scalart, C. Lamblin, C. Mokbel, and S. Proust, “Innovative speech processing for mobile terminals: An annotated bibliography,” Signal Processing, vol. 80, no. 7, pp. 1149–1166, 2000. 17. G. Glentis, K. Berberidis, and S. Theodoridis, “Efficient least squares adaptive algorithms for FIR transversal filtering: a unified view,” IEEE Signal Processing Magazine, vol. 16, no. 4, pp. 13–41, 1999.
92
E. H¨ ansler, G. Schmidt
18. S. Gustafsson, R. Martin, and P. Vary, “Combined acoustic echo control and noise reduction for hands–free telephony,” Signal Processing, vol 64, pp. 21–32, 1998. 19. S. Gustafsson, P. Jax, A. Kamphausen, and P. Vary, “A postfilter for echo and noise reduction avoiding the problem of musical tones,” in Proc. of the ICASSP, vol. 2, pp. 873–876, 1999. 20. E. H¨ ansler, “The hands–free telephone problem – An annotated bibliography,” Signal Processing, vol. 27, pp. 259–271, 1992. 21. E. H¨ ansler, “The hands–free telephone problem – An annotated bibliography update,” Annales des T´el´ecommunications, vol. 49, pp. 360–367, 1994. 22. E. H¨ ansler, “The hands–free telephone problem – A second annotated bibliography update,” in Proc. of the IWAENC, pp. 107–114, 1995. 23. E. H¨ ansler and G. U. Schmidt, “Hands–free telephones – Joint control of echo cancellation and post filtering,” Signal Processing, vol. 80, pp. 2295–2305, 2000. 24. S. Haykin, Adaptive Filter Theory, Fourth Edition. Prentice Hall, Englewood Cliffs, New Jersey, USA, 2002. 25. P. Heitk¨ amper, “An adaptation control for acoustic echo cancellers,” IEEE Signal Processing Letters, vol. 4, no. 6, pp. 170–172, 1997. 26. ITU-T Recommendation G.167, “General characteristics of international telephone connections and international telephone circuits – Acoustic echo controllers,” ITU-T Recommendations, March 1993. 27. H. Kuttruff, “Sound in enclosures,” in M. J. Crocker (ed.), Encyclopedia of Acoustics, Wiley, New York, USA, pp. 1101–1114, 1997. 28. A. P. Liavas and P. A. Regalia, “Acoustic echo cancellation: do IIR filters offer better modelling capabilities than their FIR counterparts?,” IEEE Trans. on Signal Processing, vol. 46, no. 9, pp. 2499–2504, 1998. 29. A. Mader, H. Puder, and G. Schmidt, “Step-size control for acoustic echo cancellation filters – An overview,” Signal Processing, vol. 80, pp. 1697–1719, 2000. 30. S. Makino and Y. Kaneda, “Exponentially weighted step-size projection algorithm for acoustic echo cancellers,” IEICE Trans. Fundamentals, vol. E75-A, no. 11, pp. 1500–1507, 1992. 31. S. Makino, Y. Kaneda, and N. Koizumi, “Exponentially weighted step-size NLMS adaptive filter based on the statistics of a room impulse response,” IEEE Trans. on Acoustics, Speech, and Signal Processing, vol. 1, no. 1, pp. 101–108, 1993. 32. R. Martin and P. Vary, “Combined acoustic echo control and noise reduction for hands–free telephony – State of the art and perspectives,” in Proc. of the EUSIPCO, pp. 1107–1110, 1996. 33. M. Mboup and M. Bonnet, “On the adequatness of IIR adaptive filtering for acoustic echo cancellation,” in Proc. of the EUSIPCO, pp. 111–114, 1992. 34. R. Merched, P. Diniz, and M. Petraglia, “A new delayless subband adaptive filter structure,” IEEE Trans. on Signal Processing, vol. 47. no. 6 , pp. 1580– 1591, June 1999. 35. W. Mikhael and F. Wu, “Fast algorithms for block FIR adaptive digital filtering,” IEEE Trans. on Circuits and System, vol. 34, pp. 1152–1160, Oct. 1987. 36. D. R. Morgan and J. C. Thi, “A delayless subband adaptive filter architecture,” IEEE Trans. on Signal Processing, vol. 43, no. 8, pp. 1819–1830, 1995. 37. V. Myllyl¨ a, “Robust fast affine projection algorithm for acoustic echo cancellation,” in Proc. of the IWAENC, pp. 143–146, 2001.
3
Single-Channel Acoustic Echo Cancellation
93
38. B. Nitsch, “The partitioned exact frequency domain block NLMS algorithm, a mathematically exact version of the NLMS algorithm working in the frequency ¨ Intern. Journ. of Electronics and Communication, vol. 52, pp. domain,” AEU 293–301, 1998. 39. K. Ochiai, T. Araseki, and T. Ogihara, “Echo canceler with two echo path models,” IEEE Trans. on Communications, vol. COM-25, no. 6, pp. 589–595, 1977. 40. K. Ozeki and T. Umeda, “An adaptive filtering algorithm using an orthogonal projection to an affine subspace and its properties,” Electronics and Communications in Japan, vol. 67-A, no. 5, pp. 19–27, 1984. 41. T. F. Quatieri, Discrete-Time Speech Signal Processing. Prentice Hall, Englewood Cliffs, New Jersey, USA, 2002. 42. T. Schertler and G. U. Schmidt, “Implementation of a low-cost acoustic echo canceller,” in Proc. of the IWAENC, pp. 49–52, 1997. 43. T. Schertler, “Selective block update of NLMS type algorithms,” in Proc. of the 32nd Annual Asilomar Conf. on Signals, Systems, and Computers, pp. 399–403, Nov. 1998. 44. J. Shynk, “Frequency-domain and multirate adaptive filtering,” IEEE Signal Processing Magazine, vol. 9, no. 1, 14–37, 1992. 45. D. Slock and T. Kailath, “Fast transversal RLS algorithms,” in N. Kalouptsidis and S. Theodoridis (eds.), Adaptive System Identification and Signal Processing Algorithms, Prentice Hall, Englewood Cliffs, New Jersey, USA, 1993. 46. V. Turbin, A. Gilloire, and P. Scalart, “Comparison of three post-filtering algorithms for residual acoustic echo reduction,” in Proc. of the ICASSP, Munich, Germany, pp. 307–310, 1997. 47. P. P. Vaidyanathan, “Mulitrate digital filter banks, polyphase networks, and applications: a tutorial,” Proc. of the IEEE, vol. 78, no. 1, pp. 56–93, Jan. 1990. 48. P. P. Vaidyanathan, Mulitrate Systems and Filter Banks. Prentice Hall, Englewood Cliffs, New Jersey, USA, 1992. 49. S. Yamamoto and S. Kitayama, “An adaptive echo canceller with variable step gain method,” Trans. of the IECE of Japan, vol. E 65, no. 1, pp. 1–8, 1982.
4 Multichannel Frequency-Domain Adaptive Filtering with Application to Multichannel Acoustic Echo Cancellation Herbert Buchner1 , Jacob Benesty2 , and Walter Kellermann1 1
2
University of Erlangen-Nuremberg Telecommunications Laboratory Cauerstr. 7, D-91058 Erlangen, Germany E-mail: {buchner, wk}@LNT.de Bell Laboratories, Lucent Technologies Murray Hill, NJ 07974, USA E-mail:
[email protected]
Abstract. In unknown environments where we need to identify, model, or track unknown and time-varying channels, adaptive filtering has been proven to be an effective tool. In this chapter, we focus on multichannel algorithms in the frequency domain that are especially well suited for input signals which are not only autocorrelated but also highly cross-correlated among the channels. These properties are particularly important for applications like multichannel acoustic echo cancellation. Most frequency-domain algorithms, as they are well known from the single-channel case, are derived from existing time-domain algorithms and are based on different heuristic strategies. Here, we present a new rigorous derivation of a whole class of multichannel adaptive filtering algorithms in the frequency domain based on a recursive least-squares criterion. Then, from the so-called normal equation, we derive a generic adaptive algorithm in the frequency domain that we formulate in different ways. An analysis of this multichannel algorithm shows that the meansquared error convergence is independent of the input signal statistics. A useful approximation provides interesting links between some well-known algorithms for the single-channel case and the general framework. We also give design rules for important parameters to optimize the performance in practice. Due to the rigorous approach, the proposed framework inherently takes the coherence between all input signal channels into account, while the computational complexity is kept low by introducing several new techniques, such as a robust recursive Kalman gain computation in the frequency domain and efficient fast Fourier transform (FFT) computation tailored to overlapping data blocks. Simulation results and real-time performance for multichannel acoustic echo cancellation show the high efficiency of the approach.
4.1
Introduction
The ability of an adaptive filter to operate satisfactorily in an unknown environment and to track time variations of input statistics makes it a powerful tool in such diverse fields as communications, acoustics, radar, sonar, seis-
J. Benesty et al. (eds.), Adaptive Signal Processing © Springer-Verlag Berlin Heidelberg 2003
96
H. Buchner et al.
mology, and biomedical engineering. Despite of the large variety of applications, four basic classes of adaptive filtering applications may be distinguished [1]: system identification, inverse modeling, prediction, and interference cancelling. In speech and acoustics, where all those basic types of adaptive filtering can be found, we often have to deal with very long filters (sometimes several thousand taps), unpredictably time-variant environments, and highly nonstationary and auto-correlated signals. In addition, the simultaneous processing of multiple input streams, i.e., multichannel adaptive filtering (MC-ADF) is becoming more and more desirable for future products. Typical examples are multichannel acoustic echo cancellation (system identification) or adaptive beamforming microphone arrays (interference cancelling). In this chapter, we investigate adaptive MIMO (multiple-input and multiple-output) systems that are updated in the frequency domain. The resulting generalized multichannel frequency-domain adaptive filtering has already led to efficient real-time implementations of multichannel acoustic echo cancellers on standard personal computers [2], [3]. Generally, we distinguish two classes of adaptive algorithms. One class includes filters that are updated in the time domain, usually on a sampleby-sample basis, like the classical least-mean-square (LMS) [4] and recursive least-squares (RLS) [5] algorithms. The other class may be defined as filters that are updated in the discrete Fourier transform (DFT) domain (‘frequency domain’), block-by-block in general, using the fast Fourier transform (FFT) as a powerful vehicle. As a result of this block processing, the arithmetic complexity of the latter category is significantly reduced compared to timedomain adaptive algorithms. The possibility to exploit the efficiency of FFT algorithms is due to the Toeplitz structure of the matrices involved, which results from the transversal structure of the adaptive filters. The Toeplitz matrices can be expressed by circulant matrices which are diagonalizable by the DFT. Consequently, the key for deriving the frequency-domain adaptive algorithms is to formulate the time-domain error criterion so that Toeplitz and circulant matrices are explicitly shown. In addition to the low complexity, another advantage resulting from this diagonalization in frequency-domain adaptive filtering is that the adaptation stepsize can be normalized independently for each frequency bin, which results in a more uniform convergence over the entire frequency range. Single-channel frequency-domain adaptive filtering was first introduced by Dentino et al., based on the LMS algorithm in the time-domain [6]. Ferrara [7] was the first to present an efficient frequency-domain adaptive filter algorithm (FLMS) that converges to the optimum (Wiener) solution. Mansour and Gray [8] derived an even more efficient algorithm, the unconstrained FLMS (UFLMS), using only three FFT operations per block instead of five for the FLMS, with comparable performance [9]. However, in some applica-
4
Multichannel Acoustic Echo Cancellation
97
y(n) x1 (n)
^
h1 (n) ...
...
... xP(n)
^y(n)
+
-
e(n) +
^
hP(n)
Fig. 4.1. Multichannel adaptive filtering.
tions, a major handicap with these structures is the delay introduced between input and output. Indeed, for efficient implementations, this delay is equal to the length L of the adaptive filter, which is considerable for applications like acoustic echo cancellation. A new structure called multi-delay filter (MDF), using the classical overlap-save (OLS) method, was proposed in [10], [11] and generalized in [12] where the new block size N was made independent of the filter length L; N can be chosen as small as desired, with a delay equal to N . Although from a complexity point of view, the optimum choice is N = L, using smaller block sizes (N < L) in order to reduce the delay is still more efficient than time-domain algorithms. A more general scheme based on weighted overlap and add (WOLA) methods, the generalized multidelay filter (GMDFα) was proposed in [13], [14], where α is the overlap factor. The settings α > 1 appear to be very useful in the context of adaptive filtering, since the filter coefficients can be adapted more frequently (every N/α samples instead of every N samples in the conventional OLS scheme) and the delay can be (further) reduced as well. Thus, this structure introduces one more degree of freedom, but the complexity is increased roughly by a factor α. Taking the block size in the MDF as large as the delay permits will increase the convergence rate of the algorithm, while choosing the overlap factor greater than 1 will increase the tracking abilities of the algorithm. The case of multichannel adaptive filtering, as shown in Fig. 4.1, has been found to be structurally more difficult in general. In typical scenarios, the input signals xp (n), p = 1, . . . , P , to the adaptive filter are not only auto-correlated but also highly cross-correlated which often results in very ˆ p,κ (n), where κ = 0, . . . , L − slow convergence of the LP filter coefficients h 1. This problem becomes particularly severe in multichannel acoustic echo cancellation [15], [16], [17], where the signals xp (n) represent loudspeaker signals that may originate from a unique source. Signal y(n) represents an echo received by a microphone. Direct application of commonly used low-complexity algorithms, such as the LMS algorithm or conventional frequency-domain adaptive filtering, to the multichannel case usually leads to disappointing results as the crosscorrelations between the input channels are not taken into account [18]. In
98
H. Buchner et al.
contrast to this, high-order affine projection algorithms and RLS algorithms do take the cross-correlations into account. Indeed, it can be shown that the RLS provides optimum convergence speed even in the multichannel case [18], but its complexity is prohibitively high and, e.g., will not allow realtime implementation of multichannel acoustic echo cancellation on standard hardware any time soon. Two-channel frequency-domain adaptive filtering was first introduced in [19] in the context of stereophonic acoustic echo cancellation and derived from the extended least-mean-square (ELMS) algorithm [20] in the time domain using similar considerations as for the single-channel case outlined above. The rigorous derivation of frequency-domain adaptive filtering presented in the next section leads to a generic algorithm with RLS-like properties. We will also see that there is an efficient approximation of this algorithm taking the cross-correlations into account. The single-channel version of this algorithm provides a direct link to existing frequency-domain algorithms. The organization of this chapter is as follows. In Sect. 4.2, we introduce a frequency-domain recursive least-squares criterion from which the so-called normal equation is derived. Then, from the normal equation, we deduce a generic multichannel adaptive algorithm that we can formulate in different ways, and we introduce the so-called frequency-domain Kalman gain. In Sect. 4.3, we study the convergence of this multichannel algorithm. In Sect. 4.4, we consider the general MIMO case and, in Sect. 4.5, we give a very useful approximation, deduce some well-known single-channel algorithms as special cases, and explicitly show how the cross-correlations are taken into account in the multichannel case. We also give design rules for some important parameters such as the exponential window, regularization, and adaptation stepsize. A useful dynamical regularization method is discussed in more detail in Sect. 4.6. Section 4.7 introduces several methods for increasing computational efficiency in the multi-input and MIMO cases, such as a robust recursive Kalman gain computation and FFT computation tailored for overlapping data blocks. Section 4.8 presents some simulations and multichannel real-world implementations for hands-free speech communications. Finally, our results are summarized in Sect. 4.9.
4.2
General Derivation of Multichannel Frequency-Domain Algorithms
In the first part of this section we formulate a block recursive least-squares criterion in the frequency domain. Once the criterion is rigorously defined, the adaptive multichannel algorithm follows immediately. 4.2.1
Optimization Criterion
To obtain an optimization criterion for block adaptation, we assume the ˆ p,0 , . . . , ˆhp,L−1 for the (generally time-variant) adaptive filter coefficients h
4
Multichannel Acoustic Echo Cancellation
99
input channels 1, . . . , P to be fixed within the block intervals of length N . For convenience of notation, this allows us to omit the time index of the filter coefficients during the following derivation of the block error signal. From Fig. 4.1, it can be seen that the error signal at time n between the output of the multichannel adaptive filter yˆ(n) and the desired output signal y(n) is given by e(n) = y(n) − yˆ(n),
(4.1)
with yˆ(n) =
P L−1
ˆ p,κ . xp (n − κ)h
(4.2)
p=1 κ=0
ˆ p into segments of length N as in By partitioning the impulse responses h [10], [11], (4.2) can be written as yˆ(n) =
P K−1 −1 N
ˆ p,N k+κ , xp (n − N k − κ)h
(4.3)
p=1 k=0 κ=0
where we assume that the total filter length L is an integer multiple of N (N ≤ L), so that L = KN . For convenient notation of the multichannel algorithms, we rewrite (4.3) in vectorized form yˆ(n) =
P K−1
ˆp,k xTp,k (n)h
p=1 k=0
=
P
ˆp = xT (n)h, ˆ xTp (n)h
(4.4)
p=1
where xp,k (n) = [xp (n − N k), xp (n − N k − 1), · · · , xp (n − N k − N + 1)]T ,(4.5) ˆ p,k = [h ˆ p,N k , ˆ h hp,N k+1 , · · · , ˆ hp,N k+N −1 ]T , (4.6) T T (4.7) xp (n) = [xp,0 (n), xp,1 (n), · · · , xTp,K−1 (n)]T , ˆ p = [h ˆTp,0 , h ˆTp,1 , · · · , h ˆTp,K−1 ]T , (4.8) h x(n) = [xT1 (n), xT2 (n), · · · , xTP (n)]T , ˆ = [h ˆT , h ˆT , · · · , h ˆT ]T . h 1
2
P
(4.9) (4.10)
Superscript T denotes transposition of a vector or a matrix. The length-N ˆ p,k , k = 0, . . . , K − 1, represent sub-filters of the partitioned tapvectors h ˆ p of channel p. weight vector h We now define the block error signal of length N . Based on (4.1) and (4.4) we write e(m) = y(m) − y ˆ(m),
(4.11)
100
H. Buchner et al.
with m being the block time index, and y ˆ(m) =
P K−1
ˆp,k UTp,k (m)h
=
p=1 k=0
P
ˆp = UT (m)h, ˆ UTp (m)h
(4.12)
p=1
where e(m) = [e(mN ), · · · , e(mN + N − 1)]T ,
(4.13)
y(m) = [y(mN ), · · · , y(mN + N − 1)]T ,
(4.14)
y ˆ(m) = [ˆ y (mN ), · · · , yˆ(mN + N − 1)]T ,
(4.15)
Up,k (m) = [xp,k (mN ), · · · , xp,k (mN + N − 1)],
(4.16)
UTp (m) = [UTp,0 (m), · · · , UTp,K−1 (m)],
(4.17)
UT (m) = [UT1 (m), · · · , UTP (m)].
(4.18)
It can easily be verified that Up,k , p = 1, . . . , P , k = 0, . . . , K − 1 are Toeplitz matrices of size (N × N ): ⎡ ⎤ xp (mN − N k) · · · · · · xp (mN − N k − N + 1) ⎢ ⎥ .. ⎢ xp (mN − N k + 1) . . . ⎥ . T ⎢ ⎥. Up,k (m) = ⎢ ⎥ .. . . . . . . ⎣ ⎦ . . . . xp (mN − N k + N − 1) · · · · · · xp (mN − N k) These Toeplitz matrices are now diagonalized in two steps: Step 1: Transformation of Toeplitz matrices into circulant matrices. Any Toeplitz matrix Up,k can be transformed, by doubling its size, to a circulant matrix T Up,k (m) UTp,k (m) , (4.19) Cp,k (m) = UTp,k (m) UT p,k (m) where the Up,k are also Toeplitz matrices and can be expressed in terms of the elements of UTp,k (m), except for an arbitrary diagonal, e.g., ⎡ ⎤ xp (mN − N k − N ) · · · · · · xp (mN − N k + 1) ⎢ ⎥ .. ⎢ xp (mN − N k − N + 1) . . . ⎥ . T ⎢ ⎥. Up,k (m) = ⎢ ⎥ .. .. .. .. ⎣ ⎦ . . . . xp (mN − N k − 1) · · · · · · xp (mN − N k − N ) It follows 01 10 UTp,k (m) = WN ×2N Cp,k (m)W2N ×N ,
(4.20)
4
Multichannel Acoustic Echo Cancellation
101
where we introduced the windowing matrices 01 WN ×2N = [0N ×N , IN ×N ], 10 T W2N ×N = [IN ×N , 0N ×N ] .
Step 2: Transformation of the circulant matrices into diagonal matrices. Using the 2N × 2N DFT matrix F2N ×2N with elements e−j2πνn/(2N ) , where ν, n = 0, . . . , 2N − 1, the circulant matrices are diagonalized as follows: Cp,k (m) = F−1 2N ×2N Xp,k (m)F2N ×2N ,
(4.21)
where the diagonal matrices Xp,k (m) can be expressed by the elements of the first columns of Cp,k (m), Xp,k (m) = diag{F2N ×2N [xp (mN − N k − N ), · · · , xp (mN − N k + N − 1)]T }.(4.22) Now, (4.20) can be rewritten equivalently as −1 01 10 UTp,k (m) = WN ×2N F2N ×2N Xp,k (m)F2N ×2N W2N ×N .
(4.23)
Since [AX1 B, · · · , AXP B] = A[X1 , · · · , XP ]diag{B, · · · , B} for any matrices A, B, Xp with compatible dimensions, it follows for the error vector using (4.18) and (4.23): −1 01 e(m) = y(m) − WN ×2N F2N ×2N [X1 (m), · · · , XP (m)] ˆ · diag{F2N ×2N W10 , · · · , F2N ×2N W10 }h, 2N ×N
2N ×N
(4.24)
where Xp (m) = [Xp,0 (m), Xp,1 (m), · · · , Xp,K−1 (m)].
(4.25)
If we multiply (4.24) by the N × N DFT matrix FN ×N , we obtain the error signal in the frequency domain: 10 ˆ e(m) = y(m) − G01 (4.26) N ×2N X(m)G2LP ×LP h, where e(m) = FN ×N e(m),
(4.27)
y(m) = FN ×N y(m),
(4.28)
G01 N ×2N G10 2LP ×LP G10 2N ×N
= =
−1 01 FN ×N WN ×2N F2N ×2N , 10 diag{G10 2N ×N , · · · , G2N ×N }, −1 10 F2N ×2N W2N ×N FN ×N ,
= X(m) = [X1 (m), X2 (m), · · · , XP (m)], ˆ ˆ h p,k = FN ×N hp,k , ˆ = h p ˆ= h
T ˆT , h ˆT , · · · , h ˆT [h p,0 p,1 p,K−1 ] , ˆT , h ˆT , · · · , h ˆ T ]T . [h 1 2 P
(4.29) (4.30) (4.31) (4.32) (4.33) (4.34) (4.35)
102
H. Buchner et al.
Optimization Criterion: Having derived a frequency-domain error signal, we now define a frequencyˆ = h(m): ˆ domain criterion for optimizing the coefficient vector h Jf (m) = (1 − λ)
m
λm−i eH (i)e(i),
(4.36)
i=0
where H denotes conjugate transposition and λ (0 < λ < 1) is an exponential forgetting factor. The criterion (4.36) is very similar1 to the one leading to the well-known RLS algorithm [5]. The main advantage of using (4.36) is to take advantage of the FFT in order to have low-complexity adaptive filters. 4.2.2
Normal Equation
ˆ Applying the operator Let ∇hˆ be the gradient operator with respect to h. ∇hˆ to the cost function Jf (4.36), we obtain [1], [21] the complex gradient vector: ∇hˆ Jf (m) =
∂Jf (m) ˆ ∗ (m) ∂h
= −(1 − λ)
(4.37) m
H H 01 H λm−i (G10 2LP ×LP ) X (i)(GN ×2N ) y(i)
i=0
+ (1 − λ)
m
H H λm−i (G10 2LP ×LP ) X (i)
i=0
· where
∗
10 G01 2N ×2N X(i)G2LP ×LP
ˆ h(m),
denotes complex conjugation,
01 H 01 G01 2N ×2N = (GN ×2N ) GN ×2N −1 01 = F2N ×2N W2N ×2N F2N ×2N ,
and
0N ×N 0N ×N . = 0N ×N IN ×N
(4.38)
01 W2N ×2N
(4.39)
By setting the gradient of the cost function equal to zero and defining H y2N (m) = (G01 N ×2N ) y(m) 0N ×1 , = F2N ×2N y(m) 1
(4.40)
Note that the time-frequency equivalence is assured by Parseval’s theorem.
4
Multichannel Acoustic Echo Cancellation
103
we obtain the so-called normal equation: ˆ = s(m), S(m)h(m)
(4.41)
where S(m) = (1 − λ)
m
H H 01 10 λm−i (G10 2LP ×LP ) X (i)G2N ×2N X(i)G2LP ×LP
i=0 H H = λS(m − 1) + (1 − λ)(G10 2LP ×LP ) X (m) 01 10 · G2N ×2N X(m)G2LP ×LP
(4.42)
and s(m) = (1 − λ)
m
H H λm−i (G10 2LP ×LP ) X (i)y2N (i)
i=0 H H = λs(m − 1) + (1 − λ)(G10 2LP ×LP ) X (m)y2N (m) H H 01 H = λs(m − 1) + (1 − λ)(G10 2LP ×LP ) X (m)(GN ×2N ) y(m). (4.43)
If the input signal is well-conditioned, matrix S(m) is nonsingular. In this case, the normal equation has a unique solution which is the optimum Wiener solution. 4.2.3
Adaptation Algorithm
The different formulations for filter adaptation discussed below, i.e., recursive ˆ updates of h(m), are all derived directly from the normal equation (4.41) and associated equations (4.42) and (4.43). Here, we replace s(m) and s(m − 1) in the recursive equation (4.43) by formulating (4.41) in terms of block time indices m and m − 1, respectively. We then eliminate S(m − 1) from the resulting equation using (4.42). Reintroducing the error signal vector (4.26), we obtain an exact recursive solution of (4.41) by the following adaptation algorithm: 10 ˆ e(m) = y(m) − G01 N ×2N X(m)G2LP ×LP h(m − 1) H ˆ ˆ = h(m − 1) + (1 − λ)S−1 (m)(G10 h(m) 2LP ×LP ) H · XH (m)(G01 N ×2N ) e(m).
(4.44) (4.45)
For practical purposes, it is useful to reformulate this algorithm. First, we H multiply (4.44) by (G01 N ×2N ) , 10 ˆ e2N (m) = y2N (m) − G01 2N ×2N X(m)G2LP ×LP h(m − 1)
(4.46)
H ˆ ˆ = h(m − 1) + (1 − λ)S−1 (m)(G10 h(m) 2LP ×LP )
· XH (m)e2N (m),
(4.47)
104
H. Buchner et al.
where we defined analogously to (4.40) H e2N (m) = (G01 N ×2N ) e(m) 0N ×1 . = F2N ×2N e(m)
(4.48)
If we multiply (4.47) by G10 2LP ×LP , we obtain the algorithm (4.46) and (4.47) in a slightly different form: ˆ e2N (m) = y2N (m) − G01 2N ×2N X(m)h2LP (m − 1)
(4.49)
10 ˆ ˆ h 2LP (m) = h2LP (m − 1) + (1 − λ)G2LP ×LP H H · S−1 (m)(G10 2LP ×LP ) X (m)e2N (m),
(4.50)
where S(m) is given by (4.42), and 10 ˆ ˆ h 2LP (m) = G2LP ×LP h(m).
(4.51)
The rank of the matrix G10 2LP ×LP is equal to LP . Since we have to adapt LP unknowns, in principle, (4.50) is equivalent to (4.47). Indeed, H if we multiply (4.50) by (G10 2LP ×LP ) , we obtain exactly (4.47) since 10 H 10 (G2LP ×LP ) G2LP ×LP = ILP ×LP . It is interesting to see how naturally we have ended up using blocks of length 2N (especially for the error signal) even though we have used an error criterion with blocks of length N . We can do even better than that and rewrite the algorithm exclusively using FFTs of size 2N . This formulation is by far the most interesting one because an explicit link with existing frequency-domain algorithms can be established. Let us first define the (2LP × 2LP ) matrix Sd (m) = (1 − λ)
m
λm−i XH (i)G01 2N ×2N X(i)
i=0
= λSd (m − 1) + (1 − λ)XH (m)G01 2N ×2N X(m). (4.52) The relation of Sd (m) to S(m) is obviously given by: H 10 S(m) = (G10 2LP ×LP ) Sd (m)G2LP ×LP .
(4.53)
Next, we define 10 10 H G10 2N ×2N = G2N ×N (G2N ×N ) −1 10 = F2N ×2N W2N ×2N F2N ×2N
and 10 10 G10 2LP ×2LP = diag{G2N ×2N · · · G2N ×2N },
(4.54)
4
where
10 W2N ×2N =
Multichannel Acoustic Echo Cancellation
IN ×N 0N ×N . 0N ×N 0N ×N
105
(4.55)
Now, we have a relation between the inverse of the two matrices S [as it appears in (4.50)] and Sd : −1 10 −1 H (m)(G10 G10 2LP ×2LP Sd (m) = G2LP ×LP S 2LP ×LP ) .
(4.56)
This can be verified by post-multiplying both sides of (4.56) by 10 10 10 Sd (m)G10 2LP ×LP and noting that G2LP ×2LP G2LP ×LP = G2LP ×LP . Using (4.56), the adaptive algorithm consisting of (4.42), (4.49), and (4.50) can now be formulated more conveniently: Sd (m) = λSd (m − 1) + (1 − λ)XH (m)G01 2N ×2N X(m) 01 ˆ e (m) = y (m) − G X(m)h (m − 1) 2N
2N ×2N
2N
2LP
(4.57) (4.58)
−1 10 ˆ ˆ h 2LP (m) = h2LP (m − 1) + (1 − λ)G2LP ×2LP Sd (m)
· XH (m)e2N (m).
(4.59)
Due to the structure of the update equations, we introduce a frequencydomain Kalman gain matrix in analogy to the RLS algorithm [1]: H K(m) = (1 − λ)S−1 d (m)X (m).
(4.60)
This 2LP × 2L matrix includes the inverse in (4.59) and plays a very important role in practical realizations, including a tight coupling between the multiple input channels by coherence terms, as shown in detail in subsequent sections. Figure 4.2 summarizes the general steps in multichannel frequencydomain adaptive filtering. For clarity of the figure, the case N = L is depicted. The two shaded blocks represent the calculation of the Kalman gain using (4.57) and (4.60), or efficient realizations thereof.
4.3
Convergence Analysis
In this section, we analyze the convergence behaviour of the algorithm for stationary signals xp (n) and y(n) based on (4.44) and (4.45). Due to the assumed stationarity of the filter input signals, we obtain, after taking the expected value of (4.42): E{S(m)} = (1 − λ)
m
λm−i Se ,
(4.61)
H H 01 10 Se = E (G10 2LP ×LP ) X (m)G2N ×2N X(m)G2LP ×LP
(4.62)
i=0
where
106
H. Buchner et al. x1(n)
xP (n)
old new concatenate two blocks xx
concatenate two blocks
FFT
FFT
X 1(m)
PSD estim. + regulariz. S (m)
X P (m)
FD-Kalman gain comp.
K 1(m)
KP (m) 10 2L x 2L
μ
G μ
10 2L x 2L
G
z -1
+
+
^ (m) h1
z -1
+
e (m)
^ (m) y1
FFT
0 e
^ (m) hP
insert zero block
- y^
e(n)
IFFT
^ (m) yP
save last block
−
y(n)
+
Fig. 4.2. Principle of multichannel frequency-domain adaptive filtering (N = L).
denotes the time-independent ensemble average. Noting that in (4.61) we have a sum of a finite geometric series, it can be simplified to E{S(m)} = (1 − λm+1 )Se .
(4.63)
For a single realization of the stochastic process S(m), we assume that S(m) ≈ (1 − λm+1 )Se ,
(4.64)
and for the steady state we see with 0 < λ < 1 that S(m) ≈ Se for large m. 4.3.1
(4.65)
Analysis Model
For the following, we assume that the desired response y(n) and the tap-input vector x(n) are related by the multiple linear regression model [1] y(n) = xT (n)h + nO (n),
(4.66)
where the LP × 1 vector h denotes the fixed regression parameter vector of the model and the measurement error nO (n) is assumed to be a zero-mean white noise that is independent of xp (n), ∀ p ∈ {1, . . . , P }. The equivalent expression in the frequency domain reads 10 y(m) = G01 N ×2N X(m)G2LP ×LP h + n(m),
(4.67)
4
Multichannel Acoustic Echo Cancellation
107
ˆ in (4.35) and y(m) in where h and n(m) are defined in the same way as h (4.28), respectively. 4.3.2
Convergence in Mean
By noting that H 01 01 (G01 N ×2N ) GN ×2N = G2N ×2N
(4.68)
from (4.38), the coefficient update (4.45) can be written in terms of the misalignment vector (m) as ˆ (m) = h − h(m) ˆ = h − h(m − 1) H H 01 − (1 − λ)S−1 (m)(G10 2LP ×LP ) X (m)G2N ×2N X(m) ˆ · G10 [h − h(m − 1)] 2LP ×LP
H H − (1 − λ)S−1 (m)(G10 2LP ×LP ) X (m)n(m).
(4.69)
Taking the mathematical expectation of expression (4.69), using the independence theory [1], and (4.62) together with (4.65), we deduce for large m that E{(m)} = λE{(m − 1)} = λm E{(0)}.
(4.70)
Equation (4.70) expresses that the convergence rate of the algorithm is governed by λ. Most importantly, the rate of convergence is completely independent of the input statistics (even in the multichannel case). Finally, we have ˆ = h. lim E{(m)} = 0LP ×1 ⇒ lim E{h(m)}
m→∞
m→∞
(4.71)
Now, suppose that λt is the forgetting factor of a sample-by-sample adaptive algorithm (operating in the time domain). To have the same effective window length for the sample-by-sample and block-by-block algorithms, we should choose λ = λN t . For example, a typical choice for the RLS algorithm [1] is λt = 1 − 1/(3L). In this case, a good choice for the frequency-domain algorithm is λ = [1 − 1/(3L)]N . 4.3.3
Convergence in Mean Square
The convergence of the algorithm in the mean is not sufficient for convergence to the minimum mean-squared error (MMSE) estimate [1] as it only assures ˆ The algorithm converges in the mean square if a bias-free estimate h(m). < ∞, lim Jf (m) = Jf,min
m→∞
(4.72)
108
H. Buchner et al.
where Jf (m) =
1 H E e (m)e(m) . N
(4.73)
From (4.44), the error signal e(m) can be written in terms of (m) as 10 e(m) = G01 N ×2N X(m)G2LP ×LP (m − 1) + n(m).
(4.74)
Expression (4.73) becomes Jf (m) =
1 Jex (m) + σn2 , N
(4.75)
where the excess mean-square error is given by H H 01 Jex (m) = E H (m − 1)(G10 2LP ×LP ) X (m)G2N ×2N · X(m)G10 2LP ×LP (m − 1)
(4.76)
and σn2 is the variance of the noise signal nO (n). Furthermore, (4.76) can be written as H H Jex (m) = E tr H (m − 1)(G10 2LP ×LP ) X (m) 10 · G01 2N ×2N X(m)G2LP ×LP (m − 1) H H 01 = E tr (G10 2LP ×LP ) X (m)G2N ×2N X(m) H · G10 2LP ×LP (m − 1) (m − 1) H H 01 = tr E (G10 2LP ×LP ) X (m)G2N ×2N X(m) H · G10 . 2LP ×LP (m − 1) (m − 1) Invoking the independence assumption and using (4.62), we may reduce this expectation to Jex (m) = tr[Se M(m − 1)],
(4.77)
where M(m) = E (m)H (m)
(4.78)
is the misalignment correlation matrix. We derive an expression for the misalignment vector (m) using the normal equation (4.41), and (4.43): ˆ (m) = h − h(m) = h − S−1 (m)s(m) = h − (1 − λ)S−1 (m) ·X
H
m
H λm−i (G10 2LP ×LP )
i=0 01 H (i)(GN ×2N ) y(i).
(4.79)
4
Multichannel Acoustic Echo Cancellation
109
Using y(m) from the model (4.67), we obtain with (4.68) and (4.42): (m) = −(1 − λ)S−1 (m)
m
H H λm−i (G10 2LP ×LP ) X (i)
i=0 H · (G01 2N ×2N ) n(i).
(4.80)
If we plug this equation into (4.78), we obtain, after taking the expectations, and noting that for a given input sequence, the only random variable is the white measurement noise n(m): m 2 2 −1 H M(m) = σn (1 − λ) S (m) λ2(m−i) (G10 2LP ×LP ) i=0
·X
H
10 (i)G01 2N ×2N X(i)G2LP ×LP
S−1 (m),
(4.81)
where E{n(m)nH (m)} = σn2 I was used. Analogously to (4.64), we find for the term in brackets in (4.81): m
H H 01 10 λ2(m−i) (G10 2LP ×LP ) X (i)G2N ×2N X(i)G2LP ×LP
i=0
≈ (1 − λ2(m+1) )Se .
(4.82)
Assuming strict equality in (4.82), using (4.64), and 1 − λ2(m+1) = (1 − λm+1 )(1 + λm+1 ), this leads to M(m) = σn2 (1 − λ)2
1 + λm+1 −1 S . 1 − λm+1 e
Finally, we obtain for (4.75) with (4.77) 1 + λm LP Jf (m) = (1 − λ)2 + 1 σn2 . N 1 − λm
(4.83)
(4.84)
This equation describes the convergence curve of the mean-squared error. One can see that in the steady state, i.e., for large m, the mean-squared error converges to a constant value as desired in (4.72): LP 2 Jf (m → ∞) = Jf,min = (1 − λ) + 1 σn2 . (4.85) N Moreover, we see from (4.84) that the convergence behaviour of the meansquared error is independent of the eigenvalues of the ensemble-averaged matrix Se . The scalar Jmis (m) = E H (m)(m) (4.86)
110
H. Buchner et al.
Y(m) X(m)
P
-
^
E(m) +
H LPxQ(m)
Q
Fig. 4.3. Adaptive MIMO filtering in the frequency domain.
describes the convergence of the misalignment, i.e. the coefficient convergence. Using (4.83), we deduce that Jmis (m) = tr[M(m)] = σn2 (1 − λ)2 = σn2 (1 − λ)2
1 + λm+1 tr[S−1 e ] 1 − λm+1 LP −1 1 + λm+1 1 1 − λm+1
i=0
λs,i
,
(4.87)
where the λs,i denote the eigenvalues of the ensemble-averaged matrix Se . It is important to notice the difference between the convergence of the meansquared error and the misalignment. While the mean-squared error does not depend on the eigenvalues of Se (i.e., it is also independent of the channel coherence), the misalignment is magnified by the inverse of the smallest eigenvalue λs,min of Se [and thus of S(m)]. The situation is worsened when the variance of the noise σn2 is large. So in practice, at some frequencies, where the signal is poorly excited, we may have a very large misalignment. In order to avoid this problem and to keep the misalignment low, the adaptive algorithm should be regularized by adding small values to the diagonal of S(m). In Sect. 4.6, this important topic is discussed in more detail.
4.4
Generalized Frequency-Domain Adaptive MIMO Filtering
In this section, we consider the extension of the algorithm proposed in Sect. 4.2 to the general MIMO case, i.e., we have P input signals xp (n), p = 1, . . . , P , and Q desired signals yq (n), output signals yˆq (n), and error signals eq (n), q = 1, . . . , Q, respectively (Fig. 4.3). In the sequel, the following questions are discussed: What is the optimum solution? Can correlation between the error signals eq (n) be exploited and how do the resulting update equations look like? Let us define signal block vectors yq (m), eq (m), yq (m), eq (m) for each output channel in the same way as shown in (4.14), (4.13), (4.28), and (4.27), respectively. These quantities can be combined in the (N × Q) matrices E(m) = [e1 (m), · · · , eQ (m)],
4
Multichannel Acoustic Echo Cancellation
111
Y(m) = [y1 (m), · · · , yQ (m)], E(m) = [e1 (m), · · · , eQ (m)], Y(m) = [y1 (m), · · · , yQ (m)]. We consider three conceivable generalizations of the recursive least-squares error criterion proposed in (4.36): Error criterion 1: Separate optimization The most obvious approach to the problem is to treat each of the Q desired signal channels separately by the algorithm proposed above: Jf1,q (m) = (1 − λ)
m
λm−i eH q (i)eq (i)
(4.88)
i=0
for q = 1, . . . , Q. This criterion has been traditionally used in all approaches for multichannel echo cancellation which is a system identification problem. Error criterion 2: Joint optimization A more general approach foresees to jointly optimize the MIMO filter by the following criterion: Jf2 (m) =
Q
Jf1,q (m)
q=1
= (1 − λ)
m
m−i
λ
= (1 − λ)
eH q (i)eq (i)
q=1
i=0 m
Q
λm−i tr[EH (i)E(i)]
i=0
= (1 − λ)
m
λm−i diag{EH (i)E(i)}1 ,
(4.89)
i=0
where the matrix norm · 1 sums up the absolute values of all matrix elements. Introducing the (LP × Q) coefficient matrix in the frequency domain ˆ based on the subfilter coefficient vectors h p,k,q (p, k, q denote input channel, partition, and output channel, respectively), ⎡ ˆ ⎤ ˆ h1,0,1 · · · h 1,0,Q ˆ ˆ ⎢ h ⎥ ··· h 1,1,1 1,1,Q ⎥ ⎢ ˆ (4.90) H ⎥, . . LP ×Q = ⎢ . .. .. .. ⎣ ⎦ ˆ ˆ h ··· h P,K−1,1
P,K−1,Q
and using the same approach as in Sect. 4.2, we obtain the following normal equation: ˆ S(m)H LP ×Q = sLP ×Q (m).
(4.91)
112
H. Buchner et al.
Fortunately, this matrix equation can be easily decomposed into Q equations (4.41). Therefore, criteria 1 and 2 are strictly equivalent for the behaviour of the adaptation. We note, however, that the compact formulation (4.91) of the normal equation can be used, e.g., to obtain a generalized control of the adaptation for the echo cancellation application [22]. Error criterion 3: Joint Optimization, accounting for cross-correlations between error signals The last formulation of Criterion 2 (4.89) reveals an interesting possibility to take the cross-correlations between the error signals into account by optimizing Jf3 (m) = (1 − λ)
m
λm−i EH (i)E(i)1 .
(4.92)
i=0
Let us consider the optimization of the additional off-diagonal elements H eH q (i)er (i) (q = r) of E (i)E(i). According to [1], [21], we obtain ∂ eH q (i)er (i) = 0, ˆ ∂ hq (i)
(4.93)
and from ∂ eH q (i)er (i), ˆ ∂ hr (i)
(4.94)
ˆ . we obtain the well-known normal equations (4.41) for h q Therefore, for all criteria, the generalized frequency-domain adaptive MIMO filter can be summarized as Sd (m) = λSd (m − 1) + (1 − λ)XH (m)G01 2N ×2N X(m) K(m) =
H (1 − λ)S−1 d (m)X (m) ˆ Y2N ×Q (m) − G01 2N ×2N X(m)H2LP ×Q (m
E2N ×Q (m) = − 1) 10 ˆ ˆ H 2LP ×Q (m) = H2LP ×Q (m − 1) + G2LP ×2LP K(m)E2N ×Q (m)
(4.95) (4.96) (4.97) (4.98)
in analogy to equations (4.57) to (4.60).
4.5
Approximation and Special Cases
We start this section by giving a very useful approximation of the algorithm proposed in the preceding section. This allows us both, to show explicitly the links to the classical single-channel algorithms, and also to derive new and very efficient multichannel algorithms. The list of special cases of the framework is not exhaustive and several other algorithms may also be derived.
4
4.5.1
Multichannel Acoustic Echo Cancellation
113
Approximation of the Frequency-Domain Kalman Gain
Frequency-domain adaptive filters were first introduced to reduce the arithmetic complexity of the (single-channel) LMS algorithm [7]. Unfortunately, the matrix Sd is generally not diagonal, so its inversion in (4.96) has a high complexity and the algorithm may not be very useful in practice. Since Sd is composed of (K · P )2 sub-matrices Si,j,k = λSi,j,k (m − 1) + (1 − λ)X∗i,k (m)G01 2N ×2N Xj,k (m),
(4.99)
it is desirable that each of those sub-matrices be a diagonal matrix. In the next paragraph, we will argue that G01 2N ×2N can be well approximated by the identity matrix with weight 1/2; accordingly, after introducing the positive factor μ ≤ 2 in (4.98) and the matrix S (m) approximating 2Sd (m), we then obtain the following approximate algorithm: S (m) = λS (m − 1) + (1 − λ)XH (m)X(m) K(m) = (1 − λ)S
−1
H
(m)X (m)
(4.100) (4.101)
ˆ G01 2N ×2N X(m)H2LP ×Q (m
− 1) (4.102) E2N ×Q (m) = Y2N ×Q (m) − 10 ˆ ˆ H 2LP ×Q (m) = H2LP ×Q (m − 1) + μG2LP ×2LP K(m)E2N ×Q (m), (4.103) where each sub-matrix of S and K is now a diagonal matrix and μ ≤ 2 is a positive number. Note that the imprecision introduced by the approximation in (4.100) and thus in the Kalman gain (4.101) will only affect the convergence rate. Obviously, we cannot permit the same kind of approximation in (4.102), because that would result in approximating a linear convolution by a circular one, which of course can have a disastrous impact in the adaptive filter behaviour. To justify the above approximation, let us examine the structure of the matrix G01 2N ×2N . We have −1 ∗ 01 (G01 2N ×2N ) = F2N ×2N W2N ×2N F2N ×2N .
(4.104)
01 01 ∗ Since W2N ×2N is a diagonal matrix, (G2N ×2N ) is a circulant matrix. There01 fore, inverse transformation of the diagonal of W2N ×2N gives the first column 01 ∗ of (G2N ×2N ) , ∗ T g∗ = [g0∗ , g1∗ , · · · , g2N −1 ] T = F−1 2N ×2N [0, · · · , 0, 1, · · · , 1] .
The elements of vector g can be written explicitly as: gk =
2N −1 1 exp(−j2πkl/2N ) 2N l=N
=
N −1 (−1)k exp(−jπkl/N ), 2N l=0
(4.105)
114
H. Buchner et al.
where j 2 = −1. Since gk is the sum of a finite geometric series, we have: 0.5 k=0 gk = (−1)k 1−exp(−jπk) 2N 1−exp(−jπk/N ) k = 0 ⎧ k=0 ⎨ 0.5 = 0 (4.106) πk k even ⎩ 1 − 2N 1 − j cot 2N k odd, where N − 1 elements of vector g are equal to zero. Moreover, since H 01 01 H (G01 2N ×2N ) G2N ×2N = G2N ×2N , then g g = g0 = 0.5 and we have gH g − g02 =
2N −1
|gl |2 = 2
l=1
N −1 l=1
|gl |2 =
1 . 4
(4.107)
We can see from (4.107) that the first element of vector g, i.e., g0 , is dominant in a mean-square sense, and from (4.106) that the absolute values of the N first elements of g decrease rapidly to zero as k increases. Because of the conjugate symmetry, i.e. |gk | = |g2N −k | for k = 1, . . . , N − 1, the last few elements of g are not negligible, but this affects only the first and last columns of G01 2N ×2N since this matrix is circulant with g as its first column. All other columns have those non-negligible elements wrapped around in such a way that they are concentrated around the main diagonal. To summarize, we can say that for N large, only the very first (few) off-diagonals of G01 2N ×2N will be non-negligible while the others can be completely neglected. We also neglect the influence of the two isolated peaks |g2N −1 | = |g1 | < g0 on the lower left corner and the upper right corner, respectively. Thus, approximating G01 2N ×2N by a diagonal matrix, i.e., G01 2N ×2N ≈ g0 I = I/2, is reasonable, and in this case we will have μ ≈ 1/g0 = 2 for an optimum convergence rate. For the rest of this chapter, we suppose that 0 < μ ≤ 2. 4.5.2
Special Cases
In the single-channel case P = Q = 1, S and K are diagonal matrices for N = L and the classical constrained FLMS [7] follows immediately from (4.100)-(4.103). This algorithm requires the computation of 5 FFTs of length 2L per block. By approximating G10 2LP ×2LP in (4.103) to the identity matrix, we obtain the unconstrained FLMS (UFLMS) algorithm [8] which requires only 3 FFTs per block. Many simulations show that the two algorithms have virtually the same performance. For N < L, Sd (m) in (4.95) consists of (K · P )2 sub-matrices that can be approximated as shown above. It is interesting that for N = 1, the algorithm is strictly equivalent to the RLS algorithm in the time domain. After the approximation, we obtain a new algorithm that we call extended multi-delay filter (EMDF) for 1 < N < L that takes the auto-correlations between the
4
Multichannel Acoustic Echo Cancellation
115
blocks into account. Finally, the classical multi-delay filter is obtained by further approximating S (m) in (4.100) by dropping the off-diagonal components in S (m): S (m) = diag{S1,1,0 (m), · · · , S1,1,K−1 (m)},
(4.108)
where S1,1,k (m) = λS1,1,k (m − 1) + (1 − λ)X∗1,k (m)X1,k (m) are (2N × 2N ) diagonal matrices. In the MIMO case, (4.101) is the solution of a P × P system of linear equations of block matrices (which consist of K 2 diagonal block matrices each): K(m) = [KT1 (m), · · · , KTP (m)]T .
(4.109)
This allows us to formally write the update equation (4.103) as P Q tightly coupled ‘single-channel’ update equations ˆ (m) = h ˆ (m − 1) + μG10 h 2N ×2N Kp eq (m) p,q p,q
(4.110)
(p = 1, . . . , P , q = 1, . . . , Q) with the sub-matrices Kp (m) taking the crosscorrelations between the input channels into account. These update equations (4.110) can then be calculated element-wise and the (cross) power spectra are estimated recursively: Si,j (m) = λSi,j (m − 1) + (1 − λ)X∗i (m)Xj (m),
(4.111)
where Sj,i (·) = S∗i,j (·). It is important to note that the calculation of the Kalman gain [Eqs. (4.95) and (4.96)], which is the computationally most demanding part, is completely independent of the number Q of output channels and thus, has to be calculated only once, while the remaining update equations (4.110) formally correspond to single-channel algorithms (e.g., (U)FLMS for N = L). In the case of two input channels P = 2, the Kalman gain can be written in an explicit form by block-inversion: −1 ∗ ∗ K1 (m) = D(m)S−1 1,1 (m)[X1 (m) − S1,2 (m)S2,2 (m)X2 (m)],
K2 (m) =
∗ D(m)S−1 2,2 (m)[X2 (m)
−
∗ S2,1 (m)S−1 1,1 (m)X1 (m)],
(4.112) (4.113)
with the abbreviation D(m) = (1 − λ)[I2L×2L − S∗1,2 (m)S1,2 (m){S1,1 (m)S2,2 (m)}−1 ]−1 . The solutions of (4.101) for more than two input channels may be formulated similarly to the corresponding part of the stereo update equations (4.112) and (4.113) (e.g. using Cramer’s rule). These representations allow an intuitive interpretation as a correction of the interchannel-correlations in Ki between X∗i and the other input signals X∗j , j = i.
116
H. Buchner et al.
For three channels, we have (omitting, for simplicity, the block time index m of all matrices) K1 = (1 − λ)D−1 [X∗1 (S2,2 S3,3 − S3,2 S2,3 ) − X∗2 (S1,2 S3,3 − S1,3 S3,1 ) − X∗3 (S1,3 S2,2 − S1,2 S2,3 )], D := S1,1 (S2,2 S3,3 − S3,2 S2,3 ) − S2,1 (S1,2 S3,3 − S1,3 S3,1 ) − S3,1 (S1,3 S2,2 − S1,2 S2,3 ) as the first of the three Kalman gain components with the common factor D. Unfortunately, for a higher number of channels (and/or a higher number of sub-filters in case of the extended multidelay filter), the number of update terms increases rapidly, and the equations become too complicated for practial use. Therefore, a more efficient scheme for these cases will be proposed in Sect. 4.7.
4.6
A Dynamical Regularization Strategy
In most practical scenarios, the desired signal y(n) is disturbed, e.g., by some acoustic background noise. As shown above [c.f. (4.87)], the parameter estimation (i.e., misalignment) is very sensitive in poorly excited frequency bins. For robust adaptation the power spectral densities Si,i are replaced ˜ i,i = Si,i + diag{δi } prior to inversion by regularized versions according to S in (4.96). The basic feature of the regularization is a compromise between fidelity to data and fidelity to some prior information about the solution [23]. The latter increases the robustness, but leads to biased solutions. Therefore, we propose here a bin-selective dynamical regularization vector (0)
(2N −1)
δi (m) = δmax · [e−Si,i (m)/S0 , · · · , e−Si,i
(m)/S0 T
(ν)
]
(4.114)
with two scalar parameters δmax and S0 . Si,i denotes the ν-th frequency component (ν = 0, · · · , 2N − 1) on the main diagonal of Si,i . Note that for efficient implementation, e in (4.114) may be replaced by a basis 2 and modified S0 . δmax should be chosen according to the (estimated) disturbing noise level in the desired signal y(n). As shown in Fig. 4.4, this exponential method provides a smooth transition between regularization for low input power and data fidelity whenever the input power is large enough, and yields improved results compared to fixed regularization and to the popular approach of choosing the maximum (ν) out of the respective component Si,i and a fixed threshold δth . Results of numerical simulations can be found in Sect. 4.8. The method also copes well with unbalanced excitation of the input channels, and most importantly, it can be easily extended for the efficient Kalman gain calculation introduced in the next section.
4
Multichannel Acoustic Echo Cancellation
~(ν) Si,i (m)
117
constant regularization regularization below threshold
δmax
exponential (proposed) regularization no regularization (ν)
δth
Si,i (m)
Fig. 4.4. Different regularization methods (channel i, bin ν).
4.7
Efficient Multichannel Realization
As will be demonstrated by simulation results and real-world applications in Sect. 4.8, the presented algorithm copes well with multichannel input. The cases of a larger number of filter input channels (P larger than 2 or 3) and/or a larger number of sub-filters (N < L) when using the EMDF algorithm (c.f. Sect. 4.5.2) call for further improvement of the computational efficiency. In this section, we propose efficient and stable recursive calculation schemes for the frequency-domain Kalman gain and the DFTs of the overlapping input data blocks for the case of a large number of filter input channels. Overlapping input data blocks result from an overlap factor α > 1, originally proposed in [13]. Incorporating this extension in the proposed algorithm is very simple. Essentially, only the way the input data matrices (4.22) are calculated, is modified to Xp,k (m) = diag{F2N ×2N [xp (m · · · , xp (m
N − N k − N ), · · · α
N − N k + N − 1)]T }. α
(4.115)
Simulations show that increased overlap factors α are particularly useful in the multichannel case. 4.7.1
Efficient Calculation of the Frequency-Domain Kalman Gain
For a practical implementation of a system with P > 2 channels, we propose computationally more efficient methods to calculate (4.101) as follows. Due to the block diagonal structure of (4.101), it can be simply decomposed w.r.t. the DFT components ν = 0, . . . , 2N − 1 into 2N equations K(ν) (m) = (1 − λ)(S(ν) (m))
−1
(X(ν) (m))H
(4.116)
with (usually small) KP × KP unitary and positive definite matrices S(ν) containing the ν-th components on the block diagonals of S−1 in (4.101). K(ν) and X(ν) are column and row vectors of length KP , respectively. Note that for real input signals xi we need to solve (4.116) only for N + 1 bins.
118
H. Buchner et al.
A well-known and numerically stable method for this type of problems is the Cholesky decomposition of S(ν) followed by solution via backsubstitution, see [24]. The resulting total complexity for one output value is then O(KP · log2 (2N )) + O((KP )3 ),
(4.117)
where for the (U)FLMS algorithm in the two-channel (stereo) case the second term O((KP )3 ) is much smaller than the share due to the first term. For a large number (≥ 3) of input channels (see, e.g., the applications in Sect. 4.8) we introduce a recursive solution of (4.116) that jointly estimates the inverse power spectra (S(ν) )−1 in (4.100) using the matrix-inversion lemma, e.g. [1]. This lemma relates a matrix A = B−1 + CD−1 CH
(4.118)
to its inverse according to A−1 = B − BC(D + CH BC)−1 CH B,
(4.119)
as long as A and B are positive definite. Comparing (4.100) to (4.118), we immediately obtain from (4.100) an update equation for the inverse matrices
(S(ν) (m))−1 = λ−1 (S(ν) (m − 1))−1 (S(ν) (m − 1))−1 (X(ν) (m))H X(ν) (m)(S(ν) (m − 1))−1 − λ(1 − λ)−1 + X(ν) (m)(S(ν) (m − 1))−1 (X(ν) (m))H using the bin-wise quantities introduced in (4.116) (making the denominator a scalar value). Introduction of the common vector T1 (m) = (S(ν) (m − 1))−1 (X(ν) (m))H (ν)
(4.120)
in the numerator and the denominator leads to (S(ν) (m))−1 = λ−1 (S(ν) (m − 1))−1 (ν)
−
(ν)
T1 (m)(T1 (m))H . (ν) λ2 (1 − λ)−1 + λX(ν) (m)T1 (m)
(4.121)
The Kalman gain (4.116) can then be efficiently calculated [using (4.121)] by (ν) H (ν) H (m)) (X (m)) 1 − λ (T (ν) 1 . (4.122) T1 (m) 1 − K(ν) (m) = (ν) λ λ(1 − λ)−1 + X(ν) (m)T (m) 1
Again, there are common terms (ν)
(ν)
T2 (m) = X(ν) (m)T1 (m)
(4.123)
in (4.122) and (4.121). Note that our approach should not be confused with the classical RLS approach [5] which also makes use of the matrix-inversion lemma. As we
Complexity per output value
4
Multichannel Acoustic Echo Cancellation
200
Kalman gain using Cholesky decomposition
150
Proposed Kalman gain computation
119
100 50 0
1
2
3 4 5 6 7 Number of input channels P
8
Fig. 4.5. Complexity (MUL/ADDs) of Kalman gain for K = 1.
apply the lemma independently to usually small KP × KP systems, where KP << N , (4.116), it is numerically much less critical than in the RLS algorithm. Note that for N = L, there is no analogon to a more efficient fast RLS [25] due to the different matrix structures (in this case, X(ν) (m) does not reflect a tapped delay line). The complexity of the different computation methods for the Kalman gains [in MUL/ADDs for one output value e(n)] are compared in Fig. 4.5 for the case N = L (i.e., K = 1). Finally, we note that further gains in computational complexity can be achieved this way when employing the extended multidelay filter for N < L. 4.7.2
Dynamical Regularization for Proposed Kalman Gain Approach
Due to the recursion (4.121), the regularization according to (4.114) is not immediately applicable. Therefore, an equivalent modification is applied directly to the data matrices X(ν) (m) by addition of mutually uncorrelated white noise sequences to each channel and frequency bin, respectively. Using the modified signal (row) vectors, denoted by ˜ (ν) (m) = X(ν) (m) + N(ν) (m), X
(4.124)
where N(ν) (m) are the (row) vectors of the white noise signals, we obtain the modified power spectral density matrices [c.f. Eq. (4.100)] ˜ (ν) (m) ≈ (1 − λ) S
m
λm−q X(ν)H (q)X(ν) (q)
q=0
+ (1 − λ)
m
(ν)
(ν)
λm−q diag{[|N1 (q)|2 , · · · , |NP ·K (q)|2 ]T }.
(4.125)
q=0
The diagonal elements of the second term can be interpreted as a bin-selective dynamical regularization vector δ (ν) (m) with elements (for channel and/or
120
H. Buchner et al.
partition i and bin ν) (ν)
δi (m) = (1 − λ)
m
(ν)
λm−q |Ni (q)|2
q=0
=
(ν) λδi (m
(ν)
− 1) + (1 − λ)|Ni (m)|2 .
(4.126) (ν) δi (m − 1)
(ν) δi (m)
to with Thus, in order to update the regularization from the appropriate speed (determined by λ), we need to add noise with power (ν)
(ν)
|Ni (m)|2 =
(ν)
δi (m) − λδi (m − 1) . 1−λ
(4.127)
On the other hand, according to (4.114), the regularization should be chosen according to * ) (ν) Si,i (m) (ν) δi (m) = δmax · exp − S0 * ) (ν) (ν) λSi,i (m − 1) + (1 − λ)|Xi (m)|2 = δmax · exp − . (4.128) S0 Now, unlike other dynamical regularization methods, the exponential regu(ν) larization allows simple elimination of the elements Si,i (m − 1) of the noninverted matrix [which therefore need not be computed at all due to the matrix-inversion lemma (4.119)], since ) *λ * ) (ν) (ν) Si,i (m − 1) (1 − λ)|Xi (m)|2 (ν) δi (m) = δmax exp − · exp − S0 S0 * ) (ν) (1 − λ)|Xi (m)|2 1−λ (ν) . (4.129) = δmax (δi (m − 1))λ · exp − S0 4.7.3
Efficient DFT Calculation of Overlapping Data Blocks
In this section we address the first term of the computational cost given in (4.117) which is mainly determined by the DFTs of the frequency-domain adaptive filtering scheme (Fig. 4.2). The 2N -point DFT calculation in (4.115) has to be carried out for each of the P loudspeaker signals and is therefore most costly. Moreover, as will be discussed in Sect. 4.8, an increased overlap factor α is often desirable in the multichannel case. Therefore, we aim at exploiting the overlap of the input data blocks by implementing (4.115) recursively as well. For simplicity, we assume a block length of N = L in this section. For the following derivation, L (n) xi (m) = xi m − L + n (4.130) α
4 0
Multichannel Acoustic Echo Cancellation
L/2-1
L-1
3L/2-1
121
2L-1
x(m-1) 0
L/2-1
L-1
3L/2-1
2L-1
x(m) previous values
new values
Fig. 4.6. Example: overlapping data blocks, α = 2.
denotes the n-th component (n = 0, . . . , 2L − 1) of the time domain vector (block index m) to be transformed in (4.115). Let us now consider the ν-th element on the diagonal of Xi (m) where w = e−j2π/2L : (ν)
Xi (m) =
2L−1
(n)
xi (m)wνn .
(4.131)
n=0
Separating the summation into one for previous and one for new input values (Fig. 4.6), followed by the introduction of the previous vector elements (n) xi (m − 1) leads to
2L−L/α−1 (ν)
Xi (m) =
(n)
xi (m)wνn +
n=0
=
2L−1
2L−1
(n)
xi (m)wνn
n=2L−L/α (n)
(ν)
xi (m − 1)wν(n−L/α) + ΔXi (m),
(4.132)
n=L/α
where 2L−1
(ν)
ΔXi (m) =
(n)
xi (m)wνn
(4.133)
n=2L−L/α
contains the new input values and will be the update term in our recursive (ν) scheme. Next, we introduce the previous DFT output values Xi (m − 1) by (n) subtracting the vector elements of xi (m − 1) of the previous data vector shifted out of the DFT length 2L: )2L−1 (n) (ν) −νL/α Xi (m) = w xi (m − 1)wνn n=0
L/α−1
−
⎞
xi (m − 1)wνn ⎠ + ΔXi (m) (n)
(ν)
n=0
= w−νL/α Xi (m − 1) − w−νL/α (ν)
·
2L−1 n=2L−L/α
(n−2L+L/α)
xi
(4.134) (ν)
(m)wν(n−2L+L/α) + ΔXi (m).
122
H. Buchner et al. x(0)
X(0) 0
w
8
x(1) 0
w -1
X(2) 0
2
w
w
8
8
x(3) 0
w -1
X(1) 0
1
w
w
8
8
x(5) -1
2
0
w
w
8
x(6) -1
3
w
-1
-1
X(3)
2
0
w
w
8
-1
X(5)
-1
8
8
x(7)
X(6)
-1
-1
8
x(4)
X(4)
-1
8
x(2)
8
X(7)
-1
Fig. 4.7. Illustration of decimation-in-frequency FFT with windowed input.
Using (4.130), we can show that (n−2L+L/α)
xi
(n)
(m) = xi (m − 2α + 1).
(4.135)
Finally, we obtain Xi (m) = w−νL/α Xi (m − 1) (ν)
(ν)
− w−ν2L ΔXi (m − 2α + 1) + ΔXi (m). (ν)
(ν)
(4.136) Again, this recursive update needs to be carried out only for the bins ν = (n) (ν) 0, . . . , L if xi (m) is real-valued. Only the update ΔXi (m) in this equation has to be calculated explicitly using the L/α new values of the input vector. With the truncation of the time-domain input vector for calculating (ν) ΔXi (m) in mind, we consider now the decimation-in-frequency FFT algorithm. Figure 4.7 shows a simple example for 2L = 8 and α = 2. 2L − L/α inputs (thin lines) always carry zero value. As can be seen from the figure, the first log2 (α) stages do not contain any summations while for the following stages any FFT algorithm (e.g. from highly optimized software libraries) can be employed. Generally, the elimination of operations on zeros in the FFT is referred to as pruning and was first described by Markel [26]. Since then, several pruning algorithms with increased efficiency have been proposed. A summary and further references of different approaches may be found, e.g., in [27]. In summary, using FFT pruning, the recursive DFT approach reduces the first term of the complexity in (4.117) for N = L to O(P · log2 (L/α)) for each output point.
4.8
Simulations and Real-World Applications
As mentioned in the introduction, there are many areas of applications for multichannel adaptive filtering. In the following, we demonstrate the performance of our approach in a few examples for hands-free speech communication.
4 Transmission Room
...
xP (n)
g P(n)
...
... ^ h P (n) A
Automatic Speech Recognizer
B
e(n)
123
Receiving Room
x1(n)
...
g1(n)
Multichannel Acoustic Echo Cancellation
^ h 1 (n)
h P (n) h 1 (n)
- y^P (n) - y^1(n) +
+
+
+
y(n)
Fig. 4.8. Multichannel acoustic echo cancellation.
4.8.1
Multichannel Acoustic Echo Cancellation
For applications such as home entertainment, virtual reality (e.g., games, training), or advanced teleconferencing, there is a growing interest in multimedia terminals with an increased number of audio channels for sound reproduction (e.g., stereo or 5.1 channel - surround systems). In such applications, multichannel acoustic echo cancellation is a key technology whenever hands-free and full-duplex communication is desired (Fig. 4.8). The fundamental problem is that the multiple channels may carry linearly related signals which in turn may make the normal equation to be solved by the adaptive algorithm singular. This implies that there is no unique solution to the equation but an infinite number of solutions and it can be shown that all but the true one depend on the impulse responses of the transmission room [15], [16]. It is shown in [17] that the only solution to the nonuniqueness problem is to reduce the correlation between the different signals. Three methods of preprocessing can be distinguished: inaudible nonlinear processing, e.g., [17], additive noise (preferably below the masking threshold of human hearing), e.g., [28], and time-varying filtering, e.g., [29]. For the following example, a signal from a common source (in the transmission room) was convolved by P different room impulse responses and nonlinearly, but inaudibly preprocessed according to [17] (P different nonlinearities with factor 0.5). In this subsection we consider only Q = 1 microphone in the receiving room. The convergence behaviour is shown both in terms of system misalignment (ratio of the squared norms of (4.69) and the desired response), and in terms of echo return loss enhancement (ERLE) which describes the ratio of the short-term powers of the echo y(n) − nO (n) and the residual echo e(n) − nO (n). (For smoothing the ERLE curves, a moving average filter of length 256 was used.) Figure 4.9 illustrates the effect of taking the cross-correlations in (4.112) and (4.113) for P = 2 into account. As input xp (n), a common white noise signal was convolved by the room impulse responses in the transmission room. Another white noise signal was added to the echo on the microphone for SN R = 35 dB. Here, both the receiving room impulse responses and the modeling filter lengths were chosen to be 1024 (solid lines: proposed, dashed lines: classical UFLMS algorithm).
H. Buchner et al. 0
40
−10
30 ERLE [dB]
misalignment [dB]
124
classical UFLMS −20 −30
proposed
20 10
classical UFLMS
proposed −40
0
2
4 6 Time [sec]
8
0
10
0
2
(a)
4 6 Time [sec]
8
(b)
Fig. 4.9. Effect of taking cross-correlation into account (P =2 channels, α = 4). (a) Misalignment, (b) ERLE.
0 misalignment [dB]
misalignment [dB]
0
−5
−10
NLMS proposed −15
0
2
4 6 Time [sec]
(a)
8
10
−5
−10
NLMS proposed −15
0
2
4 6 Time [sec]
8
10
(b)
Fig. 4.10. Misalignment convergence for the multichannel cases P =2 (lowest),3,4,5 (uppermost). (a) Overlap α = 4, (b) overlap α adjusted.
For simulations with real-world signals, the lengths of the measured receiving room impulse responses were 4096 and the modeling filters were 1024, respectively. One common speech signal from the transmission room serves as input signal. Figure 4.10 shows the misalignment convergence of the described algorithm (solid) for the multichannel cases P = 2 (lowest curve),3, 4, 5 (uppermost curve), and the basic NLMS [1] (dashed) for comparison. In (a) the overlap factor α was set to 4 in all cases, while in (b) the overlap factor α was set to 4 for P = 2, and adjusted to 8 for P = 3, 4, and to 16 for P = 5. Using these parameters, the convergence curves for the different numbers of channels are almost indistinguishable. Figure 4.11(a) shows the corresponding ERLE curves. Figure 4.11(b) compares different regularization methods (white noise distortion as above): no regularization (uppermost curve), constant regularization (dotted), threshold (dashed), exponential with original algorithm (dashdot), proposed Kalman gain computation 4.122 (lower solid line).
4
Multichannel Acoustic Echo Cancellation
125
Coefficient error norm [dB]
0
ERLE [dB]
30
20
10
0
0
2
4 6 Time [sec]
8
10
−2
no reg.
−4
constant reg.
−6
threshold reg.
−8 exp. reg.
−10
exp. reg., recursive −12
0
2
4 6 Time [sec]
8
10
(b)
(a)
Fig. 4.11. (a) ERLE convergence for the multichannel cases P =2,3,4,5 and adjusted α and (b) comparison of regularization methods, P =5, α=16. Multimedia System
Reverberation
Target
Interferer
Echoes
Fig. 4.12. Hands-free speech recognition in multimedia systems.
We note that for both, stereophonic teleconferencing and hands-free speech recognition applications, real-time systems could be successfully implemented on regular personal computers [2], [3]. 4.8.2
Adaptive MIMO Filtering for Hands-Free Speech Communication
In applications such as hands-free speech recognition, it is very important to reduce interfering noise or competing speech signals, and reverberation of the target speech signal, in addition to the acoustic echo cancellation (Fig. 4.12). An efficient approach to address these problems is to replace the single microphone by a microphone array directing a beam of increased sensitivity at the active talker [30]. In any practical system, this scenario presents a MIMO system identification problem for the acoustic echo canceller [30], [3]. Fortunately, as noted in Sect. 4.4, the costly calculation of the Kalman gain is necessary only once, i.e., it is independent of the number of microphones. Figure 4.13 gives an example of a low-complexity structure. Echo cancellation
126
H. Buchner et al.
Preprocessing
P
(partial channel decorrelation)
"(P x B)-C AEC"
Beam-indep. part MIMO System Beam 1
Beam B
-
Postproc.
+
Voting
Fixed BFs
B -
Q
+
[Speaker position]
Fig. 4.13. A human/machine interface for hands-free speech recognition.
is applied to several beamformer (BF) output signals. The fixed beamformers do not disturb the convergence of the echo cancellation and direct beams to all directions of interest [30]. Thanks to the efficient frequency-domain approach, it has become possible to run such a system in real-time on standard PC platforms. The implementation is fully scalable (e.g., sampling rate, number of loudspeakers, microphones, and beams) with dynamical allocation of computational power. Example parameters for a speech recognition interface with 5.1-channel surround sound reproduction running on a 1.7 GHz dual processor board are: P L = 5 · 4096 adaptive filter coefficients, an overlap factor α = 16, and a sampling rate of 12 kHz for the acquisition.
4.9
Conclusions
In many applications where an adaptive filter is required, frequency-domain algorithms are an attractive alternative to time-domain algorithms, especially for the multichannel case. First, the computational complexity can be low by utilizing the efficiency of the FFT. Second, the convergence is improved if crucial parameters of these algorithms such as the exponential window, regularization, and adaptation step are properly chosen. A general framework for multichannel frequency-domain adaptive filtering was presented and its efficiency in actual applications was demonstrated. A generic multichannel algorithm with a MMSE convergence that is independent of the input signal statistics can be derived from the normal equation after minimizing a block least-squares criterion in the frequency domain. We analyzed the convergence of this algorithm and discussed some approximations that lead to both, well-known algorithms in the single-channel case, such as the FLMS and UFLMS, and new algorithms such as the EMDF. For the multichannel case the framework is attractive as the cross-correlations between all input signals are efficiently taken into account. We have also presented strategies to improve the computational efficiency further by in-
4
Multichannel Acoustic Echo Cancellation
127
troducing stable schemes for recursive DFT and Kalman gain computation. Several simulations and real-time implementations illustrate the benefits of the multichannel algorithm.
References 1. S. Haykin, Adaptive Filter Theory. 3rd ed., Prentice Hall Inc., Englewood Cliffs, NJ, 1996. 2. V. Fischer, T. G¨ ansler, E. J. Diethorn, and J. Benesty, “A software stereo acoustic echo canceler for Microsoft Windows,” in Proc. Int. Workshop on Acoustic Echo and Noise Control, Darmstadt, Germany, pp. 87–90, Sept. 2001. 3. H. Buchner, W. Herbordt, and W. Kellermann, “An efficient combination of multichannel acoustic echo cancellation with a beamforming microphone array,” in Proc. Int. Workshop on Hands-Free Speech Communication, Kyoto, Japan, pp. 55–58, Apr. 2001. 4. B. Widrow and S. D. Stearns, Adaptive Signal Processing. Prentice-Hall, Inc., Englewood Cliffs, N.J., 1985. 5. M. G. Bellanger, Adaptive Digital Filters and Signal Analysis. Marcel Dekker, 1987. 6. M. Dentino, J. McCool, and B. Widrow, “Adaptive filtering in the frequency domain,” Proc. IEEE, vol. 66, pp. 1658–1659, Dec. 1978. 7. E. R. Ferrara, Jr., “Fast implementation of the LMS adaptive filter,” IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-28, pp. 474–475, Aug. 1980. 8. D. Mansour and A. H. Gray, “Unconstrained frequency-domain adaptive filter,” IEEE Trans. on Acoustics, Speech, and Signal Processing, vol. 30, Oct. 1982. 9. J. C. Lee and C. K. Un, “Performance analysis of frequency-domain block LMS adaptive digital filters,” IEEE Trans. Circuits Syst., vol. CAS-36, pp. 173–189, Feb. 1989. 10. J.-S. Soo and K. K. Pang, “Multidelay block frequency domain adaptive filter,” IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-38, pp. 373–376, Feb. 1990. 11. J. Benesty and P. Duhamel, “Fast constant modulus adaptive algorithm,” IEE Proc.-F, vol. 138, pp. 379–387, Aug. 1991. 12. J. Benesty and P. Duhamel, “A fast exact least mean square adaptive algorithm,” IEEE Trans. Signal Processing, vol. 40, pp. 2904–2920, Dec. 1992. 13. E. Moulines, O. Ait Amrane, and Y. Grenier, “The generalized multidelay adaptive filter: structure and convergence analysis,” IEEE Trans. Signal Processing, vol. 43, pp. 14–28, Jan. 1995. 14. J. Prado and E. Moulines, “Frequency-domain adaptive filtering with applications to acoustic echo cancellation,” Ann. T´el´ecommun., vol. 49, pp. 414–428, 1994. 15. M. M. Sondhi and D. R. Morgan, “Stereophonic acoustic echo cancellation – an overview of the fundamental problem,” IEEE SP Letters, vol. 2, pp. 148–151, Aug. 1995. 16. S. Shimauchi and S. Makino, “Stereo projection echo canceller with true echo path estimation,” in Proc. ICASSP, pp. 3059–3062, May 1995.
128
H. Buchner et al.
17. J. Benesty, D. R. Morgan, and M. M. Sondhi, “A better understanding and an improved solution to the specific problems of stereophonic acoustic echo cancellation,” IEEE Trans. on Speech and Audio Processing, vol. 6, no.2, Mar. 1998. 18. T. G¨ ansler and J. Benesty, “Stereophonic acoustic echo cancellation and twochannel adaptive filtering: an overview,” International Journal of adaptive control and signal processing, Feb. 2000. 19. J. Benesty, A. Gilloire, and Y. Grenier, “A frequency-domain stereophonic acoustic echo canceler exploiting the coherence between the channels,” J. Acoust. Soc. Am., vol. 106, pp. L30–L35, Sept. 1999. 20. J. Benesty, F. Amand, A. Gilloire, and Y. Grenier, “Adaptive filtering algorithms for stereophonic acoustic echo cancellation,” in Proc. ICASSP, pp. 3099–3102, May 1995. 21. D. H. Brandwood, “A complex gradient operator and its application in adaptive array theory,” Proc. IEE, vol. 130, Pts. F and H, pp. 11–16, Feb. 1983. 22. J. Benesty and T. G¨ ansler, “A multichannel acoustic echo canceller doubletalk detector based on a normalized cross-correlation matrix,” in Proc. Int. Workshop on Acoustic Echo and Noise Control, Darmstadt, Germany, pp. 63– 66, Sept. 2001. 23. V. A. Morozov, Regularization Methods for Ill-posed Problems. CRC Press, Boca Raton, FL, 1993. 24. G. H. Golub and C. F. Van Loan, Matrix Computations. 2nd ed., Johns Hopkins, Baltimore, MD, 1989. 25. J. M. Cioffi and T. Kailath, “Fast, recursive-least-squares transversal filters for adaptive filtering,” IEEE Trans. Acoustics, Speech, and Signal Processing, vol. ASSP-32, pp. 304–337. 26. J. D. Markel, “FFT pruning,” IEEE Trans. Audio Electroacoust., vol. AU-19, pp. 305–311, Dec. 1971. 27. S. Holm, “FFT pruning applied to time domain interpolation and peak localization,” IEEE Trans. Acoustics, Speech, and Signal Processing, vol. ASSP-35, no. 12, pp. 1776–1778, Dec. 1987. 28. T. G¨ ansler and P. Eneroth, “Influence of audio coding on stereophonic acoustic echo cancellation,” in Proc. ICASSP, pp. 3649–3652, May 1998. 29. A. Sugiyama, Y. Joncour, and A. Hirano, “A stereo echo canceler with correct echo-path identification on an input-sliding technique,” IEEE Trans. Signal Processing, vol. 49, pp. 2577–2587, Nov. 2001. 30. W. Kellermann, “Acoustic echo cancellation for beamforming microphone arrays,” in M. S. Brandstein and D. B. Ward (eds.), Microphone Arrays: Signal Processing Techniques and Applications, Springer Verlag, 2001.
5 Filtering Techniques for Noise Reduction and Speech Enhancement Jingdong Chen, Yiteng (Arden) Huang, and Jacob Benesty Bell Laboratories, Lucent Technologies Murray Hill, NJ 07974, USA E-mail: {jingdong, arden, jbenesty}@bell-labs.com Abstract. Adaptive filtering has plenty of potential applications in many areas (echo cancellation, blind source separation, blind channel identification/equalization, source localization, etc). This chapter focuses on (adaptive and non-adaptive) filtering techniques to mitigate noise effects in speech communications. Three approaches are studied: beamforming using multiple microphone sensors; adaptive noise cancellation (ANC) utilizing a primary sensor to pick up the noisy speech signal and an auxiliary or reference sensor to measure the noise field; and, spectral modification using only a single sensor. The strengths and weaknesses of these approaches are elaborated and their performance (with respect to noise reduction) are studied and compared.
5.1
Introduction
The existence of noise is inevitable in real-world applications of speech processing. In a communication system, a desired speech signal, when propagating through an acoustic channel and picked up by a microphone sensor, is corrupted by unwanted noise which results in the formation of a distorted signal. A straightforward method to mitigate noise effects is to pass the corrupted signal through a filter that tends to suppress the noise while leaving the desired signal relatively unchanged. This is the basic concept of noise cancellation. Noise cancellation has long been the interest of research since the 1940s. Various methods have been proposed and investigated [1], [2], [3], [4]. These approaches can be classified into three basic categories, namely, beamforming techniques exploiting multiple sensors, adaptive noise cancellation utilizing a primary sensor to pick up the noisy signal and a reference sensor to measure the noise field, and adaptive spectral modification using only a single microphone. Beamforming: This technique was developed to make use of a sensor array to explicitly enhance the desired signal as well as suppress the noise. Such an array consists of a set of sensors that are spatially distributed at known locations with reference to a common point. These sensors collect signals from sources in their field of view. By weighting the sensor outputs, a beam can be formed and steered along some specified direction. Consequently, a signal propagating from the look direction is reinforced, while sound sources originating from directions other than the look direction are attenuated.
J. Benesty et al. (eds.), Adaptive Signal Processing © Springer-Verlag Berlin Heidelberg 2003
130
J. Chen et al.
The beamforming approach has many appealing properties. For example, in diffuse noise conditions, it can achieve a broad-band SNR gain that is approximately equal to the number of sensors. In general, beamforming does not introduce much distortion into the signal of interest. However, the noise reduction performance of a beamformer relies on the number of sensors, the geometry of the array, the characteristics of the noise, and so forth. In addition, the estimation of direction of arrival (DOA) is also a critical issue. While pointing along a specified direction, all signals arriving at the array along any other directions become undesirable. As a result, if the array is steered along some other directions, the signal of interest becomes an undesired signal and is suppressed. The degree of suppression depends on the angular separation between the direction of the signal and the current look direction, as well as on the array geometry, and so forth. Adaptive Noise Cancellation: This typical method uses a primary sensor to pick up noisy signal and an auxiliary or reference input derived from one or more sensors located at points in the noise field where the desired signal is weak or undetectable. It achieves noise cancellation by adaptively recreating a replica of the interfering noise using the reference input and subtracting the replica from the corrupted signal. The strength of this approach derives from the use of closed-loop adaptive filtering algorithms. It is particularly effective when the reference input can anticipate the noise field at the primary sensor so that processing delay in the filter and any propagation delay between reference input and primary sensor are easily compensated for. Furthermore, it can deal with even non-stationary noise. However, the success of this method is based on two fundamental requirements. The first is that the desired signal is collected only by the primary sensor while little or none is detected by the reference microphone. Any inclusion of the desired signal in the reference input will result in signal cancellation, and eventually leads to distortion in speech. The second requirement is that the noise measured by the reference sensor should be highly correlated with the noise that interferes with the desired signal in the primary sensor. In an extreme case, if noise components at both sensors are completely uncorrelated, no reduction of noise can be achieved. Adaptive Spectral Modification: In this approach, a single sensor is used to measure both signal and interfering noise. Whereas closed-loop adaptive algorithms such as the one proposed in [5] can be exploited, a more common way is to achieve noise reduction by virtue of spectral modification such as spectral subtraction and Wiener filtering. The former is to restore the magnitude spectrum of the desired signal by subtracting an estimate of the noise spectrum from the noisy signal spectrum, while the latter estimates the magnitude spectrum of the desired signal by passing the noisy signal through a filter whose transfer function is adaptively updated according to the estimates of the signal and noise spectra. The enhanced time-domain
5
Noise Reduction and Speech Enhancement
131
signal is constructed from an estimate of the magnitude spectrum of the desired signal combined with the phase of the noisy signal. This approach often leads to more noise reduction than that obtained by the beamformer and ANC. In addition, it has no microphone placement problem. However, the estimation of noise is a critical issue. Usually, the noise spectrum is estimated or updated in periods when the desired signal is absent. The underlying assumption is that the noise is stationary, or at least slow varying, between two consecutive update periods. Furthermore, due to the variance of the spectral estimation, this approach often results in noticeable signal distortion. In this chapter, we study noise cancellation/reduction techniques and analyze the performance of the different approaches.
5.2
Noise Reduction with an Array
Sensor arrays have been in use for several decades in many practical signal processing applications to detect the presence of a desired signal and to reduce the effects of unwanted noise. Its potential in speech enhancement has gained special attention since the early 60s. The underlying idea can be described as synchronizing-and-adding. Consider a single source situation where the speech waveform produced by each sensor consists of a speech signal, which is a time-delayed or time-advanced version of the signal at the reference sensor, and random noise, statistically independent from sensor to sensor. By advancing or delaying the sensor outputs to make the signal components in-phase and then adding them together, the signal at each sensor reinforces, while the noise components tend to remain at the original level because of their random, uncorrelated nature. The synchronizing-and-adding is accomplished through a widely known technique called beamforming. Beamforming algorithms vary according to the location of the speech source relative to an array. If the source is located close to the array, the wavefront of the propagating wave is perceptively curved with respect to the dimensions of the array. Such a case is referred to as a near-field scenario. If the direction of propagation is approximately equal at each sensor, then the source is located in the array’s far-field, and the propagating field consists of plane waves. Beamforming methodologies differ for far-field and near-field. Here we assume that the source is in the far-field. In the far-field case, a number of beamforming algorithms have been developed. They can be adaptive [6], [7], [8], [9] and non-adaptive [10]. For the ease of performance analysis, we start with the non-adaptive delay-and-sum algorithm which is the basis of other beamformers. We then show that when the direction of arrival (DOA) is known a priori, adaptive algorithms can be employed to achieve greater noise reduction. A typical delay-and-sum beamforming algorithm is illustrated in Fig. 5.1. A plane wave, propagating from a far-field source, arrives at a uniformly
132
J. Chen et al.
s(n)
.
Plane wavefront
Mth sensor
. xM (n)
z
.
…
.
θ
Reference sensor
d
( M −1)τ
z
wM
x2 (n)
x1 (n)
w2
w1
τ
Σ y(n) Fig. 5.1. Illustration of a delay-and-sum beamformer with a uniformly spaced linear microphone array when the sound source is in the far-field.
spaced linear array, which consists of M microphones. The sensor outputs can be modeled as ⎤ ⎡ ⎡ ⎤ ⎡ ⎤ v1 (n) s(n) x1 (n) ⎥ ⎢ v2 (n) ⎥ ⎢ x2 (n) ⎥ ⎢ s(n − τ ) ⎥ ⎢ ⎢ ⎥ ⎢ ⎥ (5.1) x(n) = ⎢ . ⎥ = ⎢ ⎥ + ⎢ .. ⎥ , . .. ⎦ ⎣ . ⎦ ⎣ .. ⎦ ⎣ s(n − (M − 1)τ ) xM (n) vM (n) where xm (n) is the signal received at the m-th sensor, τ = d cos θ/c is the relative delay between any two neighboring microphones, c represents velocity of propagation, θ is the angle between wavefront and the line joining the sensors in the linear array, and vm (n) is the sensor noise, which is assumed to be uncorrelated with the source signal s(n). With Xm (jω), S(jω), and Vm (jω) representing the Fourier transforms of xm (n), s(m), and vm (n), respectively, (5.1) becomes ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ S(jω) V1 (jω) X1 (jω) ⎢ X2 (jω) ⎥ ⎢ S(jω)e−jωτ ⎥ ⎢ V2 (jω) ⎥ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ (5.2) ⎥=⎢ ⎥. ⎢ ⎥+⎢ .. .. .. ⎦ ⎣ ⎦ ⎣ ⎦ ⎣ . . . XM (jω)
S(jω)e−jω(M−1)τ
VM (jω)
The power of the received signal at the m-th microphone is given by Pm (ω) = E |Xm (jω)|2 = E |S(jω)|2 + E |Vm (jω)|2 ,
(5.3)
5
Noise Reduction and Speech Enhancement
133
where E[·] denotes mathematical expectation. We assume that all microphones have the same noise power: E |V1 (jω)|2 = · · · = E |VM (jω)|2 = E |V (jω)|2 , (5.4) so that P1 (ω) = P2 (ω) = · · · = PM (ω).
(5.5)
The signal-to-noise ratio (SNR) at each microphone can then be expressed as: E |S(jω)|2 γ(ω) = . (5.6) E [|V (jω)|2 ] The delay-and-sum beamformer consists of applying a delay τm and an amplitude weight wm to the m-th sensor output, then summing the M resulting signals. The delay-and-sum beamformer’s output is given by Y (jω) =
M
wm Xm (jω)ejωτm ,
(5.7)
m=1
where {wm } is sometimes called the array taper. Substituting (5.2) into (5.7) and further setting wm = 1/M and τm = (m − 1)τ = (m − 1)d cos θ/c, we have: Y (jω) = S(jω) +
M 1 Vm (jω)ejω(m−1)τ . M m=1
(5.8)
We now define a parameter that relates the microphone signal power to the beamformer’s output power as [11]: E |Xm (jω)|2 Ψ (ω) = E [|Y (jω)|2 ] E |S(jω)|2 + E |V (jω)|2 ⎡! = (5.9) !2 ⎤ . M ! ! 1 ! ! E [|S(jω)|2 ] + 2 E ⎣! Vm (jω)ejω(m−1)τ ! ⎦ ! ! M m=1
This parameter corresponds to the noise reduction performance of an equispaced linear array over a single microphone. If the noises, vm (n), are uncorrelated with each other, (5.9) reduces to E |S(jω)|2 + E |V (jω)|2 Ψ (ω) = 1
2 E |V (jω)| E [|S(jω)|2 ] + M γ(ω) + 1 . (5.10) = γ(ω) + 1/M
134
J. Chen et al. 20 M=10 M=20 M=100
Noise reduction performance (dB)
18 16 14 12 10 8 6 4 2 0 −30
−20
−10
0 γ(ω) (dB)
10
20
30
Fig. 5.2. Theoretical noise reduction performance for a uniformly spaced linear array under the condition that the noise at each microphone is uncorrelated with both the noise at any other microphone and the source signal.
Since M is a positive integer and M ≥ 2, it can be easily shown that Ψ (ω) is a monotonously decreasing function with respect to γ(ω), and 1 < Ψ (ω) < M.
(5.11) . When γ(ω) >> 1, then Ψ (ω) = 1. This is reasonable since γ(ω) >> 1 indicates that the noisy signal is relatively clean and only little noise attenuation is needed. The lower the element SNR is, the more noise reduction can be gained. Note that the maximum noise reduction is bounded by the number of microphones. A comparison of the performance of a uniform array with different number of microphones is shown in Fig. 5.2. For directional noise, e.g., noise that originates from a point source, whose angle of incidence is denoted by φ, (5.9) can be expressed as E |S(jω)|2 + E |V (jω)|2 ⎡! Ψ (ω) = !2 ⎤ M ! ! 1 ! ! ⎦ 2 jω(m−1)d(cos φ−cos θ)/c E [|S(jω)|2 ] + 2 E |V (jω)| E ⎣! e ! !m=1 ! M =
γ(ω) + 1 !2 . ! ! 1 sin[ωM d(cos φ − cos θ)/2c] !! γ(ω) + 2 !! M sin[ωd(cos φ − cos θ)/2c] !
(5.12)
In such a condition, the noise reduction performance is a function of the number of microphones and the angular separation between the direction of the signal and that of the interfering noise. Figure 5.3 compares the noise suppression performance with different number of microphones as a function of incidence angle φ. Generally, the bigger the angular separation is, the more the noise reduction can be achieved. For the same amount of noise cancellation, less angular separation is required for an array with more microphones.
5
Noise Reduction and Speech Enhancement
135
Noise reduction performance (dB)
12 M=10 M=20 M=100
10
8
6
4
2
0 0
20
40
60 80 100 120 incident angle φ (degree)
140
160
180
Fig. 5.3. Theoretical noise reduction performance for a uniform array under the condition that noise propagates from the far-field with an incidence angle of φ. The spacing between any two neighboring microphones is 2 cm. The sampling frequency is 16 kHz. The plot is shown for a signal frequency of 4 kHz, the incidence angle of the signal is θ = 90◦ , and the SNR at the reference microphone is –10 dB.
With the same number of microphones, when the DOA is known a priori, adaptive algorithms with directional constraints can be exploited to achieve better noise reduction performance. The underlying idea is to minimize or reject signals from all other directions while maintaining a constant gain along the look direction. This can be formulated in the time domain by finding a weighting vector w such that w = arg min wH Rw w
subject to |wH a(ω)| = 1,
(5.13)
where a(ω) =
1 jωτ1 jωτ2 [e ,e , · · · , ejωτM ], M
R = E x(n)xT (n) is the M × M array output covariance matrix, H denotes complex conjugate transpose, and T denotes transpose. For a positive definite matrix R, the solution to (5.13) is readily given by [8]: w=
R−1 a(ω) . aH (ω)R−1 a(ω)
(5.14)
The corresponding array output power is in the form E |y(n)|2 =
1 . aH (ω)R−1 a(ω)
(5.15)
136
J. Chen et al.
Amplitude
0.5 0
Amplitude
−0.5 0 0.5
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.1
0.2
0.3 0.4 0.5 time (seconds)
0.6
0.7
0.8
0 −0.5 0 0.5
Amplitude
0.1
0 −0.5 0
Fig. 5.4. Noise reduction performance with a uniformly spaced linear array consisting of 20 microphones in white Gaussian noise. The upper trace shows the waveform of the original clean speech signal. The middle trace plots the noisy signal received by the reference microphone. The lower trace shows the output of the beamformer.
By applying the Cauchy-Schwarz inequality, it can be easily shown that 1 ≤ aH (ω)Ra(ω). aH (ω)R−1 a(ω)
(5.16)
The term in the right hand side of (5.16) is the output power of the delayand-sum beamformer [i.e., (5.7)] in the time domain. That is, the power of the array output with adaptive directional constraints is lower than that of the delay-and-sum beamformer. Since the gain along the look direction is 1, this indicates that more noise can be attenuated with adaptive beamforming algorithms. Simulations. Figures 5.4 and 5.5 give the experimental performance of a uniformly spaced linear array with 20 microphones. The spacing between two neighboring microphones is 2 cm. A speech sound source is placed in front of the array, i.e., θ = 90◦ . The sampling rate is 16 kHz. White Gaussian noise is added to each microphone signal with an SNR equal to 0 dB. Figure 5.4 shows the waveforms of the original signal, the signal picked up by the reference microphone, and the beamformer’s output. Figure 5.5 plots the predicted noise reduction performance and actual reduction performance. In the calculation of predicted performance, both the speech signal and noise are segmented into frames with each frame consisting of 512 samples. The power spectrum of each frame is then estimated using the fast Fourier transform (FFT). The average power spectra of signal and noise are computed by averaging their frame power spectra over the whole observation period. The SNR for each frequency bin is then calculated as the ratio between the signal power and the
5
Noise Reduction and Speech Enhancement
137
14
Noise reduction performance (dB)
predicted reduction actual reduction 12 10 8 6 4 2 0 0
1
2
3 4 5 Frequency (kHz)
6
7
8
Fig. 5.5. Predicted and actual noise reduction performance with a uniform array consisting of 20 microphones in diffuse noise field.
noise power. Finally, the predicted noise reduction performance is estimated according to (5.10). The actual reduction performance is computed as the ratio between the average power spectrum of the reference microphone signal and that of the beamformer’s output. Most of the speech energy is located below 2 kHz due to the first and second formants, while the power spectral density of white Gaussian noise is rather flat across the whole frequency band. Therefore the SNR below 2 kHz is higher than that above 2 kHz. Consequently, we can expect more noise reduction at high frequencies according to (5.10). This is verified by Fig. 5.5. Figures 5.6 and 5.7 show the experimental performance of the same array in a directional noise field. The sound source is also placed with an incidence angle of 90◦ , while an interfering noise source, replaying a prerecorded car noise, is placed in the array’s far-field at φ = 45◦ . The SNR at the reference microphone is controlled to be 0 dB. Figure 5.6 shows the waveforms of the original signal, signal picked up by the reference microphone, and the beamformer’s output. Figure 5.7 plots the predicted noise reduction performance according to (5.12) and the actual reduction performance. Due to the lowpass characteristics of the car noise, the SNR between 6 kHz and 8 kHz is significantly higher than that below 6 kHz. This results in almost no noise attenuation between 6 kHz and 8 kHz. While below 6 kHz, up to 5 dB reduction is gained.
5.3
Adaptive Noise Cancellation
The aim of adaptive noise cancellation (ANC) is to eliminate the background noise by adaptively recreating the noise replica using a reference signal. This reference signal is derived from one or more sensors located at points in the
138
J. Chen et al.
Amplitude
0.5 0
Amplitude
−0.5 0 0.5
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.1
0.2
0.3 0.4 0.5 time (seconds)
0.6
0.7
0.8
0 −0.5 0 0.5
Amplitude
0.1
0 −0.5 0
Fig. 5.6. Noise reduction performance with a uniformly spaced linear array consisting of 20 microphones in a directional noise condition. The upper trace shows the waveform of the original clean speech signal. The middle trace plots the noisy signal received by the reference microphone. The lower trace shows the output of the beamformer. 6
Noise reduction performance (dB)
predicted reduction actual reduction 5
4
3
2
1
0 0
1
2
3 4 5 Frequency (kHz)
6
7
8
Fig. 5.7. Predicted and actual noise reduction performance with a uniform array consisting of 20 microphones in a directional noise situation.
noise field where the signal is weak or undetectable. Then, subtracting the noise replica from the corrupted signal, ideally leaves only the desired signal [3], [11], [12], [13], [14], [15], [16]. A typical ANC configuration is depicted in Fig. 5.8, where two microphone sensors are used, with the primary collecting the sum of unwanted noise v1 (n) and speech signal s(n), and the auxiliary measuring the noise signal v(n). To cancel the noise in the primary input, we need a noise estimate vˆ1 (n), usually
5 Signal source
Noise source
Noise Reduction and Speech Enhancement
x(n) = s(n) + v1 (n)
+
v(n)
Σ
139
e(n) = s(n) + v1(n) − vˆ1(n)
-
vˆ1 (n)
hˆ
Fig. 5.8. Model for adaptive noise cancellation with primary and auxiliary inputs.
generated by passing the auxiliary input through an FIR filter ˆ = [h ˆ 0, h ˆ 1, · · · , ˆ h hN −1 ]T , of length N , whose coefficients are estimated adaptively. We have: ˆ T v(n), vˆ1 (n) = h
(5.17)
where v(n) = [v(n), v(n − 1), · · · , v(n − N + 1)]T . Subtracting vˆ1 (n) from the primary signal yields the error signal, e(n) = x(n) − vˆ1 (n) = s(n) + [v1 (n) − vˆ1 (n)],
(5.18)
and 2
e2 (n) = s2 (n) + [v1 (n) − vˆ1 (n)] + 2s(n) [v1 (n) − vˆ1 (n)] .
(5.19)
Therefore, it follows that 2
E{e2 (n)} = E{s2 (n)} + E{[v1 (n) − vˆ1 (n)] } + 2E{s(n) [v1 (n) − vˆ1 (n)]}.
(5.20)
The goal of the ANC is to find an FIR filter that minimizes E{e2 (n)}. Three assumptions are made in the ANC: • Noise [v1 (n) and v(n)] is uncorrelated with the speech signal s(n). • v1 (n) and v(n) are at least partially coherent. • The auxiliary microphone is well isolated from the speech source so that it does not pick up speech. ˆ opt can be determined When the noise is stationary, the optimum filter h as the Wiener solution, i.e., ˆ opt = R−1 rvx , h vv where Rvv = E v(n)vT (n) and rvx = E [x(n)v(n)].
(5.21)
140
J. Chen et al.
Very often, however, the filter is estimated sequentially through adaptive algorithms to account for the time varying noise statistics. A commonly used adaptive algorithm is the steepest descent gradient (SDG) technique which updates the filter coefficients using: ˆ ˆ − 1) − μ E e2 (n) h(n) = h(n ˆ − 1) + 2μrxv , = (1 − 2μRvv )h(n (5.22) ˆ and μ is the step size which controls where is the gradient with respect to h, the rate of change. In a practical situation, the correlation functions are not usually known. An iterative algorithm is therefore required which can operate without knowledge of the system statistics. An approximation to the SDG is the stochastic gradient (SG) or least-mean-square (LMS) algorithm which updates the filter coefficients through: ˆ ˆ − 1) + μv(n)e(n). h(n) = h(n
(5.23)
The convergence rate of the LMS algorithm can slow dramatically when there is a disparity in eigenvalues of the correlation matrix Rvv . Define the ratio of the largest eigenvalue to the smallest eigenvalue as the eigenvalue spread. It can be shown that a larger eigenvalue spread corresponds to a slower convergence rate [18]. The eigenvalue spread problem is a major weakness of the LMS algorithm. One often used method to limit the effect of the eigenvalue spread is the normalized LMS (NLMS) which operates through ˆ ˆ − 1) + μ h(n) = h(n
e(n) vT (n)v(n)
+ς
v(n),
(5.24)
where ς > 0 is a regularization parameter that prevents division by zero and stabilizes the solution. Besides SDG, LMS, and NLMS, the least-squares (LS) and recursive leastsquares (RLS) algorithms are also commonly used for their fast convergence speed. The basic idea of the LS method is to replace the expected value of e2 (n) in (5.20) by the arithmetic mean of e2 (n) over some range of sampling ˆ is derived from instants n. The LS is a block processing algorithm where h a block of data. This estimate is assumed to be valid until the next block of ˆ An alternative method is to data is processed to give a new estimate of h. ˆ obtain an optimum h recursively at every time instant. The resulting method is called the RLS algorithm. While it is true that in most situations, the RLS algorithm converges much faster than LMS algorithm, it is much more computationally expensive. The affine projection algorithm (APA) provides a good compromise between LMS and RLS. It converges faster than LMS, and its complexity is intermediate between that of LMS and RLS. If the FIR filter is sparse, which means that a large number of coefficients in h is effectively zero (or close to zero), then the proportionate NLMS
5
Noise Reduction and Speech Enhancement
141
(PNLMS) [17] and PNLMS++ algorithms may be used to accelerate the convergence rate. We refer the reader to [18] for more details. Although different adaptive algorithms may lead to different degrees of noise reduction, it can be shown that when h is time-invariant, and s(n), v1 (n) and v(n) are stationary, all algorithms are bound to the optimal Wiener ˆ opt (jω), V (jω), and X(jω) representing the solution, i.e., (5.21). With H ˆ opt , v(n), and x(n) respectively, (5.21) can be expressed Fourier transform of h as ∗ ˆ opt (jω) = E [V (jω)X (jω)] , H E [|V (jω)|2 ]
(5.25)
where ∗ stands for complex conjugate. In such a condition, the output noise power spectral density is ˆ opt (jω)|2 E |V (jω)|2 E |E(jω)|2 = E |X(jω)|2 − |H ! ! ! E [V (jω)X ∗ (jω)] !2 2 ! E |V (jω)|2 , (5.26) ! = E |X(jω)| − ! ! 2 E [|V (jω)| ] where E(jω) is the Fourier transform of the error signal e(n). As in Sect. 5.2, we define the noise cancellation performance as the ratio between the power of the primary input signal and that of the error signal, i.e., E |X(jω)|2 Ψ (ω) = E [|E(jω)|2 ] E |X(jω)|2 = . (5.27) ! ! ! E [V (jω)X ∗ (jω)] !2 2 2 ! ! E |V (jω)| E [|X(jω)| ] − ! E [|V (jω)|2 ] ! Since it is assumed that the noises v(n) and v1 (n) are uncorrelated with the signal s(n), it can be easily shown that Ψ (ω) =
γ(ω) + 1 , γ(ω) + 1 − |ρvv1 (ω)|2
(5.28)
where γ(ω) =
E[|S(jω)|2 ] E[|V1 (jω)|2 ]
is the SNR of the primary input, and E[V (jω)V1∗ (jω)] ρvv1 (ω) = E[|V (jω)|2 ]E[|V1 (jω)|2 ] is the coherence function between v(n) and v1 (n). It can be seen that the performance of adaptive noise cancellation is a function of the SNR at the primary microphone, and the coherence between
142
J. Chen et al.
Noise reduction performance (dB)
30 25 20 15 10 5 0 30 20
1
10
0.8
0 γ (ω) (dB)
0.6
−10
0.4
−20
0.2 −30
0
|ρ
(ω)|
vv1
Fig. 5.9. Theoretical noise reduction performance of adaptive noise cancellation in stationary noise condition.
noise in the primary microphone and that in the reference microphone. A plot of this function is shown in Fig. 5.9. For fixed SNR, it can be proven that the performance is a monotonously increasing function with respect to |ρvv1 (ω)|. The maximum noise cancellation is achieved when |ρvv1 (ω)| = 1, namely v(n) and v1 (n) are coherent. If v(n) and v1 (n) are uncorrelated, i.e., |ρvv1 (ω)| = 0, no noise reduction can be gained. When |ρvv1 (ω)| is fixed, it can be shown that more noise cancellation can be acquired for a lower SNR. From Fig. 5.9, it can be seen that a large amount of correlation between noise in the primary microphone and that in the reference microphone (e.g., |ρvv1 (ω)| > 0.8) is necessary for even a small amount (e.g., 5 dB) of noise reduction. Simulations. Figures 5.10 and 5.11 show the noise reduction performance of the ANC with different adaptive algorithms. The original speech signal is digitized with a 16 kHz sampling frequency. The reference signal used here is white Gaussian noise. It was passed through a digital filter whose transfer function (the coefficients are random numbers normalized to the first coefficient) is H(z) = 1 + 0.53z −1 − 0.57z −2 + 2.2z −3 − 1.8z −4 + 6.7z −5 − 0.4z −6 + 0.3z −7 , and was added to the original signal to form the primary signal. The SNR of the primary signal is 0 dB. An adaptive filter is then applied to cancel the noise in the primary signal. Since it is not known a priori, the order of the adaptive filter is tentatively set to 10. Figure 5.10 plots the waveform of the
Amplitude Amplitude Amplitude Amplitude Amplitude Amplitude
5
Noise Reduction and Speech Enhancement
143
1 Original
0 −1 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8 0.9 Primary
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0 −1 10
0.9 Wiener
0 −1 10
0.9 LMS
0 −1 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9 NLMS
0 −1 10
0.9 RLS
0 −1 0
0.1
0.2
0.3
0.4 0.5 time (seconds)
0.6
0.7
0.8
0.9
Fig. 5.10. Waveforms of the original speech signal, primary microphone signal, and the signal estimated with Wiener, LMS, NLMS, and RLS algorithms. 20 Wiener LMS
10
Noise reduction performance (dB)
0 −10 0
1
2
3
4
5
6
7
8
20 Wiener NLMS
10 0 −10 0
1
2
3
4
5
6
7
8
20 Wiener RLS
10 0 −10 0
1
2
3
4 5 Frequency (kHz)
6
7
8
Fig. 5.11. Noise reduction performance of LMS (top), NLMS (middle), and RLS (bottom) algorithms (solid line) as compared to the ideal Wiener solution (dashed line).
original clean signal, primary microphone signal, and the signal estimated with the Wiener filter [eq. (5.21)], LMS, NLMS, and RLS algorithms. Since the noise is stationary, the Wiener filter gives the maximum reduction. The LMS algorithm takes a longer time to converge and the NLMS converges faster than the LMS. The RLS algorithm outperforms both LMS and NLMS.
144
J. Chen et al. x(n) = s(n) + v(n)
DFT
X ( jωk )
Sˆ ( jωk ) Spectral Modification
IDFT
sˆ(n)
Noise Estimation
Fig. 5.12. Speech enhancement with spectral modification.
Figure 5.11 presents the noise reduction performances. On average, the SNR under 4 kHz is higher than that above 4 kHz. As a result, all algorithms show more noise attenuation for frequency above 4 kHz. The RLS, due to its fast convergence speed, performs close to the Wiener filter. For both LMS and NLMS, a negative noise reduction is observed. This is the result of slow convergence and steady-state error of these algorithms.
5.4
Spectral Modification with a Single Microphone
In many applications like teleconferencing and hands-free telephony in cars, only a single sensor is available to measure both the signal and noise field. In such a circumstance, the described beamforming technique is not applicable since no multiple realizations of the noisy signal are accessible. While the ANC scheme, through some modification, may be employed [5], a more common approach is to achieve noise reduction through spectral modification, which is depicted in Fig. 5.12. The spectrum of the corrupted signal, X(jωk ) is estimated using the N-point discrete Fourier transform (DFT), where ωk = 2πk/N and k = 0, 1, · · · , N − 1. The clean speech spectral esˆ k ), is then obtained through the modification of X(jωk ). The timate, S(jω clean speech signal is finally achieved through the inverse DFT (IDFT). A more detailed implementation is described in the following. The central issue for such a single channel noise reduction is to achieve ˆ k ), from the noisy speech an estimate of the clean speech spectrum, i.e., S(jω spectrum, X(jωk ). A complex spectrum can be factorized into magnitude and phase components, X(jωk ) = X(ωk )ejψk , S(jωk ) = S(ωk )ejϕk .
(5.29)
The problem then becomes to design two signal estimators that make decisions separately on the magnitude spectrum and phase spectrum from the observed signal. Ephraim and Malah formulated in [4] a minimum mean-square error (MMSE) phase estimator, i.e., ! ϕˆk |MMSE = arg min E{ |ejϕk − ej ϕˆk |2 ! X(jωk )}. (5.30) ϕ ˆk
5
Noise Reduction and Speech Enhancement
145
It turns out that the solution of this MMSE estimator is ϕˆk |MMSE = ψk .
(5.31)
That is, the noisy phase, ψk is an optimal estimate of signal’s phase in the MMSE sense. It is quite obvious that the use of noisy phase as the signal’s phase is good enough for speech enhancement purpose since the human perception system is relatively insensitive to phase corruption. It has been shown that speech degradation resulting from phase corruption is not perceived when subband SNR at any ωk is greater than 6 dB [19]. Keeping this in mind, we can simplify the single channel noise reduction problem so as to find an ˆ k ), based on X(jωk ). optimal signal magnitude spectrum, S(ω Estimating the magnitude spectrum of a speech signal has long been the major concern of speech enhancement and noise reduction. Many techniques have been studied. Amongst them, the parametric spectral subtraction and Wiener filtering have some experimental success, and have been widely adopted by current speech enhancement systems. 5.4.1
Parametric Spectral Subtraction
From the signal model given in Fig. 5.12, it can be formulated that X(jωk ) = S(jωk ) + V (jωk ).
(5.32)
Therefore, it follows that |X(jωk )|2 = |S(jωk )|2 + |V (jωk )|2 + 2|S(jωk )||V (jωk )| cos θk = S 2 (ωk ) + V 2 (ωk ) + 2S(ωk )V (ωk ) cos θk ,
(5.33)
where S(ωk ) and V (ωk ) denote the magnitudes of S(jωk ) and V (jωk ) respectively, and θk is the phase difference between speech and noise signals. If s(n) and v(n) are assumed to be uncorrelated stationary random processes, (5.33) can be approximated as X 2 (ωk ) = S 2 (ωk ) + V 2 (ωk ).
(5.34)
Under this circumstance, the instantaneous power spectrum (IPS) or the magnitude-square spectrum of the signal, S 2 (ωk ), can be recovered by subtracting an estimate of V 2 (ωk ) from X 2 (ωk ), i.e., Sˆ2 (ωk ) = X 2 (ωk ) − Vˆ 2 (ωk ) = S 2 (ωk ) + [V 2 (ωk ) − Vˆ 2 (ωk )].
(5.35)
Therefore, the magnitude spectrum of the speech signal is computed as . ˆ k ) = Sˆ2 (ωk ) S(ω . = X 2 (ωk ) − Vˆ 2 (ωk ). (5.36)
146
J. Chen et al.
Combined with the phase of the noisy signal, an estimate of the spectrum of the speech signal is written as . ˆ S(jωk ) = X 2 (ωk ) − Vˆ 2 (ωk ) ejψk . (5.37) This forms the basis of a popularly used noise reduction method called spectral subtraction [1]. A similar algorithm can be developed in the magnitude spectral domain. If the uncorrelation assumption holds, it can be shown that . X(ωk ) = S(ωk ) + V (ωk ).
(5.38)
Therefore, the magnitude spectrum of the speech signal can be directly estimated by ˆ k ) = X(ωk ) − Vˆ (ωk ) S(ω = S(ωk ) + [V (ωk ) − Vˆ (ωk )].
(5.39)
In a more general form, (5.35) and (5.39) can be expressed as Sˆb (ωk ) = X b (ωk ) − η Vˆ b (ωk ),
(5.40)
where b is an exponent and η is a parameter introduced to control the amount of noise to be subtracted. For full subtraction, we set η = 1 and, for over subtraction η > 1. Consequently, the estimate of speech spectrum can be constructed as
1 ˆ k ) = X b (ωk ) − η Vˆ b (ωk ) b ejψk . S(jω (5.41) This is often referred to as parametric spectral subtraction [20]. The instantaneous power spectral subtraction results from b = 2 and η = 1 and the magnitude spectral subtraction is derived from b = 1 and η = 1. The technique of spectral subtraction is developed to recover the magnitude spectrum of a signal that is non-negative. However, (5.35) and (5.39), and more generally (5.40), may lead to negative estimates. This is one of the major drawbacks of spectral subtraction. A non-linear rectification process is often used to map a negative estimate into a non-negative value. However, this process introduces additional distortion into the recovered signal, which becomes more noticeable when the SNR decreases. 5.4.2
Estimation of the Noise Spectrum
A paramount issue in spectral subtraction is to obtain a good noise estimate; its accuracy greatly affects the noise reduction performance. There has been a tremendous effort in tackling this problem. Representative approaches include estimating noise in the absence of speech, minimum statistics method, quantile-based method, and sequential estimation using single-pole recursion.
5
Noise Reduction and Speech Enhancement
147
Usually, a noisy speech signal is not occupied by speech all the time. A large percentage of the time is occupied by noise only. Therefore, the noise can be estimated from regions where the speech signal is absent. A voice activity detector (VAD) is designed to distinguish speech and non-speech segments for a given noisy signal. This basic noise estimation relies on a VAD with high detection accuracy. When noise is strong and the SNR becomes rather low, the distinction of speech and noise segments could be difficult. Moreover, the noise is estimated intermittently and obtained only during the speech silent periods. This may cause problems if the noise is non-stationary, which is the case in many applications. To avoid explicit speech/non-speech detection, Martin proposed to estimate the noise via minimum statistics [21]. This technique is based on the assumption that during a speech pause, or within brief periods between words and even syllables, the speech energy is close to zero. As a result, a shortterm power spectrum estimate of the noisy signal, even during speech activity, decays frequently to the noise power. Thus, by tracking the temporal spectral minimum without distinguishing between speech presence and speech absence, the noise power in a specific frequency band can be estimated. Although a VAD is not necessary in this approach, the noise estimate is often too small to provide sufficient noise reduction. Instead of using minimum statistics, Hirsch et al. proposed a histogram based method which achieves a noise estimate from sub-band energy histograms [22]. A threshold is set over which peaks in the histogram profile are attributed to speech. The highest peak in the profile below this threshold is treated as noise energy. Stahl et al. extended this idea to a quantile-based noise estimation approach [23], which works on the assumption that even in active speech sections of the input signal not all frequency bands are permanently occupied with speech, and for a large percentage of the time the energy is at the noise level. Thus, this method computes short-term power spectra and sorts them. The noise estimate is obtained by taking a value near to the median of the resulting profiles. Evans et al. compared the histogram and the quantile-based noise estimation approaches for noisy speech recognition [24]. The conclusion was in favor of the latter. More generally, noise can be estimated sequentially using a single-pole recursive average with an implicit speech/non-speech decision embedded. The noisy signal x(n) is segmented into blocks of N samples. Each block is then transformed via a DFT into a block of N spectral samples. Successive blocks of spectral samples form a two-dimensional time-frequency matrix denoted by Xt (jωk ), where subscript t is the frame index and denotes the time dimension. Then the sequential noise estimation is formulated as Vˆtb (ωk ) =
b b (ωk ) + (1 − αa )Xtb (ωk ), if Xtb (ωk ) ≥ Vˆt−1 (ωk ) αa Vˆt−1 ,(5.42) b b b b ˆ ˆ (ωk ) αd Vt−1 (ωk ) + (1 − αd )Xt (ωk ), if Xt (ωk ) < Vt−1
148
J. Chen et al.
where αa is the “attack” coefficient and αd is the “decay” coefficient. This method is attractive for its simplicity and efficiency. Some variations of this method can be found in [25], [26], [27], [28]. 5.4.3
Parametric Wiener Filtering
Wiener derived in [29] an MMSE estimator to track a signal in noisy conditions, which can be formulated as: ˆ W (jωk )|MMSE = arg min E{|S(jωk ) − H(jωk )X(jωk )|2 }. H
(5.43)
H(jωk )
The solution to this estimator is given by E S 2 (ωk ) ˆ . HW (ωk ) = E [S 2 (ωk )] + E [V 2 (ωk )]
(5.44)
Therefore, the speech signal estimated through the Wiener filter is: ˆ k) = H ˆ W (ωk )X(jω). S(jω
(5.45)
Through a simple manipulation, the instantaneous power spectral subtraction, i.e., (5.37), can be rewritten as . ˆ k ) = X 2 (ωk ) − Vˆ 2 (ωk ) ejψk S(jω 12 X 2 (ωk ) − Vˆ 2 (ωk ) = X(ωk )ejψk . (5.46) X 2 (ωk ) Denoting 12 X 2 (ωk ) − Vˆ 2 (ωk ) ˆ HIPS (ωk ) = , X 2 (ωk )
(5.47)
we can express (5.46) as ˆ k) = H ˆ IPS (ωk )X(jωk ). S(jω
(5.48)
That is, recovering the signal through power spectral subtraction is equivalent to passing the noisy signal through a filter whose transfer function is ˆ IPS (ωk ). Equation (5.48) is akin to the Wiener filter shown in (5.45). This H indicates the close relationship between the instantaneous power spectral subtraction method and the optimum Wiener filter. Similarly, the magnitude spectral subtraction can be reformulated as ˆ k) = H ˆ MS (ωk )X(jωk ), S(jω
(5.49)
where ˆ ˆ MS (ωk ) = X(ωk ) − V (ωk ) . H X(ωk )
(5.50)
5
Noise Reduction and Speech Enhancement
149
45 Wiener IPS MS
Noise reduction performance (dB)
40 35 30 25 20 15 10 5 0 −20
−15
−10
−5
0 γ(ωk)
5
10
15
20
Fig. 5.13. Theoretical noise reduction performance of three Wiener-like filtering techniques.
Like spectral subtraction, (5.46) and (5.50) can be unified to a general form referred to as parametric Wiener filtering [15], i.e., ˆ k) = H ˆ PW (ωk )X(jωk ), S(jω
(5.51)
and a1 X b (ωk ) − η Vˆ b (ωk ) ˆ , HPW (ωk ) = X b (ωk )
(5.52)
where η [as in (5.40)] is a parameter to control the amount of noise to be reduced. The noise reduction performance for the parametric Wiener filtering technique can be expressed as Ψ (ω) =
1 . ˆ 2 (ωk ) H PW
(5.53)
Under the ideal conditions that the signal and noise are uncorrelated and the noise estimate is equal to the true noise spectrum, (5.53) can be expressed as Ψ (ω) =
b
γ 2 (ωk ) + 1 b
γ 2 (ωk ) + 1 − η
a2 ,
(5.54)
where γ(ωk ) = S 2 (ωk )/V 2 (ωk ) is the a prior SNR at frequency ωk . Figure 5.13 shows the noise reduction performance of the parametric Wiener filter versus γ(ωk ). As can be seen, the three filters have similar noise reduction performance for high SNR signals. In low SNR conditions,
150
J. Chen et al.
the Wiener filter can lead to more noise attenuation. Note that for positive SNR signals, the three filters provide noise reduction no more than 6 dB. If more noise reduction is to be obtained for positive SNR, this could be done by setting the parameter η greater than 1. However, this can lead to other problems. 5.4.4
Estimation of the Wiener Gain Filter
As the noise estimation is the central issue in spectral subtraction, the gain filter computation is the foremost problem for the parametric Wiener filtering method. Define the a posteriori SNR as (ωk ) = X(ωk )/V (ωk ), so that (5.52) can be recast as a1 1 ˆ . HPW (ωk ) = 1 − η b (ωk )
(5.55)
(5.56)
ˆ PW (ωk ) (for η = 1) should Note from (5.55) that 1 ≤ (ωk ) < ∞, therefore H be between 0 and 1. Any estimate greater than 1 should be mapped to 1 and any estimate less than 0 should be mapped to 0. From (5.56), it can be seen that the gain computation is essentially a matter of SNR estimation. In what follows, we introduce an approach to compute (ωk ). We have previously discussed the estimation of the noise spectrum. By substituting a noise estimate, (5.42) for instance, into (5.55), one can achieve an estimate of the a posteriori SNR (ωk ). However, such an estimate of SNR fluctuates due to the variance of the DFT spectrum [26]. Two approaches can be employed to reduce the fluctuation, namely, time-averaging and frequency-averaging the DFT spectra before computing the SNR. If we denote by Xt (ωk ) the short-time magnitude spectrum of the noisy signal at frame index t, the time-averaging can be implemented using a single-pole recursion, ¯ t (ωk ) = β X ¯ t−1 (ωk ) + (1 − β)Xt (ωk ), X
(5.57)
where β is a parameter to control the time constant. An even better smoothing can be achieved by a two-sided single-pole recursion, ¯ t−1 (ωk ) ¯ t−1 (ωk ) + (1 − βa )Xt (ωk ), if Xt (ωk ) ≥ X βa X ¯ Xt (ωk ) = ¯ t−1 (ωk ) ,(5.58) ¯ βd Xt−1 (ωk ) + (1 − βd )Xt (ωk ), if Xt (ωk ) < X where again βa is the “attack” coefficient and βd is the “decay” coefficient. Combining (5.42) and (5.58), we have an estimate of the narrow-band SNR at time instance t as ¯ t (ωk ) X N . (5.59) t (ωk ) = ˆ Vt (ωk )
5
Noise Reduction and Speech Enhancement
151
Further reduction of the fluctuation of the SNR can be achieved by frequency¯ t (ωk ), i.e., averaging the smoothed spectrum X
k+M/2
¯ W (ωk ) = X t
¯ t (ωj ) wj X
j=k−M/2
,
k+M/2
(5.60)
wj
j=k−M/2
where {wj } defines a window, and M defines the width of the window. Based on this estimate, we can introduce a wide-band SNR as W t (ωk ) =
¯ tW (ωk ) X . Vˆt (ωk )
(5.61)
The final estimate of the a posterior SNR is determined as W t (ωk ) = max[N t (ωk ), t (ωk )].
(5.62)
By doing so, the estimated SNR approximates the true SNR and has small fluctuation in various noise conditions. Simulations. Figures 5.14 and 5.15 show the noise reduction performance for a parametric Wiener filtering whose gain is computed through (5.56) and (5.62). For this experiment, η = 1.25. Signal is digitized with a sampling frequency of 16 kHz. A white Gaussian noise is added to the signal with an SNR equal to 10 dB. Figure 5.14 shows the waveforms of the clean signal, the noise-corrupted signal, and the enhanced signal. Figure 5.15 gives the noise reduction performance. For comparison, the performance of the optimal Wiener filter is also presented. As can be seen, for most frequencies, the parametric Wiener filter yields more noise reduction than the optimal Wiener filter.
5.5
Conclusions
Noise reduction has been investigated for more than 40 years. The approaches presented here can be classified into three categories, i.e., beamforming, adaptive noise cancellation, and adaptive spectral modification. This chapter reviewed these three techniques, and their noise reduction performance, in terms of the ratio between the power of the single input and that of the output, were analyzed. In the independent diffuse noise condition, the amount of noise that can be canceled by a beamformer relies on the number of microphones in the array and the SNR of the input signal. More sensors can lead to more noise reduction. However, the effectiveness of a beamformer in suppressing directional noise depends on the angular separation between the
152
J. Chen et al.
Amplitude
1
0
−1 0
0.5
1
1.5
2
2.5
3
0.5
1
1.5
2
2.5
3
0.5
1
1.5 time (seconds)
2
2.5
3
Amplitude
1
0
−1 0
Amplitude
1
0
−1 0
Fig. 5.14. Performance of a speech enhancement system based on the parametric Wiener filtering technique. The upper trace shows the waveform of the original clean speech signal. The middle trace plots the noisy signal. The lower trace shows the waveform after speech enhancement.
Noise reduction performance (dB)
15
10
5
0 0
1
2
3 4 5 Frequency (kHz)
6
7
8
Fig. 5.15. Actual noise reduction performance of the optimal Wiener filter (dotted line) and a speech enhancement system based on the parametric Wiener filter (solid line).
signal and the noise source as well as the number of microphones in the array. The performance of adaptive noise cancellation is a function of the coherence between input noise signal and the reference noise signal. Significant correlation between the two signals is necessary for even a modest amount of noise cancellation. The spectral modification is a single-microphone technique. It
5
Noise Reduction and Speech Enhancement
153
needs to construct an estimate of the noise spectrum before subtracting it from the corrupted speech signal. In general, the spectral modification can achieve more noise reduction than a beamformer and an ANC method. However, the amount of noise reduction is only one aspect of the problem. In practical applications, the distortion introduced is also a major concern. For a beamformer, if the look direction coincides with its incidence angle, the signal will not be distorted. But if the look direction deviates from the true angle of incidence, the signal is subject to the undesirable effects of low-pass filtering. In ANC, the distortion is mainly caused by the presence of signal in the reference input. Widrow has shown in [3] that for the optimal solution case, the SNR at the output is the reciprocal of the SNR of the reference when there is some signal present in the reference. This so called “powerinversion” is the result of signal cancellation. The artifacts introduced by the parametric Wiener filtering technique are more complicated to analyze. The dominant causes of distortion include the non-linear mapping of the negative magnitude estimates, the cross-term between signal and noise, and variations of the spectrum about the mean. These causes produce a distortion known as “musical noise” because of the way it sounds. Tremendous efforts have been made in eliminating this distortion, which should be paid attention when designing a speech enhancement system.
References 1. S. F. Boll, “Suppression of acoustic noise in speech using spectral subtraction,” IEEE Trans. Acoust., Speech, Signal processing, vol. ASSP-29, pp. 113–120, Apr. 1979. 2. Y. Kaneda and J. Ohga, “Adaptive microphone-array system for noise reduction,” IEEE Trans. Acoust., Speech, Signal processing, vol. ASSP-34, pp. 1391– 1400, Sept. 1986. 3. B. Widrow, J. R. Glover, J. M. McCool, J. Kaunitz, C. S. Williams, R. H. Hearn, J. R. Zeidler, E. Dong, and R. C. Goodlin, “Adaptive noise canceling: principles and applications,” Proc. IEEE, vol. 63, pp. 1692–1975, Dec. 1975. 4. Y. Ephraim and D. Malah, “Speech enhancement using a minimum meansquare error short-time spectral amplitude estimator,” IEEE Trans. Acoust. Speech, Signal Processing, vol. ASSP-32, pp. 1109–1121, Dec. 1984. 5. A. V. Oppenheim, E. Weistein, K. C. Zangi, M. Feder, and D. Gauger, “Singlesensor active noise cancellation,” IEEE Trans. Speech Audio Processing, vol. 2, pp. 285–290, Apr. 1994. 6. B. Widrow, P. Mantey, L. Griffiths, and B. Goode, “Adaptive antenna systems,” Proc. IEEE, col. 55, pp. 2143–2159, Dec. 1967. 7. O. L. Frost, “An algorithm for linearly constrained adaptive array signal processing,” Proc. IEEE, vol. 60, pp. 926–935, Aug. 1972. 8. J. Capon, “High resolution frequency-wavenumber spectrum analysis,” Proc. IEEE, vol. 57, pp. 1408–1418, Aug. 1969. 9. W. F. Gabriel, “Spectral analysis and adaptive array superresolution techniques,” Proc. IEEE, vol. 68, pp. 654–666, June 1980.
154
J. Chen et al.
10. D. H. Johnson and D. E. Dudgeon, Array Signal Processing: Concepts and Techniques. Prentice Hall, Upper Saddle River, N.J., 1993. 11. M. M. Goulding and J. S. Bird, “Speech enhancement for mobile telephony,” IEEE Trans. Vehicular Technology, vol. 39, pp. 316–326, Nov. 1990. 12. H. J. Kushner, “On closed-loop adaptive noise cancellation,” IEEE Trans. Automat. Contr., vol. 43, pp. 1103–1107, Aug. 1998. 13. A. S. Abutaled, “An adaptive filter for noise canceling,” IEEE Trans. Circuits Syst., vol. 35, pp. 1201–1209, Oct. 1998 14. S. A. Vorobyov and A. Cichocki, “Adaptive noise cancellation for multi-sensory signals,” Fluctuation Noise Lett., vol. 1, pp. 13–23, 2001. 15. J. S. Lim and A. V. Oppenhem, “Enhancement and bandwidth compression of noisy speech,” Proc. IEEE, vol. 67, pp. 1586–1604, Dec. 1979. 16. B. Widrow and S. D. Stearns, Adaptive Signal Processing. Prentice Hall, Englewood Cliffs, N. J., 1985. 17. D. L. Duttweiler, “Proportionate normalized least mean square adaptation in echo cancelers,” IEEE Trans. Speech Audio Processing, vol. 8, pp. 508–518, Sept. 2000. 18. J. Benesty, T. G¨ ansler, D. R. Morgan, M. M. Sondhi, and S. L. Gay, Advances in Network and Acoustic Echo Cancellation. Springer, Berlin, 2001. 19. P. Vary, “Noise suppression by spectral magnitude estimation–mechanism and theoretical limits,” Signal Processing, vol. 8, pp. 387–400, July 1985. 20. W. Etter and G. S. Moschytz, “Noise reduction by noise-adaptive spectral magnitude expansion,” J. Audio Eng. Soc., vol. 42, pp. 341–349, May 1994. 21. R. Martin, “Noise power spectral density estimation based on optimal smoothing and minimum statistics,” IEEE Trans. Speech Audio Processing, vol. 9, pp. 504–512, July 2001. 22. H. G. Hirsch and C. Ehrlicher, “Noise estimation techniques for robust speech recognition,” in Proc. IEEE ICASSP, 1995, vol. 1, pp. 153–156. 23. V. Stahl, A. Fischer, and R. Bippus, “Quantile based noise estimation for spectral subtraction and Wiener filtering,” in Proc. IEEE ICASSP, 2000, vol. 3, pp. 1875–1878. 24. N. W. D. Evans and J. S. Mason, “Noise estimation without explicit speech, non-speech detection: a comparison of mean, media and modal based approaches,” in Proc. EUROSPEECH, 2001, vol. 2, pp. 893–896. 25. E. J. Diethorn, “A subband noise-reduction method for enhancing speech in telephony and teleconferencing,” in Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 1997. 26. S. V. Vaseghi, Advanced Digital Signal Processing and Noise Reduction. John Willey & Sons, West Sussex, England, 2000. 27. J. A. N. Flores and S. J. Yound, “Continuous speech recognition in noise using spectral subtraction and HMM adaptation,” in Proc. IEEE ICASSP, 1994, vol. 1, pp. 409–412. 28. J. Chen, K. K. Paliwal, and S. Nakamura, “Sub-band based additive noise removal for robust speech recognition,” in Proc. EUROSPEECH, 2001, vol. 1, pp. 571–574. 29. N. Wiener, Extrapolation, Interpolation, and Smoothing of Stationary Time Series. John Wiley & Sons, New York, 1949.
6 Adaptive Beamforming for Audio Signal Acquisition Wolfgang Herbordt and Walter Kellermann University of Erlangen-Nuremberg Telecommunications Laboratory Cauerstr. 7, D-91058 Erlangen, Germany E-mail: {herbordt, wk}@LNT.de Abstract. This chapter provides an overview of adaptive beamforming techniques for speech and audio signal acquisition. We review basic concepts of optimum adaptive antenna arrays and show how these methods may be applied to meet the requirements of audio signal processing. In particular, we derive optimum beamformers using time-domain least-squares instead of frequency-domain minimum meansquares criteria, and, thereby, are not constrained by the commonly used narrowband and stationarity assumptions. We thus obtain a more general representation of various beamforming aspects relevant to our application. From this, a robust generalized sidelobe canceller (GSC) [1] results as an attractive solution for practical audio acquisition systems. Moreover, the general theoretical framework leads to new insights for the GSC behavior in complex practical situations.
6.1
Introduction
Array processing techniques strive for extraction of maximum information from a propagating wave field using groups of sensors, which are located at distinct spatial locations. The array sensors transduce propagating waves into signals describing both a finite spatial and a temporal aperture. In accordance with temporal sampling which leads to the discrete time domain, spatial sampling by sensor arrays forms the discrete space domain. Thus, with sensor arrays, signal processing operates in a multidimensional space-time domain. The processor which combines temporal and spatial filtering using sensor arrays is called a beamformer. Many properties and techniques which are known from temporal finite impulse response (FIR) filtering directly translate to beamforming based on finite spatial apertures1 . Usually, FIR filters are placed in each of the sensor channels in order to obtain a beamformer with desired properties. Design methods for these filters can be classified according to two categories: (a) The FIR filters are designed independently of the statistics of the sensor data (data-independent beamformer). (b) The FIR filters are designed depending on known or estimated statistics of the sensor data to optimize the array response for the given wave-field characteristics (data-dependent beamformer) [2]. 1
See [2] for a tutorial about beamforming relating properties of temporal FIR filtering and space-time processing.
J. Benesty et al. (eds.), Adaptive Signal Processing © Springer-Verlag Berlin Heidelberg 2003
156
W. Herbordt, W. Kellermann
Generally, array signal processing is applied to detection and estimation problems when a desired signal is captured in the presence of interference and noise. Arrays play an important role in areas like (a) detection of the presence of signal sources, (b) estimation of temporal waveforms or spectral contents of signals, (c) estimation of directions-of-arrival (DOAs) or positions of multiple sources, and (d) focusing on specific spatial locations for transmission. Traditionally, they are used in such diverse fields as radar, sonar, transmission systems, seismology, or medical diagnosis and treatment. A new and emerging research field is given by arrays of microphones for space-time acoustic signal processing. Typical applications include hands-free acoustic human-machine interfaces for enabling acoustic telepresence as desired for audio-/video-conferencing, dialogue systems, computer games, command-and-control interfaces, dictation systems, highquality audio recordings, and other multimedia services. All these applications of microphone arrays have in common that they focus on speech and audio signal acquisition (estimation) in the presence of noise and interference. Obviously, spatio-temporal filtering is preferable over temporal filtering alone, because desired signal and interference often overlap in time and/or frequency but originate from different spatial coordinates. Spatial filtering by beamforming allows separation without distortion of the desired signal. In practical situations, beamforming microphone arrays have to cope with highly non-stationary environments, multipath propagation due to early reflections and reverberation, wideband signals, and, finally, with restrictions in geometry, size, and number of sensors due to product design constraints. Adaptive beamformers are desirable, since they optimally extract nonstationary desired signals from non-stationary interference and noise for the observed sensor signals and for given array geometry. Dereverberation is an unresolved problem if the propagation paths are not known. As we will show later in this chapter, beamformers can partially dereverberate the desired signal, although reverberation is strongly correlated with the desired signal. The beamformers will generally increase the power ratio between direct path and reverberation. This chapter focuses on optimum adaptive beamforming for nonstationary wideband signals as they are captured by microphone arrays in typical audio applications. In contrast to many traditional presentations, we explicitly drop narrowband and stationarity assumptions and formulate optimum beamformers using time-domain least-squares instead of frequencydomain minimum mean-squares criteria. We thus obtain a more general representation of various beamforming aspects relevant to our application. The relation to narrowband beamforming based on the stationarity assumption is illustrated. In our discussion, we show how the fundamental methods are adapted when applied to audio signal acquisition. A rigorous comparison of different approaches shows optimality of a robust GSC for speech signal acquisition in reverberant and non-stationary
6
Adaptive Beamforming
157
nu,1 (k) x1 (k) p1 dc (t) p
w1 (k)
nu,2 (k) x2 (k)
p2
w2 (k)
y(k)
nu,M (k) nc (t)
xM (k) pM
wM (k)
Fig. 6.1. General filter&sum beamformer.
acoustic environments. In this chapter, we (a) relate this approach, which has only been motivated by intuition before [1], to the theoretical framework of optimum beamforming and (b) show its superiority relative to competing beamforming techniques. This chapter is organized as follows: in Sect. 6.2, we introduce basic beamforming concepts and the terminology that is used throughout the chapter. In Sect. 6.3, data-independent beamformers are briefly summarized. Section 6.4 discusses data-dependent optimum beamformer designs. (A more extensive treatment of the fundamental concepts can be found, e.g., in [2], [3], [4], [5], [6], [7].) Techniques for stationary narrowband signals are related to methods for non-stationary wideband signals. It is shown how these design methods are applied to the specific needs of audio signal processing. In Sect. 6.5, we focus on realization of optimum adaptive beamformers, whose coefficients are recursively updated from the incoming data. Section 6.6 describes a real-time system combining various of the previously described methods for speech signal acquisition [8].
6.2
Signal Model, Sensor Arrays, and Concepts
In this section, we first introduce the signal model and the beamformer setup (Sect. 6.2.1). Second, we introduce beamformer performance measures, which do not depend on the interference characteristics but only on the array geometry, the steering direction, and the beamformer coefficients (Sect. 6.2.2). Third, we give an overview of widely used array performance measures which depend on the interference characteristics (Sect. 6.2.3). 6.2.1
Sensor Array, Sensor Signals, and Beamformer Setup
The general beamformer setup is shown in Fig. 6.1. Lower case boldface and upper case boldface denote vectors and matrices, respectively. T and H denote vector or matrix transposition and complex conjugate transposition, respectively. We assume presence of a wideband desired (point) source dc (t) at
158
W. Herbordt, W. Kellermann
the spatial location p in a multipath environment and presence of wideband local interference nc (t). nc (t) includes reverberation of the desired signal. t is the continuous time variable. The M sensors sample the propagating wave fields at the locations pm at times kT , where k is the discrete time index and where T = 1/(2B) is the sampling period with the one-sided bandwidth B (in Hz) of the desired signal. We define dm (k) as the component of the wideband desired signal at the m-th sensor, which arrives via the direct signal path: dm (k) = dc (kT − τm ) ,
(6.1)
where τm = |p − pm |/c is the propagation delay between source and the m-th sensor, and where c is the sound velocity. With additional sensor noise nu,m (k), the interference contribution to the m-th sensor signal is captured in nm (k), nm (k) = nc (kT ) + nu,m (k) .
(6.2)
So far, no assumptions on the statistics of the sensor signal components are made. The sensor signals may now be written as xm (k) = dm (k) + nm (k) .
(6.3)
The beamformer output is obtained by convolution of the sensor data with time-varying FIR filter impulse responses and summation. That is, y(k) = wT (k)x(k) = wT (k) [d(k) + n(k)] ,
(6.4)
where the M beamformer filters of length N , wm (k), are combined in a tapstacked M N × 1 weight vector T T w(k) = w1T (k), w2T (k), . . . , wM (k) , (6.5) while the N × 1 vector wm (k) captures N time-varying tap weights, wm (k) = (w0,m (k), w1,m (k), . . . , wN −1,m (k))T .
(6.6)
Accordingly, we introduce the data vectors T
xm (k) = (xm (k), xm (k − 1), . . . , xm (k − N + 1)) , T x(k) = xT1 (k), xT2 (k), . . . , xTM (k) , T
dm (k) = (dm (k), dm (k − 1), . . . , dm (k − N + 1)) , T d(k) = dT1 (k), dT2 (k), . . . , dTM (k) , T
nm (k) = (nm (k), nm (k − 1), . . . , nm (k − N + 1)) , T n(k) = nT1 (k), nT2 (k), . . . , nTM (k) .
(6.7) (6.8) (6.9) (6.10) (6.11) (6.12)
For illustration of the beamforming operation, a vector space interpretation of y(k) = wT (k)x(k) can be given [9]: if the vector w(k) is orthogonal to
6
Adaptive Beamforming
159
z θ k
pM pm p M +1 2
∼τm y φ
x
p2 p1
Δ
Fig. 6.2. Coordinate system with a linear array along the z-axis centered to the spatial origin with an odd number of sensors M and sensor spacing Δ [4, Chap. 2].
the vector x(k), then, y(k) is equal to zero, and the signals are suppressed. If x(k) is in the vector space spanned by w(k), the signals are passed. The optimum data-dependent beamformer is thus designed such that (a) d(k) is in the space spanned by w(k) (‘beamsteering’) and (b) n(k) is orthogonal to w(k) (‘interference suppression’). The beamformer is steered to a specific position p if the filter weights w(k) equalize the propagation delays τm , ∀ m, such that the desired signal is time-aligned in all sensor channels, which leads to coherent superposition of the desired signal at the beamformer output. Generally, these equalization delays are non-integer multiples of the sampling period T . This means that the sensor signals with respect to (w.r.t.) the desired source need to be aligned by fractional delay interpolation filtering [10] incorporated in w(k).
6.2.2
Interference-Independent Beamformer Performance Measures
In the remainder of this section, we assume stationary monochromatic signals. This allows (a) to introduce various performance measures and (b) to motivate basic array processing techniques in Sect. 6.3. We assume the source to be located in the far-field, so that its signal dc (t) propagates as a monochromatic plane wave with frequency ω and wavelength λ along the wavenumber vector k=
2π ω a(θ, φ) = a(θ, φ) , λ c
(6.13)
where a(θ, φ) is a unit vector with spherical coordinates (r = 1, θ, φ) representing the propagation direction [3, Chap. 2] (see Fig. 6.2). Due to the plane wave assumption, the wavenumber vector k is independent of the sensor position. If d(k) = dc (kT ) is the signal that would be received at the origin
160
W. Herbordt, W. Kellermann
of the underlying coordinate system, then, intersensor propagation delays are obtained as τm =
aT (θ, φ)pm , c
(6.14)
and the desired signal component dm (k) can be written as ˆ exp j ωkT − kT pm , dm (k) = D
(6.15)
ˆ is the amplitude of the monochromatic wave. where D For defining array performance measures, which only depend on the steering direction, the array geometry, and the sensor weightings, but not on interference characteristics, we assume that the desired signal is normalized, i.e., D = 1, and that no interference is present, i.e., n(k) = 0. We transform (6.4) into the frequency domain using the discrete-time Fourier transform (DTFT) [11]. With (6.1), we have the DTFT of dm (k), Dm (ω) =
∞
ˆ exp{−jkT pm } , dm (k) exp{−jωkT } = D
(6.16)
k=−∞
and the DTFT of wm , Wm (ω) =
N −1
wi,m exp{−jωiT } ,
(6.17)
i=0
where we dropped for wm (k) the dependency on k due to stationarity of the sensor data. Defining H
wF (ω) = (W1 (ω), W2 (ω), . . . , WM (ω))
,
(6.18) H
v(ω, k) = (exp{jωτ1 }, exp{jωτ2 }, . . . , exp{jωτM }) H , = exp{jkT p1 }, exp{jkT p2 }, . . . , exp{jkT pM }
(6.19) (6.20)
ˆ = 1: we now obtain for the DTFT of y(k) with D H (ω)v(ω, k) . Y (ω, k) = wF
(6.21)
v(ω, k) is the steering vector, which incorporates the geometry of the array and the direction-of-arrival of the desired source signal. Y (ω, k) is the frequency-wavenumber response or beamformer response. It describes the complex gain for an input plane wave2 with wavenumber k and temporal frequency ω. The beampattern (array pattern) is defined as B(ω; θ, φ) = Y (ω, k) , 2
(6.22)
This definition can be extended to spherical waves as described, e.g., in [3, Chap. 4]
6
Adaptive Beamforming
161
10 0
Relative sidelobe level
P(ω;θ,φ) in dB
−10 −20 −30 −40 −50 −60 −70
Peak−to−zero distance
−80 −90 0
0.1
0.2
0.3
0.4
0.5
θ/π
0.6
0.7
0.8
0.9
1
Fig. 6.3. Power pattern P (ω; θ, φ) of a beamformer with uniform sensor weighting, Wm (ω) = 1/M , ∀ M , using the uniformly spaced sensor array given in Fig. 6.2 for M = 9, Δ = λ/2.
with θ ∈ [0; π], φ ∈ [0; 2π]. The beampattern is thus the frequencywavenumber function for plane waves evaluated on a sphere with spherical coordinates (θ, φ) and radius ω/c [4, Chap. 2]. The power pattern P (ω; θ, φ) = |B(ω; θ, φ)|2 is the squared magnitude of the beampattern. Figure 6.3 shows an example of a power pattern for the uniformly spaced sensor array of Fig. 6.2, with M = 9, Δ = λ/2, for uniform sensor weighting, Wm (ω) = 1/M , ∀ M . The array is steered to θ = π/2. The power pattern P (ω; θ, φ) is independent of the φ-coordinate due to the rotational symmetry of the sensor array relative to the z-axis. Non-ideal sensor characteristics may be accounted for by multiplying the filter transfer functions Wm (ω) with the frequency-wavenumber responses of the sensors Ys,m (ω, k). For this chapter, we assume identical isotropic (omnidirectional) sensors with Ys,m (ω, k) = 1, ∀ m. From the power pattern, two performance measures, which are often used as criteria for data-independent beamforming, may be derived: first, the mainlobe width may be measured, e.g., by the peak-to-zero distance3 , which will be used in this chapter: it is defined by the angle between the maximum of the mainlobe and the first null of the power pattern P (ω; Θ, Φ) at frequency ω. Second, the relative sidelobe level is the ratio of the mainlobe level and the level of the highest sidelobe (see Fig. 6.3). 6.2.3
Interference-Dependent Beamformer Performance Measures
We now define array performance measures, which depend on the interference components of the sensor signals: the array gain A(ω) measures the 3
See, e.g., [4, Chap. 2] for other measures for the mainlobe width.
162
W. Herbordt, W. Kellermann
ratio of output signal-to-interference-plus-noise ratio (SINR) to input SINR. The directivity D(ω) is the ratio of the beamformer output power of a signal arriving from the steering direction relative to the beamformer output power caused by an isotropic (diffuse) wave field. The directivity can be interpreted as the array gain relative to isotropic noise with the desired source being located in the steering direction of the beamformer. The directivity index is the directivity on a logarithmic scale (in dB): DI(ω) = 10 log10 D(ω). The array sensitivity TF (ω) against uncorrelated random errors like geometrical array perturbations and uncorrelated sensor noise can be measured by 1 H TF (ω) = 2π wF (ω)wF (ω) if the weight vector is normalized to provide unity gain for the steering direction [12]. Except for the array sensitivity, which can be written in the time domain as T (k) = wT (k)w(k), the interference-dependent array performance measures are defined for monochromatic waves. These performance measures can still be used with non-stationary wideband signals if the signals can be decomposed into narrow frequency bins (usually using the DFT), such that frequency bins are mutually statistically independent (narrowband decomposition). If the narrowband assumption does not hold, these array performance measures become imprecise due to statistical dependence between the frequency bins [4], [13]. When using the performance measures in the following, we assume that the narrowband assumption holds. For evaluating the practical system for non-stationary signals, in Sect. 6.6, we will, however, introduce measures which estimate time-averaged interference suppression of the beamformer.
6.3
Data-Independent Beamformer Design
For data-independent beamformers, the filtering of each sensor output, wm (k), is chosen to obtain a frequency-wavenumber response with specific properties. Usually, the beamformer response is shaped to pass signals arriving from a known position p with minimum distortion while suppressing undesired signals at known or unknown positions. With FIR filters in each sensor channel, we have spatial and temporal degrees of freedom to shape the spatio-temporal beamformer response. The beamformer designs are often specified for monochromatic plane waves with source positions in the far-field of the array. For applying these beamformer designs to near-field sources, where spherical wave propagation must be considered, near-field/farfield reciprocity [14], [15] can be exploited. For generalization to wideband signals, one may, first, apply the beamformer designs to a set of monochromatic waves which samples the desired frequency range and, second, use a conventional FIR filter design [16] to obtain the sensor filters. For monochromatic plane waves with wavenumber vector k impinging on the linear uniform (‘equi-spaced’) sensor array of Fig. 6.2 with sensor weights wm = w0,m , N = 1, the correspondence between temporal FIR filtering and
6
Adaptive Beamforming
163
beamforming becomes closest: the intersensor propagation delays relative to the spatial origin for plane waves are given by Δ cos θ M +1 τm = −m . (6.23) 2 c Writing the frequency-wavenumber response Y (ω, k), (6.21), with (6.18), (6.19), (6.23), as Y (ω, k) =
M
wm exp{−jωτm }
m=1
Δ cos θ M +1 −m , = wm exp −jω 2 c m=1 M
(6.24)
and identifying the product ωΔ cos θ/c in the exponent of (6.24) with the normalized frequency ω0 T , ω0 T = ωΔ cos θ/c, we see that, except for a phase shift, Y (ω, k) is equivalent to the DTFT of the sensor weight sequence wm , m = 1, 2, . . . , M , at frequency ω0 . Assuming a desired signal which arrives from θ = π/2 (‘broadside’), and choosing wm = 1/M , ∀ m, then, the beamformer provides unity gain for the desired signal and attenuates other directions θ, as the corresponding temporal rectangular window lets through signal components with frequency ω0 = 0 and suppresses other frequencies according to the DTFT of the rectangular window. For shaping the beamformer response for monochromatic plane waves of a given frequency ω, the amplitudes of the complex weighting w(ω) may incorporate windowing (tapering) of the spatial aperture to trade the mainlobe width against relative sidelobe height. Besides classical windowing functions [17], Dolph-Chebyshev windowing [18], which minimizes the mainlobe width for a given maximum sidelobe level, is widely used. Dolph-Chebyshev design gives a power pattern with equal sidelobe levels. In Fig. 6.4, power patterns for spatial windowing for monochromatic signals are depicted (M = 9, Δ = λ/2). Figure 6.4(a) shows power patterns for classical windows (‘von Hann’, ‘Hamming’ [17]) compared to uniform weighting, and Fig. 6.4(b) shows power patterns for Dolph-Chebyshev windows for relative sidelobe levels 20, 30, 40 dB. Figure 6.5 illustrates examples of power patterns for wideband signals (M = 9, Δ = 4 cm): in Fig. 6.5(a), the power pattern of a wideband beamformer with uniform weighting is shown. We see that the peak-to-zero distance decreases with increasing frequency f = ω/(2π). For high frequencies, the beamformer thus becomes sensitive to steering errors. For low frequencies, spatial separation of desired signal and interference becomes impossible [3, Chap. 3]. In Fig. 6.5(b), the power pattern of a wideband beamformer using a Dolph-Chebyshev design is shown. The FIR filters wm (k) are obtained by applying Dolph-Chebyshev windows to a set of discrete frequencies with a predefined frequency-invariant peak-to-zero distance of the power pattern. These frequency-dependent Dolph-Chebyshev windows are then fed into the
164
W. Herbordt, W. Kellermann (a)
(b)
0
0
von Hann Hamming uniform
−20
−30
−40
−50
−60 0.5
20 dB 30 dB 40 dB
−10
P(ω;θ,φ) in dB
P(ω;θ,φ) in dB
−10
−20
−30
−40
−50
0.6
0.7
θ/π
0.8
0.9
−60 0.5
1
0.6
0.7
θ/π
0.8
0.9
1
Fig. 6.4. Power pattern P (ω; θ, φ) of a beamformer with (a) von Hann, Hamming, and uniform windowing, and (b) Dolph-Chebyshev windowing with different relative sidelobe levels; sensor array of Fig. 6.2 with M = 9, Δ = λ/2. (a)
(b)
0
0
1
1
2
2
0 dB
f in kHz
f in kHz
−20 dB
3
3
4
4
Δ=λ/2
Δ=λ/2
5
5
6 0.5
0.6
0.7
θ/π
0.8
0.9
1
6 0.5
−40 dB
−60 dB
0.6
0.7
θ/π
0.8
0.9
1
−80 dB
Fig. 6.5. Power pattern P (ω; θ, φ) of a wideband beamformer with (a) uniform weighting, and (b) wideband Dolph-Chebyshev design with frequency-invariant peak-to-zero distance for frequencies f > 1700 Hz; uniform weighting is used for f ≤ 1700 Hz; sensor array of Fig. 6.2 with M = 9, Δ = 4 cm.
Fourier approximation filter design [16] to determine the FIR filters wm (k). Due to the limited spatial aperture of the sensor array, uniform weighting is used for frequencies f ≤ 1700 Hz. Dolph-Chebyshev (and equivalently Taylor) windowing thus allow to specify an (almost) arbitrary mainlobe width, which is frequency-independent over a wide range of frequencies, and which reduces the sensitivity due to steering errors. For designing data-independent beamformers with more control about the beamformer response than simple windowing techniques, one may apply methods which are similar to the design of arbitrary FIR filter transfer
6
Adaptive Beamforming
165
functions [16] as, e.g., windowed Fourier approximation, least-squares approximation of the desired power pattern, and minimax design to control the maximum allowable variation of mainlobe level and sidelobe levels (see references in [4, Chap. 3]). Data-independent beamformers with frequencyinvariant beampatterns in a certain frequency range are proposed in [19], [20], [21], [22].
6.4
Optimum Data-Dependent Beamformer Designs
In this section, we describe optimum data-dependent beamformer designs, which are obtained by maximization of various performance criteria. Most often, these beamformers are derived for at least wide-sense stationary (WSS) monochromatic waves. These methods directly translate to at least WSS wideband signals if the signals can be decomposed into narrow frequency bins such that the narrowband assumption holds. For simplicity, the discrete Fourier transform (DFT) is often used for the narrowband decomposition. However, corresponding optimum beamformers can also be derived in the time domain without the stationarity assumption if FIR filter structures are assumed. For wideband signals, most data-dependent beamformers can be classified either as minimum mean squared error (MMSE) design or as linearly constrained minimum variance (LCMV) design. Regardless of an implementation in the DFT domain or in the time-domain, both approaches generally use time-averaging over a finite temporal aperture to estimate the relevant statistics of the sensor data. Due to the non-stationarity of speech and audio signals, the shorttime DFTs [11] do no longer produce stationary and mutually orthogonal frequency bin signals. Therefore, we use here a time-domain representation to rigorously derive optimum data-dependent beamformers. The nonstationarity of the sensor data and the finite temporal apertures are directly taken into account: instead of using stochastic expectations, which must be replaced by time-averages for realizing the beamformer, we formulate the beamformers using least-squares (LS) criteria over finite data blocks. We thus obtain the MMSE-LS and the LCMV-LS beamformer. The MMSE-LS beamformer is known as least-squares error (LSE) beamformer [4, Chap. 6]. In Sect. 6.4.1, the LSE design is presented. In Sect. 6.4.2 the LCMV-LS beamformer is derived and it is shown how this approach can be efficiently realized in the so-called generalized sidelobe canceller (GSC) structure. The interpretation of LSE and LCMV-LS beamforming using eigenanalysis methods suggests the realization of optimum beamforming using eigendecomposition, leading to eigenvector beamformers. The use of these design methods for audio signal processing is considered (Sect. 6.4.3). In Sect. 6.4.4, the problem of correlation between the desired signal and interference is discussed, which is of special interest in reverberant acoustic environments.
166
W. Herbordt, W. Kellermann d(k)
x(k)
+ y(k)
w(k)
−
Fig. 6.6. Least-squares error (LSE) beamformer.
6.4.1
Least-Squares Error (LSE) Design
In this section, we first derive the optimum LSE processor. Second, we show the relationship with MMSE beamforming for monochromatic plane waves. Third, we illustrate how LSE (MMSE) beamforming is applied to speech and audio signal acquisition. Optimum LSE Processor. For formulating the LSE cost function, we define the estimation error e(k) as the difference between the desired response and the multi-channel filter output (see Fig. 6.6) e(k) = d(k) − wT (k)x(k) .
(6.25)
Defining the LSE cost function ξLSE (r) as the sum of K squared error samples with the sensor data overlapping by a factor α ≥ 1, and introducing the number of ‘new’ samples per block R = K/α, ξLSE (r) is obtained as ξLSE (r) =
rR+K−1
e2 (k) =
k=rR
rR+K−1
2 d(k) − wT (rR)x(k) .
(6.26)
k=rR
Introducing the desired response vector dr (k) of size K × 1 and the N M × K data matrix X(k), T
dr (k) = (d(k), d(k + 1), . . . , d(k + K − 1)) , X(k) = (x(k), x(k + 1), . . . , x(k + K − 1)) , then, ξLSE (r) may be written as: / /2 ξLSE (r) = /dr (rR) − XT (rR)w(rR)/ 2 .
(6.27) (6.28)
(6.29)
X(k) can be split into a data matrix with desired signal components Xd (k) and a data matrix with interference components Xn (k): X(k) = Xd (k) + Xn (k) .
(6.30)
The matrix Φ(k) = X(k)XT (k)
(6.31)
6
Adaptive Beamforming
167
can be interpreted as the instantaneous maximum-likelihood estimate of the sensor cross-correlation matrix at time k, obtained by time averaging over the latest K sampling intervals [4]. The reliability of the estimates of the cross-correlation matrix Φ(k) depends on the ratio formed by the number of samples K and the product N M [4, Chap. 7] 4 . Accordingly, the matrices Φd (k) = Xd (k)XTd (k) ,
(6.32)
Xn (k)XTn (k) ,
(6.33)
Φn (k) =
are estimates of sensor cross-correlation matrices for desired signal and interference, respectively. In the following, we assume orthogonality of Xd (k) and Xn (k), XTd (k)Xn (k) = 0. (See Sect. 6.4.4 for the non-orthogonal case.) We further assume Φ(k) to be invertible, which means that at least spatially and temporally orthogonal (‘non-coherent’) sensor noise with a full rank diagonal component in Φn (k) is present. Invertibility of Φ(k) implies that the matrix X(k) has maximum row-rank. This requires K ≥ M N , and defines a lower limit for the number of samples in one sensor data block K. By differentiation of ξLSE (r) w.r.t. w(rR) and by setting the derivative equal to zero, the minimum of ξLSE (r) is obtained as wLSE,o (rR) = Φ−1 (rR)X(rR)dr (rR) .
(6.34)
The product X(rR)dr (rR) can be interpreted as the time-averaged maximum-likelihood estimate of the cross-correlation vector between the sensor signals x(k) and the desired signal d(k) over a data block of length K. Applying the matrix inversion lemma [6] and after some rearrangements, we may write the optimum weight vector as −1 wLSE,o (rR) = Φ−1 dr (rR) n (rR)Xd (rR) [Λ(rR) + I]
(6.35)
with the identity matrix I and with Λ(k) = XTd (k)Φ−1 n (k)Xd (k) .
(6.36)
Introducing (6.35) into (6.29), the minimum of the cost function ξLSE (r) is obtained as / /2 / / −1 ξLSE,o (r) = /[Λ(rR) + I] dr (rR)/ . (6.37) 2
Note that, according to (6.35), separate observations of the desired signal and the interference signals are necessary for finding the LSE estimator. 4
In [4, Chap. 7], this dependency is derived for cross-power spectral density maˆ x (ω) of narrowband signals with frequency ω, where S ˆ x (ω) is the timetrices S averaged estimate of the cross-power spectral density matrix Sx (ω) for a window ˆ x (ω) depends on K/M . Assuming that, for wideband of size K. The reliability of S ˆ x (ω) requires a data block of N samples for estimating N frequency bins, signals, S then, the reliability depends on K/(N M ).
168
W. Herbordt, W. Kellermann X1 (ω) X2 (ω)
Λ(ω)S−1 n (ω)v(ω, k)
Sd (ω) Sd (ω)+Λ(ω)
Y (ω, k)
XM (ω)
Fig. 6.7. Interpretation of the minimum mean squared error (MMSE) beamformer for monochromatic waves with Xm (ω) the DTFT of xm (k), ∀ m, (after [4, Chap. 6]).
Relation of LSE to MMSE Beamforming. The corresponding statistical multi-channel Wiener filter (or MMSE processor) for monochromatic signals is obtained from the LSE criterion (6.29) by assuming an at least wide-sense (WS) ergodic plane wave d(k) for K → ∞. Then, the matrices (6.32) and (6.33) correspond to the sensor cross-correlation matrices Rd = E{d(k)dT (k)} and Rn = E{n(k)nT (k)}, respectively. (E{·} is the expectation operator.) Application of the Parseval theorem to (6.29) yields for the MMSE processor in the DTFT domain wMMSE,o (ω) =
Sd (ω) Λ(ω)S−1 n (ω)v(ω, k) , Sd (ω) + Λ(ω)
(6.38)
with −1 , Λ(ω) = vH (ω, k)S−1 n (ω)v(ω, k)
(6.39)
where Sd (ω) is the power spectral density of the desired signal and where Sn (ω) is the cross-power spectral density matrix for the interference components nm (k). Note that both the MMSE and the LSE processor do not provide unity gain for the desired signal. For the MMSE estimate, the weighting of the desired signal depends on the scalar fraction on the right side of (6.38). With increasing interference power, both the distortion of the desired signal and the interference suppression increase. Without the scalar fraction or without interference presence, unity gain is assured for the desired signal (see Sect. 6.4.2). The MMSE processor is depicted in Fig. 6.7. Application to Audio Signal Processing. MMSE beamformers for audio signal acquisition are usually realized in the DFT domain with the assumption of short-time stationary desired signals [23], [24], [25], [26]. The problem of estimating desired signal and interference separately can be addressed in two ways: (a) We assume stationary, or - relative to the desired signal - slowly timevarying interference. Then, the interference can be estimated during absence of the desired signal. An estimate of the desired signal is given by the difference of the magnitude spectra between the interference estimate and a reference signal, which contains desired signal and interference. An MMSE
6
Adaptive Beamforming
169
beamformer using a weighting function similar to single-channel spectral subtraction is presented in [23]. (b) Assumptions about the sensor cross-correlation matrix for interference components Φn (k) are made in order to improve the estimate of the desired signal at the beamformer output. This approach is often realized as a beamformer-plus-postfilter structure, similar to (6.39). The MMSE criterion is usually not fulfilled by this class of beamformers (see [24], [25], [26]). Besides, all methods for estimating the desired signal and/or the interference which are known for single-channel noise reduction may be generalized to the multi-channel case (see [27], [28], [29], [30]).
6.4.2
Least-Squares Formulation of Linearly Constrained Minimum Variance (LCMV) Beamforming: LCMV-LS Design
Application of LSE beamforming to speech and audio signal acquisition is limited, since (a) - potentially unacceptable - distortion of the desired signal is introduced by LSE beamforming, and (b) the desired signal itself can usually not be estimated. However, if secondary information about the desired source is available this can be introduced into the beamformer cost function to form a constrained optimization problem. The resulting linearly constrained beamformer using a least-squares formulation (LCMV-LS) is investigated in this section. We first outline the LCMV-LS design using the direct beamformer structure of Fig. 6.1. Second, we give the LCMV-LS design in the generalized sidelobe canceller structure, which allows more efficient realization of LCMV and LCMV-LS beamformers. The generalized sidelobe canceller will be used in Sect. 6.6 for deriving a practical system.
Direct LCMV-LS Design. The optimum LCMV-LS beamformer is derived first. Second, we interpret the LCMV-LS beamforming using a vector space representation. Third, commonly used constraints are reviewed for assuring a distortionless response for the desired signal. In many applications, these directional constraints are not sufficient for preventing distortion of the desired signal. For this purpose, we consider additional robustness constraints which may be introduced into the optimum weight vector. Then, we show the relation with other beamforming techniques. Finally, we summarize applications of direct LCMV-LS design to audio processing. Optimum LCMV-LS Beamformer. Since an accurate estimate of the desired signal is usually not available, it is desirable to introduce secondary infor-
170
W. Herbordt, W. Kellermann
mation about the desired source into the beamformer cost function. This transforms the unconstrained optimization into a constrained form as follows ξLC (r) =
rR+K−1
/ /2 y 2 (k) = /wT (rR)X(rR)/ 2 → min
(6.40)
k=rR
subject to Nc constraints CT (rR)w(rR) = c(rR) .
(6.41)
C(k) is the M N ×NcN constraint matrix with linearly independent columns. c(k) is the Nc N × 1 constraint vector5 . Note that Nc spatial constraints require Nc spatial degrees of freedom of the weight vector w(k), thus, only M − Nc spatial degrees of freedom are available for minimization of ξLC (r). The optimum LCMV-LS beamformer is found by minimization of the constrained cost function using Lagrange multipliers [3], [31] as −1 wLC,o (rR) = Φ−1 (rR)C(rR) CT (rR)Φ−1 (rR)C(rR) c(rR) . (6.42) Eigenspace Interpretation. For a better understanding, the LCMV-LS beamformer after (6.42) may be interpreted using an eigendecomposition of Φ(k). For this reason, we decompose Φ(k) into its eigenvectors and eigenvalues as follows: Φ(k) = U(k)Σ(k)UH (k) =
N M
σi2 (k)ui (k)uH i (k) ,
(6.43)
i=1
where U(k) is a unitary N M × N M matrix of eigenvectors ui (k), U(k) = (u1 (k), u2 (k), . . . , uN M (k)) ,
(6.44)
and where Σ(k) is a diagonal matrix of the corresponding eigenvalues σi2 (k), 2 Σ(k) = diag σ12 (k), σ22 (k), . . . , σN (6.45) M (k) , 2 −1 with6 σ12 (k) ≥ σ22 (k) ≥ · · · > σN (k) can thus be written as: M (k) > 0. Φ
Φ−1 (k) = U(k)Σ−1 (k)UH (k) =
N M i=1
5
6
1 ui (k)uH i (k) . σi2 (k)
(6.46)
In the constrained optimization problem (6.40), (6.41), the constraints (6.41) are only evaluated once every R samples and not for each sample k. It must thus be assumed that the constraints for times rR meet the requirements for the entire r-th block of input data X(rR). In Sect. 6.5, we reformulate the optimization problem for applying the exponentially windowed recursive least squares (RLS) algorithm [6]. Then, the constraints may vary for each data input sample k, which is required later in Sect. 6.6. The smallest eigenvalue is still greater than zero, because of the assumed presence of spatially and temporally non-coherent sensor noise at all sensors.
6
Adaptive Beamforming
171
Inserting wLC,o (rR) of (6.42) into (6.4), the LCMV-LS beamformer output y(k) can be written as y(k) = β T (rR)CT (rR)Φ−1 (rR)x(k) , −1 β(rR) = CT (rR)Φ−1 (rR)C(rR) c(rR) .
(6.47) (6.48)
The vector of sensor signals x(k) is first weighted with the inverse of the estimate of the cross-correlation matrix Φ−1 (rR). From (6.46), we see that signal components which are contained in the vector space of the i-th eigenvector ui (rR) are weighted with the inverse eigenvalue 1/σi2 (rR), ∀ i, independently of desired signal or interference. This means that signal suppression by the product Φ−1 (rR)x(k) increases with increasing eigenvalues σi2 (rR). Second, Φ−1 (rR)x(k) is projected into the vector space of the constraint matrix C(rR) by pre-multiplication of CT (rR). Third, multiplication with the vector β(rR) normalizes the beamformer output to fulfill the constraints (6.41). (Desired) signal components which are contained in the vector space of the constraints (6.41), are reprojected into the constrained subspace. (Interference) signal components which are orthogonal to the constraints are further suppressed. We see that suppression of sensor signals x(k) increases with increasing eigenvalues σi2 (rR) of Φ(rR) and with increasing orthogonality of x(k) relative to C(rR)β(rR). Spatial Constraints. Generally, the constraints are designed to assure a distortionless response for the desired signal. This means in absence of interference that y(k) = d(k). Beamformer constraints can be formulated in various ways. Directional constraints, also referred to as distortionless response criterion, require knowledge about the true position of the desired source. If several desired sources are present, multiple directional constraints may be used. Directional constraints often lead to cancellation of the desired signal if the present source position is not known exactly [9], [32], [33], [34], [35], [36]. For a better robustness against such look-direction errors, derivative constraints are often used. Thereby, the derivatives up to order Nd − 1 of the beampattern B(ω; θ, φ) w.r.t. (θ,φ) for the array steering direction must be zero. Derivative constraints thus increase the angular range of the directional constraints. The choice of Nd trades maximum allowable uncertainty of the position of the desired sources against the number of spatial degrees of freedom for suppression of interferers. For sufficient suppression of interferers, this technique is typically limited to small Nd [9], [33], [34], [37], [38], [39], [40], [41], [42], [43], [44]. A greater flexibility is offered by eigenvector constraints. They influence the beamformer response over a specified region in the frequency-wavenumber space while minimizing the necessary number of degrees of freedom. The number of degrees of freedom for shaping the desired response is controlled by selecting a specified set of eigenvectors of the constrained frequencywavenumber space for representing the desired response [13], [45].
172
W. Herbordt, W. Kellermann
By quiescent pattern constraints, a desired power pattern is specified over the entire frequency-wavenumber space. The cost function ξLC (r) is minimized by simultaneous approximation of the quiescent pattern in a leastsquares sense [37], [41], [46], [47], [48]. Robustness Improvement. Robust beamformers are beamformers whose performance degrades only smoothly in the presence of mismatched conditions. Typical problems are mismatched distortionless response constraints and array perturbations like random errors in position and complex gain of sensors (see e.g., [49], [50], [51]). For improving robustness of optimum beamformers, two techniques were developed which are both useful for unreliable spatial constraints and array perturbations: first, diagonal loading increases non-coherent sensor signal components relative to coherent signal components by augmenting Φ(k) by 2 a diagonal matrix σdl (k)I, 2 Φdl (k) = Φ(k) + σdl (k)I .
(6.49)
It can be shown that this puts an upper limit T0 (k) to the array sensitivity T (k) against uncorrelated errors, T (k) = wT (k)w(k) ≤ T0 (k), where T0 (k) 2 depends on the parameter σdl (k) [12], [52], [53], [54]. Second, elimination of the desired signal for computing Φ(rR) replaces Φ(rR) in (6.42) by Φn (rR), T −1 −1 wLC,o (rR) = Φ−1 c(rR) , (6.50) n (rR)C(rR) C (rR)Φn (rR)C(rR) and provides better robustness by performing the beamforming in the vector space, which is spanned by the interference [9]. This can be understood when replacing Φ(rR) by Φn (rR) in (6.47), (6.48). Essentially, this means that the desired signal characteristic has no influence on the beamforming any more. Relation with Other Beamforming Techniques. The LCMV-LS beamformer becomes the wideband statistical LCMV processor for time-invariant constraints if the cross-correlation matrix R = Rd + Rn is determined from Φ(k) in (6.42) by time-averaging (K → ∞): −1 wLCMV,o = R−1 C CT R−1 C c. (6.51) The relation to LSE beamforming is given for the special constraints XTd (rR)w(rR) = dr (rR) ,
(6.52)
and by identifying (6.52) with (6.41). For the optimum LCMV-LS weight vector after (6.50), we then obtain with (6.36) T −1 (rR)dr (rR) . wLC,o (rR) = Φ−1 n (rR)Xd (rR)Λ
(6.53)
This is equivalent to the optimum LSE beamformer (6.35) except for the matrix inverse, which is not augmented by the identity matrix here. The
6
Adaptive Beamforming
173
X1 (ω) X2 (ω)
Λ(ω)S−1 n (ω)v(ω, k)
Y (ω, k)
XM (ω)
f in kHz
Fig. 6.8. Minimum variance distortionless response (MVDR) beamformer for monochromatic waves (after [4, Chap. 6]). 0
0 dB
1
− 10dB
2
−20 dB
3
−30 dB
4
−40 dB
5
−50 dB
0
0.2
0.4
θ/π
0.6
0.8
1
−60 dB
Fig. 6.9. Power pattern of a wideband MVDR beamformer with interference arriving from θ = 0.37π; sensor array of Fig. 6.2 with M = 9, Δ = 4 cm.
augmentation by the identity matrix results in distortion of the desired signal, which depends on the interference characteristics. The statistical version of the special LCMV beamformer given by (6.53), is referred to as the minimum variance distortionless response (MVDR) beamformer. For at least WSS monochromatic signals, the MVDR processor is obtained by assuming an at least WSS monochromatic desired signal with wavenumber k and by applying the Parseval theorem to (6.40) in the DTFT domain as wMVDR,o (ω) = Λ(ω)S−1 n (ω)v(ω, k) .
(6.54)
Comparing (6.54) with the MMSE beamformer (6.38), we see that the MMSE beamformer can be interpreted as an MVDR beamformer with a weighting of each output frequency by the scalar Sd (ω)/(Sd (ω) + Λ(ω)) (see Figs. 6.7 and 6.8). In Fig. 6.9, a power pattern of an MVDR beamformer for wideband signals with wideband interference arriving from θ = 0.37π is illustrated. Application to Audio Signal Processing. In this section, we have illustrated (a) the basic concepts of constrained optimum beamforming and (b) the relation to LSE and MMSE beamforming. Separate estimates of sensor crosscorrelation matrices for desired signal and interference - as required for LSE and MMSE processors - are usually difficult to obtain. In LCMV-LS and
174
W. Herbordt, W. Kellermann x(k)
+
wc (k)
B(k)
y(k)
−
wa (k)
Fig. 6.10. Generalized sidelobe canceller.
LCMV beamforming, these separate estimates are therefore replaced by estimates of the overall sensor cross-correlations and constraints on estimated positions of the desired sources. This is especially important for non-stationary source signals, such as speech and audio signals. If, additionally, the desired source positions are fluctuating, the constraints might not be reliable. Then, distortion of the desired signal may be efficiently prevented by a combination of (a) widening the spatial constraints, (b) diagonal loading of the estimates of the cross-correlation matrix Φ(k), and (c) reducing the contribution of the desired signal in Φ(k) by observing Φn (k) separately, where Φn (k) is assumed to be slowly time-varying. LCMV beamforming has been extensively studied for audio signal processing for general applications in [55], for hearing aids in [56], and for speech recognition in [57], [58]. For array apertures, which are much smaller than the signal wavelength, LCMV beamformers can be realized as differential (‘superdirective’) arrays [59], [60]. Generalized Sidelobe Canceller (GSC). An efficient realization of the LCMV beamformer is the generalized sidelobe canceller [61] shown in Fig. 6.10. It is especially advantageous for adaptive realizations of LCMV beamformers, since the constrained optimization problem is transformed into an unconstrained one (see Sects. 6.5 and 6.6). The GSC splits the LCMV beamformer into two orthogonal subspaces: the first subspace satisfies the constraints, and, thus, ideally contains undistorted desired signal and filtered interference. It is given by the vector space of the upper processor in Fig. 6.10, with the N M × 1 tap weight vector wc (k) fulfilling (6.41): CT (k)wc (k) = c(k) .
(6.55)
The second subspace (lower path in Fig. 6.10) is orthogonal to wc (k). Orthogonality is assured by an N M × (M − Nc )N matrix B(k), which is orthogonal to each column of C(k): CT (k)B(k) = 0 .
(6.56)
6
Adaptive Beamforming
175
B(k) is called blocking matrix, since signals which are orthogonal to B(k) (or equivalently in the vector space of the constraints) are rejected. Ideally, the output of B(k) does not contain desired signal components, and, thus, is a reference for the interference. The remaining (M − Nc )N degrees of freedom of the blocking matrix output are used to minimize the squared GSC output signal y(k). With the tap-stacked weight vector T T T T wa (k) = wa,1 (k), wa,2 (k), . . . , wa,M−N (k) , (6.57) c where the M − Nc length-N FIR filters wa,m (k) are given by T
wa,m (k) = (wa,0,m (k), wa,1,m (k), . . . , wa,N −1,m (k)) ,
(6.58)
we write ξLC (r) =
rR+K−1
y 2 (k) =
k=rR
/ T / / wc (rR) − waT (rR)BT (rR) X(rR)/2 → min . 2
(6.59)
Note that same number of filter taps N is assumed for the FIR filters in wc (k) and in wa (k). The optimum solution wa,o (rR) is found in the same way as (6.42): −1 T wa,o (rR) = BT (rR)Φ(rR)B(rR) B (rR)Φ(rR)wc (rR). (6.60) Recalling (6.31), Φ(k) = X(k)XT (k), it may be seen that the inverted term in parentheses is the time-averaged estimate of the cross-correlation matrix of the blocking matrix output signals at time rR. Furthermore, the right term is the time-averaged estimate of the cross-correlation vector between the blocking matrix output signals and the upper signal path. The weight vector wa,o (rR) thus corresponds to an LSE processor [see (6.34)]. As the blocking matrix output does not contain desired signal components (for carefully designed constraints), it may be easily verified that the LSE-type beamformer wa,o (rR) does not distort the desired signal. Equivalence of the GSC with LCMV beamfomers is shown in [62], [63]. The GSC has been applied to audio signal processing, e.g., in [1], [8], [64], [65], [66], [67], [68], [69]. 6.4.3
Eigenvector Beamformers
Eigenvector beamformers reduce the optimization space of data-dependent beamformers for improving the reliability of the estimates of the crosscorrelation matrix Φ(k). Recall that a decreasing number of filter weights N M improves the reliability of Φ(k) for a given observation interval K (see Sect. 6.3 and [4, Chap. 7]).
176
W. Herbordt, W. Kellermann x1 (k) x2 (k) UQ (rR)
−1 Σ−1 Q (rR)CQ (rR)ΛLC,Q (rR)c(rR)
y(k))
xM (k)
Fig. 6.11. Interpretation of the eigenvector beamformer (after [4, Chap. 6]).
The eigenspace interpretation of LCMV-LS beamforming suggests to concentrate on the suppression of the large, most disturbing eigenvalues of Φn (k) in order to exploit advantages of cross-power spectral density matrices with less dimensions. For deriving the LCMV-LS beamformer in the eigenspace of desired-signal-plus-interference, we assume that the Q largest eigenvalues correspond to the subspace of desired-signal-plus-interference. The remaining N M −Q eigenvalues correspond to the subspace with interference components with small eigenvalues, which is not taken into account by the eigenspace LCMV-LS beamformer. The matrix with eigenvectors which correspond to the Q largest eigenvalues is UQ (k). ΣQ (k) is the diagonal matrix with the Q largest eigenvalues. Introducing CQ (k) = UH Q (k)C(k), the eigenspace optimum beamformer is obtained as: −1 wQ,o (rR) = UQ (rR)Σ−1 Q (rR)CQ (rR)ΛLC,Q (rR)c(rR) ,
ΛLC,Q (rR) =
−1 CH Q (rR)ΣQ (rR)CQ (rR) .
(6.61) (6.62)
Here, the input data matrix X(rR) and the constraints C(rR) are projected into the subspace which is spanned by UQ (rR). The sensor data is thus processed in the vector space of UQ (rR) (see Fig. 6.11). Signal components - ideally, background noise and sensor noise - which correspond to small eigenvalues are thus suppressed without ‘seeing’ the beamformer constraints, while signal components which correspond to large eigenvalues are suppressed according to the eigenvector realization of the LCMV-LS beamformer. Eigenvector beamformers are investigated in more detail in [51], [70], [71], [72], [73], [74], [75], [76], [77], [78], [79], [80], [81], [82]. Eigenvector beamformers using the GSC structure are discussed in [77], [83], [84], [85]. Subspace techniques for speech and audio signal acquisition have been presented first for reducing artifacts of spectral subtraction in single-channel speech enhancement. There, the noise subspace is estimated during silence of the desired source. The desired signal distortion is minimized in the desiredsignal-plus-noise subspace, while constraining the noise level at the processor output. Application is restricted to situations with slowly time-varying interference, i.e., with quasi-stationary background noise [86], [87], [88], [89], [90], [91], [92], [93]. Therefore, distortion of the desired signal is not prevented as it is for eigenvector beamforming. More recently, the subspace methods have been extended to the multi-channel case [94], [95], [96]. The result-
6
Adaptive Beamforming
177
ing beamformers are based on the optimum MMSE-type beamformer (6.34). Again, desired signal distortion is minimized in the desired-signal-plus-noise subspace subject to a constraint on the maximum allowed noise level at the beamformer output. Thus, multi-channel subspace methods for audio signal acquisition are closer to MMSE-type beamforming than to LCMV-type beamformers.
6.4.4
Suppression of Correlated Interference
The signal model, which has been described in Sect. 6.2, treats multi-path propagation of the desired source signal as interference and only the direct signal path is considered to be desired signal. This model is desirable, as, ideally, it allows to dereverberate the desired signal. However, in combination with data-dependent optimum beamformers as discussed so far, this model leads to cancellation of the desired signal due to high correlation of reflected signal components with the direct path signal [97], [98] resulting in annoying signal distortion. For preventing cancellation of the desired signal, various methods have been presented for applications other than audio. They can be classified into two categories: one class performs a decorrelation step prior to conventional beamforming. The other class integrates decorrelation with the beamforming: (a) Two techniques are considered, which decorrelate the sensor signal prior to the beamforming. First, spatial averaging [99], [100], [101], [102], [103], [104] carries out the correlation reduction by averaging data covariance matrices over a bank of subarrays. This effects a simultaneous reduction of desired signal distortion and interference suppression capability, since the effective number of sensors and the effective spatial aperture is only that of the subarrays [105], [106]. Second, frequency-domain averaging [107] reduces the correlation by averaging in the frequency domain. This method is only useful if the reciprocal of the signal bandwidth is less than twice the time delays between the desired signal and the correlated interference. Estimates of interferer positions are required. (b) Integration of decorrelation and beamforming is commonly performed by the split polarity transformation [108], by a special partially adaptive beamformer [109], and by (generalized) quadratic constraints [110], which require estimates of the source positions, sets of likely interference scenarios, and an estimate of the sensor cross-correlation matrix w.r.t. the interference, respectively. These techniques have not yet been studied for speech and audio signal acquisition so far. Most of the methods require estimates of interference characteristics, e.g., interference positions, that is, the directions of reflections. In highly non-stationary acoustic environments, estimation of these parameters is difficult. If any, spatial averaging seems to be the most promising approach. However, in highly correlated environments, a large number of subarrays may
178
W. Herbordt, W. Kellermann
Table 6.1 Overview of optimum signal-dependent beamformers for audio signal processing. Type
MMSE
Non-stationary wideband signals Cost function
ξLSE = dr − XT w22
Optimum filters
wLSE,o = Φ−1 Xdr
Stationary monochromatic signals Optimum filters
Advantages/ Disadvantages
LCMV
ξLC = XT w22 subject to CT w = c wLC,o = −1 Φ−1 C CT Φ−1 C c
d wMMSE,o = SdS+Λ ΛS−1 n v H −1 −1 Λ = v Sn v
wLCMV,o = −1 R−1 C CT R−1 C c wMVDR,o = ΛS−1 n v + Exploitation of eigenvector beamforming, − Sensitivity to reverberation of desired signal, − Dereverberation capability limited, + Position of desired source + Efficient realization as not required GSC, − Knowledge of desired + Various constraints signal required, possible, − Only slowly time-varying − Less suppression of interference possible, interference than LSE, − Distortion of desired − high complexity of signal direct method
be necessary which conflicts with common limitations on spatial aperture and number of sensors. Being unable to separate desired signal and reverberation in most promising approaches, reverberation is considered as desired signal for the constraint design to avoid cancellation of the desired signal. Thereby, on the one hand, dereverberation capability is reduced to increasing the power ratio of direct path to reverberation. On the other hand, robustness against cancellation of the desired signal is assured [1], [67], [69] (see also Sect. 6.6).
6.5
Adaptation of LCMV-LS Beamformers
The previous sections have shown that successive estimates of the crosscorrelation matrices of the non-stationary sensor signals are necessary to determine the optimum weight vectors. The resulting beamformers thus adapt to the incoming data. Basically, two adaptive beamformer approaches are possible: (a) The signal statistics may be estimated from successive blocks of incoming data and the currently optimum weight vectors can be computed
6
Adaptive Beamforming
179
using these data blocks. (b) The weights may be adjusted to the incoming data samples such that the weights converge to the optimum solution. While the first method requires matrix inversions for finding the optimum filter weights, these matrix inversions can be avoided using recursive adaptation algorithms for the second category. Thereby, the cost functions are recursively minimized, which considerably reduces the computational complexity. However, due to the necessary time for convergence to the optimum weight vector, the optimum solution might not always be tracked sufficiently fast depending on the non-stationarity of the sensor data. As a computationally very efficient and robust adaptation algorithm, the least-mean-square (LMS) algorithm [6] is prevalent. The exponentially weighted RLS algorithm [6] is computationally more intensive, but provides much faster convergence than the LMS algorithm. It solves the least-squares minimization problem at each iteration and is thus optimum in a deterministic sense. It can be shown that this deterministically obtained weight vector converges to the statistically optimum solution in a least-mean-square sense for stationary environments [6]. Computationally efficient versions exist in the time domain [7] and in the frequency domain (see Chap. 4). On the one hand, linearly constrained beamformers are desirable relative to MMSE and LSE beamformers due to the problem of estimating separate cross-correlation matrices for desired signal and interference and to assure undistorted desired signal. On the other hand, unconstrained adaptation algorithms are desirable for efficiency. The GSC structure combines these desirable properties by transforming the constrained optimization problem into an unconstrained one. Thus, fundamental algorithms for unconstrained adaptation can directly be applied, which simplifies adaptive realization, and reduces computational complexity. Due to these advantages, we concentrate in the following on an RLS formulation of the GSC. First, the block cost function ξLC (r) is transformed to a sample cost function by setting R = K = 1 in (6.59). Second, a forgetting factor λ over the summation of all past data samples is introduced. We obtain with (6.59):
ξLC (k) = (1 − λ)
k
λk−i y 2 (i)
i=1
= (1 − λ)
k T 2 ˆ aT (i)BT (i) x(i) , wc (i) − w
(6.63)
i=1
where the optimum weight vector is given by [see (6.60)]: ˆ ˆ a (k) = P(k)BT (k)Φ(k)w w c (k) −1
ˆ . P(k) = BT (k)Φ(k)B(k)
(6.64) (6.65)
180
W. Herbordt, W. Kellermann
ˆ Here, Φ(k) is the recursive estimate of Φ(k) at time k: ˆ Φ(k) = (1 − λ)
k
λk−i x(i)xT (i)
i=1
ˆ − 1) + (1 − λ)x(k)xT (k) . = λΦ(k
(6.66)
The factor (1 − λ) ensures unbiased estimation of Φ(k) for k → ∞. Applying the matrix inversion lemma to P(k) [6], we obtain: P(k) = λ−1 I − (1 − λ)q(k)xT (k)B(k) P(k − 1) , (6.67) q(k) =
λ−1 P(k − 1)BT (k)x(k) , 1 + λ−1 xT (k)B(k)P(k − 1)BT (k)x(k)
(6.68)
where (6.67) is known as Riccati equation. The Kalman gain vector q(k) can also be written as q(k) = P−1 (k)BT (k)x(k) .
(6.69)
We now develop a recursive update equation for the adaptive weight vector by introducing (6.66)-(6.68) into (6.64): ˆ a (k) = w ˆ a (k − 1) + (1 − λ)q(k)˜ w y (k) ,
(6.70)
where ˆ aT (k − 1)BT (k) x(k) . y˜(k) = wcT (k) − w
(6.71)
Finally, the algorithm needs to be initialized: usually, Φ(0) is augmented by a non-coherent noise term with variance σ02 , whose influence decays exponentially with λ. It follows for the inverse P(0) = Φ−1 (0): P(0) =
1 I. σ02
(6.72)
Quadratic constraints T0 (k) for controlling the array sensitivity against random errors T (k) can be incorporated into the exponentially weighted RLS algorithm according to [111]. Note that incorporation of diagonal loading also ˆ reduces the condition number of Φ(k), which is important for stability of the ˆ −1 (k) (‘regularization’). adaptation algorithm due to the dependency on Φ
6.6
A Practical Audio Acquisition System Using a Robust GSC
Our discussion showed that LCMV beamformers are preferable to the MMSE design for desired signal integrity. However, an inherent problem of adaptive LCMV beamforming for audio signal processing is robustness against multipath propagation of the desired source signal. In Sect. 6.4.4, we have seen that
6
Adaptive Beamforming
181
conventional techniques for improving robustness against correlated interferers are limited by the array geometry or by the necessary information about the interference environment. For increased robustness, the GSC structure proposed in [1] includes the reflections in the constraints. A distortionless response w.r.t. the desired source is ensured. Due to non-stationarity of the propagation paths (i.e. changes of desired source location, changes of acoustic environment), these constraints are made adaptive. Explicit estimates of the propagation paths are not required. In [67], this new distortionless response criterion was stated without addressing the problem of non-stationary acoustic environments. In this section, we first generalize the spatial constraints to incorporate spatial and temporal information about the desired source (‘spatio-temporal constraints’), which are especially helpful for taking non-stationarity of the desired signal into account. Second, we develop a GSC structure leading to the one proposed in [1] and demonstrate its optimality in the LCMV-sense. It will be referred to as robust GSC. Third, we outline a DFT-domain realization of the robust GSC for (a) reducing computational complexity and (b) improving convergence behavior. Finally, we illustrate the performance by some experiments with data recorded in real environments (for a more detailed description of the robust GSC, see [1], [8], [69]). 6.6.1
Spatio-Temporal Constraints
Obviously, the constraints (6.41) are exclusively of spatial nature, while temporal information on the desired signal cannot be incorporated. However, for maximum flexibility for non-stationary colored desired signals, it is desirable to provide constraints which depend on both spatial and temporal characteristics of the desired signal. In the following, we extend the constrained optimization problem for incorporating both spatial and temporal constraints. We rewrite (6.41) as7 CTst (k)wst (k) = cst (k) ,
(6.73)
with the M N 2 × 1 vector wst (k) , with the M N 2 × Nc N 2 constraint matrix Cst (k), and with the Nc N 2 × 1 constraint vector cst (k), T wst (k) = wT (k) , wT (k − 1) , . . . , wT (k − N + 1) , Cst (k) = diag {C(k), C(k − 1), . . . , C(k − N + 1)} , T cst (k) = cT (k), cT (k − 1), . . . , cT (k − N + 1) ,
(6.74) (6.75) (6.76)
respectively8 . The constraint equation (6.73) repeats the spatial constraints CT (k−i)w(k−i) = c(k−i) for N successive time instants i = 0, 1, . . . , N −1. 7 8
The subscript st stands for spatio-temporal. diag{·} transforms a vector into a diagonal matrix.
182
W. Herbordt, W. Kellermann
This means that the constraints at time k and N − 1 past constraints must be fulfilled simultaneously by the beamformer. Defining a stacked M N 2 × K matrix T Xs (k) = XT (k), XT (k − 1), . . . , XT (k − N + 1) , (6.77) we can rewrite the optimization criterion (6.40),(6.41) as / T /2 (st) ξLC (r) = /wst (rR)Xs (rR)/2
(6.78)
subject to the constraints (6.73) evaluated at time k = rR. With Φs (k) = Xs (k)XTs (k) ,
(6.79)
the solution of the constrained optimization problem is given by (st)
wLC,o (rR) =
T −1 −1 Φ−1 cst (rR) , s (rR)Cst (rR) Cst (rR)Φs (rR)Cst (rR)
(6.80)
which is simply an extended version of (6.42). 6.6.2
Robust GSC After [1] as an LCMV-LS Beamformer with Spatio-Temporal Constraints
In this section, we rigorously derive a robust GSC after [1] as a solution to the LCMV-LS optimization criterion with spatio-temporal constraints. First, the distortionless response criterion for the robust GSC is given. Second, it is illustrated how these constraints can be implemented efficiently. Third, a closed-form solution of the optimum weight vector wa,st (k) is derived. Finally, we show how to enhance the performance of this structure. We specialize the spatio-temporal distortionless response criterion given by (6.73) as follows: we demand that the desired signal processed by the weight vector wc (k) is not distorted at the GSC output for N successive samples, and we specify the constraints (6.73) with Nc = 1 as Cst (k) = diag {(d(k), d(k − 1), . . . , d(k − N + 1))} , cst (k) = CTst (k)wc,s (k) ,
(6.81) (6.82)
where T wc,s (k) = wcT (k), wcT (k − 1), . . . , wcT (k − N + 1) .
(6.83)
The constraint equation (6.73) is always fulfilled, since the stacked constraint vector cst (k) is equivalent to the left side of (6.73). Therefore, the weight vector wc (k) can be chosen arbitrarily for defining the beamformer response for the desired signals. For equivalence of the GSC with an LCMV-LS beamformer, the columns of Cst (k) must be pairwise orthogonal to the columns of the blocking matrix
6
Adaptive Beamforming
183
Bst (k), according to (6.56). This simply means, with the spatio-temporal constraints (6.81), (6.82) that the upper signal path in Fig. 6.10 has to be orthogonal to the lower signal path w.r.t. the desired signal for d(k − i), i = 0, 1, . . . , N − 1. In the lower signal path, desired signal components must be suppressed. This can be achieved by introducing FIR filters with coefficient vectors bm (k) between the output of the upper beamformer wc (k) and each of the M lower signal paths (see Fig. 6.12). For simplicity we use N × 1 vectors bm (k), which are captured by an N × M matrix Bb (k) as: Bb (k) = (b1 (k), b2 (k), . . . , bM (k)) .
(6.84)
For suppressing desired signal components in the lower signal paths, the output of the upper beamformer, i.e. the stacked weight vector wc,s (k), has to be orthogonal to each lower signal path for the desired signal. This can be achieved by determining Bb (k) such that the time-averaged principle of orthogonality [6] w.r.t. the desired signal is fulfilled: !
T Wc,s (k)Φd,s (k)Wc,s (k)Bb (k) = Φd,s (k)Wc,s .
(6.85)
Φd,s (k) is the time-averaged estimate of the sensor cross-correlation matrix w.r.t. the desired signal [see (6.32), (6.79)]. The M N 2 × N matrix Wc,s (k) is defined as: Wc,s (k) = diag {(wc (k), wc (k − 1), . . . , wc (k − N + 1))} .
(6.86)
The right side of (6.85) is a stacked matrix of the latest N maximumlikelihood estimates of cross-correlations between the output of wc,s (k) and the sensor signals. Accordingly, the quadratic form on the left side of (6.85) is a stacked matrix of the latest N maximum-likelihood estimates of autocorrelations of the output of wc,s (k). The optimum blocking matrix Bb,o (k) is found by solving (6.85) for Bb (k), which is usually performed by recursive adaptation algorithms (see below). The output of Bb (k) must be orthogonal to the output of wc,s (k) only for desired signals. Therefore, (6.85) must be solved during periods where only desired signal is present. Otherwise, if interference is simultaneously present, the outputs of the blocking matrix are orthogonal to the upper path w.r.t. both desired signal and interference, and, then, not only desired signal will be suppressed by the blocking matrix, but also interference. The array gain of the GSC will be reduced. However, in realistic situations, only Φs (k) can be determined, which contains both desired signal and interference. Detection of data blocks with presence of the desired signal only for estimating Φd,s (k) fails due to the non-stationarity of speech and audio signals. Another method is the detection of eigenvectors of Φs (k), which are orthogonal to Φn,s (k), denoting the time-averaged estimate of the sensor cross-correlation matrix for interference only. Bb (k) is then only updated for the space which is spanned by the eigenvectors of Φd,s (k), which allows a better approximation of the optimum Bb (k).
184
W. Herbordt, W. Kellermann x(k)
+
wc (k)
y(k)
bM (k)
b2 (k)
b1 (k)
−
yb (k)
− +
−
wa,st (k)
+ −
AIC
+
ABM, Bst (k)
Fig. 6.12. Robust generalized sidelobe canceller after [1].
Computing Bb (k) based on the principle of orthogonality allows to track changes of the desired source position and of the acoustic environment without explicitly estimating the position of the desired source and without explicitly estimating room impulse responses, in contrast to [67]. We now formulate the output signal y(k) of the robust GSC according to y(k) in (6.59), which allows to derive recursive update equations for the adaptive weight vector wa,st (k), similar to Sect. 6.5. The output signal vector yb (k) = (yb,1 (k), yb,2 (k), . . . , yb,M (k))T
(6.87)
of the blocking matrix Bst (k) of size M N 2 × M can be written as: (M)T T (k) xs (k) = BTst (k)xs (k) , yb (k) = JMN 2 ×M − BTb (k)Wc,s
(6.88)
where T xs (k) = xT (k), xT (k − 1), . . . , xT (k − N + 1) , (M) (0) (M) (MN −M) , JMN 2 ×M = 1MN 2 ×1 , 1MN 2 ×1 , . . . , 1MN 2 ×1
(6.89) (6.90)
(i)
with 1MN 2 ×1 being an M N 2 × 1 vector with zeroes and with the i-th element (M)
equal to one9 . Note that the M N 2 ×M matrix JMN 2 ×M rearranges the entries of xs (k) to match with the vector yb (k). The M N 2 × M blocking matrix Bst (k) describes the response between the sensors and the blocking matrix output and fulfills the spatio-temporal constraints (6.81), (6.82). The GSC output signal can then be written as (1)T T y(k) = wcT (k)JMN 2 ×MN − wa,st (k)BTst,s (k) xss (k) , (6.91) 9
The index i of J(i) means that the non-zero entry of the columns of J(i) is shifted by i entries from one column to the next.
6
Adaptive Beamforming
185
where Bst,s (k) = diag {(Bst (k), Bst (k − 1), . . . , Bst (k − N + 1))} , T xss (k) = xTs (k), xTs (k − 1), . . . , xTs (k − N + 1) , (1) (0) (1) (MN −1) JMN 2 ×MN = 1MN 2 ×1 , 1MN 2 ×1 , . . . , 1MN 2 ×1 ,
(6.92) (6.93) (6.94)
and where wa,st (k) is now arranged as wa,st (k) = T
(wa,0,1 (k), wa,0,2 (k), . . . , wa,0,M (k), wa,1,1 (k), . . . , wa,N −1,M (k)) .(6.95) Note that we assume the same number of filter coefficients N for the weight vectors in wc (k), Bb (k), and wa,st (k). The optimum filter weights wa,st,o (rR) are found according to (6.59). wa,st,o (k) depends on the time-varying blocking matrix Bst (k). Using the above eigendecomposition-based adaptation strategy, Bst (k) can only be determined for the subspace of Φs (k) which is orthogonal to the interference. For practical systems, it cannot always be assured that the subspace of the blocking matrix output is orthogonal to the output of wc (k) w.r.t. the desired signal, since it is often difficult to separate the subspace of Φs (k) accurately. For minimum distortion of the desired signal at the GSC output, the vector wa,st,o (k) should therefore only be updated in the subspace which is orthogonal to the desired signal when only noise is present. For additional robustness, diagonal loading may be used as proposed in [1]. Finally, we can summarize the algorithm of the robust GSC as follows: output signals yb (k) of the blocking matrix and y(k) of the GSC are calculated in a straightforward way according to (6.88) and (6.91), respectively. The blocking matrix filters Bb (k) and the weight vector wa,st (k) are updated recursively using exponentially weighted RLS algorithms instead of using direct matrix inversion for reducing computational complexity. The adaptation rule for wa,st (k) is obtained by using (6.91) in (6.63). The update equations for Bb (k) may be derived by replacing y(k) by yb (k) in (6.63) and by following the procedure given in Sect. 6.5. 6.6.3
Realization in the DFT-Domain
We have seen that alternating adaptation of the blocking matrix Bst (k) and the weight vector wa,st (k) in the eigenspace of the desired signal and in the eigenspace of the interference is necessary to obtain maximum suppression of interferers with minimum distortion of the desired signal, respectively. However, the complexity of diagonalizing the estimates of sensor cross-correlation matrices might exceed the available computational resources. Therefore, we propose the discrete Fourier transform (DFT) for an approximate diagonalization of the time-averaged cross-correlation matrix [112]. This transforms
186
W. Herbordt, W. Kellermann
the classification of eigenvectors of Φs (k) to a corresponding classification of frequency bins and simplifies a practical realization of the robust GSC [8], [69]. In Chap. 4, a DFT-domain RLS criterion is used to derive frequencydomain adaptive filters. Straightforward application of these results transforms the robust GSC into the DFT domain. In the following, the blocking matrix Bst (k) and the weight vector wa,st (k) are referred to as adaptive blocking matrix (ABM) and adaptive interference canceller (AIC). The beamformer wc (k) in the upper path is realized as a fixed beamformer. In order to allow movements of the desired source within a predefined interval of DOAs, the Dolph-Chebyshev-based beamformer of Fig. 6.5(b) is used. More details about realization of the robust GSC can be found in [1], [69]. Usage of a fixed beamformer is acceptable if the desired source is located at a more or less fixed position as, e.g., for microphone arrays mounted on a computer screen and with the user located in front of the screen. If the desired source is moving around or if several desired sources are active, then, speaker localization methods for steering the main beam [113], [114], [115] and/or several fixed beamformers which are steered to likely positions of desired sources combined with beam voting algorithms may be applied [116]. 6.6.4
Experimental Evaluation
In this section, we briefly illustrate the performance of the previously described robust GSC in a real acoustic environment. Robustness of the ABM against time-varying multipath propagation (changing position of the desired source) is illustrated, which is crucial for application of the robust GSC. We use a uniform linear array with M = 8, Δ = 4 cm and broadside steering in an office environment with reverberation time T60 = 300 ms. The male desired speaker and the male interferer with an average signal-tointerference ratio (SIR) of 0 dB, are located at 60 cm distance from the array center with θ = π/2 and θ = π/6, respectively. The considered frequency range is 300 Hz - 5.8 kHz processed at a sampling rate of 12 kHz. Cancellation of desired signal DS(k) and interference rejection IR(k) of the robust GSC are measured over time by the ratio of first-order low-pass filtering of instantaneous squared beamformer output signal and instantaneous squared sensor signals averaged across the sensors w.r.t. to the desired signal and w.r.t. the interference, respectively. Figure 6.13 depicts the results for the ABM in comparison to a fixed blocking matrix (BM) after [61]. Parameters are chosen to provide the same interference suppression for the GSC with fixed BM and ABM. The fixed blocking matrix is designed to suppress signals from a single propagation path which results in fluctuating cancellation of the desired signal of about 5 dB on the average. By using the ABM, cancellation of the desired signal at the GSC output is efficiently prevented even directly after a change of the position of the desired speaker (from θ = 12 π to θ = 59 π at 1.66 s). This is
6
Adaptive Beamforming
187
di (k)
ni (k)
0.5
1
1.5
2
2.5
3
DS(k) in dB
10 5 0 −5 −10
ABM
Fixed BM 0.5
1
1.5
2
2.5
3
IR(k) in dB
30 Fixed BM 20 10 0
ABM 0.5
1
1.5
2
2.5
3
time (in s) Fig. 6.13. Robustness against multipath propagation of the ABM in comparison with the BM after [61], position change of the desired speaker at 1.66 s.
achieved as (a) reverberation of the desired signal is efficiently suppressed by the ABM and (b) ABM and AIC track time-varying signal characteristics sufficiently fast.
6.7
Conclusions
This chapter has shown how basic concepts of optimum adaptive antenna array processing can be applied to the specific problems of speech and audio signal acquisition. Especially, the problem of non-stationarity of audio signals and the problem of multipath propagation in acoustic environments has been addressed. We have seen that many traditional concepts of narrowband adaptive beamforming can be extended to deal with these problems. Optimum results are obtained with systems which give up stochastic optimality criteria in favor of instantaneous data-based deterministic criteria in order to take non-stationarity of the source signals and the reverberation into account. Such systems ensure high interference suppression and robustness against cancellation of the desired signal, and, thereby, provide high output signal quality.
188
W. Herbordt, W. Kellermann
References 1. O. Hoshuyama, A. Sugiyama, and A. Hirano, “A robust adaptive beamformer for microphone arrays with a blocking matrix using constrained adaptive filters,” IEEE Trans. on Signal Processing, vol. 47, no. 10, pp. 2677–2684, Oct. 1999. 2. B. D. Van Veen and K. M. Buckley, “Beamforming: a versatile approach to spatial filtering,” IEEE ASSP Magazine, vol. 5, no. 2, pp. 4–24, Apr. 1988. 3. D. H. Johnson and D. E. Dudgeon, Array Signal Processing, Prentice Hall, Englewood Cliffs, NJ, 1993. 4. H. L. Van Trees, Optimum Array Processing, Part IV of Detection, Estimation, and Modulation Theory, John Wiley, New York, 2002. 5. B. Widrow and S. D. Stearns, Adaptive Signal Processing, Prentice-Hall, Englewood Cliffs, NJ, 1985. 6. S. Haykin, Adaptive Filter Theory, Prentice Hall, Englewood Cliffs, NJ, 3rd edition, 1996. 7. M. G. Bellanger, Adaptive Digital Filters, Marcel Dekker, New York, 2001. 8. W. Herbordt, H. Buchner, and W. Kellermann, “An acoustic human-machine front-end for multimedia applications,” EURASIP Journal on Applied Signal Processing, vol. 2002, no. 12, Dec. 2002. 9. H. Cox, “Resolving power and sensitivity to mismatch of optimum array processors,” J. Acoust. Soc. Am., vol. 54, no. 3, pp. 771–785, 1973. 10. T. I. Laakso, V. V¨ alim¨ aki, M. Karjalainen, and U. K. Laine, “Splitting the unit delay - tools for fractional filter design,” IEEE Signal Processing Magazine, pp. 30–60, Jan. 1996. 11. A. V. Oppenheim and R. W. Schafer, Digital Signal Processing, Prentice Hall, Englewood Cliffs, NJ, 1975. 12. H. Cox, R. M. Zeskind, and T. Kooij, “Robust adaptive beamforming,” IEEE Trans. on Acoustics, Speech, and Signal Processing, vol. 35, no. 10, pp. 1365– 1376, Oct. 1987. 13. K. M. Buckley, “Spatial/spectral filtering with linearly constrained minimum variance beamformers,” IEEE Trans. on Acoustics, Speech, and Signal Processing, vol. 35, no. 3, pp. 249–266, Mar. 1987. 14. R. A. Kennedy, T. Abhayapala, and D. B. Ward, “Broadband nearfield beamforming using a radial beampattern transformation,” IEEE Trans. on Signal Processing, vol. 46, no. 8, pp. 2147–2155, Aug. 1998. 15. R. A. Kennedy, D. B. Ward, and T. Abhayapala, “Near-field beamforming using radial reciprocity,” IEEE Trans. on Signal Processing, vol. 47, no. 1, pp. 33–40, Jan. 1999. 16. T. W. Parks and C. S. Burrus, Digital Filter Design, John Wiley, New York, 1987. 17. F. J. Harris, “On the use of windows for harmonic analysis with the discrete Fourier transform,” Proc. IEEE, vol. 66, no. 1, pp. 51–83, Jan. 1978. 18. C. L. Dolph, “A current distribution for broadside arrays which optimize the relationship between beam width and side-lobe level,” Proc. I.R.E. and Waves and Electrons, vol. 34, no. 6, pp. 335–348, June 1946. 19. M. M. Goodwin and G. W. Elko, “Constant beamwidth beamforming,” in Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, vol. 1, 1993, pp. 169–172.
6
Adaptive Beamforming
189
20. T. Chou, “Frequency-independent beamformer with low response error,” in Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, vol. 5, 1995, pp. 2995–2998. 21. M. Van der Wal, E. W. Start, and D. de Vries, “Design of logarithmically spaced constant-directivity transducer arrays,” J. Audio Eng. Soc., vol. 44, no. 6, pp. 497–507, June 1996. 22. D. B. Ward, R. A. Kennedy, and R. C. Williamson, “Constant directivity beamforming,” in Microphone Arrays: Signal Processing Techniques and Applications, M. S. Brandstein and D. B. Ward, Eds., chapter 1, pp. 3–17. Springer, Berlin, Germany, 2001. 23. D. A. Florencio and H. S. Malvar, “Multichannel filtering for optimum noise reduction in microphone arrays,” in Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, vol. 1, 2001, pp. 197–200. 24. K. U. Simmer, J. Bitzer, and C. Marro, “Post-filtering techniques,” in Microphone Arrays: Signal Processing Techniques and Applications, M. S. Brandstein and D. B. Ward, Eds., chapter 3, pp. 39–60. Springer, Berlin, Germany, 2001. 25. R. Martin, “Small microphone arrays with postfilters for noise and acoustic echo cancellation,” in Microphone Arrays: Signal Processing Techniques and Applications, M. S. Brandstein and D. B. Ward, Eds., pp. 255–279. Springer, Berlin, Germany, 2001. 26. I. McCowan and H. Bourlard, “Microphone array post-filter for diffuse noise field,” in Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, vol. 1, 2002, pp. 905–908. 27. Y. Ephraim and D. Malah, “Speech enhancement using a minimum meansquare error log-spectral amplitude estimator,” IEEE Trans. on Acoustics, Speech, and Signal Processing, vol. 33, no. 2, pp. 443–445, Apr. 1985. 28. R. Martin, “Noise power spectral density estimation based on optimal smoothing and minimum statistics,” IEEE Trans. on Speech and Audio Processing, vol. 9, no. 5, pp. 504–512, July 2001. 29. I. Cohen, “Optimal speech enhancement under signal presence uncertainty using log-spectral amplitude estimator,” IEEE Signal Processing Letters, vol. 9, no. 4, pp. 113–116, Apr. 2002. 30. I. Cohen and B. Berdugo, “Microphone array post-filtering for non-stationary noise suppression,” in Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, vol. 1, 2002, pp. 901–904. 31. O. L. Frost, “An algorithm for linearly constrained adaptive processing,” Proc. IEEE, vol. 60, no. 8, pp. 926–935, Aug. 1972. 32. C. L. Zahm, “Effects of errors in the direction of incidence on the performance of an adaptive array,” Proc. IEEE, vol. 60, no. 8, pp. 1008–1009, Aug. 1972. 33. A. M. Vural, “A comparative performance study of adaptive array processors,” in Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, vol. 1, 1977, pp. 695–700. 34. A. M. Vural, “Effects of perturbations on the performance of optimum/adaptive arrays,” IEEE Trans. on Aerospace and Electronic Systems, vol. 15, no. 1, pp. 76–87, Jan. 1979. 35. R. T. Compton Jr., “Pointing accuracy and dynamic range in a steered beam adaptive array,” IEEE Trans. on Aerospace and Electronic Systems, vol. 16, no. 3, pp. 280–287, May 1980.
190
W. Herbordt, W. Kellermann
36. R. T. Compton Jr., “The effect of random steering vector errors in the Applebaum adaptive array,” IEEE Trans. on Aerospace and Electronic Systems, vol. 18, no. 5, pp. 392–400, Sept. 1982. 37. S. P. Applebaum and D. J. Chapman, “Adaptive arrays with main beam constraints,” IEEE Trans. on Antennas and Propagation, vol. 24, no. 9, pp. 650– 662, Sept. 1976. 38. A. K. Steele, “Comparison of directional and derivative constraints for beamformers subject to multiple linear constraints,” IEE Proc. F, H, vol. 130, pp. 41–45, 1983. 39. M. H. Er and A. Cantoni, “Derivative constraints for broad-band element space antenna array processors,” IEEE Trans. on Acoustics, Speech, and Signal Processing, vol. 31, no. 6, pp. 1378–1393, Dec. 1983. 40. K. M. Buckley and L. J. Griffiths, “An adaptive generalized sidelobe canceller with derivative constraints,” IEEE Trans. on Antennas and Propagation, vol. 34, no. 3, pp. 311–319, Mar. 1986. 41. C.-Y. Tseng, “Minimum variance beamforming with phase-independent derivative constraints,” IEEE Trans. on Antennas and Propagation, vol. 40, no. 3, pp. 285–294, Mar. 1992. 42. C.-Y. Tseng and L. J. Griffiths, “A simple algorithm to achieve desired pattern for arbitrary arrays,” IEEE Trans. on Signal Processing, vol. 40, no. 11, pp. 2737–2746, Nov. 1992. 43. I. Thng, A. Cantoni, and Y. H. Leung, “Constraints for maximally flat optimum broadband antenna arrays,” IEEE Trans. on Signal Processing, vol. 43, no. 6, pp. 1334–1347, June 1995. 44. S. Zhang and I. L.-J. Thng, “Robust presteering derivative constraints for broadband antenna arrays,” IEEE Trans. on Signal Processing, vol. 50, no. 1, pp. 1–10, Jan. 2002. 45. M. H. Er and A. Cantoni, “An alternative formulation for an optimum beamformer with robustness capability,” IEE Proc. F, vol. 132, pp. 447–460, Oct. 1985. 46. L. J. Griffiths and K. M. Buckley, “Quiescent pattern control in linearly constrained adaptive arrays,” IEEE Trans. on Acoustics, Speech, and Signal Processing, vol. 35, no. 7, pp. 917–926, July 1987. 47. B. D. van Veen, “Optimization of quiescent response in partially adaptive beamformers,” IEEE Trans. on Acoustics, Speech, and Signal Processing, vol. 38, no. 3, pp. 471–477, Mar. 1990. 48. S. L. Sim and M. H. Er, “An effective quiescent pattern control strategy for GSC structure,” IEEE Signal Processing Letters, vol. 3, no. 8, pp. 236–238, Aug. 1996. 49. L. C. Godara, “Error analysis of the optimal antenna array processors,” IEEE Trans. on Aerospace and Electronic Systems, vol. 22, no. 4, pp. 395–409, July 1986. 50. N. K. Jablon, “Adaptive beamforming with the generalized sidelobe canceller in the presence of array imperfections,” IEEE Trans. on Antennas and Propagation, vol. 34, no. 8, pp. 996–1012, Aug. 1986. 51. W. S. Youn and C. K. Un, “Robust adaptive beamforming based on the eigenstructure method,” IEEE Trans. on Signal Processing, vol. 42, no. 6, pp. 1543– 1547, June 1994.
6
Adaptive Beamforming
191
52. E. N. Gilbert and S. P. Morgan, “Optimum design of directive antenna arrays subject to random variables,” Bell Syst. Techn. Journal, vol. 34, pp. 637–663, May 1955. 53. M. Uzsoky and L. Solymar, “Theory of super-directive linear arrays,” Acta Physica, Acad. Sci. Hung., vol. 6, pp. 195–204, 1956. 54. H. Cox, R. M. Zeskind, and T. Kooij, “Practical supergain,” IEEE Trans. on Acoustics, Speech, and Signal Processing, vol. 34, no. 3, pp. 393–398, June 1986. 55. J. Bitzer and K. U. Simmer, “Superdirective microphone arrays,” in Microphone Arrays: Signal Processing Techniques and Applications, M. S. Brandstein and D. B. Ward, Eds., chapter 2, pp. 19–38. Springer, Berlin, Germany, 2001. 56. J. E. Greenberg and P. M. Zurek, “Microphone-array hearing aids,” in Microphone Arrays: Signal Processing Techniques and Applications, M. S. Brandstein and D. B. Ward, Eds., chapter 11, pp. 229–253. Springer, Berlin, Germany, 2001. 57. M. Omologo, M. Matassoni, and P. Svaizer, “Speech recognition with microphone arrays,” in Microphone Arrays: Signal Processing Techniques and Applications, M. S. Brandstein and D. B. Ward, Eds., chapter 15, pp. 331–353. Springer, Berlin, Germany, 2001. 58. I. A. McCowan and S. Sridharan, “Microphone array sub-band speech recognition,” in Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, vol. 1, 2001, pp. 185–188. 59. G. W. Elko, “Microphone array systems for hands-free telecommunication,” Speech Communication, vol. 20, pp. 229–240, 1996. 60. G. W. Elko, “Superdirectional microphone arrays,” in Acoustic Signal Processing for Telecommunication, S. L. Gay and J. Benesty, Eds., chapter 10, pp. 181–237. Kluwer Academic Publishers, Boston, 2000. 61. L. J. Griffiths and C. W. Jim, “An alternative approach to linearly constrained adaptive beamforming,” IEEE Trans. on Antennas and Propagation, vol. 30, no. 1, pp. 27–34, Jan. 1982. 62. C. W. Jim, “A comparison of two LMS constrained optimal array structures,” Proc. IEEE, vol. 65, no. 12, pp. 1730–1731, Dec. 1977. 63. K. M. Buckley, “Broad-band beamforming and the generalized sidelobe canceller,” IEEE Trans. on Acoustics, Speech, and Signal Processing, vol. 34, no. 5, pp. 1322–1323, Oct. 1986. 64. D. van Compernolle, “Switching adaptive filters for enhancing noisy and reverberant speech from microphone array recordings,” in Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, vol. 2, 1990, pp. 833–836. 65. S. Nordholm, I. Claesson, and B. Bengtsson, “Adaptive array noise suppression of handsfree speaker input in cars,” IEEE Trans. on Vehicular Technology, vol. 42, pp. 514–518, Nov. 1993. 66. S. Affes and Y. Grenier, “A signal subspace tracking algorithm for microphone array processing of speech,” IEEE Trans. on Speech and Audio Processing, vol. 5, no. 5, pp. 425–437, Sept. 1997. 67. S. Gannot, D. Burshtein, and E. Weinstein, “Signal enhancement using beamforming and nonstationarity with applications to speech,” IEEE Trans. on Signal Processing, vol. 49, no. 8, pp. 1614–1626, Aug. 2001. 68. W. H. Neo and B. Farhang-Boroujeny, “Robust microphone arrays using subband adaptive filters,” Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, vol. 6, pp. 3721–3724, 2001.
192
W. Herbordt, W. Kellermann
69. W. Herbordt and W. Kellermann, “Frequency-domain integration of acoustic echo cancellation and a generalized sidelobe canceller with improved robustness,” European Transactions on Telecommunications, vol. 13, no. 2, pp. 123– 132, Mar. 2002. 70. E. K. L. Hung and R. M. Turner, “A fast beamforming algorithm for large arrays,” IEEE Trans. on Aerospace and Electronic Systems, vol. 19, no. 4, pp. 598–607, July 1983. 71. T. K. Citron and T. Kailath, “An improved eigenvector beamformer,” in Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, vol. 3, 1984, pp. 33.3.1–33.3.4. 72. W. F. Gabriel, “Using spectral estimation techniques in adaptive processing antenna systems,” IEEE Trans. on Antennas and Propagation, vol. 34, no. 3, pp. 291–300, Mar. 1986. 73. B. Friedlander, “A signal subspace method for adaptive interference cancellation,” IEEE Trans. on Acoustics, Speech, and Signal Processing, vol. 36, no. 12, pp. 1835–1845, Dec. 1988. 74. J. W. Kim and C. K. Un, “Noise subspace approach for interference cancellation,” IEEE Electronic Letters, vol. 25, no. 11, pp. 712–713, May 1989. 75. A. M. Haimovich, “An eigenanalysis interference canceller,” IEEE Trans. on Signal Processing, vol. 39, no. 1, pp. 76–84, Jan. 1991. 76. L. Chang and C. C. Yeh, “Performance of DMI and eigenspace-based beamformers,” IEEE Trans. on Antennas and Propagation, vol. 40, no. 11, pp. 1336– 1347, Nov. 1992. 77. J.-L. Yu and C.-C. Yeh, “Generalized eigenspace-based beamformers,” IEEE Trans. on Signal Processing, vol. 43, no. 11, pp. 2453–2461, Nov. 1995. 78. A. Haimovich, “The eigencanceler: Adaptive radar by eigenanalysis methods,” IEEE Trans. on Aerospace and Electronic Systems, vol. 32, no. 2, pp. 532–542, Apr. 1996. 79. S.-J. Yu and J.-H. Lee, “The statistical performance of eigenspace-based adaptive array beamformers,” IEEE Trans. on Antennas and Propagation, vol. 44, no. 5, pp. 665–671, May 1996. 80. T. R. Messerschmitt and R. A. Gramann, “Evaluation of the dominant mode rejection beamformer using reduced integration times,” IEEE Journal of Oceanic Engineering, vol. 22, no. 2, pp. 385–392, Apr. 1997. 81. C.-C. Lee and J.-H. Lee, “Eigenspace-based adaptive array beamforming with robust capabilities,” IEEE Trans. on Antennas and Propagation, vol. 45, no. 12, pp. 1711–1716, Dec. 1997. 82. J.-H. Lee and C.-C. Lee, “Analysis of the performance and sensitivity of an eigenspace-based interference canceller,” IEEE Trans. on Antennas and Propagation, vol. 48, no. 5, pp. 826–835, May 2000. 83. N. L. Owsley, “Sonar array processing,” in Array Signal Processing, S. Haykin, Ed., chapter 3, pp. 115–193. Prentice Hall, Englewood Cliffs, Englewood Cliffs, NJ, 1985. 84. B. D. van Veen, “Eigenstructure based partially adaptive array design,” IEEE Trans. on Antennas and Propagation, vol. 36, no. 3, pp. 357–362, Mar. 1988. 85. I. Scott and B. Mulgrew, “A sparse approach in partially adaptive linearly constrained arrays,” in Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, vol. 4, pp. 541–544, 1994. 86. M. Dendrinos, S. Bakamidis, and G. Carayannis, “Speech enhancement from noise,” Speech Communication, vol. 10, no. 2, pp. 45–57, Feb. 1991.
6
Adaptive Beamforming
193
87. Y. Ephraim and H. L. van Trees, “A signal subspace approach for speech enhancement,” IEEE Trans. on Speech and Audio Processing, vol. 3, no. 2, pp. 251–266, July 1995. 88. S. H. Jensen, P. C. Hansen, S. D. Hansen, and J. A. Sorensen, “Reduction of broad-band noise in speech by truncated QSVD,” IEEE Trans. on Speech and Audio Processing, vol. 3, no. 6, pp. 439–448, Nov. 1995. 89. U. Mittal and N. Phamdo, “Signal/Noise KLT based approach for enhancing speech degraded by colored noise,” IEEE Trans. on Speech and Audio Processing, vol. 8, no. 2, pp. 159–167, Mar. 2000. 90. F. Asano, S. Hayamizu, T. Yamada, and S. Nakamura, “Speech enhancement based on the subspace method,” IEEE Trans. on Speech and Audio Processing, vol. 8, no. 5, pp. 497–507, Sept. 2000. 91. A. Rezayee and S. Gazor, “An adaptive KLT approach for speech enhancement,” IEEE Trans. on Speech and Signal Processing, vol. 9, no. 2, pp. 87–95, Feb. 2001. 92. F. Jabloun and B. Champagne, “A perceptual signal subspace approach for speech enhancement in colored noise,” in Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, vol. 1, 2002, pp. 569–572. 93. M. Klein and P. Kabal, “Signal subspace speech enhancement with perceptual post-filtering,” in Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, vol. 1, 2002, pp. 537–540. 94. F. Jabloun and B. Champagne, “A multi-microphone signal subspace approach for speech enhancement,” in Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, vol. 1, 2001, pp. 205–208. 95. Y. Hu and P. C. Loizou, “A subspace approach for enhancing speech corrupted by colored noise,” in Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, vol. 1, 2001, pp. 573–576. 96. S. Doclo and M. Moonen, “GSVD-based optimal filtering for multi-microphone speech enhancement,” in Microphone Arrays: Signal Processing Techniques and Applications, M. S. Brandstein and D. B. Ward, Eds., chapter 6, pp. 111–132. Springer, Berlin, Germany, 2001. 97. B. Widrow, K. M. Duvall, R. P. Gooch, and W. C. Newman, “Signal cancellation phenomena in adaptive antennas: causes and cures,” IEEE Trans. on Antennas and Propagation, vol. 30, no. 3, pp. 469–478, May 1982. 98. K. Duvall, Signal Cancellation in Adaptive Arrays: Phenomena and a Remedy, Ph.D. thesis, Department of Electrical Engineering, Stanford University, 1983. 99. T.-J. Shan and T. Kailath, “Adaptive beamforming for coherent signals and interference,” IEEE Trans. on Acoustics, Speech, and Signal Processing, vol. 33, no. 6, pp. 527–534, June 1985. 100. Y.-L. Su, T.-J. Shan, and B. Widrow, “Parallel spatial processing: a cure for signal cancellation in adaptive arrays,” IEEE Trans. on Antennas and Propagation, vol. 34, no. 3, pp. 347–355, Mar. 1986. 101. K. Takao, N. Kikuma, and T. Yano, “Toeplitzization of correlation matrix in multipath environment,” in Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, vol. 3, 1986, pp. 1873–1876. 102. K. Takao and N. Kikuma, “Adaptive array utilizing an adaptive spatial averaging technique for multipath environments,” IEEE Trans. on Antennas and Propagation, vol. 35, no. 12, pp. 1389–1396, Dec. 1987.
194
W. Herbordt, W. Kellermann
103. S.-C. Pei, C.-C. Yeh, and S.-C. Chiu, “Modified spatial smoothing for coherent jammer suppression without signal cancellation,” IEEE Trans. on Acoustics, Speech, and Signal Processing, vol. 36, no. 3, pp. 412–414, Mar. 1988. 104. S. U. Pillai and B. H. Kwon, “Forward/backward spatial smoothing techniques for coherent signal identification,” IEEE Trans. on Acoustics, Speech, and Signal Processing, vol. 37, no. 1, pp. 8–15, Jan. 1989. 105. V. U. Reddy, A. Paulraj, and T. Kailath, “Performance analysis of the optimum beamformer in the presence of correlated sources and its behavior under spatial smoothing,” IEEE Trans. on Acoustics, Speech, and Signal Processing, vol. 35, no. 7, pp. 927–936, July 1987. 106. C.-C. Yeh, S.-C. Chiu, and S.-C. Pei, “On the coherent interference suppression using a spatially smoothing adaptive array,” IEEE Trans. on Antennas and Propagation, vol. 37, no. 7, pp. 851–857, July 1989. 107. J. F. Yang and M. Kaveh, “Coherent signal-subspace transformation beamformer,” IEE Proc. F, vol. 137, no. 4, pp. 267–275, Aug. 1990. 108. M. Lu and Z. He, “Adaptive beam forming using split-polarity transformation for coherent signal and interference,” IEEE Trans. on Antennas and Propagation, vol. 41, no. 3, pp. 314–324, Mar. 1993. 109. F. Qian and B. D. van Veen, “Partially adaptive beamforming for correlated interference rejection,” IEEE Trans. on Signal Processing, vol. 43, no. 2, pp. 506–515, Feb. 1995. 110. F. Qian and D. B. van Veen, “Quadratically constrained adaptive beamforming for coherent signals and interference,” IEEE Trans. on Signal Processing, vol. 43, no. 8, pp. 1890–1900, Aug. 1995. 111. Z. Tian, K. L. Bell, and H. L. van Trees, “A recursive least squares implementation for LCMP beamforming under quadratic constraint,” IEEE Trans. on Signal Processing, vol. 49, no. 6, pp. 1138–1145, June 2001. 112. R. M. Gray, “On the asymptotic eigenvalue distribution of toeplitz matrices,” IEEE Trans. on Information Theory, vol. 18, no. 6, pp. 725–730, Nov. 1972. 113. J. H. DiBiase, H. F. Silverman, and M. S. Brandstein, “Robust localization in reverberant rooms,” in Microphone Arrays: Signal Processing Techniques and Applications, M. S. Brandstein and D. B. Ward, Eds., chapter 8, pp. 157–180. Springer, Berlin, Germany, 2001. 114. E. D. Di Claudio and R. Parisi, “Multi-source localization strategies,” in Microphone Arrays: Signal Processing Techniques and Applications, M. S. Brandstein and D. B. Ward, Eds., chapter 9, pp. 181–201. Springer, Berlin, Germany, 2001. 115. N. Strobel, S. Spors, and R. Rabenstein, “Joint audio-video signal processing for object localization and tracking,” in Microphone Arrays: Signal Processing Techniques and Applications, M. S. Brandstein and D. B. Ward, Eds., chapter 10, pp. 203–225. Springer, Berlin, Germany, 2001. 116. W. Kellermann, “A self-steering digital microphone array,” in Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, vol. 5, 1991, pp. 3581–3584.
7 Blind Source Separation of Convolutive Mixtures of Speech Shoji Makino NTT Communication Science Laboratories 2-4 Hikaridai, Seika-cho Soraku-gun, Kyoto 619-0237, Japan E-mail:
[email protected]
Abstract. This chapter introduces the blind source separation (BSS) of convolutive mixtures of acoustic signals, especially speech. A statistical and computational technique, called independent component analysis (ICA), is examined. By achieving nonlinear decorrelation, nonstationary decorrelation, or time-delayed decorrelation, we can find source signals only from observed mixed signals. Particular attention is paid to the physical interpretation of BSS from the acoustical signal processing point of view. Frequency-domain BSS is shown to be equivalent to two sets of frequency domain adaptive microphone arrays, i.e., adaptive beamformers (ABFs). Although BSS can reduce reverberant sounds to some extent in the same way as ABF, it mainly removes the sounds from the jammer direction. This is why BSS has difficulties with long reverberation in the real world. If sources are not “independent,” the dependence results in bias noise when obtaining the correct unmixing filter coefficients. Therefore, the performance of BSS is limited by that of ABF. Although BSS is upper bounded by ABF, BSS has a strong advantage over ABF. BSS can be regarded as an intelligent version of ABF in the sense that it can adapt without any information on the array manifold or the target direction, and sources can be simultaneously active in BSS.
7.1
Introduction
Speech recognition is a fundamental technology for communication with computers, but with existing computers, the recognition rate drops rapidly when more than one person is speaking or when there is background noise. On the other hand, humans can engage in comprehensible conversations at a noisy cocktail party. This is the well known cocktail-party effect, where the individual speech waveforms are found from the mixtures. The aim of source separation is to provide computers with this cocktail party ability, thus making it possible for computers to understand what a person is saying at a noisy cocktail party. Blind source separation (BSS) is an emerging technique, which enables the extraction of target speech from observed mixed speeches without the need for source positioning, spectral construction, or a mixing system. To achieve this, attention has focused on a method based on independent component analysis (ICA). ICA extracts independent sounds from among mixed sounds.
J. Benesty et al. (eds.), Adaptive Signal Processing © Springer-Verlag Berlin Heidelberg 2003
196
S. Makino
This chapter considers ICA in a wide sense, namely nonlinear decorrelation together with nonstationary decorrelation and time-delayed decorrelation. These three methods are discussed in a unified manner [1]. There are a number of applications for the BSS of mixed speech signals in the real world [2], but the separation performance is still not good enough [3], [4]. Since ICA is a purely statistical process, the separation mechanism has not been clearly understood in the sense of acoustic signal processing, and it has been difficult to know which components were separated, and to what degree. Recently, the ICA method has been investigated in detail, and its mechanisms have been gradually uncovered by using theoretical analysis from the perspective of acoustic signal processing [5] as well as experimental analysis based on impulse response [6]. The mechanism of BSS based on ICA has been shown to be equivalent to that of an adaptive microphone array system, i.e., N sets of adaptive beamformers (ABFs) with an adaptive null directivity aimed in the direction of unnecessary sounds. From the equivalence between BSS and ABF, it becomes clear that the physical behavior of BSS reduces the jammer signal by making a spatial null towards the jammer. BSS can further be regarded as an intelligent version of ABF in the sense that it can adapt without any information on the source positions or period of source existence/absence. The aim of this chapter is to describe BSS and introduce ICA in terms of acoustical signal processing. Section 7.2 outlines the framework of BSS for convolutive mixtures of speech. Section 7.3 briefly summarizes the background theory of ICA. In Sect. 7.4, the separation mechanism is described for both second-order statistics and higher-order statistics approaches. The discussion is expanded to a physical interpretation of BSS, compared with an ABF in Sect. 7.5, and the equivalence between BSS and ABF is shown theoretically in Sect. 7.6. This equivalence leads to important discussions on performance in Sect. 7.7 and on limitations in Sect. 7.8. Section 7.9 provides experimental results that confirm the discussions. Section 7.10 describes experimental conditions with measured impulse responses in a real room and six combinations of male and female speech. The chapter finishes with a summary of the main conclusions.
7.2
What Is BSS?
Blind source separation (BSS) is an approach for estimating source signals si (n) using only the information of mixed signals xj (n) observed at each input channel. Typical examples of such source signals include mixtures of simultaneous speech signals that have been picked up by several microphones, brain waves recorded by multiple sensors, and interfering radio signals arriving at a mobile station.
7
Blind Source Separation
197
Fig. 7.1. BSS system configuration.
7.2.1
Mixed Signal Model for Speech Signals in a Room
In the case of audio source separation, several sensor microphones are placed in different positions so that each records a mixture of the original source signals at a slightly different time and level. In the real world where the source signals are speech and the mixing system is a room, the signals that are picked up by the microphones are affected by reverberation [7], [8]. Therefore, the N signals recorded by M microphones are modeled as xj (n) =
P N
hji (p)si (n − p + 1) (j = 1, · · · , M ),
(7.1)
i=1 p=1
where si is the source signal from a source i, xj is the signal received by a microphone j, and hji is the P -taps impulse response from source i to microphone j. This chapter focuses on speech signals as sources that are nongaussian, nonstationary, colored, and that have a zero mean. 7.2.2
Unmixed Signal Model
To obtain unmixed signals, unmixing filters wij (k) of Q-taps are estimated, and the unmixed signals are obtained as yi (n) =
Q M j=1 q=1
wij (q)xj (n − q + 1) (i = 1, · · · , N ).
(7.2)
198
S. Makino
Fig. 7.2. Task of blind source separation of speech signals.
The unmixing filters are estimated so that the unmixed signals become mutually independent. This chapter considers a two-input, two-output convolutive BSS problem, i.e., N = M = 2 (Fig. 7.1) without a loss of generality. 7.2.3
Task of Blind Source Separation of Speech Signals
It is assumed that the source signals s1 and s2 are mutually independent. This assumption usually holds for sounds in the real world. There are two microphones which pick up the mixed speech. Only the observed signals x1 and x2 are available and they are dependent. The goal is to adapt the unmixing systems wij , and extract y1 and y2 so that they are mutually independent. With this operation, we can obtain s1 and s2 in the output y1 and y2 . No information is needed on the source positions or period of source existence/absence. Nor is any information required on the mixing systems hji . Thus, this task is called blind source separation (Fig. 7.2). Note that the unmixing systems wij can at best be obtained up to a scaling and a permutation, and thus cannot itself solve the dereverberation/deconvolution problem [9]. 7.2.4
Instantaneous Mixtures vs. Convolutive Mixtures
Convolutive Mixtures. If the sound separation is being undertaken in a room, the mixing systems hji are of course FIR filters with several thousand taps. This is the very difficult and relatively new problem of convolutive mixtures.
7
Blind Source Separation
199
Instantaneous Mixtures By contrast, if the mixing systems hji are scalars, i.e., there is no delay and no reverberation, such as when we use an audio mixer, this becomes a problem of instantaneous mixtures. In fact, other applications such as the fMRI and EEG signals found in biomedical contexts, and images are almost all instantaneous mixtures problems. Instantaneous mixtures problems have been well studied and there are many good results.
7.2.5
Time-Domain Approach vs. Frequency-Domain Approach
Several methods have been proposed for achieving the BSS of convolutive mixtures. Some approaches consider the impulse responses of a room hji as FIR filters, and estimate those filters in the time domain [10], [11], [12]; other approaches transform the problem into the frequency domain so that they can simultaneously solve an instantaneous BSS problem for every frequency [13], [14].
7.2.6
Time-Domain Approach for Convolutive Mixtures
In the time-domain approach for convolutive mixtures, unmixing systems wij can be FIR filters or IIR filters. However, FIR filters are usually used to realize a non-minimum-phase filter [10]. In the BSS of convolutive mixtures in the time domain, Sun and Douglas clearly distinguished multichannel blind deconvolution from convolutive blind source separation [12]. Multichannel blind deconvolution tries to make the output both spatially and temporally independent. The sources are assumed to be temporally as well as spatially independent, i.e., they are assumed to be independent from channel to channel and from sample to sample. On the other hand, convolutive BSS tries to make the output spatially (mutually) independent without deconvolution. Since speech is temporally correlated, convolutive BSS is appropriate for the task of speech separation. If we apply multichannel blind deconvolution to speech, it imposes undesirable constraints on the output, causing undesirable spectral equalization, flattening, or whitening. Therefore, we need some pre/post-filtering method that maintains the spectral content of the original speech in the separated output [12], [15]. An advantage of the time-domain approach is that we do not have to think about the heavy permutation problem, i.e., the estimated source signal components are recovered with a different order. Permutation poses a serious problem in relation to frequency-domain BSS, whereas it is a trivial problem in time-domain BSS. A disadvantage of the time-domain approach is that the performance depends strongly on the initial value [10], [15].
200
S. Makino
7.2.7
Frequency-Domain Approach for Convolutive Mixtures
Smaragdis [13] proposed working directly in the frequency domain applying a nonlinear function to signals with complex values. The frequency domain approach to convolutive mixtures is to transform the problem into an instantaneous BSS problem in the frequency domain [13], [14], [16], [17]. Here we consider a two-input, two-output convolutive BSS problem without a loss of generality. Using a T -point short-time Fourier transformation for (7.1), we obtain, X(ω, m) = H(ω)S(ω, m),
(7.3)
where ω denotes the frequency, m represents the time-dependence of the short-time Fourier transformation, S(ω, m) = [S1 (ω, m), S2 (ω, m)]T is the source signal vector, and X(ω, m) = [X1 (ω, m), X2 (ω, m)]T is the observed signal vector. We assume that the (2×2) mixing matrix H(ω) is invertible, and that Hji (ω) = 0. Also, H(ω) does not depend on time m. The unmixing process can be formulated in a frequency bin ω: Y (ω, m) = W (ω)X(ω, m),
(7.4)
where Y (ω, m) = [Y1 (ω, m), Y2 (ω, m)]T is the estimated source signal vector, and W (ω) represents a (2×2) unmixing matrix at frequency bin ω. The unmixing matrix W (ω) is determined so that Y1 (ω, m) and Y2 (ω, m) become mutually independent. The above calculation is carried out at each frequency independently. This chapter considers the DFT frame size T to be equal to the length of unmixing filter Q. Hereafter, the convolutive BSS problem is considered in the frequency domain unless stated otherwise. Note that digital signal processing in the time and frequency domains are essentially identical, and all discussions here in the frequency domain are also essentially true for the time-domain convolutive BSS problem. 7.2.8
Scaling and Permutation Problems
Applying the model in the frequency domain introduces a new problem: the frequency bins are treated as being mutually independent. As a result, the estimated source signal components are recovered with a different order in the different frequency bins. Thus the trivial permutation ambiguity associated with time-domain ICA, i.e., the ordering of the sources, now becomes nontrivial. In frequency-domain BSS, the scaling problem also becomes nontrivial, i.e., the estimated source signal components are recovered with a different gain in the different frequency bins. The scaling ambiguity in each frequency bin results in a convolutive ambiguity for each source, this results in an arbitrary filtering. This reflects the fact that filtered versions of independent signals remain independent.
7
7.3
Blind Source Separation
201
What Is ICA?
Independent component analysis (ICA) is a statistical method that was originally introduced in the context of neural network modeling [18], [19], [20], [21], [22], [23], [24], [25]. Recently, this method has been used for the BSS of sounds, fMRI and EEG signals of biomedical applications, wireless communication signals, images, and other applications. ICA thus became an exciting new topic in the fields of signal processing, artificial neural networks, advanced statistics, information theory, and various application fields. Very general statistical properties are used in ICA theory, namely information on statistical independence. In a source separation problem, the source signals are the “independent components” of the data set. In brief, BSS poses the problem of finding a linear representation in which the components are mutually independent. ICA consists of estimating both the unmixing matrix W (ω) and sources si , when we only have the observed signals xj . The unmixing matrix W (ω) is determined so that one output contains as much information on the data as possible. The value of any one of the components gives no information on the values of the other components. If the unmixed signals are mutually independent, then they are equal to the source signals.
7.3.1
What Is Independence?
Independence is a stronger concept than “no correlation,” since correlation only deals with second-order statistics whereas independence deals with higher-order statistics. Independent components can be found by nonlinear, nonstationary, or time-delayed decorrelation. In the nonlinear decorrelation approach, if the unmixing matrix W (ω) is a true separating matrix and y1 and y2 are independent and have a zero mean, and the nonlinear function Φ(·) is an odd function such that Φ(y1 ) also has a zero mean, then E[Φ(y1 )y2 ] = E[Φ(y1 )]E[y2 ] = 0.
(7.5)
We look for such unmixing matrix W (ω) that gives (7.5). The question here concerns, how should the nonlinear function be chosen? The answers can be found by using several background theories for the ICA. Using any of these theories, we can determine the nonlinear function in a satisfactory way. These are the minimization of mutual information, maximization of nongaussianity, and maximization of likelihood. For the nonstationary and time-delayed decorrelation approaches, see Sect. 7.4.
202
S. Makino
7.3.2
Minimization of Mutual Information
The first approach for ICA, inspired by information theory, is the minimization of mutual information. Mutual information is a natural informationtheoretic measure of statistical independence. It is always nonnegative, and zero if, and only if, the variables are statistically independent. Therefore it is natural to estimate the independent components by minimizing the mutual information of their estimates. Minimization of mutual information can be interpreted as giving the maximally independent component.
7.3.3
Maximization of Nongaussianity
The second approach is based on the maximization of nongaussianity. The central limit theorem in probability theory says that the distribution of a sum of independent random variables tends toward a Gaussian distribution. Roughly speaking, the sum of independent random variables usually has a distribution that is closer to Gaussian than either of the original random variables. Therefore, the independent components can be found by finding the directions in which the data is maximally nongaussian. Note that in most classic statistical theories, random variables are assumed to have a Gaussian distribution. By contrast, in the ICA theory, random variables are assumed to have a nongaussian distribution. Many real-world data sets, including speech, have supergaussian distributions. Supergaussian random variables typically have a spiky probability density function (pdf), i.e., the pdf is relatively large at zero compared with the Gaussian distribution. A speech signal is a typical example (Fig. 7.3).
7.3.4
Maximization of Likelihood
The third approach is based on the maximization of likelihood. Maximum likelihood (ML) estimation is a fundamental principle of statistical estimation, and a very popular approach for estimating the ICA. We take the ML estimation parameter values as estimates that give the highest probability for the observations. ML estimation is closely related to the neural network principle of maximization of information flow (infomax). The infomax principle is based on maximizing the output entropy, or information flow, of a neural network with nonlinear outputs. We maximize the mutual information between the inputs xi and outputs yi . Maximization of this mutual information is equivalent to a maximization of the output entropy, so infomax is equivalent to maximum likelihood estimation.
7
Blind Source Separation
203
Amplitude
1
0
-1
0
1
2
3
4
5
Time (s) 0.35
0.3
0.25
0.2
0.15
0.1
0.05
0 -0.3
-0.2
-0.1
0
0.1
0.2
0.3
amplitude
Fig. 7.3. Speech signal and its probability density function (pdf). Dotted line is the pdf of the Gaussian distribution.
7.3.5
Three ICA Theories Are Identical
It is of interest to note that all the above solutions are identical [26]. The mutual information I(y1 , y2 ) between the output y1 and y2 is expressed as I(y1 , y2 ) =
2
H(yi ) − H(y1 , y2 ),
(7.6)
i=1
where H(yi ) are the marginal entropies and H(y1 , y2 ) is the joint entropy of the output. The entropy of y can be calculated by using p(y) (pdf of y) as follows: 1 1 H(y) = E[log ]= p(y) log . (7.7) p(y) p(y)
204
S. Makino
Mutual information I(y1 , y2 ) in Sect. 7.3.2 is minimized by minimizing the first term, or maximizing the second term of (7.6). Gaussian signals maximize the first term, namely maximization of nongaussianity in Sect. 7.3.3 achieves minimization of the first term. On the other hand, maximization of the joint entropy of the output in Sect. 7.3.4 maximizes the second term. Accordingly, the above mentioned three approaches are identical. For more details of these theories, see [10], [27], [28], [29]. 7.3.6
Learning Rules
To achieve separation, we vary the unmixing matrix W (ω) in (7.4), and see how the distribution of the output changes. We search for the unmixing matrix W (ω) that minimizes the mutual information, maximize the nongaussianity, or maximize the likelihood of the output. This can be accomplished by the gradient method. Bell and Sejnowski derived a very simple gradient algorithm [30]. Amari proposed the natural gradient version, and increased the stability and convergence speed [31]. This is a nonlinear extension of the ordinary requirement of uncorrelatedness, and in fact, this algorithm is a special case of the nonlinear decorrelation algorithm. The theory makes it clear that the nonlinear function must correspond to the derivative of the logarithm of the pdf of the sources. Hereafter, we assume that the pdf of the (speech) sources is known, that is, the supergaussian distribution of the speech sources is known. It also assumes that the nonlinear function is set in a satisfactory way that corresponds to the derivative of the logarithm of the pdf, namely the nonlinear function is properly set at tanh(·).
7.4
How Speech Signals Can Be Separated?
This chapter attempts a simple and comprehensive (rather than accurate) exploration from the acoustical signal processing perspective. With the ICAbased BSS framework, how can we separate speech signals? The simple answer is to diagonalize RY , where RY is a (2×2) matrix: Φ(Y1 )Y1 Φ(Y1 )Y2 . (7.8) RY = Φ(Y2 )Y1 Φ(Y2 )Y2 The function Φ(·) is a nonlinear function. The operation · is the averaging operation used to obtain statistical information. We want to minimize the off-diagonal components, while at the same time, constraining the diagonal components to proper constants. The components of the matrix RY correspond to the mutual information between Yi and Yj . At the convergence point, the off-diagonal components, which are the mutual information between Y1 and Y2 , become zero: Φ(Y1 )Y2 = 0,
Φ(Y2 )Y1 = 0.
(7.9)
7
Blind Source Separation
205
While at the same time, the diagonal components, which only control the amplitude scaling of the output Y1 and Y2 , are constrained to proper constants: Φ(Y1 )Y1 = c1 ,
Φ(Y2 )Y2 = c2 .
(7.10)
To achieve this convergence, we use the recursive learning rule W i+1 = W i + ηΔW i ,
(7.11)
c1 − Φ(Y1 )Y1 Φ(Y1 )Y2 . ΔW i = Φ(Y2 )Y1 c2 − Φ(Y2 )Y2
(7.12)
When RY is diagonalized, ΔW converges to zero. If c1 = c2 = 1, the algorithm is called holonomic. If c1 = Φ(Y1 )Y1 and c2 = Φ(Y2 )Y2 , the algorithm is called nonholonomic. 7.4.1
Second Order Statistics vs. Higher Order Statistics
If Φ(Y1 ) = Y1 , we have the simple decorrelation: Φ(Y1 )Y2 = Y1 Y2 = 0.
(7.13)
This is not sufficient to achieve independence. However, if we have nonstationary sources, we have this equation for multiple time blocks, and thus can solve the problem. This is the nonstationary decorrelation approach [32]. Or, if we have colored sources, we have a delayed correlation for a multiple time delay: Φ(Y1 )Y2 = Y1 (m)Y2 (m + τi ) = 0,
(7.14)
thus we can solve the problem. This is the time-delayed decorrelation (TDD) approach [33], [34]. These are the approaches of second order statistics (SOS). On the other hand if, for example, Φ(Y1 ) = tanh(Y1 ), we have: Φ(Y1 )Y2 = tanh(Y1 )Y2 = 0.
(7.15)
With a Tailor expansion of tanh(·), (7.15) can be expressed as (Y1 −
2Y 5 17Y17 Y13 + 1 − ...) Y2 = 0, 3 15 315
(7.16)
thus we have higher order or nonlinear decorrelation, then we can solve the problem. This is the approach of higher order statistics (HOS).
206
S. Makino
7.4.2
Second Order Statistics (SOS) Approach
The second order statistics (SOS) approach exploits the second order nonstationary/colored structure of the sources, namely crosstalk minimization with additional nonstationary/colored information on the sources. Weinstein et al. [9] pointed out that nonstationary signals provide enough additional information to estimate the unmixing matrix W (ω) and proposed a method based on nonstationary decorrelation. Some authors have used the SOS approach for mixed speech signals [3], [35]. This approach can be understood in a comprehensive way in that we have four unknown parameters Wij in each frequency bin, but only three equations in (7.9) and (7.10) since Y1 Y2 = Y2 Y1 when Φ(Yi ) = Yi , that is, the simultaneous equations become underdetermined. Accordingly the simultaneous equations cannot be solved. However, when the sources are nonstationary, the second order statistics is different in each time block. Similarly, when the sources are colored, the second order statistics is different in each time delay. As a result, more equations are available and the simultaneous equations can be solved. In the nonstationary decorrelation approach, the source signals S1 (ω, m) and S2 (ω, m) are assumed to have a zero mean and be mutually uncorrelated. To determine the unmixing matrix W (ω) so that Y1 (ω, m) and Y2 (ω, m) become mutually uncorrelated, we seek a W (ω) that diagonalizes the covariance matrices RY (ω, k) simultaneously for all time blocks k, RY (ω, k) = W (ω)RX (ω, k)W H (ω) = W (ω)H(ω)Λs (ω, k)H H (ω)W H (ω) = Λc (ω, k),
(7.17)
where H denotes the conjugate transpose, RX is the covariance matrix of X(ω) as follows, RX (ω, k) =
M−1 1 X(ω, M k + m)X H (ω, M k + m), M m=0
(7.18)
Λs (ω, k) is a covariant matrix of source signals that is a different diagonal matrix for each time block k, and Λc (ω, k) is an arbitrary diagonal matrix. The diagonalization of RY (ω, k) can be written as an overdetermined least squares problem, arg min ||diag{W (ω)RX (ω, k)W H (ω)} − W (ω)RX (ω, k)W H (ω)||2 W (ω)
s.t.,
k
||diag{W (ω)RX (ω, k)W H (ω)}||2 = 0,
(7.19)
k
where ||x||2 is the squared Frobenius norm and diagA is the diagonal components of the matrix A. The solution can be found by the gradient method.
7
Blind Source Separation
207
In the time-delayed decorrelation approach, RX is defined as follows, RX (ω, τi ) =
M−1 1 X(ω, m)X H (ω, m + τi ), M m=0
(7.20)
and we seek a W (ω) that diagonalizes the covariance matrices RY (ω, τi ) simultaneously for all time delays τi . 7.4.3
Higher Order Statistics (HOS) Approach
The higher order statistics (HOS) approach exploits the nongaussian structure of the sources. Or more simply, we could say that we have four equations in (7.9) and (7.10) for four unknown parameters Wij in each frequency bin. Accordingly the simultaneous equations can be solved. To calculate the unmixing matrix W (ω), an algorithm has been proposed based on the minimization of the Kullback-Leibler divergence [13], [14]. For stable and faster convergence, Amari [36] proposed an algorithm based on the natural gradient. Using the natural gradient, the optimal unmixing matrix W (ω) is obtained by using the following gradient iterative equation: W i+1 (ω) = W i (ω) +η diag Φ(Y )Y H −Φ(Y )Y H W i (ω),
(7.21)
where Y =Y (ω, m), · denotes the averaging operator, i is used to express the value of the i-th step in the iterations, and η is the step size parameter. In addition, we define the nonlinear function Φ(·) to signals with complex values as Φ(Y ) = tanh(Y (R) ) + j tanh(Y (I) ),
(7.22)
where Y (R) and Y (I) are the real part and the imaginary part of Y , respectively [13]. For the complex numbered nonlinear function, the polar coordinate version Φ(Y ) = tanh(abs(Y )) ej arg(Y )
(7.23)
was shown to outperform the Cartesian coordinate version (7.22) both theoretically and experimentally [37].
7.5
Physical Interpretation of BSS
BSS is a statistical, or mathematical method, so the physical behavior of BSS is not obvious. We are simply attempting to make the two output signals Y1 and Y2 independent. Then, what is the physical interpretation of BSS?
208
S. Makino
In earlier works, Cardoso and Souloumiac [23] indicated the connection between blind identification and beamforming in a narrowband context. Kurita et al. [38] and Parra and Alvino [39] used the relationship between BSS and beamformer to achieve better BSS performance. However, their theoretical discussion of this relationship was insufficient. This section discusses this relationship more closely, and provides a physical understanding of frequency-domain BSS [40]. It also provides an interpretation of BSS from the physical point of view showing the equivalence between frequency-domain BSS and two sets of frequency-domain adaptive microphone arrays, i.e., adaptive beamformers (ABFs). Knaak and Filbert [41] have also provided a somewhat qualitative discussion of the relationship between frequency-domain ABF and frequency-domain BSS. This chapter goes beyond their discussions and offers an explanation of the effect of the collapse of the assumption of independence in BSS. 7.5.1
Frequency-Domain Adaptive Beamformer (ABF)
Here, we consider the frequency-domain adaptive beamformer (ABF), that can adaptively remove a jammer signal. Since the aim is to separate two signals S1 and S2 with two microphones, two sets of ABFs are used (see Fig. 7.4). That is, an ABF that forms a null directivity pattern towards source S2 by using filter coefficients W11 and W12 , and an ABF that forms a null directivity pattern towards source S1 by using filter coefficients W21 and W22 . Note that the direction of the target or the impulse responses from the target to the microphones should be known, and that the ABF can adapt only when a jammer is active but a target is silent. 7.5.2
ABF for Target S1 and Jammer S2
First, we consider the case of a target S1 and a jammer S2 [see Fig. 7.4(a)]. When target S1 = 0, output Y1 (ω, m) is expressed as Y1 (ω, m) = W (ω)X(ω, m),
(7.24)
where W (ω)=[W11 (ω), W12 (ω)], X(ω, m)=[X1 (ω, m), X2 (ω, m)]T . To minimize jammer S2 (ω, m) in output Y1 (ω, m) when target S1 = 0, the mean square error J(ω) is introduced as J(ω) = E[Y12 (ω, m)] = W (ω)E[X(ω, m)X H (ω, m)]W H (ω) = W (ω)R(ω)W H (ω), where E[·] is the expectation operator and X1 (ω, m)X1∗ (ω, m) X1 (ω, m)X2∗ (ω, m) . R(ω) = E X2 (ω, m)X1∗ (ω, m) X2 (ω, m)X2∗ (ω, m)
(7.25)
(7.26)
7
Blind Source Separation
W11
S1 H12
209
0 Y1
X1 W12
H22
S2
X2 (a) ABF for target S1 and jammer S2 .
S1
H11 X1 H21
W21 W22
S2 X2
0 Y2
(b) ABF for target S2 and jammer S1 . Fig. 7.4. Two sets of ABF-system configurations.
By differentiating the cost function J(ω) with respect to W (ω) and setting the gradient equal to zero, we obtain [hereafter (ω, m) and (ω) are omitted for convenience], ∂J = 2RW H = 0. ∂W
(7.27)
Using X1 = H12 S2 , X2 = H22 S2 , we get W11 H12 + W12 H22 = 0.
(7.28)
With (7.28) only, we have a trivial solution W11 =W12 =0. Therefore, an additional constraint should be added to ensure that target signal S1 is in output Y1 , i.e., Y1 = (W11 H11 + W12 H21 )S1 = c1 S1 ,
(7.29)
which leads to W11 H11 + W12 H21 = c1 ,
(7.30)
where c1 is an arbitrary complex constant. The ABF solution is derived from simultaneous equations (7.28) and (7.30). For this ABF, target S1 is extracted with proper energy and jammer S2 is minimized.
210
S. Makino
7.5.3
ABF for Target S2 and Jammer S1
Similarly for a target S2 , a jammer S1 , and an output Y2 [see Fig. 7.4(b)], we obtain W21 H11 + W22 H21 = 0
(7.31)
W21 H12 + W22 H22 = c2 .
(7.32)
For this ABF, target S2 is extracted with proper energy and jammer S1 is minimized. 7.5.4
Two Sets of ABFs
By combining (7.28), (7.30), (7.31), and (7.32), we can summarize the simultaneous equations for two sets of ABFs as follows, W11 W12 H11 H12 c1 0 = . (7.33) 0 c2 W21 W22 H21 H22
7.6
Equivalence Between BSS and Adaptive Beamformers
Here, we consider an algorithm based on the second order statistics for nonstationary signals described in Sect. 7.4.2. The output signals are decorrelated simultaneously for all time blocks in each frequency. The BSS strategy works to diagonalize the covariance matrix of output signals RY . That is, the BSS update equation works to minimize the offdiagonal components of the matrix RY , and constrain the diagonal components to proper constants. As shown in (7.19), the BSS algorithm based on second order statistics works to minimize off-diagonal components in Y Y∗ Y Y∗ E 1 1∗ 1 2∗ , (7.34) Y2 Y1 Y2 Y2 [see (7.17)] for all time blocks. Using H and W , outputs Y1 and Y2 are expressed in each frequency bin as Y1 = aS1 + bS2
(7.35)
Y2 = cS1 + dS2 ,
(7.36)
where W11 W12 ab H11 H12 = , cd W21 W22 H21 H22
(7.37)
and their paths are shown in Fig. 7.5. Here, a and d are the target paths, and b and c are the jammer paths.
7
(a)
(b)
X1
S1
H11
W11
H12
W12
H21 S2
X2
H22
Y1
W21 Y2
H11
W11
Y1
H12
W12
H21 S2
X2
H22
W11
H12
W12
S1
W21 W22
Y2
Y1
W21 W22
Y2
H11
W11
Y1
H12
W12
X1
H21 S2
X2
H22
(d)
X1
S1
S2
211
X1 H11 H21
W22
(c)
S1
Blind Source Separation
X2
H22
W21 W22
Y2
Fig. 7.5. Paths in equation (7.37).
7.6.1
When S1 = 0 and S2 = 0
We now analyze what is occurring in the BSS framework. After convergence, the expectation of the off-diagonal component E[Y1 Y2∗ ] is expressed as E[Y1 Y2∗ ] = ad∗ E[S1 S2∗ ] + bc∗ E[S2 S1∗ ] + (ac∗ E[S12 ] + bd∗ E[S22 ]) = 0.
(7.38)
Since S1 and S2 are assumed to be independent, the first and second terms become zero. Then, the BSS adaptation should drive the third term of (7.38) to zero for all time blocks. That is, (7.38) is an identical equation with regard to E[S12 ] and E[S22 ]. This leads to ac∗ = bd∗ = 0. CASE 1: a = c1 , c = 0, b = 0, d = c2 W11 W12 H11 H12 c 0 = 1 0 c2 W21 W22 H21 H22
(7.39)
(7.40)
This equation is identical to (7.33) in ABF. CASE 2: a = 0, c = c1 , b = c2 , d = 0 H11 H12 0 c2 W11 W12 = W21 W22 H21 H22 c1 0
(7.41)
This equation leads to a permutation solution, Y1 = c2 S2 , Y2 = c1 S1 : the estimated source signal components are recovered with a different order. CASE 3: a = 0, c = c1 , b = 0, d = c2 W11 W12 H11 H12 0 0 = c 1 c2 W21 W22 H21 H22
(7.42)
212
S. Makino
This equation leads to an undesirable solution Y1 = 0, Y2 = c1 S1 + c2 S2 . CASE 4: a = c1 , c = 0, b = c2 , d = 0 W11 W12 H11 H12 c 1 c2 = W21 W22 H21 H22 0 0
(7.43)
This equation leads to an undesirable solution Y1 = c1 S1 + c2 S2 , Y2 = 0. Note that in general, CASE 3 and CASE 4 do not appear because H(ω) is assumed to be invertible, and Hji (ω) = 0. That is, if a = 0 then b = 0 (CASE 2), and if c = 0 then d = 0 (CASE 1). 7.6.2
When S1 = 0 and S2 = 0
The BSS can adapt, even if there is only one active source. In this case, only one set of ABFs is achieved. When S2 = 0, we have Y1 = aS1 and Y2 = cS1
(7.44)
then E[Y1 Y2∗ ] = E[aS1 c∗ S1∗ ] = ac∗ E[S12 ] = 0,
(7.45)
and therefore, the BSS adaptation should drive ac∗ = 0. CASE 5: c = 0, a = c1 W11 W12 H11 H12 c − , = 1 0 − W21 W22 H21 H22
(7.46)
(7.47)
where the symbol − indicates “don’t care.” Since S2 = 0, the output can be derived correctly Y1 = c1 S1 , Y2 = 0 as follows. S1 c S Y1 c − = 1 1 (7.48) = 1 0 − Y2 0 0 CASE 6: c = c1 , a = 0 W11 W12 H11 H12 0 − = c1 − W21 W22 H21 H22
(7.49)
This equation leads to a permutation solution which is Y1 = 0, Y2 = c1 S1 . S1 0 Y1 0 − = = (7.50) c1 S1 c1 − Y2 0
7
Blind Source Separation
213
When S1 = 0 and S2 = 0
7.6.3
Similarly, only one set of ABFs is achieved in this case. CASE 7: b = 0, d = c2
W11 W12 W21 W22
Since S1 = follows. Y1 = Y2
H11 H12 H21 H22
=
− 0 , − c2
(7.51)
0, the output can be derived correctly Y1 = 0, Y2 = c2 S2 as
− 0 − c2
0 0 = S2 c2 S2
(7.52)
CASE 8: b = c2 , d = 0
W11 W12 W21 W22
H11 H12 H21 H22
=
− c2 − 0
(7.53)
This equation leads to a permutation solution which is Y1 = c2 S2 , Y2 = 0.
Y1 Y2
− c2 = − 0
0 c2 S2 . = S2 0
(7.54)
Now, BSS and two sets of ABFs converge to the same equation. That is, BSS based on the nonstationary decorrelation criterion and ABFs based on the least squares criterion are equivalent, if the independent assumption holds ideally.
7.7
Separation Mechanism of BSS
Now, we can understand the behavior of BSS as two sets of ABFs. An ABF can create only one null towards the jammer signal when two microphones are used. BSS and ABFs form an adaptive spatial null in the jammer direction, and extract the target. The separation performance of BSS is compared with that of ABF. Figure 7.6 shows the directivity patterns obtained by BSS and ABF. In Fig. 7.6, (a) and (b) show directivity patterns by W obtained by BSS, and (c) and (d) show directivity patterns by W obtained by ABF. When TR = 0, a sharp spatial null is obtained by both BSS and ABF [see Figs. 7.6(a) and (c)]. When TR = 300 ms, the directivity pattern becomes duller for both BSS and ABF [see Figs. 7.6(b) and (d)].
214
S. Makino
(a)
(b) 10
angle (d
gain [dB]
0
-40
-20
4
4
angle (d
3 2 1
eg.)
80 90
20
0
0
0 -4
0 -8
qu
fre
0
ABF TR = 300 ms -6
0
-40
-2
0
y
c en
z)
(kH
40 60
eg.)
80 90
angle (d
3 2 1
40 60
0
20
0
0
-4
-2
0 -8
0
ABF TR = 0 ms
-9
-60
-6
gain [dB]
c
en
qu
fre
(d)
-20
0
0
z)
kH y(
10
10 0
-9
1
eg.)
(c)
3
2 80 90
c
en
qu
fre
BSS TR = 300 ms 40 60
eg.)
0
4 -40
z)
kH y(
0
1 80 90
angle (d
3
2 40 60
0
20
0
0
-4
-2
0
0
-8
-9
0
BSS TR = 0 ms
-9 0 -8 0
-60
-20
20
4
-6 0 -4 0 -2 0
-40
gain [dB]
0
-20
-6
gain [dB]
10 0
0
e
qu
fre
y nc
Fig. 7.6. Directivity patterns (a) obtained by BSS (TR =0 ms), (b) obtained by BSS (TR =300 ms), (c) obtained by ABF (TR =0 ms), and (d) obtained by ABF (TR =300 ms).
7.7.1
Fundamental Limitation of BSS
BSS removes the sound from the jammer direction and reduces the reverberation of the jammer signal to some extent in the same way as the ABF [6]. BSS and ABF mainly remove sound from the jammer direction. Therefore, it is difficult to completely eliminate the reverberant sound. This understanding clearly explains the poor performance of BSS in a real acoustic environment with a long reverberation. BSS was shown to outperform a null beamformer that forms a steep null directivity pattern towards a jammer under the assumption that the jammer’s direction is known [6], [42]. It is well known that an adaptive beamformer outperforms a null beamformer when there is a long reverberation. Our understanding also clearly explains the result. Note that ABF needs to know the array manifold and the target direction. Note also that ABF can adapt only when a jammer is active but a target is silent, whereas BSS can adapt in the presence of both target and jammer, and also in the presence of only one active source. 7.7.2
)
z (kH
When Sources Are Near the Microphones
Figure 7.7 shows the performance when the contribution of the direct sound is changed artificially. The performance improves with increases in the con-
7
dB
0.6 0.4 0.2 0 -0.2 -0.4 -0.6
(a)
0 0 (b) -10 -20 -30 -40 -50 -60 0
Blind Source Separation
215
direct sound
0.1
0.2
0.3
0.4
contribution of direct sound
0.1
0.2 time [s]
0.3
0.4
12 (C) SIR [dB]
10 8
5 5 10 15 20 contribution of direct sound [dB]
Fig. 7.7. Relationship between the contribution of a direct sound and the separation performance. TR = 300 ms, T = 512. (a) Example of an impulse response. (b) Energy decay curve. (c) Separation performance.
tribution of the direct sound. This characteristic is the same as that of an ABF. The unmixing system W mainly removes sound from the jammer direction, meaning that mainly the direct (largest) sound of the jammer can be extracted, and the other reverberant components, which arrive from different directions, cannot be reduced. As a result, the separation performance is fundamentally limited. The discussion here is essentially also true for BSS with higher order statistics (HOS), and will be extended to it.
7.8
When Sources Are Not “Independent”
Frequency-domain BSS and frequency-domain ABF are equivalent [see (7.33) and (7.40)] in the ideal case that the independent assumption holds ideally
216
S. Makino
[see (7.38)]. If not, the first and second terms of (7.38) behave as a bias noise in obtaining the correct coefficients a, b, c, d in (7.38). The influence of the frame size on separation performance was explored in [4], and it was shown that a long frame size works poorly in frequency-domain BSS for speech data of a few seconds. This is because when a long frame is used, the number of samples in each frequency bin becomes small. This makes it difficult to estimate such statistics as the zero mean and independent assumptions. Therefore, the first and second terms of (7.38) are not equal to zero. 7.8.1
BSS Is Upper Bounded by ABF
Figure 7.8 shows the separation performance of BSS and ABF. Here, the ABF proposed by Frost was employed [43]. The ABF was adapted only when a jammer was active but a target was silent. With BSS, when the frame size is too long, the separation performance degrades. This is because, when the frame size is long, the number of samples in each frequency bin is too small for the statistics to be estimated correctly for several seconds of speech [44]. By contrast, the ABF does not use the assumption of independence of the source signals. With the ABF, therefore, the separation performance improves as the frame size becomes longer. Figure 7.8 confirms that the performance of the BSS is limited by that of the ABF. Note again that the ABF needs to know the array manifold and the target direction. Note also that the ABF can adapt only when a jammer is active but a target is silent, whereas BSS can adapt in the presence of a target and a jammer, and also in the presence of only one active source. 7.8.2
BSS Is an Intelligent Version of ABF
Although BSS is upper bounded by ABF, BSS has a strong advantage over ABF. Strict one-channel power criterion has a serious crosstalk or leakage problem in ABF, but on the other hand, sources can be simultaneously active in BSS. This is because the error criteria of ABF and BSS differ. Instead of adopting a power minimization criterion that adapts the jammer signal out of the target signal in ABF, we adopt a cross-power minimization criterion that decorrelates the jammer signal from the target signal in BSS in a nonlinear, nonstationary, or time-delayed manner. Section 7.6 showed that the least squares criterion of ABF is equivalent to the nonstationary decorrelation criterion of BSS. The error minimization was shown to be completely equivalent to a zero search in the cross-correlation. Unlike the case with conventional adaptive beamforming, no assumptions on array geometry or source location are made in BSS. BSS can adapt without any information on the source positions or period of source existence/absence. In the above sense, BSS can be regarded as an intelligent version of ABF.
SIR [dB]
7 45 40 35 30 25 20 15 10 5
(a) TR= 0 ms
Blind Source Separation
217
ABF
BSS 32
64
128
256
512 1024 2048
frame size 9
(b)TR= 300 ms
ABF
SIR [dB]
8 7 6
BSS
5 4
32
64
128
256
512 1024 2048
frame size Fig. 7.8. SIR results for different frame sizes. The solid lines are for ABF and the broken lines are for BSS. (a) Non-reverberant test (TR =0 ms), (b) reverberant test (TR =300 ms).
The inspiration for the above can be found in two pieces of work. Weinstein et al. [9] and Gerven and Compernolle [45] showed signal separation by using a noise cancellation framework with signal leakage into the noise reference.
7.9
Sound Quality
As regards sound separation, the requirement is the direct sound of the target, not its reverberation or the direct sound or reverberation of a jammer. Then the next question is, what is separated and what remains in BSS? After convergence, if we input an impulse to S1 , we can measure the impulse response for the target [path (a) in Fig. 7.5], and if we input an impulse to S2 , we can measure the impulse response for the jammer [path (b) in Fig. 7.5]. Figure 7.9 shows the results of the impulse responses, and compares them with those for a null beamformer (NBF), which makes steep spatial null to-
218
0.6
(a) NBF (target)
(b) BSS (target)
amplitude
0.6
S. Makino
-0.6 0.06
-0.6 0.6
(jammer)
amplitude
(c) NBF
(magnified view)
-0.06
-0.6
0.06
0.6 amplitude
(d) BSS (jammer)
(magnified view)
-0.06
-0.6 0 50 100 150 time (ms)
0 50 100 150 time (ms)
Fig. 7.9. Target and jammer impulse responses of NBF and BSS.
ward the given jammer direction. Figures 7.9(a) and (c) show examples of impulse responses for the target and jammer of the separating system obtained with an NBF that forms a steep null directivity pattern towards a jammer on the assumption that the jammer’s direction is known. Figures 7.9(b) and (d) are results obtained by BSS. For the target signal, we can see that the reverberation passes the system in both cases in Fig. 7.9(a) NBF and (b) BSS. Figure 7.9(c) shows that the direct sound of the jammer is eliminated, but the reverberation is not eliminated by the NBF, as expected. By contrast, Fig. 7.9(d) indicates that BSS not only eliminates the direct sound, but also reduces the reverberation of the jammer [46].
7
Blind Source Separation
219
gain (dB)
(a) NBF 5 0 -5 -10 -15 -20 -25 -30 4 3 -80
-60
2 -40
-20
0
angle (d
20
eg.)
60
80
0
k
y(
nc
1 40
) Hz
e qu
fre
(b) BSS
gain (dB)
5 0 -5 -10 -15 -20 -25 -30 4 3 -80
-60
2 -40
-20
0
angle (d
20
eg.)
60
80
0
cy
n ue
1 40
) Hz
(k
q
fre
gain (dB)
(c) ABF 5 0 -5 -10 -15 -20 -25 -30 4 3 -80
-60
2 -40
-20
angle (d
0
20
eg.)
60
80
0
cy
n ue
1 40
) Hz
(k
f
req
Fig. 7.10. Directivity patterns obtained by (a) NBF, (b) BSS (TR =300 ms), (c) ABF (TR =300 ms). Frame size T =256, three-second learning.
If we understand that BSS is equivalent to ABF, we can clearly understand these results. 7.9.1
Directivity Patterns of NBF, BSS, and ABF
Figure 7.10 shows the directivity patterns obtained by NBF, BSS, and ABF. BSS and ABF provide duller directivity than NBF, thus they can remove not only the direct sound of the jammer but also its reverberation.
220
S. Makino 5.73 m
loudspeakers (height : 1.35 m)
1.56 m
3.12 m
1.15 m 2.15 m 4 cm
30 40 1.15 m
microphones (height : 1.35 m) room height : 2.70 m
Fig. 7.11. Layout of room used in experiments.
7.10
Experimental Conditions
7.10.1
Mixing Systems
This section summarizes the experimental conditions used in the work described in this chapter. The experiments were conducted using speech data convolved with impulse responses recorded in two environments specified by different reverberation times: TR = 0 ms and 300 ms. Since the sampling rate was 8 kHz, 300 ms corresponds to 2400 taps. The size of the room used to measure the impulse responses was 5.73 m × 3.12 m × 2.70 m and the distance between the loudspeakers and microphones was 1.15 m (Fig. 7.11). A two-element array was used with an inter-element spacing of 4 cm, which corresponds to almost the half-wave length of the Nyquist frequency 4 kHz. The speech signals arrived from two directions, −30◦ and 40◦ . An example of a measured room impulse response used in the experiments is shown in Fig. 7.12. Figure 7.12(b) shows the energy decay curve r(t) of an impulse response h(t), which can be obtained by integrating the impulse response energy as follows, 0 ∞ r2 (t) = h2 (t)dt. t
The reverberation time TR is defined as the time for an energy attenuation of 60 dB. 7.10.2
Source Signals
Two sentences spoken by two male and two female speakers were used as the original speech. The investigations were carried out for six combinations
7
Blind Source Separation
221
0.6 (a) 0.4 0.2 0 -0.2 -0.4 0
0.1
0
0.1
0.2
0.3
0.2
0.3
[dB]
0 (b) -10 -20 -30 -40 -50 -60
time (s) Fig. 7.12. An example of (a) measured impulse response h11 used in the experiments and, (b) its energy decay curve. TR = 300 ms.
of speakers. The speech data had a length of eight seconds. The first three seconds of the data were used for learning and the entire eight second data for separation. The frame size for DFT, T was changed from 32 to 2048 and the performance for each condition was investigated. The frame shift was half the frame size T , and the analysis window was a Hamming window. 7.10.3
Evaluation Measure
The performance was evaluated using the signal to interference ratio (SIR), defined as follows: SIRi = SIROi − SIRIi , |Aii (ω)Si (ω)|2 , SIROi = 10 log ω 2 ω |Aij (ω)Sj (ω)| |Hii (ω)Si (ω)|2 , SIRIi = 10 log ω 2 ω |Hij (ω)Sj (ω)|
(7.55) (7.56)
where A(ω) = W (ω)H(ω) and i = j. SIR means the ratio of a targetoriginated signal to a jammer-originated signal. These values were averaged over all six combinations with respect to the speakers, and SIR1 and SIR2 were averaged.
222
7.10.4
S. Makino
Scaling and Permutation
The blind beamforming algorithm proposed by Kurita et al. [38] was used to solve the scaling and permutation problem. First, the source directions were estimated from the directivity pattern obtained from the unmixing matrix W (ω) and the row of W (ω) was reordered so that the directivity pattern formed a null in the same direction in all frequency bins. Then the row of W (ω) was normalized so that the gains of the target directions become 0 dB.
7.11
Conclusions
The blind source separation (BSS) of convolved mixtures of acoustic signals, especially speech, was examined. Source signals can be extracted only from observed mixed signals, by achieving nonlinear, nonstationary, or timedelayed decorrelation. The statistical technique of independent component analysis (ICA) was studied from the acoustic signal processing point of view. BSS was interpreted from the physical standpoint showing the equivalence between frequency-domain BSS and two sets of microphone array systems, i.e., two sets of adaptive beamformers (ABFs). Convolutive BSS can be understood as multiple ABFs that generate statistically independent output, or more simply, an output with minimal crosstalk. Because ABF and BSS mainly deal with sound from the jammer direction by making a null towards the jammer, the separation performance is fundamentally limited. This understanding clearly explains the poor performance of BSS in the real world with long reverberation. If the sources are not “independent,” their dependency results in bias noise to obtain the correct unmixing filter coefficients. Therefore, the BSS performance is upper bounded by that of the ABF. However, in contrast to the ABF, no assumptions regarding array geometry or source location need to be made in BSS. BSS can adapt without any information on the source positions or period of source existence/absence. This is because, instead of adopting power minimization criterion that adapt the jammer signal out of the target signal in ABF, a cross-power minimization criterion is adopted that decorrelates the jammer signal from the target signal in BSS. It was shown that the least squares criterion of ABF is equivalent to the decorrelation criterion of the output in BSS. The error minimization was shown to be completely equivalent to a zero search in the cross-correlation. Although the performance of the BSS is limited by that of the ABF, BSS has a major advantage over ABF. A strict one-channel power criterion has a serious crosstalk or leakage problem in ABF, whereas sources can be simultaneously active in BSS. Also, ABF needs to know the array manifold and the target direction. Thus, BSS can be regarded as an intelligent version of ABF.
7
Blind Source Separation
223
The fusion of acoustic signal processing technologies and speech recognition technologies is playing a major role in the development of user-friendly communication with computers, conversation robots, and other advanced audio media processing technologies.
Acknowledgment I thank Shoko Araki, Ryo Mukai, Hiroshi Sawada, Tsuyoki Nishikawa, and Hiroshi Saruwatari for daily collaboration and valuable discussions, and Shigeru Katagiri for his continuous encouragement.
References 1. J. F. Cardoso, “The three easy routes to independent component analysis; contrasts and geometry,” in Proc. Conference Indep. Compon. Anal. Signal. Sep., Dec. 2001, pp. 1–6. 2. T. W. Lee, A. J. Bell, and R. Orglmeister, “Blind source separation of real world signals,” Neural Networks, vol. 4, pp. 2129–2134, 1997. 3. M. Z. Ikram and D. R. Morgan, “Exploring permutation inconsistency in blind separation of speech signals in a reverberant environment,” in Proc. ICASSP, June 2000, pp. 1041–1044. 4. S. Araki, S. Makino, T. Nishikawa, and H. Saruwatari, “Fundamental limitation of frequency domain blind source separation for convolutive mixture of speech,” in Proc. ICASSP, May 2001, vol. 5, pp. 2737–2740. 5. S. Araki, S. Makino, R. Mukai, and H. Saruwatari, “Equivalence between frequency domain blind source separation and frequency domain adaptive null beamformers,” in Proc. Eurospeech, Sept. 2001, pp. 2595–2598. 6. R. Mukai, S. Araki, and S. Makino, “Separation and dereverberation performance of frequency domain blind source separation for speech in a reverberant environment,” in Proc. Eurospeech, Sept. 2001, pp. 2599–2602. 7. S. C. Douglas, “Blind separation of acoustic signals,” in Microphone Arrays: Techniques and Applications, M. Brandstein and D. B. Ward, Eds., pp. 355– 380, Springer, Berlin, 2001. 8. K. Torkkola, “Blind separation of delayed and convolved sources,” in Unsupervised Adaptive Filtering, Vol. I, S. Haykin, Ed., pp. 321–375, John Wiley & Sons, 2000. 9. E. Weinstein, M. Feder, and A. V. Oppenheim, “Multi-channel signal separation by decorrelation,” IEEE Trans. Speech Audio Processing, vol. 1, no. 4, pp. 405– 413, Oct. 1993. 10. T. W. Lee, Independent Component Analysis -Theory and Applications, Kluwer, 1998. 11. M. Kawamoto, A. K. Barros, A. Mansour, K. Matsuoka, and N. Ohnishi, “Real world blind separation of convolved non-stationary signals,” in Proc. Workshop Indep. Compon. Anal. Signal. Sep., Jan. 1999, pp. 347–352. 12. X. Sun and S. Douglas, “A natural gradient convolutive blind source separation algorithm for speech mixtures,” in Proc. Conference Indep. Compon. Anal. Signal. Sep., Dec. 2001, pp. 59–64.
224
S. Makino
13. P. Smaragdis, “Blind separation of convolved mixtures in the frequency domain,” Neurocomputing, vol. 22, pp. 21–34, 1998. 14. S. Ikeda and N. Murata, “A method of ICA in time-frequency domain,” in Proc. Workshop Indep. Compon. Anal. Signal. Sep., Jan. 1999, pp. 365–370. 15. R. Aichner, S. Araki, S. Makino, T. Nishikawa, and H. Saruwatari, “Time domain blind source separation of non-stationary convolved signals by utilizing geometric beamforming,” in Proc. NNSP, Sept. 2002. 16. J. Anem¨ ueller and B. Kollmeier, “Amplitude modulation decorrelation for convolutive blind source separation,” in Proc. Workshop Indep. Compon. Anal. Signal. Sep., 2000, pp. 215–220. 17. F. Asano, S. Ikeda, M. Ogawa, H. Asoh, and N. Kitawaki, “A combined approach of array processing and independent component analysis for blind separation of acoustic signals,” in Proc. ICASSP, May 2001, vol. 5, pp. 2729–2732. 18. J. Herault and C. Jutten, “Space or time adaptive signal processing by neural network models,” in Neural Networks for Computing: AIP Conference Proceedings 151, J. S. Denker, Ed., American Institute of Physics, New York, 1986. 19. C. Jutten and J. Herault, “Blind separation of sources, part I: an adaptive algorithm based on neuromimetic architecture,” Signal Processing, vol. 24, pp. 1–10, 1991. 20. P. Comon, C. Jutten, and J. Herault, “Blind separation of sources, part II: problems statement,” Signal Processing, vol. 24, pp. 11–20, 1991. 21. E. Sorouchyari, “Blind separation of sources, part III: stability analysis,” Signal Processing, vol. 24, pp. 21–29, 1991. 22. A. Cichocki and L. Moszczynski, “A new learning algorithm for blind separation of sources,” Electronics Letters, vol. 28, no. 21, pp. 1986–1987, 1992. 23. J. F. Cardoso and A. Souloumiac, “Blind beamforming for non-gaussian signals,” IEE Proceedings-F, vol. 140, no. 6, pp. 362–370, Dec. 1993. 24. P. Comon, “Independent component analysis - a new concept?,” Signal Processing, vol. 36, no. 3, pp. 287–314, Apr. 1994. 25. A. Cichocki and R. Unbehauen, “Robust neural networks with on-line learning for blind identification and blind separation of sources,” IEEE Trans. Circuits and Systems, vol. 43, no. 11, pp. 894–906, 1996. 26. T. W. Lee, M. Girolami, A. J. Bell, and T. J. Sejnowski, “A unifying information-theoretic framework for independent component analysis,” Computers and Mathematics with Applications, vol. 31, no. 11, pp. 1–12, Mar. 2000. 27. A. Hyv¨ arinen, H. Karhunen, and E. Oja, Independent Component Analysis, John Wiley & Sons, 2001. 28. S. Haykin, Unsupervised Adaptive Filtering, John Wiley & Sons, 2000. 29. A. Cichocki and S. Amari, Adaptive Blind Signal and Image Processing, John Wiley & Sons, 2002. 30. A. J. Bell and T. J. Sejnowski, “An information-maximization approach to blind separation and blind deconvolution,” Neural Computation, vol. 7, no. 6, pp. 1129–1159, 1995. 31. S. Amari, A. Cichocki, and H. Yang, “A new learning algorithm for blind source separation,” in Advances in Neural Information Processing Systems 8, pp. 757– 763, MIT Press, 1996. 32. K. Matsuoka, M. Ohya, and M. Kawamoto, “A neural net for blind separation of nonstationary signals,” Neural Networks, vol. 8, no. 3, pp. 411–419, 1995.
7
Blind Source Separation
225
33. L. Molgedey and H. G. Schuster, “Separation of a mixure of independent signals using time delayed correlations,” Physical Review Letters, vol. 72, no. 23, pp. 3634–3636, 1994. 34. A. Belouchrani, K. A. Meraim, J. F. Cardoso, and E. Moulines, “A blind source separation technique based on second order statistics,” IEEE Trans. Signal Processing, vol. 45, no. 2, pp. 434–444, Feb. 1997. 35. L. Parra and C. Spence, “Convolutive blind separation of non-stationary sources,” IEEE Trans. Speech Audio Processing, vol. 8, no. 3, pp. 320–327, May 2000. 36. S. Amari, “Natural gradient works efficiently in learning,” Neural Computation, vol. 10, pp. 251–276, 1998. 37. H. Sawada, R. Mukai, S. Araki, and S. Makino, “Polar coordinate based nonlinear function for frequency-domain blind source separation,” in Proc. ICASSP, May 2002, vol. 1, pp. 1001–1004. 38. S. Kurita, H. Saruwatari, S. Kajita, K. Takeda, and F. Itakura, “Evaluation of blind signal separation method using directivity pattern under reverberant conditions,” in Proc. ICASSP, June 2000, pp. 3140–3143. 39. L. Parra and C. Alvino, “Geometric source separation: Merging convolutive source separation with geometric beamforming,” in Proc. NNSP, Sept. 2001, pp. 273–282. 40. S. Araki, S. Makino, R. Mukai, Y. Hinamoto, T. Nishikawa, and H. Saruwatari, “Equivalence between frequency domain blind source separation and frequency domain adaptive beamforming,” in Proc. ICASSP, May 2002, vol. 2, pp. 1785– 1788. 41. M. Knaak and D. Filbert, “Acoustical semi-blind source separation for machine monitoring,” in Proc. Conference Indep. Compon. Anal. Signal. Sep., Dec. 2001, pp. 361–366. 42. H. Saruwatari, S. Kurita, and K. Takeda, “Blind source separation combining frequency-domain ICA and beamforming,” in Proc. ICASSP, May 2001, pp. 2733–2736. 43. O. L. Frost, “An algorithm for linearly constrained adaptive array processing,” in Proc. IEEE, Aug. 1972, vol. 60, pp. 926–935. 44. S. Araki, S. Makino, R. Mukai, T. Nishikawa, and H. Saruwatari, “Fundamental limitation of frequency domain blind source separation for convolved mixture of speech,” in Proc. Conference Indep. Compon. Anal. Signal. Sep., Dec. 2001, pp. 132–137. 45. S. Gerven and D. Compernolle, “Signal separation by symmetric adaptive decorrelation: stability, convergence, and uniqueness,” IEEE Trans. Signal Processing, vol. 43, no. 7, pp. 1602–1612, July 1995. 46. R. Mukai, S. Araki, and S. Makino, “Separation and dereverberation performance of frequency domain blind source separation,” in Proc. Conference Indep. Compon. Anal. Signal. Sep., Dec. 2001, pp. 230–235.
8 Adaptive Multichannel Time Delay Estimation Based on Blind System Identification for Acoustic Source Localization Yiteng (Arden) Huang and Jacob Benesty Bell Laboratories, Lucent Technologies Murray Hill, NJ 07974, USA E-mail: {arden, jb}@research.bell-labs.com Abstract. Time delay estimation is a difficult problem in a reverberant environment and the traditional generalized cross-correlation (GCC) methods do not perform well. Recently, a blind channel identification-based adaptive approach called the eigenvalue decomposition (ED) algorithm has been proposed to deal with room reverberation more effectively. The ED algorithm focuses on a system with two channels whose impulse responses can be blindly identified only if they do not share common zeros. The assumption often breaks down for acoustic channels whose impulse responses are long. In this chapter, the blind channel identification-based time delay estimation approach is generalized to multiple (more than 2) channel systems and a normalized multichannel frequency-domain LMS algorithm is proposed. The proposed algorithm is more practical since it is less likely for all channels to share a common zero when more channels are available. It is shown by using the data recorded in the Varechoic chamber at Bell Labs that the proposed method performs better than the ED and GCC algorithms.
8.1
Introduction
When a source signal propagates through multiple channels and the resulting signals are captured using multiple sensors, the diversity between different channel outputs can be exploited to develop advanced statistical and array signal processing techniques for better signal acquisition and understanding. Apparently, among various channel diversities, different time delay of arrivals (TDOAs) between distinct channels is a principal one. Time delay estimation (TDE) has been an important problem of fundamental interest for speech, radar, and sonar signal processing, as well as wireless communications. In the current computer and information era, a natural speech interface between human beings and machines is desired in many modern communications and intelligent systems. Consequently, the ability to locate and track an acoustic sound source in a room has become essential. For example, an acoustic source locator would be used to guide an adaptive microphone array beamformer to ensure the quality of the acquired sound and/or to steer a camera to facilitate the delivery of a moving talker’s video in a multimedia teleconferencing system. After two decades of continuous research, the
J. Benesty et al. (eds.), Adaptive Signal Processing © Springer-Verlag Berlin Heidelberg 2003
228
Y. A. Huang, J. Benesty
TDE-based approach has become the technique of choice for acoustic source localization, especially in recent digital systems. An accurate TDE method is surely the cornerstone of the success of these systems. Traditionally, relative TDOA is estimated in light of the cross-correlation function between the observed channel outputs. The cross-correlation is large when the two examined signals are properly aligned. In order to eliminate the influence introduced by the possible auto-correlation of a source signal, e.g. speech, the channel outputs are pre-whitened using a filter before their cross-correlation function is computed, which leads to the so-called generalized cross-correlation (GCC) technique proposed by Knapp and Carter [1]. The GCC method is so far the most popular technique for TDE because it is computationally simple and an estimate is always made instantaneously. For an ideal acoustic system where only attenuation and delay are taken into account, an impulse would appear at the actual relative TDOA. But, in practice, additive background noise and room reverberation make the peak no longer well-defined or even no longer dominant in the calculated cross-correlation function. Many amendments to the GCC algorithm have been proposed, [2], [3], [4], [5], but perform better only in a free space with additive noise. Reverberation in a room is a more annoying nuisance for TDE algorithms as shown clearly in [6]. The cross-correlation based estimators which use a single-path propagation channel model have a fundamental weakness in their inability to cope well in reverberant environments [7]. Cepstral prefiltering was suggested to overcome the dispersive effect of room reverberation in [8]. However, the improvement is limited and shortcomings still remain. To the best of our knowledge, all known TDE methods fail in a highly reverberant room. In an earlier study [9], [10], we investigated the TDE problem from a different point of view and developed a blind channel identification (BCI) based method termed adaptive eigenvalue decomposition (AED). In this approach, the acoustic channel from the source to a microphone is characterized by a finite impulse response (FIR) filter. The knowledge of a multichannel system’s impulse responses implies a profound understanding of the system and therefore theoretically facilitates the determination of all channel diversities which of course comprise the relative TDOAs in interest. However, it is not easy to identify a multichannel acoustic system since in practice the system is generally nonstationary and usually has a long impulse response. Even worse, the source signal is not known a priori and a blind method must be used. Indeed, blindly identifying a multichannel acoustic system is a very difficult task. In the AED algorithm, the cross relation between two microphone outputs is exploited and an error signal is constructed. An efficient frequency-domain adaptive filter is designed to search for the desired channel impulse responses by minimizing the mean value of the error signal power. It was demonstrated that in a moderately reverberant room, the proposed AED algorithm performed better than the GCC methods.
8
Adaptive Multichannel Time Delay Estimation
229
In general, the performance of a BCI-based TDE method depends considerably on the accuracy of the BCI algorithm applied. Particularly, the more accurately channel impulse responses are estimated, the better the relative TDOAs are found. The AED algorithm considers blind identification of only two channels at a time. For a single-input two-output system, the zeros of the two channels can be close to each other especially when their impulse responses are long, which leads to an ill-conditioned system that is difficult to identify [11]. In the extreme case that the two channels share common zeros, the system becomes unidentifiable and hence the AED algorithm may fail completely. This problem can be alleviated by using more microphone sensors in the system. When more channels are involved, it is less likely for all channels to share a common zero and therefore the BCI algorithm can perform more robustly. As such, when a multi-element microphone array is used for locating a sound source, the relative TDOAs are no longer estimated pair by pair. The multichannel system will be treated as a whole and the BCI algorithm can be globally optimized. While using more microphones can make a system more feasible to identify, it inevitably presents more cross relations to analyze and brings in more parameters to estimate, which results in an explosive increase of complexity for a BCI algorithm. In [12], we constructed an error signal based on the cross relations between different channels in a novel, systematic way for a single-input multiple-output (SIMO) FIR system and the corresponding cost function was concise. We also illustrated how the traditional adaptive filtering techniques can be used to determine the desired channel impulse responses. A class of blind multichannel identification methods in both the time domain [12] and the frequency domain [13] was proposed. Among these, the normalized multichannel frequency-domain LMS (NMCFLMS) algorithm is the most efficient and has the fastest convergence. As shown in [13], the NMCFLMS algorithm was the only one of several known techniques that can successfully identify a long three-channel acoustic system excited by a speech signal. In this chapter, we generalize the AED algorithm from a two-channel to an M -channel (M ≥ 2) system and develop the idea of multichannel TDE. As will be shown, the NMCFLMS algorithm performs better than the AED algorithm for data recorded in the Varechoic chamber at Bell Labs.
8.2 8.2.1
Problem Formulation Notation
Notation used in this chapter is mostly standard in the time domain, but is specifically defined in the frequency domain. Uppercase and lowercase bold letters denote time-domain matrices and vectors, respectively. In the frequency domain, matrices and vectors are represented respectively by uppercase calligraphic and lowercase bold italic letters, and a vector is further
230
Y. A. Huang, J. Benesty
emphasized by an underbar. The difference in their appearance is illustrated by the following example: x
a vector in the time domain (bold, lowercase),
X x
a matrix in the time domain (bold, uppercase), a vector in the frequency domain (bold italic, lowercase, with an
X
underbar), a matrix in the frequency domain (calligraphic, uppercase).
Following standard convention, the operators E{·}, (·)∗ , (·)T , and (·)H stand for statistical expectation, complex conjugate, vector/matrix transpose, and Hermitian transpose, respectively. The symbols ∗, ⊗, , and ∇ denote, respectively, linear convolution, circular convolution, the Schur (element-byelement) product, and the gradient operator. The identity matrix is given by I, whose dimension is either implied by the context or explicitly specified by a subscript. 8.2.2
Signal Model
We have M microphones for capturing signals xi (n) (i = 1, 2, ..., M ) that propagate from a single unknown source s(n). As shown in Fig. 8.1, such a single-input multiple-output (SIMO) FIR linear system is considered in this chapter: xi (n) = ht,i ∗ s(n) + bi (n), i = 1, 2, ..., M,
(8.1)
where ht,i is the true (subscript t) impulse response of the i-th channel and bi (n) is the additive background noise at the i-th microphone. In vector form, this system equation (8.1) can be expressed as: xi (n) = Ht,i · s(n) + bi (n),
(8.2)
where
T xi (n) = xi (n) xi (n − 1) · · · xi (n − L + 1) , ⎡ ⎤ ht,i,0 ht,i,1 · · · ht,i,L−1 0 ··· 0 ⎢ 0 ht,i,0 · · · ht,i,L−2 ht,i,L−1 · · · ⎥ 0 ⎢ ⎥ Ht,i = ⎢ . ⎥, . . . . . . .. .. .. .. .. .. ⎣ .. ⎦ ht,i,1 · · · ht,i,L−1 0 · · · 0 ht,i,0 T s(n) = s(n) s(n − 1) · · · s(n − L + 1) · · · s(n − 2L + 2) , T bi (n) = bi (n) bi (n − 1) · · · bi (n − L + 1) ,
and L is set to the length of the longest channel impulse response by assumption. The channel parameter matrix Ht,i is of dimension L × (2L − 1) and is constructed from the channel’s impulse response: T ht,i = ht,i,0 ht,i,1 · · · ht,i,L−1 . (8.3)
8
Input
Channels
.
s(n)
Adaptive Multichannel Time Delay Estimation
Ht,1 (z)
Additive Noise b 1 (n) +
Microphone Outputs
x 1 (n)
b 2 (n) Ht,2 (z)
bM (n) Ht,M (z)
x 2 (n)
+
.. . +
231
. . . x M (n)
Fig. 8.1. Illustration of the relationships between the input source s(n) and the microphone outputs xi (n), i = 1, 2, ..., M, in a single-input multiple-output FIR system.
8.2.3
Channel Properties and Assumptions
Two assumptions (one on the channel diversity and the other on the input source signal) are made throughout this chapter to guarantee an identifiable system [14]: 1. The polynomials formed from ht,i , i = 1, 2, ...M, are co-prime, i.e., the channel transfer functions Ht,i (z) do not share any common zeros; 2. The autocorrelation matrix Rss = E s(n)sT (n) of the source signal is of full rank. In addition, the additive noise components in different channels are assumed to be uncorrelated with the source signal even though they might be mutually dependent. For a multi-path channel, reverberation components are most of the time weaker than the signal component propagating directly from the source to the corresponding sensor. This is particularly true for an acoustic channel where waveform energy would be absorbed by room surfaces and waveform magnitude would be attenuated by wall reflection. As such, the signal propagating through the direct path is one of the dominant components in the impulse responses although its magnitude may not be the largest since two or more reverberant signals via multipaths of equal delay could add coherently. Hence, the channel propagation delay τi in samples can be robustly determined as the smallest TDOA of the Q largest components: τi = min argmaxq |ht,i,l | , i = 1, 2, ..., M, q = 1, 2, ..., Q. (8.4) l
q
where max computes the q-th largest element. The relative time delay of arrival between the i-th and the j-th channel is then defined as: τij = τi − τj , i, j = 1, 2, ..., M.
(8.5)
232
Y. A. Huang, J. Benesty
Therefore, after a multichannel system is properly identified, the relative TDOAs between distinct channels can be easily determined by examining the channel impulse responses.
8.3
Generalized Multichannel Time Delay Estimation
In order to efficiently deal with room reverberation in time delay estimation, a novel approach called the adaptive eigenvalue decomposition (AED) algorithm was proposed in [9] and [10]. The algorithm is based on blind identification of a two-channel system. For a system with only two channels, it is very likely that a zero is shared between the channels or some zeros of the two channels are close, especially when their channel impulse responses are long such as in the case of an acoustic system. When such a situation is encountered in practice, the system is impossible or difficult to identify using only second-order statistics of the channel outputs and the AED algorithm would fail disappointingly. An effective way to overcome this shortcoming is to employ more channels in the system since it would be less likely for all channels to share a common zero when the number of sensors is large. However, the identification problem becomes more challenging and complicated as the number of channels increases. It is by no means trivial to generalize the BCI-based TDE approach from a two-channel to a multichannel system. In this section, a systematic method is presented to perform multichannel TDE. 8.3.1
The Principle
Basically, a multichannel system can be blindly identified because of the channel diversity which makes the outputs of different channels distinct though related. By following the fact that xi (n) ∗ ht,j = s(n) ∗ ht,i ∗ ht,j = xj (n) ∗ ht,i , i, j = 1, 2, ..., M,
(8.6)
a cross-relation between the i-th and j-th channel outputs, in the absence of noise, can be formulated as xTi (n)ht,j = xTj (n)ht,i , i, j = 1, 2, ..., M, i = j.
(8.7)
When noise is present or the channel impulse responses are improperly modeled, the left and right hand sides of (8.7) are generally not equal and the inequality can be used to define an a priori error signal as follows: eij (n + 1) =
xTi (n + 1)hj (n) − xTj (n + 1)hi (n) , i, j = 1, 2, ..., M, (8.8) h(n)
where hi (n) is the model filter for the i-th channel at time n and T h(n) = hT1 (n) hT2 (n) · · · hTM (n) .
8
Adaptive Multichannel Time Delay Estimation
233
The model filter is normalized in order to avoid a trivial solution whose elements are all zeros. Based on the error signal defined here, a cost function at time n + 1 is given by J(n + 1) =
M−1
M
e2ij (n + 1).
(8.9)
i=1 j=i+1
An adaptive algorithm is then derived to efficiently determine the model filters hi , i = 1, 2, ..., M, that minimize this cost function and therefore would be good estimates of ht,i /ht (i = 1, 2, ..., M ). The error signal (8.8) is linear in the channel impulse response which facilitates computing the gradient of J(n + 1) and hence developing an adaptive multichannel algorithm using various traditional adaptive schemes in both the time and the frequency domains. 8.3.2
Time-Domain Multichannel LMS Approach
Recently, a simple and straightforward adaptive approach to blind multichannel identification has been developed in the time domain [12]. The multichannel LMS (MCLMS) algorithm updates the estimate of channel impulse responses along the opposite direction of the gradient of the cost function (8.9), which is given by: h(n + 1) = h(n) − μ∇J(n + 1)
(8.10)
where μ is a small positive step size. As shown in [12], the gradient of J(n+1) is determined as:
˜ + 1)h(n) − J(n + 1)h(n) 2 R(n ∂J(n + 1) ∇J(n + 1) = = , (8.11) ∂h(n) h(n)2 where ⎡
˜ x2 x1 (n + 1) · · · −R ˜ xM x1 (n + 1) ˜ xi xi (n + 1) −R R
⎤
⎢ i=1 ⎥ ⎢ ⎥ ⎢ −R ˜ ˜ ˜ Rxi xi (n + 1) · · · −RxM x2 (n + 1) ⎥ x1 x2 (n + 1) ⎢ ⎥ ⎥ ˜ + 1) = ⎢ i=2 R(n ⎢ ⎥, ⎢ ⎥ . . . .. .. .. .. ⎢ ⎥ . ⎢ ⎥ ⎣ −R ˜ x2 xM (n + 1) · · · ˜ ˜ x1 xM (n + 1) −R Rxi xi (n + 1) ⎦ i=M
and ˜ xi xj (n + 1) = xi (n + 1)xTj (n + 1), i, j = 1, 2, ..., M. R
234
Y. A. Huang, J. Benesty
If the model filter is always normalized after each update, then a simplified algorithm is obtained:
˜ + 1)h(n) − J(n + 1)h(n) h(n) − 2μ R(n
/ . h(n + 1) = / (8.12) / ˜ + 1)h(n) − J(n + 1)h(n) / /h(n) − 2μ R(n / It was shown theoretically and demonstrated empirically in [12] that the MCLMS algorithm converges in the mean to the desired impulse responses. 8.3.3
Frequency-Domain Adaptive Algorithms
The time-domain MCLMS algorithm is simple and its derivation clearly explains how an adaptive approach can be developed. Even though it converges steadily, its inefficiency and slow convergence make it unsatisfactory to apply to multichannel time delay estimation. In this section, we will develop the adaptive blind channel identification algorithm in the frequency domain. Thanks to the fast Fourier transform (FFT), the frequency-domain approaches are more efficient than their time-domain counterparts. With proper normalization, the model filter coefficients proceed to their final solutions independently and at a uniform rate which dramatically accelerates the convergence of the adaptive algorithm. To begin, we define an intermediate signal yij = xi ∗ hj , the convolution result of the i-th channel output xi and the j-th model filter hj . In vector form, a block of such a signal can be expressed in the frequency domain as 10 y ij (m + 1) = W 01 L×2L D xi (m + 1)W 2L×L hj (m),
(8.13)
where −1 W 01 L×2L = FL×L 0L×L IL×L F2L×2L , D xi (m + 1) = diag {F2L×2L · xi (m + 1)2L×1 } , T −1 W 10 FL×L , 2L×L = F2L×2L IL×L 0L×L hj (m) = FL×L hj (m), T xi (m + 1)2L×1 = xi (mL) xi (mL + 1) · · · xi (mL + 2L − 1) ,
(8.14)
FL×L and F−1 L×L are respectively the Fourier and inverse Fourier matrices of size L × L, and m is the block time index. Then a block of the error signal based on the cross-relation between the i-th and the j-th channel in the frequency domain is determined as: eij (m + 1) = y ij (m + 1) − y ji (m + 1) 10 = W 01 L×2L D xi (m + 1)W 2L×L hj (m)− D xj (m + 1)W 10 2L×L hi (m) .
(8.15)
8
Adaptive Multichannel Time Delay Estimation
235
Continuing, we construct a (frequency-domain) cost function at the (m+1)-th time block index as follows: Jf (m + 1) =
M−1
M
eH ij (m + 1)eij (m + 1).
(8.16)
i=1 j=i+1
Therefore, by minimizing Jf (m + 1), the model filter in the frequency domain can be updated as: hk (m + 1) = hk (m) − μf
∂Jf (m + 1) , k = 1, 2, ..., M, ∂h∗k (m)
(8.17)
where μf is a small positive step size. It can be shown that [13] H ∂Jf (m + 1) 01 = W L×2L D xi (m + 1)W 10 eik (m + 1). ∗ 2L×L ∂hk (m) i=1 M
(8.18)
Substituting (8.18) into (8.17) yields the multichannel frequency-domain LMS (MCFLMS) algorithm: hk (m + 1) = hk (m) − μf W 10 L×2L
M
D∗xi (m + 1)W 01 2L×L eik (m + 1),(8.19)
i=1
where −1 W 10 L×2L = FL×L IL×L 0L×L F2L×2L , T −1 FL×L . W 01 2L×L = F2L×2L 0L×L IL×L The constraint ensuring that the adaptive algorithm would not converge to a trivial solution with all zero elements can be applied in either the frequency or the time domain. Since in the application of time delay estimation the delay will be estimated in the time domain and the time-domain model filter coefficients have to be computed anyway, we will enforce the constraint in the time domain for convenience. The MCFLMS is computationally more efficient compared to a multichannel time-domain block LMS algorithm. Even though implemented in the different domains, the MCFLMS and its time-domain counterpart are equivalent in performance. The convergence of the MCFLMS algorithm is still slow because of nonuniform convergence rates of the filter coefficients and cross-coupling between them. To accelerate convergence, we will use Newton’s method to develop the normalized MCFLMS (NMCFLMS) method. By using Newton’s method, we update the model filter coefficients according to: −1 ∂Jf (m + 1) ∂ ∂Jf (m + 1) hk (m + 1) = hk (m) − μf ,(8.20) ∂h∗k (m) ∂h∗k (m) ∂hTk (m)
236
Y. A. Huang, J. Benesty
where the Hessian matrix can be evaluated as ∂ ∂Jf (m + 1) = ∂h∗k (m) ∂hTk (m) M
W 10 L×2L
10 D ∗xi (m + 1)W 01 2L×2L D xi (m + 1) W 2L×L ,
(8.21)
i=1,i=k
and W 01 2L×2L =
01 W 01 2L×L W L×2L
= F2L×2L
0L×L 0L×L F−1 2L×2L . 0L×L IL×L
As shown in [13], when L is large, 2W 01 2L×2L can be well approximated by the identity matrix 2W 01 2L×2L ≈ I2L×2L .
(8.22)
Thereafter, (8.21) becomes 1 ∂ ∂Jf (m + 1) ≈ W 10 P k (m + 1)W 10 ∗ 2L×L , T ∂h (m) 2 L×2L ∂hk (m) k
(8.23)
where P k (m + 1) =
M
D∗xi (m + 1)D xi (m + 1), k = 1, 2, ..., M.
i=1,i=k
Substituting (8.18) and (8.23) into (8.20) and multiplying by W 10 2L×L produces the constrained NMCFLMS algorithm: h10 k (m + 1)
10 −1 10 10 10 = h10 W L×2L k (m) − 2μf W 2L×L W L×2L P k (m + 1)W 2L×L ·
M
D ∗xi (m + 1)e01 ik (m + 1)
i=1 10 −1 = h10 k (m) − 2μf W 2L×2L P k (m + 1)
M
D ∗xi (m + 1)e01 ik (m + 1), (8.24)
i=1
where
T T 10 h10 , k (m) = W 2L×L hk (m) = F2L×2L hk (m) 0 T 01 01 eik (m + 1) = W 2L×L eik (m + 1) = F2L×2L 0 eTik (m + 1) , IL×L 0L×L 10 10 10 F−1 W 2L×2L = W 2L×L W L×2L = F2L×2L 2L×2L , 0L×L 0L×L
and the relation 10 −1 10 10 −1 W 10 W L×2L = W 10 2L×L W L×2L P k (m + 1)W 2L×L 2L×2L P k (m + 1)
8
Adaptive Multichannel Time Delay Estimation
237
can be justified by post-multiplying both sides of the expression by 10 10 10 P k (m + 1)W 10 2L×L and recognizing that W 2L×2L W 2L×L = W 2L×L . 10 If the matrix 2W 2L×2L is approximated by the identity matrix similar to (8.22), we finally deduce the unconstrained NMCFLMS algorithm: 10 −1 h10 k (m + 1) = hk (m) − μf P k (m + 1)
M
D∗xi (m + 1)e01 ik (m + 1), (8.25)
i=1
where the normalization matrix P k (m + 1) is diagonal and it is straightforward to find its inverse. Again, the unit-norm constraint will be enforced on the model filter coefficients in the time domain. 8.3.4
Algorithm Implementation
When the channel impulse responses are long as in the multichannel acoustic systems of interest, blind identification is not easy. For the adaptive algorithms developed in this chapter, it takes a long time to determine the filter coefficients in reverberant paths. However, in the application of time delay estimation, the goal is not to accurately estimate the system impulse responses. As long as the direct path of each channel is located, the time delay can be found and the problem is successfully solved. Even though the proposed adaptive algorithms would converge to the desired system impulse responses with an arbitrary initialization, deliberately selected initial model filter coefficients will make the direct path in each channel become dominant earlier during adaption. In this chapter, we place a peak at tap L/2 of the first channel and initialize all other model filter coefficients to zeros, i.e. T 0 ··· 0 1 0 ··· 0 h1 = 1 23 4 1 23 4 , L/2−1
L/2
hi = 0, i = 2, 3, ..., M.
(8.26)
In blind channel identification, a group delay does not affect the cross relations between different channels as can be clearly seen from (8.7). Therefore, two sets of model filter coefficients that are different only in group delay are equivalent solutions for time delay estimation. Among a group of such equivalent solutions, the initialization usually determines to which one an adaptive algorithm would converge. For the initialization given in (8.26), the peak in the first channel stays dominant and the direct path of other channels will gradually become clear in the process of adaptation. Since the peak is placed in the middle of the first model filter’s impulse response, both positive and negative time delay of the other channels relative to the first channel can be easily accommodated. From the view point of system identification, stationary white noise would be a good source signal to fully excite the system’s impulse responses. However, in time delay estimation for acoustic source localization, the source
238
Y. A. Huang, J. Benesty
signal is speech, which is neither white nor stationary. Therefore the power spectrum of the multiple channel outputs changes considerably with time. In the MCFLMS algorithm, the correction applied to the model filter in each update is approximately proportional to the power spectrum P k (m + 1); this can be seen by substituting (8.15) into (8.18) and using the approximation (8.22). When the channel outputs are large, gradient noise amplification may be experienced. With the normalization of the MCFLMS correction by P k (m + 1) in the NMCFLMS algorithm, this noise amplification problem is diminished and the variability of the convergence rates due to the change of signal level is eliminated. In order to estimate a more stable power spectrum, a recursive scheme is employed in implementation: M
P k (m + 1) = λP k (m) + (1 − λ)
D ∗xi (m + 1)D xi (m + 1),
(8.27)
i=1,i=k
k = 1, 2, ..., M, where λ is a forgetting factor that may appropriately be set as λ = [1 − 1/(3L)]L for the NMCFLMS algorithm. Although the NMCFLMS algorithm bypasses the problem of noise amplification, we face a similar problem that occurs when the channel outputs becomes too small as in periods of silence. An alternative, therefore, is to insert a small positive number δ into the normalization which leads to the following modification to the unconstrained NMCFLMS algorithm: h10 k (m + 1) = −1
h10 k (m) − μf [P k (m) + δI2L×2L ]
M
D ∗xi (m + 1)e01 ik (m + 1),
(8.28)
i=1
k = 1, 2, ..., M. The NMCFLMS algorithm is summarized in Table 8.1.
8.4
Simulations
In order to evaluate the performance of the proposed normalized multichannel frequency-domain LMS (NMCFLMS) algorithm for multichannel time delay estimation, extensive Monte Carlo simulations were carried out and their results are presented in this section. For comparison, the adaptive eigenvalue decomposition (AED) [9], [10] and the phase transform (PHAT) [1] algorithms are also studied. 8.4.1
Experimental Setup
The measurements used in this chapter were made in the Varechoic chamber at Bell Labs [15]. A diagram of the floor plan layout is shown in Fig. 8.2.
8
Adaptive Multichannel Time Delay Estimation
239
Table 8.1 The constrained normalized multichannel frequency-domain LMS (CNMCFLMS) adaptive algorithm for the blind identification of a singleinput multiple-output FIR system. Parameters: h = hT1 hT2
· · · hTM
μf , the step size;
T
,
the adaptive filter coefficients;
δ, the regularization factor
Initialization: ⎡ ⎤T 0 ··· 0 1 0 ··· 0 h1 (0) = ⎣ 1 23 4 1 23 4 ⎦ , L/2−1
hk (0) = 0, k = 2, 3, ..., M ;
L/2
Update: For m = 0, 1, ... compute
T T (a) h10 (m) = FFT , k = 1, 2, ..., M ; (m) 0 h 2L 1×L k k (b) Construct xi (m + 1)2L×1 , i = 1, 2, ..., M , according to (8.14); (c) xi (m + 1)2L×1 = FFT2L {xi (m + 1)2L×1 }; * ⎧) M xH (m + 1)x (m + 1) ⎪ ⎪ i i ⎪ 12L×1 , m = 0 ⎪ ⎨ 2L i=1 (d) pk (m + 1) = M ⎪ ⎪ ⎪ λp (m) + (1 − λ) xH ⎪ 2L×1 i (m + 1)x i (m + 1), m > 0 ⎩ k i=1,i = k
(m + 1)2L×1 ; (e) Take reciprocal of elements in pk (m + 1) + δ12L×1 to get p−1 k 10 10 xi (m + 1) hj (m) − xj (m + 1) hi (m), i = j ˜ij (m + 1)2L×1 = (f) e i=j 02L×1 , (i, j = 1, 2, ..., M )
˜ij (m + 1)2L×1 = IFFT2L e ˜ij (m + 1)2L×1 ; (g) e 01 ˜ (m + 1)2L×1 ; (trim, keeping the last L elements) e (h) eij (m + 1) = WL×2L ij T 01 ; (i) eij (m + 1) = FFT2L 01×L eTij (m + 1)
(j) Δh10 k (m) =
M
−1 x∗i (m + 1) e01 ik (m + 1) pk (m + 1), k = 1, 2, ..., M ;
i=1 10 (k) Δh10 k (m) = IFFT2L Δh k (m) , k = 1, 2, ..., M ;
10 10 (l) h10 k (m + 1) = hk (m) − μf Δhk (m), k = 1, 2, ..., M ;
(m) Trim h10 k (m + 1)2L×1 , keeping the first L elements to get hk (m + 1)L×1 ; h(m + 1) (n) h(m + 1) = (impose the unit-norm constraint). h(m + 1) NOTES: FFT2L {·} and IFFT2L {·} are 2L-point fast Fourier and inverse fast Fourier transforms, respectively.
240
Y. A. Huang, J. Benesty y
N
6 Microphones M1 M2 M3 M4 M5 5 4
W
3
E
S1
2 Speakers 1
S2
x 0
0
1
2
4
3
5
6
6.7
S Fig. 8.2. Varechoic chamber floor plan (coordinate values measured in meters); loudspeaker sources located, with respect to the microphone array, at about 45◦ [position S1 (0.337, 3.162, 1.4)], and broadside [position S2 (3.337, 1.162, 1.4)]; microphones at positions M1 (2.437, 5.6, 1.4), M2 (2.937, 5.6, 1.4), M3 (3.437, 5.6, 1.4), M4 (3.937, 5.6, 1.4), and M5 (4.437, 5.6, 1.4).
For convenience, positions in the floor plan will be designated by (x, y) coordinates with reference to the southwest corner and corresponding to meters along the (South, West) walls. The chamber of size 6.7m × 6.1m × 2.9m (x × y × z) is a room with 368 electronically controlled panels that vary the acoustic absorption of the walls, floor, and ceiling [16]. Each panel consists of two perforated sheets whose holes, if aligned, expose sound absorbing material behind, but if shifted to misalign, form a highly reflective surface. The panels are individually controlled so that the holes on one particular panel are either fully open (absorbing state) or fully closed (reflective state). Therefore, by varying the binary state of each panel in any combination, 2238 different room characteristics can be simulated. In this chapter, two different panel configurations were selected: 75% and 0% of the panels open; and the corresponding 60 dB reverberation time T60 in the 20-4000 Hz band are approximately 300 ms and 585 ms, respectively. A linear microphone array which consists of five omni-directional microphones was employed in the measurement and the spacing between adjacent microphones is about 50 cm. The array was mounted 1.4 m above the floor and parallel to the North wall at a distance of 50 cm. The five microphone positions are denoted as M1 (2.437, 5.6, 1.4), M2 (2.937, 5.6, 1.4), M3 (3.437, 5.6, 1.4), M4 (3.937, 5.6, 1.4), and M5 (4.437, 5.6, 1.4). The source was simu-
8
Adaptive Multichannel Time Delay Estimation
241
Magnitude
1
0
-1
0
20
40
60 Time (seconds)
80
100
120
Fig. 8.3. A speech signal used as the source and sampled at 8 kHz.
lated by placing a loudspeaker in two different positions: 45◦ and broadside, denoted as S1 (0.337, 3.162, 1.4), and S2 (3.337, 1.162, 1.4), respectively. For each source location, measurements were made for both of the considered panel configurations, which produced four different room configurations. The transfer functions of the acoustic channels between loudspeakers and microphones were measured at a 48 kHz sampling rate. Then the obtained channel impulse responses were downsampled to an 8 kHz sampling rate and truncated to 4096 samples. These measured impulse responses will be treated as the actual impulse responses in the TDE experiments. The source signal is a sequence of clean speech sampled at 8 kHz and of duration 2 minutes. The first 1-minute speech was dictated by a male speaker and the second by a female speaker. The signal waveform is shown in Fig. 8.3. The multichannel system output is computed by convolving the speech source with the corresponding measured channel impulse response and adding zeromean, white, Gaussian noise to the results for a given signal-to-noise ratio (SNR). For the PHAT algorithm, a 256 ms Kaiser window was used for the analysis frame. For the AED and the proposed NMCFLMS algorithms, the length of the adaptive model filter for each channel was taken as L = 2048, equivalently 256 ms, which is only half of the length of the actual channel impulse responses. An estimate is yielded by each TDE algorithm when a frame of channel outputs is available. Therefore, with a 2-minute sequence, 468 time delay estimates are computed and the statistical accuracy of each TDE algorithm can be evaluated. For each room configuration and a given SNR, the performance of each time delay estimator is calculated by averaging the results of 30 Monte Carlo runs. For the AED algorithm, the step size is 0.1 and the forgetting factor γ = 0.72 for adaptively computing the power spectrum of the channel outputs is fixed. For the NMCFLMS algorithm, a step size μf = 0.6 is used and the regularization factor δ was initially set as one fifth of the total power over all channels at the first frame. For both the AED and NMCFLMS algorithms,
242
Y. A. Huang, J. Benesty
Q = 5 dominant peaks in the estimated impulse responses are extracted for determining the direct path and hence the channel’s TDOA. 8.4.2
Performance Measure
To better evaluate the performance of a time delay estimator, it would be helpful to classify an estimate into two comprehensive categories: the class of success and the class of failure [17], [6]. An estimate τˆij for which the absolute error |ˆ τij − τij | exceeds Tc /2, where Tc is the signal correlation time, is identified as a failure or an anomaly which follows the terminology used in [6]. Otherwise, an estimate would be deemed as a success or a nonanomalous one. In this chapter, Tc is defined as the width of the main lobe of the source signal autocorrelation function (taken between the −3 dB points). For the particular source signal used here, Tc is about 5.2 samples on average at the 8 kHz sampling rate. After time delay estimates are classified into the two classes, three statistics can be formed: the percentage, bias, and variance of successful estimates over the ensemble of all TDE results. In our experiments, the source signal correlation time is small and the successful interval [τij − Tc /2, τij + Tc /2] is quite narrow. Therefore, the bias and variance of successful estimates are not very meaningful. In the interest of brevity and clarity, only the percentage of successful estimates is used as a performance measure of estimation accuracy in this chapter. For a multichannel TDE problem where more than two channels are available and more than one time delay needs be estimated, an algorithm is not satisfactory if it is not accurate over all sensor pairs. In order to fairly and completely evaluate the performance of a TDE algorithm for solving the multichannel TDE problem, the percentage of successful overall estimates will also be presented in addition to the percentage of successful individual estimates for every distinct pair of available sensors. A set of time delay estimates for all sensor pairs at a given time is deemed successful if all of the individual estimates are successful. 8.4.3
Experimental Results
Table 8.2 reports the summary statistics in terms of the percentage of successful individual as well as overall estimates of each investigated TDE algorithm for the 45◦ source at S1 as shown in the Fig. 8.2. As seen, when the room is less reverberant (T60 = 300 ms), the three TDE algorithms can accurately determine the relative TDOAs between all microphone pairs at each examined SNR. But as more panels are closed and the reverberation time increases to 585 ms, the performance of all estimators dramatically deteriorates, especially when the SNR is low. It is clear that the proposed NMCFLMS algorithm is better than either AED or PHAT for this room configuration. When the SNR is only 10 dB, 16.27% of the overall estimates are successful using
8
Adaptive Multichannel Time Delay Estimation
243
the NMCFLMS. The statistic improves with the SNR and reaches 63.38% at the 40 dB SNR. However, such a trend is not true for the AED and PHAT algorithms as shown in Panel (a) of Fig. 8.4; unfortunately, they fail in determining the time delays of arrival for several microphone pairs which results in a poor (almost zero) percentage of successful overall estimates even when the SNR is quite high. In particular, both the AED and PHAT algorithm get into trouble estimating time delay between microphone 3 and 5, but the NMCFLMS seems less vulnerable. It is remarkable that the NMCFLMS converges to the real TDOAs after adaptation of only one frame, equivalently 256 ms. The second experiment considers the source at the broadside location S2. The summary statistics are presented in Table 8.3. Similarly, when the reverberation time is short, it is easy for these algorithms to estimate the relative TDOAs between different microphones, and noise effects are small. When the room reverberation time is long, the NMCFLMS is more consistent and performs better than the AED and PHAT algorithms. The PHAT algorithm is particularly sensitive to room reverberation. In this room configuration, successful overall estimates of the PHAT cannot excel 90% until the SNR is greater than 40 dB as shown in Panel (b) of Fig. 8.4.
8.5
Conclusions
Channel diversities including relative TDOA for a single-input multipleoutput system can be easily determined after all channel impulse responses are found. Adaptive blind channel identification techniques can be used for the problem of time delay estimate and it is demonstrated that they can more effectively deal with room reverberation than traditional generalized cross-correlation methods. The blind channel identification-based approach was generalized from a two-channel system to a multiple (greater than 2) channel system in this chapter and a normalized multichannel frequencydomain LMS (NMCFLMS) algorithm was proposed. Because it is less likely for all channels to share a common zero when more channels are available, a multichannel system is usually well-conditioned for blind identification in practice and the proposed multichannel time delay estimation algorithm is more robust. The proposed algorithm is implemented in the frequency domain and is computationally efficient due to the use of FFTs. It may take the NMCFLMS algorithm long to converge to the actual impulse responses due to the weak tail caused by room reverberation. However, the direct path in each channel becomes dominant in less than 256 ms. It was shown using the data measured in the Varechoic chamber at Bell Labs that the proposed algorithm performed better than the eigenvalue decomposition (ED) and phase transform (PHAT) methods particularly when the room reverberation time was long.
244
Y. A. Huang, J. Benesty
Table 8.2 Simulation summary statistics for a source at the 45◦ position S1 as shown in the Fig. 8.2 in terms of the percentage of successful individual/overall estimates of the time delay estimation algorithms: adaptive eigenvalue decomposition (AED), phase transform (PT), and normalized multichannel frequency-domain LMS (NMC). SNR (dB) TDE
1-2
1-3
Percent Successful Estimates (%) Microphone Pair 1-4 1-5 2-3 2-4 2-5 3-4 3-5 99.99 100.0 99.99 99.99 100.0 100.0 100.0 100.0 100.0 99.98 100.0 100.0
100.0 100.0 100.0 100.0 100.0 100.0 99.75 100.0 100.0 99.99 100.0 100.0
100.0 100.0 99.99 100.0 100.0 100.0 99.97 100.0 100.0 99.99 100.0 100.0
99.99 99.99 99.99 99.99 100.0 100.0 99.25 100.0 100.0 99.96 100.0 100.0
99.96 99.77 60.14 99.96 99.97 57.59 100.0 99.98 89.39 100.0 99.99 92.90
0.00 0.11 35.12 0.00 0.09 93.59 0.00 0.38 61.28 0.00 1.05 63.84
99.63 80.83 22.28 99.79 91.43 58.00 99.87 94.43 61.19 95.46 95.71 63.81
0.00 0.00 16.27 0.00 0.04 57.59 0.00 0.34 60.77 0.00 0.95 63.38
10
AED PT NMC 20 AED PT NMC 30 AED PT NMC 40 AED PT NMC
100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0
100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0
100.0 100.0 99.99 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0
100.0 100.0 100.0 100.0 100.0 100.0 99.79 100.0 100.0 99.99 100.0 100.0
T60 = 300 ms 100.0 100.0 100.0 100.0 99.99 100.0 100.0 99.99 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 99.25 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 99.99 100.0 100.0 100.0 100.0 100.0 100.0
10
99.81 99.87 97.58 99.92 99.96 100.0 99.91 100.0 100.0 99.98 100.0 100.0
0.01 66.41 91.91 0.04 90.16 93.67 9.01 98.89 93.42 20.70 99.84 93.36
98.24 90.89 60.70 99.29 99.81 57.95 99.54 100.0 89.81 99.18 100.0 93.33
69.79 84.94 35.86 56.43 95.03 99.91 99.89 97.66 67.86 98.78 98.31 70.47
T60 = 585 ms 99.79 33.88 20.38 97.66 82.81 52.64 89.84 59.21 38.27 99.15 99.51 0.18 96.98 92.66 73.39 93.67 57.95 99.91 98.51 99.60 0.07 95.38 97.26 91.47 93.42 89.81 67.86 97.69 99.59 0.02 93.11 98.63 98.34 93.36 93.33 70.47
AED PT NMC 20 AED PT NMC 30 AED PT NMC 40 AED PT NMC
4-5 Overall
8
Adaptive Multichannel Time Delay Estimation
245
Table 8.3 Simulation summary statistics for a source at the broadside position S2 as shown in the Fig. 8.2 in terms of the percentage of successful individual/overall estimates of the time delay estimation algorithms: adaptive eigenvalue decomposition (AED), phase transform (PT), and normalized multichannel frequency-domain LMS (NMC). SNR (dB) TDE
1-2
1-3
Percent Successful Estimates (%) Microphone Pair 1-4 1-5 2-3 2-4 2-5 3-4 3-5 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0
100.0 100.0 99.99 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0
100.0 100.0 99.99 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0
100.0 99.97 99.99 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0
100.0 99.99 100.0 99.99 100.0 100.0 99.99 100.0 100.0 100.0 100.0 100.0
99.87 80.65 100.0 99.96 81.77 100.0 99.99 86.42 100.0 99.95 92.11 100.0
99.98 99.92 100.0 100.0 100.0 100.0 100.0 100.0 100.0 99.99 100.0 100.0
97.36 46.84 100.0 97.47 67.79 100.0 99.39 80.93 100.0 99.73 90.16 100.0
10
AED PT NMC 20 AED PT NMC 30 AED PT NMC 40 AED PT NMC
100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0
100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0
100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0
100.0 99.98 99.99 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0
T60 = 300 ms 100.0 100.0 100.0 100.0 100.0 99.99 100.0 100.0 99.99 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0
10
100.0 99.95 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0
100.0 99.98 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0
99.98 99.93 100.0 99.99 100.0 100.0 100.0 100.0 100.0 99.97 100.0 100.0
99.99 99.88 100.0 100.0 100.0 100.0 100.0 100.0 100.0 99.84 100.0 100.0
T60 = 585 ms 100.0 99.97 97.44 99.99 99.96 57.54 100.0 100.0 100.0 100.0 100.0 97.50 100.0 100.0 83.18 100.0 100.0 100.0 100.0 100.0 99.41 100.0 100.0 93.66 100.0 100.0 100.0 100.0 100.0 99.94 100.0 100.0 97.78 100.0 100.0 100.0
AED PT NMC 20 AED PT NMC 30 AED PT NMC 40 AED PT NMC
4-5 Overall
246
Y. A. Huang, J . Benesty 100 r---~----------~------------~---,
~ >
~
Cl
-+- AED
-0- PHAT _.6.- NMCFLMS
80
E-<
-;
.... 60 0 0)
;..
:B > > 0)
,,
40
U U ;::l
Vi
C 20 0)
~ 0...
A
,, ,.
,,
0)
05
100
t
>
~
Cl
,-0
,19'
80
E-<
, 0'
.... 60 ;.. 0 0)
> > 0)
45
+--------
-;
:B
40
35
40
'"
,,
,
U U ;::l
Vi
C 20 0)
~+- ÄED
~
PHAT -.6.- NMCFLMS 40 45 30 35 -0-
0)
0...
05
10
15
20
25 SNR (dB)
(b)
Fig. 8.4. Comparison of the percentage of successful overall TDEs vs. SNR among the AED , PHAT, and NMCFLMS algorithms for T 60 = 585 ms. (a) Source at 45° (SI) . (b) Source at broadside (S2) .
References 1. C. H . Knapp and G. C. Carter, "The generalized correlation method for estima-
tion of time delay," IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP24, no. 4, pp. 320- 327, Aug. 1976. 2. M. S. Brandstein, " A pitch-based approach to time-delay estimation of reverberant speech," in Proc. IEEE ASSP Workshop Appls. Signal Processing Audio
8
Adaptive Multichannel Time Delay Estimation
247
Acoustics, Oct. 1997. 3. H. Wang and P. Chu, “Voice source localization for automatic camera pointing system in videoconferencing,” in Proc. IEEE ASSP Workshop Appls. Signal Processing Audio Acoustics, Oct. 1997. 4. M. Omologo and P. Svaizer, “Acoustic event localization using a crosspowerspectrum phase based technique,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, 1994, vol. 2, pp. 273–276. 5. M. Omologo and P. Svaizer, “Acoustic source location in noisy and reverberant environment using CSP analysis,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, 1996, vol. 2, pp. 921–924. 6. B. Champagne, S. B´edard, and A. St´ephenne, “Performance of time-delay estimation in the presence of room reverberation,” IEEE Trans. Speech Audio Processing, vol. 4, no. 2, pp. 148–152, Mar. 1996. 7. D. R. Morgan, V. N. Parikh, and C. H. Coker, “Automated evaluation of acoustic talker direction finder algorithms in the varechoic chamber,” J. Acoust. Soc. Am., vol. 102, no. 5, pp. 2786–2792, Nov. 1997. 8. A. St´ephenne and B. Champagne, “Cepstral prefiltering for time delay estimation in reverberant environments,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, 1995, vol. 5, pp. 3055–3058. 9. Y. Huang, J. Benesty, and G. W. Elko, “Adaptive eigenvalue decomposition algorithm for realtime acoustic source localization system,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, 1998, vol. 2, pp. 937–940. 10. J. Benesty, “Adaptive eigenvalue decomposition algorithm for passive acoustic source localization,” J. Acoust. Soc. Am., vol. 107, no. 1, pp. 384–391, Jan. 2000. 11. C. Avendano, J. Benesty, and D. R. Morgan, “A least squares component normalization approach to blind channel identification,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, 1999, vol. 4, pp. 1797–1800. 12. Y. Huang and J. Benesty, “Adaptive multi-channel least mean square and Newton algorithms for blind channel identification,” Elsevier Science Signal Processing, vol. 82, pp. 1127–1138, Aug. 2002. 13. Y. Huang and J. Benesty, “A class of frequency-domain adaptive approaches to blind multi-channel identification,” IEEE Trans. Signal Processing, to appear. 14. G. Xu, H. Liu, L. Tong, and T. Kailath, “A least-squares approach to blind channel identification,” IEEE Trans. Signal Processing, vol. 43, pp. 2982–2993, Dec. 1995. 15. A. H¨ arm¨ a, “Acoustic measurement data from the varechoic chamber,” Technical Memorandum, Agere Systems, Nov. 2001. 16. W. C. Ward, G. W. Elko, R. A. Kubli, and W. C. McDougald, “The new Varechoic chamber at AT&T Bell Labs,” in Proc. Wallance Clement Sabine Centennial Symposium, 1994, pp. 343–346. 17. J. P. Ianniello, “The generalized correlation method for estimation of time delay,” IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-30, no. 6, pp. 998–1003, Dec. 1982.
9 Algorithms for Adaptive Equalization in Wireless Applications Markus Rupp1 and Andreas Burg2 1
2
TU Wien Institute for Communication and RF Engineering Gusshausstr. 25/389 A-1040 Vienna, Austria E-mail:
[email protected] ETH Zurich Integrated Systems Laboratory Gloriastr. 35 CH-8092 Zurich, Switzerland E-mail:
[email protected]
Abstract. Since the introduction of adaptive equalizers in digital communication systems by Lucky [1], much progress has been made. Due to their particular constraints many new and different concepts in the wireless domain have been proposed. The wireless channel is typically time and frequency dispersive, making it difficult to use standard equalizer techniques. Also, due to its time varying nature, long transmission bursts may get corrupted and require a continuous tracking operation. Thus, transmission is often performed in short bursts, allowing only a limited amount of training data. Furthermore, quite recently, advantages of the multipleinput multiple-output character of wireless channels have been recognized. This chapter presents an overview of equalization techniques in use and emphasizes the particularities of wireless applications.
9.1
Introduction
Consider the following simplified problem: a linear channel, characterized by a time-discrete FIR filter function with LC coefficients C(q −1 ) =
L C −1
cl q −l ,
(9.1)
l=0
where q −1 denotes the delay operator1 and cl ∈ C l the coefficients. It is fed by transmitted symbols s(k) ∈ A from a finite alphabet2 A ∈ C l . The received 1
2
Note that the operator style is utilized with shift operator q −1 throughout the chapter since it does not require the existence of a z-transform and thus can also be applied to noise sequences. Note also that the description of signals and systems is purely discrete-time assuming that equivalent discrete-time counterparts to continuous-time signals and systems exist. The transmitted signal s(k) is assumed to be normalized such that E[|s(k)|2 ] = 1.
J. Benesty et al. (eds.), Adaptive Signal Processing © Springer-Verlag Berlin Heidelberg 2003
250
M. Rupp, A. Burg
sequence r(k) ∈ C l is given by the convolution of the transmitted symbols s(k) with the channel filter and additive noise v(k): r(k) = C(q −1 )[s(k)] + v(k) =
L C −1
cl s(k − l) + v(k).
(9.2)
l=0
Clearly, the coefficients of the channel do not only provide a means for transmitting the symbols, but also cause a linear distortion due to the timedispersive nature of the wireless channel, called inter-symbol-interference (ISI). For the moment, its frequency-dispersive character is neglected and will be considered later. If a situation can be established in which only one coefficient exists, the decoding process can be simplified. Thus, the question arises whether there exists a linear filter that can guarantee the re-establishment of FD (q −1 )C(q −1 ) = gD q −D
(9.3)
with a finite delay D and gD = 0. Following such an approach, the additive noise is neglected and a filter FD (q −1 ) of length LF is desired. Such a criterion is called zero-forcing (ZF) since it forces all coefficients but one of the resulting filter to zero. The remaining coefficient gD can be set to one due to the linear nature of the equalization filter FD (q −1 ) and will no longer be used explicitly in the context of ZF. So far, the noise component has been neglected. However, it impacts the decision by the filtered value FD (q −1 )[v(k)], which can result in a noise enhancement. Therefore, it appears a better approach to take the noise into account when searching for the optimal filter FD (q −1 ), which can be obtained by minimizing the so called mean square error (MSE), i.e.,
! !2 Δ MSE = E !q −D s(k) − FD (q −1 )r(k)! . (9.4) Minimizing (9.4) leads to the minimum mean square error (MMSE). Neither solution, (9.3) or (9.4) is straightforward to obtain. Depending on the equalizer structure and its length, different solutions may occur. The following Sect. 9.2 discusses criteria for optimal equalizer performance. For a timeinvariant channel it is sufficient to compute the equalizer solution just once at the initialization phase of a transmission. Algorithms for this case will be addressed in Sect. 9.3 including fractionally spaced equalizers, multiple-input multiple-output (MIMO) channel equalizers and decision feedback structures. Adaptive algorithms based either on estimating the channel and noise variance or without explicitly knowing its impulse response can achieve the initial training of such structures. The most famous least-mean-square LMS algorithm is introduced for ZF and MMSE solutions and its implications are discussed in Sect. 9.4. With some modification such adaptive filters also perform well for time-variant channels as long as the rate of change is relatively slow and training is performed periodically.
9
Adaptive Equalization in Wireless Applications
251
Other techniques are presented in Sect. 9.6 that try to find the most likely transmitted sequence by a so-called maximum likelihood (ML) technique. This requires a-priori knowledge of the channel impulse response and thus Sect. 9.5 gives an overview of estimation techniques for time-variant channels and in MIMO settings. Finally, Sect. 9.7 presents a short overview of blind techniques as they are being applied today in wireless communications.
9.2
Criteria for Equalization
In data communications systems and in particular in wireless transmission the best criterion to compare the equalizer performance is either the resulting biterror-rate (BER) or symbol-error-rate (SER). Unfortunately, such measures are usually not available and other performance metrics need to be utilized instead. Next to the signal to interference plus noise ratio (SINR) at the equalizer output, the MMSE measure already mentioned in the previous section is the most common measure and will be considered first. Substituting (9.2) into the definition of the MSE (9.4) yields for the equalizer output MMSE =
(9.5)
!
! ! ! 2 2 min E ! q −D − FD (q −1 )C(q −1 ) s(k)! +E !FD (q −1 )v(k)! . FD (q−1 ) 1 1 23 4 23 4 ISI part
noise part
The first part presents the remaining ISI while the second is the additive noise. Without equalization the SINR at the detector is given by |cD |2 SINR = LC −1 , 2 2 i=0,i=D |ci | + σv
(9.6)
where the index 0 ≤ D < LC indicates the channel coefficient on which the signal is detected (typically the strongest one). Clearly, there is signal energy LC −1 2 in the term i=0,i =D |ci | but without additional means it cannot be used as such and appears as a disturbance to the signal. Define the convolution of the channel and a finite length filter function by a new polynomial G(q −1 ) = FD (q −1 )C(q −1 ) with the coefficients gi ; i = 0, ..., LC + LF − 2 = 0, ..., LG − 1. The impact of ISI after equalization can be described by either of the two following measures (note that 0 ≤ D < LG and in general gD = 1) LG −1 ISI =
i=0
|gD |2
LG −1 PD =
|gi |2
i=0
|gD |
|gi |
− 1,
− 1.
(9.7) (9.8)
252
M. Rupp, A. Burg
The second metric is called peak distortion (PD) measure. The convolution of channel and equalizer filter results in a new SINReq SINReq = LG −1 i=0,i=D
|gD |2
LF −1 2
|gi |2 + σv
i=0
|fi |2
= ISI +
σv2 |gD |2
1 LF −1 i=0
|fi |2
.(9.9)
The problem is to find equalizer filter values {fi } such that (9.9) is maximized. Both criteria, MSE and SINR are related by MSE =
|gD |2 + |1 − gD |2 , SINReq
(9.10)
which simply becomes MSEZF = 1/SNReq in the ZF case. As mentioned before, the BER measure is the best criterion, however, it is very difficult to obtain. In the simple case of binary signaling (BPSK) or quadrature phase shift keying (QPSK) with Gray coding over an AWGN channel with constant gain cD and delay D, the BER can be determined by evaluating the expression * )5 1 |cD |2 . (9.11) BERAWGN,B/QPSK = erfc 2 2σv2 For other modulation schemes at least an upper bound of the form √ BERAWGN ≤ Kerfc δSNR
(9.12)
exists [2], where K and δ are constants depending on the modulation scheme. SNR stands for signal to noise ratio, i.e., the SINR without interference. Once ISI is present, the BER measure can be modified to * )5 I |cD + cISI (si )|2 1 BERISI = , (9.13) pi (si )erfc 2 i=1 2σv2 where for all I = P LC −1 possibilities with probability pi (si ) signal corruption is caused by cISI (si ). The vectors si contain all possible combinations of I transmitted symbols. The formula is for QAM transmission with equidistant symbols and is only correct as long as the ISI is small enough (max |cISI (si )| < |cD |). For example, with BPSK the values cD + cISI (si ) must remain positive and for QPSK they must remain in the first quarter of the complex plane. The value of P is 2 for BPSK and 4 for QPSK and can be very large, depending on the size of the symbol alphabet. Clearly, with a large number of coefficients the complexity of such an expression becomes very high. Applying a linear equalizer, the BER reads * )5 I |gD + gISI (si )|2 1 BERISI,eq = . (9.14) pi (si )erfc F −1 2 i=1 2σv2 L |fi |2 i=0
9
Adaptive Equalization in Wireless Applications
253
Optimizing such an expression is quite difficult. Approximations can be obtained for small ISI, i.e., max |gISI (si )| |gD |. Then the corruption resulting from ISI can be regarded as a Gaussian process and can be added to the noise term: ⎛ ⎞ ): * " # 1 SINReq 1 1 ⎜#
⎟ ⎠ = erfc BERISI,eq ≈ erfc ⎝$ . σ2 LF −1 2 2 2 2 |gDv|2 i=0 |fi |2 + ISI (9.15) In the case of small PD < 1 another option for approximation is to derive an upper bound by the worst case ⎛" #! !2 ⎞ #! ! !gD − |gD | PD! ⎟ ⎜# 1 $ ⎜ ⎟, BERISI ≤ erfc ⎝ (9.16) F −1 2 2σv2 L |fi |2 ⎠ i=0 with the above defined peak distortion measure (9.8). Note that (perfect) ZF solutions always result in simpler expressions of the form (9.11) with gD = 1 and gISI = 0, while MMSE will result in the much more complicated expression (9.13) with the need to use approximate results. The erfc(·) links the BER to the SNReq obtained from the ideal ZF solution and the BER to the SINReq from the MMSE solution. However, in general it remains open which criterion leads to smaller BER3 . Note also that there exist unbiased MMSE solutions as well. In this case gD = 1, simply obtained by dividing the MMSE solution by gD . In [4] it is argued that unbiased MMSE solutions give lower BER than standard MMSE.
9.3
Channel Equalization
Channel equalization tries to restore the transmission signal s(k) by means of linear or non-linear filtering. Such an approach seems straightforward and abundant literature is available, see for example [3], [4], [5], [6] to name a few. An overview of such techniques will be given in the following section where a trade-off has been made between a detailed description and sufficient information for wireless applications. Channel equalization as described in this section is not specific to wireless systems and can also be (and has successfully been) applied to other fields where time-invariant channels are common. 3
For a non ISI channel, as it appears, for example, in OFDM (also called DMT) and PAM transmission, it can be shown that minimizing BER is equivalent to the ZF solution, i.e., fD = 1/cD .
254
M. Rupp, A. Burg
9.3.1
Infinite Filter Length Solutions for Single Channels
It is quite educational to assume the equalizer filter solution to be of infinite length. Minimizing only the ISI part in (9.6), the ZF criterion is obtained, leading to the following expression: FZF,∞ (q −1 ) =
C ∗ (q −1 ) −D q −D . q = |C(q −1 )|2 C(q −1 )
(9.17)
This solution is typically of infinite length as can be shown by a simple example. Assume the channel impulse response to be of the form C(q −1 ) = c0 + c1 q −1 .
(9.18)
Then the ZF solution (for D = 0) requires inversion of the channel, i.e., FZF (q −1 ) =
1 1 , c0 1 + cc10 q −1
(9.19)
which is the structure of a first order recursive filter. If |c1 /c0 | < 1, a causal, stable filter solution with infinite length exists. On the other hand, if |c1 /c0 | > 1, a stable, anti-causal filter exists with impulse response spanning from −∞ to 0. In practice such anti-causal solution can be handled by allowing additional delays D > 0 so that the part of the impulse response carrying most of its energy can be represented as a causal solution and the remaining anti-causal tail is neglected. In the following part of this section the difference of causal and anti-causal filters will not be considered and instead a general equalizer filter of double infinite length will be assumed. Substituting the general solution (9.17) into the MMSE expression above clearly zeroes the ISI part. The remaining term is controlled by the noise obtaining 0 π 1 σv2 MSEZF,∞ = dΩ. (9.20) 2π −π |C(e−jΩ )|2 In contrast to the ZF solution, the MMSE solution minimizing ISI and noise simultaneously, reads FMMSE,∞ (q −1 ) =
C ∗ (q −1 ) q −D . |C(q −1 )|2 + σv2
The MMSE for this solution results in 0 π σv2 1 dΩ, MMSELIN,∞ = 2π −π |C(e−jΩ )|2 + σv2
(9.21)
(9.22)
which is for σv2 > 0 always smaller than the MSE of the ZF solution. Note that this advantage requires exact knowledge of the noise power as indicated in (9.21). Especially in wireless applications, due to time and frequency dispersive channels, such knowledge is usually not available and it can be
9
Adaptive Equalization in Wireless Applications
255
very difficult and of high complexity to obtain a reliable estimate. Example. Consider the above example of a two tap FIR filter for which |c0 |2 + |c1 |2 = 1 and |c1 /c0 | < 1. In case of the ZF receiver, the MSE is given by σv2 MSEZF,∞ = (|co |2 − |c1 |2 )2
(9.23)
while for the MMSE receiver σv2 MMSELIN,∞ = ≤ MSEZF,∞ (|co |2 − |c1 |2 )2 + σv4 + 2σv2
(9.24)
the MMSE is obtained. 9.3.2
Finite and Infinite Filter Length Solutions for Multiple Channels
Until here, only single channels were considered. In the near future, wireless systems are expected to use multiple antennas at the transmitter and/or receiver, resulting in so called multiple-input multiple-output (MIMO) systems. In this case, the received signal at antenna n; n = 1, ..., N is a superposition of all transmitted symbols: rn (k) =
M
Cnm (q −1 )sm (k) + vn (k); n = 1, ..., N.
(9.25)
m=1
Two cases are of interest: 1) the symbols at every transmit antenna are identical (single-input multiple-output system (SIMO)) and 2) they are different. SIMO Systems: If the transmitted symbol is identical at all transmit antennas (but different at different times), the M channels Cnm (q −1 ) are sim−1 ply combined to one new channel Cn (q ) = m Cnm (q −1 ) and the received symbols can be written in vector form: r(k) = c(q −1 )[s(k)] + v(k),
(9.26)
where c(q −1 ) is a column vector with N entries Cn (q −1 ); n = 1, ..., N . The linear equalizer solution is given by a vector f D (q −1 ) with entries Fn∗ (q −1 ) chosen to re-establish the transmitted symbol s(k) by computing −1 −1 −1 fH )r(k) = f H )c(q −1 )[s(k)] + f H )v(k). The corresponding ZF D (q D (q D (q condition thus reads −1 fH )c(q −1 ) = D (q
N n=1
Fn (q −1 )Cn (q −1 ) = q −D .
(9.27)
256
M. Rupp, A. Burg v1 (k)
s(k) C1 (q −1 )
? - + -F (q −1 ) 1
-C2 (q −1 )
? - + -F (q −1 ) 2
- + z(k) - NL 6
sˆ(k− D)
v2 (k)
Fig. 9.1. Common structure for SIMO or fractionally spaced equalizers.
Figure 9.1 depicts a scenario for N = 2. If the antennas are spaced sufficiently far apart (> λ/4), the noise components can be assumed to be uncorrelated. A nonlinear device denoted by NL is typically used to reconstruct the symbols from a finite alphabet A. In the simple case of BPSK and QPSK it is called a slicer since it slices the two halves or four quadrants of the complex plane into two or four equal-sized pieces. A typical example of this principle is called receive-diversity where multiple receive antennas pick up different realizations of the same s(k) caused by different transmission channels Cn (q −1 ) from the one transmit antenna to each of the receive antennas. In a flat fading situation (no time-dispersion) for example, the channel functions are only given by a single value Cn (q −1 ) = cn0 . The linear combination by the vector f is performed best for fn0 = c∗n0 and is called maximal ratio combining (MRC). A second example of such a principle is the so-called RAKE receiver typically used in CDMA systems. Here, the multiple channels may not be caused through several receive antennas but by multiple delayed transmission paths which can be separated in the receiver. After this separation the signal appears as if several receive antennas (also called virtual antennas) were present. A very similar situation to the multi-antenna case is obtained when socalled fractionally spaced equalizers (FSE) are utilized. In this case, the received signal is sampled more often than the Nyquist rate requires. Note that such oversampling is performed while the signal bandwidth remains constant. In fractionally-spaced equalizers, the received signals are sampled with, say, N times higher rate. The corresponding N phases of the received signal experience different channels Cn . The received signals thus appear as if they were transmitted through N different channels very much as if N receive antennas were used. One difference though, is the correlation of the received noise. While in the multiple antenna case the noise component on each receive antenna can be uncorrelated, this is not the case for the fractionally
9
Adaptive Equalization in Wireless Applications
257
spaced equalizer. The pulse shaping filter appearing at transmitter and receiver (in form of a matched filter) causes the noise components to be correlated for fractionally spaced equalizers. Thus, Fig. 9.1 also describes a scenario with T /2 spaced oversampling, where C1 (q −1 ) describes the T -spaced part starting at time zero and C2 (q −1 ) describes the part starting at time T /2. Note that the noise components, unlike in the multiple antenna case, are now correlated. Example. Assume two channels of two coefficients each fed by the same symbols s(k). The symbols are collected in a vector s(k) = [s(k), s(k − 1)]T . The received vector r(k) = [r1 (k), r2 (k)]T then reads in compact matrix notation c11 c12 r(k) = s(k) + v(k). (9.28) c21 c22 The desired equalizer shall be a set of two FIR filters of length LF = 2 each, f H (q −1 ) = [f11 + f12 q −1 , f21 + f22 q −1 ]. Thus, a ZF solution is obtained (stacked Toeplitz form) for: ⎤ ⎡ ⎡ ⎤ ⎤ f11 ⎡ 0 c11 0 c21 0 ⎢ ⎥ f ⎣ c12 c11 c22 c21 ⎦ ⎢ 12 ⎥ = ⎣ 1 ⎦ . ⎣ f21 ⎦ 0 0 c12 0 c22 f22
(9.29)
(9.30)
Obviously, only three equations exist to determine four variables, i.e., one variable can be selected at will. For an MMSE solution, this additional freedom allows to achieve smaller MSE values while not violating the ZF solution too much. More about such finite MMSE length solutions will be said in Sect. 9.3.3. Length requirements for typical wireless channels are discussed in [5]. In general, if the reception of a symbol s(k) can be written in terms of multiple polynomials caused by multiple sub-channel filters Cn (q −1 ), rn (k) = Cn (q −1 )s(k); n = 1, ..., N,
(9.31)
then a finite length ZF solution can be obtained when the sub-channels Cn (q −1 ) are co-prime, i.e., they have no common zeros. They are co-prime if and only if there exists a set of polynomials {Fn (q −1 )}; n = 1, ..., N that satisfy the Bezout identity [7], [8], [9]: N
Fn (q −1 )Cn (q −1 ) = 1.
(9.32)
n=1
Thus, although an MMSE solution is obtained by Fn (q −1 ) =
Cn∗ (q −1 ) |Cn (q −1 )|2 + σv2
(9.33)
258
M. Rupp, A. Burg
in the case of independent and identical noise components, such an infinite length filter solution is not required and a finite length filter solution exists instead. Example. Assume two channel filters with one common zero C1 (q −1 ) = (1 + c1 q −1 )C¯1 (q −1 ) and C2 (q −1 ) = (1 + c1 q −1 )C¯2 (q −1 ), where C¯1 (q −1 ) and C¯2 (q −1 ) are co-prime. In this case, 1 = C1 (q −1 )F1 (q −1 ) + C2 (q −1 )F2 (q −1 ) = (1 + c1 q −1 ) C¯1 (q −1 )F1 (q −1 ) + C¯2 (q −1 )F2 (q −1 ) cannot be satisfied. MIMO Systems: If the transmitted symbols are all different4 , the received symbol vector can be written as r(k) = C s(k) + v(k),
(9.34) −1
where C is a matrix with N × M entries Cnm (q ); m = 1, ..., M, n = 1, ..., N . In this case, a matrix F is desired to re-establish the transmitted vector s(k) = [s1 (k), ..., sM (k)]T by computing F H r(k) = F H C s(k) + F H v(k). Consider a simple case in which two symbols are transmitted (one on each transmit antenna) and three receive antennas are utilized: ⎡ ⎤ C11 (q −1 ) C12 (q −1 ) r(k) = ⎣ C21 (q −1 ) C22 (q −1 ) ⎦ s(k) + v(k). (9.35) C31 (q −1 ) C32 (q −1 ) According to the generalized Bezout Identity [9], [10], a ZF solution exists for N > M if C is of full rank for all q −1 . Since each filter function Cmn can be expressed as a vector cmn , stacking the past LC − 1 symbols of s(k) in an even larger vector, allows for reformulating the matrix of dimension ¯ of higher dimension with constant entries. At the N ×M into a new matrix C receiver a minimum norm ZF solution can be obtained (see also next section) ¯ H [C ¯ H C] ¯ −1 or alternatively, an MMSE solution by by computing F ZF = C H ¯ C ¯ + σ 2 I]−1 C ¯ H . The matrix inversion of the socomputing F MMSE = [C v obtained (M LC ) × (M LC ) matrix can be numerically challenging. Better methods are obtained by so-called iterative BLAST receivers [11] also in combination with CDMA [12] or OFDM transmission [13]. An overview of ZF, MMSE, and DFE techniques for MIMO is given in [14]. 9.3.3
Finite Filter Length Solutions for Single Channels
As mentioned before, the ideal filter length of linear equalizers for single channels is doubly infinite. Practically, the filter length is fixed by the affordable complexity. Therefore, in this section, suboptimal finite length solutions 4
This is also called Vertical-Bell-labs-LAyered-Space-Time (V-BLAST) transmission.
9
Adaptive Equalization in Wireless Applications
259
will be considered. To achieve this goal, the channel impulse response will be assumed to be of finite length LC and will be written in vector notation, ˜ ∈C i.e., C l (L+LC −1)×L denotes a channel transmission matrix describing the following model (only shown for LC = 3) ⎡ ⎤ ⎤ ⎡ ⎤ ⎡ c2 0 0 0 ... 0 v(L + 2) r(L + 2) ⎢ c1 c2 0 0 ... 0 ⎥ ⎡ ⎤ ⎢ v(L + 1) ⎥ ⎢ r(L + 1) ⎥ ⎢ ⎥ s(L) ⎥ ⎢ ⎥ ⎢ ⎢ ⎥ ⎢ r(L) ⎥ ⎢ c0 c1 c2 0 ... 0 ⎥ ⎢ s(L − 1) ⎥ ⎢ v(L) ⎥ ⎥ ⎥ ⎢ 0 c0 c1 c2 ... 0 ⎥ ⎢ ⎥ ⎢ ⎢ ⎥ ⎢ v(L − 1) ⎥ ⎢ r(L − 1) ⎥ ⎢ ⎥⎢ .. + ⎥ , (9.36) ⎢ ⎥ = ⎢ .. .. .. .. ⎥ ⎢ ⎥⎢ . ⎥ ⎥ ⎢ ⎢ ⎥ ⎢ ⎢ .. .. . . . . ⎥ ⎥ ⎢ ⎢ ⎥ ⎢ ⎥ ⎣ ⎦ . s(2) . ⎥ ⎢ ⎥ ⎢ 0 ... 0 c0 c1 c2 ⎥ ⎢ ⎥ ⎣ ⎣ r(2) ⎦ ⎢ v(2) ⎦ s(1) ⎣ 0 ... 0 0 c0 c1 ⎦ v(1) r(1) 0 ... 0 0 0 c0 allowing us to express the transmission of L symbols s(k); k = 1, ..., L: ˜ s + v. r=C
(9.37)
Note that such notation includes the beginning and end of a transmission and grows with L. For zero-forcing, a filter vector f D ∈ C l (L+LC −1)×1 can be found such that H ˜ = eT , f˜ D C D
(9.38)
(where eD = [0...010...0]T is a unit vector of appropriate dimension whose D-th component is 1) i.e., the filter selects a single value eTD s(k) = s(k − D). This description allows for computing the ZF and MMSE solutions: ˜ C ˜ H C] ˜ −1 eD f˜ ZF,D = C[ H
˜ C ˜ C ˜ + σ 2 I]−1 eD . f˜ MMSE,D = C[ v
(9.39) (9.40)
Example. Consider again the example of (9.23) and (9.24) for a two tap FIR channel filter. The channel matrix reads ⎤ ⎡ c1 0 ˜ = ⎣ c0 c1 ⎦ , (9.41) C 0 c0 The two ZF solutions are 2 c1 |c0 | + |c1 |2 , c0 |c0 |2 , −c∗1 c20 T , f˜ 1 = 2 (|c0 |2 + |c1 |2 ) − |c0 c1 |2 ∗ 2 2 2 2 −c |c c , c |c | , c | + |c | T 1 1 0 0 1 0 1 f˜ 2 = , 2 (|c0 |2 + |c1 |2 ) − |c0 c1 |2
(9.42) (9.43)
260
M. Rupp, A. Burg
while the two MMSE solutions read 2 2 2 2 2 ∗ 2 c |c , c | + |c | + σ (|c | + σ ), −c c T 1 0 1 0 0 v v 1 0 f˜ 1 = , 2 (|c0 |2 + |c1 |2 + σv2 ) − |c0 c1 |2 ∗ 2 −c0 c1 , c1 (|c1 |2 + σv2 ), c0 |c0 |2 + |c1 |2 + σv2 T ˜ f2 = . 2 (|c0 |2 + |c1 |2 + σv2 ) − |c0 c1 |2
(9.44) (9.45)
While these results seem convincing, they are misleading. Clearly, the description (9.36) requires us to observe the transmission from beginning to end; in particular, the values in which not all symbols are present are of importance. If, however, for a continuous transmission, a snapshot over L1 < L symbols is taken, say sT (k) = [s(k), s(k + 1), ..., s(k + L1 − 1)], the following matrix of Toeplitz structure is obtained ⎡ ⎤ c0 c1 ... cLC −1 0 ... 0 ⎢ 0 c0 c1 ... cLC −1 0 ... ⎥ ⎢ ⎥ ⎢ 0 0 c 0 c1 ... c 0 ⎥ L −1 C (9.46) C =⎢ ⎥, ⎢ .. .. .. ⎥ .. .. .. ⎣ . . ⎦ . . . . 0 ... 0 c0 c1 ... cLC −1 instead. Such a matrix C ∈ C l L1 ×(LC +L1 −1) is not of sufficient rank to compute a linear solution by the pseudo-inverse [of dimension (LC + L1 − 1) × (LC +L1 −1)] as was done in (9.39). For ZF, a different pseudo-inverse leading to a minimum norm solution exists while for MMSE the standard pseudoinverse exists and with the above notation, and vectors eD of appropriate size for each solution, f ZF,D = [CC H ]−1 CeD H
f MMSE,D = C[C C +
σv2 I]−1 eD .
(9.47) (9.48)
Extending such solutions to the MIMO case (including fractionally spaced equalizers) is straightforward5. Example. For an FSE with T /2 spacing including the matched filters the MMSE solution is considered. Note that in order to describe the oversampled system, the introduction of interpolation filters is required. Typically a matched filter pair Rp is used at the transmitter and receiver. So far, such filters were not shown explicitly and were incorporated in the channel impulse response. For oversampled systems, however, they play an important role and need to be considered explicitly. The received signal for each phase can be written in the form r p (k) = Rp C p Rp s(k) + Rp v(k); p = 1, ..., P. 5
(9.49)
Note that in the SIMO or FSE case the elements in C become vectors and the condition that the Toeplitz matrix C (9.46) is of full column rank is equivalent to Bezout’s Identity.
9
Adaptive Equalization in Wireless Applications
261
v(k) s(k) C(q −1 )
- q −D−1
?
- + -FD (q −1 )
- + z(k) 6
NL
sˆ(k − D)
training
f v-BD (q −1 ) f tracking
Fig. 9.2. Structure of a DFE receiver.
The noise is thus also filtered by such a filter. Note that the noise is present at each path, originating from identical noise sources but differently filtered: v p = Rp v. The MMSE equations read
H H 2 H H H R1 C 1 R1 RH R1 C 1 R 1 RH 1 C 1 R 1 + σv R 1 R 1 2 C 2 R2 H H H H H H 2 R2 C 2 R2 R1 C 1 R 1 R 2 C 2 R 2 R 2 C 2 R 2 + σv R 2 R H 2 R1 C 1 R 1 eD . = R2 C 2 R 2 eD
9.3.4
f1 f2
(9.50)
Decision Feedback Equalizers
As mentioned above and substantiated by the Bezout theorem, once several transmission paths (sub-channels) are available for the same sequence, the equalization problem can be solved by a finite length filter, whereas this is not the case for a single transmission path. There are several methods available to obtain independent paths and they do not necessarily require more than one antenna at the transmit or receiver; FSE was already mentioned in this respect. A similar method is known under the name of decision feedback equalization (DFE). Unlike the so far considered linear equalizers, the DFE also uses past estimated symbols s(k − D − 1), ..., s(k − D − LB ) to detect the current symbol s(k − D). Figure 9.2 displays the DFE structure. The switch allows the operation mode to be changed from training to tracking, thus from feeding correct symbols to using estimated ones. The estimated symbols result from a nonlinear
262
M. Rupp, A. Burg
device (NL) that maps the soft symbols z(k) onto the nearest possible symbol in the allowed A. During training, the feedback path BD (q −1 ) of length LB BD (q −1 ) =
L B −1
bi q −i ; b0 = 0,
(9.51)
i=0
can be considered a second transmission path and thus Bezout’s theorem can be satisfied. In the training mode, the ZF condition reads FD (q −1 )C(q −1 ) + q −D−1 BD (q −1 ) = q −D ,
(9.52)
while the MMSE condition is given by:
! !2 MMSE = min E ! q −D − FD (q −1 )C(q −1 ) − q −D−1 BD (q −1 ) s(k)! FD ,BD 1 23 4
! !2 +E !FD (q −1 )v(k)! . 1 23 4
ISI part
(9.53)
noise part
As compared to (9.6), the equation above allows one more degree of freedom. By modifying BD (q −1 ) it is now possible to cancel ISI components while the noise term remains unchanged. In the tracking mode, the estimated symbols are fed back. As long as the SER is relatively small, it can still be considered as a second transmission path. Infinite Length Solution. Classically the DFE solution was given by J. Salz [15]. The infinite MMSE solution requires a term |C(q −1 )|2 + σv2 in the denominator. Assume a monic, causal, and minimum-phase polynomial6 M (q −1 ) selected so that Mo M (q −1 )M ∗ (q) = |C(q −1 )|2 + σv2 , where Mo is a scaling constant. For unit transmitted signal energy, the DFE solution is given by F (q −1 ) =
1 C ∗ (q) , Mo M ∗ (q)
B(q −1 ) = 1 − M (q −1 ).
(9.54) (9.55)
Obviously, the feedforward path requires an anti-causal solution while the feedback path is strictly causal. It can be shown [15], [3] that the corresponding MSE is given by 0 π σ2 1 MMSEDFE,∞ = v = σv2 exp − (9.56) log |C(ejΩ )|2 + σv2 dΩ Mo 2π −π 6
Monic polynomials are of the form m0 = 1, causal ones have only terms for nonnegative delays and minimum-phase is equivalent to having zeroes only inside the unit circle.
9
Adaptive Equalization in Wireless Applications
263
r
6 r r
|c(k)|
r
r r
r
r D
r
k
Fig. 9.3. Possible channel impulse response magnitude and selected cursor position D.
and that the so obtained MMSE never exceeds the MMSE for a linear equalizer solution7 . A comprehensive overview for DFE solutions is given in [3]. Finite Length Solution. Figure 9.3 depicts a possible channel impulse response. Assume D = 0, i.e., it is set to the first value of the impulse response. In this case by proper choice of BD (q −1 ) all following values can be cancelled. For the BER of a BPSK (QPSK) system, )5 * 1 |c0 f0 |2 BER = erfc (9.57) 2 2 |fi |2 σv2 is obtained. Since the noise term is minimal for fi = 0, i = 0, it can be recognized that the influence of f0 is cancelled and thus the quality of the BER is only dependent on the first channel term c0 . For wireless channels this will be in general a small term and thus only poor BER performance can be achieved. If D is increased much better values can be obtained. This does not need to come at the expense of additional ISI resulting form the pre-cursor position. In [16], it is shown that a ZF solution is obtained if LF + LB ≥ max{LF + LC − 1, D + LB − 1}.
(9.58)
However, it remains difficult to find the optimal selection of D since it is dependent on the actual channel C(q −1 ). Since minimizing BER remains problematic, a better starting point for optimization is again the ZF and the MMSE criterion. In the finite filter 7
Also a second solution exists with a stable and causal feedforward filter but with anti-causal feedback filter.
264
M. Rupp, A. Burg
length case, the impact of the channel and the feedback part can be combined into one matrix ⎤ ⎡ c0 c1 ... cLC −1 0 ... 0 ⎢ 0 c0 c1 ... cLC −1 0 ... ⎥ ⎥ ⎢ ⎢ 0 0 c0 c ... c 0⎥ 1 LC −1 ⎥ ⎢ ⎢ .. .. .. .. .. ⎥ C .. .. ⎢ . . . .⎥ ˜ =⎢ . . . C , (9.59) ⎥= ID ⎢ 0 ... 0 c1 ... cLC −1 ⎥ c0 ⎥ ⎢ ⎢ 0 ... 0 1 0 ... 0⎥ ⎥ ⎢ ⎣ 0 ... 0 0 1 0 ... ⎦ 0 ... 0 0 0 1 0 where a new matrix ⎡ 0001 0 ⎢0 0 0 0 1 ⎢ Δ ⎢ ID = ⎢ 0 0 0 0 0 ⎢ .. .. ⎣. .
⎤ ... 0 0 0 ... 0 ⎥ ⎥ 1 ... 0 ⎥ ⎥ ⎥ .. . 0 ⎦
(9.60)
0 0 0 0 ... 0 1 has been introduced which is a shifted identity matrix with a delay of D steps (here D = 3 has been shown). The task is to find two vectors f D and bD in order to minimize either the ZF or MMSE condition / /2 H / , MSEZF = min / eD − f H (9.61) D C − bD I D s f D ,bD
/ /2 H / + f D 2 σv2 . MMSE = min / eD − f H D C − bD I D s f D ,bD
For MMSE, after differentiation with respect to f D and bD , fD C eD CC H + σv2 I CI TD = IDCH I D I TD bD I D eD
(9.62)
(9.63)
is obtained. The matrix is not of full rank when the noise term is zero. Thus, the ZF solution can again be obtained via the minimum norm solution. Error Propagation. So far it was assumed that the correct symbols are fed back. In general, this will only be the case when utilizing a known training sequence. Once the equalizer is used in tracking (or decision directed) mode, the detected symbols are fed back. In this case an erroneously detected symbol can cause further errors. In simulations [18] DFE structures show about 2-3 dB loss in BER due to error propagation8 . Possibilities to avoid such undesired behavior are either the application of a particular precoding (Tomlinson-Harashima [18]) or in general the utilization of error correcting codes such as trellis codes. See also Sect. 9.6.1 for more details in combination with the Viterbi algorithm. 8
A computation of BER is also possible. See, for example, [17].
9
9.4
Adaptive Equalization in Wireless Applications
265
Adaptive Algorithms for Channel Equalization
The equations for finite filter solutions already provide a possible solution for an adaptive filter. Assuming the channel remains constant over a certain time period, the channel (and the noise variance) can be estimated and by applying equations (9.39) and (9.40), the ZF or MMSE solution can be computed [19]. Such a method not only requires an additional channel estimation step, but also the computation of a matrix inverse. Although complexity-wise often acceptable, the numerical challenge of such inversion can be quite high. For rapidly changing channels, the complexity aspect becomes burdensome as well. Thus, simpler methods with low complexity and high numerical robustness are desired. Since only the equalizer solution is wanted, it is not required to estimate the channel first. Applying gradient-type, adaptive algorithms is a straightforward solution, resulting in much less complexity for this application. In the following, adaptive algorithms to minimize ZF as well as MMSE based on reference models are considered. 9.4.1
Adaptively Minimizing ZF
As previously mentioned, a finite equalization filter f o cannot be expected to deliver a perfect ZF-solution. Assuming a linear channel with finite impulse response cT = [c0 , c1 , ..., cLC −1 ], f o will result in a sub-optimal ZF solution. Composing a matrix of size L1 × (LC + L1 − 1) as in (9.46) allows us to write the obtained ZF solution as C H f o = eD + δD ,
(9.64)
where eD indicates the unit vector with one entry “one” at the D−th position and δD is a residual error vector. Following this approach, the optimal ZF solution is the one given by f o = arg min C H f − eD 2 . f
(9.65)
Note that the solution of (9.65) is not straightforward due to the rank deficiency of C. A minimum norm solution exists (9.47), causing a gain at time instant D not necessarily being one. Or, equivalently, the contribution of δD at time D may not be zero. With such a reference system f o , an iterative algorithm known as the adaptive zero-forcing algorithm is given by f ∗ (k) = f ∗ (k − 1) + μ(k)s∗k [s(k − D) − sˆ(k − D)],
(9.66)
where the vector f ∗ (k) of dimension LF estimates the equalizer solution f o and the regression vector sT (k) = [s(k), s(k−1), ..., s(k−D), ..., s(k−LF +1)]. In the literature (see for example [18]) the behavior of this algorithm is usually referred to the least-mean-square (LMS) algorithm. It is demonstrated next that this is a simplification. By introducing a signal vector t(k) tT (k) = [s(k), s(k − 1), ..., s(k − LF − LC + 1)]
266
M. Rupp, A. Burg
of appropriate length, the received signal is given by r(k) = Ct(k) + v(k) where a noise vector v(k) of dimension LF is added as well. Applying the optimal equalizer (9.64), H H T H H fH o r(k) = f o Ct(k) + f o v(k) = eD t(k) + δD t(k) + f o v(k) 1 23 4 1 23 4 s(k−D)
(9.67)
v ¯o (k)
is obtained. This needs to be compared to the estimated equalizer output f H (k − 1)r(k) = f H (k − 1)Ct(k) + f H (k − 1)v(k) H t(k) + f H (k − 1)v(k) . = sˆ(k − D) + ˆ δD 1 23 4
(9.68)
v ¯(k)
Thus, the error signal s(k − D) − sˆ(k − D) can be reformulated as s(k−D)− sˆ(k−D)= r T (k)f ∗o − v¯o (k) − rT (k)f ∗ (k − 1) + v¯(k) = [r T (k) − v T (k)][f ∗o − f ∗ (k − 1)] ∗ ∗ − δˆD ]. +tT (k)[δD
(9.69)
This formulation utilizes the noise-free received value y(k) = r(k) − v(k) = Ct(k). Reformulating the update equation (9.66) in terms of the parameter error vector f˜ (k) = f o − f (k) results in ∗ ∗ ∗ ∗ ∗ f˜ (k) = f˜ (k − 1) − μ(k)s∗ (k)[y T (k)f˜ (k − 1) − tT (k)[δD − δˆD ]] (9.70) ∗ ∗ ∗ = I − μ(k)s∗ (k)y T (k) f˜ (k − 1) − μ(k)s∗ (k)tT (k)[δD − δˆD ]. ∗ ∗ − δˆD ] causes a noise floor very similar The additional term μ(k)s(k)∗ tT (k)[δD to the additive noise floor when the LMS algorithm is applied for system identification [21]. However, when approaching the optimal solution, δD also tends to δˆD and the difference vanishes. Obviously, the quality of the ZF solution has an impact on initial behavior of the iterative solution. If the ZF solution results in a large value for δD , the iterative solution has to cope with a large additional noise. By selecting a small step-size, this effect can be decreased at the expense of a slower convergence rate. The convergence of such algorithm is thus dependent on the properties of the matrix I − μ(k)s∗ (k)y T (k) = I − μ(k)s∗ (k)tT (k)C T .
If convergence in the mean is considered, the correlation E[s∗ (k)tT (k)]=G simplifies the condition, so that the eigenvalues of I − μ(k)GC T need to be smaller than 1. However, this is not guaranteed for general channels C. A more practical solution for this problem can be obtained by a particular step-size rule. Consider the eigenvalues of I −μ(k)s∗ (k)y T (k). For this matrix
9
Adaptive Equalization in Wireless Applications
267
LF − 1 eigenvalues are 1 and one eigenvalue equals 1 − μ(k)y T (k)s∗ (k). A good choice for the step-size μ(k) is thus μ(k) = α
y T (k)s∗ (k) , y(k)2 s(k)2 +
(9.71)
where a small positive value is added to ensure that the denominator is strictly positive. More details on such step-size selection can be found in [20]. 9.4.2
Adaptively Minimizing MMSE
The corresponding gradient-type algorithm for minimizing MMSE is much easier to analyze. The update equation reads f ∗ (k) = f ∗ (k − 1) + μ(k)r ∗ (k)[s(k − D) − sˆ(k − D)].
(9.72)
Utilizing the same terms as in the ZF-case, ∗ ∗ f˜ (k) = I − μ(k)r ∗ (k)y T (k) f˜ (k − 1) − μ(k)r∗ (k)tT (k)[δD − ˆ δD ](9.73) is obtained. The essential difference now is the appearance of the vector r(k) in place of the former s(k). Although not perfectly symmetrical [note that the noisy term r(k) and the noise-free term y(k) appear together], the algorithm can be analyzed with conventional methods including the independence assumption. The algorithm behaves very much like an LMS algorithm for system identification, however, with slightly different learning dynamic due to the additional noise in the driving signal and a different noise behavior. 9.4.3
Training and Tracking
The previously considered algorithms find ZF and MMSE solutions assuming the correct symbols s(k − D) are available. Once the training period is over, the system is fed with estimated symbols sˆ(k − D) instead. This mode of operation is called the decision directed, or tracking, mode. Again, a reference model delivering an optimal solution is assumed. This time, a nonlinear device is included to map the linearly estimated symbols into symbols of the transmitted alphabet A. Note that such reference structure cannot guarantee error-free symbols s(k − D). Due to the additive Gaussian noise, a small error probability remains. In the following, the length LF of the reference filter f o is assumed to be sufficiently long so that such an error occurrence is negligible. Figure 9.4 depicts the structure, exhibiting four different error signals under consideration: eLIN (k) = z(k) − zˆ(k),
(9.74)
eTK (k) = sˆ(k − D) − zˆ(k), eTR (k) = s(k − D) − zˆ(k),
(9.75) (9.76)
eNL (k) = s(k − D) − z(k).
(9.77)
268
M. Rupp, A. Burg
v(k)
s(k) -
C(q −1 )
? - + r -FD (q −1 )
z(k) r -
?
NL
- + − 6 s(k−D) r
+
rk
AA K A -FˆD (q A −1 ) A A A
r −6 zˆ(k) r -
? eLIN e TK u e TR
? - + − NL
− +
eNL Fig. 9.4. Reference model for adaptive filtering showing various adaptation errors.
The adaptation error e(k) = s(k−D)−ˆ s(k−D) of the two adaptive algorithms in the previous sections is not shown explicitly. Note that in practice, the signal eTR (k) is used for training as long as training symbols are available, while in the tracking mode eTK (k) is used. The adaptive algorithm thus reads ∗ ∗ eTR (k) ; training mode . (9.78) fˆ (k) = fˆ (k − 1) + μ(k)r ∗ (k) eTK (k) ; tracking mode The relation of the two errors to the error signal eLIN (k) is given by eLIN (k) = eTR (k) − eNL (k) = eTK (k) − eNL (k) + e(k).
(9.79) (9.80)
Thus, adaptation error in the training mode can be regarded as the combination of the MMSE obtained by a linear system identification and an additional corruption term eNL (k), controlled only by the optimal MMSE estimate z(k). Using an adaptive filter algorithm, the following can be concluded: • The adaptive filter works in a system identification setting. • In the training mode, an additive noise term is present, defined in terms of the nonlinear error eNL (k). • In the tracking mode, the error signal consists of an additional error term e(k). • The excitation for the system identification problem is given by a composite signal, consisting of the transmitted symbols filtered by the linear channel and a white noise source.
9
Adaptive Equalization in Wireless Applications
269
In particular, the last point causes problems when applying low-complexity adaptive filters like the LMS algorithm. Its learning behavior is very slow for highly correlated signals [21] as is in general the case with filtered symbols. In [22] it has been found that the learning of oversampled DFE equalizers is hampered even further. Several possible solutions to overcome this problem are available: 1. A solution to this problem could be the recursive least-squares (RLS) algorithm. However, cheap fixed-point solutions on general purpose DSPs do not seem to be available [23]. Only if floating point hardware is available, the RLS algorithm in some formulation [21] can be considered for implementation. 2. A solution guaranteeing fast convergence rate is the subband approach with polyphase filters [24], [25]. Here, the entire frequency band is split into smaller bands in which the signal appears to be more or less white. Faster convergence is a result. Moreover, because of down-sampling, complexity can be saved. However, the price for these advantages is an additional delay due to the inherent block processing. 3. The convergence of the LMS algorithm can be sped up by using particular step-size matrices [26]. However, modifying the entries of such a matrix during the adaptation process can lead to unstable behavior [27]. 4. Another possibility is to modify the reference model and to include the nonlinear device [28]. In this case, the conditions for the adaptive filter change considerably and standard theory for system identification cannot be applied. 5. Similar to the previous point is the idea of using particular nonlinear error signals constructed out of the estimated values zˆ(k) only. Such algorithms are called blind when they exclude a training signal for reference entirely. More on this will be considered in Sect. 9.7. 6. Due to fading in wireless channels, a fixed MMSE solution may not exist over the entire observation horizon of a data burst. In such case, adaptive algorithms will be used to track the channel alterations. However, in a fading environment it can happen that the instantaneous channel gain becomes so weak that it disappears in the noise for a few symbols. Adaptive algorithms easily lose performance in such situations. If only one fade exists in a data burst, the algorithm can run forward beginning with a training preamble and backwards with the preamble of the consecutive burst, or like in GSM where a training midamble exists in the forward and backward direction. Forward-backward DFE structures typically offer advantages compared to just unidirectional adaptation [29], [30].
9.5
Channel Estimation
One of the major problems with the previously considered equalizers is the typically high correlation of the received sequence, hampering the learning
270
M. Rupp, A. Burg
rate. When using transmitted symbols, the learning rate can be expected to be much higher since the transmitted symbols are typically uncorrelated (white) and can be considered to be a statistically independent sequence. Consider the received signal vector r(k) = Cs(k) + v(k) = S(k)c + v(k),
(9.81)
where the first form describes the channel in matrix form and the transmitted symbols are organized in a vector, while in the second form, this description has been swapped. Once, the training sequence S(k) is known, it can be used to estimate the channel impulse response c. The least-squares (LS) estimator is given by ˆLS = [S H (k)S(k)]−1 S H (k)r(k). c
(9.82)
Under the assumption of additive white noise v(k), the LS estimate is known to be the best linear unbiased estimator (BLUE), i.e., the estimator with lowest error variance9 . More importantly, under Gaussian noise, the LS estimator achieves the Cramer-Rao (CR) bound [21], thus delivering an efficient estimate that cannot be improved by any other unbiased estimator [31]. Its variance is given by trace([S H (k)S(k)]−1 ) and thus, the CR bound is smallest if S H (k)S(k) = I. Hence, orthogonal sequences are of highest interest for fastest training [31], [32]. 9.5.1
Channel Estimation in MIMO Systems
In MIMO systems with M transmit and N receive antennas it is of interest to estimate all M N paths simultaneously. Thus, M transmit sequences have to be found that do not interfere with each other. In [33], it is shown that for optimal MIMO training trace([S H (k)S(k)]−1 ) must be minimized, where S(k) is a matrix of the form ⎡ ⎤ s1 (L) ... s1 (L − LC + 1) ... sM (L) ... sM (L − LC + 1) Δ ⎢ ⎥ .. .. .. .. .. .. S(k) = ⎣ ⎦, . . . . . ... . s1 (LC ) ...
s1 (1)
... sM (LC ) ...
sM (1)
in which all M training sequences of length L for estimating channels of length LC < L are combined. It turns out that the minimum LS error is obtained if and only if S H (k)S(k) = (L − LC + 1)I (a necessary condition being L − LC + 1 ≥ M LC ) and that such minimum error is given by ˆLS 2 = min E c − c S(k)
M LC σ2 , L − LC + 1 v
(9.83)
allowing the interesting interpretation that the error increases proportional to the growing number of transmit antennas, while a growing training length L can compensate for this effect. 9
Without any further a-priori knowledge on given statistics.
9
9.5.2
Adaptive Equalization in Wireless Applications
271
Estimation of Wireless Channels
Depending on the wireless environment, channel impulse responses can be as short as 25 ns (in small rooms [34]) and in mountainous areas as long as 60 μs. Thus, given the data rate and modulation scheme, a certain symbol length is defined and if it is much smaller than the channel impulse response, the channel estimation vector can be expected to be of large dimension. However, typical wireless channels display a specific structure. Most of the energy is contained in only a few taps of the impulse response. If concentrating only on such points, most of the channel energy with much less complexity can be captured. Typically, four positions (called fingers) are sufficient; up to ten may be used. A receiver architecture exploiting such channel structure is called the RAKE receiver [18]. Again, ZF or MMSE techniques can be applied to reconstruct the transmitted symbol with a channel equalizer, or maximum-likelihood and Viterbi techniques can be used based on the channel information. Note, however that in addition to the low complexity RAKE structure, further algorithms are required to find the optimal finger locations and possibly track such locations when the channel is time-varying. Once the channels are not static but time varying, as is expected in mobile communications, the tracking behavior of adaptive algorithms becomes important. For small movement (fD /fC < 10−7 , with fD the Doppler and fC the carrier frequency) typical algorithms like LMS and RLS track channel quite well [18], [21]. However, as the mobile movement becomes larger, standard algorithms can no longer track well. Of particular interest in recent years is the estimation of rapidly changing channels. Since mobiles are expected to move with speeds up to 350 kmph (fast TGV Train in France, for example), channel estimation becomes challenging. Some methods to improve estimation quality in this environment will be discussed in the following. 9.5.3
Channel Estimation by Basis Functions
An approach for achieving better results when applying channel estimation techniques to rapidly changing channels is the utilization of basis functions [35], [36]. If the channel is considered only for a limited time period, each coefficient ci (k); i = 0, ..., LC −1 is assumed to vary in time, and thus describes a function in time in its real and imaginary parts. Such a function can be approximated by orthogonal basis functions, the simplest of them being the exponential function. A model for one coefficient over a certain time period can thus be written Lg −1
ci (k) =
l=0
ail exp(jΩl k) = aTi g(k),
(9.84)
272
M. Rupp, A. Burg
where the coefficients ail are gathered in a vector ai and the exponential terms are based on the frequencies Ωl in a second time-varying vector g(k). The received signal is thus given by r(k) =
L C −1
ci (k)s(k − i) + v(k)
i=0
=
L C −1
aTi g(k)s(k − i) + v(k)
i=0 T
˜(k) + v(k), =a s
(9.85)
˜(k) in which the vector a consists of all time-constant parameters ail and s ˜(k) is known a-priori, of all time varying components exp(jΩl k)s(k − i). If s the unknown vector a can be estimated from the observations r(k) over a time period larger than the dimension of a [36]. Another approach allows us to reformulate the received vector in matrix form: r(k) = G(k)As(k) + v(k),
(9.86)
with new matrices G(k) and A containing the basis functions and the column vectors ai , respectively. The matrix G(k) exhibits the particular structure of an FFT matrix when Ωl = Ωo l, in which case the various basis functions can be interpreted as Doppler spectral components of the time-varying channel. Thus, its inverse is simply given by GH (k) and can be applied to the received sequence r(k), making the values independent of the time variations. The matrix elements of A can then be estimated by conventional estimation techniques like LS. By reformulating As(k) into S(k)a, the LS estimate ˆ LS = [S H (k)S(k)]−1 S H (k)GH (k)r(k). Note that this is a particularly is a low-complexity solution since GH (k)r(k) can be computed by an FFT and [S H (k)S(k)]−1 S H (k) can be pre-computed for the training mode. 9.5.4
Channel Estimation by Predictive Methods
Another class of adaptive algorithms includes knowledge of the time varying process of the channel filter coefficients [37]. Here, a model of this process is incorporated into the adaptive process, for example by Kalman filtering [38]. The channel coefficients can be described by c(k) = A(k)c(k − 1) + u(k),
(9.87)
where A(k) describes the transition of the channel states from time instant k − 1 to k and u(k) is a driving process. If the channel is described by statistic methods, A(k) = A can be as simple as a filter causing a particular autocorrelation function and spectrum of the elements ci (k) of c(k). If the time-variation of the channel is caused by two-dimensional motion of the transmitter and/or receiver, the so-called Jakes-spectrum is obtained [39].
9
Adaptive Equalization in Wireless Applications
273
The optimal adaptive filter in such situations [described by (9.87)] is the Kalman filter [40]. However, this requires not only knowledge of the filter structure of A(k) but also precise knowledge of its parameters. A similar but complexity-simpler approach is to apply the Wiener least-mean-square (WLMS) algorithm [41], [42]. Here, the dynamic behavior of the channel is described by a simpler one-dimensional autoregressive (AR) model which is included in the parameter estimation part. If not even the model structure is known, simple predictive methods [43] can be utilized. The estimates from LMS filtering, for example, can be linearly combined by ˆ ¯(k) = c
Lp
ˆ(k − l − D), γl c
(9.88)
l=0
with a fixed positive delay D defining the prediction order. Having statistical knowledge about the random process c(k), the optimal coefficients γl can be pre-computed. Good results were obtained with the simple approach ˆ ¯(k) = ˆ c c(k − D) +
p [ˆ c(k − D) − ˆ c(k − D − Lp )] , Lp
(9.89)
where p is the prediction step-size. Obviously, the parameters of the random process primarily depend on the selection of the step-size p. Other approaches do not exhibit the prediction process in the algorithmic modification. The proportionate LMS (PLMS) [26], [27], for example, assigns individual step-sizes according to the energy of the various weights. This can also be interpreted as a predictive method.
9.6
Maximum Likelihood Equalization
Having available a good estimate of the channel coefficients simplifies the detection process considerably. In this case, an optimal detection method can be considered. The maximum a-posteriori probability (MAP) detector is known to provide the best performance [18]. Its decision is based on the Bayes’ rule P (sm |r) =
P (r|sm )P (sm ) , P (r)
(9.90)
where P (r|sm ) is the conditional PDF of the observed vector r given the transmitted signal sm . In the MAP approach P (sm |r) hasto be maximized in order to find the most likely sequence sm . Since P (r) = m P (r|sm )P (sm ), this term deserves no further attention. Assuming that all sequences are transmitted with the same probability P (sm ), the expression simplifies and the
274
M. Rupp, A. Burg
so-called maximum likelihood (ML) estimator is obtained. In case of additive noise, the conditional PDF P (r|sm ) can be computed. Assuming a noise with Gaussian distribution, maximizing P (sm |r) becomes equivalent to minimizing the Euclidian distance r − sm . The principle can be applied to transmission with ISI. In this case, the minimization has to include the channel information, i.e., Δ
ˆML = arg min r − Cs2 . s s∈A
(9.91)
Suppose a sequence of L symbols with a P -ary alphabet A has been transmitted. Then, the minimization has to be performed over P L possible values, a complexity growing exponentially with the number of transmitted symbols. Such high complexity is the reason why brute-force ML equalization has not been used much in the past. However, recently, ML as a realizable receiver technique has drawn more attention since with MIMO antenna systems, typically a small number of transmit and receive antennas are considered and thus an ML technique can be used to search for the most likely instantaneous symbols [13], [12]. 9.6.1
Viterbi Algorithm
A technique that allows performance close to the ML technique but with linear complexity is the Viterbi algorithm (VA) [18], [44], [45], [46]. The VA structures the ML search recursively based on a finite state machine (FSM) description for the signal generation. Such an FSM is given in the context of a channel of length LC with the past symbol values s(k − 1), s(k − 2), ..., s(k − LC ) as the states. If the next symbol s(k) is to be detected, its impact on all possible LC positions of the estimated channel filter has to be computed, i.e, roughly P LC operations. Among those possibilities, only the P best ones, i.e., those with smallest metric, are selected, called survivor paths, and based on those survivor metrics, the next symbol is investigated. Thus, for L symbols, the complexity is roughly LP LC and for LC L, the VA has much lower complexity than ML. Once all symbols have been taken into account, a backward algorithm is required to find the optimal symbol sequence in the set of smallest metrics. An important issue is the delay until the detected symbol is available. It turns out that with high probability the survivor paths agree on a symbol s(k) after D steps as long as D ≥ 5LC . If they do not agree, the decision can be based on the most probable path with smallest metric. The Viterbi algorithm can be used to reduce the error propagation in a DFE structure. Since the length LB of a DFE filter can be much shorter than the actual channel length LC , the complexity of the VA becomes smaller. Such reduced-state sequence estimation (RSSE) techniques have been exploited in [47], [18] where also trellis codes have been considered. In a recent development [48] a tapselective DFE has been implemented in which only the most energy carrying
9
Adaptive Equalization in Wireless Applications
275
coefficients are utilized in the search, reducing the states considerably. For time varying channels, however, the latency in the decision of the VA can be quite prohibitive. The minimum decision latency of LB can further be reduced by parallel decision feedback decoding (PDFD) techniques [49], running several DFE structures in parallel and selecting at a later stage the most likely one. Another promising technique to deal with time-varying channels combines the concept of basis functions [see (9.5.3)] with the VA [50].
9.7
Blind Algorithms
When training signals are entirely absent, the transmission is called blind, and adaptive algorithms for estimating the transferred symbols and possibly estimating channel or equalizer information are called blind algorithms [6], [51]. Since training information is not available, a reliable reference is missing, leading to very slow learning behavior in such algorithms. Thus, blind methods are typically of interest when a large amount of data is available and quick detection not important. Their major application field is thus broadcasting for digital radio or TV. However, recently the concept of basis functions [see (9.5.3)] to describe time-variant channels has been incorporated, proving that blind techniques also have potential for time-varying channels, in particular, for MIMO transmissions [52]. Various principles allowing blind algorithms to successfully detect the symbols can be distinguished: 1. Algorithms utilizing the constant modulus (CM) property of the transmitted signal [53], [54], [55] are historically probably the first blind algorithms. The constant modulus algorithm (CMA) f ∗ (k) = f ∗ (k − 1) + μ(k)x∗ (k)y(k)g[y(k)], y(k) = f (k − 1)x(k), H
(9.92) (9.93)
is the most well-known procedure and depends on the nonlinear function10 g[·] in many variations [56], [57] and applications [58]. While the convergence analysis of such algorithms is limited to very few cases [56], the analysis of its tracking behavior, i.e., its steady-state performance has made progress. In [57], the feedback approach from [56] has been extended and conditions were presented under which the steady state error can be computed. In particular, for CMA1-2 and CMA2-2, it was shown that the proper selection of g[·] can lead to improved performance. 2. Algorithms based on higher order statistics (HOS), also called higher order moments, have been introduced by Shalvi and Weinstein [59]. In particular, the kurtosis, i.e., the ratio of the fourth-order moment to the squared second-order moment is of interest (also in the context of multiuser detection [8]). This ratio is 3 for Gaussian sequences but typically 10
The nonlinear function used is mostly of the form g[y] = γ − |y|n .
276
M. Rupp, A. Burg
smaller for transmission symbols. It has been recognized that most algorithms based on the CM property also satisfy this condition [53], [54], [55]. 3. Second Order Statistics (SOS) [60] usually do not carry phase information and thus can not be used to identify linear systems. If more than one transmission channel is present, however, missing phase information can be delivered by SOS techniques. A simple example is a two channel scenario transmitting the same sequence s(k) on both channels C1 (q −1 ) and C2 (q −1 ). In the absence of noise, the received signals are r1 (k) = C1 (q −1 )s(k) and r2 (k) = C2 (q −1 )s(k). Thus, C2 (q −1 )r1 (k) = C1 (q −1 )r2 (k). The selection of c (9.94) [R1 , R2 ] 1 = 0, c2 guarantees a unique solution11 up to a constant if C1 (q −1 ) and C2 (q −1 ) do not share any common zeroes, where R1 and R2 are Toeplitz matrices of the received sequences r1 (k) and r2 (k). In the presence of noise, this condition can be modified to H c1 c H (9.95) arg min 2 [R1 , R2 ] [R1 , R2 ] 1 . 2 c c2 ||c1 || +||c2 || =1 2 The solution of such a problem is typically found by singular-value decomposition (SVD) methods. SVD methods (also called subspace techniques) are of high complexity and usually numerically challenging. Not only is an extension of (9.95) to multiple channels possible, but also various variants exist, using cross-correlation functions or spectra. A good overview of such techniques can be found in [61], [10]. An interesting modification for MIMO transmission is the space-time matrix modulation method (STMM) [62]. The received signal consists of r(k) = C
L
ml (k)sl (k) + v(k),
(9.96)
l=1
where ml (k)sl (k) is a term of the mixture matrix combining L different data streams onto a certain number of transmit antennas. With a fixed mixture vector sequence ml (k) known at the transmitter and receiver it can be shown that the channel C and the symbols sl (k) can be separated uniquely up to a constant. The advantage of STMM is that not only SVD methods can be applied, but also ML and even a much simpler projection algorithm leads to successful equalization [63]. Combinations with basis function approaches are possible as well [63], [64]. 11
In order to obtain a unique solution there is also a persistent excitation condition on the input signal s(k). If the two channels do share at least one common zero, there are multiple solutions.
9
Adaptive Equalization in Wireless Applications
277
4. Blind ML methods try to estimate the channel or, alternatively, the equalizer, and the transmitted data sequence at the same time. Hereby, two criteria are common 2
min r − Cs = min r − Sc
{c,s}
2
{c,s}
2
2
min s − F r = min s − Rf ,
{f ,s}
{f ,s}
(9.97) (9.98)
where C and F are Toeplitz matrices of the channel c and equalizer f , respectively, and S and R those of the sequences s and r, respectively. Typically, one minimizes the first term with a fixed channel/equalizer and then the second with a fixed data sequence, and runs such procedure several times until convergence is observed. Hereby, the Toeplitz structure of the matrices, as well as the property that s(k) stems from a finite alphabet, is utilized to optimize the technique. In [6], [61] the first criterion is utilized while the second can be found in [65]. In order to succeed, a good starting value is required. This is usually achieved by a very short training sequence, hence such techniques are called semi-blind.
9.8
Conclusions
In this chapter, an overview of adaptive equalizer techniques was presented. Special emphasis was given to techniques applied in modern wireless systems where channels are frequency- and time-dispersive. Many basic concepts were explained and brought into the context of multiple-input multiple-output systems as they appear in the near future in wireless communication systems. A short overview of blind techniques was given demonstrating the potential of new signal processing techniques even better suited for the particular needs of wireless communications.
Acknowledgment The authors would like to thank Harold Art´es for his carefully reading of the manuscript and pointing out many inconsistencies.
References 1. R. W. Lucky, “Automatic equalization for digital communication,” Bell Syst. Tech. J., vol. 44, pp. 547–588, Apr. 1965. 2. G. Forney, “Maximum likelihood sequence estimation of digital sequences in the presence of intersymbol interference,” IEEE Trans. Information Theory, vol. 18, no. 3, pp. 363–378, 1972. 3. J. M. Cioffi, G. Dudevoir, M. Eyuboglu, and G. D. Forney, Jr, “MMSE decision feedback equalization and coding–Part I,” IEEE Trans. Commun., vol. 43, no. 10, pp. 2582–2594, Oct. 1995.
278
M. Rupp, A. Burg
4. N. Al-Dhahir and J. M. Cioffi, “MMSE decision feedback equalizers: finite length results,” IEEE Trans. Information Theory, vol. 41, no. 4, pp. 961–975, July 1995. 5. J. R. Treichler, I. Fijalkow, and C.R.Johnson, Jr, “Fractionally spaced equalizers,” IEEE Signal Processing Mag., pp. 65–81, May 1996. 6. J. K. Tugnait, L. Tong, and Z.Ding, “Single user channel estimation and equalization,” IEEE Signal Processing Mag., vol. 17, no. 3, pp. 17–28, May 2000. 7. P. A. Fuhrman, A Polynomial Approach to Linear Algebra. Springer, N.Y., 1996. 8. S. Haykin, Unsupervised Adaptive Filtering. Wiley-Interscience, N.Y., 2000. 9. T. Kailath, Linear Systems. Prentice Hall, Englewood Cliffs, N.J., 1980. 10. G. B. Giannnakis et al., Signal Processing Advances in Wireless & Mobile Communications. Prentice Hall, vol. 1, 2001. 11. P. W. Wolniansky, G. J. Foschini, G. D. Golden, and R. A. Valenzuela, “VBLAST: an architecture for achieving very high data rates over rich-scattering wireless channels,” in Conf. Record of ISSSE, Pisa, Italy 98. 12. A. Burg, E. Beck, M. Rupp, D. Perels, N. Felber, and W. Fichtner, “FPGA implementation of a MIMO receiver front-end for UMTS,” in Conf. Record of Int. Zuerich Seminar on Broadband Commun., 2002, pp. 8 1–8 6. 13. R. Van Nee, A. van Zelst, and G. Awater, “Maximum likelihood decoding in a space division multiplexing system,” in Conf. Record of VTC Japan, 2000. 14. B. A. Bjerke and J. G. Proakis, “Equalization and decoding for multiple-input multiple-output wireless channels,” Eurasip Journal on Applied Signal Processing, vol. 3, pp. 249–266, 2002. 15. J. Salz, “Optimum mean-square decision feedback equalization,” Bell Syst. Tech. J., vol. 52, no. 8, Oct. 1973. 16. C. B. Papadias and A. J. Paulraj, “Unbiased decision feedback equalization,” in Conf. Record of the IEEE Intern. Symp. on IT, 1998, pp. 448. 17. C. B. Papadias and M. Rupp, “Performance analysis of finite-length DFE receivers based on a polyphase representation,” in Conf. Record of the 32nd. Asilomar Conf. of Signals, Systems and Computers, 1998, pp. 374–378. 18. J. G. Proakis, Digital Communications. McGraw Hill, Fourth edition, 2001. 19. I. Ghauri and D. T. M. Slock, “Linear receivers for the DS-CDMA downlink exploiting orthogonality of spreading sequences,” in Conf. Record of the 32nd. Asilomar Conf. of Signals, Systems and Computers, 1998. 20. M. Rupp, “Normalization and convergence of gradient-based algorithms for adaptive IIR filters,” Signal Processing, vol. 46, no. 1, pp. 15–30, Sept. 1995. 21. S. Haykin, Adaptive Filter Theory. Fourth Edition, Prentice Hall, 2001. 22. M. Rupp, “On the learning behavior of decision feedback equalizers,” in Conf. Record of the 33rd. Asilomar Conf. of Signals, Systems and Computers, vol. 1, 1999, pp. 514–518. 23. A. P. Liavas and P. A. Regalia, “On the numerical stability and accuracy of the conventional recursive least squares algorithm,” IEEE Trans. Signal Processing, pp. 88–96, Jan. 1999. 24. H. Mohamad, S. Weiss, M. Rupp, and L. Hanzo, “A fast converging fractionally spaced equalizer,” in Conf. Record of the 35th Asilomar Conf. of Signals, Systems and Computers, 2001. 25. H. Mohamad, S. Weiss, M. Rupp, and L. Hanzo, “Fast adaptation of fractionally spaced equalizers,” Electronic Letters, vol. 38, no. 2, pp. 96–98, Jan. 17, 2002.
9
Adaptive Equalization in Wireless Applications
279
26. S. L. Gay, “An efficient, fast converging adaptive filter for network echo cancellation,” in Conf. Record of the 32nd. Asilomar Conf. of Signals, Systems and Computers, 1998, pp. 394–398. 27. M. Rupp and J. Cezanne, “Robustness conditions of the LMS algorithm with time-variant matrix step-size,” Signal Processing, vol. 80, no. 9, pp. 1787–1794, Sept. 2000. 28. M. Rupp and A. H. Sayed, “Robustness and convergence of adaptive schemes in blind equalization,” in Conf. Record of the 30th. Asilomar Conf. on Signals, Systems and Computers, vol. 1, 1996, pp. 271–275. 29. A. Bahai and M. Rupp, “Adaptive DFE algorithms for IS-136 based TDMA cellular phones,” in Conf. Record of IEEE International Conf. on Acoustics, Speech, and Signal Processing, vol. 3, 1997, pp. 2489–2492. 30. J. Balakrishnan and C. R. Johnson, Jr.,“Time-reversal diversity in decision feedback equalization,” in Conf. Record of Allerton Conf. on Communication, Control and Computing, (Monticello, IL), 2000. 31. S. N. Crozier, D. D. Falconer, and S. A. Mahmoud, “Least sum of squared errors channel estimation,” IEE Proc. F, vol. 138, no. 4, pp. 371–378, Aug. 1991. 32. M. Rupp, “Fast implementation of the LMS algorithm,” in Conf. Record of Eusipco, Tampere, 2000. 33. J. Balakrishnan, M. Rupp, and H. Vishwanatan, “Optimal channel training for multiple antenna systems,” in Conf. Record of Multiaccess, Mobility and Teletraffic for Wireless Communications, 2000. 34. T. S. Rappaport, Wireless Communications. Prentice Hall, 1996. 35. L. Greenstein and B. Czekaj, “Modeling multipath fading responses using multitone probing signals and polynomial approximation,” Bell Syst. Tech. J., vol. 60, pp. 193–214, 1981. 36. M. K. Tsatsanis and G. B. Giannakis, “Modeling and equalization of rapidly fading channels,” Int. J. Adaptive Control Signal Processing, vol. 10, pp. 159– 176, 1996. 37. A. Duel-Hallen, S. Hu, and H. Hallen, “Long range prediction of fading signals,” IEEE Signal Processing Mag., vol. 17, no. 3, pp. 62–75, May 2000. 38. R. A. Iltis and A. W. Fuxjaeger, “A digital DS spread-spectrum receiver with joint channel and Doppler shift estimation,” IEEE Trans. Commun., vol. 39, no. 8, Aug. 1991, 39. W. C. Jakes, Microwave Mobile Communication. IEEE Press, 1974. 40. T. Kailath, A. H. Sayed, and B. Hassibi, Linear Estimation. Prentice Hall, 1999. 41. L. Lindbom, M. Sternad, and A. Ahlen, “Tracking of time-varying mobile radio channels: part I. The Wiener LMS algorithm,” IEEE Trans. Commun., pp. 2207–2217, Dec. 2001. 42. L. Lindbom, A. Ahlen, M. Sternad, and M. Falkenstrom, “Tracking of timevarying mobile radio channels: part II. A case study,” IEEE Trans. Commun., pp. 156–167, Jan. 2002. 43. M. C. Chiu and C. Chao, “Analysis of LMS-adaptive MLSE equalization on multipath fading channels,” IEEE Trans. Commun., pp. 1684–1692, Dec. 1996. 44. A. J. Viterbi, “Error bounds for convolutional codes and an asymptotically optimum decoding algorithm,” IEEE Trans. Information Theory, vol. IT-13, Apr. 1967.
280
M. Rupp, A. Burg
45. E. A. Lee and D. G. Messerschmitt, Digital Communication. Kluwer, 2nd edition, 1994. 46. H. L. Lou, “Implementing the Viterbi algorithm,” IEEE Signal Processing Mag., vol. 12, no. 5, pp. 42–52, Sept. 1995. 47. M. Eyuboglu and S. Qureshi, “Reduced-State sequence estimation for coded modulation on intersymbol interference channels,” IEEE Journal Sel. Areas Commun., vol. 7, pp. 989–995, Aug. 1989. 48. E. F. Haratsch, A. J. Blanksby, and K. Azadet, “Reduced state sequence estimation with tab-selective decision feedback,” in IEEE Int. Conf. on Commun., vol. 1, 2000, pp. 372–376. 49. H. L. Lou, M. Rupp, R. L. Urbanke, H. Viswanatan, and R. Krishnamoorthy, “Efficient implementation of parallel decision feedback decoders for broadband applications,” in Conf. Record of the 6th IEEE Int. Conf. on Electronics, Circuits and Systems, vol. 3, pp. 1475–1478, 1999. 50. J. Bakkoury, D. Roviras, M. Ghogho, and F. Castanie, “Adaptive MLSE receiver over rapidly fading channels,” Signal Processing, vol. 80, pp. 1347–1360, 2000. 51. S. Haykin, Blind Deconvolution. Prentice Hall, Englewood Cliffs, N.J., 1994. 52. H. Liu and G. B. Giannakis, “Deterministic approaches for blind equalization of time-varying channels with antenna arrays,” IEEE Trans. Signal Processing, vol. 46, no. 11, pp. 3003–3013, Nov. 1998. 53. Y. Sato, “A method of self-recovering equalization for multi level amplitude modulation,” IEEE Trans. Commun., vol. COM-23, pp. 679–682, June 1975. 54. D. N. Godard, “Self-recovering equalization and carrier tracking in twodimensional data communication systems,” IEEE Trans. Commun., vol. COM28, pp. 1867–1875, Nov. 1980. 55. G. J. Foschini, “Equalization without altering or detecting data,” AT&T Tech. Journal, vol. 64, pp. 1885–1911, 1985. 56. M. Rupp and A. H. Sayed, “On the convergence of blind adaptive equalizers for constant modulus signals,” IEEE Trans. Comm., vol. 48, no. 5, pp. 795–803, May 2000. 57. J. Mai and A. H. Sayed, “A feedback approach to the steady-state performance of fractionally spaced blind equalizers,” IEEE Trans. Signal Processing, vol. 48, no. 1, pp. 80–91, Jan. 2000. 58. J. Treichler and C. R. Johnson, Jr., “Blind fractionally spaced equalization of digital cable TV,” in Conf. Record of the 7th IEEE DSP Workshop, 1996, pp. 122–130. 59. O. Shalvi and E. Weinstein, “New criteria for blind deconvolution of nonminimum phase systems (channels),” IEEE Trans. Information Theory, vol. IT-39, pp. 292–297, Jan. 1990. 60. L. Tong, G. Xu, and T. Kailath, “A new approach to blind identification and equalization of multipath channels,” in Conf. Record of the 25th. Asilomar Conf. of Signals, Systems and Computers, 1991. 61. L. Tong and S. Perreau, “Multichannel blind channel estimation: from subspace to maximum likelihood methods,” Proc. of IEEE, vol. 86, pp. 1951–1968, Oct. 1998. 62. H. Artes and F. Hlawatsch, “Blind equalization of MIMO channels using deterministic precoding,” in Conf. Record of IEEE Intern. Conf. on Acoustics, Speech, and Signal Processing, vol. 4, 2001, pp. 21530–2156.
9
Adaptive Equalization in Wireless Applications
281
63. H. Artes, F. Hlawatsch, and G. Matz “Efficient POCS algorithms for deterministic blind equalization of time-varying channels,” in Conf. Record of IEEE Globecom, 2000, pp. 1031–1035. 64. H. Artes and F. Hlawatsch, “Blind multiuser equalization equalization for timevarying channels,” in Conf. Record of third IEEE Signal Processing Workshop SPAWC, 2001, pp. 102–105. 65. J. Laurila, R. Tschofen, and E. Bonek, “Semi-blind space-time estimation of co-channels signals using least squares projections,” in Conf. Record of the 50th IEEE Vehicular Technology, vol. 3, 1999, pp. 1310–1315.
10 Adaptive Space-Time Processing for Wireless CDMA Sofi`ene Affes and Paul Mermelstein INRS-Telecommunications, University of Quebec 800, de la Gaucheti`ere Ouest, Suite 6900 Montreal, Quebec, H5A 1K6, Canada E-mail: {affes, mermel}@inrs-telecom.uquebec.ca Abstract. We consider adaptive space-time processing for wireless receivers in CDMA networks. Currently, the 2D RAKE is the most widely used space-time array-processor which combines multipath signals sequentially, first in space, then in time. We introduce incremental processing improvements to arrive ultimately at a more efficient one-dimensional joint space-time (1D-ST) adaptive processor named STAR, the spatio-temporal array-receiver. STAR improves the receiver’s performance by approaching blind coherent space-time maximum ratio combining (ST-MRC). With blind quasi-coherent joint ST-MRC, STAR outperforms the 2D RAKE by improving the combing operation with 2 dB gain in SNR. With quasiblind coherent joint ST-MRC, STAR can enhance channel identification while reducing significantly the pilot-power or -overhead fraction leading to a 1 dB gain in SNR. These gains translate to significant performance advantage for all versions of 1D-ST STAR over current 2D RAKE-type receivers.
10.1
Introduction
Recently adopted standards confirm that CDMA is the preferred air-interface technology for third-generation wireless systems [1]-[3]. They also recognize adaptive antenna arrays as key means to increasing capacity and spectrum efficiency. In the context of this important real-world application, adaptive space-time processing can respond to the need for fast and reliable transmission over the noisy and time-varying channels encountered in wireless access. Adaptive space-time processing addresses a broad range of issues that aim to improve: 1) multipath channel identification and combining [4], 2) synchronization [4], [5], 3) interference reduction [6] or suppression [7], etc. . . We focus here on the first issue and propose useful upgrades of the 2D RAKE [8]-[11] that ultimately implement a more efficient adaptive array-processor, STAR, the spatio-temporal array-receiver [4], [5]. The 2D RAKE, developed first by a research group at Stanford University [8]-[11], is a two-dimensional space-time adaptive array-processor widely investigated today which combines signals sequentially, first in space, then in time over CDMA multipath Rayleigh-fading channels. In the blind mode (i.e., without a pilot), the 2D-RAKE receivers estimate the channel with phase ambiguities and implement sequential combining, first noncoherent
J. Benesty et al. (eds.), Adaptive Signal Processing © Springer-Verlag Berlin Heidelberg 2003
284
S. Affes, P. Mermelstein
spatial MRC, followed by temporal equal-gain combining (EGC). In the pilotassisted mode they use the pilot for channel identification [12]-[16] and hence require a pilot with sufficient power to estimate the channel accurately and implement first reference-assisted coherent spatial MRC followed by temporal MRC. This contribution considers an adaptive array-receiver that significantly improves receiver performance and approaches that of blind coherent joint ST-MRC. First, in the blind mode we exploit the flexibility of the decision-feedback identification (DFI) procedure [4], [17] for channel estimation in the STAR receiver [4], [5] to arrive at a 2D-RAKE with an initial feedback mode in an improved adaptive structure. Further upgrades of the feedback mode in the DFI procedure ultimately enable identification of the channel within a constellation-invariant phase ambiguity and hence allow implementation of quasi-coherent (i.e., differential decoding after coherent detection) joint STMRC with about 2 dB gain in SNR. Second, in the pilot-assisted mode we exploit a much weaker pilot than needed by the 2D-RAKE receiver to estimate then compensate the constellation-invariant phase ambiguity of the channel identified blindly and more accurately and therefore implement quasi-blind or asymptotically blind coherent joint ST-MRC [18], [19], [21]. Thereby STAR can outperform the 2D-RAKE receiver by enhancing channel identification and by significantly reducing the pilot power or overhead fraction (in the range of 1%) and the resulting interference. Both enhancements combined result in a total SNR gain of 1 dB and enable significant battery power-savings and spectrum-efficiency gains. This upgrade process allows us to replace the sequential spatial-temporal processing in 2D RAKE-type receivers by one-dimensional joint space-time (1D-ST) processing in STAR and thereby improve channel identification. We present novel and significant analytical results [17], [19] that establish clearly the performance advantages of one-dimensional joint space-time processing in 1D-ST STAR over two-dimensional spatial then temporal sequential processing widely implemented today in 2D RAKE-type receivers. We show that 1D-ST structured adaptive receivers reduce both complexity and channel identification errors, increase robustness to changing propagation conditions, and speed up convergence over multipath Rayleigh-fading channels. Organization of this chapter is as follows: in Sect. 10.2 we describe our data model then provide a brief overview of the blind 2D RAKE receiver in Sect. 10.3. Section 10.4 proposes incremental upgrades of the 2D RAKE that ultimately implement blind quasi-coherent ST-MRC in 2D STAR. In Sect. 10.5, we show the benefits of joint space-time processing in the blind 1D-ST STAR. Section 10.6 considers a last option, namely the use of very weak pilot signals for channel ambiguity estimation and resolution which implement quasi-blind coherent joint ST-MRC. We finally draw our conclusions
10
Adaptive Space-Time Processing for Wireless CDMA
285
in Sect. 10.7. Simulations of the enhancements at each stage validate the significant performance gains achievable with the 1D-ST STAR receiver.
10.2
Data Model
We consider uplink transmission with M receiving antennas at the basestation. Extension to the downlink with multi-antenna mobile stations follows along similar lines but the details are left for future consideration. We consider a multipath Rayleigh fading environment with number of paths P and Doppler spread frequency fD . For simplicity, we assume perfect synchronization of the multipath delays. Efficient incorporation of accurate timing in CDMA receivers is addressed in [4], [5]. For air-interface transmission1 we use MPSK modulation, defined by the following constellation set of M symbols: (2k−1−δ(M−2)) π M CM = {. . . , ck , . . .} = . . . , ej , . . . , k ∈ {1, . . . , M}, (10.1) where δ(x) = 1 if x = 0, and 0 otherwise. The data symbols bn ∈ CM are MPSK-modulated at the rate 1/Ts where Ts is the symbol duration then differentially encoded as: bn = bn bn−1 e−j(1−δ(M−2))π/M ,
(10.2)
and hence ideally differentially decoded as: bn = bn b∗n−1 ej(1−δ(M−2))π/M .
(10.3)
The phase offset e−j(1−δ(M−2))π/M keeps both bn and bn in the MPSK constellation set CM in the differential encoding and decoding steps, respectively. We also define the set of constellation-invariant phase rotations as: RM = {. . . , rk , . . .} s.t. ∀ k ∈ {1, . . . , M} rk ck ∈ CM ,
(10.4)
referred to in the following as the rotation set. For MPSK modulations, the rotation set is given by2 : RM = . . . , ej2(k−1)π/M , . . . , k ∈ {1, . . . , M}. (10.5) After despreading the data at the receiver, we form the M × 1 multipath despread vector for each path p = 1, . . . , P : Zp,n = Gp,n εp,n ψn bn +Np,n = Gp,n ζp,n bn +Np,n = Gp,n sp,n +Np,n , (10.6) 1 2
The HSDPA standard [3] suggests use of high-order modulations such as MPSK and MQAM (see Sects. 10.4.5 and 10.6.6) in order to increase the peak rate. Note that the MPSK constellation set CM can be equated to its rotation set RM , thereby allowing suppression of the phase offset e−j(1−δ(M−2))π/M in both (10.2) and (10.3). For the sake of generalization, however, we use the conventional MPSK constellations of (10.1) where CM = RM .
286
S. Affes, P. Mermelstein
Table 10.1 Parameters used in the illustrations. Parameter Value Comment 1/Ts 19.2 KBaud Baud rate M 4 number of antennas P 3 number of fading paths ε¯2p (0, 0, 0) dB average power profile fc 1.9 GHz carrier frequency fD 9 Hz Doppler spread (i.e., 5 Kmph) fP C 1600 Hz power control (PC) frequency ΔPP C ± 0.25 dB PC adjustment BERP C 10 % PC-command BER τP C 0.625 ms PC transmission delay μ 0.05 adaptation step-size SN Rin 2 − 10 log10 sin(π/M)2 SNR after despreading in dB
where sp,n = εp,n ψn bn = ζp,n bn is the multipath signal component, ψn2 is the total received power and ε2p,n is the normalized power fraction (i.e., P 2 2 over the p-th path ζp,n = ε2p,n ψn2 . p=1 εp,n = 1) of the total power received √ The M × 1 vector Gp,n , with norm3 M , denotes the channel vector from the transmitter to the multi-antenna receiver over the p-th multipath. The interference vectors Np,n , mutually independent, have identical spatiallyuncorrelated Gaussian distribution with mean zero and covariance matrix 2 RN = σN IM after despreading of the data channel. The resulting input SNR 2 after despreading is SN Rin = ψ¯2 /σN per antenna element where ψ¯2 denotes the average total received power. Uncorrelated-Gaussian assumption holds when a large number of users are active. This motivates us to implement coherent maximum ratio combining (MRC) in both space and time, the optimal combiner in this case. Otherwise, for colored noise situations, we may incorporate the optimum or the multi-user combining solutions proposed in [6] and [7], respectively; but that is beyond the scope of this contribution. The performance of the various receiver structures is verified by simulations at the physical level with parameters4 listed in Table 10.1. Enhancements in terms of capacity at specified transmission rates are best estimated by system-level simulations [18], [19], but those are beyond the scope of this contribution. 3 4
The normalization factors of Gp,n 2 (in space) and ε2p,n (in time) are both included in ψn2 . The last two parameters of Table 10.1 are only used in Figs. 10.1 to 10.8. The SNR value guarantees the same nominal SER for all modulations (see also SNR in Fig. 10.8 for 16QAM).
10
10.3
Adaptive Space-Time Processing for Wireless CDMA
287
The Blind 2D RAKE Receiver
To the best of our knowledge, the blind 2D RAKE was the first adaptive array-processing receiver-structure proposed for DBPSK-modulated CDMA signals [8]-[11]. This receiver is adaptive in that it carries out iterative channel identification in order to implement noncoherent spatial MRC. The blind channel identification step of the 2D RAKE will be explained shortly below. Here, we extend the 2D-RAKE to DMPSK-modulated CDMA signals. For now, assume that estimates of the propagation vectors with phase ˆ p,n e−jφp,n Gp,n ). The ambiguities are available at each iteration n (i.e., G 2D RAKE first estimates the multipath signal component s˜p,n over each path for p = 1, . . . , P by noncoherent spatial MRC: ˆH ˆH G G p,n Zp,n p,n Np,n ejφp,n ψn εp,n bn + ejφp,n sp,n + ηp,n , (10.7) M M where the residual interference ηp,n is zero-mean complex Gaussian with vari2 ance σN /M . The 2D RAKE thereby implements the so called “antenna gain” by reducing the level of interference by a factor M at the combiner output. Second, to alleviate the impact of the phase ambiguities φp,n , the 2D RAKE resorts to noncoherent temporal differential demodulation and EGC of the multipath signal component estimates in the following decision variable: s˜p,n =
dn =
P
s˜p,n s˜∗p,n−1 ,
(10.8)
p=1
and hence estimates the MPSK symbol bn from dn as follows: ! ! ˆb = arg min !!dn ej(1−δ(M−2))π/M − ck !! . n
(10.9)
ck ∈CM
In a channel-coded transmission, the 2D RAKE passes on dn directly to the channel decoder after appropriate mapping. For power control, the total received power can be estimated by simple smoothing as follows: * ) P ψˆ2 = (1 − α)ψˆ2 + α (10.10) |˜ sp,n |2 , n+1
n
p=1
where α 1 is a smoothing factor. An equivalent estimator of the total 2 received power sums up estimates of the received powers over paths ζˆp,n = 2 ˆ2 2 εˆp,n ψn and allows estimation of the normalized power fractions εˆp,n as follows: 2 2 ζˆp,n+1 = (1 − α)ζˆp,n + α|˜ sp,n |2 , 2 = ψˆn+1
P
2 , ζˆp,n+1
p=1
εˆ2p,n+1
=
2 ζˆp,n+1 /
) P p=1
(10.11) (10.12)
* 2 ζˆp,n+1
2 2 = ζˆp,n+1 /ψˆn+1 .
(10.13)
288
S. Affes, P. Mermelstein
The normalized power fraction estimates εˆ2p,n are of no immediate use in the blind 2D RAKE. However, they will be exploited later to significantly enhance 2D space-time receivers. ˆ p,n are required As mentioned above, estimates of propagation vectors G to implement noncoherent spatial MRC of (10.7) in the 2D RAKE. Exploiting the fact that the interference vector in (10.6) is an uncorrelated white noise vector, the propagation vector over each path Gp,n can be identified as the principal eigenvector of RZp , the correlation matrix of the despread vector Zp,n over the p-th path: H 2 = ψ¯2 ε¯2p Gp GH (10.14) RZp = E Zp,n Zp,n p + σN IM H 2 2 −jφ −jφ 2 = ψ¯ ε¯p e p Gp e p Gp + σN IM , where ψ¯2 and ε¯2p are the average total received power and the multipath power fraction, respectively. In practice, each vector Gp,n is estimated within an unknown phase ambiguity φp,n by an iterative principal component analysis (PCA) method based on a singular- or eigenvalue decomposition of the sample ˆ Z [8]-[11]. However, in the next section we show that this correlation matrix R p iterative PCA method can be replaced by an adaptive channel identification technique that is less complex and performs better. In summary, the blind 2D RAKE [8]-[11] implements noncoherent spatial MRC and achieves an antenna gain by reducing the interference power by a factor equal to the number of antennas and thereby improves capacity significantly. Additional enhancements may be introduced until the noncoherent differential temporal demodulation and EGC step of (10.9) is completely replaced by quasi-coherent (i.e., within a constellation-invariant phase ambiguity rotation) MRC in both space and time, without a pilot.
10.4
The Blind 2D STAR
We propose incremental upgrades of the blind 2D RAKE that ultimately lead to a very efficient blind quasi-coherent (i.e., within a constellation-invariant phase rotation) ST-MRC combiner. The resulting improvement in the combining operation offers about 2 dB gain in SNR with all tested MPSK modulations. 10.4.1
Decision-Feedback Identification (DFI)
We introduce an adaptive channel identification procedure that offers a unifying framework in terms of a common structure called the spatio-temporal array-receiver (STAR) [4], [5] by equipping various combiners of DMPSKmodulated signals with the same channel identification engine. Starting with the conventional 2D-RAKE, we consider successive simple modifications to
10
Adaptive Space-Time Processing for Wireless CDMA
289
the feedback signal and obtain incremental improvements until we reach a blind quasi-coherent ST-MRC combiner. This procedure, referred to as decision-feedback identification (DFI) [4], [17], updates the channel estimate5 as follows: ˆ p,n+1 = G ˆ p,n + μp Zp,n − G ˆ p,n sˆp,n sˆ∗p,n , G (10.15) where μp is an adaptation step-size, and sˆp,n is a feedback signal providing a selected estimate of the signal component. We show next how improved choices of the feedback signal lead to enhanced versions of the 2D STAR receiver. 10.4.2
Parallel and Soft DFI
In a first version of 2D STAR, we implement parallel and soft DFI in that 1) the DFI procedures of (10.15) over multipaths for p = 1, . . . , P are excited with independent feedback signals (i.e., parallel), and 2) the feedback signals are assigned the soft output values of the noncoherent MRC combiners in (10.7) (i.e., soft): sˆp,n = s˜p,n .
(10.16)
Substituting sˆp,n for s˜p,n in (10.7), the DFI procedure of (10.15) can be rewritten as: ˆ p,n+1 = G ˆ p,n + μp Zp,n − G ˆ p,n G ˆ H Zp,n /M Z H G ˆ G p,n p,n p,n /M ˆ p,n + μp IM − G ˆ p,n G ˆ H /M Zp,n Z H G ˆ =G p,n p,n p,n /M ˙ Zp G ˆ p,n /M, ˆ p,n + μp Πp,n R =G
(10.17)
and its adaptation gradient now interprets as a projector Πp,n orthogonal ˙ Zp , the instantaneous estimate of the correlation matrix RZp . ˆ p,n of R to G On average, adaptation errors are minimized when the projector Πp,n suppresses the dimension of RZp with the highest energy, i.e., its principal eigenvector e−jφp,n Gp,n [note that Πp,n RZp Gp,n = Πp,n × (λGp,n ) = 0 if ˆ p,n = λ Gp,n , see (10.14)]. The DFI procedure is therefore an adaptive G ˆ p,n e−jφp,n Gp,n PCA implementation. Hence, after convergence we have G jφp,n and sˆp,n = s˜p,n e sp,n + ηp,n [see (10.7)]. For illustration purposes, we define the channel ambiguity over each path ap,n and the centroid channel ambiguity an as: ˆ H Gp,n /M ap,n = ρp,n ejφp,n = G p,n an = ρn ejφn =
P p=1
5
for p = 1, . . . , P,
ˆH εˆp,n εp,n G p,n Gp,n /M =
P p=1
εˆp,n εp,n ap,n .
(10.18) (10.19)
√ ˆ p,n is forced to M after each DFI update for increased stability Preferably G √ ˆ p,n to M is asymptotically (we do so in this work), although normalization of G guaranteed after convergence.
290
S. Affes, P. Mermelstein (a)
(b)
90
90
1
120
60
0.8
0.8
0.6
0.6
150
150
30
30
0.4
0.4
0.2
0.2
180
0
180
330
210
240
0
330
210
300
240
270
300 270
(c)
(d)
90
90
1
120
60
60 0.8
0.6
0.6
150
150
30
30
0.4
0.4
0.2
0.2
180
0
330
210
300 270
1
120
0.8
240
1
120
60
180
0
330
210
240
300 270
Fig. 10.1. Channel ambiguity with BPSK-modulated parallel/soft DFI over (a): 1st path (i.e., a1,n ), (b): 2nd path (i.e., a2,n ), (c): 3rd path (i.e., a3,n ), (d): centroid channel ambiguity (i.e., an ). Constellation-invariant rotation points (i.e., rk ∈ RM ) are denoted by black circles and initial/final channel ambiguities by black/white squares.
Figure 10.1 shows that ap,n , with parallel/soft DFI, follows the shortest path from initial position ap,0 towards the unit circle in the learning phase then remains in its vicinity after convergence (i.e., ρp,n = |ap,n | 1 and ap,n ejφp,n ), except during deep fades [e.g., see Figs. 10.1(b) and 10.1ˆ (c)]. With any √ random initialization Gp,0 different from the null vector (here ˆ p,n indeed converges to the corresponding propagation with norm M ), G ˆ p,n e−jφp,n Gp,n ). The vector Gp,n within a phase ambiguity φp,n (i.e., G centroid channel ambiguity an illustrates the difference between this version of the 2D-STAR and a coherent ST-MRC combiner. As shown in Fig. 10.1, the phase ambiguities ap,n are mutually independent and combine in an [see Fig. 10.1(d)] in a destructive manner with parallel/soft DFI, hence the need
10
Adaptive Space-Time Processing for Wireless CDMA
291
for noncoherent temporal differential demodulation and EGC with (10.8) and (10.9). In fact, the 2D STAR version with parallel/soft DFI readily implements the blind 2D RAKE receiver discussed previously in Sect. 10.3. However, its DFI procedure offers an adaptive PCA implementation that is much more efficient than the iterative PCA method considered in the blind 2D RAKE [8]-[11]. It requires a complexity order per symbol that is only linear in the number of antennas M and it tracks time-varying channels faster due to its LMS-type nature (ˆ sp,n acts as a reference signal). Furthermore, the iterative PCA method in [8]-[11] is not decision-directed and results in phase ambiguities that are almost random from one block iteration to another. With the DFI procedure, the phase ambiguities ap,n (or φp,n ) can be exploited as “controllable” degrees of freedom to force their convergence to a common constellation-invariant phase rotation (i.e., rk ∈ RM ) by both common and hard signal feedback. In the following, we explain hard then common DFI as opposed to soft and parallel DFI, respectively, then show how combined use of both common and hard DFI enables implementation of blind quasi-coherent ST-MRC. 10.4.3
Parallel and Hard DFI
In a second version of 2D STAR, we implement parallel and hard DFI in that the feedback signals, still independent (i.e., parallel), now incorporate tentative estimates of the transmitted symbol (i.e., hard) as follows6 : sˆp,n = ζˆp,nˆbp,n = εˆp,n ψˆnˆbp,n ,
(10.20)
where ˆbp,n is the tentative symbol estimate over the p-th path given by7 : ˆbp,n = Hard {ˆ sp,n } = arg min {|˜ sp,n − ck |} .
(10.21)
ck ∈CM
Previously we have shown that s˜p,n ejφp,n sp,n + ηp,n [see (10.7)] with the DFI procedure. Hence, neglecting momentarily ηp,n in s˜p,n , we have: ! ! ! ! ˆbp,n arg min !ejφp,n sp,n − ck ! = arg min !ejφp,n bn − ck ! . (10.22) ck ∈CM
ck ∈CM
Hard decision above exploits the phase ambiguity ap,n ejφp,n as a degree of freedom with φp,n ∈ [0, 2π) and hence restricts its realization to limited 6
7
An alternative hard feedback signal sˆp,n = Real s˜p,nˆb∗p,n /|ˆbp,n | ˆbp,n /|ˆbp,n | that
performs nearly the same in the DFI procedure of (10.15) finds more efficient use in power estimation (see (10.25)). With MPSK modulations, note that normalization of ˆbp,n with |ˆbp,n | is not needed [in (10.25) as well]. For non-constant-modulus modulations such as MQAM, minimum distance from sp,n − ζˆp,n ck | in (10.21). constellation CM should be found over |˜
292
S. Affes, P. Mermelstein (a)
(b)
90
90
1
120
60
0.8
0.8
0.6
0.6
150
150
30
30
0.4
0.4
0.2
0.2
180
0
180
330
210
240
0
330
210
300
240
270
300 270
(c)
(d)
90
90
1
120
60
1
120
60
0.8
0.8
0.6
0.6
150
150
30
30
0.4
0.4
0.2
0.2
180
0
330
210
240
1
120
60
300
180
0
330
210
240
270
300 270
Fig. 10.2. Same as Fig. 10.1 with parallel/hard DFI.
constellation-invariant rotation points rk ∈ RM that minimize the distance between the rotated symbol ejφp,n bn and the constellation CM . Indeed, we have rk bn ∈ CM and hence the minimum distance ideally reduces to zero if we neglect the residual noise contributions in s˜p,n . Figure 10.2 shows indeed that ap,n , with parallel/hard DFI, basically follows the shortest path to the closest rotation point rk ∈ RM from initial position ap,0 in the learning phase then remains in its vicinity after convergence. Hence, the DFI procedure converges with high probability8 to the following ambiguity over the p-th path: ap,n ejφp,n rk(p) = arg min {|ap,0 − rk |} .
(10.23)
rk ∈RM
8
Higher perturbations in the DFI procedure due to faster channel variations, higher noise levels or higher adaptation step-size values may prevent ap,n from converging to the closest rotation point in RM , only to converge to another point in RM .
10
Adaptive Space-Time Processing for Wireless CDMA
293
Deep fades may sporadically force ap,n away from rk(p) [see Fig. 10.2(c)]. However, amplitude attenuations away from the unit circle are less significant than those observed with parallel/soft DFI (see Fig. 10.2). They suggest that hard DFI has better channel tracking capabilities than soft DFI, by “anchoring” the phase ambiguities to constellation-invariant phase rotations. With parallel/hard DFI, the ambiguities ap,n rk(p) are mutually independent and still combine in an in a destructive manner as shown in Fig. 10.1(d), hence the need again for noncoherent temporal differential demodulation and EGC with (10.8) and (10.9). For BPSK modulations, however, hard DFI has an advantage over soft DFI. The constellation is one dimensional in the complex plane and the desired signals always lie on the real axis with hard DFI (see Fig. 10.2). Hence for BPSK-modulated hard DFI, noncoherent spatial MRC of (10.7) can be replaced by quasi-coherent (i.e., within a sign ambiguity) spatial MRC as follows: ˆH G p,n Zp,n s˜p,n = Real ±sp,n + Real {ηp,n } . (10.24) M This further reduces the residual noise variance by factor 2 and thereby reduces the detection errors (see Sect. 10.4.6) in both (10.8) and (10.9). Reduction of the residual noise power by factor 2 can be also exploited in enhancing power estimation with hard DFI. However, this improvement can be achieved for both BPSK and higher-order modulations by rewriting (10.11) as: 2 2 2 ζˆp,n+1 = (1 − α)ζˆp,n + αReal s˜p,nˆb∗p,n /|ˆbp,n | .
(10.25)
Projection of s˜p,n over the normalized tentative symbol estimate ˆbp,n /|ˆbp,n | indeed reduces the variance of the residual noise by half and improves estima2 tion9 of ζˆp,n above as well as ψˆn2 and εˆ2p,n in (10.12) and (10.13), respectively. 10.4.4
Common and Soft DFI
In a third version of 2D STAR, we implement common and soft DFI in that the feedback signals are based on weighted replicas of the same (i.e., common) soft output value s˜n of noncoherent ST-MRC (i.e., soft): sˆp,n = εˆp,n s˜n , 9
(10.26)
Projection of s˜p,n over the orthogonal to ˆbp,n /|ˆbp,n |, given by Im s˜p,nˆb∗p,n /|ˆbp,n | [5], [18], enables estimation of the residual noise variance and its subtraction from 2 for an even more enhanced power estimation. For simplicity, this option will ζˆp,n not be pursued further.
294
S. Affes, P. Mermelstein (b)
(a) 90
90
1
120
1
120
60
60
0.8
0.8
0.6
0.6
150
150
30
30 0.4
0.4 0.2
0.2
180
0
180
330
210
240
0
330
210
300
240
270
300 270
(c)
(d)
90
90
1
120
60
1
120
60
0.8
0.8
0.6
0.6
150
150
30
30
0.4
0.4
0.2
0.2
180
0
330
210
240
180
0
330
210
240
300 270
300 270
Fig. 10.3. Same as Fig. 10.1 with common/soft DFI.
where s˜n is simply obtained by noncoherent temporal MRC10 of the soft outputs s˜p,n of noncoherent spatial MRC in (10.7): s˜n =
P
εˆp,n s˜p,n =
p=1
P
ˆ p,n Zp,n /M. εˆp,n G
(10.27)
p=1
Exploiting again the expression for s˜p,n ejφp,n sp,n + ηp,n [see (10.7)] established with the DFI procedure as well as (10.18) and (10.19), we have: * ) P P jφp,n sˆn = (10.28) εˆp,n e sp,n + εˆp,n ηp,n p=1
=
) P
* εˆp,n εp,n ejφp,n
p=1
ψn bn + ηn = an sn + ηn = ρn ejφn sn + ηn ,
p=1 10
Note that estimates of εˆ2p,n from (10.13) required for temporal MRC (also used in hard DFI in (10.20)), with no use in the 2D RAKE, definitely enable significant enhancements of 2D receivers in the following.
10
Adaptive Space-Time Processing for Wireless CDMA
295
2 where the residual output noise ηn is Gaussian with variance σN /M since 2 εˆp,n sum up to 1. Common DFI exploits the phase ambiguities ap,n ejφp,n as degrees of freedom to tie their values after convergence to a unique phase ambiguity an ejφn (i.e., ρn = |an | 1 and ap,n ejφp,n an ejφn after convergence), in order to maximize the energy of both s˜n and the feedback signal sˆp,n by constructive combining (see another interpretation later in Sect. 10.5.3). Hence we have s˜n ejφn sn + ηn , s˜p,n ejφn sp,n + ηp,n and ˆ p,n e−jφn Gp,n after convergence. G Figure 10.3 shows indeed that the centroid ambiguity an [see Fig. 10.3(d)], with common/soft DFI, follows the shortest path from initial position a0 towards the unit circle in the learning phase, then remains in its vicinity after convergence. Phase deviations around the unit circle are due to the time variations of the channel realizations. After convergence, the multipath ambiguities ap,n are tied together to an (see final values of ap,n and an nearly at the same position in Fig. 10.3) and hence combine constructively in an . Amplitude attenuations of an away from the unit circle are less significant than those of ap,n , themselves weaker than those observed in Fig. 10.1. They suggest that common DFI has better tracking capabilities than soft DFI, by exploiting a “kind” of multipath diversity in the feedback signals. Noncoherent temporal MRC in (10.27), a priori destructive, forces the phases ambiguities ap,n to a common centroid ambiguity an after convergence and hence becomes constructive a posteriori. With common DFI, we therefore replace noncoherent temporal differential demodulation and EGC in the decision variable of (10.8) by noncoherent ST-MRC in (10.27) followed by differential demodulation:
dn = s˜n s˜∗n−1 ,
(10.29)
to reduce detection errors (see Sect. 10.4.6) over the symbol estimate ˆbn in (10.9) and possibly enhance data channel-decoding in the case of channelcoded transmissions. Common DFI has an additional benefit. It can exploit the soft output s˜n of noncoherent ST-MRC of (10.27) to directly estimate the total received power as follows: 2 ψˆn+1 = (1 − α)ψˆn2 + α|˜ sn |2 .
(10.30)
Compared to the two equivalent total-power estimates of (10.10) or (10.12) which sum P squared terms (i.e., temporal EGC), the new estimate sums one squared term only (i.e., noncoherent ST-MRC) and therefore improves power control from weaker variance due to residual noise [20]. 10.4.5
Common and Hard DFI
In a fourth and last version of 2D STAR, we implement common and hard DFI in that the feedback signals enclose weighted replicas of the same (i.e.,
296
S. Affes, P. Mermelstein (b)
(a) 90
90
1
120
1
120
60
60
0.8
0.8
0.6
0.6
150
150
30
30
0.4
0.4
0.2
0.2
180
0
180
330
210
240
0
330
210
300
240
270
300 270
(c)
(d)
90
90
1
120
60
60
0.8
0.8
0.6
0.6
150
30
150
30
0.4
0.4
0.2
0.2
180
0
330
210
240
1
120
180
0
330
210
300
240
270
300 270
Fig. 10.4. Same as Fig. 10.1 with common/hard DFI.
common) tentative symbol estimate (i.e., hard)11 : sˆp,n = ζˆp,nˆbn = εˆp,n ψˆnˆbn ,
(10.31)
where the tentative symbol estimate ˆbn is obtained by hard12 decision over the soft output s˜n of the noncoherent ST-MRC combiner in (10.27): ˆbn = Hard {˜ sn } = arg min {|˜ sn − ck |} .
(10.32)
ck ∈CM
11
12
! ! ! ! An alternative hard feedback signal sˆp,n = !Real s˜p,nˆb∗n /|ˆbn | ! ˆbn /|ˆbn | performs nearly the same in the DFI procedure (see footnote 6). For non-constant-modulus modulations such as MQAM, it is more accurate to sn − ψˆn ck |. However, power find minimum distance from constellation CM over |˜ control attempts to equalize ψn2 to 1 and hence the rule in (10.32) holds, unlike (10.21) (see footnote 7).
10
Adaptive Space-Time Processing for Wireless CDMA
297
(b)
(a) 90
90
1.5 60
120
2 60
120 1.5
1 30
150
150
30
1
0.5 0.5
180
0
210
330
240
180
0
210
330
240
300
300 270
270
Fig. 10.5. Realizations of the noncoherent ST-MRC soft output s˜n in (10.27) for (a): common/soft DFI, (b): common/hard DFI. Realizations marked with a small square correspond to the nominal transmitted symbol among the constellation points ck ∈ CM marked with a large square.
With common/soft DFI, we have just shown that s˜n ejφn sn + ηn . Hence, neglecting momentarily ηn in s˜n , we have: ! ! ! ! ˆbn arg min !ejφn sn − ck ! = arg min !ejφn bn − ck ! . (10.33) ck ∈CM
ck ∈CM
Hard decision above exploits the centroid phase ambiguity an ejφp,n as a degree of freedom with φn ∈ [0, 2π) and hence restricts its realization to limited constellation-invariant rotation points rk ∈ RM that minimize the distance between the rotated symbol ejφn bn and the constellation CM (see similar discussion below (10.22)). Simultaneously, common DFI ties all multipath phase ambiguities ap,n to an and hence ap,n an ∈ RM after convergence. Figure 10.4 shows indeed that the centroid ambiguity an , with common/hard DFI, follows the shortest path from initial position a0 to the closest rotation point rk ∈ RM , then remains in its vicinity after convergence. The figure also shows that the multipath phase ambiguities are bound to converge to the same rotation point. Hence, the DFI procedure converges with high probability (see footnote 8) to the following phase ambiguity: an = ap,n ejφn rk = arg min {|a0 − rk |} .
(10.34)
rk ∈RM
Amplitude attenuations of both ap,n and an are significantly weaker than those observed in Figs. 10.1, 10.2, and 10.3 with the previous DFI versions. They suggest that common and hard DFI has better tracking capabilities than parallel and/or soft DFI, 1) by exploiting a “kind” of multipath diversity in the feedback signals, and 2) by “anchoring” all phase ambiguities to a common constellation-invariant phase rotation.
298
S. Affes, P. Mermelstein (a)
(b) 90
90
2
1
120
60
120
60 0.8
1.5
0.6 150
150
30
30
1
0.4 0.5
0.2 180
0
180
330
210
240
0
210
300
330
240
270
300 270
Fig. 10.6. (a): Centroid phase ambiguity an with QPSK-modulated common/hard DFI (see caption of Fig. 10.1 for additional explanations). (b): Realizations of the noncoherent ST-MRC soft output s˜n in (10.27) with QPSK-modulated common/hard DFI (see caption of Fig. 10.5 for additional explanations). (a)
(b) 90
90
2
1
120
60
120
60 0.8
1.5
0.6 150
150
30
30
1
0.4 0.5
0.2 180
0
210
330
240
300 270
180
0
210
330
240
300 270
Fig. 10.7. Same as Fig. 10.6 with 8PSK modulation instead.
Figure 10.5 shows that common/soft DFI results in a continuous deviation of the ST-MRC output from the constellation points [see Fig. 10.5(a)], while common/hard DFI rotates them back around the nominal constellation points [see Fig. 10.5(b)] within a constellation-invariant phase rotation rk [rk +1 only because a0 was closer to +1 in Fig. 10.4(d), see (10.34)]. The soft output s˜p,n of ST-MRC, a priori noncoherent with common/soft DFI, becomes a posteriori quasi-coherent (i.e., within a constellation-invariant phase rotation) after convergence with common/hard DFI. This useful “anchoring” mechanism that casts the soft output of STMRC around the nominal positions of the constellation points, illustrated so far with BPSK, holds very well with higher-order modulations. With QPSKmodulated common/hard DFI, Fig. 10.6 shows that the centroid phase am-
10
Adaptive Space-Time Processing for Wireless CDMA (a)
299
(b) 90
90
2
1
120
60
120
60 0.8
1.5
0.6 150
150
30
30
1
0.4 0.5
0.2 180
0
330
210
240
300 270
180
0
210
330
240
300 270
Fig. 10.8. Same as Fig. 10.6 with 16QAM modulation instead (normalized constellation with average unit power and SN Rin = 12 dB).
biguity an converges to the closest rotation point rk ∈ RM from a0 , i.e., rk = −j [see Fig. 10.6(a)], resulting in a rotation of the ST-MRC output by −π/2 from nominal constellation points [see Fig. 10.6(b)]. Similar observations can be made from Fig. 10.7 for 8PSK-modulated common/hard DFI where realizations √ rotated by −π/4 from nominal positions (i.e., a0 is closer to rk = (1 − j)/ 2). In fact, this useful “anchoring” mechanism of common/hard DFI holds even for MQAM modulations as illustrated in Fig. 10.8 with 16QAM. Due to its geometry, 16QAM has the same set RM of rotation points rk as QPSK [see Fig. 10.6(a)]. Hence the centroid phase an , which starts from the same initial position a0 , also converges to rk = −j as the closest rotation point [see Fig. 10.8(a)], thereby resulting in a rotation of the ST-MRC output by −π/2 from nominal constellation points [see Fig. 10.8(b)]. With standard MQAM modulations, however, there is no trivial differential coding scheme13 to alleviate a channel phase ambiguity even if the phase rotation is constellationinvariant. The phase “anchoring” mechanism of MQAM-modulated common/hard DFI finds particularly good application later with pilot-assisted versions of STAR (see Sects. 10.6 and 10.6.6). With MPSK modulations, differential coding at transmission enables detection of symbols with a channel phase ambiguity. We resolve it by noncoherent ST-MRC and differential demodulation at the receiver when equipped with common/soft DFI [see (10.27) and (10.29)]. With common/hard DFI, however, the soft output s˜n of ST-MRC becomes quasi-coherent (i.e., within a constellation-invariant phase rotation) and hence enables reliable estimation of the transmitted DMPSK symbol bn from the tentative symbol ˆbn rk bn 13
Differential detection is possible for instance with a star 16QAM constellation, a combination of two 8PSK constellations with different amplitudes (e.g., see [22]).
300
S. Affes, P. Mermelstein
of (10.32) within a rotation rk ∈ RM . Therefore, power estimation in (10.30) can be replaced by: 2 2 = (1 − α)ψˆn2 + αReal s˜nˆb∗n /|ˆbn | , (10.35) ψˆn+1 for improved power estimation and control14 . Additionally, instead of differential demodulation in (10.29) and hard decision in (10.9), differential decoding of ˆbn enables simple estimation of the MPSK symbol bn as follows: ˆb = ˆbnˆb∗ ej(1−δ(M−2))π/M . n−1 n
(10.36)
Detection errors over rk bn in ˆbn of (10.32) (i.e., Prob[ˆbn = rk bn ]) are those of coherent ST-MRC, much fewer than those resulting from noncoherent STMRC with common/soft DFI. Differential decoding as in (10.36) doubles these errors, yet common/hard DFI significantly outperforms common/soft DFI in symbol detection (see Sect. 10.4.6). For channel-coded transmissions with soft channel decoding, however, we need to pass on the soft decision variable dn of (10.29) to the decoder. In this case, the performance gains after channel decoding are theoretically those of common/soft DFI. Recent simulations yet suggest that the “anchoring” mechanism of common/hard DFI enables noticeable improvement from limiting phase deviations in dn . This issue is however out of the scope of this contribution. In the following, we compare the SER (symbol error rate) performance of the four DFI versions discussed previously. 10.4.6
Performance Gains of the DFI Versions
Simple analytical expressions for SER (i.e., Prob[ˆbn = bn ]) can be derived only for BPSK with differential demodulation and for MPSK with differential decoding, both over Gaussian channels (see general closed-form expressions in [23], [24] for instance). Here, in Fig. 10.9, we simply plot the SER curves obtained by simulations for each DFI version when modulated with BPSK, QPSK, and 8PSK. These curves confirm the performance gains expected from successive upgrades of the DFI procedure in 2D STAR. As shown in Fig. 10.9, the 2D STAR version with parallel/soft DFI, which better approximates the blind 2D RAKE receiver (see Sects. 10.3 and 10.4.2), performs worst. Parallel/hard DFI outperforms parallel/soft DFI only with BPSK where noncoherent spatial MRC in (10.7) is replaced by quasi-coherent spatial MRC in (10.24). Common/soft DFI always outperforms parallel/soft 14
Similarly to (10.25), projection of s˜n over 1) the normalized tentative symbol ∗ ˆ ˆ ˆ ˆ estimate bn /|bn | and 2) over its orthogonal, given by Im s˜n bn /|bn | [5], [18], 1) reduces the variance of the residual noise by half in (10.35), and 2) enables estimation of this variance then its subtraction from ψˆn2 for an even more enhanced power estimation (see footnote 9).
10
Adaptive Space-Time Processing for Wireless CDMA
301
0
10
−1
SER
10
−2
10
BPSK QPSK 8PSK parallel/soft DFI parallel/hard DFI common/soft DFI common/hard DFI
−3
10
−4
10
−10
−5
0 SNR [dB]
5
10
in
Fig. 10.9. SER vs. SNR in dB of blind 2D STAR MPSK modulated with various 2 DFI versions using optimum step-size μp,opt ε¯2p ψ¯2 , σN , fD Ts in (10.52) (see Sect. 10.5.4).
DFI by replacing noncoherent temporal differential demodulation and EGC in (10.8) by noncoherent temporal MRC in (10.27) and differential demodulation in (10.29). The SNR gain however shrinks steadily as the modulation is changed from BPSK to higher order. Common/soft DFI even outperforms parallel/hard DFI with BPSK, suggesting that gains from noncoherent temporal MRC vs. EGC are more significant than those of quasi-coherent vs. noncoherent spatial MRC. Common/hard DFI outperforms all other DFI versions by implementing quasi-coherent ST-MRC. Regardless of the modulation employed, its offers an SNR gain of about 2 dB over the worst DFI version, i.e., parallel/soft DFI which better approximates the blind 2D RAKE receiver. Overall, we have been able to upgrade the blind 2D RAKE by introducing incremental improvements to the combining operation thereby enabling the blind 2D STAR to gain about 2 dB in SNR over blind 2D RAKE-type receivers with all tested MPSK modulations.
10.5
The Blind 1D-ST STAR
So far, we have exploited the flexibility of the DFI procedure in a 2D structured receiver, i.e., sequential processing of diversity fingers in two dimen-
302
S. Affes, P. Mermelstein
sions: first in space over antennas then in time over multipaths. Ultimately, we arrive at a quasi-coherent ST-MRC combiner in 2D STAR with common and hard DFI. Yet by exploiting the common DFI version, further improvements are achievable by 1) identifying then 2) combining all diversity fingers jointly in space and time with 1D-ST (space-time) structured versions of STAR. Of particular interest, the 1D-ST counterpart of the 2D structured STAR with common and hard DFI, implements quasi-coherent joint ST-MRC with reduced complexity, increased robustness to changing propagation conditions, and increased accuracy and speed of channel estimation. 10.5.1
1D-ST Structured Data Model
Let us simply align the multipath despread vectors Zp,n for p = 1, . . . , P in an M P × 1 spatio-temporal despread vector [see (10.6)]: ⎤ ⎡ ⎤ ⎤ ⎡ ⎡ ε1,n G1,n Z1,n N1,n ⎥ ⎢ .. ⎥ ⎢ ⎢ .. ⎥ .. ⎥ ⎢ . ⎥ ⎢ ⎢ . ⎥ . ⎥ ⎢ ⎥ ⎥ ⎢ ⎢ ⎢ ⎥ ⎥ ⎥ ⎢ Zn = ⎢ Zp,n ⎥ = ⎢ εp,n Gp,n ⎥ ψn bn + ⎢ (10.37) ⎢ Np,n ⎥ = Hn sn + Nn , ⎥ ⎢ . ⎥ ⎢ ⎢ . ⎥ . .. ⎦ ⎣ .. ⎦ ⎣ ⎣ .. ⎦ ZP,n εP,n GP,n NP,n where sn = ψn bn is the signal component, Hn is the M P × 1 spatio-temporal propagation vector and Nn is the M P × 1 zero-mean uncorrelated spatio2 temporal Gaussian noise vector with covariance matrix RN = σN IMP . This 1D-ST structured data model is actually a simplification of the postcorrelation model (PCM) introduced in [4] to efficiently address the more general issue of joint space-time processing with simultaneous multipath timedelay synchronization in STAR. Exploitation of a similar 1D-ST structured data model before despreading later allowed development of very efficient multi-user upgrades of STAR by simultaneous joint-space-time signal combining, channel identification, time-delay synchronization and interference suppression [7]. To our knowledge, the advantages of simultaneous joint spacetime processing operations were not recognized previously and were not pursued to further integrate the spatial dimension made available by antennaarrays (see discussion in [7] and references therein). Below we exhibit the advantages of joint space-time signal combining and channel identification using the simplified 1D-ST structured data model above. 10.5.2
2D STAR with Common DFI Reinterpreted
Common DFI enables exploitation of the 1D-ST data model of (10.37) in 1) reformulating the two sequential spatial and temporal processing steps of 2D STAR in a compact form, 2) reinterpreting the resulting compact form as a joint space-time processing step, and 3) reimplementing this joint space-time processing step in a more efficient 1D-ST structure of STAR.
10
Adaptive Space-Time Processing for Wireless CDMA
303
Using the expressions for Zp,n , ap,n and an in (10.6), (10.18), and (10.19), respectively, we further analyze the expression for the soft output s˜n of noncoherent ST-MRC in (10.27) as shown in the upward developments of the following equation: ⎧ an sn + ηn ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ *↑ * ) P ) P ⎪ ⎪ ⎪ ⎪ ⎪ εˆp,n εp,n ap,n sn + εˆp,n ηp,n ⎪ ⎪ ⎪ ⎪ p=1 p=1 ⎪ ⎪ ⎪ ⎪ ⎪ *↑ ) P * ⎨) P s˜n = ˆH ˆH εˆp,n εp,n G sn + εˆp,n G . (10.38) ⎪ p,n Gp,n /M p,n Np,n /M ⎪ ⎪ ⎪ p=1 p=1 ⎪ ⎪ ⎪ ↓ ⎪ ⎪ ⎪ ⎪ ⎪ ˆ H Z /M ⎪ H n n ⎪ ⎪ ⎪ ⎪ ↓ ⎪ ⎪ H ⎪ ⎩ ˆ ˆ H N /M H H /M sn + H n
n
n
n
In the downward developments, however, we can reformulate s˜n as the soft ˆ combining the output of the space-time propagation vector estimate H n space-time despread vector Zn by ST-MRC, noncoherent or quasi-coherent depending on whether soft or hard DFI was used jointly with common DFI. Exploiting the 1D-ST structured data model of (10.37) and equating the developments above allows reinterpretation of the two sequential spatial then temporal MRC combining steps of (10.7) and (10.27) (i.e., 2D structured) as a single joint ST-MRC combining step (i.e., 1D-ST structured). Additionally, the centroid channel ambiguity an of (10.19) can be considered as the norˆ H /M that measures the total distortion between malized scalar product H n n ˆ n . The above reinthe space-time propagation vector Hn and its estimate H terpretations of the processing steps in 2D STAR with common DFI along the 1D-ST data model call for directly estimating the space-time propagation vector Hn in a single joint space-time identification step using 1D-ST structured DFI. 10.5.3
1D-ST Structured DFI
Driven by the same channel estimates as 2D-STAR with common DFI, joint space-time processing suggests merely simple rearrangements of the data structure processed sequentially in space then in time. The 2D and 1D-ST structured STAR versions would be identical and would have the same performance. Actually, the benefits of joint spatio-temporal processing go beyond compact spatio-temporal data modeling when they reach the steps of signal combining and channel identification. Indeed, joint space-time processing replaces the P disjoint DFI procedures of (10.15) for all paths, referred to as
304
S. Affes, P. Mermelstein
2D structured DFI, by a single joint spatio-temporal DFI update15 : ˆ ˆ ˆn sˆ∗n , ˆ (10.39) H n+1 = Hn + μ Zn − Hn s where μ is an adaptation step-size, and the common feedback signal sˆn is a selected estimate of the spatio-temporal signal component. As shown later in Sect. 10.5.4, this 1D-ST structured DFI procedure outperforms its 2D structured counterpart by reducing channel estimation and power control errors. A first version of 1D-ST STAR transforms 2D structured common/soft DFI (see Sect. 10.4.4) into 1D-ST structured soft DFI using the following feedback signal in (10.39): sˆn = s˜n ,
(10.40)
where the soft output s˜n of joint noncoherent ST-MRC: ˆ H Z /M, s˜n = H n n
(10.41)
replaces that obtained by (10.27). Estimation of the total received power ψˆn2 , the decision variable dn and the MPSK data symbol ˆbn follow using (10.35), (10.29), and (10.9), respectively. Substituting s˜n for sˆn in (10.41), the 1D-structured DFI procedure can be rewritten as [see (10.17)]: Hˆ ˆ n+1 = H ˆ n + μ Zn − H ˆ nH ˆH H n Zn /M Zn Hn /M ˆ n + μ IMP − H ˆ nH ˆH ˆ =H /M Zn ZH n n Hn /M ˙ H ˆ ˆ n + μΠn R =H Z n /M,
(10.42)
ˆ and its adaptation gradient interprets as a projector Πn orthogonal to H n ˙ of RZ , the instantaneous estimate of the correlation matrix RZ of Zn [see (10.14)]: H 2 + σN IMP . (10.43) RZ = ψ¯2 Hn e−jφn Hn e−jφn On average, adaptation errors are minimized when the projector Πn suppresses the dimension of RZ with the highest energy, i.e., its principal eigenˆ = λ H ). This vector e−jφn Hn (note that Πn RZ Hn = Πn × (λHn ) = 0 if H n n new interpretation of the 1D-ST DFI procedure as an adaptive PCA implementation provides a more intuitive justification as to why the multipath ambiguities ap,n are tied, converging in parallel with the centroid ambiguity 15
√ ˆ is forced to M after each DFI update for increased stability Preferably H n √ ˆ to M is asymptotically (we do so in this work), although normalization of H n guaranteed after convergence (see footnote 5).
10
Adaptive Space-Time Processing for Wireless CDMA
305
an to the same phase ambiguity with common DFI (see discussion below (10.28)). A second version of 1D-ST STAR transforms 2D structured common/hard DFI (see Sect. 10.4.5) into 1D-ST structured hard DFI using the following feedback signal16 in (10.39): sˆn = ψˆnˆbn ,
(10.44)
where the tentative symbol estimate ˆbn is estimated in (10.32) by hard decision over s˜n of (10.41). Estimation of the total received power ψˆn2 and the decision variable dn follow by (10.35) and (10.29), respectively. Since 1D-ST hard DFI implements joint quasi-coherent ST-MRC by forcing the centroid channel ambiguity to converge to a constellation-invariant rotation point rk ∈ RM , the MPSK data symbol bn is directly estimated by differential decoding of ˆbn in (10.36). As discussed previously in Sect. 10.4.5, common/hard DFI in 1D-ST hard DFI outperforms common/soft DFI in 1DST soft DFI by significantly reducing detection errors over the data symbols bn (see Sect. 10.4.6). Theoretically, they perform equally over channel-coded transmissions although current investigations suggest that the “anchoring” mechanism in hard DFI reduces channel decoding errors (see Sect. 10.4.5). In the case of a single path (i.e., P = 1 in nonselective fading) the 1DST and 2D STAR versions become identical when both implement either hard or soft DFI. In the case of a single receive antenna (i.e., M = 1 on the downlink for instance), the differences between all the 1D and 2D versions of STAR persist and offer the same potential improvements17 . We have shown in Sect. 10.4.6 that common/hard DFI outperforms all other versions of 2D STAR by implementing quasi-coherent ST-MRC. Hence next we only compare this 2D version of STAR with its 1D-ST counterpart to show the performance advantages of the latter. 10.5.4
Performance Gains of 1D-ST STAR over 2D STAR
We establish below a theoretical performance result that channel identification errors with 1D-ST STAR are smaller than those with 2D-STAR. To 16
17
An alternative hard feedback signal sˆn = Real s˜nˆb∗n /|ˆbn | ˆbn /|ˆbn | that performs
nearly the same in the DFI procedure of (10.39) finds more efficient use in power estimation [see (10.35)]. Note in this case that the 1D-ST DFI version amounts to identifying the multipath fades as a temporal channel vector. Similarly, the 2D DFI versions, applied here to the identification of spatial propagation vectors Gp,n , could easily be combined with another “temporal” DFI procedure applied to the soft outputs s˜p,n of (10.7) aligned as a temporal observation vector for identification there of εp,n ejφp,n as a temporal P × 1 channel vector with norm 1. This option is beyond the scope of this contribution. It shows however that 1D-ST DFI amounts to jointly applying two DFI procedures, one in space and another in time.
306
S. Affes, P. Mermelstein
our knowledge, this is the first analytical explanation as to when and why joint space-time processing outperforms sequential space and time processing. Later we validate this proof by simulations and show the resulting performance advantage of 1D-ST STAR over 2D STAR and current 2D RAKE-type receivers. We define the mean square error per diversity branch of channel identification in both space and time, referred to in the following as misadjustment, as [17]:
ˆ n − Hn 2 2 E H E ΔHn β2 = = . (10.45) MP MP In [17] we carried out a detailed convergence and performance analysis of channel misadjustment using the 1D-ST structured DFI of (10.39). Here we provide a summary of the main analytical results derived there. We show that the DFI procedure converges in the mean square sense with the following time constant [17]: τ=
1 1 , ¯2 μψ 2μ ψ¯2 2 ¯ 2μψ 1− 2
(10.46)
and establish the analytical expression for steady-state (i.e., after convergence) misadjustment [17]: 2πfD Ts 2 μσ 2 2 1−B0 , (10.47) , fD Ts , P, μ = N ¯2 + β 2 ψ¯2 , σN P μψ¯2 2 1− μψ2 where B0 (x) is the Bessel function of the first kind of order 0. This expression for misadjustment reflects two contributions, the first from noise, increases with larger values of the adaptation step-size μ due to higher gradient update perturbations. And the second, from the Doppler spread, increases with smaller values of μ due to slower tracking of the channel variations. We hence establish the following analytical expressions for the optimum step-size and the resulting minimum misadjustment [19] and time constant (not necessarily the smallest): 23
√ 2 μopt ψ¯2 , σN , fD Ts , P = 2 (πfD Ts ) / P ψ¯2 σN , 23 √ 3
2 (πfD Ts ) / (SN Rin , fD Ts , P ) = P SN Rin , βmin 2 23 1 √ P / πfD Ts SN Rin . τopt (SN Rin , fD Ts , P ) 4
(10.48) (10.49) (10.50)
To the best of our knowledge, these expressions (which apply to referenceassisted receivers as well, see Sect. 10.6) are the first to provide practical means for optimal tuning of adaptive channel identification and for prediction
10
Adaptive Space-Time Processing for Wireless CDMA
307
of step-size, misadjustment and convergence time in a multipath Rayleigh fading environment. We exploit the theoretical results above to compare the minimum misadjustments and the corresponding time constants of 1D-ST and 2D STAR 2 2 denoted by β1D,min , τ1D,opt , β2D,min and τ2D,opt , respectively. We already 2 2 = βmin in (10.49) and τ1D,opt = τopt in (10.50). For 2D-STAR have β1D,min 2 as follows: we derive the expression for β2D,min
P ˆ − H 2 ˆ p,n − εp,n Gp,n 2 E E H ˆ ε G p,n n n p=1 2 = β2D,min = MP MP P P 2 2 ε¯p E ΔGp,n 1 + = (10.51) E |Δεp,n |2 , P M P p=1 p=1 ˆ p,n − Gp,n and Δεp,n = εˆp,n − εp,n denote the estimation where ΔGp,n = G errors over Gp,n and εp,n , respectively, both assumed independent. Although the analytical expressions of (10.46) to (10.49) assume perfect equalization of the total received power (i.e., ψn2 = ψ¯2 ) [17], we apply them to 2D DFI with step-size optimization assuming a constant received power over each path ε¯2p ψ¯2 and therefore have: 1
2 P 3 2 , fD Ts = 2 (πfD Ts ) / ε¯2p ψ¯2 σN 3 = 4 μopt , μp,opt ε¯2p ψ¯2 , σN ε¯p3 yielding: 2 P 13 2 E ΔGp,n 2 2 = βmin ε¯p SN Rin , fD Ts , 1 = 4 β1D,min . M ε¯p3 Using the expression above in (10.51), we have: P 1 13 2 2 2 2 β2D,min = ε¯p P β1D,min + βε2 = κβ1D,min + βε2 , P p=1
(10.52)
(10.53)
(10.54)
where βε2 denotes the average estimation errors over multipath amplitudes given by the second summation term in (10.51). Theory provides bounds to the factor κ between 0 and 1. With realistic average multipath power profiles18 , however, values for κ are actually close to 1 so that in practice we have: 2 2 2 β2D,min β1D,min + βε2 > β1D,min . 18
ε¯2p
(10.55)
With equal-power paths (i.e., = 1/P ), note that a feedback signal with average 2 power ψ¯2 /P in 2D DFI yields μp,opt = P μopt and E ΔGp,n 2 /M = P β1D,min 2 in (10.52) and (10.53), respectively. In (10.54), note also that κ = 1 and β2D,min = 2 + βε2 . β1D,min
308
S. Affes, P. Mermelstein
This steady-state misadjustment is reached after convergence with a time constant dominated by the weakest multipath power, say ε¯2min : τ2D,opt τopt ε¯2min SN Rin , fD Ts , 1
1 1
(¯ ε2min P ) 3
τ1D,opt ≥ τ1D,opt . (10.56)
Intuitively, a joint 1D-ST DFI update with the total received power in the feedback signal sˆn of (10.39) results in 1) less misadjustment and 2) faster convergence than separate 2D DFI updates with fractioned and possibly unbalanced19 powers in the multipath feedback signals sˆp,n of (10.15). Additionally, a joint 1D-ST DFI update estimates the multipath amplitudes εp,n ˆ and hence avoids the addiimplicitly in the space-time channel estimate H n 2 tional misadjustment βε in (10.55) that arises inevitably with separate 2D DFI updates regardless of the multipath amplitude estimation technique employed (see footnote 17). To our knowledge, the theoretical results above provide the first analytical explanation as to when and why joint space-time processing outperforms sequential space and time processing widely implemented today in RAKEtype receivers (see Sect. 10.3). We illustrate them below by simulations with focus on the minimum misadjustment and the resulting SER. In Fig. 10.10(a) we plot the minimum misadjustment for the 1D-ST and 2D versions of STAR using the corresponding optimum step-sizes. To widen the scope of comparisons between the 1D-ST and 2D versions of STAR, we increase the Doppler up to about 90 Hz (i.e., speed of 50 Kmph) in Fig. 10.10(b). For both low and high Doppler, misadjustment curves in Fig. 10.10 show a very good fit between the theoretical and experimental 2 values of β1D,min in (10.49) with 1D-ST STAR. They also suggest that analytical expressions of (10.46) to (10.49) derived for BPSK in [17], [19] hold for higher-oder modulations as well, the fit to the experimental curves improving at even higher SNR values. More importantly, the misadjustment curves confirm the theoretical proof of (10.55) above that indeed 2D STAR performs worse in channel identification than 1D-ST STAR, the gap in misadjustment being larger at faster Doppler. Nevertheless, reduction of misadjustment in Fig. 10.10 is not sufficient to result in a noticeable SER reduction in Fig. 10.11, especially at lower SNR where the gap in misadjustment between 1D-ST and 2D STAR is smaller. This suggests that 1D-ST STAR performs nearly the same in SER as 2D STAR over a large range of Doppler despite gains in minimum misadjustment achieved with optimum step-sizes. However, the little gap that appears between the SER curves in Fig. 10.11(b) suggests that noticeable SNR gains can be expected with 1D-ST STAR at very high Doppler, especially at high SNR and with higher-order modulations. 19
With equal-power paths (i.e., ε¯2p = ε¯2min = 1/P ), we have τ2D,opt τ1D,opt in (10.56). With unbalanced multipath power profiles, 2D DFI is hence slower. Furthermore, simulations indicate that it produces even higher misadjustment.
10
Adaptive Space-Time Processing for Wireless CDMA
309
(b)
(a)
−5
−10 −12 −14
−10 −16
βmin [dB]
−15
−20
β2
2
min
[dB]
−18
−22 −24 −26 −28 −30 −4
BPSK QPSK 8PSK 2D STAR 1D−ST STAR theoretical −2
0
−20
2
4 6 SNRin [dB]
8
10
12
−25 −4
14
BPSK QPSK 8PSK 2D STAR 1D−ST STAR theoretical −2
0
2
4 6 SNR [dB]
8
10
12
14
in
Fig. 10.10. Minimum misadjustment in dB vs. SNR in dB for 1D-ST and 2D STAR with MPSK modulations and optimum step-sizes μopt and μp,opt of (10.48) and (10.52), respectively. (a): Doppler of 9 Hz. (b): Doppler increased up to 90 Hz. (a)
0
10 BPSK QPSK 8PSK 2D STAR 1D−ST STAR
−1
−1
10
SER
SER
10
−2
10
−3
10
−4
−4
−2
10
−3
10
10
(b)
0
10
−4
−2
0
2
4 6 SNRin [dB]
8
10
12
14
10
−4
BPSK QPSK 8PSK 2D STAR 1D−ST STAR −2
0
2
4 6 SNR [dB]
8
10
12
14
in
Fig. 10.11. Same as Fig. 10.10 with SER vs. SNR in dB instead.
Table 10.2 Complexity per symbol for the 2D and 1D-ST versions of STAR. √ .+. .×. ./. . number of operations (complex for +, × and /) 2D STAR 3M P + 2P − 2 3M P + 5P + 1 M P + 2P 2P + 1 1D-ST STAR 3M P 3M P + 5 MP + 1 2 reduction in number of operations with joint processing M = 4, P = 3 10% 20% 30% 70% M = 2, P = 3 20% 30% 40% 70% operation
There are actually other performance criteria where joint space-time processing in 1D-ST STAR readily outperforms sequential space and time processing in 2D RAKE-type receivers. In terms of complexity, joint space-time processing requires less computations than sequential space and time processing. As shown in Table 10.2, reduction in the number of operations with joint
310
S. Affes, P. Mermelstein
processing is significant with M = 4 antennas and increases with M = 2 antennas. The computational gain shrinks, however, with larger antenna-arrays. In terms of robustness to changes in propagation conditions, 1D-ST STAR is insensitive to multipath power profile variations. Indeed, it requires optimization of a single step-size value μ regardless of the average multipath power fractions. On the other hand, 2D STAR requires optimization of multiple step-sizes μp with constant adjustments to the average multipath power fractions20 , in order to cope with changing propagation conditions. Without such adjustments, simulations with variable multipath power profiles (not shown for lack of space) suggest that 2D STAR looses about 0.5 dB in SNR (with steps-size values optimized for equal-power paths) while 1D-ST STAR performs the same. Joint space-time processing in 1D-ST STAR is hence more robust to changes in propagation conditions than sequential space and time processing in current 2D RAKE-type receivers and alleviates the demanding burden of continuous step-size optimization. So far the comparisons between 1D-ST and 2D-STAR have been limited to the link level. In fact, 1D-ST STAR has additional benefits at the system level where reduced variance of the total received power reduces the probability of outage [25]. Indeed, more accurate channel estimation with 1D-ST DFI results in more accurate estimation of the total received power. Additionally, while the power variations of the feedback signal sˆn of 1D-ST DFI in (10.39) are “equalized” by power control, those of the feedback signal sˆp,n of 2D DFI in (10.15) are not. Reduced variation in the power of the feedback signal further reduces channel estimation and power control errors and hence increases the performance advantage of 1D-STAR over 2D STAR at the system level. This is however beyond the scope of this contribution. In summary, joint space-time processing in 1D-ST STAR outperforms sequential space and time processing in current 2D RAKE-type receivers in many ways21 : • it requires less complexity, especially with small antenna arrays; • it increases robustness to changing propagation conditions and alleviates the demanding burden of continuous steps-size optimization; • it identifies multipath Rayleigh channels faster and more accurately and offers noticeable22 link-level SNR gains at very high Doppler; • it reduces power control errors and offers potential system-level capacity gains (see footnote 22). 20 21
22
If required, notice that (10.52) enables instantaneous optimization of variable step-sizes μp,n using ε2p,n instead of ε¯2p,n . Another advantage of joint processing is that it increases the dimension of the observation space from M (or P ) to M P thereby allowing implementation of null constraints with less noise enhancement and more efficient interference suppression [7]. This is however beyond the scope of this contribution. In fact, performance evaluations at the link and system levels with active synchronization [5] both showed significant gains of blind 2D STAR over the blind 2D RAKE, more so at high Doppler.
10
Adaptive Space-Time Processing for Wireless CDMA
311
We previously reported on the capacity gains achievable by 1D-ST STAR over 2D STAR and 2D RAKE-type receivers with orthogonal Walshmodulated CDMA signals [20]. There we proposed similar incremental upgrades of the 2D DFI procedure in 2D STAR (see Sect. 10.4) that ultimately implement blind coherent23 ST-MRC with 1D-ST STAR. With MPSK modulations so far, we have been able to implement blind quasi-coherent (i.e., within constellation-invariant rotation point) ST-MRC. In the next section we propose further upgrades of 1D-ST STAR that implement quasi-blind (i.e., with very weak pilot signals) or “asymptotically” blind coherent ST-MRC.
10.6
The Pilot-Assisted 1D-ST STAR
Blind 1D-ST STAR implements quasi-coherent ST-MRC by identifying the channel within a constellation-invariant phase rotation. Conventional use of pilot signals [12]-[16] in RAKE-type receivers allows channel identification24 without phase ambiguity and hence enables implementation of referenceassisted coherent ST-MRC. It requires however large-enough pilot-power or -overhead fractions to guarantee accurate channel identification. We propose instead enhanced use of pilot signals with much weaker power or overhead to resolve then compensate the constellation-invariant phase rotation of the channel identified blindly and more accurately. We hence implement quasiblind (i.e., with very weak pilot signals) or “asymptotically” blind coherent ST-MRC. Enhanced channel identification and reduction of the pilot power or overhead combined result in a total SNR gain of 1 dB and enable significant battery power-savings and spectrum-efficiency gains. 10.6.1
Data Model with Pilot Signals
So far we differentially encoded the data symbols bn as bn in (10.2) to compensate for the phase ambiguity inherent to blind channel identification by differential decoding with hard DFI (or demodulation with soft DFI, see Sect. 10.5.3). Here we exploit pilot signals to either avoid or resolve this ambiguity. Hence we avoid differential encoding and simply assign bn = bn in the following. In both blind and pilot-assisted processing we spread the data symbols bn by a data code and hence mark the corresponding datachannel parameters with superscript δ. In pilot-assisted processing, we may 23
24
With orthogonal Walsh modulation, the constellation set is CM = {0, 1} after despreading and the only constellation-invariant phase rotation possible is 2π (i.e., RM = {1}), hence no ambiguity is possible with hard DFI (see Sect. 10.4.5). Note that these techniques (i.e., [12]-[16]) estimate each diversity finger with a multiple-tap low-pass filter. With less computations here, we identify each finger with an optimized single-tap adaptive-filter using the DFI procedure (see Sects. 10.6.2 and 10.6.4).
s. Affe~,
312
P. :\Iermelstein
,;lo~hm",d ["~1 ~ mod
- --
1r-channel
.-i---i----,-,-----------,----,-, .
{J.J..I-J.."-_=~_i-l-J.I J·cbannd
pilot.-symbol mode
[.:r 'I'
KT. '" T./e
~ . LL--'_.L-'---'___:-:~~-. .'--L-L--'... 5-cbannel
Fig. 10 .12. Pilot modes (data signals are in grey and pilot signals are in white) .
Table 10.3 Description of t he tested versions of ID-ST STAR. 11
11
Rxl
Rx2 Rx3 Rx4
Rx5
pilo~ use pilot mode 1 none nOlle (i.c., blind without pilot) pilot-ch arme! chrume! identification (i.c., conventional) pilot-channcl a mbiguity re>ül rr tion (i .e., cnhanced ) pilot-symbol channel identification (i.c., convc ntional) pilot-sYlllbol ambiguity resolution (i.e., enhanccd)
either eodc-multiplex the spread data with a pilot or silOply insert (i.e., timemultiplex) pilot symbols in the data channcl (see Fig. 10. 12) and hellce mark thc corresponding pilot-channel parameters or pilot symbols wilh superscript 11" , respectively. H ellce we simply rewrite the data observation vector of (10.37) as folIows:
~, = H"s~, +!t, = H"1/1"b,, + tt, .
(10.57)
Simila rly when a pilot-channel is used (see Fig. JO.12), we form tbe M P x I pilot observation vector as:
z;; =
H"s~
e<
+!:!:, =
H"~1/J,,
+ l!!.';.,
(10.58)
I denotes tbe allocated pilot-to-data power ratio a)1(1 J::t;, is a space-time wlcorrelated C aussian noise vector with the sallle co,,-nri[lIlcC matri x os (i .c., RN = O";;' J .~IP). Whcn a pilol-symbol is uscd (sec fig. 10.12), the dota scqucncc b" is simply assigncd n consto nt "I" oncc every J( symbols, althongh insertion of pilot. blocks is pos~ i ble . Hence we have b", K = I und s~, K = s~" K = 1/1", K. In this case, ~2 = 1/ J( denotes t he allocatcd pilot-to-data ovcrhcad ratio. In t he fo llowi ng, we investigate t he five versions of STAR summarized in Taille lO.:J, I,he first refereuee versiou Rxl heiug t,he hliud ID-ST STAR wit.h ha rd DFI alrcady describcd in Sect. 10.&.3. ,vhere
~ero-Illean
tI"
10
10.6.2
Adaptive Space-Time Processing for Wireless CDMA
313
1D-ST STAR with Conventional Pilot-Channel Use
The second version of STAR, denoted by Rx2 (see Table 10.3), uses a pilotchannel for conventional channel identification [14], [16] (see footnote 24). It exploits the fact that the pilot signal is a known reference signal (a priori constant “1”) to feed it back to the 1D-ST DFI procedure of (10.39) modified25 as follows [18]: ˆ + μ Zπ − H ˆ sˆπ sˆπ , ˆ = H (10.59) H n n+1 n n n n where sˆπn = ξ ψˆn denotes the feedback signal with known positive sign. Estimation of the total received power ψˆn2 and the tentative symbol estimate ˆbn follow by (10.35) and (10.32). Note that the modified DFI procedure above operates on the pilot despread vector Zπn with a constellation set C1 = {1} (i.e., pilot symbol is constant “1”) and a rotation set R1 = {1} (i.e., no possible ambiguity). As a result, it identifies the channel without ambiguity (i.e., an = 1) and hence estimates the data symbol and the decision variable as follows: ˆb = ˆbn , n dn = s˜δn ,
(10.60) (10.61)
where s˜δn in (10.41) denotes now the soft output of coherent ST-MRC. In contrast to Rx1, Rx2 no longer requires differential decoding of (10.36) and differential demodulation of (10.29) in the detection steps above. Hence it reduces detection errors over both channel-uncoded and coded transmissions by implementing “fully” coherent instead of quasi-coherent ST-MRC. Note however that Rx2 identifies the channel using the pilot feedback signal sˆπn with power ξ 2 ψ¯2 < ψ¯2 . Analytical results of Sect. 10.5.4 apply to Rx2 and hence yield the following optimum step-size and minimum misadjustment as well as the corresponding time constant: 23
√ μRx2,opt = 2 (πfD Ts ) / P ξ 2 ψ¯2 σN , (10.62) 2
√ 2 3 3 2 (πfD Ts ) / = P ξ SN Rin , (10.63) βRx2,min 2 2 3 1 √ P / πfD Ts ξ SN Rin . (10.64) τRx2,opt 4 Expressions above indicate that Rx2 performs worse than Rx1 in channel identification in terms of both misadjustment and convergence speed. Despite increased detection errors due to differential decoding, Rx1 may outperform Rx2 from reduced channel identification errors as shown later by simulations. 25
Note that the DFI step of (10.59) could be updated at a slower rate if the pilot signal is transmitted in short bursts on the pilot channel. Extension in this case is ad hoc.
314
S. Affes, P. Mermelstein
10.6.3
1D-ST STAR with Enhanced Pilot-Channel Use
The third version of STAR, denoted by Rx3 (see Table 10.3), is a hybrid of Rx1 and Rx2. Like Rx1, it applies the blind DFI procedure of (10.39) and (10.44) to estimate the channel within a constellation-invariant phase ambiguity an . Like Rx2, its uses a pilot-channel. However, with much weaker power it exploits the pilot more efficiently to accurately estimate then resolve the phase ambiguity an rk ∈ RM [18]. Noticing that the pilot signal component estimate26 given by: ˆ H Zπ /M an ψn ξ + ηnπ , s˜πn = H n n
(10.65)
is based on a noisy value of an , Rx3 estimates it by hard decision over the soft output sˆπn averaged over consecutive blocks of A samples, giving for n ∈ {n A, . . . , (n + 1)A − 1} [18]: A−1 i=0
, A = rk = arg min {|˜ an − rk |} .
a ˜n = a ˜n = a ˆn = a ˆn
s˜πn A+i (10.66) (10.67)
rk ∈RM
The averaging step27 above enables accurate estimation of an with a much weaker pilot power [18]. Hence, Rx3 estimates the data symbol and the decision variable by simple phase ambiguity compensation as follows [18]: ˆb = a ˆ∗nˆbn , n
(10.68)
a ˆ∗n s˜δn ,
(10.69)
dn =
and thereby implements coherent ST-MRC, like Rx2, with the same benefits in reducing symbol detection errors over Rx1. 2 and τRx3,opt are exNote however that expressions for μRx3,opt , βRx3,min actly those of Rx1 in (10.48) to (10.50), respectively. Rx3 therefore combines both advantages of Rx1 and Rx2 by 1) exploiting a much weaker pilot than Rx2 to resolve the phase ambiguity and implement coherent detection and 2) by exploiting the data channel with much more power than the pilot channel for more accurate channel identification. Simulations will later confirm the performance advantage of Rx3 over both Rx1 and Rx2. 26 27
The power fraction in s˜πn can be exploited in enhancing power estimation in (10.35) [18]. Long-term averaging is made possible by the “anchoring” mechanism of hard DFI (see Sect. 10.4.5). With soft DFI, the ambiguity rotates continuously (see Fig. 10.3) and prevents accurate estimation of an by long-term averaging.
10
10.6.4
Adaptive Space-Time Processing for Wireless CDMA
315
1D-ST STAR with Conventional Pilot-Symbol Use
The fourth version of STAR, denoted by Rx4 (see Table 10.3), uses pilot symbols for conventional channel identification [12], [13], [15], [16] (see footnote 24). Its DFI procedure is similar to that of Rx1 in (10.39). However, it only feeds back the signal components containing the pilot symbols inserted in the data sequence once every K symbols [19]: ˆ (n +1)K = H ˆ n K + μ Zδn K − H ˆ n K sˆπn K sˆπn K , H (10.70) where sˆπn K = ψˆn K . Note that the modified DFI procedure above operates at time corresponding to the pilot-symbol indices n K on the data despread vector Zδn K with a constellation set C1 = {1} (i.e., pilot symbol is constant “1”) and a rotation set R1 = {1} (i.e., no possible ambiguity). Similarly to Rx2, the DFI procedure of Rx4 identifies the channel without ambiguity (i.e., an = 1) and allows estimation of ˆbn and dn using (10.60) and (10.61), respectively. Notice, however, that Rx4 updates the DFI procedure less frequently than Rx2, namely at the pilot-symbol rate 1/KTs . Exploiting again the analytical results of Sect. 10.5.4, we have: 23
√ μRx4,opt = 2 (πfD KTs ) / P ψ¯2 σN = μRx2,opt , 2
√ 2 3 3 2 2 (πfD KTs ) / = P ξ SN Rin = βRx2,min , βRx4,min 2 23 K √ P / πfD KTs SN Rin = τRx2,opt , τRx4,opt 4
(10.71) (10.72) (10.73)
and hence find that Rx2 and Rx4 identify the channel equally well when they use the same pilot power and overhead fractions28 (i.e., K = 1/ξ 2 ) [19]. On the one hand, the channel subsampled at the DFI-update instants appears to vary K times faster with a relative normalized Doppler KfD Ts . Thus channel identification errors can be expected to increase with faster time-variations. On the other hand, the power of the feedback signal in Rx4, |ˆ sπn K |2 = ψˆn2 K , is K times stronger than in Rx2 where |ˆ sπn |2 = ξ 2 ψˆn2 = ψˆn2 /K. This suggests that channel identification errors will decrease with stronger feedback signals. Analysis of Sect. 10.5.4 shows a non-trivial result that the corresponding loss and gain in the performance of adaptive channel identification cancel each other [19]. The minimum channel misadjustment achievable remains constant if we increase or decrease both the relative Doppler and the feedback-signal’s power by the same factor K. Later we confirm by simulations that Rx2 and Rx4 perform equally well. 28
A similar conclusion regarding misadjustment was reached in [16] based on channel estimation with low-pass filtering (see footnote 24).
316
S. Affes, P. Mermelstein
10.6.5
1D-ST STAR with Enhanced Pilot-Symbol Use
Similarly to Rx3, the fifth version of STAR denoted by Rx5 (see Table 10.3) is a hybrid of Rx1 and Rx4. It applies the blind DFI procedure of (10.39) and (10.44) to estimate the channel within a constellation-invariant phase ambiguity an . However, it uses the pilot symbols to estimate then resolve the phase ambiguity an [19]. Assume for simplicity that a block of A consecutive symbols contains exactly Q pilot symbols (i.e., A = QK). Rx4 modifies (10.66) simply by averaging the pilot signal component estimate over these Q symbols for n ∈ {n A, . . . , (n + 1)A − 1} [18]: Q−1
a ˜n =
sˆπn A+Ki
i=0
Q
,
(10.74)
before estimating an by hard decision in (10.67). Similarly to Rx3, Rx5 thereby resolves the phase ambiguity and hence estimates ˆbn and dn using 2 (10.69) and (10.68), respectively. Like Rx3, expressions for μRx5,opt , βRx5,min and τRx5,opt are exactly those of Rx1 in (10.48) to (10.50), respectively. Notice that Rx5 in (10.74) estimates the pilot-signal component from K = A/Q fewer values than Rx3 in (10.66). The variance of the residual noise present in a ˜n is thereby increased by factor K. However, bear in mind that the pilot-signal power in Rx5 is K times stronger than in Rx3. Indeed, in contrast to (10.65) we have: H
ˆ n Zπn /M an ψn + ηnπ . s˜πn = H
(10.75)
The SNR of a ˜n before phase ambiguity estimation in (10.67) is therefore the same for both receiver versions. Despite the differences between Rx3 and Rx5, the averaging step in (10.66) or (10.74) results in the same phase estimation error. In the following section we confirm that Rx3 and Rx5 have equivalent performance. 10.6.6
Performance Gains with Enhanced Pilot Use
In Fig. 10.13, we compare the SER performance of the various receiver versions of 1D-ST STAR described previously (see Table 10.12) for BPSK, QPSK, 8PSK and 16QAM29 . In Sect. 10.4.5, we have shown that the anchoring mechanism of hard DFI works well with MQAM modulations (see Fig. 10.8). Hence its useful feature of casting the phase ambiguity in the rotation set RM can be efficiently exploited to estimate and then compensate the phase ambiguity with enhanced use of pilot signals in 16QAM-modulated Rx3 and Rx5. 29
Note that the blind version Rx1, inapplicable with MQAM modulations, is not evaluated in Fig. 10.13(d).
10
Adaptive Space-Time Processing for Wireless CDMA (a)
0
−1
−1
10
−2
10
−3
10
−4
−8
SER [QPSK]
BER [BPSK]
10
Rx1 2 Rx2 @ ξ =10% 2 Rx3 @ ξ =1% 2 Rx4 @ ξ =10% 2 Rx5 @ ξ =1% 2 Rx2 @ ξ =5% 2 Rx3 @ ξ =5% 2 Rx4 @ ξ =5% 2 Rx5 @ ξ =5% −6
−3
−4
−4
−2 SNRin [dB]
0
2
10
4
(c)
0
−6
Rx1 Rx2 @ ξ2=10% 2 Rx3 @ ξ =1% 2 Rx4 @ ξ =10% Rx5 @ ξ2=1% 2 Rx2 @ ξ =5% 2 Rx3 @ ξ =5% 2 Rx4 @ ξ =5% 2 Rx5 @ ξ =5% −4
−2
0 2 SNRin [dB]
4
6
8
(d)
0
10
−1
−1
−2
10
−3
10
−4
0
10 SER [16−QAM]
10
SER [8PSK]
−2
10
10
10
10
(b)
0
10
10
10
317
Rx1 Rx2 @ ξ2=10% 2 Rx3 @ ξ =1% 2 Rx4 @ ξ =10% Rx5 @ ξ2=1% Rx2 @ ξ2=5% 2 Rx3 @ ξ =5% 2 Rx4 @ ξ =5% 2 Rx5 @ ξ =5% 2
−2
10
−3
10
−4
4
6 SNRin [dB]
8
10
12
10
2
Rx2 @ ξ2=10% Rx3 @ ξ2=1% 2 Rx4 @ ξ =10% 2 Rx5 @ ξ =1% 2 Rx2 @ ξ =5% 2 Rx3 @ ξ =5% 2 Rx4 @ ξ =5% 2 Rx5 @ ξ =5% 4
6
8
10 SNRin [dB]
12
14
16
18
Fig. 10.13. SER vs. SNR in dB for different versions of 1D-ST STAR (see Table 10.12) with optimum step-size values. (a): BPSK, (b): QPSK, (c): 8PSK, (d): 16QAM.
Simulation results shown in Fig. 10.13 indicate that more efficient exploitation of pilot channels or symbols for only phase-ambiguity resolution outperforms their conventional use for channel-identification regardless of the modulation employed. They also suggest that pilot-channel and pilot-symbol versions, with either conventional or enhanced pilot use, perform similarly for all modulations except 16QAM with conventional pilot, thereby confirming our analytical assertions30 . Note that the receiver versions with enhanced pilot use (i.e., Rx3 and Rx5) perform practically the same with 1 or 5% fractions of the pilot power or overhead. This result shows that long-term averaging in (10.66) and (10.74) indeed significantly reduces phase ambiguity estimation errors in (10.67) from very weak pilot signals (we used A = 500 in all simulations of Fig. 10.13). On the other hand, the receiver versions with conventional pilot use (i.e., Rx2 and Rx4) see their performance drop when the pilot power or overhead 30
The theoretical results of Sect. 10.5.4 derive from a convergence and performance analysis in [17] that assumes a constant-modulus modulation. Conventional identification with weak power or overhead fractions increases channel estimation errors and likely contributes to further increasing the mismatch with analysis.
318
S. Affes, P. Mermelstein
fraction is reduced by half from 10 to 5%. Simulations actually indicate that Rx2 and Rx4 with 10% fraction perform worse than Rx3 and Rx5 with 1% fraction only. They also suggest that Rx2 and Rx4 perform even worse than blind Rx1 at higher SNR, more so with reduced power fractions. Figure 10.13 actually suggests that Rx3 and Rx5 with 1% power or overhead fraction offer about 0.8 dB gain in SNR over Rx2 and Rx4 with 5% fraction, and about 0.5 dB over Rx2 and Rx4 with 10%. However, capacity gains achievable at the system level by the reduction of the pilot-signals’ interference from 5 and 10% to 1% account for “equivalent” SNR gains of roughly 0.2 (i.e., 10 log10 (1.05/1.01)) and 0.4 dB (i.e., 10 log10 (1.1/1.01)) at the link level, respectively. The total performance gain of Rx3 and Rx5 over Rx2 and Rx4 with either power/overhead fraction is hence in the range of 1 dB. Reduced power variations and power control errors due to blind identification of the data channel with stronger signals (see Sect. 10.5.4) should further increase the performance advantage of Rx3 and Rx5 over Rx2 and Rx4 at the system level. The discussion above suggests that optimization of the step-size alone does not allow for fair comparisons without simultaneous optimization of the pilot power or overhead fraction ξ 2 = 1/K, in order to reflect additional gains at the system level due to enhanced pilot use with significantly reduced pilot power or overhead [18], [19]. This is beyond the scope of this contribution. In [18], [19] however, we provide analytical means for optimizing the five receiver versions of 1D-ST STAR at the system level and show that enhanced pilot use allows for significant spectrum efficiency gains in most practical situations. Overall, pilot-assisted space-time receivers with conventional pilot use require pilot-power or -overhead fractions large enough to guarantee accurate channel identification and reliable data detection at the receiver. Pilotassisted space-time receivers with enhanced pilot use require much weaker power or overhead (i.e., in the range of 1%) to resolve then compensate the constellation-invariant phase rotation of the channel identified blindly and more accurately. They hence implement quasi-blind or “asymptotically” blind (i.e., with very weak pilot signals) coherent ST-MRC. Enhanced channel identification and reduction of the pilot power or overhead combined result in a total SNR gain of 1 dB and enable significant battery power-savings and spectrum-efficiency gains [18], [19]. Similar gains can be achieved on the downlink [21].
10.7
Conclusions
Several improvements are proposed to the 2D-RAKE, a widely used spacetime adaptive receiver which combines CDMA signals sequentially, first in space, then in time. Ultimately we arrive at a more efficient one-dimensional joint space-time (1D-ST) adaptive processor, STAR, the spatio-temporal array-receiver. Advantages of STAR are twofold.
10
Adaptive Space-Time Processing for Wireless CDMA
319
First, STAR carries out an improved combining operation that approaches that of blind coherent space-time MRC. In the blind mode (i.e., without a pilot), the 2D RAKE sequentially implements noncoherent spatial MRC then temporal EGC while STAR implements quasi-coherent (i.e., within a constellation-invariant phase ambiguity) joint ST-MRC. By improving the combing rule STAR outperforms the 2D RAKE by about 2 dB gain in SNR. In the pilot-assisted mode, the 2D RAKE uses the pilot for conventional identification and hence requires a strong pilot to accurately implement referenceassisted coherent space and time MRC sequentially. On the other hand, STAR requires a much weaker pilot to estimate then compensate the constellationinvariant phase ambiguity of the channel identified blindly and more accurately, and hence implements quasi-blind or asymptotically blind coherent joint ST-MRC. Enhanced channel identification and reduction of the pilot power or overhead (in the range of 1%) combined result in a total SNR gain of 1 dB and enable significant battery power-savings and spectrum-efficiency gains. Second, STAR outperforms the 2D RAKE in many ways by implementing joint space-time processing in a 1D-ST structured adaptive receiver. Indeed, we provide novel and non-trivial analytical results that clearly establish the performance advantage of one-dimensional joint space-time processing in 1DST STAR over two-dimensional spatial then temporal sequential processing in the 2D RAKE-type adaptive receivers widely used today. We show that 1D-ST structured adaptive receivers reduce both complexity and channel identification errors, increase robustness to changing propagation conditions, and speed up convergence over multipath Rayleigh-fading channels. These gains, validated by simulations, translate immediately into enhanced receiver performance.
References 1. 3rd Generation Partnership Project (3GPP), Technical Specification Group (TSG), Radio Access Network (RAN), Working Group (WG4), UE Radio Transmission and Reception (FDD), TS 25.101, V3.4.1, 2000. 2. 3rd Generation Partnership Project (3GPP), Technical Specification Group (TSG), Radio Access Network (RAN), Working Group (WG4), Base Station Conformance Testing (FDD), TS 25.141, V3.3.0, 2000. 3. 3rd Generation Partnership Project, Technical Specification Group (TSG), Radio Access Networks (RAN), Physical Layer Aspects of UTRA High Speed Downlink Packet Access, 3GPP TR 25.848, V4.0.0, 2001. 4. S. Affes and P. Mermelstein, “A new receiver structure for asynchronous CDMA: STAR - The spatio-temporal array-receiver,” IEEE J. Selec. Areas Comm., vol. 16, no. 8, pp. 1411–1422, Oct. 1998. 5. K. Cheikhrouhou, S. Affes, and P. Mermelstein, “Impact of synchronization on performance of enhanced array-receivers in wideband CDMA networks,” IEEE J. Selec. Areas Comm., vol. 19, no. 12, pp. 2462–2476, Dec. 2001.
320
S. Affes, P. Mermelstein
6. H. Hansen, S. Affes, and P. Mermelstein, “A beamformer for CDMA with enhanced near-far resistance,” in Proc. IEEE ICC, 1999, vol. 3, pp. 249–253. 7. S. Affes, H. Hansen, and P. Mermelstein, “Interference subspace rejection: a framework for multiuser detection in wideband CDMA,” IEEE J. Selec. Areas Comm., vol. 20, no. 2, pp. 287–302, Feb. 2002. 8. B. Suard, A. F. Naguib, G. Xu, and A. Paulraj, “Performance of CDMA mobile communication systems using antenna arrays,” in Proc. ICASSP, vol. IV, 1993, pp. 153–156. 9. B. H. Khalaj, A. Paulraj, and T. Kailath, “2D RAKE receivers for CDMA cellular systems,” in Proc. GLOBECOM, 1994, pp. 400–404. 10. A. F. Naguib and A. Paulraj, “Effects of multipath and base-station antenna arrays on uplink capacity of cellular CDMA,” in Proc. GLOBECOM, 1994, pp. 395–399. 11. A. F. Naguib, Adaptive Antennas for CDMA Wireless Networks, Ph.D. Dissertation, Stanford University, USA, 1996. 12. J. K. Cavers, “An analysis of pilot symbol assisted modulation for Rayleigh fading channels,” IEEE Trans. Vehic. Tech., vol. 40, no. 4, pp. 686–693, Nov. 1991. 13. C. D’Armours, M. Moher, and A. Yonga¯ glu, “Comparison of pilot-symbol assisted and differentially detected BPSK for DS-CDMA systems employing RAKE receivers in Rayleigh fading channels,” IEEE Trans. Vehic. Tech., vol. 47, no. 4, pp. 1258–1267, Nov. 1997. 14. P. Schramm, “Analysis and optimization of pilot-channel-assisted BPSK for DS-CDMA systems,” IEEE Trans. Comm., vol. 46, no. 9, pp. 1122–1124, Sept. 1998. 15. P. Schramm, “Pilot symbol assisted BPSK on Rayleigh fading channels with diversity: performance analysis and parameter optimization,” IEEE Trans. Comm., vol. 46, no. 12, pp. 1560–1563, Dec. 1998. 16. F. Ling, “Optimal reception, performance bound, and cutoff rate analysis of reference-assisted coherent CDMA communications with applications,” IEEE Trans. Comm., vol. 47, no. 10, pp.1583–1592, Oct. 1999. 17. S. Affes and P. Mermelstein, “Comparison of pilot-assisted and blind CDMA array-receivers adaptive to Rayleigh fading rates,” in Proc. of IEEE PIMRC, 1999, vol. 3, pp. 1186–1192. 18. S. Affes, A. Louzi, N. Kandil, and P. Mermelstein, “A high capacity CDMA array-receiver requiring reduced pilot power,” in Proc. of IEEE GLOBECOM, 2000, vol. 2, pp. 910–916. 19. S. Affes, N. Kandil, and P. Mermelstein, “Efficient use of pilot signals in wideband CDMA array-receivers,” submitted for publication, 2002. 20. S. Affes and P. Mermelstein, “A blind coherent spatio-temporal processor of orthogonal Walsh-modulated CDMA signals,” Wiley Journal on Wireless Communications and Mobile Computing, accepted for publication, to appear in 2002. 21. S. Affes, A. Saadi, and P. Mermelstein, “Pilot-assisted STAR for increased capacity and coverage on the downlink of wideband CDMA networks,” in Proc. of IEEE SPAWC, 2001, pp. 310–313. 22. N. A. B. Svensson, “On differentially encoded star 16QAM with differential detection and diversity,” IEEE Trans. Vehic. Tech., vol. 44, no. 3, pp. 586–593, Aug. 1995.
10
Adaptive Space-Time Processing for Wireless CDMA
321
23. J. G. Proakis, Digital Communications, McGraw Hill, 3rd Edition, 1995. 24. M. K. Simon and M.-S. Alouini, “A unified approach to the probability of error of noncoherent and differentially coherent modulations over generalized fading channels,” IEEE Trans. Comm., vol. 46, no. 12, pp. 1625–1638, December 1998. 25. A. Jalali and P. Mermelstein, “Effects of diversity, power control, and bandwidth on the capacity of microcellular CDMA systems,” IEEE J. Selec. Areas Comm., vol. 12, no. 5, pp. 952–961, June 1994.
11 The IEEE 802.11 System with Multiple Receive Antennas Vijitha Weerackody Independent Consultant NYC, NY, USA E-mail:
[email protected] Abstract. The demand for wireless local area networks (WLANs) based on the IEEE 802.11 standard is growing rapidly. These WLANs operate in the unlicensed spectrum, are relatively low-cost and provide very high data rates. The IEEE 802.11 system makes use of a carrier-sensing technique and packet collisions occur because the stations in a system are essentially uncoordinated. Packet collisions increase with the number of stations in the system and the traffic intensity. In this chapter, we examine the IEEE 802.11 system with multiple receive antennas. Multiple receive antennas can successfully detect multiple packets and thus reduce packet collisions. Also, multiple receive antennas help to reduce the packet errors due to channel errors. We analyze the IEEE 802.11 system with multiple antennas and propose some minor changes to the protocol to obtain the improved performance gains. It is shown that significant performance gains can be obtained using this system, for example, the throughput obtained with 10 stations and a single receive antenna is about 0.50; with two and three antennas the corresponding throughput increases to about 0.85 and 1.3, respectively.
11.1
Introduction
Applications such as mobile computing, high-quality audio and video wireless services are driving the demand for very high rate wireless data networks. Wireless local area networks (WLANs) based on the IEEE 802.11 standard [1], [2], [3] deliver very high data rates and is gaining rapid acceptance, especially, in indoor wireless data applications. WLANs provide very high data rates with relatively low-cost base stations or access points and the spectrum the WLANs operate is available without costs. The IEEE 802.11 standard specifies the medium access control (MAC) and the physical (PHY) layers of the WLAN and operate in the 2.4 GHz and 5 GHz unlicensed frequency bands, where a large portion of the spectrum is available free of any frequency licensing fees. The PHY layer of the IEEE 802.11 standard is specified in several different modes. The frequency hopping and direct sequence spreading schemes operate in the 2.4 GHz band where the aggregate spectrum available is about 80 MHz. In addition, an infra red based physical link is also specified. All the above PHY layers provide data rates of 1 Mb/s and optionally 2 Mb/s. Recently, the 2.4 GHz direct sequence spreading scheme has been modified using complementary code keying (CCK) to deliver very high
J. Benesty et al. (eds.), Adaptive Signal Processing © Springer-Verlag Berlin Heidelberg 2003
324
V. Weerackody
data rates of up to 11 Mb/s and this PHY is contained in the IEEE 802.11b standard. Motivated by the need for very high data rates, the IEEE 802.11a working group has standardized another PHY layer scheme using orthogonal frequency division multiplexing (OFDM) at the 5 GHz band. The total spectrum available at this frequency band is 300 MHz and a peak data rate of 54 Mb/s can be achieved using the IEEE 802.11a standard. The IEEE 802.11a standard is attractive because of the availability of the very large spectrum and in this chapter we will concentrate on this PHY layer, although, the general techniques discussed are applicable to other radio frequency PHY layers as well. Currently, the IEEE 802.11 standards committees are setting up another standard, IEEE 802.11g, for the PHY layer at the 2.4 GHz frequency band. The IEEE 802.11g standard provides data rates up to 54 Mb/s using an OFDM system. Since the IEEE 802.11g standard shares the spectrum with the IEEE 802.11b standard, the two standards should be made compatible, that is, the access point of the IEEE 802.11g system should recognize an IEEE 802.11b packet and vice-versa. This is accomplished by both systems having similar packet headers. The MAC layer of the IEEE 802.11 system operates in two different schemes: the distributed coordination function (DCF) and the point coordination function (PCF). In the DCF, all users with data to transmit have a fair chance of accessing the network. On the other hand, the PCF is based on polling and is controlled by the access point. The PCF is suitable for delay sensitive traffic. The DCF is usually the primary access scheme and in this chapter we will consider only this scheme. The multiple access technique deployed in the IEEE 802.11 system is based on the carrier sense multiple access scheme with collision avoidance (CSMA/CA) [4]. In CSMA, a station that is ready to send a packet monitors the channel, and if the channel is available transmits the packet; else, a collision avoidance mechanism is adopted which schedules the transmission time of the packet at a later time. In CSMA/CA, packet collisions may occur at the destination station when two or more stations transmit at the same time. These collisions significantly reduce the throughput especially when the traffic load is high. Performance of the CSMA scheme may also be deteriorated by the “hidden terminal” problem [5]. Here a station transmits to a destination node and the hidden terminal is outof-range of the transmit station but within range of the destination station. Suppose the destination station is receiving packets, the hidden terminal, unaware of the on-going transmission, may transmit a packet and contribute to a packet collision. It is well-known that multiple antennas at the receiver can improve performance in the presence of interferes and signal fading. In [6], it was shown that in a Rayleigh fading environment, with N receive antennas and M (≤ N ) mutually interfering users, it is possible to null out the M − 1 interfering users at the receiver and also provide diversity gain of order (N − M + 1)
11
The IEEE 802.11 System
325
for each user. A WLAN receiver employing multiple receive antennas has the capability of reducing packet collisions, diminishing the effects of hidden terminal problem and improving the bit-error-rate (BER) performance in fading channels. These effects will improve the throughput of the WLAN system. Simultaneous reception of multiple packets is possible with multiple receive antennas. Receiving multiple packets in ALOHA systems has been examined in the literature [7], [8], [9]. In a direct sequence code division multiple access based ALOHA system, it is possible to receive simultaneously transmitted packets with a reasonably low packet error rate. Multiuser detection techniques reduce the packet error rate further and thus improve the performance of the ALOHA scheme. In this chapter, we will examine the performance gains obtained in the IEEE 802.11 WLAN system in the DCF mode with multiple receive antennas and minimum mean squared-error (MMSE) combining. In this wireless packet transmission scheme, the antenna weights have to be determined for each packet. This is because, in general, each packet has to contend for channel access and, therefore, the interference profile and the channel conditions experienced by each received packet could be substantially different. Usually, the packets in the IEEE 802.11 system are fragmented so that the length of a packet is no more than a few hundred bytes. Each packet contains a preamble that has sufficient training symbols to determine the receive antenna weights and details of determining the antenna weights are discussed in Sect. 11.2. Detail throughput analysis of the IEEE 802.11 DCF technique has been carried out in [10]. We will make use of the analytical tools given in the above paper to examine the performance of the multiple packet reception case with multiple receive antennas. Also included in this analysis is the effect of packet errors not only due to collisions but also due to fading and receiver noise. It is demonstrated that with multiple receive antennas, the throughput of the IEEE 802.11 system can be substantially improved. Changes to the existing MAC protocol are proposed that will support a multiple packet environment with multiple receive antennas. This chapter is organized as follows. The physical layer of the IEEE 802.11a system with multiple receive antennas is modeled in Sect. 11.2. In this section, we determine the MMSE weights at the receive antennas. In Sect. 11.3, we briefly discuss the DCF of the IEEE 802.11 protocol and propose changes to the DCF to accommodate multiple packet reception. In Sect. 11.4, we present throughput analysis of the IEEE 802.11 with multiple packet reception. In this section, we consider two cases: destination station acknowledging all the correctly detected packets; destination station acknowledging only a single correctly detected packet, although, more than one packet may be detected correctly. Performance results for these schemes are given in Sect. 11.5 and the concluding remarks are contained in Sect. 11.6.
326
V. Weerackody
11.2
System Model with Multiple Receive Antennas
We will assume the IEEE 802.11a [1] system for the physical layer and combine the receive antennas according to the optimum minimum mean squarederror criterion [11]. Let us consider a system with N receive antennas and M simultaneous users in the system. In the WLAN system, M is the number of packets in a multiple packet environment and in this chapter we will refer to users and stations in an interchangeable manner. Denote by am (k) the binary data symbol of the m-th user at the k-th instant. The data is first channel encoded and bit-wise interleaved before being mapped to the complex modulation symbol, dm (k). In the IEEE 802.11a standard, depending on the required data rate, dm (k) is a symbol from either BPSK, QPSK, 16-QAM or 64-QAM constellations. Denote the transmitted baseband signal for the i-th OFDM symbol of the m-th user by xm (t) K−1 m j2π Tks t √1 t ∈ [iTo , (i + 1)To ] m k=0 d (k)e Ts , (11.1) x (t) = 0 otherwise where Ts is the reciprocal of the subcarrier spacing, To is the duration of an OFDM symbol which is given by the sum of Ts and the cyclic extension duration Tg [12]. Since the number of subcarriers is K, the total bandwidth of the transmitted signal is approximately K/Ts . Implementation of (11.1) is accomplished using an IDFT [13]. The parameters for the IEEE 802.11a system are To = 4μs, Tg = 0.8μs , and K = 52 with 4 subcarriers reserved for training purposes. The time varying channel response from the m-th transmit station to the n-th receive antenna in the i-th OFDM symbol interval is given as hm n (i) =
L−1
m hm n,l (i)δ[τ − τn,l (i)],
(11.2)
l=0 m where hm n,l (i) and τn,l (i) are, respectively, the complex channel gain and the delay for multipath l from the m-th station to the n-th antenna. In the above, it is assumed that the channel is constant over the OFDM symbol interval, To . Since To = 4μs the channel characteristics will be constant over this symbol duration even at reasonably high mobile speeds. In this work, m it is also assumed that the path delays, τn,l (i), are smaller than the guard interval, Tg , so that intersymbol interference can be avoided. Let us assume the fading experienced by each subcarrier will be a constant in an OFDM symbol period and denote this by Hnm (k). Note that Hnm (k) is the frequency response of the channel from the m-th station to the n-th receive antenna at the k-th subcarrier and at the i-th OFDM symbol interval, where the dependence on the time index i is dropped for notational simplicity. Using this and (11.1), the received baseband signal for the i-th OFDM symbol at
11
The IEEE 802.11 System
327
the n-th receive antenna can be written as x (t) = n √1 Ts
M−1 m=0
K−1 k=0
(11.3) Hnm (k)dm k e
j2π Tks
0
t
+ νn (t) t ∈ [iTo , (i + 1)To ] , otherwise
where M is the number of transmit stations and νn (t) is the additive white Gaussian noise (AWGN) term. In (11.4), it is assumed that all stations are synchronized such that the i-th OFDM symbol of the transmit stations arrive at the receiver within the guard period and the propagation delays are included in Hnm (k). The system model can be extended to the general case where the transmission times of the stations are not synchronized; however, because of simplicity, in this chapter we consider only the synchronous case. At the receiver, the samples xn (k Ts /K), k = 0, 1, ..., K − 1, are transformed using the DFT and the resulting output for the k-th subcarrier at the n-th receive antenna is yn (k) =
M−1
Hnm (k)dm (k) + vn (k).
(11.4)
m=0
In the above, vn (k) is the AWGN component with E{vn (k)vn∗ (k + k )} = N0 δ(k ). Note that in the above formulation we have ignored timing and carrier frequency offsets. Denote by dk = {d0 (k), d1 (k), ..., dM−1 (k)}T the M −dimensional data vector and yk = {y0 (k), y1 (k), ...., yN −1 (k)}T the N − dimensional vector of received symbols from the antennas for the k-th subcarrier, then we may write yk = Hk dk + vk ,
(11.5)
where Hk is the N × M channel matrix and vk is the N × 1 noise vector whose correlation matrix is given by E{vk vk H } = N0 I with I being the N ×N identity matrix and the superscript H denoting the conjugate transpose. Suppose Wk is the M ×N weight matrix then the received symbols in the k-th subcarrier after the antenna combiner is given as zk = {z 0 (k), z 1 (k), ..., z M−1 (k)}T zk = Wk yk .
(11.6)
We will assume an optimum minimum mean squared-error processor so ˆ k is that the weight matrix W ˆ k = Hk H Rk −1 vd , W ∗
(11.7)
where vd = E{di (k)di (k)} is the variance of the data symbol and Rk = Hk Hk H vd + N0 I is the N × N autocorrelation matrix. The antenna combiner output for the m-th user, z m (k), is then passed to the deinterleaver and the channel decoder to give an estimate a ˆm (k) for the data symbol am (k).
328
11.2.1
V. Weerackody
Estimating the Receive Antenna Weights
Each packet in the IEEE 802.11a system contains a preamble of 2 OFDM symbols for training and other receiver functions. Note that each OFDM symbol consists of 52 subcarriers and these training symbols are contained in all the subcarriers. Also, each data-carrying OFDM symbol has 4 subcarriers, that are equally spaced, dedicated for training purposes. The channel characteristics as well as the interference profile may be assumed to be constant over an OFDM symbol interval period, which is 4 μs. However, the characteristics may change over a duration of a packet. Therefore, the antenna weight matrix ˆ k , should be computed for each OFDM symbol. In for the k-th subcarrier, W order to determine the antenna weight matrix, the N × M channel matrix for the k-th subcarrier, Hk , may be directly estimated for the training subcarriers. Next, using (11.7) the weight matrix is computed. Since, in practice, the maximum number of antennas (N ) is limited to not more than about 4, we prefer a direct matrix inversion to compute the weight matrix. Note that, for resolution of the interfering users, M ≤ N . Since the channel characteristics are highly correlated across the subcarriers, the weight matrix for the non-training subcarriers may be obtained using an interpolation technique.
11.3
The IEEE 802.11 Distributed Coordination Function with Multiple Packet Reception
In this section, we will briefly describe the MAC layer of the IEEE 802.11 protocol in the DCF mode and propose some modifications to the protocol to accommodate multiple packet reception. As seen from the performance results given in Sect. 11.5, these modifications to the IEEE 802.11 protocol will enhance the throughput of the WLAN system. According to the IEEE 802.11 DCF technique, when a station has a new packet to transmit it monitors the channel for a period equal to the distributed interframe space (DIFS). If the channel is sensed idle during this period the packet is transmitted; else, a collision avoidance technique is adopted which schedules the transmission of the packet at a later time according to a backoff procedure. Upon successful reception of the packet at the destination it sends a positive acknowledgment packet (ACK) after a short interframe space, SIFS (
11
The IEEE 802.11 System
329
DIFS STATION A
Slot time
PACKET A
New Packet
BUSY
ACK SIFS STATION B DIFS
DIFS Slot time 8
7
6
5
PACKET B
BUSY 4
1
0
ACK
Fig. 11.1. The DCF of the IEEE 802.11 in the basic access method. Stations A and B are contending for the channel. The x-axis denotes the time.
collisions occur when two or more contending stations simultaneously decrement their backoff timers to zero. The contention window, w, varies between CWmin and CWmax and is exponentially increased at each retransmission stage. Suppose the packet is retransmitted for the r-th time, r = 0, 1, 2, ...; then w = min(2r CWmin , CWmax ), where CWmax = 2rmax CWmin . For the IEEE 802.11a, the values of the contention window, w are 16 (CWmin ), 32, 64, 128, 256, 512, and 1024 (CWmax ). Also, in this system, the slot time σ=9 μs, DIFS=34 μs, and SIFS=16 μs. The above channel access scheme is known as the basic access method and an example is given in Fig. 11.1. As shown in this figure when a new packet arrives at Station A, it senses the channel for a DIFS period and transmits the packet at the end of this idle DIFS period. The first event shown in the figure is Station B receiving an ACK to a previously transmitted packet. Before transmitting the next packet, Station B has to goto the backoff mode. Immediately after receiving the ACK, Station B monitors the channel for a DIFS period and chooses a contention window and a backoff time. The backoff timer is decremented unless the medium is sensed busy during the slot time. When the channel is busy the backoff timer is frozen until a DIFS period after the busy period and in Fig. 11.1 this corresponds to the backoff timer changing from 5 to 4. Finally, Station B sends it’s packet when the backoff timer reaches zero. Packet collisions occur because two or more stations transmit at the same time. These colliding packets can be resolved using a receiver structure that employs multiple receive antennas. Note that, because the DCF mechanism is slotted, we will assume the start-times of the colliding packets from the stations are the same and the receive times at the destination are within, δ, the maximum propagation time in the system. Consider the case when multiple stations are transmitting packets of the same length at the same
330
V. Weerackody
time to a destination station. Suppose the destination station is equipped with multiple receive antennas so that it has the capability of detecting more than one packet successfully. The destination station, after a SIFS interval, sends an ACK that contains all the positive acknowledgments of the packets that were detected successfully during the multiple packet transmission slot. However, note that the existing MAC protocol acknowledges only a single packet. The station address in the ACK is 6 bytes long and the ACK may be made longer to accommodate the addresses of all the stations whose packets were correctly detected. Next, consider the general case when multiple stations transmit packets that are of different lengths. As before, the destination station may detect more than one packet correctly. Since the channel is busy during the time packets are transmitted, the destination station cannot transmit an ACK until the end of the transmission time of the longest packet. The destination station waits for a SIFS period immediately after the end of the transmission of all the packets and transmits an ACK that contains the acknowledgments for all the successfully received packets. According to the DCF a station expects an ACK immediately after a SIFS period following the transmission of a packet. If an ACK is not received within an ACK timeout period the packet is scheduled for retransmission. In the present situation with multiple packets of unequal lengths, immediately after the transmission of a packet, all the stations except the station that transmits the longest packet will sense the channel busy. So the stations have to monitor the channel until it is not busy and expect an ACK immediately after a channel idle period of SIFS. As before, the ACK should contain the addresses of all the stations that sent successful packets to the destination station. Figure 11.2 depicts a multiple packet case. This occurs because the backoff timers of both Stations A and B reach zero simultaneously. Station B will sense the channel busy immediately after transmitting packet B1 and expects an ACK only after a SIFS period following the transmission of Packet A1. As shown in the figure following a successful ACK, Stations A and B will transmit the subsequent packets according to the standard basic access method.
11.4
Throughput Analysis
In this section, we compute the throughput in the multiple packet reception case using the approach given in [10] for the analysis of the DCF in the IEEE 802.11 system. Let us briefly describe the approach adopted in the above paper. Denote by p the packet collision probability seen by a station and τ the probability that a station transmits in a given time slot. Then, expressions are derived for p and τ in terms of the minimum backoff window CWmin and the maximum backoff stage rmax . Next, system throughput is determined using p and τ . According to [10], a packet collision occurs when two or more stations transmit simultaneously and all the packets that were subjected to
11
The IEEE 802.11 System
331
STATION A DIFS
Slot time
PACKET A1
BUSY
BUSY 4
3
2
1
0
6
5
4
3
ACK SIFS STATION B DIFS
Slot time 4
ACK
3
2
1
DIFS
PACKET B1
PACKET B2
BUSY 0
4
1
0
ACK
Fig. 11.2. The DCF of the IEEE 802.11 in the basic access method modified to accommodate multiple packet transmissions. Stations A and B are contending for the channel and transmit Packets A1 and B1 at the same time. The x-axis denotes the time.
the collision have to be retransmitted. The error probability experienced by a packet is p and in our analysis this consists of collisions as well as transmission errors due to receiver noise and fading. In our case, with N receive antennas it is possible to successfully detect up to N simultaneously transmitted packets. Note that in this analysis it is assumed that the system operates in saturation, that is, each one of the stations always has a packet to transmit. Denote by b(t ) the stochastic process representing the backoff time counter for a given station and s(t ) the stochastic process representing the backoff stage (0, 1, ..., rmax ) of the station at time t . Note that, as discussed in Sect. 11.3, the backoff stage is the number of times a particular packet is retransmitted. As in [10], the time t is discrete and this does not correspond to the system time. A change from t → (t + 1) takes place when the backoff timer changes its value. In other words, the time t changes at slot time intervals where the slot time in this section refers to the time interval when the backoff timer is constant. That is the slot time could be either the constant time σ or the variable time interval between consecutive changes of the backoff timer. With respect to Fig. 11.1, the slot time could be the time interval, σ, as shown or, the time interval during which Station B’s backoff timer is fixed at 5. It is shown in [10] that the bidimensional process {s(t), b(t)} can be represented by a discrete-time Markov chain. From [10], the probability that a station transmits in a randomly chosen slot time is
τ =
2(1 − 2p) . (1 − 2p)(CWmin + 1) + pCWmin (1 − (2p)rmax )
(11.8)
332
V. Weerackody
Next, consider multiple packet reception case with a total of M stations in the system. Note that multiple packet transmissions occur because the backoff timers of more than one station may set to zero at the same time. As stated before, we do not consider the multiple packet transmission case because of hidden terminals. Suppose pi is the probability of successfully detecting a packet when there are i simultaneous packet transmissions in the system. For a given number of antennas and the system model given in Sect. 11.2, we determine the value of pi in Sect. 11.5. We will assume that the statistics of the received signals from different stations are the same so that the average packet-error-rates will be the same for all the packets in a multiple packet environment. In a multiple packet environment, as discussed in Sect. 11.3, the DCF of the IEEE 802.11 protocol should be changed to accommodate the ACKs for all the correctly received packets. However, performance improvements can be achieved even if one stays within the existing protocol where only one successfully detected packet from a particular station is acknowledged, although, more than one packet may be successfully detected. We consider the performance of these two cases in the following. To simplify the analysis in this chapter we consider only the case of equal length packets from all stations. Also, in this analysis we assume that the ACKs arrive without any channel errors. This analysis can be extended to account for errors in the ACK in a straightforward manner. 11.4.1
All Successful Packets Acknowledged
In this subsection, we assume the ACK contains the acknowledgments for all successfully received packets so that those packets received correctly are not transmitted again. In order to determine the error probability experienced by a packet, p, we compute the probability of success, (1 − p), as seen by a particular packet. We note that a packet transmitted from a given station will be successful if its own packet is received correctly in the presence of i, (i = 0, 1, 2, ...(M − 1)), active stations, where these i stations can be any of the rest of the (M − 1) stations. Note that by active stations we mean those stations that transmit simultaneously. Consider a particular transmit station and a given set of i stations all transmitting at the same time. Then the probability of successfully transmitting a packet as seen by the particular packet is pi+1 . Considering all such i combinations and summing over all i we have for the required probability M−1 M −1 i τ (1 − τ )M−1−i . (1 − p) = pi+1 (11.9) i i=0 As in [10], the throughput S is defined as the fraction of time the channel is utilized S=
E[payload information transmitted in a slot time] . E[length of a slot time]
(11.10)
11
The IEEE 802.11 System
333
T s success PHY MAC
PAYLOAD
ACK SIFS
DIFS
T c collision PHY MAC
PAYLOAD DIFS
Fig. 11.3. The packet lengths for successfully transmitted packets (Ts ) and for colliding packets (Tc ). PHY and MAC are the size of headers for PHY and MAC layers and PAYLOAD and ACK are the respective lengths of the payload and the ACK.
In order to compute the numerator of S, denote by P the length of the payload of a packet. Suppose i packets are transmitted in a multiple packet situation and j of these packets are successfully detected. Then the average length of the packet payload delivered is jP ji pji (1 − pi )i−j . Summing have for the average length of the payload delivered j all j we i this iover i−j p jP (1 − p ) . Since the probability of i active stations from a set i j=1 j i M i of M stations is i τ (1 − τ )M−i we have E[payload information transmitted in a slot time] = M i M i i j M−i τ (1 − τ ) jP pi (1 − pi )i−j . i j i=1 j=1
(11.11)
Next, to compute the E[length of a slot time], we have to determine the probability of successfully transmitting a packet from at least a single terminal, which is denoted by Ps . Suppose there are i packets in the system, then the probability that none of the packets is successfully detected is (1 − pi )i. The probability of i packets occurring in a system with M stations is Mi τ i (1 − τ )M−i . Therefore, we have for the probability that none of the packets is received correctly M M i 1 − Ps = τ (1 − τ )M−i (1 − pi )i . (11.12) i i=0 Now, consider the average value of the slot time. It can be seen that with probability (1 − τ )M the slot is idle; with probability Ps at least one of the packets is successfully detected; and with probability (1 − Ps − (1 − τ )M ) it is subject to a collision, that is none of the packets is successfully detected. As discussed earlier in this case the length of the ACK depends on the number
334
V. Weerackody
of packets acknowledged. Suppose the ACK contains acknowledgments for j packets, then the length of a packet during a successful transmission is Ts (j) and when a collision occurs this is Tc and these are shown in Fig. 11.3. The dependence of Ts on j is because of the additional length of the ACK to acknowledge j packets. Denote by ps (i, j) = ji pji (1 − pi )i−j , the probability of successfully detecting j packets in the presence of i receive packets. Then we may write for the expected slot time E[slot time | i, j] = Ts (j)ps (i, j) i j p (1 − pi )i−j . = Ts (j) j i
(11.13)
To obtain the E[slot time], (11.13) has to be averaged over all i and j. Therefore, we have E[slot time] = (1 − τ )M σ +
i M M i τ (1 − τ )M−i i i=1 j=1
i j Ts (j) p (1 − pi )i−j + (1 − Ps − (1 − τ )M )Tc , j i (11.14) where σ is the constant length of the slot time. The throughput S as given in (11.10) is the ratio of (11.11) to (11.14). As shown in Fig. 11.3, the values of Ts (j) and Tc are given as Ts (j) = H + P + SIFS + δ + ACK(j)+DIFS + δ, Tc = H + P + DIFS + δ,
(11.15) (11.16)
where H is the PHY and MAC headers, P is the payload, δ is the maximum propagation delay, and ACK(j) is the length of the ACK when j packets are acknowledged. 11.4.2
Only a Single Successful Packet Acknowledged
The receiving station may decode several packets correctly; however, the ACK in this case can carry the acknowledgment for only a single successful packet. We will assume that the packet that is acknowledged is determined in a random and a uniform manner from all the successfully received packets. The unacknowledged packets, even though they were detected correctly, have to be retransmitted. Let us examine the probability of success, (1−p), as seen by a packet at a given transmit station denoted by m. Suppose, in addition to m, i stations out of the rest of the (M − 1) stations are transmitting. The probability that a given station knows that its packet is successful is, if it is detected correctly and the destination station has decided to acknowledge that particular packet. Say, in addition to the packet received correctly from
11
The IEEE 802.11 System
335
m,j of the i transmitted packets are received correctly and this probability is ji pji+1 (1 − pi+1 )i−j . The receive station will acknowledge only one of the (j + 1) correctly received packets with probability 1/(j + 1) so in this situation the probability that the m-th transmit station will receive an ACK is pi+1 /(j + 1). Using the above, it can be seen that (1 − p) =
M−1 i=0
i pi+1 M −1 i i j M−1−i . τ (1 − τ ) pi+1 (1 − pi+1 )i−j (j + 1) i j j=0
(11.17)
Next, let us proceed to compute the average length of the payload. In this case, unlike in Sect. 11.4.1, the length of the payload successfully delivered in a time slot is P , irrespective of the correctly received packets. To determine the numerator of (11.10) we see that when i stations are transmitting in the system at least one of them should be correctly received to deliver the payload successfully. Therefore, it can be seen that E[payload information transmitted in a slot time] = M M i P τ (1 − τ )M−i (1 − (1 − pi )i ). i i=1
(11.18)
As in Sect. 11.4.1, the probability of successful packet transmission from any terminal, Ps , is given in (11.12) and the E[slot time] can be derived as E[slot time] = (1 − τ )M σ + Ps Ts + (1 − Ps − (1 − τ )M )Tc . Using these, we have for the throughput M i M M−i (1 − (1 − pi )i ) i=1 P i τ (1 − τ ) S= . (1 − τ )M σ + Ps Ts + (1 − Ps − (1 − τ )M )Tc
11.5
(11.19)
(11.20)
Performance Results
We use simulations to determine, pi , i = 1, 2, ..M , the probability of correctly receiving any one of packets in the presence of N receive antennas and i multiple packets. These values are then used in the throughput expressions derived in Sects. 11.4.1 and 11.4.2. For the simulations we consider the IEEE 802.11a PHY layer with a rate 1/2 channel code and 4PSK modulation giving an effective data rate of 12 Mb/s. The antenna combining scheme is as discussed in Sect. 11.2. We consider two channel models: a 3-path Rayleigh fading model with an exponentially decaying signal strength with delays uniform in (0, 500 ns); and an independently fading Rayleigh channel where the channel coefficients, Hnm (k), in (11.4) are independently and identically distributed for different subcarrier components k. Also, in the simulations we assumed the
336
V. Weerackody 0
0
10
10
N=2, M=2
N=1, M=1
Packet−Error−Rate
Packet−Error−Rate
N=1, M=1 −1
10
N=3 M=3 N=3 M=1
−2
10
N=2 M=1
−1
10
N=3 M=3
−2
N=2 M=2
N=3 M=1
−3
10
N=2 M=1
N=3 M=2
−2
10
N=3 M=2
−3
0
2
4
6
SNR (dB) per diversity branch
(i)
8
10
10
−1
0
1
2
3
4
5
6
7
SNR (dB) per diversity branch
(ii)
Fig. 11.4. Packet-Error-Rate for 12 Mb/s channel coded system for the IEEE 802.11a PHY layer with different receive antennas (N ) and simultaneous users (M ). Channel Models: (i) 3-path Rayleigh channel with exponentially decaying signal strength profile; (ii) independently Rayleigh fading across subcarrier components.
channel coefficients are statistically independent for different users and receive antennas, that is, for a Rayleigh fading channel, E{Hnm11 (k)Hnm22 ∗ (k)} = E{|Hnm11 (k)|2 }δ(m1 −m2 )δ(n1 −n2 ). Figure 11.4 depicts the packet-error-rates (PERs) for these two channel models. In these simulations the size of each packet is 8000 bits and we assumed power control is such that the received signal strengths of all the M users are the same. Generalization to the case of unequal received power levels for the different users is straightforward and will not be dealt with here. Note that with the MMSE antenna combiner PER=1 for N < M . In order to compute the throughput we consider a signal-to-noise ratio (SNR) level and determine the values of pi for different receive antennas. Let us consider SNR per diversity branch of 4 dB. From Fig. 11.4(i), for the 3-path Rayleigh channel, for N = 3, p1 = 0.996, p2 = 0.95, p3 = 0.73; for N = 2, p1 = 0.94, p2 = 0.65; and for N = 1, p1 = 0.65. In Fig. 11.5 we show the throughput results computed using the scheme given in Sect. 11.4.1. The parameters used in the simulations are as follows: CWmin = 8, rmax = 5, σ=9 μs; and for Ts and Tc computation, H=(272+128) bits, (P + H)=8000 bits, ACK(j)=(112+128)*j bits, δ=1 μs, SIFS=16 μs and DIFS=34 μs. It is seen that the throughput can be significantly improved using multiple receive antennas. As seen from this figure when the packet transmission probability, τ , is 0.1, with 15 contending stations and a single receive antenna the throughput is 0.27. The corresponding throughput with two and three antennas is 0.8 and 1.25, respectively. Note that in the multiple packet case more than one packet can be successfully detected and delivered to the higher layers so the throughput defined as the channel utilization can be more than one. Figure 11.6 shows the results for the throughput computed according to the scheme discussed in Sect. 11.4.2. In this case, only a single successfully detected packet is delivered to the higher layers even though more than one packet may be successfully detected. So, unlike in the previous case, the
11 0.7
The IEEE 802.11 System
337
1 0.9
0.6
M=5 0.8 M=10
0.7
Throughput
Throughput
0.5
0.4 M=5
M=10
0.3
0.6 M=15
0.5 0.4 0.3
0.2 M=15
0.2 0.1 0.1 0 0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0 0
0.5
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Prob. of Packet Transmission
Prob. of Packet Transmission
(i)
(ii) 1.5 M=5
M=10
Throughput
1
M=15
0.5
0 0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Prob. of Packet Transmission
(iii) Fig. 11.5. Throughput (S) from Sect. 11.4.1 as a function of the packet transmission probability, τ , for different number of stations, M , and receive antennas N . (i) N = 1, (ii) N = 2, (iii) N = 3.
throughput is always less than one. It can be seen that the throughput at τ = 0.1 for M = 15 and N = 2 is 0.65. The corresponding throughput for three receive antennas is about 0.83. For the single antenna case the throughput is very low and is about 0.27, that is, even a two-antenna receiver will give rise to very large performance gains. The throughput as a function of the number of contending stations is depicted in Fig. 11.7. It can be seen from Fig. 11.7(i) that for N = 2 and N = 3 the throughput actually increases with the number of stations before starting to decrease. This may be explained as follows. For a given number of receive antennas, N , it is possible to successfully detect up to N multiple packets. The number of multiple packets at a given instance increases with the number of contending stations. When the number of contending stations is small, the number of multiple packets will be smaller than N most of the time. Since the receiver has the capability of successfully detecting up to N multiple packets, increasing the number of contending stations will help to increase the number of correctly detected packets, and hence, the throughput. On the other hand, when the number of contending stations is large, the number of multiple packets will be more than N most of the time. In this
338
V. Weerackody 1
1
0.9
0.9
0.8
0.8
M=5
0.7 M=10
0.6
M=5
Throughput
Throughput
0.7
0.5 0.4 M=15
0.3
0.5
0.3 0.2
0.1
0.1
0.05
0.1
0.15
M=15
0.4
0.2
0 0
M=10
0.6
0.2
0.25
0.3
0.35
0.4
0.45
0 0
0.5
0.05
0.1
Prob. of Packet Transmission
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Prob. of Packet Transmission
(i)
(ii)
Fig. 11.6. Throughput (S) from Sect. 11.4.2 as a function of the packet transmission probability, τ , for different number of stations, M , and receive antennas N . (i) N = 2, (ii) N = 3. 1.4
1 0.9 N=3
1.2
N=3 0.8 N=2
0.7
0.8
Throughput
Throughput
1
N=2
0.6
0.5 N=1 0.4 0.3
N=1
0.4
0.6
0.2 0.2 0.1 0
5
10
15
20
25
30
35
40
Number of Stations
(i)
45
50
55
60
0
5
10
15
20
25
30
35
40
45
Number of Stations
(ii)
Fig. 11.7. Throughput (S) as a function of the number of stations, M , for different receive antennas, N . (i) S as in Sect. 11.4.1. (ii) S as in Sect. 11.4.2.
case, the throughput decreases because the receiver has too many multiple packets at its input most of the time. As seen from Fig 11.7, multiple receive antennas can significantly improve the throughput as well as increase the number of stations in the system. For example, with a single antenna and 10 stations, the throughput obtainable is about 0.5. In a two antenna system the number of stations can be increased to 60 for a throughput of about 0.65, and more gains are achieved in a three antenna system.
11.6
Conclusions
In this chapter, we investigated the case of multiple packet reception in the IEEE 802.11a system with multiple receive antennas. It was shown that significant performance gains can be obtained with multiple receive antennas.
11
The IEEE 802.11 System
339
In this work we have used a MMSE antenna combiner which can resolve a maximum of N simultaneous transmissions where N is the number of receive antennas. Better performance gains may be obtained using more advanced detection schemes such as iterative processing techniques; however, at increased complexity. In the simulation results we considered a power-controlled system where all the packets arrive at the receiver at the same power level. Note that the underlying analysis does not depend on this assumption. The multiple packet transmissions considered in this chapter are because of the backoff timers of more than one station reaching zero at the same time. It is also possible to transmit multiple packets in a deliberate and a controlled manner to increase the throughput. This aspect requires further study.
Acknowledgment The discussions with Eric Yang and Oliver Meili’s input regarding the parameters of the IEEE 802.11a system are acknowledged.
References 1. IEEE 802.11, “Wireless LAN medium access control (MAC) physical layer (PHY) specifications,” Nov. 97. 2. B. P. Crow, I. Widjaja, J. G. Kim, and P. T. Sakai, “IEEE 802.11 wireless local area networks,” IEEE Communications Magazine, vol. 35, pp. 116–126, Sept. 1997. 3. R. van Nee, G. Awater, M. Morikura, H. Takanashi, M. Webster, and K. W. Halford, “New high-rate wireless LAN standards,” IEEE Communications Magazine, vol. 37, pp. 82–88, Dec. 1999. 4. L. Kleinrock and F. A. Tobagi, “Packet switching in radio channels: part I– Carrier sense multiple-access modes and their throughput-delay characteristics,” IEEE Trans. Communications, vol. COM-23, pp. 1400–1416, Dec. 1975. 5. L. Kleinrock and F. A. Tobagi, “Packet switching in radio channels: part II– The hidden terminal problem in carrier sense multiple-access and the busy-tone solution,” IEEE Trans. Communications, vol. COM-23, pp. 1417–1433, Dec. 1975. 6. J. H. Winters, J. Salz, and R. D. Gitlin, “The impact of antenna diversity on the capacity of wireless communication systems,” IEEE Trans. Communications, vol. 42, pp. 1740–1750, Feb./Mar./Apr. 1994. 7. D. Raychaudhuri, “Performance analysis of random access packet-switched code division multiple access systems,” IEEE Trans. Communications, vol. COM-29, pp. 895–901, June 1981. 8. S. Ghez, S. Verdu, and S. C. Schwartz, “Stability properties of slotted ALOHA with multipacket reception capability,” IEEE Trans. Automat. Contr., vol. 33, pp. 640–649, July 1988. 9. T. Shinomiya and H. Suzuki, “Slotted ALOHA mobile packet communication systems with multiuser detection in a base station,” IEEE Trans. Veh. Tech., vol. 49, pp. 948–955, May 2000.
340
V. Weerackody
10. G. Bianchi, “Performance analysis of the IEEE 802.11 distributed coordination function,” IEEE Journal on Selected Areas on Communications, vol. 18, pp. 535–547, Mar. 2000. 11. S. Haykin, Adaptive Filter Theory. NJ: Prentice-Hall, 1996. 12. A. Peled and A. Ruiz, “Frequency domain data transmission using reduced computational complexity algorithms,” in International Conference on Acoustics, Speech and Signal Processing, (Denver, Colorado), 1980. 13. S. B. Weinstein and P. M. Ebert, “Data transmission by frequency-division multiplexing using the discrete Fourier transform,” IEEE Trans. Communications Technology, vol. 19, pp. 628–634, Oct. 1971.
12 Adaptive Estimation of Clock Skew and Different Types of Delay in the Internet Network Jacob Benesty Bell Laboratories, Lucent Technologies Murray Hill, NJ 07974, USA E-mail:
[email protected]
Abstract. The end-to-end delay is often used to analyze network performance. There are different types of delay in the network: (artificial) delay due to unsynchronized clocks, transmission and propagation delays, and delay jitter. First, we show how to obtain a least-squares estimate of the clock skew (i.e. difference between the sender and receiver clock frequencies) and the fixed delay. We then show that the linear programming algorithm proposed elsewhere to estimate the clock ratio is equivalent to the maximum likelihood estimator if the delay jitter is modeled as an independent, identically distributed random process with an exponential probability density function. Finally, we show how to estimate the delay jitter and propose an unbiased recursive least-squares algorithm to estimate the clock skew and fixed delay.
12.1
Introduction
Clock skew is caused by the difference in clock frequency. This problem almost always appears when there is communication between two systems, each of which has its own clock. Because of variations in the quartz crystal and other oscillator components that regulate the clock frequency, we usually observe some variation between nominally similar clocks [1]. In the context of network delay measurement, the clock skew will introduce an artificial delay; so it is important to remove its effect in order to obtain an accurate estimate of the end-to-end delay, which can be used in analyzing the network performance [2]. In internet audio applications, the difference in clock frequencies also has undesirable effects [1], [3]. For instance, if the sender’s clock frequency is higher than the receiver’s, the receiver will systematically receive more audio samples than it is able to replay according to its own clock, which leads to systematic overfill of its buffer. Conversely, if the sender’s clock frequency is lower than the receiver’s, the receiver will systematically find itself lacking of audio samples to replay. In addition to the artificial delay introduced by clock skew, the end-to-end delay measurement is composed of transmission and propagation delays plus variable queuing delay. This variable queuing delay, known as jitter, means that the time difference between transmitting any two packets at the sender is
J. Benesty et al. (eds.), Adaptive Signal Processing © Springer-Verlag Berlin Heidelberg 2003
342
J. Benesty
unlikely to be the same as that observed between their arrivals at the receiver [4]. Hence, this delay jitter should be modeled as a random process [3]. The real-time protocol (RTP) [5], as its name implies, supports the transport of real time media content (e.g., audio and video) over packets networks. An RTP header contains very useful information such as the sequence number and timestamp of a packet. This information is used by the receiver to reconstruct the source and to determine the end-to-end delay. In the rest of this chapter, we suppose that such a protocol is used. Borrowing some ideas from [2] and [3], we show several ways to estimate clock skew and the various delays in the network. In particular, we propose an unbiased recursive least-squares algorithm. This chapter is organized as follows. Section 12.2 introduces the terminology and formulates the problem. In Sect. 12.3, we give a convenient model for the delay jitter. Section 12.4 explains the least-squares estimator and how it can be used in this application. Section 12.5 explains the maximum likelihood estimator and the equivalence with linear programming methods. In Sect. 12.6, we show how to adaptively estimate the delay jitter and propose an unbiased recursive least-squares algorithm to estimate the clock skew and fixed delay. Section 12.7 shows some simulations. Finally, we give our conclusions in Sect. 12.8.
12.2
Terminology and Problem Formulation
We suppose that there is a connection between two hosts: the sender and receiver. Each host has its own clock. The sender adds a timestamp (corresponding to its own clock) to a packet when it leaves, and the receiver records the time the packet arrives according to its clock. Let us first introduce the terminology used in this chapter: • • • • • •
cs : sender clock. cr : receiver clock. N : number of packets that arrive at the receiver. tss (n): timestamp of the nth packet leaving the sender according to cs . trs (n): timestamp of the nth packet leaving the sender according to cr . trr (n): timestamp of the nth packet arriving at the receiver according to cr . • tsr (n): timestamp of the nth packet arriving at the receiver according to cs . Note that trr (n) and tss (n) are known which is not the case for tsr (n) and
trs (n).
The timestamp of the nth packet arriving at the receiver according to cs can be written explicitly as: tsr (n) = tss (n) + dsf + v s (n),
(12.1)
12
Adaptive Estimation of Clock Skew
343
where dsf is (the fixed) transmission and propagation delay; so for all the packets transmitted during a connection, this delay is the same. The random variable v s (n) characterizes the extra delay, called delay jitter, added by the network (due to queuing) for packet n. We suppose in the following that there is no initial offset (difference in time) between the clocks and only the frequency is different. One-way measurements alone do not give enough information to distinguish the clock offset from the fixed delay. If the two clocks progress at different frequencies (that is, have a clock skew), trs (n) = tss (n) and trr (n) = tsr (n). There is a simple relation between the clock skew and the clock ratio [2]. When one is known, the other is easily determined. Now, the timestamp of the nth packet arriving at the receiver according to cr is: trr (n) = αtsr (n) = αtss (n) + αdsf + αv s (n) = αtss (n) + drf + v r (n),
(12.2)
where α is the clock ratio which is constant if cs and cr have constant frequencies. The parameter α can be smaller or greater than one but is in general very close to one. Our objective is to try to estimate α, drf , and v r (n) given trr (n), tss (n) over N packets, n = 1, 2, ..., N . The end-to-end delay of the nth packet consistent with cr is: dr (n) = trr (n) − trs (n) = trr (n) − αtss (n) = drf + v r (n).
(12.3)
However, this delay is not known at the receiver and only the end-to-end delay measurement of the nth packet can be determined: ˜ d(n) = trr (n) − tss (n) = (α − 1)tss (n) + drf + v r (n) = (α − 1)tss (n) + dr (n),
(12.4)
˜ is different from which is consistent with neither cr nor cs . In (12.4), d(n) dr (n) by (α − 1)tss (n). If α > 1, (α − 1)tss (n) grows linearly with tss (n), and ˜ thus d(n) gets larger. Let α ˆ be an estimate of α. From the end-to-end delay measurement, an estimate of the end-to-end delay dr (n) is easily obtained: ˜ − (ˆ dˆr (n) = d(n) α − 1)tss (n).
(12.5)
It is clear from the previous discussion, that it is natural and convenient to estimate all the variables according to the receiver clock, cr . As an example, we show in Fig. 12.1 a simulation of the end-to-end delay measurement with no clock skew (α = 1) [Fig. 1(a)] and with clock skew
344
J. Benesty (b) 0.8
0.7
0.7
0.6
0.6
Delay (s)
Delay (s)
(a) 0.8
0.5
0.5
0.4
0.4
0.3
0.3
0.2
100 200 300 Sender Timestamp (s)
400
0.2
100 200 300 Sender Timestamp (s)
400
Fig. 12.1. End-to-end delay measurement with (a) no clock skew (α = 1) and (b) with clock skew α = 1.0001.
α = 1.0001 [Fig. 12.1(b)]. In the figures, the x-axis is the sender timestamp (in seconds) and the y-axis is the delay (in seconds) calculated by subtracting the sender timestamp from the receiver timestamp of each packet. Clearly, the effect of the skew can not be neglected since the delay increases linearly with the sender timestamp.
12.3
Delay Jitter Model
In this chapter, the delay jitter, v r (n), is modeled as an independent, identically distributed (i.i.d.) random process with exponential probability density function (pdf): f (v r ) = u(v r )λ exp(−λv r ),
(12.6)
where u(v r ) is the unit step function: x if x ≥ 0 u(x) = . 0 if x < 0 The mean and variance of the exponential distribution are respectively: E{v r } = 1/λ and E{(v r − E{v r })2 } = 1/λ2 , where E{·} denotes mathematical expectation. The exponential distribution is often used to model queuing times in a service queue and several studies show that the variable part of the delay caused by an individual router is reasonably well modeled with this pdf [2], [3].
12.4
The Least-Squares (LS) Estimator
The least-squares (LS) estimator is widely used in practice because it is easy to implement. It is derived from the minimization of a least-squares error
12
Adaptive Estimation of Clock Skew
345
criterion. Let us define the error signal for packet n: ˆ˜ ˜ − d(n), e(n) = d(n)
(12.7)
where ˆ˜ d(n) = (ˆ α − 1)tss (n) + dˆrf
(12.8)
˜ is an estimate of d(n) and dˆrf is an estimate of drf . Here, the delay jitter is considered as noise. Equation (12.7) can be rewritten in a more compact form: ˜ − xT (n)h, e(n) = d(n)
(12.9)
where
T x(n) = x1 (n) 1 T = tss (n) 1 , T h = h 1 h2 T = α ˆ − 1 dˆrf ,
and superscript T denotes transpose of a vector or a matrix. Consider the following cost function: J(h) =
N
e2 (n).
(12.10)
n=1
The minimization of (12.10) with respect to h gives the LS estimator: h(N ) = R−1 (N )r(N ),
(12.11)
called the normal equation, where R(N ) =
N
x(n)xT (n),
(12.12)
˜ x(n)d(n).
(12.13)
n=1
r(N ) =
N n=1
If the sender starts sending packets at time 00 : 00 and sends them at a constant packetization time p, every p seconds (which means that tss (n) − tss (n − 1) = p = constant), we have: tss (n) = np, n = 1, 2, ..., N. In this case, (12.12) has the following form: ⎡ 2 ⎤ ⎢ R(N ) = ⎣
p N (N +1)(2N +1) pN (N +1) 6 2 pN (N +1) 2
N
⎥ ⎦.
(12.14)
346
J. Benesty
Using (12.11), (12.13), (12.4), and (12.14), it can easily be shown that: 0 α−1 , (12.15) + E{h(N )} = 1/λ drf which means that the LS estimator is unbiased for α but it is biased for drf with a bias equal to the mean of the delay jitter. If we are only interested in finding an estimate of the end-to-end delay, dr (n), the fact that dˆrf is biased, will have no influence on the result according to (12.5). In practice, it might be more convenient to compute the normal equation recursively. The resulting algorithm, recursive least-squares (RLS) [6], is summarized bellow: R(n) = R(n − 1) + x(n)xT (n), ˜ − xT (n)h(n − 1), e(n) = d(n)
(12.16)
h(n) = h(n − 1) + R−1 (n)x(n)e(n).
(12.17) (12.18)
(With the RLS algorithm, R−1 (n) is usually also recursively updated, but that is not necessary here for the simple 2 × 2 case.) The other advantage of using an adaptive algorithm is its ability to track quickly the changes of h. Indeed, it may happen that the clocks do not have constant frequencies (non-zero clock drift); in this case, α will vary slowly with time.
12.5
The Maximum Likelihood (ML) Estimator and Linear Programming
When the noise signal is assumed to be zero-mean, white, and Gaussian, the least-squares estimator is equivalent to the maximum likelihood (ML) estimator and hence, it is asymptotically efficient. Here, however, it is more natural to use the ML estimator since the noise (delay jitter) is better modeled with an exponential distribution as discussed in Sect. 12.2. The ML estimator with its nice asymptotic properties (of being unbiased and achieving the CramerRao lower bound) makes it a powerful tool worth trying in any estimation problem where noise is non-Gaussian. Let us first define a vector error: T e = e(1) e(2) · · · e(N ) . Because of the i.i.d. assumption, the joint probability (parameterized by h) of the whole vector e is the product of the marginal densities: f (e) =
N ; n=1
u[e(n)]λ exp[−λe(n)]
= λN exp −λ
N n=1
e(n)
N ; n=1
u[e(n)].
(12.19)
12
Adaptive Estimation of Clock Skew
347
The ML estimator is obtained by maximizing f (e), which is equivalent to the linear programming problem: N N max − e(n) = min e(n) (12.20) n=1
n=1
= min −
N
T
x (n)h
n=1
˜ subject to xT (n)h ≤ d(n), n = 1, 2, ..., N. The N constraints in the equation above come from the likelihood function in (12.19), which for f (e) = 0 requires e(n) ≥ 0, ∀n ≥ 1. Thus, because of the exponential distribution, it turns out that the ML is simply a classical linear programming problem. Hence, the intuition given in [2] to use a linear programming algorithm instead of the LS estimator to estimate the clock ratio α is clearly well justified.
12.6
An Unbiased RLS Algorithm
We start this section by showing how to adaptively estimate the delay jitter of the nth packet from the previous one. Recall that the end-to-end delay measurements of the n and (n − 1)th packets are respectively: ˜ d(n) = (α − 1)tss (n) + drf + v r (n), ˜ − 1) = (α − 1)ts (n − 1) + dr + v r (n − 1). d(n s f
(12.21) (12.22)
Taking the difference of the equations (12.21) and (12.22), we get: ˜ − d(n ˜ − 1) v r (n) = v r (n − 1) + d(n) s s − (α − 1)[ts (n) − ts (n − 1)],
(12.23)
and if the sender transmits packets at a constant packetization time p, the previous equation can be simplified to ˜ − d(n ˜ − 1) − p(α − 1). v r (n) = v r (n − 1) + d(n)
(12.24)
Note that (12.24) depends of the clock skew but is independent of the fixed delay. The advantage of using a recursive algorithm is that even if the parameter α is not known, it can be replaced by its estimate. Hence, we propose the simple recursion to estimate the delay jitter for the nth packet: ˜ − d(n ˜ − 1) − ph1 (n − 1)|, vˆr (n) = |ˆ v r (n − 1) + d(n)
(12.25)
where h1 (n − 1) = α ˆ (n − 1) − 1 is computed with the RLS algorithm. Note that the absolute value in (12.25) is necessary in order to avoid any negative delays.
348
J. Benesty
A good estimate of the delay jitter allows us to improve the RLS algorithm. Indeed, by simply subtracting vˆr (n) from the error signal in (12.17), the RLS algorithm is, in principle, made unbiased. Thus, the complete unbiased RLS algorithm is: ˜ vˆr (n) = |ˆ v r (n − 1) + d(n) ˜ − d(n − 1) − [x1 (n) − x1 (n − 1)]h1 (n − 1)|, R(n) = R(n − 1) + x(n)x (n), ˜ − xT (n)h(n − 1) − vˆr (n), e0 (n) = d(n) h(n) = h(n − 1) + R−1 (n)x(n)e0 (n). T
(12.26) (12.27) (12.28) (12.29)
This algorithm has very low complexity; it requires only a couple of operations to estimate the different parameters each time a new packet is received. On contrast, the linear programming algorithm complexity is at least of order n for the nth packet received. This means that if the total number of packets is N and we need to estimate the parameters each time a new packet is received, the complexity will be at least of order N (N + 1)/2.
12.7
Simulations
In this section, we show by way of simulations, the behavior of the different algorithms. The parameter settings chosen for all the simulations are: • • • • •
α = 1.0001. drf = 0.3 s. p = 0.04 s. λ = 20, which means that the average delay jitter is 0.05 s. vˆr (0) = 0, h(0) = 0.
The measure used here to compare α and drf to their estimated values with the different algorithms is the relative error in dB: |α − α ˆ| , α |dr − dˆr | 20 log f r f . df 20 log
Figures 12.2, 12.3, and 12.4 show respectively the behavior of the RLS algorithm, the unbiased RLS, and the linear programming algorithm. Panels (a) correspond to the relative error between the clock ratio α and its estimate, and panels (b) correspond to the relative error between the fixed delay drf and its estimate. For the linear programming algorithm, we used the MATLAB function “linprog” available in the optimization toolbox. As expected, the RLS algorithm is biased for drf but the two other algorithms are not. The proposed unbiased RLS algorithm has a smooth convergence and gives much
12
Adaptive Estimation of Clock Skew
(a)
349
(b)
0
0
Relative Error (dB)
Relative Error (dB)
−10 −20 −30 −40
−20
−40
−60
−50 −60
2000 4000 6000 8000 Packet Sequence Number
−80
10000
2000 4000 6000 8000 Packet Sequence Number
10000
Fig. 12.2. Behavior of the RLS algorithm. (a) Relative error between the clock ratio α and its estimate. (b) Relative error between the fixed delay drf and its estimate. (a)
(b)
0
0
Relative Error (dB)
Relative Error (dB)
−10 −20 −30 −40
−20
−40
−60
−50 −60
2000 4000 6000 8000 Packet Sequence Number
10000
−80
2000 4000 6000 8000 Packet Sequence Number
10000
Fig. 12.3. Behavior of the unbiased RLS algorithm. (a) Relative error between the clock ratio α and its estimate. (b) Relative error between the fixed delay drf and its estimate.
more reliable results than the other algorithms. As a result, the estimation of the parameters improves each time a new packet is received. Figure 12.5 shows the normalized mean square error between the delay jitter v r (n) and its estimate vˆr (n) obtained by using (12.26). Clearly, (12.26) is a reliable estimator.
12.8
Conclusions
The main objective of this chapter was to show how adaptive signal processing techniques can be used to solve some interesting problems encountered in the Internet network. Here, we discussed the clock skew problem and how to model it correctly. We also discussed a model for the different delays that occur in the network. From this model, we derived several estimators to estimate the clock ratio and the fixed delay. We proposed an unbiased RLS
350
J. Benesty (a)
(b) 0
0
Relative Error (dB)
Relative Error (dB)
−10 −20 −30 −40
−20
−40
−60
−50 −60
1000 2000 3000 4000 Packet Sequence Number
5000
−80
1000 2000 3000 4000 Packet Sequence Number
5000
Fig. 12.4. Behavior of the linear programming algorithm. (a) Relative error between the clock ratio α and its estimate. (b) Relative error between the fixed delay drf and its estimate.
0
−10
−20
Normalized MSE (dB)
−30
−40
−50
−60
−70
−80
−90
−100
10
20
30
40 50 60 70 Packet Sequence Number (x100)
80
90
100
Fig. 12.5. Normalized mean square error between the delay jitter v r (n) and its estimate vˆr (n).
algorithm and a simple way to estimate the delay jitter for each packet. Future work should include the problem of clock drift, which is due to the fact that a clock may not have a constant frequency. We believe that adaptive algorithms are appropriate not only to estimate the clock skew but also to track changes when they occur. The clock offset should also be studied both in one-way and round-trip end-to-end delays.
12
Adaptive Estimation of Clock Skew
351
Acknowledgment I would like to thank Dennis R. Morgan and Yiteng (Arden) Huang for reading an earlier draft of this chapter and suggesting many comments for improvement.
References 1. O. Hodson, C. Perkins, and V. Hardman, “Skew detection and compensation for Internet audio applications,” in Proc. IEEE ICME, 2000, vol. 3, pp. 1687– 1690. 2. S. B. Moon, P. Skelly, and D. Towsley, “Estimation and removal of clock skew from network delay measurements,” in Proc. IEEE INFOCOM, 1999, vol. 1, pp. 227–234. 3. T. Trump, “Estimation of clock skew in telephony over packet switched networks,” in Proc. IEEE ICASSP, 2000, vol. 5, pp. 2605–2609. 4. P. Deleon and C. J. Sreenan, “An adaptive predictor for media playout buffering,” in Proc. IEEE ICASSP, 1999, vol. 6, pp. 3097–3100. 5. H. Schulzrinne and J. Rosenberg, “The IETF internet telephony architecture and protocols,” IEEE Network, vol. 13, pp. 18–23, May/June 1999. 6. S. Haykin, Adaptive Filter Theory. Fourth Edition, Prentice Hall, Upper Saddle River, N.J., 2002.
Index
a posteriori error signal, 3 a priori error signal, 2, 77 acoustic echo cancellation, 59 acoustic echo control, 69 acoustic feedback, 23, 26 acoustic source localization, 228, 237 acoustic source locator, 227 adaptation control, 77 adaptive algorithm, 72 adaptive antenna array, 283 adaptive beamformer (ABF), 208 adaptive beamforming, 155 adaptive eigenvalue decomposition, 228, 232 adaptive noise cancellation, 130, 137 adaptive space-time processing, 283 adaptive spectral modification, 130 adaptive zero-forcing algorithm, 265 affine projection algorithm, 72, 74 array gain, 161 array processing, 155 array sensitivity, 162 background noise, 67 Bayes’ rule, 273 beamformer, 157 beamformer response, 160 beamforming, 129, 131 beampattern, 161 beamsteering, 159 behind-the-ear (BTE), 23 BER, 251 Bezout identity, 257 blind 1D-ST STAR, 301 blind 2D RAKE receiver, 287 blind 2D STAR, 288 blind algorithms, 275 blind channel identification, 228 blind mehod, 12 blind source separation (BSS), 195 block error signal, 99 block LMS, 235 block processing, 88 BLUE, 270
broadside, 163 CDMA, 283 channel cross relation, 232 channel diversity, 227, 232 channel equalization, 253 circulant matrix, 100 clamp, 39 clock ratio, 343 clock skew, 341 CMA, 275 cocktail-party effect, 195 coherence function, 141 computational complexity, 73 constant modulus, 275 constrained adaptation, 25, 37 constrained NMCFLMS, 236, 239 convergence analysis, 105 convergence in mean, 107 convergence in mean square, 107 convolutive blind source separation, 199 convolutive mixtures, 198 cost function, 73 cross correlation, 228 CSMA/CA, 324 data-dependent beamformer, 165 data-independent beamformer, 162 decision feedback equalization, 261 decision feedback identification, 288 decorrelation filter, 74 delay coefficients, 81 delay-and-sum beamformer, 131 dereverberation, 156 direction of arrival (DOA), 131 directivity, 162 directivity index, 162 distance measure, 3 distributed coordination function, 324 Dolph-Chebyshev window, 163 double-talk, 67 double-talk detector, 85 echo cancellation, 70 echo return loss enhancement, 65
354
Index
echo suppression, 71 EG, 1 EG±, 7 EGU±, 5 eigenvalue spread, 67 eigenvector beamformer, 175 entropy, 203 equalizer, 251 error signal, 99 error vector, 74 Euclidean distance, 3 exponential forgetting factor, 10, 77, 102 exponential pdf, 344 exponentiated gradient, 1 exponentiated RLS (ERLS), 10 far-field, 131 fast ERLS (FERLS), 10 feedback cancellation, 24, 26, 32 feedback path, 25, 29 feedback path delay, 34 FFT, 234 filtered-X algorithm, 44 FLMS, 114 Fourier matrix, 234 fractionally spaced equalizers, 256 frequency-domain BSS, 200 frequency-domain criterion, 102 frequency-domain error signal, 101 frequency-domain Kalman gain matrix, 105 full-duplex, 61 fullband processing, 87 GCC, 228 generalized sidelobe canceller (GSC), 174 half-duplex, 61 hands-free, 60 hearing aid, 23 Hessian matrix, 236 higher order statistics (HOS), 205, 207 IEEE IEEE IEEE IEEE
802.11, 323 802.11a, 324 802.11b, 324 802.11g, 324
improved PNLMS (IPNLMS), 10 impulse response, 62, 230 in-the-ear (ITE), 23 independence, 201 independent component analysis (ICA), 201 infomax, 202 information theory, 202 input signal matrix, 74 instantaneous mixtures, 199 inter-symbol-interference (ISI), 250 interference suppression, 159 inverse Fourier matrix, 234 jitter, 341 Kalman gain, 10 Kullback-Leibler divergence, 3 LCMV beamformer, 169 LCMV-LS beamformer, 169 learning rule, 204 least-mean-square, 1, 140 least-squares estimator, 270, 344 likelihood, 202 linear programming, 347 LMS, 5, 35 loss control, 61, 70 loudspeaker-enclosure-microphone system, 60, 61 MAC, 323 mainlobe width, 161 maximal ratio combining, 256 maximal-length sequence (MLS), 33 maximization of nongaussianity, 202 maximum likelihood equalization, 273 maximum likelihood estimator, 202, 346 maximum stable gain (MSG), 36 MCFLMS, 235 microphone array, 131, 157 MIMO, 96, 110, 255, 270, 274 minimization of mutual information, 202 minimum statistics, 147 misalignment correlation matrix, 108 misalignment vector, 71, 107, 108 mismatch vector, 71 MMSE, 250, 327
Index MMSE beamforming, 168 model of a loudspeaker-enclosuremicrophone system, 63 modulation, 285 MSE, 250 multi-delay filter, 97, 115 multichannel acoustic echo cancellation, 96, 123 multichannel adaptive filter, 99 multichannel blind deconvolution, 199 multichannel EG, 14 multichannel EG±, 15 multichannel frequency-domain LMS, 235 multichannel LMS, 233 multipath propagation, 156 multiple antennas, 324 mutual information, 202 natural gradient, 5, 204 near-field, 131 NEG±, 8 NEGU±, 7 Newton’s method, 235 NLMS, 7, 72, 73, 140 NMCFLMS, 229, 235 noise reduction, 129 nongaussianity, 202 nonlinear decorrelation, 201 nonstationary decorrelation, 205 normal equation, 102 normalized misalignment, 16 normalized projection misalignment, 17 OFDM, 324 optimal step size, 79 optimum data-dependent beamformer, 165 optimum LSE processor, 166 orthogonal basis function, 271, 275 overestimation parameter, 84 parametric spectral subtraction, 145 parametric Wiener filter, 148 peak distortion, 252 peak-to-noise ratio (PNR), 33 permutation problem, 200 PHY, 323 pilot-assisted 1D-ST STAR, 311
355
pitch, 67 PNLMS, 2, 10 point coordination function, 324 power pattern, 161 probability density function (pdf), 202 proportionate NLMS, 10 RAKE, 256 recursive least-squares, 346 regularization, 116 regulations, 68 relative entropy, 3 relative error, 348 relative sidelobe level, 161 reverberation time, 62, 220, 240 Riemannian metric tensor, 5 RLS, 9, 72, 76, 77, 346 robust beamformer, 172 room reverberation, 25, 47, 228 RTP, 342 scaling problem, 200 second order statistics (SOS), 205, 206 SER, 251 shadow filter, 81 short-time Fourier transform, 84 signal correlation time, 242 signal-to-interference ratio, 221 SIMO, 12, 229, 230, 255 SINR, 251 slicer, 256 sparseness, 1 spatio-temporal filtering, 156 spectral floor, 84 spectral modification, 144 speech enhancement, 129 speech signal, 65 steady-state analysis, 26 steepest descent, 140 steering vector, 160 step-size factor, 73 subband processing, 88 suppression of residual echoes, 82 system distance, 71, 81 system identification, 63 system stability, 27 TDOA, 231 teleconferencing, 123, 227
356
Index
throughput analysis, 330 time delay estimation, 227 time delay of arrival, 227, 231 time-delayed decorrelation (TDD), 205 time-domain BSS, 199 Toeplitz matrix, 100 tracking, 267 tracking behaviour, 72 training, 267 UFLMS, 114 unbiased RLS, 348 unconstrained NMCFLMS, 237 undisturbed error, 71, 80, 84 unit step function, 344
unit-norm constraint, 237 unvoiced speech, 67 V-BLAST, 258 Varechoic chamber, 238 Viterbi algorithm, 274 voice activity detector, 147 voiced speech, 67 Wiener filter, 71, 148 Wiener solution, 103 wireless local area network (WLAN), 323 zero-forcing, 250