Digitally signed by aequanimus DN: cn=aequanimus, c=US, o=The eBook Horde, ou=Releaser, [email protected] Reason: I attest to the accuracy and integrity of this document Date: 2006.03.10 11:19:47 -07'00'
Knowledge the way it was meant to be—free. News: It has come to our attention that a couple of our existing releases have already been released by scene groups. We aren’t attempting to show anyone up or claim that we did the release or anything like that. We do try to search ahead of time to see if a book has been released, but we don’t always find them. So… Deal with it. Group Information: This is a new release, courtesy of The eBook Hoard. We are a group dedicated to releasing high-quality books in mainly academic realms. Right now, we are only releasing PDFs, but eventually, other formats may be on the way. We do accept requests. Also, we aren’t perfect. Occasionally, an error may slip by (duplicated page, typo, whatever) so please notify us if you find an error so that we can release a corrected copy. Signed PDFs: The question of why we sign our books has been raised. This is mainly for two purposes: first, to prevent modifications after release; and second, more importantly, to protect the authenticity of the release. When you get a signed TeH book, it is guaranteed to be a copy that we have looked over to minimize mistakes. If you find a book with an error, please verify the signature before reporting it to us. Group Contact: E-Mail: [email protected] Website: None Release Information: Title: Author: Publisher: Publication: ISBN: Release Date: File Type: File Size:
Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond Bernhard Schlkopf and Alexander J. Smola The MIT Press December 15, 2001 0262194759 March 10, 2006 PDF 37 MB
Respect: LotB, DDU, DEMENTiA, EEn, LiB, YYePG, BBL, TLFeBook, and any other groups that have provided the quality scene releases that got us started. Thanks, you all. People that share the books for the world to read: Wayne, jazar, NullusNET (even though the admins suck), and everyone who puts a little something up through RapidShare or a similar service. Keep up the good work. Tracking Details: Release: TeH-0003-01-06-00008 Upcoming Releases: - Unknown
Learning with Kernels
Adaptive Computation and Machine Learning Thomas Dietterich, Editor Christopher Bishop, David Heckerman, Michael Jordan, and Michael Kearns, Associate Editors Bioinformatics: The Machine Learning Approach, Pierre Baldi and S0ren Brunak Reinforcement Learning: An Introduction, Richard S. Sutton and Andrew G. Barto Graphical Models for Machine Learning and Digital Communication, Brendan J. Frey Learning in Graphical Models, Michael I. Jordan Causation, Prediction, and Search, second edition, Peter Spirtes, Clark Glymour, and Richard Scheines Principles of Data Mining, David Hand, Heikki Mannila, and Padhraic Smyth Bioinformatics: The Machine Learning Approach, second edition, Pierre Baldi and S0ren Brunak Learning Kernel Classifiers: Theory and Algorithms, Ralf Herbrich Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond, Bernhard Scholkopf and Alexander J. Smola
Learning with Kernels Support Vector Machines, Regularization, Optimization, and Beyond Bernhard Scholkopf Alexander J. Smola
The MIT Press Cambridge, Massachusetts London, England
1 A Tutorial Introduction 1.1 Data Representation and Similarity 1.2 A Simple Pattern Recognition Algorithm 1.3 Some Insights From Statistical Learning Theory 1.4 Hyperplane Classifiers 1.5 Support Vector Classification 1.6 Support Vector Regression 1.7 Kernel Principal Component Analysis 1.8 Empirical Results and Implementations
1 1 4 6 11 15 17 19 21
I
23
CONCEPTS AND TOOLS
2 Kernels 2.1 Product Features 2.2 The Representation of Similarities in Linear Spaces 2.3 Examples and Properties of Kernels 2.4 The Representation of Dissimilarities in Linear Spaces 2.5 Summary 2.6 Problems
25 26 29 45 48 55 55
3 Risk and Loss Functions 3.1 Loss Functions 3.2 Test Error and Expected Risk 3.3 A Statistical Perspective 3.4 Robust Estimators 3.5 Summary 3.6 Problems
61 62 65 68 75 83 84
4 Regularization 4.1 The Regularized Risk Functional
The Representer Theorem Regularization Operators Translation Invariant Kernels Translation Invariant Kernels in Higher Dimensions Dot Product Kernels Multi-Output Regularization Semiparametric Regularization Coefficient Based Regularization Summary Problems
89 92 96 105 110 113 115 118 121 122
5 Elements of Statistical Learning Theory 5.1 Introduction 5.2 The Law of Large Numbers 5.3 When Does Learning Work: the Question of Consistency 5.4 Uniform Convergence and Consistency 5.5 How to Derive a VC Bound 5.6 A Model Selection Example 5.7 Summary 5.8 Problems
17 Regularized Principal Manifolds 17.1 A Coding Framework
517 518
Contents
xi
17.2 17.3 17.4 17.5 17.6 17.7 17.8
A Regularized Quantization Functional An Algorithm for Minimizing Connections to Other Algorithms Uniform Convergence Bounds Experiments Summary Problems
Rreg[f]
522 526 529 533 537 539 540
18 Pre-Images and Reduced Set Methods 18.1 The Pre-Image Problem 18.2 Finding Approximate Pre-Images 18.3 Reduced Set Methods 18.4 Reduced Set Selection Methods 18.5 Reduced Set Construction Methods 18.6 Sequential Evaluation of Reduced Set Expansions 18.7 Summary 18.8 Problems
543 544 547 552 554 561 564 566 567
A Addenda A.1 Data Sets A.2 Proofs
569 569 572
B Mathematical Prerequisites B.1 Probability B.2 Linear Algebra B.3 Functional Analysis
575 575 580 586
References
591
Index
617
Notation and Symbols
625
This page intentionally left blank
Series Foreword
The goal of building systems that can adapt to their environments and learn from their experience has attracted researchers from many fields, including computer science, engineering, mathematics, physics, neuroscience, and cognitive science. Out of this research has come a wide variety of learning techniques that have the potential to transform many scientific and industrial fields. Recently, several research communities have converged on a common set of issues surrounding supervised, unsupervised, and reinforcement learning problems. The MIT Press series on Adaptive Computation and Machine Learning seeks to unify the many diverse strands of machine learning research and to foster high quality research and innovative applications. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond is an excellent illustration of this convergence of ideas from many fields. The development of kernel-based learning methods has resulted from a combination of machine learning theory, optimization algorithms from operations research, and kernel techniques from mathematical analysis. These three ideas have spread far beyond the original support-vector machine algorithm: Virtually every learning algorithm has been redesigned to exploit the power of kernel methods. Bernhard Scholkopf and Alexander Smola have written a comprehensive, yet accessible, account of these developments. This volume includes all of the mathematical and algorithmic background needed not only to obtain a basic understanding of the material but to master it. Students and researchers who study this book will be able to apply kernel methods in creative ways to solve a wide range of problems in science and engineering. Thomas Dietterich
This page intentionally left blank
Preface
One of the most fortunate situations a scientist can encounter is to enter a field in its infancy. There is a large choice of topics to work on, and many of the issues are conceptual rather than merely technical. Over the last seven years, we have had the privilege to be in this position with regard to the field of Support Vector Machines (SVMs). We began working on our respective doctoral dissertations in 1994 and 1996. Upon completion, we decided to combine our efforts and write a book about SVMs. Since then, the field has developed impressively, and has to an extent been transformed. We set up a website that quickly became the central repository for the new community, and a number of workshops were organized by various researchers. The scope of the field has now widened significantly, both in terms of new algorithms, such as kernel methods different to SVMs, and in terms of a deeper theoretical understanding being gained. It has become clear that kernel methods provide a framework for tackling some rather profound issues in machine learning theory. At the same time, successful applications have demonstrated that SVMs not only have a more solid foundation than artificial neural networks, but are able to serve as a replacement for neural networks that perform as well or better, in a wide variety of fields. Standard neural network and pattern recognition textbooks have now started including chapters on SVMs and kernel PCA (for instance, [235,153]). While these developments took place, we were trying to strike a balance between pursuing exciting new research, and making progress with the slowly growing manuscript of this book. In the two and a half years that we worked on the book, we faced a number of lessons that we suspect everyone writing a scientific monograph — or any other book — will encounter. First, writing a book is more work than you think, even with two authors sharing the work in equal parts. Second, our book got longer than planned. Once we exceeded the initially planned length of 500 pages, we got worried. In fact, the manuscript kept growing even after we stopped writing new chapters, and began polishing things and incorporating corrections suggested by colleagues. This was mainly due to the fact that the book deals with a fascinating new area, and researchers keep adding fresh material to the body of knowledge. We learned that there is no asymptotic regime in writing such a book — if one does not stop, it will grow beyond any bound — unless one starts cutting. We therefore had to take painful decisions to leave out material that we originally thought should be in the book. Sadly, and this is the third point, the book thus contains less material than originally planned, especially on the sub-
xvi
Preface
ject of theoretical developments. We sincerely apologize to all researchers who feel that their contributions should have been included — the book is certainly biased towards our own work, and does not provide a fully comprehensive overview of the field. We did, however, aim to provide all the necessary concepts and ideas to enable a reader equipped with some basic mathematical knowledge to enter the engaging world of machine learning, using theoretically well-founded kernel algorithms, and to understand and apply the powerful algorithms that have been developed over the last few years. The book is divided into three logical parts. Each part consists of a brief introduction and a number of technical chapters. In addition, we include two appendices containing addenda, technical details, and mathematical prerequisites. Each chapter begins with a short discussion outlining the contents and prerequisites; for some of the longer chapters, we include a graph that sketches the logical structure and dependencies between the sections. At the end of most chapters, we include a set of problems, ranging from simple exercises (marked by •) to hard ones (•••); in addition, we describe open problems and questions for future research (ooo).1 The latter often represent worthwhile projects for a research publication, or even a thesis. References are also included in some of the problems. These references contain the solutions to the associated problems, or at least significant parts thereof. The overall structure of the book is perhaps somewhat unusual. Rather than presenting a logical progression of chapters building upon each other, we occasionally touch on a subject briefly, only to revisit it later in more detail. For readers who are used to reading scientific monographs and textbooks from cover to cover, this will amount to some redundancy. We hope, however, that some readers, who are more selective in their reading habits (or less generous with their time), and only look at those chapters that they are interested in, will benefit. Indeed, nobody is expected to read every chapter. Some chapters are fairly technical, and cover material included for reasons of completeness. Other chapters, which are more relevant to the central subjects of the book, are kept simpler, and should be accessible to undergraduate students. In a way, this book thus contains several books in one. For instance, the first chapter can be read as a standalone "executive summary" of Support Vector and kernel methods. This chapter should also provide a fast entry point for practitioners. Someone interested in applying SVMs to a pattern recognition problem might want to read Chapters 1 and 7 only. A reader thinking of building their own SVM implementation could additionally read Chapter 10, and parts of Chapter 6. Those who would like to get actively involved in research aspects of kernel methods, for example by "kernelizing" a new algorithm, should probably read at least Chapters 1 and 2. A one-semester undergraduate course on learning with kernels could include the material of Chapters 1,2.1-2.3,3.1-3.2,5.1-5.2,6.1-6.3,7. If there is more 1. We suggest that authors post their solutions on the book website www.learning-withkernels.org.
Preface
xvii
time, one of the Chapters 14,16, or 17 can be added, or 4.1-4.2. A graduate course could additionally deal with the more advanced parts of Chapters 3,4, and 5. The remaining chapters provide ample material for specialized courses and seminars. As a general time-saving rule, we recommend reading the first chapter and then jumping directly to the chapter of particular interest to the reader. Chances are that this will lead to a chapter that contains references to the earlier ones, which can then be followed as desired. We hope that this way, readers will inadvertently be tempted to venture into some of the less frequented chapters and research areas. Explore this book; there is a lot to find, and much more is yet to be discovered in the field of learning with kernels. We conclude the preface by thanking those who assisted us in the preparation of the book. Our first thanks go to our first readers. Chris Burges, Arthur Gretton, and Bob Williamson have read through various versions of the book, and made numerous suggestions that corrected or improved the material. A number of other researchers have proofread various chapters. We would like to thank Matt Beal, Daniel Berger, Olivier Bousquet, Ben Bradshaw, Nicolo CesaBianchi, Olivier Chapelle, Dennis DeCoste, Andre Elisseeff, Anita Faul, Arnulf Graf, Isabelle Guyon, Ralf Herbrich, Simon Hill, Dominik Janzing, Michael Jordan, Sathiya Keerthi, Neil Lawrence, Ben O'Loghlin, Ulrike von Luxburg, Davide Mattera, Sebastian Mika, Natasa Milic-Frayling, Marta Milo, Klaus Muller, Dave Musicant, Fernando Perez Cruz, Ingo Steinwart, Mike Tipping, and Chris Williams. In addition, a large number of people have contributed to this book in one way or another, be it by sharing their insights with us in discussions, or by collaborating with us on some of the topics covered in the book. In many places, this strongly influenced the presentation of the material. We would like to thank Dimitris Achlioptas, Luis Almeida, Shun-Ichi Amari, Peter Bartlett, Jonathan Baxter, Tony Bell, Shai Ben-David, Kristin Bennett, Matthias Bethge, Chris Bishop, Andrew Blake, Volker Blanz, Leon Bottou, Paul Bradley, Chris Burges, Heinrich Bulthoff, Olivier Chapelle, Nello Cristianini, Corinna Cortes, Cameron Dawson,Tom Dietterich, Andre Elisseeff, Oscar de Feo, Federico Girosi, Thore Graepel, Isabelle Guyon, Patrick Haffner, Stefan Harmeling, Paul Hayton, Markus Hegland, Ralf Herbrich, Tommi Jaakkola, Michael Jordan, Jyrki Kivinen, Yann LeCun, Chi-Jen Lin, Gabor Lugosi, Olvi Mangasarian, Laurent Massoulie, Sebastian Mika, Sayan Mukherjee, Klaus Muller, Noboru Murata, Nuria Oliver, John Platt, Tomaso Poggio, Gunnar Ratsch, Sami Romdhani, Rainer von Sachs, Christoph Schnorr, Matthias Seeger, John Shawe-Taylor, Kristy Sim, Patrice Simard, Stephen Smale, Sara Solla, Lionel Tarassenko, Lily Tian, Mike Tipping, Alexander Tsybakov, Lou van den Dries, Santosh Venkatesh, Thomas Vetter, Chris Watkins, Jason Weston, Chris Williams, Bob Williamson, Andreas Ziehe, Alex Zien, and Tong Zhang. Next, we would like to extend our thanks to the research institutes that allowed us to pursue our research interests and to dedicate the time necessary for writing the present book; these are AT&T / Bell Laboratories (Holmdel), the Australian National University (Canberra), Biowulf Technologies (New York), GMD FIRST (Berlin), the Max-Planck-Institute for Biological Cybernetics (Tubingen), and Mi-
xviii
Preface
crosoft Research (Cambridge). We are grateful to Doug Sery from MIT Press for continuing support and encouragement during the writing of this book. We are, moreover, indebted to funding from various sources; specifically, from the Studienstiftung des deutschen Volkes, the Deutsche Forschungsgemeinschaft, the Australian Research Council, and the European Union. Finally, special thanks go to Vladimir Vapnik, who introduced us to the fascinating world of statistical learning theory. P.S.: For pointing out errors in the first printing of this book, we are indebted to Juan Borras Garcia, Dongwei Cao, Dave DeBarr, Thore Graepel, Arthur Gretton, Alexandros Karatzoglou, Adam Kowalczyk, Malte Kuss, Frederic Maire, Tristan Mary-Huard, Sebastian Mika, Tommi Poggio, Carl Rasmussen, Salla Ruosaari, Kristy Sim, Paul Teal, Zhang Tong, Christian Walder, S.V.N. Vishwanathan, Xi Xuecheng, and other readers.
. ..the story of the sheep dog who was herding his sheep, and serendipitously invented both large margin classification and Sheep Vectors... Illustration by Ana Martin Larranaga
1
Overview
Prerequisites
A Tutorial Introduction
This chapter describes the central ideas of Support Vector (SV) learning in a nutshell. Its goal is to provide an overview of the basic concepts. One such concept is that of a kernel. Rather than going immediately into mathematical detail, we introduce kernels informally as similarity measures that arise from a particular representation of patterns (Section 1.1), and describe a simple kernel algorithm for pattern recognition (Section 1.2). Following this, we report some basic insights from statistical learning theory, the mathematical theory that underlies SV learning (Section 1.3). Finally, we briefly review some of the main kernel algorithms, namely Support Vector Machines (SVMs) (Sections 1.4 to 1.6) and kernel principal component analysis (Section 1.7). We have aimed to keep this introductory chapter as basic as possible, whilst giving a fairly comprehensive overview of the main ideas that will be discussed in the present book. After reading it, readers should be able to place all the remaining material in the book in context and judge which of the following chapters is of particular interest to them. As a consequence of this aim, most of the claims in the chapter are not proven. Abundant references to later chapters will enable the interested reader to fill in the gaps at a later stage, without losing sight of the main ideas described presently.
1.1 Data Representation and Similarity
Training Data
One of the fundamental problems of learning theory is the following: suppose we are given two classes of objects. We are then faced with a new object, and we have to assign it to one of the two classes. This problem can be formalized as follows: we are given empirical data
Here, X is some nonempty set from which the patterns xi (sometimes called cases, inputs, instances, or observations) are taken, usually referred to as the domain; the yi are called labels, targets, outputs or sometimes also observations.1 Note that there are 1. Note that we use the term pattern to refer to individual observations. A (smaller) part of the existing literature reserves the term for a generic prototype which underlies the data. The
2
A Tutorial Introduction
only two classes of patterns. For the sake of mathematical convenience, they are labelled by +1 and —1, respectively. This is a particularly simple situation, referred to as (binary) pattern recognition or (binary) classification. It should be emphasized that the patterns could be just about anything, and we have made no assumptions on X other than it being a set. For instance, the task might be to categorize sheep into two classes, in which case the patterns xi would simply be sheep. In order to study the problem of learning, however, we need an additional type of structure. In learning, we want to be able to generalize to unseen data points. In the case of pattern recognition, this means that given some new pattern x 6 X, we want to predict the corresponding y G {±1}.2 By this we mean, loosely speaking, that we choose y such that (x, y) is in some sense similar to the training examples (1.1). To this end, we need notions of similarity in X and in {±1}. Characterizing the similarity of the outputs {±1} is easy: in binary classification, only two situations can occur: two labels can either be identical or different. The choice of the similarity measure for the inputs, on the other hand, is a deep question that lies at the core of the field of machine learning. Let us consider a similarity measure of the form
Dot Product
that is, a function that, given two patterns x and x , returns a real number characterizing their similarity. Unless stated otherwise, we will assume that k is symmetric, that is, k(x, x') = k(x', x) for all x, x' G X. For reasons that will become clear later (cf. Remark 2.16), the function k is called a kernel [359,4,42, 62,223]. General similarity measures of this form are rather difficult to study. Let us therefore start from a particularly simple case, and generalize it subsequently. A simple type of similarity measure that is of particular mathematical appeal is a dot product. For instance, given two vectors x, x' € RN, the canonical dot product is defined as
Here, [x]i denotes the ith entry of x. Note that the dot product is also referred to as inner product or scalar product, and sometimes denoted with round brackets and a dot, as (x • x') — this is where the "dot" in the name comes from. In Section B.2, we give a general definition of dot products. Usually, however, it is sufficient to think of dot products as (1.3). latter is probably closer to the original meaning of the term, however we decided to stick with the present usage, which is more common in the field of machine learning. 2. Doing this for every x € X amounts to estimating a function f: x —> {±1}.
1.1
Data Representation and Similarity
Length
3
The geometric interpretation of the canonical dot product is that it computes the cosine of the angle between the vectors x and x', provided they are normalized to length 1. Moreover, it allows computation of the length (or norm) of a vector x as
Likewise, the distance between two vectors is computed as the length of the difference vector. Therefore, being able to compute dot products amounts to being able to carry out all geometric constructions that can be formulated in terms of angles, lengths and distances. Note, however, that the dot product approach is not really sufficiently general to deal with many interesting problems. • First, we have deliberately not made the assumption that the patterns actually exist in a dot product space. So far, they could be any kind of object. In order to be able to use a dot product as a similarity measure, we therefore first need to represent the patterns as vectors in some dot product space H (which need not coincide with RN). To this end, we use a map
• Second, even if the original patterns exist in a dot product space, we may still want to consider more general similarity measures obtained by applying a map (1.5). In that case, F will typically be a nonlinear map. An example that we will consider in Chapter 2 is a map which computes products of entries of the input patterns. Feature Space
In both the above cases, the space H is called a feature space. Note that we have used a bold face x to denote the vectorial representation of x in the feature space. We will follow this convention throughout the book. To summarize, embedding the data into H via F has three benefits: 1. It lets us define a similarity measure from the dot product in H,
2. It allows us to deal with the patterns geometrically, and thus lets us study learning algorithms using linear algebra and analytic geometry. 3. The freedom to choose the mapping F will enable us to design a large variety of similarity measures and learning algorithms. This also applies to the situation where the inputs xi already exist in a dot product space. In that case, we might directly use the dot product as a similarity measure. However, nothing prevents us from first applying a possibly nonlinear map F to change the representation into one that is more suitable for a given problem. This will be elaborated in Chapter 2, where the theory of kernels is developed in more detail.
4
A Tutorial Introduction
1.2 A Simple Pattern Recognition Algorithm We are now in the position to describe a pattern recognition learning algorithm that is arguably one of the simplest possible. We make use of the structure introduced in the previous section; that is, we assume that our data are embedded into a dot product space H.3 Using the dot product, we can measure distances in this space. The basic idea of the algorithm is to assign a previously unseen pattern to the class with closer mean. We thus begin by computing the means of the two classes in feature space;
where m+ and m_ are the number of examples with positive and negative labels, respectively. We assume that both classes are non-empty, thus m + , m _ > 0. We assign a new point x to the class whose mean is closest (Figure 1.1). This geometric construction can be formulated in terms of the dot product (•,•}. Half way between c+ and c_ lies the point c := (c+ + c_)/2. We compute the class of x by checking whether the vector x — c connecting c to x encloses an angle smaller than p/2 with the vector w := c+ — c_ connecting the class means. This leads to
Here, we have defined the offset
Decision Function
with the norm ||x|| := >/(x,x). If the class means have the same distance to the origin, then b will vanish. Note that (1.9) induces a decision boundary which has the form of a hyperplane (Figure 1.1); that is, a set of points that satisfy a constraint expressible as a linear equation. It is instructive to rewrite (1.9) in terms of the input patterns xi, using the kernel k to compute the dot products. Note, however, that (1.6) only tells us how to compute the dot products between vectorial representations xi of inputs xi. We therefore need to express the vectors ci and w in terms of x 1 ,...,x m . To this end, substitute (1.7) and (1.8) into (1.9) to get the decision function 3. For the definition of a dot product space, see Section B.2.
1.2 A Simple Pattern Recognition Algorithm
5
Figure 1.1 A simple geometric classification algorithm: given two classes of points (depicted by 'o' and '+'), compute their means c+, c_ and assign a test pattern x to the one whose mean is closer. This can be done by looking at the dot product between x — c (where c = (c+ + c_)/2) and w:— c+ — c_, which changes sign as the enclosed angle passes through p/2. Note that the corresponding decision boundary is a hyperplane (the dotted line) orthogonal to w.
Similarly, the offset becomes
Surprisingly, it turns out that this rather simple-minded approach contains a wellknown statistical classification method as a special case. Assume that the class means have the same distance to the origin (hence b = 0, cf. (1.10)), and that k can be viewed as a probability density when one of its arguments is fixed. By this we mean that it is positive and has unit integral,4
In this case, (1.11) takes the form of the so-called Bayes classifier separating the two classes, subject to the assumption that the two classes of patterns were generated by sampling from two probability distributions that are correctly estimated by the 4. In order to state this assumption, we have to require that we can define an integral on X.
6
A Tutorial Introduction
Parzen windows estimators of the two class densities,
Parzen Windows
where x e X. Given some point x, the label is then simply computed by checking which of the two values p+(x) or p-(x) is larger, which leads directly to (1.11). Note that this decision is the best we can do if we have no prior information about the probabilities of the two classes. The classifier (1.11) is quite close to the type of classifier that this book deals with in detail. Both take the form of kernel expansions on the input domain,
In both cases, the expansions correspond to a separating hyperplane in a feature space. In this sense, the ai can be considered a dual representation of the hyperplane's normal vector [223]. Both classifiers are example-based in the sense that the kernels are centered on the training patterns; that is, one of the two arguments of the kernel is always a training pattern. A test point is classified by comparing it to all the training points that appear in (1.15) with a nonzero weight. More sophisticated classification techniques, to be discussed in the remainder of the book, deviate from (1.11) mainly in the selection of the patterns on which the kernels are centered and in the choice of weights ai that are placed on the individual kernels in the decision function. It will no longer be the case that all training patterns appear in the kernel expansion, and the weights of the kernels in the expansion will no longer be uniform within the classes — recall that in the current example, cf. (1.11), the weights are either (l/m+) or (-l/m_), depending on the class to which the pattern belongs. In the feature space representation, this statement corresponds to saying that we will study normal vectors w of decision hyperplanes that can be represented as general linear combinations (i.e., with non-uniform coefficients) of the training patterns. For instance, we might want to remove the influence of patterns that are very far away from the decision boundary, either since we expect that they will not improve the generalization error of the decision function, or since we would like to reduce the computational cost of evaluating the decision function (cf. (1.11)). The hyperplane will then only depend on a subset of training patterns called Support Vectors.
1.3 Some Insights From Statistical Learning Theory With the above example in mind, let us now consider the problem of pattern recognition in a slightly more formal setting [559, 152, 186]. This will allow us to indicate the factors affecting the design of "better" algorithms. Rather than just
1.3
Some Insights From Statistical Learning Theory
7
Figure 1.2 2D toy example of binary classification, solved using three models (the decision boundaries are shown). The models vary in complexity, ranging from a simple one (left), which misclassifies a large number of points, to a complex one (right), which "trusts" each point and comes up with solution that is consistent with all training points (but may not work well on new points). As an aside: the plots were generated using the so-called softmargin SVM to be explained in Chapter 7; cf. also Figure 7.10. providing tools to come up with new algorithms, we also want to provide some insight in how to do it in a promising way. In two-class pattern recognition, we seek to infer a function
from input-output training data (1.1). The training data are sometimes also called the sample. Figure 1.2 shows a simple 2D toy example of a pattern recognition problem. The task is to separate the solid dots from the circles by finding a function which takes the value 1 on the dots and —1 on the circles. Note that instead of plotting this function, we may plot the boundaries where it switches between 1 and — 1. In the rightmost plot, we see a classification function which correctly separates all training points. From this picture, however, it is unclear whether the same would hold true for test points which stem from the same underlying regularity. For instance, what should happen to a test point which lies close to one of the two "outliers," sitting amidst points of the opposite class? Maybe the outliers should not be allowed to claim their own custom-made regions of the decision function. To avoid this, we could try to go for a simpler model which disregards these points. The leftmost picture shows an almost linear separation of the classes. This separation, however, not only misclassifies the above two outliers, but also a number of "easy" points which are so close to the decision boundary that the classifier really should be able to get them right. Finally, the central picture represents a compromise, by using a model with an intermediate complexity, which gets most points right, without putting too much trust in any individual point. The goal of statistical learning theory is to place these intuitive arguments in a mathematical framework. To this end, it studies mathematical properties of learning machines. These properties are usually properties of the function class
8
A Tutorial Introduction
Figure 1.3 A 1D classification problem, with a training set of three points (marked by circles), and three test inputs (marked on the x-axis). Classification is performed by thresholding real-valued functions g(x) according to sgn (f(x)). Note that both functions (dotted line, and solid line) perfectly explain the training data, but they give opposite predictions on the test inputs. Lacking any further information, the training data alone give us no means to tell which of the two functions is to be preferred.
Empirical Risk
that the learning machine can implement. We assume that the data are generated independently from some unknown (but fixed) probability distribution P(x: y).5 This is a standard assumption in learning theory; data generated this way is commonly referred to as iid (independent and identically distributed). Our goal is to find a function / that will correctly classify unseen examples (x, y}, so that f(x) = y for examples (x, y) that are also generated from P(x, y).6 Correctness of the classification is measured by means of the zero-one loss function c(x, y,f(x)) '•=1 / 2 | f ( x )— y|. Note that the loss is 0 if (x, y) is classified correctly, and 1 otherwise. If we put no restriction on the set of functions from which we choose our estimated /, however, then even a function that does very well on the training data, e.g., by satisfying f(xi) = yi, for all i = 1,..., m, might not generalize well to unseen examples. To see this, note that for each function / and any test set (x1, y 1 , . . . , (Xm, ym) e X x {±1}, satisfying { x 1 , . . . , xm} H {x 1 ,..., xm} = f, there exists another function /* such that f*(x i ) = f(xi) for all i — 1,..., m, yet f*(xi) ^ f ( x i ) for all i = 1,..., m (cf. Figure 1.3). As we are only given the training data, we have no means of selecting which of the two functions (and hence which of the two different sets of test label predictions) is preferable. We conclude that minimizing only the (average) training error (or empirical risk),
Risk
does not imply a small test error (called risk), averaged over test examples drawn from the underlying distribution P(x, y),
IID Data
Loss Function
Test Data
5. For a definition of a probability distribution, see Section B.1.1. 6. We mostly use the term example to denote a pair consisting of a training pattern x and the corresponding target y.
1.3
Some Insights From Statistical Learning Theory
Capacity
VC dimension
Shattering
VC Bound
9
The risk can be defined for any loss function, provided the integral exists. For the present zero-one loss function, the risk equals the probability of misclassification.7 Statistical learning theory (Chapter 5, [570, 559, 561, 136, 562, 14]), or VC (Vapnik-Chervonenkis) theory, shows that it is imperative to restrict the set of functions from which / is chosen to one that has a capacity suitable for the amount of available training data. VC theory provides bounds on the test error. The minimization of these bounds, which depend on both the empirical risk and the capacity of the function class, leads to the principle of structural risk minimization [559]. The best-known capacity concept of VC theory is the VC dimension, defined as follows: each function of the class separates the patterns in a certain way and thus induces a certain labelling of the patterns. Since the labels are in {± 1}, there are at most 2m different labellings for m patterns. A very rich function class might be able to realize all 2m separations, in which case it is said to shatter the m points. However, a given class of functions might not be sufficiently rich to shatter the m points. The VC dimension is defined as the largest m such that there exists a set of m points which the class can shatter, and oo if no such m exists. It can be thought of as a one-number summary of a learning machine's capacity (for an example, see Figure 1.4). As such, it is necessarily somewhat crude. More accurate capacity measures are the annealed VC entropy or the growth function. These are usually considered to be harder to evaluate, but they play a fundamental role in the conceptual part of VC theory. Another interesting capacity measure, which can be thought of as a scale-sensitive version of the VC dimension, is the fat shattering dimension [286,6]. For further details, cf. Chapters 5 and 12. Whilst it will be difficult for the non-expert to appreciate the results of VC theory in this chapter, we will nevertheless briefly describe an example of a VC bound: 7. The risk-based approach to machine learning has its roots in statistical decision theory [582,166,43]. In that context, f ( x ) is thought of as an action, and the loss function measures the loss incurred by taking action f(x) upon observing x when the true output (state of nature) is y. Like many fields of statistics, decision theory comes in two flavors. The present approach is a frequentist one. It considers the risk as a function of the distribution P and the decision function /. The Bayesian approach considers parametrized families P0 to model the distribution. Given a prior over 0 (which need not in general be a finite-dimensional vector), the Bayes risk of a decision function / is the expected frequentist risk, where the expectation is taken over the prior. Minimizing the Bayes risk (over decision functions) then leads to a Bayes decision function. Bayesians thus act as if the parameter 0 were actually a random variable whose distribution is known. Frequentists, who do not make this (somewhat bold) assumption, have to resort to other strategies for picking a decision function. Examples thereof are considerations like invariance and unbiasedness, both used to restrict the class of decision rules, and the minimax principle. A decision function is said to be minimax if it minimizes (over all decision functions) the maximal (over all distributions) risk. For a discussion of the relationship of these issues to VC theory, see Problem 5.9.
10
A Tutorial Introduction
Figure 1.4 A simple VC dimension example. There are 23 = 8 ways of assigning 3 points to two classes. For the displayed points in R2, all 8 possibilities can be realized using separating hyperplanes, in other words, the function class can shatter 3 points. This would not work if we were given 4 points, no matter how we placed them. Therefore, the VC dimension of the class of separating hyperplanes in R2 is 3.
if h < m is the VC dimension of the class of functions that the learning machine can implement, then for all functions of that class, independent of the underlying distribution P generating the data, with a probability of at least 1 — 6 over the drawing of the training sample,8 the bound
holds, where the confidence term (or capacity term) f is defined as
The bound (1.19) merits further explanation. Suppose we wanted to learn a "dependency" where patterns and labels are statistically independent, P(x, y) = P(x)P(y). In that case, the pattern x contains no information about the label y. If, moreover, the two classes +1 and —1 are equally likely, there is no way of making a good guess about the label of a test pattern. Nevertheless, given a training set of finite size, we can always come up with a learning machine which achieves zero training error (provided we have no examples contradicting each other, i.e., whenever two patterns are identical, then they must come with the same label). To reproduce the random labellings by correctly separating all training examples, however, this machine will necessarily require a large VC dimension h. Therefore, the confidence term (1.20), which increases monotonically with h, will be large, and the bound (1.19) will show 8. Recall that each training example is generated from P(x, y), and thus the training data are subject to randomness.
1.4
Hyperplane Classifiers
11
that the small training error does not guarantee a small test error. This illustrates how the bound can apply independent of assumptions about the underlying distribution P(x,y): it always holds (provided that h < m), but it does not always make a nontrivial prediction. In order to get nontrivial predictions from (1.19), the function class must be restricted such that its capacity (e.g., VC dimension) is small enough (in relation to the available amount of data). At the same time, the class should be large enough to provide functions that are able to model the dependencies hidden in P(x, y). The choice of the set of functions is thus crucial for learning from data. In the next section, we take a closer look at a class of functions which is particularly interesting for pattern recognition problems.
1.4 Hyperplane Classifiers In the present section, we shall describe a hyperplane learning algorithm that can be performed in a dot product space (such as the feature space that we introduced earlier). As described in the previous section, to design learning algorithms whose statistical effectiveness can be controlled, one needs to come up with a class of functions whose capacity can be computed. Vapnik et al. [573,566,570] considered the class of hyperplanes in some dot product space H,
corresponding to decision functions
Optimal Hyperplane
and proposed a learning algorithm for problems which are separable by hyperplanes (sometimes said to be linearly separable), termed the Generalized Portrait, for constructing / from empirical data. It is based on two facts. First (see Chapter 7), among all hyperplanes separating the data, there exists a unique optimal hyperplane, distinguished by the maximum margin of separation between any training point and the hyperplane. It is the solution of
Second (see Chapter 5), the capacity (as discussed in Section 1.3) of the class of separating hyperplanes decreases with increasing margin. Hence there are theoretical arguments supporting the good generalization performance of the optimal hyperplane, cf. Chapters 5, 7, 12. In addition, it is computationally attractive, since we will show below that it can be constructed by solving a quadratic programming problem for which efficient algorithms exist (see Chapters 6 and 10). Note that the form of the decision function (1.22) is quite similar to our earlier example (1.9). The ways in which the classifiers are trained, however, are different. In the earlier example, the normal vector of the hyperplane was trivially computed from the class means as w = c+ — c_.
12
A Tutorial Introduction
Figure 1.5 A binary classification toy problem: separate balls from diamonds. The optimal hyperplane (1.23) is shown as a solid line. The problem being separable, there exists a weight vector w and a threshold b such that y i ((w,x i ) + b) > 0 (i — 1 , . . . , m). Rescaling w and b such that the point(s) closest to the hyperplane satisfy | (w,x,-) + b\ — 1, we obtain a canonical form (w, b) of the hyperplane, satisfying y,((w, x,-} + b ) > 1 . Note that in this case, the margin (the distance of the closest point to the hyperplane) equals l/||w||. This can be seen by considering two points X 1 ,X 2 on opposite sides of the margin, that is, {w, x1) + b = 1, {w, x2) + b = —1, and projecting them onto the hyperplane normal vector w/||w|.
In the present case, we need to do some additional work to find the normal vector that leads to the largest margin. To construct the optimal hyperplane, we have to solve
Note that the constraints (1.25) ensure that f(x i ) will be +1 for yi = +1, and -1 for yi = -1. Now one might argue that for this to be the case, we don't actually need the "> 1" on the right hand side of (1.25). However, without it, it would not be meaningful to minimize the length of w: to see this, imagine we wrote "> 0" instead of "> 1." Now assume that the solution is (w, b). Let us rescale this solution by multiplication with some 0 < A < 1. Since A > 0, the constraints are still satisfied. Since A < 1, however, the length of w has decreased. Hence (w, b) cannot be the minimizer of T(w). The "> 1" on the right hand side of the constraints effectively fixes the scaling of w. In fact, any other positive number would do. Let us now try to get an intuition for why we should be minimizing the length of w, as in (1.24). If ||w|| were 1, then the left hand side of (1.25) would equal the distance from xi to the hyperplane (cf. (1.23)). In general, we have to divide
1.4
Hyperplane Classifiers
Lagrangian
KKT Conditions
13
y i ({w,x i ) + b) by ||w|| to transform it into this distance. Hence, if we can satisfy (1.25) for all i = 1 , . . . , m with an w of minimal length, then the overall margin will be maximized. A more detailed explanation of why this leads to the maximum margin hyperplane will be given in Chapter 7. A short summary of the argument is also given in Figure 1.5. The function T in (1.24) is called the objective function, while (1.25) are called inequality constraints. Together, they form a so-called constrained optimization problem. Problems of this kind are dealt with by introducing Lagrange multipliers ai > 0 and a Lagrangian9
The Lagrangian L has to be minimized with respect to the primal variables w and b and maximized with respect to the dual variables on (in other words, a saddle point has to be found). Note that the constraint has been incorporated into the second term of the Lagrangian; it is not necessary to enforce it explicitly. Let us try to get some intuition for this way of dealing with constrained optimization problems. If a constraint (1.25) is violated, then yi({w, xi) + b) — 1 < 0, in which case L can be increased by increasing the corresponding ai. At the same time, w and b will have to change such that L decreases. To prevent ai (y i ((w, Xi) + b) — l) from becoming an arbitrarily large negative number, the change in w and b will ensure that, provided the problem is separable, the constraint will eventually be satisfied. Similarly, one can understand that for all constraints which are not precisely met as equalities (that is, for which y i ((w, Xi) + b) — 1 > 0), the corresponding ai must be 0: this is the value of ai that maximizes L. The latter is the statement of the Karush-Kuhn-Tucker (KKT) complementarity conditions of optimization theory (Chapter 6). The statement that at the saddle point, the derivatives of L with respect to the primal variables must vanish,
9. Henceforth, we use boldface Greek letters as a shorthand for corresponding vectors a = ( a 1 , . . . , am).
14
A Tutorial Introduction
Support Vector
The solution vector thus has an expansion (1.29) in terms of a subset of the training patterns, namely those patterns with non-zero ai, called Support Vectors (SVs) (cf. (1.15) in the initial example). By the KKT conditions,
Dual Problem
the SVs lie on the margin (cf. Figure 1.5). All remaining training examples (xj, yj) are irrelevant: their constraint y j ((w,x j ) + b) > 1 (cf. (1.25)) could just as well be left out, and they do not appear in the expansion (1.29). This nicely captures our intuition of the problem: as the hyperplane (cf. Figure 1.5) is completely determined by the patterns closest to it, the solution should not depend on the other examples. By substituting (1.28) and (1.29) into the Lagrangian (1.26), one eliminates the primal variables w and b, arriving at the so-called dual optimization problem, which is the problem that one usually solves in practice:
Decision Function
Mechanical Analogy
Using (1.29), the hyperplane decision function (1.22) can thus be written as
where b is computed by exploiting (1.30) (for details, cf. Chapter 7). The structure of the optimization problem closely resembles those that typically arise in Lagrange's formulation of mechanics (e.g., [206]). In the latter class of problem, it is also often the case that only a subset of constraints become active. For instance, if we keep a ball in a box, then it will typically roll into one of the corners. The constraints corresponding to the walls which are not touched by the ball are irrelevant, and those walls could just as well be removed. Seen in this light, it is not too surprising that it is possible to give a mechanical interpretation of optimal margin hyperplanes [87]: If we assume that each SV xi exerts a perpendicular force of size ai and direction yj • w/||w|| on a solid plane sheet lying along the hyperplane, then the solution satisfies the requirements for mechanical stability. The constraint (1.28) states that the forces on the sheet sum to zero, and (1.29) implies that the torques also sum to zero, via Si- xi- x y i a i W/||w|| = w x w/1| w|| — 0.10 This mechanical analogy illustrates the physical meaning of the term Support Vector. 10. Here, the x denotes the vector (or cross) product, satisfying v x v = 0 for all v e H.
1.5
Support Vector Classification
15
Figure 1.6 The idea of SVMs: map the training data into a higher-dimensional feature space via F, and construct a separating hyperplane with maximum margin there. This yields a nonlinear decision boundary in input space. By the use of a kernel function (1.2), it is possible to compute the separating hyperplane without explicitly carrying out the map into the feature space.
1.5 Support Vector Classification We now have all the tools to describe SVMs (Figure 1.6). Everything in the last section was formulated in a dot product space. We think of this space as the feature space H of Section 1.1. To express the formulas in terms of the input patterns in X, we thus need to employ (1.6), which expresses the dot product of bold face feature vectors x, x' in terms of the kernel k evaluated on input patterns x', x',
Decision Function
This substitution, which is sometimes referred to as the kernel trick, was used by Boser, Guyon, and Vapnik [62] to extend the Generalized Portrait hyperplane classifier to nonlinear Support Vector Machines. Aizerman, Braverman, and Rozonoer [4] called H the linearization space, and used it in the context of the potential function classification method to express the dot product between elements of H in terms of elements of the input space. The kernel trick can be applied since all feature vectors only occurred in dot products (see (1.31) and (1.33)). The weight vector (cf. (1.29)) then becomes an expansion in feature space, and therefore will typically no longer correspond to the F-image of a single input space vector (cf. Chapter 18). We obtain decision functions of the form (cf. (1.33))
and the following quadratic program (cf. (1.31)):
16
A Tutorial Introduction
Figure 1.7 Example of an SV classifier found using a radial basis function kernel k(x, x') = exp(— ||x — x'|| 2 ) (here, the input space is X = [—1,1]2). Circles and disks are two classes of training examples; the middle line is the decision surface; the outer lines precisely meet the constraint (1.25). Note that the SVs found by the algorithm (marked by extra circles) are not centers of clusters, but examples which are critical for the given classification task. Gray values code|Smi=1y i ,a i k(x,x i ) + b|, the modulus of the argument of the decision function (1.35). The top and the bottom lines indicate places where it takes the value 1 (from [471]).
Soft Margin Hyperplane
Figure 1.7 shows an example of this approach, using a Gaussian radial basis function kernel. We will later study the different possibilities for the kernel function in detail (Chapters 2 and 13). In practice, a separating hyperplane may not exist, e.g., if a high noise level causes a large overlap of the classes. To allow for the possibility of examples violating (1.25), one introduces slack variables [111, 561,481]
in order to relax the constraints (1.25) to
A classifier that generalizes well is then found by controlling both the classifier capacity (via ||w||) and the sum of the slacks Sixi• The latter can be shown to provide an upper bound on the number of training errors. One possible realization of such a soft margin classifier is obtained by minimizing the objective function
1.6
Support Vector Regression
17
subject to the constraints (1.38) and (1.39), where the constant C > 0 determines the trade-off between margin maximization and training error minimization.11 Incorporating a kernel, and rewriting it in terms of Lagrange multipliers, this again leads to the problem of maximizing (1.36), subject to the constraints
The only difference from the separable case is the upper bound C on the Lagrange multipliers ai. This way, the influence of the individual patterns (which could be outliers) gets limited. As above, the solution takes the form (1.35). The threshold b can be computed by exploiting the fact that for all SVs xi with ai < C, the slack variable xi is zero (this again follows from the KKT conditions), and hence
Geometrically speaking, choosing b amounts to shifting the hyperplane, and (1.42) states that we have to shift the hyperplane such that the SVs with zero slack variables lie on the ±1 lines of Figure 1.5. Another possible realization of a soft margin variant of the optimal hyperplane uses the more natural v-parametrization. In it, the parameter C is replaced by a parameter v £ (0,1] which can be shown to provide lower and upper bounds for the fraction of examples that will be SVs and those that will have non-zero slack variables, respectively. It uses a primal objective function with the error term (1/vmSixi)- P instead of CSi xi (cf. (1.40)), and separation constraints that involve a margin parameter p,
which itself is a variable of the optimization problem. The dual can be shown to consist in maximizing the quadratic part of (1.36), subject to 0 < ai < 1/(vm), Si ai yi = 0 and the additional constraint Si ai = 1. We shall return to these methods in more detail in Section 7.5.
1.6
Support Vector Regression Let us turn to a problem slightly more general than pattern recognition. Rather than dealing with outputs y G {±1}/ regression estimation is concerned with estimating real-valued functions. To generalize the SV algorithm to the regression case, an analog of the soft margin is constructed in the space of the target values y (note that we now have 11. It is sometimes convenient to scale the sum in (1.40) by C/m rather than C, as done in Chapter 7 below.
18
A Tutorial Introduction
Figure 1.8 In SV regression, a tube with radius e is fitted to the data. The trade-off between model complexity and points lying outside of the tube (with positive slack variables x) is determined by minimizing (1.47). e-Insensitive Loss
y € R) by using Vapnik's e-insensitive loss function [561] (Figure 1.8, see Chapters 3 and 9). This quantifies the loss incurred by predicting f(x) instead of y as
To estimate a linear regression
one minimizes
Note that the term ||w||2 is the same as in pattern recognition (cf. (1.40)); for further details, cf. Chapter 9. We can transform this into a constrained optimization problem by introducing slack variables, akin to the soft margin case. In the present case, we need two types of slack variable for the two cases f(xi) — yi > e and yi -f(xi) > e. We denote them by x and x*, respectively, and collectively refer to them as x(*) The optimization problem is given by
Note that according to (1.48) and (1.49), any error smaller than e does not require a nonzero xi or xi* and hence does not enter the objective function (1.47). Generalization to kernel-based regression estimation is carried out in an analo-
1.7
Kernel Principal Component Analysis
19
gous manner to the case of pattern recognition. Introducing Lagrange multipliers, one arrives at the following optimization problem (for C, e > 0 chosen a priori):
Regression Function
v-SV Regression
The regression estimate takes the form
where b is computed using the fact that (1.48) becomes an equality with xi = 0 if 0 < ai < C, and (1.49) becomes an equality with xi* = 0 if 0 < ai* < C (for details, see Chapter 9). The solution thus looks quite similar to the pattern recognition case (cf. (1.35) and Figure 1.9). A number of extensions of this algorithm are possible. From an abstract point of view, we just need some target function which depends on (w, x) (cf. (1.47)). There are multiple degrees of freedom for constructing it, including some freedom how to penalize, or regularize. For instance, more general loss functions can be used for x, leading to problems that can still be solved efficiently ([512,515], cf. Chapter 9). Moreover, norms other than the 2-norm ||. || can be used to regularize the solution (see Sections 4.9 and 9.4). Finally, the algorithm can be modified such that e need not be specified a priori. Instead, one specifies an upper bound 0 < v < 1 on the fraction of points allowed to lie outside the tube (asymptotically, the number of SVs) and the corresponding e is computed automatically. This is achieved by using as primal objective function
instead of (1.46), and treating e > 0 as a parameter over which we minimize. For more detail, cf. Section 9.3.
1.7
Kernel Principal Component Analysis The kernel method for computing dot products in feature spaces is not restricted to SVMs. Indeed, it has been pointed out that it can be used to develop nonlinear generalizations of any algorithm that can be cast in terms of dot products, such as principal component analysis (PCA) [480]. Principal component analysis is perhaps the most common feature extraction algorithm; for details, see Chapter 14. The term feature extraction commonly refers
20
A Tutorial Introduction
to procedures for extracting (real) numbers from patterns which in some sense represent the crucial information contained in these patterns. PCA in feature space leads to an algorithm called kernel PCA. By solving an eigenvalue problem, the algorithm computes nonlinear feature extraction functions
where, up to a normalizing constant, the ain are the components of the nth eigenvector of the kernel matrix KJJ := (k(x^ Xj)). In a nutshell, this can be understood as follows. To do PCA in "H, we wish to find eigenvectors v and eigenvalues A of the so-called covariance matrix C in the feature space, where
Here, <]>(#,)T denotes the transpose of F(xi) (see Section B.2.1). In the case when H is very high dimensional, the computational costs of doing this directly are prohibitive. Fortunately, one can show that all solutions to
with l^ 0 must lie in the span of F-images of the training data. Thus, we may expand the solution v as
Kernel PCA Eigenvalue Problem
Feature Extraction
thereby reducing the problem to that of finding the ai. It turns out that this leads to a dual eigenvalue problem for the expansion coefficients,
where a. = ( a 1 , . . . , a m ) T . To extract nonlinear features from a test point x, we compute the dot product between F(x) and the nth normalized eigenvector in feature space,
Usually, this will be computationally far less expensive than taking the dot product in the feature space explicitly. A toy example is given in Chapter 14 (Figure 14.4). As in the case of SVMs, the architecture can be visualized by Figure 1.9.
1.8
Empirical Results and Implementations
21
Figure 1.9 Architecture of SVMs and related kernel methods. The input x and the expansion patterns (SVs) */ (we assume that we are dealing with handwritten digits) are nonlinearly mapped (by <£) into a feature space H where dot products are computed. Through the use of the kernel k, these two layers are in practice computed in one step. The results are linearly combined using weights Vi, found by solving a quadratic program (in pattern recognition, vi = yi,ai; in regression estimation, vi = ai* — ai) or an eigenvalue problem (Kernel PCA). The linear combination is fed into the function a (in pattern recognition, s(x) — sgn(x + b); in regression estimation, s(x) = x + b; in Kernel PCA, s(x) = x).
1.8
Empirical Results and Implementations
Examples of Kernels
Having described the basics of SVMs, we now summarize some empirical findings. By the use of kernels, the optimal margin classifier was turned into a highperformance classifier. Surprisingly, it was observed that the polynomial kernel
the Gaussian
and the sigmoid
with suitable choices of d 6 N and s, K, 0 € R (here, X C RN), empirically led to SV classifiers with very similar accuracies and SV sets (Section 7.8.2). In this sense, the SV set seems to characterize (or compress) the given task in a manner which
22
Applications
Implementation
A Tutorial Introduction
to some extent is independent of the type of kernel (that is, the type of classifier) used, provided the kernel parameters are well adjusted. Initial work at AT&T Bell Labs focused on OCR (optical character recognition), a problem where the two main issues are classification accuracy and classification speed. Consequently, some effort went into the improvement of SVMs on these issues, leading to the Virtual SV method for incorporating prior knowledge about transformation invariances by transforming SVs (Chapter 7), and the Reduced Set method (Chapter 18) for speeding up classification. Using these procedures, SVMs soon became competitive with the best available classifiers on OCR and other object recognition tasks [87, 57, 419, 438, 134], and later even achieved the world record on the main handwritten digit benchmark dataset [134]. An initial weakness of SVMs, less apparent in OCR applications which are characterized by low noise levels, was that the size of the quadratic programming problem (Chapter 10) scaled with the number of support vectors. This was due to the fact that in (1.36), the quadratic part contained at least all SVs — the common practice was to extract the SVs by going through the training data in chunks while regularly testing for the possibility that patterns initially not identified as SVs become SVs at a later stage. This procedure is referred to as chunking; note that without chunking, the size of the matrix in the quadratic part of the objective function would be m x m, where m is the number of all training examples. What happens if we have a high-noise problem? In this case, many of the slack variables xi become nonzero, and all the corresponding examples become SVs. For this case, decomposition algorithms were proposed [398, 409], based on the observation that not only can we leave out the non-SV examples (the xi with ai = 0) from the current chunk, but also some of the SVs, especially those that hit the upper boundary (ai = C). The chunks are usually dealt with using quadratic optimizers. Among the optimizers used for SVMs are LOQO [555], MINOS [380], and variants of conjugate gradient descent, such as the optimizers of Bottou [459] and Burges [85]. Several public domain SV packages and optimizers are listed on the web page http://www.kernel-machines.org. For more details on implementations, see Chapter 10. Once the SV algorithm had been generalized to regression, researchers started applying it to various problems of estimating real-valued functions. Very good results were obtained on the Boston housing benchmark [529], and on problems of times series prediction (see [376,371,351]). Moreover, the SV method was applied to the solution of inverse function estimation problems ([572]; cf. [563, 589]). For overviews, the interested reader is referred to [85,472,504,125].
I
CONCEPTS AND TOOLS
The generic can be more intense than the concrete. J. L. Borges1
We now embark on a more systematic presentation of the concepts and tools underlying Support Vector Machines and other kernel methods. In machine learning problems, we try to discover structure in data. For instance, in pattern recognition and regression estimation, we are given a training set (x1, y 1 ),..., (xm, ym) 6 X x y, and attempt to predict the outputs y for previously unseen inputs x. This is only possible if we have some measure that tells us how (x, y) is related to the training set. Informally, we want similar inputs to lead to similar outputs.2 To formalize this, we have to state what we mean by similar. A particularly simple yet surprisingly useful notion of similarity of inputs — the one we will use throughout this book — derives from embedding the data into a Euclidean feature space and utilizing geometrical concepts. Chapter 2 describes how certain classes of kernels induce feature spaces, and how one can compute dot products, and thus angles and distances, without having to explicitly work in these potentially infinite-dimensional spaces. This leads to a rather general class of similarity measure to be used on the inputs. 1. From A History of Eternity, in The Total Library, Penguin, London, 2001. 2. This procedure can be traced back to an old maxim of law: de similibus ad similia eadem ratione procedendum est — from things similar to things similar we are to proceed by the same rule.
24
CONCEPTS AND TOOLS
On the outputs, similarity is usually measured in terms of a loss function stating how "bad" it is if the predicted y does not match the true one. The training of a learning machine commonly involves a risk functional that contains a term measuring the loss incurred for the training patterns. The concepts of loss and risk are introduced in depth in Chapter 3. This is not the full story, however. In order to generalize well to the test data, it is not sufficient to "explain" the training data. It is also necessary to control the complexity of the model used for explaining the training data, a task that is often accomplished with the help of regularization terms, as explained in Chapter 4. Specifically, one utilizes objective functions that involve both the empirical loss term and a regularization term. From a statistical point of view, we can expect the function minimizing a properly chosen objective function to work well on test data, as explained by statistical learning theory (Chapter 5). From a practical point of view, however, it is not at all straightforward to find this minimizer. Indeed, the quality of a loss function or a regularizer should be assessed not only on a statistical basis but also in terms of the feasibility of the objective function minimization problem. In order to be able to assess this, and in order to obtain a thorough understanding of practical algorithms for this task, we conclude this part of the book with an in-depth review of optimization theory (Chapter 6). The chapters in this part of the book assume familiarity with basic concepts of linear algebra and probability theory. Readers who would like to refresh their knowledge of these topics may want to consult Appendix B beforehand.
2
Kernels
In Chapter 1, we described how a kernel arises as a similarity measure that can be thought of as a dot product in a so-called feature space. We tried to provide an intuitive understanding of kernels by introducing them as similarity measures, rather than immediately delving into the functional analytic theory of the classes of kernels that actually admit a dot product representation in a feature space. In the present chapter, we will be both more formal and more precise. We will study the class of kernels k that correspond to dot products in feature spaces H via a map F,
that is,
Overview
Prerequisites
Regarding the input domain X, we need not make assumptions other than it being a set. For instance, we could consider a set of discrete objects, such as strings. A natural question to ask at this point is what kind of functions k(x, x'} admit a representation of the form (2.2); that is, whether we can always construct a dot product space H and a map F mapping into it such that (2.2) holds true. We shall begin, however, by trying to give some motivation as to why kernels are at all useful, considering kernels that compute dot products in spaces of monomial features (Section 2.1). Following this, we move on to the questions of how, given a kernel, an associated feature space can be constructed (Section 2.2). This leads to the notion of a Reproducing Kernel Hilbert Space, crucial for the theory of kernel machines. In Section 2.3, we give some examples and properties of kernels, and in Section 2.4, we discuss a class of kernels that can be used as dissimilarity measures rather than as similarity measures. The chapter builds on knowledge of linear algebra, as briefly summarized in Appendix B. Apart from that, it can be read on its own; however, readers new to the field will profit from first reading Sections 1.1 and 1.2.
26
Kernels
2.1 Product Features
Monomial Features
In this section, we think of X as a subset of the vector space RN, (N € N), endowed with the canonical dot product (1.3). Suppose we are given patterns x 6 X where most information is contained in the dth order products (so-called monomials) of entries [x]j of x,
where j 1 , . . . , jd £ {1,..., N}. Often, these monomials are referred to as product features. These features form the basis of many practical algorithms; indeed, there is a whole field of pattern recognition research studying polynomial classifiers [484], which is based on first extracting product features and then applying learning algorithms to these features. In other words, the patterns are preprocessed by mapping into the feature space H of all products of d entries. This has proven quite effective in visual pattern recognition tasks, for instance. To understand the rationale for doing this, note that visual patterns are usually represented as vectors whose entries are the pixel intensities. Taking products of entries of these vectors then corresponds to taking products of pixel intensities, and is thus akin to taking logical "and" operations on the pixels. Roughly speaking, this corresponds to the intuition that, for instance, a handwritten "8" constitutes an eight if there is a top circle and a bottom circle. With just one of the two circles, it is not half an "8," but rather a "0." Nonlinearities of this type are crucial for achieving high accuracies in pattern recognition tasks. Let us take a look at this feature map in the simple example of two-dimensional patterns, for which X = R2. In this case, we can collect all monomial feature extractors of degree 2 in the nonlinear map
This approach works fine for small toy examples, but it fails for realistically sized
2.1
Product Features
27
problems: for N-dimensional input patterns, there exist
Kernel
different monomials (2.3) of degree d, comprising a feature space H of dimension NH . For instance, 16 x 16 pixel input images and a monomial degree d = 5 thus yield a dimension of almost 1010. In certain cases described below, however, there exists a way of computing dot products in these high-dimensional feature spaces without explicitly mapping into the spaces, by means of kernels nonlinear in the input space RN. Thus, if the subsequent processing can be carried out using dot products exclusively, we are able to deal with the high dimension. We now describe how dot products in polynomial feature spaces can be computed efficiently, followed by a section in which we discuss more general feature spaces. In order to compute dot products of the form (F(x),F(x')}, we employ kernel representations of the form
which allow us to compute the value of the dot product in H without having to explicitly compute the map F. What does k look like in the case of polynomial features? We start by giving an example for N = d = 2, as considered above [561]. For the map
(note that for now, we have considered [x]1[x]2 and [x]2[x]1 as separate features; thus we are looking at ordered monomials) dot products in H take the form
In other words, the desired kernel k is simply the square of the dot product in input space. The same works for arbitrary N, d € N [62]: as a straightforward generalization of a result proved in the context of polynomial approximation [412, Lemma 2.1], we have:
Proposition 2.1 Define Cd to map x e RN to the vector Cd(x) whose entries are all possible dth degree ordered products of the entries of x. Then the corresponding kernel computing the dot product of vectors mapped by Cd is Polynomial Kernel Proof
We directly compute
28
Kernels
Note that we used the symbol Cd for the feature map. The reason for this is that we would like to reserve Fd for the corresponding map computing unordered product features. Let us construct such a map Fd, yielding the same value of the dot product. To this end, we have to compensate for the multiple occurrence of certain monomials in Cd by scaling the respective entries of Fd with the square roots of their numbers of occurrence. Then, by this construction of Fd, and (2.10),
For instance, if n of the // in (2.3) are equal, and the remaining ones are different, then the coefficient in the corresponding component of Fd is ^/(d— n + 1)!. For the general case, see Problem 2.2. For F2, this simply means that [561]
The above reasoning illustrates an important point pertaining to the construction of feature spaces associated with kernel functions. Although they map into different feature spaces, Fd and Cd are both valid instantiations of feature maps for k(x,x') = (x,x')d. To illustrate how monomial feature kernels can significantly simplify pattern recognition tasks, let us consider a simple toy example.
Toy Example
Example 2.2 (Monomial Features in 2-D Pattern Recognition) In the example of Figure 2.1, a non-separable problem is reduced to the construction of a separating hyperplane by preprocessing the input data with F2. As we shall see in later chapters, this has advantages both from the computational point of view (there exist efficient algorithms for computing the hyperplane) and from the statistical point of view (there exist guarantees for how well the hyperplane will generalize to unseen test points). In more realistic cases, e.g., if x represents an image with the entries being pixel values, polynomial kernels (x, x') d enable us to work in the space spanned by products of any d pixel values — provided that we are able to do our work solely in terms of dot products, without any explicit usage of a mapped pattern F d (x). Using kernels of the form (2.10), we can take higher-order statistics into account, without the combinatorial explosion (2.6) of time and memory complexity which accompanies even moderately high N and d. To conclude this section, note that it is possible to modify (2.10) such that it maps into the space of all monomials up to degree d, by defining k(x, x'} = ((x, x') + 1)d (Problem 2.17). Moreover, in practice, it is often useful to multiply the kernel by a scaling factor c to ensure that its numeric range is within some bounded interval, say [—1,1]. The value of c will depend on the dimension and range of the data.
2.2
The Representation of Similarities in Linear Spaces
29
Figure 2.1 Toy example of a binary classification problem mapped into feature space. We assume that the true decision boundary is an ellipse in input space (left panel). The task of the learning process is to estimate this boundary based on empirical data consisting of training points in both classes (crosses and circles, respectively). When mapped into feature space via the nonlinear map F 2 (x) = (z 1 ,z 2 ,z 3 ) — ([x]21,[x]22,\/2 [x]1[x]2) (right panel), the ellipse becomes a hyperplane (in the present simple case, it is parallel to the z3 axis, hence all points are plotted in the (z 1 , z2) plane). This is due to the fact that ellipses can be written as linear equations in the entries of (z 1 ,z 2 ,z 3 ). Therefore, in feature space, the problem reduces to that of estimating a hyperplane from the mapped data points. Note that via the polynomial kernel (see (2.12) and (2.13)), the dot product in the three-dimensional space can be computed without computing F2. Later in the book, we shall describe algorithms for constructing hyperplanes which are based on dot products (Chapter 7).
2.2
The Representation of Similarities in Linear Spaces In what follows, we will look at things the other way round, and start with the kernel rather than with the feature map. Given some kernel, can we construct a feature space such that the kernel computes the dot product in that feature space; that is, such that (2.2) holds? This question has been brought to the attention of the machine learning community in a variety of contexts, especially during recent years [4, 152, 62, 561, 480]. In functional analysis, the same problem has been studied under the heading of Hilbert space representations of kernels. A good monograph on the theory of kernels is the book of Berg, Christensen, and Ressel [42]; indeed, a large part of the material in the present chapter is based on this work. We do not aim to be fully rigorous; instead, we try to provide insight into the basic ideas. As a rule, all the results that we state without proof can be found in [42]. Other standard references include [16,455]. There is one more aspect in which this section differs from the previous one: the latter dealt with vectorial data, and the domain X was assumed to be a subset of RN. By contrast, the results in the current section hold for data drawn from domains which need no structure, other than their being nonempty sets. This generalizes kernel learning algorithms to a large number of situations where a vectorial representation is not readily available, and where one directly works
30
Kernels
with pairwise distances or similarities between non-vectorial objects [246, 467, 154, 210, 234, 585]. This theme will recur in several places throughout the book, for instance in Chapter 13. 2.2.1
Positive Definite Kernels
We start with some basic definitions and results. As in the previous chapter, indices i and ;' are understood to run over 1,..., m.
Gram Matrix
Definition 2.3 (Gram Matrix) Given a function k : X2 -> K (where K = C or K. = R) and patterns x 1 ,..., xm € X, the m x m matrix K with elements
is called the Gram matrix (or kernel matrix) of k with respect to X 1 , . . . , xm. PD Matrix
Definition 2.4 (Positive Definite Matrix) A complex mxm matrix K satisfying
for all Ci € C is called positive definite.1 Similarly, a real symmetric m x m matrix K satisfying (2.15) for all ci € R is called positive definite. Note that a symmetric matrix is positive definite if and only if all its eigenvalues are nonnegative (Problem 2.4). The left hand side of (2.15) is often referred to as the quadratic form induced by K.
PD Kernel
Definition 2.5 ((Positive Definite) Kernel) Let X be a nonempty set. A function k on X x X which for all m £ N and all x 1 . . . , xm 6 X gives rise to a positive definite Gram matrix is called a positive definite (pd) kernel. Often, we shall refer to it simply as a kernel. Remark 2.6 (Terminology) The term kernel stems from the first use of this type of function in the field of integral operators as studied by Hilbert and others [243, 359,112]. A function k which gives rise to an operator T^ via
is called the kernel ofTk. In the literature, a number of different terms are used for positive definite kernels, such as reproducing kernel, Mercer kernel, admissible kernel, Support Vector kernel, nonnegative definite kernel, and covariance function. One might argue that the term positive definite kernel is slightly misleading. In matrix theory, the term definite is sometimes reserved for the case where equality in (2.15) only occurs ifc\ = ... = cm = 0. 1. The bar in c; denotes complex conjugation; for real numbers, it has no effect.
2.2
The Representation of Similarities in Linear Spaces
31
Simply using the term positive kernel, on the other hand, could be mistaken as referring to a kernel whose values are positive. Finally, the term positive semidefinite kernel becomes rather cumbersome if it is to be used throughout a book. Therefore, we follow the convention used for instance in [42], and employ the term positive definite both for kernels and matrices in the way introduced above. The case where the value 0 is only attained if all coefficients are 0 will be referred to as strictly positive definite. We shall mostly use the term kernel. Whenever we want to refer to a kernel k(x, x') which is not positive definite in the sense stated above, it will be clear from the context. The definitions for positive definite kernels and positive definite matrices differ in the fact that in the former case, we are free to choose the points on which the kernel is evaluated — for every choice, the kernel induces a positive definite matrix. Positive definiteness implies positivity on the diagonal (Problem 2.12),
and symmetry (Problem 2.13),
Real-Valued Kernels
To also cover the complex-valued case, our definition of symmetry includes complex conjugation. The definition of symmetry of matrices is analogous; that is, Kij = Kji. For real-valued kernels it is not sufficient to stipulate that (2.15) hold for real coefficients Ci. To get away with real coefficients only, we must additionally require that the kernel be symmetric (Problem 2.14); k(x i , Xj) — k(x j , xi )(cf. Problem 2.13). It can be shown that whenever A: is a (complex-valued) positive definite kernel, its real part is a (real-valued) positive definite kernel. Below, we shall largely be dealing with real-valued kernels. Most of the results, however, also apply for complex-valued kernels. Kernels can be regarded as generalized dot products. Indeed, any dot product is a kernel (Problem 2.5); however, linearity in the arguments, which is a standard property of dot products, does not carry over to general kernels. However, another property of dot products, the Cauchy-Schwarz inequality, does have a natural generalization to kernels: Proposition 2.7 (Cauchy-Schwarz Inequality for Kernels) If k is a positive definite kernel, and x1, x2 6 X then
Proof For sake of brevity, we give a non-elementary proof using some basic facts of linear algebra. The 2 x 2 Gram matrix with entries Kij = k(xi, xj) (i,j G {1,2}) is positive definite. Hence both its eigenvalues are nonnegative, and so is their product, the determinant of K. Therefore
32
Kernels
Figure 2.2 One instantiation of the feature map associated with a kernel is the map (2.21), which represents each pattern (in the picture, x or x') by a kernel-shaped function sitting on the pattern. In this sense, each pattern is represented by its similarity to all other patterns. In the picture, the kernel is assumed to be bell-shaped, e.g., a Gaussian k(x, x') = exp(-||x - x'|| 2 /(2 s 2 )). In the text, we describe the construction of a dot product {.,.} on the function space such that k(x, x') = (F(x), F(x')).
Substituting k(X i , Xj) for Kij, we get the desired inequality. We now show how the feature spaces in question are defined by the choice of kernel function. 2.2.2
Feature Map
The Reproducing Kernel Map
Assume that k is a real-valued positive definite kernel, and X a nonempty set. We define a map from X into the space of functions mapping X into R, denoted as Rx := {f : X -» R}, via
Here, F(x) denotes the function that assigns the value k(x',x) to x' 6 X, i.e., F(x)(.) = k(., x) (as shown in Figure 2.2). We have thus turned each pattern into a function on the domain X. In this sense, a pattern is now represented by its similarity to all other points in the input domain X. This seems a very rich representation; nevertheless, it will turn out that the kernel allows the computation of the dot product in this representation. Below, we show how to construct a feature space associated with F, proceeding in the following steps: 1. Turn the image of F into a vector space, 2. define a dot product; that is, a strictly positive definite bilinear form, and 3. show that the dot product satisfies k(x, x') = (F(x), F(x')}-
Vector Space
We begin by constructing a dot product space containing the images of the input patterns under F. To this end, we first need to define a vector space. This is done by taking linear combinations of the form
Here, m e N, ai e R and x 1 , . . . , xm € X are arbitrary. Next, we define a dot product
2.2 The Representation of Similarities in Linear Spaces
33
between / and another function
Dot Product
where m' E N, bj € R and x' 1 ,...,x' m ,e X, as
This expression explicitly contains the expansion coefficients, which need not be unique. To see that it is nevertheless well-defined, note that
using k(x'j,Xi) = k(x j ,x' j ). The sum in (2.25), however, does not depend on the particular expansion of /. Similarly, for g, note that
The last two equations also show that (•, •) is bilinear. It is symmetric, as {f, g) = (g, f). Moreover, it is positive definite, since positive definiteness of k implies that for any function f, written as (2.22), we have
The latter implies that {•,•) is actually itself a positive definite kernel, defined on our space of functions. To see this, note that given functions f,...,f n „, and coefficients g 1 , . . . , gn € R., we have
Here, the left hand equality follows from the bilinearity of {.,.), and the right hand inequality from (2.27). For the last step in proving that it qualifies as a dot product, we will use the following interesting property of F, which follows directly from the definition: for all functions (2.22), we have
— k is the representer of evaluation. In particular,
Reproducing Kernel
By virtue of these properties, positive definite kernels k are also called reproducing kernels [16,42,455, 578,467, 202]. By (2.29) and Proposition 2.7, we have
34
Kernels
Therefore, {f, f) = 0 directly implies f = 0, which is the last property that required proof in order to establish that {•, •} is a dot product (cf. Section B.2). The case of complex-valued kernels can be dealt with using the same construction; in that case, we will end up with a complex dot product space [42]. The above reasoning has shown that any positive definite kernel can be thought of as a dot product in another space: in view of (2.21), the reproducing kernel property (2.30) amounts to
Kernels from Feature Maps
Therefore, the dot product space H constructed in this way is one possible instantiation of the feature space associated with a kernel. Above, we have started with the kernel, and constructed a feature map. Let us now consider the opposite direction. Whenever we have a mapping F from X into a d o t product space, w e obtain a positive definite kernel v i a k(x, x ' ) : = (F(x),F
Equivalent Definition of PD Kernels
Kernel Trick
due to the nonnegativity of the norm. This has two consequences. First, it allows us to give an equivalent definition of positive definite kernels as functions with the property that there exists a map F into a dot product space such that (2.32) holds true. Second, it allows us to construct kernels from feature maps. For instance, it is in this way that powerful linear representations of 3D heads proposed in computer graphics [575, 59] give rise to kernels. The identity (2.32) forms the basis for the kernel trick: Remark 2.8 ("Kernel Trick") Given an algorithm which is formulated in terms of a positive definite kernel k, one can construct an alternative algorithm by replacing k by another positive definite kernel k. In view of the material in the present section, the justification for this procedure is the following: effectively, the original algorithm can be thought of as a dot product based algorithm operating on vectorial data F(x 1 ),... , F(xm). The algorithm obtained by replacing k by k then is exactly the same dot product based algorithm, only that it operates on F(x 1 ),..., F(x m ). The best known application of the kernel trick is in the case where k is the dot product in the input domain (cf. Problem 2.5). The trick is not limited to that case, however: k and k can both be nonlinear kernels. In general, care must be exercised in determining whether the resulting algorithm will be useful: sometimes, an algorithm will only work subject to additional conditions on the input data, e.g., the data set might have to lie in the positive orthant. We shall later see that certain kernels induce feature maps which enforce such properties for the mapped data (cf. (2.73)), and that there are algorithms which take advantage of these aspects (e.g., in Chapter 8). In such cases, not every conceivable positive definite kernel
(
x
'
2.2 The Representation of Similarities in Linear Spaces
Historical Remarks
35
will make sense. Even though the kernel trick had been used in the literature for a fair amount of time [4, 62], it took until the mid 1990s before it was explicitly stated that any algorithm that only depends on dot products, i.e., any algorithm that is rotationally invariant, can be kernelized [479, 480]. Since then, a number of algorithms have benefitted from the kernel trick, such as the ones described in the present book, as well as methods for clustering in feature spaces [479, 215,199]. Moreover, the machine learning community took time to comprehend that the definition of kernels on general sets (rather than dot product spaces) greatly extends the applicability of kernel methods [467], to data types such as texts and other sequences [234, 585, 23]. Indeed, this is now recognized as a crucial feature of kernels: they lead to an embedding of general data types in linear spaces. Not surprisingly, the history of methods for representing kernels in linear spaces (in other words, the mathematical counterpart of the kernel trick) dates back significantly further than their use in machine learning. The methods appear to have first been studied in the 1940s by Kolmogorov [304] for countable X and Aronszajn [16] in the general case. Pioneering work on linear representations of a related class of kernels, to be described in Section 2.4, was done by Schoenberg [465]. Further bibliographical comments can be found in [42]. We thus see that the mathematical basis for kernel algorithms has been around for a long time. As is often the case, however, the practical importance of mathematical results was initially underestimated.2 2.2.3
Reproducing Kernel Hilbert Spaces
In the last section, we described how to define a space of functions which is a valid realization of the feature spaces associated with a given kernel. To do this, we had to make sure that the space is a vector space, and that it is endowed with a dot product. Such spaces are referred to as dot product spaces (cf. Appendix B), or equivalently as pre-Hilbert spaces. The reason for the latter is that one can turn them into Hilbert spaces (cf. Section B.3) by a fairly simple mathematical trick. This additional structure has some mathematical advantages. For instance, in Hilbert spaces it is always possible to define projections. Indeed, Hilbert spaces are one of the favorite concepts of functional analysis. So let us again consider the pre-Hilbert space of functions (2.22), endowed with the dot product (2.24). To turn it into a Hilbert space (over R), one completes it in the norm corresponding to the dot product, ||f|| := \/(f,f). This is done by adding the limit points of sequences that are convergent in that norm (see Appendix B). 2. This is illustrated by the following quotation from an excellent machine learning textbook published in the seventies (p. 174 in [152]): "The familiar functions of mathematical physics are eigenfunctions of symmetric kernels, and their use is often suggested for the construction of potential functions. However, these suggestions are more appealing for their mathematical beauty than their practical usefulness."
36
Kernels
RKHS
In view of the properties (2.29) and (2.30), this space is usually called a reproducing kernel Hilbert space (RKHS). In general, an RKHS can be defined as follows. Definition 2.9 (Reproducing Kernel Hilbert Space) Let X be a nonempty set (often called the index set) and H a Hilbert space of functions f: X —> R. Then H is called a reproducing kernel Hilbert space endowed with the dot product {•, •) (and the norm ||f|| := x/{f,f)) if there exists a function k : X x X —>• R with the following properties.
Reproducing Property
1. k has the reproducing property 3
in particular,
Closed Space
2. k spans H, i.e. H — span {k(x, .)| x e X} where X denotes the completion of the set X (cf. Appendix B). On a more abstract level, an RKHS can be defined as a Hilbert space of functions / on X such that all evaluation functionals (the maps / i-> /(*')/ where x' 6 X) are continuous. In that case, by the Riesz representation theorem (e.g., [429]), for each x' G X there exists a unique function of x, called A:(x, x'}, such that
Uniqueness of k
It follows directly from (2.35) that k(x,x') is symmetric in its arguments (see Problem 2.28) and satisfies the conditions for positive definiteness. Note that the RKHS uniquely determines k. This can be shown by contradiction: assume that there exist two kernels, say k and k', spanning the same RKHS H. From Problem 2.28 we know that both k and k' must be symmetric. Moreover, from (2.34) we conclude that
In the second equality we used the symmetry of the dot product. Finally, symmetry in the arguments of k yields k(x, x'} = k'(x, x') which proves our claim. 2.2.4
The Mercer Kernel Map
Section 2.2.2 has shown that any positive definite kernel can be represented as a dot product in a linear space. This was done by explicitly constructing a (Hilbert) space that does the job. The present section will construct another Hilbert space. 3. Note that this implies that each f e H is actually a single function whose values at any x 6 X are well-defined. In contrast, L2 Hilbert spaces usually do not have this property. The elements of these spaces are equivalence classes of functions that disagree only on sets of measure 0; cf. footnote 15 in Section B.3.
2.2
The Representation of Similarities in Linear Spaces
Mercer's Theorem
37
One could argue that this is superfluous, given that any two separable Hilbert spaces are isometrically isomorphic, in other words, it is possible to define a oneto-one linear map between the spaces which preserves the dot product. However, the tool that we shall presently use, Mercer's theorem, has played a crucial role in the understanding of SVMs, and it provides valuable insight into the geometry of feature spaces, which more than justifies its detailed discussion. In the SVM literature, the kernel trick is usually introduced via Mercer's theorem. We start by stating the version of Mercer's theorem given in [606]. We assume (X, m) to be a finite measure space.4 The term almost all (cf. Appendix B) means except for sets of measure zero. For the commonly used Lebesgue-Borel measure, countable sets of individual points are examples of zero measure sets. Note that the integral with respect to a measure is explained in Appendix B. Readers who do not want to go into mathematical detail may simply want to think of the dm,(x') as a dx', and of X as a compact subset of RN. For further explanations of the terms involved in this theorem, cf. Appendix B, especially Section B.3. Theorem 2.10 (Mercer [359, 307]) Suppose k e Loo(X2) is a symmetric real-valued function such that the integral operator (cf. (2.16))
is positive definite; that is, for all f 6 L2(X), we have
Let yj G L 2 (X) be the normalized orthogonal eigenfunctions of Tk associated with the eigenvalues Ay > 0, sorted in non-increasing order. Then 1. (lj)j
€
l1,
2. k(x, x') = SNHj=1 ljyj ( x ) y j ( x ' ) holds for almost all (x, x'). Either NH e N, or NH = oo; in the latter case, the series converges absolutely and uniformly for almost all (x, x'). For the converse of Theorem 2.10, see Problem 2.23. For a data-dependent approximation and its relationship to kernel PCA (Section 1.7), see Problem 2.26. From statement 2 it follows that k(x, x') corresponds to a dot product in l2NH, since k(x, x') — ( F ( x ) , F ( x ' ) ) with
for almost all x £ X. Note that we use the same F as in (2.21) to denote the feature 4. A finite measure space is a set X with a s-algebra (Definition B.1) defined on it, and a measure (Definition B.2) defined on the latter, satisfying m(X) < oo (so that, up to a scaling factor, m, is a probability measure).
38
Kernels
map, although the target spaces are different. However, this distinction is not important for the present purposes — we are interested in the existence of some Hilbert space in which the kernel corresponds to the dot product, and not in what particular representation of it we are using. In fact, it has been noted [467] that the uniform convergence of the series implies that given any e > 0, there exists an n € N such that even if NH = oo, k can be approximated within accuracy e as a dot product in Rn: for almost all x,x' € X, \k(x, x') - (F n (x), Fn(x') | < e, where Fn : x *-> (VXTViM,..., V%y n (x))- The feature space can thus always be thought of as finite-dimensional within some accuracy e. We summarize our findings in the following proposition.
Mercer Feature Map
Proposition 2.11 (Mercer Kernel Map) If k is a kernel satisfying the conditions of Theorem 2.10, we can construct a mapping F into a space where k acts as a dot product,
for almost all x, x' £ X. Moreover, given any e > 0, there exists a map Fn into an ndimensional dot product space (where n E N depends on e) such that
for almost all x, x' e X. Both Mercer kernels and positive definite kernels can thus be represented as dot products in Hilbert spaces. The following proposition, showing a case where the two types of kernels coincide, thus comes as no surprise. Proposition 2.12 (Mercer Kernels are Positive Definite [359,42]) Let X = [a,b]be a compact interval and let k: [a, b] x [a, b] -> C be continuous. Then k is a positive definite kernel if and only if
for each continuous function f : X —> C. Note that the conditions in this proposition are actually more restrictive than those of Theorem 2.10. Using the feature space representation (Proposition 2.11), however, it is easy to see that Mercer kernels are also positive definite (for almost all x: x' E X) in the more general case of Theorem 2.10: given any c E Rm, we have
Being positive definite, Mercer kernels are thus also reproducing kernels. We next show how the reproducing kernel map is related to the Mercer kernel map constructed from the eigenfunction decomposition [202,467]. To this end, let us consider a kernel which satisfies the condition of Theorem 2.10, and construct
2.2
The Representation of Similarities in Linear Spaces
39
a dot product {•, •) such that k becomes a reproducing kernel for the Hilbert space H containing the functions
By linearity, which holds for any dot product, we have
Since k is a Mercer kernel, the yi (i = 1, • • •, NH) can be chosen to be orthogonal with respect to the dot product in L 2 (X). Hence it is straightforward to choose {•, •) such that
Equivalence of Feature Spaces
(using the Kronecker symbol djn, see (B.30)), in which case (2.46) reduces to the reproducing kernel property (2.36) (using (2.45)). For a coordinate representation in the RKHS, see Problem 2.29. The above connection between the Mercer kernel map and the RKHS map is instructive, but we shall rarely make use of it. In fact, we will usually identify the different feature spaces. Thus, to avoid confusion in subsequent chapters, the following comments are necessary. As described above, there are different ways of constructing feature spaces for any given kernel. In fact, they can even differ in terms of their dimensionality (cf. Problem 2.22). The two feature spaces that we will mostly use in this book are the RKHS associated with k (Section 2.2.2) and the Mercer 12 feature space. We will mostly use the same symbol H for all feature spaces that are associated with a given kernel. This makes sense provided that everything we do, at the end of the day, reduces to dot products. For instance, let us assume that F1, F2 are maps into the feature spaces H 1 , H 2 respectively, both associated with the kernel k; in other words,
Then it will usually not be the case that F 1 (x) = F2(x); due to (2.48), however, we always have ( F1(x), F1(x')}H1 = ( F2(x), F 2 ( x ' ) ) H 2 - Therefore, as long as we are only interested in dot products, the two spaces can be considered identical. An example of this identity is the so-called large margin regularizer that is usually used in SVMs, as discussed in the introductory chapter (cf. also Chapters 4 and 7),
No matter whether F is the RKHS map F(xi) = k(.,X i ) (2.21) or the Mercer map F(x i ) = (\A ljyj(x))j=1,...,NH (2.40), the value of ||w||2 will not change. This point is of great importance, and we hope that all readers are still with us.
40
Kernels
It is fair to say, however, that Section 2.2.5 can be skipped at first reading. 2.2.5
The Shape of the Mapped Data in Feature Space
Using Mercer's theorem, we have shown that one can think of the feature map as a map into a high- or infinite-dimensional Hilbert space. The argument in the remainder of the section shows that this typically entails that the mapped data F(X) lie in some box with rapidly decaying side lengths [606]. By this we mean that the range of the data decreases as the dimension index j increases, with a rate that depends on the size of the eigenvalues. Let us assume that for all j 6 N, we have supxex lj | yj(x)|2 < oo. Define the sequence
Note that if
exists (see Problem 2.24), then we have lj < ljC2k. However, if the Ay decay rapidly, then (2.50) can be finite even if (2.51) is not. By construction, F(X) is contained in an axis parallel parallelepiped inlNH2with side lengths 2^/Tj (cf. (2.40)).5 Consider an example of a common kernel, the Gaussian, and let m (see Theorem 2.10) be the Lebesgue measure. In this case, the eigenvectors are sine and cosine functions (with supremum one), and thus the sequence of the lj coincides with the sequence of the eigenvalues Ay. Generally, whenever supxex | yj(x)|2 is finite, the lj decay as fast as the Ay. We shall see in Sections 4.4, 4.5 and Chapter 12 that for many common kernels, this decay is very rapid. It will be useful to consider operators that map F(X) into balls of some radius R centered at the origin. The following proposition characterizes a class of such operators, determined by the sequence (l j ) jgN - Recall that RN denotes the space of all real sequences.
Proposition 2.13 (Mapping F(X) into l2) Lei S be the diagonal map
where (sj)j e RN. If (sj^JTj). G l2, then S maps F(X) into a ball centered at the origin whose radius is R — (sj ^/lj) • .
5. In fact, it is sufficient to use the essential supremum in (2.50). In that case, subsequent statements also only hold true almost everywhere.
2.2
The Representation of Similarities in Linear Spaces
Proof
41
Suppose (sj\/l j ) • £ l2- Using the Mercer map (2.40), we have
for any x G X. Hence S F(X) C
l2.
The converse is not necessarily the case. To see this, note that if (sj<*/lj) • 0 l2, amounting to saying that
is not finite, then there need not always exist an x £ X such that S F(x) = (sjvO^/W) . g f c , i-e., that
Continuity of
F
is not finite. To see how the freedom to rescale O(X) effectively restricts the class of functions we are using, we first note that everything in the feature space IK = t™* is done in terms of dot products. Therefore, we can compensate any invertible symmetric linear transformation of the data in H by the inverse transformation on the set of admissible weight vectors in H. In other words, for any invertible symmetric operator S on H, we have (S -1 w, SF(x)} = (w, F(x)) for all x € X. As we shall see below (cf. Theorem 5.5, Section 12.4, and Problem 7.5), there exists a class of generalization error bound that depends on the radius R of the smallest sphere containing the data. If the (li)i decay rapidly, we are not actually "making use" of the whole sphere. In this case, we may construct a diagonal scaling operator S which inflates the sides of the above parallelepiped as much as possible, while ensuring that it is still contained within a sphere of the original radius R in H (Figure 2.3). By effectively reducing the size of the function class, this will provide a way of strengthening the bounds. A similar idea, using kernel PCA (Section 14.2) to determine empirical scaling coefficients, has been successfully applied by [101]. We conclude this section with another useful insight that characterizes a property of the feature map F. Note that most of what was said so far applies to the case where the input domain X is a general set. In this case, it is not possible to make nontrivial statements about continuity properties of F. This changes if we assume X to be endowed with a notion of closeness, by turning it into a so-called topological space. Readers not familiar with this concept will be reassured to hear that Euclidean vector spaces are particular cases of topological spaces. Proposition 2.14 (Continuity of the Feature Map [402]) If X is a topological space and k is a continuous positive definite kernel on X x X, then there exists a Hilbert space H and a continuous map F : X -> H such that for all x, x' 6 X, we have k(x, x') = <
42
Kernels
Figure 2.3 Since everything is done in terms of dot products, scaling up the data by an operator S can be compensated by scaling the weight vectors with S~l (cf. text). By choosing S such that the data are still contained in a ball of the same radius R, we effectively reduce our function class (parametrized by the weight vector), which can lead to better generalization bounds, depending on the kernel inducing the map F. 2.2.6
The Empirical Kernel Map
The map F, defined in (2.21), transforms each input pattern into a function on X, that is, into a potentially infinite-dimensional object. For any given set of points, however, it is possible to approximate F by only evaluating it on these points (cf. [232,350,361,547,474]):
Empirical Kerne Map
Definition 2.15 (Empirical Kernel Map) For a given set {z 1 ,..., zn} C X, n e N, we call
the empirical kernel map w.r.t. {z 1 ,..., zn}. As an example, consider first the case where k is a positive definite kernel, and {z 1 ,..., zn} = { x 1 , . . . , xm}; we thus evaluate k(., x) on the training patterns. If we carry out a linear algorithm in feature space, then everything will take place in the linear span of the mapped training patterns. Therefore, we can represent the k(., x) of (2.21) as Fm(x) without losing information. The dot product to use in that representation, however, is not simply the canonical dot product in Rm, since the F(x i ) will usually not form an orthonormal system. To turn Fm into a feature map associated with k, we need to endow Rm with a dot product {•, -)m such that
To this end, we use the ansatz {.,.) m = {-, M-), with M being a positive definite matrix.6 Enforcing (2.57) on the training patterns, this yields the self-consistency condition [478,512]
6. Every dot product in Rm can be written in this form. We do not require strict definiteness of M, as the null space can be projected out, leading to a lower-dimensional feature space.
2.2
The Representation of Similarities in Linear Spaces
Kernel PCA Map
43
where K is the Gram matrix. The condition (2.58) can be satisfied for instance by the (pseudo-)inverse M = K~l. Equivalently, we could have incorporated this rescaling operation, which corresponds to a Kernel PCA "whitening" ([478, 547, 474], cf. Section 11.4), directly into the map, by whitening (2.56) to get
This simply amounts to dividing the eigenvector basis vectors of K by \/Af, where the \i are the eigenvalues of K.7 This parallels the rescaling of the eigenfunctions of the integral operator belonging to the kernel, given by (2.47). It turns out that this map can equivalently be performed using kernel PCA feature extraction (see Problem 14.8), which is why we refer to this map as the kernel PCA map. Note that we have thus constructed a data-dependent feature map into an mdimensional space which satisfies {3>^(x), <&%(x')} = k(x, x'), i.e., we have found an ra-dimensional feature space associated with the given kernel. In the case where K is invertible, ^(x) computes the coordinates of O(j) when represented in a basis of the m-dimensional subspace spanned by (*i),..., <&(xm). For data sets where the number of examples is smaller than their dimension, it can actually be computationally attractive to carry out O^ explicitly, rather than using kernels in subsequent algorithms. Moreover, algorithms which are not readily "kernelized" may benefit from explicitly carrying out the kernel PCA map. We end this section with two notes which illustrate why the use of (2.56) need not be restricted to the special case we just discussed. • More general kernels. When using non-symmetric kernels k in (2.56), together with the canonical dot product, we effectively work with the positive definite matrix KTK. Note that each positive definite matrix can be written as KTK. Therefore, working with positive definite kernels leads to an equally rich set of nonlinearities as working with an empirical kernel map using general non-symmetric kernels. If we wanted to carry out the whitening step, we would have to use (J^K)"1/4 (cf. footnote 7 concerning potential singularities). • Different evaluation sets. Things can be sped up by using expansion sets of the form {zi,..., zn}, mapping into an n-dimensional space, with n < ra, as done in [100,228]. In that case, one modifies (2.59) to
where (Kn),; := fc(zz, z;). The expansion set can either be a subset of the training set,8 or some other set of points. We will later return to the issue of how to choose 7. It is understood that if K is singular, we use the pseudo-inverse of K1/2 in which case we get an even lower dimensional subspace. 8. In [228] it is recommended that the size n of the expansion set is chosen large enough to ensure that the smallest eigenvalue of Kn is larger than some predetermined e > 0. Alternatively, one can start off with a larger set, and use kernel PCA to select the most important components for the map, see Problem 14.8. In the kernel PCA case, the map (2.60) is com-
44
Kernels
the best set (see Section 10.2 and Chapter 18). As an aside, note that in the case of Kernel PC A (see Section 1.7 and Chapter 14 below), one does not need to worry about the whitening step in (2.59) and (2.60): using the canonical dot product in Rm (rather than (-,-)) will simply lead to diagonalizing K2 instead of K, which yields the same eigenvectors with squared eigenvalues. This was pointed out by [350, 361]. The study [361] reports experiments where (2.56) was employed to speed up Kernel PCA by choosing {z\,..., zn} as a subset of {x\,..., xm}. 2.2.7
A Kernel Map Defined from Pairwise Similarities
In practice, we are given a finite amount of data x\,..., xm. The following simple observation shows that even if we do not want to (or are unable to) analyze a given kernel k analytically, we can still compute a map
Conversely, given an arbitrary map
where we have defined the S; as the rows of S (note that the columns of S would be K's eigenvectors). Therefore, K is the Gram matrix of the vectors ^/W• S,-.9 Hence the following map O, defined on x\,..., xm will satisfy (2.61)
Thus far, O is only defined on a set of points, rather than on a vector space. Therefore, it makes no sense to ask whether it is linear. We can, however, ask whether it can be extended to a linear map, provided the J, are elements of a vector space. The answer is that if the xz are linearly dependent (which is often the case), then this will not be possible, since a linear map would then typically be overputed as D« 2 l/J(fc(zi,x),... ,k(zn,x)), where UnDnUj is the eigenvalue decomposition of Kn. Note that the columns of Un are the eigenvectors of Kn. We discard all columns that correspond to zero eigenvalues, as well as the corresponding dimensions of Dn. To approximate the map, we may actually discard all eigenvalues smaller than some e > 0. 9. In fact, every positive definite matrix is the Gram matrix of some set of vectors [46].
2.3
Examples and Properties of Kernels
45
determined by the m conditions (2.63). For the converse, assume an arbitrary a £ W", and compute
In particular, this result implies that given data x\,..., xm/ and a kernel k which gives rise to a positive definite matrix X, it is always possible to construct a feature space "K of dimension at most m that we are implicitly working in when using kernels (cf. Problem 2.32 and Section 2.2.6). If we perform an algorithm which requires k to correspond to a dot product in some other space (as for instance the SV algorithms described in this book), it is possible that even though k is not positive definite in general, it still gives rise to a positive definite Gram matrix K with respect to the training data at hand. In this case, Proposition 2.16 tells us that nothing will go wrong during training when we work with these data. Moreover, if k leads to a matrix with some small negative eigenvalues, we can add a small multiple of some strictly positive definite kernel k' (such as the identity k'(Xj, Xj) — 6jj) to obtain a positive definite matrix. To see this, suppose that Amin < 0 is the minimal eigenvalue of k's Gram matrix. Note that being strictly positive definite, the Gram matrix X' of k' satisfies
where A^ denotes its minimal eigenvalue, and the first inequality follows from Rayleigh's principle (B.57). Therefore, provided that Amin + AA^ > 0, we have
for all a e Em, rendering (K + AK') positive definite.
2.3
Examples and Properties of Kernels
Polynomial
For the following examples, let us assume that X C MN. Besides homogeneous polynomial kernels (cf. Proposition 2.1),
Gaussian
Boser, Guyon, and Vapnik [62,223,561] suggest the usage of Gaussian radial basis function kernels [26,4],
Sigmoid
where a > 0, and sigmoid kernels,
46
Inhomogeneous Polynomial
Bn-Spline of Odd Order
Invariance of Kernels RBF Kernels
Kernels
where K > 0 and $ < 0. By applying Theorem 13.4 below, one can check that the latter kernel is not actually positive definite (see Section 4.6 and [85, 511] and the discussion in Example 4.25). Curiously, it has nevertheless successfully been used in practice. The reasons for this are discussed in [467]. Other useful kernels include the inhomogeneous polynomial,
(d £ N, c > 0) and the BM-spline kernel [501, 572] (Ix denoting the indicator (or characteristic) function on the set X, and <8> the convolution operation, (/ <8> g)(x) :— ff(x')g(x'-x)dx')f
The kernel computes B-splines of order 2p + 1 (p 6 N), defined by the (2p + l)-fold convolution of the unit interval [—1/2,1/2]. See Section 4.4.1 for further details and a regularization theoretic analysis of this kernel. Note that all these kernels have the convenient property of unitary invariance, k(x, x') = k(Ux, Ux') if UT = U~l, for instance if U is a rotation. If we consider complex numbers, then we have to use the adjoint U* := U instead of the transpose. Radial basis function (RBF) kernels are kernels that can be written in the form
where d is a metric on X, and / is a function on M
Some interesting additional structure exists in the case of a Gaussian RBF kernel k (2.68). As k(x, x) = l for all x G X, each mapped example has unit length, ||O(x)|| =
2.3
Examples and Properties of Kernels
47
1 (Problem 2.18 shows how to achieve this for general kernels). Moreover, as k(x, x') > 0 for all x, x' G X, all points lie inside the same orthant in feature space. To see this, recall that for unit length vectors, the dot product (1.3) equals the cosine of the enclosed angle. We obtain
which amounts to saying that the enclosed angle between any two mapped examples is smaller than vr/2. The above seems to indicate that in the Gaussian case, the mapped data lie in a fairly restricted area of feature space. However, in another sense, they occupy a space which is as large as possible: Theorem 2.18 (Full Rank of Gaussian RBF Gram Matrices [360]) Suppose that %i, • • • 5 Xm C X are distinct points, and a ^ 0. The matrix K given by
has full rank.
InfiniteDimensional Feature Space
In other words, the points
is a positive definite kernel on C x C. Proof
To see this, we define a feature map
where I A is the characteristic function on A. On the feature space, which consists of functions on X taking values in [—1,1], we use the dot product
The result follows by noticing (IA, IB) = P(A D B) and (7A, P(B)) = P(A)P(B).
48
Kernels
Further examples include kernels for string matching, as proposed by [585, 234, 23]. We shall describe these, and address the general problem of designing kernel functions, in Chapter 13. The next section will return to the connection between kernels and feature spaces. Readers who are eager to move on to SV algorithms may want to skip this section, which is somewhat more technical.
2.4
The Representation of Dissimilarities in Linear Spaces 2.4.1
Conditionally Positive Definite Kernels
We now proceed to a larger class of kernels than that of the positive definite ones. This larger class is interesting in several regards. First, it will turn out that some kernel algorithms work with this class, rather than only with positive definite kernels. Second, its relationship to positive definite kernels is a rather interesting one, and a number of connections between the two classes provide understanding of kernels in general. Third, they are intimately related to a question which is a variation on the central aspect of positive definite kernels: the latter can be thought of as dot products in feature spaces; the former, on the other hand, can be embedded as distance measures arising from norms in feature spaces. The present section thus attempts to extend the utility of the kernel trick by looking at the problem of which kernels can be used to compute distances in feature spaces. The underlying mathematical results have been known for quite a while [465]; some of them have already attracted interest in the kernel methods community in various contexts [515, 234]. Clearly, the squared distance ||O(j) — O(x')||2 in the feature space associated with a pd kernel k can be computed, using k(x, x') = (O(x), O(x')), as
Positive definite kernels are, however, not the full story: there exists a larger class of kernels that can be used as generalized distances, and the present section will describe why and how [468]. Let us start by considering how a dot product and the corresponding distance measure are affected by a translation of the data, x i->- x — x0. Clearly, ||x - x'\\2 is translation invariant while (x, x') is not. A short calculation shows that the effect of the translation can be expressed in terms of ||. — . ||2 as
Note that this, just like (x, x'), is still a pd kernel: £;,/c;c/ {(*; - XQ), (Xj - *o)) — || £j C{(Xj — Xo)\\2 > 0 holds true for any c,-. For any choice of x0 6 X, we thus get a similarity measure (2.79) associated with the dissimilarity measure \\x — x'\\. This naturally leads to the question of whether (2.79) might suggest a connection
2.4
The Representation of Dissimilarities in Linear Spaces
49
that also holds true in more general cases: what kind of nonlinear dissimilarity measure do we have to substitute for ||. — .||2 on the right hand side of (2.79), to ensure that the left hand side becomes positive definite? To state the answer, we first need to define the appropriate class of kernels. The following definition differs from Definition 2.4 only in the additional constraint on the sum of the Cj. Below, K is a shorthand for C or R; the definitions are the same in both cases. Definition 2.20 (Conditionally Positive Definite Matrix) A symmetric m x m matrix K(m>2) taking values in K and satisfying
is called conditionally positive definite (cpd). Definition 2.21 (Conditionally Positive Definite Kernel) Let X be a nonempty set. A function k: X x X —> K which for all m > 2, x\,..., xm G X gives rise to a conditionally positive definite Gram matrix is called a conditionally positive definite (cpd) kernel. Note that symmetry is also required in the complex case. Due to the additional constraint on the coefficients c/, it does not follow automatically anymore, as it did in the case of complex positive definite matrices and kernels. In Chapter 4, we will revisit cpd kernels. There, we will actually introduce cpd kernels of different orders. The definition given in the current chapter covers the case of kernels which are cpd of order 1.
Connection PD — CPD
Proposition 2.22 (Constructing PD Kernels from CPD Kernels [42]) Let X0 e X, and let kbea symmetric kernel on X x X. Then
is positive definite if and only ifk is conditionally positive definite. The proof follows directly from the definitions and can be found in [42]. This result does generalize (2.79): the negative squared distance kernel is indeed cpd, since 2,-c,- = 0 implies -IjtyC,-c;-||x,- - x/||2 = -Iz-c/!;-c/||x;||2 - S ; -cy£iCj||xj|| 2 + 2Ii,/CjC 7 -(x,-,x ; -) = 2'ZitjCiCj(xi,Xj) = 2\\^icixi\\2 > 0. In fact, this implies that all kernels of the form
are cpd (they are not pd),10 by application of the following result (note that the case 0 = 0 is trivial): 10. Moreover, they are not cpd if /3 > 2 [42].
50
Kernels Proposition 2.23 (Fractional Powers and Logs of CPD Kernels [42]) Ifk: X x X -> (—00,0] is cpd, then so are —(—k)a (0 < a < 1) and — ln(l — k). To state another class of cpd kernels that are not pd, note first that as a trivial consequence of Definition 2.20, we know that (i) sums of cpd kernels are cpd, and (ii) any constant b € E is a cpd kernel. Therefore, any kernel of the form k + b, where k is cpd and b G E, is also cpd. In particular, since pd kernels are cpd, we can take any pd kernel and offset it by b, and it will still be at least cpd. For further examples of cpd kernels, cf. [42, 578, 205,515]. 2.4.2
Hilbert Space Representation of CPD Kernels
We now return to the main flow of the argument. Proposition 2.22 allows us to construct the feature map for k from that of the pd kernel k. To this end, fix XQ £ X and define k according to Proposition 2.22. Due to Proposition 2.22, k is positive definite. Therefore, we may employ the Hilbert space representation O : X -> "H of k (cf. (2.32)), satisfying (O(x), O(x')) = Jt(jt, x1); hence,
Substituting Proposition 2.22 yields
This implies the following result [465,42]. Feature Map for CPD Kernels
Proposition 2.24 (Hilbert Space Representation of CPD Kernels) Let k be a realvalued CPD kernel on X, satisfying k(x, x) = Ofor all x £ X. Then there exists a Hilbert space 'K of real-valued functions on X, and a mapping O : X —>• 'K, such that
If we drop the assumption k(x, x) = 0, the Hilbert space representation reads
It can be shown that if k(x, x) = Q for all x € X, then
is a semi-metric: clearly, it is nonnegative and symmetric; additionally, it satisfies the triangle inequality, as can be seen by computing d(x, x') + d(x', x") = \\3>(x) — 0(x')|| + ||0(x') - <&(x")|| > ||0(x) - 0(j")|| = d(x, x") [42]. It is a metric if k(x, x') ^Qfor x ^ x'. We thus see that we can rightly think of k as the negative of a distance measure. We next show how to represent general symmetric kernels (thus in particular cpd kernels) as symmetric bilinear forms Q in feature spaces. This generalization of the previously known feature space representation for pd kernels comes at a
2.4
The Representation of Dissimilarities in Linear Spaces
51
cost: Q will no longer be a dot product. For our purposes, we can get away with this. The result will give us an intuitive understanding of Proposition 2.22: we can then write k as k(x, x') := Q((x) - (x0), $(*') - <&(*<)))• Proposition 2.22 thus essentially adds an origin in feature space which corresponds to the image O(XQ) of one point XQ under the feature map. Feature Map for General Symmetric Kernels ft,
Proposition 2.25 (Vector Space Representation of Symmetric Kernels) Let k be a real-valued symmetric kernel on X. Then there exists a linear space "K of real-valued functions on X, endowed with a symmetric bilinear form Q(.,.), and a mapping O : X —>• such that k(x, x') = Q(O(x), <&(*'))• Proof The proof is a direct modification of the pd case. We use the map (2.21) and linearly complete the image as in (2.22). Define Q(/, g) := I^ Sjti «,-/?;*(*,-, x'j). To see that it is well-defined, although it explicitly contains the expansion coefficients (which need not be unique), note that Q(/, g) = ^JfL\ fljf(x'j)' independent of the a/. Similarly for g, note that Q(/, g) = £; a;g(Xj), hence it is independent of @j. The last two equations also show that Q is bilinear; clearly, it is symmetric. Note, moreover, that by definition of Q, k is a reproducing kernel for the feature space (which is not a Hilbert space): for all functions / (2.22), we have Q(k(. ,*),/) = /(*); in particular, Q(k(., *), k(., *')) - k(x, x'). Rewriting k as k(x: x') := Q(
Matrix Centering
Proposition 2.26 (Exercise 2.23 in [42]) Let K be a symmetric matrix, e G Mm be the vector of all ones, Ithemxm identity matrix, and let c e C" satisfy e*c = 1. Then
is positive definite if and only ifK is conditionally positive definite.^ Proof "=>": suppose K is positive definite. Thus for any a e C™ which satisfies a*e = e*a = 0, we have 0 < a*Xa = a*Ka + a*ec*Kce*a - a*Kce*a - a*ec*Ka = a*Ka. This means that 0 < a* Ka, proving that K is conditionally positive definite. " 0 for all a e Cm. We have
11. c* is the vector obtained by transposing and taking the complex conjugate of c.
52
Kernels All we need to show is e*s = 0, since then we can use the fact that K is cpd to obtain s*Ks > 0. This can be seen as follows e*s = e*(l — ce*)a = (e* — (e*c)e*)a = (e* - e*)a = 0. This result directly implies a corresponding generalization of Proposition 2.22:
Kernel Centering
Proposition 2.27 (Adding a General Origin) Let kbea symmetric kernel, x\,..., xm e X and let cz- E C satisfy Xfli c; = 1. 77ien
fs positive definite if and only ifk is conditionally positive definite. Proof Consider a set of m' E N points x(,..., x'm, € X, and let K be the (m + m') x (m + m') Gram matrix based on xi,..., xm, x(,..., x'm,. Apply Proposition 2.26 using cm+i = ... = cm+m> = 0. Application to SVMs
Application to Kernel PCA
Application to Parzen Windows Classifiers
The above results show that conditionally positive definite kernels are a natural choice whenever we are dealing with a translation invariant problem, such as the SVM: maximization of the margin of separation between two classes of data is independent of the position of the origin. Seen in this light, it is not surprising that the structure of the dual optimization problem (cf. [561]) allows cpd kernels: as noted in [515, 507], the constraint Xfli &i\)i — 0 projects out the same subspace as (2.80) in the definition of conditionally positive definite matrices. Another example of a kernel algorithm that works with conditionally positive definite kernels is Kernel PCA (Chapter 14), where the data are centered, thus removing the dependence on the origin in feature space. Formally, this follows from Proposition 2.26 for C; = 1/ra. Let us consider another example. One of the simplest distance-based classification algorithms proceeds as follows. Given m+ points labelled with +1, m_ points labelled with —1, and a mapped test point O(x), we compute the mean squared distances between the latter and the two classes, and assign it to the one for which this mean is smaller;
We use the distance kernel trick (Proposition 2.24) to express the decision function as a kernel expansion in the input domain: a short calculation shows that
with the constant offset
2.4
The Representation of Dissimilarities in Linear Spaces
Properties of CPD Kernels
53
Note that for some cpd kernels, such as (2.81), k(Xi, Xj) is always 0, and thus b = 0. For others, such as the commonly used Gaussian kernel, k(xi, x z ) is a nonzero constant, in which case b vanishes provided that ra+ = ra_. For normalized Gaussians, the resulting decision boundary can be interpreted as the Bayes decision based on two Parzen window density estimates of the classes; for general cpd kernels, the analogy is merely a formal one; that is, the decision functions take the same form. Many properties of positive definite kernels carry over to the more general case of conditionally positive definite kernels, such as Proposition 13.1. Using Proposition 2.22, one can prove an interesting connection between the two classes of kernels: Proposition 2.28 (Connection PD — CPD [465]) A kernel k is conditionally positive definite if and only ifexp(tk) is positive definite for all t > 0. Positive definite kernels of the form exp(ffc) (t > 0) have the interesting property that their nth root (n € N) is again a positive definite kernel. Such kernels are called infinitely divisible. One can show that, disregarding some technicalities, the logarithm of an infinitely divisible positive definite kernel mapping into 1^" is a conditionally positive definite kernel. 2.4.3
Higher Order CPD Kernels
For the sake of completeness, we now present some material which is of interest to one section later in the book (Section 4.8), but not central for the present chapter. We follow [341, 204]. Definition 2.29 (Conditionally Positive Definite Functions of Order q) A continuous function h, defined on [0, oo), is called conditionally positive definite (cpd) of order q on RN if for any distinct points *i,..., xm € EN, the quadratic form,
is nonnegative, provided that the scalars ai,...,am satisfy ££li ar-p(x;) = 0, for all polynomials /?(•) on EN of degree lower than q. Let n^ denote the space of polynomials of degree lower than q on R N . By definition, every cpd function h of order q generates a positive definite kernel for SV expansions in the space of functions orthogonal to n^, by setting k(x, x') := h(\\x-x'\f). There exists also an analogue to the positive definiteness of the integral operator in the conditions of Mercer's theorem. In [157, 341] it is shown that for cpd functions h of order q, we have
provided that the projection of / onto n^ is zero.
54
Kernels
Figure 2.4 Conditionally positive definite functions, as described in Table 2.1. Where applicable, we set the free parameter c to 1; /5 is set to 2. Note that cpd kernels need not be positive anywhere (e.g., the Multiquadric kernel).
Table 2.1 Examples of Conditionally Positive Definite Kernels. The fact that the exponential kernel is pd (i.e., cpd of order 0) follows from (2.81) and Proposition 2.28. Order 0 0
Exponential Inverse Multiquadric
1 n
Multiquadric Thin Plate Spline
Definition 2.30 (Completely Monotonic Functions) A function h(x) is called completely monotonic of order q if
It can be shown [464, 465, 360] that a function h(x2) is conditionally positive definite if and only if h(x) is completely monotonic of the same order. This gives a (sometimes simpler) criterion for checking whether a function is cpd or not. If we use cpd kernels in learning algorithms, we must ensure orthogonality of the estimate with respect to IT^. This is usually done via constraints ££Li ctip(Xi) = 0 for all polynomials /?(•) on R of degree lower than q (see Section 4.8).
2.5
2.5
55
Summary
Summary The crucial ingredient of SVMs and other kernel methods is the so-called kernel trick (see (2.7) and Remark 2.8), which permits the computation of dot products in high-dimensional feature spaces, using simple functions defined on pairs of input patterns. This trick allows the formulation of nonlinear variants of any algorithm that can be cast in terms of dot products, SVMs being but the most prominent example. The mathematical result underlying the kernel trick is almost a century old [359]. Nevertheless, it was only much later that it was exploited by the machine learning community for the analysis [4] and construction of algorithms [62], and that it was described as a general method for constructing nonlinear generalizations of dot product algorithms [480]. The present chapter has reviewed the mathematical theory of kernels. We started with the class of polynomial kernels, which can be motivated as computing a combinatorially large number of monomial features rather efficiently. This led to the general question of which kernel can be used, or: which kernel can be represented as a dot product in a linear feature space. We defined this class and discussed some of its properties. We described several ways how, given such a kernel, one can construct a representation in a feature space. The most well-known representation employs Mercer's theorem, and represents the feature space as an ti space defined in terms of the eigenfunctions of an integral operator associated with the kernel. An alternative representation uses elements of the theory of reproducing kernel Hilbert spaces, and yields additional insights, representing the linear space as a space of functions written as kernel expansions. We gave an indepth discussion of the kernel trick in its general form, including the case where we are interested in dissimilarities rather than similarities; that is, when we want to come up with nonlinear generalizations of distance-based algorithms rather than dot-product-based algorithms. In both cases, the underlying philosophy is the same: we are trying to express a complex nonlinear algorithm in terms of simple geometrical concepts, and we are then dealing with it in a linear space. This linear space may not always be readily available; in some cases, it may even be hard to construct explicitly. Nevertheless, for the sake of design and analysis of the algorithms, it is sufficient to know that the linear space exists, empowering us to use the full potential of geometry, linear algebra and functional analysis.
2.6
Problems 2.1 (Monomial Features in R2 •) Verify the second equality in (2.9). 2.2 (Multiplicity of Monomial Features in EN [515] ••) Consider the monomial kernel k(x, x') = (x, x')d (where x, x' G RN), generating monomial features of order d. Prove
56
Kernels
that a valid feature map for this kernel can be defined coordinate-wise as
for every m 6 N", X"=1[m]z = d (i.e., every such m corresponds to one dimension ofK). 2.3 (Inhomogeneous Polynomial Kernel ••) Prove that the kernel (2.70) induces a feature map into the space of all monomials up to degree d. Discuss the role ofc. 2.4 (Eigenvalue Criterion of Positive Definiteness •) Prove that a symmetric matrix is positive definite if and only if all its eigenvalues are nonnegative (see Appendix B). 2.5 (Dot Products are Kernels •) Prove that dot products (Definition B.7) are positive definite kernels. 2.6 (Kernels on Finite Domains ••) Prove that for finite X, say X = {xi,..., xm}, k is a kernel if and only if the m x m matrix (A:(x,, xy));y is positive definite. 2.7 (Positivity on the Diagonal •) From Definition 2.5, prove that a kernel satisfies k(x, x) > Ofor all x e X. 2.8 (Cauchy-Schwarz for Kernels ••) Give an elementary proof of Proposition 2.7. Hint: start with the general form of a symmetric 2 x 2 matrix, and derive conditions for its coefficients that ensure that it is positive definite. 2.9 (PD Kernels Vanishing on the Diagonal •) Use Proposition 2.7 to prove that a kernel satisfying k(x, x) =for all x £ X is identically zero. How does the RKHS look in this case? Hint: use (2.31). 2.10 (Two Kinds of Positivity •) Give an example of a kernel which is positive definite according to Definition 2.5, but not positive in the sense that k(x, x') > Ofor all x, x'. Give an example of a kernel where the contrary is the case. 2.11 (General Coordinate Transformations •) Prove that if & : X —>• X z's a function, and k(x, x'} is a kernel, then k(a(x), cr(x')) is a kernel, too. 2.12 (Positivity on the Diagonal •) Prove that positive definite kernels are positive on the diagonal, k(x, x) > Ofor all x € X. Hint: use m = lin (2.15). 2.13 (Symmetry of Complex Kernels ••) Prove that complex-valued positive definite kernels are symmetric (2.18). 2.14 (Real Kernels vs. Complex Kernels •) Prove that a real matrix satisfies (2.15) for all Ci € C if and only if it is symmetric and it satisfies (2.15) for real coefficients C{. Hint: decompose each Cj in (2.15) into real and imaginary parts.
2.6 Problems
57 2.15 (Rank-One Kernels •) Prove that iff is a real-valued function on X, then /c(j, x') := f(x)f(x') is a positive definite kernel. 2.16 (Bayes Kernel ••) Consider a binary pattern recognition problem. Specialize the last problem to the case where f : X —> {±1} equals the Bayes decision function y(x), i.e., the classification with minimal risk subject to an underlying distribution P(x, y) generating the data. Argue that this kernel is particularly suitable since it renders the problem linearly separable in a ID feature space: State a decision function (cf. (1.35)) that solves the problem (hint: you just need one parameter a, and you may set it to 1; moreover, use b — 0) [1241. The final part of the problem requires knowledge of Chapter 16: Consider now the situation where some prior P(f) over the target function class is given. What would the optimal kernel be in this case? Discuss the connection to Gaussian processes. 2.17 (Inhomogeneous Polynomials •) Prove that the inhomogeneous polynomial (2.70) is a positive definite kernel, e.g., by showing that it is a linear combination of homogeneous polynomial kernels with positive coefficients. What kind of features does this kernel compute [561]? 2.18 (Normalization in Feature Space •) Given a kernel k, construct a corresponding normalized kernel k by normalizing the feature map such that for all x G X, ||O(x)|| = 1 (cf. also Definition 12.35). Discuss the relationship between normalization in input space and normalization in feature space for Gaussian kernels and homogeneous polynomial kernels. 2.19 (Cosine Kernel •) Suppose X is a dot product space, and x,x' £ X. Prove that k(x, x') = cos(Z(x, x)) is a positive definite kernel. Hint: use Problem 2.18. 2.20 (Alignment Kernel •) Let (K,K')F := 1,/K^. be the Frobenius dot product of two matrices. Prove that the empirical alignment of two Gram matrices [124], A(K, K') := (K, K')F / ^/(K,K}r(K',K'}T, is a positive definite kernel. Note that the alignment can be used for model selection, putting £•• := y/yy (cf. Problem 2.16) and KIJ := sgn.(k(x{, x;-)) or X,-y := sgn(fc(x/, x;)) — b (cf. [124]). 2.21 (Equivalence Relations as Kernels •••) Consider a similarity measure k : X —>• {0,1} with
Prove that k is a positive definite kernel if and only if, for all x, x', x" € X,
Equations (2.96) to (2.98) amount to saying that k = IT, where T C X x X is an equivalence relation.
58
Kernels As a simple example, consider an undirected graph, and let (x, x') e T whenever x and x' are in the same connected component of the graph. Show that T is an equivalence relation. Find examples of equivalence relations that lend themselves to an interpretation as similarity measures. Discuss whether there are other relations that one might want to use as similarity measures. 2.22 (Different Feature Spaces for the Same Kernel •) Give an example of a kernel with two valid feature maps Oi, 2, mapping into spaces IKi, 'Ki of different dimensions. 2.23 (Converse of Mercer's Theorem •) Prove that if an integral operator kernel k admits a uniformly convergent dot product representation on some compact set X x X,
then it is positive definite. Hint: show that
Argue that in particular, polynomial kernels (2.67) satisfy Mercer's conditions. 2.24 (oo-Norm of Mercer Eigenf unctions ••) Prove that under the conditions of Theorem 2.10, we have, up to sets of measure zero,
Hint: note that ||fc||oo > k(x, x) up to sets of measures zero, and use the series expansion given in Theorem 2.10. Show, moreover, that it is not generally the case that
Hint: consider the case where X = N, ^({n}} := 2 n, and k(i,;') := fy. Show that 1. Ttttfl;)) - (aj2-')Jvr (a;-) E L 2 (X, p), 2. Tk satisfies {(a;-), Tk(aj)} = £/(fl/2~;')2 > 0 and is thus positive definite, 3. \j = 2~i and ipj = 2^2ejform an orthonormal eigenvector decomposition ofT^ (here, ej is the jth canonical unit vector in i^), and 4. H^-Hoo = 2//2 = A71/2. Argue that the last statement shows that (2.101) is wrong and (2.100) is tight?22.25 (Generalized Feature Maps •••) Via (2.38), Mercer kernels induce compact (integral) operators. Can you generalize the idea of defining a feature map associated with an 12. Thanks to S. Smale and I. Steinwart for this exercise.
2.6
Problems
59
operator to more general bounded positive definite operators T? Hint: use the multiplication operator representation of T [467]. 2.26 (Nystrom Approximation (cf. [603]) •) Consider the integral operator obtained by substituting the distribution P underlying the data into (2.38), i.e.,
If the conditions of Mercer's theorem are satisfied, then k can be diagonalized as
where Ay and ifij satisfy the eigenvalue equation
and the orthonormality conditions
Show that by replacing the integral by a summation over an iid sample X — {x\,..., xm} from P(x), one can recover the kernel PCA eigenvalue problem (Section 1.7). Hint: Start by evaluating (2.104) for x' £ X, to obtain m equations. Next, approximate the integral by a sum over the points in X, replacing Jx fc(j, x')ipj(x) dP(x) by ^ £n=i k(xn, x')ifjj(Xn). Derive the orthogonality condition for the eigenvectors (ipj(xn))n=i,...,mfrom (2.105). 2.27 (Lorentzian Feature Spaces ••) If a finite number of eigenvalues is negative, the expansion in Theorem 2.10 is still valid. Show that in this case, k corresponds to a Lorentzian symmetric bilinear form in a space with indefinite signature [467]. Discuss whether this causes problems for learning algorithms utilizing these kernels. In particular, consider the cases ofSV machines (Chapter 7) and Kernel PCA (Chapter 14). 2.28 (Symmetry of Reproducing Kernels •) Show that reproducing kernels (Definition 2.9) are symmetric. Hint: use (2.35) and exploit the symmetry of the dot product. 2.29 (Coordinate Representation in the RKHS ••) Write {-,-} as a dot product of coordinate vectors by expressing thejunctions of the RKHS in the basis (>/A^^n)n=i,...,NM/ which is orthonormal with respect to ( • , • ) , i.e.,
Obtain an expression for the coordinates an, using (2.47) and an = {/, \f\^4>n) • Show that IK has the structure of a RKHS in the sense that for f and g given by (2.106), and
60
Kernels we have (a, (3) = {/, g). Show, moreover, that f ( x ) = (a, O(j)} in "K. In other words, R, we have, for all x, x' £ X,
Now consider the special case where (•, •) is a Euclidean dot product and (x — x',x — x') is the squared Euclidean distance between x and x'. Discuss why the polarization identity does not imply that the value of the dot product can be recovered from the distances alone. What else does one need? 2.36 (Vector Space Representation of CPD Kernels •••) Specialize the vector space representation of symmetric kernels (Proposition 2.25) to the case ofcpd kernels. Can you identify a subspace on which a cpd kernel is actually pd? 2.37 (Parzen Windows Classifiers in Feature Space ••) Assume that k is a positive definite kernel. Compare the algorithm described in Section 1.2 with the one of '(2.89). Construct situations where the two algorithms give different results. Hint: consider datasets where the class means coincide. 2.38 (Canonical Distortion Kernel ooo) Can you define a kernel based on Baxter's canonical distortion metric [28]?
3
Overview
Risk and Loss Functions
One of the most immediate requirements in any learning problem is to specify what exactly we would like to achieve, minimize, bound, or approximate. In other words, we need to determine a criterion according to which we will assess the quality of an estimate / : X —)• y obtained from data. This question is far from trivial. Even in binary classification there exist ample choices. The selection criterion may be the fraction of patterns classified correctly, it could involve the confidence with which the classification is carried out, or it might take into account the fact that losses are not symmetric for the two classes, such as in health diagnosis problems. Furthermore, the loss for an error may be input-dependent (for instance, meteorological predictions may require a higher accuracy in urban regions), and finally, we might want to obtain probabilities rather than a binary prediction of the class labels —1 and 1. Multi class discrimination and regression add even further levels of complexity to the problem. Thus we need a means of encoding these criteria. The chapter is structured as follows: in Section 3.1, we begin with a brief overview of common loss functions used in classification and regression algorithms. This is done without much mathematical rigor or statistical justification, in order to provide basic working knowledge for readers who want to get a quick idea of the default design choices in the area of kernel machines. Following this, Section 3.2 formalizes the idea of risk. The risk approach is the predominant technique used in this book, and most of the algorithms presented subsequently minimize some form of a risk functional. Section 3.3 treats the concept of loss functions from a statistical perspective, points out the connection to the estimation of densities and introduces the notion of efficiency. Readers interested in more detail should also consider Chapter 16, which discusses the problem of estimation from a Bayesian perspective. The later parts of this section are intended for readers interested in the more theoretical details of estimation. The concept of robustness is introduced in Section 3.4. Several commonly used loss functions, such as Huber's loss and the s-insensitive loss, enjoy robustness properties with respect to rather general classes of distributions. Beyond the basic relations, will show how to adjust the £-insensitive loss in such a way as to accommodate different amounts of variance automatically. This will later lead to the construction of so-called v Support Vector Algorithms (see Chapters 7,8, and 9). While technical details and proofs can be omitted for most of the present chapter, we encourage the reader to review the practical implications of this section.
62
Risk and Loss Functions
Prerequisites
As usual, exercises for all sections can be found at the end. The chapter requires knowledge of probability theory, as introduced in Section B.I.
3.1
Loss Functions Let us begin with a formal definition of what we mean by the loss incurred by a function / at location x, given an observation y. Definition 3.1 (Loss Function) Denote by (x, y,/(x)) € X x ^ x ^ the triplet consisting of a pattern x, an observation y and a prediction f(x). Then the map c : X x ^ x ^ —> [0, oo) with the property c(x, y, y) = Ofor allx£X and y e ^ will be called a loss function.
Minimized Loss 7^ Incurred Loss
Note that we require c to be a nonnegative function. This means that we will never get a payoff from an extra good prediction. If the latter was the case, we could always recover non-negativity (provided the loss is bounded from below), by using a simple shift operation (possibly depending on x). Likewise we can always satisfy the condition that exact predictions (f(x) = y) never cause any loss. The advantage of these extra conditions on c is that we know that the minimum of the loss is 0 and that it is obtainable, at least for a given x, y. Next we will formalize different kinds of loss, as described informally in the introduction of the chapter. Note that the incurred loss is not always the quantity that we will attempt to minimize. For instance, for algorithmic reasons, some loss functions will prove to be infeasible (the binary loss, for instance, can lead to NPhard optimization problems [367]). Furthermore, statistical considerations such as the desire to obtain confidence levels on the prediction (Section 3.3.1) will also influence our choice. 3.1.1
Misclassification Error
Binary Classification
The simplest case to consider involves counting the misclassification error if pattern x is classified wrongly we incur loss 1, otherwise there is no penalty.:
3.1
Loss Functions
Asymmetric and Input-Dependent Loss
Confidence Level
Soft Margin Loss
63
This definition of c does not distinguish between different classes and types of errors (false positive or negative).1 A slight extension takes the latter into account. For the sake of simplicity let us assume, as in (3.1), that we have a binary classification problem. This time, however, the loss may depend on a function c(x) which accounts for input-dependence, i.e.
A simple (albeit slightly contrived) example is the classification of objects into rocks and diamonds. Clearly, the incurred loss will depend largely on the weight of the object under consideration. Analogously, we might distinguish between errors for y = 1 and y = — 1 (see, e.g., [331] for details). For instance, in a fraud detection application, we would like to be really sure about the situation before taking any measures, rather than losing potential customers. On the other hand, a blood bank should consider even the slightest suspicion of disease before accepting a donor. Rather than predicting only whether a given object x belongs to a certain class y, we may also want to take a certain confidence level into account. In this case, f(x) becomes a real-valued function, even though y 6 {—1,1}. In this case, sgn (/(*)) denotes the class label, and the absolute value \f(x)\ the confidence of the prediction. Corresponding loss functions will depend on the product yf(x) to assess the quality of the estimate. The soft margin loss function, as introduced by Bennett and Mangasarian [40, 111], is defined as
In some cases [348, 125] (see also Section 10.6.2) the squared version of (3.3) provides an expression that can be minimized more easily;
Logistic Loss
The soft margin loss closely resembles the so-called logistic loss function (cf. [251], as well as Problem 3.1 and Section 16.1.1);
We will derive this loss function in Section 3.3.1. It is used in order to associate a probabilistic meaning with f(x). Note that in both (3.3) and (3.5) (nearly) no penalty occurs if yf(x) is sufficiently large, i.e. if the patterns are classified correctly with large confidence. In particular, in (3.3) a minimum confidence of 1 is required for zero loss. These loss functions 1. A false positive is a point which the classifier erroneously assigns to class 1, a false negative is erroneously assigned to class — 1.
64
Risk and Loss Functions
Figure 3.1 From left to right: 0-1 loss, linear soft margin loss, logistic regression, and quadratic soft margin loss. Note that both soft margin loss functions are upper bounds on the 0-1 loss.
Multi Class Discrimination
led to the development of large margin classifiers (see [491,460,504] and Chapter 5 for further details). Figure 3.1 depicts various popular loss functions.2 Matters are more complex when dealing with more than two classes. Each type of misclassification could potentially incur a different loss, leading to an MX M matrix (M being the number of classes) with positive off-diagonal and zero diagonal entries. It is still a matter of ongoing research in which way a confidence level should be included in such cases (cf. [41,311,593,161,119]). 3.1.2
Regression
When estimating real-valued quantities, it is usually the size of the difference y — f ( x ) , i.e. the amount of misprediction, rather than the product yf(x), which is used to determine the quality of the estimate. For instance, this can be the actual loss incurred by mispredictions (e.g., the loss incurred by mispredicting the value of a financial instrument at the stock exchange), provided the latter is known and computationally tractable.3 Assuming location independence, in most cases the loss function will be of the type
See Figure 3.2 below for several regression loss functions. Below we list the ones most common in kernel methods. 2. Other popular loss functions from the generalized linear model context include the inverse complementary log-log function. It is given by
This function, unfortunately, is not convex and therefore it will not lead to a convex optimization problem. However, it has nice robustness properties and therefore we think that it should be investigated in the present context. 3. As with classification, computational tractability is one of the primary concerns. This is not always satisfying from a statistician's point of view, yet it is crucial for any practical implementation of an estimation algorithm.
3.2
Test Error and Expected Risk
Squared Loss
e-insensitive Loss and t\ Loss
65
The popular choice is to minimize the sum of squares of the residuals f(x] — y. As we shall see in Section 3.3.1, this corresponds to the assumption that we have additive normal noise corrupting the observations y^. Consequently we minimize For convenience of subsequent notation, |£2 rather than £2 is often used. An extension of the soft margin loss (3.3) to regression is the e-insensitive loss function [561, 572,562]. It is obtained by symmetrization of the "hinge" of (3.3),
The idea behind (3.9) is that deviations up to e should not be penalized, and all further deviations should incur only a linear penalty. Setting e — 0 leads to an i\ loss, i.e., to minimization of the sum of absolute deviations. This is written
Practical Considerations
3.2
We will study these functions in more detail in Section 3.4.2. For efficient implementations of learning procedures, it is crucial that loss functions satisfy certain properties. In particular, they should be cheap to compute, have a small number of discontinuities (if any) in the first derivative, and be convex in order to ensure the uniqueness of the solution (see Chapter 6 and also Problem 3.6 for details). Moreover, we may want to obtain solutions that are computationally efficient, which may disregard a certain number of training points. This leads to conditions such as vanishing derivatives for a range of function values f(x). Finally, requirements such as outlier resistance are also important for the construction of estimators.
Test Error and Expected Risk Now that we have determined how errors should be penalized on specific instances (x, y,/(*)), we have to find a method to combine these (local) penalties. This will help us to assess a particular estimate /. In the following, we will assume that there exists a probability distribution P(j, y) on X x ^ which governs the data generation and underlying functional dependency. Moreover, we denote by P(y|x) the conditional distribution of y given x, and by dP(x, y) and dP(y\x) the integrals with respect to the distributions P(x, y) and P(y|x) respectively (cf. Section B.I.3). 3.2.1
Exact Quantities
Unless stated otherwise, we assume that the data (*, y) are drawn iid (independent and identically distributed, see Section B.I) from P(x, y). Whether or not we have
66
Risk and Loss Functions
knowledge of the test patterns at training time4 makes a significant difference in the design of learning algorithms. In the latter case, we will want to minimize the test error on that specific test set; in the former case, the expected error over all possible test sets.
Transduction Problem
Definition 3.2 (Test Error) Assume that we are not only given the training data {xi,..., xm} along with target values [y\,... ym} but also the test patterns {x(,... x'm,} on which we would like to predict y'{ (i — 1,..., m'). Since we already know x\, all we should care about is to minimize the expected error on the test set. We formalize this in the following definition
Unfortunately, this problem, referred to as transduction, is quite difficult to address, both computationally and conceptually, see [562, 267, 37, 211]. Instead, one typically considers the case where no knowledge about test patterns is available, as described in the following definition. Definition 3.3 (Expected Risk) If we have no knowledge about the test patterns (or decide to ignore them) we should minimize the expected error over all possible training patterns. Hence we have to minimize the expected loss with respect to P and c
Here the integration is carried out with respect to the distribution P(j, y). Again, just as (3.11), this problem is intractable, since we do not know P(x, y) explicitly. Instead, we are only given the training patterns (x z , y,-). The latter, however, allow us to replace the unknown distribution P(j, y) by its empirical estimate. To study connections between loss functions and density models, it will be convenient to assume that there exists a density p(x, y) corresponding to P(j, y). This means that we may replace / dP(x, y) by / p(x, y)dxdy and the appropriate measure on X x ^. Such a density p(x, y) need not always exist (see Section B.I for more details) but we will not give further heed to these concerns at present. 3.2.2 Approximations
Empirical Density
Unfortunately, this change in notation did not solve the problem. All we have at our disposal is the actual training data. What one usually does is replace p(x,y) by the empirical density
4. The test outputs, however, are not available during training.
3.2
Test Error and Expected Risk
67
Here 5x>(x) denotes the (^-distribution, satisfying / S x > ( x ) f ( x ) d x = f(x'). The hope is that replacing p by pemp will lead to a quantity that is "reasonably close" to the expected risk. This will be the case if the class of possible solutions / is sufficiently limited [568,571]. The issue of closeness with regard to different estimators will be discussed in further detail in Chapters 5 and 12. Substituting pemp(x, y) into (3.12) leads to the empirical risk:
Definition 3.4 (Empirical Risk) The empirical risk is defined as
M-Estimator
Ill-Posed Problems
Example of an Ill-Posed Problem
This quantity has the advantage that, given the training data, we can readily compute and also minimize it. This constitutes a particular case of what is called an M-estimator in statistics. Estimators of this type are studied in detail in the field of empirical processes [554]. As pointed out in Section 3.1, it is crucial to understand that although our particular M-estimator is built from minimizing a loss, this need not always be the case. From a decision-theoretic point of view, the question of which loss to choose is a separate issue, which is dictated by the problem at hand as well as the goal of trying to evaluate the performance of estimation methods, rather than by the problem of trying to define a particular estimation method [582,166,43]. These considerations aside, it may appear as if (3.14) is the answer to our problems, and all that remains to be done is to find a suitable class of functions 3 3 f such that we can minimize Remp [f] with respect to 3. Unfortunately, determining y is quite difficult (see Chapters 5 and 12 for details). Moreover, the minimization of .RempI/] can lead to an ill-posed problem [538, 370]. We will show this with a simple example. Assume that we want to solve a regression problem using the quadratic loss function (3.8) given by c(x, y , f ( x ) — (y — f(x))2. Moreover, assume that we are dealing with a linear class of functions,5 say
where the // are functions mapping X to R We want to find the minimizer of Remp, i.e.,
5. In the simplest case, assuming X is contained in a vector space, these could be functions that extract coordinates of x; in other words, y would be the class of linear functions on X.
68
Risk and Loss Functions
Computing the derivative of Remp[f] with respect to a and defining F,-y := fi(Xj), we can see that the minimum of (3.16) is achieved if
Condition of a Matrix
3.3
A sufficient condition for (3.17) is a = (FTF) FTy where (FTF) denotes the (pseudo-)inverse of the matrix. If FTF has a bad condition number (i.e. the quotient between the largest and the smallest eigenvalue of FTF is large), it is numerically difficult [423, 530] to solve (3.17) for a. Furthermore, if n > m, i.e. if we have more basis functions // than training patterns *,, there will exist a subspace of solutions with dimension at least n — m, satisfying (3.17). This is undesirable both practically (speed of computation) and theoretically (we would have to deal with a whole class of solutions rather than a single one). One might also expect that if S is too rich, the discrepancy between Kemp[/] and R[f] could be large. For instance, if F is an m x m matrix of full rank, 3 contains an / that predicts all target values y,- correctly on the training data. Nevertheless, we cannot expect that we will also obtain zero prediction error on unseen points. Chapter 4 will show how these problems can be overcome by adding a so-called regularization term to Remp[/L
A Statistical Perspective Given a particular pattern x, we may want to ask what risk we can expect for it, and with which probability the corresponding loss is going to occur. In other words, instead of (or in addition to) E [c(x, y, f(x)] for a fixed x, we may want to know the distribution of y given x, i.e., P(y|x). (Bayesian) statistics (see [338, 432, 49, 43] and also Chapter 16) often attempt to estimate the density corresponding to the random variables (x, y), and in some cases, we may really need information about p(x, y) to arrive at the desired conclusions given the training data (e.g., medical diagnosis). However, we always have to keep in mind that if we model the density p first, and subsequently, based on this approximation, compute a minimizer of the expected risk, we will have to make two approximations. This could lead to inferior or at least not easily predictable results. Therefore, wherever possible, we should avoid solving a more general problem, since additional approximation steps might only make the estimates worse [561]. 3.3.1
Maximum Likelihood Estimation
All this said, we still may want to compute the conditional density p(y\x). For this purpose we need to model how y is generated, based on some underlying dependency/(x); thus, we specify the functional form of p ( y \ x , f(x)) and maximize
3.3 A Statistical Perspective
69
the expression with respect to /. This will provide us with the function / that is most likely to have generated the data. Definition 3.5 (Likelihood) The likelihood of a sample (x\, t/i),... (xm, ym) given an underlying functional dependency f is given by
Log-Likelihood
Regression
Strictly speaking the likelihood only depends on the values /(*i),..., f(xm) rather than being a functional of / itself. To keep the notation simple, however, we write P({XI, ..., xm}, {1/1,..., ym}\f} instead of the more heavyweight expression p({xi,...,xm}, {yi,..., ym}\{f(xj,... ,/(*«)}). For practical reasons, we convert products into sums by taking the negative logarithm of P({XI, . . . , xm}, {yi,..., ym}|/)/ an expression which is then conveniently minimized. Furthermore, we may drop the p(Xi) from (3.18), since they do not depend on /. Thus maximization of (3.18) is equivalent to minimization of the Log-Likelihood
Remark 3.6 (Regression Loss Functions) Minimization of £[/] and ofRemp[f] cide if the loss function c is chosen according to
coin-
Assuming that the target values y were generated by an underlying functional dependency f plus additive noise £ with density pc, i.e. yz- — /true(^i) + &, we obtain
Classification
Things are slightly different in classification. Since all we are interested in is the probability that pattern x has label 1 or —1 (assuming binary classification), we can transform the problem into one of estimating the logarithm of the probability that a pattern assumes its correct label. Remark 3.7 (Classification Loss Functions) We have a finite set of labels, which allows us to model P(y\f(x)) directly, instead of modelling a density. In the binary classification case (classes 1 and —I) this problem becomes particularly easy, since all we have to do is assume functional dependency underlying P(l\f(x)): this immediately gives us P(—l|/(j)) = 1 — P(l\f(x)). The link to loss functions is established via The same result can be obtained by minimizing the cross entropy6 between the classifica6. In the case of discrete variables the cross entropy between two distributions P and Q is defined as I,-P(i) InQ(i).
70
Risk and Loss Functions Table 3.1 Common loss functions and corresponding density models according to Remark 3.6. As a shorthand we use c(f(x) — y) := c(x, y,/(*)).
e-insensitive Laplacian Gaussian Huber's robust loss Polynomial Piecewise polynomial
tion labels y,- and the probabilities p ( y \ f ( x ) ) , as is typically done in a generalized linear models context (see e.g., [355, 232, 163]). For binary classification (with y e {±1}) we obtain
When substituting the actual values for y into (3.23), this reduces to (3.22).
At this point we have a choice in modelling P(y = l|/(x)) to suit our needs. Possible models include the logistic transfer function, the probit model, the inverse complementary log-log model. See Section 16.3.5 for a more detailed discussion of the choice of such link functions. Below we explain connections in some more detail for the logistic link function. For a logistic model, where P(y = ±l|x,/) oc exp(±^/(j)), we obtain after normalization
Examples
and consequently — lnP(y = l|x,/) = ln(l + exp(—/(*))). We thus recover (3.5) as the loss function for classification. Choices other than (3.24) for a map E —>• [0,1] will lead to further loss functions for classification. See [579,179,596] and Section 16.1.1 for more details on this subject. It is important to note that not every loss function used in classification corresponds to such a density model (recall that in this case, the probabilities have to add up to 1 for any value of /(#)). In fact, one of the most popular loss functions, the soft margin loss (3.3), does not enjoy this property. A discussion of these issues can be found in [521]. Table 3.1 summarizes common loss functions and the corresponding density models as defined by (3.21), some of which were already presented in Section 3.1. It is an exhaustive list of the loss functions that will be used in this book for regression. Figure 3.2 contains graphs of the functions.
3.3 A Statistical Perspective
71
Figure 3.2 Graphs of loss functions and corresponding density models, upper left: Gaussian, upper right: Laplacian, lower left: Huber's robust, lower right: £-insensitive.
Practical Considerations
We conclude with a few cautionary remarks. The loss function resulting from a maximum likelihood reasoning might be non-convex. This might spell trouble when we try to find an efficient solution of the corresponding minimization problem. Moreover, we made a very strong assumption by claiming to know P(y|;t, /) explicitly, which was necessary in order to evaluate (3.20). Finally, the solution we obtain by minimizing the log-likelihood depends on the class of functions 3*. So we are in no better situation than by minimizing Remp[/L albeit with the additional constraint, that the loss functions c(j, y,/(j)) must correspond to a probability density. 3.3.2
Efficiency
The above reasoning could mislead us into thinking that the choice of loss function is rather arbitrary, and that there exists no good means of assessing the performance of an estimator. In the present section we will develop tools which can be used to compare estimators that are derived from different loss functions. For this purpose we need to introduce additional statistical concepts which deal with the efficiency of an estimator. Roughly speaking, these give an indication of how
72
Estimator
Risk and Loss Functions
"noisy" an estimator is with respect to a reference estimator. We begin by formalizing the concept of an estimator. Denote by P(y|#) a distribution of y depending (amongst other variables) on the parameters 9, and by Y = {yi,..., ym} an ra-sample drawn iid from P(y|0). Note that the use of the symbol y bears no relation to the y* that are outputs of some functional dependency (cf. Chapter 1). We employ this symbol because some of the results to be derived will later be applied to the outputs of SV regression. Next, we introduce the estimator $(Y) of the parameters 9, based on Y. For instance, P(y|0) could be a Gaussian with fixed variance and mean 0, and $(Y) could be the estimator (l/m) ^Ll y/. To avoid cumbersome notation, we use the shorthand
to express expectations of a random variable £(y) with respect to P(y|0). One criterion that we might impose on an estimator is that it be unbiased, i.e., that on average, it tells us the correct value of the parameter it attempts to estimate. Definition 3.8 (Unbiased Estimator) An unbiased estimator 0(Y) of the parameters 9 in P(y\0) satisfies
In this section, we will focus on unbiased estimators. In general, however, the estimators we are dealing with in this book will not be unbiased. In fact, they will have a bias towards 'simple', low-complexity functions. Properties of such estimators are more difficult to deal with, which is why, for the sake of simplicity, we restrict ourselves to the unbiased case in this section. Note, however, that "biasedness" is not a bad property by itself. On the contrary, there exist cases as the one described by James and Stein [262] where biased estimators consistently outperform unbiased estimators in the finite sample size setting, both in terms of variance and prediction error. A possible way to compare unbiased estimators is to compute their variance. Other quantities such as moments of higher order or maximum deviation properties would be valid criteria as well, yet for historical and practical reasons the variance has become a standard tool to benchmark estimators. The Fisher information matrix is crucial for this purpose since it will tell us via the Cramer-Rao bound (Theorem 3.11) the minimal possible variance for an unbiased estimator. The idea is that the smaller the variance, the lower (typically) the probability that $(Y) will deviate from 0 by a large amount. Therefore, we can use the variance as a possible one number summary to compare different estimators. Definition 3.9 (Score Function, Fisher Information, Covariance) Assume there exists a density p(y\9)for the distribution P(y|0) such that lnp(y|#) is differentiable with
3.3 A Statistical Perspective
Score Function
Fisher Information Covariance
73
respect to 9. The score V0(Y] ofP(y\9) is a random variable defined by7
This score tells us how much the likelihood of the data depends on the different components of 9, and thus, in the maximum likelihood procedure, how much the data affect the choice of 9. The covariance of V(Y) is called the Fisher information matrix /. It is given by
and the covariance matrix B of the estimator 9(Y) is defined by
The covariance matrix B tells us the amount of variation of the estimator. It can therefore be used (e.g., by Chebychev's inequality) to bound the probability that 9(Y) deviates from 9 by more than a certain amount. Remark 3.10 (Expected Value of Fisher Score) One can check that the expected value ofV0(Y) is 0 since
Average Fisher Score Vanishes
In other words, the contribution ofY to the adjustment of 9 averages to 0 over all possible Y, drawn according to P(Y|0). Equivalently we could say that the average likelihood for Y drawn according to P(Y|0) is extremal, provided we choose 9: the derivative of the expected likelihood of the data E# [lnP(Y|0)] with respect to 9 vanishes. This is also what we expect, namely that the "proper" distribution is on average the one with the highest likelihood. The following theorem gives a lower bound on the variance of an estimator, i.e. B is found in terms of the Fisher information 7. This is useful to determine how well a given estimator performs with respect to the one with the lowest possible variance. Theorem 3.11 (Cramer and Rao [425]) Any unbiased estimator 9(Y) satisfies
Proof We prove (3.31) for the scalar case. The extension to matrices is left as an exercise (see Problem 3.10). Using the Cauchy-Schwarz inequality, we obtain
7. Recall that dep(Y\6) is the gradient of p(Y\0) with respect to the parameters #1,..., On.
74
Risk and Loss Functions At the same time, E# [V#(Y)] = 0 implies that
since we may interchange integration by Y and d$. Eq. (3.31) lends itself to the definition of a one-number summary of the properties of an estimator, namely how closely the inequality is met. Definition 3.12 (Efficiency) The statistical efficiency e of an estimator 9(Y) is defined as
The closer e is to 1, the lower the variance of the corresponding estimator $(Y). For a special class of estimators minimizing loss functions, the following theorem allows us to compute B and e efficiently.
Asymptotic Variance
Theorem 3.13 (Murata, Yoshizawa, Amari [379, Lemma 3]) Assume that 0 is defined by 0(Y) := argmin^ d(Y, 9} and that d is a twice differentiable function in 0. Then asymptotically, for increasing sample size m —>• oo, the variance B is given by B = Q~lGQ~l. Here
and therefore e = (det Q)2/(det JG). This means that for the class of estimators defined via d, the evaluation of their asymptotic efficiency can be conveniently achieved via (3.38) and (3.39). For scalar valued estimators 0(Y) G E, these expressions can be greatly simplified to
Finally, in the case of continuous densities, Theorem 3.13 may be extended to piecewise twice differentiable continuous functions d, by convolving the latter with a twice differentiable smoothing kernel, and letting the width of the smoothing kernel converge to zero. We will make use of this observation in the next section when studying the efficiency of some estimators.
3.4 Robust Estimators
75
The current section concludes with the proof that the maximum likelihood estimator meets the Cramer-Rao bound. Theorem 3.14 (Efficiency of Maximum Likelihood [118,218,43]) The maximum likelihood estimator (cf. (3.18) and (3.19)) given by
is asymptotically efficient (e = I). To keep things simple we will prove (3.43) only for the class of twice differentiable continuous densities by applying Theorem 3.13. For a more general proof see [118,218,43]. Proof By construction, G is equal to the Fisher information matrix, if we choose d according to (3.43). Hence a sufficient condition is that Q = — I, which is what we show below. To this end we expand the integrand of (3.42),
The expectation of the second term in (3.44) equals —7. We now show that the expectation of the first term vanishes;
Hence Q = — I and thus e = Q2/(IG) = 1. This proves that the maximum likelihood estimator is asymptotically efficient. It appears as if the best thing we could do is to use the maximum likelihood (ML) estimator. Unfortunately, reality is not quite as simple as that. First, the above statement holds only asymptotically. This leads to the (justified) suspicion that for finite sample sizes we may be able to do better than ML estimation. Second, practical considerations such as the additional goal of sparse decomposition may lead to the choice of a non-optimal loss function. Finally, we may not know the true density model, which is required for the definition of the maximum likelihood estimator. We can try to make an educated guess; bad guesses of the class of densities, however, can lead to large errors in the estimation (see, e.g., [251]). This prompted the development of robust estimators.
3.4
Robust Estimators So far, in order to make any practical predictions, we had to assume a certain class of distributions from which P(Y) was chosen. Likewise, in the case of risk functionals, we also assumed that training and test data are identically distributed. This section provides tools to safeguard ourselves against cases where the above
76
Outliers
Risk and Loss Functions assumptions are not satisfied. More specifically, we would like to avoid a certain fraction v of 'bad' observations (often also referred to as 'outliers') seriously affecting the quality of the estimate. This implies that the influence of individual patterns should be bounded from above. Huber [250] gives a detailed list of desirable properties of a robust estimator. We refrain from reproducing this list at present, or committing to a particular definition of robustness. As usual for the estimation of location parameter context (i.e. estimation of the expected value of a random variable) we assume a specific parametric form of p(Y\0), namely
Unless stated otherwise, this is the formulation we will use throughout this section. 3.4.1
Robustness via Loss Functions
Huber's idea [250] in constructing a robust estimator was to take a loss function as provided by the maximum likelihood framework, and modify it in such a way as to limit the influence of each individual pattern. This is done by providing an upper bound on the slope of — ln/?(Y|0). We shall see that methods such as the trimmed mean or the median are special cases thereof. The £-insensitive loss function can also be viewed as a trimmed estimator. This will lead to the development of adaptive loss functions in the subsequent sections. We begin with the main theorem of this section.
Mixture Densities
Theorem 3.15 (Robust Loss Functions (Huber [250])) Let ^ be a class of densities formed by
Moreover assume that both po and p\ are symmetric with respect to the origin, their logarithms are twice continuously differentiable, In po is convex and known, and p\ is unknown. Then the density
is robust in the sense that the maximum likelihood estimator corresponding to (3.48) has minimum variance with respect to the "worst" possible density pworst = (1 — z)po + ^Piit is a saddle point (located at pWOrst) in terms of variance with respect to the true density p G ^3 and the density p G ^3 used in estimating the location parameter. This means that no density p has larger variance than pWOrst and that for p = pWOTSt no estimator is better than the one where p = pWOrst/ «s used in the robust estimator. The constants k > 0 and OQ are obtained by the normalization condition, that p be a
3.4
Robust Estimators
77
proper density and that the first derivative in In p be continuous.
Proof To show that p is a saddle point in ty we have to prove that (a) no estimation procedure other than the one using In p as the loss function has lower variance for the density p, and that (b) no density has higher variance than p if In p is used as loss function. Part (a) follows immediately from the Cramer-Rao theorem (Th. 3.11); part (b) can be proved as follows. We use Theorem 3.13, and a proof technique pointed out in [559], to compute the variance of an estimator using In p as loss function;
Here p1 is an arbitrary density which we will choose such that B is maximized. By construction,
Thus any density p' which is 0 in [—0o 5 #o] will minimize the denominator (the term depending on p' will be 0, which is the lowest obtainable value due to (3.51)), and maximize the numerator, since in the latter the contribution of p1 is always limited to k2e. Now e~l (p — (I — £)/?o) is exactly such a density. Hence the saddle point property holds. Remark 3.16 (Robustness Classes) If we have more knowledge about the class of densities ^3, a different loss function will have the saddle point property. For instance, using a similar argument as above, one can show that the normal distribution is robust in the class of all distributions with bounded variance. This implies that among all possible distributions with bounded variance, the estimator of the mean of a normal distribution has the highest variance. Likewise, the Laplacian distribution is robust in the class of all symmetric distributions with density p(G) > cfor some fixed c > 0 (see [559,251] for more details).
Hence, even though a loss function defined according to Theorem 3.15 is generally desirable, we may be less cautious, and use a different loss function for improved performance, when we have additional knowledge of the distribution. Remark 3.17 (Mean and Median) Assume we are dealing with a mixture of a normal distribution with variance a2 and an additional unknown distribution with weight at most e. It is easy to check that the application of Theorem 3.15 to normal distributions yields Huber's robust loss function from Table 3.1. The maximizer of the likelihood (see also Problem 3.17) is a trimmed mean estimator which discards e of the data: effectively all Oi deviating from the mean by more than a are
78
Risk and Loss Functions
ignored and the mean is computed from the remaining data. Hence Theorem 3.15 gives a formal justification for this popular type of estimator. If we let e —>• 1 we recover the median estimator which stems from a Laplacian distribution. Here, all patterns but the median one are discarded.
Besides the classical examples of loss functions and density models, we might also consider a slightly unconventional estimation procedure: use the average between Trimmed Interval the /c-smallest and the /c-largest of all observations 0 observations as the estimated Estimator mean of the underlying distribution (for sorted observations 0, with Oi < Oj for 1 < * < / < m the estimator computes (9k + 0m_fc+1)/2). This procedure makes sense, for instance, when we are trying to infer the mean of a random variable generated by roundoff noise (i.e., noise whose density is constant within some bounded interval) plus an additional unknown amount of noise. Note that both the patterns strictly inside or outside an interval of size [—£,e] around the estimate have no direct influence on the outcome. Only patterns on the boundary matter. This is a very similar situation to the behavior of Support Vector Support Patterns Machines in regression, and one can show that it corresponds to the minimizer of the £-insensitive loss function (3.9). We will study the properties of the latter in more detail in the following section and thereafter show how it can be transformed into an adaptive risk functional. 3.4.2
Efficiency and the e-Insensitive Loss Function
The tools of Section 3.3.2 allow us to analyze the £-insensitive loss function in more detail. Even though the asymptotic estimation of a location parameter setting is a gross oversimplification of what is happening in a SV regression estimator (where we estimate a nonparametric function, and moreover have only a limited number of observations at our disposition), it will provide us with useful insights into this more complex case [510, 481]. In a first step, we compute the efficiency of an estimator, for several noise models and amounts of variance, using a density corresponding to the £-insensitive loss function (cf. Table 3.1);
For this purpose we have to evaluate the quantities G (3.41) and Q (3.42) of Theorem 3.13. We obtain
The Fisher information I of m iid random variables distributed according to p$ is m-times the value of a single random variable. Thus all dependencies on m in e cancel out and we can limit ourselves to the case of m = 1 for the analysis of the
3.4
Robust Estimators
79
efficiency of estimators. Now we may check what happens if we use the £-insensitive loss function for different types of noise model. For the sake of simplicity we begin with Gaussian noise. Example 3.18 (Gaussian Noise) Assume that y is normally distributed with zero mean (i.e. 9 = 0) and variance a. By construction, the minimum obtainable variance is I"1 = a2 (recall that m = 1). Moreover (3.53) and (3.54) yield
The efficiency e = ^j is maximized for e = 0.6120cr. This means that if the underlying noise model is Gaussian with variance a and we have to use an s-insensitive loss function to estimate a location parameter, the most efficient estimator from this family is given by e = 0.61200-. The consequence of (3.55) is that the optimal value of e scales linearly with a. Of course, we could just use squared loss in such a situation, but in general, we will not know the exact noise model, and squared loss does not lead to robust estimators. The following lemma (which will come handy in the next section) shows that this is a general property of the £-insensitive loss. Lemma 3.19 (Linear Dependency between £-Tube Width and Variance) Denote by p a symmetric density with variance a > 0. Then the optimal value of e (i.e. the value that achieves maximum asymptotic efficiency) for an estimator using the e-insensitive loss is given by
where psid(T):= o-p(ar + Q\6] is the standardized version ofp(y\0), i.e. it is obtained by resettling p(y\9} to zero mean and unit variance. Since pstd is independent of a, we have a linear dependency between £0pt and a. The scaling factor depends on the noise model. Proof We prove (3.56) by rewriting the efficiency e(e] in terms of pstd via p(y\0) = °~~lPstd(°~~l(y - #))• This yields
The maximum of e(s) does not depend directly on e, but on a le (which is independent of cr). Hence we can find argmax£ e(e) by solving (3.56). Lemma 3.19 made it apparent that in order to adjust e we have to know a beforehand. Unfortunately, the latter is usually unknown at the beginning of the
80
Risk and Loss Functions
estimation procedure.8 The solution to this dilemma is to make e adaptive. 3.4.3
Adaptive Loss Functions
We again consider the trimmed mean estimator, which discards a predefined fraction of largest and smallest samples. This method belongs to the more general class of quantile estimators, which base their estimates on the value of samples in a certain quantile. The latter methods do not require prior knowledge of the variance, and adapt to whatever scale is required. What we need is a technique which connects a (in Huber's robust loss function) or e (in the £-insensitive loss case) with the deviations between the estimate u and the random variables y z . Let us analyze what happens to the negative log likelihood, if, in the einsensitive case, we change e to e + 6 (with S £ M) while keeping 0 fixed. In particular we assume that \S\ is chosen sufficiently small such that for all i = 1,..., m,
Moreover denote by ra<, m=, ra> the number of samples for which \9 — y,| is less than, equal to, or greater than e, respectively. Then
In other words, the amount by which the loss changes depends only on the quantiles at e. What happens if we make e itself a variable of the optimization problem? By the scaling properties of (3.58) one can see that for v € [0,1]
i/-Property
is minimized if e is chosen such that
This relation holds since at the solution ($, e) the solution also has to be optimal wrt. e alone while keeping 0 fixed. In the latter case, however, the derivatives of 8. The obvious question is why one would ever like to choose an £-insensitive loss in the presence of Gaussian noise in the first place. If the complexity of the function expansion is of no concern and the highest accuracy is required, squared loss is to be preferred. In most cases, however, it is not quite clear what exactly the type of the additive noise model is. This is when we would like to have a more conservative estimator. In practice, the £-insensitive loss has been shown to work rather well on a variety of tasks (Chapter 9).
3.4 Robust Estimators
81
the log-likelihood (i.e. error) term wrt. e at the solution are given by ^ and m>^m= on the left and right hand side respectively.9 These have to cancel with v which proves the claim. Furthermore, computing the derivative of (3.59) with respect to 9 shows that the number of samples outside the interval [0 — e, 9 + e] has to be equal on both halves (—oo, 9 — s) and (9 + e, oo). We have the following theorem:
Theorem 3.20 (Quantile Estimation as Optimization Problem [481]) A quantile procedure to estimate the mean of a distribution by taking the average of the samples at the %th and (1 — ^]th quantile is equivalent to minimizing (3.59). In particular, 1. v is an upper bound on the fraction of samples outside the interval [9 — e, 9 + e]. 2. v is a lower bound on the fraction of samples outside the interval ]0 — s,0 + e[. 3. If the distribution p(0) is continuous, for all v G [0,1]
Extension to General Robust Estimators
One might question the practical advantage of this method over direct trimming of the sample Y. In fact, the use of (3.59) is not recommended if all we want is to estimate 9. That said, (3.59) does allow us to employ trimmed estimation in the nonparametric case, cf. Chapter 9. Unfortunately, we were unable to find a similar method for Huber's robust loss function, since in this case the change in the negative log-likelihood incurred by changing a not only involves the (statistical) rank of y,-, but also the exact location of samples with \yl• — 9\ < a. One way to overcome this problem is re-estimate a adaptively while minimizing a term similar to (3.59) (see [180] for details in the context of boosting, Section 10.6.3 for a discussion of online estimation techniques, or [251] for a general overview). 3.4.4
Optimal Choice of v
Let us return to the ^-insensitive loss. A combination of Theorems 3.20, 3.13 and Lemma 3.19 allows us to compute optimal values of v for various distributions, provided that an £-insensitive loss function is to be used in the estimation procedure.10 The idea is to determine the optimal value of £ for a fixed density p(y\6) via (3.56), and compute the corresponding fraction v of patterns outside the interval [-e + 9, e + 9]. 9. Strictly speaking, the derivative is not defined at e; the Ihs and rhs values are defined, however, which is sufficient for our purpose. 10. This is not optimal in the sense of Theorem 3.15, which suggests the use of a more adapted loss function. However (as already stated in the introduction of this chapter), algorithmic or technical reasons such as computationally efficient solutions or limited memory may provide sufficient motivation to use such a loss function.
82
Risk and Loss Functions Table 3.2 Optimal v and e for various degrees of polynomial additive noise. Polynomial Degree d Optimal v Optimal £ for unit variance
1 1 0
2 0.5405 0.6120
3 0.2909 1.1180
4 0.1898 1.3583
5 0.1384 1.4844
Polynomial Degree d Optimal v Optimal e for unit variance
6 0.1080 1.5576
7 0.0881 1.6035
8 0.0743 1.6339
9 0.0641
10 0.0563 1.6704
1.6551
Theorem 3.21 (Optimal Choice of v) Denote by p a symmetric density with variance cr > 0 and by pstd the corresponding rescaled density with zero mean and unit variance. Then the optimal value of v (i.e. the value that achieves maximum asymptotic efficiency) for an estimator using the e-insensitive loss is given by
where e is chosen according to (3.56). This expression is independent of a. Proof The independence of a follows from the fact that v depends only on pstd. Next we show (3.62). For a given density p, the asymptotically optimal value of £ is given by Lemma 3.19. The average fraction of patterns outside the interval [0 - £opt, $ + £opt] is
which depends only on a 1£opt and is thus independent of u. Combining (3.63) with (3.56) yields the theorem. This means that given the type of additive noise, we can determine the value of v such that it yields the asymptotically most efficient estimator independent of the level of the noise. These theoretical predictions have since been confirmed rather accurately in a set of regression experiments [95]. Let us now look at some special cases. Example 3.22 (Optimal v for Polynomial Noise) Arbitrary polynomial noise models (oc e~^ ) with unit variance can be written as
Heavy Tails —> Large v
where T(x) is the gamma function. Figure 3.3 shows v0ptfor polynomial degrees in the interval [1,10]. For convenience, the explicit numerical values are repeated in Table 3.2. Observe that as the distribution becomes "lighter-tailed", the optimal v decreases; in other words, we may then use a larger amount of the data for the purpose of estimation. This is reasonable since it is only for very long tails of the distribution (data with many
3.5 Summary
83
Figure 3.3 Optimal v and e for various degrees of polynomial additive noise. outliers) that we have to be conservative and discard a large fraction of observations.
Even though we derived these relations solely for the case where a single number (9) has to be estimated, experiments show that the same scaling properties hold for the nonparametric case. It is still an open research problem to establish this connection exactly. As we shall see, in the nonparametric case, the effect of v will be that it both determines the number of Support Vectors (i.e., the number of basis functions needed to expand the solution) and also the fraction of function values f(Xi) with deviation larger than £ from the corresponding observations. Further information on this topic, both from the statistical and the algorithmic point of view, can be found in Section 9.3.
3.5
Summary We saw in this chapter that there exist two complementary concepts as to how risk and loss functions should be designed. The first one is data driven and uses the incurred loss as its principal guideline, possibly modified in order to suit the need of numerical efficiency. This leads to loss functions and the definitions of empirical and expected risk. A second method is based on the idea of estimating (or at least approximating) the distribution which may be responsible for generating the data. We showed
84
Risk and Loss Functions
that in a Maximum Likelihood setting this concept is rather similar to the notions of risk and loss, with c(x, y, /(*)) = - In p(y\x, /(*)) as the link between both quantities. This point of view allowed us to analyze the properties of estimators in more detail and provide lower bounds on the performance of unbiased estimators, i.e. the Cramer-Rao theorem. The latter was then used as a benchmarking tool for various loss functions and density models, such as the £-insensitive loss. The consequence of this analysis is a corroboration of experimental findings that there exists a linear correlation between the amount of noise in the observations and the optimal width of £. This, in turn, allowed us to construct adaptive loss functions which adjust themselves to the amount of noise, much like trimmed mean estimators. These formulations can be used directly in mathematical programs, leading to Z/-SV algorithms in subsequent chapters. The question of which choices are optimal in a finite sample size setting remains an open research problem.
3.6
Problems 3.1 (Soft Margin and Logistic Regression •) The soft margin loss function csoft and the logistic loss ciogist are asymptotically almost the same; show that
3.2 (Multi-class Discrimination ••) Assume you have to solve a classification problem with M different classes. Discuss how the number of functions used to solve this task affects the quality of the solution. • How would the loss function look if you were to use only one real-valued function f : X —>• M. Which symmetries are violated in this case (hint: what happens if you permute the classes)? • How many functions do you need if each of them makes a binary decision f : X —>• (0,1} ? • How many functions do you need in order to make the solution permutation symmetric with respect to the class labels? • How should you assess the classification error? Is it a good idea to use the misclassification rate of one individual function as a performance criterion (hint: correlation of errors)? By how much can this error differ from the total misclassification error? 3.3 (Mean and Median •) Assume 8 people want to gather for a meeting; 5 of them live in Stuttgart and 3 in Munich. Where should they meet if (a) they want the total distance traveled by all people to be minimal, (b) they want the average distance traveled per person to be minimal, or (c) they want the average squared distance to be minimal? What happens
3.6
Problems
85
to the meeting points if one of the 3 people moves from Munich to Sydney? 3.4 (Locally Adaptive Loss Functions •••) Assume that the loss function c(x, y,f(x)) varies with x. What does this mean for the expected loss? Can you give a bound on the latter even if you know p(y\x) and f at every point but know c only on a finite sample (hint: construct a counterexample)? How will things change if c cannot vary much with x? 3.5 (Transduction Error •••) Assume that we want to minimize the test error of misclassification Rtest[f], given a training sample {(xi,yi):.. .,(xm,ym)}, a test sample {x(,...,x'm,} and a loss function c(x, y,/(*)). Show that any loss function c'(x' ,f(x'}) on the test sample has to be symmetric in f, i.e. c'(x' ,f(x'}} = c'(x', —f(x')). Prove that no non-constant convex function can satisfy this property. What does this mean for the practical solution of optimization problem? See [267,37,211,103] for details. 3.6 (Convexity and Uniqueness ••) Show that the problem of estimating a location parameter (a single scalar) has an interval [a,b] C E of equivalent global minima if the loss functions are convex. For non-convex loss functions construct an example where this is not the case. 3.7 (Linearly Dependent Parameters ••) Show that in a linear model f = X; a,-/,- on X it is impossible to find a unique set of optimal parameters az- if the functions fi are not linearly independent. Does this have any effect on f itself? 3.8 (Ill-posed Problems •••) Assume you want to solve the problem Ax = y where A is a symmetric positive definite matrix, i.e., a matrix with nonnegative eigenvalues. If you change y to y', how much will the solution x' of Ax' = y' differ from x'. Give lower and upper bounds on this quantity. Hint: decompose y into the eigensystem of A. 3.9 (Fisher Map [258] ••) Show that the map
maps x into vectors with zero mean and unit variance. Chapter 13 will use this map to design kernels. 3.10 (Cramer-Rao Inequality for Multivariate Estimators ••) Prove equation (3.31). Hint: start by applying the Cauchy-Schwarz inequality to
to obtain I and B and compute the expected value coefficient-wise. 3.11 (Soft Margin Loss and Conditional Probabilities [521] •••) What is the conditional probability p(y\x) corresponding to the soft margin loss function c(x, y,/(x)) — max(0,1 - yf(x))?
86
Risk and Loss Functions • How can you fix the problem that the probabilities p ( — l \ x ) and p(l\x) have to sum up to I? • How does the introduction of a third class ("don't know") change the problem? What is the problem with this approach? Hint: What is the behavior for large \f(x)\? 3.12 (Label Noise ••) Denote by P(y = l|/(x)) and P(y = -l\f(x)) the conditional probabilities of labels ±l/or a classifier output f ( x ) . How will P change if we randomly flip labels with rj € (0,1) probability? How should you adapt your density model? 3.13 (Unbiased Estimators ••) Prove that the least mean square estimator is unbiased for arbitrary symmetric distributions. Can you extend the result to arbitrary symmetric losses? 3.14 (Efficiency of Huber's Robust Estimator ••) Compute the efficiency Robust Estimator in the presence of pure Gaussian noise with unit variance.
of Huber's
3.15 (Influence and Robustness •••) Prove that for robust estimators using (3.48) as their density model, the maximum change in the minimizer of the empirical risk is bounded by ^ if a sample 0; is changed to Oj + 5. What happens in the case of Gaussian density models (i.e., squared loss)? 3.16 (Robustness of Gaussian Distributions [559] •••) Prove that the normal distribution with variance a2 is robust among the class of distributions with bounded variance (by a2). Hint: show that we have a saddle point analogous to Theorem 3.15 by exploiting Theorems 3.13 and Theorem 3.14. 3.17 (Trimmed Mean ••) Show that under the assumption of an unknown distribution contributing at most e, Huber's robust loss function for normal distributions leads to a trimmed mean estimator which discards e of the data. 3.18 (Optimal v for Gaussian Noise •) Give an explicit solution for the optimal v in the case of additive Gaussian noise. 3.19 (Optimal v for Discrete Distribution ••) Assume that we have a noise model with a discrete distribution of 9, where P(0 = e) = P(0 = -e) = p\, P(0 = 2e) = P(0 = —2e) = p2,2(pi 4- p2) = 1, and pi,p2> 0. Compute the optimal value ofv.
4
Overview
Prerequisites
Regularization
Minimizing the empirical risk can lead to numerical instabilities and bad generalization performance. A possible way to avoid this problem is to restrict the class of admissible solutions, for instance to a compact set. This technique was introduced by Tikhonov and Arsenin [538] for solving inverse problems and has since been applied to learning problems with great success. In statistics, the corresponding estimators are often referred to as shrinkage estimators [262]. Kernel methods are best suited for two special types of regularization: a coefficient space constraint on the expansion coefficients of the weight vector in feature space [343, 591, 37, 517,189], or, alternatively, a function space regularization directly penalizing the weight vector in feature space [573,62,561]. In this chapter we will discuss the connections between regularization, Reproducing Kernel Hilbert Spaces (RKHS), feature spaces, and regularization operators. The connection to Gaussian Processes will be explained in more detail in Section 16.3. These different viewpoints will help us to gain insight into the success of kernel methods. We start by introducing regularized risk functionals (Section 4.1), followed by a discussion of the Representer Theorem describing the functional form of the minimizers of a certain class of such risk functionals (Section 4.2). Section 4.3 introduces regularization operators and details their connection to SV kernels. Sections 4.4 through 4.6 look at this connection for specific classes of kernels. Following that, we have several sections dealing with various regularization issues of interest for machine learning: vector-valued functions (Section 4.7), semiparametric regularization (Section 4.8), and finally, coefficient-based regularization (Section 4.9). This chapter may not be be easy to digest for some of our readers. We recommend that most readers should nevertheless consider going through Sections 4.1 and 4.2. Those two sections are accessible with the background given in Chapters 1 and Chapter 2. The following Section 4.3 is somewhat more technical, since it is using the concept of Green's functions and operators, but should nevertheless still be looked at. A background in functional analysis will be helpful. Sections 4.4, 4.5, and 4.6 are more difficult, and require a solid knowledge of Fourier integrals and elements of the theory of special functions. To understand Section 4.7, some basic notions of group theory are beneficial. Finally, Sections 4.8 and Section 4.9 do not require additional knowledge beyond the basic concepts put forward in the introductory chapters. Yet, some readers may find it beneficial to read these two last sections after they gained a deeper insight into classification, regression and mathematical programming, as provided by Chapters 6, 7, and 9.
88
4.1
Regularization
The Regularized Risk Functional
Continuity Assumption
The key idea in regularization is to restrict the class of possible minimizers 2F (with / £ 3") of the empirical risk functional -Remp[/] such that 3 becomes a compact set. While there exist various characterizations for compact sets and we may define a large variety of such sets which will suit different assumptions on the type of estimates we get, the common key idea is compactness. In addition, we will assume that Kemp[/] is continuous in /. Note that this is a stronger assumption than it may appear at first glance. It is easily satisfied for many regression problems, such as those using squared loss or the e>insensitive loss. Yet binary valued loss functions, as are often used in classification (such as c(j, i/, /(*)) — \(l — sgn y/(x)))> do not meet the requirements. Since both the exact minimization of -RemP[/] for classification problems [367], even with very restricted classes of functions, and also the approximate solution to this problem [20] have been proven to be NP-hard, we will not bother with this case any further, but rather attempt to minimize a continuous approximation of the 0 — 1 loss, such as the one using a soft margin loss function (3.3). We may now apply the operator inversion lemma to show that for compact 3, the inverse map from the minimum of the empirical risk functional Kemp[/]: 3 -* E to its minimizer / is continuous and the optimization problem well-posed. Theorem 4.1 (Operator Inversion Lemma (e.g., [431])) Let X be a compact set and let the map f : X --» Y be continuous. Then there exists an inverse map f~l : f(X) —> X that is also continuous.
We do not directly specify a compact set 3", since this leads to a constrained
4.2
The Representer Theorem
Regularization Term
89
optimization problem, which can be cumbersome in practice. Instead, we add a stabilization (regularization) term Q[/] to the original objective function; the latter could be Remp[f], for instance. This, too, leads to better conditioning of the problem. We consider the following class of regularized risk functionals (see also Problem 4.1)
Here A > 0 is the so-called regularization parameter which specifies the tradeoff between minimization of Remp[f] and the smoothness or simplicity which is enforced by small Q[/]. Usually one chooses £l[f] to be convex, since this ensures that there exists only one global minimum, provided Remp[/] is also convex (see Lemma 6.3 and Theorem 6.5). Maximization of the margin of classification in feature space by using the regularizing term 5||w||2, and thus minimizing
Quadratic Regularizer
Regularized Risk inRKHS
is the common choice in SV classification [573, 62]. In regression, the geometrical interpretation of minimizing ^||w||2 is to find the flattest function with sufficient approximation qualities. Unless stated otherwise, we will limit ourselves to this type of regularizer in the present chapter. Other methods, e.g., minimizing the £p norm (where \\x\\p = £,- x?) of the expansion coefficients for w, will be discussed in Section 4.9. As described in Section 2.2.3, we can equivalently think of the feature space as a reproducing kernel Hilbert space. It is often useful, and indeed it will be one of the central themes of this chapter, to rewrite the risk functional (4.2) in terms of the RKHS representation of the feature space. In this case, we equivalently minimize
over the whole space !K. The next section will study the properties of minimizers of (4.3), and similar regularizers that depend on ||/|| w
4.2
The Representer Theorem
History of the Representer Theorem
The explicit form of a minimizer of Kreg[/] is given by the celebrated representer theorem of Kimeldorf and Wahba [296] which plays a central role in solving practical problems of statistical estimation. It was first proven in the context of squared loss functions, and later extended to general pointwise loss functions [115]. For a machine learning point of view of the representer theorem, and variational proofs, see [205, 512]. The linear case has also been dealt with in [300]. We present a new and slightly more general version of the theorem with a simple proof [473]. As above, ^K is the RKHS associated to the kernel k.
90
Regularization Theorem 4.2 (Representer Theorem) Denote by Q : [0, oo) ->• E a strictly monotonic increasing function, by X a set, and by c : (X x R 2 ) m ->• R U {00} an arbitrary loss function. Then each minimizer / € !K of the regularized risk
admits a representation of the form
Note that this setting is slightly more general than Definition 3.1 since it allows coupling between the samples (*,, i/,)Before we proceed with the actual proof, let us make a few remarks. The original form, with pointwise mean squared loss
Requirements on Q[/]
Significance
Sparsity and Loss Function
or hard constraints (i.e., hard limits on the maximally allowed error, incorporated formally by using a cost function that takes the value oo), and Q(||/||) = |||/||^ (A > 0), is due to Kimeldorf and Wahba [296]. Monotonicity of Q is necessary to ensure that the theorem holds. It does not prevent the regularized risk functional (4.4) from having multiple local minima. To ensure a single minimum, we would need to require convexity. If we discard the strictness of the monotonicity, then it no longer follows that each minimizer of the regularized risk admits an expansion (4.5); it still follows, however, that there is always another solution that is as good, and that does admit the expansion. Note that the freedom to use regularizers other than fl(||/||) = |||/||jc allow us in principle to design algorithms that are more closely aligned with recommendations given by bounds derived from statistical learning theory, as described below (cf. Problem 5.7). The significance of the Representer Theorem is that although we might be trying to solve an optimization problem in an infinite-dimensional space 'K, containing linear combinations of kernels centered on arbitrary points of X, it states that the solution lies in the span of m particular kernels — those centered on the training points, In the Support Vector community, (4.5) is called the Support Vector expansion. For suitable choices of loss functions, it has empirically been found that many of the a z often equal 0 (see Problem 4.6 for more detail on the connection between sparsity and loss functions). Proof For convenience we will assume that we are dealing with Q(||/||2) := Q(||/||) rather than Q(||/||). This is no restriction at all, since the quadratic function is strictly monotonic on [0, oo), and therefore Q is strictly monotonic on [0, oo) if and only if Q also satisfies this requirement. We may decompose any / e 'K into a part contained in the span of the kernel
4.2
The Representer Theorem
91
functions k(xi, • ) , • • • , k(xm, •)/ and one in the orthogonal complement;
Here a,- G M and f± e "K with (/i, k(xit -)) w = 0 for all z G [m] := {1,..., m}. By (2.34) we may write /(*/) (for all ; € [m]) as
Second, for all /j_,
Thus for any fixed a; € E the risk functional (4.4) is minimized for /j_ = 0. Since this also has to hold for the solution, the theorem holds. Let us state two immediate extensions of Theorem 4.2. The proof of the following theorem is left as an exercise (see Problem 4.3).
Prior Knowledge by Parametric Expansions
Theorem 4.3 (Semiparametric Representer Theorem) Suppose that in addition to the assumptions of the previous theorem we are given a set of M real-valued functions {Vv}Jii: 3C ->• M, with the property that the m x M matrix (ipp(xi))ip has rank M. Then any f :— f + h, with f G !K and h € span {ipp}, minimizing the regularized risk
admits a representation of the form
with (3p e M/or all p e [M]. We will discuss applications of the semiparametric extension in Section 4.8.
Bias
Remark 4.4 (Biased Regularization) Another extension of the representer theorems can be obtained by including a term — (/o,/) in (4.4) or (4.10), where fo € 'K. In this case, if a solution to the minimization problem exists, it admits an expansion which differs from those described above in that it additionally contains a multiple of fa. To see this, decompose /!(•) used in the proof of Theorem 4.2 into a part orthogonal to fo and the remainder. Biased regularization means that we do not assume that the function / = 0 is the most simple of all estimates. This is a convenient way of incorporating prior knowledge about the type of solution we expect from our estimation procedure. After this rather abstract and formal treatment of regularization, let us consider some practical cases where the representer theorem can be applied. First consider
92
Regularization the problem of regression, where the solution is chosen to be an element of a Reproducing Kernel Hilbert Space.
Application of Semiparametric Expansion
Example 4.5 (Support Vector Regression) For Support Vector regression with the Einsensitive loss (Section 1.6) we have
and Q (||/||) = f ||/||2/ 'where A > 0 and e > 0 are fixed parameters which determine the trade-off between regularization and fit to the training set. In addition, a single (M = I) constant function ^i(x) = 1 is used as an offset, and is not regularized by the algorithm. Section 4.8 and [507] contain details how the case of M > l,for which more than one parametric function is used, can be dealt with algorithmically. Theorem 4.3 also applies in this case. Example 4.6 (Support Vector Classification) Here, the targets consist of yi e {±1}, and we use the soft margin loss function (3.3) to obtain
The regularizer is Q, (||/||) = f ||/||2/ and i>i(x) = 1- For A —>• 0, we recover the hard margin SVM,for which the minimizer must correctly classify each training point (*,, yi). Note that after training, the actual classifier will be sgn (/(.))•
Kernel Principal Component Analysis
Example 4.7 (Kernel PCA) Principal Component Analysis (see Chapter 14 for details) in a kernel feature space can be shown to correspond to the case of
with Q(.) an arbitrary function that is strictly monotonically increasing [480]. The constraint ensures that we only consider linear feature extraction functional that produce outputs of unit empirical variance. In other words, the task is to find the simplest function with unit variance. Note that in this case of unsupervised learning, there are no labels y, to consider.
4.3
Regularization Operators
Curse of Dimensionality
The RKHS framework proved useful in obtaining the explicit functional form of minimizers of the regularized risk functional. It still does not explain the good performance of kernel algorithms, however. In particular, it seems counter-intuitive that estimators using very high dimensional feature spaces (easily with some 1010 features as in optical character recognition with polynomial kernels, or even infinite dimensional spaces in the case of Gaussian RBF-kernels) should exhibit good
4.3 Regularization Operators
Regularization Operator Viewpoint
93
performance. It seems as if kernel methods are defying the curse of dimensionality [29], which requires the number of samples to increase with the dimensionality of the space in which estimation is performed. However, the distribution of capacity in these spaces is not isotropic (cf. Section 2.2.5). The basic idea of the viewpoint described in the present section is simple: rather than dealing with an abstract quantity such as an RKHS, which is defined by means of its corresponding kernel k, we take the converse approach of obtaining a kernel via the corresponding Hilbert space. Unless stated otherwise, we will use L2(3C) as the Hilbert space (cf. Section B.3) on which the regularization operators will be defined. Note that L2(3C) is not the feature space Oi. Recall that in Section 2.2.2, we showed that one way to think of the kernel mapping is as a map that takes a point x € X to a function k(x,.) living in an RKHS. To do this, we constructed a dot product {., .^ satisfying
Physically, however, it is still unclear what the dot product {/,g)w actually does. Does it compute some kind of "overlap" of the functions, similar to the usual dot product between functions in L2(X)? Recall that, assuming we can define an integral on X, the latter is (cf. (B.60))
Main Idea
In the present section, we will show that whilst our dot product in the RKHS is not quite a simple as (4.16), we can at least write it as
in a suitable LI space of functions. This space contains transformed versions of the original functions, where the transformation Y "extracts" those parts that should be affected by the regularization. This gives a much clearer physical understanding of the dot product in the RKHS (and thus of the similarity measure used by SVMs). It becomes particularly illuminating once one sees that for common kernels, the associated transformation Y extracts properties like derivatives of functions. In other words, these kernels induce a form of regularization that penalizes non-smooth functions. Definition 4.8 (Regularization Operator) A regularization operator Y is defined as a linear map from a dot product space of functions 3 := {f\f : X —> R} into a dot product space. The regularization term Q[/] takes the form
Positive Definite Operator
Without loss of generality, we may assume that Y is positive definite. This can be seen as follows: provided that the adjoint of Y exists, all that matters for the definition of Q[/] is the positive definite operator Y*Y (since (Y/,Y/) = (f,Y*Y/)). Hence we may always define a positive definite operator Y/, := (Y*Y)? (cf. Sec-
94
Regularization
tion B.2.2) which has the same regularization properties as Y. Next, we formally state the equivalence between RKHS and regularization operator view. Theorem 4.9 (RKHS and Regularization Operators) For every RKHS 'K with reproducing kernel k there exists a linear operator Y: 9i —> T> such that for all f G 2C
and in particular,
Matching RKHS
Likewise, for every positive definite linear self-adjoint operator Y : 3 —> y for which a Green's function exists, there exists a corresponding RKHS 'K with reproducing kernel k, a dot product space T), and an operator Y: 3 —» D such that (4.19) and (4.20) are satisfied. Equation (4.20) is useful to analyze smoothness properties of kernels, in particular if we pick D to be L2(X). Here we will obtain an explicit form of the dot product induced by the RKHS which helps us to understand why kernel methods work. From Section 2.2.4 we can see that minimization of ||w||2 is equivalent to minimization of Q[/] (4.18), due to the feature map 3>(x) := k(x, •)• Proof We prove the first part by explicitly constructing an operator that takes care of the mapping. One can see immediately that Y = 1 and D = *H will satisfy all requirements.1 For the converse statement, we have to obtain /c, D and Y from Y and show that k is, in fact, the kernel of an RKHS (note that this does not imply that 3 — 'H since it may be equipped with a different dot product than
for every / G YJ. The second equality follows from the factorization Y = YT. It implies that Gx satisfies the reproducing property (4.19). Furthermore Gx(x') is symmetric in (x, x'), since
Finally, to show that Gx(x'} is a dot product, we use
Hence k(x, x') := Gx(x') is a positive definite kernel with feature map x H-> YGX(-).
1. Y = 1 is not the most useful operator. Typically we will seek an operator Y corresponding to a specific dot product space T). Note that D and Y associated with Y are not unique.
4.3 Regularization Operators
95
The corresponding RKHS is the closure of the set {/ € Y*Yy|||Y/||2 < oo}.
Kernel Function = Regularization Operator
This means that T> is an RKHS with inner product (Y-, Y-)
where dn e {0,1} for all m, and £„ jf- convergent, then k satisfies (4.20). Moreover, the corresponding RKHS is given by span{^i\di — 1 and i G N}. Proof
We evaluate (4.21) and use the orthonormality of the system (I2-, ?/>„).
The statement about the span follows immediately from the construction of k.
Null Space of Y*Y
The summation coefficients are permitted to be rearranged, since the eigenfunctions are orthonormal and the series £M ^ converges absolutely. Consequently a large class of kernels can be associated with a given regularization operator (and vice versa), thereby restricting us to a subspace of the eigenvector decomposition of Y*Y. In other words, there exists a one to one correspondence between kernels and regularization operators only on the image of IK under the integral operator 2. Provided that no / G T> contains directions of the null space of the regularization operator Y*Y, and that the kernel functions k span the whole space "D. If this is not the case, simply define the space to be the span of k(x, •)•
96
Regularization
(T]
4.4
Translation Invariant Kernels An important class of kernels k(x, x'}, such as Gaussian RBF kernels or Laplacian kernels only depends on the difference between x and x'. For the sake of simplicity and with slight abuse of notation we will use the shorthand
or simply k(x). Since such k are independent of the absolute position of x but depend only on x — x' instead, we will refer to them as translation invariant kernels. What we will show in the following is that for kernels defined via (4.25) there exists a simple recipe how to find a regularization operator Y*Y corresponding to k and vice versa. In particular, we will show that the Fourier transform of k(x) will provide us with the representation of the regularization operator in the frequency domain. Fourier Transformation For this purpose we need a few definitions. For the sake of simplicity we assume X C EN. In this case the Fourier transformation of / is given by
Note that here i is the imaginary unit and that, in general, F[/](u;) £ C is a complex number. The inverse Fourier transformation is then given by
Regularization Operator in Fourier Domain We now specifically consider regularization operators Y that may be written as multiplications in Fourier space (i.e. Y*Y is diagonalized in the Fourier basis).
4.4
Translation Invariant Kernels
97
Denote by v(u)) a nonnegative, symmetric function defined on X, i.e. v(—a;) = v(u) > 0 which converges to 0 for ||a;|| —>• oo. Moreover denote by Q the support of v(uj) and by x the complex conjugate of x. Now we introduce a regularization operator by
The goal of regularized risk minimization in RKHS is to find a function which minimizes Rreg[/] while keeping (Y/jY/)^ reasonably small. In the context of (4.28) this means the following: Small nonzero values of v(u) correspond to a strong attenuation of the corresponding frequencies. Hence small values of v(u) for large u> are desirable, since high frequency components of F[f] correspond to rapid changes in /. It follows that v((jj] describes the filter properties of Y*Y — note that no attenuation takes place for v(u>) — 0, since these frequencies have been excluded from the integration domain Q. Our next step is to construct kernels k corresponding to Y as defined in (4.28). Green's Functions and Fourier Transformations We show that
is a Green's function for Y, 1) and that it can be used as a kernel. For a function /, whose support of its Fourier transform is contained in Q, we have
From Theorem 4.9 it now follows that G is a Green's function and that it can be used as an RKHS kernel. Eq. (4.29) provides us with an efficient tool for analyzing SV kernels and the types of capacity control they exhibit: we may also read (4.29) backwards and, in doing so, find the regularization operator for a given kernel, simply by applying the Fourier transform to k(x). As expected, kernels with high frequency components will lead to less smooth estimates. Note that (4.29) is a special case of Bochner's theorem [60], which states that the Fourier transform of a positive measure constitutes a positive definite kernel. In the remainder of this section we will now apply our new insight to a wide range of popular kernels such as Bn-splines, Gaussian kernels, Laplacian kernels, and periodic kernels. A discussion of the multidimensional case which requires additional mathematical techniques is left to Section 4.5.
98
Regularization
4.4.1
Bn-Splines
As was briefly mentioned in Section 2.3, splines are an important tool in interpolation and function estimation. They excel at problems of low dimensional interpolation. Computational problems become increasingly acute, however, as the dimensionality of the patterns (i.e. of x) increases; yet there exists a way to circumvent these difficulties. In [501,572], a method is proposed for using Bn-splines (see Figure 4.1) as building blocks for kernels, i.e., Splines in E We start with X = E (higher dimensional cases can also be obtained, for instance by taking products over the individual dimensions). Recall that Bn splines are defined as n + 1 convolutions3 of the centered unit interval (cf. (2.71) and [552]);
Given this kernel, we now use (4.29) in order to obtain the corresponding Fourier representation. In particular, we must compute the Fourier transform of Bn(x). The following theorem allows us to do this conveniently for functions represented by convolutions.
Convolutions and Products
Theorem 4.11 (Fourier-Plancherel, e.g. [306,1121) Denote by /, g two functions in L2(X), by F[/], F[g] their corresponding Fourier transforms, and by 0 the convolution operation. Then the following identities hold.
In other words, convolutions in the original space become products in the Fourier domain and vice versa. Hence we may jump from one representation to the other depending on which space is most convenient for our calculations. Repeated application of Theorem 4.11 shows that in the case of Bn splines, the Fourier representation is conveniently given by the n + 1st power of the Fourier transform of BQ. Since the Fourier transform of Bn equals v(a;), we obtain (up to a multiplicative constant)
3. A convolution / 0 g of two functions /, g: X —>• R is defined as
The normalization factor of (2?r) ? serves to make the convolution compatible with the Fourier transform. We will need this property in Theorem 4.11. Note that / <8> g = g /, as can be seen by exchange of variables.
4.4
Translation Invariant Kernels
99
Figure 4.1 From left to right: Bn splines of order 0 to 3 (top row) and their Fourier transforms (bottom row). The length of the support of Bn is n + 1, and the degree of continuous differentiability increases with n — 1. Note that the higher the degree of Bn, the more peaked the Fourier transform (4.36) becomes. This is due to the increasing support of Bn. The frequency axis labels of the Fourier transform are multiples of 2ir.
Only B2n+i Splines Admissible
This illustrates why only Bn splines of odd order are positive definite kernels (cf. (2.71 )):4 The even ones have negative components in the Fourier spectrum (which would result in an amplification of the corresponding frequencies). The zeros in F[k] stem from the fact that Bn has compact support; [— ^, ^]. See Figure 4.2 for details. By using this kernel, we trade reduced computational complexity in calculating / (we need only take points into account whose distance \\Xi~ Xj\\ is smaller than the support of Bn), for a potentially decreased performance of the regularization operator, since it completely removes (i.e., disregards) frequencies up with F[k](u>p) = 0. Moreover, as we shall see below, in comparison to other kernels, such as the Gaussian kernel, F[k](uj) decays rather slowly. 4.4.2
Gaussian Kernels
Another class of kernels are Gaussian radial basis function kernels (Figure 4.3). These are widely popular in Neural Networks and approximation theory [80,203, 201,420]. We have already encountered k(x, x') = exp (- ^£f\ in (2.68); we now investigate the regularization and smoothness properties of these kernels. For a Fourier representation we need only compute the Fourier transform of 4. Although both even and odd order Bn splines converge to a Gaussian as n ->• oo due to the law of large numbers.
100
Regularization
Figure 4.2 Left: B3-spline kernel. Right: Fourier transform of k (in log-scale). Note the zeros and the rapid decay in the spectrum of 63.
(2.68), which is given by
Uncertainty Relation
PseudoDifferential Operators
In other words, the smoother A: is in pattern space, the more peaked its Fourier transform becomes. In particular, the product between the width of k and its Fourier transform is constant.5 This phenomenon is also known as the uncertainty relation in physics and engineering. Equation (4.37) also means that the contribution of high frequency components in estimates is relatively small, since v(u}) decays extremely rapidly. It also helps explain why Gaussian kernels produce full rank kernel matrices (Theorem 2.18). We next determine an explicit representation of ||Y/ |2 in terms of differential operators, rather than a pure Fourier space formalism. While this is not possible by using only "conventional" differential operators, we may achieve our goal by using pseudo-differential operators. Roughly speaking, a pseudo-differential operator differs from a differential operator in that it may contain an infinite sum of differential operators. The latter correspond to a Taylor expansion of the operator in the Fourier domain. There is an additional requirement that the arguments lie inside the radius of convergence, however. Following the exposition of Yuille and Grzywacz [612] one can see that
with O2n = An and O2n+l = VA", A being the Laplacian and V the Gradient operator, is equivalent to a regularization with v(uj] as in (4.37). The key observation in this context is that derivatives in X translate to multiplications in the frequency 5. The multidimensional case is completely analogous, since it can be decomposed into a product of one-dimensional Gaussians. See also Section 4.5 for more details.
4.4
Translation Invariant Kernels
101
Figure 4.3 Left: Gaussian kernel with standard deviation 0.5. Right: Fourier transform of the kernel. Taylor Expansion in Differential Operators
domain and vice versa.^ Therefore a Taylor expansion of v(u>) in oj, can be rewritten as a Taylor expansion in X in terms of differential operators. See [612] and the references therein for more detail. On the practical side, training an SVM with Gaussian RBF kernels [482] corresponds to minimizing the specific loss function with a regularization operator of type (4.38). Recall that (4.38) causes all derivatives of / to be penalized, to obtain a very smooth estimate. This also explains the good performance of SVMs in this case, since it is by no means obvious that choosing a flat function in some high dimensional space will correspond to a simple function in a low dimensional space (see Section 4.4.3 for a counterexample). 4.4.3
Dirichlet Kernels
Proposition 4.10 can also be used to generate practical kernels. In particular, [572] introduced a class of kernel based on Fourier expansions by
As in Section 4.4.1, we consider x € R to avoid tedious notation. By construction, n
this kernel corresponds to v(uj) = ^ £ ^i(^)/ with Si being Dirac's delta function. i=—n
A regularization operator with these properties may not be desirable, however, as it only damps a finite number of frequencies (see Figure 4.4), and leaves all other frequencies unchanged, which can lead to overfitting (Figure 4.5). 6. Integrability considerations aside, one can see this by J
J
n
— / = — / F[f](uj)e\p(i(jjx)du} ax dx Jn
n
— I JQ
iujF[f](u})exp(i(jjx)duj.
102
Regularization
Figure 4.4 Left: Dirichlet kernel of order 10. Note that this kernel is periodic. Right: Fourier transform of the kernel.
Figure 4.5 Left: Regression with a Dirichlet Kernel of order N = 10. One can clearly observe the overfitting (solid line: interpolation,'+': original data). Right: Regression based on the same data with a Gaussian Kernel of width a2 = 1 (dash dotted line: interpolation, '+': original data).
Types of Invariances
In other words, this kernel only describes band-limited periodic functions where no distinction is made between the different components of the frequency spectrum. Section 4.4.4 will present an example of a periodic kernel with a more suitable distribution of capacity over the frequency spectrum. In some cases, it might be useful to approximate periodic functions, for instance functions defined on a circle. This leads to the second possible type of translation invariant kernel function, namely functions defined on factor spaces7. It is not reasonable to define translation invariant kernels on a bounded interval, since the data will lie beyond the boundaries of the specified interval when translated by a large amount. Therefore unbounded intervals and factor spaces are the only possible domains. 7. Factor spaces are vector spaces X, with the additional property that for at least one nonzero element x e X, we have x + x = x for all x e X. For instance, the modulo operation on Z forms such a space. We denote this space by Z/x.
4.4
Translation Invariant Kernels
103
We assume a period of 2?r without loss of generality, and thus consider translation invariance on R/27T. The next section shows how this setting affects the operator defined in section 4.4.2. 4.4.4
Regularization Operator on
Periodic Kernels
One way of dealing with periodic invariances is to begin with a translation invariant regularization operator, defined similarly to (4.38), albeit on L2([0,2?r]) (where the points 0 and TT are identified) rather than on Z/z(R), and to find a matching kernel function. We start with the regularization operator;
[0,27T]
Periodic Kernels via Fourier Coefficients
Periodic Kernels via Translation
with O defined as in Section 4.4.2. For the sake of simplicity, assume dim X = 1. A generalization to multidimensional kernels is straightforward. To obtain the eigensystem of T we start with the Fourier basis, which is dense on L2([0,2-Tr]) [69], the space of functions we are interested in. One can check that the Fourier basis {^-, sin(nx), cos(nx), n £ N} is an eigenvector decomposition of 2 2 the operator defined in (4.40), with eigenvalues exp(!1j-), by substitution into (4.40). Due to the Fourier basis being dense in L2([0,2?r]), we have thus identified all eigenfunctions of Y. Next we apply Proposition 4.10, taking into account all eigenfunctions except the constant function with n = 0. This yields the following kernel,
For practical purposes, one may truncate the expansion after a finite number of terms. Since the expansion coefficients decay rapidly, this approximation is very good. If necessary, k can be rescaled to have a range of exactly [0,1]. While this is a convenient way of building kernels if the Fourier expansion is known, we would also like to be able to render arbitrary translation invariant kernels on R periodic. The method is rather straightforward, and works as follows. Given any translation invariant kernel k we obtain kp by
Again, we can approximate (4.42) by truncating the sum after a finite number of terms. The question is whether the definition of kp leads to a positive definite kernel at all, and if so, which regularization properties it exhibits.
Proposition 4.12 (Spectrum of Periodized Kernels) Denote by k a translation invariant kernel in L2(DC), and by kp its periodization according to (4.42). Moreover denote
104
Regularization
Figure 4.6 Left: Periodic Gaussian kernel for several values of a (normalized to 1 as its maximum and 0 as its minimum value). Peaked functions correspond to small a. Right: Fourier coefficients of the kernel for a2 = 0.1. by F[f] the Fourier transform off. Then kp can be expanded into the series
Proof The proof makes use of the fact that for Lebesgue integrable functions k the integral over X can be split up into a sum over segments of size 2?r. Specifically, we obtain
The latter, however, is the Fourier transform of kp over the interval [0,2?r]. Hence we have F[k](j) = F[kp](j) for / e Z, where F[kp](j) denotes the Fourier transform over the compact set [0,2?r]. Now we may use the inverse Fourier transformation on [0,2?r], to obtain a decomposition of kp into a trigonometric series. Due to the symmetry of k, the imaginary part of F[/] vanishes, and thus all contributions of sin/x cancel out. Moreover, we obtain (4.43) since cos x is a symmetric function. In some cases, the full summation of kp can be computed in closed form. See Problem 4.10 for an application of this reasoning to Laplacian kernels. In the context of periodic functions, the difference between this kernel and the Dirichlet kernel of Section 4.4.3 is that the latter does not distinguish between the different frequency components in u; G {—WTT, . . . , mr}.
4.5
Translation Invariant Kernels in Higher Dimensions
4.4.5
105
Practical Implications
We are now able to draw some useful conclusions regarding the practical application of translation invariant kernels. Let us begin with two extreme situations. • Suppose that the shape of the power spectrum Pow[/](u;) of the function we would like to estimate is known beforehand. In this case, we should choose k such that F[k] matches the expected value of the power spectrum of/. The latter is given by the squared absolute value of the Fourier transformation of /, i.e.,
Matched Filters
One may check, using the Fourier-Plancherel equality (Theorem 4.11) that Pow[/] equals the Fourier transformation of the autocorrelation function of /, given by f(x) <8> /(—x). In signal processing this is commonly known as the problem of "matched filters" [581]. It has been shown that the optimal filter for the reconstruction of signals corrupted with white noise, has to match the frequency distribution of the signal which is to be reconstructed. (White noise has a uniform distribution over the frequency band occupied by the useful signal.) • If we know very little about the given data, however, it is reasonable to make a general smoothness assumption. Thus a Gaussian kernel as in Section 4.4.2 or 4.4.4 is recommended. If computing time is important, we might instead consider kernels with compact support, such as the BM-spline kernels of Section 4.4.1. This choice will cause many matrix elements k^ = k(X{ — Xj) to vanish.
Prior Knowledge
4.5
The usual scenario will be in between these two extremes, and we will have some limited prior knowledge available, which should be used in the choice of kernel. The goal of the present reasoning is to give a guide to selection of kernels through a deeper understanding of the regularization properties. For more information on using prior knowledge for choosing kernels, e.g. by explicit construction of kernels exhibiting only a limited amount of interaction, see Chapter 13. Finally, note that the choice of the kernel width may be more important than the actual functional form of the kernel. For instance, there may be little difference in the relevant filter properties close to u = 0 between a B-spline and a Gaussian kernel (cf. Figure 4.7). This heuristic holds if we are interested only in uniform convergence results of a certain degree of precision, in which case only a small part of the power spectrum of k is relevant (see [604, 606] and also Section 12.4.1).
Translation Invariant Kernels in Higher Dimensions
Product Kernels
Things get more complicated in higher dimensions. There are basically two ways to construct kernels in RN x EN —> R with N > 1, if no particular assumptions on the data are made. First, we could construct kernels k : RN x RN —> R, by
106
Regularization
Figure 4.7 Comparison of regularization properties in the low frequency domain of the Bs-spline kernel and Gaussian kernel (a2 = 20). Down to an attenuation factor of 5 • 10~3, i.e. in the interval [—47T,47r], both types of kernels exhibit somewhat similar filter characteristics.
Figure 4.8 Laplacian product kernel in R and R2. Note the preferred directions in the two dimensional case.
Note that we have deviated from our usual notation in that in the present section, we use bold face letters to denote elements of the input space. This will help to simplify the notation, using x = (x\,..., XN) and w = (w\,..., w4) below. The choice (4.48) usually leads to preferred directions in input space (see Figure 4.8), since the kernels are not generally rotation invariant, the exception being Gaussian kernels. This can also be seen from the corresponding regularization operator. Since k factorizes, we can apply the Fourier transform to k on a perdimension basis, to obtain
Kernels on Distance Matrices
The second approach is to assume k(x — x') = k(\\x — x'||^2). This leads to kernels which are both translation invariant and rotation invariant. It is quite straightforward to generalize the exposition to the rotation asymmetric case, and norms other than the (-2 norm. We now recall some basic results which will be useful later.
4.5
Translation Invariant Kernels in Higher Dimensions
4.5.1 Fourier Transform
107
Basic Tools
The N-dimensional Fourier transform is defined as
Its inverse transform is given by
For radially symmetric functions, i.e. /(x) = /(||x||), we can explicitly carry out the integration on the sphere to obtain a Fourier transform which is also radially symmetric (cf. [520, 373]):
Hankel Transform Bessel Function
where v := ^d — 1, and Hj, is the Hankel transform over the positive real line (we use the shorthand uj = |M|). The latter is defined as
Here }v is the Bessel function of the first kind, which is given by
and T(j) is the Gamma function, satisfying Y(n + 1) — n\ for n G N. Note that Hv — H~l, i.e. / = H^H^f]] (in L 2 ) due to the Hankel inversion theorem [520] (see also Problem 4.11), which is just another way of writing the inverse Fourier transform in the rotation symmetric case. Based on the results above, we can now use (4.29) to compute the Green's functions in EN directly from the regularization operators given in Fourier space. 4.5.2
Regularization Properties of Kernels in RN
We now give some examples of kernels typically used in SVMs, this time in MN. We must first compute the Fourier/Hankel transform of the kernels.
Gaussian —>• Gaussian
Example 4.13 (Gaussian RBFs) For Gaussian RBFs in N dimensions, k(r) = a~Ne~ ^, and correspondingly (as before we use the shorthand uj :— ||u>||),
In other words, the Fourier transform of a Gaussian is also a Gaussian, in higher dimensions.
105
Regularization
Example 4.14 (Exponential RBFs) In the case ofk(r) = e~ar,
Exponential —>• Inverse Polynomial
Inverse Polynomial —>• Exponential
For N = 1 we recover the damped harmonic oscillator in the frequency domain. In general, a decay in the Fourier spectrum approximately proportional to u~^N+V) can be observed. N+l Moreover the Fourier transform ofk, viewed itself as a kernel, k(r) = (1 + r2} ~ ~, yields the initial kernel as its corresponding Fourier transform. Example 4.15 (Damped Harmonic Oscillator) Another way to generalize the harmonic oscillator, this time so that k does not depend on the dimensionality N, is to set k(r) = -^. Following [586, Section 13.6],
where Kv is the Bessel function of the second kind, defined by (see [520])
It is possible to upper bound F[k] using
with p>v-\and6€[Q,l] [209, eq. (8.451.6)]).,The term in brackets [•] converges to 1 as x ->• oo, and thus results in an exponential decay of the Fourier spectrum. Example 4.16 (Modified Bessel Kernels) In the previous example, we defined a kernel via k(r) = ^3. Since k(r) is a nonnegative function with acceptable decay properties. Therefore we could also use this function to define a kernel in Fourier space via t>(u>) = fl2+jLi|2 • The consequence thereof is that (4.56) will now be a kernel, i.e.,
This is a popular kernel in Gaussian Process estimation [599] (see Section 16.3), since for v > n the corresponding Gaussian process is a mean-square differentiate stochastic processes. See [3] for more detail on this subject. For our purposes, it is sufficient to know that for v > n,fc(||x- x'||) is differentiate in EN. Example 4.17 (Generalized Bn Splines) Finally, we generalize Bn-splines to N dimensions. One way is to define
4.5 Translation Invariant Kernels in Higher Dimensions
109
Figure 4.9 Bn splines in 2 dimensions. From left to right and top to bottom: Splines of order 0 to 3. Again, note the increasing degree of smoothness and differentiability with increasing order of the splines.
Bn Splines -» Bessel Functions
so that B^ is the n + 1-times convolution of the indicator function of the unit ball UN in N dimensions. See Figure 4.9 for examples of such functions. Employing the FourierPlancherel Theorem (Theorem 4.11), we find that its Fourier transform is the (n + l)st power of the Fourier transform of the unit ball,
and therefore,
Only odd n generate positive definite kernels, since it is only then that the kernel has a nonnegative Fourier transform. 4.5.3
A Note on Other Invariances
So far we have only been exploiting invariances with respect to the translation group in EN. The methods could also be applied to other symmetry transformations with corresponding canonical coordinate systems, however. This means that we use a coordinate system where invariance transformations can be represented as additions.
210
Regularization
Lie Groups and Lie Algebras
4.6
Not all symmetries have this property. Informally speaking, those that do are called Lie groups (see also Section 11.3), and the parameter space where the additions take place is referred to as a Lie algebra. For instance, the rotation and scaling group (i.e. the product between the special orthogonal group SO(N) and radial scaling), as proposed in [487,167], corresponds to a log-polar parametrization of EN. The matching transform into frequency space is commonly referred to as the Fourier-Mellin transform [520].
Dot Product Kernels A second, important family of kernels can be efficiently described in term of dot products, i.e.,
Regularization Properties via Mercer's Theorem
Here, with slight abuse of notation we use k to define dot product kernels via k((x: x'}). Such dot product kernels include homogeneous and inhomogeneous polynomial kernel ((x, x') + c) p with c > 0. Proposition 2.1 shows that they satisfy Mercer's condition. What we will do in the following is state an easily verifiable criterion, under which conditions a general kernel, as defined by (4.63), will satisfy Mercer's condition. A side-effect of this analysis will be a deeper insight into the regularization properties of the operator Y*Y, when considered on the space Z^S^-i), where SN_I is the unit sphere in E N . The choice of the domain S^-i is made in order to exploit the symmetries inherent in k: k(x, x') is rotation invariant in its arguments x, x'. In a nutshell, we use Mercer's Theorem (Theorem 2.10) explicitly to obtain an expansion of k in terms of the eigenfunctions of the integral operator 7^ (2.38) corresponding to k. For convenience, we briefly review the connection between 7^, the eigenvalues Aj, and kernels k. For a given kernel k, the integral operator (T^f}(x] '.= Jx^(X) *')/(*') dfj,(x') can be expanded into its eigenvector decomposition (A/, $i(x)), such that
holds. Furthermore, the eigensystem of the regularization operator Y*Y is given by (A"1, i/>i(x)). The latter tells us the preference of a kernel expansion for specific types of functions (namely the eigenfunctions ^;), and the smoothness assumptions made via the size of the eigenvalues A,: for instance, large values of A, correspond to functions that are weakly penalized.
4.6
Dot Product Kernels
4.6.1
111
Conditions for Positivity and Eigenvector Decompositions
In the following we assume that X is the unit sphere Sjv_i C MN and that p, is the uniform measure on SN-I- This takes advantage of the inherent invariances in dot product kernels and it simplifies the notation. We begin with a few definitions. Legendre Polynomials Denote by Tn(£) the Legendre Polynomials of degree n and by 3^(£) the associated Legendre Polynomials (see [373] for more details and examples), where 9n = 3^. Without stating the explicit functional form we list some properties we will need: 1. The (associated) Legendre Polynomials form an orthogonal basis with Orthonormal Basis
Here \Su-i\ = r(N/2) denotes the surface of SN-I, and M(N,n) denotes the multiplicity of spherical harmonics of order n on SN-I/ which is given by M(N,«)-^^( M+ n N - 3 ). 2. We can find an expansion for any analytic function k(£) on [—1,1] into orthogonal basis functions 9^, by8
Series Expansion 3. The Legendre Polynomials may be expanded into an orthonormal basis of spherical harmonics Y^ by the Funk-Hecke equation (see [373]), to obtain
The explicit functional form of Y^ is not important for the further analysis. Necessary and Sufficient Conditions Below we list conditions, as proven by Schoenberg [466], under which a function k((x,x')), defined on SN_I, is positive definite. In particular, he proved the following two theorems:
Legendre Expansion
Taylor Series Expansion
Theorem 4.18 (Dot Product Kernels in Finite Dimensions) A kernel k((x, x'}) defined on S^-i x SN-I is positive definite if and only if its expansion into Legendre polynomials 7^ has only nonnegative coefficients, i.e.
Theorem 4.19 (Dot Product Kernels in Infinite Dimensions) A kernel k((x, x'}) defined on the unit sphere in a Hilbert space is positive definite if and only if its Tay8. Typically, computer algebra programs can be used to find such expansions for given kernels k. This greatly reduces the problems in the analysis of such kernels.
112
Regularization lor series expansion has only nonnegative coefficients;
Therefore, all we have to do in order to check whether a particular kernel may satisfy Mercer's condition, is to look at its polynomial series expansion, and check the coefficients. We note that (4.69) is a more stringent condition than (4.68). In other words, in order to prove positive definiteness for arbitrary dimensions it suffices to show that the Taylor expansion contains only positive coefficients. On the other hand, in order to prove that a candidate for a kernel function will never be positive definite, it is sufficient to show this for (4.68) where T^ = ?„, i.e. for the Legendre Polynomials. Eigenvector Decomposition We conclude this section with an explicit representation of the eigensystem of fc((x, x')). For a proof see [511]. Lemma 4.20 (Eigenvector Decomposition of Dot Product Kernels) Denote by k((x: x'}) a kernel on SN-I x SN-I satisfying condition (4.68) of Theorem 4.18. Then the eigenvectors ofk are given by
In other words, M?fi 4.6.2
n)
determines the regularization properties ofk((x, x'}).
Examples and Applications
In the following we will analyze a few kernels, and state under which conditions they may be used as SV kernels. Example 4.21 (Homogeneous Polynomial Kernels fc(j, x') = (x, x'}p) As we showed in Chapter 2, this kernel is positive definite for p € N . We will now show that for p 0 N this is never the case. We thus have to show that (4.68) cannot hold for an expansion in terms of Legendre Polynomials (N - 3). From [209, 7.126.1], we obtain for fc(£) = |£| p (we need |£| to make k well-defined),
For odd n, the integral vanishes, since 7n(—£) = (—\}ntPn(^). In order to satisfy (4.68), the integral has to be nonnegative for all n. One can see that F (1 + | — |) is the only term in (4.71) that may change its sign. Since the sign of the r function alternates with period I/or x < 0 (and has poles for negative integer arguments), we cannot find any p for which n = 2|_| + Ij and n = 2[| + 1] correspond to positive values of the integral.
4.7 Multi-Output Regularization
113
Example 4.22 (Inhomogeneous Polynomial Kernels k(x,xf) — ((x, x'} + l)p) Likewise, let us analyze k(£) = (1 + £,}p for p>0. Again, we expand k in a series ofLegendre Polynomials, to obtain [209, 7.127]
For p e N, all terms with n> p vanish, and the remainder is positive. For non-integer p, however, (4.72) may change its sign. This is due to T(p + 1 — n). In particular, for any p g N (with p > 0), we have T(p + 1 - n) < 0/or n = \p\ + 1. This violates condition (4.68), hence such kernels cannot be used in SV machines unless p e N. Example 4.23 (Vovk's Real Polynomial k(x, y) = with p e N [459]) This kernel can be written as /c(£) — £^Io£n, hence all the coefficients c^ = 1, which means that the kernel can be used regardless of the dimensionality of the input space. Likewise we can analyze an infinite power series. Example 4.24 (Vovk's Infinite Polynomial k(x,x') = (1 - ((zX)))"1 [459]) This kernel can be written as fc(£) = X)^lo£n' hence all the coefficients a^ — 1. The flat spectrum of the kernel suggests poor generalization properties. Example 4.25 (Neural Network Kernels k(x, x') — tanh(a + (x, x'))) We next show that k(£) = tanh(a + £) is never positive definite, no matter how we choose the parameters. The technique is identical to that of Examples 4.21 and 4.22: we have to show that the kernel does not satisfy the conditions of Theorem 4.18. Since this is very technical (and is best done using computer algebra programs such as Maple), we refer the reader to [401] for details, and explain how the method works in the simpler case of Theorem 4.19. Expanding tanh(a + £) into a Taylor series yields
We now analyze (4.73) coefficient-wise. Since the coefficients have to be nonnegative, we obtain a e [0, oo) from the first term, a e (—oo,0] from the third term, and |a| e [arctanh |, arctanh 1] from the fourth term . This leaves us with a e 0, hence there are no parameters for which this kernel is positive definite.
4.7
Multi-Output Regularization So far in this chapter we only considered scalar functions / : X —»• ^. Below we will show that under rather mild assumptions on the symmetry properties of y, there exist no other vector valued extensions to Y* Y than the trivial extension, i.e., the application of a scalar regularization operator to each of the dimensions of ^ separately. The reader not familiar with group theory may want to skip the more detailed discussion given below.
114
Regularization
The type of regularization we study are quadratic functionals Q[/]. Ridge regression, RKHS regularizers and also Gaussian Processes are examples of such regularization. Our proofs rely on a result from [509] which is stated without proof. Proposition 4.26 (Homogeneous Invariant Regularization [509]) Any regularization term Q[/] that is both homogeneous quadratic, and invariant under an irreducible orthogonal representation p of the group** 9 on ^; i.e., that satisfies
is of the form
Positivity
The motivation for the requirements (4.74) to (4.76) can be seen as follows: the necessity that a regularization term be positive (4.74) is self evident — it must at least be bounded from below. Otherwise we could obtain arbitrarily "good" estimates by exploiting the pathological behavior of the regularization operator. Hence, via a positive offset, Q[/] can be transformed such that it satisfies the positivity condition (4.74). Homogeneity Homogeneity (4.75) is a useful condition for efficient capacity control — it allows easy capacity control by noting that the entropy numbers (a quantity to be introduced in Chapter 12), which are a measure of the size of the set of possible solutions, scale in a linear (hence, homogeneous) fashion when the hypothesis class is rescaled by a constant. Practically speaking, this means that we do not need new capacity bounds for every scale the function / might assume. The requirement of being quadratic is merely algorithmic, as it allows to avoid taking absolute values in the linear or cubic case to ensure positivity, or when dealing with derivatives. Invariance Finally, the invariance must be chosen beforehand. If it happens to be sufficiently strong, it can rule out all operators but scalars. Permutation symmetry is such a case; in classification, for instance, this would mean that all class labels are treated equally. A consequence of the proposition is that there exists no vector valued reguNo Vector Valued larization operator satisfying the invariance conditions. We now look at practical Regularizer applications of Proposition 4.26, which will be stated in the form of corollaries. Corollary 4.27 (Permutation and Rotation Symmetries) Under the assumptions of Proposition 4.26, both the canonical representation of the permutation group (by permutation matrices) in a finite dimensional vector space }$, and the group of orthogonal transformations on y, enforce scalar operators Y. 9. S also may be directly defined on ^, i.e. it might be a matrix group like SU(N).
4.8 Semiparametric Regularization
Permutation and Rotation Symmetries are Irreducible
115
This follows immediately from the fact that both rotations and permutations (or more precisely their representations on ^), are unitary and irreducible on ^ by construction. For instance if the permutation group was reducible on y, then there would exist subspaces on V which do not change under any permutation on ^. This is impossible, however, since we are considering the group of all possible permutations over ^. Finally, permutations are a subgroup of the group of all possible orthogonal transformations. Let us now address the more practical side of such operators, namely how they translate into function expansions. We need only evaluate (To:/, a'f), where /, /' are scalar function and a, a' E ^. Since Y is also scalar, this yields (a, a') (Y/, T/'). It then remains to evaluate Q[/] for a kernel expansion of /. We obtain: Corollary 4.28 (Kernel Expansions) Under the assumptions of proposition 4.26, the regularization functional Q[/]/or a kernel expansion
where k(X{, x) is a function mapping X x X to the space ofscalars §, compatible with the dot product space ^ (we require that /3a € ^ for a E ^ and /? G S) can be stated
In particular, ifk is the Green's function o/Y*Y, we get
For possible applications such as regularized principal manifolds, see Chapter 17.
4.8
Semiparametric Regularization
Preference for Parametric Part
Understandable Model
In some cases, we may have additional knowledge about the solution we are going to encounter. In particular, we may know that a specific parametric component is very likely going to be part of the solution. It would be unwise not to take advantage of this extra knowledge. For instance, it might be the case that the major properties of the data are described by a combination of a small set of linearly independent basis functions {^i(-),..., 0n(-)}. Or we might want to correct the data for some (e.g. linear) trends. Second, it may also be the case that the user wants to have an understandable model, without sacrificing accuracy. Many people in life sciences tend to have a preference for linear models. These reasons motivate the construction of Semiparametric models, which are both easy to understand (due to the parametric part) and perform well (often thanks to the nonparametric term). For more advantages and advocacy on Semiparametric models, see [47]. A common approach is to fit the data with the parametric model and train the nonparametric add-on using the errors of the parametric part; that is, we fit the
116
Backfitting vs. Global Solution
Capacity Control
The Algorithm
Primal Objective Function
Dual Objective Function
Regularization
nonparametric part to the errors. We will show that this is useful only in a very restricted situation. In general, this method does not permit us to find the best model amongst a given class for different loss functions. It is better instead to solve a convex optimization problem, as in standard SVMs, but with a different set of admissible functions;
Here g G "K, where "H is a Reproducing Kernel Hilbert Space as used in Theorem 4.3. In particular, this theorem implies that there exists a mixed expansion in terms of kernel functions k(xi, x) and the parametric part >,. Keeping the standard regularizer Q[/] = jH/H^, we can see that there exist functions < / > i ( - ) 5 . . . , 0«(-) whose contribution is not regularized at all. This need not be a major concern if n is sufficiently smaller than m, as the VC dimension (and thus the capacity) of this additional class of linear models is n, hence the overall capacity control will still work, provided the nonparametric part is sufficiently restricted. We will show, in the case of SV regression, how the semiparametric setting translates into optimization problems. The application to classification is straightforward, and is left as an exercise (see Problem 4.8). Formulating the optimization equations for the expansion (4.81), using the einsensitive loss function, and introducing kernels, we arrive at the following primal optimization problem:
Computing the Lagrangian (we introduce a,,a*,77;,77* for the constraints) and solving for the Wolfe dual, yields10
10. See also (1.26) for details how to formulate the Lagrangian.
4.8
Semiparametric Regularization
117
Figure 4.10 Backfitting of a model with two parameters, f(x) = wx + 0. Data was generated by taking 10 samples from the uniform distribution on [5, f ]. The target values were obtained by the dependency y,- = x,. From left to right: (left) best fit with the parametric model of a constant function; (middle) after adaptation of the second parameter while keeping the first parameter fixed; (right) optimal fit with both parameters.
Semiparametric Kernel Expansion
Why Backfitting Is Not Sufficient
Backfitting for SVMs
Coordinate Descent
Note the similarity to the standard SV regression model. The objective function, and the box constraints on the Lagrange multipliers a/, a*, remain unchanged. The only modification comes from the additional un-regularized basis functions. Instead of a single (constant) function >i(#) = 1 as in the standard SV case, we now have an expansion in the basis A0/(-)- This gives rise to n constraints instead of one. Finally, / can be found as
The only difficulty remaining is how to determine fa. This can be done by exploiting the Karush-Kuhn-Tucker optimaliry conditions in an analogous manner to (1.30), or more easily, by using an interior point algorithm (Section 6.4). In the latter case, the variables fa can be obtained as the dual variables of the dual (dual dual = primal) optimization problem (4.83), as a by-product of the optimization process. It might seem that the approach presented above is quite unnecessary, and overly complicated for Semiparametric modelling. In fact, we could try to fit the data to the parametric model first, and then fit the nonparametric part to the residuals; this approach is called backfitting. In most cases, however, this does not lead to the minimum of the regularized risk functional. We will show this using a simple example. Consider a SV regression machine as defined in Section 1.6, with linear kernel (i.e. k(x, x') = (x, x'}) in one dimension, and a constant term as parametric part (i.e. f ( x ) = wx + (3). Now suppose the data was generated by y,- = Xj, where Xj is uniformly drawn from [|, |] without noise. Clearly, y,- > ^ also holds for all i. By construction, the best overall fit of the pair (f3,w) will be arbitrarily close to (0,1) if the regularization parameter A is chosen sufficiently small. For backfitting, we first carry out the parametric fit, to find a constant /3 minimizing the term £?Li c(\)i — (3). Depending on the chosen loss function c(-), (3 will be the mean (£,2error), the median (Li-error), a trimmed mean (related to the ^-insensitive loss), or
128
Regularization
Orthogonal Decomposition
Q[/] for Subspaces
Connecting CPD Kernels and Semiparametric Models
4.9
some other function of the set {y\ — wx\,..., ym — wxm} (cf. Section 3.4). Since all yi > 1, we have /3 > 1; this is not the optimal solution of the overall problem, since in the latter case f3 would be close to 0, as seen above. Hence backfitting does not minimize the regularized risk functional, even in the simplest of settings; and we certainly cannot expect backfitting to work in more complex cases. There exists only one case in which backfitting suffices, namely if the function spaces spanned by the kernel expansion {k(xi, •)} and {>;(•)} are orthogonal. Consequently we must in general jointly solve for both the parametric and the nonparametric part, as done in (4.82) and (4.83). Above, we effectively excluded a set of basis functions >i , . . . , < / > „ from being regularized at all. This means that we could use regularization functionals Q[/] that need not be positive definite on the whole Reproducing Kernel Hilbert Space "K but only on the orthogonal complement to span{(/>i, ...>„}. This brings us back to the notion of conditional positive definite kernels, as explained in Section 2.2. These exclude the space of linear functions from the space of admissible functions /, in order to achieve a positive definite regularization term Q[/] on the orthogonal complement. In (4.83), this is precisely what happens with the functions >/, which are not supposed to be regularized. Consequently, if we choose >, to be the family of all linear functions, the semiparametric approach will allow us to use conditionally positive definite (cpd) kernels (see Definition 2.21 and below) without any further problems.
Coefficient Based Regularization
Function Space vs. Coefficient Space
General Kernel Expansion
Most of the discussion in the current chapter was based on regularization in Reproducing Kernel Hilbert Spaces, and explicitly avoided any specific restrictions on the type of coefficient expansions used. This is useful insofar as it provides a powerful mathematical framework to assess the quality of the estimates obtained in this process. In some cases, however, we would rather use a regularization operator that acts directly on coefficient space, be it for theoretical reasons (see Section 16.5), or to satisfy the practical desire to obtain sparse expansions (Section 4.9.2); or simply by the heuristic that small coefficients generally translate into simple functions. We will now consider the situation where Q[/] can be written as a function of the coefficients 0.1, where / will again be expanded as a linear combination of kernel functions,
but with the possibility that x\ and the training patterns %i do not coincide, and that possibly m^n.
4.9
Coefficient Based Regularization
4.9.1
119
Ridge Regression
A popular choice to regularize linear combinations of basis functions is by a weight decay term (see [339,49] and the references therein), which penalizes large weights. Thus we choose
Weight Decay
Equivalence Condition
Equivalent Operator
This is also called Ridge Regression [245, 377], and is a very common method in the context of shrinkage estimators. Similar to Section 4.3, we now investigate whether there exists a correspondence between Ridge Regression and SVMs. Although no strict equivalence holds, we will show that it is possible to obtain models generated by the same type of regularization operator. The requirement on an operator Y for a strict equivalence would be
and thus,
Unfortunately this requirement is not suitable for the case of the Kronecker 6, as (4.88) implies the functions (Yfc)(x,-, •) to be elements of a non-separable Hilbert space. The solution is to change the finite Kronecker S into the more appropriate J-distribution, i.e. S(Xj — Xj). By reasoning similar to Theorem 4.9, we can see that (4.88) holds, with k(x, x') the Green's function of Y. Note that as a regularization operator, (Y*Y)z is equivalent to Y, as we can always replace the latter by the former without any difference in the regularization properties. Therefore, we assume without loss of generality that Y is a positive definite operator. Formally, we require
Again, this allows us to connect regularization operators and kernels: the Green's function of Y must be found in order to satisfy (4.89). For the special case of translation invariant operators represented in Fourier space, we can associate Y with Yridge(u;) as with (4.28), leading to
This expansion is possible since the Fourier transform diagonalizes the corresponding regularization operator: repeated applications of Y become multiplications in the Fourier domain. Comparing (4.90) with (4.28) leads to the conclusion that the following relation between kernels for Support Vector Machines and
220
Regularization
Ridge Regression holds,
In other words, in Ridge Regression it is the squared Fourier transform of the kernels that determines the regularization properties. Later on in Chapter 16, Theorem 16.9 will give a similar result, derived under the assumption that the penalties on a/ are given by a prior probability over the distribution of expansion coefficients. This connection also explains the performance of Ridge Regression Models in a smoothing regularizer context (the squared norm of the Fourier transform of the kernel function describes its regularization properties), and allows us to "transform" Support Vector Machines to Ridge Regression models and vice versa. Note, however, that the sparsity properties of Support Vectors are lost. 4.9.2
i\ for Sparsity
Linear Programming Regularization (if)
A squared penalty on the coefficients a,- has the disadvantage that even though some kernel functions k(Xi,x) may not contribute much to the overall solution, they still appear in the function expansion. This is due to the fact that the gradient of CM? tends to 0 for a,- —> 0 (this can easily be checked by looking at the partial derivative of Q[/] wrt. CM;). On the other hand, a regularizer whose derivative does not vanish in the neighborhood of 0 will not exhibit such problems. This is why we choose
The regularized risk minimization problem can then be rewritten as
Soft Margin -> Linear Program
Besides replacing <*; with a,- — a*, |a;| with a; + a*, and requiring a,, a* > 0, there is hardly anything that can be done to render the problem more computationally feasible — the constraints are already linear. Moreover most optimization software can deal efficiently with problems of this kind. 4.9.3
Mixed Semiparametric Regularizers
We now investigate the use of mixed regularization functionals, with different penalties for distinct parts of the function expansion, as suggested by equations (4.92) and (4.81). Indeed, we can construct the following variant, which is a mix-
4.10
Summary
121
ture of linear and quadratic regularizes,
Mixed Dual Problem
Semiparametric and Sparse
The equation above is essentially the SV estimation model, with an additional linear regularization term added for the parametric part. In this case, the constraints on the optimization problem (4.83) become
and the variables (3t are obtained as the dual variables of the constraints, as discussed previously in similar cases. Finally, we could reverse the setting to obtain a regularizer,
for some positive definite matrix M. Note that (4.96) can be reduced to the case of (4.94) by renaming variables accordingly, given a suitable choice of M. The proposed regularizers are a simple extension of existing methods such as Basis Pursuit [104], or Linear Programming for classification (e.g. [184]). The common idea is to have two different sets of basis functions which are regularized differently, or a subset that is not regularized at all. This is an efficient way of encoding prior knowledge or user preference, since the emphasis is on the functions with little or no regularization. Finally, one could also use a regularization functional Q[f] — \\a\\o which simply counts the number of nonzero terms in the vector a € W", or alternatively, combine this regularizer with the i\ norm to obtain Q[/] = ||a||o + ||Q;||I. This is a concave function in a, which, in combination with the soft-margin loss function, leads to an optimization problem which is, as a whole, concave. Therefore one may apply Rockafellar's theorem (Theorem 6.12) to obtain an optimal solution. See [189] for further details and an explicit algorithm.
4.10
Summary A connection between Support Vector kernels and regularization operators has been established, which can provide one key to understanding why Support Vector Machines have been found to exhibit high generalization ability. In particular, for common choices of kernels, the mapping into feature space is not arbitrary, but corresponds to useful regularization operators (see Sections 4.4.1,4.4.2 and 4.4.4). For kernels where this is not the case, Support Vector Machines may show poor performance (Section 4.4.3). This will become more obvious in Section 12, where, building on the results of the current chapter, the eigenspectrum of integral opera-
222
Regularization
Bayesian Methods
Vector Valued Functions
Semiparametric Models
4.11
tors is connected with generalization bounds of the corresponding Support Vector Machines. The link to regularization theory can be seen as a tool for determining the structure, consisting of sets of functions, in which Support Vector Machines and other kernel algorithms (approximately) perform structural risk minimization [561], possibly in a data dependent manner. In other words, it allows us to choose an appropriate kernel given the data and the problem specific knowledge. A simple consequence of this link is a Bayesian interpretation of Support Vector Machines. In this case, the choice of a special kernel can be regarded as a prior on the hypothesis space, with P[f] ex exp(—A||Y/|| 2 ). See Chapter 16 for more detail on this matter. It should be clear by now that the setting of Tikhonov and Arsenin [538], whilst very powerful, is certainly not the only conceivable one. A theorem on vector valued regularization operators showed, however, that under quite generic conditions on the isotropy of the space of target values, only scalar operators are possible; an extended version of their approach is thus the only possible option. Finally a closer consideration of the null space of regularization functionals £l[f] led us to formulate semiparametric models. The roots of such models lie in the representer theorem (Theorem 4.2), proposed and explored in the context of smoothing splines in [296]. In fact, the SV expansion is a direct consequence of the representer theorem. Moreover the semiparametric setting solves a problem created by the use of conditionally positive definite kernels of order q (see Section 2.4.3). Here, polynomials of order lower than q are excluded. Hence, to cope with this effect, we must add polynomials back in "manually." The semiparametric approach presents a way of doing that. Another application of semiparametric models, besides the conventional approach of treating the nonparametric part as nuisance parameters [47], is in the domain of hypothesis testing, for instance to test whether a parametric model fits the data sufficiently well. This can be achieved in the framework of structural risk minimization [561] — given the different models (nonparametric vs. semiparametric vs. parametric), we can evaluate the bounds on the expected risk, and then choose the model with the best bound.
Problems 4.1 (Equivalent Optimization Strategies •••) Denote by S a metric space and by R, Q : S ->• R two strictly convex continuous maps. Let A > 0. • Show that the map f ^ R[f] + XQ.[f] has only one minimum and a unique minimizer. Hint: assume the contrary and consider a straight line between two minima. • Show that for every A > 0, there exists an Q.\ such that minimization ofR[f] + A£2[/], is equivalent to minimizing R[f] subject to Q[/] < QA. Show that an analogous statement holds with R and Q exchanged. Hint: consider the minimizer of R[f] + AQ[/], and keep
4.11
Problems
123
the second term fixed while minimizing over the first term. • Consider the parametrized curve (Q(A), ft(A)). What is the shape of this curve? Show that (barring discontinuities) —A is the tangent on the curve. • Consider the parametrized curve (InQ(A), InR(A)) as proposed by Hansen [225]. Show that a tangent criterion similar to that imposed above is scale insensitive wrt. Q and R. Why is this useful? What are the numerical problems with such an ansatz? 4.2 (Orthogonality and Span ••) Show that the second condition of Definition 2.9 is equivalent to requiring
4.3 (Semiparametric Representer Theorem ••) Prove Theorem 4.3. Hint: start with a decomposition off into a parametric part, a kernel part, and an orthogonal contribution and evaluate the loss and regularization terms independently. 4.4 (Kernel Boosting •••) Show that for f € 'K and c(x, y , f ( x ) } = exp(—yf(x}}, you can develop a boosting algorithm by performing a coefficient-wise gradient descent on the coefficients oti of the expansion f ( x ) = £fli aik(x^ x). In particular, show that the expansion above is optimal What changes if we drop the regularization term Q[f] = \\f\\2? See [498, 577, 221] for examples. 4.5 (Monotonicity of the Regularizer ••) Give an example where, due to the fact that Q[/] is not strictly monotonic the kernel expansion (4.5) is not the only minimizer of the regularized risk functional (4.4). 4.6 (Sparse Expansions ••) Show that it is a sufficient requirement for the coefficients ai of the kernel expansion of the minimizer of (4.4) to vanish, if for the corresponding loss functions c(xi, i//,/(*;)) both the Ihs and the rhs derivative with respect to /(*,•) vanish. Hint: use the proof strategy of Theorem 4.2. Furthermore show that for loss functions c(x, y , f ( x } } this implies that we can obtain vanishing coefficients only ifc(xi, y,-, /(%;)) = 0. 4.7 (Biased Regularization ••) Show that for biased regularization (Remark 4.4) with £(||/||:K) = jll/lljc/ the effective overall regularizer is given by \\\f - /0||2. 4.8 (Semiparametric Classification ••) Show that given a set of parametric basis functions 4>i, the optimization problem for SV classification has the same objective function as (1.31), however with the constraints [506]
What happens if you combine Semiparametric classification with adaptive margins (the v-trick)?
124
Regularization
4.9 (Regularization Properties of Kernels •) Analyze the regularization properties of the Laplacian kernel k(x, x') = e~\x~x'\. What is the rate of decay in its power spectrum? What is the kernel corresponding to the operator
Hint: rewrite Y in the Fourier domain. 4.10 (Periodizing the Laplacian Kernel •) Show that for the Laplacian kernel k(x, x1) = e-\x-x'\f {.fo periodization with period a results in a kernel proportional to
4.11 (Hankel Transform and Inversion •••) Show that for radially symmetric functions, the Fourier transform is given by (4.52), Moreover use (4.51) to prove the Hankel inversion theorem, stating that Hv is its own inverse. 4.12 (Eigenvector Decompositions of Polynomial Kernels •••) Compute the eigenvalues of polynomial kernels on UN- Hint: use [511] and separate the radial from the angular part in the eigenvector decomposition of k, and solve the radial part empirically via numerical analysis. Possible kernels to consider are Vovk's kernel, (in)homogeneous polynomials and the hyperbolic tangent kernel. 4.13 (Necessary Conditions for Kernels ••) Burges [86] shows, by using differential geometric methods, that a necessary condition for a differentiable translation invariant kernel k(x, x') = k(\\x — x'\\2) to be positive definite is
Prove this using functional analytic methods. 4.14 (Mixed Semiparametric Regularizers ••) Derive (4.96). Hint: set up the primal optimization problem as described in Section 1.4, compute the Lagrangian, and eliminate the primal variables. Can you find an interpretation of (4.95)? What is the effect of^=l(ai - a*)(f>j(Xi)?
5
Elements of Statistical Learning Theory
Overview
Prerequisites
5.1
We now give a more complete exposition of the ideas of statistical learning theory, which we briefly touched on in Chapter 1. We mentioned previously that in order to learn from a small training set, we should try to explain the data with a model of small capacity; we have not yet justified why this is the case, however. This is the main goal of the present chapter. We start by revisiting the difference between risk minimization and empirical risk minimization, and illustrating some common pitfalls in machine learning, such as overfitting and training on the test set (Section 5.1). We explain that the motivation for empirical risk minimization is the law of large numbers, but that the classical version of this law is not sufficient for our purposes (Section 5.2). Thus, we need to introduce the statistical notion of consistency (Section 5.3). It turns out that consistency of learning algorithms amounts to a law of large numbers, which holds uniformly over all functions that the learning machine can implement (Section 5.4). This crucial insight, due to Vapnik and Chervonenkis, focuses our attention on the set of attainable functions; this set must be restricted in order to have any hope of succeeding. Section 5.5 states probabilistic bounds on the risk of learning machines, and summarizes different ways of characterizing precisely how the set of functions can be restricted. This leads to the notion of capacity concepts, which gives us the main ingredients of the typical generalization error bound of statistical learning theory. We do not indulge in a complete treatment; rather, we try to give the main insights to provide the reader with some intuition as to how the different pieces of the puzzle fit together. We end with a section showing an example application of risk bounds for model selection (Section 5.6). The chapter attempts to present the material in a fairly non-technical manner, providing intuition wherever possible. Given the nature of the subject matter, however, a limited amount of mathematical background is required. The reader who is not familiar with basic probability theory should first read Section B.I.
Introduction Let us start with an example. We consider a regression estimation problem. Suppose we are given empirical observations,
226
Elements of Statistical Learning Theory
Regression Example
where for simplicity we take X = ^ = R. Figure 5.1 shows a plot of such a dataset, along with two possible functional dependencies that could underlie the data. The dashed line represents a fairly complex model, and fits the training data perfectly. The straight line, on the other hand, does not completely "explain" the data, in the sense that there are some residual errors; it is much "simpler," however. A physicist measuring these data points would argue that it cannot be by chance that the measurements almost lie on a straight line, and would much prefer to attribute the residuals to measurement error than to an erroneous model. But is it possible to characterize the way in which the straight line is simpler, and why this should imply that it is, in some sense, closer to an underlying true dependency? In one form or another, this issue has long occupied the minds of researchers studying the problem of learning. In classical statistics, it has been studied as the bias-variance dilemma. If we computed a linear fit for every data set that we ever encountered, then every functional dependency we would ever "discover " would be linear. But this would not come from the data; it would be a bias imposed by us. If, on the other hand, we fitted a polynomial of sufficiently high degree to any given data set, we would always be able to fit the data perfectly, but the exact model we came up with would be subject to large fluctuations, depending on
Bias-Variance Dilemma
Figure 5.1 Suppose we want to estimate a functional dependence from a set of examples (black dots). Which model is preferable? The complex model perfectly fits all data points, whereas the straight line exhibits residual errors. Statistical learning theory formalizes the role of the complexity of the model class, and gives probabilistic guarantees for the validity of the inferred model.
5.1
Introduction
Overfitting
Risk
Empirical Risk
127
how accurate our measurements were in the first place — the model would suffer from a large variance. A related dichotomy is the one between estimation error and approximation error. If we use a small class of functions, then even the best possible solution will poorly approximate the "true" dependency, while a large class of functions will lead to a large statistical estimation error. In the terminology of applied machine learning and the design of neural networks, the complex explanation shows overfitting, while an overly simple explanation imposed by the learning machine design would lead to underfitting. A great deal of research has gone into clever engineering tricks and heuristics; these are used, for instance, to aid in the design of neural networks which will not overfit on a given data set [397]. In neural networks, overfitting can be avoided in a number of ways, such as by choosing a number of hidden units that is not too large, by stopping the training procedure early in order not to enforce a perfect explanation of the training set, or by using weight decay to limit the size of the weights, and thus of the function class implemented by the network. Statistical learning theory provides a solid mathematical framework for studying these questions in depth. As mentioned in Chapters 1 and 3, it makes the assumption that the data are generated by sampling from an unknown underlying distribution P(x, y). The learning problem then consists in minimizing the risk (or expected loss on the test data, see Definition 3.3),
Here, c is a loss function. In the case of pattern recognition, where ^ = {±1}, a common choice is the misclassification error, c(x, y,/(x)) = \\f(x) — y\. The difficulty of the task stems from the fact that we are trying to minimize a quantity that we cannot actually evaluate: since we do not know P, we cannot compute the integral (5.2). What we do know, however, are the training data (5.1), which are sampled from P. We can thus try to infer a function / from the training sample that is, in some sense, close to the one minimizing (5.2). To this end, we need what is called an induction principle. One way to proceed is to use the training sample to approximate the integral in (5.2) by a finite sum (see (B.18)). This leads to the empirical risk (Definition 3.4),
and the empirical risk minimization (ERM) induction principle, which recommends that we choose an / that minimizes (5.3). Cast in these terms, the fundamental trade-off in learning can be stated as follows: if we allow / to be taken from a very large class of functions y, we can always find an / that leads to a rather small value of (5.3). For instance, if we allow the use of all functions / mapping X ->• ^ (in compact notation, y = yx), then we can minimize (5.3) yet still be distant from the minimizer of (5.2). Considering a
128
Elements of Statistical Learning Theory
pattern recognition problem with xt ^ X; for i ^ j, we could set
This does not amount to any form of learning, however: suppose we are now given a test point drawn from the same distribution, (x, y} ~ P(x, y}. If X is a continuous domain, and we are not in a degenerate situation, the new pattern x will almost never be exactly equal to any of the training inputs Xj. Therefore, the learning machine will almost always predict that y = l.Ifive allow all functions from X to ^, then the values of the function at points Xi,..., xm carry no information about the values at other points. In this situation, a learning machine cannot do better than chance. This insight lies at the core of the so-called No-Free-Lunch Theorem popularized in [608]; see also [254,48]. The message is clear: if we make no restrictions on the class of functions from which we choose our estimate /, we cannot hope to learn anything. Consequently, machine learning research has studied various ways to implement such restrictions. In statistical learning theory, these restrictions are enforced by taking into account the complexity or capacity (measured by VC dimension, covering numbers, entropy numbers, or other concepts) of the class of functions that the learning machine can implement.1 In the Bayesian approach, a similar effect is achieved by placing prior distributions P(/) over the class of functions (Chapter 16). This may sound fundamentally different, but it leads to algorithms which are closely related; and on the theoretical side, recent progress has highlighted intriguing connections [92, 91, 353, 238].
5.2
The Law of Large Numbers Let us step back and try to look at the problem from a slightly different angle. Consider the case of pattern recognition using the misclassification loss function. Given a fixed function /, then for each example, the loss £,•:= \ \f(xi) — y, is either 1. As an aside, note that the same problem applies to training on the test set (sometimes called data snooping): sometimes, people optimize tuning parameters of a learning machine by looking at how they change the results on an independent test set. Unfortunately, once one has adjusted the parameter in this way, the test set is not independent anymore. This is identical to the corresponding problem in training on the training set: once we have chosen the function to minimize the training error, the latter no longer provides an unbiased estimate of the test error. Overfitting occurs much faster on the training set, however, than it does on the test set. This is usually due to the fact that the number of tuning parameters of a learning machine is much smaller than the total number of parameters, and thus the capacity tends to be smaller. For instance, an SVM for pattern recognition typically has two tuning parameters, and optimizes m weight parameters (for a training set size of m). See also Problem 5.3 and [461].
5.2
The Law of Large Numbers
129
0 or 1 (provided we have a ±l-valued function /), and all examples are drawn independently. In the language of probability theory, we are faced with Bernoulli trials. The £1,..., £m are independently sampled from a random variable
Chernoff Bound
A famous inequality due to Chernoff [107] characterizes how the empirical mean ^ lULi 6 converges to the expected value (or expectation) of £, denoted by E(£):
Note that the P refers to the probability of getting a sample £1,. • . , £m with the property ^ ££La & — E(£)| > e. Mathematically speaking, P strictly refers to a socalled product measure (cf. (B.ll)). We will presently avoid further mathematical detail; more information can be found in Appendix B. In some instances, we will use a more general bound, due to Hoeffding (Theorem 5.1). Presently, we formulate and prove a special case of the Hoeff ding bound, which implies (5.6). Note that in the following statement, the £,• are no longer restricted to take values in {0,1}.
Hoeffding Bound
Theorem 5.1 (Hoeffding [244]) Let £,-, i E [m] be m independent instances of a bounded random variable £, with values in [«,&]. Denote their average by Qm = ~ £,- £/. Then for any e > 0,
The proof is carried out by using a technique commonly known as Chernoff's bounding method [107]. The proof technique is widely applicable, and generates bounds such as Bernstein's inequality [44] (exponential bounds based on the variance of random variables), as well as concentration-of-measure inequalities (see, e.g., [356,66]). Readers not interested in the technical details underlying laws of large numbers may want to skip the following discussion. We start with an auxiliary inequality. Lemma 5.2 (Markov's Inequality (e.g., [136])) Denote by £ a nonnegative random variable with distribution P. Then for all A > 0, the following inequality holds:
Proof
Using the definition of E(£), we have
130
Elements of Statistical Learning Theory
Proof of Theorem 5.1. Without loss of generality, we assume that E(£) = 0 (otherwise simply define a random variable f :— £ — E(£) and use the latter in the proof). Chernoff's bounding method consists in transforming a random variable £ into exp(s£) (s > 0), and applying Markov's inequality to it. Depending on £, we can obtain different bounds. In our case, we use
In (5.10), we exploited the fact that for positive random variables E[FIj£z] < Hi E [£,-]. Since the inequality holds independent of the choice of s, we may minimize over s to obtain a bound that is as tight as possible. To this end, we transform the expectation over exp (^&) into something more amenable. The derivation is rather technical; thus we state without proof [244]: E [exp(^f/)] < exp (^r^)From this, we conclude that the optimal value of s is given by s = -n^i- Substituting this value into the right hand side of (5.10) proves the bound. Let us now return to (5.6). Substituting (5.5) into (5.6), we have a bound which states how likely it is that for a given function /, the empirical risk is close to the actual risk,
Using Hoeffding's inequality, a similar bound can be given for the case of regression estimation, provided the loss c(x, y,f(x)) is bounded. For any fixed function, the training error thus provides an unbiased estimate of the test error. Moreover, the convergence (in probability) Kemp[/] -> R[f] as ra —> oo is exponentially fast in the number of training examples.2 Although this sounds just about as good as we could possibly have hoped, there is one caveat: a crucial property of both the Chernoff and the Hoeffding bound is that they are probabilistic in nature. They state that the probability of a large deviation between test error and training error of / is small; the larger the sample size ra, the smaller the probability. Granted, they do not rule out the presence of cases where the deviation is large, and our learning machine will have many functions that it can implement. Could there be a function for which things go wrong? It appears that 2. Convergence in probability, denoted as
means that for all e > 0, we have
5.3
When Does Learning Work: the Question of Consistency
131
we would be very unlucky for this to occur precisely for the function / chosen by empirical risk minimization. At first sight, it seems that empirical risk minimization should work — in contradiction to our lengthy explanation in the last section, arguing that we have to do more than that. What is the catch?
5.3
When Does Learning Work: the Question of Consistency
Consistency
5.4
It turns out that in the last section, we were too sloppy. When we find a function / by choosing it to minimize the training error, we are no longer looking at independent Bernoulli trials. We are actually choosing / such that the mean of the £, is as small as possible. In this sense, we are actively looking for the worst case, for a function which is very atypical, with respect to the average loss (i.e., the empirical risk) that it will produce. We should thus state more clearly what it is that we actually need for empirical risk minimization to work. This is best expressed in terms of a notion that statisticians call consistency. It amounts to saying that as the number of examples m tends to infinity, we want the function fm that minimizes -Remp[/] (note that fm need not be unique) to lead to a test error which converges to the lowest achievable value. In other words, fm is asymptotically as good as whatever we could have done if we were able to directly minimize R[f] (which we cannot, as we do not even know it). In addition, consistency requires that asymptotically, the training and the test error of fm be identical.3 It turns out that without restricting the set of admissible functions, empirical risk minimization is not consistent. The main insight of VC (Vapnik-Chervonenkis) theory is that actually, the worst case over all functions that the learning machine can implement determines the consistency of empirical risk minimization. In other words, we need a version of the law of large numbers which is uniform over all functions that the learning machine can implement.
Uniform Convergence and Consistency The present section will explain how consistency can be characterized by a uniform convergence condition on the set of functions 3" that the learning machine can implement. Figure 5.2 gives a simplified depiction of the question of consistency. Both the empirical risk and the actual risk are drawn as functions of /. For 3. We refrain from giving a more formal definition of consistency, the reason being that there are some caveats to this classical definition of consistency; these would necessitate a discussion leading us away from the main thread of the argument. For the precise definition of the required notion of "nontrivial consistency," see [561].
132
Elements of Statistical Learning Theory
Figure 5.2 Simplified depiction of the convergence of empirical risk to actual risk. The xaxis gives a one-dimensional representation of the function class; the y axis denotes the risk (error). For each fixed function /, the law of large numbers tells us that as the sample size goes to infinity, the empirical risk KemP[/] converges towards the true risk R[f] (indicated by the downward arrow). This does not imply, however, that in the limit of infinite sample sizes, the minimizer of the empirical risk, fm, will lead to a value of the risk that is as good as the best attainable risk, R[/opt] (consistency). For the latter to be true, we require the convergence of jRemp[/] towards R[f] to be uniform over all functions that the learning machines can implement (see text).
simplicity, we have summarized all possible functions / by a single axis of the plot. Empirical risk minimization consists in picking the / that yields the minimal value of Kemp- W it is consistent, then the minimum of Remp converges to that of R in probability. Let us denote the minimizer of R by f°^r satisfying
for all / e y. This is the optimal choice that we could make, given complete knowledge of the distribution P.4 Similarly, since fm minimizes the empirical risk, we have
for all / e 5. Being true for all / € 3, (5.12) and (5.13) hold in particular for fm and /opt. If we substitute the former into (5.12) and the latter into (5.13), we obtain
and
4. As with fm, /opt need not be unique.
5.4
Uniform Convergence and Consistency
133
The sum of these two inequalities satisfies
Let us first consider the second half of the right hand side. Due to the law of large numbers, we have convergence in probability, i.e., for all e > 0,
Uniform Convergence of Risk
This holds true since /opt is a fixed function, which is independent of the training sample (see (5.11)). The important conclusion is that if the empirical risk converges to the actual risk one-sided uniformly, over all functions that the learning machine can implement,
then the left hand sides of (5.14) and (5.15) will likewise converge to 0;
As argued above, (5.17) is not always true for fm, since fm is chosen to minimize JRemp/ and thus depends on the sample. Assuming that (5.18) holds true, however, then (5.19) and (5.20) imply that in the limit, R[fm] cannot be larger than Kemp[/m]. One-sided uniform convergence on 3 is thus a sufficient condition for consistency of the empirical risk minimization over 3~.5 What about the other way round? Is one-sided uniform convergence also a necessary condition? Part of the mathematical beauty of VC theory lies in the fact that this is the case. We cannot go into the necessary details to prove this [571, 561, 562], and only state the main result. Note that this theorem uses the notion of nontrivial consistency that we already mentioned briefly in footnote 3. In a nutshell, this concept requires that the induction principle be consistent even after the "best" functions have been removed. Nontrivial consistency thus rules out, for instance, the case in which the problem is trivial, due to the existence of a function which uniformly does better than all other functions. To understand this, assume that there exists such a function. Since this function is uniformly better than all others, we can already select this function (using ERM) from one (arbitrary) data point. Hence the method would be trivially consistent, no matter what the 5. Note that the onesidedness of the convergence comes from the fact that we only require consistency of empirical risk minimization. If we required the same for empirical risk maximization, then we would end up with standard uniform convergence, and the parentheses in (5.18) would be replaced with modulus signs.
134
Elements of Statistical Learning Theory rest of the function class looks like. Having one function which gets picked as soon as we have seen one data point would essentially void the inherently asymptotic notion of consistency. Theorem 5.3 (Vapnik & Chervonenkis (e.g., [562])) One-sided uniform convergence in probability,
for all e > 0, is a necessary and sufficient condition for nontrivial consistency of empirical risk minimization. As explained above, consistency, and thus learning, crucially depends on the set of functions. In Section 5.1, we gave an example where we considered the set of all possible functions, and showed that learning was impossible. The dependence of learning on the set of functions has now returned in a different guise: the condition of uniform convergence will crucially depend on the set of functions for which it must hold. The abstract characterization in Theorem 5.3 of consistency as a uniform convergence property, whilst theoretically intriguing, is not all that useful in practice. We do not want to check some fairly abstract convergence property every time we want to use a learning machine. Therefore, we next address whether there are properties of learning machines, i.e., of sets of functions, which ensure uniform convergence of risks.
5.5 How to Derive a VC Bound We now take a closer look at the subject of Theorem 5.3; the probability
We give a simplified account, drawing from the expositions of [561,562,415,238]. We do not aim to describe or even develop the theory to the extent that would be necessary to give precise bounds for SVMs, say. Instead, our goal will be to convey central insights rather than technical details. For more complete treatments geared specifically towards SVMs, cf. [562,491,24]. We focus on the case of pattern recognition; that is, on functions taking values in {±1}. Two tricks are needed along the way: the union bound and the method of symmetrization by a ghost sample. 5.5.1
The Union Bound
Suppose the set 3~ consists of two functions, f\ and /2. In this case, uniform convergence of risk trivially follows from the law of large numbers, which holds
5.5 How to Derive a VC Bound
135
for each of the two. To see this, let
denote the set of samples for which the risks of /; differ by more than e. Then, by definition, we have
The latter, however, can be rewritten as
where the last inequality follows from the fact that P is nonnegative. Similarly, if ^= { / ! > • • • ) f n } , we have
Union Bound
This inequality is called the union bound. As it is a crucial step in the derivation of risk bounds, it is worthwhile to emphasize that it becomes an equality if and only if all the events involved are disjoint. In practice, this is rarely the case, and we therefore lose a lot when applying (5.26). It is a step with a large "slack." Nevertheless, when 7 is finite, we may simply apply the law of large numbers (5.11) for each individual P(Q), and the sum in (5.26) then leads to a constant factor n on the right hand side of the bound — it does not change the exponentially fast convergence of the empirical risk towards the actual risk. In the next section, we describe an ingenious trick used by Vapnik and Chervonenkis, to reduce the infinite case to the finite one. It consists of introducing what is sometimes called a ghost sample. 5.5.2
Symmetrization
The central observation in this section is that we can bound (5.22) in terms of a probability of an event referring to a finite function class. Note first that the empirical risk term in (5.22) effectively refers only to a finite function class: for any given training sample of m points x\,..., xm, the functions of J can take at most 2m different values y 1 ? . . . , ym (recall that the y; take values only in {±1}). In addition, the probability that the empirical risk differs from the actual risk by more than e, can be bounded by the twice the probability that it differs from the empirical risk on a second sample of size m by more than e/2.
Symmetrization
Lemma 5.4 (Symmetrization (Vapnik & Chervonenkis) (e.g. [559])) For me2 > 2, we have
Here, the first P refers to the distribution ofiid samples of size m, while the second one
236
Elements of Statistical Learning Theory
refers to iid samples of size 2m. In the latter case, jRemp measures the loss on the first half of the sample, and R'emp on the second half. Although we do not prove this result, it should be fairly plausible: if the empirical error rates on two independent m-samples are close to each other, then they should also be close to the true error rate. 5.5.3
Shattering Coefficient
Shattering
The Shattering Coefficient
The main result of Lemma 5.4 is that it implies, for the purpose of bounding (5.22), that the function class 3 is effectively finite: restricted to the 2m points appearing on the right hand side of (5.27), it has at most 22m elements. This is because only the outputs of the functions on the patterns of the sample count, and there are 2m patterns with two possible outputs, ±1. The number of effectively different functions can be smaller than 22m, however; and for our purposes, this is the case that will turn out to be interesting. Let Z2m ••= ((xi, yi), • • • , (x2m, y2m)) be the given 2m-sample. Denote by ^(J, Z2m) the cardinality of 3 when restricted to {#1,..., *2m}/ that is, the number of functions from J that can be distinguished from their values on {xi,..., X2m}- Let us, moreover, denote the maximum (over all possible choices of a 2ra-sample) number of functions that can be distinguished in this way as N(3r, 2m). The function N(3~, m) is referred to as the shattering coefficient, or in the more general case of regression estimation, the covering number of J".6 In the case of pattern recognition, which is what we are currently looking at, 7^(5", m) has a particularly simple interpretation: it is the number of different outputs ( y 1 ? . . . , y m ) that the functions in 3 can achieve on samples of a given size.7 In other words, it simply measures the number of ways that the function class can separate the patterns into two classes. Whenever N(9", m) = 2m, all possible separations can be implemented by functions of the class. In this case, the function class is said to shatter m points. Note that this means that there exists a set of m patterns which can be separated in all possible ways — it does not mean that this applies to all sets of m patterns. 5.5.4
Uniform Convergence Bounds
Let us now take a closer look at the probability that for a 2m-sample 7,2m drawn iid from P, we get a one-sided uniform deviation larger than e/2 (cf. (5.27)),
6. In regression estimation, the covering number also depends on the accuracy within which we are approximating the function class, and on the loss function used; see Section 12.4 for more details. 7. Using the zero-one loss c(x, y, f(x)) — l/2|/(x) — y\ 6 {0,1}, it also equals the number of different loss vectors (c(xi, y\, /(*i)),..., c(xm, ym, f(xm))).
5.5
How to Derive a VC Bound
137
The basic idea now is to pick a maximal set of functions {/i, • • • , fjt(j,z2m)} tnat can be distinguished based on their values on Z2m, then use the union bound, and finally bound each term using the Chernoff inequality. However, the fact that the /, depend on the sample Z2m will make things somewhat more complicated. To deal with this, we have to introduce an auxiliary step of randomization, using a uniform distibution over permutations a of the 2m-sample Z2m. Let us denote the empirical risks on the two halves of the sample after the permutation
where the subscripts of P were added to clarify what the distribution refers to. We next rewrite this as
We can now express the event C€ := {a\ supf£^
(^mp[/J - -Rempl/1) > e/2}a§
where the events Ce(/M) := {a\(R%mp[fn] - ^mp[fn]) > e/2} refer to individual functions /„ chosen such that (Un{/«}) |z2ra = 3"|z2m- Note that the functions /„ may be considered as fixed, since we have conditioned on Z2/M. We are now in a position to appeal to the classical law of large numbers. Our random experiment consists of drawing a from the uniform distribution over all permutations of 2m-samples. This turns our sequence of losses £f = ||/(xf) — yf | ( / = ! , . . . , 2m) into an iid sequence of independent Bernoulli trials. We then apply a modified Chernoff inequality to bound the probability of each event Ce(fn). It states that given a 2m-sample of Bernoulli trials, we have (see Problem 5.4)
For our present problem, we thus obtain
independent of /„. We next use the union bound to get a bound on the probability of the event Ce defined in (5.31). We obtain a sum over N(3", Z2m) identical terms of the form (5.33). Hence (5.30) (and (5.29)) can be bounded from above by
138
Elements of Statistical Learning Theory where the expectation is taken over the random drawing of Z2m. The last step is to combine this with Lemma 5.4, to obtain
Inequality of VapnikChervonenkis Type
We conclude that provided E [N^, Z2m)] does not grow exponentially in m (i.e., In E [!Nf(3r, Z^m)] grows sublinearly), it is actually possible to make nontrivial statements about the test error of learning machines. The above reasoning is essentially the VC style analysis. Similar bounds can be obtained using a strategy which is more common in the field of empirical processes, first proving that sup AR[f] — Kempt/]) is concentrated around its mean [554,14]. 5.5.5
Risk Bound
Confidence Intervals
It is sometimes useful to rewrite (5.35) such that we specify the probability with which we want the bound to hold, and then get the confidence interval, which tells us how close the risk should be to the empirical risk. This can be achieved by setting the right hand side of (5.35) equal to some 6 > 0, and then solving for e. As a result, we get the statement that with a probability at least 1 — 6,
Note that this bound holds independent of /; in particular, it holds for the function fm minimizing the empirical risk. This is not only a strength, but also a weakness in the bound. It is a strength since many learning machines do not truly minimize the empirical risk, and the bound thus holds for them, too. It is a weakness since by taking into account more information on which function we are interested in, one could hope to get more accurate bounds. We will return to this issue in Section 12.1. Bounds like (5.36) can be used to justify induction principles different from the empirical risk minimization principle. Vapnik and Chervonenkis [569, 559] proposed minimizing the right hand side of these bounds, rather than just the em-
Structural Risk Minimization
pirical risk. The confidence term, in the present case, then ensures that the chosen function, denoted /*, not only leads to a small risk, but also comes from a function class with small capacity. The capacity term is a property of the function class 7, and not of any individual function /. Thus, the bound cannot simply be minimized over choices of /. Instead, we introduce a so-called structure on 3~, and minimize over the elements of the structure. This leads to an induction principle called structural risk minimizaHon. We leave out the technicalities involved [559, 136, 562]. The main idea is depicted in Figure 5.3.
5.5 How to Derive a VC Bound
139
Figure 5.3 Graphical depiction of the structural risk minimization (SRM) induction principle. The function class is decomposed into a nested sequence of subsets of increasing size (and thus, of increasing capacity). The SRM principle picks a function /* which has small training error, and comes from an element of the structure that has low capacity h, thus minimizing a risk bound of type (5.36).
For practical purposes, we usually employ bounds of the type (5.36) as a guideline for coming up with risk functionals (see Section 4.1). Often, the risk functionals form a compromise between quantities that should be minimized from a statistical point of view, and quantities that can be minimized efficiently (cf. Problem 5.7). There exists a large number of bounds similar to (5.35) and its alternative form (5.36). Differences occur in the constants, both in front of the exponential and in its exponent. The bounds also differ in the exponent of e — in some cases, by a factor greater than 2. For instance, if a training error of zero is achievable, we can use Bernstein's inequality instead of Chernoff's result, which leads to e rather than e2. For further details, cf. [136, 562,492, 238]. Finally, the bounds differ in the way they measure capacity. So far, we have used covering numbers, but this is not the only method. 5.5.6
The VC Dimension and Other Capacity Concepts
So far, we have formulated the bounds in terms of the so-called annealed entropy InE [N(2F, Z2m)]- This led to statements that depend on the distribution and thus can take into account characteristics of the problem at hand. The downside is that they are usually difficult to evaluate; moreover, in most problems, we do not have knowledge of the underlying distribution. However, a number of different capacity concepts, with different properties, can take the role of the term ln(E[N(J,Z2m)])in(5.36). • Given an example (x, y), / € J causes a loss that we denote by c(x, t/,/(x)) := \ \f(x) — y I € {0,1}. For a larger sample (x\, y\]..., (xm, ym], the different functions
140
Elements of Statistical Learning Theory
VC Entropy
f e 3" lead to a sef of loss vectors £y = (c(*i, yi, /(*i)),..., c(*m, y m , /(xm))), whose cardinality we denote by K ([F, (x\, y\)..., (*m, y m )). The VC entropy is defined as
where the expectation is taken over the random generation of the m-sample (ji,yi)...,(xm,ym)fromP. One can show [562] that the convergence
is equivalent to uniform (two-sided) convergence of risk,
for all e > 0. By Theorem 5.3, (5.39) thus implies consistency of empirical risk minimization. Annealed Entropy
• If we exchange the expectation E and the logarithm in (5.37), we obtain the annealed entropy used above,
Since the logarithm is a concave function, the annealed entropy is an upper bound on the VC entropy. Therefore, whenever the annealed entropy satisfies a condition of the form (5.38), the same automatically holds for the VC entropy. One can show that the convergence
implies exponentially fast convergence [561],
It has recently been proven that in fact (5.41) is not only sufficient, but also necessary for this [66].
Growth Function
• We can obtain an upper bound on both entropies introduced so far, by taking a supremum over all possible samples, instead of the expectation. This leads to the growth function,
Note that by definition, the growth function is the logarithm of the shattering coefficient, Gy(ni) = lnN(5F, ra). The convergence
is necessary and sufficient for exponentially fast convergence of risk for all underlying distributions P.
5.5
How to Derive a VC Bound
141
• The next step will be to summarize the main behavior of the growth function with a single number. If 3" is as rich as possible, so that for any sample of size m, the points can be chosen such that by using functions of the learning machine, they can be separated in all 2m possible ways (i.e., they can be shattered), then
VC Dimension
In this case, the convergence (5.44) does not take place, and learning will not generally be successful. What about the other case? Vapnik and Chervonenkis [567, 568] showed that either (5.45) holds true for all m, or there exists some maximal m for which (5.45) is satisfied. This number is called the VC dimension and is denoted by h. If the maximum does not exist, the VC dimension is said to be infinite. By construction, the VC dimension is thus the maximal number of points which can be shattered by functions in 3. It is possible to prove that for m > h [568],
This means that up to m = h, the growth function increases linearly with the sample size. Thereafter, it only increases logarithmically, i.e., much more slowly. This is the regime where learning can succeed. Although we do not make use of it in the present chapter, it is worthwhile to also introduce the VC dimension of a class of real-valued functions {/w|w € A} at this stage. It is defined to equal the VC dimension of the class of indicator functions
VC Dimension for Real-Valued Functions
VC Dimension Example
In summary, we get a succession of capacity concepts,
From left to right, these become less precise. The entropies on the left are distribution-dependent, but rather difficult to evaluate (see, e.g., [430, 391]). The growth function and VC dimension are distribution-independent. This is less accurate, and does not always capture the essence of a given problem, which might have a much more benign distribution than the worst case; on the other hand, we want the learning machine to work for unknown distributions. If we knew the distribution beforehand, then we would not need a learning machine anymore. Let us look at a simple example of the VC dimension. As a function class, we consider hyperplanes in R2, i.e.,
Suppose we are given three points ^1,^2,^3 which are not collinear. No matter how they are labelled (that is, independent of our choice of 1/1,1/2,1/3 € {±1})/ we can always find parameters 0, b, c £ E such that /(#,-) = yz for all / (see Figure 1.4 in the introduction). In other words, there exist three points that we can shatter. This
142
VC Dimension of Hyperplanes
VC Dimension of Margin Hyperplanes
Elements of Statistical Learning Theory shows that the VC dimension of the set of hyperplanes in E2 satisfies h > 3. On the other hand, we can never shatter four points. It follows from simple geometry that given any four points, there is always a set of labels such that we cannot realize the corresponding classification. Therefore, the VC dimension is h = 3. More generally, for hyperplanes in R N , the VC dimension can be shown to be h = N + I. For a formal derivation of this result, as well as of other examples, see [523]. How does this fit together with the fact that SVMs can be shown to correspond to hyperplanes in feature spaces of possibly infinite dimension? The crucial point is that SVMs correspond to large margin hyperplanes. Once the margin enters, the capacity can be much smaller than the above general VC dimension of hyperplanes. For simplicity, we consider the case of hyperplanes containing the origin. Theorem 5.5 (Vapnik [559]) Consider hyperplanes (w, x) = 0, where w z's normalized such that they are in canonical form w.r.i. a set of points X* = {xi,..., xr}; i.e.,
The set of decision functions /w(x) = sgn (x,w) defined on X*, and satisfying the constraint \\vf\\ < A, has a VC dimension satisfying
Here, R is the radius of the smallest sphere centered at the origin and containing X*. Before we give a proof, several remarks are in order. • The theorem states that we can control the VC dimension irrespective of the dimension of the space by controlling the length of the weight vector ||w||. Note, however, that this needs to be done a priori, by choosing a value for A. It therefore does not strictly motivate what we will later see in SVMs, where ||w|| is minimized in order to control the capacity. Detailed treatments can be found in the work of Shawe-Taylor et al. [491,24,125]. • There exists a similar result for the case where R is the radius of the smallest sphere (not necessarily centered at the origin) enclosing the data, and where we allow for the possibility that the hyperplanes have a nonzero offset b [562]. In this case, we give a simple visualization in figure Figure 5.4, which shows it is plausible that enforcing a large margin amounts to reducing the VC dimension. • Note that the theorem talks about functions defined on X*. To extend it to the case where the functions are defined on all of the input domain X, it is best to state it for the fat shattering dimension. For details, see [24]. The proof [24,222, 559] is somewhat technical, and can be skipped if desired. Proof Let us assume that x i , . . . , xr are shattered by canonical hyperplanes with ||w|| < A. Consequently, for all t/i,..., yr G {±1}, there exists a w with ||w|| < A, such that
5.5
How to Derive a VC Bound
143
Figure 5.4 Simple visualization of the fact that enforcing a large margin of separation amounts to limiting the VC dimension. Assume that the data points are contained in a ball of radius R (cf. Theorem 5.5). Using hyperplanes with margin 71, it is possible to separate three points in all possible ways. Using hyperplanes with the larger margin 72, this is only possible for two points, hence the VC dimension in that case is two rather than three.
The proof proceeds in two steps. In the first part, we prove that the more points we want to shatter (5.52), the larger || £;=i y z x z || must be. In the second part, we prove that we can upper bound the size of || 5^=1 y/x; || in terms of R. Combining the two gives the desired condition, which tells us the maximum number of points we can shatter. Summing (5.52) over i = 1,..., r yields
By the Cauchy-Schwarz inequality, on the other hand, we have
Here, the second inequality follows from ||w|| < A. Combining (5.53) and (5.54), we get the desired lower bound,
We now move on to the second part. Let us consider independent random labels i/; £ {±1} which are uniformly distributed, sometimes called Rademacher variables. Let E denote the expectation over the choice of the labels. Exploiting the linearity of E, we have
144
Elements of Statistical Learning Theory
where the last equality follows from the fact that the Rademacher variables have zero mean and are independent. Exploiting the fact that \\yi\i\\ = ||xz-|| < R, we get
Since this is true for the expectation over the random choice of the labels, there must be at least one set of labels for which it also holds true. We have so far made no restrictions on the labels, hence we may now use this specific set of labels. This leads to the desired upper bound,
Combining the upper bound with the lower bound (5.55), we get
In other words, if the r points are shattered by a canonical hyperplane satisfying the assumptions we have made, then r is constrained by (5.60). The VC dimension h also satisfies (5.60), since it corresponds to the maximum number of points that can be shattered. In the next section, we give an application of this theorem. Readers only interested in the theoretical background of learning theory may want to skip this section.
5.6
A Model Selection Example In the following example, taken from [470], we use a bound of the form (5.36) to predict which kernel would perform best on a character recognition problem (USPS set, see Section A.I). Since the problem is essentially separable, we disregard the empirical risk term in the bound, and choose the parameters of a polynomial kernel by minimizing the second term. Note that the second term is a monotonic function of the capacity. As a capacity measure, we use the upper bound on the VC dimension described in Theorem 5.5, which in turn is an upper bound on the logarithm of the covering number that appears in (5.36) (by the arguments put forward in Section 5.5.6).
5.6
A Model Selection Example
145
Figure 5.5 Average VC dimension (solid), and total number of test errors, of ten twoclass-classifiers (dotted) with polynomial degrees 2 through 7, trained on the USPS set of handwritten digits. The baseline 174 on the error scale, corresponds to the total number of test errors of the ten best binary classifiers, chosen from degrees 2 through 7. The graph shows that for this problem, which can essentially be solved with zero training error for all degrees greater than 1, the VC dimension allows us to predict that degree 4 yields the best overall performance of the two-class-classifier on the test set (from [470,467]).
Computing the Enclosing Sphere in ft
We employ a version of Theorem 5.5, which uses the radius of the smallest sphere containing the data in a feature space "K associated with the kernel k [561]. The radius was computed by solving a quadratic program [470,85] (cf. Section 8.3). We formulate the problem as follows:
where x* is the center of the sphere, and is found in the course of the optimization. Employing the tools of constrained optimization, as briefly described in Chapter 1 (for details, see Chapter 6), we construct a Lagrangian,
and compute the derivatives with respect to x* and R, to get
and the Wolfe dual problem:
where A is the vector of all Lagrange multipliers A ; , i — 1,..., m. As in the Support Vector algorithm, this problem has the property that the xz
146
Elements of Statistical Learning Theory
appear only in dot products, so we can again compute the dot products in feature space, replacing (x z -,x ; ) by k(Xi,Xj) (where the x,- belong to the input domain X, and the x; in the feature space "K). As Figure 5.5 shows, the VC dimension bound, using the radius R computed in this way, gives a rather good prediction of the error on an independent test set.
5.7
Summary In this chapter, we introduced the main ideas of statistical learning theory. For learning processes utilizing empirical risk minimization to be successful, we need a version of the law of large numbers that holds uniformly over all functions the learning machine can implement. For this uniform law to hold true, the capacity of the set of functions that the learning machine can implement has to be "wellbehaved." We gave several capacity measures, such as the VC dimension, and illustrated how to derive bounds on the test error of a learning machine, in terms of the training error and the capacity. We have, moreover, shown how to bound the capacity of margin classifiers, a result which will later be used to motivate the Support Vector algorithm. Finally, we described an application in which a uniform convergence bound was used for model selection. Whilst this discussion of learning theory should be sufficient to understand most of the present book, we will revisit learning theory at a later stage. In Chapter 12, we will present some more advanced material, which applies to kernel learning machines. Specifically, we will introduce another class of generalization error bound, building on a concept of stability of algorithms minimizing regularized risk functionals. These bounds are proven using concentration-of-measure inequalities, which are themselves generalizations of Chernoff and Hoeffding type bounds. In addition, we will discuss leave-one-out and PAC-Bayesian bounds.
5.8
Problems 5.1 (No Free Lunch in Kernel Choice ••) Discuss the relationship between the "nofree-lunch Theorem" and the statement that there is no free lunch in kernel choice. 5.2 (Error Counting Estimate [136] •) Suppose you are given a test set with n elements to assess the accuracy of a trained classifier. Use the Chernoff bound to quantify the probability that the mean error on the test set differs from the true risk by more than e > 0. Argue that the test set should be as large as possible, in order to get a reliable estimate of the performance of a classifier. 5.3 (The Tainted Die ••) A con-artist wants to taint a die such that it does not generate any '6' when cast. "Yet he does not know exactly how. So he devises the following scheme:
5.8
Problems
147
he makes some changes and subsequently rolls the die 20 times to check that no '6' occurs. Unless pleased with the outcome, he changes more things and repeats the experiment. How long will it take on average, until, even with a perfect die, he will be convinced that he has a die that never generates a '67 What is the probability that this already happens at the first trial? Can you improve the strategy such that he can be sure the die is 'well' tainted (hint: longer trials provide increased confidence)? 5.4 (Chernoff Bound for the Deviation of Empirical Means ••) Use (5.6) and the triangle inequality to prove that
Next, note that the bound (5.66) is symmetric in how it deals with the two halves of the sample. Therefore, since the two events
are disjoint, argue that (5.32) holds true. See also Corollary 6.34 below. 5.5 (Consistency and Uniform Convergence ••) Why can we not get a bound on the generalization error of a learning algorithm by applying (5.11) to the outcome of the algorithm? Argue that since we do not know in advance which function the learning algorithm returns, we need to consider the worst possible case, which leads to uniform convergence considerations. Speculate whether there could be restrictions on learning algorithms which imply that effectively, empirical risk minimization only leads to a subset of the set of all possible functions. Argue that this amounts to restricting the capacity. Consider as an example neural networks with back-propagation: if the training algorithm always returns a local minimum close to the starting point in weight space, then the network effectively does not explore the whole weight (i.e., function) space. 5.6 (Confidence Interval and Uniform Convergence •) Derive (5.36)from (5.35). 5.7 (Representer Algorithms for Minimizing VC Bounds ooo) Construct kernel algorithms that are more closely aligned with VC bounds of the form (5.36). Hint: in the risk functional, replace the standard SV regularizer ||w||2 with the second term of (5.36), bounding the shattering coefficient with the VC dimension bound (Theorem 5.5). Use the representer theorem (Section 4.2) to argue that the minimizer takes the form of a kernel expansion in terms of the training examples. Find the optimal expansion coefficients by minimizing the modified risk functional over the choice of expansion coefficients.
148
Elements of Statistical Learning Theory
5.8 (Bounds in Terms of the VC Dimension •) From (5.35) and (5.36), derive bounds in terms of the growth function and the VC dimension, using the results of Section 5.5.6. Discuss the conditions under which they hold. 5.9 (VC Theory and Decision Theory •••) (i) Discuss the relationship between minimax estimation (cf. footnote 7 in Chapter 1) and VC theory. Argue that the VC bounds can be made "worst case" over distributions by picking suitable capacity measures. However, they only bound the difference between empirical risk and true risk, thus they are only "worst case" for the variance term, not for the bias (or empirical risk). The minimization of an upper bound on the risk of the form (536) as performed in SRM is done in order to construct an induction principle rather than to make a minimax statement. Finally, note that the minimization is done with respect to a structure on the set of functions, while in the minimax paradigm one takes the minimum directly over (all) functions. (ii) Discuss the following folklore statement: "VC statisticians do not care about doing the optimal thing, as long as they can guarantee how well they are doing. Bayesians do not care how well they are doing, as long as they are doing the optimal thing." 5.10 (Overfitting on the Test Set •••) Consider a learning algorithm which has a free parameter C. Suppose you randomly pick n values Q , . . . , Cn, and for each n, you train your algorithm. At the end, you pick the value for C which did best on the test set. How would you expect your misjudgment of the true test error to scale with n? How does the situation change if the C; are not picked randomly, but by some adaptive scheme which proposes new values of C by looking at how the previous ones did, and guessing which change ofC would likely improve the performance on the test set? 5.11 (Overfitting the Leave-One-Out Error ••) Explain how it is possible to overfit the leave-one-out error. I.e., consider a learning algorithm that minimizes the leave-one-out error, and argue that it is possible that this algorithm will overfit. 5.12 (Learning Theory for Differential Equations ooo) Can you develop a statistical theory of estimating differential equations from data? How can one suitably restrict the "capacity" of differential equations? Note that without restrictions, already ordinary differential equations may exhibit behavior where the capacity is infinite, as exemplified by Rubel's universal differential equation [447]
Rubel proved that given any continuous function f : E —>• E and any positive continuous function e : E -»• !+, there exists a C°° solution y of (5.69) such that \y(t) - f(t}\ < e(t) for all t G M. Therefore, all continuous functions are uniform limits of sequences of solutions of (5.69). Moreover, y can be made to agree with f at a countable number of distinct points (£,•)• Further references of interest to this problem include [61, 78, 63].
6
Overview
Optimization
This chapter provides a self-contained overview of some of the basic tools needed to solve the optimization problems used in kernel methods. In particular, we will cover topics such as minimization of functions in one variable, convex minimization and maximization problems, duality theory, and statistical methods to solve optimization problems approximately. The focus is noticeably different from the topics covered in works on optimization for Neural Networks, such as Backpropagation [588, 452, 317, 7] and its variants. In these cases, it is necessary to deal with non-convex problems exhibiting a large number of local minima, whereas much of the research on Kernel Methods and Mathematical Programming is focused on problems with global exact solutions. These boundaries may become less clear-cut in the future, but at the present time, methods for the solution of problems with unique optima appear to be sufficient for our purposes. In Section 6.1, we explain general properties of convex sets and functions, and how the extreme values of such functions can be found. Next, we discuss practical algorithms to best minimize convex functions on unconstrained domains (Section 6.2). In this context, we will present techniques like interval cutting methods, Newton's method, gradient descent and conjugate gradient descent. Section 6.3 then deals with constrained optimization problems, and gives characterization results for solutions. In this context, Lagrangians, primal and dual optimization problems, and the Karush-Kuhn-Tucker (KKT) conditions are introduced. These concepts set the stage for Section 6.4, which presents an interior point algorithm for the solution of constrained convex optimization problems. In a sense, the final section (Section 6.5) is a departure from the previous topics, since it introduces the notion of randomization into the optimization procedures. The basic idea is that unless the exact solution is required, statistical tools can speed up search maximization by orders of magnitude. For a general overview, we recommend Section 6.1, and the first parts of Section 6.3, which explain the basic ideas underlying constrained optimization. The latter section is needed to understand the calculations which lead to the dual optimization problems in Support Vector Machines (Chapters 7-9). Section 6.4 is only intended for readers interested in practical implementations of optimization algorithms. In particular, Chapter 10 will require some knowledge of this section. Finally, Section 6.5 describes novel randomization techniques, which are needed in the sparse greedy methods of Section 10.2,15.3,16.4, and 18.4.3. Unconstrained
150
Prerequisites
Optimization
optimization problems (Section 6.2) are less common in this book and will only be required in the gradient descent methods of Section 10.6.1, and the Gaussian Process implementation methods of Section 16.4. The present chapter is intended as an introduction to the basic concepts of optimization. It is relatively self-contained, and requires only basic skills in linear algebra and multivariate calculus. Section 6.3 is somewhat more technical, Section 6.4 requires some additional knowledge of numerical analysis, and Section 6.5 assumes some knowledge of probability and statistics.
6.1 Convex Optimization In the situations considered in this book, learning (or equivalently statistical estimation) implies the minimization of some risk functional such as Remp[/] or Rreg[/] (cf. Chapter 4). While minimizing an arbitrary function on a (possibly not even compact) set of arguments can be a difficult task, and will most likely exhibit many local minima, minimization of a convex objective function on a convex set exhibits exactly one global minimum. We now prove this property. Definition 6.1 (Convex Set) A set X in a vector space is called convex if for any x, x' € X and any A G [0,1], we have Definition and Construction of Convex Sets and Functions
Definition 6.2 (Convex Function) A function f defined on a set X (note that X need not be convex itself) is called convex if, for any x, x' G X and any A £ [0,1] such that \x + (1 — X)x' € X, we have
A function f is called strictly convex if for x ^ x' and A £ (0,1) (6.2) is a strict inequality.
6.1
Convex Optimization
151
Figure 6.1 Left: Convex Function in two variables. Right: the corresponding convex level sets {x|/(;e) < c}, for different values of c. There exist several ways to define convex sets. A convenient method is to define them via below sets of convex functions, such as the sets for which f(x) < c, for instance. Lemma 6.3 (Convex Sets as Below-Sets) Denote by f : X -> E a convex function on a convex set X. Then the set
is convex. Proof We must show condition (6.1). For any x, x' 6 X, we have f(x),f(x') Moreover, since / is convex, we also have
< c.
Hence, for all A e [0,1], we have (Xx + (1 — A)*') e X, which proves the claim. Figure 6.1 depicts this situation graphically.
Intersections
Lemma 6.4 (Intersection of Convex Sets) Denote byX,X'cX two convex sets. Then X n X' is also a convex set. Proof Given any x, x' £ X n X', then for any A 6 [0,1], the point x\ '•— Xx + (1 — X)x' satisfies x\ £ X and x\ £ X', hence also x\ £ X n X'. See also Figure 6.2. Now we have the tools to prove the central theorem of this section. Theorem 6.5 (Minima on Convex Sets) If the convex function f : X —>• K. has a minimum on a convex set X c X, then its arguments x e X, for which the minimum value is attained, form a convex set. Moreover, iff is strictly convex, then this set will contain only one element.
152
Optimization
Figure 6.2 Left: a convex set; observe that lines with points in the set are fully contained inside the set. Right: the intersection of two convex sets is also a convex set.
Figure 6.3 Note that the maximum of a convex function is obtained at the ends of the interval [a, b].
Proof Denote by c the minimum of / on X. Then the set Xm := {x\x £ X and f(x) < c} is clearly convex. In addition, Xm fl X is also convex, and f(x) = c for all x £ Xm fl X (otherwise c would not be the minimum). If / is strictly convex, then for any x, x' £ X, and in particular for any j, x' £ X n Xm, we have (for x / x' and all A
This contradicts the assumption that Xm n X contains more then one element. •
Global Minima
A simple application of this theorem is in constrained convex minimization. Recall that the notation [n], used below, is a shorthand for {1,..., n}. Corollary 6.6 (Constrained Convex Minimization) Given the set of convex functions f,ci,...,cnon the convex set X, the problem
has as its solution a convex set, if a solution exists. This solution is unique if f is strictly convex. Many problems in Mathematical Programming or Support Vector Machines can be cast into this formulation. This means either that they all have unique solutions (if / is strictly convex), or that all solutions are equally good and form a convex set (if / is merely convex). We might ask what can be said about convex maximization. Let us analyze a simple case first: convex maximization on an interval.
6.1
Convex Optimization
153
Lemma 6.7 (Convex Maximization on an Interval) Denote by f a convex function on [a, b] e R Then the problem of maximizing f on[a, b] has f(a) and f(b) as solutions. Maxima on Extreme Points
Proof
Any x e
^ ^ can be written as ^a + ^ _ |^) b/ and hence
Therefore the maximum of / on [a, b] is obtained on one of the points a, b. We will next show that the problem of convex maximization on a convex set is typically a hard problem, in the sense that the maximum can only be found at one of the extreme points of the constraining set. We must first introduce the notion of vertices of a set. Definition 6.8 (Vertex of a Set) A point x € X is a vertex of X if, for all x' e X with x' / x, and for all A > 1, the point \x + (l- \)x' g X. This definition implies, for instance, that in the case of X being an £2 ball, the vertices of X make up its surface. In the case of an t^ ball, we have 2" vertices in n dimensions, and for an t\ ball, we have only 2n of them. These differences will guide us in the choice of admissible sets of parameters for optimization problems (see, e.g., Section 14.4). In particular, there exists a connection between suprema on sets and their convex hulls. To state this link, however, we need to define the latter. Definition 6.9 (Convex Hull) Denote by Xa set in a vector space. Then the convex hull co X is defined as
Theorem 6.10 (Suprema on Sets and their Convex Hulls) Denote by Xa set and by co X its convex hull. Then for a convex function f Evaluating Convex Sets on Extreme Points
Proof Recall that the below set of convex functions is convex (Lemma 6.3), and that the below set of / with respect to c — swp{f(x)\x e X} is by definition a superset of X. Moreover, due to its convexity, it is also a superset of co X. This theorem can be used to replace search operations over sets X by subsets X' C X, which are considerably smaller, if the convex hull of the latter generates X. In particular, the vertices of convex sets are sufficient to reconstruct the whole set. Theorem 6.11 (Vertices) A compact convex set is the convex hull of its vertices.
154
Optimization
Figure 6.4 A convex function on a convex polyhedral set. Note that the minimum of this function is unique, and that the maximum can be found at one of the vertices of the constraining domain.
Reconstructing Convex Sets from Vertices
The Pro°f is slightly technical, and not central to the understanding of kernel methods. See Rockafellar [435, Chapter 18] for details, along with further theorems on convex functions. We now proceed to the second key theorem in this section. Theorem 6.12 (Maxima of Convex Functions on Convex Compact Sets) Denote by X a compact convex set in X, by \X the vertices of X, and by f a convex function on X. Then
Proof Application of Theorem 6.10 and Theorem 6.11 proves the claim, since under the assumptions made on X, we have X = co(|X). Figure 6.4 depicts the situation graphically.
6.2
Unconstrained Problems After the characterization and uniqueness results (Theorem 6.5, Corollary 6.6, and Lemma 6.7) of the previous section, we will now study numerical techniques to obtain minima (or maxima) of convex optimization problems. While the choice of algorithms is motivated by applicability to kernel methods, the presentation here is not problem specific. For details on implementation, and descriptions of applications to learning problems, see Chapter 10. 6.2.1
Continuous Differentiable Functions
Functions of One Variable
We begin with the easiest case, in which / depends on only one variable. Some of the concepts explained here, such as the interval cutting algorithm and Newton's method, can be extended to the multivariate setting (see Problem 6.5). For the sake of simplicity, however, we limit ourselves to the univariate case. Assume we want to minimize / : E -» E on the interval [0, b] C M. If we cannot make any further assumptions regarding /, then this problem, as simple as it may seem, cannot be solved numerically. If / is differentiable, the problem can be reduced to finding f'(x) = 0 (see Problem 6.4 for the general case). If in addition to the previous assumptions, / is convex, then /' is nondecreasing, and we can find a fast, simple algorithm (Algorithm
6.2 Unconstrained Problems
155
Figure 6.5 Interval Cutting Algorithm. The selection of points is ordered according to the numbers beneath (points 1 and 2 are the initial endpoints of the interval).
Algorithm 6.1 Interval Cutting Require: a, b, Precision e Set A = a,B = b repeat if /' > 0 then B=
else A= end if
until (B - A)min(|/'(A)|, |/'(B)|) < 6 Output: x =
Interval Cutting
6.1) to solve our problem (see Figure 6.5). This technique works by halving the size of the interval that contains the minimum x* of /, since it is always guaranteed by the selection criteria for B and A that x* e [A, B]. We use the following Taylor series expansion to determine the stopping criterion. Theorem 6.13 (Taylor Series) Denote by f : R ->• R a function that is d times differentiable. Then for any j, x' € M, there exists a £ with |£| < | x — x'\, such that
Now we may apply (6.11) to the stopping criterion of Algorithm 6.1. We denote by x* the minimum of f(x). Expanding / around f(x*), we obtain for some £A G [A - x*, 0] that f(A) = /(**) + £Af'(x* + fA), and therefore, |/(A)-/(**)| - I&H/V + &OI < (B-A)|/'(A)IProof of Linear Convergence
Taking the minimum over {A, B} shows that Algorithm 6.1 stops once / is e-close to its minimal value. The convergence of the algorithm is linear with constant 0.5, since the intervals [A, B] for possible x* are halved at each iteration.
156
Optimization
Algorithm 6.2 Newton's Method Require: XQ, Precision e Set x = x0 repeat x= x—
until \f'(x)\ < e Output: x
Newton's Method
In constructing the interval cutting algorithm, we in fact wasted most of the information obtained in evaluating /' at each point, by only making use of the sign of /'.In particular, we could fit a parabola to / and thereby obtain a method that converges more rapidly. If we are only allowed to use / and /', this leads to the Method of False Position (see [334] or Problem 6.3). Moreover, if we may compute the second derivative as well, we can use (6.11) to obtain a quadratic approximation of / and use the latter to find the minimum of /. This is commonly referred to as Newton's method (see Section 16.4.1 for a practical application of the latter to classification problems). We expand f(x} around XQ;
Minimization of the expansion (6.12) yields
Hence, we hope that if the approximation (6.12) is good, we will obtain an algorithm with fast convergence (Algorithm 6.2). Let us analyze the situation in more detail. For convenience, we state the result in terms of g := /', since finding a zero of g is equivalent to finding a minimum of /.
Quadratic C onvergence
Theorem 6.14 (Convergence of Newton Method) Let g : R —> E be a twice continuously differentiate function, and denote by x* G M a point with g'(x*) ^ 0 and g(x*) = 0. Then, provided XQ is sufficiently close to x*, the sequence generated by (6.13) will converge to x* at least quadratically. Proof For convenience, denote by xn the value of x at the nth iteration. As before, we apply Theorem 6.13. We now expand g(x*) around xn. For some £ G [0, x* — xn], we have
and therefore by substituting (6.14) into (6.13),
Since by construction |£| < \xn — x* , we obtain a quadratically convergent algorithm in | xn — x* |, provided that (xn — j*)!/^ < 1.
6.2
Unconstrained Problems
Region of Convergence
Line Search
In other words, if the Newton method converges, it converges more rapidly than interval cutting or similar methods. We cannot guarantee beforehand that we are really in the region of convergence of the algorithm. In practice, if we apply the Newton method and find that it converges, we know that the solution has converged to the minimizer of /. For more information on optimization algorithms for unconstrained problems see [173,530,334,15,159,45]. In some cases we will not know an upper bound on the size of the interval to be analyzed for the presence of minima. In this situation we may, for instance, start with an initial guess of an interval, and if no minimum can be found strictly inside the interval, enlarge it, say by doubling its size. See [334] for more information on this matter. Let us now proceed to a technique which is quite popular (albeit not always preferable) in machine learning. 6.2.2
Direction of Steepest Descent
157
Functions of Several Variables: Gradient Descent
Gradient descent is one of the simplest optimization techniques to implement for minimizing functions of the form / : X ->• E, where X may be EN, or indeed any set on which a gradient may be defined and evaluated. In order to avoid further complications we assume that the gradient f'(x) exists and that we are able to compute it. The basic idea is as follows: given a location xn at iteration n, compute the gradient gn := f'(Xn), and update
such that the decrease in / is maximal over all 7 > 0. For the final step, one of the algorithms from Section 6.2.1 can be used. It is straightforward to show that f(xn) is a monotonically decreasing series, since at each step the line search updates xn+i in such a way that f(xn+i) < f ( x n ) . Such a value of 7 must exist, since (again by Theorem 6.13) we may expand f ( x n + 7gn) in terms of 7 around xn, to obtain1
Problems of Convergence
As usual || • || is the Euclidean norm. For small 7 the linear contribution in the Taylor expansion will be dominant, hence for some 7 > 0 we have f ( x n — 7gM) < /(*„). It can be shown [334] that after a (possibly infinite) number of steps, gradient descent (see Algorithm 6.3) will converge. ms pite °f tRis' me performance of gradient descent is far from optimal. Depending on the shape of the landscape of values of /, gradient descent may take a long time to converge. Figure 6.6 shows two examples of possible convergence behavior of the gradient descent algorithm. 1. To see that Theorem 6.13 applies in (6.17), note that f(xn + 7g«) is a mapping R ->• R when viewed as a function of 7.
158
Optimization Algorithm 6.3 Gradient Descent Require: XQ, Precision e n=0 repeat Compute g = f'(xn) Perform line search on f(xn — 7*7) for optimal 7. Xn+l = Xn-19
n — n-\-\ until \\j'(xn)\\
Figure 6.6 Left: Gradient descent takes a long time to converge, since the landscape of values of / forms a long and narrow valley, causing the algorithm to zig-zag along the walls of the valley. Right: due to the homogeneous structure of the minimum, the algorithm converges after very few iterations. Note that in both cases, the next direction of descent is orthogonal to the previous one, since line search provides the optimal step length. 6.2.3
Convergence Properties of Gradient Descent
Let us analyze the convergence properties of Algorithm 6.3 in more detail. To keep matters simple, we assume that / is a quadratic function, i.e.
where K is a positive definite symmetric matrix (cf. Definition 2.4) and CQ is constant.2 This is clearly a convex function with minimum at x*, and f ( x * ) = c0. The gradient of / is given by
To find the update of the steepest descent we have to minimize
2. Note that we may rewrite (up to a constant) any convex quadratic function f ( x ) = x1 Kx + CTX + d in the form (6.18), simply by expanding / around its minimum value
x*.
6.2
Unconstrained Problems
159
By minimizing (6.20) for 7, the update of steepest descent is given explicitly by
Improvement per Step
Substituting (6.21) into (6.18) and subtracting the terms f(xn) and f(xn+i) yields the following improvement after an update step
Thus the relative improvement per iteration depends on the value of t(g) := ( Tjof^K-i )• In order to give performance guarantees we have to find a lower bound for t(g). To this end we introduce the condition of a matrix. Definition 6.15 (Condition of a Matrix) Denote by Ka matrix and by Amax and Amin its largest and smallest singular values (or eigenvalues if they exist) respectively. The condition of a matrix is defined as
Clearly, as cond K decreases, different directions are treated in a more homogeneous manner by x^Kx. In particular, note that smaller cond K correspond to less elliptic contours in Figure 6.6. Kantorovich proved the following inequality which allows us to connect the condition number with the convergence behavior of gradient descent algorithms.
Lower Bound for Improvement
Theorem 6.16 (Kantorovich Inequality [278]) Denote by K € R mxm (typically the kernel matrix) a strictly positive definite symmetric matrix with largest and smallest eigenvalues Amax and Xmm. Then the following inequality holds for any g £ Mm:
We typically denote by g the gradient of /. The second inequality follows immediately from Definition 6.15; the proof of the first inequality is more technical, and is not essential to the understanding of the situation. See Problem 6.7 and [278, 334] for more detail. A brief calculation gives us the correct order of magnitude. Note that for any x, the quadratic term xTKx is bounded from above by Amax||x||2, and likewise x^K^x < A^JI*!!2. Hence we bound the relative improvement t(g) (as defined below (6.22)) by I/(cond K) which is almost as good as the second term in (6.24) (the latter can be up to a factor of 4 better for Amjn
160
Optimization
desired for learning theoretical reasons); see Chapter 4 for details. This is one of the reasons why many gradient descent algorithms for training Support Vector Machines, such as the Kernel AdaTron [183, 12] or AdaLine [185], exhibit poor convergence. Section 10.6.1 deals with these issues, and sets up the gradient descent directions both in the Reproducing Kernel Hilbert Space "K and in coefficient space W1. 6.2.4
Functions of Several Variables: Conjugate Gradient Descent
Let us now look at methods that are better suited to minimizing convex functions. Again, we start with quadratic forms. The key problem with gradient descent is that the quotient between the smallest and the largest eigenvalue can be very large, which leads to slow convergence. Hence, one possible technique is to resettle X by some matrix M such that the condition of K e Wnxm in this rescaled space, which is to say the condition of MTKM, is much closer to 1 (in numerical analysis this is often referred to as preconditioning [247, 423, 530]). In addition, we would like to focus first on the largest eigenvectors of K. A key tool is the concept of conjugate directions. The basic idea is that rather than using the metric of the normal dot product JT x' = x^lx' (1 is the unit matrix) we use the metric imposed by K, i.e. x^Kx', to guide our algorithm, and we introduce an equivalent notion of orthogonality with respect to the new metric. Definition 6.17 (Conjugate Directions) Given a symmetric matrix K e Rmxm, any two vectors v, v' € W1 are called K-orthogonal ifvTKv' = 0.
Likewise, we can introduce notions of a basis and of linear independence with respect to K. The following theorem establishes the necessary identities. Theorem 6.18 (Orthogonal Decompositions in K) Denote by K e W"xm a strictly positive definite symmetric matrix and by v\,..., vm a set of mutually K-orthogonal and nonzero vectors. Then the following properties hold: (i) The vectors v\,..., vmform a basis. (ii) Any x G Mm can be expanded in terms ofvi by
In particular, for any y = Kx, we can find x by
Linear Independence
Proof (i) Since we have m vectors in Rm, all we have to show is that the vectors V[ are linearly independent. Assume that there exist some a/ £ E such that £fli api —
6.2
Unconstrained Problems
161
0. Then due to K-orthogonality, we have
Hence ay = 0 for all /. This means that all Vj are linearly independent. (ii) The vectors {PI, . . . , vm} form a basis. Therefore we may expand any x € Rm as a linear combination of Vj, i.e. x = £?Li a,-i>,-. Consequently we can expand vjKx in terms of vjKvir and we obtain
Basis Expansion
Solving for otj proves the claim. (iii) Let y = Kx. Since the vectors v{ form a basis, we can expand x in terms of a,-. Substituting this definition into (6.28) proves (6.26). The practical consequence of this theorem is that, provided we know a set of Korthogonal vectors Vj, we can solve the linear equation y — Kx via (6.26). Furthermore, we can also use it to minimize quadratic functions of the form f(x] = \ xTKx — CTX. The following theorem tells us how.
Optimality in Linear Space
Theorem 6.19 (Deflation Method) Denote byvi,...,vmaset of mutually K-orthogonal vectors for a strictly positive definite symmetric matrix K e Wnxm. Then for any x0 € W" the following method finds Xi that minimize f(x) = \x^Kx — CTX in the linear manifold X; := *o + span{ui,..., i;,-}.
Proof We use induction. For i = 0 the statement is trivial, since the linear manifold consists of only one point. Assume that the statement holds for i. Since / is convex, we only need prove that the gradient of /(*;) is orthogonal to span{i>i,..., V{}. In that case no further improvement can be gained on the linear manifold X,-. It suffices to show that for all j
Gradient Descent in Rescaled Space
Additionally, we may expand xz+i to obtain
For j = i both terms cancel out. For ;' < i both terms vanish due to the induction assumption. Since the vectors Vj form a basis Xm = K.m, xm is a minimizer of /. In a nutshell, Theorem 6.19 already contains the Conjugate Gradient descent al-
X{+i = Xi + Q,{Vi where a,- = — gi+i = /'(*i+i) 0i+i = -#+i + &z>i where /?,- = i = i +1 until gj = 0 Output: x,
gorithm: in each step we perform gradient descent with respect to one of the Korthogonal vectors Vi, which means that after n steps we will reach the minimum. We still lack a method to obtain such a K-orthogonal basis of vectors V{. It turns out that we can get the latter directly from the gradients g,-. Algorithm 6.4 describes the procedure. All we have to do is prove that Algorithm 6.4 actually does what it is required to do, namely generate a K-orthogonal set of vectors Vj, and perform deflation in the latter. To achieve this, the V{ are obtained by an orthogonalization procedure akin to Gram-Schmidt orthogonalization. Theorem 6.20 (Conjugate Gradient) Assume we are given a quadratic convex function f(x) = ^xTKx — crx, to which we apply conjugate gradient descent for minimization purposes. Then algorithm 6.4 is a deflation method, and unless g; — 0, we have for every 0,•} = span{g0, Kg0, • - - , K'g0}. (ii) The vectors Vj are K-orthogonal. (Hi) The equations in Algorithm 6.4 for a, and fa can be replaced by a, =
and
Pi = (iv) After i steps, Xj is the solution in the manifold XQ + span{#0, Kgo, • • • , Kl~lg0}. Proof (i) and (ii) We use induction. For i = 0 the statements trivially hold since VQ = g0. For i note that by construction (see Algorithm 6.4) g!+i = Kxj+i —c = gi + aiKvi, hence span{g0, • • • ,gm} = span{g0, Kg0,..., Ki+lg0}. Since vi+i = -gi+i + PiVi the same statement holds for span{i?o? • • • > Vi+i}- Moreover, the vectors gi are linearly independent or 0 due to Theorem 6.19. Finally vjKvi+i = —vjKgi+i + favjKvj = 0, since for / = i both terms cancel out, and for ;' < i both terms individually vanish (due to Theorem 6.19 and (i)). (iii) We have —gjvi = gjgj — (3j-ig~-~Vi-i = g^gi, since the second term vanishes due to Theorem 6.19. This proves the result for a z .
6.2
Unconstrained Problems
163
Table 6.1 Non-quadratic modifications of conjugate gradient descent. Generic Method
Compute Hessian X, := /"(*,) and update a,, /?,• with
Fletcher-Reeves [173]
This requires calculation of the Hessian at each iteration. Find a, via a line search and use Theorem 6.20 (iii) for /3, o.i = argmin^x, + av{)
#= Polak-Ribiere [414]
Find a, via a line search a{ = argmina/(x, + av{)
Pi = Experimentally, Polak-Ribiere tends to be better than Fletcher-Reeves.
For Pi note that gl+lKv{ = a. lg-+l(gi+i - gi) = a. lg?+lgi+i. Substitution of the value of Oil proves the claim. (iv) Again, we use induction. At step z = 1 we compute the solution within the space spanned by gQ.
Space of Largest Eigenvalues
Nonlinear Extensions
We conclude this section with some remarks on the optimality of conjugate gradient descent algorithms, and how they can be extended to arbitrary convex functions. Due to Theorems 6.19 and 6.20, we can see that after i iterations, the conjugate gradient descent algorithm finds a solution on the linear manifold XQ + span{go5 Kgo?..., Kl~lgo}. This means that the solutions will be mostly aligned with the largest eigenvalues of K, since after multiple application of K to any arbitrary vector go, the largest eigenvectors dominate. Nonetheless, the algorithm here is significantly cheaper than computing the eigenvalues of K, and subsequently minimizing / in the subspace corresponding to the largest eigenvalues. For more detail see [334] In the case of general convex functions, the assumptions of Theorem 6.20 are no longer satisfied. In spite of this, conjugate gradient descent has proven to be effective even in these situations. Additionally, we have to account for some modifications. Basically, the update rules for gi and Vj remain unchanged but the parameters &j and Pi are computed differently. Table 6.1 gives an overview of different methods. See [173,334,530,414] for details. 6.2.5
Predictor Corrector Methods
As we go to higher order Taylor expansions of the function / to be minimized (or set to zero), the corresponding numerical methods become increasingly com-
164
Increasing the Order
Predictor Corrector Methods for Quadratic Equations
Optimization
plicated to implement, and require an ever increasing number of parameters to be estimated or computed. For instance, a quadratic expansion of a multivariate function / : Ew —>• M. requires m x m terms for the quadratic part (the Hessian), whereas the linear part (the gradient) can be obtained by computing m terms. Since the quadratic expansion is only an approximation for most non-quadratic functions, this is wasteful (for interior point programs, see Section 6.4). We might instead be able to achieve roughly the same goal without computing the quadratic term explicitly, or more generally, obtain the performance of higher order methods without actually implementing them. This can in fact be achieved using predictor-corrector methods. These work by computing a tentative update x,- -> x^ (predictor step), then using x?^ to account for higher order changes in the objective function, and finally obtaining a corrected value x^ based on these changes. A simple example illustrates the method. Assume we want to find the solution to the equation
We assume a, b, /o, x G E. Exact solution of (6.32) requires taking a square root. Let us see whether we can find an approximate method that avoids this (in general b will be an m x m matrix, so this is a worthwhile goal). The predictor corrector approach works as follows: first solve
Second, substitute xpred into the nonlinear parts of (6.32) to obtain
No Quadratic Residuals
Comparing jpred and xcorr, we see that | -J? is the correction term that takes the effect of the changes in x into account. Since neither of the two values (^Pred or xcorr) will give us the exact solution to f(x) — 0 in just one step, it is worthwhile having a look at the errors of both approaches.
We can check that if -jr < 2 — 2\/2/ the corrector estimate will be better than the predictor one. As our initial estimate /o decreases, this will be the case. Moreover, we can see that f(xCOTT) only contains terms in x that are of higher order than quadratic. This means that even though we did not solve the quadratic form explicitly, we eliminated all corresponding terms. The general scheme is described in Algorithm 6.5. It is based on the assumption that f(x + £) can be split up into
6.3
Constrained Problems
165
Algorithm 6.5 Predictor Corrector Method Require: X0, Precision e Set i = 0
repeat Expand f into f(xi) + fsimple(x, xi) + T(x, xi). Predictor Solve f(xi) + fsimple(xpred, xi) = 0 for xred. Corrector Solve f(xi) + fsimple(xcorr, xi) + T(x p r e d ,x i ) = 0 for xcorr. xi+1 = xi + xcorr. i = z +1. until \f(Xi)\ < e Output: Xi where fsimp\e(£, x) contains the simple, possibly low order, part of /, and T(£,x) the higher order terms, such that /Simpie(0, x) = T(0, j) = 0. While in the previous example we introduced higher order terms into / that were not present before (/ is only quadratic), usually such terms will already exist anyway. Hence the corrector step will just eliminate additional lower order terms without too much additional error in the approximation. We will encounter such methods for instance in the context of interior point algorithms (Section 6.4), where we have to solve a set of quadratic equations.
6.3
Constrained Problems After this digression on unconstrained optimization problems, let us return to constrained optimization, which makes up the main body of the problems we will have to deal with in learning (e.g., quadratic or general convex programs for Support Vector Machines). Typically, we have to deal with problems of type (6.6). For convenience we repeat the problem statement:
Here / and c, are convex functions and n G N. In some cases3, we additionally have equality constraints Cj(x) — 0 for some ; £ [n']. Then the optimization problem can be written as
3. Note that it is common practice in Support Vector Machines to write c; as positivity constraints by using concave functions. This can be fixed by a sign change, however.
266
Optimization
Before we start minimizing /, we have to discuss what optimality means in this case. Clearly f'(x) = 0 is too restrictive a condition. For instance, /' could point into a direction which is forbidden by the constraints c; and e\. Then we could have optimality, even though /' ^ 0. Let us analyze the situation in more detail. 6.3.1
Optimality Conditions
We start with optimality conditions for optimization problems which are independent of their differentiability. While it is fairly straightforward to state sufficient optimality conditions for arbitrary functions / and cz, we will need convexity and "reasonably nice" constraints (see Lemma 6.23) to state necessary conditions. This is not a major concern, since for practical applications, the constraint qualification criteria are almost always satisfied, and the functions themselves are usually convex and differentiable. Much of the reasoning in this section follows [345], which should also be consulted for further references and detail. Some of the most important sufficient criteria are the Kuhn-Tucker4 saddle point conditions [312]. As indicated previously, they are independent of assumptions on convexity or differentiability of the constraints c/ or objective function /.
Lagrangian
Theorem 6.21 (Kuhn-Tucker Saddle Point Condition [312,345]) Assume an optimization problem of the form (6.37), where / : Em ->• E and c{ : Em ->• E for i e [n] are arbitrary functions, and a Lagrangian
If a pair of variables (x, a) with x E E" and a/ > Ofor all i e [n] exists, such that for all x e Em and a e [0, oo)n,
then x is a solution to (6.37). The parameters oti are called Lagrange multipliers. As described in the later chapters, they will become the coefficients in the kernel expansion in SVM. Proof The proof follows [345]. Denote by (x,a) a pair of variables satisfying (6.40). From the first inequality it follows that
Since we are free to choose az > 0, we can see (by setting all but one of the terms o;z to a, and the remaining one to o:; = a/ + 1) that Cj(x) < 0 for all i € [n]. This shows that x satisfies the constraints, i.e. it is feasible. 4. An earlier version is due to Karush [283]. This is why often one uses the abbreviation KKT (Karush-Kuhn-Tucker) rather than KT to denote the optimality conditions.
6.3
Constrained Problems
167
Additionally, by setting one of the a; to 0, we see that 6tiCi(x) > 0. The only way to satisfy this is by having
Eq. (6.42) is often referred to as the KKT condition [283, 312]. Finally, combining (6.42) and Ci(x) < 0 with the second inequality in (6.40) yields f(x) < f(x) for all feasible x. This proves that x is optimal. We can immediately extend Theorem 6.21 to accommodate equality constraints by splitting them into the conditions e^(x) < 0 and 6i(x) > 0. We obtain: Theorem 6.22 (Equality Constraints) Assume an optimization problem of the form (6.38), where f,Ci,ej : Em —> R/or i e [n] and j e [nf] are arbitrary functions, and a Lagrangian
If a set of variables (x, a, /3) with x e Rm, a e [0, oo), and J3 e R n/ exists such that for all x e RTO, a e [0, oo)n, and (3 6 Rn',
then x is a solution to (6.38). Now we determine when the conditions of Theorem 6.21 are necessary. We will see that convexity and sufficiently "nice" constraints are needed for (6.40) to become a necessary condition. The following lemma (see [345]) describes three constraint qualifications, which will turn out to be exactly what we need.
Feasible Region
Equivalence Between P . Qualifications
Lemma 6.23 (Constraint Qualifications) Denote by X C Rm a convex set, and by c\,..., cn : X —> R n convex functions defining a feasible region by
Then the following additional conditions on Q are connected by (i) <^=> (ii) and (in) => (i). (i) There exists an x e Xsuch that for all i € [n] Ci(x) < 0 (Slater's condition [500]). ^ For all nonzero a G [0, oo)n there exists an x € X such that Y^i=iai°i(x} ^ 0 (Karlin's condition [281]). (Hi) The feasible region X contains at least two distinct elements, and there exists an x e X such that all Ci are strictly convex at x wrt. X (Strict constraint qualification). The connection (i) <^=> (ii) is also known as the Generalized Gordan Theorem [164]. The proof can be skipped if necessary. We need an auxiliary lemma which we state without proof (see [345,435] for details).
168
Optimization
Figure 6.7 Two hyperplanes (and their normal vectors) separating the convex hull of a finite set of points from the origin.
Lemma 6.24 (Separating Hyperplane Theorem) Denote by X G Rm a convex set not containing the origin 0. Then there exists a hyperplane with normal vector a G E.m such that a^x > Ofor all x G X. See also Figure 6.7. Proof of Lemma 6.23. We prove {(z) •<=£• (ii)} by showing {(z) ==$• (it)} and { not (0 =* not (ii)}. (i) =>• (ii) For a point x G X with Cj(x) < 0, for all z G [n] we have that a,•£;(*) > 0 implies a, = 0. (z) =>• (z'z) Assume that there is no x with c;(x) < 0 for all z G [n]. Hence the set T := {7)7 G MM and there exists some x G X with 7, > c,-(x) for all i G [n]}
(6.46)
is convex and does not contain the origin. The latter follows directly from the assumption. For the former take 7,7' G F and A G (0,1) to obtain
Now by Lemma 6.24, there exists some a G W1 such that o;T7 > 0 and ||Q!|| 2 = 1 for all 7 G P. Since each of the 7, for 7 G F can be arbitrarily large (with respect to the other coordinates), we conclude a; > 0 for all z G [n]. Denote by 6 := inf xe xSF=iQ;fC ; (j) and by 8' := inf7erQ!T7. One can see that by construction 8 — 8'. By Lemma 6.24 a was chosen such that 8' > 0, and hence 8 > 0. This contradicts (z'z), however, since it implies the existence of a suitable a with ctiCi(x) > 0 for all x. (Hi) =$• (i) Since X is convex we get for all C; and for any A G (0,1):
This shows that Xx + (1 — X)x' satisfies (z) and we are done. We proved Lemma 6.23 as it provides us with a set of constraint qualifications (conditions on the constraints) that allow us to determine cases where the KKT saddle point conditions are both necessary and sufficient. This is important, since we will use the KKT conditions to transform optimization problems into their duals, and solve the latter numerically. For this approach to be valid, however, we must ensure that we do not change the solvability of the optimization problem.
6.3 Constrained Problems
169
Theorem 6.25 (Necessary KKT Conditions [312, 553,281]) Under the assumptions and definitions of Theorem 6.21 with the additional assumption that f and C[ are convex on the convex set XCRm (containing the set of feasible solutions as a subset) and that Cj satisfy one of the constraint qualifications of Lemma 6.23, the saddle point criterion (6.40) is necessary for optimality. Proof
Denote by x the solution to (6.37), and by X' the set
By construction x £ X'. Furthermore, there exists no x' 6 X' such that all inequality constraints including f(x) — f(x) are satisfied as strict inequalities (otherwise x would not be optimal). In other words, X' violates Slater's conditions (i) of Lemma 6.23 (where both (/(x) — /(*)) and c(x) together play the role of Ci(x)), and thus also Karlin's conditions (ii). This means that there exists a nonzero vector (ao, a) £ Mw+1 with nonnegative entries such that
In particular, for x = x we get £"=i &iCi(x) > 0. In addition, since x is a solution to (6.37), we have Ci(x) < 0. Hence £f=i a,-Cf(x) = 0. This allows us to rewrite (6.50) as
This looks almost like the first inequality of (6.40), except for the ao term (which we will return to later). But let us consider the second inequality first. Again, since c r (f) < 0 we have £"=i a,-Cj(x) < 0 for all a; > 0. Adding aof(x) on both sides of the inequality and £"=1 a.iCi(x) on the rhs yields
This is almost all we need for the first inequality of (6.40) .5 If Q.Q > 0 we can divide (6.51) and (6.52) by a® and we are done. When Q.Q = 0, then this implies the existence of a £ R" with nonnegative entries satisfying £"=1 &iCi(x) > 0 for all x G X. This contradicts Karlin's constraint qualification condition (ii), which allows us to rule out this case. 6.3.2
Duality and KKT-Gap
Now that we have formulated necessary and sufficient optimality conditions (Theorem 6.21 and 6.25) under quite general circumstances, let us put them to practical 5. The two inequalities (6.51) and (6.52) are also known as the Fritz-John saddle point necessary optimality conditions [269], which play a similar role as the saddle point conditions for the Lagrangia ' """ jf Theorem 6.21.
170
Optimization
use for convex differentiable optimization problems. We first derive a more practically useful form of Theorem 6.21. Our reasoning is as follows: eq. (6.40) implies that L(x, a) is a saddle point in terms of (x, a). Hence, all we have to do is write the saddle point conditions in the form of derivatives.
Primal and Dual Feasibility
Theorem 6.26 (KKT for Differentiable Convex Problems [312]) A solution to the optimization problem (6.37) with convex, differentiable /, C{ is given by x, if there exists some a £ E" with ar- > Ofor all i G [n] such that the following conditions are satisfied:
Proof The easiest way to prove Theorem 6.26 is to show that for any x G X, we have f(x) ~ f(x) > 0. Due to convexity we may linearize and obtain
Optimization by Constraint Satisfaction
Here we used the convexity and differentiability of / to arrive at the rhs of (6.56) and (6.58). To obtain (6.57) we exploited the fact that at the saddle point dxf(x) can be replaced by the corresponding expansion in dxCi(x); thus we used (6.53). Finally, for (6.59) we used the fact that the KKT gap vanishes at the optimum (6.55) and that the constraints are satisfied (6.54). m
other words, we may solve a convex optimization problem by finding (x, a) satisfy the conditions of Theorem 6.26. Moreover, these conditions, together with the constraint qualifications of Lemma 6.23, ensure necessity. Note that we transformed the problem of minimizing functions into one of solving a set of equations, for which several numerical tools are readily available. This is exactly how interior point methods work (see Section 6.4 for details on how to implement them). Necessary conditions on the constraints similar to those discussed previously can also be formulated (see [345] for a detailed discussion). The other consequence of Theorem 6.26, or rather of the definition of the Lagrangian L(x, a), is that we may bound f(x) = L(x, a) from above and below without explicit knowledge of f(x).
tnat
Theorem 6.27 (KKT-Gap) Assume an optimization problem of type (6.37), where both f and Cj are convex and differentiable. Denote by x its solution. Then for any set of variables
6.3
Constrained Problems
Bounding the Error
171
we have
Strictly speaking, we only need differentiability of / and C[ at x. However, since x is only known after the optimization problem has been solved, this is not a very useful condition. Proof The first part of (6.62) follows from the fact that x € X, so that x satisfies the constraints. Next note that L(x, a) = f(x) where (x, a) denotes the saddle point of L. For the second part note that due to the saddle point condition (6.40), we have for any a with o;z > 0,
The function L(j', a) is convex in x' since both /' and the constraints c/ are convex and all a z > 0. Therefore (6.60) implies that x minimizes L(x', a). This proves the second part of (6.63), which in turn proves the second inequality of (6.62). Hence, no matter what algorithm we are using in order to solve (6.37), we may always use (6.62) to assess the proximity of the current set of parameters to the solution. Clearly, the relative size of £"=1 a/c^x) provides a useful stopping criterion for convex optimization algorithms. Finally, another concept that is useful when dealing with optimization problems is that of duality. This means that for the primal minimization problem considered so far, which is expressed in terms of x, we can find a dual maximization problem in terms of a by computing the saddle point of the Lagrangian L(x, a), and eliminating the primal variables x. We thus obtain the following dual maximization problem from (6.37):
We state without proof a theorem guaranteeing the existence of a solution to (6.64).
Existence of Dual Solution
Theorem 6.28 (Wolfe [607]) Recall the definition of X (6.45) and of the optimization problem (6.37). Under the assumptions that X is an open set, X satisfies one of the constraint qualifications of Lemma 6.23, and /, cz- are all convex and differentiate, there exists an a € ffi" such that (x, a) solves the dual optimization problem (6.64) and in addition L(x, a) = f(x).
172
Optimization
In order to prove Theorem 6.28 we first have to show that some (x, a) exists satisfying the KKT conditions, and then use the fact that the KKT-Gap at the saddle point vanishes. 6.3.3
Primal Linear Program
Unbounded and Infeasible Problems
Linear and Quadratic Programs
Let us analyze the notions of primal and dual objective functions in more detail by looking at linear and quadratic programs. We begin with a simple linear setting.6
where c, x € W", d € R" and A € Rnxm, and where Ax + d < 0 is a shorthand for X™=1 AijXj + dt<0 for all i € [n]. It is far from clear that (6.65) always has a solution, or indeed a minimum. For instance, the set of x satisfying Ax + d<0 might be empty, or it might contain rays going to infinity in directions where cTx keeps increasing. Before we deal with this issue in more detail, let us compute the sufficient KKT conditions for optimality, and the dual of (6.65). We may use (6.26) since (6.65) is clearly differentiable and convex. In particular we obtain:
Theorem 6.29 (KKT Conditions for Linear Programs) A sufficient condition for a solution to the linear program (6.65) to exist is that the following four conditions are satisfied for some (*, a) e Rm+n where a > 0:
Then the minimum is given by CTX. Note that, depending on the choice of A and d, there may not always exist an x such that A x + d < 0, in which case the constraint does not satisfy the conditions of Lemma 6.23. In this situation, no solution exists for (6.65). If a feasible x exists, however, then (projections onto lower dimensional subspaces aside) the constraint qualifications are satisfied on the feasible set, and the conditions above are necessary. See [334,345,555] for details. 6. Note that we encounter a small clash of notation in (6.65), since c is used as a symbol for the loss function in the remainder of the book. This inconvenience is outweighed, however, by the advantage of consistency with the standard literature (e.g., [345, 45, 555]) on optimization. The latter will allow the reader to read up on the subject without any need for cumbersome notational changes.
6.3
Constrained Problems
Dual Linear Program
Primal Solution ^ Dual Solution
Dual Dual Linear Program ->• Primal
173
Next we may compute Wolfe's dual optimization problem by substituting (6.66) into L(x, a). Consequently, the primal variables x vanish, and we obtain a maximization problem in terms of a only:
Note that the number of variables and constraints has changed: we started with m variables and n constraints. Now we have n variables together with m equality constraints and n inequality constraints. While it is not yet completely obvious in the linear case, dualization may render optimization problems more amenable to numerical solution (the contrary may be true as well, though). What happens if a solution x to the primal problem (6.65) exists? In this case we know (since the KKT conditions of Theorem 6.29 are necessary and sufficient) that there must be an a solving the dual problem, since L(x, a) has a saddle point at (x, a). If no feasible point of the primal problem exists, there must exist, by (a small modification of) Lemma 6.23, some a € W with a > 0 and at least one a/ > 0 such that a1 (Ax + d) > 0 for all x. This means that for all x, the Lagrangian L(x, a) is unbounded from above, since we can make a1 (Ax + d) arbitrarily large. Hence the dual optimization problem is unbounded. Using analogous reasoning, if the primal problem is unbounded, the dual problem is infeasible. Let us see what happens if we dualize (6.70) one more time. First we need more Lagrange multipliers, since we have two sets of constraints. The equality constraints can be taken care of by an unbounded variable x' (see Theorem 6.22 for how to deal with equalities). For the inequalities a > 0, we introduce a second Lagrange multiplier y 6 W. After some calculations and resubstitution into the corresponding Lagrangian, we get
We can remove y > 0 from the set of variables by transforming Ax' + d + y into Ax + d < 0; thus we recover the primal optimization problem (6.65)7 The following theorem gives an overview of the transformations and relations between primal and dual problems (see also Table 6.2). Although we only derived these relations for linear programs, they also hold for other convex differentiable settings [45].
Theorem 6.30 (Trichotomy) For linear and convex quadratic programs exactly one of 7. This finding is useful if we have to dualize twice in some optimization settings (see Chapter 10), since then we will be able to recover some of the primal variables without further calculations if the optimization algorithm provides us with both primal and dual variables.
174
Optimization Table 6.2 Connections between primal and dual linear and convex quadratic programs. Primal Optimization Problem (in x)
Dual Optimization Problem (in a)
solution exists no solution exists
solution exists and extrema are equal maximization problem has unbounded objective from above or is infeasible no solution exists
minimization problem has unbounded objective from below or is infeasible inequality constraint equality constraint free variable
the following three alternatives must hold: 1. Both feasible regions are empty. 2. Exactly one feasible region is empty, in which case the objective function of the other problem is unbounded in the direction of optimization. 3. Both feasible regions are nonempty, in which case both problems have solutions and their extrema are equal.
Primal Quadratic Program
We conclude this section by stating primal and dual optimization problems, and the sufficient KKT conditions for convex quadratic optimization problems. To keep matters simple we only consider the following type of optimization problem (other problems can be rewritten in the same form; see Problem 6.11 for details):
Here K is a strictly positive definite matrix, x, c e lm, A 6 E nxm , and d G En. Note that this is clearly a differentiable convex optimization problem. To introduce a Lagrangian we need corresponding multipliers a € M" with a > 0. We obtain
Next we may apply Theorem 6.26 to obtain the KKT conditions. They can be stated in analogy to (6.66)-(6.68) as
6.4
Interior Point Methods
175
In order to compute the dual of (6.72), we have to eliminate x from (6.73) and write it as a function of a. We obtain
Dual Quadratic Program
In (6.78) we used (6.74) and (6.76) directly, whereas in order to eliminate x completely in (6.79) we solved (6.74) for x = —K~l(c + ATa). Ignoring constant terms this leads to the dual quadratic optimization problem,
The surprising fact about the dual problem (6.80) is that the constraints become significantly simpler than in the primal (6.72). Furthermore, if n < m, we also obtain a more compact representation of the quadratic term. There is one aspect in which (6.80) differs from its linear counterpart (6.70): if we dualize (6.80) again, we do not recover (6.72) but rather a problem very similar in structure to (6.80). Dualizing (6.80) twice, however, we recover the dual itself (Problem 6.13 deals with this matter in more detail).
6.4
Interior Point Methods Let us now have a look at simple, yet efficient optimization algorithms for constrained problems: interior point methods. An interior point is a pair of variables (x, a) that satisfies both primal and dual constraints. As already mentioned before, finding a set of vectors (x, a) that satisfy the KKT conditions is sufficient to obtain a solution in x. Hence, all we have to do is devise an algorithm which solves (6.74)-(6.77), for instance, if we want to solve a quadratic program. We will focus on the quadratic case — the changes required for linear programs merely involve the removal of some variables, simplifying the equations. See Problem 6.14 and [555,512] for details. 6.4.1
Sufficient Conditions for a Solution
We need a slight modification of (6.74)-(6.77) in order to achieve our goal: rather than the inequality (6.75), we are better off with an equality and a positivity constraint for an additional variable, i.e. we transform Ax + d < 0 into Ax + d + £ =
176
Optimization
0, where £ > 0. Hence we arrive at the following system of equations:
Optimality as Constraint Satisfaction
Let us analyze the equations in more detail. We have three sets of variables: x, a, £. To determine the latter, we have an equal number of equations plus the positivity constraints on a, £. While the first two equations are linear and thus amenable to solution, e.g., by matrix inversion, the third equality CKT£ = 0 has a small defect: given one variable, say a, we cannot solve it for £ or vice versa. Furthermore, the last two constraints are not very informative either. We use a primal-dual path-following algorithm, as proposed in [556], to solve this problem. Rather than requiring a T £ = 0 we modify it to become a;£z = // > 0 for all i E [n], solve (6.81) for a given //, and decrease // to 0 as we go. The advantage of this strategy is that we may use a Newton-type predictor corrector algorithm (see Section 6.2.5) to update the parameters x, a, £, which exhibits the fast convergence of a second order method. 6.4.2
Linearized Constraints
Solving the Equations
For the moment, assume that we have suitable initial values of x,a,£, and // with a, £ > 0. Linearization of the first three equations of (6.81), together with the modification a,-£; = //, yields (we expand x into x + Ax, etc.):
Next we solve for A£,- to obtain what is commonly referred to as the reduced KKT system. For convenience we use D := diag(a^ 1 £i,..., a~l^n) as a shorthand;
We apply a predictor-corrector method as in Section 6.2.5. The resulting matrix of the linear system in (6.83) is indefinite but of full rank, and we can solve (6.83) for (Axprec}, Ao!pred) by explicitly pivoting for individual entries (for instance, solve for AJ first and then substitute the result in to the second equality to obtain Aa). This gives us the predictor part of the solution. Next we have to correct for the linearization, which is conveniently achieved by updating PKKT and solving (6.83) again to obtain the corrector values (Axcom A^Corr)- The value of A£ is then obtained from (6.82).
6.4 Interior Point Methods
Update in x, a
177
Next, we have to make sure that the updates in a, £ do not cause the estimates to violate their positivity constraints. This is done by shrinking the length of (Ax, Aa, A£) by some factor A > 0, such that
Of course, only the negative A terms pose a problem, since they lead the parameter values closer to 0, which may lead them into conflict with the positivity constraints. Typically [556,502], we choose e = 0.05. In other words, the solution will not approach the boundaries in a, £ by more than 95%. See Problem 6.15 for a formula to compute A. 6.4.3
Tightening the KKT Conditions
Updating //
Next we have to update //. Here we face the following dilemma: if we decrease H too quickly, we will get bad convergence of our second order method, since the solution to the problem (which depends on the value of JJL) moves too quickly away from our current set of parameters (x, a, £). On the other hand, we do not want to spend too much time solving an approximation of the unrelaxed (// = 0) KKT conditions exactly. A good indication is how much the positivity constraints would be violated by the current update. Vanderbei [556] proposes the following update of //:
The first term gives the average value of satisfaction of the condition a£i = JJL after an update step. The second term allows us to decrease (JL rapidly if good progress was made (small (1 — A)2). Experimental evidence shows that it pays to be slightly more conservative, and to use the predictor estimates of a, £ for (6.85) rather than the corresponding corrector terms.8 This imposes little overhead for the implementation. 6.4.4
Regularized KKT System
Initial Conditions and Stopping Criterion
To provide a complete algorithm, we have to consider two more things: a stopping criterion and a suitable start value. For the latter, we simply solve a regularized version of the initial reduced KKT system (6.83). This means that we replace K by K +1, use (x, a) in place of AJC, Aa, and replace D by the identity matrix. Moreover, pp and pd are set to the values they would have if all variables had been set to 0 before, and finally PKKT is set to 0. In other words, we obtain an initial guess of 8. In practice it is often useful to replace (1 — A) by (1 + e — A) for some small e > 0, in order to avoid p, = 0.
178
Optimization
(x, a, £) by solving
and £ = — Ax — d. Since we have to ensure positivity of a, £, we simply replace
This heuristic solves the problem of a suitable initial condition. Regarding the stopping criterion, we recall Theorem 6.27, and in particular (6.62). Rather than obtaining bounds on the precision of parameters, we want to make sure that f ( x ) is close to its optimal value f ( x ) . From (6.64) we know, provided the feasibility constraints are all satisfied, that the value of the dual objective function is given by f ( x ) + £"=1 cn/c^x). We may use the latter to bound the relative size of the gap between primal and dual objective function by
For the special case where f ( x ) = ^xTKx + CTX as in (6.72), we know by virtue of (6.73) that the size of the feasibility gap is given by a T £, and therefore
Number of Significant Figures
In practice, a small number is usually added to the denominator of (6.89) in order to avoid divisions by 0 in the first iteration. The quality of the solution is typically measured on a logarithmic scale by — log10Gap(x, a), the number of significant figures.9 We will come back to specific versions of such interior point algorithms in Chapter 10, and show how Support Vector Regression and Classification problems can be solved with them. Primal-Dual path following methods are certainly not the only algorithms that can be employed for minimizing constrained quadratic problems. Other variants, for instance, are Barrier Methods [282,45,557], which minimize the unconstrained problem
Active set methods have also been used with success in machine learning [369, 284]. These select subsets of variables x for which the constraints c,- are not ac9. Interior point codes are very precise. They usually achieve up to 8 significant figures, whereas iterative approximation methods do not normally exceed more than 3 significant figures on large optimization problems.
6.5 Maximum Search Problems
179
tive, i.e., where the we have a strict inequality, and solve the resulting restricted quadratic program, for instance by conjugate gradient descent. We will encounter subset selection methods in Chapter 10.
6.5
Maximum Search Problems
Approximations
In several cases the task of finding an optimal function for estimation purposes means finding the best element from a finite set, or sometimes finding an optimal subset from a finite set of elements. These are discrete (sometimes combinatorial) optimization problems which are not so easily amenable to the techniques presented in the previous two sections. Furthermore, many commonly encountered problems are computationally expensive if solved exactly. Instead, by using probabilistic methods, it is possible to find almost optimal approximate solutions. These probabilistic methods are the topic of the present section. 6.5.1
Random Subset Selection
Consider the following problem: given a set of m functions, say M := {f\,..., fm}, and some criterion Q[f], find the function / that maximizes Q[/]. More formally,
Clearly, unless we have additional knowledge about the values Q[/,], we have to compute all terms Q[/;] if we want to solve (6.91) exactly. This will cost O(ra) operations. If m is large, which is often the case in practical applications, this operation is too expensive. In sparse greedy approximation problems (Section 10.2) or in Kernel Feature Analysis (Section 14.4), m can easily be of the order of 105 or larger (here, m is the number of training patterns). Hence we have to look for cheaper approximate solutions. The key idea is to pick a random subset M' C M that is sufficiently large, and take the maximum over M' as an approximation of the maximum over M. Provided the distribution of the values of Q[/r] is "well behaved", i.e., there does not exist a small fraction of Q[fi\ whose values are significantly smaller or larger than the average, we will obtain a solution that is close to the optimum with high probability. To formalize these ideas, we need the following result.
Lemma 6.31 (Maximum of Random Variables) Denote by £,£' two independent random variables on R with corresponding distributions P^,Pf and distribution functions F£,F£/. Then the random variable f := max(£,£') has the distribution function Fj = F( F(, Proof
Note that for a random variable, the distribution function F(£o) is given by
180
Optimization
the probability P{£ < £o}- Since £ and £' are independent, we may write
which proves the claim. Distribution Repeated application of Lemma 6.31 leads to the following corollary. Over | is More Peaked Corollary 6.32 (Maximum Over Identical Random Variables) Let £ 1 , . . . , fa be m independent and identically distributed (iid) random variables, with corresponding distribution function F^. Then the random variable £ := max(£i,..., £m) has the distribution function T-l(l}=(^(l}f.
Best Element of a Subset
In practice, the random variables & will be the values of Q[/,-], where the /, are drawn from the set M. If we draw them without replacement (i.e. none of the functions fi appears twice), however, the values after each draw are dependent and we cannot apply Corollary 6.32 directly. Nonetheless, we can see that the maximum over draws without replacement will be larger than the maximum with replacement, since recurring observations can be understood as reducing the effective size of the set to be considered. Thus Corollary 6.32 gives us a lower bound on the value of the distribution function for draws without replacement. Moreover, for large m the difference between draws with and without replacement is small. If the distribution of Q[/,] is known, we may use the distribution directly to determine the size m of a subset to be used to find some Q[/z] that is almost as good as the solution to (6.91). In all other cases, we have to resort to assessing the relative quality of maxima over subsets. The following theorem tells us how. Theorem 6.33 (Ranks on Random Subsets) Denote by M := {xi:..., xm} C^aset of cardinality m, and by M C M a random subset of size m. Then the probability that maxM is greater equal than n elements of M is at least 1 — (%}m. Proof We prove this by assuming the converse, namely that maxM is smaller than (m - n) elements of M. For m = 1 we know that this probability is ^, since there are n elements to choose from. For m > 1, the probability is the one of choosing m elements out of a subset MIOW of n elements, rather than all m elements. Therefore we have that
Consequently the probability that the maximum over M will be larger than n elements of M is given by 1 - P(M C Mlow) > 1 - (^) m . The practical consequence is that we may use 1 — (^) m to compute the required size of a random subset to achieve the desired degree of approximation. If we want to obtain results in the ^ percentile range with 1 — 77 confidence, we must
6.5 Maximum Search Problems
181
solve for m = .To give a numerical example, if we desire values that are better than 95% of all other estimates with 1-0.05 probability, then K = 59 samples are sufficient. This (95%, 95%, 59) rule is very useful in practice.10 A similar method was used to speed up the process of boosting classifiers in the MadaBoost algorithm [143]. Furthermore, one could think whether it might not be useful to recycle old observations rather than computing all 59 values from scratch. If this can be done cheaply, and under some additional independence assumptions, subset selection methods can be improved further. For details see [424] who use the method in the context of memory management for operating systems. 6.5.2
Approximating Sums by Partial Sums
Deviation of Subsets
Random Evaluation
Quite often, the evaluation of the term Q[f] itself is rather time consuming, especially if Q[f] is the sum of many (m, for instance) iid random variables. Again, we can speed up matters considerably by using probabilistic methods. The key idea is that averages over independent random variables are concentrated, which is to say that averages over subsets do not differ too much from averages over the whole set. Hoeffding's Theorem (Section 5.2) quantifies the size of the deviations between the expectation of a sum of random variables and their values at individual trials. We will use this to bound deviations between averages over sets and subsets. All we have to do is translate Theorem 5.1 into a statement regarding sample averages over different sample sizes. This can be readily constructed as follows:
Corollary 6.34 (Deviation Bounds for Empirical Means [508]) Suppose £ 1 , . . . , £m are iid bounded random variables, falling into the interval [a, a + b] with probability one. Denote their average by Qm = ^ £; &. Furthermore, denote by £ s (!),..., £s(^) with m • { ! , . . . , m } being an injective map, i.e. s(i) = s(j} only ifi = j), and Qm = 1X; £s(,-). Then for any e > 0,
Proof By construction E [Qm — Qm] = 0, since Qm and Qm are both averages over sums of random variables drawn from the same distribution. Hence we only have to rewrite Qm — Qm as an average over (different) random variables to apply Hoeffding's bound. Since all Q, are identically distributed, we may pick the first m random variables, without loss of generality. In other words, we assume that 10. During World War I tanks were often numbered in continuous increasing order. Unfortunately this "feature" allowed the enemy to estimate the number of tanks. How?
182
Optimization
Thus we may split up Qm — Qm into a sum of ra random variables with range b{ = (j? — V)b, and ra — ra random variables with range bi = b. We obtain
Substituting this into (5.7) and noting that Qm - Qm -E[Qm - Qm] = Qm- Qm completes the proof.
Cutoff Criterion
For small ^ the rhs in (6.93) reduces to exp ( — ^-) • In other words, deviations on the subsample ra dominate the overall deviation of Qm — Qm from 0. This allows us to compute a cutoff criterion for evaluating Qm by computing only a subset of its terms. We need only solve (6.93) for ^. Hence, in order to ensure that Qm is within e of Qm with probability 1 — 77, we have to take a fraction ^ of samples that satisfies
The fraction ^ can be small for large ra, which is exactly the case where we need methods to speed up evaluation. 6.5.3
Applications
Greedy Optimization Strategies
Quite often the overall goal is not necessarily to find the single best element Xi from a set X to solve a problem, but to find a good subset X C X of size ra according to some quality criterion Q[X]. Problems of this type include approximating a matrix by a subset of its rows and columns (Section 10.2), finding approximate solutions to Kernel Fisher Discriminant Analysis (Chapter 15) and finding a sparse solution to the problem of Gaussian Process Regression (Section 16.3.4). These all have a common structure: (i) Finding an optimal set X C X is quite often a combinatorial problem, or it even may be NP-hard, since it means selecting ra = |X| elements from a set of ra = |X| elements. There are (™) different choices, which clearly prevents an exhaustive search over all of them. Additionally, the size of ra is often not known beforehand. Hence we need a fast approximate algorithm. (ii) The evaluation of Q[X U {*;}] is inexpensive, provided Q[X] has been computed before. This indicates that an iterative algorithm can be useful. (iii) The value of Q[X], or equivalently how well we would do by taking the whole set X, can be bounded efficiently by using Q[X] (or some by-products of the computation of Q[M]) without actually computing Q[X].
6.6 Summary
183
Algorithm 6.6 Sparse Greedy Algorithm Require: Set of functions X, Precision e, Criterion Q[ •] Set X = 0 repeat Choose random subset X' of size m' from X\X. Pick x = argmaxxex/ Q[X' U {x}] X' = X' U {x} If needed, (re)compute bound on Q[X]. until Q[X] + e> Bound on Q[X] Output: X,Q[X]
(iv) The set of functions X is typically very large (i.e. more than 105 elements), yet the individual improvements by /; via Q[X U {xi}] do not differ too much, meaning that specific x\ for which Q[X U {x?}] deviate by a large amount from the rest of Q[X U {xt}] do not exist.
Iterative Enlargement of X
6.6
In this case we may use a sparse greedy algorithm to find near optimal solutions among the remaining X\X elements. This combines the idea of an iterative enlargement of X by one more element at a time (which is feasible since we can compute Q[X U {fi}] cheaply) with the idea that we need not consider all fj as possible candidates for the enlargement. This uses the reasoning in Section 6.5.1 combined with the fact that the distribution of the improvements is not too long tailed (cf. (iv)). The overall strategy is described in Algorithm 6.6. Problems 6.9 and 6.10 contain more examples of sparse greedy algorithms.
Summary This chapter gave an overview of different optimization methods, which form the basic toolbox for solving the problems arising in learning with kernels. The main focus was on convex and differentiable problems, hence the overview of properties of convex sets and functions defined on them. The key insights in Section 6.1 are that convex sets can be defined by level sets of convex functions and that convex optimization problems have one global minimum. Furthermore, the fact that the solutions of convex maximization over polyhedral sets can be found on the vertices will prove useful in some unsupervised learning applications (Section 14.4). Basic tools for unconstrained problems (Section 6.2) include interval cutting methods, the Newton method, Conjugate Gradient descent, and PredictorCorrector methods. These techniques are often used as building blocks to solve more advanced constrained optimization problems. Since constrained minimization is a fairly complex topic, we only presented a selection of fundamental results, such as necessary and sufficient conditions in the general case of nonlinear programming. The KKT conditions for differentiable
184
Optimization
convex functions then followed immediately from the previous reasoning. The main results are dualization, meaning the transformation of optimization problems via the Lagrangian mechanism into possibly simpler problems, and that optimality properties can be estimated via the KKT gap (Theorem 6.27). Interior point algorithms are practical applications of the duality reasoning; these seek to find a solution to optimization problems by satisfying the KKT optimality conditions. Here we were able to employ some of the concepts introduced at an earlier stage, such as predictor corrector methods and numerical ways of finding roots of equations. These algorithms are robust tools to find solutions on moderately sized problems (103 — 104 examples). Larger problems require decomposition methods, to be discussed in Section 10.4, or randomized methods. The chapter concluded with an overview of randomized methods for maximizing functions or finding the best subset of elements. These techniques are useful once datasets are so large that we cannot reasonably hope to find exact solutions to optimization problems.
6.7
Problems 6.1 (Level Sets •) Given the function f : M2 -> R with f(x) := \XI\P + \x2\p,for which p do we obtain a convex function? Now consider the sets {x\f(x) < c} for some c > 0. Can you give an explicit parametrization of the boundary of the set? Is it easier to deal with this parametrization? Can you find other examples (see also [489] and Chapter 8 for details)? 6.2 (Convex Hulls •) Show that for any set X, its convex hull co X is convex. Furthermore, show that co X = XifX is convex. 6.3 (Method of False Position [334] •••) Given a unimodal (possessing one minimum) differentiable function f : R —>• M, develop a quadratic method for minimizing /•
Hint: Recall the Newton method. There we used f"(x} to make a quadratic approximation off. Two values off'(x) are also sufficient to obtain this information, however. What happens if we may only use f? What does the iteration scheme look like? See Figure 6.8 for a hint. 6.4 (Convex Minimization in one Variable ••) Denote by f a convex function on [a,b]. Show that the algorithm below finds the minimum of f. What is the rate of convergence in x to argmin,t:/(%)? Can you obtain a bound in f(x) wrt. mmxf(x)? input a, b, / and threshold e xl — a, x2 = ^, £3 = b and compute f(x\), /(x 2 ), /(*s) repeat if xz — X2 > X2 — Xi then
6.7 Problems
185
x4 = ^y^ and compute f(x^) else *4 = ^^r2 and compute f(x^) end if Keep the two points closest to the point with the minimum value off(xi) them such that x\ < x2 < *3. until ^3 — xi > s
and rename
6.5 (Newton Method in Rd ••) Extend the Newton method to functions on Rd. What does the iteration rule look like? Under which conditions does the algorithm converge? Do you have to extend Theorem 6.13 to prove convergence? 6.6 (Rewriting Quadratic Functionals •) Given a function
rewrite it into the form of (6.18). Give explicit expressions for x* = argmin;c/(j) and the difference in the additive constants. 6.7 (Kantorovich Inequality [278] •••) Prove Theorem 6.16. Hint: note that without loss of generality we may require \\x\\2 — 1. Second, perform a transformation of coordinates into the eigensystem of K. Finally, note that in the new coordinate system we are dealing with convex combinations of eigenvalues A/ and ^. First show (6.24) for only two eigenvalues. Then argue that only the largest and smallest eigenvalues matter. 6.8 (Random Subsets •) Generate m random numbers drawn uniformly from the interval [0,1]. Plot their distribution function. Plot the distribution of maxima of subsets of random numbers. What can you say about the distribution of the maxima? What happens if you draw randomly from the Laplace distribution, with density p(£) = e~^ (for £ > 0)? 6.9 (Matching Pursuit [342] ••) Denote by /i,... ,/M a set of functions X ->• !, by {xi,..., xm} C X a set of locations and by {yi,..., ym} C y a set of corresponding observations. Design a sparse greedy algorithm that finds a linear combination of functions f := £z aifi minimizing the squared loss between f(xi) and y,-.
Figure 6.8 From left to right: Newton method, method of false position, quadratic interpolation through 3 points. Solid line: f(x), dash-dotted line: interpolation.
186
Optimization
6.10 (Reduced Set Approximation [474] ••) Let f(x) = £/li aik(Xj, x) be a kernel expansion in a Reproducing Kernel Hilbert Space "K^ (see Section 2.2.3). Give a sparse greedy algorithm that finds an approximation to f in "Kk by using fewer terms. See also Chapter 18 for more detail. 6.11 (Equality Constraints in LP and QP ••) Find the dual optimization problem and the necessary KKT conditions for the following optimization problem:
where c, x E W", b € R n , d € W1', A £ Rnxm and C € W1'. Hint: split up the equality constraints into two inequality constraints. Note that you may combine the two Lagrange multipliers again to obtain a free variable. Derive the corresponding conditions for
where K is a strictly positive definite matrix. 6.12 (Not Strictly Definite Quadratic Parts •••) How do you have to change the dual of (6.99) if K does not have full rank? Is it better not to dualize in this case? Do the KKT conditions still hold? 6.13 (Dual Problems of Quadratic Programs ••) Denote by P a quadratic optimization problem of type (6.72) and by (-) D the dualization operation. Prove that the following is true,
where in general (PD)D ^ P. Hint: use (6.80). Caution: you have to check whether KA1 has full rank. 6.14 (Interior Point Equations for Linear Programs [336] •••) Derive the interior point equations for linear programs. Hint: use the expansions for the quadratic programs and note that the reduced KKT system has only a diagonal term where we had K before. How does the complexity of the problem scale with the size of A? 6.15 (Update Step in Interior Point Codes •) Show that the maximum value of X satisfying (6.84) can be found by
II
SUPPORT VECTOR MACHINES
The algorithms for constructing the separating hyperplane considered above will be utilized for developing a battery of programs for pattern recognition. V. N. Vapnik [560, p. 364]
Now that we have the necessary concepts and tools, we move on to the class of Support Vector (SV) algorithms. SV algorithms are commonly considered the first practicable spin-off of statistical learning theory. We described the basic ideas of Support Vector machines (SVMs) in Chapter 1. It is now time for a much more detailed discussion and description of SVMs, starting with the case of pattern recognition (Chapter 7), which was historically the first to be developed. Following this, we move on to a problem that can actually be considered as being even simpler than pattern recognition. In pattern recognition, we try to distinguish between patterns of at least two classes; in single-class classification (Chapter 8), however, there is only one class. In the latter case, which belongs to the realm of unsupervised learning, we try to learn a model of the data which describes, in a weak sense, what the training data looks like. This model can then be used to assess the "typicality" or novelty of previously unseen patterns, a task which is rather useful in a number of application domains. Chapter 9 introduces SV algorithms for regression estimation. These retain most of the properties of the other SV algorithms, with the exception that in the regression case, the choice of the loss function, as described in Chapter 3, becomes a more interesting issue.
188
SUPPORT VECTOR MACHINES
After this, we give details on how to implement the various types of SV algorithms (Chapter 10), and we describe some methods for incorporating prior knowledge about invariances of a given problem into SVMs (Chapter 11). We conclude this part of the book by revisiting statistical learning theory, this time with a much stronger emphasis on elements that are specific to SVMs and kernel methods (Chapter 12).
7
Pattern Recognition
Overview
Prerequisites
7.1
This chapter is devoted to a detailed description of SV classification (SVC) methods. We have already briefly visited the SVC algorithm in Chapter 1. There will be some overlap with that chapter, but here we give a more thorough treatment. We start by describing the classifier that forms the basis for SVC, the separating hyperplane (Section 7.1). Separating hyperplanes can differ in how large a margin of separation they induce between the classes, with corresponding consequences on the generalization error, as discussed in Section 7.2. The "optimal" margin hyperplane is defined in Section 7.3, along with a description of how to compute it. Using the kernel trick of Chapter 2, we generalize to the case where the optimal margin hyperplane is not computed in input space, but in a feature space nonlinearly related to the latter (Section 7.4). This dramatically increases the applicability of the approach, as does the introduction of slack variables to deal with outliers and noise in the data (Section 7.5). Many practical problems require us to classify the data into more than just two classes. Section 7.6 describes how multi-class SV classification systems can be built. Following this, Section 7.7 describes some variations on standard SV classification algorithms, differing in the regularizes and constraints that are used. We conclude with a fairly detailed section on experiments and applications (Section 7.8). This chapter requires basic knowledge of kernels, as conveyed in the first half of Chapter 2. To understand details of the optimization problems, it is helpful (but not indispensable) to get some background from Chapter 6. To understand the connections to learning theory, in particular regarding the statistical basis of the regularizer used in SV classification, it would be useful to have read Chapter 5.
Separating Hyperplanes
Hyperplane
Suppose we are given a dot product space 'K, and a set of pattern vectors X!,..., xm £ Oi. Any hyperplane in "K can be written as
In this formulation, w is a vector orthogonal to the hyperplane: If w has unit length, then (w, x) is the length of x along the direction of w (Figure 7.1). For general w, this number will be scaled by ||w||. In any case, the set (7.1) consists
190
Pattern Recognition
of vectors that all have the same length along w. In other words, these are vectors that project onto the same point on the line spanned by w. In this formulation, we still have the freedom to multiply w and b by the same non-zero constant. This superfluous freedom — physicists would call it a "gauge" freedom — can be abolished as follows. Definition 7.1 (Canonical Hyperplane) The pair (w, b) 6 "K x R is called a canonical form of the hyperplane (7.1) with respect to X i , . . . , \m € IK, if it is scaled such that
which amounts to saying that the point closest to the hyperplane has a distance o/l/||w|| (Figure 7.2).
Decision Function
Note that the condition (7.2) still allows two such pairs: given a canonical hyperplane (w, b), another one satisfying (7.2) is given by (—w, —b). For the purpose of pattern recognition, these two hyperplanes turn out to be different, as they are oriented differently; they correspond to two decision functions,
which are the inverse of each other. In the absence of class labels y,- € {±1} associated with the \j, there is no way of distinguishing the two hyperplanes. For a labelled dataset, a distinction exists: The two hyperplanes make opposite class assignments. In pattern recognition,
7.1
Separating Hyperplanes
191
Figure 7.1 A separable classification problem, along with a separating hyperplane, written in terms of an orthogonal weight vector w and a threshold b. Note that by multiplying both w and b by the same non-zero constant, we obtain the same hyperplane, represented in terms of different parameters. Figure 7.2 shows how to eliminate this scaling freedom.
Figure 7.2 By requiring the scaling of w and b to be such that the point(s) closest to the hyperplane satisfy | (w, x,} + b\ = I, we obtain a canonical form (w, b) of a hyperplane. Note that in this case, the margin, measured perpendicularly to the hyperplane, equals l/||w||. This can be seen by considering two opposite points which precisely satisfy | (w, X;} + b\ = 1 (cf. Problem 7.4)
we attempt to find a solution /W)b which correctly classifies the labelled examples (x,-, yi) € "K x {±1}; in other words, which satisfies /w,b(xt) = y, for all i (in this case, the training set is said to be separable), or at least for a large fraction thereof. The next section will introduce the term margin, to denote the distance to a separating hyperplane from the point closest to it. It will be argued that to generalize well, a large margin should be sought. In view of Figure 7.2, this can be achieved by keeping ||w|| small. Readers who are content with this level of detail may skip the next section and proceed directly to Section 7.3, where we describe how to construct the hyperplane with the largest margin.
192
7.2
Pattern Recognition
The Role of the Margin The margin plays a crucial role in the design of SV learning algorithms. Let us start by formally defining it. Definition 7.2 (Geometrical Margin) For a hyperplane {x E !K| (w, x) + b = 0}, we call
Geometrical Margin
the geometrical margin of the point (x, y) £ "K x {±1}. The minimum value
shall be called the geometrical margin of (xi, y x ) , . . . , (xm, ym). If the latter is omitted, it is understood that the training set is meant. Occasionally, we will omit the qualification geometrical, and simply refer to the margin. For a point (x, y) which is correctly classified, the margin is simply the distance from x to the hyperplane. To see this, note first that the margin is zero on the hyperplane. Second, in the definition, we effectively consider a hyperplane
Margin of Canonical Hyperplanes
Insensitivity to Pattern Noise
which has a unit length weight vector, and then compute the quantity y((w, x) + b). The term {w, x), however, simply computes the length of the projection of x onto the direction orthogonal to the hyperplane, which, after adding the offset b, equals the distance to it. The multiplication by y ensures that the margin is positive whenever a point is correctly classified. For misclassified points, we thus get a margin which equals the negative distance to the hyperplane. Finally, note that for canonical hyperplanes, the margin is l/||w|| (Figure 7.2). The definition of the canonical hyperplane thus ensures that the length of w now corresponds to a meaningful geometrical quantity. It turns out that the margin of a separating hyperplane, and thus the length of the weight vector w, plays a fundamental role in support vector type algorithms. Loosely speaking, if we manage to separate the training data with a large margin, then we have reason to believe that we will do well on the test set. Not surprisingly, there exist a number of explanations for this intuition, ranging from the simple to the rather technical. We will now briefly sketch some of them. The simplest possible justification for large margins is as follows. Since the training and test data are assumed to have been generated by the same underlying dependence, it seems reasonable to assume that most of the test patterns will lie close (in "K) to at least one of the training patterns. For the sake of simplicity, let us consider the case where all test points are generated by adding bounded pattern noise (sometimes called input noise) to the training patterns. More precisely, given a training point (x, y), we will generate test points of the form (x + Ax, y), where
7.2
The Role of the Margin
193
Figure 7.3 Two-dimensional toy example of a classification problem: Separate 'o' from '+' using a hyperplane. Suppose that we add bounded noise to each pattern. If the optimal margin hyperplane has margin p, and the noise is bounded by r < p, then the hyperplane will correctly separate even the noisy patterns. Conversely, if we ran the perceptron algorithm (which finds some separating hyperplane, but not necessarily the optimal one) on the noisy data, then we would recover the optimal hyperplane in the limit r -» p.
Ax 6 !K is bounded in norm by some r > 0. Clearly, if we manage to separate the training set with a margin p > r, we will correctly classify all test points: Since all training points have a distance of at least p to the hyperplane, the test patterns will still be on the correct side (Figure 7.3, cf. also [152]). If we knew p beforehand, then this could actually be turned into an optimal margin classifier training algorithm, as follows. If we use an r which is slightly smaller than p, then even the patterns with added noise will be separable with a nonzero margin. In this case, the standard perceptron algorithm can be shown to converge.1 Therefore, we can run the perceptron algorithm on the noisy patterns. If the algorithm finds a sufficient number of noisy versions of each pattern, with different perturbations Ax, then the resulting hyperplane will not intersect any of the balls depicted in Figure 7.3. As r approaches p, the resulting hyperplane should better approximate the maximum margin solution (the figure depicts the limit r = p). This constitutes a connection between training with pattern noise and maximizing the margin. The latter, in turn, can be thought of as a regularize^ comparable to those discussed earlier (see Chapter 4 and (2.49)). Similar connections to training with noise, for other types of regularizers, have been pointed out before for neural networks [50]. 1. Rosenblatt's perceptron algorithm [439] is one of the simplest conceivable iterative procedures for computing a separating hyperplane. In its simplest form, it proceeds as follows. We start with an arbitrary weight vector w0. At step n G N, we consider the training example (xn,yn). If it is classified correctly using the current weight vector (i.e., if sgn (x n , w n _i) = yn), we set wn := vfn-i'> otherwise, we set wn := w n _i + ryi/jX, (here, r] > 0 is a learning rate). We thus loop over all patterns repeatedly, until we can complete one full pass through the training set without a single error. The resulting weight vector will thus classify all points correctly. Novikoff [386] proved that this procedure terminates, provided that the training set is separable with a nonzero margin.
194
Pattern Recognition
Figure 7.4 Two-dimensional toy example of a classification problem: Separate 'o' from '+' using a hyperplane passing through the origin. Suppose the patterns are bounded in length (distance to the origin) by R, and the classes are separated by an optimal hyperplane (parametrized by the angle 7) with margin p. In this case, we can perturb the parameter by some A7 with |A7J < arcsin |, and still correctly separate the data.
Parameter Noise
VC Margin Bound
Margin Error
A similar robustness argument can be made for the dependence of the hyperplane on the parameters (w, b] (cf. [504]). If all points lie at a distance of at least p from the hyperplane, and the patterns are bounded in length, then small perturbations to the hyperplane parameters will not change the classification of the training data (see Figure 7.4).2 Being able to perturb the parameters of the hyperplane amounts to saying that to store the hyperplane, we need fewer bits than we would for a hyperplane whose exact parameter settings are crucial. Interestingly, this is related to what is called the Minimum Description Length principle ([583, 433,485], cf. also [522,305,94]): The best description of the data, in terms of generalization error, should be the one that requires the fewest bits to store. We now move on to a more technical justification of large margin algorithms. For simplicity, we only deal with hyperplanes that have offset b = 0, leaving /(x) = sgn (w, x). The theorem below follows from a result in [24]. Theorem 7.3 (Margin Error Bound) Consider the set of decision functions /(x) = sgn (w,x) with \\vr\\ < A and ||x|| < R, for some R, A > 0. Moreover, let p > 0, and v denote the fraction of training examples with margin smaller than p/||w||, referred to as the margin error. For all distributions P generating the data, with probability at least 1 — 6 over the drawing of the m training patterns, and for any p > 0 and 6 € (0,1), the probability that a test pattern drawn from P will be misclassified is bounded from above, by
Here, c is a universal constant. 2. Note that this would not hold true if we allowed patterns of arbitrary length — this type of restriction of the pattern lengths pops up in various places, such as Novikoff's theorem [386], Vapnik's VC dimension bound for margin classifiers (Theorem 5.5), and Theorem 7.3.
7.2
The Role of the Margin
Implementation in Hardware
195
Let us try to understand this theorem. It makes a probabilistic statement about a probability, by giving an upper bound on the probability of test error, which itself only holds true with a certain probability, 1 — 6. Where do these two probabilities come from? The first is due to the fact that the test examples are randomly drawn from P; the second is due to the training examples being drawn from P. Strictly speaking, the bound does not refer to a single classifier that has been trained on some fixed data set at hand, but to an ensemble of classifiers, trained on various instantiations of training sets generated by the same underlying regularity P. It is beyond the scope of the present chapter to prove this result. The basic ingredients of bounds of this type, commonly referred to as VC bounds, are described in Chapter 5; for further details, see Chapter 12, and [562, 491, 504, 125]. Several aspects of the bound are noteworthy. The test error is bounded by a sum of the margin error v, and a capacity term (the ^/TTT term in (7.7)), with the latter tending to zero as the number of examples, m, tends to infinity. The capacity term can be kept small by keeping R and A small, and making p large. If we assume that R and A are fixed a priori, the main influence is p. As can be seen from (7.7), a large p leads to a small capacity term, but the margin error v gets larger. A small p, on the other hand, will usually cause fewer points to have margins smaller than /9/||w||, leading to a smaller margin error; but the capacity penalty will increase correspondingly. The overall message: Try to find a hyperplane which is aligned such that even for a large p, there are few margin errors. Maximizing p, however, is the same as minimizing the length of w. Hence we might just as well keep p fixed, say, equal to 1 (which is the case for canonical hyperplanes), and search for a hyperplane which has a small ||w|| and few points with a margin smaller than I/||w|j; in other words (Definition 7.2), few points such that y{w,x) < 1. It should be emphasized that dropping the condition ||w|| < A would prevent us from stating a bound of the kind shown above. We could give an alternative bound, where the capacity depends on the dimensionality of the space IK. The crucial advantage of the bound given above is that it is independent of that dimensionality, enabling us to work in very high dimensional spaces. This will become important when we make use of the kernel trick. It has recently been pointed out that the margin also plays a crucial role in improving asymptotic rates in nonparametric estimation [551]. This topic, however, is beyond the scope of the present book. To conclude this section, we note that large margin classifiers also have advantages of a practical nature: An algorithm that can separate a dataset with a certain margin will behave in a benign way when implemented in hardware. Real-world systems typically work only within certain accuracy bounds, and if the classifier is insensitive to small changes in the inputs, it will usually tolerate those inaccuracies. We have thus accumulated a fair amount of evidence in favor of the following approach: Keep the margin training error small, and the margin large, in order to achieve high generalization ability. In other words, hyperplane decision functions
196
Pattern Recognition
should be constructed such that they maximize the margin, and at the same time separate the training data with as few exceptions as possible. Sections 7.3 and 7.5 respectively will deal with these two issues.
7.3
Optimal Margin Hyperplanes Let us now derive the optimization problem to be solved for computing the optimal hyperplane. Suppose we are given a set of examples (\i, y i ) , . . . , (xm, y m ), x,- £ IK, \fi £ {±1}. Here and below, the index i runs over 1,..., m by default. We assume that there is at least one negative and one positive y z . We want to find a decision function /W){,(x) = sgn ((w, x) + fo) satisfying
If such a function exists (the non-separable case will be dealt with later), canonicality (7.2) implies
As an aside, note that out of the two canonical forms of the same hyperplane, (w, b) and (—w, — b), only one will satisfy equations (7.8) and (7.11). The existence of class labels thus allows to distinguish two orientations of a hyperplane. Following the previous section, a separating hyperplane which generalizes well can thus be constructed by solving the following problem:
This is called the primal optimization problem.
Lagrangian
Problems like this one are the subject of optimization theory. For details on how to solve them, see Chapter 6; for a short intuitive explanation, cf. the remarks following (1.26) in the introductory chapter. We will now derive the so-called dual problem, which can be shown to have the same solutions as (7.10). In the present case, it will turn out that it is more convenient to deal with the dual. To derive it, we introduce the Lagrangian,
with Lagrange multipliers a r > 0. Recall that as in Chapter 1, we use bold face Greek variables to refer to the corresponding vectors of variables, for instance, a = (ai,..., am). The Lagrangian L must be maximized with respect to a:/, and minimized with respect to w and b (see Theorem 6.26). Consequently, at this saddle point, the
7.3
Optimal Margin Hyperplanes
197
derivatives of L with respect to the primal variables must vanish,
which leads to
and
The solution vector thus has an expansion in terms of training examples. Note that although the solution w is unique (due to the strict convexity of (7.10), and the convexity of (7.11)), the coefficients a,- need not be. According to the KKT theorem (Chapter 6), only the Lagrange multipliers a, that are non-zero at the saddle point, correspond to constraints (7.11) which are precisely met. Formally, for all i = 1,..., m, we have
Support Vectors
The patterns x; for which a/ > 0 are called Support Vectors. This terminology is related to corresponding terms in the theory of convex sets, relevant to convex optimization (e.g., [334,45]).3 According to (7.16), they lie exactly on the margin.4 All remaining examples in the training set are irrelevant: Their constraints (7.11) are satisfied automatically, and they do not appear in the expansion (7.15), since their multipliers satisfy a, = O.5 This leads directly to an upper bound on the generalization ability of optimal margin hyperplanes. To this end, we consider the so-called leave-one-out method (for further details, see Section 12.2) to estimate the expected test error [335, 559]. This procedure is based on the idea that if we leave out one of the training 3. Given any boundary point of a convex set, there always exists a hyperplane separating the point from the interior of the set. This is called a supporting hyperplane. SVs lie on the boundary of the convex hulls of the two classes, thus they possess supporting hyperplanes. The SV optimal hyperplane is the hyperplane which lies in the middle of the two parallel supporting hyperplanes (of the two classes) with maximum distance. Conversely, from the optimal hyperplane, we can obtain supporting hyperplanes for all SVs of both classes, by shifting it by l/||w|| in both directions. 4. Note that this implies the solution (w, b), where b is computed using j//((w, x;) + b) = 1 for SVs, is in canonical form with respect to the training data. (This makes use of the reasonable assumption that the training set contains both positive and negative examples.) 5. In a statistical mechanics framework, Anlauf and Biehl [12] have put forward a similar argument for the optimal stability perceptron, also computed using constrained optimization. There is a large body of work in the physics community on optimal margin classification. Some further references of interest are [310, 191, 192, 394, 449, 141]; other early works include [313].
198
Pattern Recognition examples, and train on the remaining ones, then the probability of error on the left out example gives us a fair indication of the true test error. Of course, doing this for a single training example leads to an error of either zero or one, so it does not yet give an estimate of the test error. The leave-one-out method repeats this procedure for each individual training example in turn, and averages the resulting errors. Let us return to the present case. If we leave out a pattern x,-., and construct the solution from the remaining patterns, the following outcomes are possible (cf. (7.11)): 1. y,-. ({*?*, w) + b) > 1. In this case, the pattern is classified correctly and does not lie on the margin. These are patterns that would not have become SVs anyway. 2. y,-« ({xf*, w) + b] — 1. In other words, x;« exactly meets the constraint (7.11). In this case, the solution w does not change, even though the coefficients a, would change: Namely, if x/* might have become a Support Vector (i.e., a,. > 0) had it been kept in the training set. In that case, the fact that the solution is the same, no matter whether x,-* is in the training set or not, means that x r « can be written as £svs /3,-y;x,- with, /?, > 0. Note that condition 2 is not equivalent to saying that Xf* may be written as some linear combination of the remaining Support Vectors: Since the sign of the coefficients in the linear combination is determined by the class of the respective pattern, not any linear combination will do. Strictly speaking, xz-* must lie in the cone spanned by the y r x z , where the x,- are all Support Vectors.6 For more detail, see [565] and Section 12.2. 3. 0 < y,« ({xj*, w) + b] < 1. In this case, x,-* lies within the margin, but still on the correct side of the decision boundary. Thus, the solution looks different from the one obtained with x;* in the training set (in that case, xr* would satisfy (7.11) after training); classification is nevertheless correct. 4. \ji* ({x/*, w) + b) < 0. This means that x /«is classified incorrectly. Note that cases 3 and 4 necessarily correspond to examples which would have become SVs if kept in the training set; case 2 potentially includes such situations. Only case 4, however, leads to an error in the leave-one-out procedure. Consequently, we have the following result on the generalization error of optimal margin classifiers [570] :7
Leave-One-Out Bound
Proposition 7.4 The expectation of the number of Support Vectors obtained during training on a training set of size m, divided by m, is an upper bound on the expected probability of test error of the SVM trained on training sets of size m — I.8 6. Possible non-uniqueness of the solution's expansion in terms of SVs is related to zero Eigenvalues of (y;y;-fc(x;, x/)),;, cf. Proposition 2.16. Note, however, the above caveat on the distinction between linear combinations, and linear combinations with coefficients of fixed sign. 7. It also holds for the generalized versions of optimal margin classifiers described in the following sections. 8. Note that the leave-one-out procedure performed with m training examples thus yields
7.3 Optimal Margin Hyperplanes
199
Figure 7.5 The optimal hyperplane (Figure 7.2) is the one bisecting the shortest connection between the convex hulls of the two classes.
Quadratic Program of Optimal Margin Classifier
A sharper bound can be formulated by making a further distinction in case 2, between SVs that must occur in the solution, and those that can be expressed in terms of the other SVs (see [570, 565, 268, 549] and Section 12.2). We now return to the optimization problem to be solved. Substituting the conditions for the extremum, (7.14) and (7.15), into the Lagrangian (7.12), we arrive at the dual form of the optimization problem:
On substitution of the expansion (7.15) into the decision function (7.3), we obtain an expression which can be evaluated in terms of dot products, taken between the pattern to be classified and the Support Vectors,
To conclude this section, we note that there is an alternative way to derive the dual optimization problem [38]. To describe it, we first form the convex hulls C+ a bound valid for training sets of size m — \. This difference, however, does not usually mislead us too much. In statistical terms, the leave-one-out error is called almost unbiased. Note, moreover, that the statement talks about the expected probability of test error — there are thus two sources of randomness. One is the expectation over different training sets of size m — l, the other is the probability of test error when one of the SVMs is faced with a test example drawn from the underlying distribution generating the data. For a generalization, see Theorem 12.9.
200
Pattern Recognition and C_ of both classes of training points,
Convex Hull Separation
It can be shown that the maximum margin hyperplane as described above is the one bisecting the shortest line orthogonally connecting C+ and C_ (Figure 7.5). Formally, this can be seen by considering the optimization problem
and using the normal vector w = Zy,=i QXZ — Sy,-=-i Q x f/ scaled to satisfy the canonicality condition (Definition 7.1). The threshold b is explicitly adjusted such that the hyperplane bisects the shortest connecting line (see also Problem 7.7).
7.4
Nonlinear Support Vector Classifiers
Cover's Theorem
Thus far, we have shown why it is that a large margin hyperplane is good from a statistical point of view, and we have demonstrated how to compute it. Although these two points have worked out nicely, there is still a major drawback to the approach: Everything that we have done so far is linear in the data. To allow for much more general decision surfaces, we now use kernels to nonlinearly transform the input data x\,..., xm 6 X into a high-dimensional feature space, using a map O : xt• i-> xz; we then do a linear separation there. To justify this procedure, Cover's Theorem [113] is sometimes alluded to. This theorem characterizes the number of possible linear separations of m points in general position in an N-dimensional space. If m < N + 1, then all 2m separations are possible — the VC dimension of the function class is n + 1 (Section 5.5.6). If m > N + 1, then Cover's Theorem states that the number of linear separations equals
The more we increase N, the more terms there are in the sum, and thus the larger is the resulting number. This theorem formalizes the intuition that the number of separations increases with the dimensionality. It requires, however, that the points are in general position — therefore, it does not strictly make a statement about the separability of a given dataset in a given feature space. E.g., the feature map might be such that all points lie on a rather restrictive lower-dimensional manifold, which could prevent us from finding points in general position. There is another way to intuitively understand why the kernel mapping in-
7.4 Nonlinear Support Vector Classifiers
201
Figure 7.6 By mapping the input data (top left) nonlinearly (via ) into a higher-dimensional feature space "K (here: JC = R3), and constructing a separating hyperplane there (bottom left), an SVM (top right) corresponds to a nonlinear decision surface in input space (here: R2, bottom right). We use xi,X2 to denote the entries of the input vectors, and Wi,w2,w?, to denote the entries of the hyperplane normal vector in !K.
"Kernelizing" the Optimal Margin Hyperplane
Kernel Trick
creases the chances of a separation, in terms of concepts of statistical learning theory. Using a kernel typically amounts to using a larger function class, thus increasing the capacity of the learning machine, and rendering problems separable that are not linearly separable to start with. On the practical level, the modification necessary to perform the algorithm in a high-dimensional feature space are minor. In the above sections, we made no assumptions on the dimensionality of 'K, the space in which we assumed our patterns belong. We only required "K to be equipped with a dot product. The patterns x,- that we talked about previously thus need not coincide with the input patterns. They can equally well be the results of mapping the original input patterns Xj into a high-dimensional feature space. Consequently, we take the stance that wherever we wrote x, we actually meant O(;t). Maximizing the target function (7.17), and evaluating the decision function (7.20), then requires the computation of dot products (O(x), O(x,-)) in a high-dimensional space. These expensive calculations are reduced significantly by using a positive definite kernel k (see Chapter 2), such that
leading to decision functions of the form (cf. (7.20))
202
Pattern Recognition
Figure 7.7 Architecture of SVMs. The kernel function k is chosen a priori; it determines the type of classifier (for instance, polynomial classifier, radial basis function classifier, or neural network). All other parameters (number of hidden units, weights, threshold b) are found during training, by solving a quadratic programming problem. The first layer weights *i are a subset of the training set (the Support Vectors); the second layer weights A, = y,a, are computed from the Lagrange multipliers (cf. (7.25)).
Kernels
At this point, a small aside regarding terminology is in order. As explained in Chapter 2, the input domain X need not be a vector space. Therefore, the Support Vectors in (7.25) (i.e., those J, with a, > 0) are not necessarily vectors. One could choose to be on the safe side, and only refer to the corresponding O(Xf) as SVs. Common usage employs the term in a somewhat loose sense for both, however. Consequently, everything that has been said about the linear case also applies to nonlinear cases, obtained using a suitable kernel k, instead of the Euclidean dot product (Figure 7.6). By using some of the kernel functions described in Chapter 2, the SV algorithm can construct a variety of learning machines (Figure 7.7), some of which coincide with classical architectures: polynomial classifiers of degree d,
radial basis function classifiers with Gaussian kernel of width c> 0,
and neural networks (e.g., [49,235]) with tanh activation function,
The parameters K > 0 and 0 £ R are the gain and horizontal shift. As we shall see later, the tanh kernel can lead to very good results. Nevertheless, we should mention at this point that from a mathematical point of view, it has certain short-
7.4
Nonlinear Support Vector Classifiers
Quadratic Program
203
comings, cf. the discussion following (2.69). To find the decision function (7.25), we solve the following problem (cf. (7.17)):
subject to the constraints (7.18) and (7.19). If k is positive definite, Qz; := (y z t/ ; fc(j r , x/));/ is a positive definite matrix (Problem 7.6), which provides us with a convex problem that can be solved efficiently (cf. Chapter 6). To see this, note that (cf. Proposition 2.16)
Threshold
forall«GEm. As described in Chapter 2, we can actually use a larger class of kernels without destroying the convexity of the quadratic program. This is due to the fact that the constraint (7.19) excludes certain parts of the space of multipliers a,. As a result, we only need the kernel to be positive definite on the remaining points. This is precisely guaranteed if we require k to be conditionally positive definite (see Definition 2.21). In this case, we have a T Qa > 0 for all coefficient vectors a satisfying (7.19). To compute the threshold b, we take into account that due to the KKT conditions (7.16), OLJ > 0 implies (using (7.24))
Thus, the threshold can for instance be obtained by averaging
Comparison to RBF Network
over all points with a; > 0; in other words, all SVs. Alternatively, one can compute b from the value of the corresponding double dual variable; see Section 10.3 for details. Sometimes it is also useful not to use the "optimal" b, but to change it in order to adjust the number of false positives and false negatives. Figure 1.7 shows how a simple binary toy problem is solved, using a Support Vector Machine with a radial basis function kernel (7.27). Note that the SVs are the patterns closest to the decision boundary — not only in the feature space, where by construction, the SVs are the patterns closest to the separating hyperplane, but also in the input space depicted in the figure. This feature differentiates SVMs from other types of classifiers. Figure 7.8 shows both the SVs and the centers extracted by /c-means, which are the expansion patterns that a classical RBF network approach would employ. In a study comparing the two approaches on the USPS problem of handwritten character recognition, a SVM with a Gaussian kernel outperformed the classical RBF network using Gaussian kernels [482]. A hybrid approach, where the SVM
204
Pattern Recognition
Figure 7.8 RBF centers automatically computed by the Support Vector algorithm (indicated by extra circles), using a Gaussian kernel. The number of SV centers accidentally coincides with the number of identifiable clusters (indicated by crosses found by /c-means clustering, with k = 2 and k = 3 for balls and circles, respectively), but the naive correspondence between clusters and centers is lost; indeed, 3 of the SV centers are circles, and only 2 of them are balls. Note that the SV centers are chosen with respect to the classification task to be solved (from [482]).
algorithm was used to identify the centers (or hidden units) for the RBF network (that is, as a replacement for /c-means), exhibited a performance which was in between the previous two. The study concluded that the SVM algorithm yielded two advantages. First, it better identified good expansion patterns, and second, its large margin regularizer led to second-layer weights that generalized better. We should add, however, that using clever engineering, the classical RBF algorithm can be improved to achieve a performance close to the one of SVMs [427].
7.5
Soft Margin Hyperplanes So far, we have not said much about when the above will actually work. In practice, a separating hyperplane need not exist; and even if it does, it is not always the best solution to the classification problem. After all, an individual outlier in a data set, for instance a pattern which is mislabelled, can crucially affect the hyperplane. We would rather have an algorithm which can tolerate a certain fraction of outliers. A natural idea might be to ask for the algorithm to return the hyperplane that leads to the minimal number of training errors. Unfortunately, it turns out that this is a combinatorial problem. Worse still, the problem is even hard to approximate: Ben-David and Simon [34] have recently shown that it is NP-hard to find a hyperplane whose training error is worse by some constant factor than the optimal one. Interestingly, they also show that this can be alleviated by taking into account the concept of the margin. By disregarding points that are within some fixed positive margin of the hyperplane, then the problem has polynomial complexity. Cortes and Vapnik [111] chose a different approach for the SVM, following [40].
7.5
Soft Margin Hyperplanes
Slack Variables
205
To allow for the possibility of examples violating (7.11), they introduced so-called slack variables,
and use relaxed separation constraints (cf. (7.11)),
C-SVC
Clearly, by making & large enough, the constraint on (x/, y,-) can always be met. In order not to obtain the trivial solution where all & take on large values, we thus need to penalize them in the objective function. To this end, a term £,- £,- is included in (7.10). In the simplest case, referred to as the C-SV classifier, this is done by solving, for some C > 0,
subject to the constraints (7.33) and (7.34). It is instructive to compare this to Theorem 7.3, considering the case p = 1. Whenever the constraint (7.34) is met with £z- = 0, the corresponding point will not be a margin error. All non-zero slacks £ correspond to margin errors; hence, roughly speaking, the fraction of margin errors in Theorem 7.3 increases with the second term in (7.35). The capacity term, on the other hand, increases with ||w||. Hence, for a suitable positive constant C, this approach approximately minimizes the right hand side of the bound. Note, however, that if many of the & attain large values (in other words, if the classes to be separated strongly overlap, for instance due to noise), then £fli £r- can be significantly larger than the fraction of margin errors. In that case, there is no guarantee that the hyperplane will generalize well. As in the separable case (7.15), the solution can be shown to have an expansion
where non-zero coefficients a/ can only occur if the corresponding example (x,-, y/) precisely meets the constraint (7.34). Again, the problem only depends on dot products in 'K, which can be computed by means of the kernel. The coefficients QH are found by solving the following quadratic programming problem:
To compute the threshold b, we take into account that due to (7.34), for Support
206
I'-SVC
Margin Error
^-Property
Pattern Recognition
Vectors Xj for which £; = 0, we have (7.31). Thus, the threshold can be obtained by averaging (7.32) over all Support Vectors Xj (recall that they satisfy ay > 0) with a, < C. In the above formulation, C is a constant determining the trade-off between two conflicting goals: minimizing the training error, and maximizing the margin. Unfortunately, C is a rather unintuitive parameter, and we have no a priori way to select it.9 Therefore, a modification was proposed in [481], which replaces C by a parameter v; the latter will turn out to control the number of margin errors and Support Vectors. As a primal problem for this approach, termed the z/-SV classifier, we consider
Note that no constant C appears in this formulation; instead, there is a parameter v, and also an additional variable p to be optimized. To understand the role of p, note that for £ = 0, the constraint (7.41) simply states that the two classes are separated by the margin 2/o/||w|| (cf. Problem 7.4). To explain the significance of v, let us first recall the term margin error, by this, we denote points with £, > 0. These are points which are either errors, or lie within the margin. Formally, the fraction of margin errors is
Here, g is used to denote the argument of the sgn in the decision function (7.25): / = sgn og (see footnote 5, p. 344). We are now in a position to state a result that explains the significance of v. Proposition 7.5 ([481]) Suppose we run v-SVC with k on some data with the result that p>0. Then (i) v is an upper bound on the fraction of margin errors, (ii) v is a lower bound on the fraction ofSVs. (iii) Suppose the data (#i,yi),.. .,(xm,ym) were generated iid from a distribution P(x, y) = P(x)P(y\x), such that neither P(x, y = 1) nor P(x, y = -1) contains any discrete component. Suppose, moreover, that the kernel used is analytic and non-constant. With probability 1, asymptotically, v equals both the fraction of SVs and the fraction of errors.
The proof can be found in Section A.2. Before we get into the technical details of the dual derivation, let us take a look 9. As a default value, we use C/m = 10 unless stated otherwise.
7.5
Soft Margin Hyperplanes
207
Figure 7.9 Toy problem (task: separate circles from disks) solved using I/-SV classification, with parameter values ranging from v — 0.1 (top left) to v = 0.8 (bottom right). The larger we make v, the more points are allowed to lie inside the margin (depicted by dotted lines). Results are shown for a Gaussian kernel, k(x, x') = exp(— \\x — x'\\2).
Table 7.1 Fractions of errors and SVs, along with the margins of class separation, for the toy example in Figure 7.9. Note that v upper bounds the fraction of errors and lower bounds the fraction of SVs, and that increasing v, i.e., allowing more errors, increases the margin. V
fraction of errors fraction of SVs margin pf \\ w||
Derivation of the Dual
0.1 0.00 0.29 0.005
0.2 0.07 0.36 0.018
0.3 0.25 0.43 0.115
0.4 0.32 0.46 0.156
0.5 0.39 0.57 0.364
0.6 0.50 0.68 0.419
0.7 0.61 0.79 0.461
0.8 0.71 0.86 0.546
at a toy example illustrating the influence of v (Figure 7.9). The corresponding fractions of SVs and margin errors are listed in table 7.1. The derivation of the Z/-SVC dual is similar to the above SVC formulations, only slightly more complicated. We consider the Lagrangian
using multipliers a,, /?;, 6 > 0. This function has to be minimized with respect to the primal variables w, £, b, p, and maximized with respect to the dual variables a, /3,6. To eliminate the former, we compute the corresponding partial derivatives
208
Pattern Recognition and set them to 0, obtaining the following conditions:
Quadratic Program for 1/-SVC
Again, in the SV expansion (7.45), the o;z that are non-zero correspond to a constraint (7.41) which is precisely met. Substituting (7.45) and (7.46) into L, using a,-, /?,-, 8 > 0, and incorporating kernels for dot products, leaves us with the following quadratic optimization problem for z/-SV classification:
As above, the resulting decision function can be shown to take the form
Compared with the C-SVC dual (7.37), there are two differences. First, there is an additional constraint (7.52).10 Second, the linear term £/li oti no longer appears in the objective function (7.49). This has an interesting consequence: (7.49) is now quadratically homogeneous in ex.. It is straightforward to verify that the same decision function is obtained if we start with the primal function
10. The additional constraint makes it more challenging to come up with efficient training algorithms for large datasets. So far, two approaches have been proposed which work well. One of them slightly modifies the primal problem in order to avoid the other equality constraint (related to the offset b) [98]. The other one is a direct generalization of a corresponding algorithm for C-SVC, which reduces the problem for each chunk to a linear system, and which does not suffer any disadvantages from the additional constraint [407,408]. See also Sections 10.3.2,10.4.3, and 10.6.3 for further details. Note added in the second printing: It has recently become possible to train these systems as efficiently as C-SVC; see http://www.csie.ntu.edu.tw/~cjlin/papers/nusvmtutorial.pdf
7.5 Soft Margin Hyperplanes
209
i.e., if one does use C, cf. Problem 7.16. To compute the threshold b and the margin parameter p, we consider two sets S±, of identical size s > 0, containing SVs Xj with 0 < a z < 1 and yz- = ±1, respectively. Then, due to the KKT conditions, (7.41) becomes an equality with £z = 0. Hence, in terms of kernels,
Connection Z/-SVC — C-SVC
Note that for the decision function, only b is actually required. A connection to standard SV classification, and a somewhat surprising interpretation of the regularization parameter C, is described by the following result: Proposition 7.6 (Connection Z/-SVC — C-SVC [481]) If v-SV classification leads to p > 0, then C-SV classification, with C set a priori to I / p , leads to the same decision function. Proof If we minimize (7.40), and then fix p to minimize only over the remaining variables, nothing will change. Hence the solution WQ, bo, £0 minimizes (7.35), for C — 1, subject to (7.41). To recover the constraint (7.34), we rescale to the set of variables w' = w/p, b' = b/p, £' = £/p. This leaves us with the objective function (7.35), up to a constant scaling factor p2, using C = I/p.
Robustness and Outliers
For further details on the connection between r/-SVMs and C-SVMs, see [122, 38]. A complete account has been given by Chang and Lin [98], who show that for a given problem and kernel, there is an interval [i'min, ^max] of admissible values for v, with 0 < v^m < z/max < 1. The boundaries of the interval are computed by considering £z on as returned by the C-SVM in the limits C -> oo and C —>• 0, respectively. It has been noted that z/-SVMs have an interesting interpretation in terms of reduced convex hulls [122, 38] (cf. (7.21)). If a problem is non-separable, the convex hulls will no longer be disjoint. Therefore, it no longer makes sense to search for the shortest line connecting them, and the approach of (7.22) will fail. In this situation, it seems natural to reduce the convex hulls in size, by limiting the size of the coefficients cz in (7.21) to some value v £ (0,1). Intuitively, this amounts to limiting the influence of individual points — note that in the original problem (7.22), two single points can already determine the solution. It is possible to show that the vSVM formulation solves the problem of finding the hyperplane orthogonal to the closest line connecting the reduced convex hulls [122]. We now move on to another aspect of soft margin classification. When we introduced the slack variables, we did not attempt to justify the fact that in the objective function, we used a penalizer Xfli £,-. Why not use another penalizer, such as Xfli £f / for some p > 0 [111]? For instance, p = 0 would yield a penalizer
210
Pattern Recognition
that exactly counts the number of margin errors. Unfortunately, however, it is also a penalizer that leads to a combinatorial optimization problem. Penalizers yielding optimization problems that are particularly convenient, on the other hand, are obtained for p = I and p = 2. By default, we use the former, as it possesses an additional property which is statistically attractive. As the following proposition shows, linearity of the target function in the slack variables £, leads to a certain "outlier" resistance of the estimator. As above, we use the shorthand x, for O(jz-).
Proposition 7.7 (Resistance of SV classification [481]) Suppose w can be expressed in terms of the SVs which are not at bound,
with 7j / 0 only if ai G (0, l/m) (where the 0.1 are the coefficients of the dual solution). Then local movements of any margin error \m parallel to w do not change the hyperplane.^
The proof can be found in Section A.2. For further results in support of the p = I case, see [527]. Note that the assumption (7.57) is not as restrictive as it may seem. Even though the SV expansion of the solution, w = £fli c^-i/iX,-, often contains many multipliers ai which are at bound, it is nevertheless quite conceivable, especially when discarding the requirement that the coefficients be bounded, that we can obtain an expansion (7.57) in terms of a subset of the original vectors. For instance, if we have a 2-D problem that we solve directly in input space, i.e., with k(x, x') = (j, x'}, then it suffices to have two linearly independent SVs which are not at bound, in order to express w. This holds true regardless of whether or not the two classes overlap, even if there are many SVs which are at the upper bound. Further information on resistance and robustness of SVMs can be found in Sections 3.4 and 9.3. We have introduced SVs as those training examples x, for which a/ > 0. In some cases, it is useful to further distinguish different types of SVs. For reference purposes, we give a list of different types of SVs (Table 7.2). In Section 7.3, we used the KKT conditions to argue that in the hard margin case, the SVs lie exactly on the margin. Using an identical argument for the soft margin case, we see that in this instance, in-bound SVs lie on the margin (Problem 7.9). Note that in the hard margin case, where o;max = oo, every SV is an in-bound SV. Note, moreover, that for kernels that produce full-rank Gram matrices, such as the Gaussian (Theorem 2.18), in theory every SV is essential (provided there are no duplicate patterns in the training set).12 11. Note that the perturbation of the point is carried out in feature space. What it precisely corresponds to in input space therefore depends on the specific kernel chosen. 12. In practice, Gaussian Gram matrices usually have some eigenvalues that are close to 0.
7.6 Multi-Class Classification
211
Table 7.2 Overview of different types of SVs. In each case, the condition on the Lagrange multipliers a, (corresponding to an SV x;) is given. In the table, amax stands for the upper bound in the optimization problem; for instance, amax = £ in (7.38) and amax = ^ in (7.50).