Clustering Time Dependent PITCHf/x Data
by
CHRTSTOPHER PlCARDO
RICHARD DEVEAUX AND STEVEN M£LLER, ADVISORS
A thesis submitted in partial fulfillment
of the requirements for the
Degree of Bachelor of Arts with Honors
in Mathematics
WILLlAMS COLLEGE
Williamstown, Massachusetts
May 8, 2013
ABSTRACT
In this paper I extend the model based clustering framework to data that incorporates an entire time period, specifically single seasons from the PITCHf/x database. Traditional clustering methods are reviewed and described in detail in order to motivate the introduction of model based clustering. In order to apply model based clustering to the time indexed data, a cluster consistency algorithm is proposed that treats the cluster selection problem like a model selection problem from the supervised learning literature. Finaliy, the cluster consistency procedure is applied to the PlTCHf/x dataset to select the appropriate number of clusters for the entire season. The PITCHf/x season data for two stru.ting pitchers is then analyzed using the cluster movements for the entire season.
ACKNOWLEDGEMENTS
I would first like to my thank my advisors, Richard De Veaux and Steven Miller for all of their support throughout the entire thesis project. Without their advice this thesis would not have been possible. 1 would especially like to thank Steven Miller for rekindling my passion for math during freshman year, for many late night and early morning emails, and for being the biggest baseball fan that I know.
I would also like to thank the Thursday Statistics Group, Jack, Ben, and Alec for being great sounding boards over the year and for all of their advice and suggestions.
Finally, I need to thank my family for always supporting me, my friends for never being too annoyed by me, and the residents of the White House on North Street for a great senior year.
Contents
1 Introduction 6 2 Review of Existing Methodology 9
2.1 Hierarchal Clustering . . . . . . . . . . . . 11
2.1.2 Complete Linkage Clustering (CLINK) 14
2.2 Center Based (Iterative) Clustering . 16
2. 1.1 Single-Linkage Clustering (SLINK)
2.2.1 The K-Means Algorithm . .
3 Description of Data and the Pitch Classification Problem 21
3.1 Statistical Analysis and Baseball 21
3.2 The PITCHf/x System .... 24
3.2.1 The PITCHf/x Dataset 27
28
4 Methodology
28
4.1 Model Based Clustering
28
4. 1.1 Probability Models
4.1.2 E-M Algorithm . .
4.1.3 Modeling Procedure 32
4.2 Model Based Procedure for Time Dependent Data . 34
4.3 Cluster Choice Procedure ............. . 35
4.3. 1 Iris Example . . . . . . . . . . . . . . . . . . . . . . . . 37
5 Results and Analysis 42
5.1 Cluster Consistency Study
5.2 Cluster Movement over Time
6 Discussion 54
6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Chapter 1
Introduction
The use of Cluster Analysis as a statistical tool is an example of "unsupervised" learning: exploratory methods that attempt to find hidden structures in data sets that are unlabeled. This means that when attempting to uncover these structures, no enor rate or "reward" can be calculated when evaluating possible solutions. Supervised algorithms, conversely, rely on developing a model using data with known or true output values often measured by human experts. These models divide the data into a training set where these true classifications are known and a test set where the goal is to accurately predict classifications for each case. The model is measured by its ability to perform well on the training data, typically by minimizing prediction error. This approach has the nice property that the classification algorithms "learn" the impmtant structures and features of the data from the true classifications and thus hopefully increase prediction accuracy on the test set. Both the traditional and recent clustering methods that T will examine in this paper attempt to constmct good partitions of the data without knowing or incorporating the true classification of any observation.
Cluster analysis has a relatively short history, dating back only to the 1950's, but has vast applications ranging from bioinformatics to marketing analysis, from economics to social networks and even, as we shall see, to baseball. There are nu
merous methods for clustering data and all of them play an important roles both in specific fields and in exploratory data mining; cluster analysis is both a specific endpoint for some experiments and an important preliminary tool for others. For these reasons, the specific clustering procedure for a given problem is very much defined by the data set and the desired use of the results. This paper will focus on extending an existing clustering algorithm-model based clustering -to data sets that are indexed by time. In order to do so, I will introduce a new method of evaluating the accuracy of our clustering models by simulating data labels and treating the problem like a model selection choice in supervised learning. I will examine pitching data from the Major League Baseball PITCHf/x system to show the need for a clustering methodology that incorporates a time component in order to more accurately depict the structure of the data. Furthermore, one of the main results of this paper is to show that even with a distribution based clustering method that selects the appropriate number of clusters, this number is not consistent across the entire time span. Therefore a new algorithm is proposed to determine the most appropliate number of clusters to use over the entire dataset.
The motivation for all of this statistical analysis is the applied problem of classifying pitches and analyzing individual pitchers in Major League Baseball. There are several reasons why this is important motivation: pitching analysis lags far behind the new advanced methods of analyzing outfield players, current methods of pitch classification are inaccurate and outdated and lead to incorrect results, and, when deciding whether to pay an individual player millions of dollars, it is imperative to have the most accurate information about that player's skill and potential connibution to the team. Successful teams sign good players at a good price rather than average players at a bad price. Furthermore, with the advent of the highly accurate PlTCHf/x data, the traditional methods of analyzing pitchers fail to incorporate the wealth of newly available data and with this data we should be able to develop a method to analyze what makes pitchers like Felix Hernandez so good
and pitchers like late career John Rocker so bad.
Chapter 2 focuses on a review of the most common clustering methods that appear in the literature and will examine and describe several of the most important methods in greater depth. Chapter 3 introduces the data set that will be analyzed and the relevant applications for analyzing baseball players. Chapter 4 discusses the methodology that was used to analyze the data, including the unsupervised cluster consistency procedure. Finally, Chapters 5 and Chapter 6 will present the results of the analysis and discussion of future work.
Chapter 2
Review of Existing Methodology
Cluster Analysis is a statistical technique first developed in the early 1950's designed to uncover underlying structure in complex datasets. The goal is to paltition the data in such a way that observations within clusters have high similarity to each other but low similarity to observations in other clusters. Intuitively, cluster analysis is a relatively simple idea to understand. Essentially, the algorithm groups together observations that are more similar to each other than those in any other group. In other words, a clustering C = (C1, ..., Ck) of a set of observations S = (xi. ... , Xn) into K groups assigns each observation Xn a label k E (1, .. .,K), where K denotes the number of clusters. An observation can either be described by a set of measurements assigned to it or by its relation to other observations in
S. Because clustering algorithms attempt to uncover structure in different data sets, cluster analysis is an important tool of data mining when the true classifications are not known to the researcher. As a statistical method, cluster analysis is widely used throughout machine learning and is an extremely important tool in computational biology and bioinformatics research; clustering is especial1y useful in analyzing microarray experiments to determine information about gene expression and interaction. For two recent examples of cluster analysis in bioinformatics research see [Jon C. Wakefield, 2003] and [Inoue et al., 2007).
Because the concept of a cluster can be defined in many different ways, the
term cluster analysis refers not to a specific algorithm but rather to the general problem to be solved: to most accurately partition a given data set into separate groups or clusters. Unfortunately, there is no clear method of determining the true number of clusters that partition the data most accurately. There are a va1iety of different metrics used to assess the quality of one clustering solution over another including internal measures such as Within Cluster Sum of Squares, the DaviesIndex, and the Dunn fndex, as well as external evaluations such as the Jaccard Index, Rand Measure, and F-Measure. Some clustering methods, as outlined in Chapter 4, utilize internal selection procedures that attempt to select the optimum number of clusters based on a given penalty function. Other methods, such as k-means, are not able to select the optimum number. Therefore, most clustering functions CK also take K, the number of clusters to construct, as an additional input. The term cluster analysis, then, should be viewed more accurately as an umbrella term that encompasses a wide variety of algorithms and techniques that model the underlying structures in different ways.
An example of the difficulty of the general cluster analysis problem for datasets with clear structure is shown in Figure 2.1: the classic "Iris" data set that motivated the formation of Linear Discriminant Analysis by Fischer in 1936 and is often used to test new data mining and machine learning algorithms.
The observations appear to be separated into two distinct clusters; however, there are actually three classifications as indicated by the coloring, with the green and red clusters overlapping only slightly. The cluster on the left is linearly separable from the two clusters on the right, which are not easily separable from each other. Although this dataset has clear structure, two of the clusters are not obviously distinct from each other. For this reason, many clustering methods do not optimally pa1tition tbis dataset although the task seems relatively straightforward. Thus, different clustering methods have been created to handle different types of
•
•
•
•
•
•
•
•
•
Figure 2.1: Edgar Anderson's Iris Data.
data structure and they each fundamentally treat the concept of a cluster in very different ways. It is up to the analyst, then, to decide on the most approptiate algorithm to use for any given analysis and dataset. This chapter describes some of the earliest and most common clustering methods and Chapter 4 describes in detail a new method that relies on probability models to partition data.
2.1 Hierarchal Clustering
The first clustering algorithms described in the literature are hierarchal methods. These clustering methods are designed to organize the data into clusters that possess a hierarchal relationship to each other. There are two main approaches to
hierarchal clustering, agglomerative and divisive. Agglomerative clustering starts with each point assigned to its own cluster: in the case of the data S = (X1, ... , Xn)
from before, agglomerative clustering begins with n clusters and then merge pairs of clusters together to form the final clustering. Divisive clustering proceeds in the opposite direction: these methods start with one cluster that encompasses the entire data set S and proceed to divide clusters to reach the end results. Both methods are clearly computationally taxing when the data set involved is of sufficiently large size.
Both methods, agglomerative and divisive, rely on a measure of dissimilarity between observations to merge or split the data. ln order to calculate the dissimilarity between observations, hierarchal algorithms use both a distance metric to calculate the distance between pairs of observations and a linkage function that determines the dissimilarity of clusters based on the pairwise distances calculated by the distance metric. The results of a given clustering will depend significantly on the choice of distance measure and linkage criteria. The first advantage of this propetty is that the observations themselves are not required, the algorithm uses only a Dissimilarity Matrix of mathematically valid distances between observations. The second advantage is that any distance function can be used, the algorithm is not constrained to a specific measure.
The most commonly used distance functjons are Euclidean Distance, where the distance d between points x andy in JRn is given by
n
:2:: Jxi-Yil2, (2.1.1)
d = lx-Yl =
and squared Euclidean Distance where the distance d between the same points :x and y in JRn which is given by
n
d = x -Yl2 = L Ixi -Yil2. (2.1.2)
l
Because clusters are described by the maximum distance from the center, different
c1usterings form at different distances. These clustering methods are called hier
archal methods because they do not find one unique clustering but rather many different clusterings that occur at different distances. A Dendrogram is generally used to display the results of hierarchal clustering: these charts represent all of the clusterings that form at different distances.
The linkage criteria in hierarchal methods use a given distance metric and dissimilarity matrix to determine the distance between each cluster in different ways. There are many arbitrary ways in which one could define the linkage between clusters but most algorithms that do so are computationally inefficient for large datasets since the general complexity of any agglomerative clustering is 0(2n). Several methods were discovered that reduced the complexity of the procedure to O(n2). The two earliest efficient hierarchal methods are Single-Linkage Clustering (SLINK), optimally implemented by [Sibson, 1973], and Complete-Linkage Clustering (CLINK), first implemented by [Defays, 1977], based on Sibson's single-linkage idea.
2.1.1 Single-Linkage Clustering (SLINK)
The Single-linkage clustering algorithm, also known as nearest neighbors, calculates the distance between clusters as the distance between the two closest e)
ements of the clusters. The linkage function calculates the distance D( Ci, Cj) between clusters C'i and Ck as follows, where dis the chosen distance metric:
(2.1.3)
The algorithm simply merges clusters together based on the minimum distance between one cluster and another. For example if D(Ci, Cj) < D(CiC's) for all
,
s f= k, clusters C'i and Cj are merged to form a new cluster C'new· This algorithm then repeats until all observations are located in one cluster or until a given number of clusters k is reached [Sibson, 1973]. Figure 2.2 presents a graphical
representation of the Single-linkage function.
Cluster 8 •
• ••
•
•• •
\
•
• Cluster A
•
Figure 2.2: The Single-linkage function.
This approach however has the obvious drawback in that it tends to merge clusters together based only on a single element of each cluster being close to each other. This property does not allow the SLINK algorithm to handle outliers particularly well. Because the clusters can be grouped together by single points, the overall shape of each cluster may not be taken into sufficient consideration, and long chains of seemingly dissimilar clusters may form connected only by individual points from each cluster. This chaining effect has lead to the development of other algorithms designed to incorporate overall cluster structure more precisely, specifically, the Complete-linkage method.
2.1.2 Complete Linkage Clustering (CLINK)
The Complete-linkage, or furthest neighbors algorithm, developed by Defays in 1977 following Sibson's SLINK method, avoids the chaining phenomenon exhibited by the Single-linkage algorithm. Instead of defining the minimum distance between clusters Ci and Ci, the furthest distance D(Ci, Ci) is calculated:
algorithms such as CLINK and SLINK These algorithms are not widely used in the data mining community anymore but do provide the theoretical background for other, more advanced algOiitbms. Hierarchal methods in general are quite flexible but due to their inherent computational disadvantage and inability to capture more complicated data structures, approaches such as K-means and Distribution Models have largely replaced their usage in the literature.
2.2 Center Based (Iterative) Clustering
The most well known and widely used clustering algorithms are the family of center, or centroid based methods. Center, or centroid, based clustexing methods are methods that assign points to individual clusters based on the cluster means (centers) instead of partitioning the data based on the distance between specific observations. Each of the center based methods discussed attempts to minimize an objective function that determines how successful a given clustering is. These objective functions can often not be minimized directly so an iterative updating technique is used to ensure convergence to a local minimum. Following [Hamerly and Elkan, 2002], we can define the general iterative process for cluster based methods.
We begin again by defining a d-dimensionsaJ data set S = x1, ..., Xn as the data we are interested in clustering. We also define our set of k d-dimensional clusters C = C1, ... ,Ck. We can then define a membership function m(Cjlxi) that defines the prOpOt1iOn Of the Observation Xi that belongs to the cluster Cj· This function is subject to the constraints m(Cj!xi) > 0 and .2:::=1 m( Cj Jxi) =
1. These constraints simply force the observation xi to be a member of at least one cluster but not such that it's total cluster weight exceeds 1. There are two methods by which algorithms implement these constraints: hard clustering and soft clustering.
Definition 2.2.1 (Hard Clustering). A hard clustering restricts the membership function msuch that m(Cjlxi) E {0, 1 }. This property simply forces observations into exactly one cluster.
This property allows the observation be assigned to multiple clusters with a varying amount of weight.
Definition 2.2.2 (Soft Clustering). A soft clustering treats the membership func tion msuch that 0 :::; m(Cilxi) :::; 1. Xi to
Both hard and soft clustering methods have been developed for different center based algorithms. In many cases, soft clustering is used to describe the membership levels of each observation but a hard clustering is assigned by choosing the cluster to which the observation most belongs. The soft clustering algorithms are also referred to as fuzzy clustering algorithms and are quite common in statistical analysis. This paper does not describe in detail fuzzy clustering algorithms, but the model based framework described in Chapter 4 uses the membership probabilities of each observation to assign hard labels.
2.2.1 The K-Means Algorithm
The K-means algorithm, first named by James MacQueen [MacQueen, 1967] but originally proposed by Hugo Steinhaus [Steinhaus, 1957] and implemented by Stuart Lloyd [Lloyd, Mar] at Bell Labs, attempts to partition a data set S into k clusters where each point is assigned to the cluster with the nearest mean. Like all iterative center based methods, K-means minimizes an objective function to determine the optimal pa1titions of the data. This objective function is the Within Cluster Sum of Squares, abbreviated WCSS.
Definition 2.2.3 (Within Cluster Sum of Squares). Given a set of observations
S = (x1, ..., Xn ) partitioned into k sets C = (C1, ..., Ck), the WCSS is defined as:
k
WCSS = argmin L L llxj-P,ill2 (2.2.1)
c i=l XjEC;
The algorithm selects k random points in the domain of S as initial means and assigns observations to the closest mean to it at every iteration. After every step. new centers are calculated as the means of the existing clusters, and the algorithm stops when a11 of the centers remain fixed. As we will see in Section 4, the Kmeans algorithm is a special case of the generalized Expectation-Maximization Algorithm (EM) when the clusters are equal sized and thus can be divided into an Expectation step that updates the cluster memberships and a Maximization step that calculates the new clusters. F01mally, the process is detailed below
1.
Select k points as the initial means (mF), ... , mi1) ).
2.
Alternate between the following two steps until convergence is reached:
(a)
Expectation Step: Assign each data point to the cluster whose mean is the closest:
(b)
Maximization Step: Calculate the new means mrw as the centroids of the new clusters:
(2.2.3)
Convergence of the K-means algorithm is reached when the cluster assignments no longer change. Figures 2.4 and 2.5 show the K-means algorithm applied to the "Iris" data with both k = 3 and k = 2 clusters.
The clustering results of K-means on the Iris dataset reveal one of the method's key drawbacks: the clusters are assumed to be spherical in shape and equal in size. When k = 3 clusters are chosen, corresponding to the true number of classes, the K-means algorithm alternates between splitting the cluster on the left and splitting the cluster on the right as indicated by the coloring of the points in 2.4. When k = 2 clusters are chosen, the algorithm always finds the COITect split between the two obvious clusters. In this case, the appropriate number of clusters to use on this data set is not equivalent to the true number of classes. In clustering we normally do not assume we know the true number of classes, however, this simple example shows how the K-means algorithm can fluctuate on datasets with clusters of unequal shapes and sizes. This effect may be due to the algorithm permuting the labeling of the classes based on different initial centers, and we show in Chapter 4 that k = 3 on the "Iris" data is a stable and valid clustering when considering many samples of the data. Chapter 4 also outlines the distribution focused, model based clustering algorithm that can handle datasets with many different types of structure.
Chapter 3
Description of Data and the Pitch
Classification Problem
3.1 Statistical Analysis and Baseball
Over the last ten years, baseball analysis has undergone a revolutionary period which completely changed the approach most teams use to evaluate player value. The publkation of Michael Lewis' Moneyball revealed to the general public that there existed executives such as Billy Beane of the Oakland Athletics who were relying on advanced statistical analysis to determine which players to sign to their roster. These executives viewed constructing a team as a type of optimization problem: given a hard constraint such as a limited budget, construct a team that will win the most amount of games. Therefore, executives who were confined by these constraints needed to determine which players produced the most value for their given salary. The general idea then was to try to determine player value based only on what the player can control, and therefore these executives devalued traditional statistics such as batting average for hitters and earned average for pitchers because these statistics are not player independent. Originally these
not true about other aspects of the game because it is much harder to intuitively isolate a player's individual contribution in fielding or pitching.
Unlike batter evaluation, evaluating pitchers is significantly harder and much less intuitive than it may at first seem. The task at first seems trivial because the pitcher is the most isolated of all positions, however, the pitcher is also the most dependent on his team, and so isolating the individual contribution of a single pitcher becomes very difficult. A simple example might help explain the difficulty in analyzing pitchers' value to their teams. Pitchers are often judged by the general public on their ability to win games; unfortunately, win-loss record is still a major tool that is still used to evaluate starting pitchers by Major League teams. However, a win is an important part of the game that is out of a pitcher's control: a pitcher can put up an awful performance but still post a win based on a great offensive performance from his own team. On May 21, 2000, Russ Ortiz of the San Francisco Giants gave up 10 earned runs over 6.2 innings yet still recorded the win, with the San Francisco Giants victorious 16-10. Conversely, a pitcher can also pitch a perfect game and still end up losing. The famous example of this occurred on May 26, 1959 when Harvey Haddix of the Pittsburgh Pirates tossed a perfect game for 12 and 113 innings before a fielding error and a sacrifice bunt ended up allowing the rival Milwaukee Braves to score the winning run to end the game 1-0. Examples such as these reveal that our conception of pitcher quality is very often reliant on old statistical measures that have little relation to pe1formance quality. Haddix' performance is often regarded as the best pitching perf01mance in baseball history, yet he still lost the game. This paper focuses on a new method on describing a pitcher's performance based on the one thing only the pitcher can control: the actual pitch itself.
Mike Fast, an analyst for the Houston Astros, claims that our current understanding of baseball analysis is too rooted in the outdated box score format that was developed in the 19th century to adequately evaluate player performance
[Fast, 2010]. Specifically, he claims that the language we use to describe pitching is easy to understand in terms of balls and sttikes, fastballs and curveballs, yet far too general to accurately describe the finer aspects of what really separates pitchers from each other. Until five years ago fans and analysts were limited in describing pitches by their general location, inside, outside, high or low, pitch type, the previously mentioned fastball and curveball, as well as other variables such as pitch velocity and pitch outcome. For example, a fan might listen tO a commentator announce that Justin Verlander threw a 98 mile per hour fastball up and in for the third swinging strike in the seventh inning. This type of data is incredibly useful for visualizing each pitch and providing a qualitative desctiption of the game but is very difficult to use in advanced analysis as a measure of pitcher value. Many other professional pitchers as well as some college pitchers may also be able to throw a 98 mph fastbaiJ up and in, yet it is indisputable that Verlander is one of the premier pitchers in professional baseball. There must be something specific to Verlander, and pitchers of equal quality, that separates them from all of the other pitchers in the game. Therefore it is important that we develop a form of analysis that can take advantage of the wealth of current information available and the recent improvements in data collecting ability in order to work towards a new model for effectively determining pitcher quality and value.
3.2 The PITCHf/x System
The PITCHf/x system was first implemented during the 2006 MLB playoffs and was fully installed in every Major League ballpark by the summer of 2008. The system uses two cameras to track the trajectory of each pitch as it leaves the pitcher's hand until the moment it is caught or batted in play. The data is accurate to better than one mile per hour in velocity and less than one inch in distance, giving the analyst a nearly perfect description of the three-dimensional movement,
3.2.1 The PITCHf/x Dataset
The PITCHf/x dataset consists of every single pitch thrown by ever major league pitcher in all games since 2008. For each pitch, 43 different attributes are observed ranging from which stadium the game was played in and who the batter was, to values that exactly describe the movement of the pitch and even the result of the play itself. It is a completely comprehensive dataset, but for the purposes of keeping the analysis efficient and easily interpretable, a dimension reduction algorithm was employed to reduce the number of variables to the most significant contributors to the classification. J implemented the random forest algorithm from JMP to reduce the dimension of the data and select the six most significant variables: Stan Speed, End Speed, Movement in the X-direction, Movement in the Y-direction, Spin, and Acceleration of the pitch in the X direction. These variables are easy to interpret and therefore allow us to infer results from our clustering such as the average movement of the pitches in each cluster in a way that is consistent with baseball analysis.
Chapter 4
Methodology
This section outlines the overall methodology used to more accurately classify the PITCHf/x data for deeper analysis. We discuss the Model Based Clustering procedure, developed by [Fraley and Raftery, 1998] that along with an original modification is implemented in order to more accurately cluster the pitching data and also to extend our analysis of pitching data to full seasons.
4.1 Model Based Clustering
4.1.1 Probability Models
The main difference between model-based clustering and other methods of cluster analysis is that model-based clustering utilizes the assumption that the data are generated by a mixture of underlying probability distributions where each component identifies a different cluster. Given our familiar d-dimensional dataset S = (x1, ..., xn), let !k(xiiBk) be the density of observation Xi from component k where the ek are the cotTesponding parameters, and finally let G be the number of components in the mixture. So far this is just the general definition of a mixture model of probability distributions. In order to construct the composite model of our clusters, we can fol1ow two different approaches, the classification likelihood approach or the mixture likelihood approach. The classification likelihood ap
proach maximizes
n
Lc(BI, ... , Be; 'Y1, ... , 'Ynix) = IT f7;(xiiB1/JJ (4.1.1)
i=n
where li E {1, .. ., G} are discrete values that label 'Yi =kif x belongs to the kth
component. Simi larly, the mixture likelihood approach maximizes
n e
£-M(B1, ... , Be; T1, ... , Te ix) =IT L Tk!k(xi iBk), (4. 1.2)
i=l k=l
where Tk is the probability that Xi belongs to the kth component (recall: Tk
e
1;L:k=l Tk = 1).
The model-based method restricts our distributions fk(xi iBk) to multivariate normal, or Gaussian distributions because this appears to be sufficient in describing most data sets, however, it is possible, but not common, to model using other distributions. Therefore we can describe our parameters Bk in greater detail: each Bk consists of a mean vector J-lk and a covariance matrix commonly denoted as Ek. The density takes the standard multivruiate normal form
tor /-lb while the covariance matrix Ek determines their other characteristics. [Bensmail et al., 1997] created a model for clustering by parameterizing the covariance matrix Ek based on its eigenvalue decomposition of the form
(4. 1.4)
where Dk is the orthogonal matrix of eigenvectors, Ak is a diagonal mat1ix with elements proportional to the eigenvalues of Ek and )..k is simply a scalar. In
other words, Dk controls the orientation, Ak controls the shape and )...k controls the shape of our density fk(xiiBk). These characteristics are estimated from the data and therefore the model allows the shape and orientation of each cluster to be quite flexible. It follows immediately that the K-means algmi.thm is the case
I:k = )...J that restricts the clusters to spherical shape and equal volume; thus we can view the K-means algorithm as a simple case of the basic Gaussian mixture model. For the full table of the different shapes and orientations made possible by this parametrization see [Banfield and Raftery, 1993]. Our goal for the clustering problem in either the case of 4. 1.1 or 4. 1.2 is to find B such that our likelihood function £ is maximized. That is we want to find the best estimate B* such that
B* = argma.x £(BIS). ( 4.1.5)
()
In most cases, we optimize the log-likelihood log[£(BIS)J due to greater analytical
ease.
4.1.2 E-M Algorithm
In practice it is impossible to analytically optimize 4. 1.1 or 4. 1.2 so iterative techniques are used to find a local solution. The most important and powerful of these techniques is the Expectation-Maximization or E-M algorithm. The E-M algorithm is used to solve the maximum likelihood problem when analytic solutions are impossible. Specifically, E-M is used to find the maximum Ukelihood estimations of parameters when the underlying data is incomplete or missing. Even if the data is complete, the maximum likelihood problem can be simplified by assuming that additional hidden parameters exist and assuming values for these parameters. In clustering applications, for the purpose of the E-M algorithm, our
data set X = (x1, ..., Xn) is considered to be incomplete and Y = (y1, ..., Yn) is
considered to be the complete data with Yi = (xi, zi) and Zi E Z = (z1 . ... , zc).
The zi are defined as:
_
{
1 if xi E group k
.kik
0 otherwise,
and are generally called the missing or hidden data. We can then define a new joint density function for the complete data
f(yiB) = f(X, ZIB) = f(YIX, B)f(XIB). ( 4. 1 .6)
Tllis function generally a1ises from the assumptions about the hidden parameters, in this case a Gaussian distribution is assumed. Therefore, in place of the original likelihood functions Lc from Equation (4.1.1) and LM from Equation (4. 1.2) we can construct a new complete data likelihood function
L(ejY) = L(BIX, Z) = f(X, ZIB). ( 4.1 .7)
Equation (4. 1.7) gives the general form of the complete data likelihood function that we wish to optimize. The following two steps give the general case of the E-M algorithm.
1. Expectation Step. Calculate the expected value of the complete data loglikelihood log f(X, ZIB) with respect to our unknown parameter Z given our observed data X and our current parameter estimates O(i-l). In other words, we have:
l(B, e.c) that conespond to the observations of Y in the same way as Step 3.
6.
Calculate the Prediction Error as the percentage of miscJassified results,
.
0 (11, . . ., "fk) as
usmb
" l b l
a e s.
7.
Repeat steps 2-4 for k = 2 to k = n where n is appropriately large for the data set.
8.
Repeat Steps 1-5 for many random samples of the data,
The clustering with the best consistency is the number of clusters k that produces the lowest average misclassification rate over all the selected samples of the data. This approach tests the consistency of a classifier C(X) in generalizing across the entire data set because the training and test sets are both drawn independently from the complete data set. For the pitching data we select the classifier with the best generalizing power to account for movement in the clusters over time.
'"
, .,
Figure 4.5: Consistency for K-Means Clustering of the "Iris" data.
select the two cluster solution proposed by both algorithms based on its extremely high consistency measure. We also might select a three cluster solution based on the consistency plot and associated BJC evidence from the mclust algorithm: the BJC of the two cluster solution is -561 while the BIC from the three cluster solution is -562; these values are not significantly different. The three cluster solution, shown in Figure 4.6, classifies the data nearly perfectly, with a 3% error rate and only five misclassified observations. The k-means three cluster solution performs equaJly as well.
Chapter 5
Results and Analysis
5.1 Cluster Consistency Study
For the PITCHf/x data, I selected several prominent pitchers to analyze over a single season. These pitchers are Felix Hernandez, Daisuke Matsuzaka, Justin Verlander, R.A. Dickey, Jason Motte, and Mariano Rivera. The data was downloaded from the PITCHf/x database, imported into JMP and reduced in dimension through the random forest algorithm. The reduced data was then imported into R for the cluster analysis. The two main goals of the analysis were to effectively select the correct number of clusters to consider throughout an entire season and from this knowledge, cluster the entire season's data, displaying the movement of each pitch cluster over the entire time period. These paths should be useful in formulating new ways of analyzing a pitcher's performance over a single season, based on the differences in his pitches from game to game. Importantly, throughout the entire analysis, the pitch labels provided by the PITCHf/x data are ignored, we instead rely only on the clustering algorithm to discover the appropriate groupings.
The cluster choice algorithm works very well for the four starting pitchers analyzed. Each pitcher's staJt data was randomly split into a 600 observation subset which was subsequently split into independent equal sized training and test sets. The data was sampled fi·om the entire time frame, effectively creating random subsets of three games worth of data, assuming each game is roughly equal to 100 pitches. The term "error rate" in this context refers to the misclassification rate of the test clustering assuming the "true" labels are those produced by the training algorithm. For example, if we are analyzing the consistency of the three component solutions, an observation xis deemed misclassified if the test clustering assigns it to cluster 2 when the true cluster was cluster l. The overall eiTOr rate is simply the ratio of these errors to the total number of observations.
Figure 5.1 shows the cluster consistency plot for Felix Hernandez' 2012 season. The labels assigned by the PlTCHf/x database record Hernandez as throwing 7 pitches, 6 when discounting the "FA" category that accounts for only 8 total incidences, less than .03% of the data. The cluster consistency plot, however, suggests that a 5 pitch classification is most appropriate.
Felix He rnandez 2012 Cluster Cons istency
0
Figure 5.1: Felix Hernandez cluster consistency, indicating evidence for 5 clus
ters.
The lowest average error occurs at k = 5, shown by the clear dip in the graph at that point. At k 5, the average error is 17.3%, significantly lower than at k = 4
=
at 21.6% or k = 6 at 22.6%.
Figure 5.2 shows Justin Verlander's cluster consistency plot for the 2012 season. Verlander, like Hernandez, is a Cy Young winning, dominant staJting pitcher. He is known to throw four pitches, with the ability to change his fastball velocity at will according to each at bat. The PITCHf/x database lists five pitches that Verlander was classified as having thrown in the 20 12 season. The cluster consistency plot indicates that k = 4 clusters is the appropriate choice with the lowest average etTor of 13.6%.
Justin Verlander 2012 Cluster Consistency
e w1::
.Q
t>
'6
a.
lf)
0
0
0
0
Number of Clusters
Figure 5.2: Justin Verlander cluster consistency, indicating evidence for 4 clusters.
The Verlander plot shows a lower average error at each clustering level than the Hernandez plot does but still exhibits the same indicative dip at the appropriate clustering. Both graphs also seem to indicate that the PITCHf/x database is consistently overestimating the number of pitches that a pitcher throws.
Another example of the overestimation of pitches by PITCHf/x is for the pitcher Daisuke Matsuzaka. Matsuzaka, a highly sought after free agent who joined the Boston Red Sox from Japan, was always reported to throw a vast anay of pitches. The Boston Globe even estimated that he threw at least 8 different pitches as well as the mythic "Gyroball" that may or may not have been pitch number 9. The PITCHf/x database lists 7 pitch classifications for Matsuzaka's 2008 season. The cluster consistency plot in Figure 5.3, however, only gives evidence for 4 pitches, with the other stable clustering occun·ing at k = 3. The
shows that the ending speed of both of his pitches was very consistent throughout the season, with a slight increase in the middle of the season. Likewise, Figure
5.1 l shows that there is a clear increase in average vertical movement during the middle of the season, this may indicate that Dickey was able to make his knuckleball "float" more during his intermediate staitS. Finally, Figure 5.12 shows that the h01izontal movement of the knuckleball was more unpredictable, as expected: the knuckleball is thrown with the intention that it "dances" horizontally back and forth in order to confuse the hitter and the Figure 5.12 gives evidence to suppo1t that baseball claim.
Overall, the charts give a good idea of the changes throughout the season in each pitches movement and velocity. The ability to accurate track these aspects will hopefully help baseball analysts more accurately track a pitcher's performance throughout the season based on only what he can control: the pitches themselves.
Chapter 6
Discussion
The initial goal of this paper was to develop a method for analyzing pitchers that only incorporates the actual pitches themselves. Cluster analysis was suggested as a possible model for the analysis of the data given the spatial aspect of the data and the natural clustering tendency that is easily detectable by the naked eye. After briefly describing the history of clustering, we have given a detailed description of model based clustering, a relatively new and powerful method for clustering data sets. By applying model based clustering to actual pitching data from Major League Baseball games, it became clear that a new method of determining a consistent number of clusters needed to be introduced to extend the current methodology to data sets that are indexed by time. We therefore developed a procedure that treats cluster choice as a model selection problem and selects the number of clusters based on the lowest total misclassification rate rather than a BIC penalty. This procedure produced quite strong results on the Iris dataset test and also for many of the pitchers analyzed. The cluster consistency measure allowed us to present and analyze pitching data in a novel way by tracking the trajectories of the individual pitch dusters throughout an entire season. There are still many remaining questions about both cluster analysis, pitching analysis and the combination of the two. The cluster choice procedure relies on simulating la
bels for the data using the clustering algorithm itself before calculating the error between the training and test results; however, it would be interesting to compare different classificaLion methods-logistic regression, decision trees, neural networks etc .. -for simulating the labels of the training set in order to test the accuracy of the algorithm. Furthermore, I treated the cluster consistency method as if it was a problem of model selection for supervised learning. It would be useful to test other measures such as the Mutual Information Criterion, Rand Index, or Jaccard Index for external evaluation and different variations of internal evaluative measures and compare the results to those given by the cluster consistency measure. There may be effects of the sample size and sample selection on the results of the cluster consistency algorithm. [Tibshirani and Walther, 2005] claim that the sample size has a negligible effect on the results of their pairwise measure; however, the data is structured into groups of different sizes and therefore sampling evenly from the entire data set may overweight some observations too heavily and not accurately reflect the distribution of pitches in each game. Finally, the PITCHf/x data is poorly labeled but not completely unlabeled and thus would be a nice dataset for analyzing methods of semi-supervised clustering. I did not come across any of these methods in my literature review but it seems plausible that semi-supervised techniques could be used to improve clustering accuracy.
There is still much work to be done on using cluster analysis to help analyze individual pitchers. In the analysis of the results I gave simple observations based on the plots themselves, but a more thorough analysis of the clusters along with the game data will certainly reveal more about the effects of different pitches over the season. Some relevant open questions are:
•
Can we use the cluster paths to group different pitchers together? Can we classify certain pitchers as "consistent" or others as "inconsistent" based solely on these paths and not subjective analysis?
•
How does the position of each path con-elate to good starts or bad starts?
Can we predict whether a pitcher will pitch well or not based on the cluster movement from previous games?
•
Can these paths be used to analyze pitcher fatigue and its effects on injury?
For example, can we model injury rate based on negative trends in velocity or pitch movement?
•
Do new clusters develop over a certain time pe1iod? Are there points where we can see new clusters form and if so does that give us meaningful information about the development of a new pitch?
6.1 Conclusion
This thesis presented a thorough review of the history of clustering techniques, beginning with simple hierarchal models and concluding with an application of a distribution based algorithm on a new database. This work mostly falls under the umbrella of new research being done on Cluster Stability and its associated use as a component selection technique. However, treating the selection problem as an implicit model selection problem breaks from other selection procedures by offering easily interpretable results. Because the pitching data is indexed by time, the model based algorithm did not choose a consistent number of clusters to partition the data for the entire time period, but it is clear that a single, logically suppmted number needed to be selected. To do so, I introduced the concept of Cluster Consistency in order to choose the most plausible number of clusters for the entire time period. The cluster consistency measure has a clear and compelling interpretation and was shown to work well for various data sets. Furthermore, the model based algorithm accounts for the potential for clusters of different shapes in the data and therefore the Cluster Consistency measure should not be strongly
biased by the choice of model.
The method is not without limitations but performs wel1 often enough to remain a useful tool. It struggles with clusters that are not we]] separated, as in the last two cases examined, but many existing methods also su·uggle under these situations as evidenced by the occasionally insufficient BIC penalty used by MCLUST. Using this procedure allowed for the consistent clustering of a time indexed data set. Finally, the PITCHf/x data for two starting pitchers was analyzed and presented over an entire season in an entirely new way, substituting empirical methods for traditional objective baseball analysis. This type of work was previously only done based on arbitrarily classified pitches and individual games. I extended the traditional method to clusters of pitches over an entire season. There is much analysis left to do but this thesis hopefully opens up a new avenue of baseball and cluster analysis that can be explored more in the future.
Bibliography
[Banfield and Raftery, 1993] Banfield, J. D. and Raftery, A. E. (1993). Modelbased gaussian and non-gaussian clustering. Biometrics, pages 803-82 1.
[Bensmail et al., 1997] Bensmail, H., Celeux, G., Raftery, A. E., and Robert, C. P. (1997). Inference in model-based cluster analysis. Statistics and Computing, 7(1):1-10.
[Bilmes, 1998] Bilmes, J. A. (1998). A gentle tutorial of the em algorithm and its application to parameter estimation for gaussian mixture and hidden markov models. Technical Report TR-97 -021, International Computer Science Institute, UC Berkeley.
[Celeux and Govaert, 1992] Celeux, G. and Govae1t, G. (1992). A classification em algorithm for clustering and two stochastic versions. Computational statistics & Data analysis, 14(3):31 5-332.
[Defays, 1977] Defays, D. (1 977). An efficient algorithm for a complete ]jnk method. The Computer Journal, 20(4):364-366.
[Fast, 2010] Fast, M. (2010). What the heck is pitchf/x. The Hardball Times Annual, pages 153-8.
[Fraley and Raftery, 1998] Fraley, C. and Raftery, A. E. (1998). How many clusters? which clustering method? answers via model-based cluster analysis. The Computer Journal, 41 (8):578-588.
[Hamerly and Elkan, 2002] Hamerly, G. and Elkan, C. (2002). Alternatives to the k-means algorithm that find better clusterings. In Conference on Information and Knowledge Management: Proceedings of the eleventh international conference on Information and knowledge managemenr, volume 4, pages 600
607.
[Hastie et al., 2008] Hastie, T., Tibshirani, R., and F1iedman, J. (2008). Elements of Statistical Learning.
[Inoue et aL, 2007] Inoue, L Y., Neira, M., Nelson, C., Gleave, M., and Etzioni,
R. (2007). Cluster-based network model for time-course gene expression data. Biostatistics, 8(3):507-525.
[Jon C. Wakefield, 2003] Jon C. Wakefield, Chuan Zhou, S. G. S. (2003). Mode11ing gene expression data over time: Curve clustering with informative prior distributions. Bayesian Statistics, 7.
[Kass and Raftery, 1995] Kass, R. E. and Raftery, A. E. (1995). Bayes factors.
Journal of the american statistical association, 90(430):773-795.
[Lloyd, Mar] Lloyd, S. (Mar). Least squares quantization in pcm. Information Theory, IEEE Transactions on, 28(2): 129-137.
[MacQueen, 1967] MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, volume I, pages 281-297. California, USA.
[Roth et al., ] Roth, V., Lange, T., Braun, M., and Buhmann, J. A resampling approach to cluster validation.
[Sibson, 1973] Sibson, R. (1973). Slink: An optimally efficient algorithm for the single-link cluster method. The Computer Journal, 16(1):30-34.
[Steinhaus, 1957] Steinhaus, H. (1957). The problem of estimation. The Annals of Mathematical Statistics, 28(3):pp. 633-648.
[Tango et al., 2007] Tango, T., Lichtman, M., and Dolphin, A. (2007). The Book: Playing the Percentages in Baseball. Potomac Books Incorporated.
[Tibshirani and Walther, 2005] Tibsbirani, R. and Walther, G. (2005). Cluster validation by prediction strength. Journal of Computational and Graphical Statistics, 14(3):5 11 -528.
[von Luxburg, 2010] von Luxburg, U. (2010). Clustering stability: An overview.
Foundations and Trends in Machine Learning, Vol. 2, No. 3, p. 235-274, 2010.