The Statistical Analysis of Compositional Data
May 9-10, 2011
To provide an introduction to theoretical and practical aspects of statistical analysis of compositional data.
Compositional data (CoDa) are typically defined as vectors of positive components and constant sum, usually 100% or 1. These conditions render the classical statistical techniques useless on compositions, as they were devised for unbounded real vectors. However, there are many more types of data having the same limitations: as soon as the variables of a data set show the relative importance of some parts of a whole, data must be considered compositional, and classical statistics should be then avoided. Typical examples of these disguised compositions are data presented in ppm, ppb, molarities, or any other concentration units.
Aitchison introduced the log-ratio approach to analyse CoDa back in the eighties. His solution was based on transforming the vector with some standard log-ratio transformations (called alr and clr, respectively for additive and centered log-ratio transformation), and applying the classical techniques to the scores so obtained. This became the foundation of modern compositional data analysis, nowadays based on the own geometric structure of the simplex, the sample space of compositional data. In this geometry, classical translation is replaced by a multiplication-based perturbation, and classical scaling is replaced by a powering. Since then, progress has been done in understanding the geometry peculiar to their sample space, the D-part simplex. Moreover, in this geometry of the simplex, some log-ratios play the role of coordinates, and most familiar vector procedures (sum, orthogonal projections, distances, …) are available using coordinates. This course will present the current state of the art in this field of active research and will cover the following topics:
- Hypothesis underlying statistical data analysis (sample space, scale).
- The Aitchison geometry of the simplex.
- Coordinate representation; distributions on the simplex.
- Exploratory analysis (centering, variation array, biplot, balances-dendrogram).
- Linear processes in the simplex; regression.
- CoDa-discriminant analysis and logistic regression
- Introduction to CoDaPack, a userfriendly freeware; discussion of case studies.
- first semester courses in statistics, algebra and calculus;
- basic knowledge in multivariate statistics.
Attendants are encouraged to bring their own laptops for practicals. Excel (M$ Office) should be available in the laptops. If some assistant is unable to bring any laptop, this should be specified in the registration form.
Language of the course: English
May 9: 10h-13h + lunch time+ 15h-18h
May 10: 10h-13h
Location: Hotel Eden Roc, Sant Feliu de Guíxols (Girona, Spain)
There are limited places on the course and participants to the workshop will take preference.
More on the course contents
1.- Hypothesis underlying statistical data analysis (sample space, scale).
Pearson (1896) was the first one detecting what was called spurious correlation in CoDa. People started realizing that CoDa needed a special treatment as far as in the sixties, when Chayes first started wondering on the effects of spurious correlation (induced by the forced sum to 100%, and not due to any natural process) on all sorts of multivariate statistical techniques. The key idea to get out of the problem was to realize that the information conveyed by a composition is purely relative: from a compositional data set alone, we can only make statements about evolution or change on (log-)ratios of components. Any statement about absolute increments or decrements of any variable is utterly spurious, as we will never be able to distinguish between them: for instance, in a 3-part system [A,B,C], we can pass from [25,50,25]% to [50,25,25]% by transferring 25 mass units out of 100 from B to A, but we could also keep A stable and remove 3/4 of part B and 1/4 of part C. The aim of this unit is to show the need of avoiding asking us questions about absolute increments and reductions of components, as we will be unable to extract any valid answer for them by solely analysing our compositional data sets. These considerations lead to the formulation of principles of CoDa analysis: scale invariance, permutation invariance and sub-compositional coherence.
2.- The Aitchison geometry of the simplex.
From the realization that compositional data carry only relative information, Aitchison deduced that the fundamental operation of change for compositions had to be also of relative nature: the perturbation operation of two components is the closed component-wise product. The inverse perturbation is maybe easier to interpret, as it describes the change between an initial state z(0) and a final one z, as the closed division: C[z_1/z_1(0), ..., z_D/z_D(0)], where the operation C[·] just ensures that its argument vector is re-closed to sum up to 100% (or the total sum we are working with). This operation, complemented with the closed component-wise powering of a composition by a scalar, builds a vector space structure on the simplex. Finally, by adding a log-ratio scalar product, this space is given an Euclidean structure, where we can measure distances and angles, and define concepts like orthogonality, projections, lines, hyperplanes, ellipses, etc. Elementary statistical concepts involving the metrics of the sample space (mean, variance, confidence regions,...) will be later reviewed and adapted to the Euclidean structure of the simplex, in section 4.
3.- Coordinate representation; distributions on the simplex.
Euclidean spaces permit the definition of reference systems and the corresponding coordinates. Orthonormal basis are or primary interest because they allow us to translate all operations and metrics of the simplex into standard operations and metrics of the real vector spaces. In order to build up orthonormal coordinate systems, two useful procedures are available. The first one is to carry out a CoDa-principal component analysis which results after representing the raw percentages or proportions as centered-log-ratio scores (clr) and then using a singular value decomposition (SVD). The second one is to define groups of compositional components (sequential binary partition) that generates an orthonormal system of coordinates called balances. The first technique is clearly adequate when the user has no preference on the interpretation of the coordinates. The second technique can be adapted to the user preferences thus enhancing the interpretation of the coordinates. Balance-coordinates are useful to understand the meaning of orthogonal projections in the simplex. For instance, the extraction of a sub-composition, or the grouping of some components, can be interpreted as orthogonal projections.
The Euclidean structure of the simplex also implies a natural measure of reference on it. This measure, called Aitchison measure, can be used to better visualize and understand probability distributions on the simplex. The main distribution in the simplex is the logistic-normal one. When the coordinates of a random composition are distributed as a multivariate normal, then the distribution is logistic-normal, or simply normal on the simplex.
4.- Exploratory analysis of CoDa.
Elementary statistics for CoDa should agree with the principles of CoDa analysis. A consequence is that statistics defined for single compositional components are meaningless. The analysis of simple log-ratios should be used instead. Means and variances of the simple log-ratios, organized in a variation array, are a standard tool in a first exploration of CoDa. CoDa-biplot is a simultaneous bidimensional projection of the clr-components and the data, and it is based on CoDa-principal component analysis introduced in the preceding section. The interpretation of the CoDa-biplot has some characteristics that deserve particular attention. If data are represented in balance-coordinates, the CoDa-dendrogram is a simultaneous visualization of the sequential binary partition that generates the balances, together with some descriptions of their marginal distributions, like: the mean, quantiles and variances of the balances. It also allows the visual comparison of different populations. Both representations (principal components and balance-coordinates) are closely related to dimensio- reduction techniques which are needed when dealing with a large number of compositional components.
Because the mentioned techniques are based on log-ratios, the CoDa should be free of zeroes. The treatment of zeros is a difficult issue and requires either inputation techniques or changes in the sample space. The practical implications of these issues when exploring compositional data will also be discussed here.
5.- Processes in the simplex. Linear regression.
There are two kinds of processes in the simplex very often found in applications: evolution of a composition with respect to an external variable (time, space, temperature, etc.), and mixing of a set of endmembers (or better said, the unmixing problem of a composition in the proportions of such endmembers). Examples may be the exponential growth or decay of a population in time (of several species of bacteria, a mixture of radioactive nucleids, etc.), against the estimation of the mineral composition of a rock from its geochemical composition (where the chemical compositions of the minerals is estimated or fixed). Decay/growth processes may be linear in the simplex, with the Euclidean geometry presented in section 2. Thus, the choice of such a linear process to explain the observed evolution of a compositional phenomena may be a parsimonious option when evolution is the focus. The simplicity of such models contrasts with the non-linear character of the mixture models when viewed in the geometry of the simplex.
Linear regression of a compositional response on external covariates can be used to identify and estimate linear evolution of CoDA. This kind of regression is easily carried out using some coordinate representation of the compositional response and conventional statistical packages. Residual analysis is also carried out in the simplex, preferably using some coordinate system. Regression techniques are easily extended to comparisons of the centers (means) of different populations (ANOVA) in a similar way as it is done in standard multivariate analysis.
6.- CoDa discriminant analysis and logistic-regression.
Most multivariate statistics techniques can be used on the coordinates of CoDa, like those presented in this section. Discriminant techniques are useful when the response data are some pre-defined categories, considered a function of some continuous variables. In the case that such data form in fact a composition, then CoDa-discriminant analysis should be employed. This is formally equivalent to applying standard discriminant techniques (Fisher, linear or quadratic discrimination) to the log-ratio coordinates of the observed components. However, this subject deserves further attention, as the obtained results can be interpreted as objects on the simplex (directions of maximal separation between groups, boundary planes, ellipses, parabolas, etc.) and represented as such. Attention is finally paid to the relationship between the simplicial linear regression of section 5 and the standard multinomial logistic-regression, an alternative discrimination method.
7.- Introduction to available software.
The lab exercises are based on available software. CoDaPack-3D built as a macro library for MS-Excel, provides a easy-to-use tool for elementary exploratory analysis of CoDa. It is used in most teaching activities of the course. For more demanding statistical techniques, as discriminant or cluster analyses and logistic-regression, the package "compositions", based on the open source statistical environment R is also used.