## 3. Program design and academic results
As was indicated above, three major aspects can be distinguished within the research project: 1. Development of methods for the analysis of multivariate data The first aspects has the following subareas: 1a. Nonlinear optimal scaling methods These subareas will be fully described below.
In behavioral science research categorical data (e.g. SES, gender, and typologies of children) occur frequently, often but not always in combination with ordinal and continuous variables. In addition, research is characterized by a keen interest in the individual, for instance, because many educational decisions are made at the level of the individual child. However, classical multivariate analysis techniques are not really suited to do full justice to the central position the individual has in behavioral science research, because these techniques are primarily geared towards the analysis of relationships between variables in which subjects are not important individually but are considered exchangeable. Classical multivariate techniques do not handle large numbers of categorical data well, especially not in combination with other data types. For categorical data, the classical techniques only allow relatively simple analyses and they can only handle a few categorical variables at a time. A major aim of the research programme "Data Theory for the Multivariate Analysis of Individual and Group Differences" is to adapt classical statistical methods in various ways to suit the particular characteristics of research in the educational sciences, psychology, political science, and market research. These adaptations have also been proven to be very useful in other areas such as biology, medicine, and other life sciences, and their application areas genomics, proteomics and systems biology. These new techniques allow a balanced treatment of individuals and variables in the analysis of multivariate data (subject-oriented multivariate analysis). Individuals are not regarded merely as a replication factor to derive association measures such as correlation or covariance, but are considered important entities in the analysis as well. In addition, the research programme aims to develop techniques that are suitable for investigating complex relationships between categorical variables. The major tool in our analysis of categorical data is the so-called optimal scaling process, in which the original qualitative variables are transformed during the analysis into quantitative variables. Consequently, the analysis will give quantitative results. Another characteristic of the data-analytic techniques developed within the project is that they aim to represent complex high-dimensional information in low-dimensional graphs which can be inspected visually and are more easily understood. Next to the methodological innovations described above, an explicit aim of the programme is to build user-friendly computer programs for the statistical techniques that have been developed. This requires the co-ordination and supervision of software development, and depends on up-to-date knowledge of efficient algorithm construction. Our computer programs have been made available both via SPSS and as independently distributed software. This part of the programme is further described in 3.3. The commitment of the programme to the practice of research in the behavioral sciences and other disciplines has led to special efforts in disseminating the work of the programme via development and publication of exemplary applications, which are described in 3.4). Also, various chapters in handbooks, encyclopedias and other review media were written by members of the research programme, such as the paper on matrix factorization (Hubert, Meulman, & Heiser, 2000) in SIAM Review, the section on dynamic programming in clustering (Hubert, Arabie, and Meulman, 2001) in the Encyclopedia of Optimization, the sections on principal component analysis (Kroonenberg, 2003), and the biplot (Kroonenberg (2003) in the Encyclopedia of Social Science Research Methods, the chapter on nonlinear principal components analysis (Meulman, Van der Kooij, & Heiser, 2003) in the Handbook of Quantitative Methodology for the Social Sciences, the chapter on exploratory longitudinal analysis with three-mode component models (Kroonenberg, 2004) in Analyse Statistique des Données Longitudinales, and the chapter on correspondence analysis in the Encyclopedia of Statistical Science, 2nd Edition (Kroonenberg, 2005). Also, members of the research group were co-editors of two books that give the state-of-the-art of the field: Data analysis, classification, and related methods (Kiers, Rasson, Groenen & Schader, 2000), and New developments in Psychometrics (Yanai, Okada, Shigemasu, Kao, & Meulman, 2003).
In the past five years, our efforts to remedy the deficiencies in the way classical multivariate analysis methods treat individual subjects and categorical data, have been focused on four major domains of statistical methodology: (1a) Nonlinear optimal scaling methods, (1b) Three-way multivariate methods, (1c) Clustering and multidimensional scaling, (1d) Regression and classification trees. As indicated above, multivariate categorical data analysis is characterized by the use of nonlinear optimal scaling techniques (1a) which allow for quantitative analyses of qualitative data. Within this field, all classical multivariate techniques, such as regression analysis and principal components analysis, are generalized to make them suitable for the analysis of categorical data. Three-way methods (1b) deal with complex data types in which subjects, variables and points in time each make up one of the ways of the data. One of the characteristics of such data is that they do full justice to individual differences, rather than treating the individuals stochastically, as is usual in classical statistical procedures. So far, optimal scaling has not been introduced in this part of the project but this is one of the aims in the years to come. Clustering and multidimensional scaling (1c) involves a class of techniques which seek to portray the similarities and dissimilarities between subjects. Multidimensional scaling works in a low-dimensional Euclidean space so that the relationships between subjects can be examined both visually and numerically. Clustering usually involves other representations such as hierarchical trees. Our work in this area makes heavy use of optimal scaling techniques to allow the inclusion of ordinal similarities. A special focus in this part of the project is combinatorial data analysis, formed by the use of dynamic programming to find globally optimal solutions. Regression and classification trees (1d) is a collection of techniques which try to predict an outcome variable (for instance, a measured attribute) from a number of predictor variables by finding subsequent binary splits that explain differences in the outcome as well as possible. The collection of binary splits can be represented as a tree. The outcome variable can also involve a typology of the subjects in the analysis, and in that case we use the term classification.
As described above, the class of nonlinear methods focuses on data that are non-numerical, with measurements recorded on scales having an uncertain unit of measurement. Data would typically consist of qualitative or categorical variables that describe the subjects (objects, children, consumers, patients, options) in a limited number of categories. The zero point of these qualitative scales is uncertain, and although frequently it can be assumed that the categories are ordered, their mutual distances are unknown. An important development in multidimensional data analysis has been the optimal assignment of quantitative values to such qualitative scales. This form of optimal quantification (scaling, scoring) is a very general approach to treat multivariate (categorical) data. The optimal scaling principle has been applied to a variety of multivariate data analysis techniques such as multiple regression, canonical correlation analysis, discriminant analysis, and principal components analysis. Optimal scaling also includes the techniques of correspondence and multiple correspondence analysis. There is a large emphasis on graphical representation of the results; specifically, on the joint representation of subjects and variables in a common low-dimensional Euclidean space. These representations are usually called biplots, and they visualize the complex relationships between subjects and variables, groups and variables, and subjects and groups. Transformation plots show the relation of the original qualitative variable and the resulting optimally scaled outcome. Related to the fact that observations collected on subjects will usually be qualitative (categorical) as well as mixtures of qualitative and quantitative data, only very weak assumptions will be made in the models that are fitted to the data. There are no distributional assumptions, such as normality, or independence of error terms, which are often unreasonable for the data we will be concerned with. Although there is no traditional statistical framework available for these type of techniques, the procedures are in a computational framework of significance testing and stability analysis by using resampling methods such as bootstrapping, cross-validation, and permutation tests. In the past five years, the programme has studied various aspects of nonlinear optimal scaling techniques. Among these was the study of the occurrence of local minima within nonlinear categorical regression (Van der Kooij, Heiser, & Meulman, 2004). This paper includes investigation of the cause of local minima in the Leiden CATREG algorithm (and some similar algorithms that have been published in the literature). Also, the prevalence and severity of local minima, the data conditions and analysis conditions under which local minima occur, were studied, and a strategy was provided to obtain the global minimum. Because the latter could be very time consuming for data sets with a large number of predictors, several (less time consuming) strategies that are very likely to obtain the global minimum were proposed as well. In Meulman (2003) a novel approach to nonlinear categorical regression was proposed through the use of so-called additive prediction components. A prediction component is a linear combination of the predictors, as is standard in regression. The approach to use additive predictor components (additive linear combinations) is similar to so-called boosting (Friedman, 2001; Hastie, Tibshirani & Friedman, 2001). The regression problem is partitioned into a number of steps. In each step, a nonlinear regression problem is solved suboptimally, where the next step is based on the residuals of the previous step. Because suboptimal solutions are sufficient, each consecutive step needs only a few iterations to find the nonlinear transformations (instead of very many iterations). For the actual prediction, prediction components are combined. Current research in nonlinear categorical regression involves the further evaluation of its predictive accuracy. Predictive accuracy deals with the power of the method to predict future observations using regression weights that are obtained for current observations only. A large study was performed (Van der Kooij, Meulman, & Heiser, 2005, in press) with the so-called.632 bootstrap study to assess the prediction error for different optimal scaling levels (nominal, ordinal, nonmonotonic and monotonic splines), and the results were compared to similar studies with other multiple regression techniques that involve transformation of variables. Further demonstration of the usefulness of the bootstrap can be found in Linting, Meulman, Groenen, & Van der Kooij (2004, submitted), where the balanced bootstrap is used to establish the stability of the results from a principal components analysis of an educational data set in terms of the eigenvalues, the component loadings, the scores for the persons, and the optimal scaling transformations. By using 1000 samples with replacement from the original sample, it was established that nonlinear categorical principal components analysis solution can be equally stable as standard linear PCA results. The significance of the parameters from linear multivariate analysis methods are usually determined in terms of p-values and effect sizes. Because traditional p-values are based on the assumption of (multivariate) normal distributions, these are not applicable in the context of nonlinear optimal scaling methods, where distributions are often skewed. Therefore, p-values are determined empirically, by using permutation tests. In this procedure, the correlational structure of the original data set is destroyed by random permutation of the observations within the variables, and performing the analysis on the permuted data set. This procedure is repeated a large number of times, and a p-value for a parameter is established by calculating the percentage of values from permuted data sets that are equal to or more extreme than the originally found value. Other topics that were studied in the area of nonlinear optimal scaling methods include alternating nonnegative least squares (Groenen, Van Os, & Meulman, 2000), properties of distance-based multivariate analysis (Groenen & Meulman, 2004), classification and nonlinear discriminant analysis (Meulman, 2000) and classification using nonlinear categorical principal components analysis (Meulman, Van der Kooij, & Babinec, 2002). Another special development with respect to categorical regression relates to the use of correspondence analysis of nonsymmetric data for this purpose (Lombardo, Kroonenberg & D' Ambra, 2000). Current and future research focuses on regularization methods in nonlinear categorical regression to accommodate large data sets that may be either long or wide. In wide data sets, the number of variables is very large compared to the number of observations. Such data are very frequent in biological psychology (gene-environment interaction), and can be studied in relation to attachment (Bakermans-Kranenburg, 2005). Other typical examples of such new types of data are found in the life sciences, with applications in genomics (study of genes and their functions), proteomics (study of proteins and their functions), and metabolomics, (study of metabolic profiles of cells, etc.); recently subsumed under the name systems biology (giving micro-array data, NMR mass spectrometry data, etc.). Because there are more variables than subjects the correlation matrix of the predictors is rank-deficient, and cannot directly be used in regression procedures. In order to deal with this problem, so-called regularization methods are required, which approximate the original measurements with smooth functions which can be used in regression. Long data sets are especially found in applications in data mining, one of the major areas that use predictive modeling (for data mining in the behavioral sciences, see Dusseldorp & Meulman, 2001). Very large data sets call for so-called scalable algorithms: computing times with such algorithms are largely independent of the sample size, so that very large numbers of observations can be analyzed within a reasonable amount of time. Another approach to the analysis of long data sets can be through the further use of the so-called additive prediction components that were proposed in Meulman (2003). Since the regression problem is partitioned into a number of steps, and in each step a nonlinear regression problem is solved suboptimally, this is a form of regularization as well. If this approach fulfills its promises, the result would be a fast procedure to perform multiple regression with nonlinear transformations.
In the past five years, further development of three-way methods has taken place in the areas of three-mode correspondence analysis and its place among other techniques for three-way contingency tables (Kroonenberg & Anderson, in press). Another focus was on exploratory and confirmatory three-mode techniques for the analysis of multimode covariance matrices (Kroonenberg & Oort, 2003). In addition, theoretical work has been carried out in the adaptation of generalized Procrustes analysis, and a special Procrustes analysis algorithm was tailor-made to optimize the alignment of molecules (Kroonenberg, Dunn, & Commandeur, 2003; Commandeur, Kroonenberg, & Dunn, 2004). In addition, research on interaction was continued, which was published in Gilbert, Sutherland, & Kroonenberg (2000), De la Vega, Hall, & Kroonenberg (2002), and De Rooij & Kroonenberg (2003). Research reported in the latter paper focused on the analysis of dyadic sequential interaction data. Reviews of three-way methods were given in Kroonenberg (2001), focusing on examples, and in Kroonenberg (2004), focusing on theory. Special attention to individual differences was given in Murakami & Kroonenberg (2003). The software development that will be described in more detail below went hand-in-hand with studies into the applicability of three-way methods in various disciplines. For example, a major problem in medicinal pharmacy is the simultaneous alignment of similar molecules, and it was shown that this could be realized via the specially designed variant of Procrustes analysis mentioned above. Another example is a three-way analysis of data from a cancer registry, which was used to demonstrate how to evaluate in one analysis varying trends for several types of cancers across a larger number of nations. Other exemplary applications were worked out in diverse fields like semantic differential research, nursing, subjective music appreciation and agriculture. As was mentioned above, a major effort in the domain of three-way methods is undertaken by Kroonenberg to comprise all the current knowledge, expertise, and experience in a new book, as a successor to the successful book on the subject by Kroonenberg that was published in 1983 and cited almost 190 times.
The optimal scaling principle that was described under the nonlinear optimal scaling subprogram has also been applied to a variety of techniques that can be subsumed under the name "proximity analysis", which includes clustering, multidimensional scaling and multidimensional unfolding. In clustering and multidimensional scaling, the relationship between persons is measured in the form of proximities, which are measures to assess the similarities or dissimilarities between persons. To evaluate these proximities visual representations are made: in clustering this is often in the forms of a dendrogram (tree), and in multidimensional scaling this is a map in a low-dimensional space, where distances on the map are used to represent the proximities. In multidimensional unfolding, we deal with proximities between persons and objects, for example, the proximities are preference scores that are given by the persons to a number of different objects. Special topics that were studied within the group are fuzzy clustering with Minkowski distances (Groenen & Jajuga, 2001) and model-based clustering (Bensmail & Meulman, 2003). A very special problem that was solved, was the problem clustering of subjects on subsets of variables (Friedman & Meulman, 2004; Meulman, 2003). The motivation for clustering subjects on subsets of attributes (COSA) was given by consideration of data of a very special kind, i.e. data sets where the number of attributes (in the columns) is much larger than the number of subjects (in the rows). An obvious application is in genomics, where we deal with gene expression data in micro arrays, with very many genes (say, 1,500-40,000), and very few subjects (say, 20-250). When we have a large numbers of attributes, subjects are very unlikely to cluster on all, or even a large number of them. Indeed, subjects might cluster on some attributes, and be far apart on all others, and our task is to find that (unique) set of attributes that a particular group of subjects is clustering on. Common data analysis approaches in systems biology are to cluster the attributes first, and only after having reduced the original many-attribute data set to a much smaller one, one tries to cluster the subjects. The problem here, of course, is that we would like to select those attributes that discriminate most among the subjects (so we have to do this while regarding all attributes multivariately), and it is usually not good enough to inspect each attribute univariately, because important clustering attributes usually have no influence as single agents, but do in a (small) group. Therefore, two tasks have to be carried out simultaneously: cluster the subjects into homogeneous groups, while selecting different subsets of variables (one for each group of subjects). The attribute subset for any discovered group may be completely, partially or non-overlapping with those for other groups. To limit the search space, and to focus on particular aspects of the data, COSA (Friedman and Meulman, 2004) incorporates targeting. For example, we might seek clusters of samples that have preferentially high or low expression values on subsets of genes. To avoid local optima, it is shown in Friedman and Meulman (2004) that we need to start with the inverse exponential mean (rather than the arithmetic mean) of the separate attribute distances. By using a homotopy strategy, the algorithm creates a smooth transition of the inverse exponential distance to the mean of the ordinary Euclidean distances over attributes. Most work in this subprogram has a very strong technical component because very hard optimization problems have to be dealt with. A specific class of optimization strategies in multidimensional scaling is reviewed in Groenen & Heiser (2000) and Groenen, Mathar, and Trejos (2000). A major class of optimization problems can be subsumed under the name combinatorial optimization, and it is in this area that major results were obtained by implementing the so-called dynamic programming approach. Apart from special papers that focused ultrametric and additive tree representations (Hubert, Arabie, & Meulman, 2001), one-dimensional scaling (Hubert, Arabie, & Meulman, 2002), and partitioning/clustering (Van Os & Meulman, 2004), two monographs were published within the research group: the SIAM book by Hubert, Arabie and Meulman (2001) and the PhD thesis by Van Os (2001). Combinatorial data analysis includes problems such as classification, clustering and ordering problems. A naive view to these problems sees them as relatively simple, basically because classical strategies are based upon suboptimal procedures. In sharp contrast, research in this subprogram concentrates on developing optimal procedures, while formulating these problems as mathematical programs with an overall loss function to be minimized with respect to all possible solutions. The Hubert, Arabie and Meulman (2001) book shows how optimal solutions can be guaranteed through the use of Dynamic Programming (DP). The use of dynamic programming for partitioning has been extensively studied in Van Os (2001). During this research, many improvements have been made to both the formulation of DP for partitioning and the implementation, leading to an overall approach that is hugely faster than the original one in Hubert, Arabie and Meulman (2001). Now problems can be solved in minutes where previously they took weeks to solve. In Van Os and Meulman (2004) it was shown that most classical clustering problems can be reformulated into an optimization approach and solved with DP, including K-means clustering, and single link and complete link clustering. Due to inherited size limitations, the approach is especially suited for situations where the number of subjects to be clustered is small, such as in cognitive psychology or microbiology. Nonetheless, new research resulted into a reformulation of the algorithm and new implementation that speeds up by a factor of over five the largest problems that can be handled, such that now modern computer architecture can be fully exploited (Van Os, 2004). The same is true for hierarchical clustering methods The research programme has led to an innovative new formulation that can handle most of hierarchical clustering problems in a fashion similar to that for partitioning, such that larger problems can be handled (Van Os and Meulman, submitted), and much faster. Current and future research in the DP area concentrates on the following two extensions. Because the DP approach is inherently limited to relatively small problems, some alternative heuristics varieties of DP have been developed that use DP within the framework of genetic algorithms. Although global optimality can no longer be guaranteed, it has been shown that these algorithms very often do succeed in finding the optimal solution (Van Os, 2001). The field of operations research has formulated several alternatives that are also promising, but these are currently ignored within the fields of data analysis and statistics. Therefore such heuristic algorithms will be further developed for various clustering problems. Other research concerns the problem of finding optimal non-exhaustive partitionings through the use of DP; in these clustering problems, outlier subjects - being very different from any of the groups that can be identified - are set aside, and the groupings are only obtained for subjects that fit into the overall solution, reducing the overall influence of these outliers.
Regression and classification trees originate from the seminal work by Breiman, Friedman, Olshen, & Stone, 1984) and the name stands for a collection of techniques which try to predict a response or outcome variable from a number of predictor variables by finding subsequent binary splits that explain differences in the outcome as well as possible. The collection of binary splits can be represented as a tree. When the outcome variable is a continuous, measured attributed, we use the term regression tree; when the outcome variable is categorical, giving a grouping of the individuals, we use the term classification tree. Regression and classification trees (sometimes collectively called decision trees) have many attractive properties (that they share, in fact, with optimal scaling techniques): the data do not require ad-hoc transformations beforehand, missing data are elegantly treated in the overall analysis approach, mixtures of numeric, ordinal and nominal data can be analyzed, and no distributional assumptions about the data or the error terms need to be made. The research in this area within the Data Theory programme has two important components. The first concerns the development of methodology for the study of individual differences between persons with respect to their response to a particular kind of treatment. These differences are very important because they explain why we sometimes do not observe any treatment effect in the experimental group compared to the control group. A new data analysis strategy was developed to estimate interaction effects, the so-called Regression Trunk Approach (RTA; Dusseldorp, 2001; Dusseldorp & Meulman, 2001; Dusseldorp & Meulman, 2004). This approach makes it possible to estimate a radically new type of regression models, namely models including threshold interactions (in stead of cross-product interactions). The methodological tools are based on the use of regression and classification trees to detect interaction between independent predictors and covariates. Dusseldorp started in November 2002 the VENI project "Modeling interactions as small trees in regression and classification". This project is an innovative extension of RTA since the methodology is extended into a complete analysis strategy for nominal scale by continuous variable interactions. The initial three-step approach has been reduced to a one-step approach, in which linear main effects and interaction effects are estimated simultaneously (in collaboration with Conversano, University of Cassino, Italy). Furthermore, RTA has been applied to estimate the differential effectiveness of cognitive therapy to panic disorder patients (Dusseldorp, Spinhoven, & Bakker, submitted). Future research intends to investigate in which situations, relevant to the behavioral sciences, a threshold interaction is more appropriate, and in which situations a product interaction; and to explore the generalization of RTA to classification problems. The second component is the development of globally optimal decision trees. The research for globally optimal hierarchical clustering through dynamic programming that was described above has also led to an innovation into the area of classification and regression trees. Traditionally, such trees are grown using a greedy algorithm that finds the best splits sequentially without looking ahead to the complete tree to be fitted. The resulting tree will be suboptimal, and it has been shown repeatedly in the literature that such trees are not very accurate, which means that they lack predictive power for future observations. Although it is easy to formulate globally optimal trees as a mathematical problem that optimizes an overall loss function, solutions to this mathematical problem have never been formulated, until this was done in Van Os (2001). Thus the current research programme has succeeded in finding a DP and for the first time globally optimal trees have been constructed for some real data sets. Since classical decision trees lack predictive accuracy, they were replaced by linear combinations of small suboptimal trees in MART (multiple additive regression trees: Friedman, 2001, 2002; see Friedman & Meulman (2003) for an application in epidemiology). Current research within the Data Theory programme will include optimal small trees in the multiple additive regression trees approach. A major effort of the project has been the inclusion of the nonlinear optimal scaling software developed in Leiden in the CATEGORIES module of the world-wide distributed software package SPSS, Chicago (see http://www.spss.com/categories/). In 1997, a major step was taken with the integration of the procedures developed in Leiden with the award-winning "pivot table" technology developed by SPSS. Subsequently, the original module was replaced with a series of new programs including CATREG (for nonlinear categorical regression), MULTIPLE CORESPONDENCE (for correspondence analysis, including constraints, and more general biplots), CATPCA (a generalization of principal components analysis for multivariate ordinal and/or nominal data that also includes multiple correspondence analysis) and OVERALS (for generalized canonical correlation analysis). In 2004, a new version of homogeneity analysis or multiple correspondence analysis (MULTIPLE CORESPONDENCE) was added to the module. CATEGORIES also includes a state-of-the-art program for multidimensional scaling (PROXSCAL), encompassing all the scaling models known from the literature, including several three-way methods, and generalizing these models substantively by the use of external information on the subjects, in the form of special restrictions on the configuration, and the fitting of supplementary variables into a predefined subject space. In 2005, PROXSCAL will be joined by PREFSCAL, a similar program that performs multidimensional unfolding analysis. Apart from the module CATEGORIES in SPSS, a specialist suite of programs has been developed for the analysis of three-way data, 3WAYPACK (Kroonenberg, 1997a, 1997b). Several different models can be handled by the package such as three-mode principal component analysis, parallel factor analysis and three-way correspondence analysis. In addition, various supplementary procedures have been included to allow further processing the results of three-way analysis, such as rotation of components, analyzing residuals, and plotting of the results. A new version of the program suite has been released in 2000. Partly as a result of new developments in three-way methods (as described above), the three-way program suite 3WAYPACK has undergone a major overhaul by updating the source code and the addition of several new programs to accommodate new and already existing three-way procedures not yet included in the suite, such as simultaneous component analysis, Procrustes analysis, three-mode PCA with missing data, and three-mode cluster analysis. In addition, major upgrades with respect to the user-friendliness of the output and the inclusion of graphical procedures were realized. The goal of the research in the programme "Data Theory for the Multivariate Analysis of Individual and Group Differences" is to develop advanced data-analytic techniques to meet the requirements of research in the behavioral and social sciences, and applications beyond those disciplines. The development and adaptation of data-analytic techniques is usually a product of close co-operation between empirical researchers and methodologists, the latter either giving methodological advice or collaborating in empirical research that calls for an innovative methodological approach. This research often creates innovations that can be used in other fields of research as well and it explains why exemplary applications have not only been developed within |

Page last modified on