This paper describes the first step towards building quantitative structure-property
relationship (QSPR) models to systematically select a group of representative
micropollutants, which will serve as a training set to develop QSPR models for water
treatment processes. A well developed optimized selection strategy was applied, which
combined principal component analysis (PCA) and statistical experimental design. In
this research, the initial dataset contained 183 micropollutants, mostly emerging
contaminants, selected from the peer-reviewed literature. Each compound was
characterized by 858 molecular descriptors (i.e. these are variables used in QSPR
modeling). This resulted in a large complex multivariate dataset to which PCA was
applied to summarize the information in the form of principal components. The first
four principal components which captured 62.9% of the variation in the initial dataset
were used to select representative compounds using a D-optimal onion design
approach. Using this design, 22 substances were selected as structurally representative
compounds which covered the chemical domain (meaning the chemical characteristics
of all compounds) in a well-balanced manner and captured the majority of the
information. The systematic selection approach employed here ensures that future
QSPR models are applicable to a wide range of chemicals as long as their
characteristics fall within the original chemical domain. Includes 15 references, tables, figure.