

|
putting the focus on ADMET properties |

|
ChemSilico Methods |
| The following information concerning ChemSilico Methods is available on this page. Please select the appropriate topic in the list below to navigate to the subject you are interested in. ChemSilico Methods |
| Data Selection: (top of page) Selection of data is difficult to define in a generic sense since several data selection techniques were employed to analyze the many different datasets used to construct ChemSilico predictors. However, generally after removal of duplicated compounds within large datasets,
The removal process does not completely eliminate inaccuracies or inconsistencies in reported values. That is not feasible. Nonetheless, the resultant dataset is well described by the set of descriptors, which are used as inputs to the neural net for the training/test process. |
| Neural Net Analysis (NNA): (top of page) Artificial neural networks (ANN) have the astonishing ability to ferret out nonlinear dependencies among input variables and compound parameters that classical statistical methods or multivariate linear analysis cannot. Such linear and non-linear relationships can be established for a large dataset of compounds utilizing neural network analysis. There are many different approaches to ANN, the most common being back-propagation, which is used in ChemSilico QSAR predictor modeling. The methodology to construct robust NN QSAR models is to rank descriptors, reduce their number, maximize the number of compounds per neural net work weight used, while at the same time achieving a maximal R2 value (a measure of the goodness of fit) for the training set and the maximal Q2 value (a measure of the goodness of prediction) for the validation sets, without over fitting the data. Proprietary systems were used to "prune" the molecular descriptor set. As the NN QSAR models are developed, they are continually tested for predictive accuracy throughout their development. This is accomplished using a portion of the data, randomly selected as a "withheld set" which was not used to select the final algorithm. With each iteration the least important inputs or irrelevant inputs are continually removed so that only the fittest survive this exhaustive process. This pruning process is iterative and reduces the inputs to an essential set that must perform well on the "withheld set" before it moves on to the final testing phase. In this manner, the input number is reduced from 515 to a more manageable value between 7 and 70. The variable reduction occurs with high confidence. The total number of molecular indices is dependent on the size of dataset. Ultimately the emphasis in production of the final predictive QSAR model is directed towards maximizing the Q2 value. Validation, the resultant Q2 (based on a validation set) of a model when applied to a dataset not used to construct the model, is the principle statistical parameter used to assess the predictive capability of a QSAR model. A well known problem associated with QSAR model development is the algorithmic bias that arises from the specific structural characteristics of the compounds in a given training set. The latter may work well within the chemical-descriptor space from which the model was built, but not outside these spaces. There is a need to demonstrate that the chemical-descriptor space used is sufficiently broad to predict with reasonably accuracy biopharmaceutical properties of new chemical entities. Enhanced classical cross-validation techniques have been employed with all ChemSilico predictors and their respective datasets if the latter is of sufficient size (>800 compounds). CSlogP, CSlogWS, CSGenoTox, CSpKa, CSBBB, CSPB have undergone cross-validation. The Q2 for CSLogD arises from an external validation set. Although not all CSpKa predictors for ionizable groups under went cross-validation due to dataset size limitations, all the major pKa groups (CO2H, ROH, N1R, ArN, N3R) were cross-validated. |
| Cross-validation and Final Predictor: (top of page) R2 (the goodness of fit) is inflationary and approaches unity (1.0) as the number of variables (network inputs) increases. Q2 (the goodness of prediction), in contrast, is not inflationary. Q2 reaches a plateau and then declines as the complexity of the QSAR model increases. ChemSilico predictors are finalized on a maximal Q2 value with a minimal number of variables. Enhanced cross-validation and the final biopharmaceutical property predictor are interrelated as follows:
|
| Explanation of Data Handling and Statistics: (top of page) Correlation Coefficient: The correlation between experimental values and those generated during various phases of the modeling process are compared by use of the Pearson product moment correlation coefficient defined below. |

| X | is the experimental (observed) value |
| Y | is the value generated by the predictor |
| n | is each individual observation |
|
The Pearson equation is applied to sets of X and Y values that arise from three different computational environments. These environments are defined by the extent to which the individual observations on the compounds represented by the Y values contributed to the development of the equation from which the Y value is generated. |
|
R2: the square of the correlation coefficient between the calculated and experimental values, is derived from calculated results. All of the compounds that contribute to R2 were used in both variable selection and in the generation of the model. |
|
| Q2: the square of the correlation coefficient for cross validation between the calculated and experimental values, is derived from predicted results. The compounds that contribute to Q2 were used in variable selection, but not in the generation of the model. |
|
| Q2val: the square of the correlation coefficient for external validation between the predicted and experimental values, is derived from predicted results. The compounds that contribute to validation Q2 were not used in either variable selection or in the generation of the model. |
|
Mean Absolute Error ( MAE ) The mean absolute error statistic is calculated for both the calculated and predicted (validation) results. |
|
Additional Statistics A number of additional statistics are calculated to help define the quality of the ChemSilico family of predictors. |
| X | is the experimental (observed) value |
| Y | is the value generated by the predictor |
| i | is the i th compound |
| N | is the number of compounds in the dataset |
|
Standard Deviation ( s ) The standard deviation is calculated for regression when the number of degrees of freedom is known. |
| X | is the experimental (observed) value |
| Y | is the value generated by the predictor |
| i | is the ith compound |
| Ndf | ( N - number of regression variables -1) |
|
Average Relative Error (RAE) The average relative error gives the average of the absolute error expressed as a percent of the experimental value. |

| X | is the experimental (observed) value |
| Y | is the value generated by the predictor |
| i | is the ith compound |
| N | is the number of compounds in the dataset |
|
Root Mean Square (RMS) The root mean square is a corollary statistic to s and is calculated for validation where the number of degrees of freedom is undefined. |
| X | is the experimental (observed) value |
| Y | is the value generated by the predictor |
| i | is the ith compound |
| N | is the number of compounds in the dataset |
| search | |
| links | |
| user login | |
| contact us | |
|
To contact us: |
![]() |
Phone: 978-501-0633 Fax: 781-275-5197 Email: sales@chemsilico.com |
Copyright © 2003 ChemSilico LLC All Rights Reserved Terms and Conditions of Use | Privacy Policy ChemSilico is a registered trademark of ChemSilico LLC, Tewksbury, MA 01876 |