1 Crimson College of Technology, Butwal, Nepal
2 Kathmandu Multiple College (formerly Karnali College of Health Sciences), Kathmandu, Nepal
This study focuses on the development of a machine learning-based Quantitative Structure-Activity Relationship (QSAR) model to predict the toxicity of organic compounds, including both traditional herbal remedies and synthetic compounds. The study employs Logistic Regression, Random Forest, and Support Vector Machines (SVM) to predict potential toxicity based on molecular descriptors calculated using RDKit, achieving over 90?curacy across models. Feature importance analysis reveals that molecular descriptors such as lipophilicity (logP), hydrogen bond donors, and specific molecular fingerprints (e.g., FP_375, FP_243, FP_417) significantly correlate with toxicity. A Random Forest-based model highlighted these fingerprint bits as key contributors to toxicity prediction, showing strong correlations with known toxicological properties. The top 20 fingerprint features were analyzed, with their importance ranking depicted in a bar chart. The model demonstrates promising results in predicting hepatotoxicity and neurotoxicity, offering an early-stage toxicity screening tool for drug discovery. Validated on external datasets, the model generalizes well to unseen herbal and synthetic compounds, making it a valuable tool for pharmaceutical and herbal compound safety evaluation. This research underscores the potential of integrating traditional medicinal knowledge with advanced computational methods to enhance safety profiling of diverse organic compounds.
Traditional herbal remedies have played a significant role in healthcare systems worldwide, including Ayurveda, Traditional Chinese Medicine (TCM), and other ethnopharmacological practices(Sen et al., 2011). While herbal compounds are often assumed to be safe due to their natural origins, many lack extensive toxicity studies, particularly through modern scientific methods(Woo et al., 2012). This can lead to safety concerns, especially in long-term use or high doses(Moreira et al., 2014).
As drug discovery and toxicology studies progress, computational approaches have gained prominence(Cherkasov et al., 2014). Specifically, Quantitative Structure-Activity Relationship (QSAR) models have become a popular tool for predicting the biological activity of chemical compounds based on their molecular structures(Varsou et al., 2024). QSAR models can predict potential toxic effects without the need for large-scale clinical testing or experimental setups, making them efficient for early-stage screening(EBSCOhost, 2023).
While QSAR models have been applied extensively in the pharmaceutical and industrial sectors, little attention has been given to herbal medicines(Xu et al., 2024). The lack of data about traditional medicines creates a barrier to ensuring the safe use of those compounds in modern healthcare.
There is a growing need for computational models that can predict the toxicity of herbal compounds, thus bridging the gap between traditional medicine and modern toxicological evaluation(Machhar et al., 2019). This study aims to develop a machine learning – based QSAR model to predict toxicity of herbal and synthetic organic compounds, using RDKit-calculated molecular descriptors. The model will be trained and validated with known toxicity data from herbal and plant derived compounds.
The outcomes could benefit both the scientific community and the herbal medicine industry by offering a tool for early-stage screening, potentially reducing the need for extensive in vivo and in vitro testing(Krewski et al., 2010).
METHODOLOGY
Dataset Collection
This study gathered data on the molecular structures and toxicity profiles of herbal compounds from publicly available databases such as:
The dataset included toxicity data for various endpoints, including hepatotoxicity, neurotoxicity, and general cytotoxicity(Yang et al., 2019). Compounds without toxicity data were excluded from the study(Fourches et al., 2016).
Molecular Descriptor Calculation
To represent the molecular structure of each herbal compound numerically, molecular descriptors were calculated using RDKit, a powerful tool for cheminformatics(Ekins et al., 2014). The descriptors included:
0D Descriptors: 0D (zero-dimensional) descriptors in cheminformatics are typically scalar values that provide information about the overall properties of a molecule without considering its spatial arrangement. E.g., molecular weight, number of atoms, number of heavy atoms, number of hydrogen bond donors and others.
1D Descriptors: 1D (one-dimensional) descriptors in cheminformatics represent counts or specific attributes of molecular features, focusing on individual atom types, bond types, or functional groups without considering the spatial arrangement of the molecule. E.g., total hydrogen atoms, total number of C atoms, number of aromatic amino groups and others.
2D Descriptors: 2D (two-dimensional) descriptors in cheminformatics provide information about the molecular structure and properties based on the arrangement of atoms and bonds, without taking into account three-dimensional (3D) conformations. E.g., Maximum absolute Estate values, Minimum absolute partial charges, charge components and various shape descriptors.
These descriptors were used as input features for the machine learning models.
Molecular Fingeprints: Fingerprints are a specific type of descriptor that encodes molecular structures as bit strings. This encoding consists of a sequence of binary digits (bits), which indicate the presence (1) or absence (0) of particular substructures within the molecule. The resulting numeric array is of length nnn, where nnn is determined by the specific fingerprint algorithm employed. In this study, molecular fingerprints were calculated for the input data using the RDKit library in Python(Rogers & Hahn, 2010). The types of fingerprints calculated include:
Machine Learning Algorithms
Three machine learning algorithms were selected for the toxicity prediction task(Lo et al., 2018):
Logistic Regression used 11 molecular descriptors (e.g., LogP, TPSA) selected for toxicity relevance, with features scaled via RobustScaler and missing values imputed with zeros. Hyperparameters (regularization strength C, solvers: liblinear, saga) were optimized using GridSearchCV with 5-fold cross-validation, and the model was evaluated for accuracy and classification metrics, saved with joblib for reproducibility.
Support Vector Machines (SVM) utilized the same 11 descriptors, standardized with StandardScaler, and employed an RBF kernel (C=1, gamma='scale'). The model was assessed for accuracy, precision, recall, and F1-score, and serialized using joblib.
Random Forest used 1024-bit Morgan fingerprints (radius=2) from RDKit as input, with the dataset split into 80% training and 20% testing sets. The model, configured with 100 decision trees (random_state=42), was evaluated on validation and test sets and saved with joblib for external validation.
Model Training and Validation
The complete dataset was divided into training (80%) and testing (20%) subsets to train and evaluate the machine learning models. Feature engineering involved the calculation of molecular descriptors and fingerprints using RDKit, which served as the input features, while toxicity classification served as the target variable.
For Logistic Regression, hyperparameter tuning was conducted using GridSearchCV with 5-fold cross-validation. The model's regularization strength (C) and solver type (liblinear, saga) were optimized based on validation accuracy.
The Random Forest and Support Vector Machine (SVM) models were trained using fixed hyperparameters. Random Forest was implemented with 100 estimators and a fixed random state to ensure reproducibility, while the SVM used an RBF kernel with C=1 and gamma='scale' as default settings.
After training, all models were evaluated using the test dataset. Performance metrics such as accuracy, precision, recall, and F1-score were calculated to assess the predictive ability of each model(Ballabio et al., 2018). Additionally, confusion matrices and feature importance analyses were employed to understand model decisions and interpretability.
External validation:
To assess the generalization ability of the trained Random Forest, Logistic Regression, and Support Vector Machine (SVM) models, an external dataset of 506 structurally diverse compounds from the ProTox-III(Khaouane et al., 2023) validation set (https://tox.charite.de/protox3/index.php?site=links) was used. SMILES notations of the compounds were converted into 1024-bit Morgan fingerprints (radius = 2) for the Random Forest model and into 11 molecular descriptors (e.g., molecular weight, LogP, TPSA, rotatable bonds, ring count) using RDKit for the Logistic Regression and SVM models, consistent with their training methodologies. Infinite or missing descriptor values were replaced with zeros. Each model predicted toxicity based solely on chemical structure, without using experimental LD50 values or pre-assigned toxicity classes. Post-prediction, LD50 values were used to classify compounds per the Globally Harmonized System (GHS) as toxic (Classes I–IV, LD50 ≤ 2000 mg/kg) or non-toxic (Classes V–VI, LD50 > 2000 mg/kg) for performance evaluation. Predictions were exported for further analysis.
RESULTS AND DISCUSSION
Model Performance
The models were evaluated based on several key performance metrics, including accuracy, precision, recall, and F1 score. The performance of the Logistic Regression, Random Forest, and Support Vector Machine (SVM) models on the test set for toxicity prediction is summarized in the table below:
Model |
Accuracy |
Precision |
Recall |
F1 Score |
Logistic regression |
92.22% |
98.73% |
85.56% |
91.56% |
Random forest |
92.78% |
98.73% |
86.67% |
92.31% |
Support vector machine |
92.22% |
87.00% |
100% |
93.00% |
Both Logistic Regression and SVM models achieved an accuracy of 92.22%, while the Random Forest model performed slightly better with an accuracy of 92.78%. All models exhibited high precision (98.73% for Logistic Regression and Random Forest, 87% for SVM), indicating a low rate of false positives.
Analysis of Molecular Descriptors
Feature Importance analysis of Random Forest Model: Feature importance analysis using the Random Forest model revealed that not only classical molecular descriptors such as logP and the number of hydrogen bond donors were correlated with compound toxicity, but also specific molecular fingerprints contributed significantly. This finding aligns with established toxicological principles, where lipophilicity and hydrogen bonding influence membrane permeability and biological activity.
To understand the structural patterns influencing toxicity, the top 10 most important fingerprint bits (FP_375, FP_243, FP_417, FP_595, FP_887, FP_540, FP_591, FP_118, FP_695, and FP_69) were visualized using RDKit. Each fingerprint bit corresponds to a specific molecular substructure that frequently appeared in toxic compounds within the dataset. Representative SMILES structures were identified for each bit:
Fingerprint Bit |
Matched SMILES |
Structural Insight |
FP_375 |
Cc1ccc(-c2cc(C(F)(F)F)nn2- c2ccc(S(N)(=O)=O)cc2)cc1 |
Aromatic ring with trifluoromethyl & sulfonamide groups – both associated with membrane interaction and enzyme binding. |
FP_243 |
CCCC(C)C1(CC)C(=O)NC(=O)NC1=O |
Branched alkyl chain with cyclic urea – linked to hydrophobicity and metabolic stability. |
FP_417 |
CC(=O)NC[C@H]1CN(c2ccc (N3CCOCC3)c(F)c2)C(=O)O1 |
Piperazine and fluorinated aryl groups – common in CNS-active drugs with potential neurotoxicity. |
FP_595 |
Long chain phospholipid-like ester |
Highly lipophilic, mimicking biological membranes – relevant in cytotoxicity. |
FP_887 |
OCCN1CCN(CCCN2c3ccccc3Sc3ccc (Cl)cc32)CC1 |
Tertiary amines and sulfur-containing heterocycles – often associated with hepatotoxicity. |
FP_540 |
CC(=O)Nc1cccc2c1- c1ccccc1C2 |
Fused aromatic rings with acetamide – planar structures affecting DNA intercalation. |
FP_591 |
c1ccc2c(c1)[nH]c1cnccc12 |
Indole-pyridine fused ring – found in many bioactive compounds, potentially toxic at high doses. |
FP_118 |
CC(C)CON=O |
N-nitroso group – a classic structural alert for mutagenicity and carcinogenicity. |
FP_695 |
Cc1ccc2c(c1[N+](=O)[O-]) C(=O)c1ccccc1C2=O |
Nitroaromatic ketone – known for redox cycling and liver toxicity. |
FP_69 |
COC12C(COC(N)=O)C3=C(C(=O) C(C)=C(N)C3=O)N1CC1NC12 |
Complex fused heterocycles – structurally rich motifs often flagged in lead optimization for off-target effects. |
Visual representations of each substructure (highlighted in red) are provided in the supplementary materials (Figure S1–S10).
These fingerprint-based substructures, derived from the Morgan algorithm, do not directly convey semantic chemical features, but they consistently match recurring toxic motifs in chemical space. Their high importance values in the model emphasize their predictive power and their likely involvement in pharmacokinetic and toxicodynamic pathways.
Fig. 3.
This bar chart shows the relative importance of the top 20 Morgan fingerprint bits that contributed to the toxicity prediction model. Higher bars indicate more important features for the Random Forest model, suggesting these fingerprint bits are linked to key molecular characteristics that influence toxicity.
Feature Importance Analysis of Logistic Regression Model: - The feature importance analysis for the logistic regression model was performed by examining the absolute values of the model's coefficients. This provides insights into which features most significantly influence the toxicity predictions of organic compounds. The feature importance values were visualized in a bar chart, highlighting the molecular descriptors that played key roles in distinguishing between toxic and non-toxic compounds.
Fig. 4. Feature importance analysis of logistic regression model.
Figure 4 presents the feature importance analysis of the logistic regression model. As observed, the most significant features for toxicity prediction were ExactMolWt and HeavyAtomMolWt, which showed the highest coefficient values. The bar chart in Figure 4 provides a clear visual representation of these features, where higher bars indicate more important features for the model's prediction of toxicity.
Feature Importance Analysis of SVM Model: - Feature importance analysis was conducted to better understand the role of each molecular descriptor in predicting toxicity using the Support Vector Machine (SVM) model. The analysis was performed using permutation importance, which evaluates the impact of each feature by measuring the decrease in model accuracy when a feature's values are shuffled.
The most important features for the SVM model, based on the mean decrease in accuracy, were found to be ExactMolWt and HeavyAtomMolWt, which had the highest importance values, followed by NumRotatableBonds, LogP, and NumHydroxylGroups. These results align with known toxicological principles, where molecular weight and lipophilicity (LogP) are key factors in predicting the bioactivity and toxicity of compounds.
Fig. 5 Feature importance ranking of SVM Model
The feature importance ranking is presented in Figure 5, which visually represents the relative importance of each feature in the model. As shown, ExactMolWt and HeavyAtomMolWt have a considerable influence on the toxicity predictions, suggesting that molecular weight-related descriptors are crucial in understanding compound toxicity. LogP and NumHydroxylGroups also play important roles, highlighting the relevance of hydrophobicity and functional group presence in predicting toxicity.
Validation with External Data
External Validation of Random Forest Model:- The Random Forest model was externally validated using 506 structurally diverse compounds from the ProTox-III validation set. SMILES strings were converted into 1024-bit Morgan fingerprints, consistent with the training pipeline. Experimental LD50 values were used post-prediction to evaluate performance against ProTox-defined toxicity classes.
The model achieved an accuracy of 77.1%, with particularly strong performance for identifying toxic compounds (Classes I–IV), achieving 96.2% recall and 79.0% precision. However, its precision for non-toxic compounds (Classes V–VI) was lower (37.5%), indicating a conservative bias toward predicting toxicity. These results suggest the Random Forest model effectively recognizes harmful structural motifs but may overpredict toxicity in less harmful or benign compounds.
External Validation of Logistic Regression Model: - The Logistic Regression model achieved a 100% recall for toxic compounds (Classes I–IV), correctly identifying all 396 toxic entries. However, it misclassified all 110 non-toxic compounds (Classes V–VI) as toxic, resulting in 0% precision for the non-toxic class. The overall accuracy was 78.3%, driven by the model's conservative bias toward predicting toxicity. This high sensitivity may be advantageous in early hazard screening but underscores the need for improved specificity and balanced training to reduce false positives.
External Validation of Support Vector Machine (SVM) Model:- The SVM model achieved an overall accuracy of 78.3% in external validation. It successfully identified 100% of the 396 toxic compounds (Classes I–IV) with a recall of 1.00. However, it failed to identify any of the 110 non-toxic compounds (Classes V–VI), leading to 0% precision and recall for the non-toxic class.
These results suggest the model has a high sensitivity for toxic compounds, favoring toxicity prediction across structurally varied chemical space. While this conservatism helps minimize false negatives, it also results in false positives among non-toxic chemicals. Future work may focus on improving class balance and calibration to enhance non-toxic compound identification.
Limitations
The study faced limitations related to the availability and quality of toxicity data for herbal compounds. Many herbal compounds lack comprehensive toxicity profiles, which restricted the size of the dataset. Additionally, the QSAR models were limited to predicting toxicity for individual compounds and did not account for the potential synergistic effects of multiple compounds in an herbal mixture.
DISCUSSION AND CONCLUSION
The development of machine learning-based Quantitative Structure-Activity Relationship (QSAR) models for predicting the toxicity of herbal and synthetic organic compounds marks a significant advancement in computational toxicology. Our study demonstrates that Logistic Regression, Random Forest, and Support Vector Machines (SVM) can achieve high accuracy (>90%) in predicting hepatotoxicity, neurotoxicity, and general acute toxicity. Random Forest slightly outperformed the others, with an accuracy of 92.78%, precision of 98.73%, recall of 86.67%, and F1 score of 92.31%. SVM exhibited perfect recall (100%) but lower precision (87%), making it particularly suitable for applications where missing toxic compounds is critical, even at the cost of some false positives.
Feature importance analysis provided valuable insights into the structural determinants of toxicity. For Random Forest, key features included molecular fingerprints corresponding to substructures such as aromatic rings, nitro groups, and tertiary amines, which are well-known for their association with toxic effects. Logistic Regression and SVM highlighted the significance of molecular weight and lipophilicity (logP), consistent with established toxicological principles. These findings not only validate the models’ predictive capabilities but also offer actionable insights for designing safer compounds by identifying structural features that contribute to toxicity.
When compared to recent advancements in the field, our models’ performance is commendable. For instance, a study by (Romano et al., 2022) utilized graph neural networks (GNNs) with publicly aggregated semantic graph data, achieving a mean area under the receiver operating characteristic curve (AUROC) of 0.883 for toxicity prediction across 52 assays from the Tox21 dataset (Improving QSAR Modeling). While AUROC is a different metric from accuracy, the high performance of GNNs suggests that incorporating relational data between chemicals, genes, and assays can enhance prediction accuracy. Similarly, (Sharma et al., 2023) employed multi-task deep neural networks (MTDNN) with pre-trained SMILES embeddings, achieving an AUC-ROC of 0.991 for clinical toxicity prediction, demonstrating the potential of deep learning for complex endpoints (Accurate Clinical Toxicity). Our study’s focus on both herbal and synthetic compounds addresses a gap in the literature, where herbal compounds are often underrepresented, thus bridging traditional herbal medicine with modern toxicology.
The high sensitivity of our models, particularly SVM, ensures that potentially toxic compounds, including those derived from herbal sources, are identified, enhancing public health safety. This is particularly relevant given the increasing use of herbal supplements and the need for robust safety profiling. However, our models exhibited a conservative bias in external validation, over-predicting toxicity, which could lead to false positives. This is a common challenge in toxicity prediction, potentially due to class imbalance in the datasets, where toxic compounds are less prevalent than non-toxic ones. The lower accuracy (77–78%) on the external ProTox-III dataset further highlights the need for more diverse training data to improve generalizability.
The practical implications of our findings are substantial. By providing early-stage toxicity screening, our models can reduce reliance on extensive animal testing, aligning with ethical and regulatory trends toward alternative methods. In pharmaceutical development, these models can prioritize compounds for further testing, optimizing resources and accelerating the identification of safe candidates. The inclusion of herbal compounds also supports the integration of traditional medicine into modern safety assessment frameworks, potentially informing regulatory guidelines for herbal products.
Despite these strengths, our study has limitations. The conservative bias in external validation suggests that class imbalance and dataset specificity may affect model performance. Additionally, the reliance on specific molecular descriptors and fingerprints may limit the models’ applicability to novel chemical spaces. Future research should explore strategies to mitigate class imbalance, such as oversampling minority classes or employing cost-sensitive learning techniques. Incorporating more diverse data sources, such as those aggregated in platforms like ComptoxAI, could enhance model robustness. Moreover, adopting advanced machine learning architectures, such as GNNs or deep neural networks(Mayr et al., 2016), may improve prediction accuracy and enable the modeling of more complex toxicity endpoints, including clinical toxicity.
In conclusion, this study underscores the efficacy of machine learning-based QSAR models in predicting the toxicity of a diverse range of compounds, including those from traditional herbal medicine. While opportunities for refinement remain, particularly in addressing class imbalance and enhancing generalizability, the current models offer a powerful tool for early-stage toxicity screening, supporting both scientific research and public health initiatives.
FUTURE DIRECTIONS
ACKNOWLEDGEMENT:
The list of FDA-approved drugs used in this research was obtained from the U.S. Food and Drug Administration (FDA) website.The Human Metabolome Database (HMDB) was used for non-toxic metabolite data in this research. The database is freely available and must be cited as per the following reference: Wishart DS, Guo AC, Oler E, et al., HMDB 5.0: the Human Metabolome Database for 2022. Nucleic Acids Res. 2022. We thank the Toxic Exposome Database (T3DB) for providing the comprehensive data on toxic substances, which was essential for this study. We acknowledge the National Library of Medicine (NLM) for providing access to Carcinogenic Potency Database (CPDB) used in this study. This work utilized data from the NLM, 'Courtesy of the U.S. National Library of Medicine'.
The author gratefully acknowledges the use of the publicly available external validation dataset provided by ProTox-III (https://tox.charite.de/protox3/index.php?site=home#). The validation set, described by the developers as a diverse subset of compounds spanning multiple toxicity classes, was used in this study to perform structure-based external validation of predictive toxicity models. The resource significantly contributed to the evaluation of our models against real-world chemical diversity.
Supplementary information:
The supplementary information can be found here: https://github.com/Pratikkhanal18/QSAR-model
REFERENCES
Pratik Khanal, Bhawana Sen, Predicting Toxicity of Herbal and Synthetic Organic Compounds Using Machine Learning-Based QSAR Models, Int. J. of Pharm. Sci., 2025, Vol 3, Issue 7, 3938-3950. https://doi.org/10.5281/zenodo.16569753