Implementation Plan:
Step 1: Initially, we collect and load data from PubMed Multi Label Text Classification Dataset.
Step 2: Then, we preprocess the data by cleaning text using special character removal, case normalization, biomedical tokenization, multi-hot label vector conversion, and root category mapping for handling sparsity.
Step 3: Next, we select and implement transformer-based biomedical models such as PubMedBERT for multi-label classification with sigmoid activation.
Step 4: Next, we perform model training by splitting the dataset into training, validation, and testing sets, fine-tuning the biomedical transformer model, and optimizing using Binary Cross-Entropy with Logits loss function.
Step 5: Next, we evaluate the trained model by comparing predicted root labels with ground truth labels, and generating confusion matrices.
Step 6: Next, we generate prediction outputs in both textual and structured JSON formats for biomedical classification based on collected data.
Step 7: Finally, we plot performance metrics for the following:
7.1: Number of Epochs Vs. Accuracy (%)
7.2: Number of Epochs Vs. Precision (%)
7.3: Number of Epochs Vs. Recall (%)
7.4: Number of Epochs Vs. F1-score (%)
Software Requirements:
- Development Tool: Python 3.11.4 or above
- Operating System: Windows-10 (64-bit) or above
Dataset:
Link: PubMed MultiLabel Text Classification Dataset
Note:
1) If the proposed plan does not fully align with your requirements, please provide all necessary details—including steps, parameters, models, and expected outcomes—in advance. Kindly ensure that any missing configurations or specifications are clearly outlined in the plan before confirming.
2) If there’s no built-in solution for what the project needs, we can always turn to reference models, customize our own, different math models or write the code ourselves to fulfil the process.
3) If the plan satisfies your requirement, Please confirm with us.
4) Project based on Simulation only.

