When I set out to build MedGenius LLaMA, I wanted to create a medical AI assistant that could handle patient-doctor dialogues with the nuance and accuracy required in healthcare. After 3 months of work, I fine-tuned LLaMA 3.2 1B on 40,000+ curated medical dialogues.
In this post, I’ll share the complete process, from dataset curation to deployment, so you can fine-tune domain-specific LLMs for your own use cases.
Why Fine-Tune Instead of RAG?
Before diving in, let’s address the elephant in the room: when should you fine-tune vs. use RAG?
Use Fine-Tuning When:
- ✅ You need the model to “speak” domain-specific language
- ✅ You have 10,000+ high-quality examples
- ✅ You need consistent tone and style
- ✅ Latency matters (no retrieval overhead)
- ✅ You want to embed knowledge IN the model
Use RAG When:
- ✅ Knowledge changes frequently
- ✅ You have documents but not dialogue examples
- ✅ You need source citations
- ✅ Budget for inference API calls
- ✅ Don’t have GPU resources for training
For MedGenius, I chose fine-tuning because medical terminology and consultation patterns needed to be embedded in the model itself.
Dataset Curation: The Foundation
Good data > fancy techniques. I spent 50% of my time on data curation.
Sources for Medical Dialogues
- Medical transcripts (anonymized)
- Public medical QA datasets (cleaned and verified)
- Synthetic data from GPT-4 (carefully reviewed)
- Medical literature adapted to dialogue format
Data Format
I used the instruction-following format:
{
"instruction": "You are a medical AI assistant. Provide accurate, empathetic responses.",
"input": "I've had a persistent headache for 3 days, along with sensitivity to light. What could this be?",
"output": "Based on your symptoms of persistent headache and photophobia (light sensitivity), this could indicate several conditions:\n\n1. **Migraine** - Most likely given the combination of headache and light sensitivity\n2. **Tension headache** - Though less commonly causes photophobia\n3. **Sinusitis** - If accompanied by facial pressure\n4. **Meningitis** - Rare but serious (seek immediate care if you have fever, stiff neck, or confusion)\n\nI recommend:\n- Rest in a dark, quiet room\n- Stay hydrated\n- OTC pain relief (ibuprofen or acetaminophen)\n- See a doctor if symptoms persist beyond 5 days or worsen\n\n⚠️ Seek immediate medical attention if you experience: severe sudden headache, fever, stiff neck, confusion, or vision changes."
}
Data Quality Criteria
Every dialogue in my dataset met these standards:
- Medically accurate - Verified against medical databases
- Appropriately cautious - Never replace professional diagnosis
- Empathetic tone - Healthcare requires compassion
- Clear disclaimers - When to seek professional help
- Structured responses - Easy to parse and understand
Cleaning Pipeline
import re
import pandas as pd
from typing import List, Dict
def clean_medical_dialogue(dialogue: Dict) -> Dict:
"""Clean and validate medical dialogue"""
# Remove PII (emails, phone numbers, names)
dialogue['input'] = remove_pii(dialogue['input'])
dialogue['output'] = remove_pii(dialogue['output'])
# Normalize medical terms
dialogue['output'] = normalize_medical_terms(dialogue['output'])
# Add safety disclaimers
if needs_disclaimer(dialogue['output']):
dialogue['output'] += "\n\n⚠️ This information is for educational purposes. Consult a healthcare professional for personalized advice."
# Validate medical accuracy
if not is_medically_sound(dialogue):
return None # Skip invalid data
return dialogue
def remove_pii(text: str) -> str:
"""Remove personally identifiable information"""
# Remove emails
text = re.sub(r'\S+@\S+', '[EMAIL]', text)
# Remove phone numbers
text = re.sub(r'\+?\d[\d\s\-\(\)]{8,}\d', '[PHONE]', text)
# Remove specific names (use NER)
text = anonymize_names(text)
return text
Final Dataset Stats
- Total dialogues: 40,127
- Average input length: 87 tokens
- Average output length: 246 tokens
- Medical categories: 42 (cardiology, neurology, etc.)
- Quality score: 4.7/5 (expert review)
Choosing LLaMA 3.2 1B
Why LLaMA 3.2?
- Open source - Full control and commercial use
- Strong base - Meta’s latest architecture
- Efficient - 1B parameters runs on consumer GPUs
- Good instruction following - Pre-trained for chat
Model Size Trade-offs
| Model Size | GPU Memory | Inference Speed | Quality | My Choice |
|---|---|---|---|---|
| 405B | 810GB+ | Very Slow | Excellent | ❌ Overkill |
| 70B | 140GB | Slow | Excellent | ❌ Too expensive |
| 8B | 16GB | Medium | Very Good | ⚠️ Good option |
| 1B | 2GB | Fast | Good | ✅ Perfect balance |
For a medical chatbot that needs to be accessible, 1B was the sweet spot.
LoRA: Parameter-Efficient Fine-Tuning
Full fine-tuning would require:
- 🔴 4GB+ GPU memory per training sample
- 🔴 $500+ in compute costs
- 🔴 48+ hours training time
LoRA (Low-Rank Adaptation) changed everything:
How LoRA Works
Instead of updating all 1 billion parameters, LoRA:
- Freezes the base model (1B params)
- Injects trainable low-rank matrices (2M params)
- Trains only the small matrices
- Merges them back after training
Result:
- ✅ 99.8% fewer trainable parameters
- ✅ 2x faster training
- ✅ Same quality as full fine-tuning
LoRA Configuration
from peft import LoraConfig, get_peft_model
lora_config = LoraConfig(
r=16, # Rank - higher = more capacity
lora_alpha=32, # Scaling factor
target_modules=[ # Which layers to adapt
"q_proj",
"k_proj",
"v_proj",
"o_proj",
"gate_proj",
"up_proj",
"down_proj"
],
lora_dropout=0.05, # Regularization
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(base_model, lora_config)
# Check trainable parameters
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Trainable parameters: {trainable_params:,}") # ~2M instead of 1B!
Unsloth: 2x Faster Training
Unsloth optimizes training with:
- Flash Attention 2 - Faster attention computation
- Gradient checkpointing - Lower memory usage
- Optimized kernels - CUDA-level optimizations
Setup
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/llama-3.2-1b-bnb-4bit",
max_seq_length=2048,
dtype=None, # Auto-detect
load_in_4bit=True # Quantization
)
# Enable Unsloth optimizations
model = FastLanguageModel.get_peft_model(
model,
r=16,
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"
],
lora_alpha=32,
lora_dropout=0.05,
bias="none",
use_gradient_checkpointing="unsloth", # Optimized checkpointing
random_state=42
)
Training speed comparison:
| Framework | Time per Epoch | GPU Memory |
|---|---|---|
| Vanilla PyTorch | 12 hours | 24GB |
| PEFT (LoRA) | 6 hours | 12GB |
| Unsloth | 3 hours | 8GB |
Training Configuration
Hyperparameters
After experimentation, here’s what worked:
from transformers import TrainingArguments
training_args = TrainingArguments(
output_dir="./medgenius-llama-checkpoints",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=8, # Effective batch size: 32
learning_rate=2e-4,
lr_scheduler_type="cosine",
warmup_ratio=0.03,
weight_decay=0.01,
logging_steps=10,
save_strategy="epoch",
fp16=True, # Mixed precision training
optim="adamw_8bit", # Memory-efficient optimizer
max_grad_norm=0.3, # Gradient clipping
seed=42
)
Why These Values?
- Learning rate (2e-4): Higher than full fine-tuning (5e-5) because LoRA needs stronger signal
- Batch size (32 effective): Balances memory and training stability
- 3 epochs: Medical domain converges fast; more risks overfitting
- Cosine schedule: Smooth learning rate decay
Training Process
from transformers import Trainer
from datasets import load_dataset
# Load dataset
dataset = load_dataset("json", data_files="medical_dialogues.jsonl")
train_dataset = dataset["train"].shuffle(seed=42)
# Tokenize
def tokenize_function(examples):
prompts = [
f"### Instruction:\n{examples['instruction'][i]}\n\n### Input:\n{examples['input'][i]}\n\n### Response:\n{examples['output'][i]}"
for i in range(len(examples['input']))
]
return tokenizer(
prompts,
truncation=True,
max_length=2048,
padding="max_length"
)
tokenized_dataset = train_dataset.map(
tokenize_function,
batched=True,
remove_columns=train_dataset.column_names
)
# Train
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset,
data_collator=data_collator
)
trainer.train()
Training Logs
Epoch 1/3: Loss=1.234 | Time=2.8h | GPU=7.2GB
Epoch 2/3: Loss=0.876 | Time=2.8h | GPU=7.2GB
Epoch 3/3: Loss=0.654 | Time=2.8h | GPU=7.2GB
Total training time: 8.4 hours on RTX 4090
Evaluation
Quantitative Metrics
from rouge_score import rouge_scorer
from nltk.translate.bleu_score import sentence_bleu
def evaluate_model(model, test_dataset):
predictions = []
references = []
for example in test_dataset:
pred = generate_response(model, example['input'])
predictions.append(pred)
references.append(example['output'])
# ROUGE scores
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'])
rouge_scores = [scorer.score(ref, pred) for ref, pred in zip(references, predictions)]
# Medical accuracy (custom metric)
accuracy = medical_accuracy_check(predictions, references)
return {
'rouge1': np.mean([s['rouge1'].fmeasure for s in rouge_scores]),
'rouge2': np.mean([s['rouge2'].fmeasure for s in rouge_scores]),
'rougeL': np.mean([s['rougeL'].fmeasure for s in rouge_scores]),
'medical_accuracy': accuracy
}
Results:
| Metric | Base LLaMA 3.2 | MedGenius (Fine-tuned) |
|---|---|---|
| ROUGE-1 | 0.42 | 0.78 |
| ROUGE-L | 0.38 | 0.72 |
| Medical Accuracy | 61% | 89% |
| Empathy Score | 3.2/5 | 4.7/5 |
Qualitative Analysis
Example 1: Symptom Assessment
Input: “I’ve been feeling dizzy and lightheaded when I stand up quickly.”
Base Model: “You might want to see a doctor about that. It could be several things.”
MedGenius: “What you’re describing sounds like orthostatic hypotension (postural hypotension) - a sudden drop in blood pressure when standing. This is common and usually not serious.
Possible causes:
- Dehydration
- Prolonged bed rest
- Medications (blood pressure meds, diuretics)
- Anemia
Recommendations:
- Stand up slowly in stages (sit first, then stand)
- Stay well-hydrated (8+ glasses water/day)
- Avoid sudden position changes
- Increase salt intake slightly (if not hypertensive)
See a doctor if you experience: fainting, chest pain, blurred vision, or if episodes are frequent and affecting daily life.”
Winner: MedGenius provides structured, actionable, medically accurate advice.
Publishing to Hugging Face
# Merge LoRA weights
model = model.merge_and_unload()
# Push to Hub
model.push_to_hub("huzaifa525/MedGenius-LLaMA-3.2-1B")
tokenizer.push_to_hub("huzaifa525/MedGenius-LLaMA-3.2-1B")
Model card (important for visibility):
---
language: en
license: llama3.2
tags:
- medical-ai
- llama
- lora
- healthcare
datasets:
- medical-dialogues
metrics:
- rouge
---
# MedGenius LLaMA 3.2 1B
Fine-tuned medical chatbot based on Meta's LLaMA 3.2 1B, trained on 40,000+ curated medical dialogues.
## Intended Use
- Medical education
- Symptom guidance (NOT diagnosis)
- Health information queries
## Limitations
⚠️ NOT a replacement for professional medical advice
⚠️ Cannot diagnose conditions
⚠️ Should not be used for emergency situations
## Training Details
- Base model: LLaMA 3.2 1B
- Method: LoRA (r=16, alpha=32)
- Dataset: 40,127 medical dialogues
- Training time: 8.4 hours on RTX 4090
Check it out: huggingface.co/huzaifa525/MedGenius-LLaMA-3.2-1B
Deployment
Local Inference
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"huzaifa525/MedGenius-LLaMA-3.2-1B",
device_map="auto",
load_in_4bit=True
)
tokenizer = AutoTokenizer.from_pretrained("huzaifa525/MedGenius-LLaMA-3.2-1B")
def chat(prompt: str) -> str:
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.7,
top_p=0.9,
repetition_penalty=1.1
)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
FastAPI Production Deployment
from fastapi import FastAPI
from pydantic import BaseModel
app = FastAPI()
class Query(BaseModel):
message: str
max_tokens: int = 256
@app.post("/chat")
async def medical_chat(query: Query):
response = chat(query.message)
return {"response": response}
Lessons Learned
What Worked
- Quality > Quantity: 40K high-quality > 400K noisy data
- LoRA is magic: 99.8% parameter reduction, same quality
- Unsloth 2x speedup: $200 training cost → $100
- Instruction format: Consistent formatting improves learning
- Safety first: Built-in disclaimers prevent misuse
What Didn’t Work
- Larger LoRA rank (r=64): Overfitting, no quality gain
- Higher learning rate (5e-4): Training instability
- More epochs (5+): Memorization, not generalization
- Synthetic data only: Lacks real medical nuance
Mistakes to Avoid
- ❌ Skipping data cleaning: Garbage in = garbage out
- ❌ Training without validation: Can’t detect overfitting
- ❌ Ignoring safety: Medical AI needs strong safeguards
- ❌ Wrong base model: LLaMA > GPT-2 for instruction following
- ❌ No evaluation plan: How do you know it works?
Cost Breakdown
Total project cost: $473
- Dataset curation: $120 (GPT-4 API for synthetic data)
- GPU compute (Vast.ai): $243 (RTX 4090, 12 hours)
- Hugging Face Pro: $9/month
- Domain expertise (my time): Priceless 😄
Future Improvements
- Multimodal: Add medical image understanding
- Multilingual: Expand to Spanish, Hindi, Arabic
- Specialized models: Cardiology, Neurology sub-models
- Reinforcement learning: RLHF from doctor feedback
- Federated learning: Train on hospital data privately
Conclusion
Fine-tuning LLaMA 3.2 for medical AI taught me that domain expertise + clean data + efficient training = production-ready models in weeks, not months.
MedGenius LLaMA now handles patient queries with 89% medical accuracy, empathetic responses, and proper safety disclaimers - all in a 1B parameter model that runs on consumer hardware.
Key Takeaways:
- Data curation is 50% of the work - Don’t rush it
- LoRA makes fine-tuning accessible - 2M params vs 1B
- Unsloth cuts costs in half - Use it
- Medical AI needs safeguards - Never skip disclaimers
- Evaluation is critical - Quantitative + qualitative
Resources
- Model: HuggingFace
- Code: GitHub
- Paper: Fine-Tuning LLaMA for Healthcare (coming soon)
Connect
Questions about medical AI or fine-tuning? Let’s connect:
- GitHub: github.com/huzaifa525
- LinkedIn: linkedin.com/in/huzefanalkheda
- Email: huzaifanalkheda@gmail.com
Building healthcare AI? I’d love to hear about your project. Drop a comment or reach out!