Metadata-Version: 2.4
Name: slimformers
Version: 1.4.0
Summary: Lightweight Optimization and Model Adaptation
Author-email: Caden Chen <cadenc.woss@gmail.com>
License: MIT
Keywords: transformers,LLM,pruning,LoRA,model efficiency
Requires-Python: >=3.7
Description-Content-Type: text/markdown
Requires-Dist: torch>=2.0.0
Requires-Dist: transformers>=4.38.0
Requires-Dist: peft>=0.7.0
Requires-Dist: rich>=13.0.0
Requires-Dist: psutil>=5.9.0

# Slimformers

Slimformers is a lightweight Python framework for pruning and fine-tuning transformer models. It supports activation-based MLP (FFN) pruning and low-rank adaptation (LoRA) without the need for any manual layer specification.

# Features

- Prunes neurons based on average activations across multiple batches
- Automatic FFN and gated FFN block discovery for common architectures (GPT-2, BERT, LLaMA)
- Safely rebuilds pruned `nn.Linear` and `Conv1D` layers
- LoRA fine-tuning with auto-inferred target modules
- Compatible with Hugging Face models and tokenizers

# Quick Start

## Basic Pruning

```python
from slimformers import Pruner
from transformers import AutoModel, AutoTokenizer
import torch

# Load your model
model = AutoModel.from_pretrained("bert-base-uncased")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Create pruner
pruner = Pruner(model)

# Prepare your data (returns dict with input_ids, attention_mask, etc.)
dataloader = your_dataloader_here

# Prune 30% of neurons based on activation magnitudes
pruner.prune_all_mlp_layers(
    dataloader=dataloader,
    sparsity=0.3,
    max_batches=10
)
```
## LoRA Fine-tuning
``` python
from slimformers import lora_finetune
from peft import TaskType

# Fine-tune with LoRA after pruning
fine_tuned_model = lora_finetune(
    model=model,
    dataloader=train_dataloader,
    epochs=3,
    lr=1e-4,
    device="cuda",
    r=8,
    alpha=16,
    task_type=TaskType.TOKEN_CLS
)
```
## Custom Prune Strategy
``` python
def custom_neuron_selection(activations, sparsity):
    """Custom strategy: keep neurons with highest variance"""
    if activations.dim() == 3:
        variance = activations.var(dim=(0,1))
    else:
        variance = activations.var(dim=0)
    
    total = variance.size(0)
    k = int((1.0 - sparsity) * total)
    return torch.topk(variance, k=k).indices, total

# Use custom strategy
pruner = Pruner(model, pruning_strategy=custom_neuron_selection)
```
## Pruning Report

After pruning, ```pruner.report()``` displays a summary of the compression results. This includes:
- Original and pruned parameters counts
- Percentage reduction model size
- CPU and GPU memory usage before and after pruning
- Peak GPU memory usage (if CUDA enabled)

### Example 

Pruning was run on ```deepseek-ai/deepseek-coder-1.3b-base``` with 40% sparsity using a Lenovo ThinkPad T490 (Intel i5-8365U CPU, no GPU!): 
- Original Parameters: ```1,346,471,936```
- Pruned Parameters: ```1,024,855,424```
- Total Reduction: ```321,616,512 (23.89%)```
- CPU Memory: ```(Before --> After): 5398.57 MB --> 4253.34 MB (–1145.23 MB)```
