Metadata-Version: 2.4
Name: outlier-cleaner
Version: 0.1.5
Summary: A Python package for detecting and removing outliers in data using various statistical methods
Home-page: https://github.com/SubaashNair/OutlierCleaner
Author: Subashanan Nair
Author-email: subaashnair12@gmail.com
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Requires-Python: >=3.7
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy>=1.19.0
Requires-Dist: pandas>=1.2.0
Requires-Dist: matplotlib>=3.3.0
Requires-Dist: seaborn>=0.11.0
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: license-file
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# OutlierCleaner

A Python package for detecting and removing outliers in data using various statistical methods such as IQR and Z-score.

## Features

- Remove outliers using IQR (Interquartile Range) method
- Remove outliers using Z-score method
- Add Z-score columns to your DataFrame
- Clean multiple columns using pre-calculated Z-scores
- Batch clean all Z-score columns at once
- Visualize outliers with boxplots and histograms
- Generate detailed reports on outlier removal
- Support for cleaning multiple columns at once
- Comprehensive documentation and examples

## Installation

```bash
pip install outlier-cleaner
```

## Usage

```python
from outlier_cleaner import OutlierCleaner
import pandas as pd
import numpy as np

# Create a sample DataFrame
data = {
    'height': np.random.normal(170, 10, 1000),
    'weight': np.random.normal(70, 15, 1000)
}
df = pd.DataFrame(data)

# Create an OutlierCleaner instance
cleaner = OutlierCleaner(df)

# Method 1: Add Z-score columns and clean using them
df_with_zscores = cleaner.add_zscore_columns()  # Adds '_zscore' columns for all numeric columns
cleaned_df, info = cleaner.clean_zscore_columns(threshold=3.0)  # Cleans all columns with Z-scores

# Method 2: Add Z-scores for specific columns only
cleaner.add_zscore_columns(columns=['height'])  # Adds only 'height_zscore'
cleaned_df, info = cleaner.remove_outliers_zscore('height', threshold=2.5)  # Uses existing Z-score column

# Method 3: Clean using IQR method
cleaner.reset()  # Reset to original data
cleaned_df, info = cleaner.remove_outliers_iqr('height')
cleaner.visualize_outliers('height')

# Method 4: Clean multiple columns at once
cleaner.reset()
cleaned_df, info = cleaner.clean_columns(method='iqr', columns=['height', 'weight'])

# Get a summary report
report = cleaner.get_summary_report()
print(report)
```

## Methods

### Add Z-score Columns
```python
cleaner.add_zscore_columns(columns=None)
```
- Adds new columns with Z-scores for each numeric column
- New columns are named as original_column_name + '_zscore'
- If columns=None, processes all numeric columns
- Returns the modified DataFrame

### Clean All Z-score Columns
```python
cleaner.clean_zscore_columns(threshold=3.0)
```
- Automatically cleans all columns that have associated Z-score columns
- Uses pre-calculated Z-scores for efficiency
- Applies the same threshold to all columns
- Returns cleaned DataFrame and outlier information

### IQR Method
```python
cleaner.remove_outliers_iqr(column, lower_factor=1.5, upper_factor=1.5)
```

### Z-score Method
```python
cleaner.remove_outliers_zscore(column, threshold=3.0)
```
- Now uses pre-calculated Z-scores if available
- Falls back to calculating Z-scores if needed

### Clean Multiple Columns
```python
cleaner.clean_columns(method='iqr', columns=None, **kwargs)
```

## Requirements

- Python >= 3.7
- numpy >= 1.19.0
- pandas >= 1.2.0
- matplotlib >= 3.3.0
- seaborn >= 0.11.0

## License

This project is licensed under the MIT License - see the LICENSE file for details.

## Author

Subashanan Nair

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request. 
