Metadata-Version: 2.1
Name: synda
Version: 0.1.2
Summary: A synthetic data generator pipeline
License: Apache 2.0
Keywords: synthetic data,pipeline,llm
Author: TimothePearce
Author-email: timothe.pearce@gmail.com
Requires-Python: >=3.11,<4.0
Classifier: License :: Other/Proprietary License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Requires-Dist: litellm (==1.55.12)
Requires-Dist: pandas (>=2.2.3,<3.0.0)
Requires-Dist: pydantic (>=2.10.5,<3.0.0)
Requires-Dist: python-dotenv (>=1.0.1,<2.0.0)
Requires-Dist: pyyaml (>=6.0.2,<7.0.0)
Requires-Dist: rich (>=13.9.4,<14.0.0)
Description-Content-Type: text/markdown

# Synda

> [!WARNING]
> This project is in its very early stages of development and should not be used in production environments.

> [!NOTE]
> PR are more than welcome. Check the roadmap if you want to contribute or create discussion to submit a use-case.

Synda (*synthetic data*) is a package that allows you to create synthetic data generation pipelines. 
It is opinionated and fast by design, with plans to become highly configurable in the future.


## Installation

```bash
poetry add synda
```

## Usage

1. Create a YAML configuration file (e.g., `config.yaml`) that defines your pipeline:

```yaml
input:
  type: csv
  properties:
    path: tests/stubs/simple_pipeline/source.csv
    target_column: content
    separator: "\t"

pipeline:
  - type: split
    method: chunk
    parameters:
      size: 500

  - type: generation
    method: llm
    parameters:
      provider: openai
      model: gpt-4o-mini
      template: |
        Ask a question regarding the content.
        content: {chunk}

        Instructions :
        1. Use english only
        2. Keep it short

        question:

  - type: ablation
    method: llm-judge-binary
    parameters:
      provider: openai
      model: gpt-4o-mini
      consensus: all
      criteria:
        - Is the text written in english?
        - Is the text consistent?

output:
  type: csv
  properties:
    path: tests/stubs/simple_pipeline/output.csv
    separator: "\t"
```

2. Run the following command:

```bash
poetry run synda -i config.yaml
```

## Pipeline Structure

The Nebula pipeline consists of three main parts:

- **Input**: Data source configuration
- **Pipeline**: Sequence of transformation and generation steps
- **Output**: Configuration for the generated data output

### Available Steps

- **split**: Breaks down data into chunks of defined size
- **generation**: Generates content using LLM models
- **ablation**: Filters data based on defined criteria

## Roadmap

The following features are planned for future releases:

- [x] Implement a Proof of Concept
- [x] Implement a common interface (Node) for input and output of each step
- [ ] Add SQLite support
- [ ] Add setter command for .env variable (open ai key, etc.)
- [ ] Trace each synthetic data with his historic
- [ ] Store each execution and step in DB
- [ ] Allow pausing and resuming pipelines
- [ ] Enable caching of each step's output
- [ ] Implement scriptable step for developer
- [ ] Design other step & methods

## License

This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details.
