Metadata-Version: 2.1
Name: fmeval
Version: 1.0.1
Summary: Amazon Foundation Model Evaluations
License: Apache-2.0
Author: Amazon FMEval Team
Author-email: amazon-fmeval-team@amazon.com
Requires-Python: >=3.10,<4.0
Classifier: Development Status :: 1 - Planning
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Natural Language :: English
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.10
Requires-Dist: IPython
Requires-Dist: aiohttp (>=3.9.2,<4.0.0)
Requires-Dist: bert-score (>=0.3.13,<0.4.0)
Requires-Dist: evaluate (>=0.4.0,<0.5.0)
Requires-Dist: grpcio (>=1.60.0,<2.0.0)
Requires-Dist: ipykernel (>=6.26.0,<7.0.0)
Requires-Dist: jiwer (>=3.0.3,<4.0.0)
Requires-Dist: markdown
Requires-Dist: matplotlib (>=3.8.0,<4.0.0)
Requires-Dist: mypy-boto3-bedrock (>=1.33.2,<2.0.0)
Requires-Dist: nltk (>=3.8.1,<4.0.0)
Requires-Dist: pandas (==2.1.4)
Requires-Dist: pyarrow
Requires-Dist: pyfunctional (==1.4.3)
Requires-Dist: ray (==2.9.1)
Requires-Dist: rouge-score (>=0.1.2,<0.2.0)
Requires-Dist: sagemaker (>=2.199.0,<3.0.0)
Requires-Dist: scikit-learn (>=1.3.1,<2.0.0)
Requires-Dist: semantic-version (==2.10.0)
Requires-Dist: testbook (>=0.4.2,<0.5.0)
Requires-Dist: torch (>=2.0.0,!=2.0.1,!=2.1.0)
Requires-Dist: transformers (>=4.36.0,<5.0.0)
Requires-Dist: urllib3 (==1.26.18)
Description-Content-Type: text/markdown

## Foundation Model Evaluations Library
`fmeval` is a library to evaluate Large Language Models (LLMs) in order to help select the best LLM
for your use case. The library evaluates LLMs for the following tasks:
* Open-ended generation - The production of natural human responses to text that does not have a pre-defined structure.
* Text summarization - The generation of a condensed summary retaining the key information contained in a longer text.
* Question Answering - The generation of a relevant and accurate response to an answer.
* Classification - Assigning a category, such as a label or score to text, based on its content.

The library contains
* Algorithms to evaluate LLMs for Accuracy, Toxicity, Semantic Robustness and
  Prompt Stereotyping across different tasks.
* Implementations of the `ModelRunner` interface. `ModelRunner` encapsulates the logic for invoking different types of LLMs, exposing a `predict` method to simplify interactions with LLMs within the eval algorithm code. We have built-in support for Amazon SageMaker Endpoints and JumpStart models. The user can extend the interface for their own model classes by implementing the `predict` method.

## Installation
`fmeval` is developed under python3.10. To install the package, simply run:

```
pip install fmeval
```

## Usage
You can see examples of running evaluations on your LLMs with built-in or custom datasets in
the [examples folder](https://github.com/aws/fmeval/tree/main/examples).

The main steps for using `fmeval` are:
1. Create a `ModelRunner` which can perform invocation on your LLM. `fmeval` provides built-in support for Amazon SageMaker Endpoints and JumpStart LLMs. You can also extend the `ModelRunner` interface for any LLMs hosted anywhere.
2. Use any of the supported [eval_algorithms](https://github.com/aws/fmeval/tree/main/src/fmeval/eval_algorithms).

For example,
```
from fmeval.eval_algorithms.toxicity import Toxicity, ToxicityConfig

eval_algo = Toxicity(ToxicityConfig())
eval_output = eval_algo.evaluate(model=model_runner)
```
*Note: You can update the default eval config parameters for your specific use case.*

### Using a custom dataset for an evaluation
We have our built-in datasets configured, which are consumed for computing the scores in eval algorithms.
You can choose to use a custom dataset in the following manner.
1. Create a [DataConfig](https://github.com/aws/fmeval/blob/main/src/fmeval/data_loaders/data_config.py)
   for your custom dataset
```
config = DataConfig(
    dataset_name="custom_dataset",
    dataset_uri="./custom_dataset.jsonl",
    dataset_mime_type="application/jsonlines",
    model_input_location="question",
    target_output_location="answer",
)
```

2. Use an eval algorithm with a custom dataset
```
eval_algo = Toxicity(ToxicityConfig())
eval_output = eval_algo.evaluate(model=model_runner, dataset_config=config)
```

*Please refer to the [developer guide](https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-foundation-model-evaluate-auto.html) and
[examples](https://github.com/aws/fmeval/tree/main/examples) for more details around the usage of
eval algorithms.*

## Troubleshooting

1. Users running `fmeval` on a Windows machine may encounter the error `OSError: [Errno 0] AssignProcessToJobObject() failed` when `fmeval` internally calls `ray.init()`. This OS error is a known Ray issue, and is detailed [here](https://github.com/ray-project/ray/issues/21994). Multiple users have reported that installing Python from the [official Python website](https://www.python.org/downloads/windows/) rather than the Microsoft store fixes this issue. You can view more details on limitations of running Ray on Windows on [Ray's webpage](https://docs.ray.io/en/latest/ray-overview/installation.html#windows-support).

2. If you run into the error `error: can't find Rust compiler` while installing `fmeval` on a Mac, please try running the steps below.

```
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
rustup install 1.72.1
rustup default 1.72.1-aarch64-apple-darwin
rustup toolchain remove stable-aarch64-apple-darwin
rm -rf $HOME/.rustup/toolchains/stable-aarch64-apple-darwin
mv $HOME/.rustup/toolchains/1.72.1-aarch64-apple-darwin $HOME/.rustup/toolchains/stable-aarch64-apple-darwin
```

3. If you run into out of memory (OOM) errors, especially while running evaluations that use LLMs as evaluators like toxicity and
summarization accuracy, it is likely that your machine does not have enough memory to load the evaluator
models. By default, `fmeval` loads multiple copies of the model into memory to maximize parallelization, where the exact number depends on the number of cores on the machine. To reduce the number of models that get loaded in parallel, you can
set the environment variable `PARALLELIZATION_FACTOR` to a value that suits your machine.

## Development

### Setup and the use of `devtool`
Once you have created a virtual environment with python3.10, run the following command to set up the development environment:
```
./devtool install_deps_dev
./devtool install_deps
./devtool all
```

Before submitting a PR, rerun `./devtool all` for testing and linting. It should run without errors.

### Adding python dependencies
We use [poetry](https://python-poetry.org/docs/) to manage python dependencies in this project. If you want to add a new
dependency, please update the [pyproject.toml](./pyproject.toml) file, and run the `poetry update` command to update the
`poetry.lock` file (which is checked in).

Other than this step to add dependencies, use devtool commands for installing dependencies, linting and testing. Execute the command `./devtool` without any arguments to see a list of available options.

### Adding your own evaluation algorithm and/or metrics

The evaluation algorithms and metrics provided by `fmeval` are implemented using `Transform` and `TransformPipeline` objects. You can leverage these existing tools to similarly implement your own metrics and algorithms in a modular manner.

Here, we provide a high-level overview of what these classes represent and how they are used. Specific implementation details can be found in their respective docstrings (see `src/fmeval/transforms/transform.py` and `src/fmeval/transforms/transform_pipeline.py`).

#### Preface
At a high level, an evaluation algorithm takes an initial tabular dataset consisting of a number of "records" (i.e. rows) and repeatedly transforms this dataset until the dataset either contains all the evaluation metrics, or at least all the intermediate data needed to compute said metrics. The transformations that get applied to the dataset inherently operate at a per-record level, and simply get applied to every record in the dataset to transform the dataset in full.

#### The `Transform` class
We represent the concept of a record-level transformation using the `Transform` class. `Transform` is a callable class where its `__call__` method takes a single argument, `record`, which represents the record to be transformed. A record is represented by a Python dictionary. To implement your own record-level transformation logic, create a concrete subclass of `Transform` and implement its `__call__` method.

**Example:**

Let's implement a `Transform` for a simple, toy metric.

```
class NumSpaces(Transform):
    """
    Augments the input record (which contains some text data)
    with the number of spaces found in the text.
    """
    def __call__(self, record: Dict[str, Any]) -> Dict[str, Any]:
        input_text = record["input_text"]
        record["num_spaces"] = input_text.count(" ")
        return record
```

One issue with this simple example is that the keys used for the input text data and the output data are both hard-coded. This generally isn't desirable, so let's improve on our running example.

```
class NumSpaces(Transform):
    """
    Augments the input record (which contains some text data)
    with the number of spaces found in the text.
    """

    def __init__(self, text_key, output_key):
        super().__init__(text_key, output_key)  # always need to pass all init args to superclass init
        self.text_key = text_key  # the dict key corresponding to the input text data
        self.output_key = output_key  # the dict key corresponding to the output data (i.e. number of spaces)

    def __call__(self, record: Dict[str, Any]) -> Dict[str, Any]:
        input_text = record[self.text_key]
        record[self.output_key] = input_text.count(" ")
        return record
```

Since `__call__` only takes a single argument, `record`, we pass the information regarding which keys to use for input and output data to `__init__` and save them as instance attributes. Note that all subclasses of `Transform` need to call `super().__init__` with all of their `__init__` arguments, due to low-level implementation details regarding how we apply the `Transform`s to the dataset.

#### The `TransformPipeline` class
While `Transform` encapsulates the logic for the record-level transformation, we still don't have a mechanism for applying the transform to a dataset. This is where `TransformPipeline` comes in. A `TransformPipeline` represents a sequence, or "pipeline", of `Transform` objects that you wish to apply to a dataset. After initializing a `TransformPipeline` with a list of `Transform`s, simply call its `execute` method on an input dataset.

**Example:**
Here, we implement a pipeline for a very simple evaluation. The steps are:
1. Construct LLM prompts from raw text inputs
2. Feed the prompts to a `ModelRunner` to get the model outputs
3. Compute the "number of spaces" metric we defined above

```
# Use the built-in utility Transform for generating prompts
gen_prompt = GeneratePrompt(
    input_keys="model_input",
    output_keys="prompt",
    prompt_template="Answer the following question: $model_input",
)

# Use the built-in utility Transform for getting model outputs
model = ... # some ModelRunner
get_model_outputs = GetModelOutputs(
    input_to_output_keys={"prompt": ["model_output"]},
    model_runner=model,
)

# Our new metric!
compute_num_spaces = NumSpaces(
    text_key="model_output",
    output_key="num_spaces",
)

my_pipeline = TransformPipeline([gen_prompt, get_model_outputs, compute_num_spaces])
dataset = # load some dataset
dataset = my_pipeline.execute(dataset)
```

#### Conclusion
To implement new metrics, create a new `Transform` that encapsulates the logic for computing said metric. Since the logic for all evaluation algorithms can be represented as a sequence of different `Transform`s, implementing a new evaluation algorithm essentially amounts to defining a `TransformPipeline`. Please see the built-in evaluation algorithms for examples.
## Security

See [CONTRIBUTING](CONTRIBUTING.md#security-issue-notifications) for more information.

## License

This project is licensed under the Apache-2.0 License.

