Metadata-Version: 2.4
Name: dftly
Version: 0.0.2
Summary: dftly (pronounced deftly) is a simple library for a safe, expressive, config-file friendly, and readable DSL for encoding simple dataframe operations.
Author-email: Matthew McDermott <mattmcdermott8@gmail.com>
Project-URL: Homepage, https://github.com/mmcdermott/dftly
Project-URL: Issues, https://github.com/mmcdermott/dftly/issues
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.12
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: PyYAML
Requires-Dist: lark
Requires-Dist: python-dateutil
Provides-Extra: polars
Requires-Dist: polars~=1.33.0; extra == "polars"
Dynamic: license-file

# DataFrame Transformation Language from YAML (dftly)

[![Python 3.12+](https://img.shields.io/badge/-Python_3.12+-blue?logo=python&logoColor=white)](https://www.python.org/downloads/release/python-3100/)
[![PyPI - Version](https://img.shields.io/pypi/v/dftly)](https://pypi.org/project/dftly/)
[![Documentation Status](https://readthedocs.org/projects/dftly/badge/?version=latest)](https://dftly.readthedocs.io/en/latest/?badge=latest)
[![Tests](https://github.com/mmcdermott/dftly/actions/workflows/tests.yaml/badge.svg)](https://github.com/mmcdermott/dftly/actions/workflows/tests.yaml)
[![Test Coverage](https://codecov.io/github/mmcdermott/dftly/graph/badge.svg?token=BV119L5JQJ)](https://codecov.io/github/mmcdermott/dftly)
[![Code Quality](https://github.com/mmcdermott/dftly/actions/workflows/code-quality-main.yaml/badge.svg)](https://github.com/mmcdermott/dftly/actions/workflows/code-quality-main.yaml)
[![Contributors](https://img.shields.io/github/contributors/mmcdermott/dftly.svg)](https://github.com/mmcdermott/dftly/graphs/contributors)
[![Pull Requests](https://img.shields.io/badge/PRs-welcome-brightgreen.svg)](https://github.com/mmcdermott/dftly/pulls)
[![License](https://img.shields.io/badge/License-MIT-green.svg?labelColor=gray)](https://github.com/mmcdermott/dftly#license)

Dftly (pronounced "deftly") is a simple, expressive, human-readable DSL for encoding simple tabular
transformations over dataframes, designed for expression in YAML files. With dftly, you can transform your
data, deftly!

> [!WARNING]
> All of the code in this repository was generated by OpenAI's ChatGPT Codex. It has only
> received light human review, so while many examples work, expect rough edges.

## Installation

```bash
pip install dftly
```

To enable the optional polars execution engine, install with the extra:

```bash
pip install "dftly[polars]"
```

You can also install it locally via [`uv`](https://docs.astral.sh/uv/) via:

```bash
uv sync
```

from the root of the repository.

## Usage

Dftly is designed to make it easy to specify simple dataframe transformations in a YAML file (or a
mapping-like format). In particular, with dftly, you can specify a mapping of output column names to
expressions over input columns, then easily execute that over an input table.

Suppose we have an input dataframe that looks like this:

```python
>>> import polars as pl
>>> from datetime import date
>>> df = pl.DataFrame({
...     "col1": [1, 2],
...     "col2": [3, 4],
...     "foo": ["5", "6"],
...     "col3": [date(2020, 1, 1), date(2021, 6, 15)],
...     "bp": ["120/80", "NULL"],
... })
>>> df
shape: (2, 5)
┌──────┬──────┬─────┬────────────┬────────┐
│ col1 ┆ col2 ┆ foo ┆ col3       ┆ bp     │
│ ---  ┆ ---  ┆ --- ┆ ---        ┆ ---    │
│ i64  ┆ i64  ┆ str ┆ date       ┆ str    │
╞══════╪══════╪═════╪════════════╪════════╡
│ 1    ┆ 3    ┆ 5   ┆ 2020-01-01 ┆ 120/80 │
│ 2    ┆ 4    ┆ 6   ┆ 2021-06-15 ┆ NULL   │
└──────┴──────┴─────┴────────────┴────────┘

```

But we want to produce a file that adds column `col1` and `col2` together, converts the strings in `foo` to
integers, adds a timestamp onto the dates in `col3`, and extract the systolic and diastolic blood pressure
from the `bp` column. We can express this in a YAML file as follows:

```python
>>> yaml_text = """
... sum: col1 + col2
... foo_as_int: foo as "%i"
... col3_with_time: col3 @ "11:59:59 p.m."
... systolic_bp: extract group 1 of (\\d+)/(\\d+) from bp
... diastolic_bp: extract group 2 of (\\d+)/(\\d+) from bp
... interpolate: "val {col1}"
... """
>>> input_schema={"col1": "int", "col2": "int", "foo": "str", "col3": "date", "bp": "str"}
>>> from dftly import from_yaml
>>> from dftly.polars import map_to_polars
>>> df.select(**map_to_polars(from_yaml(yaml_text, input_schema=input_schema)))
shape: (2, 6)
┌─────┬────────────┬─────────────────────┬─────────────┬──────────────┬─────────────┐
│ sum ┆ foo_as_int ┆ col3_with_time      ┆ systolic_bp ┆ diastolic_bp ┆ interpolate │
│ --- ┆ ---        ┆ ---                 ┆ ---         ┆ ---          ┆ ---         │
│ i64 ┆ i64        ┆ datetime[μs]        ┆ str         ┆ str          ┆ str         │
╞═════╪════════════╪═════════════════════╪═════════════╪══════════════╪═════════════╡
│ 4   ┆ 5          ┆ 2020-01-01 23:59:59 ┆ 120         ┆ 80           ┆ val 1       │
│ 6   ┆ 6          ┆ 2021-06-15 23:59:59 ┆ null        ┆ null         ┆ val 2       │
└─────┴────────────┴─────────────────────┴─────────────┴──────────────┴─────────────┘

```

You can also use a more direct, expansive form rather than the concise string forms:

```python
>>> yaml_text = """
... sum:
...   expression:
...     type: ADD
...     arguments:
...       - column: {name: col1, type: int}
...       - column: {name: col2, type: int}
... foo_as_int:
...   expression:
...     type: PARSE_WITH_FORMAT_STRING
...     arguments:
...       input:
...         column: {name: foo, type: str}
...       format: {literal: "%i"}
...       output_type: {literal: int}
... col3_with_time:
...   expression:
...     type: RESOLVE_TIMESTAMP
...     arguments:
...       date:
...         column: {name: col3, type: date}
...       time:
...         expression:
...           type: PARSE_WITH_FORMAT_STRING
...           arguments:
...             input: {literal: "11:59:59 p.m."}
...             output_type: {literal: clock_time}
...             format: {literal: AUTO}
... systolic_bp:
...   expression:
...     type: REGEX
...     arguments:
...       regex: {literal: "(\\\\d+)/(\\\\d+)"}
...       action: {literal: EXTRACT}
...       group: {literal: 1}
...       input:
...         column: {name: bp, type: str}
... diastolic_bp:
...   expression:
...     type: REGEX
...     arguments:
...       regex: {literal: "(\\\\d+)/(\\\\d+)"}
...       action: {literal: EXTRACT}
...       group: {literal: 2}
...       input:
...         column: {name: bp, type: str}
... interpolate:
...   expression:
...     type: STRING_INTERPOLATE
...     arguments:
...       pattern: {literal: "val {col1}"}
...       inputs:
...         col1:
...           column: {name: col1, type: int}
... """
>>> df.select(**map_to_polars(from_yaml(yaml_text, input_schema=input_schema)))
shape: (2, 6)
┌─────┬────────────┬─────────────────────┬─────────────┬──────────────┬─────────────┐
│ sum ┆ foo_as_int ┆ col3_with_time      ┆ systolic_bp ┆ diastolic_bp ┆ interpolate │
│ --- ┆ ---        ┆ ---                 ┆ ---         ┆ ---          ┆ ---         │
│ i64 ┆ i64        ┆ datetime[μs]        ┆ str         ┆ str          ┆ str         │
╞═════╪════════════╪═════════════════════╪═════════════╪══════════════╪═════════════╡
│ 4   ┆ 5          ┆ 2020-01-01 23:59:59 ┆ 120         ┆ 80           ┆ val 1       │
│ 6   ┆ 6          ┆ 2021-06-15 23:59:59 ┆ null        ┆ null         ┆ val 2       │
└─────┴────────────┴─────────────────────┴─────────────┴──────────────┴─────────────┘

```

Here is another example, showcasing a variety of additional operation types:

```python
>>> df = pl.DataFrame({
...     "col1": [1, None],
...     "col2": [3, 4],
...     "flag": [True, False],
...     "flag1": [True, False],
...     "flag2": [False, True],
...     "chartdate": [date(2024, 1, 1), date(2024, 1, 2)],
...     "text": ["foo123", "bar456"],
...     "dt": ["2024-01-01", "2024-01-02"],
... })
>>> df
shape: (2, 8)
┌──────┬──────┬───────┬───────┬───────┬────────────┬────────┬────────────┐
│ col1 ┆ col2 ┆ flag  ┆ flag1 ┆ flag2 ┆ chartdate  ┆ text   ┆ dt         │
│ ---  ┆ ---  ┆ ---   ┆ ---   ┆ ---   ┆ ---        ┆ ---    ┆ ---        │
│ i64  ┆ i64  ┆ bool  ┆ bool  ┆ bool  ┆ date       ┆ str    ┆ str        │
╞══════╪══════╪═══════╪═══════╪═══════╪════════════╪════════╪════════════╡
│ 1    ┆ 3    ┆ true  ┆ true  ┆ false ┆ 2024-01-01 ┆ foo123 ┆ 2024-01-01 │
│ null ┆ 4    ┆ false ┆ false ┆ true  ┆ 2024-01-02 ┆ bar456 ┆ 2024-01-02 │
└──────┴──────┴───────┴───────┴───────┴────────────┴────────┴────────────┘
>>> spec = """
... add: col1 + col2
... coalesce:
...   - col1
...   - col2
... conditional: col1 if flag else col2
... in_set: col1 in {1, 2}
... in_range: col1 in (3, 4]
... bool_ops: flag1 && !flag2
... parse: dt as "%Y-%m-%d"
... hashed: hash(col1)
... """
>>> schema = {
...     "col1": "int",
...     "col2": "int",
...     "flag": "bool",
...     "flag1": "bool",
...     "flag2": "bool",
...     "chartdate": "date",
...     "text": "str",
...     "dt": "str",
... }
>>> ops = from_yaml(spec, input_schema=schema)
>>> df.select(**map_to_polars(ops))
shape: (2, 8)
┌──────┬──────────┬─────────────┬────────┬──────────┬──────────┬────────────┬──────────────────────┐
│ add  ┆ coalesce ┆ conditional ┆ in_set ┆ in_range ┆ bool_ops ┆ parse      ┆ hashed               │
│ ---  ┆ ---      ┆ ---         ┆ ---    ┆ ---      ┆ ---      ┆ ---        ┆ ---                  │
│ i64  ┆ i64      ┆ i64         ┆ bool   ┆ bool     ┆ bool     ┆ date       ┆ u64                  │
╞══════╪══════════╪═════════════╪════════╪══════════╪══════════╪════════════╪══════════════════════╡
│ 4    ┆ 1        ┆ 1           ┆ true   ┆ false    ┆ true     ┆ 2024-01-01 ┆ 9057554573187823076  │
│ null ┆ 4        ┆ 4           ┆ null   ┆ null     ┆ false    ┆ 2024-01-02 ┆ 16397991471585692086 │
└──────┴──────────┴─────────────┴────────┴──────────┴──────────┴────────────┴──────────────────────┘

```

You can also compare values directly across numeric, date, time, and datetime
columns using both symbolic and fully-resolved forms:

```python
>>> from datetime import date, datetime, time
>>> df = pl.DataFrame({
...     "int_col": [1, 3],
...     "int_limit": [0, 3],
...     "dt_col": [
...         datetime(2024, 1, 1, 12, 0, 0),
...         datetime(2024, 1, 1, 13, 0, 0),
...     ],
...     "dt_limit": [
...         datetime(2024, 1, 1, 11, 30, 0),
...         datetime(2024, 1, 1, 13, 30, 0),
...     ],
...     "date_col": [date(2024, 1, 1), date(2024, 1, 3)],
...     "date_limit": [date(2024, 1, 2), date(2024, 1, 3)],
...     "time_col": [time(12, 0, 0), time(12, 30, 0)],
...     "time_limit": [time(12, 0, 0), time(12, 15, 0)],
... })
>>> spec = """
... gt: int_col > int_limit
... ge: dt_col >= dt_limit
... lt: date_col < date_limit
... le: time_col <= time_limit
... """
>>> schema = {
...     "int_col": "int",
...     "int_limit": "int",
...     "dt_col": "datetime",
...     "dt_limit": "datetime",
...     "date_col": "date",
...     "date_limit": "date",
...     "time_col": "time",
...     "time_limit": "time",
... }
>>> ops = from_yaml(spec, input_schema=schema)
>>> df.select(**map_to_polars(ops))
shape: (2, 4)
┌───────┬───────┬───────┬───────┐
│ gt    ┆ ge    ┆ lt    ┆ le    │
│ ---   ┆ ---   ┆ ---   ┆ ---   │
│ bool  ┆ bool  ┆ bool  ┆ bool  │
╞═══════╪═══════╪═══════╪═══════╡
│ true  ┆ true  ┆ true  ┆ true  │
│ false ┆ false ┆ false ┆ false │
└───────┴───────┴───────┴───────┘

```

## Design Documentation

### Key Principles

Dftly is designed to enable users to easily express a (1) class of simple dataframe operations (2) in a
human-readable way, that (3) can then be used across different execution engines through a middle-layer DSL
that is fully resolved and unambiguous.

> [!NOTE]
> that dftly will most often be used through downstream packages that make use of the common
> human-readable input format but may do intermediate processing of the YAML files their users specify before
> calling dftly's internal parsing and resolution functions.

> [!WARNING]
> dftly is _not_ designed to perform complex, interdependent operations across multiple dftly
> blocks -- it will neither type check such operations nor provide a meaningful dependency graph for a parsed
> dftly specification, and instead is designed to parse all specified blocks independently.

#### (1) Class of Simple Dataframe Operations

Dftly is _not_ designed to be a full SQL or dataframe manipulation DSL. Rather, it is only intended to capture
operations that we call "tabular transformations", meaning those that can be expressed as a simple function of
a (subset of) a single row of a dataframe, returning a single value (cell) in an output dataframe at an
analogous row. This excludes operations that are at a table-level, such as pivoting or grouping, as well as
operations that yield outputs over multiple rows, such as `explode` or `unpack` operations.

It also includes some operations that are very common in data pre-processing workflows but are less common in
typical SQL workflows, such as simple arithmetic operations, temporal resolution operations, and string
manipulation.

#### (2) Huamn-readable way

The entire point of dftly is to make it easy to express data pre-processing operations in a communicable but
unambiguous way. This is done by providing both a simplified language and a fully-resolved specification of
dftly supported operations -- with the simplified language designed for use with YAML files. Internally, dftly
will parse the simplified language into a fully-resolved form that can then be executed on a dataframe.

#### (3) Middle-layer DSL and execution engines

When the simplified form is parsed into a fully resolved form, the resulting structure is fully unambiguous
and readily translatable to a dataframe execution context. Notably, the core library itself _does not_ provide
any execution engine, but rather provides only the DSL for the fully resolved form, the parsing library for
the simplified form. Then, extensions of this library in an engine-specific manner enable translation of the
fully resolved form into operations on the given engine. For example, the built-in `polars` extension (which
you can enable simply by installing the `dftly[polars]` extra argument, or installing polars separately)
allows you to translate any fully resolved expression or input into a `polars` expression that can then be
used to manipulate a `polars` dataframe. In this way, it is possible to extend this library for new dataframe
engines easily without changing its human-readable input format and the DSL it supports.

### Typical Workflow

A typical workflow for using dftly (including both internal and external steps) would look like this:

1. A user specifies a YAML file which contains a map (or collection of maps) from output column names to
    dftly specifications for transformations to realize that output column.
2. The library in use (not dftly itself, but the library the user is using that depends on dftly) reads
    the YAML file and may perform some pre-processing (e.g., to extract sections for parsing with dftly from
    those used for other purposes).
3. As needed, the library calls `dftly.parse` on the (loaded) YAML file contents (in python data form --
    e.g., not as strings, but dictionaries, lists, etc.). This will return a map of output column names to
    fully resolved operations. If an extension is enabled, those resolved operations can be naively converted
    to source execution code.
4. The library then uses this output to execute those transformations (either in bulk across all columns or
    on a per-column basis in the output map from dftly) on the input dataframes in use and uses the outputs
    as needed.

### Fully resolved dftly DSL:

The purpose of the fully resolved form is to enable easy mapping of specified operations to dataframe engines
/ operations (e.g., `polars` expressions, SQL queries, etc.), in a manner that is technically unambiguous and
requires minimal technical debt or unnecessary complexity. All fully resolved entities are dictionary like
objects, obeying one of the following simple templates:

#### Literals

Literals are simple string or typed literals (type determined by the YAML parser). They are expressed via a
one-element map with the key `literal` and the value being the literal value itself. For example:

```yaml
literal: $VALUE
```

#### Columns

Columns are simple references to a column in a dataframe. They are expressed via a one-element map with the
key `column` and the value being the column name. For example:

```yaml
column:
  name: $COLUMN_NAME
  type: $COLUMN_TYPE # optional, if missing, type is assumed to be unknown & valid for subsequent operations.

```

#### Expressions

```yaml
expression:
  type: $EXPR_NAME
  arguments: '...list or map of literals, columns, or expressions'

```

Supported expressions include:

##### `ADD`

Adds two or more inputs together. Only supports a list of positional arguments, which must obey the following
type restrictions:

1. All inputs are numeric or duration types (in which case the output will be the same type as the inputs).
2. One input is a datetime and the rest are duration values (in which case the output will be a
    datetime type).

> [!NOTE]
> For strings, use the `STRING_INTERPOLATE` expression instead; addition of strings is not supported.

##### `SUBTRACT`

Subtracts the second input from the first. Only supports a list of two positional arguments, which must obey
the following type restrictions:

1. Both inputs are numeric or duration types (in which case the output will be the same type as the inputs).
2. The first input is a datetime and the second is a duration value (in which case the output will be a
    datetime type).

##### `RESOLVE_TIMESTAMP`

Resolves a lower resolution timestamp to a higher resolution timestamp, e.g. from date to datetime. Only
supports one of several possible set of keyword arguments that clearly indicate the lower and higher
resolution components of the timestamp. Dates and times have resolution controlled via the following
compositional relationships:

```
datetime:
  date:
    year: $YEAR
    month: $MONTH
    day: $DAY
  time:
    hour: $HOUR
    minute: $MINUTE
    second: $SECOND
    microsecond: $MICROSECOND
```

The keyword arguments allowable for this expression must satisfy the property that they are mutually
compatible and not relatively incomplete. E.g., A `date` and a `time` can be passed, as could a `date` and an
`hour`, but a `date` and a `minute` cannot be passed, as, while the latter is compatible with the former, it
is not complete as it is missing the `hour`. Alternatively, a `date` and a `year` cannot be simultaneously
passed as they are not mutually compatible, as the `year` is already contained in the `date`.

##### `REGEX`

Extracts the matches of a regex from a column or checks if a column matches or fails to match a regex.

```yaml
regex: $REGEX # the regex to extract (must be a string and a valid regex)
action: EXTRACT|MATCH|NOT_MATCH   # the action to perform
input: $EXPR # the input expression to extract from
group: NULL|$GROUP # the group to extract (if applicable, e.g., for `EXTRACT` action)
```

##### `COALESCE`

Identify the first non-null value in a list of expressions.

##### `CONDITIONAL`

A conditional expression that takes a boolean predicate, a true value, and a false value, and returns the
appropriate value based on the predicate.

```yaml
if: $PREDICATE # the boolean predicate to evaluate
then: $TRUE_VALUE # the value to return if the predicate is true
else: $FALSE_VALUE # the value to return if the predicate is false. If omitted, `null` is returned when false.
```

##### `STRING_INTERPOLATE`

Interpolates a string with one or more input expressions. Argument spec:

```yaml
pattern: $PATTERN # the string pattern to interpolate, in python syntax, e.g., "${key_1}!"
inputs:
  key_1: $VALUE_1_EXPR
  key_2: $VALUE_2_EXPR
  ...
```

##### `TYPE_CAST`

Casts a value to a specific type. Argument spec:

```yaml
input: $INPUT_EXPR # the input expression to cast
output_type: $OUTPUT_TYPE # the type to cast to, e.g., "int", "float", "str", "bool", etc.
```

##### `VALUE_IN_LITERAL_SET`

Checks if a value is in a set of values. Argument spec:

```yaml
value: $VALUE_EXPR # the value to check
set: # the set of values to check against (as a list -of only literals-
  - $VALUE_1
  - $VALUE_2
  - '...'
```

##### `VALUE_IN_RANGE`

Checks if a (numeric) value is in a range. Argument spec:

```yaml
value: $VALUE_EXPR # the value to check
min: $MIN_EXPR # the minimum value of the range
min_inclusive: $MIN_INCLUSIVE # whether the minimum value is inclusive (default: true)
max: $MAX_EXPR # the maximum value of the range
max_inclusive: $MAX_INCLUSIVE # whether the maximum value is inclusive (default: true)
```

Either or both of `min` or `max` can be omitted, in which case the range is unbounded in that direction.

##### `NOT`/`AND`/`OR`

Logical operations that take one or more boolean expressions and return a boolean value. Arguments are
positional only.

##### `PARSE_WITH_FORMAT_STRING`

Parses a string into a specific type. Argument spec:

```yaml
input: $INPUT_EXPR # the input expression to parse
output_type: $OUTPUT_TYPE # the type to parse to, e.g., "datetime", "float", "int"..
format: $FORMAT # the format to parse the input. Meaning differs based on the output type.
```

##### `HASH_TO_INT`

Generates a hash in int64 format of an input expression. Argument can either be a single positional argument
or follow the argument spec:

```yaml
input: $INPUT_EXPR # the input expression to hash
algorithm: $ALGORITHM # the hash algorithm to use, e.g., "md5", "sha256", etc. Defaults to "sha256"
```

### Simplified Form:

The purpose of the simplified form is to take a concise, human-readable representation (designed for use with
`YAML` files, though it will ultimately be parsed from direct python structures) and return the fully resolved
form. The simplified form obeys the following design principles:

1. All parsing happens in a specific "context", which is simply a map of options that change certain aspects
    of parsing behavior. Typically, parsing is in a `null` context, which means that default behavior is
    used.
2. Parsing a YAML map (or a python dictionary parsed from a YAML file) is done independently for each
    key-value pair in the dictionary, with the key being the output column name and the value being the
    expression to be parsed. If a top-level parse operation is attempted on a non-map value (e.g., a list),
    it will fail.
3. A key constraint of the simplified form is that if a fully-resolved form is specified in the simplified
    form, it _must_ resolve to itself.

In addition to the above principles, value-parsing (meaning not the top-level `parse` call but the internal
call used on each of the values in the input map) can be directly simplified based both on the (python) type
of the input to be parsed (as YAML only supports a limited set of plain python types) as well as the possible
output type you can obtain. The vast majority of the complexity in this language will be found when we are
parsing strings into expression outputs, followed by parsing maps into expression outputs.

In the below sections, we will describe the language in detail, over the following sections:

1. We will describe the context flags that modify parsing behavior. Note that these cannot in general be set
    manually by the user, but instead are controlled based on other aspects of the parse tree and parsing
    always begins with a `null` context (aside from specifying available columns, if applicable).
2. We will then describe how `literal` outputs can be obtained, across different YAML input types across
    different context flags.
3. Next, we will descriobe how `column` outputs can be obtained, which is only possible through map and
    string inputs.
4. We will finally describe how to obtain expression inputs, separated into both general principles and
    mechanisms for different expression types, leveraging (where applicable) both string, map, and/or list
    input types.

#### Context flags

##### `recursive_list`

Lists are parsed recursively and returned as lists, rather than being parsed into expression form (see below).

##### `literal`

All inputs are parsed as literals, and no further resolution happens.

##### `recurse_to_literal`

Subsequent recursive resolution calls will enable the `literal` context flag.

##### `input_schema`

This is a map of available input column names to their initial types, which can dictate some aspects of the
parsing behavior. If set to `null`, it is assumed to be unknown/unspecified. A `null` type value indicates
that the column exists but has an unknown/unspecified type.

#### Obtaining a `literal`

Literals can be obtained in one of several ways:

1. If a fully-resolved literal is specified.
2. If the `literal` context flag is set, then the input is parsed as a literal regardless of its type.
3. If a numeric or boolean literal is passed as input, it is parsed as a literal.
4. If a string is passed as an input that _is not_ a valid and simple input column name nor an example of a
    parse option for another expression, then it is parsed as a literal. Note this is also a valid instance
    of the string parsing as a string interpolation, simply without any interpolation keys or values.

For example:

_Input_ (`null` context):

```yaml
a: 1234 # int literal
b: 12.34 # float literal
c: true # boolean literal
d: {literal: [1, 2, 3]} # list literal, fully resolved form
e: foobar123   # string literal, provided it is not a valid column name or expression
```

_Output_:

```yaml
a: {literal: 1234}
b: {literal: 12.34}
c: {literal: true}
d: {literal: [1, 2, 3]}
e: {literal: foobar123}
```

> [!NOTE]
> The last example works here as the input is a `null` context (so it has no columns specified), and
> `foobar123` is not an expression parseable input. Were you to specify in the `input_schema` context variable
> containing `foobar123`, then the input would be parsed as a column reference instead.

> [!NOTE]
> If the `literal` flag were true in context, then everything would just be parsed as a literal.

#### Obtaining a `column`

Columns can be obtained in one of two ways:

1. (_Fully resolved_) If a fully-resolved column is specified.
2. (_Dictionary short form)_ If the input is a dictionary with a single key `column` and a value that is a string, it is resolved into
    a column form with the given `name` set and the `type` inferred from the `input_schema` if possible, or
    left as `null` if not.
3. (_String form_) If the input is a string that is a valid column name and not an expression input parse
    target, then it is parsed as a column reference.

> [!NOTE]
> If a column is specified in string form, it must be present in the `input_schema` context --
> therefore, in this setting, were you to specify its type, it would be included in the output. However, this
> is not true for a column literal, which is assumed to be specified as intended (type included).

For example:

_Input_ (`input_schema: {'foobar123': null, 'bar': 'string'}` context):

```yaml
a: {column: {name: bar}} # fully resolved column
b: foobar123   # string column reference, which is a valid column name
c: {column: bar} # Dictionary short-form
d: {column: qux} # Dictionary short-form
e: qux # Not a valid column name, so not parsed as a column
```

_Output_:

```yaml
a: {column: {name: bar, type: null}} # fully resolved column
b: {column: {name: foobar123, type: null}} # string column reference, which is a valid column name
c: {column: {name: bar, type: string}} # Dictionary short-form infers type
d: {column: {name: qux, type: null}} # Dictionary short-form does not error if column is not in input_schema
e: {literal: qux} # Not a valid column name, so not parsed as a column
```

#### Obtaining an `expression`

Expressions can be obtained from string, dictionary, or list inputs in a variety of ways. Firstly, we note
that any fully resolved input will be parsed as such, including expressions; we will omit those from this
section for brevity.

##### Shared dictionary short-form:

In addition, all expressions can be expressed in a slightly shortened dictionary form by passing the
`expression_type` as a key (case insensitive) and the arguments to the expression as the value, though
depending on the expression different context flags may be activated when parsing the arguments recursively.
For example, this form allows for resolutions like:

```yaml
a:
  add: [col1, col2]
b:
  conditional:
    if:
      value_in_literal_set:
        value: col1
        set: [1, 2, foo]
    then: col2
    else: 43
```

to be parsed into:

```yaml
a:
  expression:
    type: ADD
    arguments:
      - {column: {name: col1, type: null}}
      - {column: {name: col2, type: null}}
b:
  expression:
    type: CONDITIONAL
    arguments:
      if:
        expression:
          type: VALUE_IN_LITERAL_SET
          arguments:
            value: {column: {name: col1, type: null}}
            set: [literal: 1, literal: 2, literal: foo]
      then: {column: {name: col2, type: null}}
      else: {literal: 43}
```

This dictionary short-form applies for all expression types; however, certain expressions have even shorter
dictionary forms or string forms that can be used as well; these will be described below on a per-expression
basis.

##### Shared string form:

All expressions can also be expressed in a string form via the python function calling signature, though this
only supports inputs that can be suitably expressed in a minimal string form. This form allows for resolutions
like:

```yaml
a: add(col1, col2)   # Add two columns together
b: conditional(if=value_in_literal_set(col1, [1, 2, foo]), then=col2, else=43)   # Conditional expression
```

Parentheses can be used to group sub-expressions. Without explicit grouping the
grammar follows normal operator precedence: arithmetic and membership operations
are evaluated before boolean logic, and `not` binds tighter than `and`, which in
turn binds tighter than `or`.

to be parsed into:

```yaml
a:
  expression:
    type: ADD
    arguments:
      - {column: {name: col1, type: null}}
      - {column: {name: col2, type: null}}
b:
  expression:
    type: CONDITIONAL
    arguments:
      if:
        expression:
          type: VALUE_IN_LITERAL_SET
          arguments:
            value: {column: {name: col1, type: null}}
            set: [literal: 1, literal: 2, literal: foo]
      then: {column: {name: col2, type: null}}
      else: {literal: 43}
```

> [!WARNING]
> Use of this form is generally frowned upon as it is much less human readable for those not
> familiar with python syntax.

##### `ADD`

- _Context flags_: `recursive_list`
- _String form_: `col1 + col2 + ...` is mapped to `add: [col1, col2, ...]` and parsed accordingly. Spaces
    are mandatory, to avoid ambiguity with literals or column names that may contain `+` characters.

##### `SUBTRACT`

- _Context flags_: `recursive_list`
- _String form_: `col1 - col2` is mapped to `subtract: [col1, col2]` and parsed accordingly. Spaces are
    mandatory, to avoid ambiguity with literals or column names that may contain `-` characters.

##### `RESOLVE_TIMESTAMP`

_String form_: `date_col @ $TIME` or `year_col @ $CALENDAR_DATE`. In this format, the part before the `@` sign
must be resolvable into a column of an appropriate type and the part after the `@` sign is resolved as a
datetime part via the `PARSE_WITH_FORMAT_STRING` expression with the `AUTO` option for the format string and
output type. The resolution of the columnar input and the time/date part must be mutually compatible (see the
definition of this expression above).

_Examples_:

```yaml
a: charttime @ 11:59:59 p.m.
b: birth_year @ January 1, 12:00 a.m.
```

_Output_:

```yaml
a:
  expression:
    type: RESOLVE_TIMESTAMP
    arguments:
      date: {column: {name: charttime, type: null}}
      time:
        hour: {literal: 23}
        minute: {literal: 59}
        second: {literal: 59}
b:
  expression:
    type: RESOLVE_TIMESTAMP
    arguments:
      date:
        year: {column: {name: birth_year, type: null}}
        month: {literal: 1}
        day: {literal: 1}
      time:
        hour: {literal: 0}
        minute: {literal: 0}
        second: {literal: 0}
```

##### `REGEX`

- _Aliases / dictionary short forms_: You can avoid specifying the regex action by using the `regex_$ACTION`
    syntax, or by specifying the regex under an argument keyed with the action rather than the `regex` key.
- _string form_: You can use the syntax `extract $REGEX from $INPUT` or `extract group 1 of $REGEX from $INPUT` to extract matches from an input, or `match $REGEX against $INPUT` to check if the input matches
    the regex, or `not match $REGEX against $INPUT` to check if the input does not match the regex. The
    `$REGEX` part must be a valid regex string, and the `$INPUT` part must be a valid column name or
    expression.

##### `COALESCE`

- _**List** form_: If the input is a list of expressions, it is parsed as a `COALESCE` operation,
    which returns the first non-null value in the list. This is the most common case for lists in dftly. List
    elements are parsed recursively.

##### `CONDITIONAL`

- _**List** form_: If the input is a list of expressions, such that each expression specifies an `if`/`then`
    block, note that the coalesce operation will naturally yield a chained conditional expression capable of
    multiple "else if" branches.
- _String form_: You can use the python ternary if syntax, e.g., `col1 if col2 else col3`, which
    is parsed as a `CONDITIONAL` expression with the `if` argument set to `col2`, the `then` argument set to
    `col1`, and the `else` argument set to `col3`.

##### `STRING_INTERPOLATE`

_String form_: Uses python string interpolation syntax, interpreting the values as strings to be resolved.

##### `TYPE_CAST`

_String form_: Uses `as` keyword (case insensitive) to specify the type to cast to, e.g., `col1 AS int` or
`col2 as float`.

> [!NOTE]
> Note this is shared with the `PARSE_WITH_FORMAT_STRING` expression, where a parse string can be used instead
> of a type, e.g., `col1 as '%Y-%m-%d %H:%M:%S'` to parse a datetime from a string.

##### `VALUE_IN_LITERAL_SET`

- _Context flags_: Enables the context flags `recursive_list` and `recurse_to_literal`.
- _String form_: Uses the `in` keyword (case insensitive) to specify set to check against, e.g.,
    `col2 in {1, 2, 3}`. The elements are parsed recursively from string form inputs.

> [!NOTE]
> Only the syntax for a set is permitted here to avoid ambiguity with the value in range expression type.

##### `VALUE_IN_RANGE`

_String form_: Uses the `in` keyword (case insensitive) to specify a range to check against, using the
two-element, `[` or `(` and `]` or `)` syntax to specify the range, e.g., `col1 in [1, 10]` for an inclusive
range between 1 and 10, or `col2 in (1, 10]` for an exclusive lower bound and inclusive upper bound.

##### `NOT`/`AND`/`OR`

_String from_: Uses `not`, `and`, and `or` keywords (case insensitive) or the `!`, `&&`, and `||` syntax.
Parentheses may be used to group boolean expressions; otherwise `not` applies
before `and`, which applies before `or`.

##### `PARSE_WITH_FORMAT_STRING`

_Dictionary shorter-form_:
Here we support some alternative key-names for arguments to pre-specify the output type of the argument, and
allow an expression alias and a shortned column-input form to be used as well. So, in particular, the
following are all equivalent:

```yaml
time:
  parse_with_format_string:
    input: ${input_col}
    output_type: datetime
    format: '%Y-%m-%d %H:%M:%S'

time:
  parse_with_format_string:
    input: ${input_col}
    datetime_format: '%Y-%m-%d %H:%M:%S' # Note the different key name for the format

time:
  parse:
    input: ${input_col}
    output_type: datetime
    format: '%Y-%m-%d %H:%M:%S' # Note the long form for the input column

time:
  input_col:
    datetime_format: '%Y-%m-%d %H:%M:%S' # Note the short form for the input column
```

_String form_: You can also use the `as` keyword (case insensitive) to specify the format string for parsing,
with the structure of the format string dictating the output type. The format string **must** be quoted.
Unquoted strings that look like chained expressions (for example `col1 + col2 AS %m mo %d d`) are considered
invalid and will raise a `ValueError`. To combine parsing with other operations, use quotes around the
format string, e.g., `col1 + col2 as '%m mo %d d'`.

##### `HASH_TO_INT`

- _Name Aliases_: You can use `hash` instead of `hash_to_int` in the typical string or dictionary form.

## Doctest Examples

The snippet below demonstrates parsing a variety of syntax options and verifying
the resulting expression types.
