Metadata-Version: 2.1
Name: code-indexer-loop
Version: 0.2.1
Summary: Code Indexer Loop
Author-email: Rick Lamers <rick@definitive.io>
Requires-Python: >=3.9
Description-Content-Type: text/markdown
Requires-Dist: llama-index>=0.8.20,<0.9
Requires-Dist: chromadb>=0.4.8,<0.5
Requires-Dist: tree-sitter-languages>=1.7.0,<1.8
Requires-Dist: tree-sitter>=0.20.2,<0.21
Requires-Dist: tiktoken>=0.4.0,<0.5
Requires-Dist: langchain>=0.0.283
Requires-Dist: watchdog>=2.3.1,<2.4
Requires-Dist: nltk>=3.8.1,<3.9
Requires-Dist: toml ~=0.10.2 ; extra == "dev"
Requires-Dist: black ~=23.3.0 ; extra == "dev"
Requires-Dist: isort ~=5.9.3 ; extra == "dev"
Requires-Dist: autoflake ~=2.2.0 ; extra == "dev"
Requires-Dist: ruff ~=0.0.284 ; extra == "dev"
Requires-Dist: pytest ~=7.4.1 ; extra == "dev"
Requires-Dist: flit >=3.8.0,<4 ; extra == "dev"
Requires-Dist: pytest-cov ~=3.0.0 ; extra == "test"
Provides-Extra: dev
Provides-Extra: test

# Code Indexer Loop

[![PyPI version](https://badge.fury.io/py/code-indexer-loop.svg?v=2)](https://pypi.org/project/code-indexer-loop/)
[![License](https://img.shields.io/github/license/definitive-io/code-indexer-loop?v=2)](LICENSE)
[![Forks](https://img.shields.io/github/forks/definitive-io/code-indexer-loop?v=2)](https://github.com/definitive-io/code-indexer-loop/network)
[![Stars](https://img.shields.io/github/stars/definitive-io/code-indexer-loop?v=2)](https://github.com/definitive-io/code-indexer-loop/stargazers)
[![Twitter](https://img.shields.io/twitter/url/https/twitter.com?style=social&label=Follow%20%40DefinitiveIO)](https://twitter.com/definitiveio)
[![Discord](https://dcbadge.vercel.app/api/server/CPJJfq87Vx?compact=true&style=flat)](https://discord.gg/CPJJfq87Vx)


**Code Indexer Loop** is a Python library designed to index and retrieve code snippets. 

It uses the useful indexing utilities of the **LlamaIndex** library and the multi-language **tree-sitter** library to parse the code from many popular programming languages. **tiktoken** is used to right-size retrieval based on number of tokens and **LangChain** is used to obtain embeddings (defaults to **OpenAI**'s `text-embedding-ada-002`) and store them in an embedded **ChromaDB** vector database. **watchdog** is used for continuous updating of the index based on file system events.

Read the [launch blog post](https://www.definitive.io/blog/open-sourcing-code-indexer-loop) for more details about why we've built this!

## Installation:
Use `pip` to install Code Indexer Loop from PyPI.
```
pip install code-indexer-loop
```

## Usage:
1. Import necessary modules:
```python
from code_indexer_loop.api import CodeIndexer
```
2. Create a CodeIndexer object and have it watch for changes:
```python
indexer = CodeIndexer(src_dir="path/to/code/", watch=True)
```
3. Use `.query` to perform a search query:
```python
query = "pandas"
print(indexer.query(query)[0:30])
```

Note: make sure the `OPENAI_API_KEY` environment variable is set. This is needed for generating the embeddings.

You can also use `indexer.query_nodes` to get the nodes of a query or `indexer.query_documents` to receive the entire source code files.

Note that if you edit any of the source code files in the `src_dir` it will efficiently re-index those files using `watchdog` and an `md5` based caching mechanism. This results in up-to-date embeddings every time you query the index.

## Examples
Check out the [basic_usage](examples/basic_usage.ipynb) notebook for a quick overview of the API.

## Token limits
You can configure token limits for the chunks through the CodeIndexer constructor:

```python
indexer = CodeIndexer(
    src_dir="path/to/code/", watch=True,
    target_chunk_tokens = 300,
    max_chunk_tokens = 1000,
    enforce_max_chunk_tokens = False,
    coalesce = 50
    token_model = "gpt-4"
)
```

Note you can choose whether the `max_chunk_tokens` is enforced. If it is, it will raise an exception in case there is no semantic parsing that respects the `max_chunk_tokens`.

The `coalesce` argument controls the limit of combining smaller chunks into single chunks to avoid having many very small chunks. The unit for `coalesce` is also tokens.

## tree-sitter
Using `tree-sitter` for parsing, the chunks are broken only at valid node-level string positions in the source file. This avoids breaking up e.g. function and class definitions.

### Supported languages:
C, C++, C#, Go, Haskell, Java, Julia, JavaScript, PHP, Python, Ruby, Rust, Scala, Swift, SQL, TypeScript

Note, we're mainly testing Python support. Use other languages at your own peril.

## Contributing
Pull requests are welcome. Please make sure to update tests as appropriate. Use tools provided within `dev` dependencies to maintain the code standard.

### Tests
Run the unit tests by invoking `pytest` in the root.

## License
Please see the LICENSE file provided with the source code.

## Attribution
We'd like to thank the Sweep AI for publishing their ideas about code chunking. Read their blog posts about the topic [here](https://docs.sweep.dev/blogs/chunking-2m-files) and [here](https://docs.sweep.dev/blogs/chunking-improvements). The implementation in `code_indexer_loop` is modified from their original implementation mainly to limit based on tokens instead of characters and to achieve perfect document reconstruction (`"".join(chunks) == original_source_code`).

