Metadata-Version: 2.1
Name: juqueue
Version: 0.0.13
Summary: Computation and work management system for time-constrained cluster environments.
Author-email: Viet Anh Khoa Tran <v.tran@fz-juelich.de>
License: GNU General Public License v3 (GPLv3)
Project-URL: Homepage, https://github.com/tran-khoa/JuQueue
Keywords: workflow,cluster,management,slurm,dask
Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3)
Classifier: Development Status :: 2 - Pre-Alpha
Classifier: Operating System :: POSIX :: Linux
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Requires-Python: >=3.7
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pyyaml (~=6.0)
Requires-Dist: dask-jobqueue
Requires-Dist: dask (~=2022.5.2)
Requires-Dist: tornado (~=6.1)
Requires-Dist: loguru (~=0.6.0)
Requires-Dist: nest-asyncio
Requires-Dist: filelock (~=3.7.1)
Requires-Dist: fastapi (~=0.78.0)
Requires-Dist: hypercorn (~=0.13.2)
Requires-Dist: pydantic (~=1.9.1)
Requires-Dist: uvloop (~=0.16.0)
Provides-Extra: dev
Requires-Dist: bumpver ; extra == 'dev'
Requires-Dist: pip-tools ; extra == 'dev'

# JuQueue
Computation and workflow management system for **time-constrained** cluster environments.
This system is aimed at compute clusters, on which users are accounted 
for the runtime of an entire node or *minimum resource allocation units* 
(e.g. at the Jülich Supercomputing Centre (JSC)).

Work in progress and potentially unstable. The [wiki](https://github.com/tran-khoa/JuQueue/wiki) provides further documentation.

## Concept
- **Runs**
  - Defines the command and its corresponding parameters.
  - Defines an Executor which determines environment variables, virtual environments, etc...
  - Commands should be robust to termination, i.e.  
    - Should resume from previous computation if terminated.
      - If the Node shuts down/fails, the Run will be requeued.
    - Upon failure, must return a non-zero status code. [will not be requeued]
    - Must return status code 0 if completed. [will not be requeued]
- **Experiment**
  - A logical group of Runs.
- **Clusters**
  - Each Cluster (currently `local` and `slurm`) defines a group of nodes.
  - A ClusterManager manages NodeManagers on computation nodes (e.g. via SLURM jobs).
    - Each NodeManager specifies a certain number of Slots and manages the execution of Runs in Python subprocesses.
    - As Runs are (un-)queued from/to the Cluster, or are completed/failed, the number of nodes is rescaled as necessary.
  - For now, the system is aggressive in minimizing the number of nodes, e.g.
    - Assume 4 nodes (each with 4 slots), each executing a single Run
    - Then 3 nodes are cancelled (along with the runs) and rescheduled to the remaining node. 

## Installation
### From source
```bash
git clone https://github.com/tran-khoa/JuQueue juqueue
cd juqueue
pip install -e .

# (optional) Start with example definitions
cp -r example_defs ~/defs
```
### Via pip
```bash 
pip install juqueue
```

## Usage
```bash
juqueue --def-dir [PATH] --work-dir [PATH]
```

A minimal user interface is offered at [localhost:51234](http://localhost:51234).
For more advanced usage, JuQueue can be controlled via FastAPI's interactive docs 
available at [localhost:51234/docs](http://localhost:51234/docs).

## Documentation
For now, refer to the examples in [example_defs/](./example_defs) and FastAPI's docs,
available at [localhost:51234/docs](http://localhost:51234/docs)
or [localhost:51234/redoc](http://localhost:51234/redoc).
