The Py files assume the wget of the website has been placed in an S3 bucket: s3://neurips-papers/

The command to get all the filters is:
wget --adjust-extension -e robots=off -r -np -l9 'https://papers.nips.cc/'

The --adjust-extension is needed because there are directories and files that both have the year name with no extension. This creates errors without changing the pages to .html.
the -e robots=off is needed because the pdf files have a no robots meta-data to prevent commands like wget.
-r is to download everything recursive.
-np is good practice.
-l9 lets it go 9 levels deep which is enough for now.

In 2023 this downloads about 73G of data.

In studio, the commands are run in Python 3.9 and the requirements of requirements.txt.

You must supply your own in OpenAI API key in the `process_pdf.py` file. The API key is not version controlled.

To run the RAG-LLM demo, one must save the results to `pdf-bib` and then run `distance_to_query.py` and save the results to `rag-query`. The final results are generated by `llm_chat.py`.
