Metadata-Version: 2.4
Name: maalfrid_toolkit
Version: 1.2.0
Summary: Toolkit for the Målfrid project
Author-Email: Magnus Breder Birkenes <magnus.birkenes@nb.no>
License-File: LICENSE.txt
License-File: src/justext/LICENSE
Project-URL: repository, https://github.com/NationalLibraryOfNorway/maalfrid_toolkit
Requires-Python: <3.13,>=3.10
Requires-Dist: beautifulsoup4==4.12.3
Requires-Dist: certifi==2024.8.30
Requires-Dist: charset-normalizer==3.4.0
Requires-Dist: faust-cchardet==2.1.19
Requires-Dist: gielladetect==1.0.3
Requires-Dist: html5lib==1.1
Requires-Dist: idna==3.10
Requires-Dist: lxml<5.2.0
Requires-Dist: numpy==1.*
Requires-Dist: python-dateutil==2.9.0.post0
Requires-Dist: pytz==2024.2
Requires-Dist: requests==2.32.3
Requires-Dist: six==1.16.0
Requires-Dist: soupsieve==2.6
Requires-Dist: typing-extensions==4.12.2
Requires-Dist: tzdata==2024.2
Requires-Dist: urllib3==2.2.3
Requires-Dist: webencodings==0.5.1
Requires-Dist: pyyaml>=6.0.2
Requires-Dist: warcio>=1.7.4
Requires-Dist: psycopg2-binary>=2.9.10
Requires-Dist: PyMuPDF==1.23.26
Requires-Dist: pdfminer-six==20201018
Requires-Dist: sentence-splitter==1.4
Requires-Dist: joblib==1.4.2
Requires-Dist: tqdm>=4.67.1
Requires-Dist: python-docx>=1.1.2
Requires-Dist: docx2txt>=0.8
Requires-Dist: simhash>=2.1.2
Requires-Dist: python-dotenv>=1.0.1
Provides-Extra: glotlid
Requires-Dist: fasttext>=0.9.3; extra == "glotlid"
Requires-Dist: huggingface-hub>=0.26.3; extra == "glotlid"
Description-Content-Type: text/markdown

# Maalfrid toolkit

__maalfrid_toolkit__ is a Python package designed for crawling and extracting natural language data from documents found on the web (HTML, PDF, DOC). It is primarily used in the Målfrid project, a collaboration between the National Library of Norway and The Language Council of Norway, which aims to measure the usage of the two official Norwegian language forms, Bokmål and Nynorsk, on Norwegian public sector websites. While the toolkit has a particular emphasis on the Nordic countries, it supports extraction and language detection of more than 60 languages.

It builds upon:
- [wget](https://www.gnu.org/software/wget/) and [(custom) browsertrix](https://github.com/Sprakbanken/browsertrix-crawler/) for crawling
- [JusText](https://github.com/miso-belica/jusText) for HTML boilerplate removal
- [Notram PDF text extraction](https://github.com/NbAiLab/notram/) from NB AI-lab
- DOC extraction using docx2txt and antiword
- [Gielladetect/pytextcat](https://github.com/NationalLibraryOfNorway/gielladetect) and [GlotLID V3](https://huggingface.co/cis-lmu/glotlid) for language detection
- [Simhash](https://github.com/1e0ng/simhash) for near-duplicate detection

# Install
## Install with pip

```bash
pip install maalfrid_toolkit
```

With Glotlid / fasttext (optional, see below for caveats):

```bash
pip install maalfrid_toolkit[glotlid]
```

## Install with pdm

```bash
pdm install
```

## Test run pipeline

### On HTML

```bash
python -m maalfrid_toolkit.pipeline --url https://www.nb.no/utstilling/opplyst-glimt-fra-en-kulturhistorie/ --to_jsonl
```

### On PDF

```bash
python -m maalfrid_toolkit.pipeline --url https://www.nb.no/sbfil/dok/nst_taledat_dk.pdf --to_jsonl
```

### On DOC

```bash
python -m maalfrid_toolkit.pipeline --url https://www.nb.no/content/uploads/2018/11/Søknadsskjema-Bokhylla-2.doc --to_jsonl
```

### On WARC file (e.g. from self-crawled material)

```bash
python -m maalfrid_toolkit.pipeline --warc_file example_com-00000.warc.gz --calculate_simhash --to_jsonl > warc.jsonl
```

### On sitemap

```bash
python -m maalfrid_toolkit.pipeline --url https://example.com/sitemap.xml --crawl_sitemap --to_jsonl > example.jsonl
```

## Database (Postgres)

If you want to store and process the data further in a database, setup a Postgres database and enter your credentials in an .env file in the package root directory (see env-example). Be sure to populate the database with schema and indices found in db/ prior to running the commands in maalfrid_toolkit.db.

## OS-level dependencies (tested with Ubuntu 24.04) for optional functionality

### For fasttext (optional)

```bash
sudo apt-get install build-essential python3-dev
```

### For .doc text extraction (optional)

```bash
sudo apt-get install antiword
```

## A note on using Browsertrix

In order to use Browsertrix for crawling JavaScript-heavy pages and extract text from HTML, you must currently clone a custom Browsertrix from:

https://github.com/Sprakbanken/browsertrix-crawler/tree/add-dom-resource

Then build with Docker:

```bash
docker build -t maalfrid-browsertrix .
```

## License
GPL
