Dataset Card for AI Aktindsigt¶

Multiple web scrapes from municipality websites collected as a part of the AI-aktindsigt project.

The dataset consists of multiple scrapes of municipal websites compiled in connection with the work on the AI-aktindsigt project. The scrape is made across different domains from several different municipalities.

Dataset Description¶

Number of samples: 200.91K
Number of tokens (Llama 3): 139.23M
Average document length in tokens (min, max): 693.0064405666105 (9, 152.60K)

Dataset Structure¶

An example from the dataset looks as follows.

{
  "id": "ai-aktindsigt_0",
  "text": "Vallensbæk Stationstorv 100 2665 Vallensbæk Strand Telefon: +45 4797 4000",
  "source": "ai-aktindsigt",
  "added": "2025-03-24",
  "created": "2010-01-01, 2024-03-18",
  "token_count": 29
}

Data Fields¶

An entry in the dataset consists of the following fields:

id (str): An unique identifier for each document.
text(str): The content of the document.
source (str): The source of the document.
added (str): An date for when the document was added to this collection.
created (str): An date range for when the document was originally created.
token_count (int): The number of tokens in the sample computed using the Llama 8B tokenizer

Dataset Statistics¶

Additional Information¶

Sourced data¶

This dataset is derived from AI-aktindsigt/Skrabet_kommunale_hjemmesider

Citation Information¶

No citation is applicable for this work. We recommend citing the huggingface repository.