Skip to content

Danish Foundation Models

Nota lyd- og tekstdata (Tekst only)

Dataset Card for Nota lyd- og tekstdata (Tekst only)¶

The text only part of the Nota lyd- og tekstdata dataset.

Nota lyd- og tekstdata (Tekst only) is a readaloud dataset consisting of few very long texts.

Dataset Description¶

Number of samples: 446
Number of tokens (Llama 3): 7.30M
Average document length in tokens (min, max): 16.37K (4.48K, 107.26K)

Dataset Structure¶

An example from the dataset looks as follows.

{
  "id": "INSL20160004",
  "text": "Inspiration nr. 4, 2016\nBiblioteksbetjening \nTelefon: 39 13 46 00\nEmail: biblioteket@nota.dk\nInspira[...]",
  "source": "nota",
  "added": "2025-02-03",
  "created": "2016-01-01, 2016-12-31",
  "token_count": 69977
}

Data Fields¶

An entry in the dataset consists of the following fields:

id (str): An unique identifier for each document.
text(str): The content of the document.
source (str): The source of the document.
added (str): An date for when the document was added to this collection.
created (str): An date range for when the document was originally created.
token_count (int): The number of tokens in the sample computed using the Llama 8B tokenizer

Additional Processing¶

Dataset Statistics¶

Additional Information¶

Citation Information¶