Skip to content

Dataset Card for Nota lyd- og tekstdata (Tekst only)

The text only part of the Nota lyd- og tekstdata dataset.

Nota lyd- og tekstdata (Tekst only) is a readaloud dataset consisting of few very long texts.

Dataset Description

  • Number of samples: 446
  • Number of tokens (Llama 3): 7.30M
  • Average document length in tokens (min, max): 16.37K (4.48K, 107.26K)

Dataset Structure

An example from the dataset looks as follows.

{
  "id": "INSL20160004",
  "text": "Inspiration nr. 4, 2016\nBiblioteksbetjening \nTelefon: 39 13 46 00\nEmail: biblioteket@nota.dk\nInspira[...]",
  "source": "nota",
  "added": "2025-02-03",
  "created": "2016-01-01, 2016-12-31",
  "token_count": 69977
}

Data Fields

An entry in the dataset consists of the following fields:

  • id (str): An unique identifier for each document.
  • text(str): The content of the document.
  • source (str): The source of the document.
  • added (str): An date for when the document was added to this collection.
  • created (str): An date range for when the document was originally created.
  • token_count (int): The number of tokens in the sample computed using the Llama 8B tokenizer

Additional Processing

Dataset Statistics

Additional Information

Citation Information