Skip to content

DFM Data: A Composite Dataset for Danish LLMs.

This page provides a detailed description of the composite dataset used to train large language models developed by Danish Foundation Models. The dataset is curated to offer a diverse and comprehensive corpus across multiple domains, including legal, financial, and literary texts, with the primary intention of developing language models for Danish.

Dataset Description

Summary

The DFM Data is a collection of datasets used for Danish Foundation Models. This repository ensure documentation to data along with FAIR data practices.

Curation Rationale

These datasets were collected and curated with the intention of developing language models for Danish.

Data Collection and Processing

The dataset was constructed by collecting and integrating text from a wide variety of public and partner-provided sources. The raw data was subjected to a standardized cleaning pipeline, which included steps such as deduplication, filtering of low-quality content to prepare it for large-scale language model training.

Dataset Statistics

  • Number of samples: 230.07M
  • Number of tokens (Llama 3): 430.24B
  • Average document length in tokens (min, max): 1.87K (1, 51.77M)

The following plot pr. dataset histograms displaying document lengths.

Languages

This dataset includes the following languages:

  • Danish
  • English
  • Swedish
  • Norwegian Bokmål
  • Norwegian Nynorsk

Below is a visualisation of the main languages in each of the datasets.

Domains

This dataset consist of data from various domains (e.g., legal, books, social media). The following table and figure give an overview of the relative distributions of these domains.

Domain Sources N. Tokens
Legal retsinformationdk, retspraksis, skat, fm-udgivelser, eur-lex-sum-da, miljoeportalen, cellar, domsdatabasen, caselaw_access_project_filtered, uspto_filtered 162.19B
Other dannet, depbank, synne, dsk-cbrain, dsk-hofor, dsk-plesner, dsk-vitec, ncc_parliament, data_provenance_initiative_filtered, public_domain_review_filtered, stackv2_edu_filtered 63.56B
Scientific arxiv_abstracts_filtered, arxiv_papers_filtered, peS2o_filtered 46.15B
Books adl, gutenberg, jvj, relig, wikibooks, memo, ncc_books, dbc-abstracts, dbc-reviews, danish-pd, grundtvig, biodiversity_heritage_library_filtered, doab_filtered, library_of_congress_filtered, libretexts_filtered, oercommons_filtered, pre_1929_books_filtered, pressbooks_filtered, project_gutenberg_filtered 37.19B
Medical health_hovedstaden, pubmed_filtered 35.35B
Conversation ep, ft, naat, spont, danske-taler, opensubtitles, github_archive_filtered, stackexchange_filtered, ubuntu_irc_filtered 34.30B
Encyclopedic wiki, wikisource, dbc-faktalink, dbc-forfatterweb, wikimedia_filtered, wikiteam_filtered 17.21B
Web dsk-alexandra, dsk-atp, dsk-salling, dsk-vejle, ai-aktindsigt, ncc_maalfrid, cccc_filtered 14.20B
Governmental plandata, regulations_filtered, uk_hansard_filtered, usgpo_filtered 12.10B
Speeches youtube_filtered 4.07B
Financial cvr-reports 2.32B
News tv2r, dsk-danskerhverv, dsk-dkmedier, dsk-ida, dsk-odense, nordjyllandnews, ncc_newspaper, enevaeldens_nyheder, news_filtered 1.22B
Social Media hest 389.32M
Readaloud nota 7.30M
Technical python_enhancement_proposals_filtered 2.54M
Dialect botxt 847.97K
Total 430.24B

Licensing

The following gives an overview of the licensing in the DFMv1. To get the exact license of the individual datasets check out the individual datasets by clicking the links in the table. These license is applied to the constituent data, i.e., the text. The collection of datasets (metadata, quality control, etc.) is licensed under CC-0.

License Sources N. Tokens
Public Domain danish-pd, python_enhancement_proposals_filtered, regulations_filtered, ubuntu_irc_filtered, usgpo_filtered, uspto_filtered 153.74B
CC-BY-SA 4.0 depbank, jvj, tv2r, fm-udgivelser, eur-lex-sum-da, memo, cellar, doab_filtered, libretexts_filtered, news_filtered, oercommons_filtered, peS2o_filtered, pressbooks_filtered, public_domain_review_filtered, pubmed_filtered, stackexchange_filtered, wikimedia_filtered, wikiteam_filtered, youtube_filtered 122.21B
CC-0 adl, botxt, ep, ft, hest, naat, relig, retspraksis, skat, spont, synne, wiki, wikibooks, wikisource, danske-taler, miljoeportalen, nordjyllandnews, nota, opensubtitles, ncc_books, ncc_newspaper, health_hovedstaden, grundtvig, enevaeldens_nyheder, arxiv_abstracts_filtered, arxiv_papers_filtered, biodiversity_heritage_library_filtered, caselaw_access_project_filtered, cccc_filtered, data_provenance_initiative_filtered, library_of_congress_filtered, pre_1929_books_filtered, project_gutenberg_filtered 74.04B
Various - MIT, BSD-3-Clause, Apache-2.0, etc. github_archive_filtered, stackv2_edu_filtered 72.60B
Verbal agreement cvr-reports 2.32B
Open Parliament License uk_hansard_filtered 2.01B
Written agreement (public models, private data) plandata, dbc-abstracts, dbc-faktalink, dbc-forfatterweb, dbc-reviews 1.78B
Other (No attribution required) retsinformationdk, domsdatabasen 904.61M
Other (Attribution required) dannet, gutenberg, ai-aktindsigt, ncc_maalfrid, ncc_parliament 515.61M
DSK-1 dsk-alexandra, dsk-atp, dsk-cbrain, dsk-danskerhverv, dsk-dkmedier, dsk-hofor, dsk-ida, dsk-odense, dsk-plesner, dsk-salling, dsk-vejle, dsk-vitec 113.35M
Total 430.24B

Additional Information

Citation Information

If you use a model trained on this dataset, please cite the associated DFM project or research paper when it becomes available. A BibTeX entry will be provided here upon the official release of a corresponding paper.

Disclaimer

We do not own any of the text from which the data has been extracted. If you believe that we are not allowed to train on any of the datasets noted please do contact us.

Notice and take down policy

Notice: Should you consider that our data contains material that is owned by you and should therefore not be included in the training of LLMs here, please:

  • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
  • Clearly identify the copyrighted work claimed to be infringed.
  • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

You can contact us by making an issue.

Take down: We will comply to legitimate requests by removing the affected sources from the next release of the corpus.


Danish Foundation Models dataset