DFM Data: A Composite Dataset for Danish LLMs.¶

This page provides a detailed description of the composite dataset used to train large language models developed by Danish Foundation Models. The dataset is curated to offer a diverse and comprehensive corpus across multiple domains, including legal, financial, and literary texts, with the primary intention of developing language models for Danish.

Dataset Description¶

Summary¶

The DFM Data is a collection of datasets used for Danish Foundation Models. This repository ensure documentation to data along with FAIR data practices.

Curation Rationale¶

These datasets were collected and curated with the intention of developing language models for Danish.

Data Collection and Processing¶

The dataset was constructed by collecting and integrating text from a wide variety of public and partner-provided sources. The raw data was subjected to a standardized cleaning pipeline, which included steps such as deduplication, filtering of low-quality content to prepare it for large-scale language model training.

Dataset Statistics¶

Number of samples: 230.07M
Number of tokens (Llama 3): 430.24B
Average document length in tokens (min, max): 1.87K (1, 51.77M)

The following plot pr. dataset histograms displaying document lengths.

Languages¶

This dataset includes the following languages:

Danish
English
Swedish
Norwegian Bokmål
Norwegian Nynorsk

Below is a visualisation of the main languages in each of the datasets.

Domains¶

This dataset consist of data from various domains (e.g., legal, books, social media). The following table and figure give an overview of the relative distributions of these domains.

Domain	Sources	N. Tokens
Legal	retsinformationdk, retspraksis, skat, fm-udgivelser, eur-lex-sum-da, miljoeportalen, cellar, domsdatabasen, caselaw_access_project_filtered, uspto_filtered	162.19B
Other	dannet, depbank, synne, dsk-cbrain, dsk-hofor, dsk-plesner, dsk-vitec, ncc_parliament, data_provenance_initiative_filtered, public_domain_review_filtered, stackv2_edu_filtered	63.56B
Scientific	arxiv_abstracts_filtered, arxiv_papers_filtered, peS2o_filtered	46.15B
Books	adl, gutenberg, jvj, relig, wikibooks, memo, ncc_books, dbc-abstracts, dbc-reviews, danish-pd, grundtvig, biodiversity_heritage_library_filtered, doab_filtered, library_of_congress_filtered, libretexts_filtered, oercommons_filtered, pre_1929_books_filtered, pressbooks_filtered, project_gutenberg_filtered	37.19B
Medical	health_hovedstaden, pubmed_filtered	35.35B
Conversation	ep, ft, naat, spont, danske-taler, opensubtitles, github_archive_filtered, stackexchange_filtered, ubuntu_irc_filtered	34.30B
Encyclopedic	wiki, wikisource, dbc-faktalink, dbc-forfatterweb, wikimedia_filtered, wikiteam_filtered	17.21B
Web	dsk-alexandra, dsk-atp, dsk-salling, dsk-vejle, ai-aktindsigt, ncc_maalfrid, cccc_filtered	14.20B
Governmental	plandata, regulations_filtered, uk_hansard_filtered, usgpo_filtered	12.10B
Speeches	youtube_filtered	4.07B
Financial	cvr-reports	2.32B
News	tv2r, dsk-danskerhverv, dsk-dkmedier, dsk-ida, dsk-odense, nordjyllandnews, ncc_newspaper, enevaeldens_nyheder, news_filtered	1.22B
Social Media	hest	389.32M
Readaloud	nota	7.30M
Technical	python_enhancement_proposals_filtered	2.54M
Dialect	botxt	847.97K
Total		430.24B

Licensing¶

The following gives an overview of the licensing in the DFMv1. To get the exact license of the individual datasets check out the individual datasets by clicking the links in the table. These license is applied to the constituent data, i.e., the text. The collection of datasets (metadata, quality control, etc.) is licensed under CC-0.

License	Sources	N. Tokens
Public Domain	danish-pd, python_enhancement_proposals_filtered, regulations_filtered, ubuntu_irc_filtered, usgpo_filtered, uspto_filtered	153.74B
CC-BY-SA 4.0	depbank, jvj, tv2r, fm-udgivelser, eur-lex-sum-da, memo, cellar, doab_filtered, libretexts_filtered, news_filtered, oercommons_filtered, peS2o_filtered, pressbooks_filtered, public_domain_review_filtered, pubmed_filtered, stackexchange_filtered, wikimedia_filtered, wikiteam_filtered, youtube_filtered	122.21B
CC-0	adl, botxt, ep, ft, hest, naat, relig, retspraksis, skat, spont, synne, wiki, wikibooks, wikisource, danske-taler, miljoeportalen, nordjyllandnews, nota, opensubtitles, ncc_books, ncc_newspaper, health_hovedstaden, grundtvig, enevaeldens_nyheder, arxiv_abstracts_filtered, arxiv_papers_filtered, biodiversity_heritage_library_filtered, caselaw_access_project_filtered, cccc_filtered, data_provenance_initiative_filtered, library_of_congress_filtered, pre_1929_books_filtered, project_gutenberg_filtered	74.04B
Various - MIT, BSD-3-Clause, Apache-2.0, etc.	github_archive_filtered, stackv2_edu_filtered	72.60B
Verbal agreement	cvr-reports	2.32B
Open Parliament License	uk_hansard_filtered	2.01B
Written agreement (public models, private data)	plandata, dbc-abstracts, dbc-faktalink, dbc-forfatterweb, dbc-reviews	1.78B
Other (No attribution required)	retsinformationdk, domsdatabasen	904.61M
Other (Attribution required)	dannet, gutenberg, ai-aktindsigt, ncc_maalfrid, ncc_parliament	515.61M
DSK-1	dsk-alexandra, dsk-atp, dsk-cbrain, dsk-danskerhverv, dsk-dkmedier, dsk-hofor, dsk-ida, dsk-odense, dsk-plesner, dsk-salling, dsk-vejle, dsk-vitec	113.35M
Total		430.24B

Additional Information¶

Citation Information¶

If you use a model trained on this dataset, please cite the associated DFM project or research paper when it becomes available. A BibTeX entry will be provided here upon the official release of a corresponding paper.

Disclaimer¶

We do not own any of the text from which the data has been extracted. If you believe that we are not allowed to train on any of the datasets noted please do contact us.

Notice and take down policy¶

Notice: Should you consider that our data contains material that is owned by you and should therefore not be included in the training of LLMs here, please:

Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
Clearly identify the copyrighted work claimed to be infringed.
Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.

You can contact us by making an issue.

Take down: We will comply to legitimate requests by removing the affected sources from the next release of the corpus.