Skip to content

DCC v1

The DCC is a composite corpus consisting of the following subcorpora. For more information about the specific subcorpora, feel free to check out the individual datasheets.

Name Description Size Open Access Novel Corpus
DAGW Danish Gigaword 1B tokens
reddit-da Danish Reddit <.1B tokens
HopeTwitter Danish Tweets 0.48B tokens
DaNews Danish newspapers 0.5B tokens
Netarkivet Text Danish internet >100B tokens
DaRadio Danish talk radio 140,000 hours
DaTV Danish subtitled TV 900 hours

Collaborators and Data Owners

Data are provided in agreement with the data owners and data collaborators. The data is generally accecible by the research collaborators, though each data agreements has their own access restrictions and might not cover all research collaborators. Access restriction are specified on the server hosting the data in accordance with the data agreements.

  • Data Owners
  • Aviser / dagblade
  • Danmarks Statistik
  • NetArkivet
  • Data Collaborators
  • Det Kongelige bibliotek
  • Infomedia
  • Research Collaborators
  • Center for humanities Computing, Aarhus Universitet
  • Alexandra Institutet
  • Peter Schneider-Kamp, Syddansk Universitet