Skip to content

Munin 1.0 release note

Today, Danish Foundation Models releases the Munin 1.0 family of language models, post-trained on top of several best-in-class open models. These models are trained using a combination of existing open and newly developed datasets.

Models and results

The Munin 1.0 family of models are post-trained on open-weights models from Swiss AI, Mistral, and Qwen. All models were originally released under the Apache 2.0 license, and our models have the same license.

Developer Base model Munin model Country Openness
Swiss AI swiss-ai/Apertus-8B-2509 munin-apertus-8b Europe/Switzerland Fully open
Mistral mistralai/Ministral-3-8B-Base-2512 munin-ministral3-8B Europe/France Open weights
Qwen 3.5 9B Qwen/Qwen3.5-9B-Base munin-qwen3.5-9B China Open weights

Munin 1.0 is a text-only post-training release, so the comparisons below focus on text benchmarks. Some upstream instruct models advertise image-to-text or other multimodal capabilities, but those capabilities are not supported by the models released here.

We evaluated the Munin models against the instruction-tuned models released by the original model developers. Since Munin is trained from the same base models, this is an apples-to-apples comparison of two independent post-training efforts: the original developer's post-training and our Danish-focused Munin training.

The aggregate scores below are unweighted means across evaluations in each task group. Scores are percentages, and standard errors are percentage points. Blue marks the better Original/Munin score within each model family, or both if they are tied within uncertainty. Bold scores are not substantially below the row-best score. See the full benchmark results for the individual evaluations behind each aggregate.

Task aggregate scores
Suite Task Metric Apertus 8B Ministral 8B Qwen 9B
Original Munin Original Munin Original Munin
DanishCommon-sense ReasoningMCC33.1 ± 0.929.4 ± 1.352.4 ± 1.250.2 ± 1.364.6 ± 0.762.2 ± 0.9
DanishGrammatical Error Detectionmicro-F118.0 ± 1.317.4 ± 1.121.7 ± 2.017.7 ± 0.720.4 ± 1.220.7 ± 1.1
DanishInstruction-followingAccuracy69.0 ± 1.151.4 ± 1.466.7 ± 1.374.4 ± 0.981.6 ± 0.877.9 ± 0.9
DanishKnowledgeMCC58.9 ± 0.762.3 ± 0.773.6 ± 0.568.5 ± 0.676.0 ± 0.577.6 ± 0.5
DanishLinguistic AcceptabilityMCC33.0 ± 1.229.3 ± 2.443.4 ± 1.918.9 ± 3.149.2 ± 1.752.2 ± 1.3
DanishMultiple-choice Reading ComprehensionMCC67.1 ± 1.066.0 ± 2.085.9 ± 1.484.4 ± 1.187.2 ± 1.287.3 ± 1.3
DanishNamed Entity Recognitionmicro-F149.3 ± 1.447.6 ± 1.361.1 ± 1.051.4 ± 1.869.1 ± 1.269.6 ± 1.2
DanishNatural Language InferenceMCC48.8 ± 2.352.1 ± 2.625.8 ± 1.658.2 ± 2.253.8 ± 1.965.6 ± 2.0
DanishReading ComprehensionF170.8 ± 0.569.4 ± 0.669.7 ± 0.771.2 ± 0.870.8 ± 0.672.0 ± 0.7
DanishSentiment ClassificationMCC57.9 ± 1.054.3 ± 1.160.4 ± 1.059.6 ± 1.464.9 ± 0.964.7 ± 1.0
DanishSummarizationchrF++37.6 ± 0.236.9 ± 0.235.1 ± 0.337.0 ± 0.236.5 ± 0.336.7 ± 0.4
DanishWord-in-ContextMCC11.8 ± 2.28.7 ± 3.529.9 ± 1.723.3 ± 3.244.6 ± 2.140.1 ± 3.5
EnglishCommon-sense ReasoningAccuracy58.7 ± 0.523.2 ± 0.473.1 ± 0.459.6 ± 0.590.0 ± 0.385.7 ± 0.3
EnglishInstruction-followingAccuracy73.3 ± 1.954.7 ± 2.070.4 ± 1.869.8 ± 1.989.6 ± 1.578.6 ± 1.8
EnglishKnowledgeAccuracy50.3 ± 0.541.9 ± 0.581.7 ± 0.373.0 ± 0.379.2 ± 0.282.4 ± 0.2
EnglishLong-contextAccuracy34.6 ± 2.135.8 ± 2.151.4 ± 2.249.4 ± 2.267.2 ± 2.154.6 ± 2.2
EnglishMathAccuracy68.1 ± 1.356.7 ± 1.492.2 ± 0.782.3 ± 1.194.8 ± 0.692.2 ± 0.7
EnglishTruthfulnessAccuracy16.8 ± 1.315.7 ± 1.364.7 ± 1.763.3 ± 1.778.1 ± 1.474.2 ± 1.5
AgenticCodepass@146.8 ± 2.539.2 ± 2.475.0 ± 2.149.2 ± 2.383.0 ± 1.877.2 ± 2.1
AgenticTool CallingAccuracy52.4 ± 0.843.1 ± 0.875.0 ± 0.749.2 ± 0.879.4 ± 0.675.8 ± 0.7
What is included in each aggregate?

Danish task groupings follow the EuroEval Danish task taxonomy. Links point to papers where available, otherwise to public datasets, upstream repositories, or the EuroEval dataset construction scripts used for the benchmark. For several entries ending in -da, the link is to the original benchmark paper because the evaluated resource is a Danish translated or otherwise localized variant. Some Danish variants, including ifeval-da, are adaptations rather than simple direct translations.

Suite Aggregate task Benchmarks
Danish Common-sense Reasoning goldenswag-da, hellaswag-da, winogrande-da
Danish Grammatical Error Detection gerlangmod-da
Danish Instruction-following ifeval-da
Danish Knowledge arc-da, dameta, danish-citizen-tests, danske-talemaader, mmlu-da
Danish Linguistic Acceptability dala, scala-da
Danish Multiple-choice Reading Comprehension belebele-da
Danish Named Entity Recognition dane, dansk
Danish Natural Language Inference danish-entailment, danish-lexical-inference
Danish Reading Comprehension multi-wiki-qa-da, scandiqa-da
Danish Sentiment Classification angry-tweets, danish-sentiment-in-context
Danish Summarization nordjylland-news
Danish Word-in-Context danwic
English Knowledge ARC-C, ARC-E, MMLU, MMLU-Pro
English Other task groups HellaSwag, IFEval, RULER 32k, GSM8K, TruthfulQA
Agentic Code and Tool Calling HumanEval, MBPP p@1, BFCL

The main result is that Munin is highly competitive on the Danish evaluations, and in several task groups, our post-trained models match or improve on the original instruct models. The strongest results are on Danish knowledge, reading comprehension, summarization, and natural language inference, where some Munin models are ahead within uncertainty. This is the behavior we wanted to see: post-training moves the models toward stronger Danish performance while retaining a useful general capability profile.

The English and agentic task results are more mixed. The original instruct models are generally stronger on code, tool calling, and several English benchmarks, which reflects that Munin 1.0 is now primarily focused on Danish text capabilities. These gaps are important, because they point directly to the next stage of work: keeping the Danish gains while investing more deliberately in reasoning, multilingual retention, and agentic capabilities.

We also ran a focused Danish creative writing tournament across 360 rated matches, judged by Qwen3.5-397B-A17B. We chose this judge due to its strong Danish language capabilities.

This tournament is not a broad benchmark of model quality, but it is useful because creative writing is a high-signal Danish task: it tests fluency, register, coherence, and whether the model can produce text that feels natural rather than merely correct. Munin Qwen 9B separates clearly from the rest of the field with a 96-9 record, while Munin Apertus 8B places second and Munin Ministral 8B also beats its original counterpart.

Danish creative writing tournament ranking; W-L is aggregate win-loss
Rank Model W-L
1 Munin Qwen 9B 96-9
2 Munin Apertus 8B 56-45
3 Original Qwen 9B 43-53
4 Munin Ministral 8B 43-54
5 Original Apertus 8B 30-65
6 Original Ministral 8B 26-68

Contributors

Rasmus Larsen led training, performed experiments, built synthetic datasets and benchmarks, and drafted the release announcement.

Oliver Kinch built synthetic datasets and benchmarks.

Vladimir Salnikov and Jacob Nielsen contributed to datasets.

Dan Saattrup Smart developed benchmarks and led the integration into EuroEval.

Torben Blach contributed with project management and coordination.

Acknowledgements

The work was supported by the Danish Foundation Models project, funded by the Danish government. This work was partially supported by DeiC National HPC (grant agreement DeiC-SDU-N5-2025162).