Munin 1.0 release note¶
Today, Danish Foundation Models releases the Munin 1.0 family of language models, post-trained on top of several best-in-class open models. These models are trained using a combination of existing open and newly developed datasets.
Models and results¶
The Munin 1.0 family of models are post-trained on open-weights models from Swiss AI, Mistral, and Qwen. All models were originally released under the Apache 2.0 license, and our models have the same license.
| Developer | Base model | Munin model | Country | Openness |
|---|---|---|---|---|
| Swiss AI | swiss-ai/Apertus-8B-2509 |
munin-apertus-8b |
Europe/Switzerland | Fully open |
| Mistral | mistralai/Ministral-3-8B-Base-2512 |
munin-ministral3-8B |
Europe/France | Open weights |
| Qwen 3.5 9B | Qwen/Qwen3.5-9B-Base |
munin-qwen3.5-9B |
China | Open weights |
Munin 1.0 is a text-only post-training release, so the comparisons below focus on text benchmarks. Some upstream instruct models advertise image-to-text or other multimodal capabilities, but those capabilities are not supported by the models released here.
We evaluated the Munin models against the instruction-tuned models released by the original model developers. Since Munin is trained from the same base models, this is an apples-to-apples comparison of two independent post-training efforts: the original developer's post-training and our Danish-focused Munin training.
The aggregate scores below are unweighted means across evaluations in each task group. Scores are percentages, and standard errors are percentage points. Blue marks the better Original/Munin score within each model family, or both if they are tied within uncertainty. Bold scores are not substantially below the row-best score. See the full benchmark results for the individual evaluations behind each aggregate.
| Suite | Task | Metric | Apertus 8B | Ministral 8B | Qwen 9B | |||
|---|---|---|---|---|---|---|---|---|
| Original | Munin | Original | Munin | Original | Munin | |||
| Danish | Common-sense Reasoning | MCC | 33.1 ± 0.9 | 29.4 ± 1.3 | 52.4 ± 1.2 | 50.2 ± 1.3 | 64.6 ± 0.7 | 62.2 ± 0.9 |
| Danish | Grammatical Error Detection | micro-F1 | 18.0 ± 1.3 | 17.4 ± 1.1 | 21.7 ± 2.0 | 17.7 ± 0.7 | 20.4 ± 1.2 | 20.7 ± 1.1 |
| Danish | Instruction-following | Accuracy | 69.0 ± 1.1 | 51.4 ± 1.4 | 66.7 ± 1.3 | 74.4 ± 0.9 | 81.6 ± 0.8 | 77.9 ± 0.9 |
| Danish | Knowledge | MCC | 58.9 ± 0.7 | 62.3 ± 0.7 | 73.6 ± 0.5 | 68.5 ± 0.6 | 76.0 ± 0.5 | 77.6 ± 0.5 |
| Danish | Linguistic Acceptability | MCC | 33.0 ± 1.2 | 29.3 ± 2.4 | 43.4 ± 1.9 | 18.9 ± 3.1 | 49.2 ± 1.7 | 52.2 ± 1.3 |
| Danish | Multiple-choice Reading Comprehension | MCC | 67.1 ± 1.0 | 66.0 ± 2.0 | 85.9 ± 1.4 | 84.4 ± 1.1 | 87.2 ± 1.2 | 87.3 ± 1.3 |
| Danish | Named Entity Recognition | micro-F1 | 49.3 ± 1.4 | 47.6 ± 1.3 | 61.1 ± 1.0 | 51.4 ± 1.8 | 69.1 ± 1.2 | 69.6 ± 1.2 |
| Danish | Natural Language Inference | MCC | 48.8 ± 2.3 | 52.1 ± 2.6 | 25.8 ± 1.6 | 58.2 ± 2.2 | 53.8 ± 1.9 | 65.6 ± 2.0 |
| Danish | Reading Comprehension | F1 | 70.8 ± 0.5 | 69.4 ± 0.6 | 69.7 ± 0.7 | 71.2 ± 0.8 | 70.8 ± 0.6 | 72.0 ± 0.7 |
| Danish | Sentiment Classification | MCC | 57.9 ± 1.0 | 54.3 ± 1.1 | 60.4 ± 1.0 | 59.6 ± 1.4 | 64.9 ± 0.9 | 64.7 ± 1.0 |
| Danish | Summarization | chrF++ | 37.6 ± 0.2 | 36.9 ± 0.2 | 35.1 ± 0.3 | 37.0 ± 0.2 | 36.5 ± 0.3 | 36.7 ± 0.4 |
| Danish | Word-in-Context | MCC | 11.8 ± 2.2 | 8.7 ± 3.5 | 29.9 ± 1.7 | 23.3 ± 3.2 | 44.6 ± 2.1 | 40.1 ± 3.5 |
| English | Common-sense Reasoning | Accuracy | 58.7 ± 0.5 | 23.2 ± 0.4 | 73.1 ± 0.4 | 59.6 ± 0.5 | 90.0 ± 0.3 | 85.7 ± 0.3 |
| English | Instruction-following | Accuracy | 73.3 ± 1.9 | 54.7 ± 2.0 | 70.4 ± 1.8 | 69.8 ± 1.9 | 89.6 ± 1.5 | 78.6 ± 1.8 |
| English | Knowledge | Accuracy | 50.3 ± 0.5 | 41.9 ± 0.5 | 81.7 ± 0.3 | 73.0 ± 0.3 | 79.2 ± 0.2 | 82.4 ± 0.2 |
| English | Long-context | Accuracy | 34.6 ± 2.1 | 35.8 ± 2.1 | 51.4 ± 2.2 | 49.4 ± 2.2 | 67.2 ± 2.1 | 54.6 ± 2.2 |
| English | Math | Accuracy | 68.1 ± 1.3 | 56.7 ± 1.4 | 92.2 ± 0.7 | 82.3 ± 1.1 | 94.8 ± 0.6 | 92.2 ± 0.7 |
| English | Truthfulness | Accuracy | 16.8 ± 1.3 | 15.7 ± 1.3 | 64.7 ± 1.7 | 63.3 ± 1.7 | 78.1 ± 1.4 | 74.2 ± 1.5 |
| Agentic | Code | pass@1 | 46.8 ± 2.5 | 39.2 ± 2.4 | 75.0 ± 2.1 | 49.2 ± 2.3 | 83.0 ± 1.8 | 77.2 ± 2.1 |
| Agentic | Tool Calling | Accuracy | 52.4 ± 0.8 | 43.1 ± 0.8 | 75.0 ± 0.7 | 49.2 ± 0.8 | 79.4 ± 0.6 | 75.8 ± 0.7 |
What is included in each aggregate?
Danish task groupings follow the EuroEval Danish task taxonomy. Links point to papers where available, otherwise to public datasets, upstream repositories, or the EuroEval dataset construction scripts used for the benchmark. For several entries ending in -da, the link is to the original benchmark paper because the evaluated resource is a Danish translated or otherwise localized variant. Some Danish variants, including ifeval-da, are adaptations rather than simple direct translations.
| Suite | Aggregate task | Benchmarks |
|---|---|---|
| Danish | Common-sense Reasoning | goldenswag-da, hellaswag-da, winogrande-da |
| Danish | Grammatical Error Detection | gerlangmod-da |
| Danish | Instruction-following | ifeval-da |
| Danish | Knowledge | arc-da, dameta, danish-citizen-tests, danske-talemaader, mmlu-da |
| Danish | Linguistic Acceptability | dala, scala-da |
| Danish | Multiple-choice Reading Comprehension | belebele-da |
| Danish | Named Entity Recognition | dane, dansk |
| Danish | Natural Language Inference | danish-entailment, danish-lexical-inference |
| Danish | Reading Comprehension | multi-wiki-qa-da, scandiqa-da |
| Danish | Sentiment Classification | angry-tweets, danish-sentiment-in-context |
| Danish | Summarization | nordjylland-news |
| Danish | Word-in-Context | danwic |
| English | Knowledge | ARC-C, ARC-E, MMLU, MMLU-Pro |
| English | Other task groups | HellaSwag, IFEval, RULER 32k, GSM8K, TruthfulQA |
| Agentic | Code and Tool Calling | HumanEval, MBPP p@1, BFCL |
The main result is that Munin is highly competitive on the Danish evaluations, and in several task groups, our post-trained models match or improve on the original instruct models. The strongest results are on Danish knowledge, reading comprehension, summarization, and natural language inference, where some Munin models are ahead within uncertainty. This is the behavior we wanted to see: post-training moves the models toward stronger Danish performance while retaining a useful general capability profile.
The English and agentic task results are more mixed. The original instruct models are generally stronger on code, tool calling, and several English benchmarks, which reflects that Munin 1.0 is now primarily focused on Danish text capabilities. These gaps are important, because they point directly to the next stage of work: keeping the Danish gains while investing more deliberately in reasoning, multilingual retention, and agentic capabilities.
We also ran a focused Danish creative writing tournament across 360 rated matches, judged by Qwen3.5-397B-A17B. We chose this judge due to its strong Danish language capabilities.
This tournament is not a broad benchmark of model quality, but it is useful because creative writing is a high-signal Danish task: it tests fluency, register, coherence, and whether the model can produce text that feels natural rather than merely correct. Munin Qwen 9B separates clearly from the rest of the field with a 96-9 record, while Munin Apertus 8B places second and Munin Ministral 8B also beats its original counterpart.
| Rank | Model | W-L |
|---|---|---|
| 1 | Munin Qwen 9B | 96-9 |
| 2 | Munin Apertus 8B | 56-45 |
| 3 | Original Qwen 9B | 43-53 |
| 4 | Munin Ministral 8B | 43-54 |
| 5 | Original Apertus 8B | 30-65 |
| 6 | Original Ministral 8B | 26-68 |
Contributors¶
Rasmus Larsen led training, performed experiments, built synthetic datasets and benchmarks, and drafted the release announcement.
Oliver Kinch built synthetic datasets and benchmarks.
Vladimir Salnikov and Jacob Nielsen contributed to datasets.
Dan Saattrup Smart developed benchmarks and led the integration into EuroEval.
Torben Blach contributed with project management and coordination.
Acknowledgements¶
The work was supported by the Danish Foundation Models project, funded by the Danish government. This work was partially supported by DeiC National HPC (grant agreement DeiC-SDU-N5-2025162).