Munin 1.0 release note¶

Today, Danish Foundation Models releases the Munin 1.0 family of language models, post-trained on top of several best-in-class open models. These models are trained using a combination of existing open and newly developed datasets.

Models and results¶

The Munin 1.0 family of models are post-trained on open-weights models from Swiss AI, Mistral, and Qwen. All models were originally released under the Apache 2.0 license, and our models have the same license.

Developer	Base model	Munin model	Country	Openness
Swiss AI	`swiss-ai/Apertus-8B-2509`	`munin-apertus-8b`	Europe/Switzerland	Fully open
Mistral	`mistralai/Ministral-3-8B-Base-2512`	`munin-ministral3-8B`	Europe/France	Open weights
Qwen 3.5 9B	`Qwen/Qwen3.5-9B-Base`	`munin-qwen3.5-9B`	China	Open weights

Munin 1.0 is a text-only post-training release, so the comparisons below focus on text benchmarks. Some upstream instruct models advertise image-to-text or other multimodal capabilities, but those capabilities are not supported by the models released here.

We evaluated the Munin models against the instruction-tuned models released by the original model developers. Since Munin is trained from the same base models, this is an apples-to-apples comparison of two independent post-training efforts: the original developer's post-training and our Danish-focused Munin training.

The aggregate scores below are unweighted means across evaluations in each task group. Scores are percentages, and standard errors are percentage points. Blue marks the better Original/Munin score within each model family, or both if they are tied within uncertainty. Bold scores are not substantially below the row-best score. See the full benchmark results for the individual evaluations behind each aggregate.

Task aggregate scores
Suite	Task	Metric	Apertus 8B		Ministral 8B		Qwen 9B
Suite	Task	Metric	Original	Munin	Original	Munin	Original	Munin
Danish	Common-sense Reasoning	MCC	33.1 ± 0.9	29.4 ± 1.3	52.4 ± 1.2	50.2 ± 1.3	64.6 ± 0.7	62.2 ± 0.9
Danish	Grammatical Error Detection	micro-F1	18.0 ± 1.3	17.4 ± 1.1	21.7 ± 2.0	17.7 ± 0.7	20.4 ± 1.2	20.7 ± 1.1
Danish	Instruction-following	Accuracy	69.0 ± 1.1	51.4 ± 1.4	66.7 ± 1.3	74.4 ± 0.9	81.6 ± 0.8	77.9 ± 0.9
Danish	Knowledge	MCC	58.9 ± 0.7	62.3 ± 0.7	73.6 ± 0.5	68.5 ± 0.6	76.0 ± 0.5	77.6 ± 0.5
Danish	Linguistic Acceptability	MCC	33.0 ± 1.2	29.3 ± 2.4	43.4 ± 1.9	18.9 ± 3.1	49.2 ± 1.7	52.2 ± 1.3
Danish	Multiple-choice Reading Comprehension	MCC	67.1 ± 1.0	66.0 ± 2.0	85.9 ± 1.4	84.4 ± 1.1	87.2 ± 1.2	87.3 ± 1.3
Danish	Named Entity Recognition	micro-F1	49.3 ± 1.4	47.6 ± 1.3	61.1 ± 1.0	51.4 ± 1.8	69.1 ± 1.2	69.6 ± 1.2
Danish	Natural Language Inference	MCC	48.8 ± 2.3	52.1 ± 2.6	25.8 ± 1.6	58.2 ± 2.2	53.8 ± 1.9	65.6 ± 2.0
Danish	Reading Comprehension	F1	70.8 ± 0.5	69.4 ± 0.6	69.7 ± 0.7	71.2 ± 0.8	70.8 ± 0.6	72.0 ± 0.7
Danish	Sentiment Classification	MCC	57.9 ± 1.0	54.3 ± 1.1	60.4 ± 1.0	59.6 ± 1.4	64.9 ± 0.9	64.7 ± 1.0
Danish	Summarization	chrF++	37.6 ± 0.2	36.9 ± 0.2	35.1 ± 0.3	37.0 ± 0.2	36.5 ± 0.3	36.7 ± 0.4
Danish	Word-in-Context	MCC	11.8 ± 2.2	8.7 ± 3.5	29.9 ± 1.7	23.3 ± 3.2	44.6 ± 2.1	40.1 ± 3.5
English	Common-sense Reasoning	Accuracy	58.7 ± 0.5	23.2 ± 0.4	73.1 ± 0.4	59.6 ± 0.5	90.0 ± 0.3	85.7 ± 0.3
English	Instruction-following	Accuracy	73.3 ± 1.9	54.7 ± 2.0	70.4 ± 1.8	69.8 ± 1.9	89.6 ± 1.5	78.6 ± 1.8
English	Knowledge	Accuracy	50.3 ± 0.5	41.9 ± 0.5	81.7 ± 0.3	73.0 ± 0.3	79.2 ± 0.2	82.4 ± 0.2
English	Long-context	Accuracy	34.6 ± 2.1	35.8 ± 2.1	51.4 ± 2.2	49.4 ± 2.2	67.2 ± 2.1	54.6 ± 2.2
English	Math	Accuracy	68.1 ± 1.3	56.7 ± 1.4	92.2 ± 0.7	82.3 ± 1.1	94.8 ± 0.6	92.2 ± 0.7
English	Truthfulness	Accuracy	16.8 ± 1.3	15.7 ± 1.3	64.7 ± 1.7	63.3 ± 1.7	78.1 ± 1.4	74.2 ± 1.5
Agentic	Code	pass@1	46.8 ± 2.5	39.2 ± 2.4	75.0 ± 2.1	49.2 ± 2.3	83.0 ± 1.8	77.2 ± 2.1
Agentic	Tool Calling	Accuracy	52.4 ± 0.8	43.1 ± 0.8	75.0 ± 0.7	49.2 ± 0.8	79.4 ± 0.6	75.8 ± 0.7

What is included in each aggregate?

Danish task groupings follow the EuroEval Danish task taxonomy. Links point to papers where available, otherwise to public datasets, upstream repositories, or the EuroEval dataset construction scripts used for the benchmark. For several entries ending in -da, the link is to the original benchmark paper because the evaluated resource is a Danish translated or otherwise localized variant. Some Danish variants, including ifeval-da, are adaptations rather than simple direct translations.

Suite	Aggregate task	Benchmarks
Danish	Common-sense Reasoning	`goldenswag-da`, `hellaswag-da`, `winogrande-da`
Danish	Grammatical Error Detection	`gerlangmod-da`
Danish	Instruction-following	`ifeval-da`
Danish	Knowledge	`arc-da`, `dameta`, `danish-citizen-tests`, `danske-talemaader`, `mmlu-da`
Danish	Linguistic Acceptability	`dala`, `scala-da`
Danish	Multiple-choice Reading Comprehension	`belebele-da`
Danish	Named Entity Recognition	`dane`, `dansk`
Danish	Natural Language Inference	`danish-entailment`, `danish-lexical-inference`
Danish	Reading Comprehension	`multi-wiki-qa-da`, `scandiqa-da`
Danish	Sentiment Classification	`angry-tweets`, `danish-sentiment-in-context`
Danish	Summarization	`nordjylland-news`
Danish	Word-in-Context	`danwic`
English	Knowledge	`ARC-C`, `ARC-E`, `MMLU`, `MMLU-Pro`
English	Other task groups	`HellaSwag`, `IFEval`, `RULER 32k`, `GSM8K`, `TruthfulQA`
Agentic	Code and Tool Calling	`HumanEval`, `MBPP p@1`, `BFCL`

The main result is that Munin is highly competitive on the Danish evaluations, and in several task groups, our post-trained models match or improve on the original instruct models. The strongest results are on Danish knowledge, reading comprehension, summarization, and natural language inference, where some Munin models are ahead within uncertainty. This is the behavior we wanted to see: post-training moves the models toward stronger Danish performance while retaining a useful general capability profile.

The English and agentic task results are more mixed. The original instruct models are generally stronger on code, tool calling, and several English benchmarks, which reflects that Munin 1.0 is now primarily focused on Danish text capabilities. These gaps are important, because they point directly to the next stage of work: keeping the Danish gains while investing more deliberately in reasoning, multilingual retention, and agentic capabilities.

We also ran a focused Danish creative writing tournament across 360 rated matches, judged by Qwen3.5-397B-A17B. We chose this judge due to its strong Danish language capabilities.

This tournament is not a broad benchmark of model quality, but it is useful because creative writing is a high-signal Danish task: it tests fluency, register, coherence, and whether the model can produce text that feels natural rather than merely correct. Munin Qwen 9B separates clearly from the rest of the field with a 96-9 record, while Munin Apertus 8B places second and Munin Ministral 8B also beats its original counterpart.

Danish creative writing tournament ranking; W-L is aggregate win-loss
Rank	Model	W-L
1	Munin Qwen 9B	96-9
2	Munin Apertus 8B	56-45
3	Original Qwen 9B	43-53
4	Munin Ministral 8B	43-54
5	Original Apertus 8B	30-65
6	Original Ministral 8B	26-68

Contributors¶

Rasmus Larsen led training, performed experiments, built synthetic datasets and benchmarks, and drafted the release announcement.

Oliver Kinch built synthetic datasets and benchmarks.

Andreas Holm contributed to experiments.

Vladimir Salnikov and Jacob Nielsen contributed to datasets.

Dan Saattrup Smart developed benchmarks and led the integration into EuroEval.

Torben Blach contributed with project management and coordination.

Acknowledgements¶

The work was supported by the Danish Foundation Models project, funded by the Danish government. This work was partially supported by DeiC National HPC (grant agreement DeiC-SDU-N5-2025162).